Identifying Recurrent Patterns of Chromatin Modifications at Regulatory Regions on Genome

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Nan Meng

Graduate Program in Computer Science and Engineering

The Ohio State University

2015

Dissertation Committee:

Dr. Raghu Machiraju, Adviser

Dr. Kun Huang, Co-adviser

Dr. Yuejie Chi

Copyright by

Nan Meng

2015

Abstract

Epigenetics is an important regulation layer of DNA in cells. Post- translational modification is one of the most studied aspects of epigenetics. The distribution of chromatin modifications at known regulatory regions could provide invaluable information on epigenetic regulations of DNA transcription.

Firstly, a pipeline was developed to quantitatively identify and analyze distinct recurrent distribution patterns of single chromatin modification at regulatory regions. The clustering based method identified several recurrent patterns in cell lines from different tissue types. One particular pattern identified at promoter regions indicates activation of transcription. Further investigation show displaying this pattern carry important roles in cell development and differentiation. Later the study was extended to regulatory regions for long non-coding RNAs. This study shows that similar epigenetic regulatory functions regulate lncRNAs as well.

Then a computational framework was developed to study combinatorial patterns of multiple chromatin modifications. Results show that recurrent combinatorial patterns provide more insights on subtle details of epigenetic regulation. However, it is computationally challenging and unnecessary to include all chromatin modifications as some of their distributions may be correlated. Our framework provides an efficient way to select a subset of chromatin modifications that best represents the recurrent combinatorial patterns. In this framework, high dimensional distribution data of ii chromatin modifications are considered embedded in several local low dimensional subspaces. Distributions of chromatin modifications could be expressed as linear combinations of others in the same subspace. Hence, multiple chromatin modifications residing in the same subspace could be represented by one representative. Recurrent combinatorial patterns of all the selected representatives are then identified and carefully studied. Study shows that different recurrent combinatorial patterns are associated with different states of promoters. The previously un-annotated promoters identified by these patterns are further analyzed and show strong associations to activation of DNA transcriptions. This framework also provides information on similarity among chromatin modifications, which could be used for guidance on selection of chromatin modifications for future experiments.

This proposed methodology could be utilized to reveal distinct patterns of chromatin modifications associated with various epigenetic regulation mechanisms at various regulatory regions or the whole genome, depending on user’s choice of genomic resolution. It could also be adapted to analyze other types of genomic data or generate interesting hypotheses for future investigation. These methods could be used in various aspects to interpret the and study epigenetic regulation.

iii

Dedication

This document is dedicated to my family.

iv

Acknowledgments

Firstly I would like to thank my adviser Prof. Raghu Machiraju, for his unceasing guidance and support. Without his encouragement and help, I could not have completed my work. In additional to my adviser, I am very grateful to work with my co-adviser

Prof. Kun Huang. A lot of ideas were inspired from discussions with my two mentors and it was a great honor to work with both of them. I am also very grateful for my committee members Prof. Yuejie Chi and Prof. Paul Granello for their invaluable comments and questions regarding my research. Especially Prof. Chi has been a great role model for me since the first time I met her. I also want to thank Prof. Clark Anderson for his guidance and support during early years of my PhD program.

I will always be thankful for people I work with in my lab: Chao Wang, Hao Ding,

Qihang Li, Zhi Han, Jie Zhang, Brian Arand, Arunima Srivastava, Travis Johnson and

Ross Donatelli. Furthermore, I am very grateful that I met some lifelong friends along the way, Jihyun Lee, Pei Zhang, Yufang Sun and James Gentry stood by my side through thick and thin for the past few years. I also want to send special thanks to my friends I knew before grad school who accompanied me through this journey from near and far,

Qian Gao, Chuan Lu, Jinyan Guan, Wei Zhang and Hua Fu.

Finally, I want to thank my parents for their unconditional love and support.

v

Vita

2009...... B.S. Computer Science,

Hebei University of Technology, China

2013...... M.S. Computer Science and Engineering,

The Ohio State University

2014 to present ...... Graduate Teaching Associate, Department

of Biomedical Informatics,

The Ohio State University

Publications

Meng N, Machiraju R, Huang K. Identify Critical Genes in Development with

Consistent H3K4me2 Patterns across Multiple Tissues. IEEE/ACM Trans Comput Biol

Bioinforma 2015, PP:1–1.

Fields of Study

Major Field: Computer Science and Engineering

vi

Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments...... v

Vita ...... vi

List of Tables ...... x

List of Figures ...... xii

Chapter 1: Introduction ...... 1

1.1 Investigate epigenetic regulatory regions with recurrent patterns of chromatin

modifications ...... 1

1.2 From regulatory regions to recurrent patterns of chromatin modifications ...... 2

1.3 Thesis Statement ...... 8

1.4 Outline of Solution ...... 8

1.5 Contributions ...... 9

Chapter 2: Related Work ...... 12

2.1 Data collection and Challenges ...... 12

2.2 Methods and tools ...... 16

vii

2.3 Sparse subspace clustering and combinatorial patterns of chromatin

modifications ...... 27

2.4 From past to present ...... 28

Chapter 3: Identify Recurrent Patterns of H3K4me2 at promoters of Critical

Developmental Genes across Multiple Tissues ...... 29

2.1 Introduction ...... 30

2.2 Material and Methods ...... 33

2.3 Results ...... 39

2.4 Discussion and Summary ...... 46

Chapter 4: Identify recurrent patterns of H3K4me2 at promoters of Critical lncRNAs across Multiple Tissues ...... 50

4.1 Background ...... 51

4.2 Methods ...... 53

4.3 Results ...... 60

4.4 Summary ...... 69

Chapter 5: Identify Combinatorial Recurrent Patterns of Chromatin Modifications at

Promoters of Different States ...... 71

5.1 Introduction ...... 71

5.2 Method ...... 73

5.3 Results ...... 78 viii

5.4 Summary ...... 88

Chapter 6: Conclusion and future work ...... 90

References ...... 94

Appendix A: Supplementary material for Chapter 3 ...... 100

Appendix B: Supplementary Material for Chapter 4 ...... 112

Appendix C: Supplementary Material for Chapter 5 ...... 121

ix

List of Tables

Table 1. More details on different types of regulatory regions: locations to their targeted regions, strand and their regulatory functions...... 5

Table 2. Five H3K4me2 ChIP-seq Data sets with listed accession numbers are downloaded from NCBI Omnibus...... 34

Table 3. Five Microarray Data sets with listed accession number are downloaded from

NCBI Gene Expression Omnibus...... 37

Table 4. Top enriched GO terms of cluster 1 from each cell line...... 43

Table 5. PPI density of cluster 1 from all tissue types...... 45

Table 6. Accession numbers of ChIP-seq data sets...... 55

Table 7. Accession numbers of Microarray data sets for expression level study and co- expression network constructions (only normal samples are included)...... 58

Table 8. PPI density of genes that are highly co-expressed with selected long-tail lncRNA groups...... 66

Table 9. Most Long-tail lncRNAs clusters are highly enriched with diseases and dysfunctions related lncRNAs...... 69

Table 10. Sizes of identified clusters and the correlations between matching clusters from the two cell lines...... 82

Table 11. Further analyses of the identified putative promoters...... 87

x

Table 12. Unsupervised clustering method k-means was applied to cluster tss region profiles of 41413 genes in each dataset...... 100

Table 13. Gene symbols of all 217 genes in the core group...... 100

Table 14. Top 20 enriched go terms in molecular function, biological process and mouse phenotype of core gene group...... 101

Table 15. Number of lncRNAs in each cluster...... 117

Table 16. Functional enrichment analyses of genes that co-expressed with long-tail lncRNAs (Group Two)...... 118

Table 17. Functional enrichment analyses of genes that co-expressed with long-tail lncRNAs (Group One)...... 119

Table 18. Group one genes co-expressed with long-tail lncRNAs from blood tissue cells are enriched with tissue specific mouse phenotypes...... 120

Table 19. Enriched GO terms for genes displaying CP1 at their promoters...... 122

Table 20. Enriched GO terms for genes displaying CP2 at their promoters...... 123

Table 21. Enriched GO terms for genes displaying CP3 at their promoters...... 124

Table 22. Enriched GO terms for genes displaying CP4 at their promoters...... 125

xi

List of Figures

Figure 1. Diagrams of how promoter from [1], enhancer and insulator regulate transcription of DNA. The promoter regulates transcription of directly downstream regions. The enhancer regulates transcription by regulating promoters from distance, either upstream or downstream. Insulators could either block spread of heterochromatin

(a tightly packed form of DNA) or block the interaction between enhancers and promoters...... 3

Figure 2. DNA binds on to form octamers in its usual state. The tails of histones are subjected various types of modifications and could impact the accessibility of DNA for transcription[3]...... 6

Figure 3. As shown here, there are various types of modifications that could happen on tails of 3 and 4. Each modification, individually or in combination with others, could influence the transcription of DNA...... 7

Figure 4. General workflow of ChIP-seq procedure is shown here. DNA sequences bound by targeted chromatin modifications are selected and later sequenced, figure from [11]. 14

Figure 5. Selection of chromatin modifications by greedy algorithm. Certain chromatin modifications bring more information for identification of recurrent states compared to others [5]...... 24

xii

Figure 6. General workflow of subspace clustering by Ucar et al. [25]. This approach takes the distribution pattern into consideration. It searches for similar patterns from recurrent combination of chromatin modifications...... 26

Figure 7. Workflow of the analysis used in Chapter 3...... 32

Figure 8. Diagram of tail length definition...... 36

Figure 9. Average H3K4me2 distribution signal in clusters identified by K-means clustering...... 39

Figure 10. Gene expression level (A) and interaction density (B) are plotted against tail length...... 42

Figure 11. Diagram of workflow used in Chapter 4...... 54

Figure 12. Definition of tail length...... 57

Figure 13. Clustering result based on data from skeletal muscular cells...... 61

Figure 14. Distributions of H3K4me2 (left) and PolII (right) at lncRNA promoter regions.

...... 63

Figure 15. Investigation on the PhastCon scores of long-tail lncRNAs show they are highly conserved...... 68

Figure 16. Workflow of the framework proposed in Chapter 5...... 74

Figure 17. Value of λ is empirically selected by comparing the two affinity matrices generated based on data from two different cell lines...... 79

Figure 18. Hierarchical clustering and subset selection of chromatin modifications...... 80

Figure 19. Combinatorial patterns (CP) identified in this study and the average profile of each CP...... 81

xiii

Figure 20. Expression levels of identified clusters from the two cell lines...... 83

Figure 21. Distributions of PolII for identified clusters...... 84

Figure 22. Comparison of expression levels of regions regulated by putative and identified active promoters...... 86

Figure 23. Combinatorial patterns and PolII distributions of putative promoters...... 86

Figure 24. PolII distributions at putative promoters (identified in cell line GM12878) in other cell lines...... 88

Figure 25. Silhouette coefficient (top) and sum of point-to-centroid distances (bottom) based on K values ranging from 2 to 30 are plotted for skeletal muscular tissue cells dataset. It appears that 7 is an optimal value choice for K...... 102

Figure 26. Sum of point-to-centroid distances (bottom) based on K values ranging from 2 to 15 are plotted for embryonic stem cells dataset. It appears that 7 is an optimal value choice for K...... 103

Figure 27. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Epithelial tissue cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length...... 104

Figure 28. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Epithelial tissue cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length...... 105

Figure 29. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Connective tissue cells are plotted (top), using K=7. The average

xiv distribution pattern of seven clusters are plotted in the order with decreasing tail length.

...... 106

Figure 30. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Neural Progenitor Cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length.

...... 107

Figure 31. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Embryonic Stem Cells are plotted, using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length...... 108

Figure 32. H3K4me2 distribution signal in clusters identified by K-means clustering in

Embryonic Stem Cells are plotted, using K=7. The location of clusters are marked at the left side of the figure...... 108

Figure 33. H3K4me2 distribution signal in clusters identified by K-means clustering in

Epithelial tissue cells are plotted, using K=7. The location of clusters are marked at the left side of the figure...... 109

Figure 34. H3K4me2 distribution signal in clusters identified by K-means clustering in

Blood tissue cells are plotted, using K=7. The location of clusters are marked at the left side of the figure...... 109

Figure 35. H3K4me2 distribution signal in clusters identified by K-means clustering in

Skeletal Muscular tissue cells are plotted, using K=7. The location of clusters are marked at the left side of the figure...... 110

xv

Figure 36. H3K4me2 distribution signal in clusters identified by K-means clustering in

Neural Progenitor cells are plotted, using K=7. The location of clusters are marked at the left side of the figure...... 110

Figure 37. The core group is identified by the intersection of genes from all long tail clusters identified in various tissue types...... 111

Figure 38. Silhouette coefficient and Sum of squared Euclidean (SSE) distance based on k values ranging from 2 to 30 for skeletal muscular tissue cells dataset. It appears that 5 is an optimal value choice for k...... 112

Figure 39. Pol II and H3K4me2 signal at lncRNA TSS regions for various cell lines, from left to right: Hela cells, HUVEC cells and K562 cells...... 112

Figure 40. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in blood cells...... 113

Figure 41. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in

Connective tissue cells...... 114

Figure 42. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in

Epithelial tissue cells...... 115

Figure 43. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in

Embryonic stem cells...... 116

xvi

Figure 44. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in

Nervous tissue cells...... 117

xvii

Chapter 1: Introduction

Distributions of chromatin modifications on the human genome are hardly random.

As certain patterns frequently recur, it has been shown that recurrent patterns of chromatin modifications can be utilized to infer the epigenetic regulatory functions of their residing regions. Hence, much attention has been spent on investigating recurrent patterns of chromatin modifications. In particular, as the number of discovered modifications increases, identifying subsets of chromatin modifications forming recurrent patterns becomes an important task. It could simplify the analysis and provide guidance for future experimental design at the same time. This dissertation provides a computational framework to identify and verify biologically meaningful recurrent patterns at regulatory regions, quantitatively select subset of chromatin modifications, and discover novel regulatory regions.

1.1 Investigate epigenetic regulatory regions with recurrent patterns of chromatin modifications

Epigenetics plays an important regulatory role in gene expression. Exact same genotype could lead to drastically different phenotypes due to variations in epigenetics.

There are several types of regulatory regions that are associated with different regulatory functions. To gain more understanding of epigenetic regulations, it is of great importance to locate these regulatory regions on the genome in order to further examine their underlying regulatory functions. Previous studies have shown that recurrent patterns of 1 chromatin modifications can be indicators of epigenetic regulatory regions. Hence, recurrent patterns are utilized to discover previously un-annotated regulatory regions on the genome.

In this dissertation, I present a computational framework to identify recurrent patterns of chromatin modifications at regulatory regions and discover novel regulatory regions by utilizing the identified patterns. Essentially this work answers the following questions:

Q1: What are recurrent patterns of chromatin modifications at specific regulatory regions and are they associated with regulatory functions?

Q2: How to select a subset of chromatin modifications to identify recurrent patterns at specific regulatory regions that are associated with regulatory functions?

Q3: How to discover novel regulatory regions by utilizing the identified recurrent patterns at specific regulatory regions?

1.2 From regulatory regions to recurrent patterns of chromatin modifications

To further explain the reasoning of using recurrent patterns of chromatin modifications to uncover epigenetic regulatory regions, more detailed information are listed below. In the next few subsections, different types of regulatory regions and their functions are introduced. Followed by a brief explanation of DNA structure and how chromatin modifications are related to epigenetic regulation of DNA transcription. Then the association between recurrent patterns of chromatin modifications and regulatory functions are established.

2

1.2.1 Various types of regulatory regions on genome: promoter, enhancer and silencer

As -coding genes account for approximately 2% of the human genome, there is large number of non-coding regulatory regions. Several important types of regulatory regions are discovered and carefully studied, such as promoters, enhancers, insulators and silencers. Comparison of different regulatory regions is shown in Table 1.

Figure 1. Diagrams of how promoter from [1], enhancer and insulator regulate transcription of DNA. The promoter regulates transcription of directly downstream regions. The enhancer regulates transcription by regulating promoters from distance, either upstream or downstream. Insulators could either block spread of heterochromatin (a tightly packed form of DNA) or block the interaction between enhancers and promoters.

The promoter is a region of DNA that initiate transcription. Promoters usually locate at the transcription start sites (TSS) of genes, on the same strand and upstream of the DNA. Recent research shows that not only protein-coding genes have promoters, but non-coding regions are also regulated by promoters.

3

The enhancer is a region of DNA that can be bound with to activate tissue specific transcription. Enhancers usually locate up to 1 million (Mbp) away from the TSS, and can be upstream or downstream. It is estimated that there are hundreds of thousands of enhancers in the human genome.

The insulator is a region of DNA that blocks the interaction between enhancers and promoters. Insulators serve as genetic boundary blocks, which are responsible for regulating adjacent genes with different levels of transcriptions. As a result, inducing or repressing of one region would not interfere with the neighboring gene. It has been established that the distribution of DNA binding transcriptional repressor CTCF (also known as CCCTC-binding factor) is a good predictor of insulators.

The silencer is a DNA region where its sequence is capable of binding transcription regulation factors that would prevent RNA polymerase binding to promoter region, essentially silence the genomic region for transcription.

Studies have shown that the patterns of chromatin modifications vary at different regulatory regions on the human genome[2]. Examining recurrent patterns at specific regulatory regions could provide insightful information on the epigenetic functions of their residing regions. In this dissertation, several case studies of the proposed computational framework are conducted on data extracted from promoter regions. The promoter regions were chosen for the fact that there is a well established database on locations of promoters and verifications of their regulatory functions are uncomplicated.

4

Name of the Location to regulatory transcription regions Strand Regulatory functions region they regulate Promoter sequences directly upstream or define the direction transcription factors at the 5' region of transcription and bind to the promoter to Promoter adjacent to the indicate which initiate transcription of transcriptional start DNA strand will be DNA site transcribed Enhancer sequences can be located can be positioned in thousands of base regulate promoters both forward or pairs away from the from distance to Enhancer reversed sequence transcription start site regulate transcription orientations and of the gene being of DNA still affect gene regulated; transcription inhibit interactions between promoters between the enhancer Insulator and enhancers; and promoter prevent the spread of heterochromatin Upstream of the bind with repressors to Silencer targeted region with prevent transcription of various distances DNA

Table 1. More details on different types of regulatory regions: locations to their targeted regions, strand and their regulatory functions.

1.2.2 Chromatin modifications influence the accessibility of DNA for transcription

In its usual state, the DNA wrapped around octamers of core histones and form nucleosomes, as shown in Figure 2. Post-translational modification of histones could influence the DNA-histone and histone-histone interactions, which further impact DNA transcription. There are various types of post-translational histone modifications, such as methylation, acethylation and phosphorylation. Many modifications were shown to be associated with gene activation, silencing, heterochromatin formation, DNA damage

5 sensing and repair. Other DNA-binding proteins, such as RNA polymerase II (PolII) and transcriptional repressor CTCF, also have strong impacts on gene transcription.

Furthermore, transcriptions of non-coding RNAs are also subjected to epigenetic regulation. In particular, the epigenetic regulation of long non-coding RNAs (lncRNA) is confirmed to be similar to its coding counterpart in Chapter 4. Here post-translational modifications of histones and DNA binding proteins are referred as chromatin modifications in general.

Figure 2. DNA binds on histones to form octamers in its usual state. The tails of histones are subjected various types of modifications and could impact the accessibility of DNA for transcription[3].

Histones are subjected to modifications at many sites. Figure 3 shows some commonly detected histone modifications at various locations on tails of and

H4. The number of discovered histone modifications is still increasing.

6

Figure 3. As shown here, there are various types of modifications that could happen on tails of histone 3 and 4. Each modification, individually or in combination with others, could influence the transcription of DNA.

1.2.3 Histone code hypothesis and chromatin state: Annotation of the human genome by recurrent patterns of chromatin modifications

There has been significant amount of research centered the hypothesis of “histone code”. First proposed by David Allis[4], this hypothesis involves the notion that histone modifications can modulate the accessibility of DNA to transcription factors. It further hypothesized that chromatin-DNA interactions are guided by the combination of histone modifications. The concept of “chromatin states” was later proposed by Ernst et al., which refers to biologically meaningful and spatially coherent combinations of chromatin modifications[5]. By this definition, the focus of study extended from merely histones to including other DNA-binding proteins, such as RNA Polymerase II (PolII) and CTCF.

Overall, it is well accepted that some recurrent combinatorial patterns of chromatin modifications, either called “histone code” or “chromatin states”, could serve as indicators of epigenetic regulatory regions and possibly the corresponding epigenetic regulatory functions of their residing regions. Hence, more regulatory regions could be 7 discovered on the genome by investigating the recurrent patterns of chromatin modifications.

1.3 Thesis Statement

Recurrent patterns of chromatin modifications, such as those identified as long-tail binding patterns of H3K4me2 and other combinatorial patterns of multiple chromatin modifications at promoter regions of genes and lncRNAs, that are indicative of epigenetic regulatory functions of their residing genomic regions, can be identified by well constructed frameworks leveraging cluster based methods, which are evaluated by integrative analyses with other data evidence and further utilized to discover novel regulatory regions.

1.4 Outline of Solution

As distribution patterns of chromatin modifications are indicative of genomic regulatory regions, identification of recurrent patterns at regulatory regions could provide invaluable information on discovering novel regulatory regions on genome. This dissertation provides an analytic framework to identify recurrent patterns of single or multiple chromatin modifications at regulatory regions. It further demonstrates that the identified patterns can be utilized to discover previously un-annotated regulatory regions.

The key components in this framework are following:

Q1: Identification of recurrent patterns of chromatin modifications associated with various functions of specific regulatory regions

At annotated regulatory regions, there exist different patterns of chromatin modifications indicating various regulatory functions. Here a clustering based method is

8 proposed to identify recurrent distribution patterns at regulatory regions. Each identified recurrent pattern is further examined by an integrative analysis of other data evidence, such as transcription levels and functional enrichment analyses of the targeted genes/non- coding RNAs. More detailed information and explanation are provided in Chapter 3 and

4.

Q2: Identification of minimal subsets of chromatin modifications forming combinatorial patterns at specific regulatory regions

Firstly, chromatin modifications are clustered into different clusters, based on their distribution patterns at regulatory regions. Then, one chromatin modification is selected as the representative of each cluster. Later combinatorial patterns formed by all selected representatives are examined in detailed. Recurrent combinatorial patterns associated with specific features are identified, as discussed more in Chapter 5.

Q3: Discovering novel regulatory regions using identified combinatorial patterns

With identified recurrent combinatorial patterns, the whole genome was searched for candidate regulatory regions that displaying similar combinatorial patterns.

Furthermore, each candidate region is examined by an integrative analysis with other data evidence, along with expression levels of neighboring regions, as shown in Chapter 5.

1.5 Contributions - Informatics

Pipeline to study recurrent patterns of single chromatin modification mark in cells of multiple tissue types

Prior to our study, analyses on recurrent single histone distribution pattern were only conducted on genes in specific cell lines. This proposed pipeline provided a

9 systematic approach to identify and compare recurrent patterns of chromatin modifications that are detected at regulatory regions of both genes and non-coding RNAs in cells of multiple tissue types.

Identification of combinatorial patterns with minimal subsets of chromatin modifications

As more date sets of chromatin modifications become available, it is computationally challenging and unnecessary to include all for analysis. To tackle this problem, this framework could identify minimal subsets of chromatin modifications forming recurrent patterns. This selection step not only significantly simplifies the search for recurrent combinatorial patterns, but also provides guidance for future experimental design on data collection.

Discovery of novel regulatory regions by combinatorial patterns of chromatin modification

In order to gain more understanding of epigenetic regulations, functional annotation of regulatory regions is of great importance. Several novel promoters were discovered by utilizing combinatorial patterns identified in our study. This framework provides a useful approach in discovering previously un-annotated regulatory regions and further annotating the genome.

1.6 Contribution – Biological Implications

Besides computational contributions mentioned above, the case study presented in each chapter also leads to biological meaningful discoveries. In Chapter 3, a distinct recurrent pattern of H3K4me2 is confirmed to regulate transcription of genes in multiple cell lines of different tissue types. In Chapter 4, this particular recurrent pattern is

10 discovered to regulate transcription of lncRNAs as well. Later in Chapter 5, several recurrent combinatorial patterns were identified and studies confirmed that these patterns are associated with different states of promoters. Novel promoter regions were discovered by detecting these identified patterns on the human genome. Further analysis confirmed that they are indeed associated with promoter functions.

1.7 Scope of the dissertation

This framework analyzes recurrent patterns of chromatin modifications at specific regulatory regions on the human genome. In other words, it requires the locations of known regulatory regions as part of the input. As shown in later chapters, it is easier to associate recurrent patterns identified at known regulatory regions to epigenetic regulatory functions compared to those identified at unknown regions. Furthermore, all study cases in this dissertation are conducted at promoter regions. This choice was made as the target regions of promoters are directly downstream from them. In the study cases, target regions are examined to verify regulatory functions associated with the identified promoter patterns. As mentioned earlier, for other regulatory regions, more efforts need to be spent on searching for and confirming their target regions. Though this framework could identify recurrent patterns at other regulatory regions, locating their target regions would be difficult and outside the scope of this study.

11

Chapter 2: Related Work

Investigating the associations between recurrent patterns of chromatin modifications and regulatory functions of genomic regions is an emerging field of research. This chapter presents a review of data acquisition, challenges in studying recurrent patterns of chromatin modifications and previously reported methods and tools to identify regulatory regions by patterns of chromatin modifications. Section 2.1 opens the chapter with a brief discussion on the experimental procedures of data collection and challenges in studying patterns of chromatin modifications. Section 2.2 describes methods and tools reported previously for identifying recurrent patterns. Subsequently in

2.3, the work of sparse subspace clustering (SSC) is briefly introduced as it inspired the component on subset selection of the framework. Finally the chapter ends with a brief summary in Section 2.4.

2.1 Data collection and Challenges

2.1.1 Data Acquisition and preprocessing

New developments in technology bring new opportunities to study chromatin modifications on genome. Previously scientists were limited to data of the residue of histone modifications on DNA[6, 7]. Later the Chromatin Immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) method provided a better alternative, which provides the distributions of chromatin modifications at specific loci on the genome. Recently, the rapid development of next generation sequencing made Chromatin 12

Immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq, procedure described in Figure 4) an attractive alternative method. ChIP-seq directly sequences all retained DNA segments and is considered as the state-of-the-art for studying genomic distributions of DNA binding proteins, such as histones and transcription factors.

Both ChIP-chip and ChIP-seq assays started with the process of chromatin immunoprecipitation. Afterward, the ChIP-chip assay requires microarray chips for sequence detection, which means only a subset of segments that matches with the microarray chips are detected. On the other hand, ChIP-seq records all segments.

Research shows that compared to ChIP-chip, ChIP-seq generally produces sequence reads with better signal-to-noise ratio and allows detection of more peaks[8]. The same study suggested that ChIP-seq also generates profiles with higher spatial resolution and dynamic range.

The Encyclopedia of DNA Elements (ENCODE) project is a public research project launched by the US National Human Genome Research Institute (NHGRI) in September

2003[9, 10]. Intended as a follow-up to the Human Genome Project (Genomic Research),

ENCODE aims to identify all functional elements in the human genome. So far,

ENCODE has generated huge amount of data that reflects the binding signals of various chromatin modifications on genome.

13

Figure 4. General workflow of ChIP-seq procedure is shown here. DNA sequences bound by targeted chromatin modifications are selected and later sequenced (figure from [11]).

14

2.1.2 Challenges in studying patterns of chromatin modifications

With current technology, we are able to examine distributions of the DNA binding proteins with resolution higher than ever. However, there are several challenges remaining in this field.

Firstly, data collection of chromatin modifications is still ongoing. So far, data for a short list of chromatin modifications are collected for multiple cell lines, while experiments for many others are still in process. Hence, studying comprehensive recurrent patterns of chromatin modifications and further verifying those identified patterns in multiple cell lines is still a difficult task. As ChIP-seq experiments are expensive and time consuming, the ENCODE project organizes many institutions to generate data collaboratively and the project has made encouraging progress on collecting data for future studies.

Secondly, preprocessing is crucial for analyzing ChIP-seq data sets[9]. If a study includes data sets collected by multiple labs, the first step is to examine and possibly eliminate batch effects. Then it is also important to conduct another round of preprocessing to reduce noise and normalize datasets for analyses. There is no standard procedure in preprocessing data sets of chromatin modifications. Hence this step could possibly bring bias into subsequent computational analyses.

Thirdly, to identify and examine patterns of chromatin modifications is a computationally complex task. There are a large number of loci on the human genome, comparing the distributions of chromatin modifications at these loci and identifying recurrent patterns requires well constructed formulations. In later chapters of this

15 dissertation, recurrent patterns of one or more chromatin modifications are identified based on cluster based methods.

In particular, it has been shown that by including more chromatin modifications, more recurrent patterns could be identified. However, the complexity of the analysis increases significantly when more chromatin modifications are included. Hence, there is a tradeoff between the number of chromatin modifications and the complexity of the analysis. Furthermore, it has been demonstrated in previous studies that some subsets of chromatin modifications are sufficient to identify recurrent combinatorial patterns[12].

Yet currently a rigorous and efficient method to select suitable subsets of chromatin modifications is still not available. In Chapter 5, a quantitative approach to select subsets of chromatin modifications is proposed and tested.

Finally, it remains challenging to verify and evaluate the identified recurrent patterns. To examine if there is any association between the identified patterns and epigenetic regulatory functions, it requires data evidence from variety of data sources.

The verification process is uncomplicated for promoters as they are directly upstream to the DNA regions they regulate, as shown in this dissertation. Verifications for other types of regulatory regions remain difficult as their targeted DNA regions are difficult to locate on the genome.

2.2 Methods and tools

Many efforts have been spent on studying distribution patterns of chromatin modifications, aiming to reveal the underlying epigenetic regulation mechanisms. After recurrent patterns are identified, they need to be examined by integrative analyses with

16 other data evidence. Furthermore, novel regions could be discovered by searching for regions displaying similar patterns. Here, the section of previously reported methods and tools is briefly divided into four aspects: identifying recurrent patterns of chromatin modifications, identifying subset of representative chromatin modifications, integrative analyses on identified patterns and new discoveries based on identified patterns.

2.2.1 Identify recurrent patterns of chromatin modifications

Analyses on distribution patterns of single chromatin modifications

Recurrent patterns of single chromatin modifications are found to be associated with different regulatory functions. Roh et al. studied the distribution patterns of

H3K9acK14ac (H3 K9/K14 di-acetylation), H3K4me3 (H3 K4 tri-methylation), and

H3K27me3 (H3 K27 tri-methylation) in primary human T cells[13]. Results show that while abundance of H3K9acK14ac and H3K4me3 at promoter regions indicates active transcription of genes, enrichment of H3K27me3 signals repression of transcription.

Later Roh et al. examined the di-acetylation of histone H3 at Lys 9 and Lys 14 in resting and activated human T cells by genome-wide mapping technique (GMAT)[14]. Their study show the chromatin accessibility and gene expression of a genetic domain is correlated with hyper-acetylation of promoters and other regulatory elements but not with generally elevated acetylation of the entire domain. Furthermore, the authors compared the acetylation island sequences with distantly and closely related vertebrates, namely pufferfish and mouse. One third of the conserved sequences between human and pufferfish were found to have enhancer activities in reporter gene assays. In human- mouse acetylation island sequence comparisons, half of the non-conserved sequences

17 have enhancer activities in human Jurkat T cells. Barski et al. examined distributions of

20 histone modifications at various regulatory regions on genome, such as enhancers, promoters, insulators and transcribed regions[13]. Patterns of each histone modification are divided into groups and plotted according to expression levels of neighboring genes.

This study clearly demonstrated that there exist associations between recurrent patterns of chromatin modifications at regulatory regions and expression levels of genes. Histone modifications are then categorized as associated with ‘activating’ or ‘repressing’ functions.

In particular, Pekowska et al. first established that there is a distinct distribution pattern of a specific histone modification mark (H3K4me2, di-methylation of the 4th residue from the start of the H3 protein) at promoter regions of genes, which associated with tissue specificity and higher expressions in human CD4+ T cells and mouse whole brain cells[15]. Later Zhang et al. further investigated the same pattern at different developmental stages of human cells, namely embryonic stem cells, neural progenitor cells and mature brain cells[16]. Their study confirmed that this distinct pattern not only indicates tissue specific genes in differentiated cells, but also signals active transcription of crucial genes from early developmental stages.

However, studying epigenetic regulation based on single chromatin modifications may lead to conflicting conclusions. Vakoc et al. reported that H3K9me3 occurs at both silent heterochromatin and at the transcribed regions of active mammalian genes[17].

This modification is observed as dynamic; it increases during activation of transcription and rapidly decreases upon gene repression. Later the authors extended their study to

18 include six chromatin modifications (namely H3K4me3, H3K9me3, H3K27me1,

H3K36me3, H3K79me3 and H4K20me1) [16]. Result show high tri-methylation of

H3K4, H3K9, H3K36, and H3K79 in the transcribed region; H4K20me1, previously linked with repression, could also be considered as an indicator of transcription elongation in mammalian cells; H3K27me1, a modification enriched at pericentromeric heterochromatin, was observed broadly distributed throughout all euchromatic sites analyzed, with selective depletion in the vicinity of the transcription start sites at active genes. H4K20me1 and H3K27me1 are versatile and dynamic with respect to gene activity, suggesting the existence of novel site-specific methyltransferases and demethylases coupled to the transcription cycle.

Besides histone modifications, recurrent patterns of other chromatin modifications are shown to be good indicators of regulatory regions. These chromatin modifications could also serve as external data evidence for verifications and evaluations of putative regulatory regions. Kim et al. reported a genome-wide map of promoters in human fibroblast cells by examining the distribution pattern of RNA polymerase II preinitiation complex (PIC) throughout genome[18]. PIC includes both the RNA polymerase II

(PolII), the transcription factor IID (TFIID) and other general transcription factors.

Furthermore, four classes of promoters are identified by analyzing the expression profiles of genes associated with these promoters. Later the same group reported a list of insulators by examining the distribution signal of CTCF throughout genome in human fibroblast cells[19]. Most identified insulators are located far from TSS and their distributions strongly correlate with levels of gene expression. To uncover genomic

19 locations of enhancers, this group proposed to search for P300 binding as an indicator throughout genome[20]. Their study was carried out in mouse embryonic forebrain, midbrain and limb tissues, the identified sequences show highly reproducible enhancer activity in the tissues they were detected. Especially for cardiac transcriptional enhancers, few were detected by searching for non-coding sequence with high conservation.

Analyses of combinatorial distribution patterns of multiple chromatin modifications

As more data became available, researchers started analyzing multiple chromatin modifications together at regulatory regions. Heintzman et al. examined the distribution patterns of six histone modifications (H4ac, H3ac, H3K4me1/2/3, H3) and three general transcription factors (PolII, TAF1, P300) in the ENCODE regions (30Mb)[2]. For each type of regulatory regions, the distribution patterns of multiple histone modifications were concatenated as one combinatorial distribution. By comparing the histone modification profiles of known regulatory regions, the authors reported that while

H3K4me2 is universally distributed at both promoter and enhancer regions, H3K4me1 are enriched at active enhancers and H3K4me3 are enriched at active promoters. Their algorithm can be used to identify and distinguish novel regulatory elements in the human genome. Later they investigated distribution patterns of three histone modifications in five cell lines[21]. Their study established that patterns of histone modifications at promoter and insulator regions are cell-type-invariant whereas those at enhancers are cell-type-specific. Furthermore, chromatin modifications at enhancers are globally related to expressions of cell-type-specific genes.

20

Hon et al. proposed a probabilistic approach, ChromSig, to search for recurrent combinatorial patterns of chromatin modifications[22]. ChromSig firstly identified genomic loci with enriched chromatin modifications as candidate loci. Then the likelihood of each belongs to background or a biologically meaningful pattern is calculated and examined. The authors first identified 8 patterns from analyzing 12 chromatin modifications from a ChIP-chip data sets of HeLa cells. Five out of 8 patterns correspond to known promoters and enhancers. Furthermore, the authors show ChromSig identified 16 patterns from 21 chromatin modifications in ChIP-seq data sets of CD4 T cells. This study clearly confirms that by including more chromatin modifications in the analysis, more different patterns of regulatory regions could be identified.

Yu et al. developed an algorithm to study the causal relationships between chromatin modifications and gene expressions using Bayesian networks[23]. They studied the distributions of 20 histone modifications and 3 DNA-binding proteins surrounding TSS of genes. For each loci, the distribution of each chromatin modification is categorized as ‘low’, ‘medium’ or ‘high’, as well as expression level of corresponding gene. Nodes of the Bayesian networks are chromatin modifications and gene expression; the edges are constructed based on the joint conditional probabilities between nodes. The construction of Bayesian network is repeated for 100 times and only the common sub- networks agreed by all trials were retained. This study provided an interesting perspective on chromatin modifications and their regulatory roles in gene expression. According to this study, subsets of chromatin modifications form unique relationship networks and their distributions indicate specific regulations of gene expression. However, the strong

21 assumption that binding signals of chromatin modifications have causal relationships among each other is difficult to verify. Furthermore, the algorithm only accounts for the relative levels of chromatin modifications distributed on the genome. Different choices of thresholding parameter may lead to different representations of the same data set, which further leads to different conclusions.

Ernst et al. developed a tool called ChromHMM to discover chromatin states

(according to authors’ definition, they are “recurring biologically meaningful combinatorial patterns of multiple histone modifications”) on the human genome[5, 24].

The algorithm is based on multivariate Hidden Markov Model (HMM) that models the presence or absence of each chromatin modification. ChromHMM is the first computational tool that could handle large number of chromatin modifications (18 acetylations, 20 methylations, H2AZ, CTCF and PolII in CD4 T cells). The authors reported 51 chromatin states associated with promoters, transcription regions, active intergenic regions, repressed regions and repetitive regions. Also in this study, the authors pointed out that the annotations of chromatin states could be conducted by subsets of chromatin modifications. In a later study from the same group, the human genome was annotated by 15 chromatin states based on recurrent patterns of nine chromatin modifications in nine human cell lines[12]. One drawback of this model is that it converts all chromatin modification distributions to binary representation, so a modification is either “present” or “absent”. It does not account for patterns of chromatin modifications. As established in previous studies, not only the amount of the histone

22 modification is informative, but the detailed patterns can also provide insights on the underlying epigenetic regions and their regulatory functions.

2.2.2 Identify subsets of chromatin modifications for studying combinatorial patterns

As data sets of more chromatin modifications become available, scientists soon realized that it is computationally challenging and unnecessary to include all chromatin modifications when studying combinatorial patterns. The additional information each chromatin modification provides is not equal. Hence, a subset of well selected chromatin modifications could provide sufficient information for identifications of recurrent combinatorial patterns. There are several algorithms proposed to tackle this problem.

For instance, ChromHMM was originally proposed to analyze 41 chromatin modifications to identify recurrent patterns[5]. In this study, authors also provided a figure demonstrating the ordering of chromatin modifications by a greedy forward selection algorithm based on minimization of a squared error penalty (Figure 5). This figure clearly shows that the information from each additional chromatin modification is diminishing. Hence, a subset of all available chromatin modifications could recover different chromatin states. Later the same authors recovered 15 chromatin states by using

10 chromatin modifications, and the results were replicated and confirmed in 9 different cell lines[12].

Yu et al. constructed Bayesian networks to investigate the casual relationship between chromatin modifications and gene expressions[23]. According to this study, subsets of chromatin modifications form unique relationship networks and their distributions reflect specific gene expression regulations.

23

Figure 5. Selection of chromatin modifications by greedy algorithm. Certain chromatin modifications bring more information for identification of recurrent states compared to others [5].

Ucar et al. developed an algorithm to identify regulatory regions on the genome by examining the actual patterns of multiple chromatin modifications[25], shown in Figure 6 below. The algorithm started with the assumption that only a subset of observed

24 chromatin modifications are involved in forming combinatorial patterns. There are two major steps: firstly, identify the maximal set of chromatin modifications that exhibit a coherent signal at each locus; the set is named as maximal sample set. Secondly, identify the genomic loci that have all selected chromatin modifications displaying coherent distribution patterns. This algorithm takes patterns of chromatin modifications as input instead of simplified representations of patterns. However, the authors assumed that all chromatin modifications of the recurrent patterns would display correlated distributions.

As a result, recurrent patterns containing chromatin modifications with uncorrelated distributions would not be discovered by this algorithm.

25

Figure 6. General workflow of subspace clustering by Ucar et al. [25]. This approach takes the distribution pattern into consideration. It searches for similar patterns from recurrent combination of chromatin modifications.

2.2.3 Integrative analyses on identified patterns

After recurrent patterns of chromatin modifications are identified at regulatory regions, it is important to examine their associations with underlying epigenetic regulatory functions. Usually, the evaluation involves integrating evidence from published literature or other data sources. For instance, Yu et al. confirmed their findings with existing literature, as certain discovery are difficult to validate quantitatively without further data evidence[23]. Ucar et al. also integrated established studies and other data sources to evaluate their identified combinatorial patterns[25].

26

On the other hand, in later chapters of this dissertation, the identified recurrent patterns at promoter regions were examined by the expression levels of targeted genes/lncRNAs, the interactions between proteins coded by targeted genes, conservation scores and GWAS recorded mutations of the targeted regions. To evaluate the chromatin states identified by ChromHMM, the authors compared the annotated states with previously published databases, such as RefSeq and GWAS[5]. Furthermore, the authors investigated the enrichment of DNaseI hypersensitive sites, CpG islands, evolutionarily conserved motifs and transcription factors at each chromatin state. The neighboring genes of each state were also carefully analyzed in their study.

2.2.4 Discoveries of novel regulatory regions

Identified recurrent patterns of chromatin modifications are utilized to discover novel regulatory regions on the genome. In ChromHMM, the combinational patterns of

41 chromatin modifications segmented the whole human genome into 51 chromatin states[5]. In another study, the human genome was segmented to 15 chromatin states[12].

The annotated genome serves as a roadmap for further in-depth investigation. In Chapter

5, novel regulatory regions were discovered based on the identified recurrent patterns.

2.3 Sparse subspace clustering and combinatorial patterns of chromatin modifications

Sparse subspace clustering (SSC) was developed to cluster data drawn from low- dimensional linear subspaces embedded in a high-dimensional space[26]. The essential idea of SSC is that each data point could be represented by a linear combination of other data points from the same low-dimensional subspace. By enforcing an L0 optimization,

27 the points from the same low-dimensional subspace shall have much higher weights over those from other subspaces. Hence, the weights could also be considered as an affinity matrix between points. Then spectral clustering is carried out on the obtained affinity matrix. SSC can be solved efficiently by relaxing the L0 optimization condition to L1 optimization without compromising accuracy of the algorithm. It also performs well with data contaminated by noise, missing entries and outliers.

One major challenge in analyzing the distributions of chromatin modifications is that the data has very high dimensionality. Inspired by SSC, distributions of chromatin modifications are considered as high-dimensional data with low-dimensional structure embedded. Here it is hypothesized that there are several low-dimensional subspaces; each contains the chromatin modifications with closely related distributions at corresponding genomic loci. Hence, the distribution of one chromatin modification can be sufficiently explained by those from the same subspace. This formulation provided a new perspective in analyzing distributions of chromatin modifications and has yield encouraging results.

2.4 From past to present

In this chapter, more detailed information regarding data acquisition and challenges in studying chromatin modifications are listed. Furthermore, multiple previously proposed methods and tools are reviewed. As mentioned earlier, current available methods are inadequate given certain challenges in this field. In the next few chapters, different case studies are presented to demonstrate the proposed framework, which is designed to tackle problems that were not completely solved by existing methods.

28

Chapter 3: Identify Recurrent Patterns of H3K4me2 at promoters of Critical Developmental Genes across Multiple Tissues

Histone modification is an important epigenetic event which plays essential roles in cell differentiation and tissue development. Recent studies show that a unique dimethylation of lysine 4 residue on histone 3 (H3K4me2) distribution pattern around transcription starting sites (TSS) of genes marks tissue specific genes in human CD4+ T cells and mouse nervous tissue cells[15, 16]. However, existence of this pattern has not been widely tested in other tissue types and the implication of this pattern remains unclear. In this paper, the H3K4me2 distribution patterns across six different cell lines from five major tissue types (including muscular tissue, nervous tissue, non-blood connective tissue, blood, and epithelial tissue) as well as embryonic stem cells were studied. A metric ‘tail length’ was defined to quantitatively describe H3K4me2 distribution patterns around the TSS. While the observations confirmed that genes with long H3K4me2 tails around TSS are enriched with tissue specific functions, a group of

217 genes with ubiquitous long-tail H3K4me2 patterns in all the tested tissues as well as the embryonic stem cells (ESC) was identified. Since it was observed that the long-tail

H3K4me2 pattern is often associated with high transcription activity, it is hypothesized that these genes are active in multiple developmental stages and are of significant importance in tissue differentiation and development. Functional enrichment analysis confirmed that these genes are critical for development. Further analysis shows that genes

29 in this group are highly interactive with other tissue specific genes as evinced by protein- protein interaction networks, suggesting their critical regulatory functions. Here, results suggest that rich information on gene functions and epigenetic events can be revealed using pattern recognition methods.

3.1 Introduction

With the rapid development of next generation sequencing (NGS) technologies, the

ChIP-seq technique has become the standard means for biologists to characterize genome-wide protein-DNA interaction patterns. Among proteins, the interaction between different histone marks and DNA has received a lot of attention due to their critical roles in epigenetic regulation of chromatin conformation and gene expression[6,

27–29]. Histones are the major protein components of chromatin in eukaryotic cells. In the nuclei, DNA molecules wrap around an octamer of histones to form the nucleosomes[29]. Each histone molecule has two “tails” which can be post- translationally modified, these modifications change the 3D conformation of the histone molecules which consequently affect the accessibility of the resident DNA to RNA polymerase and thus control gene transcription[5, 6, 25, 27, 30].

Among the large variety of different post-translational modifications, methylation of histone 3 (H3) at the 4th lysine position has been extensively studied[31]. Depending on the number of methyl groups attached to the lysine residual, there are three common forms, namely H3K4me1, H3K4me2 and H3K4me3. It has been shown that while

H3K4me1 preferentially exists around enhancer regions and H3K4me3 preferentially exists around gene promoter regions, H3K4me2 exists in both enhancers and promoters

30 regions with a positive role in facilitating gene transcription. Using the ChIP-seq technique, not only can the quantities of the level of H3K4me2 distribution on the gene promoter be measured, the patterns of H3K4me2 distribution over certain genomic regions have also been shown to be related to gene transcriptional activities[2, 21].

Specifically, it has been observed that a “long tail” distribution pattern of H3K4me2 over the promoter regions is indicative of tissue specific gene expression[15]. Recently it was observed that such long-tail patterns of H3K4me2 can be acquired at different stages of tissue differentiation including as early as the embryonic stem cell stage[16].

However, currently most studies are based on only one or two types of tissues.

Given the availability of the huge amount of ChIP-seq data generated by the ENCODE project, it is now possible to survey the H3K4me2 distribution patterns over multiple tissue types. In this study, six different cell lines were selected from five major tissue types (including muscular tissue, nervous tissue, non-blood connective tissue, blood, and epithelial tissue) as well as embryonic stem cells.

Specifically, by formally defining a parameter called cluster tail length and using it to gauge the strength and integrity of different clusters of genes obtained from unsupervised clustering algorithm, a previous observation was verified: gene clusters with long-tail H3K4me2 distribution patterns are enriched with tissue specific genes [15] and therefore established a positive correlation between cluster tail length and mean gene expression level. More importantly, the interactions among the genes and their protein products in each long-tail cluster were investigated by using PPI network analyses and a

31 positive correlation between the cluster tail length and the density of PPI network in each gene cluster was discovered.

Figure 7. Workflow of the analysis used in Chapter 3.

Furthermore, a set of 217 core genes were identified, which have ubiquitous long- tail patterns in all the tested tissues as well as the embryonic stem cell (ESC). These

32 genes form a highly dense PPI sub network (18 times denser than the overall PPI network) and they have an even stronger connection with other tissue specific genes, suggesting a fundamental regulatory role of this particular gene group. In fact, gene set enrichment analyses show that these genes are highly enriched with lethal and developmental defect genes, establishing fundamental roles for these genes in tissue development. These findings suggest that histone mark data, when combined with functional genomics data, can lead to discovery of key genes and networks in epigenetic and gene regulation as well as potential lethal and disease genes.

3.2 Material and Methods

3.2.1 Workflow

The data analysis workflow includes three major steps as shown in Figure 7. Firstly

ChIP-seq data of H3K4me2 generated from multiple cell lines are downloaded from repositories, pre-processed and clustered based on H3K4me2 distribution patterns over the promoter and downstream regions in the genome. Then, each cluster is studied in detail by calculating the cluster tail length and correlating it with other cluster features such as cluster functional enrichment, gene expression level, followed by examination of cluster PPI density. Finally, the tissue specificity of the long-tail genes is confirmed and genes with a long-tail distribution of H3K4me2 in all datasets were identified. Each step is described in detail in the following sections.

H3K4me2 ChIP-seq datasets pre-processing and clustering

Six H3K4me2 ChIP-seq datasets of aligned short reads from the ENCODE project are downloaded from NCBI Gene Expression Omnibus with accession numbers listed in

33

Table I. Each dataset corresponds to either embryonic stem cells or one of the five basic tissue types identified: muscular tissue, nervous tissue, non-blood connective tissue, blood, and epithelial tissue. The 36-bp short reads were generated using Illumina GA sequencing platform. In this analysis, the aligned short reads were extended by 200 bp.

Histograms of extended read counts over a 10-kb genomic region around the transcription starting site (TSS) (i.e., -2kb ~ +8kb of TSS) were created using a bin size of 20-bp. TSS were extracted by RefSeq hg19 reference genome available through the UCSC genome browser.

Data Set Accession Cell Description Tissue Type Number GSM733769 B-lymphocyte, lymphoblastoid Blood GSM733781 Lung fibroblasts Connective GSM733686 Epidermal keratinocytes Epithelial GSM733768 Skeletal muscle myoblasts Muscular GSM908957 Neural progenitor cells Nervous GSM733670 Embryonic stem cells Stem cells Table 2. Five H3K4me2 ChIP-seq Data sets with listed accession numbers are downloaded from NCBI Gene Expression Omnibus.

The unsupervised clustering method, K-means, was applied to cluster TSS region profiles of 41413 genes in each dataset implemented in Matlab (using K = 7, repeats =

50, distance = squared Euclidean), detailed information on sizes of all resulting clusters is listed in Table 12 in Appendix A. While K-means algorithm is widely used in machine learning and bioinformatics applications, the choice of the cluster number K is an often ad hoc choice based on prior knowledge, assumptions and empirical experience [32]. In

34 previous studies on clustering chromatin modification marks various K values were adopted. For instance K was set as 5 for H3K4me2 marks in [15] and was 3 for clustering multiple chromatin modification marks at enhancer regions and 4 for promoters [2]. In this study, a large number of different values for K (from 2 to 30, see Fig. 5) were surveyed and the optimum choice was determined based on two widely adopted criteria: a sum of squared Euclidean distance (SSE) and the silhouette coefficient. The optimal K value is identified as the knee point of the curves, the point where successive decreases in silhouette coefficient and SSE become noticeably smaller and below certain threshold.

From the plots, it appears that 7 is an optimal value choice of K.

Analysis of gene clusters features

Multiple analyses were performed to study the clusters identified in each tissue sample, including a study of the cluster tail length of the H3K4me2 profiles, gene set functional enrichment analysis, an estimation of the average gene expression level in each cluster, co-expression coefficients of the genes and their corresponding protein- protein interaction (PPI) density.

Cluster tail length

In order to quantitatively compare the differences among H3K4me2 patterns of all clusters, a metric called 'cluster tail length' is calculated for each cluster. Given the average H3K4me2 distribution pattern of a cluster, loci on the genome with highest and lowest H3K4me2 signals downstream of TSS are identified. The average signal of these two loci is calculated. Then a midpoint is located in the distribution pattern, which is the

35 first locus associated with the average H3K4me2 signal. The cluster tail length is calculated as the horizontal distance from TSS to the midpoint as shown in Figure 8.

Functional enrichment analysis

GO terms enrichment analyses were performed with V 2

http david.abcc.ncifcrf.gov ). Other functional enrichments of selected genes are analyzed using TOPPgene Suite (http://toppgene.cchmc.org/).

Figure 8. Diagram of tail length definition.

Gene expression level study

To further investigate the gene expression level of clusters, five independent microarray datasets are obtained from NCBI Gene Expression Omnibus. Each data set corresponds to one tissue type. More information on these data sets is listed in Table 3.

36

Data Set Accession Cell Description Tissue Type Sample Size Number GDS3713 B-lymphocytes Blood 60 GSE15359 Lung fibroblasts Connective 6 GDS4426 Epidermis Epithelial 6 GDS4104 Skeletal muscle Muscular 5 GSE13307 Neural progenitor cells Nervous 4 Table 3. Five Microarray Data sets with listed accession number are downloaded from NCBI Gene Expression Omnibus.

Interaction density based on PPI network

The interactions within and among the gene groups were studied based on the PINA

PPI network database (downloaded from PINA on Jan 2, 2013). PINA integrates six curated PPI data from multiple sources including IntAct, BioGRID, MINT, DIP, HPRD and MIPS/MPact and thus can decrease the chance of bias caused by the choice of a specific PPI database. The interaction density of one gene cluster is defined as following:

N D  I NC  (NC 1) 2 where NI is the number of confirmed interactions between proteins that transcribed by genes within this particular cluster, NC is the number of gene in this particular cluster.

Therefore the interaction density is a metric ranging from 0 to 1 with 0 indicating no interaction recorded and 1 indicating an interaction exists between every pair of proteins.

Similarly interaction density between two different gene groups is defined as the number of interactions divided by the total number of protein pairs, as listed in following formula:

37

N D  I N  N C1 C2 where NI is the number of interactions between the two gene clusters recorded in PPI network. Each counted interaction arises from the consideration of two genes, each from one particular gene cluster. Finally, NC1 and NC2 are the numbers of genes from each cluster, which is also the number of proteins transcribed from the genes in the corresponding cluster.

Statistical significance of cluster interaction density is estimated by comparing it with interaction density of random simulated gene groups. Given a gene cluster of size N,

100 groups of N genes are randomly selected from all genes in the protein-protein interaction network. Then interaction density of each group is calculated. Further, an interaction density distribution is constructed based on interaction densities of all simulated gene clusters. Statistical significance of a given cluster is characterized by a Z- score and a p-value, which are estimated from the constructed interaction densities distribution.

38

Figure 9. Average H3K4me2 distribution signal in clusters identified by K-means clustering.

3.3 Results

3.3.1 H3K4me2 profiles distinguish different groups of genes in cells of all tissue types

As mentioned earlier, previous studies have shown that H3K4me2 profiles at TSS region of genes distinguish different classes of genes in human CD4+ and mature mouse brain cells. While CD4+ cells could be considered as blood tissue cells and mature brain cells as nervous tissue cells, a similar conclusion could not be reached for cells of other tissue types. In this study, all five basic human tissue types and embryonic stem cells are included. This study confirmed previous findings [15, 16] that tissue specific genes are 39 highly enriched in the cluster of genes with prolonged H3K4me2 marks or long-tail profiles; it also confirmed that this group of genes do display a higher level of gene expression than those in remaining clusters. Furthermore, this study especially shows that the long-tail genes constantly display higher interaction densities in all tissue types.

In each data set associated to one of the five tissue types (muscular tissue, nervous tissue, non-blood connective tissue, blood, and epithelial tissue) or embryonic stem cells, seven clusters of genes with distinctive H3K4me2 binding patterns were identified by K- means clustering. The average H3K4me2 distribution patterns for the clusters of skeletal muscular tissue cells are plotted in Figure 9A, along with the plots of other cluster features. Here all clusters are arranged by tail length in descending order. It should be noted that cluster 1 has the distribution pattern with longest tail. In the rest of this paper, the cluster with longest tail in corresponding tissue sample is always referred as ‘cluster

1’. Figure 9B shows the gene expression level of each cluster. To compare the gene expression levels of the clusters, the Wilcoxon rank sum (i.e. the Mann–Whitney) test was used. Instead of the t-test, this non-parametric test is chosen to minimize the effect of outlying observations. P-values from this test are marked between the tested pairs of clusters. As shown in Figure 9C, as the cluster tail length decreases, the level of gene expression also decreases. In addition, the PPI interaction density decreases as cluster tail length decreases as shown in Figure 9B. The same trend has been observed in all the tissues examined in this paper. In certain cases, interaction density can distinguish different clusters that have similar levels of gene expression. For instance, in skeletal

40 muscular tissue cells, cluster 1 and cluster 2 have similar gene expression levels yet estimated interaction density is quite different between these two clusters.

H3K4me2 tail length positively correlates with genes expression levels and interaction densities. Previously it is shown that H3K4me2 patterns distinguish genes with different features. In order to further investigate the differences of H3K4me2 patterns among all clusters, the metric cluster tail length (see Method) was used to quantitatively characterize the H3K4me2 pattern of all clusters. Results show that average gene expression level of a cluster increases as the cluster tail length increases

(Spearman rank correlation coefficient is 0.85 for skeletal muscular tissue cells).

Moreover, as cluster tail length increases, the PPI interaction density increases as well

(Spearman rank correlation coefficient 0.97 for skeletal muscular tissue cells). In all datasets, cluster 1 with long-tail profile has highest gene expression level and interaction density. Other clusters display lower gene expression level and interaction density as the cluster tail length decreases, as plotted in Figure 10. The same trend with similar

Spearman rank correlation coefficient values has been observed in all datasets investigated in this study.

41

Figure 10. Gene expression level (A) and interaction density (B) are plotted against tail length.

Long-tail genes are tissue specific Clustering analysis identified one cluster with prolonged H3K4me2 tail in transcribed region in all five different tissues. The top five enriched GO terms with the lowest p-values from each tissue sample are listed in Table

III. With the exception of epithelial tissue (from epidermal keratinocyte cell line), genes in cluster 1 from remaining four tissues display strong tissue specificity which matching with the cell type. For epithelial tissue sample, the genes are enriched with functions such as ectoderm and tubular development, which are consistent with its tissue specificity and origin while the most significantly enriched functions are related to blood vessel development (see Discussion section).

3.3.2 Core group: Common cluster 1 genes in all tissue types

As previous results show, cluster 1 with long-tail H3K4me2 pattern contains highly expressed genes that actively interact with other genes. Based on previous findings, it is likely cluster 1 genes carry out developmental/differentiation functions in each specific tissue. 42

Data Set Accession Tissue Type Enriched Functions in Cluster 1 (p-value) Number Leukocyte activation (4.7E-11) Lymphocyte activation (9.4E-11) Immune system process (1.0E-8) GSM733769 Blood T cell activation (2.0E-8) Hemopoietic or lymphoid organ development (2.8E-8) Bone development (2.1E-8) Osteoblast dierentiation (5.5E-5) Regulation of osteoblast dierentiation (7.3E-5) GSM733781 Connective Cartilage development (1.6E-4) Positive regulation of osteoblast dierentiation (3.0E-3) Vasculature development (2.0E-15) Blood vessel development (7.1E-15) GSM733686 Epithelial Blood vessel morphogenesis (7.3E-15) Ectoderm development (3.1E-7) Tube development (3.6E-7) Skeletal system development (2.6E-18), Skeletal system morphogenesis (2.3E-12) Muscle organ development (1.6E-11), GSM733768 Muscular Embryonic skeletal system development (1.1E-9) Embryonic skeletal system morphogenesis (2.2E-7) Nervous system development (8.2E-50), Neurogenesis (1.7E-38) GSM908957 Nervous Generation of neurons (8.8E-37), Neuron dierentiation (4.6E-30) Neuron development (7.5E-22) Table 4. Top enriched GO terms of cluster 1 from each cell line.

Since cluster 1 genes are highly interactive, a mutation in these genes might lead to loss of protein interactions, and possibly diseases or dysfunctions. Since certain genes display long tails in multiple data sets, the hypothesis that a gene with long-tail

H3K4me2 pattern in multiple cell types may play key roles in tissue differentiation and

43 development. Mutations on such genes could cause more serious dysfunctions than that of other genes.

Core group identification: To test our hypothesis, recurrent cluster 1 genes are identified and long-tail H3K4me2 pattern in all five tissue types and embryonic stem cells. In other words, the core group consists of the intersection of all cluster 1 gene sets from all datasets. Gene symbols of all genes contained in core group are listed in Table.

Core group of genes display higher interaction density: Furthermore, the PPI density within the core group is even higher than that of any of the cluster 1 from all tissue types listed in Table IV. PPI density of this core group is 7.2E-3, which is about three times of the density of the cluster 1 in these tissue/cell types, and is about 18 times of the average PPI density of the overall PPI network. More interestingly, the PPI density between this core group and the remaining genes in cluster 1 of any given data set is around 16.0E-3, compared to the interaction density of cluster 1 which is at a value 2.5E-

3. This result shows that the core group forms a highly interactive protein network and maintains active interactions with all tissue specific gene clusters. It should be noted that all other genes form an even less dense PPI network around cluster 1 genes.

44

C1*:C1-core genes C1* vs. core genes Tissue Type C1(*10-3) (*10-3) (*10-3) Blood 2.1 1.9 19.3 Connective 2.6 2.3 15.8 Epithelial 2.7 2.4 15.7 Muscular 2.8 2.6 19.2 Nervous 2.5 2.1 13.7 Embryonic 2.4 2.1 16.6 stem cells Core group 7.2 Average interaction density of PPI network is 3.9*10-4

Table 5. PPI density of cluster 1 from all tissue types.

Gene set enrichment of core group: The fact that the 217 genes from the core group are active in all tissues and developmental stages as well as that they have strong interactions among themselves suggest that these genes are involved in important biological functions. Detailed enrichment analysis results are listed in Table 14 in

Appendix A. As we have predicted, these 217 genes are highly enriched in fundamental biological processes including regulation of gene expression (e.g., GO: 0010629: negative regulation of gene expression with unadjusted p-value of 1.442E-12, GO:

0000122 negative regulation of transcription from RNA polymerase II promoter with p- value 6.086E-13 and GO: 0010628: positive regulation of gene expression with p-value

2.790E-9), RNA splicing (e.g., GO: 0000398: mRNA splicing via splicesome with p- value 8:073E-7), embryo development (GO: 0009790 with p-value 1.978E-6) and histone modification (GO: 0016570 with p-value 1.246E-5). Among these genes, twenty nine are known to have transcription factor activity such as TGIF1, SMAD7, ATF4, and FOXO3.

45

Genes from this group are also significantly enriched in important signaling pathways such as regulation of nuclear SMAD2/3 signaling (PID, p-value 4.680E-7), hedgehog signaling (PatherDB, p-value 2.752E-6), BMP receptor signaling (PID, p-value

4.663E-6) and PI3K-Akt signaling (KEGG 5.643E-6). Based on above observations, it is clear that this core group contains mostly essential genes for development and differentiation. In fact, mouse phenotype analysis indicates this group is highly enriched with embryonic lethal genes. According to this study, defects in 56 genes among the identified 217 are known to cause lethal effects in mouse embryonic development (p- value 5.414E-6).

The enrichment analysis further reveals the regulation of this group of genes, which play critical biological functions. Specifically, this group of genes are enriched with targets of well-known development and differentiation related transcription factors YY1

(23 targets, p-value 7.365E-11), PAX4 (35 targets, p-value 4.875E-7), MYC (28 targets, p-value 2.888E-6), ATF4 (13 targets, p-value 3.870E-6), OCT1 (12 targets, p-value

7.754E-6), and TEF1 (15 targets, 1.276E-5). Furthermore, this group is highly enriched with predicted targets of specific such as hsa-miR-570 (predicted by PITA,

42 targets, p-value 7.091E-15), the hsa-miR-548 family (nine members, predicted by

PITA, p-value ranges between 5.238E-13 to 2.353E-8 with 25 to 39 predicted targets), and hsa-miR-802 (predicted by PITA, 23 targets, p-value 3.032E-10), just to name a few.

3.4 Discussion and Summary

This study performed clustering analysis on H3K4me2 distribution patterns around the TSS regions of genes in the whole genome. Results confirmed in all tissue types that

46 heavy presence of H3K4me2 in the downstream of TSS marks tissue specific genes and in each dataset the cluster contains long-tail genes show higher gene expression levels than other clusters. The only exception is that the cluster 1 genes detected in the epidermal keratinocyte are highly enriched with not only ectodermal and tube development functions which are consistent with the tissue type, they are more significantly enriched with functions such as blood vessel development. While this seems to be controversial, it should be noted that while endothelial cells are lining the inside of blood vessels, there is a thin layer of epithelial composed serosa around arteries. In addition, a similar role of lining tubular structures (endothelial cells for blood vessels and epithelial cells for lumens) suggests that these active genes marked by H3K4me2 may be more functional specific than tissue specific.

In addition to confirming the tissue specific functions, the cluster tail length was defined as a quantitative measure for H3K4me2 distribution pattern. Study shows the tail length is highly correlated with other cluster features such as gene expression and cluster

PPI density. This study revealed that genes with longest tail pattern, or heavy presence of

H3K4me2 downstream of TSS, are highly interactive at protein level, suggesting that they are involved in related tissue specific functions.

Furthermore, this study identified a small group of 217 genes that show consistent long-tail H3K4me2 patterns from stem cells to all the examined differentiated tissue cells. Contrary to previous findings, these genes are mostly essential genes enriched with fundamental biological functions. Mouse phenotype analysis of this group indicates they are enriched with embryonic lethality, implying their fundamental importance. In

47 addition, the fact that they are highly interactive with all the tissue specific gene groups suggests that they are not only critical in early developmental stages, but also play pivotal roles in tissue differentiation and maintenance.

An interesting observation is the regulation of this group of 217 genes. While many of them are transcription factors or co-factors themselves, they are strongly regulated by well-known development/differentiation related transcription factors such as YY1,

PAX4, and MYC, suggesting a hierarchy in the regulation of stem cell differentiation regulation. Furthermore, many of these genes appear to be predicted targets of certain microRNAs. For instance, hsa-miR-548d-3p is predicted to regulate 33 genes among the

217. Recently it has been shown that hsa-miR-548d is a superior regulator in pancreatic cancer [33]. Similarly mutations on binding sites for another highly enriched microRNA, hsa-miR-570 are shown to be strongly associated with risk of gastric cancers [34, 35].

These results can shed light on the potential important roles of these microRNAs in diseases.

Overall, by carrying out comparative analysis of the H3K4me2 marked genes among multiple tissues and integrating functional genomics and PPI data, it is demonstrated that the role of H3K4me2 is not only limited to individual genes but is involved in the protein network activities with important regulatory mechanisms. While the method presented in this paper is specifically designed for H3K4me2, it demonstrates that rich information such as distribution patterns and shapes of the protein and histone marks is contained in ChIP-seq data. Therefore pattern recognition and signal process methods can play important roles in extracting such information from these data. It can

48 be conceived that more sophisticated analysis on more histone and regulatory proteins will yield even deeper insight on the relationships between epigenetic regulation and molecular network activities in the future.

49

Chapter 4: Identify recurrent patterns of H3K4me2 at promoters of Critical lncRNAs across Multiple Tissues

Non-coding RNAs (ncRNA) have long been known to regulate gene expression in cell development and differentiation. Long non-coding RNAs (lncRNA) typically occupy large portions of non-coding regions and have attracted much attention. However, the epigenetic regulation of lncRNA transcription remains unclear. Previously, it has been shown that a distinct H3K4me2 distribution pattern indicates highly expressed, tissue- specific genes. Therefore, the hypothesis is that this pattern may also indicate activation of lncRNA transcription. By integrating next generation sequencing and microarray datasets of cell lines derived from multiple tissue types, it is shown that promoter regions of lncRNAs also display prolonged H3K4me2 signals downstream of their transcription starting sites (TSS). Further, it was also shown that the epigenetic regulation mechanism is similar to that for protein-coding genes. Long-tail lncRNAs display higher expression levels. They also co-express with tissue-specific and highly interactive genes. A core group of 182 lncRNAs was identified based on the fact that they display ubiquitous long- tail profiles in multiple tissue types. Functional enrichment analyses show that lncRNAs from this core group play essential roles in tissue development and differentiation.

Further analysis shows that the long-tail lncRNAs are also highly conserved across species. Consequently, mutations of these long-tail lncRNAs can implicate serious diseases and dysfunctions. Distinct patterns of H3K4me2 at lncRNA promoter regions

50 indicate active transcription of lncRNAs. LncRNAs displaying this long tail pattern are essential to tissue development and differentiation. These results suggest that characterization of histone modifications, when combined with expression level of lncRNAs, can reveal new insights on the epigenetic regulations of lncRNAs.

4.1 Background

Although the human genome is pervasively transcribed, protein-coding genes only account for approximately 2% of the whole genome [36]. Studies have shown that non- protein-coding portion of the genome plays an essential and crucial role in the regulation of gene expression [37]. Long non-coding RNAs (lncRNAs) account for the largest portion of non-coding RNAs, yet the function and regulation of most lncRNAs still remains uncertain [38, 39]. In this work, the aim is to study the epigenetic regulation of lncRNAs and their biological implications.

Usually, the lncRNA is defined as an RNA molecule longer than 200 nucleotides that is unable to translate into protein. Recent evidence shows that lncRNAs could regulate gene expressions [40–42] and mediate chromatin changes [43]. In addition, mutations of lncRNAs have been linked to a variety of human diseases [44]. Evidence suggests that lncRNA expressions show even more cell specificity than their coding counterparts [45–50]. For instance, a group of lncRNAs shows subtype-specific expressions in more than one cancer type [51]. Epigenetics plays an essential regulatory role in the expression level of lncRNAs as well as genes. Compared to its coding counterpart, the epigenetic regulation of lncRNA transcription has not been widely explored. Among various chromatin modifications, post-translational modification of

51 histones is one of the most studied epigenetic regulation mechanisms. Previous studies established that histone modifications are associated with the expression of tissue-specific or developmental state-specific lncRNAs in mouse brain tissue[52]. Among these modifications, methylation of histone 3 (H3), at the 4th lysine position (H3K4), has been extensively studied [2, 30, 31]. Depending on the number of methyl groups attached to the lysine residual, there are three common forms, H3K4me1, H3K4me2, and H3K4me3 respectively. It has been shown that while H3K4me1 binding is preferentially detected around enhancer regions, and H3K4me3 binding is preferentially detected around gene promoter regions, H3K4me2 binding was detected at both enhancers and promoters regions with a positive role in facilitating gene transcription [2, 21]. Specifically, it has been observed that a “long tail” distribution pattern of H3K4me2 over the promoter regions is indicative of expressions of tissue specific genes [15, 53]. Recently, it was discovered that such long-tail patterns of H3K4me2 can be acquired at different stages of tissue differentiation including as early as the embryonic stem cell stage [16].

With the extensive amount of ChIP-seq data generated by the ENCODE project

[10], it is now possible to study the H3K4me2 distribution pattern over multiple tissue types. In this study, six different cell lines were selected from five major tissue types (i.e., muscular, nervous, epithelial, non-blood connective tissues and blood) as well as embryonic stem cells (ESC) to investigate the patterns of H3K4me2 distribution at the promoter regions of lncRNAs and examine their tissue-specific functions. A pipeline was implemented, utilizing a clustering methodology, to identify recurrent patterns of

H3K4me2 at lncRNA promoters. Specifically, it is shown that the previously observed

52 long-tail H3K4me2 pattern at gene promoters also exists for lncRNAs and these patterns are enriched with tissue-specific lncRNAs. It is also shown that these “long-tail” lncRNAs are highly conserved across species and their mutations are linked to diseases and dysfunctions, which suggests that this group of lncRNAs an important role in the specific regulation of normal cell functions. Furthermore, a set of 182 core lncRNAs were identified based on the fact that they display ubiquitous long-tail patterns in all tested tissues as well as in ESC. These lncRNAs strongly co-express with essential developmental and housekeeping genes. Also, a conservation study shows that lncRNAs in this core group are conserved across 100 species with even higher conservation scores, implying fundamental roles for these lncRNAs. This core group is highly enriched with lncRNAs related to diseases and dysfunctions, suggesting fundamental regulatory roles for the group.

To reiterate, this study shows long-tail lncRNAs carry essential gene regulatory powers across multiple tissue types. It also demonstrates that H3K4me2 not only regulate transcription of protein-coding genes but also lncRNAs. Overall, these results suggest that characterization of histone modifications, when combined with expression level of lncRNAs, can reveal new insights on the epigenetic regulations of lncRNAs.

4.2 Methods

4.2.1 Overview of Workflow

The data analysis workflow includes three major steps as shown in Figure 11. It begins by downloading from repositories the H3K4me2 ChIP-seq data generated from multiple cell lines that have been pre-processed and clustered based on H3K4me2

53 distribution over the promoter and downstream regions of lncRNAs. Next, each cluster is examined in depth by calculating tail length and expression level causing a core group of lncRNAs to be identified as they display a ubiquitous long tail profile in all cell lines.

Further, the biological functions of lncRNAs are inferred by studying the genes that are co-expressed with them. The conservation across species and disease-causing mutations of long tail lncRNAs are also reported.

Figure 11. Diagram of workflow used in Chapter 4.

54

H3K4me2 ChIP-seq datasets clustering

Nine H3K4me2 and four PolII ChIP-seq datasets of aligned short reads from the

ENCODE project are downloaded from the NCBI Gene Expression Omnibus with accession numbers listed in Table 6. Each dataset pertains to either embryonic stem cells or one of the five basic tissue types we identified: muscular, nervous, epithelial, non- blood connective tissues, and blood. The 36-bp short reads were generated using an

Illumina GA sequencing platform. In this analysis, the aligned short reads were extended by 200 bp as in [5, 12, 53, 54]. Histograms of extended read counts over a 10-kb genomic region around the transcription starting site (TSS) (i.e., -2kb ~ +8kb of TSS) were created using a bin size of 20-bp. The TSS loci were extracted by using the RefSeq hg19 reference genome available through the UCSC genome browser.

Data Set Accession Cell Description Protein Type Tissue Type Number GSM733670 Embryonic stem cells H3K4me2 Stem cells GSM733769 B-lymphocyte, lymphoblastoid H3K4me2 Blood GSM733781 Lung fibroblasts H3K4me2 Connective GSM733768 Skeletal muscle myoblasts H3K4me2 Muscular GSM908957 Neural progenitor cells H3K4me2 Nervous GSM733686 H3K4me2 Epidermal keratinocytes Epithelial GSM733671 PolII GSM733683 Human umbilical vein endothelial H3K4me2 Epithelial GSM733749 cells (HUVEC) PolII GSM733651 Human erythroleukemic cells H3K4me2 Blood GSM733643 (K562) PolII Human cervical cancer cells (Hela- GSM733734 H3K4me2 Epithelial S3) Table 6. Accession numbers of ChIP-seq data sets.

55

K-means clustering was applied to cluster TSS region profiles of 24,327 long non- coding RNA (lncRNA) in each dataset implemented in Matlab (using K = 5, repeats = 50, distance metric = squared Euclidean), detailed information on the sizes of all resulting clusters is listed in Table 15 in Appendix B. In previous studies on clustering chromatin modification marks, various K values were adopted. For instance, K was set to 5 for

H3K4me2 marks [15] and was set to 3 for clustering multiple chromatin modification marks at enhancer regions and 4 for promoters [2]. In this study, a large number of different values were surveyed for K (from 2 to 30, see Figure 38 in Appendix B) and determine the optimal choice based on two widely-adopted criteria: the sum of point-to- centroid distances and the silhouette coefficient. The optimal K value is identified as the knee point of the curves – the point where successive decreases become noticeably smaller and below a certain threshold. From the plots, it appears that 5 is the optimal value. With the same choice of K, the clusters displaying distinct binding patterns are identified in each cell line. This choice of K not only follows common practice, but also allows for the comparison of binding signals among all cell lines.

Analyses of lncRNA clusters

Multiple analyses were performed to characterize the identified clusters, including a study of the cluster tail length and an estimation of the average lncRNA expression level.

Functional enrichment analyses and PPI density analyses were also performed on the genes that co-expressed with long-tail lncRNAs, followed by the studies of conservation and mutations of long-tail lncRNAs.

56

Cluster Tail Length

This metric is adopted from our previously published study [53]. It captures the length of the genomic region that has an elevated H3K4me2 distribution signal, as shown in Figure 12.

Figure 12. Definition of tail length.

LncRNA Expression Level

Based on a recent study [51], previously published Affymetrix microarray gene expression data sets can be repurposed to study the expression level of lncRNAs.

Specifically, out of the 54,675 probe sets for the Affymetrix HG-U133 Plus 2.0

GeneChip, 4,575 are identified to target lncRNAs. Therefore, to further investigate the expression levels of lncRNA clusters, five independent microarray datasets using

Affymetrix HG-U133 Plus 2.0 GeneChip were obtained from the NCBI Gene Expression

Omnibus as listed in Table 7. Each data set corresponds to one tissue type.

57

Data Set Accession Sample Cell Description Tissue Type Number Size GSE46480 B-lymphocytes Blood 98 GSE23066 Lung fibroblasts Connective 5 GDS2611 Epidermis Epithelial 3 GSE28422 Skeletal muscle Muscular 110 GSE13307 Neural progenitor cells Nervous 4 GSE7307 90 distinct cell types Multiple 504 Table 7. Accession numbers of Microarray data sets for expression level study and co- expression network constructions (only normal samples are included).

Gene and lncRNA Co-expression

Genes co-expressed with long-tail lncRNAs were used as surrogates to infer the biological functions of long-tail lncRNAs. Co-expression between a pair of genes and lncRNA is measured by Pearson Correlation Coefficients (PCC) between expression values. Firstly the PCC between all long-tail lncRNAs and genes are calculated; then the co-expressed genes were identified in each tissue type if the co-expression coefficients were above certain thresholds. Finally, functional enrichment analyses on the co- expressed genes were carried out using ToppGene Suite[55]. The enriched GO functions and human or mouse phenotypes of co-expressed genes are used to infer the biological functions of long-tail lncRNAs.

Interaction density using the PPI network

To study the interactions within and among the gene groups, we used the PINA PPI network database[56]. PINA integrates six curated PPI data from multiple sources including IntAct, BioGRID, MINT, DIP, HPRD and MIPS/MPact and thus can decrease

58 the chance of bias caused by the choice of a specific PPI database. The interaction density of one gene group is defined as following:

N I D1  NC  (NC 1) 2 where NI is the number of confirmed interactions between proteins that are transcribed by genes from this particular group, NC is the number of gene in this particular group. The interaction density is a metric ranging from 0 to 1 with 0 indicating no interaction recorded and 1 indicating an interaction exists between every pair of proteins. Similarly, interaction density between two different gene groups is defined as the number of interactions divided by the total number of protein pairs, as listed in the following formula:

N D  I 2 N  N C1 C2 where NI is the number of confirmed interactions between the two gene groups recorded in the PPI network. Each counted interaction arises from the consideration of two genes, each from one of the two gene groups. Finally, NC1 and NC2 are the number of genes from the two groups respectively, which are also the numbers of proteins transcribed from each group.

PhastCon Scores

PhastCon Score is a value that indicates the conservation rate of a base pair on the human genome by comparing it to 99 other species[57, 58]. The value of the

PhastCon score ranges from 0 to 1. The higher the score, the more conserved the base pair. Since the focus of this study is the highly conserved base pairs of long-tail 59 lncRNAs, the percentages of base pairs with a PhastCon score no less than 0.8 is calculated and plotted for all long-tail lncRNAs. Overall, 4.62% of the whole genome has

PhastCon score no less than 0.8.

Disease causing mutations by using the GWAS catalog

GWAS (Genome-Wide Association Studies) catalog was downloaded from the

National Human Genome Research Institute (NHGRI) website [59][60]. The numbers of mutations that are associated with identified long-tail lncRNAs are recorded.

4.3 Results

4.3.1 H3K4me2 profiles distinguish different groups of lncRNAs in cells of all tissue types

Nine H3K4me2 and four PolII ChIP-seq datasets of aligned short reads from the

ENCODE project are downloaded from the NCBI Gene Expression Omnibus with accession numbers listed in Table 6. Each dataset corresponds to either ESC or one of the five basic tissue types we identified: muscular, nervous, epithelial, non-blood connective tissues, and blood.

60

Figure 13. Clustering result based on data from skeletal muscular cells.

In each data set associated to one of the five tissue types (muscular, nervous, epithelial, non-blood connective tissues, and blood) or ESC, five clusters of lncRNAs with distinctive H3K4me2 binding patterns at promoter regions were identified using the

K-means clustering (See Figure S3 for more details). The average H3K4me2 distributions for clusters of muscular tissue cells are plotted in Figure 13A. Clusters are distinguished by a parameter called “tail length” see Methods section for details on tail length definition), which characterizes the pervasiveness of H3K4me2 downstream of the transcription starting sites (TSS). Here all clusters are ranked by tail length in descending order. LncRNAs with the longest tail length are designated as “long-tail” in each tissue type. 61

4.3.2 H3K4me2 tail length positively correlates with lncRNA expression levels

As established in Chapter 3, H3K4me2 profiles at TSS regions of protein-coding genes distinguish different classes of genes in all five basic human tissue types and ESC.

Additionally, genes with prolonged H3K4me2 distribution at promoter regions constantly display higher expression levels as compared to other genes. Figure 13C shows the expression level of each cluster. To compare the expression levels of the lncRNA clusters, the Wilcoxon rank sum (i.e. the Mann–Whitney) test was used. Instead of the t- test, this non-parametric test is chosen to minimize the effect of outliers. The p-values are marked for chosen pairs of clusters. In addition, this study shows that, as tail length of the cluster increases, the level of lncRNA expression also increases (as shown in Figure

13D). Results for the muscular tissue show that mean expression level of a cluster increases as the cluster tail length increases, with a Pearson’s Correlation Coefficient of

0.93 between the two. The same trend with similar Spearman Rank Correlation

Coefficient values was observed in all five basic tissue type datasets investigated in this study. This study does not include ESC due to the unavailability of gene expression data of ESC.

H3K4me2 tail length of lncRNAs positively correlates with Pol II binding signals in epithelial tissue cells

RNA polymerase II (PolII) is known to play an important role in mRNA transcription; however, its role in lncRNA transcription remains unclear. In order to study the role of PolII in lncRNA transcription, the distribution patterns of PolII at TSS regions of lncRNAs are examined in epidermal keratinocytes cells from epithelial tissue. This

62 particular cell line was chosen as it is one of the four cell lines that have PolII distribution data available, besides H3K4me2 distribution data, from ENCODE (see similar plots for the other three cell lines in Figure S2). For each cluster identified by the H3K4me2 signal

(Figure 4, left), the average PolII distribution is plotted in the right panel of Figure 14. As the tail length decreases, the amount of PolII binding decreases as well. It is noteworthy that highly expressed lncRNAs have high levels of PolII binding at TSS regions. Hence, this observation suggests that PolII also participates in lncRNA transcription.

Figure 14. Distributions of H3K4me2 (left) and PolII (right) at lncRNA promoter regions.

4.3.3 Core group of lncRNAs display ubiquitous long tail profile in all tissue types

A core group of 182 lncRNAs show recurrent long-tail H3K4me2 patterns in all five tissue types and ESC. This group accounts for more than 15% of all long-tail lncRNAs identified in each cell line (See sizes of clusters of each cell line in Table 15 in

Appendix B). Ensembl transcript IDs of all lncRNAs contained in the core group are

63 listed in Table 18 in Appendix B. The lncRNAs from the core group are examined as a special group in the remainder of this paper.

4.3.4 Long-tail lncRNAs co-expressed with tissue specific and highly interactive genes

Since currently there is no comprehensive database for the annotation of lncRNA functions, genes that are highly co-expressed with lncRNAs are often utilized as surrogates to infer the functions of lncRNAs. As shown in multiple studies, coding non- coding (CNC) networks are utilized as a means of studying the biological functions of lncRNAs; the genes that co-expressed with lncRNAs usually provide insightful information on the functions of lncRNAs[61–63].

Co-expression between a pair of genes and lncRNA is measured by the Pearson

Correlation Coefficients between expression values. Due to the small sample sizes of other cell lines, this study was conducted on cell lines of blood and muscular tissues only.

Furthermore, we included a dataset that contains cells of multiple tissue types to study the functions of core lncRNAs. Firstly, the co-expression coefficients between all pairs of long-tail lncRNAs and genes are calculated. The calculated results demonstrate that most genes only co-express with a few long-tail lncRNAs and few genes are co-expressed with multiple long-tail lncRNAs. Thus two gene groups were identified for each co-expression study: Group One (G1) contains genes that only co-express with a few (less than 3 for muscle tissue, less than 4 for blood tissue, and less than 3 for core group) long-tail lncRNAs; and Group Two (G2) contains genes that co-express with multiple (greater than 8 for muscle tissue, greater than 11 for blood tissue, and greater than 7 for core

64 group) long-tail lncRNAs. Finally the enriched GO functions of co-expressed genes are used to infer the biological functions of long-tail lncRNAs.

Function enrichment analyses were performed on genes that co-expressed with long-tail lncRNAs in blood, muscular tissue, and core lncRNAs. The top five enriched

GO molecular functions, biological processes, human phenotypes, and mouse phenotypes are listed in Table 16 and 17 in Appendix B. G1 genes from muscular tissue cells are highly enriched with tissue-specific biological processes (the top enriched terms are not tissue specific, hence we listed the tissue specific terms as 6th – 9th terms in bold in the table cell) and G2 genes are enriched with more general functions. Similar results are discovered in blood tissue cells, G1 genes are highly enriched with tissue-specific mouse phenotypes (see the complete list of all enriched terms in mouse phenotypes in Table 17 in Appendix B). For core lncRNAs, both groups are enriched with fundamental functions with different emphases.

Genes that are highly co-expressed with long-tail lncRNAs are further analyzed by their interactions with other genes. It is our hypothesis that long-tail lncRNAs carry essential regulatory functions in cells of various tissue types; hence genes that are highly co-expressed with long-tail lncRNAs would also play important roles in cellular functions. As a result, these genes are more likely to interact with other genes. In this study, the number of gene interactions was used as an indicator of regulatory power.

Here, the hypothesis is that the more interaction a gene has, the more important its role in cellular functions, and the more essential the lncRNA it co-expressed with. As a result, if a gene group contains genes with high regulatory power, it shall have more

65 interactions within the group and with other genes. In this study, gene interactions between two gene groups are quantified by a ‘PP density’ metric see Material and

Methods section for more detail). As shown in Table 8, it is clear that genes that are highly co-expressed with long-tail lncRNAs have more interactions as compared to other genes. For muscular tissue, the PPI density of G1 strongly co-expressed with some lncRNAs is more than fifty times the average interaction density of the entire PPI network. Similar results are also observed in the blood tissue. Of all the observed PPI densities, G2 from the core group have the highest PPI density within the group and with all other genes in the entire network. This result indicates that long-tail lncRNAs co- express with highly interactive genes and it is likely that long-tail lncRNAs fill an essential regulatory role in tissue differentiation and development.

Number of Number of recorded recorded D1: PPI D2: PPI Long-tail lncRNA Number interactions interaction density density groups of genes (within s (with all (×10-3) (×10-3) gene other group) genes) Muscular G1 116 135 20.2 4892 2.9 Muscular G2 139 50 5.2 2438 1.2 Blood G1 175 52 3.4 2012 0.78 Blood G2 127 84 10.5 2534 1.4 Core lncRNA G1 100 79 16 1759 1.2 Core lncRNA G2 57 66 41.4 3025 3.7 Average interaction density of the PPI network is 3.9×10-4

Table 8. PPI density of genes that are highly co-expressed with selected long-tail lncRNA groups.

66

4.3.5 Long-tail lncRNAs are highly conserved

Based on previous findings, it is likely that long-tail lncRNAs carry out essential developmental/differentiation functions as they interact with highly-expressed and tissue- specific genes. Hence, mutations in these lncRNAs might lead to loss of interactions with genes, and possibly dysfunctions and even abnormal phenotypes or diseases. As a result, such lncRNAs should be highly conserved across species as mutations on such lncRNAs could cause more serious dysfunctions than that of other lncRNAs.

Long-tail lncRNAs are indeed highly conserved. PhastCon score is a value that indicates the conservation rate of a base pair on the human genome by comparing it to 99 other species. The value of a PhastCon score ranges from 0 to 1. The higher the score, the more conserved the region. As the focus of study here is the highly conserved regions in long-tail lncRNAs, this study was conducted on the base pairs that have a PhastCon score no less than 0.8. Overall, 4.62% of the whole genome has a PhastCon score no less than

0.8, plotted as the right most box plot in the left panel of Figure 15. Long-tail lncRNAs indentified in cells from various tissue types show higher conservation levels than the entire human genome, as shown in the box plots in the left panel of Figure 15. It is clear that the long-tail lncRNA groups are highly enriched with base pairs associated with high

PhastCon scores. To further demonstrate the enrichment of highly conserved base pairs in long-tail lncRNAs, multiple Wilcoxon Rank Sum Tests were conducted between each long-tail lncRNA groups and all known lncRNAs. The resulting p-values indicate that the enrichment of highly conserved base pairs in long-tail lncRNA groups is very significant.

67

Figure 15. Investigation on the PhastCon scores of long-tail lncRNAs show they are highly conserved.

4.3.6 Mutations on long-tail lncRNAs are more likely to be associated with diseases and dysfunctions

According to the NHGRI GWAS Catalog, mutations on 1,852 lncRNAs are associated with 535 diseases or dysfunctions, constituting 7.61% (1,852 out of 24,327) of all lncRNAs included in this study. Long-tail lncRNAs are highly enriched with lncRNAs associated with diseases and dysfunctions. For the long-tail lncRNAs identified in each tissue type, the percentage of lncRNAs associated with disease and dysfunction is higher

(with most of them being statistically significant) as shown in Table 9. As expected, the percentage of diseases and dysfunctions related lncRNAs is even higher in the lncRNA core group, which is 19.78%.

68

Number of long- Chi-square test tail lncRNAs Percentage of long-tail (between each containing lncRNAs containing group and all Tissue Type mutations mutations associated with lncRNAs associated with diseases and dysfunctions observed) p- diseases and values dysfunctions Blood 98 9.34 3.91E-2 Connective 89 9.57 2.76E-2 Epithelial 101 10.26 4.02E-5 Muscular 70 8.40 3.97E-1 Nervous 127 11.11 1.55E-5 ESC 117 10.34 7.90E-4 Core Group 36 19.78 8.49E-10 All 1852 7.61 - Table 9. Most Long-tail lncRNAs clusters are highly enriched with diseases and dysfunctions related lncRNAs.

4.4 Summary

This study performed clustering analysis on H3K4me2 distributions around the TSS regions of identified lncRNAs. Results confirmed that heavy presence of H3K4me2 in the downstream of TSS indicates highly transcribed and tissue-specific lncRNAs in all tissue types. In each dataset the cluster containing long-tail lncRNAs shows statistically higher expression levels than other clusters. Further analyses confirmed that long-tail lncRNAs are highly co-expressed with tissue-specific and highly-interactive genes. This observation suggests that long-tail lncRNAs carry essential regulatory functions in tissue differentiation and development. Long-tail lncRNAs are highly conserved across species in multiple tissue types. Evidence shows that in each tissue type mutations of approximately 10% of long-tail lncRNAs lead to severe dysfunctions and diseases.

69

Furthermore, in this study, a group of core lncRNAs is identified that shows ubiquitous long-tail H3K4me2 profiles in cells of all tissue types. This group of 182 lncRNAs shows high co-expression with genes that carry essential gene functions and are highly interactive with other genes. In addition, the conservation scores for these lncRNAs are higher than those of general long-tail lncRNAs, suggesting the evolutionary selection of these lncRNAs. This core group also contains a higher percentage of lncRNAs that are associated to diseases and dysfunctions, which approximately doubled the percentage of this association than that found in regular long-tail lncRNA groups of any other tissue type. All these observations suggest the biological importance of this group of lncRNAs. This is similar to the critical roles of the protein-coding genes with ubiquitous long-tail H3K4me2 profiles, suggesting the role of H3K4me2 in marking active genes including non-coding RNAs.

To summarize, this study reveals the regulatory power of long-tail lncRNAs in multiple tissue types. Comparative analyses of the H3K4me2 marked lncRNAs were carried out among multiple tissues. By integrating functional genomics, PPI data and

GWAS catalog, it is demonstrated that H3K4me2 is not only regulating protein-coding genes but is also involved with important regulatory mechanisms for the transcription of lncRNAs. It can be conceived that more integrative analyses on epigenetics, such as histone modifications and regulatory proteins, will yield even deeper insight into the epigenetic regulation of non-coding RNA activities in the future.

70

Chapter 5: Identify Combinatorial Recurrent Patterns of Chromatin Modifications at Promoters of Different States

Previous chapters show that recurrent pattern of single chromatin modification displaying at regulatory regions associated with regulatory functions. In this chapter, recurrent combinatorial patterns of multiple chromatin modifications are identified and analyzed. Furthermore, as more data becomes available, it is computationally expensive and unnecessary to study combinatorial patterns of all available modifications. Here a novel framework is proposed to investigate recurrent combinatorial patterns of a subset of available chromatin modifications. This subset is selected quantitatively based on their distribution patterns along the genome. A case study conducted at promoter regions is presented: four out of twelve chromatin modifications were selected, eight different promoter states were identified and the identified patterns of active promoters were further utilized to discover novel promoter regions. Several previously un-annotated promoters were discovered by the identified recurrent combinatorial patterns, further investigations confirm their promoter functions. With this proposed framework, functions of previously un-annotated regions could be discovered, which lead to better understanding of epigenetic regulations.

5.1 Introduction

Previous studies show that recurrent combinatorial patterns of multiple chromatin modifications at regulatory regions are associated with various regulatory functions and

71 provide valuable information on annotating the human genome[5, 22, 25, 64, 65].

Currently, there are several types of known regulatory regions and it remains an active field of research to study their regulatory mechanisms[18–20, 65–69]. Progress has been made as more data becomes available and more algorithms are developed to analyzed recurrent patterns. For instance, many efforts were spent on analyzing chromatin modifications of in human CD4+ T cells as more data becomes available. Firstly, salient patterns of multiple chromatin modifications were by Roh et al. [70, 71]. Later ChromSig was developed by Hon et al. to utilize combination of 21 chromatin modifications to search for commonly recurring chromatin signatures using the updated data set[22, 30].

Subsequently as more data become available for this cell line, ChromHMM was developed to annotate the genome based on distributions of 41 chromatin modifications by Ernst et al. [5]. It is later reported by the same group that human genome can be annotated by 15 chromatin states based on 10 chromatin modifications by using

ChromHMM[12]. It is noteworthy that computationally sophisticated methods become crucial to analyze patterns of chromatin modifications as more data becomes available.

Furthermore, it also demonstrates that chromatin modifications do not contribute equally to the process of identifying recurrent patterns. Hence, the identification of recurrent combinatorial patterns could be simplified significantly if a suitable subset of chromatin modifications could be identified. Moreover, it could also provide guideline for future experimental design.

In this study, a computational framework is designed to select subsets of chromatin modifications that form distinct recurrent patterns at regulatory regions. The identified

72 recurrent combinatorial patterns can be further utilized to discover novel regulatory regions. A case study of promoters yields encouraging results: 4 out of 12 available chromatin modifications were selected and 8 different recurrent patterns were indentified.

In-depth analyses show that the combinatorial patterns are associated with different states of promoters, confirmed by the expression levels of genes and enriched distributions of

PolII. Recurrent combinatorial patterns of active promoters were further utilized to discover novel promoters. The identified putative promoters are shown to be related to transcription activation. The proposed framework can be easily adapted to study other regulatory regions or extended to annotate the whole genome.

5.2 Method

5.2.1 Workflow

The workflow (Figure 16) of proposed framework is as follows: Firstly, data of all candidate chromatin modifications are pre-processed. Then, the distribution of each chromatin modification is expressed as a weighted sum of all other modifications in the candidate pool. The resulting coefficients are recorded in the affinity matrix of all chromatin modifications. This affinity matrix is enforced to be sparse, as the distribution of each chromatin modification is expected to be a weighted sum of few other closely related chromatin modifications. Consequently, the chromatin modifications are clustered into different groups via hierarchical clustering. In this step, chromatin modifications with closely related distributions are clustered into the same cluster. Then, a representative is selected from each cluster. After the subset is identified, the regulatory functions associated with these combinatorial patterns are further confirmed by evidence

73 from other databases. The identified patterns then further lead to discovery of novel regulatory regions.

Figure 16. Workflow of the framework proposed in Chapter 5.

5.2.2 Problem Formulation

Suppose distributions of N chromatin modifications at M loci are collected via

ChIP-seq experiments. The data set H could be written as follows,

74

th Here xi,j denotes the distribution of chromatin modification hi at j (j∈[1,M]) locus on the genome, as follows

Suppose each locus is of length L, then the vector xi,j could be further expressed as

where i∈ [1,N] and j∈[1,M].

Affinity matrix of chromatin modifications

Following formulation is proposed to identify subsets of chromatin modifications forming recurrent patterns on the genome. Suppose there exists a subspace P that few chromatin modifications reside. Then the distribution of one chromatin modification could be expressed by linear sum of distributions of remaining chromatin modifications in the same subspace, as follows

It could also be written as follows,

75 where αj=0 for all j P. Here αj could be considered as a coefficient measuring how the two distributions of ith and jth chromatin modifications related. Furthermore, this could be rewritten as,

N where αii= and αi∈R and |αi|0 =|P|-1. This formulation follows the assumption that a distribution can be explained by the closely related distributions of other chromatin modifications. Hence, to calculate αi, it shall follow,

As functions in L0 space is non-convex, here the formulation is relaxed to minimize the tightest convex relaxation of the L0-norm, i.e.

which can be solved efficiently and prefers sparse solutions. This sparse optimization program could also be rewritten for all data points i = 1, …, N in matrix form as

where A ∈RN×N. This affinity matrix A is then used to cluster chromatin modifications.

5.2.3 Selection of chromatin modifications and identification of combinatorial patterns

The affinity matrix A is then utilized to cluster chromatin modifications via hierarchical clustering. Each cluster is considered as a collection of chromatin modifications displaying linearly related distributions. Hence, for each cluster, one chromatin modification is selected to represent the distribution signal of this cluster.

After the set of chromatin modifications are selected, distributions of all selected modifications are concatenated as one vector. Recurrent combinatorial distribution 76 patterns are then identified by the K-means clustering. Here, it is hypothesized that recurrent combinatorial patterns could be indicators of different states of regulatory regions. Hence, each pattern is further analyzed to confirm if they are indeed associated with distinct epigenetic regulatory functions.

5.2.4 Discovery of novel regulatory regions

The identified combinatorial patterns are then utilized to discover novel regulatory regions. Here, Pearson correlation coefficient (PCC) is used to quantify the similarity between distributions of two chromatin modifications. The similarity metric is defined as the mean of correlation coefficients of each pair of chromatin modifications. Putative regulatory regions are selected by thresholding the similarity metrics. The quality of the putative regulatory regions is further analyzed by confirming with existing annotations of the human genome and other data evidence.

5.2.5 Case study at promoter region: data collection and pre-processing

Genome wide maps of 2 histone acetylations, 8 methylations, a histone variant

H2A.Z and CTCF of human skeletal muscular cells and B-lymphocyte cells were generated by the ENOCDE project. For each chromatin modification, the raw data of summary tag counts obtained at every 100bp was pre-processed before analyses.

Distributions of chromatin modifications at the -5k to +5k base pair (bp) region of each annotated Transcription Start Site (TSS) were extracted. The TSS list was downloaded from UCSC Genome Browser website. Overall, there are 41413 annotated

TSS from refGene. In this study, the distribution of each chromatin modification at every captured promoter region is represented by a vector of length 100 (the locus is of length

77

10kbp and each genomic window is of length 100bp). Consequently, for each chromatin modification, the data matrix is of size 41413×100.

In this study, ToppGene was used to study the enriched biological functions of gene groups displaying identified combinatorial patterns at promoter regions. Putative promoters are further analyzed by using evidence from other databases. Other approaches to examine the putative promoters include the investigation of the expression levels of downstream regions and PolII distributions, which are usually considered as good indicators of promoter activities.

5.3 Results

5.3.1 Subset identification

Data from human skeletal muscular cells and B-lymphocyte cells were used in this study. Overall, this study includes 12 chromatin modifications: 2 histone acetylations, 8 histone methylations, 1 histone variant H2A.Z and transcriptional repressor CTCF.

Annotation of promoters was obtained from UCSC Genome Browser refGene annotation.

One affinity matrix of chromatin modifications was generated for each cell line individually (see Methods section). Here, the hypothesis is that the distribution of one chromatin modification mark could be expressed as a weighted sum of few linearly related others. Therefore, the resulting affinity matrix shall be sparse. To further enforce this assumption, the value of parameter λ is empirically tested and selected.

Value of λ was chosen by empirical tests

Since the value of λ has great impact on the sparsity of the resulting affinity matrix, it was empirically chosen by comparing two affinity matrices. As previous studies show,

78 the recurrent patterns at promoter regions remain cell type invariant[2, 21]. Hence, the affinity matrices from the two cell lines shall remain similar to each other. To compare the similarity between the two affinity matrices, the PCC between all matching entries were calculated based on different choice of λ. The value of λ that gives the highest PCC was chosen, as shown in the figure below.

Figure 17. Value of λ is empirically selected by comparing the two affinity matrices generated based on data from two different cell lines.

Clustering chromatin modifications

To divide the set of chromatin modifications into clusters, hierarchical clustering was applied to the affinity matrices. The clustering was tested with K=3,4,5 to partition a 79 set of 12 chromatin modifications. In the end, we empirically selected K=4. As it is shown in Figure 18, the identified chromatin modification clusters largely overlap between the two cell lines. For each cluster, one chromatin modification is selected to represent the cluster. Therefore, a group of 4 chromatin modifications are selected to represent the overall distributions of all chromatin modifications. The selected chromatin modifications are underlined in the figure below.

Figure 18. Hierarchical clustering and subset selection of chromatin modifications.

80

Figure 19. Combinatorial patterns (CP) identified in this study and the average profile of each CP.

5.3.2 Identification of combinatorial patterns of chromatin modifications

Recurrent combinatorial patterns of chromatin modifications were detected in both cell lines via K-means clustering. Firstly, the distributions of selected chromatin modifications are concatenated as one vector. Therefore, for each known promoter, a vector of length N’×L is generated to represent the combinatorial distribution. Then, the

K-means clustering was performed to identify recurrent combinatorial patterns at promoters. To select an optimal value of K, the silhouette values and sum of point-to- centroid distances were examined for K value varies from 2 to 20. Empirically K is set to

81

8 for both cell lines (please refer to Table 10 for the sizes of all clusters in both cell lines).

Figure 19 shows the clustering results from both cell lines. The recurrent combinatorial patterns (CP) are ranked by the expression level of their target genes (as shown in Figure

20). It is observed that there exist similar combinatorial patterns in both cell lines.

Similarity between two combinatorial patterns is calculated by modified PCC: the mean of PCC between all matching pairs of chromatin modifications. As shown in Figure

19 and Table 10, modified PCC between combinatorial patterns discovered in both cell lines is quite high.

Cluster Sizes GM12878 HSMM Modified PCC CP1 4655 3190 0.945 CP2 6954 6486 0.980 CP3 6154 7551 0.973 CP4 4145 4572 0.956 CP5 2799 3657 0.956 CP6 5259 6865 0.877 CP7 795 995 0.223 CP8 10652 8097 0.684 Table 10. Sizes of identified clusters and the correlations between matching clusters from the two cell lines.

Analyses of expression levels of genes show different combinatorial patterns are associated with different promoter states. Each state is considered to carry out a different epigenetic regulatory function. It is observed that the same recurrent combinatorial pattern is associated with similar expression levels in both cell lines. As Figure 20 shows, the combinatorial patterns could be divided into 3 groups: patterns of active promoters

(CP1-CP4), weak promoters (CP5, CP6) and inactive/poised promoters (CP7, CP8).

82

Figure 20. Expression levels of identified clusters from the two cell lines.

Another indicator of activation of transcription is the enriched distribution of PolII at promoters, as it is the enzyme that catalyzes the transcription of DNA at TSS. Here, distributions of PolII at promoter regions of genes were investigated as well. As plotted in Figure 21, results show that there is significant PolII enrichment at active promoters

(CP1-CP4), and scarce distribution on weak promoters (CP5, CP6) and almost no clear distribution at poise promoters (CP7, CP8).

83

Figure 21. Distributions of PolII for identified clusters.

5.3.3 Distinct combinatorial patterns are indicators of specific regulatory functions

To thoroughly investigate the differences among genes associated with patterns of active promoters, they are further examined with functional enrichment analyses. Results show that genes displaying CP1 are enriched with tissue specific functions and genes displaying CP2-4 are associated with mostly housekeeping functions.

CP1: tissue specific genes Functional enrichment analysis of genes displaying CP1 at promoter regions yield several tissue specific biological processes and mouse phenotypes. The enriched GO terms and associated p-values are listed in Table 19 in

Appendix C.

CP2-4: housekeeping genes For genes that displaying CP2, CP3 and CP4 at promoter regions, functional enrichment analyses indicate that they are mostly associated with housekeeping functions. It is noteworthy that the enriched functions usually overlap significantly for genes displaying the same pattern from both cell lines. Detailed analyses 84 results are listed in Table 20-22 in Appendix C. GO terms that are enriched in gene groups from both cell lines are listed in bold. The remaining non-overlapping GO term are mostly related to the overlapping GO terms. For example, in the table for CP2, one

GO term enriched in both cell lines is “regulation of cellular protein metabolic process”, and the non-overlapping GO terms include “negative regulation of metabolic process” and “negative regulation of cellular metabolic process”. Even though some GO terms do not appear in both columns, the functions of both gene groups are closely related.

5.3.4 Discovery of novel promoters

As the identified recurrent combinatorial patterns associate with promoters of different states, they could be utilized to discover novel promoters. In this study, un- annotated promoter regions are discovered if they display identified patterns of active promoters. Here, the human genome is divided into 10k bps loci with 2k bps sliding window. Then, the combinatorial distribution at each locus was compared to the identified recurrent patterns of active promoters. In this study, the similarity between two combinatorial patterns is calculated as the mean of the PCC of all matching pairs. A locus is considered as a putative promoter only if similarity coefficients of all individual PCC are above certain threshold. After all the candidate loci are selected, loci with high similarity scores (similarity coefficient greater than 0.75) are further analyzed. The search is carried out on both DNA strands.

5.3.5 Evaluation of the putative promoters

Putative promoter regions are further analyzed: the expression levels of downstream regions are examined along with the PolII distributions. Investigations show that the

85 downstream regions from putative promoters have similar expression levels with the genes that displaying the same patterns at their promoters, as shown in Figure 22.

Furthermore, investigations also show putative promoter regions display PolII distribution patterns that are expected for active promoter regions, as shown in Figure 23.

Figure 22. Comparison of expression levels of regions regulated by putative and identified active promoters.

Figure 23. Combinatorial patterns and PolII distributions of putative promoters.

86

Further analyses indicated that putative promoters mostly consist of promoter regions of non-coding RNAs, of known genes along gene body and regions without annotations. The breakdown of the putative promoters is listed in Table 11.

Regions overlaps with annotations Number of Un-annotated Regions between Regions within putative promoters regions gene bodies gene bodies CP1 10 3 3 4 CP2 46 6 17 25 CP3 15 1 4 11 CP4 105 27 8 72 Table 11. Further analyses of the identified putative promoters.

As shown above, the un-annotated regions downstream of active promoter patterns also have similar expression levels of known genes with the same promoter pattern, and similar PolII distributions. The PolII distributions of putative promoters were also investigated in other cell lines, such as HUVEC, K562 and HeLa (Figure 24). Results show that the putative promoters in these three cell lines also display enriched PolII distributions. More details please refer to Appendix C. One interesting observation is that the PolII distributions are different in these three cell lines, which suggests that some identified promoters are likely to be tissue specific. Hence, some of them are active in

GM12878 but not as much in other cells.

87

Figure 24. PolII distributions at putative promoters (identified in cell line GM12878) in other cell lines.

5.4 Summary

In this chapter a framework is proposed to investigate recurrent combinatorial patterns of chromatin modifications at regulatory regions. The framework is demonstrated by a case study conducted at promoter regions. By using the proposed framework, a subset of available chromatin modifications was successfully identified based on their distribution patterns at promoter regions. Furthermore, recurrent patterns formed by the selected subset of chromatin modifications were identified. The investigations show that the identified recurrent combinatorial patterns associated with different states of promoters, confirmed by the expression levels of downstream genes and PolII distributions at promoter regions. The identified patterns were further utilized for discovering putative promoters. Further analysis show that the putative promoters are indeed related to activation of transcription. Promoter regions were chosen to demonstrate this framework as their targeted regions are easy to locate. It is worth mentioning that this framework can be easily adapted to other regulatory regions with

88 suitable data sets, or extend to study genome wide recurrent patterns/annotate the whole human genome.

89

Chapter 6: Conclusion and future work

Recurrent patterns of chromatin modifications along the genome are informative indicators of regulatory regions. With the rapid development of sequencing technologies in recent years, massive high-throughput sequencing data have been generated to capture distributions of chromatin modifications. These data sets provide invaluable information to decode the associations between patterns of chromatin modifications and epigenetic regulatory functions of their residing regions.

This dissertation presents a framework to identify and analyze recurrent patterns of chromatin modifications on the genome. It is designed to answer the following questions:

Q1: What are recurrent patterns of chromatin modifications at specific regulatory regions and do they associate with regulatory functions of their residing regions?

Q2: How to select a subset of chromatin modifications to identify recurrent patterns at specific regulatory regions that associated with regulatory functions?

Q3: How to discover novel regulatory regions by utilizing the identified recurrent patterns at specific regulatory regions?

To address the first question, a pipeline to identify and examine recurrent patterns of single chromatin modification is presented for genes (Chapter 3) and lncRNAs

(Chapter 4) at promoter regions. In Chapter 5, a computational approach to quantitatively select subsets of chromatin modifications is demonstrated. Here, chromatin modifications with related distributions are clustered into the same cluster, and a representative is 90 selected from each cluster. Recurrent combinatorial patterns of selected chromatin modifications at promoter regions are then identified and examined. Analyses of identified regulatory regions of different states require integrating evidence from other data sources. For instance, expression levels of downstream genes and distributions of

PolII at the promoter regions were utilized to verify the promoter functions. Also in

Chapter 5, the identified recurrent patterns are utilized to discover novel promoter regions by searching for genomic regions displaying similar patterns. The similarities between a pair of distributions are calculated by PCC.

In addition to computational pipelines/frameworks proposed, the case studies presented in this dissertation also led to biological meaningful discoveries. In Chapter 3, a distinct recurrent pattern of H3K4me2 is confirmed to regulate transcription of genes in multiple cell lines of different tissue types. Genes displaying this pattern at promoter regions have higher expression level and carry more important roles in tissue development/differentiation. Furthermore, study on this particular recurrent pattern is extended to promoters of lncRNAs in Chapter 4. It was discovered that this pattern regulates transcription of lncRNAs as well. Later in Chapter 5, a subset of available chromatin modifications were selected quantitatively and several recurrent combinatorial patterns were identified. Further analyses confirmed that these patterns are associated with different states of promoters. Previously un-annotated promoter regions were discovered based on these identified patterns. Further analysis verified that they are indeed associated with promoter functions.

91

This proposed framework provided a quantitative approach to tackle problems raised at the beginning of the dissertation. Furthermore, there are several aspects of this framework could be improved to provide more insights into epigenetics:

Extension to other types of regulatory regions

The proposed framework does not make any assumption on the type of regulatory regions as input. Hence, it could be utilized to identify recurrent patterns at any other regulatory regions. In this dissertation, promoter regions were selected for all case studies as their targeted regions are straightforward to locate and their regulatory functions are easy to confirm. If information on targeted regions of other regulatory regions becomes available, it would be an uncomplicated adaption of the framework to analyze patterns at other regulatory regions. For instance, even though the locations of some enhancers are discovered, currently a commonly accepted experiment procedure to locate their targeted promoters is still lacking.

Annotation of the whole genome

Specific regulatory regions were chosen as their regulatory functions could be verified with evidence from other data sources. This framework can also be utilized to search for recurrent patterns on the whole genome, as the framework does not make any assumptions on the input. However, special attention has to be made on deciding the number of recurrent patterns of the genome. As the whole genome would contain regions associated with various types of regulatory functions, the number of recurrent patterns would have great impact on the quality of annotation. Number of recurrent patterns was chosen empirically in the presented study for promoter regions. Since the amount of data

92 would increase tremendously when analyzing the whole genome, an empirical search for the optimal number of recurrent patterns needs to be carefully constructed.

Verification of functions regulatory regions

For any identified pattern, it is of great importance to verify if there exist associations with regulatory functions. Though this framework provided several approaches to examine regulatory functions of promoters, it could be further improved to integrate more data evidence for promoters. Furthermore, other types of regulatory regions require integrative analyses to verify their regulatory functions as well.

Comparisons among different cell lines

Epigenetics turn genes and non-coding RNAs on and off at strategic moments.

During cell differentiation and disease development, certain critical epigenetic regulations are altered in cells. These critical epigenetic regulatory functions, if identified, would certainly bring more insights into understanding cell differentiation and disease development. For instance, identifying genes displaying patterns of different states of promoters during cell differentiation could lead to more understanding of the process. Furthermore, identifying genes consistently displaying different patterns between patients and control group might provide more indicators for future diagnosis of the disease.

This dissertation serves as one attempt to utilize chromatin modifications to gain more understandings of epigenetic regulation on the genome. The listed future directions can further improve the studies, which lead to more insights to epigenetic regulation.

93

References

1. Heger P, Wiehe T. New tools in the box: an evolutionary synopsis of chromatin insulators. Trends Genet 2014, 30:161–71. 2. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 2007, 39:311–8. 3. Histone Modifications | What is Epigenetics?http://www.whatisepigenetics.com/histone-modifications/ 4. Strahl BD, Allis CD. The language of covalent histone modifications. Nature 2000, 403:41–5. 5. Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol 2010, 28:817–25. 6. Kondo Y, Shen L, Cheng AS, Ahmed S, Boumber Y, Charo C, et al. Gene silencing in cancer by histone H3 lysine 27 trimethylation independent of promoter DNA methylation. Nat Genet 2008, 40:741–50. 7. Seligson DB, Horvath S, Shi T, Yu H, Tze S, Grunstein M, et al. Global histone modification patterns predict risk of prostate cancer recurrence. Nature 2005, 435:1262–6. 8. Ho JWK, Bishop E, Karchenko P V, Nègre N, White KP, Park PJ. ChIP-chip versus ChIP-seq: lessons for experimental design and data analysis. BMC Genomics 2011, 12:134. 9. Link C. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 2011, 9:e1001046. 10. Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489:57–74. 11. Kidder BL, Hu G, Zhao K. ChIP-Seq: technical considerations for obtaining high- quality data. Nat Immunol 2011, 12:918–22.

94

12. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011, 473:43–49. 13. Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, et al. High- resolution profiling of histone methylations in the human genome. Cell 2007, 129:823–37. 14. Araki Y, Wang Z, Zang C, Wood WH, Schones D, Cui K, et al. Genome-wide analysis of histone methylation reveals chromatin state-based regulation of gene transcription and function of memory CD8+ T cells. Immunity 2009, 30:912–25. 15. Pekowska A, Benoukraf T, Ferrier P, Spicuglia S. A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Res 2010, 20:1493–502. 16. Zhang J, Parvin J, Huang K. Redistribution of H3K4me2 on neural tissue specific genes during mouse brain development. BMC Genomics 2012, 13 Suppl 8(Suppl 8):S5. 17. Vakoc CR, Mandat SA, Olenchock BA, Blobel GA. Histone H3 lysine 9 methylation and HP1gamma are associated with transcription elongation through mammalian chromatin. Mol Cell 2005, 19:381–91. 18. Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, et al. A high- resolution map of active promoters in the human genome. Nature 2005, 436:876–80. 19. Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, et al. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 2007, 128:1231–45. 20. Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 2009, 457:854–8. 21. Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 2009, 459:108–12. 22. Hon G, Ren B, Wang W. ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol 2008, 4:e1000201. 23. Yu H, Zhu S, Zhou B, Xue H, Han J-DJ. Inferring causal relationships among different histone modifications and gene expression. Genome Res 2008, 18:1314–24. 95

24. Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 2012, 9:215–6. 25. Ucar D, Hu Q, Tan K. Combinatorial chromatin modification patterns in the human genome revealed by subspace clustering. Nucleic Acids Res 2011, 39:4063–75. 26. Elhamifar E, Vidal R. Sparse subspace clustering. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2009:2790–2797. 27. He HH, Meyer CA, Shin H, Bailey ST, Wei G, Wang Q, et al. Nucleosome dynamics define transcriptional enhancers. Nat Genet 2010, 42:343–7. 28. Schmid CD, Bucher P. ChIP-Seq data reveal nucleosome architecture of human promoters. Cell 2007, 131:831–2; author reply 832–3. 29. Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, et al. High- resolution profiling of histone methylations in the human genome. Cell 2007, 129:823–37. 30. Hon G, Wang W, Ren B. Discovery and annotation of functional chromatin signatures in the human genome. PLoS Comput Biol 2009, 5:e1000566. 31. Robertson AG, Bilenky M, Tam A, Zhao Y, Zeng T, Thiessen N, et al. Genome-wide relationship between histone H3 lysine 4 mono- and tri-methylation and transcription factor binding. Genome Res 2008, 18:1906–17. 32. Hamerly G, Elkan C. Learning the k in k-means. In Advances in Neural Information Processing Systems; 2004:281–288. 33. Heyn H, Schreek S, Buurman R, Focken T, Schlegelberger B, Beger C. MicroRNA miR-548d is a superior regulator in pancreatic cancer. Pancreas 2012, 41:218–21. 34. Wang W, Sun J, Li F, Li R, Gu Y, Liu C, et al. A frequent somatic mutation in CD274 3’-UTR leads to protein over-expression in gastric cancer by disrupting miR-570 binding. Hum Mutat 2012, 33:480–4. 35. Wang W, Li F, Mao Y, Zhou H, Sun J, Li R, et al. A miR-570 binding site polymorphism in the B7-H1 gene is associated with the risk of gastric adenocarcinoma. Hum Genet 2013, 132:641–8. 36. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature 2001, 409:860–921. 37. Kaikkonen MU, Lam MTY, Glass CK. Non-coding RNAs as regulators of gene

96 expression and epigenetics. Cardiovasc Res 2011, 90:430–40. 38. Wilusz JE, Sunwoo H, Spector DL. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev 2009, 23:1494–504. 39. Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet 2009, 10:155–9. 40. Vadaie N, Morris K V. Long antisense non-coding RNAs and the epigenetic regulation of gene expression. Biomol Concepts 2013, 4:411–5. 41. Morris K V, Vogt PK. Long antisense non-coding RNAs and their role in transcription and oncogenesis. Cell Cycle 2010, 9:2544–7. 42. Maia BM, Rocha RM, Calin GA. Clinical significance of the interaction between non-coding RNAs and the epigenetics machinery: challenges and opportunities in oncology. Epigenetics 2014, 9:75–80. 43. Hung T, Chang HY. Long noncoding RNA in genome regulation: prospects and mechanisms. RNA Biol , 7:582–5. 44. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res 2013, 41(Database issue):D983–6. 45. Rinn JL, Chang HY. Genome regulation by long noncoding RNAs. Annu Rev Biochem 2012, 81:145–66. 46. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2012, 22:1775–89. 47. Mercer TR, Dinger ME, Sunkin SM, Mehler MF, Mattick JS. Specific expression of long noncoding RNAs in the mouse brain. Proc Natl Acad Sci U S A 2008, 105:716– 21. 48. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011, 25:1915–27. 49. Pauli A, Valen E, Lin MF, Garber M, Vastenhouw NL, Levin JZ, et al. Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. Genome Res 2012, 22:577–91.

97

50. Guttman M, Rinn JL. Modular regulatory principles of large non-coding RNAs. Nature 2012, 482:339–46. 51. Du Z, Fei T, Verhaak RGW, Su Z, Zhang Y, Brown M, et al. Integrative genomic analyses reveal clinically relevant long noncoding RNAs in human cancer. Nat Struct Mol Biol 2013, 20:908–13. 52. Lv J, Liu H, Huang Z, Su J, He H, Xiu Y, et al. Long non-coding RNA identification over mouse brain development by integrative modeling of chromatin and genomic features. Nucleic Acids Res 2013, 41:10044–61. 53. Meng N, Machiraju R, Huang K. Identify Critical Genes in Development with Consistent H3K4me2 Patterns across Multiple Tissues. IEEE/ACM Trans Comput Biol Bioinforma 2015, PP:1–1. 54. Ernst J, Kellis M. Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Res 2013. 55. ToppGene Suitehttps://toppgene.cchmc.org/ 56. Protein Interaction Network Analysis (PINA)http://cbg.garvan.unsw.edu.au/pina/ 57. Nielsen R. Statistical Methods in Molecular Evolution. New York, NY: Springer New York; 2005. [Statistics for Biology and Health] 58. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15:1034–50. 59. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 2014, 42(Database issue):D1001–6. 60. Catalog of Published Genome-Wide Association Studieshttp://www.genome.gov/gwastudies/ 61. Liao Q, Liu C, Yuan X, Kang S, Miao R, Xiao H, et al. Large-scale prediction of long non-coding RNA functions in a coding-non-coding gene co-expression network. Nucleic Acids Res 2011, 39:3864–78. 62. Guo X, Gao L, Liao Q, Xiao H, Ma X, Yang X, et al. Long non-coding RNAs function annotation: a global prediction method based on bi-colored networks. Nucleic Acids Res 2013, 41:e35.

98

63. Cogill SB, Wang L. Co-expression Network Analysis of Human lncRNAs and Cancer Genes. Cancer Inform 2014, 13(Suppl 5):49–59. 64. Blow MJ, McCulley DJ, Li Z, Zhang T, Akiyama JA, Holt A, et al. ChIP-Seq identification of weakly conserved heart enhancers. Nat Genet 2010, 42:806–10. 65. Roh T -y., Wei G, Farrell CM, Zhao K. Genome-wide prediction of conserved and nonconserved enhancers by histone acetylation patterns. Genome Res 2006, 17:74– 81. 66. Bernstein BE, Kamal M, Lindblad-Toh K, Bekiranov S, Bailey DK, Huebert DJ, et al. Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 2005, 120:169–81. 67. Firpi H a, Ucar D, Tan K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 2010, 26:1579–86. 68. Hu G, Schones DE, Cui K, Ybarra R, Northrup D, Tang Q, et al. Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Res 2011, 21:1650–8. 69. Wu W, Cheng Y, Keller C a, Ernst J, Kumar SA, Mishra T, et al. Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res 2011, 21:1659–71. 70. Roh T-Y, Cuddapah S, Cui K, Zhao K. The genomic landscape of histone modifications in human T cells. Proc Natl Acad Sci U S A 2006, 103:15782–7. 71. Roh T-Y, Cuddapah S, Zhao K. Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping. Genes Dev 2005, 19:542–52.

99

Appendix A: Supplementary material for Chapter 3

Data set Tissue Accession C1 C2 C3 C4 C5 C6 C7 Type Number GSM733769 Blood 5918 5129 5234 5673 5815 2569 11075 GSM733781 Connective 3961 6346 6953 6252 5289 3315 9297 GSM733686 Epithelial 3699 7182 3106 6486 8163 3768 9009 GSM733768 Muscular 4560 6833 6335 2966 8159 7313 5247 GSM908957 Nervous 3349 7189 2334 7582 7751 6994 6214 Embryonic GSM733670 4244 7377 2548 7228 7819 6761 5436 stem cells Table 12. Unsupervised clustering method k-means was applied to cluster tss region profiles of 41413 genes in each dataset.

SEPT9, ACSL3, ACTN4, ADNP, AHDC1, AHNAK, AKIRIN2, ALKBH5, AMD1, ANKRD10, ANKRD11, ANKRD34A, ANP32A, ARF1, ARGLU1, ARHGAP21, ARHGDIA, ARID1A, ATF4, ATN1, ATP1B1, ATP2A2, ATP6V0B, ATP6V0C, BASP1, BAZ2A, BCL2L1, BCOR, BMF, BRD2, BTG1, C16ORF72, C8ORF58, CALM2, CAPRIN1, CCR10, CDCA4, CDK6, CDKN1A, CHD2, CIRBP, CIRBP-AS1, CREBBP, CSNK1A1, CSNK1E, CSRNP1, CTDSP2, DDX3X, DIDO1, DLX2, DUSP4, DYRK1A, EGLN2, EML2, ENO1, EWSR1, FAM179B, FAM53B, FAM60A, FBXO5, FDFT1, FEM1B, FOXO3, FSCN1, FUBP1, GAS5-AS1, GNAS, GPANK1, GPBP1, GSK3B, H2AFZ, HDGF, HEG1, HIST1H2AI, HIST1H2AJ, HIST1H3B, Gene HMGA1, HMGCS1, HNRNPA2B1, HNRNPC, HNRNPD, HNRNPF, HNRNPH1, HNRNPH3, HSP90AB1, IER5, IRF2BP2, IRS1, ITGB1, JAG1, KANSL1, KCTD1, KLF16, KLF6, KLF7, symbols LAMC1, LASP1, LDB1, LMNA, LMO4, LOC100216545, LOC256021, LOC338799, MAN2A1, for the MAP7D1, MAPK1, MAT2B, MATR3, MCL1, MEPCE, MFHAS1, MICA, MIDN, MIR195, MIR4520A, MIR4530, MIR497, MIR5188, MIR661, MLL, MNT, MORF4L1, MSL2, MSMO1, core gene MYO1C, NDUFAF3, NDUFV2, NEAT1, NFIC, NR2F2, NRIP1, NRM, PANK2, PDE4DIP, group PHF12, PHF23, PHLPP1, PKM, PLEC, PLEKHO2, POLR2A, PPP1CB, PPP1R14B, PPP1R15B, PPP1R18, PRRT2, PTCH1, PTMA, PTPN12, QKI, RAD23B, RARA, RASSF1, RBM15, RBM4, RGMB, RPL17, RPL17-C18ORF32, RPL3, SETD5, SETD8, SFPQ, SHC1, SIN3A, SKI, SLC16A3, SLC29A1, SLC2A1, SLC35G6, SLC38A2, SLC39A1, SLC3A2, SLC7A5P2, SMAD7, SNORD12, SNORD12B, SNORD12C, SNORD43, SNORD4A, SNX5, SOGA1, SP1, SPRY4, SPTBN1, SRSF2, SRSF3, SSSCA1, STK24, TGIF1, TLE3, TMEM203, TMEM253, TPM4, TRA2B, TRIO, TSC22D2, UBA1, UBE2D3, UBE2E1, UBE2V1, USP22, USP9X, VEGFA, WDR1, WNK1, YBX1, YWHAE, YWHAZ, ZBTB7A, ZC3H4, ZFHX3, ZFP36L1, ZFP36L2, ZMYND8, ZNF516, ZNF580, ZNFX1-AS1 Table 13. Gene symbols of all 217 genes in the core group.

100

Top 20 enriched GO terms (p value) transcription corepressor activity (8.70E-13) transcription cofactor activity (1.96E-10) transcription factor binding transcription factor activity (7.50E-10) protein binding transcription factor activity (8.72E-10) transcription factor binding (1.04E-07) RNA binding (1.71E-07) enzyme binding (4.65E-07) mRNA binding (8.05E-06) sequence-specific DNA binding transcription factor activity (8.55E-06) nucleic acid binding transcription factor activity (8.70E-06) Molecular RNA polymerase II transcription factor binding transcription factor activity involved in negative regulation Function of transcription (1.24E-05) kinase binding (2.55E-05) RNA polymerase II transcription corepressor activity (5.22E-05) protein kinase binding (1.01E-04) transcription cofactor binding (1.34E-04) pre-mRNA intronic binding (3.34E-04) "ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism" (6.18E-04) transcription corepressor binding (6.64E-04) BH3 domain binding (6.64E-04) chromatin binding (7.95E-04) negative regulation of cellular macromolecule biosynthetic process (2.75E-13) negative regulation of transcription from RNA polymerase II promoter (5.28E-13) negative regulation of macromolecule biosynthetic process (1.07E-12) negative regulation of gene expression (1.21E-12) negative regulation of RNA metabolic process (3.01E-12) "negative regulation of transcription, DNA-templated" (3.62E-12) negative regulation of cellular biosynthetic process (4.30E-12) negative regulation of biosynthetic process (7.24E-12) negative regulation of nucleobase-containing compound metabolic process (4.88E-11) negative regulation of nitrogen compound metabolic process (7.88E-11) Biological Process positive regulation of nitrogen compound metabolic process (1.22E-10) positive regulation of cellular biosynthetic process (1.28E-10) positive regulation of biosynthetic process (2.51E-10) positive regulation of nucleobase-containing compound metabolic process (7.06E-10) positive regulation of RNA metabolic process (1.06E-09) positive regulation of macromolecule biosynthetic process (1.97E-09) positive regulation of gene expression (8.27E-09) "positive regulation of transcription, DNA-templated" (1.15E-08) cellular response to stress (1.05E-07) mRNA splicing, via spliceosome (7.61E-07) embryonic lethality (8.50E-11) abnormal prenatal growth/weight/body size (3.56E-10) abnormal prenatal body size (1.70E-09) abnormal embryogenesis/ development (3.60E-09) embryogenesis phenotype (3.60E-09) complete embryonic lethality during organogenesis (2.25E-07) embryonic lethality during organogenesis (2.56E-07) abnormal myocardium layer morphology (2.05E-06) decreased embryo size (2.09E-06) abnormal embryonic growth/weight/body size (2.69E-06) Mouse Phenotype abnormal embryo size (2.75E-06) abnormal heart layer morphology (2.79E-06) abnormal fetal growth/weight/body size (3.73E-06) decreased vascular endothelial cell number (4.67E-06) abnormal fetal size (5.69E-06) abnormal viscerocranium morphology (1.64E-05) decreased fetal size (1.87E-05) prenatal growth retardation (2.68E-05) abnormal ventricle myocardium morphology (3.31E-05) abnormal extraembryonic tissue morphology (4.04E-05) Table 14. Top 20 enriched go terms in molecular function, biological process and mouse phenotype of core gene group. 101

Figure 25. Silhouette coefficient (top) and sum of point-to-centroid distances (bottom) based on K values ranging from 2 to 30 are plotted for skeletal muscular tissue cells dataset. It appears that 7 is an optimal value choice for K.

102

Figure 26. Sum of point-to-centroid distances (bottom) based on K values ranging from 2 to 15 are plotted for embryonic stem cells dataset. It appears that 7 is an optimal value choice for K.

103

Figure 27. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Epithelial tissue cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length.

104

Figure 28. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Epithelial tissue cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length.

105

Figure 29. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Connective tissue cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length.

106

Figure 30. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Neural Progenitor Cells are plotted (top), using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length.

107

Figure 31. Average H3K4me2 distribution signal in clusters identified by K-means clustering in Embryonic Stem Cells are plotted, using K=7. The average distribution pattern of seven clusters are plotted in the order with decreasing tail length.

Figure 32. H3K4me2 distribution signal in clusters identified by K-means clustering in Embryonic Stem Cells are plotted, using K=7. The location of clusters are marked at the left side of the figure.

108

Figure 33. H3K4me2 distribution signal in clusters identified by K-means clustering in Epithelial tissue cells are plotted, using K=7. The location of clusters are marked at the left side of the figure.

Figure 34. H3K4me2 distribution signal in clusters identified by K-means clustering in Blood tissue cells are plotted, using K=7. The location of clusters are marked at the left side of the figure.

109

Figure 35. H3K4me2 distribution signal in clusters identified by K-means clustering in Skeletal Muscular tissue cells are plotted, using K=7. The location of clusters are marked at the left side of the figure.

Figure 36. H3K4me2 distribution signal in clusters identified by K-means clustering in Neural Progenitor cells are plotted, using K=7. The location of clusters are marked at the left side of the figure.

110

Figure 37. The core group is identified by the intersection of genes from all long tail clusters identified in various tissue types.

111

Appendix B: Supplementary Material for Chapter 4

Figure 38. Silhouette coefficient and Sum of squared Euclidean (SSE) distance based on k values ranging from 2 to 30 for skeletal muscular tissue cells dataset. It appears that 5 is an optimal value choice for k.

Figure 39. Pol II and H3K4me2 signal at lncRNA TSS regions for various cell lines, from left to right: Hela cells, HUVEC cells and K562 cells. 112

Figure 40. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in blood cells.

113

Figure 41. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in Connective tissue cells.

114

Figure 42. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in Epithelial tissue cells.

115

Figure 43. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in Embryonic stem cells.

116

Figure 44. Cluster Profiles from Each Cell Line (heat map of the distribution signals, average cluster profile plot, expression level box plot, and histogram of cluster size) in Nervous tissue cells.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Blood 1240 2643 2481 1273 25088 Connective 1065 2549 2689 1167 25255 Epithelial 1128 2077 2911 1088 25521 Muscular 949 2022 3008 1177 25569 Nervous 2754 1234 1814 1025 25898 Embryonic Stem cells 1261 2441 2812 1161 25050 Table 15. Number of lncRNAs in each cluster.

117

Enriched GO: Enriched GO: Enriched Human Enriched Mouse Biological Molecular Phenotype from Phenotype from Tissue Type Processes from Functions from co- co-expressed co-expressed co-expressed expressed genes genes genes genes

abnormal prenatal growth/weight/body size (p-value = helicase activity (p-value RNA splicing (p-value 1.106E-7) = 5.206E-6) = 5.025E-8) abnormal embryonic RNA binding (p-value = RNA processing (p- growth/weight/body 8.776E-6) value = 1.325E-6) size (p-value = Muscular poly(A) RNA binding (p- mRNA processing (p- 1.552E-7) value = 3.866E-5) value = 4.271E-6) (Group Two: 139 decreased cell proliferation (p-value genes) 1-phosphatidylinositol-3- maintenance of protein kinase activity (p-value = location in cell (p-value = 6.747E-6) 8.193E-5) = 2.139E-5) abnormal prenatal protein serine/threonine maintenance of location body size (p-value = kinase activity (p-value = in cell (p-value = 1.344E-5) 1.156E-4) 3.682E-5) embryogenesis phenotype (p-value = 2.064E-5) RNA binding (p-value = Translation (p-value = 2.483E-17) 3.827E-18) Anemia of inadequate structural constituent of SRP-dependent production (p-value = ribosome (p-value = cotranslational protein 1.698E-4) 2.986E-17) targeting to membrane (p-value = 9.325E-12) Macrocytic anemia (p- abnormal cell Blood poly(A) RNA binding (p- value = 2.874E-4) proliferation (p-value value = 5.550E-14) cotranslational protein = 3.537E-5) (Group Two: 127 targeting to membrane Erythrocyte genes) structural molecule (p-value = 1.165E-11) macrocytosis (p-value = prenatal lethality (p- activity (p-value = 2.907E-4) value = 8.619E-5) 4.504E-7) protein targeting to ER (p-value = 1.796E-11) Persistence of hydrogen ion hemoglobin F (p-value transmembrane ribonucleoprotein = 2.907E-4) transporter activity (p- complex biogenesis (p- value = 7.281E-5) value = 1.968E-11) Displacement of the mRNA processing (p- external urethral meatus RNA binding (p-value = value = 9.292E-14) (p-value = 1.782E-5) 3.012E-11) RNA processing (p- Abnormality of the poly(A) RNA binding (p- value = 1.232E-12) urethra (p-value = value = 3.576E-9) 3.510E-5) core lncRNA RNA splicing (p-value pre-mRNA intronic = 2.456E-11) Prominent digit pad (p- binding (p-value = (Group Two: 57 value = 6.271E-5) 5.286E-5) genes) mRNA metabolic process (p-value = Prominent fingertip RNA helicase activity (p- 4.057E-11) pads (p-value = 6.271E- value = 1.703E-4) 5) mRNA splicing, via RS domain binding (p- spliceosome (p-value = Delayed speech and value = 2.448E-4) 5.605E-11) language development (p-value = 1.231E-4) Table 16. Functional enrichment analyses of genes that co-expressed with long-tail lncRNAs (Group Two).

118

Enriched GO: Enriched GO: Enriched Human Enriched Mouse Molecular Biological Phenotype from Phenotype from Tissue Type Functions from Processes from co- co-expressed co-expressed co-expressed expressed genes genes genes genes

macromolecule catabolic process (p-value = 6.628E- 9) cellular macromolecule catabolic process (p-value =1.273E-8) protein catabolic process poly(A) RNA binding Hyperpituitarism (p- (p-value = 1.553E-8) (p-value = 1.810E-7) value = 1.166E-4) cellular protein catabolic RNA binding (p-value = Peripheral axonal process (p-value = 2.223E- 3.375E-5) atrophy (p-value = 8) thioesterase binding (p- 1.390E-4) positive regulation of Muscular value = 5.082E-5) Pituitary adenoma (p- catabolic process (p-value (Group One: 116 genes) single-stranded DNA value = 2.081E-4) = 4.458E-7) binding (p-value = Growth hormone excess regulation of muscle cell 1.307E-4) (p-value = 2.907E-4) differentiation (p-value = structure-specific DNA Neoplasm of the 1.451E-5) binding (p-value = anterior pituitary (p- muscle cell differentiation 3.696E-4) value = 2.907E-4) (p-value = 4.189E-5) muscle structure development (p-value = 1.861E-4) regulation of striated muscle cell differentiation (p-value = 2.340E-4) abnormal IgG1 level (p-value = 2.326E-6) abnormal B cell monoamine proliferation (p-value transmembrane = 8.371E-6) transporter activity (p- Blood abnormal IgG level value = 6.547E-5) (Group One: 175 genes) (p-value = 1.278E-5) leukemia inhibitory decreased IgG1 level factor receptor binding (p-value = 1.338E-5) (p-value = 8.841E-5) abnormal B cell activation (p-value = 2.238E-5) extracellular matrix decreased skin tensile extracellular matrix organization (p-value = strength (p-value = Joint laxity (p-value = structural constituent (p- 1.428E-25) 4.438E-9) 2.879E-6) value = 1.980E-15) extracellular structure abnormal skin tensile Atrophic scars (p-value collagen binding (p- organization (p-value = strength (p-value = = 8.336E-6) value = 6.961E-8) 1.538E-25) 1.611E-8) Aplasia/Hypoplasia of heparin binding (p-value extracellular matrix abnormal tendon core lncRNA the skin (p-value = = 9.473E-8) disassembly (p-value = morphology (p-value (Group One: 100 genes) 1.411E-5) glycosaminoglycan 1.807E-16) = 5.834E-8) Abnormality of the binding (p-value = collagen metabolic process abnormal cartilage joints of the lower limbs 1.159E-7) (p-value = 7.254E-14) morphology (p-value (p-value = 1.890E-5) platelet-derived growth multicellular organismal = 3.316E-7) Abnormality of toe (p- factor binding (p-value macromolecule metabolic abnormal blood value = 1.946E-5) = 3.534E-7) process (p-value = 1.378E- vessel morphology 13) (p-value = 5.905E-7) Table 17. Functional enrichment analyses of genes that co-expressed with long-tail lncRNAs (Group One).

119

GO Term ID GO Term Name P-value MP:0020175 abnormal IgG1 level 2.326E-6 MP:0005153 abnormal B cell proliferation 8.371E-6 MP:0020174 abnormal IgG level 1.278E-5 MP:0008495 decreased IgG1 level 1.338E-5 MP:0008217 abnormal B cell activation 2.238E-5 MP:0008781 abnormal B cell apoptosis 3.186E-5 MP:0002461 increased immunoglobulin level 3.862E-5 MP:0002459 abnormal B cell physiology 4.115E-5 MP:0002490 abnormal immunoglobulin level 4.157E-5 MP:0008782 increased B cell apoptosis 4.381E-5 MP:0004976 abnormal B-1 B cell number 4.553E-5 MP:0004940 abnormal B-1 B cell morphology 4.916E-5 MP:0001807 decreased IgA level 5.309E-5 MP:0001805 decreased IgG level 5.385E-5 MP:0020180 abnormal IgM level 7.878E-5 MP:0002460 decreased immunoglobulin level 8.471E-5 MP:0004978 decreased B-1 B cell number 9.211E-5 MP:0001800 abnormal humoral immune response 1.072E-4 MP:0005022 abnormal immature B cell morphology 1.132E-4 MP:0010326 malleus hypoplasia 1.314E-4 MP:0008202 absent B-1 B cells 2.249E-4 MP:0002359 abnormal spleen germinal center morphology 2.376E-4 MP:0002144 abnormal B cell differentiation 2.585E-4 MP:0002494 increased IgM level 2.658E-4 MP:0004939 abnormal B cell morphology 2.796E-4 MP:0020171 abnormal IgA level 2.810E-4 MP:0008472 abnormal spleen secondary B follicle morphology 2.810E-4 MP:0001802 arrested B cell differentiation 3.596E-4 MP:0005432 abnormal pro-B cell morphology 3.808E-4 MP:0004365 abnormal strial basal cell morphology 3.869E-4 MP:0002357 abnormal spleen white pulp morphology 4.220E-4 MP:0003945 abnormal lymphocyte physiology 4.281E-4 MP:0000726 absent lymphocyte 4.325E-4 MP:0005093 decreased B cell proliferation 4.603E-4 MP:0013664 abnormal immature B cell number 5.752E-4 MP:0008071 absent B cells 6.352E-4 Table 18. Group one genes co-expressed with long-tail lncRNAs from blood tissue cells are enriched with tissue specific mouse phenotypes

120

Appendix C: Supplementary Material for Chapter 5

121

CP1: Biological Process GM12878 HSMM cardiovascular system lymphocyte activation 4.06E-12 8.08E-22 development leukocyte activation 2.05E-11 muscle structure development 2.31E-21 immune response 5.48E-11 skeletal system development 8.68E-19 regulation of lymphocyte 6.79E-11 muscle tissue development 1.60E-17 activation striated muscle tissue T cell activation 1.39E-10 5.22E-17 development regulation of leukocyte 2.23E-10 muscle organ development 2.85E-16 activation regulation of immune system skeletal system 4.62E-10 1.72E-15 process morphogenesis embryonic skeletal system immune effector process 8.76E-09 4.92E-15 development positive regulation of T cell 1.09E-08 heart development 1.04E-12 activation embryonic skeletal system regulation of T cell activation 1.12E-08 1.21E-12 morphogenesis CP1: Mouse Phenotype GM12878 HSMM abnormal axial skeleton abnormal leukocyte physiology 3.12E-26 4.27E-10 morphology abnormal lymphocyte abnormal muscle 5.95E-26 1.69E-09 physiology morphology abnormal hematopoietic system abnormal thoracic cage 9.38E-26 3.81E-09 physiology morphology abnormal immune cell abnormal skeleton 1.52E-25 5.59E-08 physiology morphology abnormal cell-mediated 2.36E-25 abnormal joint morphology 7.09E-08 immunity abnormal cardiovascular abnormal blood cell physiology 4.31E-25 7.65E-08 system morphology abnormal cardiovascular abnormal adaptive immunity 4.32E-25 8.17E-08 development abnormal T cell physiology 3.52E-24 abnormal rib morphology 1.32E-07 abnormal lymphocyte 4.07E-23 skeleton phenotype 2.09E-07 morphology abnormal skeletal muscle abnormal hemopoiesis 5.24E-22 9.56E-07 morphology Table 19. Enriched GO terms for genes displaying CP1 at their promoters.

122

CP2-Biological Process

GM12878 HSMM regulation of cellular cell cycle 2.87E-36 protein metabolic 1.07E-40 process negative regulation of mitotic cell cycle 1.04E-34 macromolecule 1.62E-37 metabolic process single-organism organelle regulation of cellular 1.37E-29 2.40E-36 organization component organization negative regulation of cell cycle process 1.75E-29 2.26E-35 metabolic process mitotic cell cycle process 5.91E-29 cell cycle 2.17E-33 single-organism intracellular transport 3.13E-26 6.55E-31 organelle organization cellular macromolecule catabolic transcription from RNA 1.14E-25 2.64E-30 process polymerase II promoter single-organism intracellular 5.40E-25 programmed cell death 2.65E-30 transport regulation of cellular protein regulation of protein 7.11E-25 5.82E-30 metabolic process modification process negative regulation of viral process 2.57E-24 5.85E-30 cellular metabolic process multi-organism cellular process 4.41E-24 apoptotic process 9.19E-30 cellular protein cellular response to stress 5.67E-24 5.43E-29 localization protein modification by small 7.06E-24 protein phosphorylation 6.46E-29 protein conjugation or removal cellular macromolecule macromolecule catabolic process 8.07E-24 6.69E-29 localization programmed cell death 3.53E-23 mitotic cell cycle 1.52E-28 negative regulation of regulation of transcription macromolecule metabolic 6.33E-23 from RNA polymerase II 2.03E-28 process promoter protein localization 9.59E-23 intracellular transport 1.07E-27 nucleobase-containing 1.38E-22 phosphorylation 1.94E-27 compound catabolic process cellular macromolecule enzyme linked receptor 3.51E-22 1.98E-27 localization protein signaling pathway interspecies interaction between 4.67E-22 cell cycle process 2.30E-27 organisms Table 20. Enriched GO terms for genes displaying CP2 at their promoters.

123

CP3-Biological Process GM12878 HSMM protein modification by tRNA metabolic process 2.71E-11 small protein 1.15E-14 conjugation or removal protein modification by ncRNA metabolic process 3.64E-09 small protein 2.62E-13 conjugation ncRNA metabolic tRNA processing 6.71E-09 5.48E-13 process protein modification by small protein conjugation 3.57E-08 cellular respiration 9.00E-13 or removal ncRNA processing 1.07E-07 tRNA metabolic process 1.17E-12 cellular response to DNA protein polyubiquitination 1.89E-07 1.52E-12 damage stimulus electron transport chain 1.90E-07 RNA processing 5.05E-12 protein modification by 2.24E-07 protein ubiquitination 1.62E-11 small protein conjugation respiratory electron transport 2.62E-07 mitotic cell cycle 1.80E-10 chain cellular respiration 4.12E-07 DNA repair 2.50E-10 respiratory electron regulation of ligase activity 4.73E-07 5.18E-10 transport chain mitochondrion organization 5.41E-07 ncRNA processing 1.31E-09 regulation of ubiquitin- 1.18E-06 electron transport chain 1.62E-09 protein transferase activity negative regulation of ubiquitin-protein transferase 1.49E-06 tRNA processing 2.01E-09 activity negative regulation of ligase mitochondrion 1.49E-06 2.81E-09 activity organization oxoacid metabolic process 2.00E-06 mitotic cell cycle process 3.25E-09 carboxylic acid metabolic 2.39E-06 cell cycle 4.26E-09 process protein ubiquitination 2.66E-06 cellular response to stress 5.77E-09 cofactor metabolic oxidation-reduction process 2.72E-06 8.18E-09 process phosphatidylinositol single-organism 2.83E-06 3.24E-08 metabolic process intracellular transport Table 21. Enriched GO terms for genes displaying CP3 at their promoters.

124

CP4-Biological Process GM12878 HSMM RNA processing 5.30E-09 RNA processing 5.51E-15 tRNA metabolic process 4.06E-08 ncRNA metabolic process 1.40E-12 DNA metabolic process 1.44E-06 DNA metabolic process 3.64E-11 tRNA processing 5.01E-06 ncRNA processing 4.52E-11 cellular response to DNA 5.93E-06 DNA repair 1.04E-09 damage stimulus cellular response to DNA ncRNA metabolic process 6.76E-06 1.41E-09 damage stimulus ncRNA processing 7.45E-06 RNA modification 6.34E-09 iron-sulfur cluster assembly 1.43E-05 tRNA metabolic process 9.22E-09 metallo-sulfur cluster assembly 1.43E-05 cell cycle 1.05E-07 ribonucleoprotein complex organization 1.95E-05 1.50E-07 biogenesis DNA repair 1.95E-05 rRNA metabolic process 7.54E-07 nucleoside monophosphate mRNA processing 2.81E-05 1.22E-06 metabolic process transcription from RNA respiratory electron transport 4.03E-05 1.24E-06 polymerase III promoter chain transcription from RNA DNA recombination 5.67E-05 1.31E-06 polymerase III promoter cellular respiration 7.56E-05 cell cycle process 1.33E-06 Table 22. Enriched GO terms for genes displaying CP4 at their promoters.

125