Computational Modeling of Long Noncoding RNA Transcriptional

COMPUTATIONAL MODELING OF LONG NONCODING RNA

TRANSCRIPTIONAL REGULATION

Haozhe Wang

APPROVED BY SUPERVISORY COMMITTEE:

______Zhenyu Xuan, Chair

______Michael Q. Zhang

______Tae Hoon Kim

______Zachary Campbell

Haozhe Wang

To my family

COMPUTATIONAL MODELING OF LONG NONCODING RNA

TRANSCRIPTIONAL REGULATION

HAOZHE WANG, MS

DISSERTATION

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY IN

MOLECULAR AND CELL BIOLOGY

THE UNIVERSITY OF TEXAS AT DALLAS

December 2018

ACKNOWLEDGMENTS

I would like to express my deepest appreciation to my supervisor, Professor Zhenyu Xuan, for his patience, support, and assistance to my research. His guidance and persistent help encouraged me in all the time of research.

I would also like to thank my committee members: Professor Michael Q. Zhang, Professor Tae

Hoon Kim, and Professor Zachary Campbell for their advice and guidance. Especially, I am extremely thankful to Professor Michael Q. Zhang for his insightful comments and encouragement.

I am grateful to all the members of the Xuan lab and Zhang lab for the help and pleasant time of working together. Special thanks go to Dr. Yong Chen and Dr. Yunfei Wang for their sharing of expertise, extensive support, and sincere guidance and encouragement.

Last but not least, I would like to thank my parents and my husband for having faith in me and extending their support at all times.

October 2018

COMPUTATIONAL MODELING OF LONG NONCODING RNA

TRANSCRIPTIONAL REGULATION

Haozhe Wang, PhD The University of Texas at Dallas, 2018

Supervising Professor: Zhenyu Xuan

With work from the past decade and rapid development of high throughput transcriptome sequencing techniques, our knowledge of noncoding RNAs has broadened. The human genome is pervasively transcribed; however, a significant fraction of the transcripts generated from the human genome does not code for proteins. Among noncoding RNAs, long non-coding RNAs

(lncRNAs) play a variety of roles, ranging from transcription regulation, controlling chromatin epigenetic state, participating alternative splicing to subnuclear compartment formation. Most lncRNAs exert their functions through the interaction of protein partners. Elucidation of RNA- protein interactions is essential for understanding many critical biological processes. In particular, lncRNAs interact with ribonucleoprotein complexes and numerous chromatin regulators to target appropriate locations in the genome. Based on the support vector machine method, we present LncLink, a computational method used to infer the set of the most probable proteins involved in lncRNA and the genomic regions that they control. We modeled LncLink using data obtained by capture hybridization analysis of RNA targets. The inferences derived from this method are obtained by integrating transcription factor binding sites and genome-wide chromatin interactions as predictive features. To validate the method, we applied LncLink to

CHART-seq data obtained from MCF-7 cells to identify putative protein binding partners of lncRNA NEAT1. Our method generated a signature of 27 proteins highly predicted to be involved in NEAT1 interaction and mediating NEAT1 chromatin targeting. Furthermore, several of these proteins have been implicated in NEAT1 binding and NEAT1-mediated transcription in the literature, confirming the reliability of our results. The findings revealed by our work provide novel insights into our understanding of key players targeted by lncRNA.

vii

TABLE OF CONTENTS

ACKNOWLEDGMENTS ...... v

ABSTRACT ...... vi

LIST OF FIGURES ...... x

LIST OF TABLES ...... xi

CHAPTER 1 INTRODUCTION ...... 1

1.1 Noncoding RNA ...... 1

1.2 Long noncoding RNA ...... 1

1.3 lncRNA NEAT1 ...... 13

1.4 Aim of Dissertation ...... 16

CHAPTER 2 DATA RESOURCES AND METHODS ...... 19

2.1 Overall Design ...... 19

2.2 Datasets ...... 21

2.3 Methods...... 23

CHAPTER 3 RESULTS ...... 27

3.1 Genome-wide identification of NEAT1 target sites ...... 27

3.2 Motif Analysis ...... 27

3.3 NEAT1 genomic targets enriched in open promoter regions ...... 28

3.4 Prediction of NEAT1 target sites using genomic features ...... 30

3.5 Prioritization of co-factor candidates using SVM model output ...... 32

viii

3.6 PPIs of selected proteins and CHART-MS proteins ...... 36

CHAPTER 4 CONCLUSION AND DISCUSSION ...... 38

4.1 Conclusion ...... 38

4.2 Discussion ...... 39

APPENDIX ...... 41

REFERENCES ...... 47

BIOGRAPHICAL SKETCH ...... 59

CURRICULUM VITAE

LIST OF FIGURES

Figure 1-1. The number of publications on lncRNAs in past decades...... 2

Figure 1-2. Classification of lncRNAs based on genomic location respect to protein coding genes...... 3

Figure 1-3. Functions of lncRNA...... 5

Figure 1-4. Schematic of CHART-seq...... 11

Figure 1-5. Transcripts of NEAT1 gene...... 14

Figure 2-1. Computational workflow of feature selection in lncRNA genomic targets...... 20

Figure 3-1. NEAT1 genomic targets...... 28

Figure 3-2. Distribution of NEAT1 CHART-seq signals...... 29

Figure 3-3. Heatmap of matrices and matrices correlation distribution...... 31

Figure 3-4. Model prediction performance...... 32

Figure 3-5. Feature weights ranked by L1-SVM...... 33

Figure 3-6. Protein-protein interactions of selected proteins and CHART-MS proteins...... 37

LIST OF TABLES

Table 3-1. Performance of training set and test set from the best model...... 32

Table 3-2. List of feature weights identified by L1-norm SVM...... 34

Table 4-1. Performance of training set and test set using 579 predicted transcription factor binding profiles together with original 74 transcription factor binding profiles as features...... 39

Table A-1. Feature list used for predicting lncRNA targets...... 41

Table A-2. Sequence motifs found within NEAT1 target regions...... 45

CHAPTER 1

INTRODUCTION

1.1 Noncoding RNA

The Encyclopedia of DNA Elements (ENCODE) project has revealed that 75% of the human genome can be transcribed, whereas protein-coding genes only account for 1-2% of the genome

(Djebali et al., 2012; ENCODE Project Consortium, 2012). Pervasively transcribed non-coding

RNAs (ncRNAs) indicates that RNAs have a diverse role in biological processes including chromatin remodeling, transcription, and post-transcriptional regulations. The diverse functions of noncoding RNA endow them with the potential of acting as new biomarkers and novel therapeutic cancer targets.

Noncoding RNAs can be classified into two groups according to their length. One group is small

RNA with size shorter than 200 nt. They have been relatively well characterized, and include transfer RNAs, microRNAs, small nuclear RNAs, PIWI-interacting RNAs, etc. The other group with the length greater than 200nt is defined as long noncoding RNA (lncRNA) (Prasanth &

Spector, 2007).

1.2 Long noncoding RNA

The human genome produces approximately 16,000 potential lncRNAs with 9,640 gene loci

(Derrien et al., 2012). LncRNAs are defined as transcripts that are longer than 200 nucleotides that do not have protein-coding potential. Most of the characterized lncRNAs are transcribed by

RNA polymerase II (RNA Pol II) as are mRNAs (Guttman et al., 2009). These lncRNAs also have mRNA-like features such as similar chromatin modifications on regulatory elements, 5′

terminal methylguanosine cap, alternative splicing and are polyadenylated (Guttman et al., 2009;

Mercer & Mattick, 2013).

Recently lncRNAs have become an important class of genes in the human genome. There has been an exponential increase of scientific publications in the lncRNA study field near recent decades (Figure 1-1). The sheer number and the increasing pace of lncRNAs studies provide a better understanding of this new class transcripts.

3000

2500

2000

1500 Count

1000

500

0 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

Figure 1-1. The number of publications on lncRNAs in past decades. The bar plot shows the number of publications related to lncRNAs, and the number of publications increased exponentially in the last decades. The keywords (“long noncoding RNA” OR “lnc RNA”) were used for searching in PubMed (http://www.ncbi.nlm.nih.gov/pubmed). Statistics for years 1988-2018 (until September) are shown.

1.2.1 Classification of lncRNA

LncRNAs can be classified based on their features, such as, by function mechanisms, by targeting mechanisms, by the types of action mechanisms and most commonly, by their genomic

locations relative to the protein-coding genes (Chen, L. & Carmichael, 2009; Chen, M. et al.,

2015). Based on their genomic locations, lncRNAs can be classified as (Derrien et al., 2012;

Ulitsky, Shkumatava, Jan, Sive, & Bartel, 2011) (Figure 1-2):

1. Intergenic, located in between two protein-coding genes.

2. Intronic, resided entirely within introns of a protein-coding gene.

3. Antisense, transcribed from the opposite strand of protein-coding genes with overlapping.

4. Sense, transcribed from the same strand of a protein-coding gene and overlaps with it.

5. Bidirectional, also known as promoter associated lncRNAs, are divergently transcribed

from proximal promoter of protein-coding genes.

Intergenic lncRNA Gene A lncRNA Gene B

Intronic lncRNA Exon 1 lncRNA Exon 2

Gene A Antisense lncRNA lncRNA

Sense lncRNA Gene A lncRNA

Bidirectional lncRNA

Gene A Promoter lncRNA

Protein coding gene Long noncoding RNA Figure 1-2. Classification of lncRNAs based on genomic location respect to protein coding genes. Protein-coding genes are shown in blue; lncRNAs are shown in orange. Intergenic lncRNA, transcribed from intergenic regions on both strands. Intronic lncRNA, transcribed from introns of protein-coding genes. Antisense lncRNA, transcribed from the opposite strand of protein-coding genes with overlapping. Sense lncRNA, transcribed from the same strand of protein-coding genes. Bidirectional lncRNAs, are divergently transcribed from proximal promoter of protein-coding genes.

1.2.2 Roles and functions of lncRNAs

With the rapid growth of lncRNA study, their roles and functions have been demonstrated in various aspects of mammalian biology recently. LncRNAs can regulate gene expression and transcriptional process at multiple levels (Figure 1-3), such as remodeling the chromatin status by recruiting epigenetic modifying complexes (Khalil et al., 2009; Forrest & Khalil, 2017)

(Figure 1-3 A), regulate gene transcription by recruiting or decoying transcription machinery

(Figure 1-3 B), organizing nuclear structure (Figure 1-3 D) and subnuclear compartment (Figure

1-3 E). In addition, lncRNAs can also regulate gene expression at the post-transcriptional level, such as regulating pre-mRNA splicing (Figure 1-3 C), controlling mRNA stability (Figure 1-3

G), modulating translation (Figure 1-3 F), or acting as competing endogenous RNA (ceRNA) to regulate mRNA level (Geisler & Coller, 2013; Quinodoz & Guttman, 2014; Vance & Ponting,

2014) (Figure 1-3 H). lncRNAs regulate gene expression through chromatin modifiers

In past decade, lncRNA have been reported as major players in epigenetic regulation. In nucleus, many lncRNAs may scaffold or recruit multiple chromatin modifiers to reprogramming the chromatin status. For instance, lncRNA HOTAIR (HOX transcript antisense RNA) can scaffold with PRC2 (Polycomb Repressive Complex 2) with its 5’ domain which trimethylates histone 3 lysine-27, and recruit LSD1/CoREST/REST complex which can demethylate H3K4me2 and

H3K4me1 by using 3’ domain to the HOX-D gene clusters, removing active histone marks and maintaining repression across 40 kb of the HOXD locus (Li, Lingjie et al., 2013; Rinn et al., 2007;

Tsai et al., 2010).

Figure 1-3. Functions of lncRNA. LncRNAs can bind to DNA, RNA, proteins and/or their complexes acting diverse functions in the cell. (A) lncRNAs (in red) guide chromatin remodeling complexes to chromatin mediating active (green dots) or repressive (red dots) histone modifications. (B) lncRNAs can decoy or recruit transcription factors, RNA polymerase and/or cofactors to gene promoters, thus controlling gene transcription. (C) LncRNAs can regulate splicing events and mRNA stability. (D) lncRNAs mediate chromatin structure. (E) LncRNAs participate in organizing subnuclear compartments via the scaffold function. (F) lncRNAs can control the translation rate of mRNA by influencing polysomes loaded to mRNA. (G) lncRNAs can control mRNA decay by base-pairing with mRNAs. (H) LncRNAs act as ceRNAs competing for miRNA binding, thus influencing the level of the mRNAs targeted by the sequestered miRNA. Reprint from (Neguembor, Jothi, & Gabellini, 2014), under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0).

The interplay between lncRNAs and transcription factors

Additionally, lncRNA can recruit or prevent the binding of the transcription factor, RNA polymerase and/or cofactors. LncRNAs that interact with DNA binding proteins could be

targeted to the chromatin indirectly through RNA–protein–DNA interactions. For example, brain specific lncRNA RMST (rhabdomyosarcoma 2-associated transcript) plays an important role in pluripotency maintenance and modulating neurogenesis. RMST physically interacts with transcription factor sex-determining region Y-box (SOX2) which known as a regulator of neural fate. RNA interference and RNA pull down studies reveals that lncRNA RMST is required for the correct binding of SOX2 to promoter regions of neurogenic target genes and activates gene transcription in human ESCs (Ng, S., Bogu, Soh, & Stanton, 2013). In addition, zinc finger transcription factor YinYang 1 (YY1) is required for Xist tethering to chromatin in mouse embryonic fibroblasts (Hasegawa et al., 2010; Jeon & Lee, 2011) . Other examples including lncRNA PANDA and transcription factor Y subunit alpha (NF-YA) in U2OS cells (Hung et al.,

2011) , Evf-2 and Dlx-2 protein in mouse embryos (Feng et al., 2006) , lncRNA DINO and p53 mouse embryonic fibroblasts (Schmitt et al., 2016), etc. Therefore, lncRNAs play important roles in gene transcription. lncRNAs organize nuclear structure

Another function of lncRNA is to regulate transcription vis enhancer lncRNA (eRNA) mediated chromatin loops. It has been demonstrated that eRNAs can stabilize DNA loops which bring distal enhancer to its regulated promoter by recruiting transcription machinery or chromatin remodelers (Mousavi et al., 2013). For instance, the interaction between FOXC1 and its enhancer is stabilized by eRNA, as well as estrogen receptor alpha in MCF-7 human breast cancer cells.

(Li, Wenbo et al., 2013). lncRNA in subnuclear compartments

Mammalian nucleus is in highly organization and contains distinct membrane-less organelles, including nucleoli, nuclear speckles, paraspeckles, and etc. These nuclear bodies contain specific

proteins and RNAs involved in particular nuclear processes (Spector, 2006). Many lncRNA localize to different nuclear bodies stably and abundantly with a series of protein factors. For example, the lncRNA MALAT1(metastasis-associated long adenocarcinoma transcript 1) sequesters and binds to serine/arginine (SR) splicing factors to form nuclear speckles (Tripathi et al., 2010) and the lncRNA NEAT1 serves as essential architectural component of paraspeckles

(Zhang, B., Mao, Sunwoo, & Spector, 2011). Other nuclear bodies also require RNAs for assembly, indicating a broad role for lncRNAs in nuclear formation (Chujo, Yamazaki, &

Hirose, 2016).

Post-transcriptional regulation by lncRNA

Besides the function of transcriptional regulation, lncRNAs are also able to influence post- transcriptional processes by affecting mRNA splicing, stability, degradation, and translation efficiency. For instance, alternative splicing of pre-mRNA is one of the key steps for post- transcriptional regulation of gene expression (Blencowe, 2006; Hallegger, Llorian, & Smith,

2010; Licatalosi & Darnell, 2010). The lncRNA MALAT1 was demonstrated to function in pre- mRNA splicing by influencing the functions and distributions of SR proteins (Tripathi et al.,

2010).

Multiple lncRNAs have miRNA binding sites and have been identified to function as competing endogenous RNA (ceRNA) by sequestering miRNAs. CeRNAs sequester miRNA from their target mRNAs, thus influence the levels of mRNAs (Chen, M. et al., 2015; Peng et al., 2015;

Salmena, Poliseno, Tay, Kats, & Pandolfi, 2011).

1.2.3 LncRNA genomic targeting

Mechanisms of lncRNA recruitment to genomic DNA

Summarizing from the mechanism of lncRNAs exerting above, three mechanisms have been proposed for how lncRNAs recognize specific genomic targets (Vance & Ponting, 2014). (i) lncRNAs can interact with DNA directly via nucleic-acid hybridization, such as, through base- pairing like telomerase RNA and telomeric DNA (Harrington, 1991) , or by triplex-helices mediated interactions (Martianov, Ramadass, Barros, Chow, & Akoulitchev, 2007; Schmitz,

Mayer, Postepska, & Grummt, 2010) . (ii) RNA–RNA interactions between nascent RNA and lncRNAs could also guide lncRNAs to genomic targets. The example of lncRNA utilizing the

RNA-RNA interaction mechanism is lncRNA Malat1, a lncRNA in nuclear speckles

(Hutchinson et al., 2007). The interaction between Malat1 and chromatin is dependent on transcription, and the interaction of Malat1 with nascent RNA is mediated by proteins (Engreitz et al., 2014) . (iii) lncRNAs indirectly interact with DNA via interaction with DNA binding proteins. This mechanism can explain the localization of several lncRNAs including Xist. Xist and its interactome involve in X chromosome inactivation, it is capable to stick to chromatin in cis. Xist is thought to interact with DNA binding protein YY1 and hnRNPU that are required for tethering Xist to DNA (Hasegawa et al., 2010; Jeon & Lee, 2011) .

To answer how lncRNAs regulate transcription, it is important to determine where lncRNAs localize on the genome, and what proteins they interact with. Several techniques have been developed to map genomic binding sites of lncRNAs, and/or to systematically study the proteins bound to them. These techniques, termed chromatin isolation by RNA purification (CHIRP)

(Chu, Qu, Zhong, Artandi, & Chang, 2011) , capture hybridization analysis of RNA targets

(CHART) (Simon et al., 2011) and RNA antisense purification (RAP) (Engreitz et al., 2013).

Capture hybridization analysis of RNA targets (CHART)

Capture Hybridization Analysis of RNA Targets (CHART) has been developed to map the genomic binding sites of chromatin-associated endogenous RNAs in flies and humans by

Kingston’s lab (Simon et al., 2011). It is also commonly used to purify lncRNA-associated proteins. This protocol used a hybridization-based technique (Figure 1-4). To ensure the capture oligos can hybridize to the desired RNA, the oligos are designed from regions on RNA that are accessible and not occluded by RNA secondary structure or protein binding. The RNase H mapping assay was performed to identify the sites available to be hybridized by capture oligonucleotides. Then a set of 25 nucleotide biotinylated antisense oligonucleotides were designed based on the result of RNase H mapping assay and formaldehyde-cross-linked to lncRNA of interest. To purify target chromatin from the cross-linked complex, RNA-chromatin complexes are next immobilized on beads, washed, and eluted using RNase H (Simon et al.,

2011; Simon, 2013a). High-throughput sequencing was used to reveal associated genomic DNA.

In addition to analyzing DNA that is associated with the desired RNA, CHART can also be used to detect associated proteins followed by western blot or mass spectrometry (CHART-MS) (West et al., 2014).

The basic idea of RAP, CHIRP and CHART is similar. In general, the antisense oligos were added to fragmented and crosslinked chromatin. The crosslinking allows the lncRNA, its interacted proteins, and the associated chromatin to be pulled down by biotin-binding beads. The differences in the methods include cross-linking reagents, chromatin shearing, the strength of washing buffer and probe design. The major difference among them is the probe design. CHART designs effective probe and avoids sequence nonspecific effects using RNase H assay which narrows the search space for hybridization spots. In contrast, ChIRP and RAP do not require a

priori knowledge of lncRNA structure and protein binding situation. They tile oligonucleotides across the entire target lncRNA so that all potential hybridization spots can be detected.

Computational methods for predicting lncRNA-DNA interactions

There are few computational methods to predict lncRNA-DNA interactions. Current computational methods for predicting lncRNA-DNA interactions generally are based on nucleic acid hybridization. For example, a computational method called Triplexator (Buske, Bauer,

Mattick, & Bailey, 2012), integrated all aspects of the triplex formation to predict. Another computational tool LongTarget uses Hoogsteen base-pairing analysis to predict lncRNA DNA- binding motifs and binding sites(He, Zhang, Liu, & Zhu, 2014). All of these methods only detect the interactions mediated by lncRNA-DNA nucleic-acid hybridization, instead of protein- mediated interactions.

Figure 1-4. Schematic of CHART-seq. The protocol uses a set of 25 nucleotide biotinylated antisense oligonucleotides, complementary to desired RNA. Target RNA cross linked to interacted chromatin and proteins. RNA-chromatin- protein complexes are next immobilized on beads, washed, and eluted using RNase H. Interacted genomic DNA is then sequenced using high-throughput sequencing and associated proteins are identified by western blot. Reprint with permission from (Simon et al., 2011). "Copyright (2011) National Academy of Sciences."

1.2.4 LncRNA-protein interaction

To decipher the mechanisms by which RNA binding proteins affect ncRNA functions, several methods have been developed to determine protein bound RNAs on a transcriptome-wide scale

(Jankowsky & Harris, 2015; König, Zarnack, Luscombe, & Ule, 2012; McHugh, Russell, &

Guttman, 2014). Most of the techniques based on protein immunoprecipitation, which utilize antibodies to pull down the protein of interest and combined with high throughput sequencing to sequence its interacted RNAs. These techniques termed RNA immunoprecipitation (RIP) (Zhao et al., 2010), RNA–protein immunoprecipitation in tandem (RIPiT) (Singh, Ricci, & Moore,

2014), and crosslinking and immunoprecipitation (CLIP) (Darnell, 2010; Licatalosi et al., 2008) with several protocol variants: PAR-CLIP (Photoactivatable-Ribonucleoside-

Enhanced Crosslinking and Immunoprecipitation) (Hafner et al., 2010), iCLIP (individual- nucleotide CLIP) (König et al., 2010), eCLIP (enhanced CLIP) (Van Nostrand et al., 2016) and irCLIP (infrared-CLIP) (Zarnegar et al., 2016). These methods revealed large numbers of protein-binding RNAs and numerous binding sites on RNAs. The binding sites usually defined as consensus motifs for protein binding (Milek, Wyler, & Landthaler, 2012).

Computational methods provide a less consuming method comparing to the experimental techniques of RNA-protein interactions. RNA-binding domains (RBDs) tend to show RNA sequence and/or structural motifs specificities, therefore different computational approaches utilized this to predict RNA-protein interactions. An approach called catRAPID (Livi, Klus,

Delli Ponti, & Tartaglia, 2015) was developed to evaluate the interaction propensities of polypeptides and nucleotide chains between protein and RNA using secondary structure propensities, hydrogen-bonding propensities, and van der Waals interaction propensities. A purely sequence-based approach RPISeq used protein and RNA sequences information including

amino acid and nucleotide composition, and further constructed Random Forest and Support

Vector Machine classifiers to predict the probability of ncRNA-protein interactions (Shen et al.,

2007). There are other methods, such as RPI-Pred (Suresh, Liu, Adjeroh, & Zhou, 2015), IncPro

(Lu et al., 2013) together with the methods mentioned above which mainly focus on intrinsic features of RNAs and proteins to reveal the direct RNA-protein binding and consensus motifs, rather than provide the information of protein-RNA complexes, which is important for understanding the molecular function.

Moreover, RNA binding is not restricted to proteins with RNA-binding domains (Jankowsky &

Harris, 2015). It has been revealed extensive RNA associated with numbers of metabolic enzymes that does not have previously defined RBDs (Castello et al., 2012; Hentze & Preiss,

2010; Mitchell, Jain, She, & Parker, 2013). Other studies have shown the interactions of RNAs

(mostly lncRNA) with transcription factors (Di Ruscio et al., 2013; Hudson & Ortlund, 2014; Ng et al., 2013).

1.3 LncRNA NEAT1

Among the lncRNAs identified to date, nuclear paraspeckle assembly transcript 1 (also called nuclear enriched abundant transcript, NEAT1) is one of the most abundant lncRNAs expressed in human cells (Gibb et al., 2011). NEAT1 is located on chromosome 11q13.1. It has recently emerged as a key regulator involved in various cellular processes, cancer development and progression. Two isoforms are generated from a common promoter of NEAT1 gene, 3.7kb

NEAT1_1 and 23kb NEAT1_2 (Hutchinson et al., 2007). The shorter isoform NEAT1_1 completely overlaps the 5′ end of NEAT1_2 with polyadenylated tail (Hutchinson et al., 2007;

Sasaki, Ideue, Sano, Mituyama, & Hirose, 2009), whereas the longer isoform NEAT1_2 has a

triple helix structure at the 3′ end by RNase P cleavage (Brown, Valenstein, Yario, Tycowski, &

Steitz, 2012). Both transcripts do not contain introns and are retained in nucleus (Figure 1-5). c hr1 y

chr11 1

NEAT1_1 3.7 kb

NEAT1_2 23 kb Figure 1-5. Transcripts of NEAT1 gene. NEAT1 has two isoforms. The human nuclear paraspeckle assembly transcript 1 (NEAT1) gene is located in chromosome 11q13.1. Two isoforms of NEAT1, NEAT1_1 (3.7 kilobases) and NEAT1_2 (23 kilobases) are transcribed from the same promoter.

Aberrant lncRNA expressions were associated with cancer development. NEAT1 has been reported to be overexpressed and have important functions in many types of cancer, including breast cancer (Li, Wanjin et al., 2017; Zhang, M., Wu, Wang, & Wang, 2017), prostate cancer

(Chakravarty et al., 2014; Li, Xin et al., 2018), non-small cell lung cancer (Sun et al., 2016), cervical cancer (Guo et al., 2018), ovarian cancer (Ding, Wu, Tao, & Peng, 2017), colorectal cancer (Li, Yunlong et al., 2015), gliomas (Gong et al., 2016), gastric cancer (Fu, Kong, & Sun,

2016) and pancreatic cancer (Huang et al., 2017). NEAT1 plays a considerable role in tumorigenesis including cell proliferation, apoptosis, migration and metastasis which suggest

NEAT1 act as an oncogene in many solid tumors. This makes NEAT1 potentially a good diagnostic and prognostic biomarker in these cancers.

1.3.1 NEAT1 function

Essential structural component in paraspeckle

As previously reported that NEAT1 is an essential architectural component of nuclear body paraspeckle (Clemson et al., 2009). Paraspeckles are mammal-specific membrane-less organelles that regulate gene expression. Nascent transcribed NEAT1 acts as ‘seed’ to recruit specific component proteins, including DBHS (Drosophila behavior/human splicing) family, to form paraspeckle (Zhang et al., 2011). The longer isoform NEAT1_2 is regarded as ‘architectural’ backbone of paraspeckle and plays an important role in paraspeckle integrity. Knocking down

NEAT1 causes the loss of paraspeckle (Souquere, Beauclair, Harper, Fox, & Pierron, 2010;

Sunwoo et al., 2009; Zhang et al., 2011).

NEAT1 mediates transcriptional regulation

NEAT1 modulates gene expression through paraspeckles which sequestrate mRNAs and transcription factors (TFs) (Chen & Carmichael, 2009). For instance, increasing in levels of

NEAT1 was observed upon immune stimuli. As NEAT1 level increases, paraspeckles elongate and require more SFPQ (Splicing factor, proline- and glutamine-rich). Thus, the relative protein level of SFPQ is decreased in the nucleoplasm. Meanwhile, SFPQ is also a transcription factor in nucleoplasm, which can either repress (such as an antivirus gene IL-8), or activate (adenosine deaminase (ADARB2)) gene expression in HeLa cells. The decreasing of SFPQ protein level in nucleoplasm alters the transcription level of target genes (Imamura et al., 2014). In addition,

NEAT1 was also found to be recruited to chromatin including several key prostate cancer genes and change the epigenetic marks to an active state, which promotes prostate tumorigenesis

(Chakravarty et al., 2014). Moreover, NEAT1 directly interacts with putative transcription factor

CDC5L to regulate potential downstream genes of CDC5L in prostate cancer (Li et al., 2018).

NEAT1 functions as competing endogenous RNA

Additionally, NEAT1 has been demonstrated to interact with miRNA and function as a competing endogenous RNA (ceRNA) in many cancers (Ding et al., 2017; Gong et al., 2016;

Huang et al., 2017; Sun et al., 2016). For instance, it sequesters and reduces the level of hsa- miR-377-3p and let-7e, which results in tumor progression through the miR-377–3p/E2F3 (E2F transcription factor 3) in non-small cell lung cancer and let-7e-NRAS signaling pathway in glioma stem cells, respectively (Gong et al., 2016; Zhang, J., Li, Dong, & Wu, 2017).

NEAT1 enhances pri-miRNA processing

Previous study has revealed that NEAT1 plays an important role in pri-miRNA processing.

NEAT1 has a pseudo pri-miRNA structure near 3′ end acting as anchor to attract Microprocessor complex to paraspeckle. Additionally, pri-miRNA are transported into paraspeckle by multiple

RNA binding proteins. NEAT1 scaffolds NONO–SFPQ heterodimer together with other RNA binding proteins bringing pri-miRNA proximity to Microprocessor, helps to generate mature miRNAs in HeLa cells (Jiang et al., 2017).

1.4 Aim of the dissertation

Long noncoding RNAs have recently been recognized as important regulators of gene expression and form complexes with multiple RNA-binding proteins (RBPs) (Engreitz, Ollikainen, &

Guttman, 2016; Quinn & Chang, 2016). Many lncRNAs take their functions with the help of proteins they bind to. Therefore, identifying RBPs interacting with lncRNA and lncRNA-protein complexes binding to DNA are important to understand their functions and working mechanisms. DNA-protein interactions include those between DNA and transcription factors or other regulatory proteins. LncRNA-chromatin interactions indicate possible genes that the

lncRNAs regulate. Therefore, identifying lncRNA-protein-DNA interactions provides tremendous insights of lncRNA functions and mechanisms.

NEAT1 has been reported to exhibit strong functional roles in transcriptional regulation by modulating chromosomal architecture and controlling the recruitment of transcription factors to target genes in prostate cancer (Chakravarty et al., 2014). Furthermore, NEAT1 can localize to active gene loci in breast cancer cell line (West et al., 2014), which indicates that the proteins with which NEAT1 interacts have a high affinity for actively transcribed loci, and also a sign of

NEAT1 being as part of the transcriptional machinery.

NEAT1 does not interact with the genome by directly hybridizing to DNA (West et al., 2014).

Therefore, to further elucidate the mechanism of how NEAT1 (and likely other lncRNAs) regulates gene transcription, it is essential to identify which proteins are involved in NEAT1 interactions with chromatin and subsequent transcriptional regulation. Of the current techniques used to address these interactions, capture hybridization analysis of RNA targets (CHART) is a technique used to localize the genome-wide binding sites (CHART-seq) as well as the interacting proteins of lncRNAs (CHART-MS) (Simon et al., 2011; Simon, 2013b; West et al., 2014).

CHART-seq can only provide information on RNA-DNA interactions but tell us little about the proteins responsible for these interactions. Some candidate proteins can be inferred using

CHART-MS (West et al., 2014) , but most of the proteins identified by CHART-MS are RNA- binding proteins rather than DNA-binding proteins, leaving a substantial information gap in the full interaction complex involving proteins, lncRNAs, and DNA.

In this study, we present a computational method called LncLink to fill this gap and identify the proteins involved in RNA-DNA interactions. This method applies an L1-norm support vector machine (SVM)-based approach, which automatically eliminates redundant or noisy features.

This computational method is based on the observation that a lncRNA binds to chromatin by the help of proteins; thus, the method aims to infer the most likely co-factors involved in lncRNA interactions with its chromatin targets to decipher lncRNA functions.

CHAPTER 2

DATA RESOURCES AND METHODS

2.1 Overall Design

To generalize our target findings and translate the results to elucidate how the NEAT1 lncRNA recognizes its target regions, we developed a computational framework (Figure 2-1), called

LncLink, for lncRNA genomic targets classification using transcription factor binding sites and genome-wide chromatin interactions as predictive features. For each genomic region, including targets and nontarget set, the normalized read coverage depth according to the uniquely mapped reads of each feature was calculated, which composed of input instance matrix for SVM model.

The input instance matrix was supplied to SVM to train the model by maximizing the margin between target set and nontarget set. The parameters, hyperparameter and soft margin parameter

C of the model were optimized by cross-validation. Once the model was obtained, the coefficients � ∈ ℝ were used to determine the relevance of each feature. NEAT1 targeting related proteins were identified with a feature selection procedure that ranks the proteins according to their discriminative power.

1228 NEAT1 targets overlap with DNaseI region and TSS+/-3k 767 NEAT1 targets

Refine NEAT1 targets by correlation with feature matrix of sense oligo targeted regions + Non-target regions filtered by DNaseI, TSS+/-3kb and GC content

Regions feature1... feature n

Target1 ... Nontarget

Split samples Select 20% of samples for testing

Train Test

Learning algorithm Test dataset training Predict class label L1-norm based feature selection

Hyper-parameter optimization 3-fold Cross-validation

average error rate over Performance evaluation of all CV runs independent test set

Rank of features

Figure 2-1. Computational workflow of feature selection in lncRNA genomic targets. A total of 1228 NEAT1 target regions in MCF7 cells were obtained from previous CHART-seq experiments. Those regions overlapping DNase I hypersensitive sites and within 3 kb of transcriptions start sites (TSS) were further selected. This matrix of putative transcriptional regulatory binding sites was correlated with the sense matrix to filter out potential non-specific interactions. Input matrix was split into a training set and an independent test set for algorithm learning and performance evaluation respectively.

2.2 Datasets

NEAT1 target regions

We used the CHART-seq experiments (GSE58444) performed in MCF7 cell line by West et al. to identify NEAT1 target regions (West et al., 2014). We first used bowtie (Langmead, Trapnell,

Pop, & Salzberg, 2009) to align reads and MACS (Zhang, Y. et al., 2008) to call the peaks with a p-value cutoff of 10. A total of 1228 putative NEAT1 genomic binding sites were identified.

We then filter the peaks further by applying three criteria: 1) overlapping open chromatin regions defined by ENCODE DNase I hypersensitive (DHS) regions, 2) overlapping genomic DNA regions around ± 3kb of the transcription start sites (TSSs) of genes; and 3) the feature distribution is not similar to regions pull-down by sense oligonucleotides of NEAT1, which are negative controls of CHART-seq experiment. The reason to perform the third criteria is that some of the genomic DNA regions pulled down by NEAT1 antisense oligonucleotides potentially represent direct oligonucleotide-chromatin interactions that are not mediated by

NEAT1; thus, removing the regions similar to negative controls would help eliminate non- specific results. We used the Pearson correlation coefficient to determine the similarities and filter those regions with correlations greater than 0. In total, we obtained 416 NEAT1 targets after the above filtering and defined them as target regions in promoter regions.

For each true target region, we randomly selected a genomic DNA region from the DHS regions in the same chromosome of the same length and GC content. This region should also overlap the

TSS of a gene but not overlap any NEAT1 antisense oligonucleotide pull-down regions. In total, we obtained 416 nontarget regions as a control set for support vector machine modeling. To avoid the potential bias of using randomly selected nontarget regions, we repeated this step 10 times to obtain 10 nontarget region sets for evaluating the modeling performance.

Feature sets

In a typical machine learning classification problem, data is represented as a matrix of instances

(also called samples) that is described by features (also called attributes) and a label that indicates its class, where positive instances with label of +1 and a set of negative instances with label of -1. We collected 74 features (Table A-1) for modeling how the lncRNA recognizes its chromatin targets. These features included 73 chromatin immunoprecipitation sequencing (ChIP- seq) datasets in MCF7 cell line, generated from the ENCODE project (ENCODE Project

Consortium, 2012) and GEO datasets, and one chromatin interaction profile for NEAT1 gene, obtained from a Hi-C study in MCF-7 cell line (Barutcu et al., 2015). For each genomic region, we calculated the normalized read coverage depth according to the uniquely mapped reads of each feature, which composed of instance matrix for SVM model.

Sense oligonucleotides

A control experiment to support the specificity of CHART is to perform sense oligonucleotides

(sense COs) experiment. Sense oligonucleotides have sequence identity to NEAT1 transcripts.

Other controls include scrambled oligonucleotides, or using oligonucleotides directed against an unrelated RNA. The aim of sense oligo control is to detect the artificial signal caused by the direct interaction between capture oligos and the DNA. Peaks of sense CO were also called from

MACS (Zhang et al., 2008) with p-value cutoff being 10. These peaks are sites where the capture oligos directly interact DNA. However, some sense-oligo profiles coincide with the peaks in NEAT1 CHART-seq. To ensure the specificity for lncRNA-associated DNA, we eliminated pull-down regions by antisense-oligo which show similar signal of sense CO pull- down regions (the third criterion of filtering listed above).

2.3 Methods

Motif analysis

We used HOMER (Heinz et al. 2010) to identify both known and de novo motifs enriched in the target regions comparing with control regions. All of the repeat elements in the sequences were masked, and we have searched motifs with lengths of 8, 10, 12, 14 and 16 bp. The significance of enrichment was calculated by using hypergeometric distribution. Known motif database was from HOMER human v6.0 for package hg19.

Modeling with L1-norm SVM.

Support Vector Machine (SVM) is a discriminative classifier defined by a separating hyperplane

(Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995). It aims to maximize the distances between nearest data point in different classes and hyperplane, which determines the optimal hyperplane. The distance between the support vectors and the decision boundary called Margin.

Feature selection is the process of testing and removing irrelevant information from the original dataset. This may allow learning algorithms to operate faster and reduces overfitting by reducing the dimensionality of the data (Chandrashekar & Sahin, 2014; Janecek, Gansterer, Demel, &

Ecker, 2008). Embedded feature selection methods perform feature selection in the process of training. L1-norm SVM is an embedded feature selection method that performs feature weighting using lasso regularization with objective functions that minimize fitting errors and force the feature coefficients to be small or zero.

We utilized the L1-norm SVM to build a classifier that can distinguish the true target regions from the nontarget regions. The L1-norm SVM with lasso penalty (formula 1 below) have shown advantages when redundant features are present (Zhu, Rosset, Tibshirani, & Hastie, 2004) , because it can select features by shrinking the small coefficients of the hyperplane to exactly

zero before overfitting (Ng, A. Y., 2004) .

1 min ‖�‖ + � � ,, 2

subject to �� + � �(�) ≥ 1 − �, � ≥ 0, . (formula 1)

Here � ∈ ℝ , � = 1, … , �,and a vector � ∈ {1, −1} . ‖�‖ = ∑ |�|, where � is a kernel function that mapped training data into higher dimensional spaces. � are the non-negative slack variables that allow points to be misclassified on soft margin as well as the decision boundary.

Therefore, it describes the overlap between the two classes, and C is a regularization parameter that controls the amount of overlap which is optimized by cross-validation in practice, also known as soft margin parameter.

For any testing instance �, the decision function (predictor) is

�(�) = �� (��(�) + �) (formula 2)

We applied the L1-norm SVM using the scikit-learn library (Pedregosa et al., 2011) with the linear kernel for both feature ranking and classification. A total of 80% of the true target regions and non-target regions were used for training, and three-fold cross-validation was used to avoid overfitting. The final performance was evaluated using the rest 20% of the data. The search space for hyperparameter C ranged from 10 to 10, and C was determined via cross validation. Eventually, hyperparameter C was determined to be 0.06 to achieve the best performance. The L1-nom SVM was fitted to instance matrix and the model was obtained. The coefficients � ∈ ℝ were used to determine the relevance of each feature (Guyon, Weston,

Barnhill, & Vapnik, 2002). The positive coefficients mean the features contribute to positive set,

the negative coefficients mean the features contribute to negative set, and the irrelative features have been assigned no weights by L1-SVM. Therefore, we identified potential proteins involved in NEAT1 genomic targeting using features with positive coefficients by the process of prediction.

Performance evaluation

To construct the model and evaluate the performance of classification, we split the samples into two sets, the training set and the test set. Training set was used to build the learning model by 3- fold cross validation, while test set was used to evaluate the model. The test set was supplied to the model with labels hidden, and then predicted class labels were assigned by the model. With comparing the assigned labels and the corresponding original labels, we calculated the prediction accuracy. To more impartially evaluate the classification results, some other evaluation metrics are constructed:

Sensitivity (SN), specificity (SPC) and accuracy (ACC) were calculated with the following definitions to evaluate the modeling performance.

Sensitivity (SN) = TP / (TP + FN);

Specificity (SPC) = TN / (TN + FP);

Accuracy (ACC) = (TP + TN) / (TP + FN + TN + FP);

Where TP, FN, TN and FP stand for true positive, false negative, true negative and false positive respectively. We use accuracy as the sole index to determine the best model.

Receiver operating characteristic (ROC) curve is used to illustrates the prediction performance of a classifier at different threshold. The ROC curve is generated by plotting sensitivity versus 1- specificity as threshold is varied. Thus, each point on the ROC curve represents a pair of sensitivity and specificity for one selected threshold. The area under the ROC curve (AUC) is a

measure of how well a parameter can distinguish between two groups. Receiver operating characteristic (ROC) analysis and the area under the curve (AUC) were used to evaluate the overall performance of model.

Determine the best model

As was mentioned before, we obtained 10 nontarget region sets to avoid the potential bias of randomly selection. We then used the model of each set to apply to other 9 sets, and calculated the average ACC. We used ACC as the sole index to determine the best model.

Interpreting selected features from model.

We obtained protein-protein interactions (PPI) using the online STRING database version 10.5 with a confidence score > 0.4 to defined significant interactions (Szklarczyk et al., 2014) by inputting the NEAT1-interacting proteins identified previously by CHART-MS (West et al.,

2014) and selected positive features. The average connectivity between these two protein sets was calculated. To identify the selected features are significantly connected to CHART-MS proteins, the average connectivity also was calculated between selected positive features and randomly selected proteins with the same number and degree distribution as those of the

CHART-MS proteins, as well as randomly selected DNA-binding proteins with the same number to positive features and with same degree distribution to positive features and CHART-MS proteins. The same significance tests were then performed on both the negative weighted features and zero-weighted features. Each random set was generated by 10 times in order to calculate p- value with Z-transformation. We imported the PPI interaction data into the Cytoscape software

(Shannon et al., 2003) to create subnetwork of proteins related to the features with positive and negative coefficients in our model (Figure 3-6a).

CHAPTER 3

RESULTS

In this Chapter, we confirmed that proteins mediate NEAT1 to its genomic target sites in promoter region. We used SVM to predict potential transcription factors/cofactors involved in genomic targeting by lncRNA NEAT1. Our results show that DNA-binding proteins may directly link NEAT1 to its DNA targets and can also interact with ribonucleoprotein complexes to assist lncRNA targeting. Several of the 27 top-ranked DNA-binding proteins predicted by our method are supported by existing literatures as having associations with NEAT1. The findings revealed by our work provide novel insights into our understanding of critical players targeted by lncRNA.

3.1 Genome-wide identification of NEAT1 target sites

We used previous CHART-seq data from MCF7 cells (West et al., 2014) to identify NEAT1 target DNA sites and obtained 1228 putative NEAT1 genomic binding sites (Figure 3-1 a) prior to applying our filtering criteria. Of these 1228 sites, 95% are located in open chromatin regions that were identified by overlapping with ENCODE DHS sites. The NEAT1 binding sites exhibit a significantly higher GC content (58%) comparing with the randomly sampled regions (39%, p- value < 2.2e-16) (Figure 3-1 b). When we selected the negative set, we balanced the GC content as the same distribution in the positive set (Figure 3-1 c).

3.2 Motif analysis

We identified 16 known DNA motifs and 2 de novo DNA motifs enriched in NEAT1 target regions by software HOMER (Table A-2). The motifs do not match the sequence of NEAT1

transcript by BLAST analysis. Therefore, DNA motif analysis shows that there is no direct

RNA-DNA hybridization between NEAT1 and DNA at target regions identified by CHART-seq. a) b)

chr1 0.8

chr2

chr3 0.6 chr4

chr5 0.4

chr6

chr7 GC content

chr8 0.2

chr9

chr10 0.0 chr11 c) random targets chr12

0.8

chr13

chr14 0.7 chr15

chr16 0.6 chr17

chr19 GC content 0.5 chr20

chr21 0.4 chr22

chrX negative positive 0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

Figure 3-1. NEAT1 genomic targets. a) Chromosomal distribution of 1228 binding sites of the lncRNA NEAT1 determined by CHART- seq. b) GC contents of NEAT1 target regions and randomly selected regions. c) GC contents of NEAT1 target regions (positive set) and negative set. The GC content was balanced in the negative set as the same distribution as in the positive set.

3.3 NEAT1 genomic targets enriched in open promoter regions

By metagene analysis, NEAT1 showed enrichment both at the 5′ and 3′ ends of genes. In these genes, NEAT1 CHART-seq showed a sharp peak at the transcription start site (TSS) and a broader peak around transcription end site (TES) with slightly enrichment. After inhibiting transcription elongation by treating cells with flavopiridol, the peak around TES of genes significantly reduced, while the enrichment was partially maintained at the TSS of genes (Figure

3-2). This suggests that NEAT1 may interact with these sites around TSS through a mechanism that does not involve interactions with nascent RNAs. Therefore, we hypothesized that NEAT1 interacts with chromatin through proteins in the promoter regions.

To generate a homogenous NEAT1 target set, we focused on the targets located in open chromatin regions near TSSs, by applying our first (overlapping DHS regions) and second criteria (within 3 kb of a TSS) described in the Methods. 767 NEAT1 genomic targets were found within 3kb of TSSs. The regions with feature protein coverages were composed of an instance matrix, called pre-positive matrix.

0.7 Untreated 0.6 + favopiridol 0.5 0.4 0.3 (CHART-seq vs Input) vs (CHART-seq

2 0.2

log 0.1 0.0 −0.1 -3.0 TSS TES 3.0Kb

Figure 3-2. Distribution of NEAT1 CHART-seq signals. Metagene plots for CHART-seq signals across gene body (blue line) and CHART-seq signals from cells treated with flavopiridol (green line).

To ensure that the lncRNA-associated DNA sites are specifically associated with NEAT1, we eliminated the antisense oligos pull-down regions with the signals similar to those obtained by sense oligonucleotides CHART-seq. Sense oligonucleotides are control oligonucleotides, which are from part of the same strand as lncRNA, but may bind to double-stranded genomic DNA directly. A total of 172 peaks were identified in sense oligonucleotides biological replicates. For

each genomic region, we calculated the normalized read coverage depth according to the uniquely mapped reads of each feature. The regions with feature protein coverages are composed of a sense matrix (Figure 3-3 a). The Pearson’s correlation coefficients between the pre-positive matrix and sense matrix was calculated (Figure 3-3 b). The rational of doing so is to filer potential non-specific NEAT1 targets. We retained only the regions that showed negative or no correlation (corr ≤ 0) with the regions isolated by the sense oligonucleotides, which is third criteria described in the Methods. A total of 416 regions were remained as the positive set.

3.4 Prediction of NEAT1 target sites using genomic features

The sequence analysis implied that NEAT1 contacts with the transcription process related proteins that are responsible for genomic localization. Thus, we constructed a computational framework to predict the transcription process related proteins involved in lncRNA NEAT1 targeting using transcription factor binding profiles and genome-wide chromatin interactions as predictive features. The instance matrix was supplied to a linear SVM to train the classifier aiming to distinguish targets versus nontargets, and the feature coefficients were obtained during the learning process.

a) b)

40 Frequency

−0.2 −0.1 0.0 0.1 0.2 Pearson's correlation Figure 3-3. Heatmap of matrices and matrices correlation distribution. a) Heat map image of pre-positive matrix (upper panel) and sense matrix (lower panel). The color of each cell represents the protein ChIP-seq coverage of a target region (row) for a feature (column). b) Distribution of Pearson’s correlation coefficients between the positive matrix and sense matrix.

The performance of the model was measured by three-fold cross-validation using a number of indicators (ACC, SN, SP and AUC). To ensure that the randomly selected nontarget regions were not biased, we prepared 10 different sets of non-target regions. The stable performances of

10 different sets are shown in Figure 3-4 a). The selection of nontarget region sets was not biased of random sampling. We chose the best model with the best ACC for all 10 independent test sets.

Our model was successful in predicting DNA binding proteins involved in lncRNA and target chromatin interactions, and the best performance was achieved on training set and independent test set, respectively (Table 3-1). The results showed that the model achieved a high AUC of

0.93, as well as a test ACC of 87.3%, SN of 86.7% and SP of 87.8% (Figure 3-4 b and Table 3-

1).

a) b)

1.0

0.9

category 0.8 Test Training Performance

0.7

0.6

ACC SN SP

c) Figure 3-4. Model prediction performance. From the best a) Performances of the L1-norm SVMACC model for 10 SN replicates. b) Receiver SP operating characteristics (ROC)model curve of the best model. The area under the curve (AUC) was 0.93.

Training 0.929 0.910 0.948 Table 3-1. Performance of training set and test set from the best model. ACC refers to accuracy, SN refers to sensitivity and SPC refers to specificity. Test 0.861 0.885 0.839 From the best model ACC SN SPC

Training 90.9 88.4 93.3

Testing 87.3 86.7 87.8

3.5 Prioritization of co-factor candidates using SVM model output

In addition to targets training, the model is also useful in indicating transcription factors/cofactors that facilitate NEAT1 targeting. To this end, we applied L1-norm constrains in the training procedure (see Methods), which could effectively select features that contribute significantly to the model. The procedure identified 28 positively coefficient features, 33 unweighted features and 13 negatively coefficient features (Figure 3-5 and Table 3-2).

Figure 3-5. Feature weights ranked by L1-SVM. Blue bars and red bars represent negatively and positively weighted features, respectively.

27 proteins and Hi-C profile have the positive coefficients in the best model, which indicates the potential factors involved in NEAT1 targeting (Table 3-2). Several of these proteins have been reported to plays roles in NEAT1 mechanisms. Among these proteins, SFPQ (rank 4, see Table

3-2) and NONO (rank 18) are core paraspeckle proteins that directly bind to NEAT1 to form, and maintain paraspeckle integrity (Fox, Bond, & Lamond, 2005; Yamazaki et al., 2018).

Additionally, SFPQ and NONO act as both a transcriptional corepressors and coactivators

(Chakravarty et al., 2014; Imamura et al., 2014), implying that these proteins and lncRNA

NEAT1 physically interact with certain NEAT1 chromatin targets.

Among the other proteins with positive coefficients, zinc finger protein (ZNF) 207 (rank 5) was identified previously as a NEAT1-interacting protein by CHART-MS (West et al., 2014).

Additionally, it is a DNA-binding protein with ENCODE ChIP-seq data showed that it binds to

certain NEAT1 genomic targets, which renders ZNF207 a highly likely cofactor involved in the interaction between NEAT1 and its genomic target sites.

As reported previously, ZNF24 (rank 25) and RNA polymerase II (rank 8) are both paraspeckle proteins, and ZNF24 contributes to paraspeckle integrity (Fong et al., 2013; Xie, Martin, Guillot,

Bentley, & Pombo, 2006). ZKSCAN1 (rank 8) physically interacts with ZNF24 in paraspeckle

(Fong et al., 2013). Moreover, these paraspeckle proteins can also be found outside of paraspeckles as transcription factors (Jia, Huang, Bischoff, & Moses, 2014; Nie, Li, Li, Mu, &

Wang, 2016), suggesting that they may interact with NEAT1 directly or indirectly.

These findings suggest that lncRNA localization may not depend on a single factor but rather multiple factors. In addition, many proteins related to the features with positive coefficients have no report to mediate NEAT1 targeting in the literature. Our results may provide novel candidate proteins involved in NEAT1 functions to be evaluated.

Table 3-2. List of feature weights identified by L1-norm SVM.

Rank Feature Weight Feature Weight Feature Weight

1 MAZ 0.53 ARID3A 0 ZBTB40 -0.001

2 POLR2A 0.253 HSF1 0 PML -0.003

3 HDGF 0.241 HDAC2 0 SIX4 -0.006

4 SFPQ 0.229 ZBTB33 0 EP300 -0.008

5 ZNF207 0.197 MNT 0 P53 -0.018

6 DPF2 0.167 JUND 0 TCF7L2 -0.022

7 ZNF217 0.166 ZNF592 0 TCF12 -0.022

Rank Feature Weight Feature Weight Feature Weight

8 ZKSCAN1 0.15 RCOR1 0 FOXK2 -0.025

9 MYC 0.121 FOXM1 0 HA-E2F1 -0.026

10 GABPA 0.117 ELF1 0 MLLT1 -0.035

11 ZBTB11 0.087 TEAD4 0 SIN3A -0.073

12 MAX 0.084 HCFC1 0 GTF2F1 -0.078

13 PKNOX1 0.079 JUN 0 EGR1 -0.088

14 SRF 0.079 SREBF1 0

15 CHD1 0.074 H3F3A 0

16 NR2F2 0.054 FOSL2 0

17 YBX1 0.039 MTA1 0

18 NONO 0.036 MTA2 0

19 HiC 0.031 ESRRA 0

20 AGO1 0.031 FOS 0

21 MTA3 0.031 SP1 0

22 NBN 0.023 ZNF687 0

23 CEBPB 0.016 REST 0

24 H2AFZ 0.014 CUX1 0

25 ZNF24 0.01 GATA3 0

26 RAD21 0.008 ESR1 0

27 ZHX2 0.007 RFX1 0

Rank Feature Weight Feature Weight Feature Weight

28 CTCF 0.006 FOXA1 0

ELK1 0

MBD2 0

GATAD2B 0

CTBP1 0

NRF1 0

3.6 PPI of selected proteins and CHART-MS proteins

To further confirm that the selected features are proteins involved in NEAT1 targeting, we compared the average connectivity of protein-protein subnetworks between LncLink identified signature proteins and CHART-MS proteins (Figure 3-6). The average connectivity is defined connectivity normalized by the number of proteins in each feature set. We obtained protein- protein interactions (PPI) using the STRING database version 10.5 by inputting the NEAT1- interacting proteins identified previously by CHART-MS (West et al., 2014) and selected positive features. The average connectivity between these two protein sets was calculated. To identify the selected features are significantly related to CHART-MS proteins, the average connectivity was calculated between selected positive features and randomly selected proteins with the same number and degree distribution as those of the CHART-MS proteins, as well as randomly selected DNA-binding proteins with the same number to positive features and with same degree distribution to positive features and CHART-MS proteins. The same significance

tests were then performed on both the negative weighted features and zero-weighted features.

The p-value was calculated by Z-score for 10 repeats.

Clearly, positive weighted features had a significantly higher average connectivity with CHART-

MS proteins than with the same number randomly selected proteins with controlled degrees (Z- test p-value < 10). Additionally, CHART-MS proteins had a significantly higher average connectivity with the positively weighted features than with randomly selected transcription factors with controlled degrees. In addition, the positive weighted features had a higher average connectivity with CHART-MS proteins than with negative weighted features and features with no weight (Figure 3-6 b). All of these results indicated that our constructed model was effective in identifying novel binding proteins of lncRNAs, and the method we propose is reliable. a) b)

**** SFPQ 6 YBX1 ZNF217 **** SRFS NONO RAD21 RCOR1 MYC MAZ SP1 AGO1 HDAC2 ZNF24 NBN CPNE2 DPF2 CSTF3 ZC3H14 RALY ERH MBD2 PKNOX1 ZNF687 ZNF207 HDGF AZGP1 CUX1 DDX42 RAB6C FAM98B H2AFZ CHD1 ZKSCAN1 FXR2 GABPAA AKAP8 SPEG CTCF PSMA3 ADAR MAX PML PSPC1 SETD9 POLR2A 4 EGR1 GATAD2B MAF NUFIP2 FOXK2 RPA1 TARDBP GALK1 TCF7L2 EIF3G FOXM1 FOS LSM12 RNF200 API5 SREBF1 JUND CDC6 U2AF1 TEAD4 SIN3A EWSR1 YTHDF1 RBM17 ZNF638 COL1A2 category SIX4 TP53 LYZ RBM14 NCOA5 NRF1 GTF2F1 TCF12 SART1 CPNE3 EP300 PSMB1 negative_weight MLLT1 E2F1 CDC5L PRSS1 REST ACIN1 RAB11A EIF3H NCBP1 RFX1 ZBTB40 positive_weight MYH1 FAM98A DDX5 EIF3I SRRT PIP DYSF COG22 SAFB zero_weight PURA PRMT6 BCLAF1 connectivity RAVER1 SRSF5 RBM45 FAM120A PIN4 YAF2 ALKBH5 RBM4 PTBP2 PMP22 RBM12B 2 HNRNPK ESRP2 C17orf85 CIRBP LSM14A FIP1L1 RBM10 HNRNPH2 LSM2 HNRNPUL2 RBM4B HNRNPF LSM3 EIF4B

GATA3

CEBPB ZBTB11 ARID3A 0 HSF1 ELF1 CTBP1 ELK1 ZBTB333 ZHX2 ZNF592 MTA2 MTA3 HCFC1 ESR1 NR2F2 ESRRA H3F3A JUN MTA1M FOSL2 FOXA1 random zero_MS

positive_MSposrand_MS zerorand_MSzero_randomnegative_MSnegrand_MS

positive_ negative_random Figure 3-6. Protein-protein interactions of selected proteins and CHART-MS proteins. a) Protein-protein interaction sub-networks from STRING database. The connectivity between positively weighted features (shown in green nodes) and proteins identified by CHART-MS (shown in grey nodes), negatively weighted features (shown in pink nodes) and proteins identified by CHART-MS and zero-weighted features (shown blue nodes) and proteins identified by CHART-MS are shown, respectively. b) The average connectivity of protein-protein interaction sub-networks obtained from the STRING database. **** p-value < 10-4 by Z-test.

CHAPTER 4

CONCLUSION AND DISCUSSION

4.1 Conclusion

The prediction of lncRNAs-protein-DNA interactions is essential for uncovering the functions of lncRNAs. Here we introduced a computational method, LncLink, to infer lncRNA function by identifying a set of the most probable proteins involved in lncRNA and the genomic regions that they control. This work integrated multiple omics data in human lncRNA studies, offering novel and more specific lncRNA function indication.

For the purpose of identification genome-wide NEAT1-protein-DNA interactions, we first identified lncRNA NEAT1 chromatin targets using CHART-seq data in cell line MCF-7. Also, we collected the profiles of DNA-binding proteins in the same cell line. Then we used support vector machine (SVM) based method to predict proteins or cofactors help lncRNA binding. In this way, we linked lncRNA, protein, and DNA together to further predict the regulator function of lncRNA.

The chromatin targets of NEAT1 are heterogeneous with the most of them being located in the open chromatin region. While some target regions overlap with promoter regions, there are also regions located in the gene bodies and intergenic regions. In order to get homogenous target set, we focused on DNaseI hypersensitivity and TSS +/- 3kb overlapped regions followed by Sense matrix correlation filtering. The negative set was a set of randomly selected genomic region with control of same GC content and filtering criterion as target set. We collected DNA-binding proteins as feature set. We use an embedded feature selection method for feature selection and

prediction. Using L1-norm SVM method, we identified a subset of non-redundant, discriminative, predictive features. Moreover, some of them are supported by literatures.

We demonstrated that LncLink is capable of identifying candidate proteins for lncRNA targeting and targeting regulation. The higher connectivities of identified proteins in the protein-protein interaction network than random controls support them as potential cofactors. All of these results indicated that our constructed model was effective in identifying novel binding cofactors of lncRNAs, which could provide new insight of the lncRNA-specific function.

4.2 Discussion

Although the training performance can achieve 90.9% and the testing performance can achieve

87.3% using 74 features, an in silico method was used to test whether the performance can be further improved if more features are added. A computational method called Protein interaction quantitation (PIQ) (Sherwood et al., 2014) was used to predict the transcription factor binding sites by integrating cell line specific genome-wide DNase I hypersensitivity profiles and transcription factor binding motifs from JASPAR database (Khan et al., 2017). 579 transcription factor binding profiles were predicted by PIQ and added with original 74 features to predict

NEAT1 targets. The performances were shown in Table 4-1.

Table 4-1. Performance of training set and test set using 579 predicted transcription factor binding profiles together with original 74 transcription factor binding profiles as features. ACC refers to accuracy, SN refers to sensitivity and SPC refers to specificity. From the best model ACC SN SPC

Training 89.8 86.7 93.0

Testing 80.6 75.3 85.7

The performance was not improved, this indicates that either the original 74 features have caught the most information for lncRNA targeting or the in silico method is not accurate and the predicted protein-DNA binding profiles contain noises.

In addition, although 27 proteins were found involved in NEAT1 targeting by our model, whether NEAT1-promoter interactions mediated by those proteins can regulate gene expression is still unclear. To deeply understand how NEAT1 regulates transcription together with transcription factors, we need to measure the changes of the targets genes with the perturbations of NEAT1 and transcription factors.

Furthermore, it has been reported that the overexpression of short isoform NEAT1_1, instead of

NEAT1_2, promoted the tumorigenicity of prostate cancer (Chakravarty et al., 2014). In another study, the longer isoform NEAT1_2 has been identified to be a promising predictor of chemotherapeutic response in ovarian cancer. Therefore, the two isoforms of NEAT1 might have different functions in cells. However, the CHART-seq data cannot distinguish the genomic targets for individual isoform since the capture oligo designed in the overlap region of two isoforms. To this end, studies that distinguish the functions of different NEAT1 isoforms are needed in the future.

APPENDIX

APPENDIX TABLES

Table A-1. Feature list used for predicting lncRNA targets. Available public dataset of ChIP-seq and Hi-C in MCF7 cell line from ENCODE database and GEO database. Feature Source Description Argonaute 1, RISC Catalytic Component. Argonaute family of AGO1 ChIP-seq signal GSM1369944 proteins which associate with small RNAs and have important roles in RNA interference (RNAi) and RNA silencing. ARID3A ChIP-seq ENCSR231ENE AT-Rich Interaction Domain 3A. Transcription factor signal CCAAT/Enhancer Binding Protein Beta. Transcription factor that CEBPB ChIP-seq signal ENCSR000BSR contains a basic leucine zipper (bZIP) domain. Chromodomain helicase DNA binding protein 1; ATP-dependent chromatin-remodeling factor which functions as substrate CHD1 ChIP-seq signal ENCSR360JOC recognition component of the transcription regulatory histone acetylation (HAT) complex SAGA. C-terminal binding protein 1; Corepressor targeting diverse CTBP1 ChIP-seq signal ENCSR636EYA transcription regulators such as GLIS2 or BCL6. CCCTC-binding factor (zinc finger protein); Involved in transcriptional regulation by binding to chromatin insulators and CTCF ChIP-seq signal ENCSR000DWH preventing interaction between promoter and nearby enhancers and silencers. Cut-like homeobox 1; Probably has a broad role in mammalian development as a repressor of developmentally regulated gene CUX1 ChIP-seq signal ENCSR017CEO expression. May act by preventing binding of positively-activing CCAAT factors to promoters. Double PHD Fingers 2; Functions as a transcription factor which DPF2 ChIP-seq signal ENCSR234VCE is necessary for the apoptotic response following deprivation of survival factors. Early Growth Response 1; Nuclear protein and functions as a EGR1 ChIP-seq signal ENCSR000BUX transcriptional regulator. The products of target genes it activates are required for differentitation and mitogenesis. E74 Like ETS Transcription Factor 1; The encoded protein is primarily expressed in lymphoid cells and acts as both an ELF1 ChIP-seq signal ENCSR000BSS enhancer and a repressor to regulate transcription of various genes. ELK1, ETS Transcription Factor; Transcription factor that binds ELK1 ChIP-seq signal ENCSR382WLL to purine-rich DNA sequences. E1A Binding Protein P300; It functions as histone acetyltransferase that regulates transcription via chromatin EP300 ChIP-seq signal ENCSR000BTR remodeling and is important in the processes of cell proliferation and differentiation. Estrogen Receptor 1, a ligand-activated transcription factor ER� ChIP-seq signal GSE41561 composed of several domains important for hormone binding, DNA binding, and activation of transcription. ESRRA ChIP-seq Estrogen Related Receptor Alpha, a nuclear receptor that is ENCSR954WVZ signal closely related to the estrogen receptor. Fos Proto-Oncogene, AP-1 Transcription Factor Subunit; Nuclear FOS ChIP-seq signal ENCSR569XNP phosphoprotein which forms a tight but non-covalently linked complex with the JUN/AP-1 transcription factor.

Feature Source Description FOS Like 2, AP-1 Transcription Factor Subunit; Controls FOSL2 ChIP-seq signal ENCSR000BUI osteoclast survival and size. Forkhead Box A1; Transcription factor that is involved in FOXA1 ChIP-seq embryonic development, establishment of tissue-specific gene ENCSR126YEB signal expression and regulation of gene expression in differentiated tissues. FOXK2 ChIP-seq Forkhead Box K2; Transcriptional regulator that recognizes the ENCSR465VLK signal core sequence 5-TAAACA-3. Forkhead Box M1; Transcriptional factor regulating the FOXM1 ChIP-seq ENCSR000BUJ expression of cell cycle genes essential for DNA replication and signal mitosis. GA Binding Protein Transcription Factor Alpha Subunit; GABPA ChIP-seq ENCSR000BUK Transcription factor capable of interacting with purine rich signal repeats (GA repeats). GATA3 ChIP-seq ENCSR423RTK GATA Binding Protein 3; Transcriptional activator which binds signal to the enhancer of the T-cell receptor alpha and delta genes. GATAD2B ChIP-seq GATA Zinc Finger Domain Containing 2B; Transcriptional ENCSR389BLX signal repressor. General Transcription Factor IIF Subunit 1; TFIIF is a general GTF2F1 ChIP-seq transcription initiation factor that binds to RNA polymerase II ENCSR557JTZ signal and helps to recruit it to the initiation complex in collaboration with TFIIB. It promotes transcription elongation. H2A Histone Family Member Z; Variant histone H2A which H2AFZ ChIP-seq signal ENCSR057MWG replaces conventional H2A in a subset of nucleosomes. H3 Histone Family Member 3A; Variant histone H3 which H3F3A ChIP-seq signal ENCSR197XDQ replaces conventional H3 in a wide range of nucleosomes in active genes. HA-E2F1 ChIP-seq ENCSR000EWX E2F Transcription Factor 1; Transcription activator. signal HCFC1 ChIP-seq signal ENCSR697CUP Host Cell Factor C1; Involved in control of the cell cycle. Histone Deacetylase 2; They catalyze acetyl group removal from HDAC2 ChIP-seq ENCSR000BTP lysine residues in histones and non-histone proteins, causing signal transcriptional repression. HDGF ChIP-seq signal ENCSR200CUA Heparin Binding Growth Factor; transcriptional repressor. HSF1 ChIP-seq signal ENCSR062HDL Heat Shock Transcription Factor 1. Jun Proto-Oncogene, AP-1 Transcription Factor Subunit; JUN ChIP-seq signal ENCSR176EXN Transcription factor that recognizes and binds to the enhancer heptamer motif 5-TGA[CG]TCA-3. JunD Proto-Oncogene, AP-1 Transcription Factor Subunit; JunD ChIP-seq signal ENCSR000BSU Transcription factor binding AP-1 sites. MAX ChIP-seq signal ENCSR000BUL MYC Associated Factor X; Transcription regulator. MYC Associated Zinc Finger Protein; transcription factor with MAZ ChIP-seq signal ENCSR288IJC dual roles in transcription initiation and termination. Methyl-CpG Binding Domain Protein 2; Binds CpG islands in MBD2 ChIP-seq signal ENCSR940MHE promoters where the DNA is methylated at position 5 of cytosine within CpG dinucleotides. MLLT1 ChIP-seq MLLT1, Super Elongation Complex Subunit; Component of the ENCSR427BBI signal super elongation complex (SEC). MAX Network Transcriptional Repressor; Binds DNA as a MNT ChIP-seq signal ENCSR663ZZZ heterodimer with MAX and represses transcription. Metastasis Associated 1; Transcriptional coregulator which can MTA1 ChIP-seq signal ENCSR348JOJ act as both a transcriptional corepressor and coactivator. Metastasis Associated 1 Family Member 2; Involved in the MTA2 ChIP-seq signal ENCSR551ZDZ regulation of gene expression as repressor and activator.

Feature Source Description MTA3 ChIP-seq signal ENCSR391KQC Metastasis Associated 1 Family Member 3. MYC Proto-Oncogene, BHLH Transcription Factor; Transcription factor that binds DNA in a non-specific manner, yet MYC ChIP-seq signal ENCSR000DMQ also specifically recognizes the core sequence 5-CAC[GA]TG-3. Activates the transcription of growth-related genes. Nibrin; Component of the MRE11-RAD50-NBN (MRN NBN ChIP-seq signal ENCSR591EBL complex) which plays a critical role in the cellular response to DNA damage and the maintenance of chromosome integrity. Non-POU Domain Containing Octamer Binding; DNA- and RNA NONO ChIP-seq signal ENCSR912NMR binding protein which involved in pre-mRNA splicing, probably as a heterodimer with SFPQ. Nuclear Receptor Subfamily 2 Group F Member 2; Ligand- NR2F2 ChIP-seq signal ENCSR000BUY activated transcription factor. NRF1 ChIP-seq signal ENCSR135ANT Nuclear Respiratory Factor 1; Transcription factor. PKNOX1 ChIP-seq ENCSR986XYK PBX/Knotted 1 Homeobox 1. signal PML ChIP-seq signal ENCSR000BUZ Promyelocytic Leukemia. RNA Polymerase II Subunit A; DNA-dependent RNA POLR2A ChIP-seq ENCSR000DMT polymerase catalyzes the transcription of DNA into RNA using signal the four ribonucleoside triphosphates as substrates. Splicing Factor Proline And Glutamine Rich; DNA- and RNA SFPQ/PSF ChIP-seq binding protein which is an essential GSE58444 signal pre-mRNA splicing factor required early in spliceosome formation. RAD21 Cohesin Complex Component; Cleavable component of RAD21 ChIP-seq signal ENCSR000BTQ the cohesin complex, involved in chromosome cohesion during cell cycle, in DNA repair, and in apoptosis. REST Corepressor 1; Essential component of the BHC complex, RCOR1 ChIP-seq ENCSR391JII a corepressor complex that represses transcription of neuron- signal specific genes in non-neuronal cells. RE1 Silencing Transcription Factor; Transcriptional repressor REST ChIP-seq signal ENCSR000BSP which binds neuron-restrictive silencer element (NRSE) and represses neuronal gene transcription in non-neuronal cells. Regulatory Factor X1; Regulatory factor essential for MHC class RFX1 ChIP-seq signal ENCSR066TET II genes expression. SIN3 Transcription Regulator Family Member A; Acts as a SIN3A ChIP-seq signal ENCSR468LUO transcriptional repressor. SIX Homeobox 4; Transcriptional regulator which can act as both SIX4 ChIP-seq signal ENCSR279IEM a transcriptional repressor and activator. Sp1 Transcription Factor; Transcription factor that can activate SP1 ChIP-seq signal ENCSR729LGA or repress transcription in response to physiological and pathological stimuli. SREBF1 ChIP-seq Sterol Regulatory Element Binding Transcription Factor 1; ENCSR197DJH signal Transcriptional activator required for lipid homeostasis. Serum Response Factor; Transcription factor that binds to the SRF ChIP-seq signal ENCSR000BVA serum response element (SRE). TCF12 ChIP-seq signal ENCSR000BUN Transcription Factor 12; Transcriptional regulator. Transcription Factor 7 Like 2; Participates in the Wnt signaling TCF7L2 ChIP-seq ENCSR000EWT pathway and modulates MYC expression by binding to its signal promoter in a sequence-specific manner. TEA Domain Transcription Factor 4; Transcription factor which TEAD4 ChIP-seq signal ENCSR000BUO plays a key role in the Hippo signaling pathway. Y-Box Binding Protein 1; Mediates pre-mRNA alternative YBX1 ChIP-seq signal ENCSR864KNH splicing regulation.

Feature Source Description ZBTB11 ChIP-seq Zinc Finger And BTB Domain Containing 11; Involved in ENCSR155VDK signal transcriptional regulation. ZBTB33 ChIP-seq Zinc Finger And BTB Domain Containing 33; Transcriptional ENCSR231YFE signal regulator with bimodal DNA-binding specificity. ZBTB40 ChIP-seq Zinc Finger And BTB Domain Containing 40; Involved in ENCSR318LVG signal transcriptional regulation. Zinc Fingers And Homeoboxes 2; Acts as a transcriptional ZHX2 ChIP-seq signal ENCSR876UYH repressor. ZKSCAN1 ChIP-seq Zinc Finger With KRAB And SCAN Domains 1; Involved in ENCSR449UFF signal transcriptional regulation. ZNF207 ChIP-seq Zinc Finger Protein 207; Kinetochore- and microtubule-binding ENCSR096KWU signal protein that plays a key role in spindle assembly[23, 24]. ZNF217 ChIP-seq Zinc Finger Protein 217; Binds to the promoters of target genes ENCSR912ATU signal and functions as repressor. ZNF24 ChIP-seq signal ENCSR503VTG Zinc Finger Protein 24. ZNF592 ChIP-seq Zinc Finger Protein 592; Involved in transcriptional regulation. signal ZNF687 ChIP-seq ENCSR899BKM Zinc Finger Protein 687; Involved in transcriptional regulation. signal Regions which has physical interaction with NEAT1 gene locus Hi-C GSE66733 identified by Hi-C.

Table A-2. Sequence motifs found within NEAT1 target regions. Sequence motifs found in known and de novo enrichment analysis using HOMER.Corresponding name, p-value and q-value are indicated for each motif and, when applicable, transcription factors associated with similar motifs. Only motifs discovered using HOMER with an q-value less than 0.05 are displayed. p- q- Motif Name value value Known motifs 1e- NFY(CCAAT)/Promoter/Homer 0 10

1e- CRE(bZIP)/Promoter/Homer 0 10

KLF14(Zf)/HEK293- KLF14.GFP-ChIP- 1e-6 0 Seq(GSE58341)/Homer E-box(bHLH)/Promoter/Homer 1e-5 0.0004

Sp1(Zf)/Promoter/Homer 1e-4 0.0008

Usf2(bHLH)/C2C12-Usf2-ChIP- 1e-4 0.0009 Seq(GSE36030)/Homer

Atf1(bZIP)/K562-ATF1-ChIP- 1e-4 0.0010 Seq(GSE31477)/Homer

Atf7(bZIP)/3T3L1-Atf7-ChIP- 1e-4 0.0024 Seq(GSE56872)/Homer

Pbx3(Homeobox)/GM12878- PBX3-ChIP- 1e-4 0.0033 Seq(GSE32465)/Homer Atf2(bZIP)/3T3L1-Atf2-ChIP- 1e-3 0.0069 Seq(GSE56872)/Homer

GFY-Staf(?,Zf)/Promoter/Homer 1e-3 0.0112

Maz(Zf)/HepG2-Maz-ChIP- 1e-3 0.0132 Seq(GSE31477)/Homer

Pit1+1bp(Homeobox)/GCrat-Pit1- 1e-3 0.0132 ChIP-Seq(GSE58009)/Homer

Klf9(Zf)/GBM-Klf9-ChIP- 1e-2 0.0231 Seq(GSE62211) /Homer

c-Jun-CRE(bZIP)/K562-cJun- 1e-2 0.0269 ChIP-Seq(GSE31477)/Homer

GFY(?)/Promoter/Homer 1e-2 0.0444 de novo Best match: 1e- 0.001 EGR1/MA0162.2/Jaspar(0.818) 19

1e- Best match: 17 0.001 ZNF263/MA0528.1/Jaspar(0.672)

REFERENCES

Barutcu, A. R., Lajoie, B. R., McCord, R. P., Tye, C. E., Hong, D., Messier, T. L., . . . Stein, J. L. (2015). Chromatin interaction analysis reveals changes in small chromosome and telomere clustering between epithelial and breast cancer cells. Genome Biology, 16(1), 214.

Blencowe, B. J. (2006). Alternative splicing: New insights from global analyses. Cell, 126(1), 37-47.

Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144-152.

Brown, J. A., Valenstein, M. L., Yario, T. A., Tycowski, K. T., & Steitz, J. A. (2012). Formation of triple-helical structures by the 3′-end sequences of MALAT1 and MENβ noncoding RNAs. Proceedings of the National Academy of Sciences, 109(47), 19202-19207.

Buske, F. A., Bauer, D. C., Mattick, J. S., & Bailey, T. L. (2012). Triplexator: Detecting nucleic acid triple helices in genomic and transcriptomic data. Genome Research,

Castello, A., Fischer, B., Eichelbaum, K., Horos, R., Beckmann, B. M., Strein, C., . . . Steinmetz, L. M. (2012). Insights into RNA biology from an atlas of mammalian mRNA-binding proteins. Cell, 149(6), 1393-1406.

Chakravarty, D., Sboner, A., Nair, S. S., Giannopoulou, E., Li, R., Hennig, S., . . . Kossai, M. (2014). The oestrogen receptor alpha-regulated lncRNA NEAT1 is a critical modulator of prostate cancer. Nature Communications, 5, 5383.

Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.

Chen, L., & Carmichael, G. G. (2009). Altered nuclear retention of mRNAs containing inverted repeats in human embryonic stem cells: Functional role of a nuclear noncoding RNA. Molecular Cell, 35(4), 467-478.

Chen, M., Lin, H., Shen, C., Ma, Y., Wang, F., Zhao, H., . . . Zhang, J. (2015). The PU. 1- regulated long noncoding RNA lnc-MC controls human monocyte/macrophage

differentiation through interaction with MicroRNA-199a-5p. Molecular and Cellular Biology, , 15.

Chu, C., Qu, K., Zhong, F. L., Artandi, S. E., & Chang, H. Y. (2011). Genomic maps of long noncoding RNA occupancy reveal principles of RNA-chromatin interactions. Molecular Cell, 44(4), 667-678.

Chujo, T., Yamazaki, T., & Hirose, T. (2016). Architectural RNAs (arcRNAs): A class of long noncoding RNAs that function as the scaffold of nuclear bodies. Biochimica Et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 1859(1), 139-146.

Clemson, C. M., Hutchinson, J. N., Sara, S. A., Ensminger, A. W., Fox, A. H., Chess, A., & Lawrence, J. B. (2009). An architectural role for a nuclear noncoding RNA: NEAT1 RNA is essential for the structure of paraspeckles. Molecular Cell, 33(6), 717-726.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.

Darnell, R. B. (2010). HITS-CLIP: Panoramic views of protein–RNA regulation in living cells. Wiley Interdisciplinary Reviews: RNA, 1(2), 266-286.

Derrien, T., Johnson, R., Bussotti, G., Tanzer, A., Djebali, S., Tilgner, H., . . . Knowles, D. G. (2012). The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Research, 22(9), 1775-1789.

Di Ruscio, A., Ebralidze, A. K., Benoukraf, T., Amabile, G., Goff, L. A., Terragni, J., . . . Zhang, P. (2013). DNMT1-interacting RNAs block gene-specific DNA methylation. Nature, 503(7476), 371.

Ding, N., Wu, H., Tao, T., & Peng, E. (2017). NEAT1 regulates cell proliferation and apoptosis of ovarian cancer by miR-34a-5p/BCL2. OncoTargets and Therapy, 10, 4905.

Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., . . . Schlesinger, F. (2012). Landscape of transcription in human cells. Nature, 489(7414), 101.

ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57.

Engreitz, J. M., Ollikainen, N., & Guttman, M. (2016). Long non-coding RNAs: Spatial amplifiers that control nuclear structure and gene expression. Nature Reviews Molecular Cell Biology, 17(12), 756.

Engreitz, J. M., Pandya-Jones, A., McDonel, P., Shishkin, A., Sirokman, K., Surka, C., . . . Lander, E. S. (2013). The xist lncRNA exploits three-dimensional genome architecture to spread across the X chromosome. Science, 341(6147), 1237973.

Engreitz, J. M., Sirokman, K., McDonel, P., Shishkin, A. A., Surka, C., Russell, P., . . . Lander, E. S. (2014). RNA-RNA interactions enable specific targeting of noncoding RNAs to nascent pre-mRNAs and chromatin sites. Cell, 159(1), 188-199.

Feng, J., Bi, C., Clark, B. S., Mady, R., Shah, P., & Kohtz, J. D. (2006). The evf-2 noncoding RNA is transcribed from the dlx-5/6 ultraconserved region and functions as a dlx-2 transcriptional coactivator. Genes & Development, 20(11), 1470-1484.

Fong, K., Li, Y., Wang, W., Ma, W., Li, K., Qi, R. Z., . . . Chen, J. (2013). Whole-genome screening identifies proteins localized to distinct nuclear bodies. J Cell Biol, 203(1), 149- 164.

Forrest, M. E., & Khalil, A. M. (2017). Regulation of the cancer epigenome by long non-coding RNAs. Cancer Letters, 407, 106-112.

Fox, A. H., Bond, C. S., & Lamond, A. I. (2005). P54nrb forms a heterodimer with PSP1 that localizes to paraspeckles in an RNA-dependent manner. Molecular Biology of the Cell, 16(11), 5304-5315.

Fu, J., Kong, Y., & Sun, X. (2016). Long noncoding RNA NEAT1 is an unfavorable prognostic factor and regulates migration and invasion in gastric cancer. Journal of Cancer Research and Clinical Oncology, 142(7), 1571-1579.

Geisler, S., & Coller, J. (2013). RNA in unexpected places: Long non-coding RNA functions in diverse cellular contexts. Nature Reviews Molecular Cell Biology, 14(11), 699.

Gibb, E. A., Vucic, E. A., Enfield, K. S., Stewart, G. L., Lonergan, K. M., Kennett, J. Y., . . . Brown, C. J. (2011). Human cancer long non-coding RNA transcriptomes. PloS One, 6(10), e25915.

Gong, W., Zheng, J., Liu, X., Ma, J., Liu, Y., & Xue, Y. (2016). Knockdown of NEAT1 restrained the malignant progression of glioma stem cells by activating microRNA let-7e. Oncotarget, 7(38), 62208.

Guo, H. M., Yang, S. H., Zhao, S. Z., Li, L., Yan, M. T., & Fan, M. C. (2018). LncRNA NEAT1 regulates cervical carcinoma proliferation and invasion by targeting AKT/PI3K. European Review for Medical and Pharmacological Sciences, 22(13), 4090-4097.

Guttman, M., Amit, I., Garber, M., French, C., Lin, M. F., Feldser, D., . . . Cassady, J. P. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature, 458(7235), 223.

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3), 389-422.

Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., . . . Munschauer, M. (2010). Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell, 141(1), 129-141.

Hallegger, M., Llorian, M., & Smith, C. W. (2010). Alternative splicing: Global insights. The FEBS Journal, 277(4), 856-866.

Harrington, L. A. (1991). Telomerase primer specificity and chromosome healing. Nature, 353(6343), 451.

Hasegawa, Y., Brockdorff, N., Kawano, S., Tsutui, K., Tsutui, K., & Nakagawa, S. (2010). The matrix protein hnRNP U is required for chromosomal localization of xist RNA. Developmental Cell, 19(3), 469-476.

He, S., Zhang, H., Liu, H., & Zhu, H. (2014). LongTarget: A tool to predict lncRNA DNA- binding motifs and binding sites via hoogsteen base-pairing analysis. Bioinformatics, 31(2), 178-186.

Hentze, M. W., & Preiss, T. (2010). The REM phase of gene regulation. Trends in Biochemical Sciences, 35(8), 423-426.

Huang, B., Liu, C., Wu, Q., Zhang, J., Min, Q., Sheng, T., . . . Zou, Y. (2017). Long non-coding RNA NEAT1 facilitates pancreatic cancer progression through negative modulation of miR- 506-3p. Biochemical and Biophysical Research Communications, 482(4), 828-834.

Hudson, W. H., & Ortlund, E. A. (2014). The structure, function and evolution of proteins that bind DNA and RNA. Nature Reviews Molecular Cell Biology, 15(11), 749.

Hung, T., Wang, Y., Lin, M. F., Koegel, A. K., Kotake, Y., Grant, G. D., . . . Wang, P. (2011). Extensive and coordinated transcription of noncoding RNAs within cell-cycle promoters. Nature Genetics, 43(7), 621.

Hutchinson, J. N., Ensminger, A. W., Clemson, C. M., Lynch, C. R., Lawrence, J. B., & Chess, A. (2007). A screen for nuclear transcripts identifies two linked noncoding RNAs associated with SC35 splicing domains. BMC Genomics, 8(1), 39.

Imamura, K., Imamachi, N., Akizuki, G., Kumakura, M., Kawaguchi, A., Nagata, K., . . . Yoneda, M. (2014). Long noncoding RNA NEAT1-dependent SFPQ relocation from promoter region to paraspeckle mediates IL8 expression upon immune stimuli. Molecular Cell, 53(3), 393-406.

Janecek, A., Gansterer, W., Demel, M., & Ecker, G. (2008). On the relationship between feature selection and classification accuracy. Paper presented at the New Challenges for Feature Selection in Data Mining and Knowledge Discovery, 90-105.

Jankowsky, E., & Harris, M. E. (2015). Specificity and nonspecificity in RNA–protein interactions. Nature Reviews Molecular Cell Biology, 16(9), 533.

Jeon, Y., & Lee, J. T. (2011). YY1 tethers xist RNA to the inactive X nucleation center. Cell, 146(1), 119-133.

Jia, D., Huang, L., Bischoff, J., & Moses, M. A. (2014). The endogenous zinc finger transcription factor, ZNF24, modulates the angiogenic potential of human microvascular endothelial cells. The FASEB Journal, 29(4), 1371-1382.

Jiang, L., Shao, C., Wu, Q., Chen, G., Zhou, J., Yang, B., . . . Wang, Y. (2017). NEAT1 scaffolds RNA-binding proteins and the microprocessor to globally enhance pri-miRNA processing. Nature Structural and Molecular Biology, 24(10), 816.

Khalil, A. M., Guttman, M., Huarte, M., Garber, M., Raj, A., Morales, D. R., . . . Van Oudenaarden, A. (2009). Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proceedings of the National Academy of Sciences, 106(28), 11667-11672.

Khan, A., Fornes, O., Stigliani, A., Gheorghe, M., Castro-Mondragon, J. A., van der Lee, R., . . . Tan, G. (2017). JASPAR 2018: Update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research, 46(D1), D266.

König, J., Zarnack, K., Luscombe, N. M., & Ule, J. (2012). Protein–RNA interactions: New genomic technologies and perspectives. Nature Reviews Genetics, 13(2), 77.

König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., . . . Ule, J. (2010). iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Structural & Molecular Biology, 17(7), 909.

Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25.

Li, L., Liu, B., Wapinski, O. L., Tsai, M., Qu, K., Zhang, J., . . . Gupta, R. A. (2013). Targeted disruption of hotair leads to homeotic transformation and gene derepression. Cell Reports, 5(1), 3-12.

Li, W., Zhang, Z., Liu, X., Cheng, X., Zhang, Y., Han, X., . . . Xu, B. (2017). The FOXN3- NEAT1-SIN3A repressor complex promotes progression of hormonally responsive breast cancer. The Journal of Clinical Investigation, 127(9), 3421-3440.

Li, W., Notani, D., Ma, Q., Tanasa, B., Nunez, E., Chen, A. Y., . . . Song, X. (2013). Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation. Nature, 498(7455), 516.

Li, X., Wang, X., Song, W., Xu, H., Huang, R., Wang, Y., . . . Yang, X. (2018). Oncogenic properties of NEAT1 in prostate cancer cells depend on the CDC5L-AGRN transcriptional regulation circuit. Cancer Research, , canres. 0688.2018.

Li, Y., Li, Y., Chen, W., He, F., Tan, Z., Zheng, J., . . . Li, J. (2015). NEAT expression is associated with tumor recurrence and unfavorable prognosis in colorectal cancer. Oncotarget, 6(29), 27641.

Licatalosi, D. D., & Darnell, R. B. (2010). RNA processing and its regulation: Global insights into biological networks. Nature Reviews Genetics, 11(1), 75.

Licatalosi, D. D., Mele, A., Fak, J. J., Ule, J., Kayikci, M., Chi, S. W., . . . Wang, X. (2008). HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature, 456(7221), 464.

Livi, C. M., Klus, P., Delli Ponti, R., & Tartaglia, G. G. (2015). Cat RAPID signature: Identification of ribonucleoproteins and RNA-binding regions. Bioinformatics, 32(5), 773- 775.

Lu, Q., Ren, S., Lu, M., Zhang, Y., Zhu, D., Zhang, X., & Li, T. (2013). Computational prediction of associations between long non-coding RNAs and proteins. BMC Genomics, 14(1), 651.

Martianov, I., Ramadass, A., Barros, A. S., Chow, N., & Akoulitchev, A. (2007). Repression of the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature, 445(7128), 666.

McHugh, C. A., Russell, P., & Guttman, M. (2014). Methods for comprehensive experimental identification of RNA-protein interactions. Genome Biology, 15(1), 203.

Mercer, T. R., & Mattick, J. S. (2013). Structure and function of long noncoding RNAs in epigenetic regulation. Nature Structural & Molecular Biology, 20(3), 300.

Milek, M., Wyler, E., & Landthaler, M. (2012). Transcriptome-wide analysis of protein–RNA interactions using high-throughput sequencing. Paper presented at the Seminars in Cell & Developmental Biology, , 23(2) 206-212.

Mitchell, S. F., Jain, S., She, M., & Parker, R. (2013). Global analysis of yeast mRNPs. Nature Structural & Molecular Biology, 20(1), 127.

Mousavi, K., Zare, H., Dell’Orso, S., Grontved, L., Gutierrez-Cruz, G., Derfoul, A., . . . Sartorelli, V. (2013). eRNAs promote transcription by establishing chromatin accessibility at defined genomic loci. Molecular Cell, 51(5), 606-617.

Neguembor, M. V., Jothi, M., & Gabellini, D. (2014). Long noncoding RNAs, emerging players in muscle differentiation and disease. Skeletal Muscle, 4(1), 8.

Ng, A. Y. (2004). Feature selection, L 1 vs. L 2 regularization, and rotational invariance. Paper presented at the Proceedings of the Twenty-First International Conference on Machine Learning, 78.

Ng, S., Bogu, G. K., Soh, B. S., & Stanton, L. W. (2013). The long noncoding RNA RMST interacts with SOX2 to regulate neurogenesis. Molecular Cell, 51(3), 349-359.

Nie, H., Li, Y., Li, Z., Mu, J., & Wang, J. (2016). Effects of ZNF139 on gastric cancer cells and mice with gastric tumors. Oncology Letters, 12(4), 2550-2554.

Pachnis, V., Brannan, C. I., & Tilghman, S. M. (1988). The structure and expression of a novel gene activated in early mouse embryogenesis. The EMBO Journal, 7(3), 673-681.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Dubourg, V. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct), 2825-2830.

Peng, W., Si, S., Zhang, Q., Li, C., Zhao, F., Wang, F., . . . Ma, R. (2015). Long non-coding RNA MEG3 functions as a competing endogenous RNA to regulate gastric cancer progression. Journal of Experimental & Clinical Cancer Research, 34(1), 79.

Prasanth, K. V., & Spector, D. L. (2007). Eukaryotic regulatory RNAs: An answer to the ‘genome complexity’conundrum. Genes & Development, 21(1), 11-42.

Quinn, J. J., & Chang, H. Y. (2016). Unique features of long non-coding RNA biogenesis and function. Nature Reviews Genetics, 17(1), 47.

Quinodoz, S., & Guttman, M. (2014). Long noncoding RNAs: An emerging link between gene regulation and nuclear organization. Trends in Cell Biology, 24(11), 651-663.

Rinn, J. L., Kertesz, M., Wang, J. K., Squazzo, S. L., Xu, X., Brugmann, S. A., . . . Segal, E. (2007). Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell, 129(7), 1311-1323.

Salmena, L., Poliseno, L., Tay, Y., Kats, L., & Pandolfi, P. P. (2011). A ceRNA hypothesis: The rosetta stone of a hidden RNA language? Cell, 146(3), 353-358.

Sasaki, Y. T., Ideue, T., Sano, M., Mituyama, T., & Hirose, T. (2009). MENε/β noncoding RNAs are essential for structural integrity of nuclear paraspeckles. Proceedings of the National Academy of Sciences, 106(8), 2525-2530.

Schmitt, A. M., Garcia, J. T., Hung, T., Flynn, R. A., Shen, Y., Qu, K., . . . Baum, R. (2016). An inducible long noncoding RNA amplifies DNA damage signaling. Nature Genetics, 48(11), 1370.

Schmitz, K., Mayer, C., Postepska, A., & Grummt, I. (2010). Interaction of noncoding RNA with the rDNA promoter mediates recruitment of DNMT3b and silencing of rRNA genes. Genes & Development, 24(20), 2264-2269.

Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., . . . Ideker, T. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Research, 13(11), 2498-2504.

Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., . . . Jiang, H. (2007). Predicting protein– protein interactions based only on sequences information. Proceedings of the National Academy of Sciences, 104(11), 4337-4341.

Sherwood, R. I., Hashimoto, T., O'donnell, C. W., Lewis, S., Barkal, A. A., Van Hoff, J. P., . . . Gifford, D. K. (2014). Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotechnology, 32(2), 171.

Simon, M. D. (2013a). Capture hybridization analysis of RNA targets (CHART). Current Protocols in Molecular Biology, 101(1), 21.25. 16.

Simon, M. D. (2013b). Capture hybridization analysis of RNA targets (CHART). Current Protocols in Molecular Biology, 101(1), 21.25. 16.

Simon, M. D., Wang, C. I., Kharchenko, P. V., West, J. A., Chapman, B. A., Alekseyenko, A. A., . . . Kingston, R. E. (2011). The genomic binding sites of a noncoding RNA. Proceedings of the National Academy of Sciences, 108(51), 20497-20502.

Singh, G., Ricci, E. P., & Moore, M. J. (2014). RIPiT-seq: A high-throughput approach for footprinting RNA: Protein complexes. Methods, 65(3), 320-332.

Souquere, S., Beauclair, G., Harper, F., Fox, A., & Pierron, G. (2010). Highly ordered spatial organization of the structural long noncoding NEAT1 RNAs within paraspeckle nuclear bodies. Molecular Biology of the Cell, 21(22), 4020-4027.

Spector, D. L. (2006). SnapShot: Cellular bodies. Cell, 127(5), 1071. e2.

Sun, C., Li, S., Zhang, F., Xi, Y., Wang, L., Bi, Y., & Li, D. (2016). Long non-coding RNA NEAT1 promotes non-small cell lung cancer progression through regulation of miR-377-3p- E2F3 pathway. Oncotarget, 7(32), 51784.

Sunwoo, H., Dinger, M. E., Wilusz, J. E., Amaral, P. P., Mattick, J. S., & Spector, D. L. (2009). MEN ε/β nuclear-retained non-coding RNAs are up-regulated upon muscle differentiation and are essential components of paraspeckles. Genome Research,

Suresh, V., Liu, L., Adjeroh, D., & Zhou, X. (2015). RPI-pred: Predicting ncRNA-protein interaction using sequence and structural information. Nucleic Acids Research, 43(3), 1370- 1379.

Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., . . . Tsafou, K. P. (2014). STRING v10: Protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43(D1), D452.

Tripathi, V., Ellis, J. D., Shen, Z., Song, D. Y., Pan, Q., Watt, A. T., . . . Bubulya, P. A. (2010). The nuclear-retained noncoding RNA MALAT1 regulates alternative splicing by modulating SR splicing factor phosphorylation. Molecular Cell, 39(6), 925-938.

Tsai, M., Manor, O., Wan, Y., Mosammaparast, N., Wang, J. K., Lan, F., . . . Chang, H. Y. (2010). Long noncoding RNA as modular scaffold of histone modification complexes. Science, , 1192002.

Ulitsky, I., Shkumatava, A., Jan, C. H., Sive, H., & Bartel, D. P. (2011). Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution. Cell, 147(7), 1537-1550.

Van Nostrand, E. L., Pratt, G. A., Shishkin, A. A., Gelboin-Burkhart, C., Fang, M. Y., Sundararaman, B., . . . Elkins, K. (2016). Robust transcriptome-wide discovery of RNA- binding protein binding sites with enhanced CLIP (eCLIP). Nature Methods, 13(6), 508.

Vance, K. W., & Ponting, C. P. (2014). Transcriptional regulatory functions of nuclear long noncoding RNAs. Trends in Genetics, 30(8), 348-355.

West, J. A., Davis, C. P., Sunwoo, H., Simon, M. D., Sadreyev, R. I., Wang, P. I., . . . Kingston, R. E. (2014). The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sites. Molecular Cell, 55(5), 791-802.

Xie, S. Q., Martin, S., Guillot, P. V., Bentley, D. L., & Pombo, A. (2006). Splicing speckles are not reservoirs of RNA polymerase II, but contain an inactive form, phosphorylated on serine2 residues of the C-terminal domain. Molecular Biology of the Cell, 17(4), 1723-1733.

Yamazaki, T., Souquere, S., Chujo, T., Kobelke, S., Chong, Y. S., Fox, A. H., . . . Hirose, T. (2018). Functional domains of NEAT1 architectural lncRNA induce paraspeckle assembly through phase separation. Molecular Cell, 70(6), 1053. e7.

Zarnegar, B. J., Flynn, R. A., Shen, Y., Do, B. T., Chang, H. Y., & Khavari, P. A. (2016). irCLIP platform for efficient characterization of protein–RNA interactions. Nature Methods, 13(6), 489.

Zhang, B., Mao, Y. S., Sunwoo, H., & Spector, D. L. (2011). Direct visualization of the co- transcriptional assembly of a nuclear body by noncoding RNAs. Nature Cell Biology, 13(1), 95-101. doi:10.1038/ncb2140

Zhang, J., Li, Y., Dong, M., & Wu, D. (2017). Long non-coding RNA NEAT1 regulates E2F3 expression by competitively binding to miR-377 in non-small cell lung cancer. Oncology Letters, 14(4), 4983-4988.

Zhang, M., Wu, W. B., Wang, Z. W., & Wang, X. H. (2017). lncRNA NEAT1 is closely related with progression of breast cancer via promoting proliferation and EMT. Eur.Rev.Med.Pharmacol.Sci, 21, 1020-1026.

Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., . . . Li, W. (2008). Model-based analysis of ChIP-seq (MACS). Genome Biology, 9(9), R137.

Zhao, J., Ohsumi, T. K., Kung, J. T., Ogawa, Y., Grau, D. J., Sarma, K., . . . Lee, J. T. (2010). Genome-wide identification of polycomb-associated RNAs by RIP-seq. Molecular Cell, 40(6), 939-953.

Zhu, J., Rosset, S., Tibshirani, R., & Hastie, T. J. (2004). 1-norm support vector machines. Paper presented at the Advances in Neural Information Processing Systems, 49-56.

BIOGRAPHICAL SKETCH

Haozhe Wang was born in Qingdao, China. She entered Qingdao University in 2007. After

Haozhe graduated from Qingdao University with a Bachelor of Science in Biotechnology in

2011, she joined in the molecular and cell biology program at UT Dallas. She received a Master of Science in Molecular and Cell Biology in 2013 and continued on in the Doctoral program in the same year.

CURRICULUM VITAE

HAOZHE WANG

EDUCATION

Ph.D., Molecular & Cell Biology – Computational Biology, expected Winter 2018

The University of Texas at Dallas – Richardson, TX

M.S., Bioinformatics & Computational Biology, expected Winter 2018

The University of Texas at Dallas – Richardson, TX

M.S., Molecular & Cell Biology, 2013

The University of Texas at Dallas – Richardson, TX

B.S., Biotechnology, 2011

Qingdao University – Shandong, China

ACADEMIC PROJECTS

The University of Texas at Dallas, Richardson, TX

Ph.D. Candidate, 2013 – Present

• Computational modeling long non-coding RNAs mediated transcription regulation Led mining, review, and quality analysis of genomic and epigenetic data. Applied machine- learning methods to predict the long non-coding RNAs mediated transcription regulation. Modeled lncRNA-protein-DNA interaction via support vector machine and cluster analysis.

• Predicting genes and pathways related to breast cancer by network optimization

In addition to collecting protein interaction and signaling data from multiple sources, steered construction of protein interaction networks and design of graph theoretical approaches to support cancer-related candidate gene research.

Institute of Oceanology, Chinese Academy of Sciences, Shandong, China

Project Assistant, 2010 – 2011

• Function verification of actin promoter in Laminaria japonica Led comparison of kelp and related species across conservation areas. Predicted actin gene sequence, designed PCR primers, conducted genomic walking to obtain promoter sequence, and verified results via transient expression.

PROFESSIONAL EXPERIENCE

The University of Texas at Dallas, Richardson, TX

Teaching Assistant, 2012 – Present

Teaching Assistant for Introductory Biology Laboratory (2012 to present) and Biochemistry

Laboratory (2013-14).

• Supervise weekly laboratory periods to include setting up and demonstrating experiments. • Grade assignments and score exams per set standards. • Promote a safe, instructive atmosphere.

The University of Texas at Dallas, Richardson, TX

Instructor, 2017 – 2017

• Design and develop learning materials, preparing schemes of work and maintaining records to monitor student progress and achievement of programming workshop.

PUBLICATIONS & PAPERS

Publication:

Guo, Y., Wang, H. Z., Wu, C. H., Fu, H. H., & Jiang, P. (2017). “Cloning and characterization

of nitrate reductase gene in Ulva prolifera (Ulvophyceae, Chlorophyta).” Journal of

Phycology.

Working Papers:

Wang, H., Gu J., & Xuan, Z. “Computational modeling long non-coding RNAs mediated transcription regulation.”

Wang, Y., Sun, M., Zhu, X., Chen, Y., Tu, S, Chen, Y., Bai, B., Chen, M., Dai Q., Wang, H.,

Xuan, Z., & Zhang, M. Q. “RSQ: a statistical method for quantification of isoform-specific

structurome using transcriptome-wide structural profiling data.”

TECHNICAL STRENGTHS

• Programming Languages

Proficient: Python, R, Bash (Unix shell)

Experience: Java, Matlab, Perl, Git and Github

• Computational skills

Proficient: Support Vector Machine, Random Forest, Clustering Analysis, feature selection

Experience: Logistic regression, Neural Network, and deep learning

• Data Analysis

Proficient: sequencing data (RNA-, ChIP-, CHART- and GRO-seq) analysis