Bioinformatics Analysis and Computational Biology Modeling Of

BIOINFORMATICS ANALYSIS AND COMPUTATIONAL BIOLOGY MODELING OF

GENETIC AND EPIGENETIC ELEMENTS DICTATING CHROMATIN ORGANIZATION,

TRANSCRIPTION REGULATION, AND DISEASE STATE IN CANCER GENOMES

JACQUELINE CHYR

A Dissertation Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES

in Partial Fulfillment of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

Cancer Biology

May 2020

Winston-Salem, North Carolina

Approved by:

Xiaobo Zhou, Ph.D., Advisor

Gregory A. Hawkins, Ph.D., Chair

William Gmeiner, Ph.D.

Lance D. Miller, Ph.D.

Kounosuke Watabe, Ph.D. ACKNOWLEDGEMENTS

I would like to begin by thanking my advisor and mentor, Dr. Xiaobo Zhou, for his years

of guidance and support throughout my graduate studies. His experience, knowledge, and ready

availability to discuss my ideas and project was critical in the development and completion of my

work. I would also like to thank past and present members of Zhou’s lab for their valuable input

and crucial assistance. Working with Dr. Zhou and our lab members has been the most wonderful

and memorable experience in my life.

Next, I would like to thank my committee members: Dr. Gregory A. Hawkins, Dr. William

Gmeiner, Dr. Lance D. Miller, and Dr. Kounosuke Watabe for their advice and discussions. I greatly appreciate the time and energy they put into keeping me on track with my education. In addition, I would like to thank all my professors, my peers, the Cancer Biology Department, and

Wake Forest University School of Medicine.

Furthermore, I would like to thank my mother and father for providing the soil for me to grow, and my sisters: Christina, Erica, and Gloria for being my light. I would also like to thank my friends because without their constant motivation and encouragement, I would not be here today.

Finally, above all, I would like to thank William G. Huang for his great love and many sacrifices for me. Thank you all for supporting me through this journey.

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ...... 2

TABLE OF CONTENTS ...... 3

LIST OF ABBREVIATIONS ...... 9

LIST OF FIGURES ...... 17

LIST OF TABLES ...... 20

CHAPTERS ...... 21

1 General Introduction ...... 23

1.1 Bridging between cancer biology and computational science ...... 23

1.2 Immune system in head and neck cancers ...... 24

1.3 Stem cell transcription factors in lung cancer ...... 24

1.4 Chromatin organization in breast cancer...... 25

1.5 Enhancer and promoter interactions ...... 27

1.6 Significance and innovation ...... 27

2 Immune signaling-based Cascade Propagation approach re-stratifies HNSCC patients ... 30

2.1 Abstract ...... 30

2.2 Introduction ...... 31

2.2.1 Head and neck squamous cell carcinoma...... 31

2.2.2 Big data analysis and patient stratification ...... 31

2.2.3 Cancer immune system ...... 32

2.2.4 Environmental factors in HNSCC ...... 32

2.2.5 EGFR signaling pathway in HNSCC ...... 33 3

2.2.6 Innovation ...... 33

2.3 Experimental Design and Methods ...... 34

2.3.1 Patient clinical and genomic data ...... 34

2.3.2 Immune signaling interaction database ...... 34

2.3.3 Statistical analysis of gene interaction on survival ...... 34

2.3.4 Clustering patients by dynamic network tree cutting ...... 35

2.3.5 Protein-protein interaction (PPI) information ...... 36

2.3.6 Smoothing somatic mutations network ...... 36

2.3.7 Non-negative matrix factorization (NMF)-based clustering ...... 36

2.4 Results ...... 37

2.4.1 Cascade Propagation Overview ...... 37

2.4.2 Patient stratification and Identification of Co-dependent Pathways ...... 39

2.4.3 Enriched Immune Signaling Pathway ...... 41

2.4.4 Somatic Mutation Propagation for Further stratification ...... 42

2.4.5 Distinct clinical characteristics for Subnetworks ...... 45

2.4.6 Network analysis of Mutation Subnetworks ...... 46

2.5 Discussion ...... 49

3 SNP variant regulates SOX2 modulation of VDAC3 in LSCC patients...... 52

3.1 Abstract ...... 52

3.2 Introduction ...... 52

3.2.1 Prevalence and subtypes of lung cancer...... 52

3.2.2 SOX2 amplification in lung cancer ...... 53

3.2.3 Mechanisms of SOX2 and SNPs ...... 54

3.2.4 eQTL analysis and LD SNPs ...... 55

3.3 Experimental Design and Methods ...... 56

3.3.1 LSCC patient data ...... 56

3.3.2 Matrix eQTL computation ...... 56

3.3.3 SOX2 siRNA lung cancer cell line ...... 56

3.3.4 Linkage Disequilibrium SNPs and binding motifs ...... 56

3.3.5 SOX2 and H3K27ac ChIP-seq in HCC95 cell line ...... 57

3.3.6 Lung tissue and cell line Hi-C data ...... 57

3.4 Results ...... 58

3.4.1 Overview of our workflow ...... 58

3.4.2 SOX2 is amplified in LSCC patients ...... 61

3.4.3 Matrix eQTL identifies SNP to VDAC3 expression association ...... 63

3.4.4 SOX2 regulates VDAC3 expression ...... 64

3.4.5 SNPs or LD SNPs are located in SOX2 binding motifs ...... 68

3.4.6 VDAC3 and rs58163073 are located within the same TAD ...... 70

3.4.7 Significant association between SNP and clinical features ...... 73

3.5 Discussion ...... 75

4 Chromatin organization alterations lead to oncogene overexpression in breast cancer ..... 79

4.1 Abstract ...... 79

4.2 Introduction ...... 79

4.2.1 Chromatin organization...... 79

4.2.2 Topologically associated domains ...... 80

4.2.3 TAD and TAD boundary predictions ...... 80

4.2.4 Gradient Boosting Machine classification model ...... 81

4.2.5 Significance and innovation ...... 81

4.3 Experimental Design and Methods ...... 82

4.3.1 Hi-C data and TAD boundaries ...... 82

4.3.2 Epigenomic and Genomic Elements ...... 83

4.3.3 PredTAD Boundary Prediction Model...... 83

4.3.4 Boundary alterations prediction ...... 84

4.3.5 Genomic interaction prediction ...... 85

4.3.6 Gene expression analysis ...... 85

4.4 Results ...... 86

4.4.1 Epigenetic and genetic features in TAD boundaries ...... 86

4.4.2 PredTAD model ...... 87

4.4.3 Comparison to other models ...... 90

4.4.4 Boundary alterations are controlled by H3K9ac, CTCF, RAD21, and SMC3. . 91

4.4.5 Epigenomic and genomic features accurately predicts genomic interactions .... 94

4.4.6 Altered chromatin organization contributes to oncogenic pathway activation .. 97

4.4.7 Loss of TAD boundary promotes RET oncogene expression in breast cancer .. 99

4.5 Discussion ...... 101

5 Sequence-based modeling of enhancer and promoter interactions ...... 105

5.1 Abstract ...... 105

5.2 Introduction ...... 105

5.2.1 Enhancers and promoters ...... 105

5.2.2 Annotation of enhancer-promoter pairs ...... 106

5.2.3 Deep-learning models for genomics ...... 106

5.2.4 Development of sequence-based models ...... 107

5.2.5 Application of sequence-based models ...... 108

5.2.6 Predicting enhancer-promoter interactions ...... 108

5.3 Experimental Design and Methods ...... 109

5.3.1 Two-step enhancer-promoter interaction prediction in GM12878 ...... 109

5.3.2 Modeling enhancer-promoter interactions in six cell lines ...... 110

5.3.3 Identification of important binding motifs using deep learning framework .... 111

5.4 Results ...... 111

5.4.1 Classification of enhancer-promoter interactions using chromatin features .... 111

5.4.2 Classification of enhancer-promoter interactions using sequence information 113

5.4.3 Systematically identifying important positions ...... 116

5.4.4 POLR2A, HDAC1, and TRIM28 are important ...... 118

5.4.5 Identification of SNPs that affect classification accuracy ...... 118

5.5 Discussion ...... 119

6 Summary and Discussion ...... 123

6.1 Cascade Propagation approach re-stratified HNSCC patients ...... 123

6.2 SNP variant regulates SOX2 modulation of VDAC2 in LSCC patients ...... 124

6.3 Chromatin organization alterations leads to oncogene overexpression in breast cancer

124

6.4 Sequence-based modeling of enhancer and promoter interactions ...... 125

6.5 Future Directions ...... 127

CURRICULUM VITAE ...... 128

REFERNCES ...... 130

LIST OF ABBREVIATIONS

3D three-dimensional

A549 adenocarcinomic alveolar basal epithelial cell line

AFR African

AMR American

ANOVA analysis of variants

ASN Asian

ATAC-seq assay for transposase-accessible chromatin sequencing

AUC area under the curve

BART Bayesian additive regression trees

BCR B-cell receptor

BioGrid The Biological General Repository for Interaction Datasets

BLNK B-cell linker bp base pair

BRCA1/2 breast cancer type 1/2 susceptibility protein

BTK bruton tyrosine kinase

CAGE cap-analysis of gene expression

CasP Cascade Propagation

CBLB Cbl proto-oncogene B

CD14 cluster of differentiation 14

CD3G T-cell surface glycoprotein CD3 gamma chain

CD79A B-cell antigen receptor complex-associated protein alpha chain

CD81 target of antiproliferative antibody

CDKN2A/B cyclin dependent kinase inhibitor A/B

CGCI The Cancer Genome Characterization Initiative

ChIA-PET chromatin interaction analysis by paired-end tag sequencing

ChIP-seq chromatin immunoprecipitation sequencing c-Myc MYC proto-oncogene, BHLH transcription factor

COCA Constructing Optimal Cluster Architecture

CpG cytosine-phosphodiester-guanine

CTCF CCCTC-binding factor dCTP deoxycytidine triphosphate

DeepBind deep learning framework for predicting DNA or RNA factor binding

DeepSEA deep learning framework for predicting chromatin effects of sequence alterations

DEseq differential gene expression analysis based on the negative binomial distribution

DNMT1 DNA methyltransferase 1

DNMT3a DNA methyltransferase 3 alpha

DNMT3b DNA methyltransferase 3 beta

DNA deoxyribonucleic acid

E2 estrogen

EGFR epidermal growth factor receptor

ELF1 E74-like ETS transcription factor 1

ENCODE Encyclopedia of DNA Elements

EP300 histone acetyltransferase P300 eQTL expression quantitative trait loci

ER estrogen receptor eRNA enhancer RNA

EUR European

ExPecto tissue-specific gene expression effect predictions for human mutations

FANTOM5 functional annotation of the mammalian genome

FDA The Food and Drug Administration

FDR false discovery rate

FOCS FDR-corrected OLS with cross-validation and shrinkage

FOXA1 forkhead box A

FPKM fragments per kilobase of transcript per million

GABPA human nuclear respiratory factor-2 subunit alpha

GATA3 trans-acting T-cell-specific transcription factor GATA-3

GBM gradient boosting machine

GEO Gene Expression Omnibus

GM12878 human lymphoblastoid cell line

GRO-seq global run-on sequencing

GWAS genome wide association study

H1 hESC human embryonic stem cell line

H3K27ac acetylation of lysine 27 of histone H3

H3K27me3 trimethylation at lysine 27 of histone H3

H3K36me3 trimethylation of lysine 36 of histone H3

H3K4me1 monomethylation at lysine 4 of histone H3

H3K4me3 trimethylation of lysine 4 of histone H3

H3K9ac acetylation of lysine 9 of histone H3

H3K9me3 trimethylation at lysine 9 of histone H3

H4K20me1 monomethylation at lysine 20 of histone H4

H520 lung squamous cell carcinoma cell line

HCC95 lung squamous cell carcinoma cell line

HDAC1 histone deacetylase 1

HeLa-S3 clonal derivative of a cervical cancer cell line

HER2 (Her2) human epidermal growth factor receptor 2

Hi-C high-throughput chromatin conformation capture

HindIII type II site-specific deoxyribonuclease restriction enzyme

HMG high-mobility group

HNSCC head and neck squamous cell carcinoma

HP1 heterochromatin protein 1

HPV human papilloma virus

HT-SELEX high throughput systematic evolution of ligands by exponential enrichment

HUVEC human umbilical vein endothelial cell line

ICE iterative correction and eigenvalue decomposition

ICGC International Cancer Genome Consortium iCluster integrative cluster

IMR90 lung fibroblast cell line

InnateDB systems biology of the innate immune response database

IPA Ingenuity Pathway Analysis

IRAK1 interleukin 1 receptor associated kinase 1

IRAK3 interleukin 1 receptor associated kinase 3

IRF8 interferon regulatory factor 8

ITK interleukin-2-inducible T-cell kinase

JAK tyrosine-protein kinase pathway

JASPAR database of transcription factor binding profiles

K562 myelogenous leukemia cell line

Kaviar Known VARiants

KEGG Kyoto Encyclopedia of Genes and Genomes

KIT KIT proto-oncogene, receptor tyrosine kinase kp or kbps thousand base pairs

LAT linker for activation of T-cells

LCK lymphocyte cell-specific protein-tyrosine kinase

LCP2 lymphocyte cytosolic protein 2

LD linkage disequilibrium

LK2 lung cancer cell line log logarithm log2FC logarithm 2-fold change

LSCC lung squamous cell carcinoma

LUAD lung adenocarcinoma

LY86 lymphocyte antigen 86

LY96 lymphocyte antigen 96

MB million base pairs

MACS model-based analysis of ChIP-seq

MAF minor allele frequency

MARGE model-based analysis of regulation of gene expression

MAX basic helix-loop-helix protein 4, MYC associated factor X

MCF10A non-tumorigenic epithelial cell line

MCF7 breast cancer cell line

MYD88 innate immune signal transduction adaptor

NCBI National Cancer Biotechnology Information

NHEK epidermal keratinocytes cell line

NMF non-negative matrix factorization nt nucleotide

PBM protein binding microarrays

PI3K phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha

PLCγ1 phospholipase C gamma

PML promyelocytic leukemia

POLR2A RNA polymerase II subunit A

PPI protein-protein interaction

PredTAD prediction of TAD

PSLM position-specific linear model

Pten phosphatase and tensin homolog

PWM position weight matrix r2 square of correlation coefficient

RAD21 cohesin complex component

RET receptor tyrosine kinase

RF random forest

RMA robust multichip average

RNA-seq RNA sequencing

ROC receiver operating characteristic

RP regulatory potential

SIN3A histone deacetylase complex subunit Sin3a

14 siRNA small interfering RNA

SMARCA5 SWI/SNF related actin dependent regulator of chromatin A5

SMARCE1 SWI/SNF related actin dependent regulator of chromatin E1

SMC3 structural maintenance of chromosome 3

SNP single nucleotide polymorphism

SNV single nucleotide variations

SOS1 son of sevenless homolog 1, Ras/Rac guanine nucleotide exchange factor 1

SOX2 SRY-box 2, transcription factor

SRA Sequence Read Archive

SRC SRC proto-oncogene, non-receptor tyrosine kinase

SRF serum response factor

SRY sex determining region Y

STAT signal transducer and activator of transcription pathway

T47D breast cancer cell line

TAD topologically associating domain

TAD-BR TAD boundary

TARGET Therapeutically Applicable Research to Generate Effective Treatments

TCGA The Cancer Genome Atlas

TCR T-cell receptor

TF transcription factor

TFBS transcription factor binding site

TLR toll-like receptor

TLR2 toll-like receptor 2

TLR3 toll-like receptor 3

TLR4 toll-like receptor 4

TLR9 toll-like receptor

TP53 tumor protein 53

TRIM28 transcription intermediary factor 1-beta

TruSeq Illumina RNA Library Prep Kit

TSS transcription start site

UCSC University of California, Santa Cruz

VDAC3 voltage-dependent anion-selective channel 3

Wnt1 wingless-type MMTV interaction site family member 1

Zap70 zeta chain of T-cell receptor associated protein kinase 70

LIST OF FIGURES

CHAPTER II.

Immune signaling-based Cascade Propagation approach re-stratifies HNSCC patients

Figure 1. Overview of CasP subtyping approach...... 38

Figure 2. The Dynamic Network Tree Cutting Approach...... 40

Figure 3. Branch 14 network and survival curve...... 41

Figure 4. HPV status in Network-positive group...... 42

Figure 5. Overview of somatic mutation analysis...... 43

Figure 6. Cox survival analysis of subnetworks...... 44

Figure 7. Fully connected subnetworks after further stratification with mutational data...... 48

CHAPTER III.

SNP variant regulates SOX2 modulation of VDAC3 in LSCC patients

Figure 8. Brief overview of workflow...... 58

Figure 9. Correlation between methylation and SOX2 expression ...... 60

Figure 10. Multiple genomic datatypes are used to group SOX2 patients into two groups...... 62

Figure 11. Gene expression profile of two SOX2-silenced LSCC cell lines...... 66

Figure 12. SOX2 regulates VDAC3 expression...... 67

Figure 13. SNP rs58163073 alters the PWM of SOX2...... 68

Figure 14. VDAC3 and rs58163073 are located within the same TAD in lung cell line...... 71

Figure 15. Hi-C heat maps for lung cancer cell line and lung tissues...... 72

Figure 16. LSCC Clinical Analysis...... 74

Figure 17. Graphical summary of the regulation of VDAC3 by SOX2...... 77

CHAPTER IV.

Chromatin organization alterations lead to oncogene overexpression in breast cancer

Figure 18. Epigenomic and genomic features...... 87

Figure 19. Overview of PredTAD...... 88

Figure 20. PredTAD top predictive features...... 89

Figure 21. Comparison of PredTAD to other models...... 90

Figure 22. Comparison of normal and breast cancer TAD boundaries...... 91

Figure 23. Expression of top five features in conserved and perturbed TAD boundaries...... 93

Figure 24. Genome-wide interaction study...... 94

Figure 25. Interaction prediction accuracy across each chromosome...... 95

Figure 26. Comparing interacting and non-interacting regions...... 96

Figure 27. TAD boundary changes near RET gene...... 100

Figure 28. RET in breast cancer cell line and patients ...... 100

CHAPTER V.

Sequence-based modeling of enhancer and promoter interactions

Figure 29. Overview of our two-step enhancer-promoter interaction prediction...... 109

Figure 30. Detailed view of our two-step enhancer-promoter interaction model...... 112

Figure 31. Variable importance plot for enhancer-promoter interactions...... 113

Figure 32. Genome browser view of an interacting enhancer-promoter pair...... 114

Figure 33. ROC curves for enhancer-promoter interaction results...... 115

Figure 34. P1 probability score changes for one sample...... 117

Figure 35. Top 20 most recurrent DNA binding proteins...... 118

LIST OF TABLES

Table 1. Top ranking branches based on Cox regression p-values...... 40

Table 2. Patient HPV status in Network-positive and Network-negative subgroups...... 42

Table 3. Clinical characteristics of Subnetworks...... 45

Table 4. Identified genes in each subnetwork...... 47

Table 5. Master regulator potential ...... 59

Table 6. eQTL results for SOX2-active patients...... 64

Table 7. eQTL results for SOX2-inactive patients...... 65

Table 8. Binding motifs located around target SNPs...... 69

Table 9. Association of SNP rs798827 and clinical features in LSCC patients...... 75

Table 10. Signaling pathway analysis of differentially expressed genes near altered boundaries.98

Table 11. Enhancer-promoter interaction prediction details and results...... 115

Table 12. Small percentage of positions leads to a classification labeling change...... 119

CHAPTERS

CHAPTER I:

GENERAL INTRODUCTION

1 General Introduction

1.1 Bridging between cancer biology and computational science

Cancer is the second leading cause of death worldwide. It is responsible for 9.6 million

deaths worldwide [1]. In the United States, there were 1,762,450 new cancer cases and 606,880

cancer deaths in 2019 [2]. Unsurprisingly, it is one of the biggest research topics of our time. Cancer

is an abnormal growth of human cells that can invade into adjacent tissues and spread to other

organs causing impaired organ function and death [3]. It can affect any part of the body. In this

work, we explore the classification and mechanisms of head and neck cancer, lung cancer, and breast cancer.

There are two main causes of cancer: genetic and environmental. Exposure to carcinogens,

such as radiation, asbestos, or tobacco smoke, and underlying genetic defects, such as mutations of

TP53, BRCA1/2, or SOX2, can cause the development of cancer [4-7]. Additionally, our immune

system functions to detect and eliminate abnormal cancer cells [8, 9]. However, there are still

unknown causes of cancer and the heterogeneity of cancer cells and patients make it a challenge to

study.

With the emergence of powerful next generation technologies, such as high throughput

sequencing, there has been an increase of biological data availability. Cancer cells can now be

analyzed in great detail. The analysis and integration of genetic and clinical data benefits from the

use of bioinformatics. While the development of mathematical and prediction models is possible

with the use of computational biology. Here, we applied bioinformatics and computational biology

techniques on multiple cancer types to better understand the mechanisms of cancers and its biology.

Machine learning and deep learning techniques are among the most advanced

computational methods. Computational algorithms allow for a deeper understanding by extracting

complex patterns from high-dimensional data. In our work, we begin by conducting bioinformatics

analysis and progressing to the more advanced computational algorithms and modeling.

1.2 Immune system in head and neck cancers

Head and neck squamous cell carcinoma (HNSCC) is a complex disease of the oral cavity, oropharynx, larynx or hypopharynx. It is the sixth most common malignancy in the world with a five-year survival rate of 60% [10, 11]. The immune system plays an important role in detecting and controlling the progression of tumors. HNSCC in particular, is considered as an immunosuppressive disease, where there is a dysregulation of immunocompetent cells [11-13]. The tumor microenvironment, which includes surrounding immune and normal cells, can communicate with cancer cells and affect their development and progression [14, 15].

Environmental factors such as consumption of alcohol and tobacco also influences the etiology and pathogenesis of HNSCC [16, 17]. In addition, infection with human papilloma viruses

(HPVs) is a risk factor for HNSCC. Oncogenic HPV types, especially HPV-16 and HPV-18 are associated with up to 70% of HNSCC [14, 15]. The relationship between these infections and the immune system may also contribute to the progression and prognosis of HNSCC.

The availability of high-throughput genomic assays and rich electronic medical records aids in the identification of HNSCC immune-related subtypes with greater accuracy and resolution.

However, integration of multiplatform, heterogeneous, and high dimensional data remains a challenge in bioinformatics research. Therefore, we developed a novel immune signaling-based

Cascade Propagation (CasP) subtyping approach to stratify HNSCC patients based on their immune functions. Our CasP approach utilized multiple data types and identified clinically relevant subtypes of HNSCC patients.

1.3 Stem cell transcription factors in lung cancer

Lung cancer is the second most common cancer in both men and women. There are two major subtypes of lung cancer: lung adenocarcinoma (LUAD) and lung squamous cell carcinoma

(LSCC). There are several FDA-approved targeted therapies for LUAD, however, there are none for LSCC. The genomically heterozygous and high-molecular complexity of LSCC has hampered our understanding of the disease and the discovery of druggable targets. 24

Recent studies have found a number of DNA alterations in LSCC [18, 19]. The most notable alteration is the amplification of chromosome 3q, which contains the SOX2 locus. SOX2 is found to be overexpressed in 63% of lung squamous cell carcinoma (LSCC) at the transcript level and 80-90% at the protein level [20-22]. SOX2 is a transcription factor that plays a role in self-renewal and differentiation of stem cells [23-26]. When dysregulated, stem cell transcription factors can lead to the abnormal proliferation and growth of cancer cells. Other studies have found that SOX2 expression can promote lung cancer and downregulation of SOX2 inhibits proliferation and induces apoptosis in tumor cells [27].

As a transcription factor, SOX2 binds to and regulates the transcription of several genes.

However, SOX2’s gene targets and downstream mechanisms are unclear. To investigate SOX2’s gene targets in LSCC, we first subgroup LSCC patients based on their SOX2 activity (high expression, promoter hypomethylation, and copy number gain). Then, we conducted separate expression quantitative trait loci (eQTL) analysis for the two groups. eQTL analysis identifies genetic variants such as SNPs that influence the expression of one or more genes [28]. By first subgrouping LSCC patients by their SOX2 activity, we were able to identify SOX2-related eQTLs and make more relevant discoveries.

1.4 Chromatin organization in breast cancer

The human genome is very large, with more than three billion nucleotides. In order to fit this vast genomic material in a small nuclear space, the chromatin undergoes multiple levels of organization. The genome is beautifully organized into fractal globular-like structures with extensive and specific looping and folding. The chromatin structure is not random and plays a critical role in gene regulation. However, it is often altered in diseases such as breast cancer.

Recent development of high-throughput chromosome conformation capture technologies such as 3C, ChIA-PET, and Hi-C allows us to experimentally map long-range chromatin interactions [29, 30]. Large-scale 3D chromatin structures such as topologically associating domains (TAD) and TAD boundaries can be detected from analyzing these datatypes [31, 32]. TAD 25 boundaries, in particular, are important in gene regulation in that it insulates and prevents

neighboring regions from interacting with each other. Perturbed TAD boundaries may affect the

interaction and regulation of enhancers and promoters. For example, novel TAD boundaries may prevent the interaction and regulation of enhancers from its target genes and destroyed TAD boundaries may allow for new, unprecedented interactions to occur [33-35]. Usually, there are

about 3000 TAD boundaries within a cell [34, 36]. In breast cancer cells, there is a gain and loss of

approximately 600 TAD boundaries [34]. The reasons for these changes are unclear and the

downstream outcomes of these perturbed boundaries are unknown.

Previous studies have noted enrichment of certain factors at TAD boundaries, such as

CTCF binding and housekeeping genes [37]. However, distribution of CTCF alone is not sufficient

to predict TAD boundaries and housekeeping genes were only enriched in some TAD boundaries but not all. In our work, we utilized a comprehensive set of epigenetic and genetic features to model

and understand TAD boundaries. We included DNA binding factors, histone modification features,

transcription binding motifs, DNA methylation, and more datatypes in our prediction of TAD boundaries. Unlike other TAD prediction models, we also included flanking region’s information

and interrogated the entire genome [37-39]. Then, we modeled large-scale interactions using the

same epigenomic and genomic features. Moreover, we analyzed the effects of TAD boundary

changes on gene expression and regulation. We conducted signaling pathway analysis and

identified oncogenic activation due to TAD boundary perturbations.

By studying chromatin structures in breast cancer cells, we have a better understanding of

the mechanisms involved in genomic interactions, altered gene expression, and disease state in breast cancer cells.

1.5 Enhancer and promoter interactions

In complex genomes such as the human genome, transcriptional regulatory elements play

a major role in the transcription of genes. Targeted gene transcription contributes to the function

and phenotype of a cell. There are two major classes of regulatory elements: enhancers and promoters. Promoters are regions where transcription of genes are initiated, and enhancers are

regulatory regions that activates transcription. For an enhancer to activate a target gene’s

transcription, it needs to bind to and interact with the promoter, forming DNA loops. Enhancers

and its target promoters can be separated by thousands or millions of base pairs [40]. Previously,

enhancer and promoter pairs were indirectly identified through genetic associations such as eQTLs,

enhancer and promoter chromatin state, and gene expression [41]. Now, high-throughput methods

such as ChIA-PET and Hi-C allows us to now truly define long range enhancer-promoter pairs

[42]. However, high-resolution, high-throughput chromatin conformation capturing experiments

are costly and time consuming.

Increasing evidence suggests that complex patterns in our DNA sequence can dictate

enhancer-promoter interaction pairs, as well as DNA binding protein affinities, alternative polyadenylation, and more [43-45]. With the advancements of computation algorithms and computational power, it is now possible to utilize DNA sequence in prediction models. In our work, we developed two new models that predict interaction statuses of enhancer-promoter pairs using

DNA sequence information. We further analyzed the sequence positions that contribute most to the classification model and identified important patterns and features.

1.6 Significance and innovation

In all our projects, we implemented powerful bioinformatics and computational biology concepts, ideas, and tools to answer biology questions. 1) We developed a novel immune signaling- based subtyping approach to stratify HNSCC patients into clinically relevant subtypes with different immune signatures. 2) For the first time in eQTL analysis, we sub-grouped genomically complex LSCC patients based on SOX2 activity prior to conducting eQTL analysis. This allowed 27

us to truly identify gene targets that are affected by the overexpression of the stem cell transcription

factor. 3) We developed a machine learning model that modeled chromatin organization to better

understand the underlying factors that contribute to chromatin structure changes in breast cancer.

We also analyzed downstream effects of chromatin structure alterations in breast cancer cells and

identified plausible activation of signaling pathways and oncogenes. 4) Finally, we developed two

“sequence-based” methods to study enhancer-promoter interactions. Both methods extracted

meaningful information and patterns from DNA sequences and accurately predicted enhancer- promoter interactions. This project enhanced our understanding on our genome and allowed us to predict functional importance of mutations, SNPs, or indels in patient data. Our developed models

and pipelines tackle cancer biology problems and can easily be applied to other diseases.

CHAPTER II:

IMMUNE SIGNALING-BASED CASCADE PROPAGATION APPROACH RE-STRATIFIES

HNSCC PATIENTS

Chapter II contains text and figures from Liu, K., Chyr, J., Zhao, W., and Zhou, X. Immune signaling-based Cascade Propagation approach re-stratifies HNSCC patients. Methods, 2016.

111:72-79.

This research article is part of a special issue: Big Data Bioinformatics.

Liu, K. and Chyr, J. contributed equally to the work.

2 Immune signaling-based Cascade Propagation approach re-stratifies HNSCC patients

2.1 Abstract

The availability of high-throughput genomic assays and rich electronic medical records

allows for identification of cancer subtypes with greater accuracy and resolution. Unfortunately,

the integration of multiplatform, heterogeneous, and high dimensional data remains an enormous

challenge in bioinformatics research. Previous methods have been developed for patient

stratification; however, these approaches did not incorporate prior knowledge and offers limited biology insight. New computational methods are needed to better utilize multiple types of

information to identify clinically meaningful subtypes. Recent studies have shown that many

immune functional genes are associated with cancer progression, recurrence and prognosis in head

and neck squamous cell carcinoma (HNSCC). Therefore, we developed a novel immune signaling based Cascade Propagation (CasP) subtyping approach to stratify HNSCC patients. Unlike previous stratification methods that uses only patient genomic data, our approach makes use of prior biological information such as immune signaling and protein-protein interactions, as well as patient survival information. CasP is multi-step stratification procedure, composed of a dynamic network tree cutting step followed by a mutational stratification step. Using our approach, HNSCC patients were stratified into clinically relative subgroups with different survival outcomes and distinct immunogenic features. We found that the good outcome of a subgroup of HNSCC patients was due to an enhanced immune response. The gene sets were characterized by a significant activation of T cell receptor signaling pathways, in addition to other important cancer related pathways such as PI3K and JAK/STAT signaling pathways. Further stratification of patients based on somatic mutation profiles detected three survival-distinct subnetworks. Our newly developed

CasP subtyping approach allowed us to integrate multiple data types and identify clinically relevant subtypes of HNSCC patients.

2.2 Introduction

2.2.1 Head and neck squamous cell carcinoma

Head and neck squamous cell carcinoma (HNSCC) is a group of biologically similar cancers which develops from the mucosal lining of the upper aerodigestive tract [46, 47]. It is the sixth most common malignancy in the world, with more than 550,000 cases annually worldwide

[11, 13]. The five-year survival rate of patients with HNSCC is about 60% and has not markedly improved over the past few decades [10, 11]. The severe heterogeneity among HNSCC patients makes it difficult to detect useful biomarkers for clinical outcome prediction and treatment [48,

49]. To better understand HNSCC development and recurrence, new computational modeling is needed to stratify patients into biologically meaningful subtypes with different clinical outcomes.

However, the heterogeneity of data source and high dimensionality characteristics of big data makes it difficult to work with [50].

2.2.2 Big data analysis and patient stratification

Various genomics and the second-generation sequencing technologies have generated numerous heterogeneous data. All these data are available on The Cancer Genome Atlas (TCGA),

Therapeutically Applicable Research to Generate Effective Treatments (TARGET), The Cancer

Genome Characterization Initiative (CGCI), and International Cancer Genome Consortium (ICGC) consortiums [51, 52]. The number of samples and sequencing reads stored in the Gene Expression

Omnibus (GEO) and Sequence Read Archive (SRA) databases increased exponentially in the past few years. As of March 2016, GEO contains 1,768,216 samples and SRA has 1,352,046 samples of sequencing data. Different omics data such as transcriptome data, whole genome sequencing data, SNP array data, DNA methylation data, reverse phase protein array data are generated from different labs and different platforms. How to mine the heterogeneity and achieve dimension reduction are the challenges of big data analysis.

To address both the challenges of big data and the challenge of patient stratification, effective integrative methods are urgently required to fully utilize publicly available patient data. 31

Advancement in big data analysis will allow us to properly stratify patients and guide treatment

and disease subtyping, as well as reveal disease mechanism and discover new targets. The standard

approach for integrating multiple cancer genomic datasets, such as Constructing Optimal Cluster

Architecture or “COCA”, is to perform cluster analysis on each cancer data type individually and

then integrate the diverse platform-specific cluster assignments to group tumors into subtypes that

shared features across multiple datasets [53]. Unfortunately, this approach is not bi-clustering,

which means that it does not cluster samples and signatures to patterns simultaneously [54, 55].

Another recent clustering approach, called integrative cluster (iCluster), can capture the

associations among different data types and variance-covariance structures within data types in one

model [56, 57]. However, despite various studies that have used genomic data to stratify cancer patients, most did not take into account prior knowledge and clinical phenotypes in their patient stratification and some of the identified pathways were not druggable [58-60].

2.2.3 Cancer immune system

Accumulating evidence indicates the importance of the immune system in controlling the progression of tumors. The immune system has been recognized as an extrinsic tumor-suppressor that can detect and destroy abnormal cells and prevent the development of HNSCC [61-64].

Immune cells and other normal cells make up the HNSCC microenvironment and they can communicate with cancer cells and affect their development and progression [14, 15]. The difference in immune responses between patients may explain the mechanism of cancer resistance and progression, and it may provide new treatment strategies.

2.2.4 Environmental factors in HNSCC

The etiology and pathogenesis of HNSCC are also influenced by environmental factors.

Although consumption of alcohol and tobacco are known to be primary risk factors for HNSCC

[16, 17], recent studies of epidemiology have found that infection with human papilloma viruses

(HPVs) is also an important risk factor for HNSCC cancer. Oncogenic HPV types, especially HPV-

16 and HPV-18, have been found to be the cause of up to 70% HNSCC [65-67]. We further

investigated the clinical characteristics of our subtypes.

2.2.5 EGFR signaling pathway in HNSCC

EGFR overexpression has been associated with poor survival of HNSCC patients;

however, EGFR inhibition as a monotherapy has not been very successful in the past [68-70]. Given

the cytotoxic and non-specific nature of standard cancer therapies and the limited efficacies of

EGFR treatments, there is a tremendous need for development of effective targeted therapies with

minimal negative effects on the quality of life of individuals. The incorporation of important

signaling information in our new CasP stratification approach will offer more insights and

generated biologically interpretable results that can explain the mechanisms of HNSCC progression.

2.2.6 Innovation

We developed a novel Cascade Propagation (CasP) subtyping approach to investigate co-

dependent immune signaling pathways involved in HNSCC. CasP is a two-stage subtyping

approach with a dynamic network tree cutting step followed by a somatic mutational stratification

step. Distinct from molecular-based approaches, the dynamic network tree cutting method uses

signaling interaction information to explain immune responses within HNSCC patients. Because

somatic mutation profiles of cancers can also stratify cancer patients, we employed a non-negative

matrix factorization-based clustering to further stratify HNSCC patients based on their somatic

mutation profiles [60, 71]. Network propagation was applied to smooth the network profile for each patient since mutations were not frequent. Ingenuity Pathway Analysis (or IPA) was used to identify enriched pathways in the subtypes of HNSCC patients.

2.3 Experimental Design and Methods

2.3.1 Patient clinical and genomic data

Clinical data were downloaded from The Cancer Genome Atlas (TCGA) data portal in

June 2014. Survival time for each patient is determined by the date of death or last time of visit.

Only 377 patients had survival information. The HPV status of this cohort was also extracted from

TCGA clinical dataset for further survival analysis. Level 3 normalized RNA-seq data and level 2

somatic mutation data were also obtained from TCGA. Only RNA-seq data generated using the

IlluminaHiSeq_RNAseqV2 platform was used. Based on clinical data, we retained 377 patients

with 19,978 genes from the cohort. For somatic mutation data, we only used data generated by the

Illumina GA platform. Screening the mutation data, we constructed a binary matrix with rows

representing genes and columns representing the bit vectors for each patient mutation status.

Excluding the patients with fewer than 10 mutations, we obtained 369 patients with 16,610

corresponding mutated genes.

2.3.2 Immune signaling interaction database

We employed publicly available database InnateDB (http://www.innatedb.com, download

in May 2014) for immune protein interactions. InnateDB is a knowledge resource for innate

immunity which contains more than 196,000 experimentally validated molecular interactions in

human, mouse, and bovine samples. Considering only uniquely validated interactions for humans,

we have identified 11,378 immune-related protein-protein interactions (PPIs) among 1,043 proteins.

2.3.3 Statistical analysis of gene interaction on survival

To associate gene expression level with patient survival status, we fitted a Cox proportional

hazards regression model using the R survival package by Fox J, et al. [72]. A likelihood-ratio test

and corresponding p-value was calculated to evaluate the association between gene and survival

time. To convert gene-level expression values to interaction-level, or ‘network-level’, only protein

coding genes that can be mapped to InnateDB were considered. The PPIs in the high-quality 34

immune signaling network were converted to gene-gene interactions. We defined the expression

values of gene-gene interaction across patient samples as the combination of the expression levels

of a pair of genes and the corresponding P-value in survival analysis. For an interaction ↔

, its expression value for patient k is defined as:

log( ) log( ) ↔ , = + log ( ) + log( ) log ( ) + log( )

where pvalue is the statistical p-value of Cox regression between gene with patient survival time,

Eg i and Eg j represents the gene expression level of gene gi and gj in patient k.

2.3.4 Clustering patients by dynamic network tree cutting

We developed dynamic network tree cutting approach based on Dynamic Tree Cut algorithm [73] to separate immune signaling interactions into functional networks. A Dynamic Tree

Cut algorithm cuts a tree structure into branches [73]. Each branch contains a number of gene-gene interactions (corresponding to PPIs after mapping genes to proteins). The roles of gene-gene interactions in both the classification of patients and regression of patients’ survival time was evaluated for every branch. Patients in each branch were clustered into groups based on the expression values of the gene-gene interactions. More specifically, we used the Dynamic Tree Cut algorithm to cut the clustering tree of patients into three groups. If the sample number in a group was lower than a threshold of 30, it would be combined with another group to generate a total of two groups. Otherwise, we would iteratively take one branch as one group and the other two branches as another group. The two groups were then assigned a network status: the patient group with higher average expression values was called Network-positive and the one with lower average expression values was called Network-negative. Next, the performance of each branch in survival analysis was evaluated by Cox proportional-hazards regression model between patients’ network statuses and patients’ survival time.

2.3.5 Protein-protein interaction (PPI) information

Human PPIs were downloaded from the BioGrid Database in May 2014. BioGrid contains

manually curated PPIs from literature. The PPI data set used in this study had 137,506 protein

interactions among 13,588 unique proteins.

2.3.6 Smoothing somatic mutations network

Somatic mutation data are binary, which makes it different from gene expression data. The

mutation profiles were mapped to protein interaction networks. Because of the sparse nature of

mutations, network propagation was applied to smooth the mutation profile for each patient [74].

The mutated genes were the source nodes in a gene-gene interaction network and the mutation

information were propagated to neighboring genes along the edge of the network. The node in each

network pumped flow to its neighbors while simultaneously received flow from them. The propagation function is defined as:

= × × + ( − ) × ,

where S is a patient-by-gene matrix and W is an adjacent matrix of the gene interaction network. Y

is a patient-by-gene matrix representing the initial information. The parameter a is set as 0.5 in

iteration which weighs the relative contribution of neighborhoods. The propagation process was

run iteratively until the mean square deviation between two consecutive steps was less than 1x10 -

4. The smoothed mutation profiles were normalized by quartile method and used as the input matrix to cluster patients into predefined number of subnetworks.

2.3.7 Non-negative matrix factorization (NMF)-based clustering

The NMF method decomposed the smoothed patient-by-gene matrix into two matrixes,

called “subtype prototypes matrix” and “mutation prototypes matrix” [71]. The consensus

clustering method was employed to get robust results by randomly selecting ninety percent of the patients from the entire dataset for classification. This was repeated 1000 times and the results were put into a co-occurrence matrix. Each number in the matrix indicated the frequency of a pair of

36 patients belonging in the same cluster. Finally, the patients were stratified to different subtypes by

using the complete linkage hierarchical clustering approach.

2.4 Results

2.4.1 Cascade Propagation Overview

Detection of signaling patterns from gene expression or somatic mutation information was

critical to stratify HNSCC patients and identify co-dependent signaling pathways important for

HNSCC progression. We developed a novel two-stage subtyping approach called Cascade

Propagation (CasP), which is composed of a dynamic network tree cutting followed by mutation propagation, to stratify HNSCC. First, immune protein-protein interactions (PPIs) from InnateDB,

and gene expression and clinical information from TCGA are used to cluster patients into two

distinct network statuses. Then, mutation data also from TCGA and human PPIs from BioGRID

are used to further stratify the networks into subnetworks. Survival analysis showed that these

subgroups are associated with different clinical outcomes and functional analysis identified

enriched co-dependent immune pathways (Figure 1).

Figure 1. Overview of CasP subtyping approach. HNSCC patients are stratified with network-based clustering and further stratified with mutation propagation. Immune signaling information from InnateDB, and gene expression and clinical data from TCGA are used to cluster HNSCC patients into functional networks with two statuses: positive and negative. PPIs from BioGRID and mutational data from TCGA are used to further stratify HNSCC patients. Co-dependent immune signaling pathways and druggable targets are identified in the different subgroups.

2.4.2 Patient stratification and Identification of Co-dependent Pathways

The dynamic network tree cutting method uses signaling interaction information to explain

immune responses of HNSCC patients and elucidate candidate anti-tumor pathways. This novel

approach is able to cut the clustering tree of immune signaling interactions into functional networks,

or branches, using a Dynamic Tree Cut algorithm (Figure 2A).For each branch, the patients are

assigned to two groups with different status: Network-positive for higher average expression values

and Network-negative for lower average expression values (Figure 2B and C). Expression values

of gene-gene interactions across patient samples was defined as the combination of gene expression

levels of each pair of genes and the corresponding p-value in survival analysis. Finally, the performance of each branch in survival was evaluated by Cox proportional-hazards regression

model between patients’ network statuses and survival time (Figure 2D).

Branches with significantly low p-values will contain signaling interactions that are

essential for predicting patients’ survival outcomes. The top ranked branches with the lowest Cox

regression p-values are Branch 22, Branch 11, and Branch 14 (Table 1). The associated coefficients

of hazards ratios of these branches indicate that there is also a survival difference between the

Network-positive and Network-negative groups. The two network groups of HNSCC patients in

Branch 14 and the corresponding survival curve are shown in Figure 3A and B.

Figure 2. The Dynamic Network Tree Cutting Approach. (A) A dynamic tree cut algorithm is used to group immune signaling interactions into functional networks, termed branches. (B) In each branch, the patients are assigned into two groups with different status based on average expression values. (C) The patient status with higher average expression values is marked as Network-positive and the one with lower average expression values is marked as Network-negative. (D) Finally, the performance of each branch is evaluated by Cox proportional-hazards regression model.

Number Cox regression Coefficient of Branch of genes P value hazard ratios 22 49 0.000248 0.3987 11 91 0.000507 0.3715 14 80 0.000552 0.5155

Table 1. Top ranking branches based on Cox regression p-values. Branches are ranked based on Cox regression p-value. The top three branches with the lowest Cox regression P values are listed. The number of genes in the branches and the coefficient of hazard ratios between the Network-positive and Network-negative groups in each branch are shown.

Figure 3. Branch 14 network and survival curve. (A) All HNSCC patients in Branch 14 are clustered into two groups based on average expression values. The groups are assigned a network status: Network-positive (Red, +) and Network-negative (Green, -). n=304 (B) The corresponding survival curve shows the survival difference between the Network-positive and the Network-negative groups in Branch 14. The p-value is <0.05 and the hazard ratio is 0.515.

2.4.3 Enriched Immune Signaling Pathway

Ingenuity Pathway Analysis (IPA) was used to identify enriched pathways for high ranking branches. Comparing significant p-values of the pathways in the top three branches, the most enriched pathway was T cell receptor signaling pathway. There were 20 genes involved in T cell receptor signaling pathway in Branch 14, 10 in Branch 22, and 14 in Branch 11. HPV infection is increasingly identified as an important causative factor for HNSCC cancer [65-67]. Patients with

HPV+ HNSCC have a significantly better prognosis than HPV- patients [75, 76]. The Chi-squared test indicates that HPV+ patients were enriched in Network-positive group in the top three branches

(Figure 4). The Network-positive patients in the top three branches also had a more favorable clinical outcome. Further analysis of Branch 14 shows that more than half of the HPV+ patients were assigned to Network-positive status (Table 1). Functional analysis of the genes in Branch 14 revealed important cancer related pathways, such as P13K signaling pathway (p-value = 3.33E-13) and JAK/STAT signaling pathway (p-value = 3.93E-11), which are already reported in literature to be relevant to cancers [77-81].

100%

80%

60%

40%

20%

0% Branch 22 Branch 11 Branch 14 Total HPV- HPV+

Figure 4. HPV status in Network-positive group. Percentage of HPV+ patients in the Network-positive groups of the top three branches are shown along with the total percentage of HPV+ patients in the cohort. The red color indicates the percentage of HPV+ patients and the blue color represents the percentage of HPV- patients in the Network-positive groups.

Network positive Network negative Total HPV+ 35 34 69 HPV- 78 227 305 Total 113 261 374

Table 2. Patient HPV status in Network-positive and Network-negative subgroups. Patients’ HPV statuses in the two networks in Branch 14 are shown. The values indicate the number of patients in each category.

2.4.4 Somatic Mutation Propagation for Further stratification

Some cancers have been successfully stratified based on somatic mutation profiles [60]. In

Branch 14, 168 patients had mutations in the PI3K signaling pathway, and 56 patients and 69 patients had mutations in the JAK/STAT signaling pathway and T cell receptor signaling pathway,

respectively. Mutational status can provide more information to stratify HNSCC patients into

different subtypes. Thus, we conducted somatic mutation analysis to further stratify HNSCC patients using a non-negative matrix factorization (NMF) approach. Briefly, this stratification process consisted of three steps which are highlighted in Figure 5. First, a somatic mutation network

was generated by mapping mutated genes to a protein interaction network. Approximately 295 patients (80%) had ≤200 gene mutations. Because mutations are not frequent, network propagation was applied to smooth the mutation profile for each patient. Then, using the NMF algorithm, patients were clustered into subtypes according to the similarity of their smoothed mutation profile.

Lastly, the association between the subtypes and clinical data was evaluated using Cox proportional-hazards regression models.

Figure 5. Overview of somatic mutation analysis. A smoothed somatic mutation network is generated using network propagation and HNSCC patients are further stratified into three subnetworks using NMF-based approach. The survival outcomes of the three subnetworks are evaluated with Cox proportional-hazards regression model.

Network propagation was applied to Network-positive and Network-negative groups to further stratify them into 2 to 5 subnetworks. The p-value from Cox proportional hazards regression model was used to determine biologically meaningful subtypes. Branch 14’s Network-positive group was further stratified into three distinguished subnetworks with different survival distribution

(p value = 0.029), whereas the Network-negative group did not further stratify into subnetworks with different survival outcomes (Figure 6).

Figure 6. Cox survival analysis of subnetworks. The Network-positive group of Branch 14 was further stratified into three subnetworks using mutation information. The survival curve, p-value, and hazard ratio are shown.

2.4.5 Distinct clinical characteristics for Subnetworks

The three subnetworks have distinct somatic mutations and different survival distribution.

Subnetwork 1 and Subnetwork 3 had a similar number of patients (20 and 21, respectively),

however, patients in Subnetwork 1 had a better survival outcome and those in Subnetwork 3 had a poor outcome. Clinical analysis of these patients showed that there were more patients in stage 1 and stage 2 in Subnetwork 1 than in Subnetwork 3. This suggests that somatic mutation information can stratify HNSCC patients and the mutation profiles can distinguish between early and late stage patients and may also define prognosis. HPV+ status was equally distributed in the three subtypes.

Smoking was still a high-risk factor associated with patient survival, and there were only six

smokers in the good outcome Subnetwork 1 compared to 14 smokers in poor outcome Subnetwork

3. The effect of alcohol was not significant. More patients in Subnetwork 3 were undergoing

radiation or drug therapy (Table 3).

Subnetwork 1 Subnetwork 3 Stage I 1 0 Stage II 8 5 Stage III 2 4 Stage IVA 11 16 Stage IVB 1 2 HPV+ 7 8 HPV- 16 19 Smoker 6 14 Reformed smoker 11 8 Non-smoker 7 5 Alcohol 17 20 No alcohol 7 7 Chemotherapy 6 9 Radiotherapy 11 16

Table 3. Clinical characteristics of Subnetworks. Clinical data of patients in Subnetwork 1 and 3are presented. Patient tumor stage, HPV status, smoking status, alcohol consumption status, and tumor treatment information are listed.

2.4.6 Network analysis of Mutation Subnetworks

The subnetworks generated after somatic mutation propagation and NMF clustering further stratified HNSCC patients into distinct groups. Using student’s t-test, three sets of significantly mutated genes were identified (Table 4). To construct a fully connected subnetwork, other related signaling pathways were investigated. Five connecting genes for Subnetwork 1, two for

Subnetwork 2, and three for Subnetwork 3 were used to link the genes in a connected network

(Figure 7). Biological functional analysis of Subnetwork 1 showed that approximately 50% of identified genes belonged to Toll-like receptor (TLR)-associated signaling pathways, including

CD14, IRAK1, IRAK3, IRF8, LY86, LY96, TLR2, TLR3, TLR4, TLR9, and MYD88 [82, 83].

TLR is an important family of evolutionally conserved pattern recognition receptor that plays an essential role in innate immune response and the regulation of adaptive immunity [84]. Currently there are at least 11 functional TLRs identified in humans, and activation of TLRs often indicates secretion of inflammatory cytokines, activation of adaptive immunity, and maturation of dendritic cells [85-87]. The patients in Subnetwork 3 had poor survival and the genes were associated with adaptive immune pathways, such as BLNK, BTK, CBLB, CD3G, CD79A, CD81, ITK, Kit, LAT,

LCK, LCP2 (SLP-76), PLCγ1, SOS1, SRC, and Zap70. Many of these genes are involved in T cell receptor (TCR) signaling [88-90]. CD81 is a transmembrane protein that is highly expressed in all stages of T cell development [91]. Members of the Src family of protein tyrosine kinases, such as

SRC and LCK, are responsible for T-cell activation and development [92]. Some genes are related to B cell receptor (BCR) signaling, such as BLNK, BTK, CD79A, and PLCγ1 [93, 94].

Genes Total Subnetwork 1 CD14, EGFR, ERBB2, FGR, FGF2, FGFR1, 20 IRAK1, IRAK3, IRF8, ISG15, LY86, LY96, MYD88, SRC, STAT3, SYK, TLR2, TLR3, TLR4, TLR9 Subnetwork 2 BTK, CBLB, DHX58, ERBB3, IRF8, LY86, 12 LY96, NCF2, SOS1, TLR4, TRAF6, UBC Subnetwork 3 AP1M1, AP1S2, BLNK, BTK, CBLB, CD3G, 21 CD79A, CD81, EGFR, ERBB3, ITK, KIT, LAT, LCK, LCP2, PLCG1, SOS1, SRC, TEC, UBC, ZAP70

Table 4. Identified genes in each subnetwork. Patients were further stratified into three subnetworks using mutational data. Significantly mutated genes and connecting genes are listed. Connecting genes are genes that link the other genes in a coherent network.

Figure 7. Fully connected subnetworks after further stratification with mutational data. The nodes in yellow are mutated genes in each subnetwork. The nodes in green are linking genes. The complete networks for Subnetwork 1 and 3 are shown.

2.5 Discussion

The major issues in current therapies for HNSCC is disease recurrence and resistance to

therapy, and current treatment regimens including surgery, radiation, and chemotherapy often lead

to excess morbidity and reduced patient's quality of life. The recent advances in genomics, bioinformatics, and systems biology methods give us the opportunities to identify molecular alterations and allow us to explore the abnormalities of cancers based on large data sets from

TCGA. The challenge of integrating heterogeneous and high dimensional data has limited the potential of big data. Consequently, we have developed a unique and novel patient stratification method. Our approach is an immune CasP subtyping procedure, which is composed of a dynamic network tree cutting approach followed by an NMF mutation-based stratification. Unlike previous stratification methods, CasP utilizes not only genomic data, but also prior signaling and interaction information. Thus, our approach offers more insights and biologically interpretable results. HNSCC patients were stratified into clinically meaningful subtypes using multiple data types and co- dependent signaling pathways were identified. Network analysis showed that gene sets are characterized by a significant activation of T cell receptor signaling pathways, as well as other important cancer related pathways such as PI3K and JAK/STAT. Differences in gene expressions were able to separate HNSCC patients into two distinct groups. Identification of immune signaling networks that play a role in HNSCC patient survival would not be possible without the development of novel subtyping approaches.

Patient stratification using mutation data has been a challenge in the past due to the sparse nature of mutations. However mutational information can provide more information to stratify patients into different subtypes. To bypass this challenge, mutation profiles were mapped to protein interaction networks and network propagation was applied to smooth the mutation profile. Further stratification with mutation data identified three subnetworks with distinct molecular profiles and different survival outcomes. The patients in Subnetwork 1 had mutations in genes associated with

TLR signaling pathways whereas those in Subnetwork 3 had mutations in genes associated with T 49 cell receptor and B cell receptor signaling. Since TLRs primarily recognize pathogen-associated molecular patterns, they play a critical role in innate immune responses, which is not extremely important for tumor cell recognition and targeting. Therefore, these patients have a relatively better survival. On the other hand, patients in Subnetwork 3 have poorer survival. These patients have mutation in the T cell and B cell receptor signaling pathways, which may affect T cell or B cell activation. This suggests a functional adaptive immune system may be important for better survival outcomes for HNSCC patients. Many co-stimulatory receptors, such as CD28 and CD45, were also mutated and these regulate TCR activation. Elevated expression of EGFR was seen in over 67%

HNSCC patients and it was correlated with poor prognosis. Because of EGFR’s critical role in survival and proliferation, it has been a commonly used target of anti-cancer treatment. Our novel

CasP subtyping approach allows us to stratify HNSCC patients into subgroups using multiple types of data. Our method can facilitate the identification of co-dependent immune signaling pathways important for HNSCC progression and guide the discovery of relevant druggable targets.

CHAPTER III:

SNP VARIANT REGULATES SOX2 MODULATION OF VDAC3 IN LSCC PATIENTS

Chapter III contains text and figures from Chyr, J., Guo, D., and Zhou, X. LSCC SNP variant regulates SOX2 modulation of VDAC3. Oncotarget, 2018. 9(32):22340-22352.

3 SNP variant regulates SOX2 modulation of VDAC3 in LSCC patients

3.1 Abstract

Lung squamous cell carcinoma (LSCC) is a genomically complex malignancy with no

effective treatments. Recent studies have found a large number of DNA alterations such as SOX2

amplification in LSCC patients. As a stem cell transcription factor, SOX2 is important for the

maintenance of pluripotent cells and may play a role in cancer. To study the downstream

mechanisms of SOX2, we employed expression quantitative trait loci (eQTLs) technology to

investigate how the presence of SOX2 affects the expression of target genes. We discovered unique

eQTLs, such as rs798827-VDAC3 (FDR p-value = 0.0034), that are only found in SOX2-active patients but not in SOX2-inactive patients. SNP rs798827 is within strong linkage disequilibrium

(r 2 = 1) to rs58163073, where rs58163073 [T] allele increases the binding affinity of SOX2 and

allele [TA] decreases it. In our analysis, SOX2 silencing downregulates VDAC3 in two LSCC cell

lines. Chromatin conformation capturing data indicates that this SNP is located within the same

Topologically Associating Domain (TAD) of VDAC3, further suggesting SOX2’s role in the

regulation of VDAC3 through the binding of rs58163073. By first subgrouping patients based on

SOX2 activity, we made more relevant eQTL discoveries and our analysis can be applied to other

diseases.

3.2 Introduction

3.2.1 Prevalence and subtypes of lung cancer

There are two major types of lung cancer: non-small cell lung cancer and small cell lung

cancer. Non-small cell lung cancer accounts for about 85% of all lung cancer cases [18, 95]. It

includes three subtypes: adenocarcinoma (LUAD), squamous cell carcinoma (LSCC), and large

cell carcinoma. LUAD is the most common form of lung cancer, but LSCC also represents a large portion of lung cancers with over 60,000 new cases diagnosed each year. The five-year survival

rate for LSCC is only 12% [11, 18, 96, 97]. Nearly 87% of all lung cancers are smoking-related

however, genetic defects and other environmental factors may also play a role. 52

LUAD and LSCC are histologically different diseases. LUAD starts in glandular cells,

which secrete mucus and LSCC begins in squamous cells. LUAD tends to develop in alveoli and

is located along the outer edges of the lungs. Squamous cells that develop into LSCC tend to line

the inner airways and often occurs in the central or in one of the main airways of the lung. Due to

the location of LSCC, common symptoms are coughing, trouble breathing, chest pain, and blood

in the sputum. Whereas LUAD patients may not exhibit any symptoms.

Lung cancer patients are often treated with surgery, general chemotherapy, radiation

therapy, or a combination of these treatments. Recently, immunotherapies have been discovered

for both LUAD and LUSC. These drugs work by targeting and blocking immune checkpoint

mechanisms so that the immune system can target cancer cells. For LUAD, there are also several

FDA-approved targeted therapies that target the cancer cells directly. However, there are no

targeted treatments for LSCC [98-100]. The higher molecular complexity of LSCC has hampered

our understanding and the discovery of druggable targets for LSCC for many years.

3.2.2 SOX2 amplification in lung cancer

Recent studies, including the comprehensive analysis by The Cancer Genome Atlas

(TCGA) network, have found a large number of DNA alterations [18]. The most notable alteration

is amplification of chromosome 3q, which contains the SOX2 locus [22]. SRY (sex determining

region Y)-box 2 (SOX2) is a stem cell transcription factor that plays a role in cell self-renewal and

differentiation [23-26]. It is located on chromosome 3 and encodes a relatively small protein of

only 317 amino acids long. SOX2 belongs to the SOX gene family and contains three main

domains: N-terminal, high-mobility group (HMG), and transactivation domain. It functions as a

transcription factor that binds to the DNA and regulates the expression of multiple genes. Initially,

SOX2 was reported to be involved with the inhibition of neuronal differentiation. Now, SOX2 is better known as an essential embryonic stem cell gene that plays a role in cell self-renewal and

differentiation. It is also required in reprogramming human somatic cells to pluripotent stem cells

[23-26]. It has found to be amplified and/or overexpressed in 63% of LSCC at the transcript level 53 and 80-90% at the protein level [20-22]. SOX2 is also found to be amplified in other cancers such as adenocarcinomas, esophageal and oral SCC, and glioblastoma. SOX2 overexpression is crucial in promoting LSCC formation in vivo upon loss of Pten and Cdkn2ab [27]. SOX2 also cooperates with oncogenes such as Wnt1, c-Myc, and Notch to promote lung cancer [101]. Downregulation of

SOX2 inhibited proliferation and induced apoptosis in tumor cells which further emphasizes the major role SOX2 plays in lung cancer.

3.2.3 Mechanisms of SOX2 and SNPs

SOX2 contains a DNA-binding region and regulates the expression of many downstream genes by binding to enhancer regions and facilitating in the remodeling of chromatin and subsequent initiation and transcription of target genes [102-104]. Other studies have found that single nucleotide polymorphisms, or SNPs (pronounced “snips”), located within the binding region of transcription factors, that can change the binding affinity of the factors and consequently the expression of their target genes [105-109]. SNPs are the most common type of genetic variation in the human genome. Our genome is comprised of over 3 billion base pairs, or 6 billion nucleotides.

Each SNP represents a difference in a single nucleotide. A SNP occurs almost once in every 1,000 nucleotides on average, which means there are roughly 4 to 5 million SNPs in a person's genome.

These variations may be rare or occur in many individuals. There are more than 100 million annotated SNPs in populations around the world. SNPs are inherited genetic variation and this difference in DNA sequence contributes to phenotypic variation, including disease risk.

Genome-wide association studies, or GWAS, identifies SNPs that are strongly associated with a phenotype in a population. Currently, there are more than 10,000 genomic regions or loci identified by GWAS for human diseases and traits. For cancer, there are more than 262 distinct, cancer-associated genomic regions. Breast and prostate cancers have the greatest number of risk loci identified by GWAS. This is likely related to the greater statistical power due to the large sample size of these two diseases. However, there are only 18 risk loci identified for lung cancer.

This may be due to non-genetic risk factors in the etiology of this cancer. 54

When SNP occurs within a gene, they may play a direct role in disease by altering the

amino acid sequence and therefore affecting the gene’s function. However, most SNPs are found

in non-coding regions, or regions outside of genes, and play an indirect role in gene expression and

regulation. SNPs found in enhancer regions may increase or decrease the binding affinity of

regulatory transcription factors such as SOX2. The increased or reduced binding of SOX2 to DNA

can affect the transcription and expression of nearby target genes.

3.2.4 eQTL analysis and LD SNPs

Expression quantitative trait loci (eQTL) analysis aims to identify genetic variants such as

SNPs that influence the expression of one or more genes. Matrix eQTL allows us to quickly and

efficiently identify local and distal eQTLs that are associated with the expression of target genes

[28]. We can understand the downstream mechanism of SOX2 by identifying eQTLs in SOX2-

active patients that are not in SOX2-inactive patients [28, 71, 110].

Linkage disequilibrium is defined as the difference between the observed frequency of two

SNPs and the frequency expected for random association. In other words, when two neighboring

alleles are in linkage disequilibrium, the haplotypes do not occur at the expected frequencies. They

may occur more frequently together as if they are physically linked. Linkage disequilibrium is

determined by calculating the correlation between two SNP alleles. The coefficient of linkage

disequilibrium D` is a measure of how coinherited two alleles are. A D` of 0.80 is considered high

disequilibrium. Often, people also use r2 as the measurement of coheritability. r2 is the correlation coefficient of the frequencies. It also takes into account the frequency of the SNP alleles.

Analysis of flanking sequences of the eQTL or the SNPs in strong linkage equilibrium

(LD) (r2 < 0.8) can elucidate which motifs the SNPs may alter. Additional analysis of ChIP-seq and

high-throughput chromatin conformation capture (Hi-C) data can further confirm the interaction of

SOX2 to specific regions of the genome and thereby its modulation of target genes [31, 111].

3.3 Experimental Design and Methods

3.3.1 LSCC patient data

Patient gene expression, copy number variation, methylation, and SNP data were collected from The Cancer Genome Atlas (TCGA) database. Specifically, Illumina HiSeq RNAseqv2 was used for gene expression data, Genome Wide SNP 6.0 was used for both copy number variation and SNP genotype data, and Human Methylation 450 bioassay data set was used for methylation data. A total number of 366 LSCC patients had all three datatypes available.

3.3.2 Matrix eQTL computation

Expression quantitative trait loci (eQTL) analysis was conducted with Matrix eQTL: Ultra- fast eQTL analysis via large matrix operations by Andrey Shabalin. Software and resources were obtained from http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ . Output threshold p-values were set at 0.05. Local eQTLs (or cis eQTLs) were set at the default distance of

1 million base pairs of each other.

3.3.3 SOX2 siRNA lung cancer cell line

SOX2 silencing experimental data was obtained from NCBI GSE48871 [32]. Two LSCC cell lines (H520 and LK2) were treated with either pooled siRNA sequences of SOX2 or scrambled control siRNAs) and their gene expression was profiled using Illumina’s BeadChip Human HT-

12v3 array. The gene expression data distribution was normalized to 0.

3.3.4 Linkage Disequilibrium SNPs and binding motifs

LD SNPs were explored using HaploRegv4.1 hosted by Broad Institute [39]. The database table is curated from information from the 1000 Genomes Project. Pairwise LD was calculated for all pairs of SNPs within 250 kb. A LD threshold of r2 > 0.8 was used to query variants. The 16 eQTLs were within strong LD to 263 SNPs. HaploRegv4.1 also contained information on regulatory motif changes. Only the SNP variants that overlapped the binding motif of SOX2 and changed the position weight matrix (PWM) score were considered. Under these conditions, only seven unique LD SNPs are identified. 56

3.3.5 SOX2 and H3K27ac ChIP-seq in HCC95 cell line

SOX2 ChIP-seq data was obtained from NCBI- GSE46837 [30]. HCC95 is a LSCC cell line with SOX2 amplification. SOX2 binding sites were detected using Model-based Analysis of

ChIP-Seq (MACS) and normalized for copy number variation. There were 5371 peaks with high

SOX2 interaction. H3K27ac ChIP-seq data was obtained from NCBI-GSE66992 [38]. This experiment also used the HCC95 cell line and H3K27ac binding sites were also called by MACS.

3.3.6 Lung tissue and cell line Hi-C data

Hi-C data was obtained and visualized on 3D Genome Browser [59]. IMR90 Hi-C data was from Rao S, et al. (2014) [48], A549 cell line Hi-C data was from ENCODE Encyclopedia version 3 (2010) [29, 60], and two normal lung tissue Hi-C data were from Shmitt A, et al. [49].

Hi-C heat maps resolutions are at 10 kb, 40 kb, 40 kb, and 40 kb respectively, and between

41000000 and 44000000 position on chromosome 8, which is where VDAC3 gene and associated

SNPs are located.

3.4 Results

3.4.1 Overview of our workflow

We aim to understand the complex mechanisms of LSCC. To do so, we first classified

LSCC patients into two groups: SOX2 active and SOX2 inactive patients. Next, Matrix eQTL was performed on each group separately to identify significantly associated SNP and gene pairs. Our goal is to identify pairs that may be regulated by the transcription factor SOX2. Finally, we conducted further analysis on the binding motifs and chromatin organization of our targets. Our workflow is summarized in Figure 8.

Figure 8. Brief overview of workflow. Left: LSCC patients are first clustered into two groups: SOX2-active and SOX2-inactive based on their SOX2 gene expression, copy number variation, and methylation. Matrix eQTL analyses are per formed on each group of patients to identify group-specific SNP-gene pairs. Only the eQTL pairs with genes that are downregulated after SOX2 silencing are included in our analysis. Only the SNP-gene pairs that are unique in SOX2 active patients are considered. Right: Further analyses on SOX2 motifs, ChIP-seq binding peaks, and topologically associating domains are conducted to validate a SOX2 ‰ SNP ‰ Gene relationship.

Table 5. Master regulator potential

Rank Gene Score Rank Gene Score Rank Gene Score Rank Gene Score 1 CD44 189699.77 26 TM4SF1 50732.38 51 RN45S 46032.00 76 LTBR 35200.76 2 FJX1 189699.77 27 TNRC18 50679.97 52 KRT17 46032.00 77 CDKN1B 35156.14 3 EHF 189699.77 28 CAV1 50679.97 53 TXNRD1 45890.67 78 RPS21 34581.60 4 NFE2L2 189699.77 29 HES1 49662.93 54 AKR1B10 45890.67 79 C22orf26 34414.60 5 APIP 189699.77 30 SCNN1A 49442.34 55 MIR23A 45414.47 80 LOC150381 34252.45 6 PDHX 189699.77 31 LOC100128675 48753.18 56 SDC1 42671.31 81 LOC730091 34169.57 7 FLJ35776 189699.77 32 ALDH3A1 48182.53 57 PLEC 42357.95 82 TBL1XR1 33990.03 8 PTHLH 189699.77 33 KRT13 47224.15 58 IRF2BP2 39823.72 83 HIST1H2BF 33899.67 9 CAT 127763.12 34 FOXE1 46683.63 59 MIR27A 39156.91 84 NAT10 33824.86 10 PKP1 97244.88 35 CLIP4 46683.63 60 LOC727677 38740.68 85 HIST1H2AD 33520.88 11 MIR205HG 76837.13 36 DDR1 46683.63 61 MIR24-2 38200.49 86 HIST1H3D 33520.88 12 SLC20A2 76837.13 37 MALAT1 46683.63 62 BCL2L1 37779.46 87 MIR2117 33520.88 13 IER2 69792.78 38 LDLRAD3 46683.63 63 BCL9L 37417.72 88 IKBKB 33248.69 14 STX10 68970.01 39 SLC47A2 46683.63 64 LOC284454 37391.04 89 HIST1H4E 33248.69 15 TRIM44 68896.29 40 LAX1 46683.63 65 MIR661 37326.36 90 EGFR 33081.68 16 KRT42P 68896.29 41 FOSL2 46683.63 66 MIR4492 37080.82 91 AKR1C2 32320.20 17 FXYD3 68896.29 42 KRT19 46032.00 67 JUP 37009.58 92 PA2G4P4 31707.47 18 NRG1 66717.41 43 KRT5 46032.00 68 TNK2 36976.19 93 MYOF 31224.56 19 PAMR1 66717.41 44 TGIF1 46032.00 69 ST6GALNAC2 36962.04 94 DDX59 31224.56 20 MIR205 60390.76 45 DLG1 46032.00 70 LOC344887 36856.98 95 TLCD1 31162.34 21 RN5-8S1 53118.22 46 SOX2 46032.00 71 PHLDA3 35943.35 96 CAV2 31115.21 22 ABTB2 53118.22 47 MIR2278 46032.00 72 ZFHX3 35838.98 97 TNFRSF1A 31095.41 23 MIR3128 52973.59 48 ZMYND8 46032.00 73 KIF13A 35588.39 98 MIR4640 31075.60 24 LAMA5 51725.02 49 DLG1-AS1 46032.00 74 MIR21 35588.39 99 ZFP36 31075.60 25 KRT15 51725.02 50 GPR87 46032.00 75 PIM3 35483.63 100 IL1F10 31075.60

Model-based analysis of regulation of gene expression (MARGE) analysis of HCC95 H3K27ac ChIP-Seq data. Regulatory potential was calculated using the MARGE-potential function as published by Wang, S, et al. (2016) and the regulatory potential (RP) score is listed. Genes are ranks based on their RP scores, from highest to lowest. Only the top 100 unique genes are listed. SOX2 is bolded.

Figure 9. Correlation between methylation and SOX2 expression DNA methylation β-values are shown for LSCC patients with high, medium, or low expression of SOX2. CpG methylation sites are shown in relations to SOX2’s gene location.

3.4.2 SOX2 is amplified in LSCC patients

Unlike LUAD, SOX2 is often seen amplified in LSCC patients [18]. Model-based analysis

of regulation of gene expression (MARGE) confirms high regulatory potential of SOX2 in LSCC patients and suggests that it is a master regulator [112] (Table 5). The increased presence of this

transcription factor may lead to differential regulation of downstream genes. To understand SOX2’s

roles in LSCC, multiple types of data from TCGA was analyzed. Patients were first divided into

two groups: SOX2-active and SOX2-inactive based on their gene expression, copy number

variation, methylation of SOX2. For methylation, only the 14 SOX2 CpGs located within the promoter region of SOX2 are considered. These CpGs have high variance and are highly correlated

with SOX2 gene expression (Figure 9). Patients with activated SOX2 met all three of the following

criteria (Figure 10A-C):

1) SOX2 expression values greater than the mean (log2expression values > 10.87797)

2) SOX2 copy number variation is duplicated compared to normal (segment means > 0.5)

3) SOX2 promoter region is hypomethylated (methylation β-values ≤ 0.4)

Patients that did not meet all three criteria are considered as SOX2 inactive. A total of 366

LSCC patients had all three data types available and were included in our study. Out of the 366 patients, 219 had high SOX2 expression, 212 had SOX2 copy number amplification, and 292 patients had SOX2 hypomethylation. Although many patients met more than one criterion, only

159 patients which had all three criteria are grouped as SOX2-active and the other 196 patients are

grouped as SOX2-inactive (Figure 10D).

Figure 10. Multiple genomic datatypes are used to group SOX2 patients into two groups. The highlighted bars in the histograms indicate patients with high SOX2 gene expression (A), SOX2 copy number amplification (B), and SOX2 promoter-region hypomethylation (C). Patients with SOX2 expression > 10.88, copy number segment means > 0.5, and methylation β-values ≤ 0.4 are considered as SOX2 active patients. A total number of 366 patents has all three datatypes available and were included in our analysis. (D) Of the 366 patients, 159 patients were considered as SOX2-active, all other patients are considered as SOX2-inactive.

3.4.3 Matrix eQTL identifies SNP to VDAC3 expression association

Expression quantitative trait loci (eQTL) analysis correlates SNP genotypes gene

expression variations [28]. There are different sets of eQTLs in SOX2-active patients that are not

in SOX2-inactive patients. Using a highly efficient and accurate eQTL analysis tool called Matrix

eQTL [28, 110], we identified 686,251 cis SNP-gene pairs in the SOX2-active patients and 688,591 pairs in SOX2-inactive patients (p < 0.05) within 1 Mbps of each other. Since we are interested in the mechanism of SOX2, we focused on SNP-gene pairs that are putatively affected by SOX2 expression. Fang WT, et al. significantly knocked down SOX2 expression using siRNAs in two

LSCC cell lines with high expression levels of SOX2 (LK2 and NCI-H20) [113]. The gene expression of the SOX2-knocked down cells was profiled using an array. After normalization, we conducted differentia l expression analysis and found 266 unique genes downregulated in SOX2 siRNA cells compared to control cells as shown in Figure 11A. Of the large number of SNP-gene pairs from our Matrix eQTL analysis, 8,546 SOX2-active and 8,956 SOX2-inactive pairs contained a gene that is downregulated upon silencing SOX2. Looking at the top results with FDR p < 0.01, only 16 and 38 cis SNP-gene pairs remain (Table 6 and Table 7). For each pair, the expression of the gene is correlated with the genotype of the SNP. For example, patients with active SOX2, the genotypes for SNP rs798827 is significantly correlated with the expression of the gene VDAC3

(FDR p-value = 0.0034), where allele G is correlated with lower expression of VDAC3, and allele

T is correlated with higher expression of VDAC3. Further analysis reveals one gene-pair overlap between SOX2 active and inactive cis-Matrix eQTL results. The correlation of this SNP-gene pair may be independent of SOX2 activity.

SNP Allele A Allele B Gene Pvalue FDR rs4654947 C T NBPF3 4.60E-11 3.45E-07 rs9984519 C T IFNAR1 1.38E-07 5.59E-04 rs17420195 C T NBPF3 4.18E-07 1.45E-03 rs2465941 C T ZCCHC12 8.23E-07 2.60E-03 rs2290163 C T LMCD1 8.48E-07 2.67E-03 rs798827 G T VDAC3 1.14E-06 3.43E-03 rs4747471 A G MSRB2 1.22E-06 3.63E-03 rs16913776 C G RAB38 1.31E-06 3.83E-03 rs7624916 C G ARL6IP5 1.48E-06 4.27E-03 rs10968209 A G MOBKL2B 1.49E-06 4.27E-03 rs16850158 A G ALCAM 1.87E-06 5.14E-03 rs12057041 C T MOBKL2B 3.20E-06 7.98E-03 rs10968456 C T MOBKL2B 3.20E-06 7.98E-03 rs4790508 A C CRK 3.37E-06 8.33E-03 rs308819 A C RAB38 4.00E-06 9.57E-03 rs308814 C T RAB38 4.00E-06 9.57E-03

Table 6. eQTL results for SOX2-active patients. Significant eQTLs (FDR p-value < 0.01) for SOX2-active patients are shown.

3.4.4 SOX2 regulates VDAC3 expression

Among the 16 SNP-gene pairs in SOX2 active patients that were not in SOX2 inactive patients, SNP rs798827 to gene VDAC3 pair caught our interest (Table 6). Voltage-dependent anion-selective channel 3 (VDAC3) are small integral membrane channels important for controlling the flux of metabolites between mitochondria and cytoplasm [114, 115]. It has previously been shown to play a role in apoptotic and oxidative stress signaling [116-118]. It was also found to be downregulated upon SOX2 silencing in two LSCC cell lines (Figure 11B).

SNP Allele A Allele B Gene Pvalue FDR rs10505902 A G ETNK1 2.87E-17 3.30E-13 rs16925264 A G ETNK1 5.06E-14 4.28E-10 rs12299764 A T ETNK1 5.06E-14 4.28E-10 rs16925217 A G ETNK1 9.57E-13 7.31E-09 rs988175 C G ETNK1 5.13E-10 2.80E-06 rs16925483 A G ETNK1 5.13E-10 2.80E-06 rs12304738 A G ETNK1 5.13E-10 2.80E-06 rs7972877 C T ETNK1 5.13E-10 2.80E-06 rs7980969 A C ETNK1 5.13E-10 2.80E-06 rs6491171 C T CDK8 8.97E-10 4.69E-06 rs11046495 A G ETNK1 1.38E-09 6.98E-06 rs7134724 A G CPM 1.79E-09 8.84E-06 rs1047290 A T SKAP2 2.40E-09 1.17E-05 rs6956721 A C SKAP2 2.40E-09 1.17E-05 rs10505886 C T ETNK1 1.04E-08 4.35E-05 rs759931 A G ETNK1 1.46E-08 5.95E-05 rs12305233 A C ETNK1 1.46E-08 5.95E-05 rs2251988 C T ETNK1 9.33E-08 3.12E-04 rs16924949 A G ETNK1 1.02E-07 3.36E-04 rs4654947 C T NBPF3 3.43E-07 9.74E-04 rs12306101 A G ETNK1 6.29E-07 1.66E-03 rs11046794 G T ETNK1 7.25E-07 1.87E-03 rs12298399 C T ETNK1 7.53E-07 1.94E-03 rs10279895 C T SKAP2 9.24E-07 2.30E-03 rs4002871 A T CHL1 9.52E-07 2.36E-03 rs16891267 C T VDAC3 1.28E-06 3.04E-03 rs11046707 C T ETNK1 1.57E-06 3.62E-03 rs7309425 C G ETNK1 1.62E-06 3.70E-03 rs11990074 C T VDAC3 1.77E-06 4.00E-03 rs2699843 C T ETNK1 1.91E-06 4.26E-03 rs10478632 A G ZNF608 2.03E-06 4.47E-03 rs7975461 C T ETNK1 3.55E-06 7.10E-03 rs4325353 A C ETNK1 3.94E-06 7.71E-03 rs16925588 C T ETNK1 4.53E-06 8.66E-03 rs1363501 A G GALNT10 4.79E-06 9.09E-03 rs16925627 A G ETNK1 5.00E-06 9.42E-03 rs17082292 C T CDK8 5.15E-06 9.62E-03 rs4402409 C T CDK8 5.15E-06 9.62E-03

Table 7. eQTL results for SOX2-inactive patients. Significant eQTLs (FDR p-value < 0.01) for SOX2-inactivate patients are shown.

Figure 11. Gene expression profile of two SOX2-silenced LSCC cell lines. SOX2 was knocked down in two LSCC cell lines: H520 and LK2 and gene expression was profiled using an array. (A) Top 266 downregulated genes are shown in a heat map. Downregulated genes are defined as genes with combined value differences between SOX2 siRNA and control > 1.0. Yellow color represents higher expression and dark blue color represents lower expression. (B) VDAC3’s normalized expression values are shown as an example. All samples were normalized to 0.

Not only is the expression of VDAC3 significantly higher in SOX2 active patients when compared to SOX2 inactive patients (p=0.0031), the expression of VDAC3 is also correlated with the genotype of SNP rs798827 in SOX2 active patients, but not in SOX2 inactive patients (ANOVA t-test p<0.05) (

Figure 11A and B). The T genotype of rs798827 correlated with a significantly higher

expression of VDAC3 than the G genotype. This association is only seen in SOX2 active patients, but not in SOX2 inactive patients (Figure 12 B). SOX2 ChIP-seq data from Watanabe H, et al.

identified 5371 regions with high SOX2 interactions in a LSCC cell line (HCC95) [111]. SOX2 peaks were detected using MACS and normalized for copy number variation. Figure 12 C shows the

SOX2 ChIP-seq peak at the promoter region of VDAC3. ChIP-seq peaks for H3K27ac, an active

enhancer mark, is also shown for the same LSCC cell line [119]. Collectively, our analysis indicates

that SOX2 binds to the promoter region of VDAC3 and subsequently regulates the expression of

VDAC3.

Figure 12. SOX2 regulates VDAC3 expression. (A) VDAC3 expression is significantly higher in SOX2-active patients compared to SOX2-inactive patients, t-test p=0.0031. (B) In SOX2-active patients, the SNP genotype is associated with a significant difference in VDAC3 expression, ANOVA t-test p = 6.91E-08. In SOX2-inactive patients, the difference is not present, ANOVA t-test p = 0.459. (C) SOX2 and H3K27ac ChIP-seq peaks for cell line HCC95 are shown for the promoter region of VDAC3.

3.4.5 SNPs or LD SNPs are located in SOX2 binding motifs

The HaploReg v4.1 web interface by Broad Institute is a well annotated database for SNP

and their linkage disequilibrium (LD) SNPs [120-122]. When a SNP is in strong LD to another

SNP, then those SNPs are highly associated with one another. In other words, the alleles of a few

SNPs can suggest the alleles of their LD SNPs. Analyzing the 16 target SNPs from our cis-Matrix

eQTL analysis of SOX2 active patients, we identified over 350 SNPs within strong LD (r 2 >0.80) to our 16 SNPs. SOX2 is a member of the SOX HMG box family of transcription factors [123]. It can bind to specific regions of the genome at consensus binding sequences. Looking at the flanking sequences of those LD SNPs, seven unique SNPs are located in and may alter a SOX family binding motif (Table 8). The SNP from rs798827-VDAC3 pair is within a strong LD to SNP rs58163073.

This LD SNP is actually an insertion variation where an [A] nucleotide is inserted. The flanking sequences of LD SNP rs58163073 make up the SOX2 binding motif and the SNP genotype modified the position weight matrix score. The [T] allele has a stronger binding affinity for SOX2

(11.5) than the [TA] allele (10.6) (Figure 13). The [T] allele of rs58163073 is linked to the [T] allele of rs798827 which is correlated with a higher expression of VDAC3.

Figure 13. SNP rs58163073 alters the PWM of SOX2. The flanking sequences of rs58163073 from chromosome 8 position 42904940 +/- 29 nucleotides are shown for the reference and alternative alleles. The position weight matrix (PWM) score are obtained from HaploReg4.1. The reference [T] allele has a stronger binding affinity for SOX2 than the alternative [TA].

SNP LD SNP Ref Alt D' r2 AFR AMR ASN EUR Motifs Cart1, Dbx1, Foxa2, Foxp1, HDAC2, Ncx2, Sox2 , Sox5, Zfp105, rs798827 rs58163073 T TA -1 1 0.7 0.83 0.85 0.97 p300 Arid3a2, Dbx1, Dbx2, FAC1, Foxa2, Foxa4, Foxj2, Foxk1, Foxo2, Foxp1, HNF1, Hlx1, Hoxa10, Hoxa5, Hoxc6, Hoxd8, Lhx3, Mef2, rs4747471 rs199772546 TA T 1 1 0.04 0.02 0.21 0 Msx-1, Nanog, Ncx2, Nkx6-1, PLZF, Pax-6, Pou2f2, Pou3f2, Pou3f4, Pou4f3, Prrx1, Sox13, Sox18, Sox19, Sox2 , Sox5, Sox6, Sox7, Zfp105, p300 CDP7, Dbx1, Dbx2, Evi-1, FAC1, Foxa2, Foxa4, Foxj2, Foxk1, Foxo2, Foxp1, HNF1, Hlx1, Hoxa10, Hoxa5, Hoxd8, Lhx3, Lhx3, rs4747471 rs200774383 AAT A 1 1 0.04 0.02 0.21 0 Mef2, Nanog, Ncx_2, Nkx6-1, Nkx6-2, PLZF, Pax-6, Pou2f2, Pou3f4, Pou6f1, Prrx1, Sox13, Sox18, Sox19, Sox2 , Sox5, Sox6, Sox7, Zfp105, p300 Fox, Foxk1, Foxp1, Hoxa10, Hoxd8, Lhx3, Pou2f2, Sox18, Sox2 , rs4747471 rs7087230 C T 1 1 0.3 0.06 0.28 0 Sox3, Sox7, TATA FAC1, Foxa4, Foxd3, Foxk1, Foxo1, Foxo2, Foxp1, HDAC2, Irf, rs4747471 rs12217320 A C 1 1 0.03 0.02 0.25 0 Mef2, Nanog, RREB-1, Sox13, Sox2 , Sox6, Sox7, Zfp105 0.00 rs4747471 rs1398027 G C 1 1 0.02 0.25 0 Sox2 2 rs12057041 rs10968463 C T 1 1 0 0.03 0.2 0.01 Foxj2, Sox2 rs10968456 rs10968463 C T 1 1 0 0.03 0.2 0.01 Foxj2, Sox2

Table 8. Binding motifs located around target SNPs. Seven LD SNPs are located within and alters a SOX2 binding motif. The eQTL SNP, LD SNP, and reference and alternative alleles of the LD SNPs are shown. The coefficient of linkage disequilibrium (D`) and the square of correlation coefficient (r2) are shown along with the frequency of the alternative allele in African (AFR), American (AMR), Asian (ASN), and European (EUR) populations. Motifs whose position weight matrix scores are affected by the LD SNPs are listed.

3.4.6 VDAC3 and rs58163073 are located within the same TAD

The spatial organization of the genome plays a role in the transcriptional regulation of genes [124]. Chromatin conformation capturing methods such as Hi-C can reveal regions of the genome that have high interaction [30, 125, 126]. These regions are referred to as Topologically

Associating Domains (TADs) [124, 127]. Hi-C data can be visualized in Hi-C heat maps [127].

Rao, S. S. P., Schmitt, A., and the Dekker Laboratory generated multiple Hi-C data for lung tissues and cell lines [30, 31, 36, 42, 128]. Figure 14 shows Hi-C heat map at chr8:41000000-44000000 for lung cell line IMR90. Figure 15 shows the Hi-C heap maps for lung cancer cell line A549, and two lung tissue samples. Figures were generated using 3D Genome Browser [129]. The location of

VDAC3 and SNP rs58163073 are indicated. From analyzing Hi-C maps of lung cell lines (normal and cancer) and lung tissues, we have confirmed that VDAC3 and rs58163073 are located within the same TAD.

Figure 14. VDAC3 and rs58163073 are located within the same TAD in lung cell line. The Hi-C heat map of lung cell line IMR90 is shown for chr8:41000000-44000000, resolution 10 kb. The location of VDAC3 and rs58163073 are indicated with dotted lines. The different TADs are marked with pale yellow and blue bars below the heat map. The locations of other genes are shown.

Figure 15. Hi-C heat maps for lung cancer cell line and lung tissues. Hi-C heat maps of lung cancer cell line A549 and two lung tissue samples are show for chr8:41000000- 44000000, resolution 40 kb. VDAC3 gene and SNP rs58163073 are marked with a dotted line.

3.4.7 Significant association between SNP and clinical features

Minor allele frequencies (MAF) of rs798827 and rs58163073 varied by race, with Black

or African Americans more likely to carry the minor alleles than Whites and Asians. The MAF of

rs798827 were 0.32 for Black or African American, 0.03 for White, and 0.15 for Asian according

to information from the 1000 Genomes Project. These frequencies were found to be similar in

LSCC patients, 0.48, 0.07, and 0.22 for each race respectively, again, with a higher frequency of

the minor alleles in Black or African American patients (Figure 16A and B ). There is no significant

difference in race distributions in SOX2-active and SOX2-inactive groups (Figure 16C, chi-square

test p-value = 0.555909). However, SNP rs798827 is associated with location of tumor. The

malignant location of patients with the major allele were 47% left and 53% right lungs. Patients

with the minor allele had tumor more predominantly located on the right lungs (24 % left and 76%

right, chi-square test p-value = 0.007461) Figure 16D. There were no significant association between SNP genotype, and stage and age at diagnosis (Table 9).

Figure 16. LSCC Clinical Analysis. SNP rs798827 minor allele frequency (MAF) is shown for different races in LSCC patients (A) and in SOX2-active patients (B). (C) Percentage of each race is shown for SOX2-active (left) and SOX2-inactive (right) patients. (D) Frequency of tumor location (left or right) is shown for patients with the m ajor allele and minor allele.

Major Allele Minor Allele Race Black or African American 5% 36% White 93% 60% Asian 2% 5%

Location Left-Lower 18% 14% Left-Upper 30% 11% Right-Lower 24% 32% Right-Upper 29% 43%

Age Min 39 40 1st Q 62 63 Median 68 69 Mean 67.22 67.72 3rd Q 73 74.75 Max 90 84

Stage 1 48% 57% 2 33% 25% 3 17% 16% 4 1% 2%

Table 9. Association of SNP rs798827 and clinical features in LSCC patients. Distribution of race, location of cancer in the lungs, age of the patient at diagnosis, and stage of cancer are shown for patients with the major or minor allele of SNP rs798827.

3.5 Discussion

LSCC remains a complex disease with many molecular alterations. SOX2 is often seen

amplified in LSCC patients and plays an important role in the progression and development of

LSCC, however its mechanisms are not very well understood. Utilizing multiple layers of data from

genomic to epigenomic, we are able to depict a potential mechanism of SOX2 in LSCC. First,

LSCC patients were separated into two groups: SOX2-active and SOX2-inactive based on their

SOX2 gene expression, copy number variation, and methylation. By first distinguishing patients based on their SOX2 activity, the mechanisms of SOX2 can be better elucidated. Using Matrix

eQTL analysis, we linked variations in gene expression to SNP genotypes. Differences in SNP

genotypes, especially within regulatory elements or binding motifs, can affect the binding affinity

of transcription factors such as SOX2, and alter downstream transcription of target genes.

In our study, we focused on eQTLs that are found in SOX2-active patients but not in SOX2-

inactive patients. We identified a SNP-gene pair (rs798827-VDAC3) that is unique in SOX2-active patients. The genotype of rs798827 is significantly correlated to the expression of VDAC3 in

SOX2-active patients but not in SOX2-inactive patients. Since SOX2 is a transcription factor that binds to specific regions of the genome, variations in the binding sequences may alter the binding

affinity of SOX2 to that region. Using HaploReg 4.1, we identified multiple SNPs within a strong

LD to our eQTL SNPs. We found that rs798827 is within strong LD to SNP rs58163073 and motif

analysis indicates that SNP rs58163073 is located within the binding motif of SOX2. The [T] allele

of SNP rs58163073 increased the position weight matrix score for SOX2 and the [TA] allele

decreased it. This SNP directly alter the binding sequence of SOX2 and the [T] allele increased its binding affinity.

Our cell line and ChIP-seq analysis further suggests that SOX2 regulates VDAC3. When

SOX2 was silenced in two LSCC cell lines, the expression of VDAC3 was also downregulated. In

addition, SOX2 ChIP-seq peaks show SOX2 binding in the promoter region of VDAC3. SNP

rs58163073 and VDAC3 are also located within the same TAD in lung tissues, normal lung, and

lung cancer cell lines. VDAC3 is the least investigated isoform of voltage-dependent anion

channels. They are localized in the mitochondrial outer members and are pore-forming structures

that control the exchange of metabolites between the mitochondria and cytoplasm [114]. They also play crucial roles in oxidative stress, maintaining redox status, and mediating cytochrome c apoptosis [115, 130]. Other studies have reported that VDAC3-deficient cancer cells have reduced permeability for ADP/ATP and decreased mitochondrial membrane potential [131-133]. The pathways VDAC3 were most involved in were cancer and reproductive system disease, according to Ingenuity Pathway Analysis (IPA) [134]. Deletion of VDAC3 significantly increases cell resistance to anti-tumor agent Erastin, which targets VDAC2 and VDAC3 [135, 136]. 76

Further analysis shows that Black or African American patients have a higher MAF than those of White or Asian patients and patients with MAF had tumors located more predominantly on the right lungs. Collectively, these analyses support the notion that the SNP variant rs58163073 affects the binding of SOX2, which in turn affects SOX2’s modulation of target genes such as

VDAC3 (Figure 17). This discovery is found only when SOX2 activity is first considered. Our analysis can be used to understand the downstream mechanisms of transcription factors in other diseases and cancers.

Figure 17. Graphical summary of the regulation of VDAC3 by SOX2. In SOX2-active patients, the genotype of SNP rs798827 is associated to the expression of VDAC3. This SNP is within strong LD to SNP rs58163073 which overlaps with the binding motif of SOX2. This mechanism is only seen in patients with SOX2 activity. Positions of genes and SNPs are not drawn to scale.

CHAPTER IV:

Chromatin organization alterations lead to oncogene overexpression in breast cancer

Chapter IV contains text and figures from Chyr, J., Zhang, Z., D., and Zhou, X. Chromatin organization alterations lead to oncogene overexpression in breast cancer. Under review. 2020

4 Chromatin organization alterations lead to oncogene overexpression in breast cancer

4.1 Abstract

Topologically associated domains (TADs) and TAD boundaries play important roles in

genome organization and gene regulation. However, they are often altered in diseases. In our study,

we addressed the relationship between genomic features and higher-order chromatin structures

using a newly developed machine-learning model call PredTAD. Unlike previous models, our

model utilizes a diverse set of epigenomic and genomic features as well as neighboring information

to classify the entire genome as boundary or non-boundary regions. In breast cancer cells, there is

a gain of 566 new boundaries and a loss of 612 existing boundaries compared to normal breast

epithelial cells. Among the most important features for predicting boundary alterations were CTCF,

subunits of cohesin (RAD21 and SMC3), and chromosome number, suggesting their roles in

conserved and dynamic boundaries formation. Genes near TAD boundary alterations were found

to be involved in several important breast cancer signaling pathways such as Ras, Jak-STAT, and

estrogen signaling pathways. Finally, we discovered a TAD boundary alteration that contributes to

RET oncogene overexpression. In conclusion, studying the epigenomic and genomic characteristics

of TAD boundaries allowed for a more complete understanding of the dynamic chromatin

structures involved in signaling pathway activation, altered gene expression, and disease state in breast cancer cells.

4.2 Introduction

4.2.1 Chromatin organization

The human genome consists of more than three billion nucleotides, spanning over two meters in length. In order to fit this genomic material within the micrometer-sized nuclear space, the chromatin undergoes multiple levels of organization that involves extensive looping and folding, and it is tightly packaged into nucleosomes. Although tightly packed, the human genome is still accessible to cellular components and transcription factors.

High-throughput chromosome conformation capture technologies such as 3C, 4C, ChIA-

PET, and Hi-C have been developed to experimentally map long-range chromatin interactions [29,

30]. The genome is organized in a fractal globular-like structure where open and closed chromatin

form two genome-wide compartments [31].

4.2.2 Topologically associated domains

The genome consists of genomic loci with frequent chromatin interactions, commonly

known as topologically associated domains or TADs [31, 32]. The region between two adjacent

TADs are called TAD boundaries. This region limits the interactions between the two adjacent

TADs. TADs and TAD boundaries are important in orchestrating enhancer-promoter activity and

cell-specific gene expression. In cancer, alterations in chromatin structures may activate oncogenic

enhancer-promoters or silence tumor suppressive interactions [33-35].

4.2.3 TAD and TAD boundary predictions

It has been noted that chromatin interactions are associated with distinct patterns of TF binding, namely architectural-related protein CTCF, and histone modifications [38, 137]. Histone

modifications (acetylation, methylation, phosphorylation and ubiquitination) serve as distinct

markers for transcription regulation, and function to control the accessibility of the chromatin and

the recruitment of DNA binding proteins. Additionally, DNA methylation and DNA accessibility

has also been known to affect TF binding and histone methylation [138-140]. Although there is an

enrichment of CTCF and housekeeping genes in TAD boundaries, distribution of CTCF alone is

not sufficient to predict TAD boundaries and housekeeping genes were only enriched in some TAD boundaries [37]. This suggests that computational models using a combination of ChIP-seq and other forms of data and information is capable of predicting and understanding large-scale chromatin structures.

A number of studies have already begun to predict TAD interaction hubs and TAD boundaries using readily available ChIP-seq data [37-39]. One group used the computational

classifier – Bayesian additive regression tree (BART) to successfully predict TAD boundaries in 80

IMR90 with a good prediction accuracy (AUC = 0.77) [38]. Another group used a Bayesian Ridge

model to develop a computational model called Position-specific linear model (PSLM) which

considered both position and density of various genomic elements. They also classified genomic

segments as TAD boundaries or non-TAD boundaries [37]. However, these studies have not

considered a vast combination of both genomic and epigenomic data. We utilized various

epigenomic data such as DNA methylation, transcription factor binding, histone modifications, and

chromatin accessibility as well as several genomic data such as transcription start site, gene density, and location information. Not only that, but previous studies only focused on a very small portion of the genome: TAD boundaries and randomly selected non-TAD boundary regions at a 1:1 ratio.

Alternatively, we interrogated the entire genome by binning it into 10 kb regions. This allowed us to retain the most information by providing a reasonably high resolution. We also included feature information for each bin’s neighbors in our model. Finally, these studies also did not question the differences between normal and cancer genomes. In our project, we sought to understand the biological significance of chromatin organization alterations in breast cancer.

4.2.4 Gradient Boosting Machine classification model

We employed a machine learning algorithm called gradient boosting machine (GBM) to

classify genomic regions as boundary or non-boundary. GBM is a popular classification method based on an ensemble of decision trees. Unlike the Random Forest (RF) model which builds an

ensemble of deep independent trees, GBM builds shallow trees one at a time with each new tree

correcting the errors made by previously trained trees. [141, 142]. Other advantages of GBM

include capability of handling missing data, not requiring pre-processed data, and being a better

learner than RF. However, due to its sequential manner, GBM models take longer to train and are

more sensitive to overfitting.

4.2.5 Significance and innovation

In our study, we deciphered the relationship between epigenomic and genomic features and

higher-order chromatin structures. By studying chromatin organization, we uncovered large-scale 81

structural changes that contribute to breast cancer disease state. Our analysis of the chromatin

organization offers an in-depth understanding of gene regulation, signaling pathways activation,

and disease state.

4.3 Experimental Design and Methods

4.3.1 Hi-C data and TAD boundaries

Normal breast epithelial cell line data (MCF10A) and breast cancer luminal A subtype

(ER+/Her2-) cell line data (MCF7 and T47D) were obtained from GEO and ENCODE. MCF10A

and MCF7 Hi-C data was obtained from Barutcu et al GSE66733 [34]. The Hi-C library was prepared using a slightly modified Hi-C library preparation protocol from Dekker. Briefly, cultured

MCF7 and MCF10A cells were crosslinked using formaldehyde. The cells were lysed, and the

crosslinked chromatin was digested with the restriction enzyme HindIII. Next, the fragments were

labeled with biotin-14-dCTP, and then the blunt ends were ligated. Then the DNA is purified, and biotinylated Hi-C ligation products were pulled down with streptavidin. Finally, the Hi-C library

was sequenced with Hi-seq 2000 instrument. The 100-bp pair-end reads were mapped to hg19. Due

to aneuploidy of MCF7 cells, iterative correction and eigenvalue decomposition (ICE) was performed [34, 42]. The ICE method normalizes the interaction counts of aneuploid chromosomes by dividing interaction count by the total sum of all interactions.

TAD calling and TAD boundary identification was performed using the insulation square

analysis at the 40 kb resolution [34]. The insulation square analysis slides a 1 Mb by 1 Mb square

along the diagonal of the interaction matrix. For each square, there are 25 by 25 bins. The average

interaction of all 25 x 25 bins in the 1 Mb x 1 Mb square is recorded as the insulation score. Valleys,

or decrease, in insulation scores indicate a depletion of Hi-C interactions. These valleys represent

TAD boundaries. Regions between TAD boundaries are considered as TADs. The width of the boundaries was extended by 80 kb on both side to account for variation between biological

replicates (n=3). The final width of TAD boundaries spans 200 kb. Weak boundaries with a boundary strength of <0.15 were excluded in our analysis. A total number of 3213 and 3167 82 boundaries were used for MCF10A and MCF7, respectively. T47D TAD data was obtained from

ENCODE (ENCFF437EBV). TAD boundaries were defined as the region between two TADs.

Each TAD boundary region was standardized to 200 kb width. A total number of 3166 boundaries

were used for T47D.

4.3.2 Epigenomic and Genomic Elements

MCF10A, MCF7, and T47D histone modification and DNA binding protein ChIP-seq data

were obtained from the ENCODE project and GEO database [143, 144]. DNA chromatin

accessibility was determined by ATAC-seq, which was also obtained from GEO [145, 146]. Fastq

sequence files of ChIP-seq and ATAC-seq data were mapped to hg19 with BWA-MEM and narrow peaks were called with MACS2 [147, 148]. MCF7 and MCF10A DNA methylation from Illumina

Methylation 450K BeadChip array was obtained from ENCODE. T47D DNA methylation, also

Illumina Methylation 450K BeadChip array, was obtained from Uehiro, et al. GSE87177 [149].

Transcription factor binding sites were obtained from UCSC (wgEncodeRegTfbsClusteredV3

table) [144, 150]. Location of coding, noncoding, and housekeeping genes were also obtained from

UCSC. Gene density was calculated based on the number of TSS of each gene in each genomic bin. Chromosome length and location of centromeres were also obtained from UCSC. Distance to

centromere is defined as the percentage of the chromosomal arm the region is in, where 0 is at the

centromere and 1 is at the telomere.

4.3.3 PredTAD Boundary Prediction Model

The entire genome is binned into 10 kb regions. Regions in the centromere and telomere

were excluded from the study due to lack of Hi-C reads in these regions. Training and test sets were

randomly selected at a 7:3 ratio per chromosome. Epigenomic and genomic information of each bin as well as information of neighboring bins were used as the features. Specifically, ten 10kb bins

to the left (or downstream) and ten 10kb bins to the right (or upstream) were included as features.

This method keeps spatial feature information instead of averaging out the signals in the 200 kb

region. For ChIP-seq and ATAC-seq data, the mean signal values of narrowPeaks peaks per bin 83

were used. For methylation data, we used the average methylation signal values of all CpG sites in

each bin. For transcription factor binding sites and transcription start sites, we took number of each

genomic element in each bin.

We used the machine learning technique Gradient Boosting Machine (GBM) to classify

each 10 kb genomic bin as either TAD boundary or non-TAD boundary. The number of trees built

in the models was set at 500 and the max depth was 10. Although deeper trees may provide better

accuracy on a training set, it is prone to overfitting. In total, there are 3305 TAD boundaries for

MCF10A, 3273 for MCF7, and 3166 for T47D. For MCF10A, there were 65453 positive samples

(10 kb regions within a TAD boundary) and 227588 (10 kb regions not within a TAD boundary).

For MCF7, there were 64547 positive and 228494 negative samples. For T47D, there were 64784 positive samples and 231867 negative samples. Any 10 kb regions partially located in a TAD boundary were also excluded.

4.3.4 Boundary alterations prediction

A modification of PredTAD was used to predict TAD boundary alterations between normal breast MCF10A cell line and breast cancer MCF7 or T47D cell lines. Similar to PredTAD, 10 kb regions were used as samples. Each region could be one of three classes: boundary gain, boundary loss, or conserved boundary. Boundary gain is defined as TAD boundaries that are not in normal

MCF10A but is in breast cancer MCF7 or T47D (they are also considered as MCF7-specific boundaries or T47D-specific boundaries). Boundary loss is the opposite, where a boundary is found

in MCF10A but is lost in MCF7 or T47D (also considered as MCF10A-specific boundaries).

Conserved boundaries are defined as boundaries found in MCF10A, MCF7, and T47D.

The features used were the same as PredTAD, with an additional set of features: difference between two cell lines. That is, for ChIP-seq, ATAC-seq, and methylation data, we also included

the difference in mean signal values (between normal and breast cancer cell lines) as addition

features in our model.

4.3.5 Genomic interaction prediction

To predict if two regions have interactions based on epigenomic and genomic features, the

entire genome is first binned at 40 kb. Each 40 kb region is paired with a different 40 kb region

within 1 million base pairs from each other. In total, there are 1,966,393 pairs. Due to the large

number of samples, we sampled 250,000 pairs for our model and also modeled each chromosome

individually. Again, 70% of which was used for training and 30% was used for testing. Random

Forrest was used the classifier. Positive samples were defined as two regions with more than one

Hi-C read. The features used were average ChIP-seq signals, number of TFBS and TSS, distance

to centromeres, and distance between two regions. The output of our model is interaction or no

interaction, determined by overlapping reads in Hi-C data. MCF7 and MCF10A interaction predictions were conducted separately.

4.3.6 Gene expression analysis

MCF10A and MCF7 RNA-seq libraries were obtained from Barutcu et al GSE71862 [34].

The libraries were generated with TruSeq Stranded Total RNA with Ribo-Zero Gold Kit and sequenced as 100-bp single-end reads using HiSeq 2000. Gene expression (transcripts per million) was quantified by RSEM v.1.2.7 as previously described [34, 151]. Differentially expressed genes were defined as log2FC > 1 with a p-value < 0.05, n = 3 biological replicates. There were 6271 differentially expressed genes between MCF10A and MCF7 cell lines.

Estrogen related gene list was curated from analysis of two independent datasets:

GSE11352 [152] and GSE55922 [153]. The first dataset is from Lin CY, et al [152]. Briefly, MCF7 cells were treated with 10 nM of 17β-estradiol (E2, estrogen) or vehicle control for 12 hrs. RNA was extracted and gene expression was profiled with Affymetrix U133 A and B GeneChips. Data was normalized using the Robust Multichip Average (RMA) normalization method as previously described [152, 154]. Differentially expressed genes were defined as log2FC > 1 with an adjusted p-value < 0.2, n = 3 biological replicates. The second dataset is from Nagarajan, et al [153]. Briefly,

MCF7 cells were treated with 10 nM of 17β-estradiol or vehicle control for 2 hrs. RNA was 85

extracted and sequenced with HiSeq 2000 from Illumina. RNA-seq was mapped to hg19 using

Bowtie2 v.2.1.0 and differential expression was measured using DEseq v.1.14.0 as previously

described [155, 156]. Differentially expressed genes were defined as log2FC > 1 with a p-value <

0.05, n = 2 biological replicates. A total number of 521 unique genes were considered as estrogen-

related genes.

Breast cancer patient RNA-seq data was obtained from TCGA cohort. Luminal A

(ER+/Her2-) patients were selected because MCF7 is a luminal A breast cancer cell line. In total,

there was 315 tumor samples and 40 matched normal breast samples. There were 3947

differentially expressed genes (log2FC > 1, p-value < 0.05) between breast cancer tumor samples

and matched normal control.

4.4 Results

4.4.1 Epigenetic and genetic features in TAD boundaries

Previous studies have reported enrichment of CTCF binding and transcription start sites of

housekeeping genes at TAD boundaries. This is also the case in MCF10A and MCF7 TAD boundaries (Figure 18A). Previous studies have also reported distinct patterns of histone marks

around TAD boundaries in H1 hESC and IMR90 cell lines [32, 36-38]. DNA methylation has been

reported to be negatively correlated with transcription factor binding, namely the TAD boundary

associated protein CTCF (Figure 18B) [140]. Thus, it is possible to utilize both epigenomic and

genomic data in the modeling and prediction of TAD boundaries.

Figure 18. Epigenomic and genomic features. (A) Enrichment of CTCF (left) and house-keeping genes (right) centered on TAD boundaries for MCF10A and MCF7 cell lines. (B) Relationship between DNA methylation and CTCF ChIP-seq reads in MCF10A (left) and MCF7 (right) cell lines.

4.4.2 PredTAD model

PredTAD model overview is described in Figure 19. Specifically, we built a machine learning technique called Gradient Boosting Machine, also known as Generalized Gradient Model or GBM, to classify genomic regions as boundaries or non-TAD boundaries. This method is often used for regression and classification problems, and it builds a model based on multiple layers of decision trees. The whole genome was binned into 10 kb sample regions. We excluded regions of the telomere and centromeres due to lack of Hi-C reads in these regions. In total, we had 296,651 samples. We used 70% of the samples for training and withheld 30% for testing.

For epigenomic features, we included 9 histone modifications (H3K27ac, H3K27me3,

H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, and H4K20me1), 15 DNA binding proteins (CTCF, ELF1, EGR1, EP300, FOXA1, GABPA, GATA3, MAX, PML, POLR2A,

RAD21, SIN3A, SMARCA5, SMARCE1, and SRF), DNA accessibility (ATAC-seq), and DNA

methylation. For genomic features, we included the chromosomal location, relative distance on

chromosome (determined by relative distance from the centromere), gene density (number of

transcription start sites (TSS) of noncoding, coding, and housekeeping genes), and binding sites of

161 transcription factors. To improve the accuracy of our model, we also included information of

each sample’s neighboring regions as features. Specifically, we included epigenomic and genomic

feature for each bin’s ten upstream and ten downstream bins.

Figure 19. Overview of PredTAD. The entire genome is binned into 10 kb bins. For each bin, 192 epigenomic and genomic features were measured. Feature information for ten upstream and ten downstream bins were also included. Gradient Boosting Machine (or GBM) was used to classify each 10 kb samples to either a TAD boundary or non-TAD boundary.

We obtained an AUC of 0.7954, 0.7993, and 0.7487 in MCF10A, MCF7 and T47D,

respectively. The top features for MCF10A were chromosome number, transcription factor binding

sites of SMC3, RAD21, and CTCF, histone modifications (H3K9ac, H3K27me3, and H3K9me3),

gene density of noncoding genes, relative distance on chromosome, and DNA methylation. For

MCF7, the top features were chromosome number, CTCF, RAD21, gene density of noncoding 88 genes, DNA accessibility, and DNA methylation. T47D also had similar top features: chromosome number, CTCF, and transcription factor binding sites of RAD21, CTCF, and SMC3 (Figure 20).

These results suggest that there are higher order chromatin structures that is dependent on not just

CTCF and other chromatin remodeling proteins such as SMC3 and RAD21, but also the basic properties of the chromosome region (i.e. chromosome number and location), gene density, chromatin accessibility, and DNA methylation.

Figure 20. PredTAD top predictive features. Top 20 most informative epigenomic and genomic features are ranked for MCF10A, MCF7, and T47D. Feature importance is determined by calculating the relative influence of each variable. Feature importance are scaled between 0 and 1. The feature are color coded based on feature types.

4.4.3 Comparison to other models

Our method is better at predicting TAD boundaries compared to two other previously published methods: PSLM and BART. When these two methods were applied to MCF7, we obtained a lower AUC (Figure 4). Position-specific linear model (or PSLM) is a Bayesian Ridge model with regularized linear regression. It uses DNA binding and histone modification ChIP-seq as well as transcription start sites and transcription factor binding sites. They use 300 kb samples which are further divided into 11 sub-bins and the number of each genomic element in each sub- bin was counted. Bayesian Additive Regression Trees (or BART) is a “sum of trees” model that averages results from an ensemble of regression trees. However, their model only uses CTCF and histone modification ChIP-seq data. Additionally, they use 300 kb samples with no additional sub- bins.

Figure 21. Comparison of PredTAD to other models. Performance of PredTAD is compared to two other TAD prediction models: PSLM and BART. MCF7 cell line was used in the comparison. The AUC of the testing set is shown for all three models.

4.4.4 Boundary alterations are controlled by H3K9ac, CTCF, RAD21, and SMC3.

Higher-order chromatin organization is important for the proper regulation of genes.

However, TAD and TAD boundaries are often perturbed in cancer [34, 157]. Although majority of

TAD boundaries in normal epithelial breast and breast cancer cell line overlapped, roughly 17.9-

19.0% of the boundaries are different. There are 566 new TAD boundaries found in the breast

cancer cell line MCF7 that is not found in the normal epithelial breast cell line MCF10A. There is

also a loss of 612 TAD boundaries in MCF7 compared to MCF10A (Figure 22A). Studying TAD boundaries will offer great insight on gene regulation and disease state of breast cancer.

Figure 22. Comparison of normal and breast cancer TAD boundaries. (A) Venn diagram of number of TAD boundaries in MCF10A and MCF7 cell lines. (B) Top 20 most informative epigenomic and genomic features for predicting TAD boundary alterations.

We asked which features was most involved in TAD boundary changes between normal

MCF10A cells and breast cancer MCF7 cells. A TAD boundary change can be either gained or

lost: a boundary gain is a boundary found in MCF7 but not in MCF10A (or MCF7-specific boundary) and a TAD boundary loss is a boundary found in MCF10A but not in MCF7 (or

MCF10A-specific boundary). A conserved TAD boundary or “no change” is one that is found in both MCF7 and MCF10A cell lines. We used the same epigenomic and genomic features to predict

if a boundary region is gained, lost, or conserved. The 5-fold cross validation AUC is 0.7389. The

top five important features are H3K9ac, CTCF, RAD21, SMC3, and H3K4me1 (Figure 22B). A

closer analysis of these top features reveals that these features are significantly enriched in

conserved boundaries (Figure 23). Interestingly, H3K4me1 functions to protect regulatory regions

from DNA methylation and marks open chromatin [158]. H3K9ac marks active enhancers [159,

160]. RAD21 regulates the structure and organization of the chromosome, in particularly during

cell division. SMC3 is part of the structural maintenance of chromosomes (SMC) family of proteins. Both RAD21 and SMC3 are part of the cohesin complex and plays important roles in gene

regulation, DNA damage repair, and stabilization of the genome [161, 162].

Figure 23. Expression of top five features in conserved and perturbed TAD boundaries. Mean signal values for H3K9ac, CTCF, and H3K4me1 are shown for conserved, gained, and lost TAD boundaries (200 kb regions). Transcription factor binding site counts for RAD21 and SMC3 are shown for conserved, gained, and lost TAD boundaries (200 kb regions). ANOVA p-values for the top five features are shown in the table.

4.4.5 Epigenomic and genomic features accurately predicts genomic interactions

Next we wondered if large-scale chromatin interactions can be modeled using epigenetic and genetic features. The entire genome was binned at 40 kb sections much like how the Hi-C data was processed. If two bins have Hi-C reads covering both bins, then the two bins were considered as interacting. Only bins within 1 MB from each other were included in our study. In total, there were 1,966,393 bin pairs. We sampled 250,000 pairs for our model and used random forest as our classifier. We obtained an AUC of 0.8242. The most important feature was distance between the two regions, followed by chromosome number (Figure 24).

Figure 24. Genome-wide interaction study. The interaction between two 40 kb regions are modeled. The ROC curve for the testing set is shown on the left. The variable importance plot is on the right. The x-axis represents the mean decrease in node impurity.

We also modeled each chromosome individually. Sample size ranged from 30,000 to

162,000 per chromosome. Interacting 40 kb pairs were individually modeled and predicted for each

chromosome. We obtained AUCs ranging from 0.7593 to 0.9061 across all 23 chromosomes. The

specific testing AUCs for each chromosome are shown in Figure 25. Chromosomes 9, 14, 15, 16,

21, and 22 had the highest accuracy in MCF7. We looked at the top 10 most important features and

found that distance between two regions, relative distance to the centromere, CTCF, SMARCA5,

SMARCE1, FOXA1, H3K4me1, and transcription factor binding sites of CEBPB, MAFK, and

CTCF were among the most important features across the six best performing chromosomes. We plotted the ratio and differences of these features and found that the distance between two regions were shorter in interacting pairs than in non-interacting pairs (Figure 26A). We also found that non- interacting regions had a greater difference in average CTCF, SMARCA5, SMARCE1, FOXA1, and H3k4me1 signals as well as a greater difference in the number transcription binding sites of

CEBPB, MAFK, and CTCF (Figure 26B).

Figure 25. Interaction prediction accuracy across each chromosome. Interaction between two 40 kb regions were modeled for each chromosome individually. The testing AUC is plotted for each chromosome and for MCF7 and MCF10A separately.

Figure 26. Comparing interacting and non-interacting regions. Chromosomes 9, 14, 15, 16, 21, and 22 had the highest performance. (A) The ratio of epigenetic and genetic features for non-interacting regions over interacting regions are plotted. (B) On the left, differences in ChIP- seq signals for the top five ChIP-seq features are plotted for interacting (red) and non-interacting (teal) samples. On the right, differences in the number of transcription factor of three TFs are plotted for interacting and non-interacting samples (T-test p-values < 0.05).

4.4.6 Altered chromatin organization contributes to oncogenic pathway activation

TAD boundary alterations can affect the expression of nearby genes. We examined the

genes near (+/- 1 MB) MCF7 boundary gain and loss. There are 6443 genes near MCF7 gained boundaries. Of those genes, 1375 gene are differentially expressed in cell line RNA-seq data (p- value < 0.05) and 108 are estrogen related genes. For MCF7 lost boundaries (found in MCF10A but not in MCF7), 7298 genes are affected. Of those genes, 1551 genes are differentially expressed in cell line (p-value < 0.05) and 106 were estrogen related genes.

Using David’s Gene Ontology online tool, we examined which KEGG pathways these genes are involved in [163]. KEGG PATHWAY is a collection of pathway maps on the molecular interaction, reaction, and relation networks for: metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases, and drug development [164]. The differentially expressed genes near gained boundaries were involved in pathways such as hippo signaling pathway, estrogen signaling pathway, and Ras and

Jak-STAT signaling pathways. The differentially expressed genes near lost boundaries were involved in estrogen signaling pathway and metabolic pathways. The full list is shown in Table 10.

Table 10. Signaling pathway analysis of differentially expressed genes near altered boundaries. Enriched signaling pathways are listed for differentiated genes near gained (left) and lost (right) TAD boundaries.

4.4.7 Loss of TAD boundary promotes RET oncogene expression in breast cancer

We selected the gene RET for further analysis due to its proximity to a TAD boundary alteration, gene expression change in cancer cells, and its previous implication in breast cancer.

RET is a receptor tyrosine kinase that phosphorylates PTK2/FAK1. Our PredTAD model predicted a boundary loss in chr10:43,600,000-43,800,000 in both MCF7 and T47D cell lines. This boundary is just upstream of the RET gene Figure 27. Closer analysis indicates low H3K4me1 ChIP-seq signals and DNA hypermethylation. As a result, the TAD boundary is lost and RET is significantly overexpressed in MCF7 cells compared to MCF10A cells (log2FC > 1, p-value < 0.05, n = 3)

Figure 28A. RET is also overexpressed in T47D cell line. There is also an increase in RET gene expression in TCGA breast cancer patients compared to matched normal breast tissue samples

(log2FC > 1, p-value < 0.05) Figure 28B. Overexpression of RET gene leads to a poor prognosis on breast cancer (Cox p-value = 0.038) Figure 28C. There are several tyrosine kinase inhibitors that can either enhance sensitivity of breast cancer to tamoxifen therapy and/or re-sensitize tumors that have developed tamoxifen resistance [165].

Figure 27. TAD boundary changes near RET gene. Hi-C heat map for MCF10A (red) and MCF7 (blue) are shown for chr10:42400000-46400000. Conserved TAD boundaries are indicated with black dots. MCF710A specific boundary is indicated with a red dot. The location of RET is marked with a star.

Figure 28. RET in breast cancer cell line and patients (A) RET gene expression is shown for MCF10A and MCF7 cell lines (n=3, p-value = 0.0024). (B) RET log2 expression for TCGA breast cancer tumor samples and matched normal samples are shown (p-value = 1.4069e-15). (C) Kaplan-Meier curves for TCGA breast cancer patients are shown (Cox regression p-value = 0.0384). Patients were stratified into two groups: high and low expression of RET. 100

4.5 Discussion

The 3D chromatin organization plays an important role in gene regulation. Chromatin

structures such as TAD and TAD boundaries regulate gene expression by dictating genomic

interactions. Previous studies have shown that TAD boundaries are enriched with certain factors

such as the core architectural protein CTCF and house-keeping genes. In fact, TAD boundaries can be characterized by a combination of multiple epigenomic and genomic elements. In our study, not

only was CTCF and histone modifications important in the prediction of TAD boundaries, but

chromosome number was one of the top features in MCF10A, MCF7, and T47D TAD boundary predictions. Genes are not evenly distributed among the 23 pairs of chromosomes. For example,

chromosomes 16-22, but not chromosome 18, are considered as small gene-rich chromosomes, and

these tend to cluster together in the nucleus. A large portion of these chromosomes are often active

and have different combinations of histone marks compared to gene-poor chromosomes. Therefore,

including chromosome number in our model increased the accuracy in our TAD boundary prediction.

We used GBM in our PredTAD model because it is a very good learner. It trains the models

in an additive and sequential manner – with each subsequent tree learning from the previous trees’

errors. We selected the parameters of 500 trees at a 10 max depth to offer a deep model without

overfitting. Additionally, a higher number of trees increases the run time drastically and does not

significantly increase the accuracy. A lower number of trees such as 100 trees decreases the run

time but offer an AUC of around 0.77 compared to 0.80 for MCF7.

When comparing top prediction features between MCF10A and MCF7, we noticed one

main difference: the type of feature. For MCF10A, it was the transcription factor binding sites of

multiple genes, but in MCF7 and T47D, it was a mix of transcription factor binding sites and ChIP-

seq mean signal values. This was not a surprise since transcription factor binding sites are generated

from their corresponding proteins’ ChIP-seq data and offer similar, redundant information. Since

MCF10A is a normal cell line, the generic transcription factor binding sites were more informative, 101

whereas for the abnormal breast cancer cell lines which have genomic alterations, dynamic ChIP-

seq signals were a more accurate representation of current status and thus, more informative.

When compared to other TAD boundary prediction methods, our method performed the best. PredTAD utilizes more epigenomic and genomic features than the other prediction models.

Moreover, we used much smaller bin sizes which increases the resolution and prevents averaging high and low values together. We also included ten upstream and ten downstream neighboring bin’s information in the prediction. When predicting TAD boundaries that were gained, lost, or conserved between MCF10A and MCF7, we find that SMC3 and RAD21 were among the most important features. This was not surprising since both SMC3 and RAD21 are part of the cohesin complex and are involved in chromatin organization. We also find that H3K4me1 is enriched in conserved TAD boundary regions. H3K4me1 interferes with DNA methyltransferases binding and may protect regulatory regions from being targeted for DNA methylation [166, 167]. Inhibition of de novo DNA methylation may protect TAD boundary alterations.

Modeling large scale genomic interactions was also possible using epigenetic and genetic elements. The most informative features in distinguishing between interacting and non-interacting genomic regions were distance between two regions, chromosome number, relative distance to the centromere, CTCF, SMARCA5, SMARCE1, FOXA1, H3K4me1, and transcription factor binding sites of CEBPB, MAFK, and CTCF. Different chromosomes have different characteristics. Active, open chromosomes and will have a different epigenomic landscape than inactive, heterochromatin.

The distance between two regions is the most informative feature because the closer two regions are to each other the more likely it will interact [31, 32, 168]. The relative position on the chromosome was also informative because telomeres tend to be gene-poor, inactive, and are clumped together [169]. Closer analysis revealed that interacting regions shared similar epigenetic and genetic elements which is consistent with previous findings [38, 170].

In summary, we accurately predicted TAD boundaries and large-scale genomic interactions using a diverse set of epigenomic and genomic features with chromosome number being the most 102 informative feature in all three cell lines. For predicting conserved or perturbed boundaries in breast cancer cell lines, H3K9ac, CTCF, RAD21, SMC3, and H3K4me1 were found to be most important.

We also identified a list of genes near perturbed boundaries that were involved in a number of oncogenic pathways. And finally, using our PredTAD model, we discovered a boundary loss near the oncogene RET which may lead to overexpression of RET in cell line and patient data. In conclusion, studying chromatin organization offers a better understanding of gene regulation, signaling pathways activation, and disease state.

103

CHAPTER V

Sequence-based modeling of enhancer and promoter interactions

104

5 Sequence-based modeling of enhancer and promoter interactions

5.1 Abstract

Enhancer-promoter interactions are important in gene activation and transcription.

Previous studies defined these interactions indirectly through association studies between gene

expression and enhancer state. More recently, high-throughput chromatin conformation capture

assays, such as Hi-C or ChIA-PET, can assay chromatin interactions and detect true enhancer- promoter interactions. However, these experimental methods are costly. Growing evidence

suggests that DNA sequences contains rich information for predicting enhancer-promoter links.

Therefore, we developed two new models to predict enhancer-promoter interactions from DNA

sequence. Our first model uses an intermediary chromatin features step while our second model

uses DNA sequence directly to predict enhancer-promoter interactions. Our models accurately

classify enhancer-promoter pairs from non-interacting pairs across multiple cell lines. Further

analysis revealed important genomic positions with strong binding affinities for transcription-

regulating transcription factors such as POLR2A, TRIM28, and HDAC1.

5.2 Introduction

5.2.1 Enhancers and promoters

There are two main classes of regulatory elements in our genome that play key roles in the

transcription of genes. They are commonly called enhancers and promoters. An enhancer is a short

region of DNA that amplifies the transcription of a gene. Enhancer regions are typically a few

hundred base pairs in length. Promoters are regions where transcription is initiated. Promoters are

about 1-2 kbps in length [43]. Most enhancers regulate their nearest gene, however up to 50% of

enhancers control a more distal gene [171-173]. Enhancers can also regulate more than one gene

target.

The Encyclopedia of DNA elements (ENCODE) consortium and the Roadmap

Epigenomics has identified hundreds of thousands putative enhancer regions in the genome [173].

These regulatory regions are associated with several epigenetic marks, such as H3K4me1 and DHS- 105

seq profiling of DNase I hypersensitive sites (DHSs). The discovery of a new class of non-coding

transcripts called enhancer RNAs (eRNAs) further improves our identification of functional

enhancers [174]. eRNAs are transcribed regions of enhancers that are actively engaged in

transcriptional regulation. Like mRNA and most small nuclear and microRNAs, eRNAs are also

transcribed by RNA Polymerase II. However, because eRNAs are not polyadenylated and are

typically expressed at low levels, standard RNA-seq protocols cannot detect them. Two new

sequencing techniques called: global run-on sequencing (GRO-seq) and cap-analysis of gene

expression (CAGE) allows us to readily detect eRNAs [175, 176]. GRO-seq measures production

rates of all nascent RNAs and CAGE is able to capture the 5’ start of a transcript.

5.2.2 Annotation of enhancer-promoter pairs

Using eRNA expression and gene expression, the FANTOM5 consortium generated an

atlas of predicted enhancers in human cancer and primary cell lines and tissues [177]. This study

used pairwise correlation to infer enhancer-promoter links. FOCS (FDR-corrected OLS with Cross-

validation and Shrinkage) also infer enhancer-promoter links based on correlated activity patterns

across many heterogeneous samples [172]. Additionally, Whalen et al. curated enhancer-promoter

datasets for six cell lines, which we used to train one of our model [43]. They identified TSS-

containing promoter regions and both strong and weak enhancer regions from combined ENCODE

Segway and ChromHMM annotations [178, 179]. Then, they annotated interacting enhancer- promoter pairs using high-resolution genome-wide Hi-C data from Rao et al [36].

5.2.3 Deep-learning models for genomics

Deep learning describes a type of machine learning techniques that use deep neural

networks to extract complicated patterns from high-dimensional input data. Deep learning models

are already widely used for image classification and natural language processing and is now being

applied in genomics [44, 45, 180, 181].

Deep learning holds a promising path in genome biology. However, several cautions must be taken. Deep learning should only be applied to datasets of sufficient size with at least thousands 106

of samples [182, 183]. If the sample size is too small, variations may affect the output and lead to

overfitting. Deep learning networks works in mysterious ways. An intrinsic property of deep

learning networks is the “black box” nature. Imagine a box with hundreds of buttons, knobs, and

switches as well as other dials we cannot quantitatively measure (yet). Different combinations of

these inputs would lead to some output, but for simplicity: let’s say the output is to turn on a light.

We have a few sets of inputs with an observable output, that is, we know that some combinations

turn on the light, while other combinations do not. However, we do not know that is in the box. We

do not know the wiring, connections, and otherwise linked processes within the box. With a large

enough sample size, deep learning models become more accurate in predicting the output.

However, in genomics, we are limited by the number of biological samples as well as not knowing

all the inputs that are involved. Deep learning is also a computational extensive process and careful

consideration of inputs, sample sets, and features remains a challenge.

5.2.4 Development of sequence-based models

The human genome is vast and complex, comprising of over 3 billion base pairs worth of

information that is still poorly understood. Deep learning techniques allows us to understand

complicated sequences (DNA, RNA, or even protein) in order to connect genotype to phenotype,

as well as predict regulatory functions and classify regions. One hot encoding is a process to

represent categorical variables as binary vectors [184, 185]. For DNA sequences, there are four

categories: A, T, C, and G. Chen, et al. developed a PyTorch-based deep learning library for

sequence data called Selene [186]. Selene is a framework for developing sequence-level deep

learning networks that provides scientists with comprehensive support for model training,

evaluation, and application across a broad range of biological questions. Sequence-level data refers

to any type of biological sequence such as DNA, RNA, or protein sequences and their measured properties (for example, binding of transcription factors or RNA-binding proteins, or DNase

sensitivity).

107

5.2.5 Application of sequence-based models

In 2015, Zhou J, et al. developed a deep-learning based algorithm framework, DeepSEA,

which predicted noncoding-variant effects de novo from sequence [180]. The output consisted of

919 TF binding, chromatin accessibility, and histone marks from ENCODE and Roadmap

Epigenomics database. They considered 2,608,181 sample regions with at least one epigenomic

event. The features were 1000 bp sequences of those regions. Their model was able to predict the

chromatin landscapes of 919 features with a single nucleotide sensitivity.

Then in 2018, the same group developed an updated deep learning framework, called

ExPecto, to predict gene expression from sequences alone [187]. Here, Zhou J, et al. considered

the 200 spatial bins of 200 bp (40 kb region total) centered at promoter of genes. Like DeepSEA,

Zhou J, et al. also predicted chromatin feature expressions in these regions, but instead of 919

chromatin features, they considered 2002 genome-wide TF binding, chromatin accessibility, and

histone marks. Due to the large number of features, the second component of ExPecto reduced the

dimensionality of the 200 spatial bins of 2002 chromatin features (400,400 features total) with ten

exponential functions. More importantly, Zhou J, et al. used the spatially transformed features to predict gene expression levels of Pol II-transcribed genes using a gradient boosting algorithm.

5.2.6 Predicting enhancer-promoter interactions

In our previous work (see 4.4.5), we predicted interacting genomic regions using

epigenomic and genomic information. This was done at a 40kb resolution, which is too coarse to

look at enhancer-promoter interactions. Additionally, one limitation of this prediction is that not

cell lines and tissue types have ChIP-seq data. To further enhance our work and increase the

resolution, we developed a novel two-step model. The first step of the model takes enhancer and promoter sequences and predicts chromatin profiles using the DeepSEA framework previously

developed by Zhou J, et al. [180] and described in 5.2.5. This step completely bypasses the need

for any TF or histone modification ChIP-seq data. The second step takes the predicted chromatin

features from the first step and predicts interaction status Figure 29. 108

Figure 29. Overview of our two-step enhancer-promoter interaction prediction. To bypass the need for ChIP-seq data, we developed a novel two-step model where DNA sequences are used to predict chromatin features and then, the predicted chromatin features are used to train our enhancer- promoter interaction model.

5.3 Experimental Design and Methods

5.3.1 Two-step enhancer-promoter interaction prediction in GM12878

Our two-step enhancer-promoter interaction model consists of a chromatin prediction step

and an interaction step (Figure 30). Enhancer and promoter regions were extracted from the

FANTOM5 database. Positive samples were enhancer-promoter pairs with Hi-C reads overlapping

the two regions. Negative samples were non-enhancer-promoter pairs with no overlapping reads or

completely random regions. In total we had 19,230 samples, 6410 of which were positive samples

and 12,820 were negative samples. First, we used deepsea.beluga.2002.cpu trained model by Zhou,

J. et al. to predict chromatin profiles of 2002 chromatin features [180, 187]. Specifically, this model predicts the chromatin features of 200 bp at a time but requires 900 bp flanking sequences in the prediction. The outputs of this step are chromatin probability scores between 0 and 1 for 2002 chromatin features. Next, we take the predicted chromatin profiles and use them as features to predict enhancer-promoter interactions. In detail, enhancer and promoter sequences of were 2000 bp each. Each region was further subdivided into ten 200-bp regions. A distinct chromatin profile 109

was predicted for each of the ten 200-bp regions. In total, there are 10*2002 chromatin features and

these features were used to model enhancer-promoter interactions. The classification method we

used was gradient boosting machine (or GBM). 70% of the sample were used in training and 30%

was withheld for testing.

5.3.2 Modeling enhancer-promoter interactions in six cell lines

Enhancer and promoter regions were obtained from Whalen, et al [43]. Briefly, TSS-

containing promoter regions and enhancer regions were previously identified using combined

ENCODE Segway and ChromHMM annotations for GM12878, HeLa-S3, HUVEC, IMR90, K562,

and NHEK [178, 179]. Since we were interested in modeling long-range interactions, enhancers

within 10 kb to the nearest promoter were discarded. Only promoters that are actively transcribed

(mean FPKM > 0.3) were retained. Each sample consists of one enhancer and one promoter. To

standardize the length of the annotated promoters and enhancers, each enhancer was standardized

to 401 bp and each promoter were standardized to 2001 bp in length. There were 2113, 1740, 1524,

1254, 1977, and 1291 positively interacting enhancer-promoter pairs for the cell lines GM12878,

HeLa-S3, HUVEC, IMR90, K562, and NHEK, respectively. For negative samples, equal number

of non-interacting enhancer-promoter pairs were sampled. Additionally, random non-enhancer nor promoter genomic regions were also sampled as another set of negative samples. These negative

samples were distance-matched with the positive samples. That is, the distance between an

enhancer and promoter in the positive and negative samples were of the same distribution. In total,

there were 6339, 5220, 4572, 3762, 5931, and 3873 samples for the six cell lines.

Gradient boosting machine (GBM) was used to classify whether a sample is a positively

interacting enhancer-promoter pair or not. 80% of the samples were used in training and 20% were

used for testing. The number of trees used were 100, and the maximum depth was 10 per tree. AUC

was used to measure the performance.

110

5.3.3 Identification of important binding motifs using deep learning framework

Since this is a binary classification model, the labels for classification are based on the maximum F1 score from the trained model. All samples with a predicted P1 probabilities less than the maximum F1 score threshold are classified as 0 (or negative). All samples with a predicted P1 probabilities greater than the maximum F1 score threshold are classified as 1 (or positive). In our

study, we extracted the 21 nt sequence surrounding the position with the highest P1 score change

and used DeepBind to score the binding affinities of 137 human DNA binding proteins. DeepBind

is a deep learning approach developed by Alipanahi, et al. to predict sequence specificities of DNA binding proteins [44]. Their prediction models were trained on a combined 12 terabases of sequence

data from public protein binding microarrays (PBM), RNAcompete, ChIP-seq, and HT-SELEX

experiments [188, 189]. They released their trained DeepBind models for 927 distinct transcription

factors and RNA binding proteins in humans and other species.

The DNA binding proteins with the highest binding affinity scores were counted. We also

selected positions with classification labeling changes. That is, after removing 1 bp, the predicted

P1 probability decreased or increased past the F1 score threshold, yielding a different classification

result.

5.4 Results

5.4.1 Classification of enhancer-promoter interactions using chromatin features

Our two-step enhancer-promoter interaction model encompasses a chromatin prediction

step and an interaction step (Figure 30). First, sequences of 2000 bp are inputted into a deep

convolutional network, called DeepSEA, developed by Zhou, J. et al [180, 187]. Originally,

DeepSEA only predicted 919 chromatin features, but the updated version predicts 2002 chromatin

features. Specifically, this model predicts the chromatin features of 200 bp at a time but uses a total

number of 2000 bp in the prediction. The output of DeepSEA is a chromatin feature score between

0 and 1. Next, we take the predicted chromatin profiles and use these as features to predict

enhancer-promoter interactions. We used the classification method called gradient boosting 111 machine (or GBM) to classify whether two regions are interacting or not. Our AUC is 0.7962. The most important chromatin features are ranked in Figure 31. We tested 15 SNPs that are likely to affect chromatin feature binding, however, none changed the classification results.

Figure 30. Detailed view of our two-step enhancer-promoter interaction model. Genomic sequences are inputted into trained DeepSEA networks to generate predicted chromatin profiles for 2002 chromatin features, including TF and histone modification across multiple cell lines and tissue types. Then the predicted chromatin features are used to model enhancer and promoter interaction status.

Portion of this figure was adapted from Zhou, J. et al. [180]

112

Figure 31. Variable importance plot for enhancer-promoter interactions. Top chromatin features are listed along with its relative importance in our two-step enhancer-promoter interaction model.

5.4.2 Classification of enhancer-promoter interactions using sequence information

Another way to model enhancer-promoter interactions is to use DNA sequence directly in the classification. Here, we built a more straightforward model using DNA sequences. First, we obtained positively interacting enhancer-promoter samples in GM12878, HeLa-S3, HUVEC,

IMR90, K562, and NHEK, all previously annotated by Whalen, et al [43]. The location of one interacting enhancer-promoter pair is shown in Figure 32. For negative samples, we randomly selected equal number of non-interacting enhancer-promoter pairs and equal number of completely random regions. We used gradient boosting machine (or GBM) as our classifier. 80% of the samples was used for training and 20% was withheld for testing. The sample size, along with five- fold cross-validation accuracy and testing AUCs, are shown in Table 11. The ROC curve for our testing sets are shown in Figure 33.

113

Figure 32. Genome browser view of an interacting enhancer-promoter pair. The genomic position of a positively interacting enhancer-promoter pair is shown. The 401 bp enhancer region (chr1:9685861-9686261) and the 2 kb promoter region (chr1:9747402-9749402) are indicated with red and blue rectangles, respectively. This interacting pair is found in GM12878.

114

Table 11. Enhancer-promoter interaction prediction details and results. Enhancer-promoter pairs from six different cell lines were individually modeled. The number of positive and negative samples are listed. 80% of the samples were used in training. The 5-fold cross validation AUC is shown. 20% of the samples were used for testing and the testing AUCs are also listed.

Figure 33. ROC curves for enhancer-promoter interaction results. We modeled enhancer-promoter interactions across multiple cell lines using a GBM model. The ROC curves showing the ratio of true positives to false positives are shown.

115

5.4.3 Systematically identifying important positions

To systematically test each position’s importance in classifying interacting and non-

interacting samples, we systematically removed each position. That is, for every sample, one base pair was removed at a time. Then, each altered sample was classified using our previously trained

model, which was trained on all samples. Changes in P1 probability scores were compared to the

original samples’ P1 probability scores. Predicted P1 probability scores informs us which label to

give the sample. If the predicted P1 probability score is less than the maximum F1 score threshold,

then the sample will be labeled as 0 or negative. If the predicted P1 probability score is greater than the maximum F1 score threshold, then the sample will be labeled as 1 or positive. The maximum

F1 score is calculated from confusion matrix values (true positives, true negatives, false positives, and false negatives).

In total, there were 15,226,278 (6339 samples x 2402 bp) perturbed samples for GM12878,

12,538,440 for HeLa-S3, 10,981,944 for HUVEC, 9,036,324 for IMR90, 14,246,262 for H562, and

9,302,946 for NHEK. The changes in P1 score after removing one base at a time for one sample is shown in Figure 34.

116

Changes in P1 probability score

Figure 34. P1 probability score changes for one sample. Each position was systematically removed, and the perturbed sample was inputted into our trained model. The new P1 scores were compared to the P1 score of the original unperturbed sample. The differences in P1 scores between the perturbed and original samples are shown. The red (left) indicates the enhancer sequences and the blue (right) indicates the promoter sequences. A portion of the figure is magnified for clarity.

117

5.4.4 POLR2A, HDAC1, and TRIM28 are important

Nucleotide positions with the most change in P1 probability scores indicate that this position may have an important biological function or meaning. For each sample, we extracted the position with the highest change in P1 probability scores. We also extracted their flanking sequences (10 bp upstream and 10 bp downstream), yielding a final length of 21 bps. Then, we utilize Alipanahi et al.’s DeepBind tool to identify the DNA-binding protein with the strongest binding affinity to our 21-bp samples [44]. The most recurrent DNA binding proteins for the 21-bp regions were POLR2A, TRIM28, and HDAC1. The ranking of DNA binding proteins for all six cell lines is shown in Figure 35.

Figure 35. Top 20 most recurrent DNA binding proteins. Sequences flanking the position with the highest P1 score change were inputted into DeepBind and the top DNA binding protein with the strongest binding affinity were extracted. Top recurrent DNA binding proteins are ranked from most common to least.

5.4.5 Identification of SNPs that affect classification accuracy

Next, we wanted to know which position can cause a change in the classification result. In other words, we wanted to determine which position(s), when removed, can change the classification prediction from true positive to false negative, or true negative to false positive. To do so, we looked at the classification results of all perturbed samples. For GM12878, we found

6896 classification changes, all of which were from positive samples to negative samples. Just 118 considering the 2113 positive samples, 0.14% positions were important in accurate classification of enhancer-promoter interactions.

We queried whether these positions were also a SNP using the Kaviar database [190].

Kaviar (~Known VARiants) is a compilation of SNVs, indels, and complex variants observed in humans. It contains 162 million SNV sites and incorporates data from 35 projects encompassing

77,781 individuals. Out of the 6896 positions that affect the classification accuracy, 238 were located in a known SNP or SNV region. The results for all six cell lines are shown in Table 12.

Table 12. Small percentage of positions leads to a classification labeling change. After perturbation of each bp position in every sample, a small percentage of perturbed samples led to a change in the classification outcome of the sample. Out of these positions, a small number were in an annotated SNP region as queried with Kaviar.

5.5 Discussion

Distal enhancers regulate promoter regions by forming enhancer-promoter loops. Recent technologies such as high-throughput chromatin conformation capturing techniques have led to the true identification of enhancer-promoter pairs [42]. However, these techniques are costly and time consuming. Increasing studies have developed computational methods to extract vast information from our DNA sequence [44, 180]. In our study, we developed two methods to study and classify enhancer-promoter pairs using sequence information. In our first method, we first extracted chromatin profile properties from DNA sequences using a modified version of DeepSea [180, 187].

Then using the predicted chromatin profiles, we classified enhancer-promoter samples as interacting or non-interacting. Although this method can classify our samples with almost an 80% 119

accuracy, it was not sensitive to a single nucleotide change. This is, in part, due to the multi-step process, which loses information in each step. Additionally, 200-bp sub-regions were used as an

input to predict chromatin profiles, and then chromatin profiles of the ten-200 bp regions were used

to predict interaction status. In this model, a single nucleotide change did not have a large enough

effect to affect the classification results.

Consequently, we developed another model which uses DNA sequences directly in the prediction of enhancer-promoter interactions. We applied our model on six cell lines and obtained

AUCs between 0.9246 and 0.9748. One limitation of using DNA sequence information directly in

the prediction model is that it is difficult to draw meaningful conclusions. Additional steps need to be taken to extract meaning from the sequence. To achieve this goal, we systematically evaluated

the importance of each position by removing 1 bp at a time and running the new samples through

our trained model. P1 probability score changes between the new and original sequences were used

to evaluate the degree of importance each position has. Then, we extracted the 21-nucleotide

sequence surrounding the position with the greatest change in P1 probability scores and used

DeepBind to identify which DNA binding protein has the greatest affinity to the sequence.

Among the top DNA binding proteins that bound to the 21 nucleotide sequences were

POLR2A, TRIM28, and HDAC1. POLR2A is the largest subunit of RNA polymerase II, which is

the polymerase directly responsible for RNA transcription. TRIM28, or transcription intermediary

factor 1-beta, also mediates transcription control. TRIM28 interacts with a variety of effector proteins, including heterochromatin protein 1, histone deacetylases, such as HDAC1, and DNA

methyltransferases, such as DNMT1, DNMT3a, and DNMT3 [191-195].

HDAC1 modulates chromatin structure and transcription [196, 197]. Histone deacetylases

(HDACs) are essential for core-regulatory TF transcription [198]. HDACs catalyze removal of the

acetyl group from lysine residues in histones, which recovers its positive charge and permits

interactions to the negatively charged DNA molecules. In general, HDACs cause transcription

repression by closing or condensing the chromatin structure. However, in other studies, formation 120 of chromatin loops between enhancers and promoters are also associated with extensive recruitment of HDACs, indicating that this class of corepressor may play a previously unrecognized role during enhancer activation [199]. Given the functions of POLR2A, TRIM28, and HDAC1, it is not surprising that these three transcription-related DNA binding proteins ranked the highest in our analysis.

When we considered sequence positions that led to classification labeling changes, we identified several positions that led to a classification change from positive to negative when removed. Among the small percentage of sequences that led to a change in classification labeling, a total of 510 were located at a SNP. Further analysis needs to be conducted to determine the functional importance of these positions.

In conclusion, DNA sequence harbors information that can be extracted to predict interacting enhancer-promoter pairs. Using two newly developed models, we predicted interacting enhancer-promoter regions across multiple cell lines with up to 97% accuracy. Deeper analysis revealed strong binding affinities of POLR2A, TRIM28, and HDAC1 transcription-regulating

DNA binding proteins to important sequence positions. Our sequence-based models can be applied to patient genomes to study their enhancer-promoter interactions without the need of generating expensive Hi-C data.

121

CHAPTER 6

Summary and Discussion

122

6 Summary and Discussion

6.1 Cascade Propagation approach re-stratified HNSCC patients

HNSCC is an extremely heterogeneous cancers of the oral cavity, oropharynx, larynx or

hypopharynx. Like many other cancers, the immune system also plays an important role in the progression of HNSCC. To better understand HNSCC development and recurrence, we developed

a novel immune-signaling-based Cascade Propagation (CasP) subtyping approach to stratify

HNSCC patients. Our approach is a multi-step stratification approach composed of a dynamic

network tree cutting step followed by a somatic-mutational stratification step. It uses patient

genomic data, such as gene expression and mutation profiles, patient clinical and survival

information, as well as prior signaling and protein-protein interaction information to stratify

HNSCC patients into more biologically interpretable and clinically meaningful subtypes.

In our work, we found that activation of T cell receptor signaling pathways and other

important cancer related pathways such as PI3K and JAK/STAT were useful in the stratification of

HNSCC patients. Further stratification using mutational data identified two distinct molecular profiles with different survival outcomes. One group had mutations in genes associated with TLR

(innate immune system) signaling pathways while the other group had mutations in T cell and B

cell receptor signaling pathways (adaptive immune system). The former group had better survival

whilst the latter group did not. This suggests the importance of a functional adaptive immune

system in HNSCC survival and that innate immune systems may not play a vital role in the

detection and clearance of cancer cells as well as the survivability of the patient.

In conclusion, our novel CasP approach stratified HNSCC patients into survival distinct

and clinically meaningful subtypes. Our method can facilitate in the discovery of relevant

druggable targets in HNSCC patients.

123

6.2 SNP variant regulates SOX2 modulation of VDAC2 in LSCC patients

LSCC is another genomically heterogeneous cancer with no targeted therapy. Recent studies have found amplification or overexpression of the stem cell transcription factor SOX2 in over 60% of LSCC patients. SOX2 also plays an important role in the progression and development of LSCC, however the downstream mechanisms of SOX2 is unclear. As a transcription factor,

SOX2 binds to DNA and regulates the transcription and expression of target genes. However, SNPs may affect the binding affinity of SOX2 to regulatory elements. To fully distinguish which genes

SOX2 targets, we employed independent-eQTL analysis. In other words, we first subgrouped

LSCC patients based on their SOX2 activity (gene expression, copy number variation, and methylation), then we conducted separate eQTL analysis. By doing so, we determined SOX2- specific SNP to gene associations. We found that SNP rs798827 is correlated with the expression of VDAC3 gene in SOX2-active patients but not in SOX2-inactive patients. This discovery can only be made when SOX2 activity was first considered. Our pipeline and methodology can be applied to the study of any transcription factors in other diseases and cancers.

6.3 Chromatin organization alterations leads to oncogene overexpression in breast cancer

Chromatin organization is important in the regulation of gene expression and is often perturbed in cancer genomes. In breast cancer cells, there are chromatin structure changes such as

creation of new TAD boundaries and destruction of normal TAD boundaries. TAD boundaries can be characterized by a combination of multiple epigenomic and genomic elements and previous studies have already started to model TAD boundaries using CTCF and histone modification elements. In our work, we modeled TAD boundaries of normal breast and breast cancer cell lines.

We used a wide range of features, from ChIP-seq data to transcription factor binding sites to DNA methylation in our model. We found that chromosome number was the most informative feature.

We suspect the uneven distribution of genes among the 23 pairs of chromosomes is the reason why chromosome number was important.

124

When modeling TAD boundary changes (gain, loss, or conserved) between normal and

breast cancer cell lines, we found that the subunits of cohesin, RAD21 and SMC3, were among the

most important feature in distinguishing between conserved and altered TAD boundaries. We also

found that H3K4me1 is enriched in conserved TAD boundaries. H3K4me1 interferes with DNA

methyltransferase binding and may protect TAD boundary regions from being targeted for DNA

methylation. When modeling large scale genomic interactions, we discovered that regions closer

in distance are more likely to be interacting which is consistent with other findings. Additionally,

interacting regions shared more similar chromatin profiles than non-interacting regions. Features

such as CTCF, SMARCA5, SMARCE1, FOXA1, and H3K4me1 were among the most important

features in distinguishing between interacting and non-interacting genomic regions.

We also studied the downstream effects of perturbed TAD boundaries by analyzing genes

near TAD boundary alterations. We found that genes near perturbed TAD boundaries were

involved in a number of breast-cancer related pathways such as estrogen signaling, Ras, and Jak-

STAT. In the end, we identified one key gene that is putatively affected by the loss of a nearby

TAD boundary. RET is significantly overexpressed in breast cancer cells compared to normal cells.

It is also overexpressed in breast cancer patients. Overexpression of RET is associated with a poorer prognosis of breast cancer.

In conclusion, chromatin structures, such as TAD and TAD boundaries, can be accurately modeled using our tool. Additionally, chromatin structure alterations affected gene expression, activated oncogenic signaling pathways, and led to the eventual disease state of breast cancer cells.

6.4 Sequence-based modeling of enhancer and promoter interactions

Enhancer and promoter interactions are important in the activation and transcription of

target genes. Our previous work studied chromatin structures and large-scale interactions at the 40

kb resolution. We further enhanced our work by studying enhancer and promoter regions

specifically. We also bypassed the limitation of needing cell line specific ChIP-seq and other

experimentally generated data in our models. Here, we developed two new computational models. 125

Both models use sequence data as the input. The first model consists of two steps. First, sequence data is used to predict ChIP-seq information, histone modification, etc. in a generalized manner.

Then the predicted chromatin profiles are used to predict enhancer-promoter interactions. While this method does classify enhancer-promoter pairs at an almost 80% accuracy, it was not sensitive to a single nucleotide change. This is due to the multi-step nature, where some information is lost in the chromatin features prediction step. The second model we developed directly uses DNA sequences in the prediction. This method accurately predicts enhancer-promoter interacting pairs at a 92-97% accuracy across multiple cell lines. Not only that, but we are able to detect classification changes when single nucleotides were removed.

In addition, using DeepBind, we detected several DNA binding proteins with high affinity to important sequence positions. We discovered that transcription regulating proteins such as

POLR2A, TRIM28, and HDAC1 were among the top DNA binding proteins to our sequence samples. In conclusion, with the help of powerful computational models, we can extract meaningful information from DNA sequences to study and predict genomic interactions, regulation, and transcription of genes. We can apply our models to patient genomes without the need of generating costly and time consuming Hi-C, RNA-seq, and ChIP-seq data.

126

6.5 Future Directions

The DNA sequence contains immense amount of information that we have yet to fully understand. Powerful computational and analytical tools now make it possible to extract patterns from our genome. Sequence-based models such as the ones developed in Chapter 5 can be modified and applied to other biological studies. For example, sequence-based models can be used to predict mRNA polyadenylation sites. Alternative polyadenylation is a post-transcriptional RNA processing that plays a major role in mRNA stability, translocation, and expression. The polyadenylation cleavage site is determined by several factors including presence of a polyadenylation signal hexamer motif, flanking U-rich and U-/GU-rich sequences, and expression of polyadenylation- related genes. We can generate a sequence-based model that can determine if certain mutations or alterations in the sequences will affect APA in disease-related genes. Additionally, all our developed models, pipelines, and methodologies can be applied to other cancers and disease. Our tools can be used to answer other biology questions.

127

CURRICULUM VITAE

Jacqueline Chyr 786.838.3570 [email protected]

Education

Ph.D Wake Forest School of Medicine, Cancer Biology May 2020 Dissertation: Bioinformatics Analysis and Computational Biology Modeling of Genetic and Epigenetic Elements Dictating Chromatin Organization, Transcription Regulation, and Disease State in Cancer Genomes Advisor: Dr. Xiaobo Zhou, PhD Committee: Gregory A. Hawkins, Ph.D. (Chair) William Gmeiner, Ph.D. Lance D. Miller, Ph.D. Kounosuke Watabe, Ph.D.

BS University of Miami, Microbiology and Immunology Dec 2012 Graduated Magna Cum Laude, Honors Minor in Chemistry

Work Experience

Research Assistant I University of Texas Health Science Center, Houston, TX Jul 2017 - School of Biomedical Informatics present

Research Experience

Dr. Bonnie Blombergs’ Lab – Microbiology and Immunology Aug 2012 – Dec 2012 Univ. of Miami’s Miller Medical Campus, Miami, FL • Honors Thesis for Departmental Honors • Studied the effects of aging on the immune response of B cells against the influenza vaccine and established biomarkers that predict vaccine effectiveness in elderly individuals

Dr. Dale Ramsden’ Lab – Biochemistry and Biophysics Summer 2012 UNC Chapel Hill, Chapel Hill, NC • Summer Undergraduate Research Experience (SURE) REU Program • Characterized POLQ’s role in double strand break repair. • Poster Presentation – 1st Place award (Jul 2012)

Dr. Leonidas Bachas’ Lab – Chemistry Jan 2011 – May 2012 University of Miami, Coral Gables, FL • Work Study Program • Analyzed the selectivity of poly(vinyl chloride)-based cation-selective electrodes with different molar ratios of ionophores. • Characterized aequorin’s bioluminescence properties by truncating the protein at different locations to produce different bioluminescent properties. 128

Dr. Jeffery Plunkett’s Lab – Biochemistry Aug 2009 – May 2010 St. Thomas University, Miami Gardens, FL • Honors Executive Internship Program • Developed and characterized a novel primary culture system to study zebrafish neuron regeneration after spinal cord injury.

Dr. Richard Myer’s Lab – Biochemistry and Molecular Biology Summer 2009 Univ. of Miami’s Miller Medical Campus, Miami, FL • Howard Hughes Medical Institute ( HHMI ) summer research program • Genetically altered the lambda gene that encodes for the SynExo protein in E. coli to test its efficiency in homologous recombination.

Medical Experience

Intern, Emergency Room at Aventura Hospital 2009 – 2010

Volunteer, Pharmacy Department, Palmetto General Hospital Summer of 2007 and 2008

Publications

Chyr J , Zhang Z, Zhou X. Chromatin organization alterations lead to oncogene overexpression in breast cancer. 2020. Under review.

Li Z, Jia Z, Chyr J , Wang L, Hu, X, Wu X, Song C. Identification of hub genes associated with hypertension and their interaction with miRNA based on WGCNA analysis . 2020. Under review.

Peak TC, Panigrahi GK, Praharaj PP, Su Y, Shi L, Chyr J , Rivera ‐Chávez J, Flores ‐Bocanegra L, Singh R, Vander Griend DJ, Oberlies NH, Kerr BA, Hemal A, Bitting RL, Deep G. Syntaxin 6 ‐ mediated exosome secretion regulates enzalutamide resistance in prostate cancer . Molecular Carcinogenesis. 2019. PMID: 31674708.

Peak TC, Su Y, Chapple AG, Chyr J, Deep G. Syntaxin 6: A novel predictive and prognostic biomarker in papillary renal cell carcinoma . Scientific Reports. 2019; 9(1):3146. PMID: 30816681.

Liu C, Chyr J , Zhao W, Xu Y, Ji Z, Tan H, Soto C, Zhou X. Genome-Wide Association and Mechanistic Studies Indicate That Immune Response Contributes to Alzheimer's Disease Development. Frontiers in Genetics. 2018. 9:410. PMID: 30319691.

Chyr J , Guo D, Zhou X. LSCC SNP variant regulates SOX2 modulation of VDAC3. Oncotarget. 2018. 9(32):22340-52. PMID: 29854282.

Liu K., Chyr J , Zhao W, Zhou X. Immune signaling-based Cascade Propagation approach re- stratifies HNSCC patients . Methods. 2016; 111:72-9. PMID: 27339942.

129

REFERNCES

1. Gloeckler Ries, L.A., M.E. Reichman, D.R. Lewis, B.F. Hankey, and B.K. Edwards, Cancer survival and incidence from the Surveillance, Epidemiology, and End Results (SEER) program. The oncologist, 2003. 8(6): p. 541-552. 2. Siegel, R.L., K.D. Miller, and A. Jemal, Cancer statistics, 2019. CA: a cancer journal for clinicians, 2019. 69 (1): p. 7-34. 3. Jemal, A., R. Siegel, E. Ward, Y. Hao, J. Xu, and M. Thun, Cancer statistics. Ca Cancer J Clin, 2009. 59 (4). 4. Petitjean, A., E. Mathe, S. Kato, C. Ishioka, S.V. Tavtigian, P. Hainaut, and M. Olivier, Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Human mutation, 2007. 28 (6): p. 622-629. 5. Leis, O., A. Eguiara, E. Lopez-Arribillaga, M. Alberdi, S. Hernandez-Garcia, K. Elorriaga, A. Pandiella, R. Rezola, and A. Martin, Sox2 expression in breast tumours and activation in breast cancer stem cells. Oncogene, 2012. 31 (11): p. 1354-1365. 6. Ford, D., D.F. Easton, D.T. Bishop, S.A. Narod, and D.E. Goldgar, Risks of cancer in BRCA1-mutation carriers. The Lancet, 1994. 343 (8899): p. 692-695. 7. Boffetta, P., Human cancer from environmental pollutants: the epidemiological evidence. Mutation Research/Genetic Toxicology and Environmental Mutagenesis, 2006. 608 (2): p. 157-162. 8. Couzin-Frankel, J., Cancer immunotherapy . 2013, American Association for the Advancement of Science. 9. Croft, M., The role of TNF superfamily members in T-cell function and diseases. Nature Reviews Immunology, 2009. 9(4): p. 271-285. 10. Leemans, C.R., B.J. Braakhuis, and R.H. Brakenhoff, The molecular biology of head and neck cancer. Nature Reviews Cancer, 2011. 11 (1): p. 9-22. 11. Jemal, A., F. Bray, M.M. Center, J. Ferlay, E. Ward, and D. Forman, Global cancer statistics. CA: a cancer journal for clinicians, 2011. 61 (2): p. 69-90. 12. Varilla, V., J. Atienza, and C.A. Dasanu, Immune alterations and immunotherapy prospects in head and neck cancer. Expert opinion on biological therapy, 2013. 13 (9): p. 1241-1256. 13. Chaturvedi, A.K., W.F. Anderson, J. Lortet-Tieulent, M.P. Curado, J. Ferlay, S. Franceschi, P.S. Rosenberg, F. Bray, and M.L. Gillison, Worldwide trends in incidence rates for oral cavity and oropharyngeal cancers. Journal of Clinical Oncology, 2013. 31(36): p. 4550-4559. 14. Yu, H., M. Kortylewski, and D. Pardoll, Crosstalk between cancer and immune cells: role of STAT3 in the tumour microenvironment. Nature Reviews Immunology, 2007. 7(1): p. 41-51. 15. Yang, L., Y. Pang, and H.L. Moses, TGF-β and immune cells: an important regulatory axis in the tumor microenvironment and progression. Trends in immunology, 2010. 31 (6): p. 220-227. 16. Sturgis, E.M., Q. Wei, and M.R. Spitz. Descriptive epidemiology and risk factors for head and neck cancer . in Seminars in oncology . 2004. Elsevier. 17. Talamini, R., C. Bosetti, C. La Vecchia, L. Dal Maso, F. Levi, E. Bidoli, E. Negri, C. Pasche, S. Vaccarella, and L. Barzan, Combined effect of tobacco and alcohol on laryngeal cancer risk: a case–control study. Cancer Causes & Control, 2002. 13 (10): p. 957-964. 18. Network, C.G.A.R., Comprehensive genomic characterization of squamous cell lung cancers. Nature, 2012. 489 (7417): p. 519. 19. Campbell, J.D., A. Alexandrov, J. Kim, J. Wala, A.H. Berger, C.S. Pedamallu, S.A. Shukla, G. Guo, A.N. Brooks, and B.A. Murray, Distinct patterns of somatic genome alterations 130

in lung adenocarcinomas and squamous cell carcinomas. Nature genetics, 2016. 48 (6): p. 607. 20. Sholl, L.M., J.A. Barletta, B.Y. Yeap, L.R. Chirieac, and J.L. Hornick, Sox2 protein expression is an independent poor prognostic indicator in stage I lung adenocarcinoma. The American journal of surgical pathology, 2010. 34 (8): p. 1193. 21. Bass, A.J., H. Watanabe, C.H. Mermel, S. Yu, S. Perner, R.G. Verhaak, S.Y. Kim, L. Wardwell, P. Tamayo, and I. Gat-Viks, SOX2 is an amplified lineage-survival oncogene in lung and esophageal squamous cell carcinomas. Nature genetics, 2009. 41 (11): p. 1238- 1242. 22. Hussenet, T., S. Dali, J. Exinger, B. Monga, B. Jost, D. Dembelé, N. Martinet, C. Thibault, J. Huelsken, and E. Brambilla, SOX2 is an oncogene activated by recurrent 3q26. 3 amplifications in human lung squamous cell carcinomas. PloS one, 2010. 5(1): p. e8960. 23. Basu-Roy, U., E. Seo, L. Ramanathapuram, T.B. Rapp, J.A. Perry, S.H. Orkin, A. Mansukhani, and C. Basilico, Sox2 maintains self renewal of tumor-initiating cells in osteosarcomas. Oncogene, 2012. 31 (18): p. 2270-2282. 24. Chen, Y., L. Shi, L. Zhang, R. Li, J. Liang, W. Yu, L. Sun, X. Yang, Y. Wang, and Y. Zhang, The molecular mechanism governing the oncogenic potential of SOX2 in breast cancer. Journal of Biological Chemistry, 2008. 283 (26): p. 17969-17978. 25. Jia, X., X. Li, Y. Xu, S. Zhang, W. Mou, Y. Liu, Y. Liu, D. Lv, C.-H. Liu, and X. Tan, SOX2 promotes tumorigenesis and increases the anti-apoptotic property of human prostate cancer cell. Journal of molecular cell biology, 2011. 3(4): p. 230-238. 26. Wang, Q., W. He, C. Lu, Z. Wang, J. Wang, K.E. Giercksky, J.M. Nesland, and Z. Suo, Oct3/4 and Sox2 are significantly associated with an unfavorable clinical outcome in human esophageal squamous cell carcinoma. Anticancer research, 2009. 29 (4): p. 1233- 1241. 27. Ferone, G., J.-Y. Song, K.D. Sutherland, R. Bhaskaran, K. Monkhorst, J.-P. Lambooij, N. Proost, G. Gargiulo, and A. Berns, SOX2 is the determining oncogenic switch in promoting lung squamous cell carcinoma from different cells of origin. Cancer Cell, 2016. 30 (4): p. 519-532. 28. Shabalin, A.A., Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics, 2012. 28 (10): p. 1353-1358. 29. Dekker, J., K. Rippe, M. Dekker, and N. Kleckner, Capturing chromosome conformation. science, 2002. 295 (5558): p. 1306-1311. 30. Belton, J.-M., R.P. McCord, J.H. Gibcus, N. Naumova, Y. Zhan, and J. Dekker, Hi–C: a comprehensive technique to capture the conformation of genomes. Methods, 2012. 58 (3): p. 268-276. 31. Lieberman-Aiden, E., N.L. Van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B.R. Lajoie, P.J. Sabo, and M.O. Dorschner, Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science, 2009. 326 (5950): p. 289-293. 32. Dixon, J.R., S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J.S. Liu, and B. Ren, Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 2012. 485 (7398): p. 376. 33. Guo, Y., Q. Xu, D. Canzio, J. Shou, J. Li, D.U. Gorkin, I. Jung, H. Wu, Y. Zhai, and Y. Tang, CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell, 2015. 162 (4): p. 900-910. 34. Barutcu, A.R., B.R. Lajoie, R.P. McCord, C.E. Tye, D. Hong, T.L. Messier, G. Browne, A.J. van Wijnen, J.B. Lian, and J.L. Stein, Chromatin interaction analysis reveals changes in small chromosome and telomere clustering between epithelial and breast cancer cells. Genome biology, 2015. 16 (1): p. 214.

131

35. Hnisz, D., D.S. Day, and R.A. Young, Insulated neighborhoods: structural and functional units of mammalian gene control. Cell, 2016. 167 (5): p. 1188-1200. 36. Rao, S.S., M.H. Huntley, N.C. Durand, E.K. Stamenova, I.D. Bochkov, J.T. Robinson, A.L. Sanborn, I. Machol, A.D. Omer, and E.S. Lander, A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 2014. 159 (7): p. 1665- 1680. 37. Hong, S. and D. Kim, Computational characterization of chromatin domain boundary- associated genomic elements. Nucleic acids research, 2017. 45 (18): p. 10403-10414. 38. Huang, J., E. Marco, L. Pinello, and G.-C. Yuan, Predicting chromatin organization using histone marks. Genome biology, 2015. 16 (1): p. 162. 39. Moore, B.L., S. Aitken, and C.A. Semple, Integrative modeling reveals the principles of multi-scale chromatin boundary formation in human nuclear organization. Genome biology, 2015. 16 (1): p. 110. 40. Marsman, J. and J.A. Horsfield, Long distance relationships: enhancer–promoter communication and dynamic gene transcription. Biochimica et Biophysica Acta (BBA)- Gene Regulatory Mechanisms, 2012. 1819 (11-12): p. 1217-1227. 41. Cao, Q., C. Anyansi, X. Hu, L. Xu, L. Xiong, W. Tang, M.T. Mok, C. Cheng, X. Fan, and M. Gerstein, Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. Nature genetics, 2017. 49 (10): p. 1428. 42. Imakaev, M., G. Fudenberg, R.P. McCord, N. Naumova, A. Goloborodko, B.R. Lajoie, J. Dekker, and L.A. Mirny, Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature methods, 2012. 9(10): p. 999. 43. Whalen, S., R.M. Truty, and K.S. Pollard, Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature genetics, 2016. 48 (5): p. 488. 44. Alipanahi, B., A. Delong, M.T. Weirauch, and B.J. Frey, Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 2015. 33 (8): p. 831-838. 45. Bogard, N., J. Linder, A.B. Rosenberg, and G. Seelig, A deep neural network for predicting and engineering alternative polyadenylation. Cell, 2019. 178 (1): p. 91-106. e23. 46. Pignon, J., J. Bourhis, C.o. Domenge, and L. Designe, Chemotherapy added to locoregional treatment for head and neck squamous-cell carcinoma: three meta-analyses of updated individual data. The Lancet, 2000. 355 (9208): p. 949-955. 47. Vokes, E.E., R.R. Weichselbaum, S.M. Lippman, and W.K. Hong, Head and neck cancer. New England Journal of Medicine, 1993. 328 (3): p. 184-194. 48. Mroz, E.A. and J.W. Rocco, MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral oncology, 2013. 49 (3): p. 211-215. 49. Argiris, A., M.V. Karamouzis, D. Raben, and R.L. Ferris, Head and neck cancer. The Lancet, 2008. 371 (9625): p. 1695-1709. 50. Fan, J., F. Han, and H. Liu, Challenges of big data analysis. National science review, 2014. 1(2): p. 293-314. 51. Tomczak, K., P. Czerwińska, and M. Wiznerowicz, The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology, 2015. 19 (1A): p. A68. 52. Behrman, S. and J. Mazerik, The International Cancer Genome Consortium. 53. Li, H., J. Cao, and J. Xiong. Constructing optimal clustering architecture for maximizing lifetime in large scale wireless sensor networks . in Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on . 2009. IEEE. 54. Network, C.G.A.R., Integrated genomic characterization of endometrial carcinoma. Nature, 2013. 497 (7447): p. 67-73. 55. Hoadley, K.A., C. Yau, D.M. Wolf, A.D. Cherniack, D. Tamborero, S. Ng, M.D. Leiserson, B. Niu, M.D. McLellan, and V. Uzunangelov, Multiplatform analysis of 12 132

cancer types reveals molecular classification within and across tissues of origin. Cell, 2014. 158 (4): p. 929-944. 56. Shen, R., Q. Mo, N. Schultz, V.E. Seshan, A.B. Olshen, J. Huse, M. Ladanyi, and C. Sander, Integrative subtype discovery in glioblastoma using iCluster. PloS one, 2012. 7(4). 57. Shen, R. and M.R. Shen, Package ‘iCluster’. 2012. 58. Network, C.G.A.R., Integrated genomic analyses of ovarian carcinoma. Nature, 2011. 474 (7353): p. 609-615. 59. Konstantinopoulos, P.A., D. Spentzos, and S.A. Cannistra, Gene-expression profiling in epithelial ovarian cancer. Nature Clinical Practice Oncology, 2008. 5(10): p. 577-587. 60. Hofree, M., J.P. Shen, H. Carter, A. Gross, and T. Ideker, Network-based stratification of tumor mutations. Nature methods, 2013. 10 (11): p. 1108-1115. 61. Wanebo, H.J., M.Y. Jun, E.W. Strong, and H. Oettgen, T-cell deficiency in patients with squamous cell cancer of the head and neck. The American Journal of Surgery, 1975. 130 (4): p. 445-451. 62. Lyford-Pike, S., S. Peng, G.D. Young, J.M. Taube, W.H. Westra, B. Akpeng, T.C. Bruno, J.D. Richmon, H. Wang, and J.A. Bishop, Evidence for a role of the PD-1: PD-L1 pathway in immune resistance of HPV-associated head and neck squamous cell carcinoma. Cancer research, 2013. 73 (6): p. 1733-1741. 63. De Visser, K.E., A. Eichten, and L.M. Coussens, Paradoxical roles of the immune system during cancer development. Nature reviews cancer, 2006. 6(1): p. 24-37. 64. Balkwill, F. and A. Mantovani, Inflammation and cancer: back to Virchow? The lancet, 2001. 357 (9255): p. 539-545. 65. Gillison, M.L., W.M. Koch, R.B. Capone, M. Spafford, W.H. Westra, L. Wu, M.L. Zahurak, R.W. Daniel, M. Viglione, and D.E. Symer, Evidence for a causal association between human papillomavirus and a subset of head and neck cancers. Journal of the National Cancer Institute, 2000. 92 (9): p. 709-720. 66. Begum, S., D. Cao, M. Gillison, M. Zahurak, and W.H. Westra, Tissue distribution of human papillomavirus 16 DNA integration in patients with tonsillar carcinoma. Clinical Cancer Research, 2005. 11 (16): p. 5694-5699. 67. Kreimer, A.R., G.M. Clifford, P. Boyle, and S. Franceschi, Human papillomavirus types in head and neck squamous cell carcinomas worldwide: a systematic review. Cancer Epidemiology Biomarkers & Prevention, 2005. 14 (2): p. 467-475. 68. Vermorken, J.B., J. Trigo, R. Hitt, P. Koralewski, E. Diaz-Rubio, F. Rolland, R. Knecht, N. Amellal, A. Schueler, and J. Baselga, Open-label, uncontrolled, multicenter phase II study to evaluate the efficacy and toxicity of cetuximab as a single agent in patients with recurrent and/or metastatic squamous cell carcinoma of the head and neck who failed to respond to platinum-based therapy. Journal of Clinical Oncology, 2007. 25 (16): p. 2171- 2177. 69. Cohen, E.E., M.A. Kane, M.A. List, B.E. Brockstein, B. Mehrotra, D. Huo, A.M. Mauer, C. Pierce, A. Dekker, and E.E. Vokes, Phase II trial of gefitinib 250 mg daily in patients with recurrent and/or metastatic squamous cell carcinoma of the head and neck. Clinical Cancer Research, 2005. 11 (23): p. 8418-8424. 70. Soulieres, D., N.N. Senzer, E.E. Vokes, M. Hidalgo, S.S. Agarwala, and L.L. Siu, Multicenter phase II study of erlotinib, an oral epidermal growth factor receptor tyrosine kinase inhibitor, in patients with recurrent or metastatic squamous cell cancer of the head and neck. Journal of Clinical Oncology, 2004. 22 (1): p. 77-85. 71. Lee, D.D. and H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature, 1999. 401 (6755): p. 788-791. 72. Fox, J., Cox proportional-hazards regression for survival data. An R and S-PLUS companion to applied regression, 2002: p. 1-18.

133

73. Langfelder, P., B. Zhang, and S. Horvath, Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 2008. 24 (5): p. 719-720. 74. Vanunu, O., O. Magger, E. Ruppin, T. Shlomi, and R. Sharan, Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol, 2010. 6(1): p. e1000641. 75. Dayyani, F., C.J. Etzel, M. Liu, C.-H. Ho, S.M. Lippman, and A.S. Tsao, Meta-analysis of the impact of human papillomavirus (HPV) on cancer risk and overall survival in head and neck squamous cell carcinomas (HNSCC). Head & neck oncology, 2010. 2(1): p. 1. 76. Syrjänen, S., The role of human papillomavirus infection in head and neck cancers. Annals of oncology, 2010. 21 (suppl 7): p. vii243-vii245. 77. Wong, K.-K., J.A. Engelman, and L.C. Cantley, Targeting the PI3K signaling pathway in cancer. Current opinion in genetics & development, 2010. 20 (1): p. 87-90. 78. Lui, V.W., M.L. Hedberg, H. Li, B.S. Vangara, K. Pendleton, Y. Zeng, Y. Lu, Q. Zhang, Y. Du, and B.R. Gilbert, Frequent mutation of the PI3K pathway in head and neck cancer defines predictive biomarkers. Cancer discovery, 2013. 3(7): p. 761-769. 79. Nagpal, J.K., R. Mishra, and B.R. Das, Activation of Stat ‐3 as one of the early events in tobacco chewing ‐mediated oral carcinogenesis. Cancer, 2002. 94 (9): p. 2393-2400. 80. Lai, S.Y. and F.M. Johnson, Defining the role of the JAK-STAT pathway in head and neck and thoracic malignancies: implications for future therapeutic approaches. Drug Resistance Updates, 2010. 13 (3): p. 67-78. 81. Yoshikawa, H., K. Matsubara, G.-S. Qian, P. Jackson, J.D. Groopman, J.E. Manning, C.C. Harris, and J.G. Herman, SOCS-1, a negative regulator of the JAK/STAT pathway, is silenced by methylation in human hepatocellular carcinoma and shows growth- suppression activity. Nature genetics, 2001. 28 (1): p. 29-35. 82. Adachi, O., T. Kawai, K. Takeda, M. Matsumoto, H. Tsutsui, M. Sakagami, K. Nakanishi, and S. Akira, Targeted disruption of the MyD88 gene results in loss of IL-1-and IL-18- mediated function. Immunity, 1998. 9(1): p. 143-150. 83. Yamamoto, M., S. Sato, H. Hemmi, K. Hoshino, T. Kaisho, H. Sanjo, O. Takeuchi, M. Sugiyama, M. Okabe, and K. Takeda, Role of adaptor TRIF in the MyD88-independent toll-like receptor signaling pathway. Science, 2003. 301 (5633): p. 640-643. 84. Akira, S. and K. Takeda, Toll-like receptor signalling. Nature reviews immunology, 2004. 4(7): p. 499-511. 85. Hemmi, H. and S. Akira, TLR signalling and the function of dendritic cells. 2005. 86. Akira, S., K. Takeda, and T. Kaisho, Toll-like receptors: critical proteins linking innate and acquired immunity. Nature immunology, 2001. 2(8): p. 675-680. 87. Iwasaki, A. and R. Medzhitov, Toll-like receptor control of the adaptive immune responses. Nature immunology, 2004. 5(10): p. 987-995. 88. Arpaia, E., M. Shahar, H. Dadi, A. Cohen, and C.M. Rolfman, Defective T cell receptor signaling and CD8+ thymic selection in humans lacking zap-70 kinase. Cell, 1994. 76 (5): p. 947-958. 89. Cooke, M.P., K.M. Abraham, K.A. Forbush, and R.M. Perimutter, Regulation of T cell receptor signaling by a src family protein-tyrosine kinase (p59 fyn). Cell, 1991. 65 (2): p. 281-291. 90. Zhang, W., J. Sloan-Lancaster, J. Kitchen, R.P. Trible, and L.E. Samelson, LAT: the ZAP- 70 tyrosine kinase substrate that links T cell receptor to cellular activation. Cell, 1998. 92 (1): p. 83-92. 91. Boismenu, R., M. Rhein, W.H. Fischer, and W.L. Havran, A role for CD81 in early T cell development. Science, 1996. 271 (5246): p. 198. 92. Palacios, E.H. and A. Weiss, Function of the Src-family kinases, Lck and Fyn, in T-cell development and activation. Oncogene, 2004. 23 (48): p. 7990-8000. 134

93. O'Neill, S.K., A. Getahun, S.B. Gauld, K.T. Merrell, I. Tamir, M.J. Smith, J.M. Dal Porto, Q.-Z. Li, and J.C. Cambier, Monophosphorylation of CD79a and CD79b ITAM motifs initiates a SHIP-1 phosphatase-mediated inhibitory signaling cascade required for B cell anergy. Immunity, 2011. 35 (5): p. 746-756. 94. Dal Porto, J.M., S.B. Gauld, K.T. Merrell, D. Mills, A.E. Pugh-Bernard, and J. Cambier, B cell antigen receptor signaling 101. Molecular immunology, 2004. 41 (6): p. 599-613. 95. Field, R., B. Smith, C. Platz, R. Robinson, J. Neuberger, C. Brus, and C. Lynch, Lung cancer histologic type in the surveillance, epidemiology, and end results registry versus independent review. Journal of the National Cancer Institute, 2004. 96 (14): p. 1105-1107. 96. Jemal, A., R. Siegel, E. Ward, Y. Hao, J. Xu, T. Murray, and M.J. Thun, Cancer statistics, 2008. CA: a cancer journal for clinicians, 2008. 58 (2): p. 71-96. 97. Siegel, R.L., K.D. Miller, and A. Jemal, Cancer statistics, 2015. CA: a cancer journal for clinicians, 2015. 65 (1): p. 5-29. 98. Ladanyi, M. and W. Pao, Lung adenocarcinoma: guiding EGFR-targeted therapy and beyond. Modern Pathology, 2008. 21 : p. S16-S22. 99. Oka, S., H. Uramoto, H. Shimokawa, T. Iwanami, and F. Tanaka, The expression of Ki-67, but not proliferating cell nuclear antigen, predicts poor disease free survival in patients with adenocarcinoma of the lung. Anticancer research, 2011. 31 (12): p. 4277-4282. 100. Paik, P.K., M.E. Arcila, M. Fara, C.S. Sima, V.A. Miller, M.G. Kris, M. Ladanyi, and G.J. Riely, Clinical characteristics of patients with lung adenocarcinomas harboring BRAF mutations. Journal of clinical oncology, 2011. 29 (15): p. 2046-2051. 101. Chen, S., Y. Xu, Y. Chen, X. Li, W. Mou, L. Wang, Y. Liu, R.A. Reisfeld, R. Xiang, and D. Lv, SOX2 gene regulates the transcriptional network of oncogenes and affects tumorigenesis of human lung cancer cells. PloS one, 2012. 7(5). 102. Scaffidi, P. and M.E. Bianchi, Spatially precise DNA bending is an essential activity of the sox2 transcription factor. Journal of Biological Chemistry, 2001. 276 (50): p. 47296-47302. 103. Xiang, R., D. Liao, T. Cheng, H. Zhou, Q. Shi, T. Chuang, D. Markowitz, R. Reisfeld, and Y. Luo, Downregulation of transcription factor SOX2 in cancer stem cells suppresses growth and metastasis of lung cancer. British journal of cancer, 2011. 104 (9): p. 1410- 1417. 104. Harley, V.R., R. Lovell-Badge, and P.N. Goodfellow, Definition of a consensus DNA binding site for SRY. Nucleic acids research, 1994. 22 (8): p. 1500. 105. Tuupanen, S., M. Turunen, R. Lehtonen, O. Hallikas, S. Vanharanta, T. Kivioja, M. Björklund, G. Wei, J. Yan, and I. Niittymäki, The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nature genetics, 2009. 41 (8): p. 885-890. 106. Zhang, X., L. Zhou, G. Fu, F. Sun, J. Shi, J. Wei, C. Lu, C. Zhou, Q. Yuan, and M. Yang, The identification of an ESCC susceptibility SNP rs920778 that regulates the expression of lncRNA HOTAIR via a novel intronic enhancer. Carcinogenesis, 2014: p. bgu103. 107. Tokuhiro, S., R. Yamada, X. Chang, A. Suzuki, Y. Kochi, T. Sawada, M. Suzuki, M. Nagasaki, M. Ohtsuki, and M. Ono, An intronic SNP in a RUNX1 binding site of SLC22A4, encoding an organic cation transporter, is associated with rheumatoid arthritis. Nature genetics, 2003. 35 (4): p. 341-348. 108. Ng, M.T.H., R. van't Hof, J.C. Crockett, M.E. Hope, S. Berry, J. Thomson, M.H. McLean, K.E. McColl, E.M. El-Omar, and G.L. Hold, Increase in NF-κB binding affinity of the variant C allele of the Toll-like receptor 9− 1237T/C polymorphism is associated with Helicobacter pylori-induced gastric disease. Infection and immunity, 2010. 78 (3): p. 1345- 1352. 109. Guo, H., M. Ahmed, F. Zhang, C.Q. Yao, S. Li, Y. Liang, J. Hua, F. Soares, Y. Sun, and J. Langstein, Modulation of long noncoding RNAs by risk SNPs underlying genetic predispositions to prostate cancer. Nature genetics, 2016. 135

110. Majewski, J. and T. Pastinen, The study of eQTL variations by RNA-seq: from SNPs to phenotypes. Trends in Genetics, 2011. 27 (2): p. 72-79. 111. Watanabe, H., Q. Ma, S. Peng, G. Adelmant, D. Swain, W. Song, C. Fox, J.M. Francis, C.S. Pedamallu, and D.S. DeLuca, SOX2 and p63 colocalize at genetic loci in squamous cell carcinomas. The Journal of clinical investigation, 2014. 124 (4): p. 1636-1645. 112. Wang, S., C. Zang, T. Xiao, J. Fan, S. Mei, Q. Qin, Q. Wu, X. Li, K. Xu, and H.H. He, Modeling cis-regulation with a compendium of genome-wide histone H3K27ac profiles. Genome Research, 2016. 26 (10): p. 1417-1429. 113. Fang, W.T., C.C. Fan, S.M. Li, T.H. Jang, H.P. Lin, N.Y. Shih, C.H. Chen, T.Y. Wang, S.F. Huang, and A.Y.L. Lee, Downregulation of a putative tumor suppressor BMP4 by SOX2 promotes growth of lung squamous cell carcinoma. International journal of cancer, 2014. 135 (4): p. 809-819. 114. Baines, C.P., R.A. Kaiser, T. Sheiko, W.J. Craigen, and J.D. Molkentin, Voltage-dependent anion channels are dispensable for mitochondrial-dependent cell death. Nature cell biology, 2007. 9(5): p. 550-555. 115. De Pinto, V., F. Guarino, A. Guarnera, A. Messina, S. Reina, F.M. Tomasello, V. Palermo, and C. Mazzoni, Characterization of human VDAC isoforms: a peculiar function for VDAC3? Biochimica et Biophysica Acta (BBA)-Bioenergetics, 2010. 1797 (6): p. 1268- 1275. 116. Cheng, E.H.-Y., T.V. Sheiko, J.K. Fisher, W.J. Craigen, and S.J. Korsmeyer, VDAC2 inhibits BAK activation and mitochondrial apoptosis. Science, 2003. 301 (5632): p. 513- 517. 117. Reina, S., V. Checchetto, R. Saletti, A. Gupta, D. Chaturvedi, C. Guardiani, F. Guarino, M.A. Scorciapino, A. Magrì, and S. Foti, VDAC3 as a sensor of oxidative state of the intermembrane space of mitochondria: the putative role of cysteine residue modifications. Oncotarget, 2016. 7(3): p. 2249. 118. De Stefani, D., A. Bononi, A. Romagnoli, A. Messina, V. De Pinto, P. Pinton, and R. Rizzuto, VDAC1 selectively transfers apoptotic Ca2+ signals to mitochondria. Cell Death & Differentiation, 2012. 19(2): p. 267-273. 119. Zhang, X., P.S. Choi, J.M. Francis, M. Imielinski, H. Watanabe, A.D. Cherniack, and M. Meyerson, Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers. Nature genetics, 2015. 120. Ward, L.D. and M. Kellis, HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic acids research, 2016. 44 (D1): p. D877-D881. 121. Reich, D.E., M. Cargill, S. Bolk, J. Ireland, P.C. Sabeti, D.J. Richter, T. Lavery, R. Kouyoumjian, S.F. Farhadian, and R. Ward, Linkage disequilibrium in the human genome. Nature, 2001. 411 (6834): p. 199-204. 122. Ward, L.D. and M. Kellis, HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic acids research, 2012. 40 (D1): p. D930-D934. 123. Wakamatsu, Y., Y. Endo, N. Osumi, and J.A. Weston, Multiple roles of Sox2, an HMG ‐ box transcription factor in avian neural crest development. Developmental dynamics, 2004. 229 (1): p. 74-86. 124. Dixon, J.R., S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J.S. Liu, and B. Ren, Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 2012. 485 (7398): p. 376-380. 125. Dekker, J., M.A. Marti-Renom, and L.A. Mirny, Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nature Reviews Genetics, 2013. 14 (6): p. 390-403. 136

126. Simonis, M., P. Klous, E. Splinter, Y. Moshkin, R. Willemsen, E. De Wit, B. Van Steensel, and W. De Laat, Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C). Nature genetics, 2006. 38 (11): p. 1348-1354. 127. Jost, D., P. Carrivain, G. Cavalli, and C. Vaillant, Modeling epigenome folding: formation and dynamics of topologically associated chromatin domains. Nucleic acids research, 2014: p. gku698. 128. Schmitt, A.D., M. Hu, I. Jung, Z. Xu, Y. Qiu, C.L. Tan, Y. Li, S. Lin, Y. Lin, and C.L. Barr, A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Reports, 2016. 17 (8): p. 2042-2059. 129. Wang, Y., B. Zhang, L. Zhang, L. An, J. Xu, D. Li, M.N. Choudhary, Y. Li, M. Hu, and R. Hardison, The 3D Genome Browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions. bioRxiv, 2017: p. 112268. 130. Shimizu, S., M. Narita, and Y. Tsujimoto, Bcl-2 family proteins regulate the release of apoptogenic cytochrome c by the mitochondrial channel VDAC. Nature, 1999. 399 (6735): p. 483-487. 131. Maldonado, E.N. and J.J. Lemasters, Warburg revisited: regulation of mitochondrial metabolism by voltage-dependent anion channels in cancer cells. Journal of Pharmacology and Experimental Therapeutics, 2012. 342 (3): p. 637-641. 132. Okazaki, M., K. Kurabayashi, M. Asanuma, Y. Saito, K. Dodo, and M. Sodeoka, VDAC3 gating is activated by suppression of disulfide-bond formation between the N-terminal region and the bottom of the pore. Biochimica et Biophysica Acta (BBA)-Biomembranes, 2015. 1848 (12): p. 3188-3196. 133. Maldonado, E.N., K.L. Sheldon, D.N. DeHart, J. Patnaik, Y. Manevich, D.M. Townsend, S.M. Bezrukov, T.K. Rostovtseva, and J.J. Lemasters, Voltage-dependent anion channels modulate mitochondrial metabolism in cancer cells regulation by free tubulin and erastin. Journal of Biological Chemistry, 2013. 288 (17): p. 11920-11929. 134. Messina, A., S. Reina, F. Guarino, A. Magrì, F. Tomasello, R.E. Clark, R.R. Ramsay, and V. De Pinto, Live cell interactome of the human voltage dependent anion channel 3 (VDAC3) revealed in HeLa cells by affinity purification tag technique. Molecular BioSystems, 2014. 10 (8): p. 2134-2145. 135. Yagoda, N., M. Von Rechenberg, E. Zaganjor, A.J. Bauer, W.S. Yang, D.J. Fridman, A.J. Wolpaw, I. Smukste, J.M. Peltier, and J.J. Boniface, RAS–RAF–MEK-dependent oxidative cell death involving voltage-dependent anion channels. Nature, 2007. 447 (7146): p. 865- 869. 136. Simamura, E., H. Shimada, T. Hatta, and K.-I. Hirai, Mitochondrial voltage-dependent anion channels (VDACs) as novel pharmacological targets for anti-cancer agents. Journal of bioenergetics and biomembranes, 2008. 40 (3): p. 213-217. 137. Ong, C.-T. and V.G. Corces, CTCF: an architectural protein bridging genome topology and function. Nature reviews Genetics, 2014. 15 (4): p. 234. 138. Fuks, F., DNA methylation and histone modifications: teaming up to silence genes. Current opinion in genetics & development, 2005. 15 (5): p. 490-495. 139. Tate, P.H. and A.P. Bird, Effects of DNA methylation on DNA-binding proteins and gene expression. Current opinion in genetics & development, 1993. 3(2): p. 226-231. 140. Wang, H., M.T. Maurano, H. Qu, K.E. Varley, J. Gertz, F. Pauli, K. Lee, T. Canfield, M. Weaver, and R. Sandstrom, Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome research, 2012. 22 (9): p. 1680-1688. 141. Natekin, A. and A. Knoll, Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 2013. 7: p. 21.

137

142. Ogutu, J.O., H.-P. Piepho, and T. Schulz-Streeck. A comparison of random forests, boosting and support vector machines for genomic selection . in BMC proceedings . 2011. BioMed Central. 143. Franco, H.L., A. Nagari, V.S. Malladi, W. Li, Y. Xi, D. Richardson, K.L. Allton, K. Tanaka, J. Li, and S. Murakami, Enhancer transcription reveals subtype-specific gene expression programs controlling breast cancer pathogenesis. Genome research, 2018. 28 (2): p. 159-170. 144. Rosenbloom, K.R., C.A. Sloan, V.S. Malladi, T.R. Dreszer, K. Learned, V.M. Kirkup, M.C. Wong, M. Maddren, R. Fang, and S.G. Heitner, ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic acids research, 2012. 41 (D1): p. D56-D63. 145. Liu, Y., N.M. Walavalkar, M.G. Dozmorov, S.S. Rich, M. Civelek, and M.J. Guertin, Identification of breast cancer associated variants that modulate transcription factor binding. PLoS genetics, 2017. 13 (9): p. e1006761. 146. Porter, J.R., B.E. Fisher, L. Baranello, J.C. Liu, D.M. Kambach, Z. Nie, W.S. Koh, J. Luo, J.M. Stommel, and D. Levens, Global Inhibition with Specific Activation: How p53 and MYC Redistribute the Transcriptome in the DNA Double-Strand Break Response. Molecular cell, 2017. 67 (6): p. 1013-1025. e9. 147. Li, H., Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013. 148. Zhang, Y., T. Liu, C.A. Meyer, J. Eeckhoute, D.S. Johnson, B.E. Bernstein, C. Nusbaum, R.M. Myers, M. Brown, and W. Li, Model-based analysis of ChIP-Seq (MACS). Genome biology, 2008. 9(9): p. R137. 149. Uehiro, N., F. Sato, F. Pu, S. Tanaka, M. Kawashima, K. Kawaguchi, M. Sugimoto, S. Saji, and M. Toi, Circulating cell-free DNA-based epigenetic assay can detect early breast cancer. Breast Cancer Research, 2016. 18 (1): p. 129. 150. Karolchik, D., A.S. Hinrichs, T.S. Furey, K.M. Roskin, C.W. Sugnet, D. Haussler, and W.J. Kent, The UCSC Table Browser data retrieval tool. Nucleic acids research, 2004. 32 (suppl_1): p. D493-D496. 151. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics, 2011. 12 (1): p. 323. 152. Lin, C.-Y., V.B. Vega, J.S. Thomsen, T. Zhang, S.L. Kong, M. Xie, K.P. Chiu, L. Lipovich, D.H. Barnett, and F. Stossi, Whole-genome cartography of estrogen receptor α binding sites. PLoS genetics, 2007. 3(6): p. e87. 153. Nagarajan, S., T. Hossan, M. Alawi, Z. Najafova, D. Indenbirken, U. Bedi, H. Taipaleenmäki, I. Ben-Batalla, M. Scheller, and S. Loges, Bromodomain protein BRD4 is required for estrogen receptor-dependent enhancer activation and gene transcription. Cell reports, 2014. 8(2): p. 460-469. 154. Bolstad, B.M., R.A. Irizarry, M. Åstrand, and T.P. Speed, A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 2003. 19 (2): p. 185-193. 155. Anders, S. and W. Huber, Differential expression analysis for sequence count data. Genome biology, 2010. 11 (10): p. R106. 156. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nature methods, 2012. 9(4): p. 357. 157. Taberlay, P.C., J. Achinger-Kawecka, A.T. Lun, F.A. Buske, K. Sabir, C.M. Gould, E. Zotenko, S.A. Bert, K.A. Giles, and D.C. Bauer, Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations. Genome research, 2016. 26 (6): p. 719-731. 158. Meissner, A., T.S. Mikkelsen, H. Gu, M. Wernig, J. Hanna, A. Sivachenko, X. Zhang, B.E. Bernstein, C. Nusbaum, and D.B. Jaffe, Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature, 2008. 454 (7205): p. 766. 138

159. Guertin, M.J., A.L. Martins, A. Siepel, and J.T. Lis, Accurate prediction of inducible transcription factor binding intensities in vivo. PLoS genetics, 2012. 8(3): p. e1002610. 160. Danko, C.G., S.L. Hyland, L.J. Core, A.L. Martins, C.T. Waters, H.W. Lee, V.G. Cheung, W.L. Kraus, J.T. Lis, and A. Siepel, Identification of active transcriptional regulatory elements from GRO-seq data. Nature methods, 2015. 12 (5): p. 433. 161. Strunnikov, A.V. and R. Jessberger, Structural maintenance of chromosomes (SMC) proteins: conserved molecular properties for multiple biological functions. European journal of biochemistry, 1999. 263 (1): p. 6-13. 162. Deardorff, M.A., M. Kaur, D. Yaeger, A. Rampuria, S. Korolev, J. Pie, C. Gil-Rodríguez, M. Arnedo, B. Loeys, and A.D. Kline, Mutations in cohesin complex members SMC3 and SMC1A cause a mild variant of cornelia de Lange syndrome with predominant mental retardation. The American Journal of Human Genetics, 2007. 80 (3): p. 485-494. 163. Dennis, G., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, and R.A. Lempicki, DAVID: database for annotation, visualization, and integrated discovery. Genome biology, 2003. 4(9): p. R60. 164. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 2000. 28 (1): p. 27-30. 165. Garcia, R., T.L. Bowman, G. Niu, H. Yu, S. Minton, C.A. Muro-Cacho, C.E. Cox, R. Falcone, R. Fairclough, and S. Parsons, Constitutive activation of Stat3 by the Src and JAK tyrosine kinases participates in growth regulation of human breast carcinoma cells. Oncogene, 2001. 20 (20): p. 2499. 166. Calo, E. and J. Wysocka, Modification of enhancer chromatin: what, how, and why? Molecular cell, 2013. 49 (5): p. 825-837. 167. Ooi, S.K., C. Qiu, E. Bernstein, K. Li, D. Jia, Z. Yang, H. Erdjument-Bromage, P. Tempst, S.-P. Lin, and C.D. Allis, DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA. Nature, 2007. 448 (7154): p. 714. 168. Hall, M.A., A. Shundrovsky, L. Bai, R.M. Fulbright, J.T. Lis, and M.D. Wang, High- resolution dynamic mapping of histone-DNA interactions in a nucleosome. Nature structural & molecular biology, 2009. 16 (2): p. 124. 169. Cremer, T., G. Kreth, H. Koester, R. Fink, R. Heintzmann, M. Cremer, I. Solovei, D. Zink, and C. Cremer, Chromosome territories, interchromatin domain compartment, and nuclear matrix: an integrated view of the functional nuclear architecture. Critical reviews in eukaryotic gene expression, 2000. 10 (2): p. 179. 170. Roy, S., A.F. Siahpirani, D. Chasman, S. Knaack, F. Ay, R. Stewart, M. Wilson, and R. Sridharan, A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic acids research, 2015. 43 (18): p. 8694-8712. 171. Andersson, R., C. Gebhard, I. Miguel-Escalada, I. Hoof, J. Bornholdt, M. Boyd, Y. Chen, X. Zhao, C. Schmidl, and T. Suzuki, An atlas of active enhancers across human cell types and tissues. Nature, 2014. 507 (7493): p. 455-461. 172. Hait, T.A., D. Amar, R. Shamir, and R. Elkon, FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer–promoter map. Genome biology, 2018. 19 (1): p. 56. 173. Consortium, E.P., The ENCODE (ENCyclopedia of DNA elements) project. Science, 2004. 306 (5696): p. 636-640. 174. Wang, D., I. Garcia-Bassets, C. Benner, W. Li, X. Su, Y. Zhou, J. Qiu, W. Liu, M.U. Kaikkonen, and K.A. Ohgi, Reprogramming transcription by distinct classes of enhancers functionally defined by eRNA. Nature, 2011. 474 (7351): p. 390-394. 175. Ntini, E., A.I. Järvelin, J. Bornholdt, Y. Chen, M. Boyd, M. Jørgensen, R. Andersson, I. Hoof, A. Schein, and P.R. Andersen, Polyadenylation site–induced decay of upstream transcripts enforces promoter directionality. Nature structural & molecular biology, 2013. 20 (8): p. 923. 139

176. Danko, C.G., S.L. Hyland, L.J. Core, A.L. Martins, C.T. Waters, H.W. Lee, V.G. Cheung, W.L. Kraus, J.T. Lis, and A. Siepel, Identification of active transcriptional regulatory elements from GRO-seq data. Nature methods, 2015. 12 (5): p. 433-438. 177. Lizio, M., J. Harshbarger, H. Shimoji, J. Severin, T. Kasukawa, S. Sahin, I. Abugessaisa, S. Fukuda, F. Hori, and S. Ishikawa-Kato, Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome biology, 2015. 16 (1): p. 22. 178. Hoffman, M.M., O.J. Buske, J. Wang, Z. Weng, J.A. Bilmes, and W.S. Noble, Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods, 2012. 9(5): p. 473. 179. Ernst, J. and M. Kellis, ChromHMM: automating chromatin-state discovery and characterization. Nature methods, 2012. 9(3): p. 215-216. 180. Zhou, J. and O.G. Troyanskaya, Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 2015. 12 (10): p. 931-934. 181. Litjens, G., T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A. Van Der Laak, B. Van Ginneken, and C.I. Sánchez, A survey on deep learning in medical image analysis. Medical image analysis, 2017. 42 : p. 60-88. 182. Eraslan, G., Ž. Avsec, J. Gagneur, and F.J. Theis, Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 2019. 20 (7): p. 389-403. 183. Park, Y. and M. Kellis, Deep learning for regulatory genomics. Nature biotechnology, 2015. 33 (8): p. 825. 184. Choong, A.C.H. and N.K. Lee. Evaluation of convolutionary neural networks modeling of DNA sequences using ordinal versus one-hot encoding method . in 2017 International Conference on Computer and Drone Applications (IConDA) . 2017. IEEE. 185. Kelley, D.R., J. Snoek, and J.L. Rinn, Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research, 2016. 26 (7): p. 990- 999. 186. Chen, K.M., E.M. Cofer, J. Zhou, and O.G. Troyanskaya, Selene: a PyTorch-based deep learning library for sequence data. Nature methods, 2019. 16 (4): p. 315-318. 187. Zhou, J., C.L. Theesfeld, K. Yao, K.M. Chen, A.K. Wong, and O.G. Troyanskaya, Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics, 2018. 50 (8): p. 1171-1179. 188. Ray, D., K.C. Ha, K. Nie, H. Zheng, T.R. Hughes, and Q.D. Morris, RNAcompete methodology and application to determine sequence preferences of unconventional RNA- binding proteins. Methods, 2017. 118 : p. 3-15. 189. Orenstein, Y. and R. Shamir, A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data. Nucleic acids research, 2014. 42 (8): p. e63-e63. 190. Glusman, G., J. Caballero, D.E. Mauldin, L. Hood, and J.C. Roach, Kaviar: an accessible system for testing SNV novelty. Bioinformatics, 2011. 27 (22): p. 3216-3217. 191. Rowe, H.M., A. Kapopoulou, A. Corsinotti, L. Fasching, T.S. Macfarlan, Y. Tarabay, S. Viville, J. Jakobsson, S.L. Pfaff, and D. Trono, TRIM28 repression of retrotransposon- based enhancers is necessary to preserve transcriptional dynamics in embryonic stem cells. Genome research, 2013. 23 (3): p. 452-461. 192. Lechner, F., D.K. Wong, P.R. Dunbar, R. Chapman, R.T. Chung, P. Dohrenwend, G. Robbins, R. Phillips, P. Klenerman, and B.D. Walker, Analysis of successful immune responses in persons infected with hepatitis C virus. The Journal of experimental medicine, 2000. 191 (9): p. 1499-1512. 193. Schultz, R.A. and T.R. Watters, Forward mechanical modeling of the Amenthes Rupes thrust fault on Mars. Geophysical Research Letters, 2001. 28 (24): p. 4659-4662. 194. Chen, L., D.-T. Chen, C. Kurtyka, B. Rawal, W.J. Fulp, E.B. Haura, and W.D. Cress, Tripartite motif containing 28 (Trim28) can regulate cell proliferation by bridging 140

HDAC1/E2F interactions. Journal of Biological Chemistry, 2012. 287 (48): p. 40106- 40118. 195. Zuo, X., J. Sheng, H.-T. Lau, C.M. McDonald, M. Andrade, D.E. Cullen, F.T. Bell, M. Iacovino, M. Kyba, and G. Xu, Zinc finger protein ZFP57 requires its co-factor to recruit DNA methyltransferases and maintains DNA methylation imprint in embryonic stem cells via its transcriptional repression domain. Journal of Biological Chemistry, 2012. 287 (3): p. 2107-2118. 196. Annunziato, A.T. and J.C. Hansen, Role of histone acetylation in the assembly and modulation of chromatin structures. Gene Expression, The Journal of Liver Research, 2001. 9(1-2): p. 37-61. 197. Zhou, Y., R. Santoro, and I. Grummt, The chromatin remodeling complex NoRC targets HDAC1 to the ribosomal gene promoter and represses RNA polymerase I transcription. The EMBO journal, 2002. 21 (17): p. 4632-4640. 198. Gryder, B.E., L. Wu, G.M. Woldemichael, S. Pomella, T.R. Quinn, P.M. Park, A. Cleveland, B.Z. Stanton, Y. Song, and R. Rota, Chemical genomics reveals histone deacetylases are required for core regulatory transcription. Nature communications, 2019. 10 (1): p. 1-12. 199. Siersbæk, R., J.G.S. Madsen, B.M. Javierre, R. Nielsen, E.K. Bagge, J. Cairns, S.W. Wingett, S. Traynor, M. Spivakov, and P. Fraser, Dynamic rewiring of promoter-anchored chromatin loops during adipocyte differentiation. Molecular cell, 2017. 66 (3): p. 420-435. e5.

141

142