<<

Identification of endometrial features using a combined

methylation analysis method

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Michael P Trimarchi B.S.

Biomedical Sciences Graduate Program

The Ohio State University

2016

Dissertation Committee:

Joanna L Groden, Advisor

Paul Goodfellow

Ralf A. Bundschuh

Jeffrey Parvin

Pearlly Yan

Copyrighted by

Michael P Trimarchi

2016

Abstract

Introduction: DNA methylation is a stable epigenetic mark that is frequently altered in tumors. DNA methylation marks are attractive for disease states given the stability of DNA methylation in living cells and in biologic specimens. Widespread accumulation of methylation in regulatory elements in some (termed the CpG island methylator phenotype,

CIMP) can play an important role in tumorigenesis. High resolution assessment of CIMP for the entire genome, however, remains cost prohibitive and requires quantities of DNA that are not available for many tissue samples of interest. Genome-wide scans of methylation have been undertaken for large numbers of tumors, and higher resolution analyses have been performed for a limited number of cancer specimens. Yet methods for analyzing these large datasets and integrating findings from different studies have not been fully developed. An approach was developed to profile CIMP by combining the strengths of two different methylome profiling techniques.

Methods: Methylomes for 76 primary endometrial cancer and 12 normal endometrial samples were generated using methylated fragment capture and second generation sequencing (MethylCap-seq). Publically available data from The Cancer Genome Atlas (TCGA) for 203 endometrial cancers, analyzed using the Infinium HumanMethylation 450 beadchip, were compared to the MethylCap-seq data. A MethylCap-seq quality control module was

ii

developed to exclude sequencing samples with poor-quality methylation data from analysis. Additional MethylCap-seq datasets were also used to develop and validate the quality control module.

Results: Analysis of total methylation in CpG islands (CGIs) identified a subset of tumors with a methylator phenotype. I developed a 13- region methylation signature associated with a “hypermethylator state” using a training set of five highly methylated and eight lowly methylated tumors. The signature was validated using data from TCGA. High signature methylation score was associated with mismatch repair deficiency, high rate, and low somatic copy number alteration in TCGA test set. In addition, the methylation signature distinguished >90% of endometrioid endometrial tumors from normal controls in the test set. Furthermore, classification of tumors by signature methylation score proved highly robust, showing good agreement with previously published methylation clusters for the test set as well as consistent ranking of tumors across alternative signatures.

Conclusion: I identified a methylation signature for a “hypermethylator phenotype” in endometrial cancer and developed methods that could prove useful for identifying extreme methylation phenotypes in other cancers.

iii

Acknowledgments

My sincere thanks to my committee and my former advisor Tim H.M. Huang for making this project possible, as well as the current and former members of the Yan and Groden labs. Special thanks to Paul Goodfellow for helping reorient the project after my former advisor left, and to Ralf Bundschuh for his guidance in data analysis.

iv

Vita

June 2002 ...... North Olmsted High School

June 2006 ...... B.S. Microbiology, Minor in

Chemistry, The Ohio State University

June 2007 to present ...... Graduate Research Associate,

Biomedical Sciences Graduate

Program, The Ohio State University

Publications

2016 Michael P Trimarchi, Pearlly Yan, Joanna Groden, Ralf

Bundschuh, Paul Goodfellow. Identification of endometrial

cancer methylation features using a combined methylation

analysis method. Manuscript in preparation.

2012 Michael P Trimarchi, Mark Murphy, David Frankhouser,

Benjamin AT Rodriguez, John Curfman, Guido Marcucci, Pearlly

Yan, Ralf Bundschuh. Enrichment-based DNA methylation

analysis using next-generation sequencing: sample exclusion,

estimating changes in global methylation, and the contribution of

replicate lanes. BMC Genomics 2012, 13(Suppl 8):S6 (17

December 2012). v

2012 Rodriguez B, Tam HH, Frankhouser D, Trimarchi M, Murphy M,

Kuo C, Parikh D, Ball B, Schwind S, Curfman J, Blum W,

Marcucci G, Yan P, Bundschuh R. A Scalable, Flexible

Workflow for MethylCap-Seq Data Analysis. BMC Genomics

2012, 13(Suppl 6):S14 (26 October 2012).

2012 Yan P, Frankhouser D, Murphy M, Tam HH, Rodriguez B,

Curfman J, Trimarchi M, Geyer S, Wu YZ, Whitman SP,

Metzeler K, Walker A, Klisovic R, Jacob S, Grever MR, Byrd JC,

Bloomfield CD, Garzon R, Blum W, Caligiuri MA, Bundschuh R,

Marcucci G. Genome-wide methylation profiling in -

treated patients with acute myeloid . Blood. 2012 Jul

11.

2011 Trimarchi MP, Mouangsavanh M, Huang TH., Cancer

: a perspective on the role of DNA methylation in

acquired endocrine resistance. Chin J Cancer. 2011

Nov;30(11):749-56.

2010 Cottrell, C. E., Bir, N., Varga, E., Alvarez, C. E., Bouyain, S.,

Zernzach, R., Thrush, D. L., Evans, J., Trimarchi, M., Butter, E.

M., Cunningham, D., Gastier-Foster, J. M., McBride, K. L. and

Herman, G. E. Contactin 4 as an autism susceptibility locus.

Autism Res. 2011 Jun;4(3):189-99.

vi

Fields of Study

Major Field: Biomedical Sciences

Specialization: Cancer Research

Interdisciplinary Specialization: Biomedical, Clinical & Translational Sciences

vii

Table of Contents

Abstract ...... ii

Acknowledgments ...... iv

Vita ...... v

List of Figures ...... xii

List of Tables ...... xiv

Chapter 1. Background ...... 1

I. Epigenetics in cancer, with a focus on DNA methylation ...... 1

Chromatin model ...... 1

DNA methylation ...... 1

DNA methylation in cancer COMMENT: perhaps this is a better place

for you to refer to the You and Jones review...... 2

Histone modifications ...... 3

DNA methylation interaction with modifications ...... 4

DNA methylation as a and therapeutic target ...... 5

II. Methylome Profiling ...... 6

Analysis methods ...... 6

Methylome Profiling: The Biology ...... 13

viii

Chapter 2. Thesis Rationale and Research Objectives ...... 19

Chapter 3. Enrichment-based DNA methylation analysis using next- generation sequencing: quality control, estimating changes in global methylation and the effects of increased sequencing depth...... 21

I. Introduction ...... 21

II. Results and Discussion ...... 23

Quality control exclusion criteria reduce noise in methylation signal and

improve analytical power...... 23

The effect of additional sequencing lanes on quality control metrics ...... 27

The global methylation indicator (GMI) correlates inversely with an in vitro

methylated tracer sequence...... 31

III. Methods ...... 34

Patient samples ...... 34

Methylated-DNA capture (MethylCap-seq) ...... 35

MethylCap-seq experimental quality control and exclusion criteria ...... 36

Standard sequence file processing and alignment ...... 37

Standard global methylation analysis workflow ...... 37

Calculation of noise in methylation signal ...... 38

Calculation of the Global Methylation Indicator (GMI) ...... 39

Assessment of methylated fragment enrichment using an in vitro

methylated construct ...... 39

ix

IV. Conclusions ...... 40

Chapter 4. Identification of endometrial cancer methylation features using a combined methylation analysis method ...... 41

I. Introduction ...... 41

II. Results ...... 44

Characterizing a CpG island methylator phenotype ...... 44

Methylation signature construction and technical validation ...... 50

Methylation signature stratifies endometrioid endometrial tumors by

methylation phenotype and distinguishes tumors from normal controls . 56

High methylation score is associated with mismatch repair deficiency high

mutation rate, and low somatic copy number alteration ...... 64

Methodological validity ...... 68

A signature for CIMP ...... 70

III. Discussion ...... 70

IV. Conclusion ...... 75

V. Methods ...... 76

Patient samples ...... 76

MethylCap-seq quality control ...... 76

MethylCap-seq data analysis ...... 77

Infinium validation of methylation signature candidates ...... 77

Computation of methylation score using the 13 promoter CGI signature 78

x

In silico analysis of TCGA endometrioid endometrial tumors ...... 78

Replicate signature analysis ...... 78

Chapter 5. Summary and Discussion ...... 80

I. Summary ...... 80

II. Discussion ...... 83

References ...... 86

Appendix A. Supplementary data for Chapter 3 ...... 101

Appendix A1 – QC table for endometrial cancer study ...... 101

Appendix A2 – Replicate lane correlation, endometrial QC passed vs. QC

failed samples ...... 101

Appendix A3 – QC table for ovarian study ...... 101

Appendix A4 – QC, GMI, plasmid RPM table for AML study ...... 101

Appendix B. Supplementary data for Chapter 4 ...... 103

Appendix B1 - Cohort characteristics ...... 103

Appendix B2 - Sequencing summary ...... 103

xi

List of Figures

Figure 1. QC exclusion criteria reduce noise in methylation signals...... 26

Figure 2. Replicate sequencing lanes for MethylCap-seq experiments correlate highly...... 30

Figure 3. Additional lanes of sequencing data moderately increase saturation but greatly increase 5X CpG coverage...... 31

Figure 4. Global methylation indicator scales inversely with read counts from a "spiked" in vitro methylated construct...... 34

Figure 5. Endometrioid endometrial malignancies show increased methylation in promoter CGI compared to unmatched normal controls...... 46

Figure 6. Loci methylated in CGI-H tumors account for many of the cancer- associated methylation gains...... 49

Figure 7. Methylation of 13 promoter-associated CGI distinguishes tumors with high promoter CGI methylation (CGI-H) from those with baseline promoter CGI methylation (CGI-0)...... 53

Figure 8. Methylation signature stratifies endometrioid endometrial tumors by methylation phenotype and distinguishes tumors from normal controls...... 58

Figure 9. Aggregate promoter CGI methylation shows weaker stratification potential compared to the 13-region methylation signature...... 61

Figure 10. Methylation of EPHX3 is associated with decreased ...... 64

xii

Figure 11. High methylation score is associated with mismatch repair deficiency, high mutation rate, and low somatic copy number alteration...... 67

Figure 12. Replicate 13-region methylation signatures rank tumors similarly.69

xiii

List of Tables

Table 1. Differentially Methylated Regions, Endometrial Tumors vs.

Nonmalignant Endometrial Tissue ...... 27

Table 2. Term enrichment associated with hypermethylated promoter CGI

(MSigDB Perturbation) ...... 50

Table 3. Promoter CGI that distinguish CGI-H from CGI-0 tumors as measured by MethylCap-seq ...... 52

Table 4. Promoter CGI included in the final signature after validation by

Infinium ...... 55

Table 5. Promoter CGI discarded from the final signature after validation by

Infinium ...... 56

Table 6. Correlation between promoter CGI methylation and gene expression in TCGA tumors ...... 63

xiv

Chapter 1. Background

I. Epigenetics in cancer, with a focus on DNA methylation

Chromatin model

DNA in cells does not exist freely, but associates with proteins to form a complex termed chromatin. According to the “beads on a string” model,

DNA (the string) in cells is wound around (the beads), which in turn are composed of protein . The four core histone subunits (H2A,

H2B, H3, H4) form heterodimers that complex together as an octamer to form the core. The nucleosomes in turn are connected to a scaffolding by linker histones (H1, H5), contributing to higher order chromatin structure.

Changes in patterns of DNA methylation and histone modification alter the conformation of chromatin and accessibility of the DNA to transcriptional machinery, either by directly introducing steric hindrance or indirectly causing surrounding chromatin to adopt an “open” or “closed” conformation. In addition, epigenetic modifications may cause individual nucleosomes to shift laterally across the DNA, exposing some areas of DNA for and covering others [1].

DNA methylation

DNA methylation is a covalent modification that occurs at cytosine , in particular cytosines that precede a (CpGs) [2]. The

1

process is catalyzed by DNA methyltransferases (DNMTs), which transfer the methyl group from S-adenosylmethionine to the target cytosine. Two families of DNMTs have been identified: DNMT1, which predominantly functions in maintenance of DNA methylation during DNA replication, and DNMT3, which is thought to be primarily responsible for de novo CpG methylation [3]. CpGs are strikingly rare in the genome compared to what would be expected from probabilistic estimates, and outside of transcribed regions CpGs are generally methylated. Areas of high CpG content, termed CpG islands, are found in approximately 40% of mammalian promoters, and unlike CpGs in the rest of the genome are usually unmethylated [4]. The methylation state of CpG islands in promoters can be an important factor controlling gene expression, with heavy methylation blocking gene transcription and sparse methylation permitting it. In addition, evidence suggests that heavy methylation throughout a region of chromatin can mediate long-range silencing that extends even to adjacent unmethylated genes [5].

DNA methylation in cancer COMMENT: perhaps this is a better place for you to refer to the You and Jones Cancer Cell review.

Precisely controlled DNA methylation is important for imprinting (allele- specific expression of some genes) and cell differentiation in normal cells. In cancer cells, aberrant patterns of DNA methylation are frequently observed.

In general, cancer cells feature global hypomethylation and focal promoter hypermethylation. Global hypomethylation has been tied to genomic instability, loss of imprinting, and activation of [4,6]. Promoter hypermethylation, namely within CpG islands, leads to epigenetic silencing of

2

the target gene. In contrast to the rest of the genome, tumor suppressor promoters are frequently hypermethylated in tumors, suggesting that epigenetic silencing of tumor suppressors can be functionally equivalent to loss-of-function [7]. DNA methylation perturbation has also been linked to resistance to therapeutics [8-10]. While epigenetic changes and genetic mutations may be mutually sufficient in some cases, mutations can also perturb the epigenome, and epigenetic changes can lead to increased mutation rate [11].

Histone modifications

Histones also represent a target for epigenetic regulation, and H2A,

H2B, H3, and H4 are all known to be post-translationally modified. These modifications generally occur on the N-terminal tails which protrude from the nucleosome, though H2A, H2B, and H3 are also known to be modified in the core globular domains. A wide range of modifications have been observed, including methylation, , phosphorylation, ubiquitination,

SUMOylation, and ADP-ribosylation, with methylation ( and arginine residues) and acetylation (lysine residues) being the most studied [12]. The potentially limitless combinations of modifications and the clear role of histone modification in epigenetic regulation have inspired the concept of a “”, a pattern of modifications that directs cellular development, selectively filtering the more expansive genetic code [13]. While the concept of a histone code remains controversial, several general themes have emerged. Histone acetylation is generally associated with opening of the chromatin and an activating effect on gene expression, with deacetylation having opposite

3

effects and being responsible for . can be either activating or repressive, depending on the residue, the number of methyl groups on the residue (ranging from 0 to 3), and the presence of other histone modifications. While the effect of histone acetylation could potentially be explained by electrostatic interactions (as acetylation decreases the affinity of histones for DNA and thus might be hypothesized to “push” the chromatin apart), the variable effects of histone methylation suggests other enzyme intermediates. Indeed, the enzymes responsible for modifying histones (e.g., histone acetylases, deacetylases, and methyltransferases) are themselves often found complexed with other proteins (e.g., the Polycomb repressor complex (PRC)) which may help mediate downstream effects [14].

DNA methylation interaction with histone modifications

Changes in patterns of histone modifications are associated with promoter CpG island DNA hypermethylation, which in turn is associated with the epigenetic silencing of tumor suppressor genes commonly found in various cancers. Recent evidence shows that histone methylation, depending on the residues involved, may either recruit (H3K27, H3K9) or prevent recruitment (H3K4) of DNMT3. Recruitment of DNMT3 subsequently results in DNA methylation and transcriptional downregulation of the target gene, providing a link between the two pathways of epigenetic control. This model would explain why CpG island hypermethylation in tumors is often correlated with increased H3K27 and H3K9 methylation, and decreased H3K4 methylation. Another model suggests that repressive histone modifications and DNA methylation may be mutually sufficient in some instances, such that

4

the presence of one can compensate for the other. For example, Polycomb group proteins (e.g., EZH2) associated with repressive histone modifications can recruit DNMT3, leading to hypermethylation of the DNA, loss of the repressive histone modifications, and maintenance of silencing. In addition, the enzymes that mediate and maintain epigenetic control (e.g., Polycomb group proteins) are often dysregulated in cancer cells, showing a mechanism by which cancer cells can manipulate gene expression on a more global scale

[15].

DNA methylation as a biomarker and therapeutic target

The study of cancer epigenetics has great potential to further the diagnosis and . Promoter DNA hypermethylation is common in tumors, and could potentially serve as an early diagnostic marker.

In cancer, the gene GSTP1 is found to be methylated in 90% of clinical malignancies, but not benign hyperplasias, and can be detected in a variety of bodily fluids, including blood plasma [16]. Discovery of similar genes for breast cancer could potentially lead to more effective early detection strategies. In addition, DNA methylome profiles and histone modification maps could potentially serve as prognostic tools once a tumor is identified.

As expression levels of certain genes is correlated with response to treatment, and epigenetic factors exert a powerful and durable effect on gene expression, such whole genome methods have the potential to become powerful tools for personalized treatment.

Elucidation of epigenetic mechanisms may lead to the development of new treatments that specifically target epigenetic abnormalities or

5

vulnerabilities in cancer cells. Examples of two such classes of drugs include

DNA demethylating agents and inhibitors, which have shown promise in the treatment of leukemia and T-cell lymphoma, respectively [4]. Broadly demethylating or acetylating the genome may have adverse consequences, as use of such agents has been shown to promote metastasis and resistance to other therapeutics in some contexts [17,18]. In addition, promoter methylation of genes like MGMT may make some tumors more susceptible to specific therapies, and demethylating these regions could be counterproductive [19]. Greater understanding of epigenetic pathways should enable development of more targeted therapeutics with fewer off- target effects.

II. Methylome Profiling

Affordable genome-wide approaches for assessing methylation have opened the way for methylome profiling of human cohorts. As sequencing costs continue to plummet, the limiting factor in genome-wide studies will no longer be the costs of sequencing, but the costs of analyzing the enormous datasets that result. While next-generation sequencing is becoming increasingly standardized, analysis is a complex task and an ongoing area of research. This section will summarize analytical methodology involved in methylome profiling, then discuss recent findings that are shaping methylome analysis.

Analysis methods

This section provides a conceptual overview of methylome analysis that begins with a discussion of methylation detection methods, and then

6

focuses on several topics specific to MethylCap-seq: relative vs. absolute methylation, normalization, and quantification of MethylCap-seq data. A summary of notable literature utilizing MethylCap-seq is also presented. For additional background, see two excellent reviews on methylome profiling methods and analysis [20,21].

Detection methods

Because methylation marks are erased by typical amplification techniques (e.g., PCR), DNA must be specially processed prior to amplification and sequencing. One of three methods is typically used to detect DNA methylation: methylation-sensitive restriction enzyme digestion, sodium bisulfite treatment, or methylation-specific fragment capture (also referred to as affinity enrichment) [22]. Processed DNA is then incorporated into libraries for sequencing. Raw sequencing data, typically consisting of millions of short reads, is then aligned to the appropriate genome, providing a

DNA methylation map of the sample. The methylation detection methods yield different types of signal, with corresponding advantages and disadvantages [21,23-25]. Sodium bisulfite treatment converts unmethylated cytosines to (via a uracil intermediate), leaving only methylated cytosines (assuming complete conversion). Overlapping DNA fragments are then compared to determine the methylation percentage of particular CpGs (a measure of absolute methylation), which is comparable across the genome and between samples. However, this single-base resolution has a significant tradeoff: profiling every CpG in the human genome requires as much as a billion aligned reads, which is presently unfeasible for most clinical studies.

7

Instead, reads are typically focused in regions of interest, such as by selecting for CpG dense regions (e.g., reduced representation ).

Thus breadth of genomic coverage is sacrificed for single-base resolution.

Affinity enrichment-based methods (e.g., MethylCap-seq) capture methylated fragments from the sample, with the resolution determined by the fragment size (typically ~130-180bp in our studies). Because the fragments were captured, we know that they are methylated relative to the rest of the sample.

However, absence of methylation must be indirectly inferred from absence of signal, which can be problematic when coverage depth is insufficient (in which case absence of signal is not equivalent to absence of methylation). Signal must be normalized to account for variable sequencing lane yield; this is accomplished by dividing aligned read counts in a region by the total aligned read count. The result is a relative methylation profile of a sample. These relative methylation profiles can be compared between samples (if the assumption of comparable overall methylation between samples is valid—this assumption is usually not tested directly), and across the genome within the sample if further normalization for CpG density is performed. Confusing the matter, multiple published studies claim to convert relative methylation to absolute methylation values using CpG normalization [23,26]; while this may produce results that correlate on average with bisulfite-based methods, it is doubtful that such methods could account for systematic differences in global methylation between samples or groups of samples. In theory, an absolute methylation profile could be reconstructed by pairing sequencing data with a scaling factor, a measure of the overall methylation in a sample relative to

8

other samples. Enrichment-based methods, despite their lower resolution vs. bisulfite-based methods, can achieve much greater coverage of CpGs in the genome with the same sequencing depth [23].

Normalization and quantification of MethylCap-seq signal

Normalization and quantification of methylation signal can be envisioned as a 5-step process. Fragment estimation and sequencing yield normalization are essential elements of the process. Count aggregation and exclusion of duplicate reads are recommended, while CpG density normalization is optional. Fragment estimation involves extending the short reads from the sequencer to reconstruct the original captured fragments.

Count aggregation aggregates the reads in genomic bins of set size, which smoothes the data and condenses it. As methylation of adjacent CpGs is highly correlated and resolution of enrichment-based methods is inherently limited, count aggregation is a simple way to package the data with minimal data loss. Sequencing yield normalization normalizes signal to account for arbitrary sequencing lane yield; this is accomplished by dividing aligned read counts in a region by the total aligned read count. Duplicate exclusion attempts to correct for PCR artifacts introduced by genome amplification, where certain fragments are replicated more frequently relative to other fragments, sometimes by orders of magnitude. With the assumption that sequencing depth is low compared to the diversity of fragments used to generate the sequencing library, in the case that multiple reads share the same exact sequence all but one are assumed to be PCR artifacts, and thus these duplicate reads are excluded from analysis. Duplicate exclusion can

9

potentially result in analysis artifacts, as regions with higher CpG density or higher methylation may be more prone to random duplicates representing identical fragments from different cells (as opposed to PCR artifacts).

Tools for MethylCap-seq data processing

Two popular tools for processing of enrichment-based data are Batman

[27] and MEDIPS [26]. After working briefly with MEDIPS, we developed our own custom workflow for enrichment-based sequencing data to enable greater flexibility in our analyses [28].

Batman was developed to address the poor agreement between enrichment-based methods and bisulfite-based methods—especially problematic since bisulfite-based methods are often used to validate the findings of enrichment-based methods. The authors use a Bayesian deconvolution strategy to account for differences in CpG density that cause genomic regions with high CpG density (and thus more opportunities for methylation) to be preferentially enriched, a factor that does not affect bisulfite-based methods. Fragment estimation assumes a 500bp fragment size, and count aggregation uses 100bp bins. The tool is implemented in

Java. While the tool greatly improves the agreement between enrichment- based and bisulfite-based data, the Bayesian deconvolution model is computationally intensive. In addition, the tool is optimized for use for use with microarray data, and sequencing data requires substantial preprocessing before it can be input into Batman.

MEDIPS similarly applies CpG density normalization to improve agreement with bisulfite-based methods. The authors use coupling factors

10

rather than a Bayesian deconvolution model, simplifying calculations and improving computation speed vs. Batman [26]. Fragment estimation and count aggregation bin sizes are customizable; default bin size is 50bp.

MEDIPS is implemented in R, and use requires some familiarity with R.

MEDIPS includes a methylProfiling function to compare two samples for differentially methylated regions (DMRs). Unfortunately, in most scenarios researchers wish to compare two or more groups of samples, and thus the feature has limited usefulness. While MEDIPS is functional, the tool is still slow and RAM intensive, and we found it difficult to use for larger studies involving many samples.

The workflow used for this project skips the CpG density normalization step in favor of simplicity and transparency. Fragment estimation and count aggregation is customizable; I used 150bp fragment size and 500bp bins for our dataset of endometrial tumors profiled using MethylCap-seq.

Implemented in a combination of bash and C++, the workflow is optimized for parallel processing in a supercomputing environment, and outputs a binary count file that is used for downstream analysis.

Other published methods

Harris et. al compared enrichment-based and bisulfite-based methylome profiling methods [24]. For enrichment-based methods, fragment estimation assumed a 150bp fragment size, duplicate reads were excluded, counts were aggregated in 200bp or 1000bp windows, and CpG density normalization was performed by iteratively comparing CpG contribution per estimated fragment with the regional fragment distribution. For comparison

11

between methods, methylation was binarized by setting tag density or beta- value thresholds (0.2 for the bisulfite-based methods). To avoid overlap during region classification, regions were assigned to UCSC features in a set priority order (promoter > exon > UTR > intron > intergenic).

Bock et. al reported a method for integrated analysis of enrichment- based data alongside bisulfite-based data [23]. For enrichment-based methods, duplicate reads were excluded, counts were not aggregated, and linear regression models were used to normalize for CpG density. They observed that combining lanes of data for the same sample improved correlation with bisulfite-based data. Statistical testing for differentially methylated regions was performed using Fisher’s exact test.

Lan et. al applied a bi-asymmetric-Laplace model to disentangle signal from neighboring CpGs and thus output a methylation score for each CpG

[29]. Notably, they eschewed fragment estimation in favor of a peak-finding approach (commonly used for ChIP-seq analyses), with the basic assumption that an adequate pile of fragments surrounding a methylated site provides sufficient information to attribute signal to individual CpGs. The approach claims to be able to increase resolution beyond the natural limits of fragment size (typically ~150bp) to less than 50bp, and the same approach was applied in a subsequent research article [30].

MethylCap-seq literature

Sequencing-based profiling of the methylome began in earnest around 2008, with a subset of those studies using a MethylCap-seq approach. This section

12

summarizes the primary literature utilizing MethylCap-seq by category: methods article, methods comparison, research article.

Method articles

[27,29,31,32]

Methods comparison

[23-25,33]

Research articles

[30,34-36] [37-50]

Methylome Profiling: The Biology

This section will focus on biological findings resulting from and influencing methylome profiling. As the success of computational investigation hinges on knowing where to look and what to look for, this section will particularly emphasize genomic regions of interest.

Promoter CGI methylation

Methylation within promoter CGIs is well established as a mechanism of epigenetic silencing, and is usually the first genomic feature examined in any methylome profiling study. Many array-based methods, including early variants of the Infinium HumanMethylation bead array, only interrogate CpGs within promoter CGI. CGI are rarely methylated in the genome, even more so if they lie within promoters. Methylation of promoter CGI is thought to mediate silencing by suppressing the binding of transcription factors necessary for transcription initiation, most likely by recruiting suppressive factors, typically complexed with methyl-CpG-binding proteins, that provide a direct steric

13

hindrance or maintain the DNA in a closed conformation that is inaccessible to transcription factor complexes.

CGI shore methylation

Andrew Feinberg’s group proposed that methylation changes associated with changes in gene expression tended to occur not in CGIs or promoters, but in flanking regions. This is consistent with so-called methylation spread theory

[51], which postulates that methylation changes begin at foci adjacent to a gene promoter and gradually spread into the promoter, with transcriptional potential dropping off early in this process. Using the CHARM assay, which pairs methylation-sensitive restriction enzyme digestion with a tiling array, they showed that CGI shores (defined as flanking regions 200-2000bp distant from a CGI) were the primary hotspots for epigenetic change in both colon cancer and hematopoietic progenitor differentiation [52,53]. They further showed that cancer-associated methylation accumulated in shores associated with tissue-specific methylation, also consistent with methylation spread theory. The role of CpG shores has since been independently confirmed in other contexts, including induced pluripotent stem cells [54], acute myeloid leukemia [55], and bladder cancer [56].

First exon methylation

Early methylome profiling showed that first exons were frequently unmethylated together with CGI and promoters, suggesting a possible role in gene regulation [57]. This potential was confirmed in a recent study by Brenet et. al, which compared first exon and promoter methylation and showed that

14

first exon methylation was more tightly coupled to expression in an acute myeloid leukemia cell line [37].

Gene body methylation

While first exons are typically unmethylated, gene bodies are frequently methylated [57,58]. Multiple studies have reported a direct correlation between gene body methylation and gene expression [41,59,60]. Particularly intriguing is that exon/intron borders often show sharp transitions in methylation, with exons heavily methylated and introns less methylated.

While the role of gene body methylation remains unclear, the distinct pattern of methylation in exons and introns suggests that methylation in tandem with nucleosome positioning may play a role in alternative gene splicing.

Intragenic CGI methylation

While the role of promoter CGI methylation in gene silencing is well established, the role of intragenic CGI methylation is less clear. Using a method similar to MethylCap-seq, Adrian Bird’s group found that intragenic

CGI were frequently differentially methylated compared to promoter CGI and promoter CGI shores in differentiated mouse hematopoietic cells [42].

Unexpectedly, intragenic CGI methylation was associated with gene silencing.

It is not clear whether this finding translates to human cells and other contexts; a previous study by their group found that intragenic CGI were differentially methylated with similar frequency in mouse and human cells

(embryonic stem cells vs. whole blood), although intragenic CGI in colon cancer cells were not differentially methylated compared to normal tissue [49].

Partially Methylated Domains

15

Lister et. al introduced the concept of partially methylated domains

(PMDs): large (median size ~150kB) genomic regions that showed reduced methylation and expression compared to surrounding regions [61]. Curiously, these PMDs were seen in differentiated cell lines but not embryonic stem cell lines. Reduced methylation seen in PMDs was restored when the cells were reprogrammed via induced pluripotency, suggesting a role for PMDs in cell differentiation [62].

Laird’s group subsequently uncovered a unique pattern of aberrant methylation in colon tumors: focal CGI methylation occurring within extended regions of reduced methylation compared to surrounding regions [63]. These regions of reduced methylation overlapped significantly with Lister’s PMDs and previously reported multi-gene domains of long range epigenetic silencing

(LRES) in [5]. Significantly, methylation gains in tumors vs. normal tissue were more likely to occur in PMDs, and constitutive methylation was less likely to occur in PMDs. PMDs were seen in colon tumors and immortalized somatic cell lines but not in adjacent normal tissue, however only the colon tumors showed focal CGI hypermethylation within the PMDs.

PMDs corresponded to nuclear-lamina-associated domains (LADs) seen in a fibroblast cell line, suggesting a general biological concept behind this striking methylation phenotype.

Feinberg's group uncovered large blocks of contiguous hypomethylation in multiple colon tumors relative to normal tissue [64].

Notably, methylation loss was predominantly in regions of high methylation

(~75%) in normal tissue, to intermediate levels in tumors (~55%). These

16

blocks overlapped highly with Lister's PMDs. Most genes in the blocks were silenced in both tumors and normal tissue; however genes from the blocks that were expressed in normal tissue were at increased risk for silencing in tumors. The reverse was also true; block genes silenced in normal tissue were often reexpressed in the tumors, with considerable variation between tumors.

Taken together, one could speculate that PMDs represent potential hotspots for epigenetic change: zones where tumor suppressors are particularly vulnerable to silencing and oncogenes can be reactivated. The publications listed here all used whole genome bisulfite sequencing; it remains to be seen whether the findings can be recapitulated in large cohorts using less costly methods.

Methylation and transcription factor binding sites

While regulation of gene expression via promoter methylation is generally thought to be mediated through its effects on transcription factor binding, the role of DNA methylation at distal enhancers is not well characterized. Using whole genome bisulfite sequencing data from a mouse embryonic stem cell and a neural progenitor cell line, Stadler et. al characterized short stretches of low-intermediate methylation (10-50%) which they termed LMRs [65]. These LMRs tended to be distal from promoters and bore chromatin marks of enhancers. Furthermore, the authors demonstrated that the characteristic methylation state of LMRs was established by transcription factor binding.

17

More broadly, DNase I hypersensitivity, chromatin marks (e.g.,

H3K4me1 and ), and binding by the histone acetyltransferase p300 are known to be associated with enhancers [66]. An ENCODE track provides compiled transcription factor binding sites and DNase I hypersensitivity regions from aggregate cell lines. The degree of context dependency of these data, and thus their applicability to studies in other tissues and malignancies, remains unclear. When examining profound epigenetic alterations such as those expected in cancer, comparisons with and transcription factor data from the same samples or same tissues would likely yield more robust results.

Exon methylation and regulation of alternative splicing

While several genomic profiling studies revealed sharp methylation changes defining exon and intron boundaries (see the discussion of gene body methylation), the effects of such methylation and a mechanism for these effects remained elusive. A recent study by Shukla et. al revealed a role for exon methylation in alterative splicing [67]. Exon methylation inhibited CTCF binding in human B-cell lines, decreasing incorporation of the exons into RNA transcripts. Exon incorporation in turn was mediated by RNA polymerase pausing during transcript elongation. In addition to DNA methylation, alternative splicing may be marked by other epigenetic factors including

H3K36me3 [68].

18

Chapter 2. Thesis Rationale and Research Objectives

DNA methylation is a stable epigenetic mark often perturbed during [4]. Growing evidence demonstrates a role for DNA methylation both as a regulator of gene expression and as a potential biomarker [69,70]. Widespread accumulation of methylation in regulatory elements in specific cancer types, termed the CpG island methylator phenotype (CIMP), may even direct carcinogenesis [71,72]. Few studies, however, have examined the CpG island methylator phenotype genome-wide in a clinical cohort, as until recently analyzing methylation genome-wide in a large number of samples was cost-prohibitive. As a result, the methodology for analyzing these large datasets remains poorly developed. The Infinium

HumanMethylation beadchip is a popular tool for sampling methylation at many loci across the genome [73], but the suitability of this tool for defining

CIMP de novo in an uncharacterized cancer type (as opposed to verifying

CIMP) has not been thoroughly evaluated. Studies show general but not perfect agreement between Infinium and other methylome profiling methods in tumors and normal tissue [23,25], but agreement has not yet been evaluated in the context of the profound methylation dysregulation associated with

CIMP. A clear need exists for a study that validated Infinium-based CIMP discovery using an alternative methylome profiling method.

19

To address this problem, I analyzed methylation in endometrioid endometrial cancers and unmatched normal endometria [74]. These data were generated using MethylCap-seq. Initial analyses suggested deficiencies in library preparation in some samples that could compromise the validity of methylation calls, leading me to develop a quality control module to identify and control for these issues. I also analyzed the Infinium methylation data from an endometrioid endometrial cohort from The Cancer Genome Atlas

Consortium, which later published a cluster analysis of the Infinium data [75].

I set out to identify CIMP using endometrioid endometrial cancer as the model cancer and MethylCap-seq as the methylome-profiling method The goals of this project were: 1) develop and validate a MethylCap-seq quality control module to identify and exclude samples with spurious methylation data; 2) develop a method to identify putative CIMP tumors using the

MethylCap-seq platform; 3) describe the differences between putative CIMP tumors using the Infinium platform; 4) compare agreement in CIMP classification between my method based on MethylCap-seq and the published clustering method based on Infinium; and 5) demonstrate a signature that could classify CIMP tumors prospectively without a full methylome profiling experiment. Chapter 3 addresses goal 1 as well as the general methodology developed for these analyses. Chapter 4 addresses goals 2 through 5 and presents data that were collected along the way. This thesis tests the hypothesis that a combined methylation analysis method using MethylCap- seq and Infinium can be used to define CIMP de novo in endometrial cancer.

20

Chapter 3. Enrichment-based DNA methylation analysis using next-

generation sequencing: quality control, estimating changes in global

methylation and the effects of increased sequencing depth.

I. Introduction

Epigenetic mechanisms are responsible for determining and maintaining cell fate, stably differentiating various tissues in the human body.

Epigenetic changes can induce chromatin remodeling, leading to lasting effects on gene expression. Epigenetic aberrations, including perturbed DNA methylation, have been implicated in a diverse array of cancer-associated pathways, including silencing of tumor suppressors [4], activation of oncogenes [6], promotion of metastasis [17] and resistance to therapeutic drugs [8-10]. DNA methylation is thought to be a secondary or tertiary effector in the epigenetic cascade [76], with its accumulation in promoter CpG islands leading to durable gene silencing [58]. Due to its biological and experimental stability as an epigenetic marker, DNA methylation is a potential biomarker for disease, especially cancer [70]. While methylation marks are erased by standard PCR, other methods preserve this information, including affinity-based capture of methylated regions and bisulfite conversion [21].

Low-cost genome sequencing is revolutionizing the field of DNA methylation analysis. New techniques are enabling deep methylome profiling of biological samples on an unprecedented scale [21]. These sequencing projects can yield several GB of data per sample, presenting significant

21

bioinformatic challenges. While standardized pipelines have been developed for the most common elements of analysis such as sequencing data processing and alignment, methylome analysis often relies on custom methods specific to the methylation assay. To address quality control concerns and to determine required sequencing depth, analytic methods must not only be assay-specific but must also be tailored to the specific experiment

[20,24].

MethylCap-seq is a method for genome-wide profiling of DNA methylation that relies on enrichment of methylated DNA fragments using methyl-CpG binding domain protein 2 (MBD2) [77]. These fragments are then shotgun-sequenced, yielding a map of DNA methylation across the genome.

Advantages of MethylCap-seq compared to other platforms include cost- effective profiling of CpG dense regions and true whole genome coverage

[25].

Optimization of MethylCap-seq data collection and analysis requires attention to data quality and saturation. Proper quality control ensures that methylation calls are not impacted by inconsistencies in methylation enrichment. Greater saturation increases statistical confidence in methylation calls, especially in sparsely methylated regions [78].

A challenge currently facing enrichment-based methods that rely on sequencing, such as MethylCap-seq, is estimating changes in global methylation. Information on absolute methylation levels is erased during sequencing normalization, which is necessary to control for day-to-day

22

variations in sequencing output. Thus, this information must be recaptured using other computational methods or assays.

In this study, we provide evidence-based guidelines for quality control, illustrate the effects of additional reads on quality control metrics and provide empirical support for a global methylation indicator, an analytic tool that correlates with overall methylation levels. We hope these findings will help other groups standardize and optimize their MethylCap-seq experiments to take advantage of this promising methylome profiling method.

II. Results and Discussion

Quality control exclusion criteria reduce noise in methylation signal and improve analytical power.

Our automated quality control (QC) module, based on MEDIPS [26], was implemented to identify technical problems in the sequencing data and flag potentially spurious samples. One goal of the QC module was to provide rapid feedback to investigators regarding dataset quality, facilitating protocol optimization prior to committing resources to a larger scale sequencing project. A second goal was to identify samples that should be excluded from analyses due to data validity concerns. The validity of a MethylCap-seq experiment is dependent on enrichment of methylated fragments prior to sequencing. A failure in enrichment invalidates any downstream data, and therefore identifying such failures is vital. Also important is verifying the statistical reproducibility of the data for each sample. Generating replicate sequencing lanes for each sample to assess experimental reproducibility

23

empirically is often not cost effective, and thus addressing this issue computationally is desirable. Similarly, the confidence in methylation calls is related to the breadth and strength of signal at the CpGs in the genome. We assessed enrichment of methylated fragments using the CpG enrichment parameter, which compares the frequency of CpGs in the sequenced sample with the frequency of CpGs in a reference genome (hg18 for this study).

Statistical reproducibility was assessed by calculation of saturation, the

Pearson correlation of two random partitions of the sequenced sample [26].

Breadth and strength of methylation signal was assessed using 5X CpG coverage, representing the fraction of CpG loci with five or more reads in the sample compared to the total number of CpGs in the reference genome.

These QC parameters were calculated for each sample using MEDIPS [26].

Appendix A1 demonstrates the results of the QC module for the

Endometrial Dataset. 203 lanes of sequencing data were generated for 101 unique samples. 43 lanes failed QC, representing 21 unique samples. To assess how lanes that pass QC might differ from lanes that failed QC, we computed the noise in methylation signal, representing percentage of uniquely aligned extended reads falling in 500bp bins without CpG dinucleotides

(Figure 1). Median noise in samples that failed QC (6.40%) was more than three-fold greater than in samples that did not fail QC (2.04%, p<0.001), and closely resembled noise in input (7.82%). Excluding "QC-failed" lanes did not significantly decrease median noise levels (2.04 vs. 2.22, p=0.08) but did greatly decrease the variation in noise levels between samples. As the distribution of noise levels is positively skewed and not normal, a small

24

number of outliers would not be expected to significantly shift the median noise level. To investigate whether the additional noise seen in "QC-failed" samples impacted sequencing reproducibility, we computed the Pearson correlation between replicate lanes of samples that passed QC (n = 68) vs. those that failed QC (n = 9) (Appendix A2). Replicates of samples that passed QC correlated more highly than replicates of samples that failed QC

(average r = 0.90 vs. 0.59; p<0.001). Variation in replicate correlation between samples was also noticeably less in the QC pass group (relative standard deviation = 6.7% vs. 27.1%). We surmise that failures in methylation enrichment result in a more random sampling of the fragment distribution regardless of methylation status, resulting in increased signal in regions where methylation should not be detectable.

25

Figure 1. QC exclusion criteria reduce noise in methylation signals. Percentages of uniquely aligned reads falling in 500bp bins containing no

CpG dinucleotides pre- and post-QC analysis were plotted as a standard box plot for samples prior to QC filtering, samples that passed QC, and samples that did not pass QC in the endometrial dataset. An input from a sample that was not subjected to methylation capture is included for reference. The number of samples in each group is included above the baseline. Values for replicate lanes in each group were averaged, and samples were compared statistically using a Wilcoxon rank-sum test. Whiskers indicate 10th and 90th percentiles. 13.5% of 500bp bins in the genome are classified as CpG- barren.

26

As the goal of many methylome profiling studies is to identify differentially methylated regions (DMRs) between biological groups, we next assessed whether our QC exclusion criteria might improve our analytical power to detect DMRs. We compared DMRs between 89 endometrial tumors and 12 nonmalignant endometrial tissue samples across several genomic features. Excluding sequencing lanes that failed QC (corresponding to 19 tumor and 2 nonmalignant samples) resulted in more DMRs in every genomic feature assessed (Table 1). The greatest gains were seen in promoters and

CpG shores, where the number of DMRs increased 22-fold and 2-fold, respectively. Gains in CpG islands and promoter-associated CpG islands were more modest (1.6-fold and 1.05-fold). These results trend inversely with

CpG density, perhaps reflecting greater benefit from QC exclusion in regions where coverage is lower. We speculate that the improvements in DMR detection resulting from exclusion of samples that fail QC would be even greater when working with smaller sample sizes or biological groups with more similar methylation patterns.

Table 1. Differentially Methylated Regions, Endometrial Tumors vs. Nonmalignant Endometrial Tissue

Genomic feature All samples Samples Passing QC only CpG islands Promoter- associated CpG shores Promoters

The effect of additional sequencing lanes on quality control metrics

27

DNA sequencing cores are frequently asked whether additional lanes of sequencing data are necessary or desirable for MethylCap-seq experiments. To address this question, we analyzed a large dataset of ovarian tumors of which 7 samples had been resequenced (using the same genomic library), for a total of 15 lanes (Appendix A3). First, the degree of correlation between the replicate lanes was analyzed to ensure that additional lanes of data would not introduce excessive variation. As shown in Figure 2, replicate lanes from sequencing the same library twice correlated highly (R2 value of 0.98). Note that the question here was the value of additional sequencing lanes and not of additional technical or biological replicates – the correlation between technical or biological replicates would be expected to be much lower than the correlation between two lanes sequencing the same library.

CpG enrichment, saturation and 5X coverage were then evaluated for individual lanes and combined lanes (Figure 3). CpG enrichment varied somewhat between samples (range: 2.33-3.02), but was extremely similar for replicate lanes (<1% percent deviation from the combined lane on average).

Saturation improved modestly from a median of 0.79 to a median of 0.86. As saturation values for individual lanes of MethylCap-seq data typically range from 0.6 to 0.85 for single lanes in our hands and we consider a saturation value of 0.6 acceptable for analysis, this improvement may be inconsequential although it is statistically significant. 5X coverage improved noticeably from a median of 0.21 to a median of 0.28, representing an average 38% gain. As

5X coverage represents a minimum signal level needed to reliably

28

differentiate a methylated locus from a locus with no methylation (or the absence of a methylation signal), we speculate that this increase could significantly increase the statistical power to detect DMRs, particularly in small or lightly methylated regions.

29

Figure 2. Replicate sequencing lanes for MethylCap-seq experiments correlate highly. Replicate lanes for each sample were randomly assigned to two partitions, and the average rpm of 6000 (of 6M) randomly selected 500bp bins were compared between partitions.

30

Figure 3. Additional lanes of sequencing data moderately increase saturation but greatly increase 5X CpG coverage. Variations in CpG enrichment (A), saturation (B), and 5X coverage (C) were assessed for 15 lanes of data in the ovarian study corresponding to 7 samples by generating plots of individual lanes and combined replicate lanes for each sample. (D) Average percent deviation of the individual lanes from the combined lane for each sample was plotted for each parameter. Error bars for (D) represent standard error. Asterisks represent Student t-test p<0.05.

The global methylation indicator (GMI) correlates inversely with an in vitro methylated tracer sequence.

We recently proposed a computational method to compare genome- wide changes in methylation patterns between samples in a given experiment 31

[77]. As MethylCap-seq signal (in reads) is normalized by raw read counts to adjust for variability in lane yield, two samples with identically distributed methylation yet different absolute levels of methylation would be expected to yield identical normalized methylation signals at any given loci. The GMI method relies on the observation that in vitro methylated samples display characteristic changes in the methylation signal distribution as quantified in a

MethylCap-seq experiment, and these changes are CpG-density dependent.

Methylation signal shifts from low CpG content regions to high CpG content regions; this difference can be quantified by calculating the area under the curve of the average normalized methylation signal plotted across CpG density. The GMI is a potentially powerful tool for capturing differences in methylation distribution between samples.

In an effort to validate the GMI as a surrogate for global methylation, we developed a complementary analysis utilizing an in vitro methylated construct. Samples from an acute myeloid leukemia (AML) study were used for this analysis [39]. This methylated construct was spiked into the genomic

DNA in the AML samples prior to sonication at a defined concentration and subjected to methylated DNA enrichment along with the sample DNA. The

"spike-in" was originally intended to verify successful enrichment; if enrichment occurs, PCR for the methylated plasmid would show increased copy number after enrichment. Additionally, this "spike-in" also indicates global methylation levels in a sample since the methylated plasmid competes with the natively methylated genomic DNA fragments for binding to the MBD protein. When the proportion of methylated to unmethylated genomic

32

fragments is high prior to enrichment, the methylated plasmid "spike-in" gets enriched relatively less, and vice versa. Indeed, we found that read counts aligned to the plasmid correlate inversely with GMI (Figure 4, Appendix A4).

This result provides empirical evidence that GMI can capture changes in absolute global methylation levels for MethylCap-seq experiments. Such a metric could be useful for gauging response to treatments that are known or expected to alter the methylome.

33

Figure 4. Global methylation indicator scales inversely with read counts from a "spiked" in vitro methylated construct. The pIRES2-EGFP plasmid was in vitro methylated and "spiked" at a set concentration into each of 14 samples from the decitabine study prior to sequencing. After sequencing, GMI was calculated and plotted against the inverse of the number of normalized reads aligning to the plasmid. A linear best fit was drawn through the experimental points (p = 0.036, R2 = 0.318).

III. Methods

Patient samples

89 endometrioid endometrial cancer and 12 unmatched nonmalignant endometrium samples were obtained from Washington University. All studies

34

involving human samples were approved by the Human Studies Committee at the Washington University and at The Ohio State University.

Seven samples from a larger patient cohort were obtained from TriService General Hospital, Taipei, Taiwan. All studies involving human ovarian cancer samples were approved by the Institutional

Review Boards of TriService General Hospital and National Defense Medical

Center.

Fourteen bone marrow samples from a single-center Phase II clinical trial involving patients with acute myeloid leukemia (AML) at The Ohio State

University were obtained for this investigation. The study design and the results of the trial for the entire patient cohort have been reported elsewhere

[79]. All studies involving these samples were approved by The Ohio State

University Human Studies Committee.

Methylated-DNA capture (MethylCap-seq)

Enrichment of methylated DNA was performed with the Methyl Miner

Kit (Invitrogen) according to the manufacturer’s protocol as previously described [77]. Briefly, one microgram of sonicated DNA was incubated at room temperature on a rotator mixer in a solution containing 3.5 micrograms of MBD-Biotin Protein coupled to M-280 Streptavidin Dynabeads. Non- captured DNA was removed by collecting beads with bound methylated DNA on a magnetic stand and washing three times with Bind/Wash Buffer.

Enriched, methylated DNA was eluted from the bead complex with 1M NaCl and purified by ethanol precipitation. Library generation and 36-bp single-

35

ended sequencing were performed on the Illumina Genome Analyzer IIx according to the manufacturer’s standard protocol.

MethylCap-seq experimental quality control and exclusion criteria

The automated quality control (QC) module was implemented as previously described [77]. Pre-aligned sorted.txt files from the Illumina

CASAVA 1.7 pipeline were used to reduce turnaround time for analysis. In brief, duplicate alignments were removed from the aligned sequencing file (a correction for potential PCR artifacts), and the resulting output was loaded into an R workspace. MEDIPS [26] was used to analyze CpG enrichment, saturation, and CpG coverage.

Sequencing lanes were excluded using the following thresholds: CpG enrichment < 1.4, saturation < 0.5 and CpG 5x coverage < 0.05. These criteria and thresholds were chosen for technical relevance and their ability to identify known technical issues without a bias for specific biological groups.

Samples were excluded if any of the criteria were not met. As CpG coverage was assessed qualitatively for analysis of the endometrial dataset, five lanes of data with borderline 5x CpG coverage were not excluded that would have qualified for exclusion due to this criterion.

For the DMR comparison (Table 1), methylation signal was normalized for each lane and averaged among replicate lanes for each sample. The “All” group includes samples with merged QC pass lanes, samples with merged

QC fail lanes, and samples with merged QC pass and QC fail lanes, and thus is not a straight sum of the QC-passed and QC-failed samples.

36

For the reproducibility comparison (Appendix A2), Pearson r was calculated using two replicate lanes corresponding to each sample represented in the

QC pass and QC fail groups. When a sample had more than two replicate lanes in a single group, two lanes were randomly chosen for the analysis.

Samples lacking two replicate lanes in either the QC pass or QC fail group were excluded from this analysis. Lanes corresponding to the same sample but generated using different library preparations were also excluded.

Sequencing and QC summaries corresponding to the datasets referenced in this chapter can be viewed in Appendices A2, A3, and A4.

Standard sequence file processing and alignment

Sequence files were processed and aligned as previously described

[77]. Briefly, QSEQ files from the Illumina CASAVA1.7 pipeline were converted to FASTA format, duplicate reads removed (to control for PCR bias), and uniquely aligned with Bowtie to generate SAM files using the following options: -f -t -p 1 -n 3 –l 32 -k 1 -m 1 -S -y --chunkmbs 1024 –max – best [80]. Duplicate alignments (reads aligning to the same genomic position) were removed using SAMtools [81].

Standard global methylation analysis workflow

Aligned sequence files in SAM format were analyzed using a custom analysis workflow as previously described [77]. Briefly, aligned reads were extended to the average fragment length (as determined by BioAnalyzer fragment analysis) and counted in 500bp bins genome-wide. The resulting

37

count distribution was normalized against the total aligned reads by conversion to reads per million (RPM).

Methylation was categorized by genomic feature as follows: CpG islands (CGI, as defined in the UCSC genome browser), promoters (2kB in length, 1kB upstream and downstream of the TSS), CGI shores (200bp to 2kb distant from both ends of each CGI), and the first exon of RefSeq genes. CGI were further subdivided by proximity to promoters (within 10kB upstream or

1kB downstream of a 2kB promoter), and 2kB promoters were subdivided by overlap with CGI.

Differentially methylated regions were identified by summing RPM across the bins for each locus in the genomic feature, then performing a

Wilcoxon rank sum test to assess differences in these summed RPMs between sample groups. Results were then adjusted for multiple comparisons by setting a false discovery rate (FDR) cutoff of 0.05.

Calculation of noise in methylation signal

Noise in methylation signal, representing extended reads falling in regions without CG dinucleotides, was quantified as the summation of reads falling into bins with zero CpG content. If a sample in a given group had multiple lanes of data, noise was computed for each lane individually and averaged among replicate lanes in the group. As a single sample could have a lane that passed QC and a lane that failed QC, the number of samples in each group does not sum to the total number of samples in the study.

38

Calculation of the Global Methylation Indicator (GMI)

To assess genome-wide changes in methylation patterns for each sample in an experiment, a custom parameter termed the global methylation indicator (GMI) was calculated as previously described [77]. Briefly, normalized read counts (in RPM) were classified by CpG density and averaged to construct a methylation distribution. The average RPM were then summed across the distribution (i.e., the estimated area under the methylation distribution curve) to yield the GMI.

Assessment of methylated fragment enrichment using an in vitro methylated construct

Experimental procedure

The 5.3Kb plasmid vector pIRES2-EGFP, which contains three CpG islands, was used to assess methylated fragment enrichment. The construct was linearized with Nhe I and then in vitro methylated with M.SssI. The methylated "spiked-in" DNA was quantified by the Qubit High Sensitivity

Assay and diluted. Plasmid was spiked into genomic DNA at a concentration of 1.5pg plasmid / 1µg genomic DNA (~2.5 plasmid copies per cell) prior to sonication of genomic DNA for library generation.

Analysis

Reads mapping to the construct were identified by converting QSEQ files to FASTA format as described above, then aligning the files with Bowtie using the following options: -q -t -p 1 -n 3 -l 32 -k 1 -S --chunkmbs 1024 --max

--best. Duplicate reads were retained for this analysis. To control for

39

variation in construct aligned read counts attributable to fluctuations in lane yield, construct aligned read counts were normalized against the total raw read counts by conversion to reads per million (RPM).

IV. Conclusions

This study shows that post-sequencing QC metrics can exclude poor quality samples from analysis, decreasing noise in methylation signal and improving power to detect DMRs. Furthermore, we show that resequenced lanes from the same library correlate very well, and that additional lanes of data have a small impact on saturation (data reproducibility) and a large impact on 5X CpG coverage (confidence in methylation calls at a given locus).

Finally, we demonstrate that our computational indicator of global methylation correlates with an unrelated method that utilizes spike-in of DNA with known methylation status. These findings verify that MethylCap-seq, with appropriate quality control, is a reliable tool that provides reproducible relative methylation information on a feature by feature basis, provides information about levels of global methylation, and can be used to analyze large patient cohorts of hundreds of patients.

40

Chapter 4. Identification of endometrial cancer methylation features

using a combined methylation analysis method

I. Introduction

During carcinogenesis cells within solid tumors acquire numerous aberrations, including mutations that alter the coding sequence of genes, as well as changes in gene expression. Changes in gene expression may be mediated in many ways including: altered transcription factor levels or function, mutations in DNA binding elements, miRNAs, and chromatin remodeling. Chromatin remodeling, including epigenetic modifications to histones and DNA methylation, normally plays a key role in cell differentiation, stably switching cellular pathways on/off until the cells reach a terminally differentiated state that is typically irreversible. Epigenetic aberrancies can allow cancer cells to silence tumor suppressors and re-express oncogenes, giving tumor cells an additional option besides mutation to dysregulate key pathways [82].

DNA methylation is one of the better understood mechanisms of epigenetic control. DNA methylation in humans is mediated by the DNA methyltransferases DNMT1 and DNMT3, which add a methyl group to the 5’ carbon of cytosine. In differentiated cells, DNA methylation occurs in the context of cytosine followed by guanine (CpG) in the DNA sequence. DNA methylation in promoter CpG islands (CGI) has been shown to mediate stable

41

gene silencing. Tumor suppressor silencing via DNA methylation is found in a wide range of tumor types [69]. While methylation marks are erased by standard PCR, several methods have been developed to preserve this information, including affinity-based capture of methylated regions and bisulfite conversion [21]. DNA methylation is increasingly being recognized as a potential biomarker [70].

The CpG island methylator phenotype (CIMP) is a cancer-specific accumulation of DNA methylation in the CpG islands of some tumors.

Originally identified more than 15 years ago in [83], CIMP has since been identified in multiple cancer types: [84], breast cancer

[85], acute myeloid leukemia [86], gastric cancer [87], clear cell renal cell carcinoma [88], oral squamous cell carcinoma [89], and hindbrain ependyomas [90]. CIMP may occur early in tumorigenesis: CIMP can be detected in colorectal serrated adenomas prior to malignant progression and the development of microsatellite instability [91]. Early diagnosis presents an opportunity for early intervention and improved outcomes, including lower treatment morbidity and lower recurrence rate. CIMP classification may have prognostic value: CIMP is associated with good prognosis in some cancer types (e.g., colorectal, breast) and poor prognosis in others (e.g., renal cell carcinoma) [72]. Accurate differentiation of relatively benign and aggressive cancer subtypes allows benign subtypes to be treated less aggressively and aggressive subtypes to be treated more aggressively. CIMP could also represent a therapeutic target for demethylating therapies [92]. Despite its

42

potential diagnostic, prognostic and therapeutic value, CIMP and its manifestations in different cancer types remain poorly understood.

Defining CIMP in a new cancer type requires extensive methylome profiling. However, methylome profiling studies examining CIMP genome- wide in endometrial cancer cohorts remain sparse. Several methods have emerged to profile the methylome, including the Infinium beadchip [73] and affinity-based methylation capture followed by shotgun sequencing (e.g.,

MethylCap-seq [77]). The Infinium beadchip is frequently used in clinical studies for its cost-effectiveness, scalability for large cohorts, high accuracy, and user-friendly analysis pipeline. The method relies on hybridization of bisulfite-converted DNA to the beadchip, followed by single-base extension.

The end result is a readout of percent methylation for individual CpGs, with ~7

CpGs assessed per promoter CGI using the HumanMethylation 450 kit, or roughly 8% of the CpGs in promoter CGI. Methylation of nearby CpGs is not measured, but is assumed to be similar. MethylCap-seq is one of several affinity-based capture methods that leverage shotgun sequencing to assess methylation patterns. MethylCap-seq uses the MBD2 protein to capture methylated fragments, which are then sequenced to yield piles of methylation tags across the genome. By comparing tag frequency between samples, relative methylation levels can be inferred for a given region. As sequencing costs continue to fall, MethylCap-seq and similar methods will become increasingly cost-effective. For analysis of promoter CGI, MethylCap-seq has a particular advantage over Infinium: average methylation over the regions is measured, rather than assumed.

43

In this study, we developed a 13-region signature that stratifies endometrioid endometrial tumors by CpG island methylation status and distinguishes tumors from both normal control and adjacent normal tissue.

This signature is based on a training set of MethylCap-seq data and validated for use on Infinium datasets. We also demonstrate a general method for identifying methylator phenotypes based on total promoter CGI methylation.

This signature could prove useful for detecting and classifying endometrioid endometrial carcinomas, as well as catalyzing research into the role of dysregulated methylation in this disease.

II. Results

Characterizing a CpG island methylator phenotype

Methylome data were analyzed from a previously published

MethylCap-seq dataset of 76 endometrioid endometrial primary carcinomas and 12 non-matched normal control samples (hereafter referred to as the discovery set) [93]. To assess the extent of CpG island (CGI) methylation in tumors relative to normal controls, we compared overall methylation in CGIs as well as in other genomic features known to acquire methylation during carcinogenesis [37,52] (Figure 5A). Tumors overall showed a nearly 2-fold increase in methylation of promoter CGI, with reduced gains in CGI shores.

These increases correlated with methylation gains in promoters containing

CGI (but not promoters without CGI), as well as gains in first exons. To examine the variability of tumor methylation gains between different tumors, promoter CGI methylation was plotted for each tumor and compared to

44

normal controls (Figure 5B). Tumors displayed a spectrum of methylation gains ranging from slightly below the levels of normal controls to 5-fold higher.

45

3.00 A 2.50 * * 2.00 * * * 1.50 * 1.00

Fold ChangeFold RPM 0.50 Normal 0.00 Malignant

Defining the CIMP Training Set

B 30000 25000

20000 CGI-H (n=5) 15000 Normal

10000 Malignant

5000 CGI-0 (n=8)

Promoter CGI methylation (RPM) 0 Normal Malignant n=12 n=76

Figure 5. Endometrioid endometrial malignancies show increased methylation in promoter CGI compared to unmatched normal controls. (A) MethylCap-seq normalized signal in the discovery set was compared

between malignancies (n=76) and normals (n=12) and plotted across

autosomes for several genomic features: CpG islands (CGI), promoters

(1kB upstream and downstream of the TSS), CGI shores (200bp to 2kb

distant from both ends of each CGI), and the first exon of RefSeq

genes. CGI were further subdivided by proximity to promoters (within

10kB upstream or 1kB downstream of a 2kB promoter), and 2kB

promoters were subdivided by overlap with CGI. Bars denote mean

46

fold change relative to normal controls; error bars depict 25th and 75th

percentiles. Asterisks denote Bonferroni-adjusted Wilcoxon rank sum

test p<0.05.

(B) Most endometrioid endometrial malignancies show increased promoter

CGI methylation compared to normal controls, with a subset of highly

methylated tumors (CGI-H) showing over three-fold more methylation

compared to tumors with similar methylation to normal controls (CGI-

0). To identify where methylation changes occur in the most

methylated of tumors, thresholds were drawn at the upper and lower

extremes of the tumor methylation spectrum (dotted lines).

To rule out the contribution of an enrichment bias, which would produce differences in methylation signal based on the efficiency of methylated fragment enrichment rather than biological methylation, we compared mitochondrial methylation amongst the 76 tumors as a negative control. Mitochondrial and nuclear methylation utilize different cellular machinery and occur in different cellular compartments [94,95], thus a positive correlation between mitochondrial methylation and CGI methylation would indicate a potential enrichment bias. No significant correlation was seen between mitochondrial methylation and promoter CGI methylation (Spearman r=-0.15, p=0.2).

To define the changes in methylation underlying a putative CpG island methylator phenotype, we examined regional methylation differences between tumors with the highest (CGI-H) and lowest (CGI-0) levels of promoter CpG

47

island methylation (Figure 5B, dashed lines). Consistent with the premise of a

CpG island methylator phenotype, methylation differences were almost exclusively unidirectional (4672 hypermethylated vs. 17 hypomethylated), and encompassed a large fraction of all promoter CGI (29%) (Figure 6B). To focus on methylation changes that might occur during carcinogenesis, the identified regions were further overlapped with methylation differences observed in tumors vs. normal samples (Figure 6A). 2269 promoter CGI were shared between the two comparisons (Figure 6C). 49% of CGI-H v. CGI-0 hypermethylated loci were also hypermethylated in tumors compared to normal tissue, which was twice the expected rate of 23% (Chi squared p- value < 0.001), suggesting that many of these loci represent “hotspots” in the genome that are particularly vulnerable to aberrant methylation in endometrial tissue. Pathway analysis of these shared promoter CGI showed enrichment for known targets of epigenetic regulation, including targets of the Polycomb

Repressor Complex and regions known to be methylated in other cancers

(Table 2).

48

Comparison of methylation Comparison of regions methylated A Comparison of methylation B C differences between in hypermethylator tumors and differences between tumors and hypermethylator and non- frequently methylated in normals hypermethylator tumors endometrioid endometrial cancer

3730 Hyper 4672 Hyper 95 3730 1461 2269 2403 4672 Hypo Hypo

12349 Unchanged 11346 17 Unchanged

Malignant v. CGI-H v. Malignant (n=76) v. Normal (n=12) CGI-H (n=5) v. CGI-0 (n=8) Normal CGI-0

Figure 6. Loci methylated in CGI-H tumors account for many of the cancer-associated methylation gains. (A) Endometrioid endometrial tumors show increased methylation of over

20% of promoter CGI compared to unmatched normal controls.

Differentially methylated promoter CGI were quantified in the discovery

set. Depicted is the proportion of loci that were hypermethylated

(Hyper), hypomethylated (Hypo), or unchanged in malignancies relative

to normals.

(B) CGI-H tumors gain methylation at nearly 30% of promoter CGI

compared to tumors with methylation similar to normal controls (CGI-

0).

(C) Differentially methylated regions between Malignant v. Normal and

CGI-H v. CGI-0 show considerable overlap. Hypermethylated

promoter CGI from (A) and (B) were intersected to identify cancer-

specific hypermethylated regions.

49

Table 2. Term enrichment associated with hypermethylated promoter CGI (MSigDB Perturbation)

Hypermethylated promoter CGI, Overlap comparison

3730 2269 2403 4672

Malignant v. CGI-H v. Normal CGI-0

Hyper Hyper Hyper FDR Hyper Fold Term Name Foreground Total Q-Val Enrichment Region Hits Regions

Genes identified as targets of the Polycomb -19 2.5 126 1669 protein SUZ12 2.76 x 10

Genes possessing promoter trimethylated -19 2.4 134 1854 H3K27 () 2.2 x 10

-18 Polycomb Repression Complex 2 (PRC) targets 2.77 x 10 2.8 97 1127

Genes identified as targets of the Polycomb -18 2.4 124 1705 protein EED 4.18 x 10

Genes with hypermethylated DNA in lung -6 2.5 44 574 cancer samples 9.14 x 10

Genes de novo DNA methylated in cancer 0.000399 3.7 19 169

Methylation signature construction and technical validation

To identify a methylation signature that could be translated to other platforms (including Infinium and assays of individual regions), 16 candidates 50

were selected from the 2269 promoter CGI from Figure 6C (Table 3). The

2269 promoter CGI were sorted by Kruskal-Wallis p-value and fold difference for the CGI-H vs. CGI-0 comparison; potential candidates were excluded that overlapped regions associated with copy number amplification in TCGA endometrioid samples (8 of the top 50). In addition, a threshold Student p- value of p<0.01 (not corrected for multiple comparisons) was imposed to exclude candidates with large fold differences that appeared to be driven by outlier samples (5 of the top 21). Methylation of 15/16 candidates robustly distinguished the CGI-H and CGI-0 tumors; TMEM115 however showed hypermethylation in only 3 of 6 CGI-H tumors (Figure 7A).

51

Table 3. Promoter CGI that distinguish CGI-H from CGI-0 tumors as measured by MethylCap-seq

Gene Chromosomal location CpG fold change symbol (hg19) count RPM

TMEM115 chr3:50402103-50402942 66 18.9

TBX18 chr6:85472702-85474132 129 18.9

NODAL chr10:72200065-72201368 106 16.9

SVEP1 chr9:113341213-113342029 99 16.2

TNFSF11 chr13:43148277-43149282 83 16

OR10H2 chr19:15833733-15833983 23 14.5

KDM2B chr12:122016170-122017693 125 14.4

FGF12 chr3:192125818-192127991 176 14

APCDD1L chr20:57089460-57090237 71 13.6

EPHX3 chr19:15344091-15344419 33 13

ASCL1 chr12:103351579-103352695 105 13

EXOC3L4 chr14:103557606-103558235 63 12.6

SMOC2 chr6:168841818-168843100 125 12.5

B4GALNT1 chr12:58025661 -58027056 124 12.2

GRM8 chr7:126891300-126894205 234 12.1

VILL chr3:38035701-38036000 29 11.7

52

MethylCap-seq Infinium A B C 1.0

0.8 p=0.007

0 0

- -

0.6

CGI CGI

0.4

H H

- -

CGI CGI

Signature average beta-value average Signature 0.2

Discarded Log2 norm index Signature CGI Candidate 0.0 Log2 norm index CGI -3 0 3 CGI-0 CGI-H (n=5) (n=6) -3 0 3

Figure 7. Methylation of 13 promoter-associated CGI distinguishes tumors with high promoter CGI methylation (CGI-H) from those with baseline promoter CGI methylation (CGI-0). 11 tumors from the discovery set of 76 endometrioid endometrial tumors were chosen for technical validation via Infinium and indexed using a signature of

13 promoter-associated CGI. Two candidate regions that showed <0.1 difference in beta-value and p>0.05 between groups in the technical validation

Infinium dataset were discarded. An additional region that showed a negative difference in beta-value was also discarded.

(A) Methylation was compared between CGI-H (n=6) and CGI-0 tumors

(n=5) across 16 promoter-associated CGI signature candidates using

MethylCap-seq. Relative methylation was compared between regions

by normalizing to the region average, then applying a log2

transformation.

(B) Methylation differences seen in (A) were technically validated using the

Infinium HumanMethylation 450 platform. Tumors were indexed using

the average beta-value of all probes in each region (total of 88 probes),

and relative methylation was compared as in (A). Regions (columns)

53

were sorted in ascending order by row product for visual

representation.

(C) Methylation score, an average of the beta-values of the 13 validated

promoter CGI, was plotted for each tumor. Methylation score was

compared between CGI-H and CGI-0 tumors using Student’s t-test. P-

value for the 16-region score was 0.01 (data not shown).

To verify differential methylation, the 16 signature candidates were technically validated in 6 CGI-H (4 of the 5 from the original definition and 2 additional from the upper fringe) and 5 of the original 8 CGI-0 tumors using the Infinium HumanMethylation450 beadchip (Table 4 and Table 5). 1 CGI-H tumor and 3 CGI-0 tumors from the original definition were not included in this technical validation due to limited availability of sample DNA. 13/16 candidates showed differential methylation between CGI-H and CGI-0 tumors according to the following criteria: beta-value difference of greater than +0.1

(CGI-H – CGI-0) or Student’s t-test p<0.1. Exclusion criteria were kept loose to avoid overfitting. The 13 validated signature regions comprise a total of 88

Infinium HumanMethylation450 probes located within the respective CGI, with a median of 6 probes per region and a range of 2 to 14 (Table 4).

54

Table 4. Promoter CGI included in the final signature after validation by Infinium

Δ Gene Chromosomal region # p- beta- symbol (hg19) probes value2 value1

SVEP1 chr9:113341213-113342029 4 0.34 0.012

FGF12 chr3:192125818-192127991 14 0.323 0.02

NODAL chr10:72200065-72201368 6 0.313 0.025

TNFSF11 chr13:43148277-43149282 10 0.31 0.011

TBX18 chr6:85472702-85474132 10 0.268 0.061

OR10H2 chr19:15833733-15833983 3 0.265 0.047

VILL chr3:38035701-38036000 2 0.251 0.144 chr12:103351579- ASCL1 9 0.203 0.129 103352695 EPHX3 chr19:15344091-15344419 3 0.172 0.002

GRM8 chr7:126891300-126894205 11 0.152 0.058

TMEM115 chr3:50402103-50402942 3 0.12 0.118

APCDD1L chr20:57089460-57090237 9 0.105 0.007 chr14:103557606- EXOC3L4 4 0.038 0.082 103558235 1 Measured as the average beta-value for CGI-H minus CGI-0 tumors 2 Calculated using Student’s t-test

55

Table 5. Promoter CGI discarded from the final signature after validation by Infinium

Gene Chromosomal region # Δ beta- p- symbol (hg19) probes value 1 value2

B4GALNT1 chr12:58025661 -58027056 11 0.003 0.32 chr12:122016170- KDM2B 5 -0.01 0.183 122017693 SMOC2 chr6:168841818-168843100 7 -0.029 0.077 1 Measured as the average beta-value for CGI-H minus CGI-0 tumors 2 Calculated using Student’s t-test

The validated signature regions robustly distinguished 5/6 of the CGI-H tumors from the CGI-0 tumors (Figure 7B). The aggregate signature composed of these 13 promoter CGI likewise distinguished CGI-H from CGI-0 tumors (mean average beta-value of 0.47 vs. 0.26, Student’s t-test p<0.05)

(Figure 7C).

Methylation signature stratifies endometrioid endometrial tumors by methylation phenotype and distinguishes tumors from normal controls

To test the ability of the signature to stratify tumors by methylation phenotype in an independent cohort, methylation profiles for 203 endometrioid endometrial carcinomas from The Cancer Genome Atlas (TCGA) were examined (hereafter referred to as the test set). These methylation profiles were generated using the Infinium HumanMethylation450 beadchip, and represent all endometrioid tumors with methylation data on this platform from the previously published 373 endometrial tumor cohort [75]. The test set was 56

considerably larger than the discovery set and representative of the clinical presentation of the disease, making it well-suited for follow-up analyses.

The signature was originally generated by comparing groups of tumors that showed the largest differences in overall promoter CGI methylation. To assess whether the signature methylation score correlated with overall promoter CGI methylation in this independent dataset, the two parameters were compared with a scatter plot (Figure 8A). Signature methylation score showed a strong linear correlation with overall promoter CGI methylation

(Pearson r=0.83, p<0.001), suggesting that the methylation signature can be used to estimate overall promoter CGI methylation in endometrioid endometrial cancer. This result highlights that endometrial CIMP is a genome-wide phenomenon rather than isolated to a specific set of loci.

57

2D Graph 1 A B 0.8 1.0 Pearson r=0.83, p<0.001

0.7 y=4.6741x-0.4908 0.8 0.6

0.5 0.6

0.4

0.4 0.3

0.2

Signature average beta-value 0.2 0.1

Methylation, signature (average beta-value) (average signature Methylation, 0.0 0.0 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 MC1 MC2 MC3 MC4 (n=56) (n=66) (n=25) (n=56) Methylation, promoter CGI (average beta-value) Methylation Cluster C D Methylation cluster 0.8 1.0 p<0.001 p<0.001 0.7

0.8 0.6

0.5 0.6

0.4

Sensitivity 0.4 0.3

0.2 0.2

Signature average beta-value average Signature

0.1 Methylation score, A = 0.99 p< 0.001 0.0 0.0 Unmatched normal Tumor Matched normal Matched tumor 0.0 0.2 0.4 0.6 0.8 1.0 (n=11) (n=203) (n=13) (n=13) 1 - Specificity

Figure 8. Methylation signature stratifies endometrioid endometrial tumors by methylation phenotype and distinguishes tumors from normal controls. (A) Signature methylation score shows a strong linear relationship with

overall promoter CGI methylation. Infinium methylation data were

analyzed for 203 endometrioid endometrial tumors from The Cancer

Genome Atlas. Average methylation for the 13 signature promoter CGI

from Figure 7 were compared against average methylation for all

promoter CGI.

(B) Methylation signature stratifies endometrioid endometrial tumors by

methylation phenotype. Tumors from (A) were classified into

methylation clusters as previously published (Nature 497, 67–73 (02

May 2013)), and their signature average beta-values plotted as a

standard box plot. A Kruskal-Wallis test was performed with Dunn’s 58

post-hoc on all pairwise comparisons (p<0.05 for all comparisons

except MC3 vs. MC4). Whiskers denote 10th and 90th percentiles.

(C) Methylation signature distinguishes tumors from normal controls.

Methylation score was plotted for the tumors in (A) alongside 11

unmatched normal controls, as well for tissue from matched tumor and

adjacent normal samples, all from The Cancer Genome Atlas. A

Wilcoxon rank sum test was used to compare unmatched normals and

tumors, while a paired Student’s t-test was used to compare matched

normals and matched tumors.

(D) Methylation signature distinguishes tumors from normal controls with

high sensitivity and specificity. Matched and unmatched normals from

(C) were pooled, and an ROC curve was generated. Sensitivity

represents the true positive rate for tumors at a given signature score

threshold (% tumors correctly categorized as tumors), while specificity

represents the false positive rate (% normal controls incorrectly

categorized as tumors).

Methylation clusters were previously identified in the test set: one very highly methylated (MC1), one highly methylated (MC2), one with intermediate methylation (MC4), and one with methylation similar to normal (MC3).

Tumors in each cluster were scored using the 13-region signature and plotted

(Figure 8B). Signature score distinguished all methylation clusters except

MC3 and MC4. Mean signature score for MC1 was similar to the score for the

CGI-H group in the discovery set (0.45 vs. 0.48, Student’s t-test p=0.44).

59

Methylation score of tumors was compared to matched and unmatched normal control tissue (Figure 8C). 95% (192/203) of endometrioid endometrial tumors showed a higher methylation score than unmatched normal controls (n=11), suggesting that the 13-region signature could also be useful for distinguishing tumors from normal tissue. Likewise, >90% of tumors

(12/13) displayed higher methylation score than adjacent normal tissue, representing a three-fold increase on average. The methylation signature showed a sensitivity of 0.95 +/- 0.03 and a specificity of 0.93 +/- .07 (95% confidence interval) for distinguishing tumors from normal tissue at a methylation score threshold of 0.16 (Figure 8D). Since previous studies have shown that cancer-specific methylated sequences could be used to identify circulating tumor DNA in blood plasma [96], a cancer-specific methylation signature could have broad applications from early diagnosis to measuring response to therapy.

The stratification potential of total promoter CGI methylation was also assessed (Figure 9). As measured by Infinium, aggregate promoter CGI methylation mirrored results seen with the 13-region signature, but with diminished separation. Aggregate promoter CGI methylation distinguished tumors and normal controls relatively poorly (A = 0.90, Figure 9) compared to the 13-region methylation signature (A = 0.99, Figure 8D). As signature candidates were screened in part based on their ability to differentiate tumors and normal tissue (Figure 6C), it is not surprising that the 13-region signature would better differentiate tumors and normal tissue compared to background

(all promoter CGI).

60

0.28 0.28 A B p<0.001 p=0.002 0.26 0.26

0.24 0.24

0.22 0.22

0.20 0.20

0.18 0.18

0.16 0.16

0.14 0.14

Promoter CGI average beta-value CGI Promoter average beta-value CGI Promoter average 0.12 0.12

0.10 0.10 MC1 MC2 MC3 MC4 Unmatched normal Tumor Matched normal Matched tumor (n=56) (n=66) (n=25) (n=56) (n=11) (n=203) (n=13) (n=13) C 1.0

0.8

0.6

Sensitivity 0.4

0.2

Aggregate promoter CGI methylation, A = 0.90 p < 0.001 0.0

0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity

Figure 9. Aggregate promoter CGI methylation shows weaker stratification potential compared to the 13-region methylation signature. (A) Aggregate promoter CGI methylation correlates with previously

published methylation clusters. Infinium methylation data were

analyzed for 203 endometrioid endometrial tumors from The Cancer

Genome Atlas, and probes lying in promoter-associated CpG islands

were averaged to estimate aggregate promoter CGI methylation.

Tumors were classified into methylation clusters as previously

published [75], and average beta-values plotted as a standard box plot.

A Kruskal-Wallis test was performed with Dunn’s post-hoc on all

pairwise comparisons (p<0.05 for all comparisons except MC3 vs.

MC4). Whiskers denote 10th and 90th percentiles.

(B) Aggregate promoter CGI methylation distinguishes tumors from normal

controls. Methylation score was plotted for the tumors in (A) alongside 61

11 unmatched normal controls, as well for tissue from matched tumor

and adjacent normal samples, all from The Cancer Genome Atlas. A

Wilcoxon rank sum test was used to compare unmatched normals and

tumors, while a paired Student’s t-test was used to compare matched

normals and matched tumors.

(C) Aggregate promoter CGI methylation distinguishes tumors from normal

controls, albeit with reduced sensitivity and specificity compared to the

13-region methylation signature. Matched and unmatched normals

from (B) were pooled, and an ROC curve was generated. Sensitivity

represents the true positive rate for tumors at a given signature score

threshold (% tumors correctly categorized as tumors), while specificity

represents the false positive rate (% normal controls incorrectly

categorized as tumors). Distinguishing power, represented by area

under the curve, was lower for aggregate promoter CGI methylation (A

= 0.90) compared to the 13-region methylation signature (A = 0.99, see

Figure 8D).

As each promoter CGI in the signature was associated with a gene, we assessed whether increasing promoter methylation impacted mRNA expression in test set tumors. Methylation of promoter CGI is often but not always associated with decreased gene expression. 7/13 genes showed a significant association between increased methylation and decreased gene expression (Table 6). EPHX3 expression notably decreased with increasing promoter methylation (Figure 10).

62

Table 6. Correlation between promoter CGI methylation and gene expression in TCGA tumors

upper Gene quartile Spearman r p-value symbol TPM1 EPHX3 47.2 -0.601 2E-07

TBX18 448.0 -0.494 2E-07

TNFSF11 30.1 -0.379 3.48E-07

VILL 275.8 -0.37 6.74E-07

FGF12 21.1 -0.365 9.75E-07

ASCL1 6.1 -0.243 0.00138

APCDD1L 18.9 -0.238 0.00172

SVEP1 397.8 -0.118 0.122

NODAL 10.5 -0.113 0.141

GRM8 38.1 -0.108 0.159

EXOC3L4 23.3 -0.0245 0.75

TMEM115 1701.9 0.0192 0.802

OR10H2 0.8 0.164 0.0318 1 Transcripts per million, as calculated by RSEM

63

2048

512

128

32

8

2 mRNA abundance mRNA (TPM) 0.5 0 0.2 0.4 0.6 0.8 1 Methylation (beta-value)

Figure 10. Methylation of EPHX3 is associated with decreased gene expression. RNA expression vs. promoter CGI methylation of EPHX3 was plotted for 172 endometrioid endometrial tumors from The Cancer Genome Atlas. A linear fit line (r2=0.4) depicts the inverse relationship between RNA expression and methylation, corresponding to a Spearman correlation coefficient of r=-0.60 and p<0.001. TPM indicates transcripts per million, as calculated by RSEM.

Methylation beta-value represents the average methylation of all Infinium probes within the CGI.

High methylation score is associated with mismatch repair deficiency high mutation rate, and low somatic copy number alteration

In colorectal cancer, CIMP is frequently associated with defects in mismatch repair [91], which manifests as microsatellite instability (MSI); in endometrial cancer, methylation-mediated silencing of the mismatch repair gene MLH1 accounts for ~80% of sporadic mismatch repair defects [97].

Silencing of MLH1 leads to an accumulation of mutations as well as abnormal 64

expansion and contraction of microsatellite repeats. Methylation score was compared across integrated clusters previously reported for the test set

(Figure 11A), which were conceived by integrating microsatellite instability

(MSI) status, mutation frequency, and somatic copy number alteration (SCNA) cluster data. MSI (Hyper-mutated) and Copy-number high (Serous-like) clusters showed significant differences in methylation score (median 0.39 vs.

0.25, ANOVA with Holm-Sidak post-hoc p<0.001). To elucidate specific correlates, we compared methylation score across the single parameter clusters used to generate the integrated clusters (Figure 11, B-D).

Methylation score correlated with MSI status (MSI+ vs. MSI-, median 0.40 vs.

0.27, Wilcoxon rank sum p<0.001) and mutation frequency (High vs. Low clusters, mean 0.38 vs. 0.28, ANOVA with Holm-Sidak post-hoc p<0.001).

Methylation score also distinguished the low SCNA clusters 2 and 3 from the high SCNA cluster 4 (median 0.33 and 0.39 vs. 0.24, Kruskal-Wallis with

Bonferroni-corrected Student's t-test post-hoc p<0.01), as well as low SCNA cluster 3 from very low SCNA cluster 1 (median 0.39 vs. 0.28). As promoter methylation of the mismatch repair gene MLH1 is implicated in ~80% of MSI+ sporadic endometrioid endometrial tumors [97], the strong correlation of methylation score with MSI and high mutation rate was expected and shows that the signature is capable of reproducing known CIMP correlates. As previously reported, POLE mutation dramatically increases mutation rate via a mismatch repair independent mechanism [75]; the “Highest” cluster in Figure

11C predominantly composed of tumors with POLE mutations did not correlate with methylation score. The association of high methylation score

65

with low somatic copy number alteration suggests that methylation accumulation and pervasive copy number alteration represent distinct pathways of tumorigenesis with different molecular drivers.

66

A 1.01.0 B 1.01.0

value 0.8 0.8 - 0.8 0.8 p<0.001 p=0.007

beta 0.60.6 0.60.6

avg 0.40.4 0.40.4

Signature average beta-value 0.20.2 0.2Signature average beta-value 0.2

Signature 0.00.0 0.00.0 MSSMSS MSIMSI (n=122) (n=80)

MSI (Hyper-mutated) POLE (Ultra-mutated)

Copy-number high (Serous-like) Copy-number low (Endometrioid)

1.01.0 1.01.0 C D p<0.01 for marked

value 0.8 0.8 - 0.8 0.8 p<0.001

beta 0.60.6 0.60.6

avg 0.40.4 0.40.4

Signature average beta-value 0.2Signature average beta-value 0.2 0.20.2

Signature 0.00.0 0.00.0 LowLow HighHigh HighestHighest CN1CN1 CN2CN2 CN3CN3 CN4CN4 (n=59) Mutation(n=31) rate cluster (n=12) (n=55) (n=83)Copy Number(n=34) Cluster (n=24) Mutation rate cluster Copy number cluster

Figure 11. High methylation score is associated with mismatch repair deficiency, high mutation rate, and low somatic copy number alteration. Methylation score was compared among published clusters for 203 endometrioid endometrial tumors in The Cancer Genome Atlas. Parameters considered included (A) the published integrated clusters factoring in microsatellite instability (MSI) status, mutation frequency, and somatic copy number alteration frequency; (B) microsatellite instability (MSI) status; (C) mutation rate cluster; (D) copy number cluster. Statistical comparisons were performed using ANOVA with Holm-Sidak post-hoc (A, C), Wilcoxon rank-sum

(B), and Kruskal-Wallis with Bonferroni-corrected t-test or Wilcoxon post-hoc as appropriate (D). Significant differences were reported if differences

67

persisted across 2 additional independently-generated replicate signatures at a threshold of p<0.01.

Methylation score was further compared across other published cluster data for the test set as well as known clinical and molecular covariates of endometrial cancer. Methylation score showed no significant relationship

(threshold of p<0.01) with mRNA or miRNA expression clusters, stage, grade, relapse-free survival, BMI, or age of diagnosis (data not shown).

Methodological validity

The reproducibility of the methodology used to generate the methylation signature was verified by generating two additional replicate signatures, representing independent sets of 13 promoter CGI selected using a similar methodology. As shown in Figure 12, the replicate signatures ranked tumors in the test set similarly compared to a negative control

(r=0.82,0.89 for replicates R1 and R2 vs. original signature O, p<0.001; r=0.144 for negative control vs. original signature, p>0.01), demonstrating that the methodology underlying our methylation score is highly robust. Correlates with methylation score (MSI, mutation rate, copy number alteration) were subsequently validated using the replicate signatures; the reported correlates were significant across all three signatures unless otherwise noted.

68

O R1 R2 NC Samples

Log2 norm index

-2 0 2

Figure 12. Replicate 13-region methylation signatures rank tumors similarly. CGI-H and CGI-0 tumors were compared from the MethylCap-seq discovery set of 76 endometrioid endometrial tumors, and signatures were generated from the top differentially methylated promoter CpG islands (O: original 13- region signature, R1: replicate 1, R2: replicate 2, NC: negative control).

Candidates that showed <0.1 difference in beta-value between groups in the

Infinium technical validation set were discarded. 203 endometrioid endometrial tumors from The Cancer Genome Atlas were indexed using the average beta-value of all regions in the signature, and relative index values between replicates were compared by plotting as a normalized log2 transformed heat map. Samples were ranked by the original signature index for visual comparison. Statistical comparison of rank correlation vs. the original signature was performed using a Spearman test (r=0.82, 0.89 for replicates and p<0.001; r=0.14 for NC and p>0.01).

69

A signature for CIMP

Delineating a threshold is useful for classification purposes, yet can be challenging when the data resemble a continuum rather than being clearly modal. The agreement of signature-based methylation score with published methylation clusters was discussed earlier. A threshold signature score of 0.4 captures ~75% of the tumors from the very highly methylated (MC1) cluster and excludes all tumors from the less methylated MC3 and MC4 clusters

(Figure 8B), and also excludes the outlier CGI-H tumor from the discovery set

(Figure 7C). 58/203 (29%) of endometrioid tumors from the test set would be characterized as CIMP-H using this classification scheme, which would be comparable to the size of MC1 (56/203, 28%).

III. Discussion

DNA methylation is increasingly being recognized as a potential biomarker with diagnostic, prognostic, and therapeutic implications [70,98].

DNA methylation is frequently perturbed in cancer [4], and in certain contexts these changes may happen early in tumorigenesis [16]. Distinctive DNA methylation profiles may also be used to molecularly subtype tumors for personalized treatment [99]. In addition, many tumor types shed DNA into bodily fluids or excretions that can be harvested non-invasively (e.g., blood, urine, feces) [100,101]. Unlike mRNA which is susceptible to degradation and is often expressed in a highly time-dependent manner, DNA methylation is molecularly stable and stably inherited during cellular division. These characteristics make DNA methylation a promising biomarker in cancer. In

70

addition, CIMP tumors show highly dysregulated methylation that may make these tumors particularly vulnerable to demethylation therapies [92].

While CIMP has been defined in other tumor types, few studies have leveraged genome-wide methylome profiling techniques to examine CIMP in endometrial cancer [75,102,103]. Armed with an alternative methylome profiling technique that measures regional methylation over CGI rather than assuming regional methylation based on a few data points, we set out to develop an analysis method that could identify samples with large-scale methylation differences, and then pinpoint regions that could be used in a signature. This method would leverage the increased CpG coverage of enrichment-based methylation profiling vs. Infinium, yet would yield a signature that could be applied to Infinium datasets, especially datasets in the

Cancer Genome Atlas.

A few biological insights arose from our study. Our analysis was based on the premise that CIMP could be identified based on aggregate methylation, which is an unusual approach for defining CIMP. We show that an approach for identifying CIMP based on aggregate methylation shows general agreement with a clustering-based approach (Figure 8B), and furthermore show that the methylation score yielded by our signature reflects aggregate

CpG island methylation (Figure 8A). In addition, most tumors show more aggregate CGI methylation than normal controls (Figure 5, Figure 8C-D), suggesting that promoter methylation is more prevalent in endometrioid endometrial cancer than previously thought. This is a potentially exciting insight, as it increases the likelihood that methylation could be used as a

71

biomarker to diagnose and track endometrial cancer, or that this aberrant methylation could serve as a useful therapeutic target.

An advantage of comparing aggregate methylation was that it was relatively straightforward to implement compared to unsupervised clustering approaches, which require complex data normalization and correction for batch effects and also require considerable trial and error to refine.

Nonetheless, we had to account for technical biases that could make aggregate methylation appear very low or very high. We excluded samples with very poor CpG enrichment, indicating a strong possibility that the enrichment for methylated fragments was poor or incomplete. However, we cannot rule out the possibility that in doing so we excluded samples with very low CGI methylation (significantly below that of normal tissue); our data therefore do not exclude the possibility of a CpG island demethylator phenotype in endometrioid endometrial cancer. We further reassured ourselves that no significant enrichment bias was present in the remaining data by comparing mitochondrial methylation signal against genomic methylation signal, since an enrichment bias would be expected to affect both

(biologically methylation between compartments would be expected to be unrelated); no significant correlation was observed between mitochondrial and genomic methylation amongst the 76 samples (data not shown). The validity of our approach for identifying CIMP in endometrioid endometrial cancer is also supported by data showing that our CIMP signature correlates extremely well with aggregate promoter CGI methylation using a different methylation platform and an independent dataset (Figure 8A).

72

Methylation in normal tissues is highly tissue-specific, and the exact methylation perturbations in different tumor types tend to vary based on cell of origin [104]. Thus there is no single CIMP signature that can identify CIMP in all cancers; instead CIMP must be examined individually for each tumor type, and broad methylation dysregulation may be more or less common in different tumor types. Nonetheless, some areas of the genome may be more susceptible to methylation across tissue types. We identified three CGI from our 13-region signature associated with genes known to be methylated in other tumor types: EPHX3 (alias: ABHD9), FGF12, and ASCL1. Methylation of EPHX3 in primary prostate malignancies was associated with early recurrence [105]. Methylation of FGF12 was observed in five colorectal tumors but not in matched controls [106]. ASCL1 was also frequently methylated in 125 colorectal tumors compared to 29 normal controls [104], suggesting that ASCL1 and FGF12 methylation may have diagnostic potential in both colorectal and endometrial cancers.

A 2013 study by the TCGA Consortium was the basis for the in silico analysis of our signature. This study profiled methylation of 373 endometrial tumors using the Infinium beadchip [75]. Unsupervised hierarchical clustering was used on the most variable probes to segregate tumors by methylation phenotype. Two methylation clusters were identified that showed overall hypermethylation. However, a signature for recapitulating these clusters was not provided, as methylome analysis was not the focus of the study.

Our study corroborates and supplements the excellent 2013 study by the TCGA Consortium. Their identification of CIMP relied on methylation data

73

obtained from the Infinium platform. The Infinium platform samples CpGs from across the genome but is not a true whole genome methylome profiling method. The Infinium HumanMethylation450 platform used for the TCGA study features 485,577 probes, of which 113521 lie within promoter CGI, covering 16079/16394 (98%) of promoter CGI in the human genome with an average of 7 probes per CGI. These promoter CGI have an average length of

904bp and 84 CpGs per island; therefore the Infinium platform measures methylation of 8% of CpGs in promoter CGI. While Illumina suggests that methylation of Infinium probes is representative of regional methylation, it may be inaccurate to assume that representative probes in normal tissues are also representative in tumors with profound dysregulation of methylation patterns genome-wide, especially when the analysis is expanded to the scope of a genome-wide study with no additional validation. Our methylation signature is based on genome-wide promoter CGI data collected using MethylCap-seq; the agreement of our methylation score data with the clusters in the TCGA

Consortium study validates their method as well as our own. Our method has an additional advantage. Identifying endometrial CIMP samples using our method requires measuring methylation of only 82 CpGs corresponding to 13 regions as opposed to the 785 probes scattered throughout the genome used for clustering in the TCGA Consortium study. Our signature could likely be further pared down to facilitate high-throughput screening without sacrificing accuracy. In addition, we provide a suggested methylation score threshold of

0.4 for identifying arbitrary endometrial CIMP tumors, whereas the TCGA

74

Consortium identified CIMP-like tumors in their dataset, but did not provide a method for identifying CIMP in arbitrary samples.

A potential drawback of our approach is that any threshold we might use to distinguish CIMP-H and CIMP-0 tumors would be arbitrary; in fact, our observations suggest that CIMP in endometrioid endometrial cancer could be viewed as a continuum rather than as a discrete phenomenon (Figure 5B,

Figure 8C). We offer a methylation score threshold of 0.4 for distinguishing

CIMP-H tumors, but pending comparison to other data sets, this recommendation remains unvalidated.

It is important to recall that the methylation of 13 CpG islands in and of itself is not CIMP, but rather that methylation of these regions correlated with broad gains in CpG island methylation in both a discovery and independent test set. A methylation signature could be useful for diagnosis and classification of cancer, yet indicate no underlying methylator phenotype, and vice versa.

IV. Conclusion

In summary, we used methylome profiling techniques to stratify tumors by overall promoter CGI methylation, identified a signature to reproduce this stratification, and verified that the stratifying tumors using this signature reproduced known characteristics of CIMP tumors (e.g., the association with microsatellite instability). Furthermore, we demonstrated methodological robustness by repeating the stratification with two additional signatures derived using the same methodology.

75

More generally, we demonstrate an approach for translating methylome profiling findings to the Infinium platform, which will become increasingly important as publically available methylation datasets (e.g., those in The Cancer Genome Atlas, TCGA) mature. We also illustrate a general method for identifying methylator phenotypes that may be applied to other tumor types—a method that does not rely on unsupervised clustering, which is sensitive to technical artifacts such as batch effects [107]. In addition, our results suggest that widespread promoter methylation is more prevalent in endometrioid endometrial cancer than previously thought, and that promoter methylation could be a useful marker for distinguishing tumors and normal tissue. We hope that our methylation signature will catalyze further investigation of the methylator phenotype in endometrioid endometrial cancer to better understand the mechanisms and consequences of wide-scale epigenetic dysregulation.

V. Methods

Patient samples

76 primary human endometrioid endometrial cancer and 12 nonmalignant endometrial samples were analyzed from a previously published cohort [108].

Cohort characteristics are shown in Appendix B1. A sequencing read summary is provided in Appendix B2. All studies involving human endometrial cancer samples were approved by the Human Studies

Committee at the Washington University and at The Ohio State University.

MethylCap-seq quality control

76

MethylCap-seq quality control was implemented as previously described

[109], and 14 of 102 samples that showed evidence of poor methylated fragment enrichment or poor sequencing reproducibility were excluded from analysis. This method was demonstrated to reduce noise in methylation signal and improve the ability to discriminate between tumors and normal tissue.

MethylCap-seq data analysis

Sequence files were aligned and processed as previously described [109].

Reads were extended to the average fragment length and the resulting count distribution was normalized against the total aligned reads by conversion to reads per million (RPM). Differentially methylated promoter CGI were identified by performing a Wilcoxon rank sum test for each CGI across the two sample groups being considered. Results were adjusted for multiple comparisons by setting a false discovery rate (FDR) cutoff of 0.05.

Methylation was categorized by genomic feature as follows: CpG islands

(CGI, as defined in the UCSC genome browser), promoters (2kB in length,

1kB upstream and downstream of the TSS), CGI shores (200bp to 2kb distant from both ends of each CGI), and the first exon of RefSeq genes. CGI were further subdivided by proximity to promoters (within 10kB upstream or 1kB downstream of a 2kB promoter), and 2kB promoters were subdivided by overlap with CGI.

Infinium validation of methylation signature candidates

11 of 76 tumors were chosen for technical validation using the Infinium

HumanMethylation450 beadchip platform, a well-validated bisulfite-based

77

method for assessing methylation of individual CpGs genome-wide. The assay was performed according to manufacturer protocol by the University of

Southern California Epigenome Center. Methylation was reported using beta- values, a number which represents the fraction of DNA fragments that were methylated at a given CpG site.

Computation of methylation score using the 13 promoter CGI signature

Methylation score was computed by taking the average of the beta-values for all probes within a promoter CGI, then averaging the result across the 13 promoter CGI in the signature. The final signature comprised a total of 88

Infinium HumanMethylation450 probes.

In silico analysis of TCGA endometrioid endometrial tumors

Methylation was analyzed for 203 endometrioid endometrial tumors from the original published Cancer Genome Atlas cohort of 373 endometrial tumors

[75]. 170 tumors were excluded from analysis that lacked Infinium

HumanMethylation450 data or were not of the endometrioid subtype. Some analyses assessed fewer than 203 samples due to gaps in data availability for each assay. Methylation was assessed using Level 3 data from The Cancer

Genome Atlas Data Portal, while clinical and molecular correlate data were gathered from cBioPortal for Cancer Genomics (Memorial Sloan Kettering

Cancer Center).

Replicate signature analysis

To demonstrate the reproducibility of our method for identifying tumors with a

CpG island methylator phenotype, two additional 13-region signatures were compiled from the original list of top differentially methylated promoter CGI

78

between CGI-H and CGI-0 tumors in the Discovery set. Regions that had already been considered for the original signature were excluded from this analysis. Mirroring the technical validation of the original signature, candidate regions that showed <0.1 difference in average beta-value between groups in the Infinium technical validation set were discarded. An additional negative control signature was populated with the 13 promoter CGI that showed the least difference in methylation between groups in the discovery set (as determined by fold change). Endometrioid endometrial tumors from the test set were indexed using all four signatures, and methylation score was computed using the average beta-value of the regions in each signature.

Rank correlation of tumor methylation scores between replicate signatures and the original signature was compared using a Spearman test.

79

Chapter 5. Summary and Discussion

I. Summary

The goals of this project were:

1) Develop and validate a MethylCap-seq quality control module to identify and exclude samples with spurious methylation data.

2) Develop a method to identify putative CIMP tumors using the MethylCap- seq platform.

3) Describe the differences between putative CIMP tumors using the Infinium platform.

4) Compare agreement in CIMP classification between my method based on

MethylCap-seq and the published clustering method based on Infinium.

5) Demonstrate a signature that could classify CIMP tumors prospectively without a full methylome profiling experiment.

Hypothesis: A combined methylation analysis method using MethylCap-seq and Infinium can be used to define CIMP de novo in endometrial cancer.

Chapter 3 outlines and validates the quality control module that was developed for this project. The validity of a MethylCap-seq experiment is dependent on enrichment of methylated fragments prior to sequencing. A failure in enrichment invalidates any downstream data, and therefore identifying such failures is vital. 203 lanes of sequencing data were generated for 101 unique samples. 43 lanes failed QC, representing 21 unique samples.

80

The QC module excluded samples with noisy methylation signal, increased a measure of sequencing reproducibility, and increased power to detect differentially methylated regions.

The subsequent goals are addressed in Chapter 4. Analysis with

MethylCap-seq identified a wide distribution of total promoter CpG island methylation among endometrioid endometrial tumors, with normal controls showing similar methylation to the lower end of the tumor methylation spectrum. The CpG island methylator phenotype (CIMP) is often regarded as a discrete phenomenon occurring in a specific subset of genomic regions, but

I hypothesized that comparing the two ends of the tumor methylation spectrum would also be a valid starting point to identify CIMP.

To test my hypothesis, I had to validate a putative CIMP signature's ability to differentiate "true positives" and "true negatives". The easiest way to do this was to compare against methylation data from a large published cohort where CIMP had been identified using a traditional method. TCGA provided this opportunity, but used a different methylation platform. Therefore

I had to describe the differences between putative CIMP tumors and non-

CIMP control tumors in my dataset using the same platform as the TCGA:

Infinium. To this end, I sampled tumors on the high and low ends of the methylation spectrum and identified a set of loci that distinguished them, using normal controls to screen for loci that may have undergone cancer-associated methylation gains. I also profiled this subset of tumors with Infinium and discarded the loci that were not differentially methylated between groups, as determined by Infinium. This also served as a technical validation to show

81

that MethylCap-seq and Infinium were yielding similar results, which helped rule out biases that could have cast doubt on my methodology.

A 13-region methylation signature for identifying CIMP, composed of

82 Infinium probes, emerged from this analysis. Yet since the original data were a continuum, I lacked a binary method to classify CIMP. Instead of drawing an arbitrary threshold, I chose to compare against correlates from

TCGA using the entire spectrum of the signature methylation score and look for the expected differences. Signature methylation score differentiated the previously published methylation clusters, demonstrating the agreement between my method and a traditional clustering method. In addition, the typical CIMP correlates emerged: microsatellite instability, high mutation rate, and low somatic copy number alteration. Two alternative signatures generated using the same methodology showed a similar result, while a negative control signature did not.

The final goal was for the methylation signature to be useful for classifying CIMP tumors prospectively. The signature is composed of 82

CpGs spread across 13 genomic regions, and therefore these markers are measurable without a full methylome profiling study. In addition, measurements of the methylation state of individual CpGs can be performed with high technical reliably and reproducibility using relatively cost-effective methods such as small-scale beadchip arrays and bisulfite pyrosequencing.

My method by its nature does not lend itself to binary calling of "CIMP" or "not

CIMP", but I provide a cutoff threshold signature score based on comparison with TCGA methylation clusters for reference and future validation.

82

II. Discussion

CIMP appears to have multiple definitions in the literature, depending on the context of the particular study. In some studies, CIMP is regarded as a reproducible pattern of methylated promoter CpG islands, similar to what I term a "signature". Critics have questioned whether the term itself has meaning or whether there is any consistent biology behind the phenomenon.

I strictly define CIMP in my study as a genome-wide phenomenon that must be initially identified using a genome-wide approach. I argue that such a definition, likely involving broad perturbation of major epigenetic pathways, is most likely to have a consistent biological basis across tissue types--even if the exact regions affected vary greatly across different tumor types.

Methylation in normal tissues is highly tissue-specific, and the exact methylation perturbations in different tumor types tend to vary based on cell of origin [104]. Thus there is no single CIMP signature that can identify CIMP in all cancers; instead CIMP must be examined individually for each tumor type, and broad methylation dysregulation may be more or less common in different tumor types. Nonetheless, some areas of the genome may be more susceptible to methylation across tissue types. We identified three CGI from our 13-region signature associated with genes known to be methylated in other tumor types: EPHX3 (alias: ABHD9), FGF12, and ASCL1. Methylation of EPHX3 in primary prostate malignancies was associated with early recurrence [105]. Methylation of FGF12 was observed in five colorectal tumors but not in matched controls [106]. ASCL1 was also frequently methylated in 125 colorectal tumors compared to 29 normal controls [104],

83

suggesting that ASCL1 and FGF12 methylation may have diagnostic potential in both colorectal and endometrial cancers.

The method I used to analyze MethylCap-seq data differs from other published methods in that no attempt was made to normalize CpG density differences across the genome. With enrichment-based methods for profiling methylation like MethylCap-seq, areas with greater CpG density have more opportunities for methylation and therefore a higher likelihood of being enriched--independent of the ratio of methylated CpGs to unmethylated

CpGs. This methodology discrepancy complicates comparison with bisulfite- based methodologies, which measure the methylation rate of a given CpG in the sampled cell population. Thus CpG density normalization is typically performed to make data from enrichment-based methylation profiling techniques look more similar to that obtained from bisulfite-based techniques.

Doing so can streamline analysis pipelines, since normalized data could theoretically be analyzed in a single platform-independent manner.

Normalization is also required if meaningful comparisons of methylation are to be made between two different loci in the same sample. However, CpG density normalization is computationally complex and was not necessary in my study, as I was interested in comparing methylation between the same loci in many samples (vertically) rather than multiple loci in the same sample

(horizontally). Skipping CpG density normalization could have potentially biased the top differentially methylated regions that composed my final signature towards regions with higher CpG density. The regions were subsequently screened with Infinium to discard the loci that were not

84

differentially methylated between both platforms. As the signature regions were treated as markers, a bias towards higher CpG density would not be undesirable, and in fact the observed bias was small (6%, 94 CpGs per island on average among regions in the final signature vs. 89 among all promoter

CGI, as defined and interrogated by our particular method).

85

References

1. Berger SL (2007) The complex language of chromatin regulation during

transcription. Nature 447: 407-412.

2. Robertson KD, Wolffe AP (2000) DNA methylation in health and disease.

Nat Rev Genet 1: 11-19.

3. Yang X, Yan L, Davidson NE (2001) DNA methylation in breast cancer.

Endocr Relat Cancer 8: 115-127.

4. Esteller M (2008) Epigenetics in cancer. N Engl J Med 358: 1148-1159.

5. Coolen MW, Stirzaker C, Song JZ, Statham AL, Kassir Z, et al. (2010)

Consolidation of the cancer genome into domains of repressive

chromatin by long-range epigenetic silencing (LRES) reduces

transcriptional plasticity. Nat Cell Biol 12: 235-246.

6. Hatziapostolou M, Iliopoulos D Epigenetic aberrations during oncogenesis.

Cellular and Molecular Life Sciences: 1-22.

7. Baylin SB (2005) DNA methylation and gene silencing in cancer. Nat Clin

Pract Oncol 2 Suppl 1: S4-11.

8. Ferreri AJ, Dell'Oro S, Capello D, Ponzoni M, Iuzzolino P, et al. (2004)

Aberrant methylation in the promoter region of the reduced folate

carrier gene is a potential mechanism of resistance to methotrexate in

primary central nervous system lymphomas. Br J Haematol 126: 657-

664.

86

9. Nakayama M, Wada M, Harada T, Nagayama J, Kusaba H, et al. (1998)

Hypomethylation status of CpG sites at the promoter region and

overexpression of the human MDR1 gene in acute myeloid .

Blood 92: 4296-4307.

10. Strathdee G, MacKean MJ, Illand M, Brown R (1999) A role for

methylation of the hMLH1 promoter in loss of hMLH1 expression and

drug resistance in ovarian cancer. 18: 2335-2341.

11. You JS, Jones PA (2012) Cancer genetics and epigenetics: two sides of

the same coin? Cancer Cell 22: 9-20.

12. Nightingale KP, O'Neill LP, Turner BM (2006) Histone modifications:

signalling receptors and potential elements of a heritable epigenetic

code. Curr Opin Genet Dev 16: 125-136.

13. Margueron R, Trojer P, Reinberg D (2005) The key to development:

interpreting the histone code? Curr Opin Genet Dev 15: 163-176.

14. Cedar H, Bergman Y (2009) Linking DNA methylation and histone

modification: patterns and paradigms. Nat Rev Genet 10: 295-304.

15. Sparmann A, van Lohuizen M (2006) Polycomb silencers control cell fate,

development and cancer. Nat Rev Cancer 6: 846-856.

16. Laird PW (2003) The power and the promise of DNA methylation markers.

Nat Rev Cancer 3: 253-266.

17. Szyf M (2009) Epigenetics, DNA methylation, and chromatin modifying

drugs. Annu Rev Pharmacol Toxicol 49: 243-263.

87

18. van Agthoven T, van Agthoven TL, Dekker A, Foekens JA, Dorssers LC

(1994) Induction of estrogen independence of ZR-75-1 human breast

cancer cells by epigenetic alterations. Mol Endocrinol 8: 1474-1483.

19. Hegi ME, Diserens AC, Gorlia T, Hamou MF, de Tribolet N, et al. (2005)

MGMT gene silencing and benefit from temozolomide in .

N Engl J Med 352: 997-1003.

20. Bock C (2012) Analysing and interpreting DNA methylation data. Nat Rev

Genet 13: 705-719.

21. Laird PW (2010) Principles and challenges of genomewide DNA

methylation analysis. Nat Rev Genet 11: 191-203.

22. Trimarchi MP, Mouangsavanh M, Huang TH (2011) Cancer epigenetics: a

perspective on the role of DNA methylation in acquired endocrine

resistance. Chin J Cancer 30: 749-756.

23. Bock C, Tomazou EM, Brinkman AB, Muller F, Simmer F, et al.

Quantitative comparison of genome-wide DNA methylation mapping

technologies. Nat Biotechnol 28: 1106-1114.

24. Harris RA, Wang T, Coarfa C, Nagarajan RP, Hong C, et al. (2010)

Comparison of sequencing-based methods to profile DNA methylation

and identification of monoallelic epigenetic modifications. Nat

Biotechnol 28: 1097-1105.

25. Robinson MD, Statham AL, Speed TP, Clark SJ (2010) Protocol matters:

which methylome are you actually studying? Epigenomics 2: 587-598.

26. Chavez L, Jozefczuk J, Grimm C, Dietrich J, Timmermann B, et al. (2010)

Computational analysis of genome-wide DNA methylation during the

88

differentiation of human embryonic stem cells along the endodermal

lineage. Genome Res 20: 1441-1450.

27. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, et al. (2008) A Bayesian

deconvolution strategy for immunoprecipitation-based DNA methylome

analysis. Nat Biotechnol 26: 779-785.

28. Rodriguez B, Frankhouser D, Murphy M, Trimarchi M, Tam HH, et al.

(2012) A Scalable, Flexible Workflow for MethylCap-Seq Data

Analysis. BMC Genomics.

29. Lan X, Adams C, Landers M, Dudas M, Krissinger D, et al. (2011) High

resolution detection and analysis of CpG dinucleotides methylation

using MBD-Seq technology. PLoS One 6: e22226.

30. Rao X, Evans J, Chae H, Pilrose J, Kim S, et al. (2012) CpG island shore

methylation regulates caveolin-1 expression in breast cancer.

Oncogene.

31. Serre D, Lee BH, Ting AH MBD-isolated Genome Sequencing provides a

high-throughput and comprehensive survey of DNA methylation in the

human genome. Nucleic Acids Res 38: 391-399.

32. Li N, Ye M, Li Y, Yan Z, Butcher LM, et al. (2010) Whole genome DNA

methylation analysis based on high throughput sequencing technology.

Methods 52: 203-212.

33. Nair SS, Coolen MW, Stirzaker C, Song JZ, Statham AL, et al. (2011)

Comparison of methyl-DNA immunoprecipitation (MeDIP) and methyl-

CpG binding domain (MBD) protein capture for genome-wide DNA

89

methylation analysis reveal CpG sequence coverage bias. Epigenetics

6: 34-44.

34. Bogdanovic O, Long SW, van Heeringen SJ, Brinkman AB, Gomez-

Skarmeta JL, et al. (2011) Temporal uncoupling of the DNA methylome

and transcriptional repression during embryogenesis. Genome Res 21:

1313-1327.

35. Xu Y, Hu B, Choi AJ, Gopalan B, Lee BH, et al. (2012) Unique DNA

methylome profiles in CpG island methylator phenotype colon cancers.

Genome Res 22: 283-291.

36. Kim M, Kang TW, Lee HC, Han YM, Kim H, et al. (2011) Identification of

DNA methylation markers for lineage commitment of in vitro

hepatogenesis. Hum Mol Genet 20: 2722-2733.

37. Brenet F, Moh M, Funk P, Feierstein E, Viale AJ, et al. (2011) DNA

methylation of the first exon is tightly linked to transcriptional silencing.

PLoS One 6: e14524.

38. Park JH, Park J, Choi JK, Lyu J, Bae MG, et al. (2011) Identification of

DNA methylation changes associated with human gastric cancer. BMC

Med Genomics 4: 82.

39. Yan P, Frankhouser D, Murphy M, Tam HH, Rodriguez B, et al. (2012)

Genome-wide methylation profiling in decitabine-treated patients with

acute myeloid leukemia. Blood 120: 2466-2474.

40. Lee BH, Taylor MG, Robinet P, Smith JD, Schweitzer J, et al. (2012)

Dysregulation of cholesterol homeostasis in human prostate cancer

through loss of ABCA1. Cancer Res.

90

41. Hogart A, Lichtenberg J, Ajay SS, Anderson S, Margulies EH, et al. (2012)

Genome-wide DNA methylation profiles in hematopoietic stem and

progenitor cells reveal overrepresentation of ETS transcription factor

binding sites. Genome Res 22: 1407-1418.

42. Deaton AM, Webb S, Kerr AR, Illingworth RS, Guy J, et al. (2011) Cell

type-specific DNA methylation at intragenic CpG islands in the immune

system. Genome Res 21: 1074-1086.

43. Yan H, Choi AJ, Lee BH, Ting AH (2011) Identification and functional

analysis of epigenetically silenced in colorectal cancer

cells. PLoS One 6: e20628.

44. Decock A, Ongenaert M, Hoebeeck J, De Preter K, Van Peer G, et al.

(2012) Genome-wide promoter methylation analysis in neuroblastoma

identifies prognostic methylation biomarkers. Genome Biol 13: R95.

45. Jin B, Ernst J, Tiedemann RL, Xu H, Sureshchandra S, et al. (2012)

Linking DNA Methyltransferases to Epigenetic Marks and Nucleosome

Structure Genome-wide in Human Tumor Cells. Cell Rep 2: 1411-

1424.

46. Simmer F, Brinkman AB, Assenov Y, Matarese F, Kaan A, et al. (2012)

Comparative genome-wide DNA methylation analysis of colorectal

tumor and matched normal tissues. Epigenetics 7.

47. Carvalho RH, Haberle V, Hou J, van Gent T, Thongjuea S, et al. (2012)

Genome-wide DNA methylation profiling of non-small cell lung

carcinomas. Epigenetics Chromatin 5: 9.

91

48. Yu W, Jin C, Lou X, Han X, Li L, et al. (2011) Global analysis of DNA

methylation by Methyl-Capture sequencing reveals epigenetic control

of cisplatin resistance in ovarian cancer cell. PLoS One 6: e29450.

49. Illingworth RS, Gruenewald-Schneider U, Webb S, Kerr AR, James KD, et

al. (2010) Orphan CpG islands identify numerous conserved promoters

in the mammalian genome. PLoS Genet 6.

50. Skene PJ, Illingworth RS, Webb S, Kerr AR, James KD, et al. (2010)

Neuronal MeCP2 is expressed at near histone-octamer levels and

globally alters the chromatin state. Mol Cell 37: 457-468.

51. Turker MS (2002) Gene silencing in mammalian cells and the spread of

DNA methylation. Oncogene 21: 5388-5393.

52. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, et al. (2009) The

human colon cancer methylome shows similar hypo- and

hypermethylation at conserved tissue-specific CpG island shores. Nat

Genet 41: 178-186.

53. Ji H, Ehrlich LI, Seita J, Murakami P, Doi A, et al. (2010) Comprehensive

methylome map of lineage commitment from haematopoietic

progenitors. Nature 467: 338-342.

54. Doi A, Park IH, Wen B, Murakami P, Aryee MJ, et al. (2009) Differential

methylation of tissue- and cancer-specific CpG island shores

distinguishes human induced pluripotent stem cells, embryonic stem

cells and fibroblasts. Nat Genet 41: 1350-1353.

55. Akalin A, Garrett-Bakelman FE, Kormaksson M, Busuttil J, Zhang L, et al.

(2012) Base-pair resolution DNA methylation sequencing reveals

92

profoundly divergent epigenetic landscapes in acute myeloid leukemia.

PLoS Genet 8: e1002781.

56. Dudziec E, Miah S, Choudhry HM, Owen HC, Blizard S, et al. (2011)

Hypermethylation of CpG islands and shores around specific

microRNAs and mirtrons is associated with the phenotype and

presence of bladder cancer. Clin Cancer Res 17: 1287-1296.

57. Rollins RA, Haghighi F, Edwards JR, Das R, Zhang MQ, et al. (2006)

Large-scale structure of genomic methylation patterns. Genome Res

16: 157-163.

58. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene

bodies and beyond. Nat Rev Genet 13: 484-492.

59. Ball MP, Li JB, Gao Y, Lee JH, LeProust EM, et al. (2009) Targeted and

genome-scale strategies reveal gene-body methylation signatures in

human cells. Nat Biotechnol 27: 361-368.

60. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, et al. (2010) Dynamic

changes in the human methylome during differentiation. Genome Res

20: 320-331.

61. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009)

Human DNA methylomes at base resolution show widespread

epigenomic differences. Nature 462: 315-322.

62. Lister R, Pelizzola M, Kida YS, Hawkins RD, Nery JR, et al. (2011)

Hotspots of aberrant epigenomic reprogramming in human induced

pluripotent stem cells. Nature 471: 68-73.

93

63. Berman BP, Weisenberger DJ, Aman JF, Hinoue T, Ramjan Z, et al.

(2012) Regions of focal DNA hypermethylation and long-range

hypomethylation in colorectal cancer coincide with nuclear lamina-

associated domains. Nat Genet 44: 40-46.

64. Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, et al. (2011)

Increased methylation variation in epigenetic domains across cancer

types. Nat Genet 43: 768-775.

65. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, et al. (2011) DNA-

binding factors shape the mouse methylome at distal regulatory

regions. Nature 480: 490-495.

66. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, et al. (2012) An

integrated encyclopedia of DNA elements in the human genome.

Nature 489: 57-74.

67. Shukla S, Kavak E, Gregory M, Imashimizu M, Shutinoski B, et al. (2011)

CTCF-promoted RNA polymerase II pausing links DNA methylation to

splicing. Nature 479: 74-79.

68. Zhou Y, Lu Y, Tian W (2012) Epigenetic features are significantly

associated with alternative splicing. BMC Genomics 13: 123.

69. Herman JG, Baylin SB (2003) Gene Silencing in Cancer in Association

with Promoter Hypermethylation. New England Journal of Medicine

349: 2042-2054.

70. Tost J (2010) DNA Methylation: An Introduction to the Biology and the

Disease-Associated Changes of a Promising Biomarker. Molecular

Biotechnology 44: 71-81.

94

71. Issa JP (2004) CpG island methylator phenotype in cancer. Nat Rev

Cancer 4: 988-993.

72. Hughes LA, Melotte V, de Schrijver J, de Maat M, Smit VT, et al. (2013)

The CpG island methylator phenotype: what's in a name? Cancer Res

73: 5858-5868.

73. Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, et al. (2009)

Genome-wide DNA methylation profiling using Infinium® assay.

Epigenomics 1: 177-200.

74. Huang YW, Luo J, Weng YI, Mutch DG, Goodfellow PJ, et al. (2010)

Promoter hypermethylation of CIDEA, HAAO and RXFP3 associated

with microsatellite instability in endometrial carcinomas. Gynecol Oncol

117: 239-247.

75. Kandoth C, Schultz N, Cherniack AD, Akbani R, Liu Y, et al. (2013)

Integrated genomic characterization of endometrial carcinoma. Nature

497: 67-73.

76. Bergman Y, Cedar H (2013) DNA methylation dynamics in health and

disease. Nat Struct Mol Biol 20: 274-281.

77. Rodriguez BA, Frankhouser D, Murphy M, Trimarchi M, Tam H-H, et al.

(2012) Methods for high-throughput MethylCap-Seq data analysis.

BMC Genomics 13: S14.

78. Lienhard M, Grimm C, Morkel M, Herwig R, Chavez L (2014) MEDIPS:

genome-wide differential coverage analysis of sequencing data derived

from DNA enrichment experiments. Bioinformatics 30: 284-286.

95

79. Blum W, Garzon R, Klisovic RB, Schwind S, Walker A, et al. (2010)

Clinical response and miR-29b predictive significance in older AML

patients treated with a 10-day schedule of decitabine. Proc Natl Acad

Sci U S A 107: 7473-7478.

80. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and

memory-efficient alignment of short DNA sequences to the human

genome. Genome Biol 10: R25.

81. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The

Sequence Alignment/Map format and SAMtools. Bioinformatics 25:

2078-2079.

82. Jones PA, Baylin SB (2007) The epigenomics of cancer. Cell 128: 683-

692.

83. Toyota M, Ahuja N, Ohe-Toyota M, Herman JG, Baylin SB, et al. (1999)

CpG island methylator phenotype in colorectal cancer. Proc Natl Acad

Sci U S A 96: 8681-8686.

84. Noushmehr H, Weisenberger DJ, Diefes K, Phillips HS, Pujara K, et al.

(2010) Identification of a CpG island methylator phenotype that defines

a distinct subgroup of glioma. Cancer Cell 17: 510-522.

85. Fang F, Turcan S, Rimner A, Kaufman A, Giri D, et al. (2011) Breast

cancer methylomes establish an epigenomic foundation for metastasis.

Sci Transl Med 3: 75ra25.

86. Figueroa ME, Abdel-Wahab O, Lu C, Ward PS, Patel J, et al. (2010)

Leukemic IDH1 and IDH2 mutations result in a hypermethylation

96

phenotype, disrupt TET2 function, and impair hematopoietic

differentiation. Cancer Cell 18: 553-567.

87. Zouridis H, Deng N, Ivanova T, Zhu Y, Wong B, et al. (2012) Methylation

subtypes and large-scale epigenetic alterations in gastric cancer. Sci

Transl Med 4: 156ra140.

88. Arai E, Chiku S, Mori T, Gotoh M, Nakagawa T, et al. (2012) Single-CpG-

resolution methylome analysis identifies clinicopathologically

aggressive CpG island methylator phenotype clear cell renal cell

carcinomas. Carcinogenesis 33: 1487-1493.

89. Jithesh PV, Risk JM, Schache AG, Dhanda J, Lane B, et al. (2013) The

epigenetic landscape of oral squamous cell carcinoma. Br J Cancer

108: 370-379.

90. Mack SC, Witt H, Piro RM, Gu L, Zuyderduyn S, et al. (2014) Epigenomic

alterations define lethal CIMP-positive ependymomas of infancy.

Nature 506: 445-450.

91. Issa J-P (2008) Colon cancer: it's CIN or CIMP. Clinical Cancer Research

14: 5939-5940.

92. Turcan S, Fabius AW, Borodovsky A, Pedraza A, Brennan C, et al. (2013)

Efficient induction of differentiation and growth inhibition in IDH1

mutant glioma cells by the DNMT Inhibitor Decitabine.

93. Hsu YT, Gu F, Huang YW, Liu J, Ruan J, et al. (2013) Promoter

hypomethylation of EpCAM-regulated bone morphogenetic protein

gene family in recurrent endometrial cancer. Clin Cancer Res 19: 6272-

6285.

97

94. Shmookler Reis RJ, Goldstein S (1983) Mitochondrial DNA in mortal and

immortal human cells. Genome number, integrity, and methylation. J

Biol Chem 258: 9078-9085.

95. Shock LS, Thakkar PV, Peterson EJ, Moran RG, Taylor SM (2011) DNA

methyltransferase 1, cytosine methylation, and cytosine

hydroxymethylation in mammalian mitochondria. Proc Natl Acad Sci U

S A 108: 3630-3635.

96. Goessl C, Krause H, Muller M, Heicappell R, Schrader M, et al. (2000)

Fluorescent methylation-specific polymerase chain reaction for DNA-

based detection of prostate cancer in bodily fluids. Cancer Res 60:

5941-5945.

97. Simpkins SB, Bocker T, Swisher EM, Mutch DG, Gersell DJ, et al. (1999)

MLH1 Promoter Methylation and Gene Silencing is the Primary Cause

of Microsatellite Instability in Sporadic Endometrial Cancers. Hum Mol

Genet 8: 661-666.

98. Issa JP (2007) DNA methylation as a therapeutic target in cancer. Clin

Cancer Res 13: 1634-1637.

99. Rhee JK, Kim K, Chae H, Evans J, Yan P, et al. (2013) Integrated

analysis of genome-wide DNA methylation and gene expression

profiles in molecular subtypes of breast cancer. Nucleic Acids Res 41:

8464-8474.

100. Widschwendter M, Menon U (2006) Circulating methylated DNA: a new

generation of tumor markers. Clin Cancer Res 12: 7205-7208.

98

101. Melotte V, Yi JM, Lentjes MH, Smits KM, Van Neste L, et al. (2015)

Spectrin repeat containing nuclear envelope 1 and forkhead box

protein E1 are promising markers for the detection of colorectal cancer

in blood. Cancer Prev Res (Phila) 8: 157-164.

102. Zhang B, Xing X, Li J, Lowdon RF, Zhou Y, et al. (2014) Comparative

DNA methylome analysis of endometrial carcinoma reveals complex

and distinct deregulation of cancer promoters and enhancers. BMC

Genomics 15: 868.

103. Kolbe DL, DeLoia JA, Porter-Gill P, Strange M, Petrykowska HM, et al.

(2012) Differential Analysis of Ovarian and Endometrial Cancers

Identifies a Methylator Phenotype. PLoS One 7: e32941.

104. Sproul D, Kitchen RR, Nestor CE, Dixon JM, Sims AH, et al. (2012)

Tissue of origin determines cancer-associated CpG island promoter

hypermethylation patterns. Genome Biol 13: R84.

105. Cottrell S, Jung K, Kristiansen G, Eltze E, Semjonow A, et al. (2007)

Discovery and validation of 3 novel DNA methylation markers of

prostate cancer prognosis. J Urol 177: 1753-1758.

106. Li H, Du Y, Zhang D, Wang LN, Yang C, et al. (2012) Identification of

novel DNA methylation markers in colorectal cancer using MIRA-based

microarrays. Oncol Rep 28: 99-104.

107. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, et al. (2010)

Tackling the widespread and critical impact of batch effects in high-

throughput data. Nature Reviews Genetics 11: 733-739.

99

108. Zighelboim I, Goodfellow PJ, Gao F, Gibb RK, Powell MA, et al. (2007)

Microsatellite instability and epigenetic inactivation of MLH1 and

outcome of patients with endometrial carcinomas of the endometrioid

type. J Clin Oncol 25: 2042-2048.

109. Trimarchi MP, Murphy M, Frankhouser D, Rodriguez BA, Curfman J, et

al. (2012) Enrichment-based DNA methylation analysis using next-

generation sequencing: sample exclusion, estimating changes in global

methylation, and the contribution of replicate lanes. BMC Genomics 13

Suppl 8: S6.

100

Appendix A. Supplementary data for Chapter 3

Appendix A1 – QC table for endometrial cancer study

Excel 2007 format (.xlsx)

A listing of CpG enrichment, saturation, 5x coverage and read information for each sample lane in the endometrial dataset.

Appendix A2 – Replicate lane correlation, endometrial QC passed vs. QC failed samples

Excel 2007 format (.xlsx)

A table showing Pearson correlation of replicate lanes for samples that passed QC vs. failed QC. Data are presented both as a group summary and for individual samples.

Appendix A3 – QC table for ovarian study

Excel 2007 format (.xlsx)

A listing of CpG enrichment, saturation, 5x coverage and read information for each sample lane in the ovarian dataset.

Appendix A4 – QC, GMI, plasmid RPM table for AML study

Excel 2007 format (.xlsx)

101

A listing of CpG enrichment, saturation, 5x coverage, read information, global methylation indicator and plasmid reads per million for each sample lane in the AML dataset.

102

Appendix B. Supplementary data for Chapter 4

Appendix B1 - Cohort characteristics

Excel 2007 format (.xlsx)

Cohort characteristics for the endometrial training set.

Appendix B2 - Sequencing summary

Excel 2007 format (.xlsx)

A listing of sequencing read information for each sample in the endometrial training set.

103