Identification of endometrial cancer methylation features using a combined
methylation analysis method
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Michael P Trimarchi B.S.
Biomedical Sciences Graduate Program
The Ohio State University
2016
Dissertation Committee:
Joanna L Groden, Advisor
Paul Goodfellow
Ralf A. Bundschuh
Jeffrey Parvin
Pearlly Yan
Copyrighted by
Michael P Trimarchi
2016
Abstract
Introduction: DNA methylation is a stable epigenetic mark that is frequently altered in tumors. DNA methylation marks are attractive biomarkers for disease states given the stability of DNA methylation in living cells and in biologic specimens. Widespread accumulation of methylation in regulatory elements in some cancers (termed the CpG island methylator phenotype,
CIMP) can play an important role in tumorigenesis. High resolution assessment of CIMP for the entire genome, however, remains cost prohibitive and requires quantities of DNA that are not available for many tissue samples of interest. Genome-wide scans of methylation have been undertaken for large numbers of tumors, and higher resolution analyses have been performed for a limited number of cancer specimens. Yet methods for analyzing these large datasets and integrating findings from different studies have not been fully developed. An approach was developed to profile CIMP by combining the strengths of two different methylome profiling techniques.
Methods: Methylomes for 76 primary endometrial cancer and 12 normal endometrial samples were generated using methylated fragment capture and second generation sequencing (MethylCap-seq). Publically available data from The Cancer Genome Atlas (TCGA) for 203 endometrial cancers, analyzed using the Infinium HumanMethylation 450 beadchip, were compared to the MethylCap-seq data. A MethylCap-seq quality control module was
ii
developed to exclude sequencing samples with poor-quality methylation data from analysis. Additional MethylCap-seq datasets were also used to develop and validate the quality control module.
Results: Analysis of total methylation in promoter CpG islands (CGIs) identified a subset of tumors with a methylator phenotype. I developed a 13- region methylation signature associated with a “hypermethylator state” using a training set of five highly methylated and eight lowly methylated tumors. The signature was validated using data from TCGA. High signature methylation score was associated with mismatch repair deficiency, high mutation rate, and low somatic copy number alteration in TCGA test set. In addition, the methylation signature distinguished >90% of endometrioid endometrial tumors from normal controls in the test set. Furthermore, classification of tumors by signature methylation score proved highly robust, showing good agreement with previously published methylation clusters for the test set as well as consistent ranking of tumors across alternative signatures.
Conclusion: I identified a methylation signature for a “hypermethylator phenotype” in endometrial cancer and developed methods that could prove useful for identifying extreme methylation phenotypes in other cancers.
iii
Acknowledgments
My sincere thanks to my committee and my former advisor Tim H.M. Huang for making this project possible, as well as the current and former members of the Yan and Groden labs. Special thanks to Paul Goodfellow for helping reorient the project after my former advisor left, and to Ralf Bundschuh for his guidance in data analysis.
iv
Vita
June 2002 ...... North Olmsted High School
June 2006 ...... B.S. Microbiology, Minor in
Chemistry, The Ohio State University
June 2007 to present ...... Graduate Research Associate,
Biomedical Sciences Graduate
Program, The Ohio State University
Publications
2016 Michael P Trimarchi, Pearlly Yan, Joanna Groden, Ralf
Bundschuh, Paul Goodfellow. Identification of endometrial
cancer methylation features using a combined methylation
analysis method. Manuscript in preparation.
2012 Michael P Trimarchi, Mark Murphy, David Frankhouser,
Benjamin AT Rodriguez, John Curfman, Guido Marcucci, Pearlly
Yan, Ralf Bundschuh. Enrichment-based DNA methylation
analysis using next-generation sequencing: sample exclusion,
estimating changes in global methylation, and the contribution of
replicate lanes. BMC Genomics 2012, 13(Suppl 8):S6 (17
December 2012). v
2012 Rodriguez B, Tam HH, Frankhouser D, Trimarchi M, Murphy M,
Kuo C, Parikh D, Ball B, Schwind S, Curfman J, Blum W,
Marcucci G, Yan P, Bundschuh R. A Scalable, Flexible
Workflow for MethylCap-Seq Data Analysis. BMC Genomics
2012, 13(Suppl 6):S14 (26 October 2012).
2012 Yan P, Frankhouser D, Murphy M, Tam HH, Rodriguez B,
Curfman J, Trimarchi M, Geyer S, Wu YZ, Whitman SP,
Metzeler K, Walker A, Klisovic R, Jacob S, Grever MR, Byrd JC,
Bloomfield CD, Garzon R, Blum W, Caligiuri MA, Bundschuh R,
Marcucci G. Genome-wide methylation profiling in decitabine-
treated patients with acute myeloid leukemia. Blood. 2012 Jul
11.
2011 Trimarchi MP, Mouangsavanh M, Huang TH., Cancer
epigenetics: a perspective on the role of DNA methylation in
acquired endocrine resistance. Chin J Cancer. 2011
Nov;30(11):749-56.
2010 Cottrell, C. E., Bir, N., Varga, E., Alvarez, C. E., Bouyain, S.,
Zernzach, R., Thrush, D. L., Evans, J., Trimarchi, M., Butter, E.
M., Cunningham, D., Gastier-Foster, J. M., McBride, K. L. and
Herman, G. E. Contactin 4 as an autism susceptibility locus.
Autism Res. 2011 Jun;4(3):189-99.
vi
Fields of Study
Major Field: Biomedical Sciences
Specialization: Cancer Research
Interdisciplinary Specialization: Biomedical, Clinical & Translational Sciences
vii
Table of Contents
Abstract ...... ii
Acknowledgments ...... iv
Vita ...... v
List of Figures ...... xii
List of Tables ...... xiv
Chapter 1. Background ...... 1
I. Epigenetics in cancer, with a focus on DNA methylation ...... 1
Chromatin model ...... 1
DNA methylation ...... 1
DNA methylation in cancer COMMENT: perhaps this is a better place
for you to refer to the You and Jones Cancer Cell review...... 2
Histone modifications ...... 3
DNA methylation interaction with histone modifications ...... 4
DNA methylation as a biomarker and therapeutic target ...... 5
II. Methylome Profiling ...... 6
Analysis methods ...... 6
Methylome Profiling: The Biology ...... 13
viii
Chapter 2. Thesis Rationale and Research Objectives ...... 19
Chapter 3. Enrichment-based DNA methylation analysis using next- generation sequencing: quality control, estimating changes in global methylation and the effects of increased sequencing depth...... 21
I. Introduction ...... 21
II. Results and Discussion ...... 23
Quality control exclusion criteria reduce noise in methylation signal and
improve analytical power...... 23
The effect of additional sequencing lanes on quality control metrics ...... 27
The global methylation indicator (GMI) correlates inversely with an in vitro
methylated tracer sequence...... 31
III. Methods ...... 34
Patient samples ...... 34
Methylated-DNA capture (MethylCap-seq) ...... 35
MethylCap-seq experimental quality control and exclusion criteria ...... 36
Standard sequence file processing and alignment ...... 37
Standard global methylation analysis workflow ...... 37
Calculation of noise in methylation signal ...... 38
Calculation of the Global Methylation Indicator (GMI) ...... 39
Assessment of methylated fragment enrichment using an in vitro
methylated construct ...... 39
ix
IV. Conclusions ...... 40
Chapter 4. Identification of endometrial cancer methylation features using a combined methylation analysis method ...... 41
I. Introduction ...... 41
II. Results ...... 44
Characterizing a CpG island methylator phenotype ...... 44
Methylation signature construction and technical validation ...... 50
Methylation signature stratifies endometrioid endometrial tumors by
methylation phenotype and distinguishes tumors from normal controls . 56
High methylation score is associated with mismatch repair deficiency high
mutation rate, and low somatic copy number alteration ...... 64
Methodological validity ...... 68
A signature for CIMP ...... 70
III. Discussion ...... 70
IV. Conclusion ...... 75
V. Methods ...... 76
Patient samples ...... 76
MethylCap-seq quality control ...... 76
MethylCap-seq data analysis ...... 77
Infinium validation of methylation signature candidates ...... 77
Computation of methylation score using the 13 promoter CGI signature 78
x
In silico analysis of TCGA endometrioid endometrial tumors ...... 78
Replicate signature analysis ...... 78
Chapter 5. Summary and Discussion ...... 80
I. Summary ...... 80
II. Discussion ...... 83
References ...... 86
Appendix A. Supplementary data for Chapter 3 ...... 101
Appendix A1 – QC table for endometrial cancer study ...... 101
Appendix A2 – Replicate lane correlation, endometrial QC passed vs. QC
failed samples ...... 101
Appendix A3 – QC table for ovarian study ...... 101
Appendix A4 – QC, GMI, plasmid RPM table for AML study ...... 101
Appendix B. Supplementary data for Chapter 4 ...... 103
Appendix B1 - Cohort characteristics ...... 103
Appendix B2 - Sequencing summary ...... 103
xi
List of Figures
Figure 1. QC exclusion criteria reduce noise in methylation signals...... 26
Figure 2. Replicate sequencing lanes for MethylCap-seq experiments correlate highly...... 30
Figure 3. Additional lanes of sequencing data moderately increase saturation but greatly increase 5X CpG coverage...... 31
Figure 4. Global methylation indicator scales inversely with read counts from a "spiked" in vitro methylated construct...... 34
Figure 5. Endometrioid endometrial malignancies show increased methylation in promoter CGI compared to unmatched normal controls...... 46
Figure 6. Loci methylated in CGI-H tumors account for many of the cancer- associated methylation gains...... 49
Figure 7. Methylation of 13 promoter-associated CGI distinguishes tumors with high promoter CGI methylation (CGI-H) from those with baseline promoter CGI methylation (CGI-0)...... 53
Figure 8. Methylation signature stratifies endometrioid endometrial tumors by methylation phenotype and distinguishes tumors from normal controls...... 58
Figure 9. Aggregate promoter CGI methylation shows weaker stratification potential compared to the 13-region methylation signature...... 61
Figure 10. Methylation of EPHX3 is associated with decreased gene expression...... 64
xii
Figure 11. High methylation score is associated with mismatch repair deficiency, high mutation rate, and low somatic copy number alteration...... 67
Figure 12. Replicate 13-region methylation signatures rank tumors similarly.69
xiii
List of Tables
Table 1. Differentially Methylated Regions, Endometrial Tumors vs.
Nonmalignant Endometrial Tissue ...... 27
Table 2. Term enrichment associated with hypermethylated promoter CGI
(MSigDB Perturbation) ...... 50
Table 3. Promoter CGI that distinguish CGI-H from CGI-0 tumors as measured by MethylCap-seq ...... 52
Table 4. Promoter CGI included in the final signature after validation by
Infinium ...... 55
Table 5. Promoter CGI discarded from the final signature after validation by
Infinium ...... 56
Table 6. Correlation between promoter CGI methylation and gene expression in TCGA tumors ...... 63
xiv
Chapter 1. Background
I. Epigenetics in cancer, with a focus on DNA methylation
Chromatin model
DNA in cells does not exist freely, but associates with proteins to form a complex termed chromatin. According to the “beads on a string” model,
DNA (the string) in cells is wound around nucleosomes (the beads), which in turn are composed of protein histones. The four core histone subunits (H2A,
H2B, H3, H4) form heterodimers that complex together as an octamer to form the nucleosome core. The nucleosomes in turn are connected to a scaffolding by linker histones (H1, H5), contributing to higher order chromatin structure.
Changes in patterns of DNA methylation and histone modification alter the conformation of chromatin and accessibility of the DNA to transcriptional machinery, either by directly introducing steric hindrance or indirectly causing surrounding chromatin to adopt an “open” or “closed” conformation. In addition, epigenetic modifications may cause individual nucleosomes to shift laterally across the DNA, exposing some areas of DNA for transcription and covering others [1].
DNA methylation
DNA methylation is a covalent modification that occurs at cytosine nucleotides, in particular cytosines that precede a guanine (CpGs) [2]. The
1
process is catalyzed by DNA methyltransferases (DNMTs), which transfer the methyl group from S-adenosylmethionine to the target cytosine. Two families of DNMTs have been identified: DNMT1, which predominantly functions in maintenance of DNA methylation during DNA replication, and DNMT3, which is thought to be primarily responsible for de novo CpG methylation [3]. CpGs are strikingly rare in the genome compared to what would be expected from probabilistic estimates, and outside of transcribed regions CpGs are generally methylated. Areas of high CpG content, termed CpG islands, are found in approximately 40% of mammalian promoters, and unlike CpGs in the rest of the genome are usually unmethylated [4]. The methylation state of CpG islands in promoters can be an important factor controlling gene expression, with heavy methylation blocking gene transcription and sparse methylation permitting it. In addition, evidence suggests that heavy methylation throughout a region of chromatin can mediate long-range silencing that extends even to adjacent unmethylated genes [5].
DNA methylation in cancer COMMENT: perhaps this is a better place for you to refer to the You and Jones Cancer Cell review.
Precisely controlled DNA methylation is important for imprinting (allele- specific expression of some genes) and cell differentiation in normal cells. In cancer cells, aberrant patterns of DNA methylation are frequently observed.
In general, cancer cells feature global hypomethylation and focal promoter hypermethylation. Global hypomethylation has been tied to genomic instability, loss of imprinting, and activation of oncogenes [4,6]. Promoter hypermethylation, namely within CpG islands, leads to epigenetic silencing of
2
the target gene. In contrast to the rest of the genome, tumor suppressor promoters are frequently hypermethylated in tumors, suggesting that epigenetic silencing of tumor suppressors can be functionally equivalent to loss-of-function mutations [7]. DNA methylation perturbation has also been linked to resistance to therapeutics [8-10]. While epigenetic changes and genetic mutations may be mutually sufficient in some cases, mutations can also perturb the epigenome, and epigenetic changes can lead to increased mutation rate [11].
Histone modifications
Histones also represent a target for epigenetic regulation, and H2A,
H2B, H3, and H4 are all known to be post-translationally modified. These modifications generally occur on the N-terminal tails which protrude from the nucleosome, though H2A, H2B, and H3 are also known to be modified in the core globular domains. A wide range of modifications have been observed, including methylation, acetylation, phosphorylation, ubiquitination,
SUMOylation, and ADP-ribosylation, with methylation (lysine and arginine residues) and acetylation (lysine residues) being the most studied [12]. The potentially limitless combinations of modifications and the clear role of histone modification in epigenetic regulation have inspired the concept of a “histone code”, a pattern of modifications that directs cellular development, selectively filtering the more expansive genetic code [13]. While the concept of a histone code remains controversial, several general themes have emerged. Histone acetylation is generally associated with opening of the chromatin and an activating effect on gene expression, with deacetylation having opposite
3
effects and being responsible for gene silencing. Histone methylation can be either activating or repressive, depending on the residue, the number of methyl groups on the residue (ranging from 0 to 3), and the presence of other histone modifications. While the effect of histone acetylation could potentially be explained by electrostatic interactions (as acetylation decreases the affinity of histones for DNA and thus might be hypothesized to “push” the chromatin apart), the variable effects of histone methylation suggests other enzyme intermediates. Indeed, the enzymes responsible for modifying histones (e.g., histone acetylases, deacetylases, and methyltransferases) are themselves often found complexed with other proteins (e.g., the Polycomb repressor complex (PRC)) which may help mediate downstream effects [14].
DNA methylation interaction with histone modifications
Changes in patterns of histone modifications are associated with promoter CpG island DNA hypermethylation, which in turn is associated with the epigenetic silencing of tumor suppressor genes commonly found in various cancers. Recent evidence shows that histone methylation, depending on the residues involved, may either recruit (H3K27, H3K9) or prevent recruitment (H3K4) of DNMT3. Recruitment of DNMT3 subsequently results in DNA methylation and transcriptional downregulation of the target gene, providing a link between the two pathways of epigenetic control. This model would explain why CpG island hypermethylation in tumors is often correlated with increased H3K27 and H3K9 methylation, and decreased H3K4 methylation. Another model suggests that repressive histone modifications and DNA methylation may be mutually sufficient in some instances, such that
4
the presence of one can compensate for the other. For example, Polycomb group proteins (e.g., EZH2) associated with repressive histone modifications can recruit DNMT3, leading to hypermethylation of the DNA, loss of the repressive histone modifications, and maintenance of silencing. In addition, the enzymes that mediate and maintain epigenetic control (e.g., Polycomb group proteins) are often dysregulated in cancer cells, showing a mechanism by which cancer cells can manipulate gene expression on a more global scale
[15].
DNA methylation as a biomarker and therapeutic target
The study of cancer epigenetics has great potential to further the diagnosis and treatment of cancer. Promoter DNA hypermethylation is common in tumors, and could potentially serve as an early diagnostic marker.
In prostate cancer, the gene GSTP1 is found to be methylated in 90% of clinical malignancies, but not benign hyperplasias, and can be detected in a variety of bodily fluids, including blood plasma [16]. Discovery of similar genes for breast cancer could potentially lead to more effective early detection strategies. In addition, DNA methylome profiles and histone modification maps could potentially serve as prognostic tools once a tumor is identified.
As expression levels of certain genes is correlated with response to treatment, and epigenetic factors exert a powerful and durable effect on gene expression, such whole genome methods have the potential to become powerful tools for personalized treatment.
Elucidation of epigenetic mechanisms may lead to the development of new treatments that specifically target epigenetic abnormalities or
5
vulnerabilities in cancer cells. Examples of two such classes of drugs include
DNA demethylating agents and histone deacetylase inhibitors, which have shown promise in the treatment of leukemia and T-cell lymphoma, respectively [4]. Broadly demethylating or acetylating the genome may have adverse consequences, as use of such agents has been shown to promote metastasis and resistance to other therapeutics in some contexts [17,18]. In addition, promoter methylation of genes like MGMT may make some tumors more susceptible to specific therapies, and demethylating these regions could be counterproductive [19]. Greater understanding of epigenetic pathways should enable development of more targeted therapeutics with fewer off- target effects.
II. Methylome Profiling
Affordable genome-wide approaches for assessing methylation have opened the way for methylome profiling of human cohorts. As sequencing costs continue to plummet, the limiting factor in genome-wide studies will no longer be the costs of sequencing, but the costs of analyzing the enormous datasets that result. While next-generation sequencing is becoming increasingly standardized, analysis is a complex task and an ongoing area of research. This section will summarize analytical methodology involved in methylome profiling, then discuss recent findings that are shaping methylome analysis.
Analysis methods
This section provides a conceptual overview of methylome analysis that begins with a discussion of methylation detection methods, and then
6
focuses on several topics specific to MethylCap-seq: relative vs. absolute methylation, normalization, and quantification of MethylCap-seq data. A summary of notable literature utilizing MethylCap-seq is also presented. For additional background, see two excellent reviews on methylome profiling methods and analysis [20,21].
Detection methods
Because methylation marks are erased by typical amplification techniques (e.g., PCR), DNA must be specially processed prior to amplification and sequencing. One of three methods is typically used to detect DNA methylation: methylation-sensitive restriction enzyme digestion, sodium bisulfite treatment, or methylation-specific fragment capture (also referred to as affinity enrichment) [22]. Processed DNA is then incorporated into libraries for sequencing. Raw sequencing data, typically consisting of millions of short reads, is then aligned to the appropriate genome, providing a
DNA methylation map of the sample. The methylation detection methods yield different types of signal, with corresponding advantages and disadvantages [21,23-25]. Sodium bisulfite treatment converts unmethylated cytosines to thymines (via a uracil intermediate), leaving only methylated cytosines (assuming complete conversion). Overlapping DNA fragments are then compared to determine the methylation percentage of particular CpGs (a measure of absolute methylation), which is comparable across the genome and between samples. However, this single-base resolution has a significant tradeoff: profiling every CpG in the human genome requires as much as a billion aligned reads, which is presently unfeasible for most clinical studies.
7
Instead, reads are typically focused in regions of interest, such as by selecting for CpG dense regions (e.g., reduced representation bisulfite sequencing).
Thus breadth of genomic coverage is sacrificed for single-base resolution.
Affinity enrichment-based methods (e.g., MethylCap-seq) capture methylated fragments from the sample, with the resolution determined by the fragment size (typically ~130-180bp in our studies). Because the fragments were captured, we know that they are methylated relative to the rest of the sample.
However, absence of methylation must be indirectly inferred from absence of signal, which can be problematic when coverage depth is insufficient (in which case absence of signal is not equivalent to absence of methylation). Signal must be normalized to account for variable sequencing lane yield; this is accomplished by dividing aligned read counts in a region by the total aligned read count. The result is a relative methylation profile of a sample. These relative methylation profiles can be compared between samples (if the assumption of comparable overall methylation between samples is valid—this assumption is usually not tested directly), and across the genome within the sample if further normalization for CpG density is performed. Confusing the matter, multiple published studies claim to convert relative methylation to absolute methylation values using CpG normalization [23,26]; while this may produce results that correlate on average with bisulfite-based methods, it is doubtful that such methods could account for systematic differences in global methylation between samples or groups of samples. In theory, an absolute methylation profile could be reconstructed by pairing sequencing data with a scaling factor, a measure of the overall methylation in a sample relative to
8
other samples. Enrichment-based methods, despite their lower resolution vs. bisulfite-based methods, can achieve much greater coverage of CpGs in the genome with the same sequencing depth [23].
Normalization and quantification of MethylCap-seq signal
Normalization and quantification of methylation signal can be envisioned as a 5-step process. Fragment estimation and sequencing yield normalization are essential elements of the process. Count aggregation and exclusion of duplicate reads are recommended, while CpG density normalization is optional. Fragment estimation involves extending the short reads from the sequencer to reconstruct the original captured fragments.
Count aggregation aggregates the reads in genomic bins of set size, which smoothes the data and condenses it. As methylation of adjacent CpGs is highly correlated and resolution of enrichment-based methods is inherently limited, count aggregation is a simple way to package the data with minimal data loss. Sequencing yield normalization normalizes signal to account for arbitrary sequencing lane yield; this is accomplished by dividing aligned read counts in a region by the total aligned read count. Duplicate exclusion attempts to correct for PCR artifacts introduced by genome amplification, where certain fragments are replicated more frequently relative to other fragments, sometimes by orders of magnitude. With the assumption that sequencing depth is low compared to the diversity of fragments used to generate the sequencing library, in the case that multiple reads share the same exact sequence all but one are assumed to be PCR artifacts, and thus these duplicate reads are excluded from analysis. Duplicate exclusion can
9
potentially result in analysis artifacts, as regions with higher CpG density or higher methylation may be more prone to random duplicates representing identical fragments from different cells (as opposed to PCR artifacts).
Tools for MethylCap-seq data processing
Two popular tools for processing of enrichment-based data are Batman
[27] and MEDIPS [26]. After working briefly with MEDIPS, we developed our own custom workflow for enrichment-based sequencing data to enable greater flexibility in our analyses [28].
Batman was developed to address the poor agreement between enrichment-based methods and bisulfite-based methods—especially problematic since bisulfite-based methods are often used to validate the findings of enrichment-based methods. The authors use a Bayesian deconvolution strategy to account for differences in CpG density that cause genomic regions with high CpG density (and thus more opportunities for methylation) to be preferentially enriched, a factor that does not affect bisulfite-based methods. Fragment estimation assumes a 500bp fragment size, and count aggregation uses 100bp bins. The tool is implemented in
Java. While the tool greatly improves the agreement between enrichment- based and bisulfite-based data, the Bayesian deconvolution model is computationally intensive. In addition, the tool is optimized for use for use with microarray data, and sequencing data requires substantial preprocessing before it can be input into Batman.
MEDIPS similarly applies CpG density normalization to improve agreement with bisulfite-based methods. The authors use coupling factors
10
rather than a Bayesian deconvolution model, simplifying calculations and improving computation speed vs. Batman [26]. Fragment estimation and count aggregation bin sizes are customizable; default bin size is 50bp.
MEDIPS is implemented in R, and use requires some familiarity with R.
MEDIPS includes a methylProfiling function to compare two samples for differentially methylated regions (DMRs). Unfortunately, in most scenarios researchers wish to compare two or more groups of samples, and thus the feature has limited usefulness. While MEDIPS is functional, the tool is still slow and RAM intensive, and we found it difficult to use for larger studies involving many samples.
The workflow used for this project skips the CpG density normalization step in favor of simplicity and transparency. Fragment estimation and count aggregation is customizable; I used 150bp fragment size and 500bp bins for our dataset of endometrial tumors profiled using MethylCap-seq.
Implemented in a combination of bash and C++, the workflow is optimized for parallel processing in a supercomputing environment, and outputs a binary count file that is used for downstream analysis.
Other published methods
Harris et. al compared enrichment-based and bisulfite-based methylome profiling methods [24]. For enrichment-based methods, fragment estimation assumed a 150bp fragment size, duplicate reads were excluded, counts were aggregated in 200bp or 1000bp windows, and CpG density normalization was performed by iteratively comparing CpG contribution per estimated fragment with the regional fragment distribution. For comparison
11
between methods, methylation was binarized by setting tag density or beta- value thresholds (0.2 for the bisulfite-based methods). To avoid overlap during region classification, regions were assigned to UCSC features in a set priority order (promoter > exon > UTR > intron > intergenic).
Bock et. al reported a method for integrated analysis of enrichment- based data alongside bisulfite-based data [23]. For enrichment-based methods, duplicate reads were excluded, counts were not aggregated, and linear regression models were used to normalize for CpG density. They observed that combining lanes of data for the same sample improved correlation with bisulfite-based data. Statistical testing for differentially methylated regions was performed using Fisher’s exact test.
Lan et. al applied a bi-asymmetric-Laplace model to disentangle signal from neighboring CpGs and thus output a methylation score for each CpG
[29]. Notably, they eschewed fragment estimation in favor of a peak-finding approach (commonly used for ChIP-seq analyses), with the basic assumption that an adequate pile of fragments surrounding a methylated site provides sufficient information to attribute signal to individual CpGs. The approach claims to be able to increase resolution beyond the natural limits of fragment size (typically ~150bp) to less than 50bp, and the same approach was applied in a subsequent research article [30].
MethylCap-seq literature
Sequencing-based profiling of the methylome began in earnest around 2008, with a subset of those studies using a MethylCap-seq approach. This section
12
summarizes the primary literature utilizing MethylCap-seq by category: methods article, methods comparison, research article.
Method articles
[27,29,31,32]
Methods comparison
[23-25,33]
Research articles
[30,34-36] [37-50]
Methylome Profiling: The Biology
This section will focus on biological findings resulting from and influencing methylome profiling. As the success of computational investigation hinges on knowing where to look and what to look for, this section will particularly emphasize genomic regions of interest.
Promoter CGI methylation
Methylation within promoter CGIs is well established as a mechanism of epigenetic silencing, and is usually the first genomic feature examined in any methylome profiling study. Many array-based methods, including early variants of the Infinium HumanMethylation bead array, only interrogate CpGs within promoter CGI. CGI are rarely methylated in the genome, even more so if they lie within promoters. Methylation of promoter CGI is thought to mediate silencing by suppressing the binding of transcription factors necessary for transcription initiation, most likely by recruiting suppressive factors, typically complexed with methyl-CpG-binding proteins, that provide a direct steric
13
hindrance or maintain the DNA in a closed conformation that is inaccessible to transcription factor complexes.
CGI shore methylation
Andrew Feinberg’s group proposed that methylation changes associated with changes in gene expression tended to occur not in CGIs or promoters, but in flanking regions. This is consistent with so-called methylation spread theory
[51], which postulates that methylation changes begin at foci adjacent to a gene promoter and gradually spread into the promoter, with transcriptional potential dropping off early in this process. Using the CHARM assay, which pairs methylation-sensitive restriction enzyme digestion with a tiling array, they showed that CGI shores (defined as flanking regions 200-2000bp distant from a CGI) were the primary hotspots for epigenetic change in both colon cancer and hematopoietic progenitor differentiation [52,53]. They further showed that cancer-associated methylation accumulated in shores associated with tissue-specific methylation, also consistent with methylation spread theory. The role of CpG shores has since been independently confirmed in other contexts, including induced pluripotent stem cells [54], acute myeloid leukemia [55], and bladder cancer [56].
First exon methylation
Early methylome profiling showed that first exons were frequently unmethylated together with CGI and promoters, suggesting a possible role in gene regulation [57]. This potential was confirmed in a recent study by Brenet et. al, which compared first exon and promoter methylation and showed that
14
first exon methylation was more tightly coupled to expression in an acute myeloid leukemia cell line [37].
Gene body methylation
While first exons are typically unmethylated, gene bodies are frequently methylated [57,58]. Multiple studies have reported a direct correlation between gene body methylation and gene expression [41,59,60]. Particularly intriguing is that exon/intron borders often show sharp transitions in methylation, with exons heavily methylated and introns less methylated.
While the role of gene body methylation remains unclear, the distinct pattern of methylation in exons and introns suggests that methylation in tandem with nucleosome positioning may play a role in alternative gene splicing.
Intragenic CGI methylation
While the role of promoter CGI methylation in gene silencing is well established, the role of intragenic CGI methylation is less clear. Using a method similar to MethylCap-seq, Adrian Bird’s group found that intragenic
CGI were frequently differentially methylated compared to promoter CGI and promoter CGI shores in differentiated mouse hematopoietic cells [42].
Unexpectedly, intragenic CGI methylation was associated with gene silencing.
It is not clear whether this finding translates to human cells and other contexts; a previous study by their group found that intragenic CGI were differentially methylated with similar frequency in mouse and human cells
(embryonic stem cells vs. whole blood), although intragenic CGI in colon cancer cells were not differentially methylated compared to normal tissue [49].
Partially Methylated Domains
15
Lister et. al introduced the concept of partially methylated domains
(PMDs): large (median size ~150kB) genomic regions that showed reduced methylation and expression compared to surrounding regions [61]. Curiously, these PMDs were seen in differentiated cell lines but not embryonic stem cell lines. Reduced methylation seen in PMDs was restored when the cells were reprogrammed via induced pluripotency, suggesting a role for PMDs in cell differentiation [62].
Laird’s group subsequently uncovered a unique pattern of aberrant methylation in colon tumors: focal CGI methylation occurring within extended regions of reduced methylation compared to surrounding regions [63]. These regions of reduced methylation overlapped significantly with Lister’s PMDs and previously reported multi-gene domains of long range epigenetic silencing
(LRES) in prostate cancer [5]. Significantly, methylation gains in tumors vs. normal tissue were more likely to occur in PMDs, and constitutive methylation was less likely to occur in PMDs. PMDs were seen in colon tumors and immortalized somatic cell lines but not in adjacent normal tissue, however only the colon tumors showed focal CGI hypermethylation within the PMDs.
PMDs corresponded to nuclear-lamina-associated domains (LADs) seen in a fibroblast cell line, suggesting a general biological concept behind this striking methylation phenotype.
Feinberg's group uncovered large blocks of contiguous hypomethylation in multiple colon tumors relative to normal tissue [64].
Notably, methylation loss was predominantly in regions of high methylation
(~75%) in normal tissue, to intermediate levels in tumors (~55%). These
16
blocks overlapped highly with Lister's PMDs. Most genes in the blocks were silenced in both tumors and normal tissue; however genes from the blocks that were expressed in normal tissue were at increased risk for silencing in tumors. The reverse was also true; block genes silenced in normal tissue were often reexpressed in the tumors, with considerable variation between tumors.
Taken together, one could speculate that PMDs represent potential hotspots for epigenetic change: zones where tumor suppressors are particularly vulnerable to silencing and oncogenes can be reactivated. The publications listed here all used whole genome bisulfite sequencing; it remains to be seen whether the findings can be recapitulated in large cohorts using less costly methods.
Methylation and transcription factor binding sites
While regulation of gene expression via promoter methylation is generally thought to be mediated through its effects on transcription factor binding, the role of DNA methylation at distal enhancers is not well characterized. Using whole genome bisulfite sequencing data from a mouse embryonic stem cell and a neural progenitor cell line, Stadler et. al characterized short stretches of low-intermediate methylation (10-50%) which they termed LMRs [65]. These LMRs tended to be distal from promoters and bore chromatin marks of enhancers. Furthermore, the authors demonstrated that the characteristic methylation state of LMRs was established by transcription factor binding.
17
More broadly, DNase I hypersensitivity, chromatin marks (e.g.,
H3K4me1 and H3K27ac), and binding by the histone acetyltransferase p300 are known to be associated with enhancers [66]. An ENCODE track provides compiled transcription factor binding sites and DNase I hypersensitivity regions from aggregate cell lines. The degree of context dependency of these data, and thus their applicability to studies in other tissues and malignancies, remains unclear. When examining profound epigenetic alterations such as those expected in cancer, comparisons with enhancer and transcription factor data from the same samples or same tissues would likely yield more robust results.
Exon methylation and regulation of alternative splicing
While several genomic profiling studies revealed sharp methylation changes defining exon and intron boundaries (see the discussion of gene body methylation), the effects of such methylation and a mechanism for these effects remained elusive. A recent study by Shukla et. al revealed a role for exon methylation in alterative splicing [67]. Exon methylation inhibited CTCF binding in human B-cell lines, decreasing incorporation of the exons into RNA transcripts. Exon incorporation in turn was mediated by RNA polymerase pausing during transcript elongation. In addition to DNA methylation, alternative splicing may be marked by other epigenetic factors including
H3K36me3 [68].
18
Chapter 2. Thesis Rationale and Research Objectives
DNA methylation is a stable epigenetic mark often perturbed during carcinogenesis [4]. Growing evidence demonstrates a role for DNA methylation both as a regulator of gene expression and as a potential biomarker [69,70]. Widespread accumulation of methylation in regulatory elements in specific cancer types, termed the CpG island methylator phenotype (CIMP), may even direct carcinogenesis [71,72]. Few studies, however, have examined the CpG island methylator phenotype genome-wide in a clinical cohort, as until recently analyzing methylation genome-wide in a large number of samples was cost-prohibitive. As a result, the methodology for analyzing these large datasets remains poorly developed. The Infinium
HumanMethylation beadchip is a popular tool for sampling methylation at many loci across the genome [73], but the suitability of this tool for defining
CIMP de novo in an uncharacterized cancer type (as opposed to verifying
CIMP) has not been thoroughly evaluated. Studies show general but not perfect agreement between Infinium and other methylome profiling methods in tumors and normal tissue [23,25], but agreement has not yet been evaluated in the context of the profound methylation dysregulation associated with
CIMP. A clear need exists for a study that validated Infinium-based CIMP discovery using an alternative methylome profiling method.
19
To address this problem, I analyzed methylation in endometrioid endometrial cancers and unmatched normal endometria [74]. These data were generated using MethylCap-seq. Initial analyses suggested deficiencies in library preparation in some samples that could compromise the validity of methylation calls, leading me to develop a quality control module to identify and control for these issues. I also analyzed the Infinium methylation data from an endometrioid endometrial cohort from The Cancer Genome Atlas
Consortium, which later published a cluster analysis of the Infinium data [75].
I set out to identify CIMP using endometrioid endometrial cancer as the model cancer and MethylCap-seq as the methylome-profiling method The goals of this project were: 1) develop and validate a MethylCap-seq quality control module to identify and exclude samples with spurious methylation data; 2) develop a method to identify putative CIMP tumors using the
MethylCap-seq platform; 3) describe the differences between putative CIMP tumors using the Infinium platform; 4) compare agreement in CIMP classification between my method based on MethylCap-seq and the published clustering method based on Infinium; and 5) demonstrate a signature that could classify CIMP tumors prospectively without a full methylome profiling experiment. Chapter 3 addresses goal 1 as well as the general methodology developed for these analyses. Chapter 4 addresses goals 2 through 5 and presents data that were collected along the way. This thesis tests the hypothesis that a combined methylation analysis method using MethylCap- seq and Infinium can be used to define CIMP de novo in endometrial cancer.
20
Chapter 3. Enrichment-based DNA methylation analysis using next-
generation sequencing: quality control, estimating changes in global
methylation and the effects of increased sequencing depth.
I. Introduction
Epigenetic mechanisms are responsible for determining and maintaining cell fate, stably differentiating various tissues in the human body.
Epigenetic changes can induce chromatin remodeling, leading to lasting effects on gene expression. Epigenetic aberrations, including perturbed DNA methylation, have been implicated in a diverse array of cancer-associated pathways, including silencing of tumor suppressors [4], activation of oncogenes [6], promotion of metastasis [17] and resistance to therapeutic drugs [8-10]. DNA methylation is thought to be a secondary or tertiary effector in the epigenetic cascade [76], with its accumulation in promoter CpG islands leading to durable gene silencing [58]. Due to its biological and experimental stability as an epigenetic marker, DNA methylation is a potential biomarker for disease, especially cancer [70]. While methylation marks are erased by standard PCR, other methods preserve this information, including affinity-based capture of methylated regions and bisulfite conversion [21].
Low-cost genome sequencing is revolutionizing the field of DNA methylation analysis. New techniques are enabling deep methylome profiling of biological samples on an unprecedented scale [21]. These sequencing projects can yield several GB of data per sample, presenting significant
21
bioinformatic challenges. While standardized pipelines have been developed for the most common elements of analysis such as sequencing data processing and alignment, methylome analysis often relies on custom methods specific to the methylation assay. To address quality control concerns and to determine required sequencing depth, analytic methods must not only be assay-specific but must also be tailored to the specific experiment
[20,24].
MethylCap-seq is a method for genome-wide profiling of DNA methylation that relies on enrichment of methylated DNA fragments using methyl-CpG binding domain protein 2 (MBD2) [77]. These fragments are then shotgun-sequenced, yielding a map of DNA methylation across the genome.
Advantages of MethylCap-seq compared to other platforms include cost- effective profiling of CpG dense regions and true whole genome coverage
[25].
Optimization of MethylCap-seq data collection and analysis requires attention to data quality and saturation. Proper quality control ensures that methylation calls are not impacted by inconsistencies in methylation enrichment. Greater saturation increases statistical confidence in methylation calls, especially in sparsely methylated regions [78].
A challenge currently facing enrichment-based methods that rely on sequencing, such as MethylCap-seq, is estimating changes in global methylation. Information on absolute methylation levels is erased during sequencing normalization, which is necessary to control for day-to-day
22
variations in sequencing output. Thus, this information must be recaptured using other computational methods or assays.
In this study, we provide evidence-based guidelines for quality control, illustrate the effects of additional reads on quality control metrics and provide empirical support for a global methylation indicator, an analytic tool that correlates with overall methylation levels. We hope these findings will help other groups standardize and optimize their MethylCap-seq experiments to take advantage of this promising methylome profiling method.
II. Results and Discussion
Quality control exclusion criteria reduce noise in methylation signal and improve analytical power.
Our automated quality control (QC) module, based on MEDIPS [26], was implemented to identify technical problems in the sequencing data and flag potentially spurious samples. One goal of the QC module was to provide rapid feedback to investigators regarding dataset quality, facilitating protocol optimization prior to committing resources to a larger scale sequencing project. A second goal was to identify samples that should be excluded from analyses due to data validity concerns. The validity of a MethylCap-seq experiment is dependent on enrichment of methylated fragments prior to sequencing. A failure in enrichment invalidates any downstream data, and therefore identifying such failures is vital. Also important is verifying the statistical reproducibility of the data for each sample. Generating replicate sequencing lanes for each sample to assess experimental reproducibility
23
empirically is often not cost effective, and thus addressing this issue computationally is desirable. Similarly, the confidence in methylation calls is related to the breadth and strength of signal at the CpGs in the genome. We assessed enrichment of methylated fragments using the CpG enrichment parameter, which compares the frequency of CpGs in the sequenced sample with the frequency of CpGs in a reference genome (hg18 for this study).
Statistical reproducibility was assessed by calculation of saturation, the
Pearson correlation of two random partitions of the sequenced sample [26].
Breadth and strength of methylation signal was assessed using 5X CpG coverage, representing the fraction of CpG loci with five or more reads in the sample compared to the total number of CpGs in the reference genome.
These QC parameters were calculated for each sample using MEDIPS [26].
Appendix A1 demonstrates the results of the QC module for the
Endometrial Dataset. 203 lanes of sequencing data were generated for 101 unique samples. 43 lanes failed QC, representing 21 unique samples. To assess how lanes that pass QC might differ from lanes that failed QC, we computed the noise in methylation signal, representing percentage of uniquely aligned extended reads falling in 500bp bins without CpG dinucleotides
(Figure 1). Median noise in samples that failed QC (6.40%) was more than three-fold greater than in samples that did not fail QC (2.04%, p<0.001), and closely resembled noise in input (7.82%). Excluding "QC-failed" lanes did not significantly decrease median noise levels (2.04 vs. 2.22, p=0.08) but did greatly decrease the variation in noise levels between samples. As the distribution of noise levels is positively skewed and not normal, a small
24
number of outliers would not be expected to significantly shift the median noise level. To investigate whether the additional noise seen in "QC-failed" samples impacted sequencing reproducibility, we computed the Pearson correlation between replicate lanes of samples that passed QC (n = 68) vs. those that failed QC (n = 9) (Appendix A2). Replicates of samples that passed QC correlated more highly than replicates of samples that failed QC
(average r = 0.90 vs. 0.59; p<0.001). Variation in replicate correlation between samples was also noticeably less in the QC pass group (relative standard deviation = 6.7% vs. 27.1%). We surmise that failures in methylation enrichment result in a more random sampling of the fragment distribution regardless of methylation status, resulting in increased signal in regions where methylation should not be detectable.
25
Figure 1. QC exclusion criteria reduce noise in methylation signals. Percentages of uniquely aligned reads falling in 500bp bins containing no
CpG dinucleotides pre- and post-QC analysis were plotted as a standard box plot for samples prior to QC filtering, samples that passed QC, and samples that did not pass QC in the endometrial dataset. An input from a sample that was not subjected to methylation capture is included for reference. The number of samples in each group is included above the baseline. Values for replicate lanes in each group were averaged, and samples were compared statistically using a Wilcoxon rank-sum test. Whiskers indicate 10th and 90th percentiles. 13.5% of 500bp bins in the genome are classified as CpG- barren.
26
As the goal of many methylome profiling studies is to identify differentially methylated regions (DMRs) between biological groups, we next assessed whether our QC exclusion criteria might improve our analytical power to detect DMRs. We compared DMRs between 89 endometrial tumors and 12 nonmalignant endometrial tissue samples across several genomic features. Excluding sequencing lanes that failed QC (corresponding to 19 tumor and 2 nonmalignant samples) resulted in more DMRs in every genomic feature assessed (Table 1). The greatest gains were seen in promoters and
CpG shores, where the number of DMRs increased 22-fold and 2-fold, respectively. Gains in CpG islands and promoter-associated CpG islands were more modest (1.6-fold and 1.05-fold). These results trend inversely with
CpG density, perhaps reflecting greater benefit from QC exclusion in regions where coverage is lower. We speculate that the improvements in DMR detection resulting from exclusion of samples that fail QC would be even greater when working with smaller sample sizes or biological groups with more similar methylation patterns.
Table 1. Differentially Methylated Regions, Endometrial Tumors vs. Nonmalignant Endometrial Tissue
Genomic feature All samples Samples Passing QC only CpG islands Promoter- associated CpG shores Promoters
The effect of additional sequencing lanes on quality control metrics
27
DNA sequencing cores are frequently asked whether additional lanes of sequencing data are necessary or desirable for MethylCap-seq experiments. To address this question, we analyzed a large dataset of ovarian tumors of which 7 samples had been resequenced (using the same genomic library), for a total of 15 lanes (Appendix A3). First, the degree of correlation between the replicate lanes was analyzed to ensure that additional lanes of data would not introduce excessive variation. As shown in Figure 2, replicate lanes from sequencing the same library twice correlated highly (R2 value of 0.98). Note that the question here was the value of additional sequencing lanes and not of additional technical or biological replicates – the correlation between technical or biological replicates would be expected to be much lower than the correlation between two lanes sequencing the same library.
CpG enrichment, saturation and 5X coverage were then evaluated for individual lanes and combined lanes (Figure 3). CpG enrichment varied somewhat between samples (range: 2.33-3.02), but was extremely similar for replicate lanes (<1% percent deviation from the combined lane on average).
Saturation improved modestly from a median of 0.79 to a median of 0.86. As saturation values for individual lanes of MethylCap-seq data typically range from 0.6 to 0.85 for single lanes in our hands and we consider a saturation value of 0.6 acceptable for analysis, this improvement may be inconsequential although it is statistically significant. 5X coverage improved noticeably from a median of 0.21 to a median of 0.28, representing an average 38% gain. As
5X coverage represents a minimum signal level needed to reliably
28
differentiate a methylated locus from a locus with no methylation (or the absence of a methylation signal), we speculate that this increase could significantly increase the statistical power to detect DMRs, particularly in small or lightly methylated regions.
29
Figure 2. Replicate sequencing lanes for MethylCap-seq experiments correlate highly. Replicate lanes for each sample were randomly assigned to two partitions, and the average rpm of 6000 (of 6M) randomly selected 500bp bins were compared between partitions.
30
Figure 3. Additional lanes of sequencing data moderately increase saturation but greatly increase 5X CpG coverage. Variations in CpG enrichment (A), saturation (B), and 5X coverage (C) were assessed for 15 lanes of data in the ovarian study corresponding to 7 samples by generating plots of individual lanes and combined replicate lanes for each sample. (D) Average percent deviation of the individual lanes from the combined lane for each sample was plotted for each parameter. Error bars for (D) represent standard error. Asterisks represent Student t-test p<0.05.
The global methylation indicator (GMI) correlates inversely with an in vitro methylated tracer sequence.
We recently proposed a computational method to compare genome- wide changes in methylation patterns between samples in a given experiment 31
[77]. As MethylCap-seq signal (in reads) is normalized by raw read counts to adjust for variability in lane yield, two samples with identically distributed methylation yet different absolute levels of methylation would be expected to yield identical normalized methylation signals at any given loci. The GMI method relies on the observation that in vitro methylated samples display characteristic changes in the methylation signal distribution as quantified in a
MethylCap-seq experiment, and these changes are CpG-density dependent.
Methylation signal shifts from low CpG content regions to high CpG content regions; this difference can be quantified by calculating the area under the curve of the average normalized methylation signal plotted across CpG density. The GMI is a potentially powerful tool for capturing differences in methylation distribution between samples.
In an effort to validate the GMI as a surrogate for global methylation, we developed a complementary analysis utilizing an in vitro methylated construct. Samples from an acute myeloid leukemia (AML) study were used for this analysis [39]. This methylated construct was spiked into the genomic
DNA in the AML samples prior to sonication at a defined concentration and subjected to methylated DNA enrichment along with the sample DNA. The
"spike-in" was originally intended to verify successful enrichment; if enrichment occurs, PCR for the methylated plasmid would show increased copy number after enrichment. Additionally, this "spike-in" also indicates global methylation levels in a sample since the methylated plasmid competes with the natively methylated genomic DNA fragments for binding to the MBD protein. When the proportion of methylated to unmethylated genomic
32
fragments is high prior to enrichment, the methylated plasmid "spike-in" gets enriched relatively less, and vice versa. Indeed, we found that read counts aligned to the plasmid correlate inversely with GMI (Figure 4, Appendix A4).
This result provides empirical evidence that GMI can capture changes in absolute global methylation levels for MethylCap-seq experiments. Such a metric could be useful for gauging response to treatments that are known or expected to alter the methylome.
33
Figure 4. Global methylation indicator scales inversely with read counts from a "spiked" in vitro methylated construct. The pIRES2-EGFP plasmid was in vitro methylated and "spiked" at a set concentration into each of 14 samples from the decitabine study prior to sequencing. After sequencing, GMI was calculated and plotted against the inverse of the number of normalized reads aligning to the plasmid. A linear best fit was drawn through the experimental points (p = 0.036, R2 = 0.318).
III. Methods
Patient samples
89 endometrioid endometrial cancer and 12 unmatched nonmalignant endometrium samples were obtained from Washington University. All studies
34
involving human samples were approved by the Human Studies Committee at the Washington University and at The Ohio State University.
Seven ovarian cancer samples from a larger patient cohort were obtained from TriService General Hospital, Taipei, Taiwan. All studies involving human ovarian cancer samples were approved by the Institutional
Review Boards of TriService General Hospital and National Defense Medical
Center.
Fourteen bone marrow samples from a single-center Phase II clinical trial involving patients with acute myeloid leukemia (AML) at The Ohio State
University were obtained for this investigation. The study design and the results of the trial for the entire patient cohort have been reported elsewhere
[79]. All studies involving these samples were approved by The Ohio State
University Human Studies Committee.
Methylated-DNA capture (MethylCap-seq)
Enrichment of methylated DNA was performed with the Methyl Miner
Kit (Invitrogen) according to the manufacturer’s protocol as previously described [77]. Briefly, one microgram of sonicated DNA was incubated at room temperature on a rotator mixer in a solution containing 3.5 micrograms of MBD-Biotin Protein coupled to M-280 Streptavidin Dynabeads. Non- captured DNA was removed by collecting beads with bound methylated DNA on a magnetic stand and washing three times with Bind/Wash Buffer.
Enriched, methylated DNA was eluted from the bead complex with 1M NaCl and purified by ethanol precipitation. Library generation and 36-bp single-
35
ended sequencing were performed on the Illumina Genome Analyzer IIx according to the manufacturer’s standard protocol.
MethylCap-seq experimental quality control and exclusion criteria
The automated quality control (QC) module was implemented as previously described [77]. Pre-aligned sorted.txt files from the Illumina
CASAVA 1.7 pipeline were used to reduce turnaround time for analysis. In brief, duplicate alignments were removed from the aligned sequencing file (a correction for potential PCR artifacts), and the resulting output was loaded into an R workspace. MEDIPS [26] was used to analyze CpG enrichment, saturation, and CpG coverage.
Sequencing lanes were excluded using the following thresholds: CpG enrichment < 1.4, saturation < 0.5 and CpG 5x coverage < 0.05. These criteria and thresholds were chosen for technical relevance and their ability to identify known technical issues without a bias for specific biological groups.
Samples were excluded if any of the criteria were not met. As CpG coverage was assessed qualitatively for analysis of the endometrial dataset, five lanes of data with borderline 5x CpG coverage were not excluded that would have qualified for exclusion due to this criterion.
For the DMR comparison (Table 1), methylation signal was normalized for each lane and averaged among replicate lanes for each sample. The “All” group includes samples with merged QC pass lanes, samples with merged
QC fail lanes, and samples with merged QC pass and QC fail lanes, and thus is not a straight sum of the QC-passed and QC-failed samples.
36
For the reproducibility comparison (Appendix A2), Pearson r was calculated using two replicate lanes corresponding to each sample represented in the
QC pass and QC fail groups. When a sample had more than two replicate lanes in a single group, two lanes were randomly chosen for the analysis.
Samples lacking two replicate lanes in either the QC pass or QC fail group were excluded from this analysis. Lanes corresponding to the same sample but generated using different library preparations were also excluded.
Sequencing and QC summaries corresponding to the datasets referenced in this chapter can be viewed in Appendices A2, A3, and A4.
Standard sequence file processing and alignment
Sequence files were processed and aligned as previously described
[77]. Briefly, QSEQ files from the Illumina CASAVA1.7 pipeline were converted to FASTA format, duplicate reads removed (to control for PCR bias), and uniquely aligned with Bowtie to generate SAM files using the following options: -f -t -p 1 -n 3 –l 32 -k 1 -m 1 -S -y --chunkmbs 1024 –max – best [80]. Duplicate alignments (reads aligning to the same genomic position) were removed using SAMtools [81].
Standard global methylation analysis workflow
Aligned sequence files in SAM format were analyzed using a custom analysis workflow as previously described [77]. Briefly, aligned reads were extended to the average fragment length (as determined by BioAnalyzer fragment analysis) and counted in 500bp bins genome-wide. The resulting
37
count distribution was normalized against the total aligned reads by conversion to reads per million (RPM).
Methylation was categorized by genomic feature as follows: CpG islands (CGI, as defined in the UCSC genome browser), promoters (2kB in length, 1kB upstream and downstream of the TSS), CGI shores (200bp to 2kb distant from both ends of each CGI), and the first exon of RefSeq genes. CGI were further subdivided by proximity to promoters (within 10kB upstream or
1kB downstream of a 2kB promoter), and 2kB promoters were subdivided by overlap with CGI.
Differentially methylated regions were identified by summing RPM across the bins for each locus in the genomic feature, then performing a
Wilcoxon rank sum test to assess differences in these summed RPMs between sample groups. Results were then adjusted for multiple comparisons by setting a false discovery rate (FDR) cutoff of 0.05.
Calculation of noise in methylation signal
Noise in methylation signal, representing extended reads falling in regions without CG dinucleotides, was quantified as the summation of reads falling into bins with zero CpG content. If a sample in a given group had multiple lanes of data, noise was computed for each lane individually and averaged among replicate lanes in the group. As a single sample could have a lane that passed QC and a lane that failed QC, the number of samples in each group does not sum to the total number of samples in the study.
38
Calculation of the Global Methylation Indicator (GMI)
To assess genome-wide changes in methylation patterns for each sample in an experiment, a custom parameter termed the global methylation indicator (GMI) was calculated as previously described [77]. Briefly, normalized read counts (in RPM) were classified by CpG density and averaged to construct a methylation distribution. The average RPM were then summed across the distribution (i.e., the estimated area under the methylation distribution curve) to yield the GMI.
Assessment of methylated fragment enrichment using an in vitro methylated construct
Experimental procedure
The 5.3Kb plasmid vector pIRES2-EGFP, which contains three CpG islands, was used to assess methylated fragment enrichment. The construct was linearized with Nhe I and then in vitro methylated with M.SssI. The methylated "spiked-in" DNA was quantified by the Qubit High Sensitivity
Assay and diluted. Plasmid was spiked into genomic DNA at a concentration of 1.5pg plasmid / 1µg genomic DNA (~2.5 plasmid copies per cell) prior to sonication of genomic DNA for library generation.
Analysis
Reads mapping to the construct were identified by converting QSEQ files to FASTA format as described above, then aligning the files with Bowtie using the following options: -q -t -p 1 -n 3 -l 32 -k 1 -S --chunkmbs 1024 --max
--best. Duplicate reads were retained for this analysis. To control for
39
variation in construct aligned read counts attributable to fluctuations in lane yield, construct aligned read counts were normalized against the total raw read counts by conversion to reads per million (RPM).
IV. Conclusions
This study shows that post-sequencing QC metrics can exclude poor quality samples from analysis, decreasing noise in methylation signal and improving power to detect DMRs. Furthermore, we show that resequenced lanes from the same library correlate very well, and that additional lanes of data have a small impact on saturation (data reproducibility) and a large impact on 5X CpG coverage (confidence in methylation calls at a given locus).
Finally, we demonstrate that our computational indicator of global methylation correlates with an unrelated method that utilizes spike-in of DNA with known methylation status. These findings verify that MethylCap-seq, with appropriate quality control, is a reliable tool that provides reproducible relative methylation information on a feature by feature basis, provides information about levels of global methylation, and can be used to analyze large patient cohorts of hundreds of patients.
40
Chapter 4. Identification of endometrial cancer methylation features
using a combined methylation analysis method
I. Introduction
During carcinogenesis cells within solid tumors acquire numerous aberrations, including mutations that alter the coding sequence of genes, as well as changes in gene expression. Changes in gene expression may be mediated in many ways including: altered transcription factor levels or function, mutations in DNA binding elements, miRNAs, and chromatin remodeling. Chromatin remodeling, including epigenetic modifications to histones and DNA methylation, normally plays a key role in cell differentiation, stably switching cellular pathways on/off until the cells reach a terminally differentiated state that is typically irreversible. Epigenetic aberrancies can allow cancer cells to silence tumor suppressors and re-express oncogenes, giving tumor cells an additional option besides mutation to dysregulate key pathways [82].
DNA methylation is one of the better understood mechanisms of epigenetic control. DNA methylation in humans is mediated by the DNA methyltransferases DNMT1 and DNMT3, which add a methyl group to the 5’ carbon of cytosine. In differentiated cells, DNA methylation occurs in the context of cytosine followed by guanine (CpG) in the DNA sequence. DNA methylation in promoter CpG islands (CGI) has been shown to mediate stable
41
gene silencing. Tumor suppressor silencing via DNA methylation is found in a wide range of tumor types [69]. While methylation marks are erased by standard PCR, several methods have been developed to preserve this information, including affinity-based capture of methylated regions and bisulfite conversion [21]. DNA methylation is increasingly being recognized as a potential biomarker [70].
The CpG island methylator phenotype (CIMP) is a cancer-specific accumulation of DNA methylation in the CpG islands of some tumors.
Originally identified more than 15 years ago in colorectal cancer [83], CIMP has since been identified in multiple cancer types: glioma [84], breast cancer
[85], acute myeloid leukemia [86], gastric cancer [87], clear cell renal cell carcinoma [88], oral squamous cell carcinoma [89], and hindbrain ependyomas [90]. CIMP may occur early in tumorigenesis: CIMP can be detected in colorectal serrated adenomas prior to malignant progression and the development of microsatellite instability [91]. Early diagnosis presents an opportunity for early intervention and improved outcomes, including lower treatment morbidity and lower recurrence rate. CIMP classification may have prognostic value: CIMP is associated with good prognosis in some cancer types (e.g., colorectal, breast) and poor prognosis in others (e.g., renal cell carcinoma) [72]. Accurate differentiation of relatively benign and aggressive cancer subtypes allows benign subtypes to be treated less aggressively and aggressive subtypes to be treated more aggressively. CIMP could also represent a therapeutic target for demethylating therapies [92]. Despite its
42
potential diagnostic, prognostic and therapeutic value, CIMP and its manifestations in different cancer types remain poorly understood.
Defining CIMP in a new cancer type requires extensive methylome profiling. However, methylome profiling studies examining CIMP genome- wide in endometrial cancer cohorts remain sparse. Several methods have emerged to profile the methylome, including the Infinium beadchip [73] and affinity-based methylation capture followed by shotgun sequencing (e.g.,
MethylCap-seq [77]). The Infinium beadchip is frequently used in clinical studies for its cost-effectiveness, scalability for large cohorts, high accuracy, and user-friendly analysis pipeline. The method relies on hybridization of bisulfite-converted DNA to the beadchip, followed by single-base extension.
The end result is a readout of percent methylation for individual CpGs, with ~7
CpGs assessed per promoter CGI using the HumanMethylation 450 kit, or roughly 8% of the CpGs in promoter CGI. Methylation of nearby CpGs is not measured, but is assumed to be similar. MethylCap-seq is one of several affinity-based capture methods that leverage shotgun sequencing to assess methylation patterns. MethylCap-seq uses the MBD2 protein to capture methylated fragments, which are then sequenced to yield piles of methylation tags across the genome. By comparing tag frequency between samples, relative methylation levels can be inferred for a given region. As sequencing costs continue to fall, MethylCap-seq and similar methods will become increasingly cost-effective. For analysis of promoter CGI, MethylCap-seq has a particular advantage over Infinium: average methylation over the regions is measured, rather than assumed.
43
In this study, we developed a 13-region signature that stratifies endometrioid endometrial tumors by CpG island methylation status and distinguishes tumors from both normal control and adjacent normal tissue.
This signature is based on a training set of MethylCap-seq data and validated for use on Infinium datasets. We also demonstrate a general method for identifying methylator phenotypes based on total promoter CGI methylation.
This signature could prove useful for detecting and classifying endometrioid endometrial carcinomas, as well as catalyzing research into the role of dysregulated methylation in this disease.
II. Results
Characterizing a CpG island methylator phenotype
Methylome data were analyzed from a previously published
MethylCap-seq dataset of 76 endometrioid endometrial primary carcinomas and 12 non-matched normal control samples (hereafter referred to as the discovery set) [93]. To assess the extent of CpG island (CGI) methylation in tumors relative to normal controls, we compared overall methylation in CGIs as well as in other genomic features known to acquire methylation during carcinogenesis [37,52] (Figure 5A). Tumors overall showed a nearly 2-fold increase in methylation of promoter CGI, with reduced gains in CGI shores.
These increases correlated with methylation gains in promoters containing
CGI (but not promoters without CGI), as well as gains in first exons. To examine the variability of tumor methylation gains between different tumors, promoter CGI methylation was plotted for each tumor and compared to
44
normal controls (Figure 5B). Tumors displayed a spectrum of methylation gains ranging from slightly below the levels of normal controls to 5-fold higher.
45
3.00 A 2.50 * * 2.00 * * * 1.50 * 1.00
Fold ChangeFold RPM 0.50 Normal 0.00 Malignant
Defining the CIMP Training Set
B 30000 25000
20000 CGI-H (n=5) 15000 Normal
10000 Malignant
5000 CGI-0 (n=8)
Promoter CGI methylation (RPM) 0 Normal Malignant n=12 n=76
Figure 5. Endometrioid endometrial malignancies show increased methylation in promoter CGI compared to unmatched normal controls. (A) MethylCap-seq normalized signal in the discovery set was compared
between malignancies (n=76) and normals (n=12) and plotted across
autosomes for several genomic features: CpG islands (CGI), promoters
(1kB upstream and downstream of the TSS), CGI shores (200bp to 2kb
distant from both ends of each CGI), and the first exon of RefSeq
genes. CGI were further subdivided by proximity to promoters (within
10kB upstream or 1kB downstream of a 2kB promoter), and 2kB
promoters were subdivided by overlap with CGI. Bars denote mean
46
fold change relative to normal controls; error bars depict 25th and 75th
percentiles. Asterisks denote Bonferroni-adjusted Wilcoxon rank sum
test p<0.05.
(B) Most endometrioid endometrial malignancies show increased promoter
CGI methylation compared to normal controls, with a subset of highly
methylated tumors (CGI-H) showing over three-fold more methylation
compared to tumors with similar methylation to normal controls (CGI-
0). To identify where methylation changes occur in the most
methylated of tumors, thresholds were drawn at the upper and lower
extremes of the tumor methylation spectrum (dotted lines).
To rule out the contribution of an enrichment bias, which would produce differences in methylation signal based on the efficiency of methylated fragment enrichment rather than biological methylation, we compared mitochondrial methylation amongst the 76 tumors as a negative control. Mitochondrial and nuclear methylation utilize different cellular machinery and occur in different cellular compartments [94,95], thus a positive correlation between mitochondrial methylation and CGI methylation would indicate a potential enrichment bias. No significant correlation was seen between mitochondrial methylation and promoter CGI methylation (Spearman r=-0.15, p=0.2).
To define the changes in methylation underlying a putative CpG island methylator phenotype, we examined regional methylation differences between tumors with the highest (CGI-H) and lowest (CGI-0) levels of promoter CpG
47
island methylation (Figure 5B, dashed lines). Consistent with the premise of a
CpG island methylator phenotype, methylation differences were almost exclusively unidirectional (4672 hypermethylated vs. 17 hypomethylated), and encompassed a large fraction of all promoter CGI (29%) (Figure 6B). To focus on methylation changes that might occur during carcinogenesis, the identified regions were further overlapped with methylation differences observed in tumors vs. normal samples (Figure 6A). 2269 promoter CGI were shared between the two comparisons (Figure 6C). 49% of CGI-H v. CGI-0 hypermethylated loci were also hypermethylated in tumors compared to normal tissue, which was twice the expected rate of 23% (Chi squared p- value < 0.001), suggesting that many of these loci represent “hotspots” in the genome that are particularly vulnerable to aberrant methylation in endometrial tissue. Pathway analysis of these shared promoter CGI showed enrichment for known targets of epigenetic regulation, including targets of the Polycomb
Repressor Complex and regions known to be methylated in other cancers
(Table 2).
48
Comparison of methylation Comparison of regions methylated A Comparison of methylation B C differences between in hypermethylator tumors and differences between tumors and hypermethylator and non- frequently methylated in normals hypermethylator tumors endometrioid endometrial cancer
3730 Hyper 4672 Hyper 95 3730 1461 2269 2403 4672 Hypo Hypo
12349 Unchanged 11346 17 Unchanged
Malignant v. CGI-H v. Malignant (n=76) v. Normal (n=12) CGI-H (n=5) v. CGI-0 (n=8) Normal CGI-0
Figure 6. Loci methylated in CGI-H tumors account for many of the cancer-associated methylation gains. (A) Endometrioid endometrial tumors show increased methylation of over
20% of promoter CGI compared to unmatched normal controls.
Differentially methylated promoter CGI were quantified in the discovery
set. Depicted is the proportion of loci that were hypermethylated
(Hyper), hypomethylated (Hypo), or unchanged in malignancies relative
to normals.
(B) CGI-H tumors gain methylation at nearly 30% of promoter CGI
compared to tumors with methylation similar to normal controls (CGI-
0).
(C) Differentially methylated regions between Malignant v. Normal and
CGI-H v. CGI-0 show considerable overlap. Hypermethylated
promoter CGI from (A) and (B) were intersected to identify cancer-
specific hypermethylated regions.
49
Table 2. Term enrichment associated with hypermethylated promoter CGI (MSigDB Perturbation)
Hypermethylated promoter CGI, Overlap comparison
3730 2269 2403 4672
Malignant v. CGI-H v. Normal CGI-0
Hyper Hyper Hyper FDR Hyper Fold Term Name Foreground Total Q-Val Enrichment Region Hits Regions
Genes identified as targets of the Polycomb -19 2.5 126 1669 protein SUZ12 2.76 x 10
Genes possessing promoter trimethylated -19 2.4 134 1854 H3K27 (H3K27me3) 2.2 x 10
-18 Polycomb Repression Complex 2 (PRC) targets 2.77 x 10 2.8 97 1127
Genes identified as targets of the Polycomb -18 2.4 124 1705 protein EED 4.18 x 10
Genes with hypermethylated DNA in lung -6 2.5 44 574 cancer samples 9.14 x 10
Genes de novo DNA methylated in cancer 0.000399 3.7 19 169
Methylation signature construction and technical validation
To identify a methylation signature that could be translated to other platforms (including Infinium and assays of individual regions), 16 candidates 50
were selected from the 2269 promoter CGI from Figure 6C (Table 3). The
2269 promoter CGI were sorted by Kruskal-Wallis p-value and fold difference for the CGI-H vs. CGI-0 comparison; potential candidates were excluded that overlapped regions associated with copy number amplification in TCGA endometrioid samples (8 of the top 50). In addition, a threshold Student p- value of p<0.01 (not corrected for multiple comparisons) was imposed to exclude candidates with large fold differences that appeared to be driven by outlier samples (5 of the top 21). Methylation of 15/16 candidates robustly distinguished the CGI-H and CGI-0 tumors; TMEM115 however showed hypermethylation in only 3 of 6 CGI-H tumors (Figure 7A).
51
Table 3. Promoter CGI that distinguish CGI-H from CGI-0 tumors as measured by MethylCap-seq
Gene Chromosomal location CpG fold change symbol (hg19) count RPM
TMEM115 chr3:50402103-50402942 66 18.9
TBX18 chr6:85472702-85474132 129 18.9
NODAL chr10:72200065-72201368 106 16.9
SVEP1 chr9:113341213-113342029 99 16.2
TNFSF11 chr13:43148277-43149282 83 16
OR10H2 chr19:15833733-15833983 23 14.5
KDM2B chr12:122016170-122017693 125 14.4
FGF12 chr3:192125818-192127991 176 14
APCDD1L chr20:57089460-57090237 71 13.6
EPHX3 chr19:15344091-15344419 33 13
ASCL1 chr12:103351579-103352695 105 13
EXOC3L4 chr14:103557606-103558235 63 12.6
SMOC2 chr6:168841818-168843100 125 12.5
B4GALNT1 chr12:58025661 -58027056 124 12.2
GRM8 chr7:126891300-126894205 234 12.1
VILL chr3:38035701-38036000 29 11.7
52
MethylCap-seq Infinium A B C 1.0
0.8 p=0.007
0 0
- -
0.6
CGI CGI
0.4
H H
- -
CGI CGI
Signature average beta-value average Signature 0.2
Discarded Log2 norm index Signature CGI Candidate 0.0 Log2 norm index CGI -3 0 3 CGI-0 CGI-H (n=5) (n=6) -3 0 3
Figure 7. Methylation of 13 promoter-associated CGI distinguishes tumors with high promoter CGI methylation (CGI-H) from those with baseline promoter CGI methylation (CGI-0). 11 tumors from the discovery set of 76 endometrioid endometrial tumors were chosen for technical validation via Infinium and indexed using a signature of
13 promoter-associated CGI. Two candidate regions that showed <0.1 difference in beta-value and p>0.05 between groups in the technical validation
Infinium dataset were discarded. An additional region that showed a negative difference in beta-value was also discarded.
(A) Methylation was compared between CGI-H (n=6) and CGI-0 tumors
(n=5) across 16 promoter-associated CGI signature candidates using
MethylCap-seq. Relative methylation was compared between regions
by normalizing to the region average, then applying a log2
transformation.
(B) Methylation differences seen in (A) were technically validated using the
Infinium HumanMethylation 450 platform. Tumors were indexed using
the average beta-value of all probes in each region (total of 88 probes),
and relative methylation was compared as in (A). Regions (columns)
53
were sorted in ascending order by row product for visual
representation.
(C) Methylation score, an average of the beta-values of the 13 validated
promoter CGI, was plotted for each tumor. Methylation score was
compared between CGI-H and CGI-0 tumors using Student’s t-test. P-
value for the 16-region score was 0.01 (data not shown).
To verify differential methylation, the 16 signature candidates were technically validated in 6 CGI-H (4 of the 5 from the original definition and 2 additional from the upper fringe) and 5 of the original 8 CGI-0 tumors using the Infinium HumanMethylation450 beadchip (Table 4 and Table 5). 1 CGI-H tumor and 3 CGI-0 tumors from the original definition were not included in this technical validation due to limited availability of sample DNA. 13/16 candidates showed differential methylation between CGI-H and CGI-0 tumors according to the following criteria: beta-value difference of greater than +0.1
(CGI-H – CGI-0) or Student’s t-test p<0.1. Exclusion criteria were kept loose to avoid overfitting. The 13 validated signature regions comprise a total of 88
Infinium HumanMethylation450 probes located within the respective CGI, with a median of 6 probes per region and a range of 2 to 14 (Table 4).
54
Table 4. Promoter CGI included in the final signature after validation by Infinium
Δ Gene Chromosomal region # p- beta- symbol (hg19) probes value2 value1
SVEP1 chr9:113341213-113342029 4 0.34 0.012
FGF12 chr3:192125818-192127991 14 0.323 0.02
NODAL chr10:72200065-72201368 6 0.313 0.025
TNFSF11 chr13:43148277-43149282 10 0.31 0.011
TBX18 chr6:85472702-85474132 10 0.268 0.061
OR10H2 chr19:15833733-15833983 3 0.265 0.047
VILL chr3:38035701-38036000 2 0.251 0.144 chr12:103351579- ASCL1 9 0.203 0.129 103352695 EPHX3 chr19:15344091-15344419 3 0.172 0.002
GRM8 chr7:126891300-126894205 11 0.152 0.058
TMEM115 chr3:50402103-50402942 3 0.12 0.118
APCDD1L chr20:57089460-57090237 9 0.105 0.007 chr14:103557606- EXOC3L4 4 0.038 0.082 103558235 1 Measured as the average beta-value for CGI-H minus CGI-0 tumors 2 Calculated using Student’s t-test
55
Table 5. Promoter CGI discarded from the final signature after validation by Infinium
Gene Chromosomal region # Δ beta- p- symbol (hg19) probes value 1 value2
B4GALNT1 chr12:58025661 -58027056 11 0.003 0.32 chr12:122016170- KDM2B 5 -0.01 0.183 122017693 SMOC2 chr6:168841818-168843100 7 -0.029 0.077 1 Measured as the average beta-value for CGI-H minus CGI-0 tumors 2 Calculated using Student’s t-test
The validated signature regions robustly distinguished 5/6 of the CGI-H tumors from the CGI-0 tumors (Figure 7B). The aggregate signature composed of these 13 promoter CGI likewise distinguished CGI-H from CGI-0 tumors (mean average beta-value of 0.47 vs. 0.26, Student’s t-test p<0.05)
(Figure 7C).
Methylation signature stratifies endometrioid endometrial tumors by methylation phenotype and distinguishes tumors from normal controls
To test the ability of the signature to stratify tumors by methylation phenotype in an independent cohort, methylation profiles for 203 endometrioid endometrial carcinomas from The Cancer Genome Atlas (TCGA) were examined (hereafter referred to as the test set). These methylation profiles were generated using the Infinium HumanMethylation450 beadchip, and represent all endometrioid tumors with methylation data on this platform from the previously published 373 endometrial tumor cohort [75]. The test set was 56
considerably larger than the discovery set and representative of the clinical presentation of the disease, making it well-suited for follow-up analyses.
The signature was originally generated by comparing groups of tumors that showed the largest differences in overall promoter CGI methylation. To assess whether the signature methylation score correlated with overall promoter CGI methylation in this independent dataset, the two parameters were compared with a scatter plot (Figure 8A). Signature methylation score showed a strong linear correlation with overall promoter CGI methylation
(Pearson r=0.83, p<0.001), suggesting that the methylation signature can be used to estimate overall promoter CGI methylation in endometrioid endometrial cancer. This result highlights that endometrial CIMP is a genome-wide phenomenon rather than isolated to a specific set of loci.
57
2D Graph 1 A B 0.8 1.0 Pearson r=0.83, p<0.001
0.7 y=4.6741x-0.4908 0.8 0.6
0.5 0.6
0.4
0.4 0.3
0.2
Signature average beta-value 0.2 0.1
Methylation, signature (average beta-value) (average signature Methylation, 0.0 0.0 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 MC1 MC2 MC3 MC4 (n=56) (n=66) (n=25) (n=56) Methylation, promoter CGI (average beta-value) Methylation Cluster C D Methylation cluster 0.8 1.0 p<0.001 p<0.001 0.7
0.8 0.6
0.5 0.6
0.4
Sensitivity 0.4 0.3
0.2 0.2
Signature average beta-value average Signature
0.1 Methylation score, A = 0.99 p< 0.001 0.0 0.0 Unmatched normal Tumor Matched normal Matched tumor 0.0 0.2 0.4 0.6 0.8 1.0 (n=11) (n=203) (n=13) (n=13) 1 - Specificity
Figure 8. Methylation signature stratifies endometrioid endometrial tumors by methylation phenotype and distinguishes tumors from normal controls. (A) Signature methylation score shows a strong linear relationship with
overall promoter CGI methylation. Infinium methylation data were
analyzed for 203 endometrioid endometrial tumors from The Cancer
Genome Atlas. Average methylation for the 13 signature promoter CGI
from Figure 7 were compared against average methylation for all
promoter CGI.
(B) Methylation signature stratifies endometrioid endometrial tumors by
methylation phenotype. Tumors from (A) were classified into
methylation clusters as previously published (Nature 497, 67–73 (02
May 2013)), and their signature average beta-values plotted as a
standard box plot. A Kruskal-Wallis test was performed with Dunn’s 58
post-hoc on all pairwise comparisons (p<0.05 for all comparisons
except MC3 vs. MC4). Whiskers denote 10th and 90th percentiles.
(C) Methylation signature distinguishes tumors from normal controls.
Methylation score was plotted for the tumors in (A) alongside 11
unmatched normal controls, as well for tissue from matched tumor and
adjacent normal samples, all from The Cancer Genome Atlas. A
Wilcoxon rank sum test was used to compare unmatched normals and
tumors, while a paired Student’s t-test was used to compare matched
normals and matched tumors.
(D) Methylation signature distinguishes tumors from normal controls with
high sensitivity and specificity. Matched and unmatched normals from
(C) were pooled, and an ROC curve was generated. Sensitivity
represents the true positive rate for tumors at a given signature score
threshold (% tumors correctly categorized as tumors), while specificity
represents the false positive rate (% normal controls incorrectly
categorized as tumors).
Methylation clusters were previously identified in the test set: one very highly methylated (MC1), one highly methylated (MC2), one with intermediate methylation (MC4), and one with methylation similar to normal (MC3).
Tumors in each cluster were scored using the 13-region signature and plotted
(Figure 8B). Signature score distinguished all methylation clusters except
MC3 and MC4. Mean signature score for MC1 was similar to the score for the
CGI-H group in the discovery set (0.45 vs. 0.48, Student’s t-test p=0.44).
59
Methylation score of tumors was compared to matched and unmatched normal control tissue (Figure 8C). 95% (192/203) of endometrioid endometrial tumors showed a higher methylation score than unmatched normal controls (n=11), suggesting that the 13-region signature could also be useful for distinguishing tumors from normal tissue. Likewise, >90% of tumors
(12/13) displayed higher methylation score than adjacent normal tissue, representing a three-fold increase on average. The methylation signature showed a sensitivity of 0.95 +/- 0.03 and a specificity of 0.93 +/- .07 (95% confidence interval) for distinguishing tumors from normal tissue at a methylation score threshold of 0.16 (Figure 8D). Since previous studies have shown that cancer-specific methylated sequences could be used to identify circulating tumor DNA in blood plasma [96], a cancer-specific methylation signature could have broad applications from early diagnosis to measuring response to therapy.
The stratification potential of total promoter CGI methylation was also assessed (Figure 9). As measured by Infinium, aggregate promoter CGI methylation mirrored results seen with the 13-region signature, but with diminished separation. Aggregate promoter CGI methylation distinguished tumors and normal controls relatively poorly (A = 0.90, Figure 9) compared to the 13-region methylation signature (A = 0.99, Figure 8D). As signature candidates were screened in part based on their ability to differentiate tumors and normal tissue (Figure 6C), it is not surprising that the 13-region signature would better differentiate tumors and normal tissue compared to background
(all promoter CGI).
60
0.28 0.28 A B p<0.001 p=0.002 0.26 0.26
0.24 0.24
0.22 0.22
0.20 0.20
0.18 0.18
0.16 0.16
0.14 0.14
Promoter CGI average beta-value CGI Promoter average beta-value CGI Promoter average 0.12 0.12
0.10 0.10 MC1 MC2 MC3 MC4 Unmatched normal Tumor Matched normal Matched tumor (n=56) (n=66) (n=25) (n=56) (n=11) (n=203) (n=13) (n=13) C 1.0
0.8
0.6
Sensitivity 0.4
0.2
Aggregate promoter CGI methylation, A = 0.90 p < 0.001 0.0
0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity
Figure 9. Aggregate promoter CGI methylation shows weaker stratification potential compared to the 13-region methylation signature. (A) Aggregate promoter CGI methylation correlates with previously
published methylation clusters. Infinium methylation data were
analyzed for 203 endometrioid endometrial tumors from The Cancer
Genome Atlas, and probes lying in promoter-associated CpG islands
were averaged to estimate aggregate promoter CGI methylation.
Tumors were classified into methylation clusters as previously
published [75], and average beta-values plotted as a standard box plot.
A Kruskal-Wallis test was performed with Dunn’s post-hoc on all
pairwise comparisons (p<0.05 for all comparisons except MC3 vs.
MC4). Whiskers denote 10th and 90th percentiles.
(B) Aggregate promoter CGI methylation distinguishes tumors from normal
controls. Methylation score was plotted for the tumors in (A) alongside 61
11 unmatched normal controls, as well for tissue from matched tumor
and adjacent normal samples, all from The Cancer Genome Atlas. A
Wilcoxon rank sum test was used to compare unmatched normals and
tumors, while a paired Student’s t-test was used to compare matched
normals and matched tumors.
(C) Aggregate promoter CGI methylation distinguishes tumors from normal
controls, albeit with reduced sensitivity and specificity compared to the
13-region methylation signature. Matched and unmatched normals
from (B) were pooled, and an ROC curve was generated. Sensitivity
represents the true positive rate for tumors at a given signature score
threshold (% tumors correctly categorized as tumors), while specificity
represents the false positive rate (% normal controls incorrectly
categorized as tumors). Distinguishing power, represented by area
under the curve, was lower for aggregate promoter CGI methylation (A
= 0.90) compared to the 13-region methylation signature (A = 0.99, see
Figure 8D).
As each promoter CGI in the signature was associated with a gene, we assessed whether increasing promoter methylation impacted mRNA expression in test set tumors. Methylation of promoter CGI is often but not always associated with decreased gene expression. 7/13 genes showed a significant association between increased methylation and decreased gene expression (Table 6). EPHX3 expression notably decreased with increasing promoter methylation (Figure 10).
62
Table 6. Correlation between promoter CGI methylation and gene expression in TCGA tumors
upper Gene quartile Spearman r p-value symbol TPM1 EPHX3 47.2 -0.601 2E-07
TBX18 448.0 -0.494 2E-07
TNFSF11 30.1 -0.379 3.48E-07
VILL 275.8 -0.37 6.74E-07
FGF12 21.1 -0.365 9.75E-07
ASCL1 6.1 -0.243 0.00138
APCDD1L 18.9 -0.238 0.00172
SVEP1 397.8 -0.118 0.122
NODAL 10.5 -0.113 0.141
GRM8 38.1 -0.108 0.159
EXOC3L4 23.3 -0.0245 0.75
TMEM115 1701.9 0.0192 0.802
OR10H2 0.8 0.164 0.0318 1 Transcripts per million, as calculated by RSEM
63
2048
512
128
32
8
2 mRNA abundance mRNA (TPM) 0.5 0 0.2 0.4 0.6 0.8 1 Methylation (beta-value)
Figure 10. Methylation of EPHX3 is associated with decreased gene expression. RNA expression vs. promoter CGI methylation of EPHX3 was plotted for 172 endometrioid endometrial tumors from The Cancer Genome Atlas. A linear fit line (r2=0.4) depicts the inverse relationship between RNA expression and methylation, corresponding to a Spearman correlation coefficient of r=-0.60 and p<0.001. TPM indicates transcripts per million, as calculated by RSEM.
Methylation beta-value represents the average methylation of all Infinium probes within the CGI.
High methylation score is associated with mismatch repair deficiency high mutation rate, and low somatic copy number alteration
In colorectal cancer, CIMP is frequently associated with defects in mismatch repair [91], which manifests as microsatellite instability (MSI); in endometrial cancer, methylation-mediated silencing of the mismatch repair gene MLH1 accounts for ~80% of sporadic mismatch repair defects [97].
Silencing of MLH1 leads to an accumulation of mutations as well as abnormal 64
expansion and contraction of microsatellite repeats. Methylation score was compared across integrated clusters previously reported for the test set
(Figure 11A), which were conceived by integrating microsatellite instability
(MSI) status, mutation frequency, and somatic copy number alteration (SCNA) cluster data. MSI (Hyper-mutated) and Copy-number high (Serous-like) clusters showed significant differences in methylation score (median 0.39 vs.
0.25, ANOVA with Holm-Sidak post-hoc p<0.001). To elucidate specific correlates, we compared methylation score across the single parameter clusters used to generate the integrated clusters (Figure 11, B-D).
Methylation score correlated with MSI status (MSI+ vs. MSI-, median 0.40 vs.
0.27, Wilcoxon rank sum p<0.001) and mutation frequency (High vs. Low clusters, mean 0.38 vs. 0.28, ANOVA with Holm-Sidak post-hoc p<0.001).
Methylation score also distinguished the low SCNA clusters 2 and 3 from the high SCNA cluster 4 (median 0.33 and 0.39 vs. 0.24, Kruskal-Wallis with
Bonferroni-corrected Student's t-test post-hoc p<0.01), as well as low SCNA cluster 3 from very low SCNA cluster 1 (median 0.39 vs. 0.28). As promoter methylation of the mismatch repair gene MLH1 is implicated in ~80% of MSI+ sporadic endometrioid endometrial tumors [97], the strong correlation of methylation score with MSI and high mutation rate was expected and shows that the signature is capable of reproducing known CIMP correlates. As previously reported, POLE mutation dramatically increases mutation rate via a mismatch repair independent mechanism [75]; the “Highest” cluster in Figure
11C predominantly composed of tumors with POLE mutations did not correlate with methylation score. The association of high methylation score
65
with low somatic copy number alteration suggests that methylation accumulation and pervasive copy number alteration represent distinct pathways of tumorigenesis with different molecular drivers.
66
A 1.01.0 B 1.01.0
value 0.8 0.8 - 0.8 0.8 p<0.001 p=0.007
beta 0.60.6 0.60.6
avg 0.40.4 0.40.4
Signature average beta-value 0.20.2 0.2Signature average beta-value 0.2
Signature 0.00.0 0.00.0 MSSMSS MSIMSI (n=122) (n=80)
MSI (Hyper-mutated) POLE (Ultra-mutated)
Copy-number high (Serous-like) Copy-number low (Endometrioid)
1.01.0 1.01.0 C D p<0.01 for marked
value 0.8 0.8 - 0.8 0.8 p<0.001
beta 0.60.6 0.60.6
avg 0.40.4 0.40.4
Signature average beta-value 0.2Signature average beta-value 0.2 0.20.2
Signature 0.00.0 0.00.0 LowLow HighHigh HighestHighest CN1CN1 CN2CN2 CN3CN3 CN4CN4 (n=59) Mutation(n=31) rate cluster (n=12) (n=55) (n=83)Copy Number(n=34) Cluster (n=24) Mutation rate cluster Copy number cluster
Figure 11. High methylation score is associated with mismatch repair deficiency, high mutation rate, and low somatic copy number alteration. Methylation score was compared among published clusters for 203 endometrioid endometrial tumors in The Cancer Genome Atlas. Parameters considered included (A) the published integrated clusters factoring in microsatellite instability (MSI) status, mutation frequency, and somatic copy number alteration frequency; (B) microsatellite instability (MSI) status; (C) mutation rate cluster; (D) copy number cluster. Statistical comparisons were performed using ANOVA with Holm-Sidak post-hoc (A, C), Wilcoxon rank-sum
(B), and Kruskal-Wallis with Bonferroni-corrected t-test or Wilcoxon post-hoc as appropriate (D). Significant differences were reported if differences
67
persisted across 2 additional independently-generated replicate signatures at a threshold of p<0.01.
Methylation score was further compared across other published cluster data for the test set as well as known clinical and molecular covariates of endometrial cancer. Methylation score showed no significant relationship
(threshold of p<0.01) with mRNA or miRNA expression clusters, stage, grade, relapse-free survival, BMI, or age of diagnosis (data not shown).
Methodological validity
The reproducibility of the methodology used to generate the methylation signature was verified by generating two additional replicate signatures, representing independent sets of 13 promoter CGI selected using a similar methodology. As shown in Figure 12, the replicate signatures ranked tumors in the test set similarly compared to a negative control
(r=0.82,0.89 for replicates R1 and R2 vs. original signature O, p<0.001; r=0.144 for negative control vs. original signature, p>0.01), demonstrating that the methodology underlying our methylation score is highly robust. Correlates with methylation score (MSI, mutation rate, copy number alteration) were subsequently validated using the replicate signatures; the reported correlates were significant across all three signatures unless otherwise noted.
68
O R1 R2 NC Samples
Log2 norm index
-2 0 2
Figure 12. Replicate 13-region methylation signatures rank tumors similarly. CGI-H and CGI-0 tumors were compared from the MethylCap-seq discovery set of 76 endometrioid endometrial tumors, and signatures were generated from the top differentially methylated promoter CpG islands (O: original 13- region signature, R1: replicate 1, R2: replicate 2, NC: negative control).
Candidates that showed <0.1 difference in beta-value between groups in the
Infinium technical validation set were discarded. 203 endometrioid endometrial tumors from The Cancer Genome Atlas were indexed using the average beta-value of all regions in the signature, and relative index values between replicates were compared by plotting as a normalized log2 transformed heat map. Samples were ranked by the original signature index for visual comparison. Statistical comparison of rank correlation vs. the original signature was performed using a Spearman test (r=0.82, 0.89 for replicates and p<0.001; r=0.14 for NC and p>0.01).
69
A signature for CIMP
Delineating a threshold is useful for classification purposes, yet can be challenging when the data resemble a continuum rather than being clearly modal. The agreement of signature-based methylation score with published methylation clusters was discussed earlier. A threshold signature score of 0.4 captures ~75% of the tumors from the very highly methylated (MC1) cluster and excludes all tumors from the less methylated MC3 and MC4 clusters
(Figure 8B), and also excludes the outlier CGI-H tumor from the discovery set
(Figure 7C). 58/203 (29%) of endometrioid tumors from the test set would be characterized as CIMP-H using this classification scheme, which would be comparable to the size of MC1 (56/203, 28%).
III. Discussion
DNA methylation is increasingly being recognized as a potential biomarker with diagnostic, prognostic, and therapeutic implications [70,98].
DNA methylation is frequently perturbed in cancer [4], and in certain contexts these changes may happen early in tumorigenesis [16]. Distinctive DNA methylation profiles may also be used to molecularly subtype tumors for personalized treatment [99]. In addition, many tumor types shed DNA into bodily fluids or excretions that can be harvested non-invasively (e.g., blood, urine, feces) [100,101]. Unlike mRNA which is susceptible to degradation and is often expressed in a highly time-dependent manner, DNA methylation is molecularly stable and stably inherited during cellular division. These characteristics make DNA methylation a promising biomarker in cancer. In
70
addition, CIMP tumors show highly dysregulated methylation that may make these tumors particularly vulnerable to demethylation therapies [92].
While CIMP has been defined in other tumor types, few studies have leveraged genome-wide methylome profiling techniques to examine CIMP in endometrial cancer [75,102,103]. Armed with an alternative methylome profiling technique that measures regional methylation over CGI rather than assuming regional methylation based on a few data points, we set out to develop an analysis method that could identify samples with large-scale methylation differences, and then pinpoint regions that could be used in a signature. This method would leverage the increased CpG coverage of enrichment-based methylation profiling vs. Infinium, yet would yield a signature that could be applied to Infinium datasets, especially datasets in the
Cancer Genome Atlas.
A few biological insights arose from our study. Our analysis was based on the premise that CIMP could be identified based on aggregate methylation, which is an unusual approach for defining CIMP. We show that an approach for identifying CIMP based on aggregate methylation shows general agreement with a clustering-based approach (Figure 8B), and furthermore show that the methylation score yielded by our signature reflects aggregate
CpG island methylation (Figure 8A). In addition, most tumors show more aggregate CGI methylation than normal controls (Figure 5, Figure 8C-D), suggesting that promoter methylation is more prevalent in endometrioid endometrial cancer than previously thought. This is a potentially exciting insight, as it increases the likelihood that methylation could be used as a
71
biomarker to diagnose and track endometrial cancer, or that this aberrant methylation could serve as a useful therapeutic target.
An advantage of comparing aggregate methylation was that it was relatively straightforward to implement compared to unsupervised clustering approaches, which require complex data normalization and correction for batch effects and also require considerable trial and error to refine.
Nonetheless, we had to account for technical biases that could make aggregate methylation appear very low or very high. We excluded samples with very poor CpG enrichment, indicating a strong possibility that the enrichment for methylated fragments was poor or incomplete. However, we cannot rule out the possibility that in doing so we excluded samples with very low CGI methylation (significantly below that of normal tissue); our data therefore do not exclude the possibility of a CpG island demethylator phenotype in endometrioid endometrial cancer. We further reassured ourselves that no significant enrichment bias was present in the remaining data by comparing mitochondrial methylation signal against genomic methylation signal, since an enrichment bias would be expected to affect both
(biologically methylation between compartments would be expected to be unrelated); no significant correlation was observed between mitochondrial and genomic methylation amongst the 76 samples (data not shown). The validity of our approach for identifying CIMP in endometrioid endometrial cancer is also supported by data showing that our CIMP signature correlates extremely well with aggregate promoter CGI methylation using a different methylation platform and an independent dataset (Figure 8A).
72
Methylation in normal tissues is highly tissue-specific, and the exact methylation perturbations in different tumor types tend to vary based on cell of origin [104]. Thus there is no single CIMP signature that can identify CIMP in all cancers; instead CIMP must be examined individually for each tumor type, and broad methylation dysregulation may be more or less common in different tumor types. Nonetheless, some areas of the genome may be more susceptible to methylation across tissue types. We identified three CGI from our 13-region signature associated with genes known to be methylated in other tumor types: EPHX3 (alias: ABHD9), FGF12, and ASCL1. Methylation of EPHX3 in primary prostate malignancies was associated with early recurrence [105]. Methylation of FGF12 was observed in five colorectal tumors but not in matched controls [106]. ASCL1 was also frequently methylated in 125 colorectal tumors compared to 29 normal controls [104], suggesting that ASCL1 and FGF12 methylation may have diagnostic potential in both colorectal and endometrial cancers.
A 2013 study by the TCGA Consortium was the basis for the in silico analysis of our signature. This study profiled methylation of 373 endometrial tumors using the Infinium beadchip [75]. Unsupervised hierarchical clustering was used on the most variable probes to segregate tumors by methylation phenotype. Two methylation clusters were identified that showed overall hypermethylation. However, a signature for recapitulating these clusters was not provided, as methylome analysis was not the focus of the study.
Our study corroborates and supplements the excellent 2013 study by the TCGA Consortium. Their identification of CIMP relied on methylation data
73
obtained from the Infinium platform. The Infinium platform samples CpGs from across the genome but is not a true whole genome methylome profiling method. The Infinium HumanMethylation450 platform used for the TCGA study features 485,577 probes, of which 113521 lie within promoter CGI, covering 16079/16394 (98%) of promoter CGI in the human genome with an average of 7 probes per CGI. These promoter CGI have an average length of
904bp and 84 CpGs per island; therefore the Infinium platform measures methylation of 8% of CpGs in promoter CGI. While Illumina suggests that methylation of Infinium probes is representative of regional methylation, it may be inaccurate to assume that representative probes in normal tissues are also representative in tumors with profound dysregulation of methylation patterns genome-wide, especially when the analysis is expanded to the scope of a genome-wide study with no additional validation. Our methylation signature is based on genome-wide promoter CGI data collected using MethylCap-seq; the agreement of our methylation score data with the clusters in the TCGA
Consortium study validates their method as well as our own. Our method has an additional advantage. Identifying endometrial CIMP samples using our method requires measuring methylation of only 82 CpGs corresponding to 13 regions as opposed to the 785 probes scattered throughout the genome used for clustering in the TCGA Consortium study. Our signature could likely be further pared down to facilitate high-throughput screening without sacrificing accuracy. In addition, we provide a suggested methylation score threshold of
0.4 for identifying arbitrary endometrial CIMP tumors, whereas the TCGA
74
Consortium identified CIMP-like tumors in their dataset, but did not provide a method for identifying CIMP in arbitrary samples.
A potential drawback of our approach is that any threshold we might use to distinguish CIMP-H and CIMP-0 tumors would be arbitrary; in fact, our observations suggest that CIMP in endometrioid endometrial cancer could be viewed as a continuum rather than as a discrete phenomenon (Figure 5B,
Figure 8C). We offer a methylation score threshold of 0.4 for distinguishing
CIMP-H tumors, but pending comparison to other data sets, this recommendation remains unvalidated.
It is important to recall that the methylation of 13 CpG islands in and of itself is not CIMP, but rather that methylation of these regions correlated with broad gains in CpG island methylation in both a discovery and independent test set. A methylation signature could be useful for diagnosis and classification of cancer, yet indicate no underlying methylator phenotype, and vice versa.
IV. Conclusion
In summary, we used methylome profiling techniques to stratify tumors by overall promoter CGI methylation, identified a signature to reproduce this stratification, and verified that the stratifying tumors using this signature reproduced known characteristics of CIMP tumors (e.g., the association with microsatellite instability). Furthermore, we demonstrated methodological robustness by repeating the stratification with two additional signatures derived using the same methodology.
75
More generally, we demonstrate an approach for translating methylome profiling findings to the Infinium platform, which will become increasingly important as publically available methylation datasets (e.g., those in The Cancer Genome Atlas, TCGA) mature. We also illustrate a general method for identifying methylator phenotypes that may be applied to other tumor types—a method that does not rely on unsupervised clustering, which is sensitive to technical artifacts such as batch effects [107]. In addition, our results suggest that widespread promoter methylation is more prevalent in endometrioid endometrial cancer than previously thought, and that promoter methylation could be a useful marker for distinguishing tumors and normal tissue. We hope that our methylation signature will catalyze further investigation of the methylator phenotype in endometrioid endometrial cancer to better understand the mechanisms and consequences of wide-scale epigenetic dysregulation.
V. Methods
Patient samples
76 primary human endometrioid endometrial cancer and 12 nonmalignant endometrial samples were analyzed from a previously published cohort [108].
Cohort characteristics are shown in Appendix B1. A sequencing read summary is provided in Appendix B2. All studies involving human endometrial cancer samples were approved by the Human Studies
Committee at the Washington University and at The Ohio State University.
MethylCap-seq quality control
76
MethylCap-seq quality control was implemented as previously described
[109], and 14 of 102 samples that showed evidence of poor methylated fragment enrichment or poor sequencing reproducibility were excluded from analysis. This method was demonstrated to reduce noise in methylation signal and improve the ability to discriminate between tumors and normal tissue.
MethylCap-seq data analysis
Sequence files were aligned and processed as previously described [109].
Reads were extended to the average fragment length and the resulting count distribution was normalized against the total aligned reads by conversion to reads per million (RPM). Differentially methylated promoter CGI were identified by performing a Wilcoxon rank sum test for each CGI across the two sample groups being considered. Results were adjusted for multiple comparisons by setting a false discovery rate (FDR) cutoff of 0.05.
Methylation was categorized by genomic feature as follows: CpG islands
(CGI, as defined in the UCSC genome browser), promoters (2kB in length,
1kB upstream and downstream of the TSS), CGI shores (200bp to 2kb distant from both ends of each CGI), and the first exon of RefSeq genes. CGI were further subdivided by proximity to promoters (within 10kB upstream or 1kB downstream of a 2kB promoter), and 2kB promoters were subdivided by overlap with CGI.
Infinium validation of methylation signature candidates
11 of 76 tumors were chosen for technical validation using the Infinium
HumanMethylation450 beadchip platform, a well-validated bisulfite-based
77
method for assessing methylation of individual CpGs genome-wide. The assay was performed according to manufacturer protocol by the University of
Southern California Epigenome Center. Methylation was reported using beta- values, a number which represents the fraction of DNA fragments that were methylated at a given CpG site.
Computation of methylation score using the 13 promoter CGI signature
Methylation score was computed by taking the average of the beta-values for all probes within a promoter CGI, then averaging the result across the 13 promoter CGI in the signature. The final signature comprised a total of 88
Infinium HumanMethylation450 probes.
In silico analysis of TCGA endometrioid endometrial tumors
Methylation was analyzed for 203 endometrioid endometrial tumors from the original published Cancer Genome Atlas cohort of 373 endometrial tumors
[75]. 170 tumors were excluded from analysis that lacked Infinium
HumanMethylation450 data or were not of the endometrioid subtype. Some analyses assessed fewer than 203 samples due to gaps in data availability for each assay. Methylation was assessed using Level 3 data from The Cancer
Genome Atlas Data Portal, while clinical and molecular correlate data were gathered from cBioPortal for Cancer Genomics (Memorial Sloan Kettering
Cancer Center).
Replicate signature analysis
To demonstrate the reproducibility of our method for identifying tumors with a
CpG island methylator phenotype, two additional 13-region signatures were compiled from the original list of top differentially methylated promoter CGI
78
between CGI-H and CGI-0 tumors in the Discovery set. Regions that had already been considered for the original signature were excluded from this analysis. Mirroring the technical validation of the original signature, candidate regions that showed <0.1 difference in average beta-value between groups in the Infinium technical validation set were discarded. An additional negative control signature was populated with the 13 promoter CGI that showed the least difference in methylation between groups in the discovery set (as determined by fold change). Endometrioid endometrial tumors from the test set were indexed using all four signatures, and methylation score was computed using the average beta-value of the regions in each signature.
Rank correlation of tumor methylation scores between replicate signatures and the original signature was compared using a Spearman test.
79
Chapter 5. Summary and Discussion
I. Summary
The goals of this project were:
1) Develop and validate a MethylCap-seq quality control module to identify and exclude samples with spurious methylation data.
2) Develop a method to identify putative CIMP tumors using the MethylCap- seq platform.
3) Describe the differences between putative CIMP tumors using the Infinium platform.
4) Compare agreement in CIMP classification between my method based on
MethylCap-seq and the published clustering method based on Infinium.
5) Demonstrate a signature that could classify CIMP tumors prospectively without a full methylome profiling experiment.
Hypothesis: A combined methylation analysis method using MethylCap-seq and Infinium can be used to define CIMP de novo in endometrial cancer.
Chapter 3 outlines and validates the quality control module that was developed for this project. The validity of a MethylCap-seq experiment is dependent on enrichment of methylated fragments prior to sequencing. A failure in enrichment invalidates any downstream data, and therefore identifying such failures is vital. 203 lanes of sequencing data were generated for 101 unique samples. 43 lanes failed QC, representing 21 unique samples.
80
The QC module excluded samples with noisy methylation signal, increased a measure of sequencing reproducibility, and increased power to detect differentially methylated regions.
The subsequent goals are addressed in Chapter 4. Analysis with
MethylCap-seq identified a wide distribution of total promoter CpG island methylation among endometrioid endometrial tumors, with normal controls showing similar methylation to the lower end of the tumor methylation spectrum. The CpG island methylator phenotype (CIMP) is often regarded as a discrete phenomenon occurring in a specific subset of genomic regions, but
I hypothesized that comparing the two ends of the tumor methylation spectrum would also be a valid starting point to identify CIMP.
To test my hypothesis, I had to validate a putative CIMP signature's ability to differentiate "true positives" and "true negatives". The easiest way to do this was to compare against methylation data from a large published cohort where CIMP had been identified using a traditional method. TCGA provided this opportunity, but used a different methylation platform. Therefore
I had to describe the differences between putative CIMP tumors and non-
CIMP control tumors in my dataset using the same platform as the TCGA:
Infinium. To this end, I sampled tumors on the high and low ends of the methylation spectrum and identified a set of loci that distinguished them, using normal controls to screen for loci that may have undergone cancer-associated methylation gains. I also profiled this subset of tumors with Infinium and discarded the loci that were not differentially methylated between groups, as determined by Infinium. This also served as a technical validation to show
81
that MethylCap-seq and Infinium were yielding similar results, which helped rule out biases that could have cast doubt on my methodology.
A 13-region methylation signature for identifying CIMP, composed of
82 Infinium probes, emerged from this analysis. Yet since the original data were a continuum, I lacked a binary method to classify CIMP. Instead of drawing an arbitrary threshold, I chose to compare against correlates from
TCGA using the entire spectrum of the signature methylation score and look for the expected differences. Signature methylation score differentiated the previously published methylation clusters, demonstrating the agreement between my method and a traditional clustering method. In addition, the typical CIMP correlates emerged: microsatellite instability, high mutation rate, and low somatic copy number alteration. Two alternative signatures generated using the same methodology showed a similar result, while a negative control signature did not.
The final goal was for the methylation signature to be useful for classifying CIMP tumors prospectively. The signature is composed of 82
CpGs spread across 13 genomic regions, and therefore these markers are measurable without a full methylome profiling study. In addition, measurements of the methylation state of individual CpGs can be performed with high technical reliably and reproducibility using relatively cost-effective methods such as small-scale beadchip arrays and bisulfite pyrosequencing.
My method by its nature does not lend itself to binary calling of "CIMP" or "not
CIMP", but I provide a cutoff threshold signature score based on comparison with TCGA methylation clusters for reference and future validation.
82
II. Discussion
CIMP appears to have multiple definitions in the literature, depending on the context of the particular study. In some studies, CIMP is regarded as a reproducible pattern of methylated promoter CpG islands, similar to what I term a "signature". Critics have questioned whether the term itself has meaning or whether there is any consistent biology behind the phenomenon.
I strictly define CIMP in my study as a genome-wide phenomenon that must be initially identified using a genome-wide approach. I argue that such a definition, likely involving broad perturbation of major epigenetic pathways, is most likely to have a consistent biological basis across tissue types--even if the exact regions affected vary greatly across different tumor types.
Methylation in normal tissues is highly tissue-specific, and the exact methylation perturbations in different tumor types tend to vary based on cell of origin [104]. Thus there is no single CIMP signature that can identify CIMP in all cancers; instead CIMP must be examined individually for each tumor type, and broad methylation dysregulation may be more or less common in different tumor types. Nonetheless, some areas of the genome may be more susceptible to methylation across tissue types. We identified three CGI from our 13-region signature associated with genes known to be methylated in other tumor types: EPHX3 (alias: ABHD9), FGF12, and ASCL1. Methylation of EPHX3 in primary prostate malignancies was associated with early recurrence [105]. Methylation of FGF12 was observed in five colorectal tumors but not in matched controls [106]. ASCL1 was also frequently methylated in 125 colorectal tumors compared to 29 normal controls [104],
83
suggesting that ASCL1 and FGF12 methylation may have diagnostic potential in both colorectal and endometrial cancers.
The method I used to analyze MethylCap-seq data differs from other published methods in that no attempt was made to normalize CpG density differences across the genome. With enrichment-based methods for profiling methylation like MethylCap-seq, areas with greater CpG density have more opportunities for methylation and therefore a higher likelihood of being enriched--independent of the ratio of methylated CpGs to unmethylated
CpGs. This methodology discrepancy complicates comparison with bisulfite- based methodologies, which measure the methylation rate of a given CpG in the sampled cell population. Thus CpG density normalization is typically performed to make data from enrichment-based methylation profiling techniques look more similar to that obtained from bisulfite-based techniques.
Doing so can streamline analysis pipelines, since normalized data could theoretically be analyzed in a single platform-independent manner.
Normalization is also required if meaningful comparisons of methylation are to be made between two different loci in the same sample. However, CpG density normalization is computationally complex and was not necessary in my study, as I was interested in comparing methylation between the same loci in many samples (vertically) rather than multiple loci in the same sample
(horizontally). Skipping CpG density normalization could have potentially biased the top differentially methylated regions that composed my final signature towards regions with higher CpG density. The regions were subsequently screened with Infinium to discard the loci that were not
84
differentially methylated between both platforms. As the signature regions were treated as markers, a bias towards higher CpG density would not be undesirable, and in fact the observed bias was small (6%, 94 CpGs per island on average among regions in the final signature vs. 89 among all promoter
CGI, as defined and interrogated by our particular method).
85
References
1. Berger SL (2007) The complex language of chromatin regulation during
transcription. Nature 447: 407-412.
2. Robertson KD, Wolffe AP (2000) DNA methylation in health and disease.
Nat Rev Genet 1: 11-19.
3. Yang X, Yan L, Davidson NE (2001) DNA methylation in breast cancer.
Endocr Relat Cancer 8: 115-127.
4. Esteller M (2008) Epigenetics in cancer. N Engl J Med 358: 1148-1159.
5. Coolen MW, Stirzaker C, Song JZ, Statham AL, Kassir Z, et al. (2010)
Consolidation of the cancer genome into domains of repressive
chromatin by long-range epigenetic silencing (LRES) reduces
transcriptional plasticity. Nat Cell Biol 12: 235-246.
6. Hatziapostolou M, Iliopoulos D Epigenetic aberrations during oncogenesis.
Cellular and Molecular Life Sciences: 1-22.
7. Baylin SB (2005) DNA methylation and gene silencing in cancer. Nat Clin
Pract Oncol 2 Suppl 1: S4-11.
8. Ferreri AJ, Dell'Oro S, Capello D, Ponzoni M, Iuzzolino P, et al. (2004)
Aberrant methylation in the promoter region of the reduced folate
carrier gene is a potential mechanism of resistance to methotrexate in
primary central nervous system lymphomas. Br J Haematol 126: 657-
664.
86
9. Nakayama M, Wada M, Harada T, Nagayama J, Kusaba H, et al. (1998)
Hypomethylation status of CpG sites at the promoter region and
overexpression of the human MDR1 gene in acute myeloid leukemias.
Blood 92: 4296-4307.
10. Strathdee G, MacKean MJ, Illand M, Brown R (1999) A role for
methylation of the hMLH1 promoter in loss of hMLH1 expression and
drug resistance in ovarian cancer. Oncogene 18: 2335-2341.
11. You JS, Jones PA (2012) Cancer genetics and epigenetics: two sides of
the same coin? Cancer Cell 22: 9-20.
12. Nightingale KP, O'Neill LP, Turner BM (2006) Histone modifications:
signalling receptors and potential elements of a heritable epigenetic
code. Curr Opin Genet Dev 16: 125-136.
13. Margueron R, Trojer P, Reinberg D (2005) The key to development:
interpreting the histone code? Curr Opin Genet Dev 15: 163-176.
14. Cedar H, Bergman Y (2009) Linking DNA methylation and histone
modification: patterns and paradigms. Nat Rev Genet 10: 295-304.
15. Sparmann A, van Lohuizen M (2006) Polycomb silencers control cell fate,
development and cancer. Nat Rev Cancer 6: 846-856.
16. Laird PW (2003) The power and the promise of DNA methylation markers.
Nat Rev Cancer 3: 253-266.
17. Szyf M (2009) Epigenetics, DNA methylation, and chromatin modifying
drugs. Annu Rev Pharmacol Toxicol 49: 243-263.
87
18. van Agthoven T, van Agthoven TL, Dekker A, Foekens JA, Dorssers LC
(1994) Induction of estrogen independence of ZR-75-1 human breast
cancer cells by epigenetic alterations. Mol Endocrinol 8: 1474-1483.
19. Hegi ME, Diserens AC, Gorlia T, Hamou MF, de Tribolet N, et al. (2005)
MGMT gene silencing and benefit from temozolomide in glioblastoma.
N Engl J Med 352: 997-1003.
20. Bock C (2012) Analysing and interpreting DNA methylation data. Nat Rev
Genet 13: 705-719.
21. Laird PW (2010) Principles and challenges of genomewide DNA
methylation analysis. Nat Rev Genet 11: 191-203.
22. Trimarchi MP, Mouangsavanh M, Huang TH (2011) Cancer epigenetics: a
perspective on the role of DNA methylation in acquired endocrine
resistance. Chin J Cancer 30: 749-756.
23. Bock C, Tomazou EM, Brinkman AB, Muller F, Simmer F, et al.
Quantitative comparison of genome-wide DNA methylation mapping
technologies. Nat Biotechnol 28: 1106-1114.
24. Harris RA, Wang T, Coarfa C, Nagarajan RP, Hong C, et al. (2010)
Comparison of sequencing-based methods to profile DNA methylation
and identification of monoallelic epigenetic modifications. Nat
Biotechnol 28: 1097-1105.
25. Robinson MD, Statham AL, Speed TP, Clark SJ (2010) Protocol matters:
which methylome are you actually studying? Epigenomics 2: 587-598.
26. Chavez L, Jozefczuk J, Grimm C, Dietrich J, Timmermann B, et al. (2010)
Computational analysis of genome-wide DNA methylation during the
88
differentiation of human embryonic stem cells along the endodermal
lineage. Genome Res 20: 1441-1450.
27. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, et al. (2008) A Bayesian
deconvolution strategy for immunoprecipitation-based DNA methylome
analysis. Nat Biotechnol 26: 779-785.
28. Rodriguez B, Frankhouser D, Murphy M, Trimarchi M, Tam HH, et al.
(2012) A Scalable, Flexible Workflow for MethylCap-Seq Data
Analysis. BMC Genomics.
29. Lan X, Adams C, Landers M, Dudas M, Krissinger D, et al. (2011) High
resolution detection and analysis of CpG dinucleotides methylation
using MBD-Seq technology. PLoS One 6: e22226.
30. Rao X, Evans J, Chae H, Pilrose J, Kim S, et al. (2012) CpG island shore
methylation regulates caveolin-1 expression in breast cancer.
Oncogene.
31. Serre D, Lee BH, Ting AH MBD-isolated Genome Sequencing provides a
high-throughput and comprehensive survey of DNA methylation in the
human genome. Nucleic Acids Res 38: 391-399.
32. Li N, Ye M, Li Y, Yan Z, Butcher LM, et al. (2010) Whole genome DNA
methylation analysis based on high throughput sequencing technology.
Methods 52: 203-212.
33. Nair SS, Coolen MW, Stirzaker C, Song JZ, Statham AL, et al. (2011)
Comparison of methyl-DNA immunoprecipitation (MeDIP) and methyl-
CpG binding domain (MBD) protein capture for genome-wide DNA
89
methylation analysis reveal CpG sequence coverage bias. Epigenetics
6: 34-44.
34. Bogdanovic O, Long SW, van Heeringen SJ, Brinkman AB, Gomez-
Skarmeta JL, et al. (2011) Temporal uncoupling of the DNA methylome
and transcriptional repression during embryogenesis. Genome Res 21:
1313-1327.
35. Xu Y, Hu B, Choi AJ, Gopalan B, Lee BH, et al. (2012) Unique DNA
methylome profiles in CpG island methylator phenotype colon cancers.
Genome Res 22: 283-291.
36. Kim M, Kang TW, Lee HC, Han YM, Kim H, et al. (2011) Identification of
DNA methylation markers for lineage commitment of in vitro
hepatogenesis. Hum Mol Genet 20: 2722-2733.
37. Brenet F, Moh M, Funk P, Feierstein E, Viale AJ, et al. (2011) DNA
methylation of the first exon is tightly linked to transcriptional silencing.
PLoS One 6: e14524.
38. Park JH, Park J, Choi JK, Lyu J, Bae MG, et al. (2011) Identification of
DNA methylation changes associated with human gastric cancer. BMC
Med Genomics 4: 82.
39. Yan P, Frankhouser D, Murphy M, Tam HH, Rodriguez B, et al. (2012)
Genome-wide methylation profiling in decitabine-treated patients with
acute myeloid leukemia. Blood 120: 2466-2474.
40. Lee BH, Taylor MG, Robinet P, Smith JD, Schweitzer J, et al. (2012)
Dysregulation of cholesterol homeostasis in human prostate cancer
through loss of ABCA1. Cancer Res.
90
41. Hogart A, Lichtenberg J, Ajay SS, Anderson S, Margulies EH, et al. (2012)
Genome-wide DNA methylation profiles in hematopoietic stem and
progenitor cells reveal overrepresentation of ETS transcription factor
binding sites. Genome Res 22: 1407-1418.
42. Deaton AM, Webb S, Kerr AR, Illingworth RS, Guy J, et al. (2011) Cell
type-specific DNA methylation at intragenic CpG islands in the immune
system. Genome Res 21: 1074-1086.
43. Yan H, Choi AJ, Lee BH, Ting AH (2011) Identification and functional
analysis of epigenetically silenced microRNAs in colorectal cancer
cells. PLoS One 6: e20628.
44. Decock A, Ongenaert M, Hoebeeck J, De Preter K, Van Peer G, et al.
(2012) Genome-wide promoter methylation analysis in neuroblastoma
identifies prognostic methylation biomarkers. Genome Biol 13: R95.
45. Jin B, Ernst J, Tiedemann RL, Xu H, Sureshchandra S, et al. (2012)
Linking DNA Methyltransferases to Epigenetic Marks and Nucleosome
Structure Genome-wide in Human Tumor Cells. Cell Rep 2: 1411-
1424.
46. Simmer F, Brinkman AB, Assenov Y, Matarese F, Kaan A, et al. (2012)
Comparative genome-wide DNA methylation analysis of colorectal
tumor and matched normal tissues. Epigenetics 7.
47. Carvalho RH, Haberle V, Hou J, van Gent T, Thongjuea S, et al. (2012)
Genome-wide DNA methylation profiling of non-small cell lung
carcinomas. Epigenetics Chromatin 5: 9.
91
48. Yu W, Jin C, Lou X, Han X, Li L, et al. (2011) Global analysis of DNA
methylation by Methyl-Capture sequencing reveals epigenetic control
of cisplatin resistance in ovarian cancer cell. PLoS One 6: e29450.
49. Illingworth RS, Gruenewald-Schneider U, Webb S, Kerr AR, James KD, et
al. (2010) Orphan CpG islands identify numerous conserved promoters
in the mammalian genome. PLoS Genet 6.
50. Skene PJ, Illingworth RS, Webb S, Kerr AR, James KD, et al. (2010)
Neuronal MeCP2 is expressed at near histone-octamer levels and
globally alters the chromatin state. Mol Cell 37: 457-468.
51. Turker MS (2002) Gene silencing in mammalian cells and the spread of
DNA methylation. Oncogene 21: 5388-5393.
52. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, et al. (2009) The
human colon cancer methylome shows similar hypo- and
hypermethylation at conserved tissue-specific CpG island shores. Nat
Genet 41: 178-186.
53. Ji H, Ehrlich LI, Seita J, Murakami P, Doi A, et al. (2010) Comprehensive
methylome map of lineage commitment from haematopoietic
progenitors. Nature 467: 338-342.
54. Doi A, Park IH, Wen B, Murakami P, Aryee MJ, et al. (2009) Differential
methylation of tissue- and cancer-specific CpG island shores
distinguishes human induced pluripotent stem cells, embryonic stem
cells and fibroblasts. Nat Genet 41: 1350-1353.
55. Akalin A, Garrett-Bakelman FE, Kormaksson M, Busuttil J, Zhang L, et al.
(2012) Base-pair resolution DNA methylation sequencing reveals
92
profoundly divergent epigenetic landscapes in acute myeloid leukemia.
PLoS Genet 8: e1002781.
56. Dudziec E, Miah S, Choudhry HM, Owen HC, Blizard S, et al. (2011)
Hypermethylation of CpG islands and shores around specific
microRNAs and mirtrons is associated with the phenotype and
presence of bladder cancer. Clin Cancer Res 17: 1287-1296.
57. Rollins RA, Haghighi F, Edwards JR, Das R, Zhang MQ, et al. (2006)
Large-scale structure of genomic methylation patterns. Genome Res
16: 157-163.
58. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene
bodies and beyond. Nat Rev Genet 13: 484-492.
59. Ball MP, Li JB, Gao Y, Lee JH, LeProust EM, et al. (2009) Targeted and
genome-scale strategies reveal gene-body methylation signatures in
human cells. Nat Biotechnol 27: 361-368.
60. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, et al. (2010) Dynamic
changes in the human methylome during differentiation. Genome Res
20: 320-331.
61. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009)
Human DNA methylomes at base resolution show widespread
epigenomic differences. Nature 462: 315-322.
62. Lister R, Pelizzola M, Kida YS, Hawkins RD, Nery JR, et al. (2011)
Hotspots of aberrant epigenomic reprogramming in human induced
pluripotent stem cells. Nature 471: 68-73.
93
63. Berman BP, Weisenberger DJ, Aman JF, Hinoue T, Ramjan Z, et al.
(2012) Regions of focal DNA hypermethylation and long-range
hypomethylation in colorectal cancer coincide with nuclear lamina-
associated domains. Nat Genet 44: 40-46.
64. Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, et al. (2011)
Increased methylation variation in epigenetic domains across cancer
types. Nat Genet 43: 768-775.
65. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, et al. (2011) DNA-
binding factors shape the mouse methylome at distal regulatory
regions. Nature 480: 490-495.
66. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, et al. (2012) An
integrated encyclopedia of DNA elements in the human genome.
Nature 489: 57-74.
67. Shukla S, Kavak E, Gregory M, Imashimizu M, Shutinoski B, et al. (2011)
CTCF-promoted RNA polymerase II pausing links DNA methylation to
splicing. Nature 479: 74-79.
68. Zhou Y, Lu Y, Tian W (2012) Epigenetic features are significantly
associated with alternative splicing. BMC Genomics 13: 123.
69. Herman JG, Baylin SB (2003) Gene Silencing in Cancer in Association
with Promoter Hypermethylation. New England Journal of Medicine
349: 2042-2054.
70. Tost J (2010) DNA Methylation: An Introduction to the Biology and the
Disease-Associated Changes of a Promising Biomarker. Molecular
Biotechnology 44: 71-81.
94
71. Issa JP (2004) CpG island methylator phenotype in cancer. Nat Rev
Cancer 4: 988-993.
72. Hughes LA, Melotte V, de Schrijver J, de Maat M, Smit VT, et al. (2013)
The CpG island methylator phenotype: what's in a name? Cancer Res
73: 5858-5868.
73. Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, et al. (2009)
Genome-wide DNA methylation profiling using Infinium® assay.
Epigenomics 1: 177-200.
74. Huang YW, Luo J, Weng YI, Mutch DG, Goodfellow PJ, et al. (2010)
Promoter hypermethylation of CIDEA, HAAO and RXFP3 associated
with microsatellite instability in endometrial carcinomas. Gynecol Oncol
117: 239-247.
75. Kandoth C, Schultz N, Cherniack AD, Akbani R, Liu Y, et al. (2013)
Integrated genomic characterization of endometrial carcinoma. Nature
497: 67-73.
76. Bergman Y, Cedar H (2013) DNA methylation dynamics in health and
disease. Nat Struct Mol Biol 20: 274-281.
77. Rodriguez BA, Frankhouser D, Murphy M, Trimarchi M, Tam H-H, et al.
(2012) Methods for high-throughput MethylCap-Seq data analysis.
BMC Genomics 13: S14.
78. Lienhard M, Grimm C, Morkel M, Herwig R, Chavez L (2014) MEDIPS:
genome-wide differential coverage analysis of sequencing data derived
from DNA enrichment experiments. Bioinformatics 30: 284-286.
95
79. Blum W, Garzon R, Klisovic RB, Schwind S, Walker A, et al. (2010)
Clinical response and miR-29b predictive significance in older AML
patients treated with a 10-day schedule of decitabine. Proc Natl Acad
Sci U S A 107: 7473-7478.
80. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and
memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10: R25.
81. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The
Sequence Alignment/Map format and SAMtools. Bioinformatics 25:
2078-2079.
82. Jones PA, Baylin SB (2007) The epigenomics of cancer. Cell 128: 683-
692.
83. Toyota M, Ahuja N, Ohe-Toyota M, Herman JG, Baylin SB, et al. (1999)
CpG island methylator phenotype in colorectal cancer. Proc Natl Acad
Sci U S A 96: 8681-8686.
84. Noushmehr H, Weisenberger DJ, Diefes K, Phillips HS, Pujara K, et al.
(2010) Identification of a CpG island methylator phenotype that defines
a distinct subgroup of glioma. Cancer Cell 17: 510-522.
85. Fang F, Turcan S, Rimner A, Kaufman A, Giri D, et al. (2011) Breast
cancer methylomes establish an epigenomic foundation for metastasis.
Sci Transl Med 3: 75ra25.
86. Figueroa ME, Abdel-Wahab O, Lu C, Ward PS, Patel J, et al. (2010)
Leukemic IDH1 and IDH2 mutations result in a hypermethylation
96
phenotype, disrupt TET2 function, and impair hematopoietic
differentiation. Cancer Cell 18: 553-567.
87. Zouridis H, Deng N, Ivanova T, Zhu Y, Wong B, et al. (2012) Methylation
subtypes and large-scale epigenetic alterations in gastric cancer. Sci
Transl Med 4: 156ra140.
88. Arai E, Chiku S, Mori T, Gotoh M, Nakagawa T, et al. (2012) Single-CpG-
resolution methylome analysis identifies clinicopathologically
aggressive CpG island methylator phenotype clear cell renal cell
carcinomas. Carcinogenesis 33: 1487-1493.
89. Jithesh PV, Risk JM, Schache AG, Dhanda J, Lane B, et al. (2013) The
epigenetic landscape of oral squamous cell carcinoma. Br J Cancer
108: 370-379.
90. Mack SC, Witt H, Piro RM, Gu L, Zuyderduyn S, et al. (2014) Epigenomic
alterations define lethal CIMP-positive ependymomas of infancy.
Nature 506: 445-450.
91. Issa J-P (2008) Colon cancer: it's CIN or CIMP. Clinical Cancer Research
14: 5939-5940.
92. Turcan S, Fabius AW, Borodovsky A, Pedraza A, Brennan C, et al. (2013)
Efficient induction of differentiation and growth inhibition in IDH1
mutant glioma cells by the DNMT Inhibitor Decitabine.
93. Hsu YT, Gu F, Huang YW, Liu J, Ruan J, et al. (2013) Promoter
hypomethylation of EpCAM-regulated bone morphogenetic protein
gene family in recurrent endometrial cancer. Clin Cancer Res 19: 6272-
6285.
97
94. Shmookler Reis RJ, Goldstein S (1983) Mitochondrial DNA in mortal and
immortal human cells. Genome number, integrity, and methylation. J
Biol Chem 258: 9078-9085.
95. Shock LS, Thakkar PV, Peterson EJ, Moran RG, Taylor SM (2011) DNA
methyltransferase 1, cytosine methylation, and cytosine
hydroxymethylation in mammalian mitochondria. Proc Natl Acad Sci U
S A 108: 3630-3635.
96. Goessl C, Krause H, Muller M, Heicappell R, Schrader M, et al. (2000)
Fluorescent methylation-specific polymerase chain reaction for DNA-
based detection of prostate cancer in bodily fluids. Cancer Res 60:
5941-5945.
97. Simpkins SB, Bocker T, Swisher EM, Mutch DG, Gersell DJ, et al. (1999)
MLH1 Promoter Methylation and Gene Silencing is the Primary Cause
of Microsatellite Instability in Sporadic Endometrial Cancers. Hum Mol
Genet 8: 661-666.
98. Issa JP (2007) DNA methylation as a therapeutic target in cancer. Clin
Cancer Res 13: 1634-1637.
99. Rhee JK, Kim K, Chae H, Evans J, Yan P, et al. (2013) Integrated
analysis of genome-wide DNA methylation and gene expression
profiles in molecular subtypes of breast cancer. Nucleic Acids Res 41:
8464-8474.
100. Widschwendter M, Menon U (2006) Circulating methylated DNA: a new
generation of tumor markers. Clin Cancer Res 12: 7205-7208.
98
101. Melotte V, Yi JM, Lentjes MH, Smits KM, Van Neste L, et al. (2015)
Spectrin repeat containing nuclear envelope 1 and forkhead box
protein E1 are promising markers for the detection of colorectal cancer
in blood. Cancer Prev Res (Phila) 8: 157-164.
102. Zhang B, Xing X, Li J, Lowdon RF, Zhou Y, et al. (2014) Comparative
DNA methylome analysis of endometrial carcinoma reveals complex
and distinct deregulation of cancer promoters and enhancers. BMC
Genomics 15: 868.
103. Kolbe DL, DeLoia JA, Porter-Gill P, Strange M, Petrykowska HM, et al.
(2012) Differential Analysis of Ovarian and Endometrial Cancers
Identifies a Methylator Phenotype. PLoS One 7: e32941.
104. Sproul D, Kitchen RR, Nestor CE, Dixon JM, Sims AH, et al. (2012)
Tissue of origin determines cancer-associated CpG island promoter
hypermethylation patterns. Genome Biol 13: R84.
105. Cottrell S, Jung K, Kristiansen G, Eltze E, Semjonow A, et al. (2007)
Discovery and validation of 3 novel DNA methylation markers of
prostate cancer prognosis. J Urol 177: 1753-1758.
106. Li H, Du Y, Zhang D, Wang LN, Yang C, et al. (2012) Identification of
novel DNA methylation markers in colorectal cancer using MIRA-based
microarrays. Oncol Rep 28: 99-104.
107. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, et al. (2010)
Tackling the widespread and critical impact of batch effects in high-
throughput data. Nature Reviews Genetics 11: 733-739.
99
108. Zighelboim I, Goodfellow PJ, Gao F, Gibb RK, Powell MA, et al. (2007)
Microsatellite instability and epigenetic inactivation of MLH1 and
outcome of patients with endometrial carcinomas of the endometrioid
type. J Clin Oncol 25: 2042-2048.
109. Trimarchi MP, Murphy M, Frankhouser D, Rodriguez BA, Curfman J, et
al. (2012) Enrichment-based DNA methylation analysis using next-
generation sequencing: sample exclusion, estimating changes in global
methylation, and the contribution of replicate lanes. BMC Genomics 13
Suppl 8: S6.
100
Appendix A. Supplementary data for Chapter 3
Appendix A1 – QC table for endometrial cancer study
Excel 2007 format (.xlsx)
A listing of CpG enrichment, saturation, 5x coverage and read information for each sample lane in the endometrial dataset.
Appendix A2 – Replicate lane correlation, endometrial QC passed vs. QC failed samples
Excel 2007 format (.xlsx)
A table showing Pearson correlation of replicate lanes for samples that passed QC vs. failed QC. Data are presented both as a group summary and for individual samples.
Appendix A3 – QC table for ovarian study
Excel 2007 format (.xlsx)
A listing of CpG enrichment, saturation, 5x coverage and read information for each sample lane in the ovarian dataset.
Appendix A4 – QC, GMI, plasmid RPM table for AML study
Excel 2007 format (.xlsx)
101
A listing of CpG enrichment, saturation, 5x coverage, read information, global methylation indicator and plasmid reads per million for each sample lane in the AML dataset.
102
Appendix B. Supplementary data for Chapter 4
Appendix B1 - Cohort characteristics
Excel 2007 format (.xlsx)
Cohort characteristics for the endometrial training set.
Appendix B2 - Sequencing summary
Excel 2007 format (.xlsx)
A listing of sequencing read information for each sample in the endometrial training set.
103