METHODS AND ANALYSES IN THE STUDY OF
HUMAN DNA METHYLATION
by
KE HU
Submitted in partial fulfillment of the requirements
For the Degree of Doctor of Philosophy
Department of Electrical Engineering and Computer Science
CASE WESTERN RESERVE UNIVERSITY
May, 2018 CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis/dissertation of
Ke Hu
candidate for the degree of Doctor of Philosophy *.
Committee Chair
Dr. Jing Li
Committee Member
Dr. Angela Ting
Committee Member
Dr. Fulai Jin
Committee Member
Dr. Xusheng Xiao
Date of Defense
March 29, 2018
*We also certify that written approval has been obtained
for any proprietary material contained therein. Table of Contents
TABLE OF CONTENTS ...... I
LIST OF TABLES ...... IV
LIST OF FIGURES ...... V
ACKNOWLEDGEMENTS ...... VIII
ABSTRACT ...... 1
CHAPTER 1 INTRODUCTION ...... 3
1.1 DNA METHYLATION IN MAMMALS ...... 3
1.2 METHODS TO MEASURE DNA METHYLATION ...... 3
1.3 DISSERTATION ORGANIZATION ...... 5
CHAPTER 2 DETECTION OF CPG SITES WITH MULTI-MODAL DNA METHYLATION
LEVEL DISTRIBUTIONS ...... 6
2.1 MOTIVATION ...... 6
2.2 METHODS ...... 8
2.2.1 Data ...... 8
2.2.2 Gaussian Mixture Model Clustering ...... 9
2.2.3 Detection of multimodal CpG sites ...... 9
2.2.4 Associating GMM cluster labels with genotypes ...... 11
2.3 RESULTS ...... 11
2.3.1 Genome-wide survey of mmCpG sites in GAW20 dataset ...... 11
2.3.2 Association between mmCpGs and SNPs ...... 13
2.4 DISCUSSION ...... 16
CHAPTER 3 DISCOVERING DNA METHYLATION CO-OCCURRENCE PATTERNS ...... 18
i 3.1 MOTIVATION ...... 18
3.2 METHODS ...... 20
3.2.1 Characteristics ...... 20
3.2.2 Workflow ...... 21
3.3 RESULT ...... 27
3.3.1 Experiment Summary ...... 27
3.3.2 DNA methylation co-occurrence pattern analysis ...... 28
3.3.3 Potential ASM detection ...... 29
3.3.4 Efficiency ...... 32
3.4 DISCUSSION ...... 35
CHAPTER 4 GENOME WIDE PROFILING OF ALLELE-SPECIFIC DNA METHYLATION ..... 37
4.1 MOTIVATION ...... 37
4.2 METHODS ...... 43
4.2.1 Data ...... 43
4.2.2 Analysis flow ...... 46
4.2.3 The proposed ASM detection method ...... 50
4.2.3.1 Step 1: Mapping and methylation calling ...... 50
4.2.3.2 Step 2: Candidate region definition ...... 50
4.2.3.3 Step 3 ASM detection based on a Graph model ...... 51
4.2.3.4 Step 4 Final analysis ...... 55
4.2.4 Genome annotation ...... 55
4.2.5 CTCF binding data ...... 56
4.2.6 RNA-seq data ...... 56
4.2.7 SNP calling ...... 56
4.2.8 Checking consistency between heterozygous alleles and ASM partitions ...... 59
4.2.9 amrfinder result ...... 59
4.3 RESULT ...... 60 ii 4.3.1 ASM is ubiquitous across the genome and is cell line specific ...... 60
4.3.2 Enrichment in female X chromosomes ...... 70
4.3.2.1 Overlaps with RefSeq Genes...... 71
4.3.2.2 Overlaps with ENCODE regulatory elements...... 73
4.3.2.3 Relationship with gene expression levels...... 80
4.3.3 ASM significantly overlaps imprinted gene regions ...... 82
4.3.3.1 Majority of imprinted genes have ASMs...... 82
4.3.3.2 Imprinted genes overlap strong ASMs...... 89
4.3.3.3 Variability of ASM in imprinted regions in different cell lines...... 91
4.3.3.4 Overlaps with promoter regions and correlation with gene expression...... 94
4.3.4 ASM patterns in autosomes ...... 95
4.3.4.1 ASM distributions...... 95
4.3.4.2 Overlaps with regulatory elements and relations with expression levels...... 96
4.3.5 Heterozygous SNPs located in identified ASM regions strongly support read partitions ..... 98
4.3.6 Comparison with amrfinder ...... 106
4.4 DISCUSSION ...... 116
CHAPTER 5 A GENERAL BISULFITE SEQUENCE CLUSTERING AND VISUALIZATION
TOOL 120
5.1 BACKGROUND ...... 120
5.2 IMPLEMENTATION AND DESCRIPTION ...... 120
5.3 EXAMPLE AND DISCUSSION ...... 123
CHAPTER 6 CONCLUSION ...... 126
BIBLIOGRAPHY ...... 127
iii List of Tables
TABLE 2.1 SUMMARY OF MMCPGS ...... 15
TABLE 3.1 SIZES OF DATASETS USED IN THE EXPERIMENTS ...... 33
TABLE 4.1 SUMMARY STATISTICS OF DATA PROCESSING...... 45
TABLE 4.2 HETEROZYGOUS SNPS IN ASM REGIONS...... 58
TABLE 4.3 DETECTED ASM REGIONS IN DIFFERENT CELL LINES AND THEIR STATISTICS...... 64
TABLE 4.4 DISTRIBUTION OF ASM REGIONS ON AUTOSOMES...... 65
TABLE 4.5 DISTRIBUTION OF ASM REGIONS ON X CHROMOSOME...... 65
TABLE 4.6 DISTRIBUTION OF ASM REGIONS ON Y CHROMOSOME...... 66
TABLE 4.7 THE OVERLAPS OF KNOWN IMPRINTED GENES WITH DETECTED ASMS IN ALL 8 CELL LINES...... 85
TABLE 4.8 SNP CALLING RESULT ...... 105
TABLE 4.9 SUMMARY OF 1K RANGE MERGED ASM REGIONS DETECTED BY ASM-DETECTOR ...... 110
TABLE 4.10 SUMMARY OF ASM REGIONS DETECTED BY AMRFINDER ...... 111
TABLE 4.11 COMPARISON OF ASM REGIONS DETECTED BY ASM-DETECTOR AND AMRFINDER IN MALE X
AND ALL Y CHROMOSOME...... 115
iv List of Figures
FIGURE 2.1 VENN DIAGRAM OF GMMC AND GAPHUNTER RESULTS ON THE PRE-TREATMENT (A) AND POST-
TREATMENT (B) DATASET...... 13
FIGURE 2.2: NUMBER OF MMCPGS DETECTED BY GMMC AND GAPHUNTER IN DIFFERENT SAMPLE SIZE...... 13
FIGURE 2.3: EXAMPLE OF MMCPG AND NON-MMCPG...... 14
FIGURE 2.4: DISTANCE DISTRIBUTION FROM MMCPG SITES AND THEIR ASSOCIATED SNPS...... 16
FIGURE 3.1: WORKFLOW OF BSPAT...... 23
FIGURE 3.2: EXAMPLES OF DNA METHYLATION CO-OCCURRENCE PATTERNS...... 29
FIGURE 3.3: AN ALLELE SPECIFIC METHYLATION EXAMPLE NEAR PAX6...... 31
FIGURE 3.4: EFFICIENCY COMPARISON OF BSPAT AND BIQ ANALYZER HT (REFERRED AS BIQ HT HERE)
USING DIFFERENT SETTINGS...... 34
FIGURE 3.5: PEAK MEMORY USAGE COMPARISON OF BSPAT AND BIQ ANALYZER HT (REFERRED AS BIQ HT
HERE) USING DIFFERENT SETTINGS...... 35
FIGURE 4.1: A TOY EXAMPLE ILLUSTRATING THE CLUSTERING ALGORITHM IMPLEMENTED IN ASM-
DETECTOR...... 49
FIGURE 4.2: GENOME-WIDE DISTRIBUTION OF ASMS (BLUE) AND SOME OTHER GENOMIC FEATURES
INCLUDING IMPRINTED GENES (NAMES), DNASE I HYPERSENSITIVE SITES (BLACK), TRANSCRIPTION
FACTOR BINDING SITES (RED) AND CPG ISLANDS (GREEN)...... 67
FIGURE 4.3 THE LENGTH DISTRIBUTION OF ASMS IN DIFFERENT CELL LINES. THE THREE ADS CELL LINES
SEQUENCED USING PAIRED-END TECHNOLOGY HAVE LONGER ASM REGIONS AND THE REST HAVE
SIMILAR LENGTH DISTRIBUTIONS...... 68
FIGURE 4.4 AUTOSOMAL ASMS IDENTIFIED IN EACH CELL LINE AND THEIR CORRESPONDING METHYLATION
LEVEL DISTRIBUTIONS OF THE SAME GENOMIC REGIONS IN OTHER CELL LINES...... 69
FIGURE 4.5 ASMS IDENTIFIED FROM X CHROMOSOME IN EACH CELL LINE AND THEIR CORRESPONDING
METHYLATION LEVEL DISTRIBUTIONS OF THE SAME GENOMIC REGIONS IN OTHER CELL LINES...... 70
FIGURE 4.6 GENOMIC DISTRIBUTIONS OF ASM REGIONS AND THEIR OVERLAPS WITH DIFFERENT GENE
v ANNOTATIONS INCLUDING PROMOTER (RED), EXON (BLUE), INTRON (GREEN), AND INTERGENIC
(PURPLE) REGIONS BASED ON REFSEQ GENE ANNOTATIONS. DISTRIBUTIONS IN OTHER CONTEXTS
(ASMS ON AUTOSOMES, ON X CHROMOSOME, AND OVERLAPPING WITH IMPRINTED GENE REGIONS) ARE
ALSO SHOWN...... 73
FIGURE 4.7 FRACTIONS OF ASM REGIONS LOCATED IN DHS REGIONS IN EACH CELL LINE, BASED ON LENGTH.
THE BAR LABELED WITH “GENOME” IN EACH PANEL SHOWS THE FRACTION OF EACH FEATURE ON THE
GENOME...... 75
FIGURE 4.8 FRACTIONS OF ASM REGIONS LOCATED IN TFBS REGIONS IN EACH CELL LINE, BASED ON
LENGTH...... 76
FIGURE 4.9 FRACTIONS OF ASM REGIONS LOCATED IN CGI REGIONS IN EACH CELL LINE, BASED ON LENGTH.
...... 77
FIGURE 4.10 VIOLIN PLOT SHOWS THE SCORE DISTRIBUTION OF DHS IN ASM AND NONASM REGIONS FOR
EACH OF THE CELL LINES...... 78
FIGURE 4.11: ASMS IDENTIFIED IN EACH CELL LINE AROUND FIRRE GENE, TOGETHER WITH THE
METHYLATION LEVELS, AND SIGNALS OF DHS, TFBS, AND CGI...... 80
FIGURE 4.12: BOXPLOTS OF GENE TRANSCRIPT ABUNDANCE ON X CHROMOSOME OF THREE FEMALE CELL
LINES...... 82
FIGURE 4.13 THE NUMBER OF IMPRINTED GENES COVERED BY PREDICTED ASM REGIONS IN EACH CELL LINE
BY THE PROPOSED METHOD ASM-DETECTOR AND AN EXISTING ALGORITHM AMRFINDER. BLUE
REPRESENTS REGIONS DETECTED BY OUR METHOD ALONE, RED REPRESENTS REGIONS DETECTED BY
AMRFINDER ALONE, AND GREEN IN REPRESENTS SHUFFLED ASM REGIONS...... 84
FIGURE 4.14 THE DISTRIBUTION OF ASM REGIONS IN TERMS OF THEIR LENGTH AND THE NUMBER OF CPG
SITES WITHIN THEM. THE TOP ASM REGIONS ARE LABELED BY GENE NAMES THAT OVERLAP THEM, RED
FOR KNOWN IMPRINTED GENES AND BLUE FOR POSSIBLE IMPRINTED GENES. THE FIGURE SHOWS THE
UNION SET OF ASM REGIONS FROM ALL CELL LINES...... 90
FIGURE 4.15 ALL CELL LINES HAVE ASMS AROUND TSS OF KCNQ1OT1 AND SNRPN...... 92
FIGURE 4.16 ASMS ONLY OCCUR IN SOMATIC CELL LINES AROUND TSS OF GENE PEG3 AND MEG3...... 93
vi FIGURE 4.17: CELL LINE SPECIFIC ASMS ARE FOUND AROUND TSS OF DIRAS3 AND BLCAP...... 93
FIGURE 4.18 DISTANCE DISTRIBUTIONS OF ASM REGIONS TO THEIR NEAREST TSS FOR ALL THE CELL LINES.
ASMS ARE SIGNIFICANTLY CLOSE TO TSS, COMPARING WITH THE NULL DISTRIBUTIONS (GREY LINES)
GENERATED BY SHUFFLING ASM REGIONS RANDOMLY IN EACH SAMPLE...... 95
FIGURE 4.19 BOXPLOTS OF GENE TRANSCRIPT ABUNDANCE ON AUTOSOMES OF THE FOUR CELL LINES...... 98
FIGURE 4.20 EXAMPLES OF ASM REGIONS WITH READ PARTITIONS CONSISTENT WITH ALLELES OF
HETEROZYGOUS GENOTYPES...... 102
FIGURE 4.21 SIMILAR TO FIGURE 4.20, PANEL A IS UCSC GENOME BROWSER VIEW OF ANOTHER EXAMPLE IN
THE INTRAGENIC REGION OF TRAPPC9 GENE...... 104
FIGURE 4.22 COMPARISON OF OUR RESULTS AND AMRFINDER RESULTS...... 108
FIGURE 4.23 ASM REGIONS DETECTED BY BOTH APPROACHES AND BY INDIVIDUAL APPROACHES ALONE AND
THEIR OVERLAPS WITH DHS...... 109
FIGURE 4.24 UCSC GENOME BROWSER VIEW OF THE REGION AROUND TSS OF GENE ERICH3 (A) AND
AROUND TSS OF XIST (B)...... 114
FIGURE 4.25 GENOMIC DISTRIBUTIONS OF CPMRS AND THEIR OVERLAPS WITH DIFFERENT GENE
ANNOTATIONS. SAME LEGENDS ARE USED IN FIGURE 4.6...... 119
FIGURE 5.1 EXAMPLE OF GROUP PATTERN VIEW AND INDIVIDUAL METHYLATION PATTERN VIEW FIGURES. 123
vii Acknowledgements
I would like to thank my advisor Dr. Jing Li for his guidance, patience and support in the past years. It is my pleasure to be his student during the long journey of
Ph.D. study. I would also like to thank Dr. Angela Ting, Dr. Fulai Jin, Dr. Xusheng Xiao for serving on my dissertation committee and providing valuable advices. Many thanks to my lab mates and classmates at Case. They made my life at Case pleasant and valuable.
Furthermore, I would like to thank my wife. Her love and understanding help me going through the hard time. Most importantly, I would like to thank my parents for their love and unconditional support throughout my life.
viii Methods and Analyses in the Study of Human DNA
Methylation
by
KE HU
Abstract
DNA methylation is an important epigenetic mechanism. Analysis of DNA methylation patterns will help understand mechanism and function of DNA methylation and diseases associated with it. Advancements of technology increase both depth and breadth of DNA methylation measurement, make it possible to detect multi-modal CpG sites, capture
DNA methylation co-occurrence patterns and profile genome-wide allele-specific DNA methylation (ASM) patterns from different types of data. In this dissertation, we will describe novel tools and methods designed for analyzing human DNA methylation data.
DNA methylation beadchip assay enables study in population level. We have developed a Gaussian Mixture-Model Clustering (GMMC) based approach to systematically detect CpG sites with multi-modal methylation level distributions
(mmCpGs) across the genome based on Ilumina 450k data. Comparison with an existing approach has illustrated that our GMMC based method is more accurate and consistent.
Ultra-deep bisulfite sequencing allows more than ten thousand depth of coverage of certain genomic regions. We developed BSPAT, an efficient and user friendly tool to
1 discover and visualize DNA methylation co-occurrence patterns from ultra-deep bisulfite sequencing datasets. Besides, BSPAT can identify potential ASM patterns from co- occurrence patterns with SNP inside.
Recently, Whole Genome Bisulfite Sequencing (WGBS) makes it possible to study DNA methylation in single nucleotide level genome-widely. We have developed a novel computational method to better detect ASM regions from WGBS data and have performed comprehensive analysis of their distributions by applying the method on
WGBS datasets from eight human cell lines. Results have shown ASM regions is ubiquitous and functional in human genome. Our findings confirm previous observations that ASM can be found in most imprinted genes and on female X chromosome. Our method is highly reliable with very low false positive rates and the partition of reads in predicted ASMs is in high concordance with the two alleles when ASMs overlap heterozygous SNPs.
Based on our previous work, we have implemented a general bisulfite sequence clustering tool called BS-Cluster. It released requirement and setup in BSPAT and ASM profiling, thus generally can be applied on any kind of bisulfite-sequencing dataset, including both ultra-deep bisulfite sequencing and WGBS.
2 Chapter 1
Introduction
1.1 DNA methylation in mammals
DNA methylation is one type of epigenetic events which plays an important role in gene regulation and during normal development [1]. In mammals, DNA methylation mostly happens in CpG site, where a cytosine and a guanine connected by one phosphate.
Methylated CpG dinucleotides contains a 5-methylcytosine instead of a normal cytosine.
DNA methylation has been shown associated with many regulatory mechanisms such as gene silencing, gene imprinting and X chromosome inactivation. Many studies found abnormal DNA methylation in CpG dinucleotides involved in human diseases such as cancer [2]. Besides, aberrant reprogramming of DNA methylation is observed in induced pluripotent stem cells [3]. Analysis of DNA methylation patterns is of great importance in understanding the mechanism of DNA methylation and its functions [4].
1.2 Methods to measure DNA methylation
Accurate measurement of DNA methylation is essential to study the mechanism and function of it. Many technologies have been developed to systematically acquire DNA methylation information [5]. According to types of pretreatment, methods fall into three main categories: enzyme digestion, affinity enrichment and bisulfite conversion. Among them, bisulfite conversion is considered “gold standard” for measuring DNA methylation.
Bisulfite treatment of DNA sample will convert non-methylated cytocine to uracil while
3 leave methylated cytocine unaffected. During PCR, all uracil will be converted to thymine. Thus in PCR product, we can distinguish methylated and unmethylated cytosines, i.e. original methylated cytosine remain cytocine and unmethylated cytosine has been converted to thymine.
To finally obtaining methylation information from PCR product of bisulfite treated DNA sample, different technologies have been developed. Array-hybridization based methods (such as Illumina Infinium Humanmethylation450 Beadchip) utilize quantitative genotyping to detect DNA methylation in individual CpG locus. It enables genome-wide study of DNA methylation in selected set of CpG sites with large sample size and low cost. Thus, it is mostly used in population epigenetics study. Instead, sequencing based methods can acquire single-nucleotide resolution information of DNA sequence and DNA methylation information inside the sequence in both higher breath of coverage and depth of coverage than array-hybridization based methods. Ultra-deep bisulfite sequencing allows depth of coverage as high as hundred thousand of reads per locus [6], [7]. It provides us opportunity to study the inner epigenetic variation of sample such as variations between subtypes of cancer cells. In contrast, whole-genome bisulfite sequencing (WGBS) makes it possible to profile DNA methylation in whole genome in single-nucleotide level. In the first WGBS dataset of human genome, 94% of cytosines of the genome is covered [8]. Besides, sequencing based methods provides enough information to allow study of allelic differences of DNA methylation.
4 1.3 Dissertation organization
The following parts of this dissertation are organized by DNA methylation measurement type and method based on it. 1) Detection of CpG sites with multi-modal DNA methylation level distributions from Illumine 450k data. 2) Identifying co-occurrence
DNA methylation patterns from ultra-deep bisulfite sequencing data. 3) Profiling genome-wide ASM patterns from WGBS data. 4) A general bisulfite sequence clustering and visualization tool. Content of each topic will be split into motivation, method, result and discussion.
5 Chapter 2
Detection of CpG sites with multi-modal DNA methylation
level distributions
2.1 Motivation
DNA methylation is one of the most widely used epigenetic marks and plays an important role in gene regulations, which may result in phenotypic differences among different individuals, as well as phenotypic differences of the same individuals before and after treatments [9]. Although epigenetics is traditionally defined as heritable changes in gene activities that do not involve genetic mutations, recent studies have suggested associations exist between genetic variants and differences in DNA methylation levels
[10], [11]. Large-scale genome-wide DNA methylation profiling (e.g., using Illumina
Infinium Human Methylation450 Beadchip, a.k.a Illumina 450K), together with genome- wide genotyping assays using SNP arrays, enables studies of associations between genetic variations and differences in DNA methylation levels.
While many studies have treated DNA methylation levels as a quantitative trait and performed so called meQTL analysis, two recent studies [12], [13] have investigated multimodal distributions of methylation levels at CpG sites, primarily as a quality control step to correct methylation signals from Illumina 450K chips. Daca-Roszak et al. [12] studied relationships between SNP genotypes and methylation levels of 96 CpG sites from European and Asian populations. They observed multi-modal distributions among
6 individual samples for CpG sites with SNPs. However, their study was limited only to a very small subset of CpG sites and only considered CpG sites that physically overlapped with SNPs. In another attempt, Andrews et al. developed an interval based clustering method called Gaphunter to identify CpG sites with multi-modal distributions [13], which was implemented in Bioconductor package minfi [14]. Gaphunter first sorts individual
DNA methylation levels of candidate CpG sites and then groups them into clusters with predefined methylation level thresholds. An optional post-process can be used to exclude outlier-driven clusters, which are defined as clusters with smaller sizes relative to the total sample size and the size of the largest cluster. This simple algorithm is fast and works well for moderate size datasets with little experimental measurement noise of methylation levels. The authors also explored applications of mmCpGs such as probe quality-control and population stratification adjustment. However, threshold-based approaches such as Gaphunter are sensitive to noise levels and are sensitive to the sample size, too.
To overcome those limitations, we propose a more generic and more robust clustering method to identify mmCpGs. The method is based on Gaussian Mixture Model and we apply it on the Genetic Analysis Workshop 20 (GAW20) datasets to identify mmGpGs. We further check the relationships between SNPs and mmCpGs, in terms of direct overlaps of their genomic locations as well as statistical associations between mmCpG clusters and genotypes of SNPs that are physically close to mmCpGs. Analysis result shows that 68~70% of mmCpG sites are associated with some SNPs within their
100kbp neighborhood, suggesting high concordances between mmCpG clusters and
7 individual genotypes. In comparison with Gaphunter, result has shown that our approach is more robust and more stable than Gaphunter.
2.2 Methods
2.2.1 Data
In this study, we analyzed the genome-wide DNA methylation data before treatment and after treatment, as well as dense SNP genotype data provided by GAW20. There are
995/530 individuals from 182/153 families in the pre/post-treatment methylation datasets, respectively. Among them, 823 individuals have been genotyped. The numbers of individuals that have both methylation and SNP data in pre/post datasets are 717/507, respectively. We performed mmCpGs predictions on all individuals with methylation data, separately for pre-treatment and post-treatment dataset. Due to time limitation, when assessing associations between mmCpGs and genotypes, we randomly picked one member from each family. Association between mmCpGs and SNPs in related individuals will be examined in future studies. The number of CpG sites included is 463,
995. DNA methylation level of each CpG site in an individual is a numeric value between
0 and 1. The SNP array data consists of 718,566 SNPs. The genotype data is defined as a dosage 0, 1 or 2 copies of the coded allele. Because genotype and DNA methylation data may contain missing value for some SNPs or CpG sites, we only include those individuals that have both genotypes and DNA methylation information when associating cluster labels with genotypes.
8 2.2.2 Gaussian Mixture Model Clustering
The goal of our method is to identify clusters of individuals that have distinct distributions of DNA methylation levels for each CpG site without using any prior knowledge of genotypes or phenotypes. Gaussian Mixture Model (GMM) is one of the mostly used model-based clustering algorithms that is suitable to identify cluster structures from a mixture of multiple distributions. A GMM is a weighted sum of M component Gaussian densities as shown in formula below,
M 2 |xp λ i |xgw= μ ,σ ii (2.1) =i 1
2 where wi is the weight of component i, |xg μ ,σ ii 1, M,=i, are the component
Gaussian densities.
The assumption is that when methylation levels are affected by genotypes, each distinct genotype corresponds to a different distribution. In a population with different types of genotypes, their methylation levels will exhibit the characteristics of a mixture distribution. In our study, we utilize the mixtools [15] to perform GMMC, which will estimate model parameters for each cluster using the EM algorithm for a given number of clusters. It also provides posterior probabilities of a sample belonging to each of the clusters. Evaluation metric such as Bayesian information criterion (BIC) and log- likelihood of the mixture-model can be used for model selection.
2.2.3 Detection of multimodal CpG sites
To determine the best number of clusters, we try different numbers of clusters iteratively and determine the best model using the BIC criteria. More specifically, we apply the
9 algorithm below:
0. Starting with k=1, calculate BIC1 based on uni-modal GMM1.
BEST_MODEL=GMM1.
1. If k > MAX_K, stop iteration.
Otherwise, k=k+1. Apply GMMC with given k components. Obtain BICk and GMMk.
2. If BICk > BICk-1+BIC_INC_THRESHOLD, BEST_MODEL=GMMk. Continue to first step. Otherwise, stop iteration.
Given the property of our specific application, we set the MAX_K to be 3, corresponding to the three distinct genotypes of some SNPs that are potentially associated with the mmCpG. The BIC incremental threshold (BIC_INC_THRESHOLD) is used to control the model complexity. A larger number of clusters are meaningful only if its BIC is substantially higher than the BIC with a smaller number of clusters. In practice, a higher value of the threshold will allow the method to be less sensitive. We set
BIC_INC_THRESHOLD=100 in our analysis so that our results are more conservative.
Once the model is fixed, each individual will be assigned to the cluster for which the posterior-probability is highest. Our method also incorporates a post-process step that utilizes several thresholds to filter out low quality clusters. First, the largest cluster cannot be too big. If the fraction of the largest cluster is greater than 1-OUT_CUTT, where
OUT_CUTT is a user specified parameter, the mmCpG will be excluded from further analysis. Second, samples within each cluster should have small variance, controlled by a threshold MAX_STD for the maximum allowed standard deviation in each cluster.
10 Finally, we require that cluster centers should be separable from each other, which is controlled using a threshold MIN_MEAN_DIFF. In our study, we set MAX_STD=0.1 and MIN_MEAN_DIFF =0.2.
2.2.4 Associating GMM cluster labels with genotypes
To study the relationships between genotypes and mmCpGs, we have evaluated genotype data and GMM cluster labels together to assess the strength of associations. In our study, we included all SNPs located less than 50kb on either side of an mmCpG site. For each pair of a SNP and an mmCpG, we first constructed the contingency table for three genotypes and three cluster labels. Then a chi-square p-value was calculated and corrected by Bonferroni correction for multiple testing. Among all the nearby SNPs around an mmCpG, only the one with the minimum p-value is considered as the measure of SNP-mmCpG association. Finally, a critical value of 0.001 (after Bonferroni correction for multiple testing) was used to determine if an mmCpG has strong association with at least one nearby SNP.
2.3 Results
2.3.1 Genome-wide survey of mmCpG sites in GAW20 dataset
We applied our GMMC based method on both pre-treatment (995 individuals) and post- treatment (530) Illumina 450K datasets and detected 3785 and 3847 mmCpGs. A significant majority of them (2965 mmCpGs) were found in both datasets. 820 and 882
11 mmCpGs were found unique in pre and post-treatment datasets, respectively. To compare our method with Gaphunter, we also applied Gaphunter on the same datasets. Gaphunter identified 4313 and 5632 mmCpGs in pre/post-treatment datasets respectively. About
78% and 91% of mmCpGs identified by our method were also included in Gaphunter result. Moreover, the number of mmCpGs identified by our method alone is much smaller than the number of mmCpGs identified by Gaphunter alone, which indicates that our method is much more conservative than Gaphunter (Figure 2.1). To evaluate the sensitivity to sample size of both methods, we randomly picked different numbers of individuals from all individuals in the pre-treatment DNA methylation data and applied both methods on chromosome 21 of the sub-datasets. Our analysis shows that Gaphunter has many more mmCpGs with small sample sizes while the number of mmCpGs identified decreases with increasing of sample size (Figure 2.2). In contrast, our method is very stable for all tested sample sizes. Many of the reported mmCpGs by Gaphunter when using small sample sizes are likely false positives. In summary, results support that our method is more conservative with small sample sizes and more stable than
Gaphunter.
12 AB
Figure 2.1 Venn diagram of GMMC and Gaphunter results on the pre-treatment (A) and post-treatment (B) dataset.
Figure 2.2: Number of mmCpGs detected by GMMC and Gaphunter in different sample size.
2.3.2 Association between mmCpGs and SNPs
To investigate the relationships between mmCpGs and SNPs, we separate mmCpGs into
13 two categories: 1) mmCpGs have a SNP directly overlapping with it (at either C position or G position); 2) mmCpGs with no directly overlapped SNPs, but have strong associations with some SNPs in close physical proximity. Since family structure may have impact on correlation between genotype and methylation level, we further conducted analysis on unrelated individuals (182/153), which detected 3014 and 3128 mmCpGs in pre and post-treatment datasets, respectively. There are in total 453 CpG sites directly overlapping a SNP at their locations. 180/190 of them are detected as mmCpGs in pre/post-treatment datasets, respectively. Figure 2.3 shows examples of an mmCpG with overlapped SNPs and a non mmCpG site with overlapped SNPs.
A 1.00 B 1.00
0.75 Cluster 0.75 1 2 0.50 0.50 3 Methylation 0.25 Methylation 0.25
0.00 0.00 0 12 0 1 2 Genotype Genotype
Figure 2.3: Example of mmCpG and non-mmCpG. A) An example of an mmCpG with a genotyped SNP physically overlapped with its location. Each point represents an individual. B) An example of a non-mmCpG with a genotyped SNP physically overlapped with its location. Distributions of different genotype groups are similar.
In addition to direct overlaps to its location, nearby SNPs may affect/interact with mmCpGs. We examined SNPs located within 50kb on either side of each CpG site
(100kb window). By matching genotypes and GMM cluster labels, we measured the association between mmCpGs and their nearby SNPs based on a contingency table. 14 Results show that 68~70% of mmCpGs in both pre/post-treatment datasets has association with at least one of the nearby SNPs (Table 2.1). This observation supports our hypothesis that most of mmCpGs are somehow affected by SNPs. Moreover, we found that the most associated SNPs were located less than 20kb away from the mmCpGs
(Figure 2.4).
Table 2.1 Summary of mmCpGs mmCpG_pre Percent mmCpG_post Percent
All result 3014 3128
p<=0.001 2073 68.78% 2207 70.56%
p>0.001 941 31.22% 921 29.44%
15 0.05
0.04
0.03 Density 0.02
0.01
0.00 50 25 0 25 50 Distance from mmCpG site to SNP with min p−value(kb)
Figure 2.4: Distance distribution from mmCpG sites and their associated SNPs.
2.4 Discussion
In this dissertation, we have proposed a novel GMMC based method to detect genome- widely mmCpGs generated from Illumina 450K chips. We applied this method on
GAW20 dataset and found that the majority of mmCpGs are associated with SNPs that are either directly overlapping with CpG sites or are in close proximity to CpG sites.
Empirical analysis demonstrates that our method is more stable than Gaphunter, a thresholding based method. The ideas underneath threshold-based clustering and model- based clustering are quite different. Threshold based methods such as Gpahunter use a fixed cut-off value to draw a boundary to separate data points, which may not be able to capture characteristics of different clusters. First of all, the choice of cut-off values is 16 mostly arbitrary. The same cut-off values may not be valid in different datasets, or even worse, they may not be valid for different CpG sites in the same dataset, because methylation level distributions of different CpG sites may have different characteristics, i.e., some CpG sites have larger gaps between clusters but others may have smaller gaps.
Moreover, the cut-off values can be quite sensitive to sample sizes. When the sample size is small, the distribution of DNA methylation levels among individuals will be sparse and the within-cluster distances may be big. Threshold-based methods are prone to false positives. When the sample size is large, the distribution is dense and clusters may have overlaps. In this case, it is hard for threshold-based approaches correctly clustering samples. Different than threshold-based methods, model based clustering methods including the one proposed here are designed to obtain models that fit the distributions, therefore can naturally capture the characteristics of cluster structures. Therefore, they usually provide results that are more accurate. In addition, the Gaussian Mixture Model can detect clusters with identifiable overlaps.
The current study mainly focused on detection of mmCpGs. Our findings suggest that there might be some connections between genetics and epigenetics. One should not treat mmCpGs as irregularity and filter them out from further analysis. Instead, careful characterization after identification is crucial to understand better the biological significance of mmCpGs. In the future, we will continue to investigate the association between mmCpGs and genotypes and explore how these mmCpGs might relate to phenotypes.
17 Chapter 3
Discovering DNA methylation co-occurrence patterns
3.1 Motivation
Along with the generation of bisulfite sequencing data, many bisulfite sequencing data analysis tools have been proposed in recent years. Among them, QUMA [16], BISMA
[17] and BiQ Analyzer [18] are earlier tools for bisulfite sequencing data analysis that have been widely adopted. However, none of the tools can handle large datasets with ultra-high read coverages or a large number of targeted regions, which are increasingly common in real data analysis. For example, QUMA web server limits the maximum number of bisulfite sequence reads per request to 400. Similarly, for BISMA, the number of sequences that can be uploaded is limited to 400. The upload files size is limited to 10
MB. Even for later tools such as BiQ Analyzer HT [19] that were designed specifically for processing large datasets, their performance still cannot keep up with the throughput of data generation, mainly because they utilized a global sequence alignment algorithm.
The alignment strategy also limits its usage on very small genomic regions.
More recently, some newer tools such as Bismark [20] and BS-Seeker [21] have utilized more efficient mapping tools with modifications for bisulfite sequencing data.
Therefore, they can effectively handle larger datasets, especially those generated by next- generation sequencing (NGS) technologies [22]. However, the primary focus of these tools is to perform sequence read map and to call methylation status at each site. Other functionalities in downstream pattern analysis and visualization are limited. Furthermore, 18 most existing tools provide little if any functions in analyzing methylation co-occurrence patterns, nor in correlating methylation patterns with mutations.
The term DNA methylation co-occurrence patterns used here refers to groups of sequencing reads sharing specific DNA methylation patterns. There are many possible sources of co-occurrence patterns, such as tumor heterogeneity, cell heterogeneity in the same tissue and allele differences. Investigating such patterns can provide further insights in distinguishing different cancer subtypes [23], in revealing mechanisms of cancer development [24], and in detecting allele-specific methylation.
The differences between co-occurrence pattern analysis and traditional DNA methylation pattern analysis is obvious. Traditional DNA methylation pattern analysis focus on methylation level of specific CpG cite. It condenses two dimensional data into one dimension, i.e. discard origin of methylation information. We can no longer tell the methylation level of subgroups of sequencing reads. Further, the methylation level of
CpG site is contaminated by sequencing or bisulfite conversion noise. Small number of reads which has different methylation patterns comparing with majority of reads are likely to be affected by noise. Thus, co-occurrence pattern would reveal information cannot be discovered from traditional pattern analysis and produce more accurate methylation level.
Although DNA methylation co-occurrence patterns can be called from any type of bisulfite sequencing data, lower depth of coverage will not provide enough confidence for the pattern detection. Ultra-deep bisulfite sequencing makes it feasible to analyze co- occurrence patterns. In addition, at the same time, extreme high coverage is a huge 19 challenge for processing, analyzing and visualizing the data. Thus, it is necessary to develop a novel method to efficiently detect and visualize DNA methylation co- occurrence patterns.
In this dissertation, we present a web application service named BSPAT for
Bisulfite Sequencing Pattern Analysis Tool, which takes advantage of Bismark’s read alignments and methylation calling functionalities, and provides further quality control, co-occurrence pattern analysis, simple allele specific methylation analysis, visualization and integration with other databases and tools. In addition to the web service, the source code of the tool is also made available, which enables advanced users to deploy BSPAT on their own machines for dedicated analysis of large volume of data without uploading them to our own server. We have applied BSPAT on a real dataset generated from two prostate cancer cell lines and one normal prostate epithelial cell line. Results have shown some interesting methylation co-occurrence patterns that are different in different cell lines. A potential allele specific methylation case is also observed. We have also compared the performance of BSPAT with a popular tool BiQ Analyzer HT [19]. Results show that BSPAT is much faster, uses less memory, and generates more results for visualization and further analysis.
3.2 Methods
3.2.1 Characteristics
Comparing with existing tools, BSPAT has several important features: 1) The methylation pattern analysis features provided by most existing tools focus on either an
20 overall methylation status of a CpG rich region or methylation level of each CpG site.
Although the detailed single read methylation patterns may be presented, the significant co-occurrence patterns are not summarized. 2) BSPAT also provides a feature to automatically discover potential allele-specific DNA methylation co-occurrence patterns in a targeted region. 3) By utilizing a sequence mapping approach instead of sequence alignment algorithms, BSPAT is much faster than existing tools, as demonstrated in
Result section. 4) BSPAT implements an easy to use integrated workflow and visualizes results in multiple formats.
3.2.2 Workflow
The workflow of BSPAT is shown in Figure 3.1. It mainly consists of two stages: mapping stage and analysis stage. We discuss both of them in details in this subsection.
For sequence reads generated from bisulfite sequencing projects, BSPAT accepts both
FASTA and FASTQ format as its inputs (Figure 3.1 A) for mapping. Four different types of quality scores (i.e.,phred33, phred64, solexa and solexa1.3) for FASTQ format are supported. Reads from multiple experiments can be uploaded at the same time. Each experiment can consist of one or more genomic regions. A utility script is also provided to extract data from multiplex experiments. BSPAT also requires users to provide a reference sequence file using FASTA format, which can consist of reference sequences from all the regions/experiments. Because the program uses a mapping strategy instead of an alignment strategy, it assumes read lengths are smaller than the lengths of reference sequences. The design of BSPAT is mainly for targeted sequencing data, where the 21 regions sequenced are known a priori. Therefore, users should provide reference sequences of targeted regions, not the whole human genome, to speed up the mapping and analysis. To obtain genome coordinates of these regions for the analysis stage,
BSPAT calls Blat service hosted by UCSC Genome Browser [25], [26] to automatically acquire the genome coordinates of reference sequences. Three versions of genome assemblies (i.e., hg38, hg19, hg18) are supported currently. The top Blat result for each region, which in general represents the true region, will be selected for use in the analysis step. To map bisulfite converted sequence reads to reference regions, BSPAT relies on another program Bismark (Figure 3.1 B), which actually calls Bowtie [27] to perform the mapping. The mapping step takes the majority of execution time. BSPAT allows up to three mismatches in the seed region of each read but gaps are not allowed. Reads with low mapping qualities are discarded. Users will be notified by email (if provided) when the mapping result is ready. A unique identifier is assigned to each executed job and users can use that number to retrieve the results. The webpage will also be refreshed when the result is ready, which provides some summary information about the mapping result, the genomic coordinates of the targeted regions, and a link to the detailed results generated from Bismark.
22
Figure 3.1: Workflow of BSPAT.
23 A) Example of input sequence reads in FASTQ format. B) Sequence reads aremapped to the reference. C) For a given targeted region, only reads that cover all CpG sites in the region are considered in generating co-occurrence patterns. D) Methylation patterns and mismatch information at single read level. E) Visualization of results in three different formats. 1) DNA Methylation co-occurrence patterns in text format. ‘@@’ represents amethylated CpG site; ‘**’ represents an unmethylated CpG site; ‘-’ represents a non- CpG context nucleotide; amismatch is represented by the variant allele at the position. 2) Graphical representation of methylation co-occurrence patterns with genomic coordinate information. A black circle represents a methylated CpG site and a white one represents an unmethylated CpG site. The last row represents the proportion of methylated reads to the total number of reads at each site. The colored circles showmethylation rates from low (green) to high (red). Variant allele in each pattern is represented by a blue bar. 3) Methylation patterns are shown as a UCSC Genome Browser custom track.
Based on mapping results, BSPAT not only summarizes the methylation level at each CpG site, more importantly, it examines methylation co-occurrence patterns of CpG sites in close proximity. BSPAT does so in several steps. First, low quality reads will be filtered out based on user-defined parameters such as bisulfite conversion rate and sequence identity. Second, in order to view co-occurrence patterns, a user needs to specify a window by providing its genomic coordinates. If no such window is given,
BSPAT uses a default window of size 70 bps starting at the first CpG site of the reference sequence. Only reads that cover all the CpG sites in the view window will be considered in generating co-occurrence patterns (Figure 3.1 C). For each read, the methylation status at all CpG sites covered by the read is regarded as its methylation signature or a pattern
(Figure 3.1 D) Then, all reads with the same signature will be grouped into a methylation co-occurrence pattern and the number of all such reads is the support of the pattern.