METHODS AND ANALYSES IN THE STUDY OF

HUMAN DNA METHYLATION

by

KE HU

Submitted in partial fulfillment of the requirements

For the Degree of Doctor of Philosophy

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

May, 2018 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Ke Hu

candidate for the degree of Doctor of Philosophy *.

Committee Chair

Dr. Jing Li

Committee Member

Dr. Angela Ting

Committee Member

Dr. Fulai Jin

Committee Member

Dr. Xusheng Xiao

Date of Defense

March 29, 2018

*We also certify that written approval has been obtained

for any proprietary material contained therein. Table of Contents

TABLE OF CONTENTS ...... I

LIST OF TABLES ...... IV

LIST OF FIGURES ...... V

ACKNOWLEDGEMENTS ...... VIII

ABSTRACT ...... 1

CHAPTER 1 INTRODUCTION ...... 3

1.1 DNA METHYLATION IN MAMMALS ...... 3

1.2 METHODS TO MEASURE DNA METHYLATION ...... 3

1.3 DISSERTATION ORGANIZATION ...... 5

CHAPTER 2 DETECTION OF CPG SITES WITH MULTI-MODAL DNA METHYLATION

LEVEL DISTRIBUTIONS ...... 6

2.1 MOTIVATION ...... 6

2.2 METHODS ...... 8

2.2.1 Data ...... 8

2.2.2 Gaussian Mixture Model Clustering ...... 9

2.2.3 Detection of multimodal CpG sites ...... 9

2.2.4 Associating GMM cluster labels with genotypes ...... 11

2.3 RESULTS ...... 11

2.3.1 Genome-wide survey of mmCpG sites in GAW20 dataset ...... 11

2.3.2 Association between mmCpGs and SNPs ...... 13

2.4 DISCUSSION ...... 16

CHAPTER 3 DISCOVERING DNA METHYLATION CO-OCCURRENCE PATTERNS ...... 18

i 3.1 MOTIVATION ...... 18

3.2 METHODS ...... 20

3.2.1 Characteristics ...... 20

3.2.2 Workflow ...... 21

3.3 RESULT ...... 27

3.3.1 Experiment Summary ...... 27

3.3.2 DNA methylation co-occurrence pattern analysis ...... 28

3.3.3 Potential ASM detection ...... 29

3.3.4 Efficiency ...... 32

3.4 DISCUSSION ...... 35

CHAPTER 4 GENOME WIDE PROFILING OF ALLELE-SPECIFIC DNA METHYLATION ..... 37

4.1 MOTIVATION ...... 37

4.2 METHODS ...... 43

4.2.1 Data ...... 43

4.2.2 Analysis flow ...... 46

4.2.3 The proposed ASM detection method ...... 50

4.2.3.1 Step 1: Mapping and methylation calling ...... 50

4.2.3.2 Step 2: Candidate region definition ...... 50

4.2.3.3 Step 3 ASM detection based on a Graph model ...... 51

4.2.3.4 Step 4 Final analysis ...... 55

4.2.4 Genome annotation ...... 55

4.2.5 CTCF binding data ...... 56

4.2.6 RNA-seq data ...... 56

4.2.7 SNP calling ...... 56

4.2.8 Checking consistency between heterozygous alleles and ASM partitions ...... 59

4.2.9 amrfinder result ...... 59

4.3 RESULT ...... 60 ii 4.3.1 ASM is ubiquitous across the genome and is cell line specific ...... 60

4.3.2 Enrichment in female X ...... 70

4.3.2.1 Overlaps with RefSeq ...... 71

4.3.2.2 Overlaps with ENCODE regulatory elements...... 73

4.3.2.3 Relationship with expression levels...... 80

4.3.3 ASM significantly overlaps imprinted gene regions ...... 82

4.3.3.1 Majority of imprinted genes have ASMs...... 82

4.3.3.2 Imprinted genes overlap strong ASMs...... 89

4.3.3.3 Variability of ASM in imprinted regions in different cell lines...... 91

4.3.3.4 Overlaps with promoter regions and correlation with gene expression...... 94

4.3.4 ASM patterns in autosomes ...... 95

4.3.4.1 ASM distributions...... 95

4.3.4.2 Overlaps with regulatory elements and relations with expression levels...... 96

4.3.5 Heterozygous SNPs located in identified ASM regions strongly support read partitions ..... 98

4.3.6 Comparison with amrfinder ...... 106

4.4 DISCUSSION ...... 116

CHAPTER 5 A GENERAL BISULFITE SEQUENCE CLUSTERING AND VISUALIZATION

TOOL 120

5.1 BACKGROUND ...... 120

5.2 IMPLEMENTATION AND DESCRIPTION ...... 120

5.3 EXAMPLE AND DISCUSSION ...... 123

CHAPTER 6 CONCLUSION ...... 126

BIBLIOGRAPHY ...... 127

iii List of Tables

TABLE 2.1 SUMMARY OF MMCPGS ...... 15

TABLE 3.1 SIZES OF DATASETS USED IN THE EXPERIMENTS ...... 33

TABLE 4.1 SUMMARY STATISTICS OF DATA PROCESSING...... 45

TABLE 4.2 HETEROZYGOUS SNPS IN ASM REGIONS...... 58

TABLE 4.3 DETECTED ASM REGIONS IN DIFFERENT CELL LINES AND THEIR STATISTICS...... 64

TABLE 4.4 DISTRIBUTION OF ASM REGIONS ON AUTOSOMES...... 65

TABLE 4.5 DISTRIBUTION OF ASM REGIONS ON X ...... 65

TABLE 4.6 DISTRIBUTION OF ASM REGIONS ON Y CHROMOSOME...... 66

TABLE 4.7 THE OVERLAPS OF KNOWN IMPRINTED GENES WITH DETECTED ASMS IN ALL 8 CELL LINES...... 85

TABLE 4.8 SNP CALLING RESULT ...... 105

TABLE 4.9 SUMMARY OF 1K RANGE MERGED ASM REGIONS DETECTED BY ASM-DETECTOR ...... 110

TABLE 4.10 SUMMARY OF ASM REGIONS DETECTED BY AMRFINDER ...... 111

TABLE 4.11 COMPARISON OF ASM REGIONS DETECTED BY ASM-DETECTOR AND AMRFINDER IN MALE X

AND ALL Y CHROMOSOME...... 115

iv List of Figures

FIGURE 2.1 VENN DIAGRAM OF GMMC AND GAPHUNTER RESULTS ON THE PRE-TREATMENT (A) AND POST-

TREATMENT (B) DATASET...... 13

FIGURE 2.2: NUMBER OF MMCPGS DETECTED BY GMMC AND GAPHUNTER IN DIFFERENT SAMPLE SIZE...... 13

FIGURE 2.3: EXAMPLE OF MMCPG AND NON-MMCPG...... 14

FIGURE 2.4: DISTANCE DISTRIBUTION FROM MMCPG SITES AND THEIR ASSOCIATED SNPS...... 16

FIGURE 3.1: WORKFLOW OF BSPAT...... 23

FIGURE 3.2: EXAMPLES OF DNA METHYLATION CO-OCCURRENCE PATTERNS...... 29

FIGURE 3.3: AN ALLELE SPECIFIC METHYLATION EXAMPLE NEAR PAX6...... 31

FIGURE 3.4: EFFICIENCY COMPARISON OF BSPAT AND BIQ ANALYZER HT (REFERRED AS BIQ HT HERE)

USING DIFFERENT SETTINGS...... 34

FIGURE 3.5: PEAK MEMORY USAGE COMPARISON OF BSPAT AND BIQ ANALYZER HT (REFERRED AS BIQ HT

HERE) USING DIFFERENT SETTINGS...... 35

FIGURE 4.1: A TOY EXAMPLE ILLUSTRATING THE CLUSTERING ALGORITHM IMPLEMENTED IN ASM-

DETECTOR...... 49

FIGURE 4.2: GENOME-WIDE DISTRIBUTION OF ASMS (BLUE) AND SOME OTHER GENOMIC FEATURES

INCLUDING IMPRINTED GENES (NAMES), DNASE I HYPERSENSITIVE SITES (BLACK), TRANSCRIPTION

FACTOR BINDING SITES (RED) AND CPG ISLANDS (GREEN)...... 67

FIGURE 4.3 THE LENGTH DISTRIBUTION OF ASMS IN DIFFERENT CELL LINES. THE THREE ADS CELL LINES

SEQUENCED USING PAIRED-END TECHNOLOGY HAVE LONGER ASM REGIONS AND THE REST HAVE

SIMILAR LENGTH DISTRIBUTIONS...... 68

FIGURE 4.4 AUTOSOMAL ASMS IDENTIFIED IN EACH CELL LINE AND THEIR CORRESPONDING METHYLATION

LEVEL DISTRIBUTIONS OF THE SAME GENOMIC REGIONS IN OTHER CELL LINES...... 69

FIGURE 4.5 ASMS IDENTIFIED FROM X CHROMOSOME IN EACH CELL LINE AND THEIR CORRESPONDING

METHYLATION LEVEL DISTRIBUTIONS OF THE SAME GENOMIC REGIONS IN OTHER CELL LINES...... 70

FIGURE 4.6 GENOMIC DISTRIBUTIONS OF ASM REGIONS AND THEIR OVERLAPS WITH DIFFERENT GENE

v ANNOTATIONS INCLUDING PROMOTER (RED), EXON (BLUE), INTRON (GREEN), AND INTERGENIC

(PURPLE) REGIONS BASED ON REFSEQ GENE ANNOTATIONS. DISTRIBUTIONS IN OTHER CONTEXTS

(ASMS ON AUTOSOMES, ON X CHROMOSOME, AND OVERLAPPING WITH IMPRINTED GENE REGIONS) ARE

ALSO SHOWN...... 73

FIGURE 4.7 FRACTIONS OF ASM REGIONS LOCATED IN DHS REGIONS IN EACH CELL LINE, BASED ON LENGTH.

THE BAR LABELED WITH “GENOME” IN EACH PANEL SHOWS THE FRACTION OF EACH FEATURE ON THE

GENOME...... 75

FIGURE 4.8 FRACTIONS OF ASM REGIONS LOCATED IN TFBS REGIONS IN EACH CELL LINE, BASED ON

LENGTH...... 76

FIGURE 4.9 FRACTIONS OF ASM REGIONS LOCATED IN CGI REGIONS IN EACH CELL LINE, BASED ON LENGTH.

...... 77

FIGURE 4.10 VIOLIN PLOT SHOWS THE SCORE DISTRIBUTION OF DHS IN ASM AND NONASM REGIONS FOR

EACH OF THE CELL LINES...... 78

FIGURE 4.11: ASMS IDENTIFIED IN EACH CELL LINE AROUND FIRRE GENE, TOGETHER WITH THE

METHYLATION LEVELS, AND SIGNALS OF DHS, TFBS, AND CGI...... 80

FIGURE 4.12: BOXPLOTS OF GENE TRANSCRIPT ABUNDANCE ON X CHROMOSOME OF THREE FEMALE CELL

LINES...... 82

FIGURE 4.13 THE NUMBER OF IMPRINTED GENES COVERED BY PREDICTED ASM REGIONS IN EACH CELL LINE

BY THE PROPOSED METHOD ASM-DETECTOR AND AN EXISTING ALGORITHM AMRFINDER. BLUE

REPRESENTS REGIONS DETECTED BY OUR METHOD ALONE, RED REPRESENTS REGIONS DETECTED BY

AMRFINDER ALONE, AND GREEN IN REPRESENTS SHUFFLED ASM REGIONS...... 84

FIGURE 4.14 THE DISTRIBUTION OF ASM REGIONS IN TERMS OF THEIR LENGTH AND THE NUMBER OF CPG

SITES WITHIN THEM. THE TOP ASM REGIONS ARE LABELED BY GENE NAMES THAT OVERLAP THEM, RED

FOR KNOWN IMPRINTED GENES AND BLUE FOR POSSIBLE IMPRINTED GENES. THE FIGURE SHOWS THE

UNION SET OF ASM REGIONS FROM ALL CELL LINES...... 90

FIGURE 4.15 ALL CELL LINES HAVE ASMS AROUND TSS OF KCNQ1OT1 AND SNRPN...... 92

FIGURE 4.16 ASMS ONLY OCCUR IN SOMATIC CELL LINES AROUND TSS OF GENE PEG3 AND MEG3...... 93

vi FIGURE 4.17: CELL LINE SPECIFIC ASMS ARE FOUND AROUND TSS OF DIRAS3 AND BLCAP...... 93

FIGURE 4.18 DISTANCE DISTRIBUTIONS OF ASM REGIONS TO THEIR NEAREST TSS FOR ALL THE CELL LINES.

ASMS ARE SIGNIFICANTLY CLOSE TO TSS, COMPARING WITH THE NULL DISTRIBUTIONS (GREY LINES)

GENERATED BY SHUFFLING ASM REGIONS RANDOMLY IN EACH SAMPLE...... 95

FIGURE 4.19 BOXPLOTS OF GENE TRANSCRIPT ABUNDANCE ON AUTOSOMES OF THE FOUR CELL LINES...... 98

FIGURE 4.20 EXAMPLES OF ASM REGIONS WITH READ PARTITIONS CONSISTENT WITH ALLELES OF

HETEROZYGOUS GENOTYPES...... 102

FIGURE 4.21 SIMILAR TO FIGURE 4.20, PANEL A IS UCSC GENOME BROWSER VIEW OF ANOTHER EXAMPLE IN

THE INTRAGENIC REGION OF TRAPPC9 GENE...... 104

FIGURE 4.22 COMPARISON OF OUR RESULTS AND AMRFINDER RESULTS...... 108

FIGURE 4.23 ASM REGIONS DETECTED BY BOTH APPROACHES AND BY INDIVIDUAL APPROACHES ALONE AND

THEIR OVERLAPS WITH DHS...... 109

FIGURE 4.24 UCSC GENOME BROWSER VIEW OF THE REGION AROUND TSS OF GENE ERICH3 (A) AND

AROUND TSS OF XIST (B)...... 114

FIGURE 4.25 GENOMIC DISTRIBUTIONS OF CPMRS AND THEIR OVERLAPS WITH DIFFERENT GENE

ANNOTATIONS. SAME LEGENDS ARE USED IN FIGURE 4.6...... 119

FIGURE 5.1 EXAMPLE OF GROUP PATTERN VIEW AND INDIVIDUAL METHYLATION PATTERN VIEW FIGURES. 123

vii Acknowledgements

I would like to thank my advisor Dr. Jing Li for his guidance, patience and support in the past years. It is my pleasure to be his student during the long journey of

Ph.D. study. I would also like to thank Dr. Angela Ting, Dr. Fulai Jin, Dr. Xusheng Xiao for serving on my dissertation committee and providing valuable advices. Many thanks to my lab mates and classmates at Case. They made my life at Case pleasant and valuable.

Furthermore, I would like to thank my wife. Her love and understanding help me going through the hard time. Most importantly, I would like to thank my parents for their love and unconditional support throughout my life.

viii Methods and Analyses in the Study of DNA

Methylation

by

KE HU

Abstract

DNA methylation is an important epigenetic mechanism. Analysis of DNA methylation patterns will help understand mechanism and function of DNA methylation and diseases associated with it. Advancements of technology increase both depth and breadth of DNA methylation measurement, make it possible to detect multi-modal CpG sites, capture

DNA methylation co-occurrence patterns and profile genome-wide allele-specific DNA methylation (ASM) patterns from different types of data. In this dissertation, we will describe novel tools and methods designed for analyzing human DNA methylation data.

DNA methylation beadchip assay enables study in population level. We have developed a Gaussian Mixture-Model Clustering (GMMC) based approach to systematically detect CpG sites with multi-modal methylation level distributions

(mmCpGs) across the genome based on Ilumina 450k data. Comparison with an existing approach has illustrated that our GMMC based method is more accurate and consistent.

Ultra-deep bisulfite sequencing allows more than ten thousand depth of coverage of certain genomic regions. We developed BSPAT, an efficient and user friendly tool to

1 discover and visualize DNA methylation co-occurrence patterns from ultra-deep bisulfite sequencing datasets. Besides, BSPAT can identify potential ASM patterns from co- occurrence patterns with SNP inside.

Recently, Whole Genome Bisulfite Sequencing (WGBS) makes it possible to study DNA methylation in single nucleotide level genome-widely. We have developed a novel computational method to better detect ASM regions from WGBS data and have performed comprehensive analysis of their distributions by applying the method on

WGBS datasets from eight human cell lines. Results have shown ASM regions is ubiquitous and functional in . Our findings confirm previous observations that ASM can be found in most imprinted genes and on female X chromosome. Our method is highly reliable with very low false positive rates and the partition of reads in predicted ASMs is in high concordance with the two alleles when ASMs overlap heterozygous SNPs.

Based on our previous work, we have implemented a general bisulfite sequence clustering tool called BS-Cluster. It released requirement and setup in BSPAT and ASM profiling, thus generally can be applied on any kind of bisulfite-sequencing dataset, including both ultra-deep bisulfite sequencing and WGBS.

2 Chapter 1

Introduction

1.1 DNA methylation in mammals

DNA methylation is one type of epigenetic events which plays an important role in gene regulation and during normal development [1]. In mammals, DNA methylation mostly happens in CpG site, where a cytosine and a guanine connected by one phosphate.

Methylated CpG dinucleotides contains a 5-methylcytosine instead of a normal cytosine.

DNA methylation has been shown associated with many regulatory mechanisms such as gene silencing, gene imprinting and X chromosome inactivation. Many studies found abnormal DNA methylation in CpG dinucleotides involved in human diseases such as cancer [2]. Besides, aberrant reprogramming of DNA methylation is observed in induced pluripotent stem cells [3]. Analysis of DNA methylation patterns is of great importance in understanding the mechanism of DNA methylation and its functions [4].

1.2 Methods to measure DNA methylation

Accurate measurement of DNA methylation is essential to study the mechanism and function of it. Many technologies have been developed to systematically acquire DNA methylation information [5]. According to types of pretreatment, methods fall into three main categories: enzyme digestion, affinity enrichment and bisulfite conversion. Among them, bisulfite conversion is considered “gold standard” for measuring DNA methylation.

Bisulfite treatment of DNA sample will convert non-methylated cytocine to uracil while

3 leave methylated cytocine unaffected. During PCR, all uracil will be converted to thymine. Thus in PCR product, we can distinguish methylated and unmethylated cytosines, i.e. original methylated cytosine remain cytocine and unmethylated cytosine has been converted to thymine.

To finally obtaining methylation information from PCR product of bisulfite treated DNA sample, different technologies have been developed. Array-hybridization based methods (such as Illumina Infinium Humanmethylation450 Beadchip) utilize quantitative genotyping to detect DNA methylation in individual CpG locus. It enables genome-wide study of DNA methylation in selected set of CpG sites with large sample size and low cost. Thus, it is mostly used in population epigenetics study. Instead, sequencing based methods can acquire single-nucleotide resolution information of DNA sequence and DNA methylation information inside the sequence in both higher breath of coverage and depth of coverage than array-hybridization based methods. Ultra-deep bisulfite sequencing allows depth of coverage as high as hundred thousand of reads per locus [6], [7]. It provides us opportunity to study the inner epigenetic variation of sample such as variations between subtypes of cancer cells. In contrast, whole-genome bisulfite sequencing (WGBS) makes it possible to profile DNA methylation in whole genome in single-nucleotide level. In the first WGBS dataset of human genome, 94% of cytosines of the genome is covered [8]. Besides, sequencing based methods provides enough information to allow study of allelic differences of DNA methylation.

4 1.3 Dissertation organization

The following parts of this dissertation are organized by DNA methylation measurement type and method based on it. 1) Detection of CpG sites with multi-modal DNA methylation level distributions from Illumine 450k data. 2) Identifying co-occurrence

DNA methylation patterns from ultra-deep bisulfite sequencing data. 3) Profiling genome-wide ASM patterns from WGBS data. 4) A general bisulfite sequence clustering and visualization tool. Content of each topic will be split into motivation, method, result and discussion.

5 Chapter 2

Detection of CpG sites with multi-modal DNA methylation

level distributions

2.1 Motivation

DNA methylation is one of the most widely used epigenetic marks and plays an important role in gene regulations, which may result in phenotypic differences among different individuals, as well as phenotypic differences of the same individuals before and after treatments [9]. Although epigenetics is traditionally defined as heritable changes in gene activities that do not involve genetic mutations, recent studies have suggested associations exist between genetic variants and differences in DNA methylation levels

[10], [11]. Large-scale genome-wide DNA methylation profiling (e.g., using Illumina

Infinium Human Methylation450 Beadchip, a.k.a Illumina 450K), together with genome- wide genotyping assays using SNP arrays, enables studies of associations between genetic variations and differences in DNA methylation levels.

While many studies have treated DNA methylation levels as a quantitative trait and performed so called meQTL analysis, two recent studies [12], [13] have investigated multimodal distributions of methylation levels at CpG sites, primarily as a quality control step to correct methylation signals from Illumina 450K chips. Daca-Roszak et al. [12] studied relationships between SNP genotypes and methylation levels of 96 CpG sites from European and Asian populations. They observed multi-modal distributions among

6 individual samples for CpG sites with SNPs. However, their study was limited only to a very small subset of CpG sites and only considered CpG sites that physically overlapped with SNPs. In another attempt, Andrews et al. developed an interval based clustering method called Gaphunter to identify CpG sites with multi-modal distributions [13], which was implemented in Bioconductor package minfi [14]. Gaphunter first sorts individual

DNA methylation levels of candidate CpG sites and then groups them into clusters with predefined methylation level thresholds. An optional post-process can be used to exclude outlier-driven clusters, which are defined as clusters with smaller sizes relative to the total sample size and the size of the largest cluster. This simple algorithm is fast and works well for moderate size datasets with little experimental measurement noise of methylation levels. The authors also explored applications of mmCpGs such as probe quality-control and population stratification adjustment. However, threshold-based approaches such as Gaphunter are sensitive to noise levels and are sensitive to the sample size, too.

To overcome those limitations, we propose a more generic and more robust clustering method to identify mmCpGs. The method is based on Gaussian Mixture Model and we apply it on the Genetic Analysis Workshop 20 (GAW20) datasets to identify mmGpGs. We further check the relationships between SNPs and mmCpGs, in terms of direct overlaps of their genomic locations as well as statistical associations between mmCpG clusters and genotypes of SNPs that are physically close to mmCpGs. Analysis result shows that 68~70% of mmCpG sites are associated with some SNPs within their

100kbp neighborhood, suggesting high concordances between mmCpG clusters and

7 individual genotypes. In comparison with Gaphunter, result has shown that our approach is more robust and more stable than Gaphunter.

2.2 Methods

2.2.1 Data

In this study, we analyzed the genome-wide DNA methylation data before treatment and after treatment, as well as dense SNP genotype data provided by GAW20. There are

995/530 individuals from 182/153 families in the pre/post-treatment methylation datasets, respectively. Among them, 823 individuals have been genotyped. The numbers of individuals that have both methylation and SNP data in pre/post datasets are 717/507, respectively. We performed mmCpGs predictions on all individuals with methylation data, separately for pre-treatment and post-treatment dataset. Due to time limitation, when assessing associations between mmCpGs and genotypes, we randomly picked one member from each family. Association between mmCpGs and SNPs in related individuals will be examined in future studies. The number of CpG sites included is 463,

995. DNA methylation level of each CpG site in an individual is a numeric value between

0 and 1. The SNP array data consists of 718,566 SNPs. The genotype data is defined as a dosage 0, 1 or 2 copies of the coded allele. Because genotype and DNA methylation data may contain missing value for some SNPs or CpG sites, we only include those individuals that have both genotypes and DNA methylation information when associating cluster labels with genotypes.

8 2.2.2 Gaussian Mixture Model Clustering

The goal of our method is to identify clusters of individuals that have distinct distributions of DNA methylation levels for each CpG site without using any prior knowledge of genotypes or phenotypes. Gaussian Mixture Model (GMM) is one of the mostly used model-based clustering algorithms that is suitable to identify cluster structures from a mixture of multiple distributions. A GMM is a weighted sum of M component Gaussian densities as shown in formula below,

M 2 |xp λ  i |xgw= μ ,σ ii (2.1) =i 1

2 where wi is the weight of component i,  |xg μ ,σ ii  1, M,=i, are the component

Gaussian densities.

The assumption is that when methylation levels are affected by genotypes, each distinct genotype corresponds to a different distribution. In a population with different types of genotypes, their methylation levels will exhibit the characteristics of a mixture distribution. In our study, we utilize the mixtools [15] to perform GMMC, which will estimate model parameters for each cluster using the EM algorithm for a given number of clusters. It also provides posterior probabilities of a sample belonging to each of the clusters. Evaluation metric such as Bayesian information criterion (BIC) and log- likelihood of the mixture-model can be used for model selection.

2.2.3 Detection of multimodal CpG sites

To determine the best number of clusters, we try different numbers of clusters iteratively and determine the best model using the BIC criteria. More specifically, we apply the

9 algorithm below:

0. Starting with k=1, calculate BIC1 based on uni-modal GMM1.

BEST_MODEL=GMM1.

1. If k > MAX_K, stop iteration.

Otherwise, k=k+1. Apply GMMC with given k components. Obtain BICk and GMMk.

2. If BICk > BICk-1+BIC_INC_THRESHOLD, BEST_MODEL=GMMk. Continue to first step. Otherwise, stop iteration.

Given the property of our specific application, we set the MAX_K to be 3, corresponding to the three distinct genotypes of some SNPs that are potentially associated with the mmCpG. The BIC incremental threshold (BIC_INC_THRESHOLD) is used to control the model complexity. A larger number of clusters are meaningful only if its BIC is substantially higher than the BIC with a smaller number of clusters. In practice, a higher value of the threshold will allow the method to be less sensitive. We set

BIC_INC_THRESHOLD=100 in our analysis so that our results are more conservative.

Once the model is fixed, each individual will be assigned to the cluster for which the posterior-probability is highest. Our method also incorporates a post-process step that utilizes several thresholds to filter out low quality clusters. First, the largest cluster cannot be too big. If the fraction of the largest cluster is greater than 1-OUT_CUTT, where

OUT_CUTT is a user specified parameter, the mmCpG will be excluded from further analysis. Second, samples within each cluster should have small variance, controlled by a threshold MAX_STD for the maximum allowed standard deviation in each cluster.

10 Finally, we require that cluster centers should be separable from each other, which is controlled using a threshold MIN_MEAN_DIFF. In our study, we set MAX_STD=0.1 and MIN_MEAN_DIFF =0.2.

2.2.4 Associating GMM cluster labels with genotypes

To study the relationships between genotypes and mmCpGs, we have evaluated genotype data and GMM cluster labels together to assess the strength of associations. In our study, we included all SNPs located less than 50kb on either side of an mmCpG site. For each pair of a SNP and an mmCpG, we first constructed the contingency table for three genotypes and three cluster labels. Then a chi-square p-value was calculated and corrected by Bonferroni correction for multiple testing. Among all the nearby SNPs around an mmCpG, only the one with the minimum p-value is considered as the measure of SNP-mmCpG association. Finally, a critical value of 0.001 (after Bonferroni correction for multiple testing) was used to determine if an mmCpG has strong association with at least one nearby SNP.

2.3 Results

2.3.1 Genome-wide survey of mmCpG sites in GAW20 dataset

We applied our GMMC based method on both pre-treatment (995 individuals) and post- treatment (530) Illumina 450K datasets and detected 3785 and 3847 mmCpGs. A significant majority of them (2965 mmCpGs) were found in both datasets. 820 and 882

11 mmCpGs were found unique in pre and post-treatment datasets, respectively. To compare our method with Gaphunter, we also applied Gaphunter on the same datasets. Gaphunter identified 4313 and 5632 mmCpGs in pre/post-treatment datasets respectively. About

78% and 91% of mmCpGs identified by our method were also included in Gaphunter result. Moreover, the number of mmCpGs identified by our method alone is much smaller than the number of mmCpGs identified by Gaphunter alone, which indicates that our method is much more conservative than Gaphunter (Figure 2.1). To evaluate the sensitivity to sample size of both methods, we randomly picked different numbers of individuals from all individuals in the pre-treatment DNA methylation data and applied both methods on chromosome 21 of the sub-datasets. Our analysis shows that Gaphunter has many more mmCpGs with small sample sizes while the number of mmCpGs identified decreases with increasing of sample size (Figure 2.2). In contrast, our method is very stable for all tested sample sizes. Many of the reported mmCpGs by Gaphunter when using small sample sizes are likely false positives. In summary, results support that our method is more conservative with small sample sizes and more stable than

Gaphunter.

12 AB

Figure 2.1 Venn diagram of GMMC and Gaphunter results on the pre-treatment (A) and post-treatment (B) dataset.

Figure 2.2: Number of mmCpGs detected by GMMC and Gaphunter in different sample size.

2.3.2 Association between mmCpGs and SNPs

To investigate the relationships between mmCpGs and SNPs, we separate mmCpGs into

13 two categories: 1) mmCpGs have a SNP directly overlapping with it (at either C position or G position); 2) mmCpGs with no directly overlapped SNPs, but have strong associations with some SNPs in close physical proximity. Since family structure may have impact on correlation between genotype and methylation level, we further conducted analysis on unrelated individuals (182/153), which detected 3014 and 3128 mmCpGs in pre and post-treatment datasets, respectively. There are in total 453 CpG sites directly overlapping a SNP at their locations. 180/190 of them are detected as mmCpGs in pre/post-treatment datasets, respectively. Figure 2.3 shows examples of an mmCpG with overlapped SNPs and a non mmCpG site with overlapped SNPs.

A 1.00 B 1.00

0.75 Cluster 0.75 1 2 0.50 0.50 3 Methylation 0.25 Methylation 0.25

0.00 0.00 0 12 0 1 2 Genotype Genotype

Figure 2.3: Example of mmCpG and non-mmCpG. A) An example of an mmCpG with a genotyped SNP physically overlapped with its location. Each point represents an individual. B) An example of a non-mmCpG with a genotyped SNP physically overlapped with its location. Distributions of different genotype groups are similar.

In addition to direct overlaps to its location, nearby SNPs may affect/interact with mmCpGs. We examined SNPs located within 50kb on either side of each CpG site

(100kb window). By matching genotypes and GMM cluster labels, we measured the association between mmCpGs and their nearby SNPs based on a contingency table. 14 Results show that 68~70% of mmCpGs in both pre/post-treatment datasets has association with at least one of the nearby SNPs (Table 2.1). This observation supports our hypothesis that most of mmCpGs are somehow affected by SNPs. Moreover, we found that the most associated SNPs were located less than 20kb away from the mmCpGs

(Figure 2.4).

Table 2.1 Summary of mmCpGs mmCpG_pre Percent mmCpG_post Percent

All result 3014 3128

p<=0.001 2073 68.78% 2207 70.56%

p>0.001 941 31.22% 921 29.44%

15 0.05

0.04

0.03 Density 0.02

0.01

0.00 50 25 0 25 50 Distance from mmCpG site to SNP with min p−value(kb)

Figure 2.4: Distance distribution from mmCpG sites and their associated SNPs.

2.4 Discussion

In this dissertation, we have proposed a novel GMMC based method to detect genome- widely mmCpGs generated from Illumina 450K chips. We applied this method on

GAW20 dataset and found that the majority of mmCpGs are associated with SNPs that are either directly overlapping with CpG sites or are in close proximity to CpG sites.

Empirical analysis demonstrates that our method is more stable than Gaphunter, a thresholding based method. The ideas underneath threshold-based clustering and model- based clustering are quite different. Threshold based methods such as Gpahunter use a fixed cut-off value to draw a boundary to separate data points, which may not be able to capture characteristics of different clusters. First of all, the choice of cut-off values is 16 mostly arbitrary. The same cut-off values may not be valid in different datasets, or even worse, they may not be valid for different CpG sites in the same dataset, because methylation level distributions of different CpG sites may have different characteristics, i.e., some CpG sites have larger gaps between clusters but others may have smaller gaps.

Moreover, the cut-off values can be quite sensitive to sample sizes. When the sample size is small, the distribution of DNA methylation levels among individuals will be sparse and the within-cluster distances may be big. Threshold-based methods are prone to false positives. When the sample size is large, the distribution is dense and clusters may have overlaps. In this case, it is hard for threshold-based approaches correctly clustering samples. Different than threshold-based methods, model based clustering methods including the one proposed here are designed to obtain models that fit the distributions, therefore can naturally capture the characteristics of cluster structures. Therefore, they usually provide results that are more accurate. In addition, the Gaussian Mixture Model can detect clusters with identifiable overlaps.

The current study mainly focused on detection of mmCpGs. Our findings suggest that there might be some connections between genetics and epigenetics. One should not treat mmCpGs as irregularity and filter them out from further analysis. Instead, careful characterization after identification is crucial to understand better the biological significance of mmCpGs. In the future, we will continue to investigate the association between mmCpGs and genotypes and explore how these mmCpGs might relate to phenotypes.

17 Chapter 3

Discovering DNA methylation co-occurrence patterns

3.1 Motivation

Along with the generation of bisulfite sequencing data, many bisulfite sequencing data analysis tools have been proposed in recent years. Among them, QUMA [16], BISMA

[17] and BiQ Analyzer [18] are earlier tools for bisulfite sequencing data analysis that have been widely adopted. However, none of the tools can handle large datasets with ultra-high read coverages or a large number of targeted regions, which are increasingly common in real data analysis. For example, QUMA web server limits the maximum number of bisulfite sequence reads per request to 400. Similarly, for BISMA, the number of sequences that can be uploaded is limited to 400. The upload files size is limited to 10

MB. Even for later tools such as BiQ Analyzer HT [19] that were designed specifically for processing large datasets, their performance still cannot keep up with the throughput of data generation, mainly because they utilized a global sequence alignment algorithm.

The alignment strategy also limits its usage on very small genomic regions.

More recently, some newer tools such as Bismark [20] and BS-Seeker [21] have utilized more efficient mapping tools with modifications for bisulfite sequencing data.

Therefore, they can effectively handle larger datasets, especially those generated by next- generation sequencing (NGS) technologies [22]. However, the primary focus of these tools is to perform sequence read map and to call methylation status at each site. Other functionalities in downstream pattern analysis and visualization are limited. Furthermore, 18 most existing tools provide little if any functions in analyzing methylation co-occurrence patterns, nor in correlating methylation patterns with mutations.

The term DNA methylation co-occurrence patterns used here refers to groups of sequencing reads sharing specific DNA methylation patterns. There are many possible sources of co-occurrence patterns, such as tumor heterogeneity, cell heterogeneity in the same tissue and allele differences. Investigating such patterns can provide further insights in distinguishing different cancer subtypes [23], in revealing mechanisms of cancer development [24], and in detecting allele-specific methylation.

The differences between co-occurrence pattern analysis and traditional DNA methylation pattern analysis is obvious. Traditional DNA methylation pattern analysis focus on methylation level of specific CpG cite. It condenses two dimensional data into one dimension, i.e. discard origin of methylation information. We can no longer tell the methylation level of subgroups of sequencing reads. Further, the methylation level of

CpG site is contaminated by sequencing or bisulfite conversion noise. Small number of reads which has different methylation patterns comparing with majority of reads are likely to be affected by noise. Thus, co-occurrence pattern would reveal information cannot be discovered from traditional pattern analysis and produce more accurate methylation level.

Although DNA methylation co-occurrence patterns can be called from any type of bisulfite sequencing data, lower depth of coverage will not provide enough confidence for the pattern detection. Ultra-deep bisulfite sequencing makes it feasible to analyze co- occurrence patterns. In addition, at the same time, extreme high coverage is a huge 19 challenge for processing, analyzing and visualizing the data. Thus, it is necessary to develop a novel method to efficiently detect and visualize DNA methylation co- occurrence patterns.

In this dissertation, we present a web application service named BSPAT for

Bisulfite Sequencing Pattern Analysis Tool, which takes advantage of Bismark’s read alignments and methylation calling functionalities, and provides further quality control, co-occurrence pattern analysis, simple allele specific methylation analysis, visualization and integration with other databases and tools. In addition to the web service, the source code of the tool is also made available, which enables advanced users to deploy BSPAT on their own machines for dedicated analysis of large volume of data without uploading them to our own server. We have applied BSPAT on a real dataset generated from two prostate cancer cell lines and one normal prostate epithelial cell line. Results have shown some interesting methylation co-occurrence patterns that are different in different cell lines. A potential allele specific methylation case is also observed. We have also compared the performance of BSPAT with a popular tool BiQ Analyzer HT [19]. Results show that BSPAT is much faster, uses less memory, and generates more results for visualization and further analysis.

3.2 Methods

3.2.1 Characteristics

Comparing with existing tools, BSPAT has several important features: 1) The methylation pattern analysis features provided by most existing tools focus on either an

20 overall methylation status of a CpG rich region or methylation level of each CpG site.

Although the detailed single read methylation patterns may be presented, the significant co-occurrence patterns are not summarized. 2) BSPAT also provides a feature to automatically discover potential allele-specific DNA methylation co-occurrence patterns in a targeted region. 3) By utilizing a sequence mapping approach instead of sequence alignment algorithms, BSPAT is much faster than existing tools, as demonstrated in

Result section. 4) BSPAT implements an easy to use integrated workflow and visualizes results in multiple formats.

3.2.2 Workflow

The workflow of BSPAT is shown in Figure 3.1. It mainly consists of two stages: mapping stage and analysis stage. We discuss both of them in details in this subsection.

For sequence reads generated from bisulfite sequencing projects, BSPAT accepts both

FASTA and FASTQ format as its inputs (Figure 3.1 A) for mapping. Four different types of quality scores (i.e.,phred33, phred64, solexa and solexa1.3) for FASTQ format are supported. Reads from multiple experiments can be uploaded at the same time. Each experiment can consist of one or more genomic regions. A utility script is also provided to extract data from multiplex experiments. BSPAT also requires users to provide a reference sequence file using FASTA format, which can consist of reference sequences from all the regions/experiments. Because the program uses a mapping strategy instead of an alignment strategy, it assumes read lengths are smaller than the lengths of reference sequences. The design of BSPAT is mainly for targeted sequencing data, where the 21 regions sequenced are known a priori. Therefore, users should provide reference sequences of targeted regions, not the whole human genome, to speed up the mapping and analysis. To obtain genome coordinates of these regions for the analysis stage,

BSPAT calls Blat service hosted by UCSC Genome Browser [25], [26] to automatically acquire the genome coordinates of reference sequences. Three versions of genome assemblies (i.e., hg38, hg19, hg18) are supported currently. The top Blat result for each region, which in general represents the true region, will be selected for use in the analysis step. To map bisulfite converted sequence reads to reference regions, BSPAT relies on another program Bismark (Figure 3.1 B), which actually calls Bowtie [27] to perform the mapping. The mapping step takes the majority of execution time. BSPAT allows up to three mismatches in the seed region of each read but gaps are not allowed. Reads with low mapping qualities are discarded. Users will be notified by email (if provided) when the mapping result is ready. A unique identifier is assigned to each executed job and users can use that number to retrieve the results. The webpage will also be refreshed when the result is ready, which provides some summary information about the mapping result, the genomic coordinates of the targeted regions, and a link to the detailed results generated from Bismark.

22

Figure 3.1: Workflow of BSPAT.

23 A) Example of input sequence reads in FASTQ format. B) Sequence reads aremapped to the reference. C) For a given targeted region, only reads that cover all CpG sites in the region are considered in generating co-occurrence patterns. D) Methylation patterns and mismatch information at single read level. E) Visualization of results in three different formats. 1) DNA Methylation co-occurrence patterns in text format. ‘@@’ represents amethylated CpG site; ‘**’ represents an unmethylated CpG site; ‘-’ represents a non- CpG context nucleotide; amismatch is represented by the variant allele at the position. 2) Graphical representation of methylation co-occurrence patterns with genomic coordinate information. A black circle represents a methylated CpG site and a white one represents an unmethylated CpG site. The last row represents the proportion of methylated reads to the total number of reads at each site. The colored circles showmethylation rates from low (green) to high (red). Variant allele in each pattern is represented by a blue bar. 3) Methylation patterns are shown as a UCSC Genome Browser custom track.

Based on mapping results, BSPAT not only summarizes the methylation level at each CpG site, more importantly, it examines methylation co-occurrence patterns of CpG sites in close proximity. BSPAT does so in several steps. First, low quality reads will be filtered out based on user-defined parameters such as bisulfite conversion rate and sequence identity. Second, in order to view co-occurrence patterns, a user needs to specify a window by providing its genomic coordinates. If no such window is given,

BSPAT uses a default window of size 70 bps starting at the first CpG site of the reference sequence. Only reads that cover all the CpG sites in the view window will be considered in generating co-occurrence patterns (Figure 3.1 C). For each read, the methylation status at all CpG sites covered by the read is regarded as its methylation signature or a pattern

(Figure 3.1 D) Then, all reads with the same signature will be grouped into a methylation co-occurrence pattern and the number of all such reads is the support of the pattern.

Z= (3.1)

Given the noisy nature of data, in general, only prevalent patterns with enough 24 support are meaningful/significant. To filter out random patterns, users can use a simple fraction threshold (i.e., the percentage of the number of reads supporting a pattern over the number of all reads). In addition, BSPAT provides a simple Z-score like statistic to measure the significance of a pattern. Basically, it assumes all CpG sites in the region are independently methylated with a probability of 0.5. Therefore, for k CpG sites in a

region, there are 2 different patterns each with equal probability of . Any patterns with frequencies that are significantly greater than this probability are potentially important. However, the assumption may not hold in reality in the sense that the total number of reads in the region may not be sufficiently large relative to the total number of

CpG sites, and methylation status of nearby CpG sites may be correlated. Therefore, instead of this probability, we actually define the baseline probability as one over the number of observed patterns in the data, to better reflect dependencies among methylated

CpG sites in close proximity. Assume ̂ is the percentage of reads supporting a pattern and n is the total number of reads. Then one can utilize the one-sample Z-test for proportions to assess the significance of each pattern, with the alternative hypothesis

1: ̂ . The Z-score can be calculated based on Equation (3.1), where the numerate represents the difference between the observed frequency and the expected frequency, and the denominator is the estimated standard deviation under the binomial distribution.

If the p-value corresponding to the Z-score is smaller than a predefined threshold, the co- occurrence pattern is treated as significant. All significant patterns will be shown in the results in the descending order of their significance.

In order to assess potential allele-specific methylation patterns, BSPAT first needs 25 to discover mutations from mapping reads. In the current implementation, it simply defines a mutation as a mismatch supported by an excessive number of reads, using a user-defined threshold. When a mutation exists, BSPAT naturally separates all reads into two groups: reads with the reference allele and reads with the mutated allele. For each group, BSPAT assesses the methylation level at each CpG site and assigns all CpG sites into three categories based on the proportion of methylated reads covering the sites: low methylation level (≤ 20 % reads are methylated), high methylation level (≥ 80 % reads are methylated), and intermediate level (otherwise). If the two groups corresponding to the two alleles have at least one CpG site where their methylation levels are in two different categories and the actual difference of their methylation proportions is larger than 20 %, BSPAT regards the region as a potential allele specific methylation region.

Then within each group, BSPAT further generates methylation co-occurrence patterns by grouping reads with the same methylation signature.

When BSPAT finishes the analysis, it visualizes significant methylation co- occurrence patterns and allele specific methylation patterns in three different formats including text format (Figure 3.1 E1), graph (PNG or EPS) format (Figure 3.1 E2), and a format that can be loaded directly to UCSC Genome Browser [28] as a custom track

(Figure 3.1 E3). In addition, when a mutation coincides with an existing SNP in the dbSNP database [29], BSPAT provides a link to that SNP.

26 3.3 Result

3.3.1 Experiment Summary

To test the functions and performance of BSPAT, we have performed analysis based on a real bisulfite amplicon sequencing dataset as well as a simulated dataset based on the real dataset. The real dataset consists of three prostate related cell lines (DU145, LNCaP,

PrEC), each with 24 genomic regions. DU145 and LNCaP are prostate cancer cell lines.

PrEC is normal prostate epithelial cell line. Genomic DNA from each cell line was bisulfite treated. The bisulfite treated DNA was PCR amplified using primers specific for the 24 regions of interest. PCR products for all 24 amplicons were pooled for each cell line and used for subsequent Illumina next-gen sequencing library construction. To enable multiplexing, a uniquely indexed adapter was used for each cell line during library preparation. The final library for each cell line was pooled together in equal molar ratios before sequencing on one lane of Illumina GAIIx. The average length of a region is about

127 bps with the total length of all regions 3020 bps. The whole dataset contains about half million reads with read length varying from 69 to 80 bps after trimming the library index and PCR primers. With default mapping parameters (maximum permitted mismatches = 2), 93.88 % reads were mapped uniquely to the reference sequences, with an average read depth of 18,886. The unmapped reads (6.12 %) were all with low quality scores or with gaps. Default parameters were used in performing pattern analysis (e.g., bisulfite conversion rate 0.95, sequence identity 0.9, p-value 0.05 and mutation threshold

0.2). By examining the results, we have found some interesting patterns that are potentially biologically important, which will be discussed here. More thorough analysis 27 of the dataset will be presented elsewhere.

3.3.2 DNA methylation co-occurrence pattern analysis

Unlike overall methylation patterns that summarize methylation levels at each individual

CpG site, methylation co-occurrence patterns can reveal rich information that could be biologically important. For example, Figure 3.2 A shows the methylation patterns in gene

CYP1B1 region for two cell lines DU145 and LNCaP. Although the overall methylation patterns are similar in these two cell lines, the significant methylation co-occurrence patterns are different, with DU145 showing a single significant pattern while LNCaP showing two additional patterns. The diversity may be due to the existence of sub- categories in LNCaP samples. Also, because the number of reads covering this region is extremely high, simply sorting and displaying all reads (as some other tools do) is not helpful in this case. In contrast, significant co-occurrence patterns give a clear and direct view of the methylation patterns. This is best illustrated in another example in the downstream region of gene HIST1H4D. There area two significant methylation co- occurrence patterns in DU145 cell line, while all CpG sites are completely methylated in one and all CpG sites are totally unmethylated in the other (Figure 3.2 B). This suggests that the partially methylation status in those CpG sites are likely caused by mixture of fully methylated and unmethylated reads [30]. Some other methylation co-occurrence patterns reveal possibly correlated methylation among neighboring CpG sites. Two examples are shown in Figure 3.2 C and D for genes TLX3 and NPR3, respectively. For

TLX3, methylation status of the first and the last CpG sites seems correlated, while for

NPR3, the methylation status of the first and the third CpG sites seems correlated. By

28 using a simple contingency table based on the read count of each pattern, we can calculate the significance level of such dependency based on a χ2 statistics. The p-values for the two cases are 0.0046 and <0.0001, respectively. The observation supports the general notation that nearby CpG sites may be methylated together, but the biological mechanism of this dependence needs further investigation.

Figure 3.2: Examples of DNA methylation co-occurrence patterns. A) DU145 and LNCaP cell lines have different significantmethylation co-occurrence patterns in region CYP1B1. B) Two distinct co-occurrence patterns (one all sites are methylated while the other all cites are unmethylated) in the downstream region of HIST1H4D of DU145 cell line. Examples of correlated partially methylated CpG sites in a region in the upstream of TLX3 from PrEC cell line (C)and in the NPR3 region from LNCaP (D). For all sub-panels, coordinates used are based on hg18. Because not all reads belong to a significant pattern, the sum of percentages of all significant patterns (on the right hand side of each pattern) is not necessarily 100%.

3.3.3 Potential ASM detection

From pattern analysis results, we have found a potential allele specific methylation

29 pattern in PAX6 region, as shown in Figure 3.3. The mutation identified is at the third

CpG site, which is also reported in dbSNP as SNP rs4440995. The nucleotide in the reference sequence is G and the variant allele is A. We first notice that in LNCaP cell line, the overall methylation levels of reads with the reference allele and reads with the variant allele are significantly different (Figure 3.3 A). Further investigation based on co- occurrence patterns shows that the reference allele is associated with hypermethylation while the variant allele is associated with hypomethylation (Figure 3.3 B). We further examined the mutation and co-occurrence patterns in the other two cell lines in this region (Figure 3.3 C and D). Both alleles in the normal cell line (PrEC) are the reference allele while both alleles in DU145 cell line are the variant allele. The association between alleles and methylation co-occurrence patterns are different from those observed in

LNCaP cell line: the variant allele in DU145 exhibits hypermethylation patterns while the reference allele in PrEC exhibits hypomethylation patterns.

30

Figure 3.3: An allele specific methylation example near PAX6. A) Potential allele-specific methylation patterns were discovered in LNCaP cell line near gene PAX6. The first row is the overall methylation level associated with the reference allele. The second row is the overallmethylation level associated with the variant allele (indicated by the blue bar). Significant methylation co-occurrence patterns in LNCaP (B), in DU145 (C), and in PrEC (D) for the same region. Coordinates used here are based on hg18.

There are several possibilities to explain the observation. First, PrEC is a normal cell line and has intact machinery to maintain normal methylation pattern, which is largely not methylated. This locus may be free of methylation in all normal prostate cells.

In cancer cell lines, when methylation becomes abnormal, this locus gets methylated to achieve some desirable function, and the reference allele has a higher chance of becoming methylated (in LNCaP). Another possibility is the reference allele in LNCaP is in linkage disequilibrium with something that needs to be methylated here in order to achieve desirable effects. For example, the reference allele in LNCaP is linked to a wild- type that needs to be silenced. The SNP is linked to mutant protein already inactive. In DU145, both alleles are variant alleles and need to be silenced. Further 31 studies and experiments are needed to confirm which hypothesis is true.

3.3.4 Efficiency

To evaluate the efficiency of BSPAT on larger datasets, we have generated a simulated dataset by replicating the reads from the original data multiple times (2X, 5X, 10X, see

Table 3.1). We compared its performance with a state-of-the-art tool called BiQ Analyzer

HT. BiQ Analyzer HT is a standalone program written in Java that was developed specifically for high-throughput bisulfite sequencing data. It performs read alignments and can visualize methylation level at each CpG site and methylation status of each read.

But unlike BSPAT, it does not generate methylation co-occurrence patterns. BiQ

Analyzer HT can only take FASTA format input files and BSPAT can take both FASTA and FASTQ formats. We have compared the memory usage and time needed to perform the analysis by BSPAT and by BiQ Analyzer HT. All experiments were executed on the same computer with 4-core 3GHz CPU and 12 GB memory. BiQ Analyzer HT was executed in command line interface with JVM heap setting:-Xmx12g. The same JVM heap parameter was used in the Tomcat Server, which hosts BSPAT. BiQ Analyzer HT can only run in the single-thread mode. We have tested BSPAT using both single-thread and multiple-thread modes (3 threads for 3 cell lines in the experiments).

32 Table 3.1 Sizes of datasets used in the experiments Times Read Count File Size (MB)

FASTA FASTQ

1x 482,791 67 134

2x 965,582 134 268

5x 2,413,955 335 570

10x 4,827,910 670 1340

Figure 3.4 shows that BSPAT is much faster than BiQ Analyzer HT under all settings. When using the same setting, i.e., the same FASTA format input and both using the single-thread mode, BSPAT is about 3 to 4 times faster than BiQ Analyzer HT. When using the multi-thread mode, BSPAT is about 6 to 7 times faster than BiQ Analyzer HT.

The time for BSPAT using FASTQ is almost the same as the time it used for FASTA.

When using BSPAT as a web service, the memory usage does not have any influence on end users. However, users can deploy BSPAT in their own server. In this case, BSPAT still have less peak memory usage than BiQ Analyzer HT (Figure 3.5). Comparing with

BiQ Analyzer HT, single-thread BSPAT used about half of its memory. Multi-thread

BSPAT utilized more memory than the single-thread version, but it was still less than the memory usage of BiQ Analyzer HT. In summary, BSPAT provides more features and has better performance than BiQ Analyzer HT in terms of both running time and memory usage. 33

Figure 3.4: Efficiency comparison of BSPAT and BiQ Analyzer HT (referred as BiQ HT here) using different settings. BSPAT outperformed BiQ HT in all cases. BSPAT can accept FASTA or FASTQ format and run in single or multi-thread mode. All experiments were run on the same computer with quite background. For BSPAT, the Tomcat Server did not host any other applications.

34

Figure 3.5: Peak memory usage comparison of BSPAT and BiQ Analyzer HT (referred as BiQ HT here) using different settings. BSPAT used less memory than BiQ HT in all cases. Here the peak memory usage of BSPAT was measured by monitoring the memory usage of Tomcat Server. For smaller datasets, the majority memory usage of BSPAT was by Tomcat Server itself. So there are no significant differences using single-thread or multiple-thread for 1X dataset

3.4 Discussion

In this dissertation, we have presented BSPAT, a web application for methylation pattern analysis based on bisulfite sequencing data. BSPAT capitalizes on ultra deep sequence data in targeted regions to automate the n of methylation co-occurrence patterns and allele specific methylation. The implementation is efficient and also provides great flexibilities in parameter settings. Visualization of result patterns and integration with 35 Genome Browser allow users to examine other genomic features in the same regions together. For our future work, we will refine mutation calling by combining prior information on genetic variations and more advanced variation calling algorithms.

Furthermore, we will extend BSPAT to handle non-human bisulfite sequencing data.

36 Chapter 4

Genome wide profiling of Allele-specific DNA methylation

4.1 Motivation

Allele-specific DNA methylation (ASM) refers to the fact that different DNA methylation patterns exist in regions of two homologous chromosomes, either because of different parental origins of the two haplotypes/alleles, or because the paternal allele and the maternal allele are different. In the first case (parent-of-origin ASM), a haplotype being methylated or not depends on whether it is inherited from mother or father, regardless whether the two haplotypes/alleles are the same or not. In the second case

(sequence-dependent ASM), a specific allele is associated with methylation and the other allele is associated with non-methylation, and such an association may exist in different individuals -- sometimes referred as genetics of epigenetics. As a special type of differential DNA methylation, ASM is associated with many biological processes such as autosomal genomic imprinting and X chromosome inactivation (XCI). Genomic imprinting is an epigenetic phenomenon that only one allele in an imprinted gene is expressed and the other one is silenced/imprinted. The allele to express is determined by its parental origin. ASM has been found in many known imprinted gene clusters and it is believed that imprinted genes are epigenetically marked through the use of DNA methylation [31]. X chromosome inactivation is the process that one copy of X chromosome in female is inactivated during embryonic development as an important dosage compensation mechanism. Therefore, only a single X chromosome keeps active

37 in both male and female cells. ASM is involved in the maintenance of stable XCI, in which active X chromosome (Xa) and inactive X chromosome (Xi) have different DNA methylation patterns [32]. Sequence-dependent ASM is not well understood but has been observed in both human and mouse [33]–[36].

Functions of DNA methylation have been studied for decades. It is generally believed that DNA methylation in gene promoter regions represses gene transcriptions.

However, the understanding of potential functions of ASM is limited and incomplete.

Studies have shown that long term stable silence of genes on one haplotype/chromosome in genomic imprinting and X chromosome inactivation is associated with hypermethylation of gene promoters [4]. However, genome-wide studies of the relationship between ASM and allele specific expression are lacking. Furthermore, identification of ASM regions may aid the identification of novel imprinted genes and genes on X chromosome silenced by Xi or escaped from Xi. For example, a recent study has demonstrated that combing allele-specific expression with allele-specific methylation could improve the accuracy of novel imprinted gene detection [37]. Chromosome-wide

DNA methylation analysis has been used to identify genes silenced by Xi or escaped from Xi [38], [39]. In addition, ASM profiling can also aid the studies of epigenetic changes during disease and cell development. Different DNA methylation patterns have been widely observed among samples from different tissues, from different development stages [3], [8], [40], [41], from different disease status [42], [43], or from different types of cell lines (e.g., embryonic cells vs. somatic cells) [32]. Studies of these differential methylation patterns in general and ASM in particular may lead to the discovery of novel

38 biological functions of methylation.

The prerequisite to study functions of ASM is to systematically mapping ASM across the genome. Many previous studies actually treated regions with “intermediate methylation levels” as ASM regions, which could not distinguish real ASM regions from other types of intermediate methylation regions. Novel approaches that can accurately and efficiently survey genome-wide ASM regions are in great need. Earlier methods using microarray [33], or bisulfite sequencing [34], [35]) only measured signals in targeted sites/regions. With the development of whole-genome bisulfite sequencing technologies, genome-wide ASM detection at a base-pair resolution has become a reality

[36], [41], [44]. However, most of these methods have relied on single-nucleotide polymorphism (SNP) information observed in sequencing reads to detect ASM, consequently bearing two shortcomings: first, parent-of-origin ASM may not show direct associations with SNPs; second, the density of SNPs in the genome limits the discovery power of these methods. Comparing to CpG sites where DNA methylation normally occurs, SNPs are far sparser. Furthermore, it is unclear how far a SNP should be in order to have direct impact on ASM. Therefore, only a small fraction of ASM can be potentially detected by using SNP information. A few methods have been developed recently that have utilized DNA methylation information alone to detect ASM [45], [46].

Method by Peng and Ecker [45] utilized a supervised learning approach to classify ASM regions, which highly relied on training data. The classifier trained on one dataset may not be suitable to other datasets. The acquisition of accurate training data is in general challenging. Furthermore, their approach chose candidate ASM regions by using a fixed-

39 width sliding window and a fixed distance threshold between two consecutive CpG sites, which made the accurate detection of ASM boundaries impossible. Another widely used method called amrfinder [46] also utilized the sliding window approach to choose candidate regions. It then applied both single and two-allele models on each candidate region. Bayesian information criterion (BIC) or likelihood ratio test (LRT) were used to select the best model. Like other sliding window based methods, amrfinder may not accurately predict boundaries of ASM regions, which will add noise to downstream analysis of ASM. Furthermore, it is hard to set an optimal window size, which may depend on the dataset itself. Finally, amrfinder does not provide partitions of reads, nor the methylation pattern of each allele/haplotype, which is quite useful in validation and follow-up studies.

A general assumption about ASM is that methylation status of CpGs in the reads from the same haplotype should be similar and reads from the two haplotypes form two distinct DNA methylation patterns. Two pieces of information are missing in detecting

ASM. First of all, it is generally unknown which read comes from which haplotype.

Secondly, the methylation pattern of each haplotype is also unknown. The key observation in detecting ASM regions is that one can utilize methylation status of CpGs shared by reads to group them into two different clusters, corresponding to the two haplotypes. Although theoretically simple, there are many practical reasons that make the problem hard. For example, for any given ASM, it may be possible, but not necessary that all cytosine nucleotides in CpG sites are totally methylated in one haplotype but totally unmethylated in the other haplotype. Sequencing and mapping errors will add

40 further complications. Secondly, due to CpG density and limited read length, a large portion of reads from existing data only contain one or even zero CpG site, which provide little information to ASM detection. Other factors that affect detection include low or uneven coverage, read bias from the two haplotypes, incomplete bisulfite conversion, and errors in methylation calling or mapping.

Here we present a novel method named ASM-Detector that utilizes whole-genome bisulfite sequencing data to accurately detect ASM regions. The candidate regions are determined by the nature of data without using a fixed sliding window. The boundaries of detected ASM regions therefore are primarily defined by data (e.g., CpG site distribution, read coverage). A clustering method is then applied to each of the candidate regions and

Fisher statistics measure is used to assess the significance. The boundaries of detected

ASM regions are primarily defined by data (e.g., CpG site distribution, read coverage).

By taking advantage of the newly proposed approach, we have systematically assayed

ASMs based on whole-genome bisulfite sequencing datasets of eight human cell lines of total size 646G. In total, we have detected about 23 thousand ASM regions of total length

3.68M base-pairs, covering 223 thousand CpG sites. Results show that ASMs are ubiquitous across the genome and they are also cell line specific. ASMs are significantly enriched on X chromosomes in female cell lines, accounting for more than one third of the total number of ASMs detected in four out of six female lines. ASM also has strong presence in imprinted gene regions, covering 55 out of 88 validated imprinted genes.

Furthermore, ASM regions are distributed throughout the autosomes, too. In terms of their relationships with genes, ASMs are significantly enriched in gene promoter regions.

41 Although the total length of promoter regions only accounts for less than 2% of the genome length, 13% to 52% of ASM regions (in terms of length) from different cell lines are within promoter regions. To further assess their potential roles in transcription regulation, we have measured the overlaps of the predicted ASMs with other transcription regulatory markers including DNase I hypersensitive sites (DHS), transcription factor binding sites (TFBS), and CpG islands (CGI). Extreme high concordances are observed between ASM and DHS (60% to 84% of ASMs of different cell lines are within DHS), and between ASM and TFBS (49% to 88% of ASMs of various cell lines overlap TFBS), implying that most detected ASMs should be functional and they are most likely related to transcription regulation. Further analysis on gene expression data from a subset of the cell lines indicates that ASMs in promoter regions are associated with intermediate gene expressions, providing additional evidence that

ASM might play an important role in gene regulation and ASM might lead to allele specific expression. Although our method does not rely on SNP information in predicting ASMs, we have found that when a predicted ASM region overlaps heterozygous SNPs, the two methylation patterns are in high concordance with the two alleles (74.6%-94.0%), which strongly supports that our proposed method has correctly partitioned sequence reads to the two haplotypes. Finally, in comparison with an existing algorithm (amrfinder), we have detected more, but short ASM regions, indicating better boundary detection by our method. In terms of length, 29% to 87% of ASM regions in our results overlap 8% to 36% of ASM regions detected by amrfinder.

42 4.2 Methods

4.2.1 Data

In recent years, with the help of the NIH Roadmap Epigenomics Project [47] and its participants [3], [8], genome-wide DNA methylation datasets are increasingly available at finer resolutions, many of which have quantitative, base-pair resolution of cytosine methylation across the genome. The most commonly used approach in generating such large scale DNA methylation data is the shotgun bisulfite-sequencing method, termed

MethylC-Seq or BS-Seq, where genomic DNAs from primary tissues or cell lines are treated with sodium bisulfite to convert cytosine, but not methylcytosine, to uracil, which is subsequently read as thymidine through high-throughput sequencing. Sequence reads are then mapped to the reference genome and methylcytosines can be called. In this dissertation, we collected a diverse group of eight datasets from [3], [8]. It consists of one fetal lung fibroblast cell line (IMR90), one male embryonic stem cell line (H1ESC), one male cell line from foreskin fibroblasts (FF), one female embryonic stem cell line

(H9ESC), an induced pluripotent stem cell from the fetal lung fibroblast cell line

(IMR90-iPSC), and three cell lines based on adipose-derived stem cells (ADS, ADS- adipose and ADS-iPSC). The original data generators have studied their methylation levels and distribution patterns, as well as their differentiation. However, none of them has focused their analyses on allelic specific methylation patterns. This group of rich datasets enables us to study different ASM patterns in different types of cells (e.g., stem cells, somatic cells and induced pluripotent stem cells) and in different genders. In addition, by integrating data from gene functional annotations, information of imprinted 43 gene regions, signals of regulatory elements, SNP genotypes called from the same dataset, and expression profiles of four of the cell lines, we were able to characterize distributions of ASMs and their overlaps with functional elements, and to elucidate possible types and functional roles of detected ASMs.

In total, the size of the raw data in this analysis was around 646G, which consisted of more than 10 billion reads. After trimming and removing duplicates, more than 5 billion reads were uniquely mapped to the human reference sequence. All three ADS cell lines were sequenced using paired-end technology with read length of ~(75*2) base pairs while all other cell lines were based on single-end technology with read length of ~85 base pairs. The average coverage rates vary from 9X to 32X. More than 86% of the genome was covered by mapped reads for all cell lines. The percentages of covered CpG sites varied from 79% to 97% and the actual coverage rates over CpG sites varied from

7X to 30X. There were two experimental replicates for cell lines IMR90 and H1ESC.

Mapped reads from the replicates were pooled together before methylation calling. More data processing details of each individual cell line can be found in Table 4.1.

44

Table 4.1 Summary statistics of data processing.

read length # raw reads % after % of length Overall % CpG CpG processing Covered Coverage Covered Coverage IMR90_r1 SE1 87 1,336,843,086 43.41% 88.15% 15.88 88.82% 10.29 IMR90_r2 SE 87 1,480,805,943 41.29% 88.10% 16.51 88.16% 11.51 H1ESC_r1 SE 87 863,764,536 60.90% 86.77% 11.54 79.17% 7.04 H1ESC_r2 SE 87 1,118,907,995 48.10% 88.27% 14.71 88.01% 10.51 ADS PE2 75*2 1,198,077,473 41.86% 90.33% 25.32 97.05% 20.33 ADS-adipose PE 75*2 1,224,724,789 49.10% 90.37% 30.44 97.26% 25.06 ADS-iPSC PE 75*2 1,322,574,755 53.60% 89.66% 32.50 96.98% 29.69 45 45 FF SE 85 829,077,268 68.06% 88.46% 17.06 91.27% 12.94 H9ESC SE 85 456,414,029 72.19% 87.57% 9.97 89.88% 8.03 IMR90-iPSC SE 85 442,626,970 73.19% 87.24% 9.87 89.93% 8.12

1 SE: Single-end.

2 PE: Paired-end.

4.2.2 Analysis flow

Our analysis mainly consists of four sequential steps. We outline these steps here and leave the details in the Method section. In the mapping and methylation calling step, existing tools (e.g., Trim Galore!

(http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), Bismark [20], Bowtie

[27] and Picard (http://broadinstitute.github.io/picard/) ) were used to trim and to map reads to the human reference genome, and to remove duplicated reads. Methylation level and coverage at each CpG site were called using Bismark. We only consider methylcytosines within CpG sites in this study. In order to identify allelic specific methylation regions, in the quality control step, we filtered CpG sites based on their coverage and methylation levels at each site. Only CpG sites with high coverage that exhibited intermediate methylation levels were retained for further analysis. Afterwards, we regard two CpG sites as connected if there existed at least one read that covered both

CpG sites. Through these connections, CpG sites across the genome were naturally separated into local regions, with each region consisting of two or more CpG sites that were in close proximity. We further filtered these regions using the number of CpG sites and the number of reads within each region. Only high quality regions were retained for next steps. After filtering, these candidate regions are of high quality both in terms of number of reads supporting them and in terms of CpG density. Real ASM regions are more likely to reside in these regions. We term these regions as consecutively connected partial methylation regions (CPMR). The third step is prediction step. We have developed a graph-based partitioning algorithm to identify ASM regions from CPMRs and to assess

46

the significance of each ASM region using Fisher’s combination method [48] based on the difference of the two methylation patterns identified. We term the algorithm ASM-

Detector. The basic idea is that in a real ASM region, the reads from the same haplotype are expected to have same/similar methylation patterns while reads from different haplotypes are expected to exhibit distinct methylation patterns. On the contrary, if the region is not an ASM region, methylation patterns of reads from the two haplotypes cannot be distinguished from each other. Therefore, for an ASM region, based on the overlapped reads that share one or more CpG sites, one can potentially partition them into two groups according to their methylation patterns. To be more specific, ASM-Detector works as follows. First, for each CPMR, a graph is constructed where a vertex is created for each CpG site embedded read. An edge is created between two vertexes if the two corresponding reads share at least one CpG site. An edge weight is defined based on the consistency level of the methylation status of all CpG sites that are shared by the two vertexes (See Figure 4.1 for an example). We then deploy a step-wise clustering algorithm that iteratively merges two vertexes with the highest edge weight and updates the edge weights after each iteration, until there are no more positive edges in the graph.

We further require that at each CpG site, all the reads covering the site belong to at most two clusters, where in the case of two clusters, they correspond to two distinguishable haplotypes and in the case of one cluster, it corresponds to two undistinguishable haplotypes. If all the reads are grouped into one cluster, the candidate is not an ASM region. Otherwise, we further tested whether the methylation patterns of the two clusters are significantly different from each other based on Fisher’s combination score [48]. The

47

significance of the score is assessed through a permutation test by randomly shuffling the methylation status of each read at each CpG site independently. Only significant ones are declared as true ASM regions. We applied ASM-Detector on CPMR regions and identified ASM regions from each cell line. The final analysis step of our study is consist of 5 tasks: a) check and compare the distributions of the identified ASM regions from different cell lines; b) examine the overlaps of ASM regions with other genomic annotations; c) validate predicted regions by comparing them with regions/chromosomes of known allele dependent methylations such as imprinted gene regions and X chromosomes in female cell lines; d) evaluate the potential functional roles of newly identified ASM regions; e) assess the genetic effect on epigenetics by correlating ASM regions with SNPs and expression data.

48

Figure 4.1: A toy example illustrating the clustering algorithm implemented in ASM- Detector. Each horizontal bar represents one read. There are in total six reads labeled from a to f. Reads are ordered according to their chromosomal positions from left to right. Black/white circles on each read are methylated/unmethylated CpG sites. One heterozygous SNP is located within the region and the corresponding allele is shown on each read, which is unknown for the algorithm. In constructing the graph, one vertex is created for each read, which is labeled using the read label (from a to f). Edges are created between vertexes that share at least one CpG sites. And edge weights are calculated based on the consistency of methylation status of shared CpG sites. For illustration purposes, a solid edge represents a positive edge weight, and a dotted edge represents a negative or zero weight. The algorithm iteratively merges two vertexes with the highest edge weight until there are no positive edges. The reads are separated into two groups with {a, d, e} in one group and {b, c, f} in the other group. Reads are then reordered to show the two distinct methylation patterns of the two groups. 49

4.2.3 The proposed ASM detection method

As described in Result section, the proposed method consists of four steps. We will discuss them in details here.

4.2.3.1 Step 1: Mapping and methylation calling

All WGBS datasets were obtained from UCSD Human Reference Epigenome Mapping

Project (SRP000941). IMR90, H1ESC, FF, H9ESC, IMR90-iPSC, ADS,ADS-adipose,

ADS-iPSC cell lines were originally used by [3], [49] . All sequencing data were trimmed

(using Trim Galore! v0.4.0 http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) and mapped to the human reference genome assembly hg19 with Bismark v0.14.5 [20] and Bowtie 1.1.2

[27]. PCR duplicates were removed by using Picard 1.138

(http://broadinstitute.github.io/picard/). Mapped reads from replicates of IMR90 and

H1ESC were pooled together before methylation calling. Methylation level and read coverage at each CpG site for each dataset were called using Bismark. Because only cytosine methylation within CpG sites was considered in this study, reads not covering any CpG sites were discarded from further analysis. The called results were made available at our website dedicated to this project (cbc.case.edu/ASM) in bigwig format.

Summary statistics of the processed data can be found in Table 4.1.

4.2.3.2 Step 2: Candidate region definition

Based on the mapped reads, we then defined candidate genomic regions that were likely

50

to be ASM regions. Well-chosen candidate regions not only limit the search space, but also more likely provide biologically meaningful results. We aimed to identify candidate regions that had intermediate methylation levels with relative high CpG density and high read depth. Therefore, we only retained CpG sites with depth of coverage at least 4 and methylation level between [0.3, 0.7]. After this step, adjacent CpG sites were naturally grouped if they shared at least one read. If a group has at least 5 CpG sites with at least

10 reads covering the region, it is defined as a consecutively connected partial methylation region (CPMR), i.e., a candidate ASM region. A candidate region defined this way naturally captures methylation variations and characteristics of input data, therefore tends to have more accurate boundary prediction comparing to sliding window based approaches. In addition, it is worthy to mention that in our implementation, all these values are parameters that can be easily adjusted based on user preference and/or input data.

4.2.3.3 Step 3 ASM detection based on a Graph model

For each CPMR, we developed a graph based partitioning/clustering algorithm in combination with a statistical test to report the likelihood of it being a true allelic specific methylation region and to report the two methylation patterns if it was indeed an ASM.

Suppose there are n reads (|1⋯) coving the region and each read covers at least one CpG site. To construct a weighted graph ,,, one vertex is created for each read (i.e., V = R). Two vertexes are connected to each other if both of them cover at least one common CpG site. The cytosine in a CpG site for each read is either methylated or unmethylated. Therefore, a read r can also be represented by a binary

51

r fingerprint ,,,,⋯,, , where m is the number of CpGs of read r. The weigh w on an edge ∈ is defined based on the consistency of the methylation states of the common CpG sites of the two overlapped reads. More specifically, we define a

th consistency/conflict score sj at j CpG site covered by both r1 and r2:

 1, if cj,r1 =cj,r2  s =   j 1, if c  c  j,r1 j,r2  (4.1)

The wr1,r2 on the edge connecting r1 and r2 is defined as the summation of consistency/conflict score across all shared CpG sites:

t

wr1,r2 =s j j=1 (4.2)

Therefore, the weight of an edge reflects the degree of consistency/conflict of methylation status of the two reads over the overlapped CpG sites. If wr1,r2 > 0, more consistent CpG sites shared by the two reads; wr1,r2 ≤ 0 otherwise. Once again, the underlying assumption is that in an ASM region, reads from the same haplotype should have the same methylation pattern whereas reads from different haplotypes should have different methylation patterns. Thus, the problem of ASM prediction can be viewed as a node-partitioning problem on the defined graph. We next present an iterative clustering algorithm for the problem.

The idea of the proposed clustering algorithm is to group similar vertexes together iteratively based on their similarities. When two existing vertexes merged into a new vertex, the new vertex will inherit all the previous connections from the two original

52

vertexes and weights will be updated accordingly. For a vertex v, let |v| be the size of v, representing the number of reads it contains. Initially, we have |v|= 1 for all vertex v. Let emax denote the edge with the maximum weight wmax, which connects two vertexes vl and

vr. We merge the two vertexes into one new vertex ∪, which will replace vl and vr in graph G. If a vertex v connecting vl and vr through two edges e1, e2 with weight w1, w2, respectively before merging, a new edge e' with weight w1 + w2 is created between v and v' in the merged graph. If a vertex is originally connected with vl or vr, but not both, it is now connected with new vertex v' while keeping the same weight. In the meantime, we will remove edge emax but all other edges (not affected by the merging) remain the same. When there is a tie for wmax, we utilize the total size of two vertexes the edge connects to as a tie-breaker and prefer to merge the two vertexes with a smaller size.

In the case of equal sizes, we select a random edge. We iteratively merge the vertexes until no more positive edges (i.e., wmax < 0) or only one vertex is left. Because diploid organisms such as human beings only have two homologous haplotypes/alleles, we further examine each CpG site to ensure there are at most two clusters/vertexes covering it. Otherwise, we will continue to merge the vertexes covering the site until there are only two left. Finally, it is possible that there are only at most two clusters covering each CpG site, but more than two clusters in total for the whole region. This will happen when the two clusters do not overlap each other. These vertexes/clusters are presumably from the same haplotype and they will be merged into one. If at the end, there are two clusters in the result, the candidate region is regarded as a true ASM region. We further assess how significantly different the two methylation patterns corresponding to the two clusters are

53

from each other using Fisher combination test [48]. More specifically, at each CpG site, we apply Fisher's exact test based on the numbers of methylated/unmethylated reads in each cluster. Fisher’s exact test is more accurate than other independent test (e.g., χ2 test) when the expected numbers are small (in the case of low coverage). A small p-value indicates that the two haplotypes have different methylation status at the CpG site. To assess the overall significance of the whole region, Fisher's method [48] is adopted to calculate the combination test statistic X2 based on the p values of all the k CpG sites in the region:

k 2 X2k 2ln pi . i=1 (4.3)

The test statistic X2 follows a χ2 distribution with 2k degrees of freedom under the assumption that the k tests are independent. When the p values of the CpG site are small,

X2 tends to be large, indicating that the overall methylation patterns of the two clusters are different. Because in general, the methylation status of nearby CpG sites may not be independent, the test statistic X2 may not follow χ2 distribution under the null hypothesis.

To assess the significance more accurately, we rely on a permutation test by shuffling the methylation status of each CpG sites in the same region for ε times. For each permutation, we redo the clustering and calculate the X2 statistic again. The region is regarded as significance only if the original statistic is greater than all the statistics calculated from all permutations. In our experiments, we choose ε = 100, 000 to account for genome-wide multiple testing problem.

The computational complexity of the graph based clustering algorithm for each 54

region is analyzed here. The time for constructing the graph is proportional to the number of vertex (i.e., reads) and the number of edges. Because both the number of CpG sites per read and the read depth at each site can be treated as small constants (i.e., the degree of any vertex is bounded by a constant), effectively, the number of edges of the graph is linear to the number of vertexes/reads in the region in the worst case. The time to iteratively find the edge with maximum weight is linear to the number of edges. The time to merge two vertexes is also bounded by the number of edges. The total number of iterations is bounded by the total number of vertexes. Therefore, the overall complexity of the graph clustering algorithm is quadratic in terms of the total number of vertexes/reads. It is very efficient and can finish whole genome analysis within minutes for each dataset (when no permutations performed). For our data analysis with 100,000 permutations, the program can finish one dataset in a few dozens of hours.

4.2.3.4 Step 4 Final analysis

The study of the properties of predicted ASMs involves obtaining genomic annotations and RNA-seq data, SNP calling and consistency check, and obtaining amrfinder results.

4.2.4 Genome annotation

All genome annotation used in this study were downloaded directly from UCSC genome browser under assembly hg19 [50]. RefSeq and CpG island annotations were downloaded on Nov 18, 2015. Version 3 of ENCODE DHS clusters and TFBS clusters were used in this study. DAC Blacklist and Duke Excluded regions were merged as a single blacklist.

CPMRs located in these regions were excluded from further analysis. Genome annotation

55

files were processed using Bedtools v2.25.0 [51]. Promoter regions were defined as regions ±1kb around TSSs. Intergenic regions were generated by using complementary regions of genic regions. Regions overlapped with promoter were excluded from exon, intron and intergenic regions. Overlaps between exon and intron were only included in exon regions.

4.2.5 CTCF binding data

The CTCF binding used in figures is loaded from hub “Transcription Factor ChIP-seq

Uniform Peaks from ENCODE/Analysis” in UCSC genome browser, which is originally generated by ENCODE project [52].

4.2.6 RNA-seq data

The gene expression data used in this study for the four cell lines (H1ESC, ADS, ADS- adipose, ADS-iPSC) was generated using RNA-seq and was downloaded directly from the website at http://neomorph.salk.edu/ips_methylomes.

4.2.7 SNP calling

SNP calling from bisulfite treated sequences is different from SNP calling based on normal DNA sequences. Special programs have been developed to handle the cytosine nucleic acids. In this study, SNPs were called from WGBS data by using BisSNP 0.82.2

[53], a program that was designed specifically for SNP and methylation calling. We performed the SNP calling process following the recommendation of BisSNP, which consisted of several steps including indel realignment, base quality recalibrate, filtering fake SNPs. Only SNPs with heterozygous genotypes after post-processing were used in

56

the analysis (Table 4.2). For IMR90 and H1ESC cell lines, SNPs were called in individual replicates first and then combined for the analysis afterwards. Overall, the numbers of SNPs range from 1.07 million to 2.04 million with the average number of

SNPs of 1.48 million across all the cell lines. Among them, 870K to 1.68M SNPs are heterozygous. These numbers are in line with the number of SNPs discovered in other studies using normal DNA sequences [54]. To assess the calling quality, we compared calling results from the two different replicates for IMR90 and H1ESC cell lines. For

IMR90, the concordant rates between the two cell lines are 71.6% and 73.6%, respectively. For H1ESC cell lines, the concordant rates between the two cell lines are

54.5% and 60.7%, respectively. The discrepancy in called SNPs in different replicates is a result of several factors, which include coverage variations in different replicates, possible mapping or calling errors. In our analysis, we used the union of called SNPs from both replicates. For the paired-end data (ADS, ADS-adipose, ADS-iPSC, all derived from the same ADS cell line), pair-wise concordant rates among called SNPs between any two cell lines are in the range of 76.0% to 82.5%.

57

Table 4.2 Heterozygous SNPs in ASM regions. Cell line #SNPs in Expected #ASMs with % of ASMs # consistent % consistent

ASMs #SNPs SNPs with SNPs ASMs ASMs

IMR90 201 177.48 181 6.40% 136 75.13%

H1ESC 77 47.75 59 4.82% 44 74.57%

FF 179 104.58 154 6.29% 128 83.11%

H9ESC 91 82.37 84 3.56% 79 94.04%

58 58 IMR90-iPSC 117 82.79 110 5.72% 101 91.81%

ADS 687 500.41 572 8.47% 476 83.21%

ADS-adipose 878 630.75 737 9.28% 614 83.31%

ADS-iPSC 949 684.33 786 9.98% 661 84.09%

Sum 3,179 2,683 2,239

Union 2,843 2,332 1,958

4.2.8 Checking consistency between heterozygous alleles and ASM partitions

When a heterozygous SNP is located in an ASM region, we can directly check how the variant allele embedded reads are grouped in the two partitions. Assuming the two alleles are A0 and A1, and the reads covering this SNP in the region have been partitioned into two subsets of P0 and P1 according to their methylation patterns. Reads with allele A0 can be in set P0 or in set P1, similarly for reads with allele A1. We define the SNP being consistent with the partition if the majority (≥50%) of reads in P0 support one allele while

1 the majority of reads in P support the other allele. More formally, let represents the

i i number of reads in partition P , and , represents the number of reads in partition P

observe allele Ai. If or holds for i=0 and i=1 , , simultaneously, we say the SNP is consistent with partition. Because the reads are bisulfite converted, we assume that all Cs in non-CpG context have been converted to Us and then to Ts. For cytosine in a CpG site, we assume both C and T can be observed in reads and they are all consistent with C.

4.2.9 amrfinder result

Allele-specific methylation regions called by amrfinder in the same cell lines used in this study were downloaded from MethyBase [55]. UCSC genome browser hub under assembly hg19.

59

4.3 Result

4.3.1 ASM is ubiquitous across the genome and is cell line specific

After applying our approach on the eight whole-genome bisulfite sequencing datasets, we have obtained 21,123 ASM regions in total as a union1 across all the samples, which cover 3.68 million base-pairs (0.12% of the genome) and 223 thousand CpG sites (0.80% of the 28M CpG sites [56]) (Table 4.3). Our results show for the first time that the ASM regions are ubiquitous and spread throughout the genome for all chromosomes (Figure

4.2). X chromosome shows the highest number of ASMs relative to its length, accounting for 17% of the total ASMs. They are almost exclusively identified from female cell lines.

Figure 4.2 also shows the distribution of imprinted genes, DHS, TFBS and CpG islands.

Their relationships with ASMs will be discussed later. The average length of ASM regions is about 160 (median 120) and the average number of CpG sites that each ASM region covers is about 9.6 (median 7). To assess the false positive discovery rate of the proposed framework, we examined the “ASM regions” detected on X and Y chromosomes in the two male cell lines (H1ESC and FF). Because there are only a single copy of X allele and a single copy of Y allele in a male cell (other than the small pseudoautosomal regions), the ASM regions detected in male X and all Y chromosomes are treated as false positives. In total, there were four ASM regions on Y chromosome and 13 ASM regions on X chromosome for H1ESC, and 3 ASM regions on Y

1 Overlapped regions from different cell lines were treated as one in the union.

60

chromosome and 17 ASM regions on X chromosome for FF (Table 4.4, Table 4.5, Table

4.6). None of them was from those pseudo-autosomal regions. These X/Y ASM regions were 1.39% and 0.81% of all ASM regions detected in H1ESC and FF, respectively.

Results therefore suggested that our method had a very low false discover rate.

The numbers of ASM regions detected in different cell lines vary from twelve hundred to eight thousand (Table 4.3). The average lengths of ASM regions from different cell lines range from ~92 to ~180 (median: 81 to 135). The average numbers of

CpG sites covered by an ASM region in different cell lines range from ~7 to ~11

(median: 6 to 8). The differences in the number of ASM regions and ASM distributions in different cell lines could be results of several factors, including differences in cell lines and differences in sequencing technologies (i.e., single-end vs. paired-end). For example, the three ADS cell lines that were sequenced using paired-end technology have significantly more ASM regions (6757-7942 regions) than the other five samples (1224-

2826) that were sequenced using single-end reads (Table 4.3). Similarly, the total lengths of ASM regions of the three ADS cell lines as well as lengths of individual ASMs are also significantly longer than those of other cell lines (Table 4.3, Figure 4.3), mainly because read pairs cover longer regions than single reads. Although read depths may affect the power of ASM detection in general, as long as the genome is reasonably covered, the number of ASM regions detected is more a characteristic of different types of cell lines. For example, H1ESC and IMR90 have similar genomic coverage and similar numbers of mapped reads, but they have very different numbers of ASM regions

(1224 vs 2826). The primary reason is because the female cell line IMR90 has 1067 ASM

61

regions on X chromosome that are most likely associated with X chromosome inactivation; while the male cell line H1ECS only has 13 predicted ASM regions on X chromosome, which are regarded as false positives. Other than X chromosome, H1ESC also has a smaller number of ASM regions on autosomes (1207) comparing to IMR90

(1759). The observation that the number of autosomal ASM regions in ESC cells is significantly smaller than the number of autosomal ASM regions in somatic cells also holds in other cell lines, regardless of sequencing technology or read coverage depth. For example, the two ESC cell lines H1ESC and H9ESC, as well as an iPSC cell, IMR90- iPSC, all have around twelve hundred autosomal ASM regions. On the other hand,

IMR90 and FF have 1759 and 2428 autosomal ASM regions, respectively. Similarly for the paired-end data, the adult stem cell ADS and the iPSC cell derived from it (ADS- iPSC) both have similar number of autosomal ASM regions (5658 and 5345 respectively), while the differentiated ADS cell (ADS-adipose) has many more autosomal

ASM regions (6602).

The cell lines are different not only in terms of the numbers of their ASM regions, but also in terms of their genomic distributions. ASM regions detected in one cell line and the methylation level distributions of the same genomic locations in all the cell lines are shown in Figure 4.4 (autosome) and Figure 4.5 (X chromosome). Clearly, the ASM profiles are cell line specific. As expected, ASM regions detected in each individual cell line all have intermediate methylation levels (along the diagonal). While the distributions of methylation levels of these same genomic regions in other cell lines are widely spread.

For example, for ASM regions detected in IMR90 cell line, their methylation levels in

62

other cells (first row in Figure 4.4) spread widely across the spectrum. Interestingly, their distributions in ESC/iPSC cells (H1ESC, H9ESC and IMR90-iPSC, ADS-iPSC) are similar, and their distributions in the other two ADS cell lines are also similar. The distribution in FF cell is somewhat similar to the two ADS cells. The observation is true not only for the first row. Actually, the cell lines can be roughly partitioned into two groups based on their methylation level distributions across all autosomal ASMs:

ESC/iPSC cell lines and differentiated cell lines. The only exception is ADS-iPSC cell, which is similar to ESC/iPSC for some of the ASMs, and is similar to other cell lines for some other ASMs (Figure 4.4). For X chromosome, the two male cell lines are clearly more similar and they are different from the female cell lines (Figure 4.5). Other than that, one can also see a clear distinction between ESC/iPSC cells and differentiated cells.

Comparing to autosomal ASMs, methylation levels of ASMs on X chromosome are in general more conserved across female cell lines. Figure 4.4 and Figure 4.5 clearly demonstrate that many ASMs are cell line specific and ASMs in similar types of cell lines exhibit somewhat similar distributions.

63

Table 4.3 Detected ASM regions in different cell lines and their statistics. Cell line #ASM Total Avg Median Total # Avg median

regions length length/ASM length CpG #CpG/ASM #CpG/ASM

/ASM

IMR90 2,826 269,691 95.4 86 19,811 7.0 6

H1ESC 1,224 112,794 92.2 81 9,078 7.4 7

FF 2,448 245,013 100.1 89 18,774 7.7 6

64 64 H9ESC 2,362 238,768 101.1 87 21,418 9.1 8

IMR90-iPSC 1,923 191,907 99.8 86 17,063 8.9 8

ADS 6,757 973,328 144.0 118 53,610 7.9 7

ADS-adipose 7,942 1,158,825 145.9 116 63,371 8.0 7

ADS-iPSC 7,873 1,424,782 181.0 135 91,453 11.6 8

Union 23,123 3,684,822 159.4 120 223,024 9.6 7

Table 4.4 Distribution of ASM regions on autosomes.

autosome cell line #ASM length avg length #CpG avg cpg IMR90 1,759 169,490 96.36 11,719 6.66 H1ESC 1,207 111,431 92.32 8,962 7.43 FF 2,428 243,262 100.19 18,627 7.67 H9ESC 1,282 131,754 102.77 11,828 9.23 IMR90-iPSC 1,256 119,246 94.94 10,993 8.75 ADS 5,658 806,315 142.51 45,446 8.03 ADS-adipose 6,602 956,364 144.86 53,301 8.07 ADS-iPSC 5,345 856,775 160.29 54,916 10.27 Union 19,270 2,812,185 145.94 171,694 8.91

Table 4.5 Distribution of ASM regions on X chromosome. The ASMs on X chromosome of the two male cell lines (H1ESC and FF) are treated as false positives. X X% avg avg cell line #ASM length length #CpG cpg #% length% IMR90 1,067 100,201 93.91 8,092 7.58 37.76% 37.15% H1ESC 13 1,040 80.00 85 6.54 1.06% 0.92% FF 17 1,524 89.65 128 7.53 0.69% 0.62% H9ESC 1,080 107,014 99.09 9,590 8.88 45.72% 44.82% IMR90-iPSC 667 72,661 108.94 6,070 9.10 34.69% 37.86% ADS 1,099 167,013 151.97 8,164 7.43 16.26% 17.16% ADS-adipose 1,339 202,368 151.13 10,064 7.52 16.86% 17.46% ADS-iPSC 2,524 567,508 224.84 36,509 14.46 32.06% 39.83% Union 3,842 871,586 226.86 51,251 13.34 16.62% 23.65%

65

Table 4.6 Distribution of ASM regions on Y chromosome. The ASMs on Y chromosome are treated as false positives.

Y Y% cell line #ASM length avg length #CpG avg cpg #% length% IMR90 0 0 0.00 0 0.00 0.00% 0.00% H1ESC 4 323 80.75 31 7.75 0.33% 0.29% FF 3 227 75.67 19 6.33 0.12% 0.09% H9ESC 0 0 0.00 0 0.00 0.00% 0.00% IMR90-iPSC 0 0 0.00 0 0.00 0.00% 0.00% ADS 0 0 0.00 0 0.00 0.00% 0.00% ADS-adipose 1 93 93.00 6 6.00 0.01% 0.01% ADS-iPSC 4 499 124.75 28 7.00 0.05% 0.04% Union 11 1051 95.55 79 7.18 0.05% 0.03%

66

Figure 4.2: Genome-wide distribution of ASMs (blue) and some other genomic features including imprinted genes (names), DNase I hypersensitive sites (black), transcription factor binding sites (red) and CpG islands (green).

67

Figure 4.3 The length distribution of ASMs in different cell lines. The three ADS cell lines sequenced using paired-end technology have longer ASM regions and the rest have similar length distributions.

68

Figure 4.4 Autosomal ASMs identified in each cell line and their corresponding methylation level distributions of the same genomic regions in other cell lines.

69

Figure 4.5 ASMs identified from X chromosome in each cell line and their corresponding methylation level distributions of the same genomic regions in other cell lines.

4.3.2 Enrichment in female X chromosomes

DNA methylation plays an important role in the initiation and especially, maintenance of

X chromosome inactivation. Studies [39], [57] have shown that in somatic cells, the inactive X chromosome (Xi) has low methylation in gene-rare regions and has high methylation in gene-rich regions, comparing to the active X chromosome (Xa).

Therefore, one would expect to observe a large number of ASM regions on X chromosomes in female cell lines. Indeed, among the six female cell lines, four of them have more than one third of their detected ASM regions located on X chromosome,

70

accounting for more than 37% of the total length of their detected ASM regions (Table

4.5). The other two cell lines (ADS and ADS-adipose) have around 16% of their detected

ASM regions located on X chromosome (17% in terms of length). Detected ASM regions are significantly enriched on X chromosome, given that the length of X chromosome is only about 5% of the total length of the genome. In total, 3842 ASM regions were detected on X chromosome from these six cell lines, covering 871Kbps in length and 51 thousand CpG sites. Similar to the overall number of ASM regions, the absolute numbers of ASM regions on X chromosome from different cell lines may also depend on several factors such as sequencing technologies, different types of cell lines and different tissues that they were originally from. However, due to the limited sample size and cell line diversity, it is hard to separate the effects from different factors.

4.3.2.1 Overlaps with RefSeq Genes.

To study the potential functions of ASM regions, we first checked the relationship of location distributions of ASM regions and the RefSeq annotated genes, as well as some other genomic/epigenomic markers. The results on X chromosome are reported here, and results in imprinted gene regions and other regions on autosomes will be discussed in sequel. We first took a simple approach and separated the genome into four non- overlapped regions: the promoter region (defined as +/- 1kbps from any transcription start

(TSS) site), the exon region (all exons excluding overlaps with the promoter region), the intron region (all introns excluding overlaps with the promoter and exon regions), and the intergenic region (all other regions). A distinct feature in the distribution (Figure 4.6) is that ASM regions are significantly enriched in the promoter region. For a total length

71

accounting for less than 2% of X chromosome, the promoter region actually harbors more than 45% of all ASM regions of the union set from the six female cell lines. The exon region also shows enrichment, although it is not as significant as the promoter region. On the other hand, the proportion of ASM regions in the intergenic region (~26%) is significantly low relative to the length of the region (~62% of the genome). The intron region also shows a depletion. The distribution cannot be explained by the CpG site distribution; rather, it strongly suggests that ASM mainly functions as an element in the complex gene regulatory machinery. Another noticeable feature is that different cell lines have different distribution profiles, which is a reflection of the characteristics of cell lines. For example, ESC (H9ESC) and iPSC (IMR90-iPSC, ADS-iPSC) have highest proportions of ASM regions in the promoter region, while somatic cells (ADS, ADS- adipose) have relatively lower proportions, with the exception that fetal lung fibroblast cell line (IMR90) also have high proportion in the promoter region.

72

Figure 4.6 Genomic distributions of ASM regions and their overlaps with different gene annotations including promoter (red), exon (blue), intron (green), and intergenic (purple) regions based on RefSeq gene annotations. Distributions in other contexts (ASMs on autosomes, on X chromosome, and overlapping with imprinted gene regions) are also shown.

4.3.2.2 Overlaps with ENCODE regulatory elements.

Given significant enrichment of ASM regions in promoters, to further assess their 73

potential regulatory roles in fine tuning gene expressions, we next examine their overlaps with some important regulatory markers include DHS clusters and TFBS clusters generated by the ENCODE project [52] as well as CGI regions obtained from UCSC genome browser. The length distribution of DHS, TFBS and CGI regions across the genome, autosomes, imprinted genes and X chromosome, as well as the fractions of ASM regions located in these annotated regions for all four categories can be found in Figure

4.7, Figure 4.8, Figure 4.9. The lengths of annotated DHS (Figure 4.7) clusters, TFBS

(Figure 4.8) clusters and CGI (Figure 4.9) only account for 15.34%, 12.29%, 0.71% of the human genome respectively. Their distributions on X chromosome are even lower.

However, our detected ASM regions are predominately located in these annotated regions across all different cell lines, suggesting most ASM regions are functional and their functions may be closely related to these regulatory elements and their functions. For example, the significant concordance between ASM regions and TFBS supports the hypothesis that the onset of DNA methylation of one haplotype/chromosome prevents it from binding by transcription factors, therefore behaves differently from the other haplotype/chromosome, through which ASM plays an important role in allelic specific regulation and expression. Furthermore, data in Figure 4.10 shows that DHS clusters that have overlaps with ASMs are usually associated with higher scores, indicating that ASM regions are not only enriched in DHS regions, but also preferably located in DHS regions with strong signals. The coincidence of the two features from two different data sources provides another level of support that the detected ASMs are real and they are most likely functional as parts of whole regulatory machinery.

74

Figure 4.7 Fractions of ASM regions located in DHS regions in each cell line, based on length. The bar labeled with “Genome” in each panel shows the fraction of each feature on the genome.

75

Figure 4.8 Fractions of ASM regions located in TFBS regions in each cell line, based on length.

76

Figure 4.9 Fractions of ASM regions located in CGI regions in each cell line, based on length.

77

Figure 4.10 Violin plot shows the score distribution of DHS in ASM and nonASM regions for each of the cell lines.

Results also show that the high concordance between ASM and TFBS/DHS is not limited to gene promoter regions, rather, high concordance has been observed in gene body regions too. As an example, Figure 4.11 shows that detected ASM regions in

FIRRE, a gene on X chromosome, have significant overlaps with TFBS and DHS clusters. A recent study by Yang et al. [58] using mouse cells has shown that Firre

78

anchors the inactive mouse X chromosome near nucleolus through the binding by a transcription factor CTCF to its intragenic region. The authors suspect that for the active copy of X chromosome in female (as well as X chromosome in male), intragenic regions of Firre is highly methylated during embryonic cell differentiation, which precludes

CTCF from binding. Examination of human data in our study shows that ASM regions were found in all female differentiated cell lines. In contrast, in all ESC/iPSC cell lines

(other than ADS-iPSC) and in male cell lines, FIRRE gene is highly methylated.

Although ASMs have been detected in ADS-iPSC, the overall methylation level is very high across FIRRE gene comparing to ADS and ADS-adipose. Furthermore, the CTCF binding profiles from the ENCODE project on the two cell lines (H1ESC and IMR90) also show CTCF binding signals consisting with ASM regions in IMR90 and lacking of

CTCF binding in HIESC, which is highly methylated (Figure 4.11). The result is mostly consistent with the findings in the mouse study that at some point during the development, the onset of DNA methylation on the active copy of X chromosome precludes CTCF from binding, while CTCF can still bind the un-methylated inactive X chromosome, which helps to position the Xi near the nucleolus. However, we do observe an important difference between human and mouse data in ESC. In mouse ESCs, CTCF binding presents initially in both male and female. While in human (H1ESC), we observe high methylation levels and lacking of CTCF binding. The difference may be due to differences in cell lines (e.g., development time of embryonic cells), or differences in mouse and human. More data (methylation as well as CTCF binding) from different cell lines are needed to elucidate the difference.

79

Scale 50 kb hg19 chrX: 130,850,000 130,900,000 130,950,000 FIRRE CpG Islands DNase Clusters Txn Factor ChIP IMR90 CTCF H1-hESC CTCF b H1-hESC CTCF h H1-hESC CTCF t IMR90

H1ESC

FF

H9ESC

IMR90-iPSC

ADS

ADS-adipose

ADS-iPSC

Figure 4.11: ASMs identified in each cell line around FIRRE gene, together with the methylation levels, and signals of DHS, TFBS, and CGI.

4.3.2.3 Relationship with gene expression levels.

Because ASM regions are significantly enriched in promoter regions and have significant overlaps with transcription regulatory elements, we sought to further examine the relationship between ASMs in promoter regions and their corresponding transcription levels. Previous studies on intermediate methylation regions around transcription start sites have found that they were associated with intermediate transcriptional activities

[59]. However, other evidence suggested that genes associated with intermediate methylation levels were transcriptionally repressed in immortalized cell lines such as

IMR90, ADS, ADS-adipose [3], [41]. To investigate whether the identified ASM regions around promoter regions in our study have a direct correlation with gene expressions, we

80

further obtained independent RNA-Seq data from (data is only available for four cell lines: ADS, ADS-adipose, ADS-iPSC and H1ECS) and examined their relationships. We partitioned RefSeq genes into four different classes: genes with promoters overlapping

ASM regions, genes with promoters not overlapping ASM regions, but with low, intermediate or high methylation levels in their promoter regions. For each gene, all its isoforms were considered. Promoter regions were defined in the same way as before (i.e.,

±1kbps from TSS). The average methylation level of all CpG sites in each non-ASM- overlapping promoter was calculated. Following common practice, a promoter region was regarded as low, high or intermediate methylation level if the average is less than

30%, greater than 70%, or in-between. Our results show that genes on X chromosomes with their promoter regions overlapping ASM regions have a wide range of expression levels, but mostly intermediate expression levels, for the three female cell lines (ADS,

ADS-adipose, ADS-iPSC) (Figure 4.12), and the distribution is similar to genes with low methylated promoters. This may not seem intuitive, but can be explained by X inactivation. For genes with low methylated promoters, because of XCI, only one copy of each gene that is on Xa is expressed. For genes with ASMs in their promoters, it is expected that one and only one copy of each gene is expressed. It is quite possible that different mechanisms may be involved in the regulation of these two types of genes, but their expression levels are similar as observed here (Figure 4.12). This result suggests that, as one of possibly many regulatory mechanisms on X chromosome, ASM correlates well with gene expression levels. Genes with intermediate and high methylation levels in their promoters are almost completely repressed, with the only exception of ADS-iPSC

81

cell line, which has a small number of (23) genes with intermediate methylation that are moderately expressed. Similar results have been reported by other researchers [38]. Our result illustrates that ASM and intermediate methylated regions are different and it is important to accurately identify ASM regions to better understand their regulatory roles.

Figure 4.12: Boxplots of gene transcript abundance on X chromosome of three female cell lines. Genes are separated into four categories based on the methylation status of their promoter regions: with ASM, with Low, Medium and High methylation levels. The number of genes in each category is shown. P values of Wilcox test between the distributions of ASM and each of the other categories are shown on top of the boxplot of each category.

4.3.3 ASM significantly overlaps imprinted gene regions

4.3.3.1 Majority of imprinted genes have ASMs.

Genomic imprinting refers to the fact that certain genes are expressed or silenced based

82

on their parent-of-origin through the epigenetic regulation such as DNA methylation. It would be of great interest to assess the relationship between allelic specific methylation patterns and imprinted genes, at least as a way of validation of the proposed approach, because it is expected that different methylation patterns of the two alleles have to be maintained for many imprinted genes during development and cell divisions. We therefore examined the overlaps between the detected ASM regions and 88 confirmed imprinted genes obtained from the gene imprinting website (www.geneimprint.org). To define overlaps, the coordinates of imprinted genes also include a1k base pairs before the transcription initiation sites to include the promoter regions. In total, we have identified

55 imprinted genes that have overlaps with ASM regions in one or more cell lines. For each individual cell line, the number of imprinted genes overlapping ASM regions ranged from 22 to 46. The overlapping details for all the imprinted genes in all cell lines can be found in Table 4.7. Overall, the cell lines with paired-end reads have identified more overlaps. We also observed that ESC cells and iPSC cells tended to have smaller number of overlaps in general, comparing to other cell lines. To assess whether the overlaps were purely by chance, we shuffled the locations of ASM regions along the genome using

Bedtools [51] while keeping their lengths, and observed that much smaller number of known imprinted genes overlapped the shuffled regions across all the cell lines (Figure

4.13). Overall, the enrichment factors across different cell lines range from 3 to 9, which indicates that the overlaps cannot be explained by chance. As an indirect validation approach, detecting a large number imprinted genes overlapping ASM regions shows the capacity of the proposed approach.

83

Figure 4.13 The number of imprinted genes covered by predicted ASM regions in each cell line by the proposed method ASM-Detector and an existing algorithm amrfinder. Blue represents regions detected by our method alone, red represents regions detected by amrfinder alone, and green in represents shuffled ASM regions.

84

Table 4.7 The overlaps of known imprinted genes with detected ASMs in all 8 cell lines. The overlapped ones are colored yellow for easy illustration. name IMR90 H1ESC FF H9ESC IMR90-iPSC ADS ADS-adipose ADS-iPSC Union count 33 26 36 22 28 46 46 36 55 TP73 false false false false false true true false true DIRAS3 true true true false true true true true true INPP5F true true true false true true true true true H19 true true true true true true true true true IGF2 false false true false false true true true true IGF2-AS false false true false false true true true true INS false false false false false false false false false

85 85 KCNQ1 true true true true true true true true true KCNQ1OT1 true true true true true true true true true KCNQ1DN false true true true false false false false true CDKN1C false false false false false false false false false SLC22A18 false false false false false false false false false PHLDA2 false false false false false false false false false OSBPL5 false false false false false false false false false WT1 false false false false false false false true true ANO1 false false false false false false false false false ZC3H12C false false false false false false false false false NTM false false true false false true true true true RBP5 false false false false false false false false false RB1 false false false false false true true false true

DLK1 false false false false false false false false false MEG3 true false true false false true true true true RTL1 false false false false false false false false false MEG8 false false false false false false false false false MKRN3 false false true false false false false false true MAGEL2 false true true false true true true true true NDN true false true true true true true true true NPAP1 false false false false false false false false false SNRPN true true true true true true true true true SNURF true true true true true true true true true SNORD107 false false false false false false false false false SNORD64 false false false false false false false false false 86 86 SNORD108 false false false false false false false false false SNORD109A false false false false false false false false false SNORD109B false false false false false false false false false UBE3A false false false false false false false false false ATP10A true false false false false true false false true IRAIN false false false false false false false false false ZNF597 true true true true true true true true true NAA60 true true true true true true true true true TCEB3C false false false false false false false false false DNMT1 false false false false false false false false false MIR371A false false false false true false false false true NLRP2 false false false false false false true true true

ZIM2 true false true false false true true false true PEG3 true false true false false true true false true MIMT1 true false true false false true true false true LRRTM1 false false true true false false false false true GPR1-AS true true true false true true true true true ZDBF2 false false false false false true true false true BLCAP true false false false true true true true true NNAT true false false false true true true true true L3MBTL1 true true true true true true true true true SGK2 false false false false false false false false false GDAP1L1 true false false false false true true false true MIR296 false false false false false false false false false 87 87 MIR298 false false false false false false false false false GNAS-AS1 true true true true true true true true true GNAS true true true true true true true true true DGCR6 false false false false false true false false true DGCR6L false false false false false false false false false NAP1L5 true true true true true true true true true RNU5D-1 false false false false false true true false true VTRNA2-1 false true true false false true true true true FAM50B true false true false false true true true true LIN28B false false false false false false false false false AIM1 false false false false false true false false true PLAGL1 true true true true true true true true true

HYMAI true true true true true true true true true SLC22A2 false false false false false false true false true SLC22A3 false false false false false false false false false DDC false false false false false false false false false GRB10 true true true true true true true true true MAGI2 false false false false true true false true true TFPI2 false false false false false false false false false SGCE true true true true true true true true true PEG10 true true true true true true true true true PPP1R9A false false false false false false false false false DLX5 false false true false false true true false true CPA4 false false false false false false false false false 88 88 MEST true true true true true true true true true MESTIT1 true true true true true true true true true KLF14 false false false false false false false false false DLGAP2 true true false true true true true true true ZFAT false false false false false true true true true ZFAT-AS1 false false false false false false false false false KCNK9 false false false false false false true false true GLIS3 true true false false false false true false true

4.3.3.2 Imprinted genes overlap strong ASMs.

One interesting observation from our result is that known imprinted genes are usually overlapped ASM regions with strong signals (i.e., long ASM regions with large numbers of CpG sites, Figure 4.14), which are most likely true signals. This shows that there is indeed a strong correlation between ASM and gene imprinting as it was suggested by previous studies, and our approach can identify these regions with high confidence.

Furthermore, for genes overlapping strong ASM signals that are not labeled as imprinted genes, it will be of great interest to study their functional roles. One possibility is that some of them may be actually imprinted genes that have not been identified. To test this hypothesis, we manually checked a few genes that overlap long and strong ASM regions but not in the known imprinted gene list (Figure 4.14). We found that for both TRAPPC9 and ZNF331, a more recent study [37] had just declared that they were indeed imprinted genes. Another [60] had reported that gene PAX8 had shown heterogeneous patterns of monoallelic expression in different individuals and classified it as one possible variable imprinted gene. Yet another gene IGF2R had also been reported that it was parentally imprinted, but only in a minority of individuals [61]. Its murine homologue (Igf2r) was parentally imprinted.

89

Figure 4.14 The distribution of ASM regions in terms of their length and the number of CpG sites within them. The top ASM regions are labeled by gene names that overlap them, red for known imprinted genes and blue for possible imprinted genes. The figure shows the union set of ASM regions from all cell lines.

90

4.3.3.3 Variability of ASM in imprinted regions in different cell lines.

Extensive studies have revealed that gene imprinting is maintained through epigenetic marks, mainly allele specific methylations and histone modifications. It is thus expected that ASM around imprinted gene regions to be maintained through development and to be stable. However, studies using human embryonic stem cell expression data have shown that although many imprinted genes maintain mono-allelic expressions, loss of allele specific expression has been observed in some imprinted genes [62]. Here we examine directly the stability of ASM in imprinted regions of different cell lines and results indeed show variability of ASM in imprinting regions. For example, for some imprinted gene regions (SNRPN, KCNQ1OT1, GNAS, PEG10, and MEST), ASMs were found in all cell lines (two representatives can be found in Figure 4.15). However, for some other imprinted gene regions such as MEG3, PEG3, strong ASM signals were only found in all somatic cell lines, but not in any ESC/iPSC cell lines (Figure 4.16). Previous studies using allelic specific gene expression data also reported conservation of mono- allelic expression patterns in SNRPN and KCNQ1OT1 genes in different cell lines, and the variability in expressions for MEG3 gene [62]–[65]. In addition, some other imprinted regions have potentially cell line specific ASM patterns. For example, ASM was found in DIRAS3 for all cell lines except H9ESC, while ASM was found in BLCAP for all but FF, H1ESC, and H9ESC cell lines (Figure 4.17). Interestingly, the gene regions in those cell lines that did not show ASM were all highly methylated. There are several possible explanations for the observed ASM variation in imprinted regions. First, the stability of imprinted genes is different in different cell lines, some of which may

91

depend on development stages of cell lines. Second, different cell lines may be cultured under different conditions for various time periods, which may affect ASM patterns.

Third, data qualities are different for different cell lines, which may result in false negatives in ASM detection.

Figure 4.15 All cell lines have ASMs around TSS of KCNQ1OT1 and SNRPN. All examples are shown in UCSC Genome Browser. CGI, DHS and TFBS tracks are provided by UCSC Genome Browser. For each cell line in our study, two tracks of data are displayed: blue horizontal bars represent ASM regions and red slim vertical lines are methylation level of CpG sites.

92

Figure 4.16 ASMs only occur in somatic cell lines around TSS of gene PEG3 and MEG3.

Figure 4.17: Cell line specific ASMs are found around TSS of DIRAS3 and BLCAP.

93

4.3.3.4 Overlaps with promoter regions and correlation with gene expression.

Another noticeable observation is that a significant portion of the overlapped ASM regions (~50%-70% in terms of length) are located in the promoter regions (~3% in length) of the imprinted genes (Figure 4.6). For most cell lines, the portion of ASMs in promoters of imprinted genes is even higher than those on X chromosome. Furthermore,

ASM signals in imprinted regions are in high concordance with DHS signals and TFBS signals for all cell lines (Figure 4.7, Figure 4.8, Figure 4.9). For both DHS and TFBS, more than 70% of ASM regions have overlaps with them. This is a strong indication that genomic imprinting is mainly achieved by preferential methylation of one of the two chromosomes in the regulatory regions, which in turn regulates their transcription initiation. The conclusion is probably true in general: ASMs work similarly in non- imprinting regions. We therefore assessed the distance distribution between ASM regions and their nearest TSS across the genome. Results in Figure 4.18 show that most ASM regions are in close proximity to TSS (i.e., ±1kbps), another strong indication of their potential roles in regulating transcription initiation. The imprinted genes in the four cell lines with RNA-seq data have intermediate expression levels (Figure 4.19), comparable to expression levels of genes with ASM in promoter regions. The result is also consistent with the hypothesis that ASM regulates imprinted genes, which show intermediate expression levels through allelic specific expression.

94

Figure 4.18 Distance distributions of ASM regions to their nearest TSS for all the cell lines. ASMs are significantly close to TSS, comparing with the null distributions (grey lines) generated by shuffling ASM regions randomly in each sample.

4.3.4 ASM patterns in autosomes

4.3.4.1 ASM distributions.

We next examined ASM regions detected on autosomes that did not have overlaps with known imprinted genes. Figure 4.2 clearly shows that ASM is not limited to X

95

chromosome or imprinted gene regions; instead, they are distributed throughout the autosomes. In terms of their relations with gene annotations, they are clearly enriched in promoter/exon regions for all the cell lines (Figure 4.6). However, a significant portion of

ASM regions (> 50% in many cell lines) was also found in intron/intergenic regions.

Comparing to X chromosome or imprinted genes, the numbers of ASM regions located in the promoter regions of autosomes are smaller, indicating that some of them may function differently. In addition, the distributions of ASM regions in different cell lines can be very different. Noticed that all ESC cells and iPSC cells are very similar and have high fractions of ASM regions located in promoter/exon regions. In contrast, they are very different from somatic cell lines, which tend to have smaller fractions of ASM regions in promoter/exon regions. The trend is similar to those on X chromosome, further confirming differences of ASM patterns in ESC/iPSC cells and somatic cells. In addition, as noted earlier, many ASM regions are cell line specific. Methylation levels in ASM regions of any particular cell line are all intermediate (Figure 4.4, diagonal), which is expected because in ASM regions, one haplotype is mostly methylated while the other one is mostly un-methylated. However, methylation levels of the same regions in other cell lines have wide and distinct distributions, implying that the same regions themselves may not be ASM regions in other cell lines.

4.3.4.2 Overlaps with regulatory elements and relations with expression levels.

Although the fractions of overlaps between ASM regions and promoters on autosomes in different cell lines are not as high as those on X chromosome or in imprinted regions, they are all highly enriched (Figure 4.6). Furthermore, autosome ASM regions also have

96

very high concordance rates with DHS, TFBS and CGI (Figure 4.7, Figure 4.8, Figure

4.9). The non-random distribution is a strong indication that many identified ASM regions on autosomes are likely to be functional and their functions may closely related to gene regulations, just as the ones on X chromosome and the ones near imprinted genes, although they may involve different types regulations rather than XCI or parent-of-origin.

Furthermore, gene expression data shows that genes on autosomes with promoters overlapping ASM regions have intermediate expression levels across all four cell lines

(Figure 4.19). The observation is consistent with the belief that allele specific methylation mediates mono-allelic expression. Also, very similar to X chromosome, genes with intermediate methylation level in their promoters (that are not ASM regions) are repressed in somatic cell lines, but have intermediate expression level in ESC/iPSC lines

(Figure 4.19). On one hand, results on somatic cell lines indicate the importance to separate ASM regions from intermediate methylation regions. Although they may share similar methylation levels, they actually function differently. This also explains inconsistent results in some earlier studies when researchers did not separate them. For example, Elliott at al. found that intermediate methylation regions around transcription start sites were associated with intermediate transcriptional activity. While other studies have shown that in immortalized cell lines such as IMR90, ADS, ADS-adipose, genes with intermediate methylation levels in promoters were transcriptionally repressed [3],

[41]. Rigorous algorithms as the one developed here that can reliably separate ASMs from other intermediate methylation regions, which may be caused by measurement noise at individual CpG sites, or by the artifact of region/promoter definition, are essential to

97

study the functions in these partially methylated regions. On the other hand, further studies are needed to answer why the genes with intermediate methylation levels in their promoters also have intermediate expression levels on ESC/PSC lines. The difference may be due to the differences of the cell lines and intermediate methylation may function differently in different types of cell lines. We also notice that the total numbers of genes in this category for both cell lines are relatively much smaller than those in the somatic cell lines.

Figure 4.19 Boxplots of gene transcript abundance on autosomes of the four cell lines. Similar legends are used in Figure 4.12. Imprinted genes are grouped into a separate category.

4.3.5 Heterozygous SNPs located in identified ASM regions strongly support read

partitions

The proposed approach predicts ASM regions without any assumption of prior knowledge about heterozygous SNPs in each sample and it does not rely on such

98

information. Doing so not only allows detection of ASM regions that are indeed linked to genetic variants in a sample, but also detection of ASM regions that are linked to parent- of-origin (i.e., the two alleles might be the same). Nevertheless, it is interesting to see, among the detected ASM regions, how many of them actually overlap heterozygous

SNPs. More importantly, for those heterozygous SNPs that are located in ASM regions, we can check how well the two alleles are aligned with the two methylation patterns predicted by the algorithm. The later can actually be viewed as an in silico validation of our algorithm on the correctness of read partition solely based on their methylation patterns. We therefore called SNPs from all the samples based on the same bisulfite converted sequencing data using a software tool called Bis-SNP 0.82.2 [53], which was developed specifically for SNP calling of bisulfite treated DNA sequencing data. In total, the numbers of heterozygous SNPs in different cell lines vary from 870K to 1.68M

(Table 4.8). More details about SNP calling and quality control criteria can be found in

Methods section. Given the total length of the detected ASM regions, the absolute number of heterozygous SNPs that are located in ASM regions for each cell line is in the hundreds. The total number of unique SNPs from all cell lines is 2843, overlapping with

2332 unique ASM regions (some long ASM regions may overlap two or more SNPs).

The numbers of heterozygous SNPs located in ASM regions were slightly greater than what were expected according to the lengths of ASMs for all cell lines (1.1 to 1.7 enrichment). The fraction of ASM regions containing heterozygous SNPs for different cell lines range from 3.6% to 10.0% (Table 4.2). We say that a heterozygous SNP is consistent with the two methylation patterns at an overlapped ASM region if majority

99

reads (≥ 50%) of one allele have their methylation status in full agreement with one pattern and majority reads of the other allele with their methylation status in full agreement with the other pattern. Results show that a large portion of these heterozygous

SNPs (74.6%-94.0%) are consistent with the two methylation patterns in these ASM regions (Table 4.2), in strong support that our approach correctly separates reads to their corresponding haplotypes based purely on methylation patterns. Two patterns arising from two different haplotypes is a strong indication that ASM regions are indeed true signals, which implies that our approach has a very low false positive rate. We present here some of the consistent SNP-ASM pairs that have also been observed in previous experimental studies. For example, in exon 1 of PTCHD3 gene, we predicted ASM regions in five of the eight cell lines (Figure 4.20 A). Three of them (ADS-adipose, ADS- iPSC and IMR90) have ASM regions directly overlapping a SNP (rs11015753). The two

ADS cell lines have the heterozygous genotype of GA, which allows us to check the

SNP-ASM partition. For both cell lines, the two alleles and the two methylation patterns are highly consistent (Figure 4.20 B). This ASM region has also been identified and experimentally validated using bisulfite PCR sequencing in different samples by a previous study [59]. The result further supports the notion that those consistent SNP-

ASM pairs are more likely to be true signals. Notice that this evidence alone still cannot differentiate variant-associated ASM and parent-of-origin associated ASM. Inferring casual relationships between SNPs and ASMs across different samples is much hard because of the dynamic nature of DNA methylation. It has been observed that methylation levels and patterns in different types of cell lines (e.g., ESC vs. adult) with

100

the same genetic background (e.g., ESC vs. adult)can be different. Nevertheless, we checked the genotypes of all cell lines at this SNP and examined their correlations with

ASMs. We observed: 1) all cell lines with heterozygous genotype (ADS-adipose, ADS- iPSC, ADS, FF) have ASMs regardless overlapping status of these ASM regions and the

SNP; 2) all ESC/iPSC cells (H1ESC, IMR90-iPSC, H9ESC) are highly methylated regardless of their genotypes (H1ESC and IMR90-iPSC have genotype GG, and H9ESC has genotype AA); 3) not all results from all cell lines are consistent, i.e., IMR90 has homozygous genotype GG but with predicted ASMs. There are many possible explanations. Regardless, when we only focused on the relationship between the two alleles and the read partitions provided by our method based on methylation levels, the high consistency indicated that our method had correctly grouped reads from the same chromosome/haplotype together and the methylation patters of two chromosomes were totally different. Another validated ASM region (a known imprinted gene TRAPPC9

[66]) that is also predicted by our method can be found in Figure 4.21, which also shows the partition of the reads based on methylation is consistent with the two variants at a

SNP position (rs4455807). These examples illustrate that ASMs called without using

SNP information can be validated in silicon based on SNP genotypes in a later stage. The consistency between our computational validation method and experimental validation approaches using traditional bisulfite PCR sequencing suggests high confidence of our validation method. The computational validation method can be applied to all ASM regions overlapping heterozygous SNPs without incurring any additional experimental costs.

101

Figure 4.20 Examples of ASM regions with read partitions consistent with alleles of heterozygous genotypes. 102

Panel A is UCSC Genome Browser view of the region around TSS of PTCHD3 gene. SNP rs11015753 was covered by ASM regions in ADS-adipose, ADS-iPSC, and IMR90, but heterozygous genotypes were called only in ADS-adipose and ADS-iPSC. Panel B/C shows partitions of reads contain rs11015753 in ADS-adipose/ADS-iPSC. Each line represents one read or a pair of reads, which is aligned by its position. Black/white circle are methylated/unmethylted CpG sites on the reads. A gap between a pair of reads is represented by spaces. The horizontal blue bar separates the two partitions. Characters on the right side of vertical yellow bar are the called nucleotides. The genotype at the SNP is AG. For A allele, it can be called A from plus strand (represented as A+), or T from minus strand (represented in the Figure as its reverse-complement counterpart A-). For G allele, the plus strand will report G+. The minus strand C is not within a CpG site, therefore, it is not methylated and is first converted to U and then to T (represented as A-). In this case, one cannot distinguish whether A- is from A allele or G allele. Here as long as there are no conflicts, we assume they are consistent, i.e., A- from the methylated pattern is from allele A and A- from the non-methylated pattern is from allele G.

103

Figure 4.21 Similar to Figure 4.20, Panel A is UCSC genome browser view of another example in the intragenic region of TRAPPC9 gene. SNP rs4455807 is covered by ASM regions in ADS and ADS-adipose cell lines. Panel C, D shows partitions of reads contain rs4455807 in ADS-adipose/ADS cell lines.

104

Table 4.8 SNP calling result #snp after %snps after #heterozygous %heterozygous cell lines #snp postprocessing postprocessing SNP SNP IMR90_r1 4,285,329 2,017,503 47.08% 1,620,179 80.31% IMR90_r2 4,360,779 1,978,565 45.37% 1,577,155 79.71% H1ESC_r1 3,151,286 1,085,275 34.44% 870,123 80.18% H1ESC_r2 3,708,373 1,333,467 35.96% 968,188 72.61% ADS 3,905,665 2,277,174 58.30% 1,591,545 69.89% ADS-adipose 3,736,341 2,182,156 58.40% 1,684,969 68.14% ADS-iPSC 4,031,969 2,468,233 61.22% 1,486,864 68.27% FF 3,657,051 1,866,319 51.03% 1,321,336 70.80%

105 105 H9ESC 3,435,637 1,332,363 38.78% 1,067,891 80.15% IMR90-iPSC 3,776,680 1,579,779 41.83% 1,335,565 84.54%

4.3.6 Comparison with amrfinder

We further compared our detected ASM regions with the ones detected by amrfinder, a program that implemented a probabilistic model to predict ASMs without using SNP information [46]. We obtained amrfinder results on the same dataset analyzed in our study directly from MethyBase [55] which using amrfinder to perform analysis. In terms of the number of detected ASM regions and their total length, we have reported many more ASM regions (23,123 vs. 7970), but the total length of the ASM regions in each cell line is much shorter than the one predicted by amrfinder. There are two possible reasons for the difference. First of all, amrfinder utilizes a sliding window approach to choose candidate regions with the default window size of 10 CpG sites. Second, it merges nearby

ASM regions if their distance is within 1kbps. In contrast, our method does not rely on the sliding window approach and only requires that an ASM region must have at least 5

CpG sites. In addition, to retain clear boundaries, we did not merge nearby ASM regions.

Therefore, at least three possible cases contribute to the differences: a) our results have many short ASM regions that have been missed by amrfinder; b) results by amrfinder may contain sub-regions that do not overlap CpG sites because of merging, and c) one

ASM region by amrfinder may correspond to multiple ASM regions by our method. To have a fair comparison, we tried the same merging criteria, and compared the merged results with amrfinder’s results again. Our method still reported more ASM regions with shorter ASM lengths (Table 4.9, Table 4.10). In terms of consistency, overall, across different cell lines, 50% to 68% of ASM regions detected by amrfinder had overlaps with our result. About 30% to 87% of ASM regions in our result overlapped amrfinder’s result

106

(Figure 4.22 A). In terms of overlapped length, 8% to 36% of ASM regions by amrfinder completely overlapped 29% to 87% ASM regions in our result (Figure 4.22 B), which showed that a large portion of our regions were found in amrfinder results, but not vice versa, for reasons discussed earlier. Because common ASM regions detected by both methods are more likely to be true positives, we can examine these regions and use their characteristics to compare ASM signals that were detected only by one of the two approaches. One of the features we can use is the percentage of ASM regions that overlap

DHS. About from 72% to 86% of ASM regions detected by both methods have overlaps with DHS. About 48% to 75% of ASM regions detected by our approach alone have overlaps with DHS, while only 32% to 54% of ASM regions by detected by amrfinder alone have overlaps with DHS (Figure 4.23). The result shows two things. First, the regions detected by both methods indeed have very high quality. Second, ASM signals detected by our method alone have better quality than regions detected by amrfinder alone. Furthermore, to compare false positive discoveries by the two methods, we compared the numbers of detected ASM regions on X chromosome in the two male cell lines by both approaches, which can be used as an indication of false positives as discussed earlier. The proportions of ASMs on X chromosome out of the total number of detected ASMs by our method are 0.69% and 1.06%, respectively, for FF and H1ESC.

The proportions from amrfinder results are 2.06% and 1.98%, respectively. Occasionally, there is a slight chance that reads from female cell lines can be mapped to Y chromosome due to mapping errors or short homolog regions shared between X and Y chromosomes.

ASM regions may be called on chromosome Y by either program. Results show that

107

amrfinder also has higher error rates than our approach (Table 4.11). In addition, our method has detected more imprinted genes than amrfinder in 6 out of 8 cell lines (Figure

4.13).

Figure 4.22 Comparison of our results and amrfinder results. Percentage of overlapped and non-overlapped ASM regions detected by each method, based on the number of regions (A) and length (B). Yellow represents regions detected by both approaches, red represents regions detected by our method alone, blue represents regions detected by amrfinder alone.

108

Figure 4.23 ASM regions detected by both approaches and by individual approaches alone and their overlaps with DHS.

109

Table 4.9 Summary of 1k range merged ASM regions detected by ASM-Detector median media % of number of %of ASM length of n ASM regions region length number length ASM #CpG #CpG %genome before merge before merge IMR90 2,130 454,470 97 19,811 6 0.0147% 75.37% 168.52% H1ESC 925 186,181 90 9,078 7 0.0060% 75.57% 165.06% FF 1,986 333,363 95 18,774 7 0.0108% 81.13% 136.06% H9ESC 1,137 497,937 173 21,418 14 0.0161% 48.14% 208.54% IMR90-iPSC 1,059 385,857 125 17,063 11 0.0125% 55.07% 201.06% ADS 5,341 1,233,080 142 53,610 7 0.0398% 79.04% 126.69% ADS-adipose 6,154 1,482,420 144 63,371 7 0.0479% 77.49% 127.92%

110 110 ADS-iPSC 4,616 1,856,743 190 91,453 10 0.0600% 58.63% 130.32% Union 16,150 4,880,099 150 285,593 7 0.16% 69.84% 132.44%

Table 4.10 Summary of ASM regions detected by amrfinder

Amrfinder number length median length of ASM #CpG median #CpG %genome IMR90 999 1,196,914 944 58,664 44 0.04% H1ESC 606 731,289 827.5 34,414 47 0.02% FF 1,362 916,468 401 47,257 20 0.03% H9ESC 1,218 1,153,783 651.5 75,230 43 0.04% IMR90-iPSC 1,115 928,201 529 62,036 36 0.03% ADS 2,909 2,090,345 463 118,450 23 0.07% ADS-adipose 4,336 3,207,012 474 167,170 21 0.10% ADS-iPSC 1,621 1,160,216 468 71,547 28 0.04%

111 111 Union 7,970 6,409,177 525 324,900 22 0.21%

The proposed method offers several advantages over amrfinder. First, our method can detect short ASM regions that will be filtered out by amrfinder based on its sliding window approach. For example, in the promoter region of ERICH3, our method detected two short ASM regions (5 and 6 CpGs) in H9ESC cell line, while amrfinder did not report any although manual examination shows that the two regions both have intermediate methylation levels with high coverage at the CpG sites (Figure 4.24 A).

Second, our method has the tendency to predict the boundaries more accurately than amrfinder. Because without using a fixed windows size, we can naturally detect methylation level changes more accurately. For example, for the same region (Figure

4.24 A), both amrfinder and our algorithm have detected ASM signals in IMR90-iPSC cell line. However, the ASM region detected by amrfinder included some CpG sites with lower coverage and/or 0 methylation level. Another example can be found in the promoter region of XIST gene (Figure 4.24 B). XIST is known to be essential for the initiation of XCI during embryogenesis. Expression of XIST is time dependent and changes throughout the embryonic development [32]. It is still largely inconclusive whether expression of XIST from Xi is needed to maintain XCI in somatic cells with evidences from both sides [67]. Both amrfinder and our method detected ASM regions around XIST for the three female somatic cell lines: ADS, ADS-adipose and IMR90. In addition, our method also detected ASM regions in ADS-iPSC cells. In comparing results detected by our approach and those by amrfinder, we have the following observations.

First, the ASM regions detected by amrfinder have included some CpG sites with extreme (high or low) methylation levels on their left sides. The boundaries detected by

112

our method are more consistent with raw data (Figure 4.24 B). Second, amrfinder reported one region for each of the three female cell lines and our method declared several small regions. However, these regions can be merged into one if the post- processing step is adopted with the same parameter (1kbps). Third, ASMs are observed in female somatic cell lines while high methylation levels are observed in male cell lines and female ESC/iPSC cell lines except for ADS-iPSC, for which the two programs show inconsistent results. Further examination of the raw methylation data and coverage

(Figure 4.24 B) shows that the region in ADS-iPSC should be called as ASM. This result is consistent with our previous observation that ADS-iPSC is more similar to somatic cell lines than ESC/iPSC cell lines. Fourth, RNA-seq data of XIST shows that all three ADS cell lines had RPKM value above 10. In contrast, the male sample H1ESC had RPKM value below 1. The co-occurrence of ASM in the promoter region and intermediate transcription of XIST supports the notion that XIST is not only important in the initiation of XCI, it may also play an important role in the maintenance of XCI in somatic cells.

Through ASM in XIST’s promoter region, the low methylated copy allows the transcription of the corresponding XIST on the inactive X chromosome, which helps to maintain its inactive state. The co-occurrence of intermediate transcription of XIST and

ASM in XIST promoter in ADS-iPSC cell line supports the ASM region identified by our method is likely to be true signal.

113

Figure 4.24 UCSC Genome Browser view of the region around TSS of gene ERICH3 (A) and around TSS of XIST (B). Green rectangles highlight the regions that the two methods have different predictions. ASM regions detected by amrfinder are shown as purple bar in the upper side. For H9ESC and IMR90-iPSC, methylation level (yellow lines) and coverage (black lines) tracks are included to show that the difference between the results were not a result of difference in mapping.

114

Table 4.11 Comparison of ASM regions detected by ASM-detector and amrfinder in male X and all Y chromosome.

ASM-detector amrfinder X chromosome #ASM length %ASM %length #ASM length %ASM %length H1ESC 13 1040 1.06% 0.92% 12 11339 1.98% 1.55% FF 17 1524 0.69% 0.62% 28 17041 2.06% 1.86%

ASM-detector amrfinder Y chromosome #ASM length %ASM %length #ASM length %ASM %length

115 115 IMR90 0 0 0.00% 0.00% IMR90 3 2713 0.30% 0.23% H1ESC 4 323 0.33% 0.29% H1ESC 6 10003 0.99% 1.37% FF 3 227 0.12% 0.09% FF 16 12801 1.17% 1.40% H9ESC 0 0 0.00% 0.00% H9ESC 5 7371 0.41% 0.64% IMR90-iPSC 0 0 0.00% 0.00% IMR90-iPSC 2 6234 0.18% 0.67% ADS 0 0 0.00% 0.00% ADS 4 1625 0.14% 0.08% ADS_adipose 1 93 0.01% 0.01% ADS_adipose 4 11071 0.09% 0.35% ADS-iPSC 4 499 0.05% 0.04% ADS-iPSC 2 5573 0.12% 0.48%

4.4 DISCUSSION

We have presented an effective method to accurately detect ASM from whole-genome bisulfite sequencing data and we have generated genome wide ASM profiles from eight human cell lines by applying the method. Our method defines boundaries of ASM regions by the nature of data without using a fixed sliding window, therefore, it can detect ASM regions more accurately, including short ASM signals. Accurate detection is crucial to the characterization and functional analysis of ASM regions. Results on X and

Y chromosomes in male cell lines suggest that our method has a very low false positive rate. High consistency between read partitions in ASM regions and read partitions of independently called heterozygous SNPs further validates the correctness of our clustering algorithm. The fact that our predicted ASMs are highly enriched on X chromosome of female cell lines and the majority of known imprinted genes are covered by our predicted ASMs provides further support for the proposed method.

Accurate genomic profiling of ASMs provides us a great opportunity to characterize ASMs and to explore potential functional roles of ASMs. Our analysis shows that ASMs are significantly enriched in promoter regions and have high concordance with DHS and TFBS. Furthermore, ASMs in promoter regions are generally associated with intermediate transcriptional activities. All the evidences support the hypothesis that ASMs in gene promoters function through regulation of downstream gene

116

expressions. Such regulation manifests itself as long term stable gene silencing in genomic imprinting, maintenance of X chromosome inactivation, and allelic specific gene expressions on autosomes. There are also some differences for ASMs on X chromosome, ASMs in imprinted gene regions, and ASMs in other regions of autosomes.

For example, imprinted gene regions normally have very strong signals with long ASM regions, which may be associated with the long term effect of gene silencing. On the other hand, ASMs on autosomes may be time--dependent or environment-dependent, resulting in cell line specific ASM. Obviously, in addition to ASMs in promoter regions, there are many ASMs that are in gene body regions or in gene desert regions. Analysis of their functions certainly needs more investigation. Due to limited number of cell lines and their diversity, cell line specific ASM cannot be thoroughly studied here. However, our results have clearly illustrated the difference ASM profiles in somatic cell lines and those in ESC/iPSC cell lines, giving strong support that ASM changes during cell differentiation.

Many previous studies do not distinguish ASM regions from partially methylated regions (PMD), the latter of which are defined based on the average methylation levels, similar to the candidate region definition in our study (i.e., CPMR). Our analysis shows that ASM is a very special type of PMD and it is critical to separate it from other types of

PMD in order to better understand ASM’s functions. For example, ASMs are highly enriched in promoter regions, comparing to the distribution of the rest of CPMRs (Figure

4.25). When mixed together, it is harder to understand their roles in gene regulation.

117

PMDs that are not ASMs may also functional. Abundant partially methylated genome regions have been observed in immortalized cell lines, cancer cell lines, placenta and pancreas, but not so often in embryonic stem cell lines or induced pluripotent cell lines

[3], [8], [41], [55]. In our study, we also observed that the numbers of CPMRs in somatic cell lines were about 4 times more than the ones in ESC/iPSC cell lines, and the lengths were 5 times longer, while the numbers and lengths of ASM regions were in the same scale across all cell lines. On one hand, this demonstrates the effectiveness of our method in separating ASMs from other types CPMRs. On the other hand, CPMRs other than

ASMs may function differently and they are more of transient phenomenon in adult somatic cells.

118

Figure 4.25 Genomic distributions of CPMRs and their overlaps with different gene annotations. Same legends are used in Figure 4.6.

Similar to the studies of differential methylations in different samples, the study of ASMs is important in understanding their roles in genomic imprinting, X inactivation, epigenetic changes during development, and gene regulations. The comprehensive survey of ASMs from eight human cell lines presented here has revealed rich information about their distributions and their functionalities. With increasing available methylation data, coupled with other genomic data (e.g., SNPs, expressions, transcriptional regulatory markers) measured from the same set of samples, more knowledge about ASM will certainly be learned in the near future.

119

Chapter 5

A general bisulfite sequence clustering and visualization tool

5.1 Background

Many bioinformatics tools have been developed to process analyze and visualize DNA methylation. Particularly, single nucleotide level DNA methylation information acquired through sequencing data have been widely used in different types of studies. Recently a few algorithms have been proposed to detect allele-specific DNA methylation (ASM) genome-widely, such as amrfinder [46] and ASM-Detector (Chapter 4). Besides, some other tools are designed to discover co-occurrence DNA methylation patterns from ultra- deep bisulfite sequencing data ([19] and Chapter 3). Here we present a general tool called

BS-Cluster to identify clusters of bisulfite sequences, which has distinct/unique DNA methylation patterns. It utilizes an efficient heuristic graph-clustering algorithm (Chapter

4) to accurately group bisulfite sequences with different DNA methylation pattern. BS-

Cluster can be applied on both ultra-deep and whole-genome sequencing dataset to profile ASM genome-widely and identify sub-patterns of tumor tissues. Besides, BS-

Cluster provides visualization of DNA methylation patterns in multiple views and raw result for further downstream analysis.

5.2 Implementation and description

We implemented BS-Cluster as a command line tool. Users can easily adapt it into large

120

analysis pipelines, and can execute in both PC and high performance computing (HPC) environment. We efficiently implemented BS-Cluster by using Java and HTS-JDK.

Besides requiring reference sequence in FASTA format, it takes mapped reads in both standard SAM and BAM format as input. Both single-end and paired-end are supported.

Indexed BAM input can provide better performance in querying reads in a specific region. The reference file required should be same to the one used in sequence mapping.

BS-Cluster’s workflow is consist of three steps. First, it obtains Consecutively

Covered Methylation Regions (CCMR) by given parameters (partial methylation threshold, minimum adjacent CpG coverage, minimum interval CpG number, and minimum interval read number). Next, it applies heuristic clustering algorithm on each

CCMR. Clustered sequence groups and their methylation levels are saved in text format output files, which can be used in other downstream analysis. In general, BS-Cluster may give clustering result with arbitrary number of groups (up to number of sequences). ASM is a special case of sequence groups. In human genome, it contains at most two groups, which have distinct DNA methylation patterns. To detect ASM in BS-Cluster, user only need to append one more parameter (-a) when perform clustering. Then the maximum group number will be limited up to two. BS-Cluster uses P value of each CpG site in the region provided in output files to identify bona fide ASM regions. Finally, BS-Cluster provides DNA methylation pattern visualization in two different views. Group pattern view and individual sequence view. The group pattern view (Figure 5.1 A) provides summaries of methylation level and number of sequences in each group. It will be

121

especially useful when there are hundreds or thousands of sequences mapped to the region, in which case visualizing all of sequences will not be readable. When number of sequences is small, the individual sequence view (Figure 5.1 B) will be feasible and provide more details. Except visualizing methylation information of each CpG in each sequence, BS-Cluster can present SNP located in each sequence to help revealing the association between SNP and DNA methylation.

122

Figure 5.1 Example of group pattern view and individual methylation pattern view figures. A) Group pattern figure of a region in PAX6 gene. Circle refers to a CpG site. Methylation level is represented by color from green (0%) to red (100%) and shown under each circle. Number of sequences is on the right of each group. B) Individual sequence pattern figure of a region in TRAPPC9 gene. Circles represent methylated CpG (black) and unmethylated CpG (white). Gap between CpGs may be caused by gap of paired-end sequences or non CG/TG sequence in CpG site. All sequences cover SNP rs4455807 are shown in figure. The allele and strand of sequence is shown on the right.

5.3 Example and discussion

Here we will demonstrate the usage of BS-Cluster with examples from two real datasets.

First, we use sequences generated from LNCaP prostate cancer cell line by using ultra-

123

deep bisulfite sequencing. In this example, sequences are mapped to target regions, which are selected before sequencing. In the first step, BS-Cluster recognizes target regions as

CCMR regions since they are designed to be consecutively covered by sequences. Then we apply clustering on each of CCMR region. Since the CCMR region contains hundreds of sequences, we use group pattern view to visualize the clustering result. Figure 5.1 A shows an example region in PAX6 gene. 820 sequences are clustered into four groups.

Each of group has distinct DNA methylation pattern. It suggests each group may represent a subtype of this cancer cell line.

We have another WGBS example dataset generated from adipose-derived stem cells (ADS). We will use it to illustrate how to detect ASM regions from WGBS dataset.

First, we use BS-Cluster to obtain CCMRs. Since we are aiming to detect ASM regions, we can specify partial methylation parameter to filter out CCMRs, which are either hypomethylated or hypermethylated. Next, we apply clustering on candidate regions with

ASM parameter, which will limit the number of groups up to two in cluster result. To support the clustering result, we supply SNP (rs4455807) called in the same region to generate individual methylation pattern view. SNP can be called from Whole Genome

Sequencing (WGS) dataset of same sample or same WGBS dataset by using tools like

Bis-SNP [53]. Individual pattern figure of gene TRAPPC9 (Figure 5.1 B) demonstrates an ASM region detected in this dataset. Sequences clustered in methylated group contain allele T. Sequences clustered in unmethylated group contain allele G. The consistency between SNP and clustering result suggest this region is a bona fide ASM region.

124

In summary, BS-Cluster is the first general bisulfite sequence clustering tool which can handle different kind of bisulfite sequencing data. It provides accurate clustering result and multiple ways to visualize DNA methylation patterns of clustered groups. Our example illustrate its usefulness in cancer study and detecting of ASM.

125

Chapter 6

Conclusion

In this dissertation, we descried several novel tools and methods to analyze human DNA methylation data. Those tools and methods are designed to support different DNA methylation measurement technologies, regarding both depth and breadth of measuring.

The mmCpGs detected by our GMMC based method will be useful in population epigenetic studies. DNA methylation co-occurrence patterns identified by BSPAT can help study of tumor heterogeneity from epigenetics perspective. Genome-wide ASM profiling performed by ASM-detector will reveal the feature of ASM in human genome and provide a new approach to study gene imprinting and X chromosome inactivation.

BS-Cluster gives a general solution for bisulfite sequencing clustering in any kind of bisulfite-sequencing dataset.

126

Bibliography

[1] H. Cedar and Y. Bergman, “Programming of DNA methylation patterns,” Ann Rev

Biochem, vol. 81, pp. 97–117, Jan. 2012.

[2] G. Egger, G. Liang, A. Aparicio, and P. A. Jones, “Epigenetics in human disease

and prospects for epigenetic therapy,” Nature, vol. 429, no. 6990, p. 457, 2004.

[3] R. Lister et al., “Hotspots of aberrant epigenomic reprogramming in human

induced pluripotent stem cells.,” Nature, vol. 471, no. 7336, pp. 68–73, 2011.

[4] P. A. Jones, “Functions of DNA methylation: islands, start sites, gene bodies and

beyond,” Nat Rev Genet, vol. 13, no. 7, pp. 484–92, Jul. 2012.

[5] P. W. Laird, “Principles and challenges of genome-wide DNA methylation

analysis,” Nat Rev Genet, vol. 11, no. 3, pp. 191–203, Mar. 2010.

[6] K. H. Taylor et al., “Ultradeep bisulfite sequencing analysis of DNA methylation

patterns in multiple gene promoters by 454 sequencing,” Cancer Res, vol. 67, no.

18, pp. 8511–8, Sep. 2007.

[7] Y. Korshunova et al., “Massively parallel bisulphite pyrosequencing reveals the

molecular complexity of breast cancer-associated cytosine-methylation patterns

obtained from tissue and serum DNA,” Genome Res., vol. 18, no. 1, pp. 19–29,

2008.

[8] R. Lister et al., “Human DNA methylomes at base resolution show widespread

127

epigenomic differences.,” Nature, vol. 462, no. 7271, pp. 315–22, Nov. 2009.

[9] M. R. Irvin et al., “Epigenome-Wide Association Study of Fasting Blood Lipids in

the Genetics of Lipid-Lowering Drugs and Diet Network StudyCLINICAL

PERSPECTIVE,” Circulation, vol. 130, no. 7, pp. 565–572, 2014.

[10] A. H. Olsson et al., “Genome-wide associations between genetic and epigenetic

variation influence mRNA expression and insulin secretion in human pancreatic

islets,” PLoS Genet., vol. 10, no. 11, p. e1004735, 2014.

[11] A. K. Smith et al., “Methylation quantitative trait loci (meQTLs) are consistently

detected across ancestry, developmental stage, and tissue type,” BMC Genom, vol.

15, 2014.

[12] P. Daca-Roszak et al., “Impact of SNPs on methylation readouts by Illumina

Infinium HumanMethylation450 BeadChip Array: implications for comparative

population studies.,” BMC Genomics, vol. 16, no. 1, p. 1003, 2015.

[13] S. V Andrews, C. Ladd-Acosta, A. P. Feinberg, K. D. Hansen, and M. D. Fallin,

“"Gap hunting’’ to characterize clustered probe signals in Illumina methylation

array data,” Epigenetics Chromatin, vol. 9, no. 1, p. 56, 2016.

[14] M. J. Aryee et al., “Minfi: a flexible and comprehensive Bioconductor package for

the analysis of Infinium DNA methylation microarrays,” Bioinf Oxf Engl, vol. 30,

2014.

[15] T. Benaglia, D. Chauveau, D. R. Hunter, and D. S. Young, “mixtools: An R 128

package for analyzing mixture models,” J. Stat. Softw., vol. 32, no. 1, pp. 1–29,

2009.

[16] Y. Kumaki, M. Oda, and M. Okano, “QUMA: quantification tool for methylation

analysis,” Nucleic Acids Res., vol. 36, no. suppl_2, pp. W170--W175, 2008.

[17] C. Rohde, Y. Zhang, R. Reinhardt, and A. Jeltsch, “BISMA-Fast and accurate

bisulfite sequencing data analysis of individual clones from unique and repetitive

sequences,” BMC Bioinformatics, vol. 11, p. 230, Jan. 2010.

[18] C. Bock, S. Reither, T. Mikeska, M. Paulsen, J. Walter, and T. Lengauer, “Biq

analyzer: visualization and quality control for DNA methylation data from bisulfite

sequencing,” Bioinformatics, vol. 21, no. 21, pp. 4067–8, Nov. 2005.

[19] P. Lutsik, L. Feuerbach, J. Arand, T. Lengauer, J. Walter, and C. Bock, “Biq

analyzer ht: locus-specific analysis of DNA methylation by high-throughput

bisulfite sequencing,” Nucleic Acids Res, vol. 39, no. Web Server issue, pp. W551-

6, Jul. 2011.

[20] F. Krueger and S. R. Andrews, “Bismark: a flexible aligner and methylation caller

for Bisulfite-Seq applications,” Bioinformatics, vol. 27, no. 11, pp. 1571–2, Jun.

2011.

[21] P.-Y. Chen, S. J. Cokus, and M. Pellegrini, “BS Seeker: precise mapping for

bisulfite sequencing,” BMC Bioinformatics, vol. 11, no. 1, p. 203, 2010.

[22] F. Krueger, B. Kreck, A. Franke, and S. R. Andrews, “DNA methylome analysis 129

using short bisulfite sequencing data,” Nat Methods, vol. 9, no. 2, pp. 145–51, Feb.

2012.

[23] Y. Xu et al., “Unique DNA methylome profiles in CpG island methylator

phenotype colon cancers,” Genome Res, vol. 22, no. 2, pp. 283–91, Feb. 2012.

[24] M. Brait et al., “Correlation between BRAF mutation and promoter methylation of

TIMP3, RARβ2 and RASSF1A in thyroid cancer,” Epigenetics, vol. 7, no. 7, pp.

710–719, 2012.

[25] W. J. Kent, “BLAT—the BLAST-like alignment tool,” Genome Res., vol. 12, no.

4, pp. 656–664, 2002.

[26] R. M. Kuhn, D. Haussler, and W. J. Kent, “The UCSC genome browser and

associated tools,” Brief. Bioinform., vol. 14, no. 2, pp. 144–161, 2012.

[27] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-

efficient alignment of short DNA sequences to the human genome.,” Genome

Biol., vol. 10, no. 3, p. R25, Jan. 2009.

[28] D. Karolchik et al., “The UCSC Genome Browser database: 2014 update,” Nucleic

Acids Res., vol. 42, no. D1, pp. D764--D770, 2014.

[29] S. T. Sherry et al., “dbSNP: the NCBI database of genetic variation,” Nucleic

Acids Res., vol. 29, no. 1, pp. 308–311, 2001.

[30] T. Mikeska, I. L. M. Candiloro, and A. Dobrovic, “The implications of

130

heterogeneous DNA methylation for the accurate quantification of methylation,”

Epigenomics, vol. 2, no. 4, pp. 561–573, 2010.

[31] D. P. Barlow and M. S. Bartolomei, “Genomic imprinting in mammals.,” Cold

Spring Harb. Perspect. Biol., vol. 6, no. 2, p. a018382, 2014.

[32] A. Wutz, “Gene silencing in X-chromosome inactivation: advances in

understanding facultative heterochromatin formation.,” Nat. Rev. Genet., vol. 12,

no. 8, pp. 542–553, 2011.

[33] K. Kerkel et al., “Genomic surveys by methylation-sensitive SNP analysis identify

sequence-dependent allele-specific DNA methylation.,” Nat. Genet., vol. 40, no. 7,

pp. 904–908, 2008.

[34] R. Shoemaker, J. Deng, W. Wang, and K. Zhang, “Allele-specific methylation is

prevalent and is contributed by CpG-SNPs in the human genome.,” Genome Res.,

vol. 20, no. 7, pp. 883–9, Jul. 2010.

[35] J. Gertz et al., “Analysis of dna methylation in a three-generation family reveals

widespread genetic influence on epigenetic regulation,” PLoS Genet., vol. 7, no. 8,

2011.

[36] W. Xie et al., “Base-resolution analyses of sequence and parent-of-origin

dependent DNA methylation in the mouse genome,” Cell, vol. 148, no. 4, pp. 816–

31, Feb. 2012.

[37] T. Babak et al., “Genetic conflict reflected in tissue-specific maps of genomic 131

imprinting in human and mouse,” Nat. Genet., vol. 47, no. 5, pp. 544–549, 2015.

[38] Y. Yasukochi et al., “X chromosome-wide analyses of genomic DNA methylation

states and gene expression in male and female neutrophils.,” Proc. Natl. Acad. Sci.

U. S. A., vol. 107, no. 8, pp. 3704–9, 2010.

[39] A. M. Cotton et al., “Chromosome-wide DNA methylation analysis predicts

human tissue-specific X inactivation,” Hum. Genet., vol. 130, no. 2, pp. 187–201,

2011.

[40] R. E. Consortium et al., “Integrative analysis of 111 reference human

epigenomes,” Nature, vol. 518, no. 7539, pp. 317–330, 2015.

[41] M. D. Schultz et al., “Human body epigenome maps reveal noncanonical DNA

methylation variation,” Nature, 2015.

[42] M. Lalande, “Parental imprinting and human disease.,” Annu. Rev. Genet., vol. 30,

pp. 173–195, 1996.

[43] D. Monk, “Deciphering the cancer imprintome.,” Brief. Funct. Genomics, vol. 9,

no. 4, pp. 329–39, 2010.

[44] V. Kuleshov et al., “Whole-genome haplotyping using long reads and statistical

methods.,” Nat. Biotechnol., vol. 32, no. 3, pp. 261–6, Mar. 2014.

[45] Q. Peng and J. R. Ecker, “Detection of allele-specific methylation through a

generalized heterogeneous epigenome model.,” Bioinformatics, vol. 28, no. 12, pp.

132

i163-71, Jun. 2012.

[46] F. Fang, E. Hodges, A. Molaro, M. Dean, G. J. Hannon, and A. D. Smith,

“Genomic landscape of human allele-specific DNA methylation,” Proc. Natl.

Acad. Sci., vol. 109, no. 19, pp. 7332–7337, May 2012.

[47] B. E. Bernstein et al., “The NIH Roadmap Epigenomics Mapping Consortium,”

Nat Biotech, vol. 28, no. 10, pp. 1045–1048, Oct. 2010.

[48] R. A. Fisher, Statistical methods for research workers, vol. 354. 1925.

[49] R. Lister et al., “Human DNA methylomes at base resolution show widespread

epigenomic differences,” Nature, vol. 462, no. 7271, pp. 315–322, 2009.

[50] K. R. Rosenbloom et al., “The UCSC Genome Browser database: 2015 update.,”

Nucleic Acids Res., vol. 43, no. Database issue, pp. D670-81, 2015.

[51] A. R. Quinlan and I. M. Hall, “BEDTools: a flexible suite of utilities for

comparing genomic features.,” Bioinformatics, vol. 26, no. 6, pp. 841–842, 2010.

[52] The ENCODE Project Consortium et al., “An integrated encyclopedia of DNA

elements in the human genome.,” Nature, vol. 489, no. 7414, pp. 57–74, Sep.

2012.

[53] Y. Liu, K. D. Siegmund, P. W. Laird, and B. P. Berman, “Bis-SNP: Combined

DNA methylation and SNP calling for Bisulfite-seq data,” Genome Biol., vol. 13,

no. 7, p. R61, 2012.

133

[54] J. R. Dixon et al., “Chromatin architecture reorganization during stem cell

differentiation,” Nature, vol. 518, no. 7539, pp. 331–336, 2015.

[55] Q. Song et al., “A Reference Methylome Database and Analysis Pipeline to

Facilitate Integrative and Comparative Epigenomics,” PLoS One, vol. 8, no. 12, p.

e81148, 2013.

[56] M. Stevens et al., “Estimating absolute methylation levels at single-CpG resolution

from methylation enrichment and restriction enzyme sequencing methods,”

Genome Res., vol. 23, no. 9, pp. 1541–1553, Sep. 2013.

[57] M. Weber et al., “Chromosome-wide and promoter-specific analyses identify sites

of differential DNA methylation in normal and transformed human cells,” Nat

Genet, vol. 37, no. 8, pp. 853–862, Aug. 2005.

[58] F. Yang et al., “The lncRNA Firre anchors the inactive X chromosome to the

nucleolus by binding CTCF and maintains H3K27me3 methylation,” Genome

Biol., vol. 16, no. 1, p. 52, 2015.

[59] G. Elliott et al., “Intermediate DNA methylation is a conserved signature of

genome regulation,” Nat. Commun., vol. 6, p. 6363, 2015.

[60] Y. Baran et al., “The landscape of genomic imprinting across diverse adult human

tissues,” Genome Res., vol. 25, no. 7, pp. 927–936, Jul. 2015.

[61] Y. Q. Xu, C. G. Goodyer, C. Deal, and C. Polychronakos, “Functional

polymorphism in the parental imprinting of the human IGF2R gene,” Biochem. 134

Biophys. Res. Commun., vol. 197, no. 2, pp. 747–754, 1993.

[62] S. Takikawa, C. Ray, X. Wang, Y. Shamis, T. Y. Wu, and X. Li, “Genomic

imprinting is variably lost during reprogramming of mouse iPS cells,” Stem Cell

Res., vol. 11, no. 2, pp. 861–873, 2013.

[63] P. J. Rugg-Gunn, A. C. Ferguson-Smith, and R. a Pedersen, “Status of genomic

imprinting in human embryonic stem cells as revealed by a large cohort of

independently derived and maintained lines.,” Hum. Mol. Genet., vol. 16 Spec No,

no. 2, pp. R243-51, 2007.

[64] M. Stadtfeld et al., “Aberrant silencing of imprinted genes on chromosome 12qF1

in mouse induced pluripotent stem cells.,” Nature, vol. 465, no. 7295, pp. 175–

181, 2010.

[65] Y. Stelzer, O. Yanuka, and N. Benvenisty, “Global analysis of parental imprinting

in human parthenogenetic induced pluripotent stem cells.,” Nat. Struct. Mol. Biol.,

vol. 18, no. 6, pp. 735–741, 2011.

[66] F. Court et al., “The PEG13-DMR and brain-specific enhancers dictate imprinted

expression within the 8q24 intellectual disability risk locus,” Epigenetics

Chromatin, vol. 7, no. 1, p. 5, 2014.

[67] G. Csankovszki, A. Nagy, and R. Jaenisch, “Synergism of Xist RNA, DNA

methylation, and histone hypoacetylation in maintaining X chromosome

inactivation.,” J. Cell Biol., vol. 153, no. 4, pp. 773–784, May 2001.

135