Computational Discovery of DNA Methylation Patterns As Biomarkers of Ageing, Cancer, and Mental Disorders

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1520

Computational discovery of DNA methylation patterns as biomarkers of ageing, cancer, and mental disorders

Algorithms and Tools

BEHROOZ TORABI MOGHADAM

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-554-9924-2 UPPSALA urn:nbn:se:uu:diva-320720 2017 Dissertation presented at Uppsala University to be publicly examined in A1:111a, BMC Building, Husargatan 3, Uppsala, Monday, 12 June 2017 at 09:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Jaak Vilo (University of Tartu).

Abstract Torabi Moghadam, B. 2017. Computational discovery of DNA methylation patterns as biomarkers of ageing, cancer, and mental disorders. Algorithms and Tools. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1520. 55 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9924-2.

Epigenetics refers to the mitotically heritable modifications in gene expression without a change in the genetic code. A combination of molecular, chemical and environmental factors constituting the epigenome is involved, together with the genome, in setting up the unique functionality of each cell type. DNA methylation is the most studied epigenetic mark in mammals, where a methyl group is added to the cytosine in a cytosine-phosphate-guanine dinucleotides or a CpG site. It has been shown to have a major role in various biological phenomena such as chromosome X inactivation, regulation of gene expression, cell differentiation, genomic imprinting. Furthermore, aberrant patterns of DNA methylation have been observed in various diseases including cancer. In this thesis, we have utilized machine learning methods and developed new methods and tools to analyze DNA methylation patterns as a biomarker of ageing, cancer subtyping and mental disorders. In Paper I, we introduced a pipeline of Monte Carlo Feature Selection and rule-base modeling using ROSETTA in order to identify combinations of CpG sites that classify samples in different age intervals based on the DNA methylation levels. The combination of genes that showed up to be acting together, motivated us to develop an interactive pathway browser, named PiiL, to check the methylation status of multiple genes in a pathway. The tool enhances detecting differential patterns of DNA methylation and/or gene expression by quickly assessing large data sets. In Paper III, we developed a novel unsupervised clustering method, methylSaguaro, for analyzing various types of cancers, to detect cancer subtypes based on their DNA methylation patterns. Using this method we confirmed the previously reported findings that challenge the histological grouping of the patients, and proposed new subtypes based on DNA methylation patterns. In Paper IV, we investigated the DNA methylation patterns in a cohort of schizophrenic and healthy samples, using all the methods that were introduced and developed in the first three papers.

Keywords: DNA methylation, machine learning, biomarker, cancer, ageing, classification

Behrooz Torabi Moghadam, Department of Cell and Molecular Biology, Box 596, Uppsala University, SE-75124 Uppsala, Sweden. Science for Life Laboratory, SciLifeLab, Box 256, Uppsala University, SE-75105 Uppsala, Sweden.

ISSN 1651-6214 ISBN 978-91-554-9924-2 urn:nbn:se:uu:diva-320720 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320720)

“There are no incurable diseases; only the lack of will. There are no worthless herbs; only the lack of knowledge.”

Avicenna

To my family

“A soul in tension that is learning to fly Condition grounded but determined to try Can't keep my eyes from the circling skies Tongue-tied and twisted just an earth-bound misfit, I”

Pink Floyd – Learning to fly

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Torabi Moghadam, B., Dabrowski, M., Kaminska, B., Grabherr, M. G., and Komorowski, J. (2016) Combinatorial identification of DNA methylation patterns over age in the human brain. BMC Bioin- formatics, 17(1):393

II Torabi Moghadam, B., Zamani, N., Komorowski, J., Grabherr, M. G. (2017) PiiL: visualization of DNA methylation and gene expression data in gene pathways. BMC Genomics, in press.

III Torabi Moghadam, B., Zamani, N., Gao, J., Meadows, J. R. S., Sundström, G., Holmfeldt, L., Komorowski, J., Grabherr, M. G. (2017) An unsupervised approach subgroups cancer types by distinct local DNA methylation patterns. Manuscript.

IV Torabi Moghadam, B., Etemadikhah, M., Rajkova, G., Stockmeier, C., Grabherr, M. G., Komorowski, J., Feuk, L. and Lindholm Carl- ström, E. (2017) Analyzing DNA methylation patterns in schizophrenic patients using machine learning methods. Manuscript.

Reprints were made with permission from the respective publishers.

List of additional publications

(Not included in this thesis)

I Torabi Moghadam, B. *, Zamani, N. *, Rafati, N.*, Lamichhaney, S.*, Sundström, G., Lantz, H., Martinez Barrio, A., Komorowski, J., Clavijo, B. J., Jern, P., and Grabherr, M. G. A local web interface for protein and transcript aligners: Smörgås. Manuscript submitted. * These authors contributed equally

II Torabi Moghadam, B., Alvarsson, J., Holm, M., Eklund, M., Carlsson, L., Spjuth, O. (2015) Scaling predictive modeling in drug development with cloud computing. Journal of Chemical Information and Modeling, 55(1):19–25

III Lindqvist, C. M., Nordlund, J., Ekman, D., Johansson, A., Torabi Moghadam, B., Raine, A., Övernäs, E., Dahlberg, J., Wahlberg, P., Henriksson, N., Abrahamsson, J., Frost, B. M., Grandér, D., Heyman, M., Larsson, R., Palle, J., Söderhäll, S., Forestier, E., Lönnerholm, G., Syvänen, A. C., Berglund, E. C. (2015) The mutational landscape in pediatric acute lymphoblastic leukemia deciphered by whole genome sequencing. Human Mutation, 36(1):118–28

IV Hoeppner, M. P., Lundquist, A., Pirun, M., Meadows, J. R., Zamani, N., Johnson, J., Sundström, G., Cook, A., FitzGerald, M. G., Swofford, R., Mauceli, E., Torabi Moghadam, B., Greka, A., Alföldi, J., Abouelleil, A., Aftuck, L., Bessette, D., Berlin, A., Brown, A., Gearin, G., Lui, A., Macdonald, J. P., Priest, M., Shea, T., Turner-Maier, J., Zimmer, A., Lander, E. S., di Palma, F., Lindblad-Toh, K., Grabherr, M. G. (2014) An improved canine genome and a comprehensive catalogue of coding genes and non-coding transcripts. PLOS ONE, 9(3):e91172

V Bornelöv, S., Sääf, A., Melén, E., Bergström, A., Torabi Moghadam, B., Pulkkinen, V., Acevedo, N., Orsmark Pietras, C., Ege, M., Braun- Fahrländer, C., Riedler, J., Doekes, G., Kabesch, M., van Hage, M., Kere, J., Scheynius, A., Söderhäll, C., Pershagen, G., Komorowski, J. (2013) Rule-based models of the interplay between genetic and environmental factors in childhood allergy. PLOS ONE, 8(11):e80080

Contents

Biological background ...... 11 The human genome ...... 11 The human epigenome ...... 12 Biomarkers ...... 12 DNA methylation ...... 13 DNAm establishment and maintenance ...... 14 DNA demethylation mechanisms ...... 14 DNA methylation functions ...... 14 DNA methylation and disease ...... 15 DNA methylation and cancer ...... 16 DNA hypomethylation in cancer ...... 17 DNA hypermethylation in cancer ...... 17 DNA methylation and aging ...... 18 Gene expression pathways ...... 19 Measuring DNA methylation ...... 20 Computational background ...... 22 Machine learning in life sciences ...... 22 Supervised learning ...... 22 Decision trees ...... 23 Rough sets ...... 23 Rule-based modeling using ROSETTA ...... 24 Feature selection ...... 25 Monte Carlo Feature Selection ...... 25 Unsupervised learning ...... 26 K-means clustering ...... 27 Hierarchical clustering ...... 27 Self Organizing Map ...... 27 Hidden Markov Model (HMM) ...... 28 Hypothesis-free pattern recognition using Saguaro ...... 28 Visualization ...... 28 Aims ...... 32

Methods and Results ...... 33 Paper I ...... 33 Background ...... 33 Data ...... 34 Methods ...... 34 Results ...... 34 Paper II ...... 36 Background ...... 36 Data ...... 36 Methods ...... 36 Results ...... 37 Paper III ...... 39 Background ...... 39 Data ...... 39 Methods ...... 40 Results ...... 40 Paper IV ...... 41 Background ...... 41 Data ...... 41 Methods ...... 41 Results ...... 42 Concluding remarks and future prospects ...... 43 Acknowledgements ...... 45 References ...... 48

Abbreviations

5hmC 5-hydroxymethylcytosine 5mC 5-methylcytosine AML Acute Myeloid Leukemia CGI CpG island CLL Chronic Lymphocytic Leukemia CpG Cytosine-phosphate-Guanine DMR Differentially Methylated Region DNA Deoxyribonucleic acid DNAm DNA methylation DNase Deoxyribonuclease DNMT DNA methyltransferase GBM Glioblastoma multiforme HMM Hidden Markov Model ICR Imprinting Control Region KICH Kidney Chromophobe KIRC Kidney Renal Clear Cell Carcinoma KIRP Kidney Renal Papillary Cell Carcinoma LGG Lower Grade Glioma LOI Loss Of Imprinting MBD Methyl-CpG-binding Domain MCFS Monte Carlo Feature Selection meCP Methyl-CpG-binding domain protein meCpG Methylated CpG MeDIP Methylated DNA immunoprecipitation RI Relative Importance RNA Ribonucleic acid SNP Single nucleotide polymorphism SNV Single Nucleotide Variant SOM Self Organizing Map TSS Transcription Start Site WGBS Whole Genome Bisulfite Sequencing

Biological background

The human genome Around 3 billion base pairs make up the human genome, which are divided into 22 autosomal chromosomes and two sex chromosomes X and Y. There are two copies of each chromosome in the cell nucleus, each inherited from one of the parents. Only about 2% of the whole genome codes proteins known as the coding DNA. The rest of the DNA, i.e. non-coding DNA, include regulatory elements, retroviruses, transposable elements, introns and sections that encode non-coding RNAs. The long stretch of DNA is condensed and packed into a structure called chromatin. Chromatin consists of subunits called nucleosomes. Each nucleosome consists of 146 bp of DNA wrapped around a histone octamer. Chro- matin structure is a key player during mitosis and for preventing DNA dam- age. It is also strongly involved in gene regulation since its structure determines the accessibility of the DNA for transcription. Heterochromatin refers to the state of chromatin that is densely packed and euchromatin refers to the open state of chromatin. Figure 1 depicts various forms of DNA compaction.

Figure 1. Major structures of compaction of DNA (by Richard Wheeler, from Wikimedia Commons).

11 The human epigenome The term epigenetics, literally meaning “on top of or above genetics”, refers to mitotically heritable patterns of gene expression without any change in the underlying DNA sequence. This explains how more than 200 cell types with various functionality and phenotypes differentiate consistently from the same DNA code [1,2]. The epigenome then is the combination of dynamic layers of molecular, chemical and environmental factors that along with the genome, determines each cell type’s unique functional identity [3,4]. Physical organization of the DNA, affected by the way it is packed into chromatin, is a key factor in epigenetic gene regulation as it determines the accessibility of the genomic code to transcription factors. Epigenetic regulation of gene expression is driven by four major mechanisms: (1) cytosine methylation, (2) post-translational histone proteins modifications like methylation or acetylation, (3) chromatin remodeling and (4) non-coding RNAs. In recent years, an increasing number of publications have reported the fundamental role of epigenetic alterations in pathogenesis and disease sus- ceptibility, including cancer [5,6]. This has been a key motivation for large- scale projects providing publically available resources of epigenomic maps for multiple tissues, namely the Roadmap Epigenomics Project [1] and Blueprint Epigenome Project [7].

Biomarkers A biomarker is any objectively measurable biological feature or characteris- tic to be evaluated as a sign or mark of normal biological process, pathogen- ic process, or pharmacological response to a therapeutic intervention [8]. Biomarkers can be divided into different categories based on the stage of a disease at which they can be used; for example risk biomarkers can be associated to disease cause or latency, diagnostic biomarkers to the onset of the disease, prognostic biomarkers to clinical course, predictive biomarkers to response to treatment and finally exposure biomarkers can be associated with specific environments. In complex diseases biomarkers may be influenced by a combination of genetic and environmental factors. The relation of exposure and diseases can be also reflected on biomarkers and they can be used in classifying patients according to the risk or prognosis. The chosen biomarkers must be measurable with high accuracy in multiple samples and across populations [9].

12 DNA methylation Cytosine methylation is a DNA modification that involves the addition of a methyl group to 5’ position cytosine residues creating 5-methyl-cytosine, denoted as 5mC (Figure 2). This modification in mammals happens almost exclusively for the cytosines located 5’ to a guanosine, often denoted as CpG sites, where ‘p’ represents the phosphodiester bond linking cytosine and guanine [10,11]. However, non-CpG methylation (mCHG and mCHH, where H = A, C or T) has been observed in embryonic stem cells [12]. Methylated cytosine can be oxidized and converted into 5- hydroxymethylcytosine (5hmc), which was first found in cerebellar Purkinje neurons and in embryonic stem cells, but its biological function is not fully understood yet [13].

Figure 2. A methyl group is added to 5’ position of cytosine, creating a 5-methylcytosine (by Mariuswalter, via Wikimedia Commons).

DNA methylation, first explained in 1975 [14,15], is the most studied epigenetic mark in mammals, reported to have a major role in multiple biological phenomena including X chromosome inactivation, imprinting, cell differentiation and development [16-18]. It has an important role in maintaining chromosomal stability, by its effect on repetitive regions like centromeres [19]. The distribution of methylation is not uniform across the genome. The mammalian genome is mostly depleted of CpGs due to spontaneous deamination of 5mC converting it to thymine in the germline [20], while conversion of unmethylated cytosine to uracil by spontaneous deamination is rec- ognized and converted back by the DNA repair machinery [3]. Apart from the general lower than expected pattern of CpG dinucleotides over the genome, the frequency of CpG sites is higher than expected in some short regions (approximately 1kb) known as CpG islands (CGI). In the human genome there are about 28 million CpG sites where 60-80 percent of them are methylated. The CGIs are usually associated with promoter regions of the genes, but later studies have shown that about 50 percent of them are located at intra or intergenic regions [21].

13 DNAm establishment and maintenance A family of enzymes called DNA methyltransferases (DNMTs) are involved in setting up and maintaining DNA methylation. DNMT1, DNMT2, DNMT3a, DNMT3b and DNMT3L are the members of this family identified in mammals. DNMT1 is responsible for maintaining DNA methylation by recognizing hemi-methylated DNA molecules and copying the pattern from the parent strand to the daughter strand during replication. Two other members of the family DNMT3a and DNMT3b are involved in de novo methylation, acting on unmethylated CpG sites, besides helping DNMT1 in the maintenance of DNA methylation during replication. DNMT2 is mainly involved in methylation of RNA. DNMT3L was shown to stimulate de novo DNAm by DNMT3a and mediate transcriptional repressive state by interacting with histone deacetylase 1 [22].

DNA demethylation mechanisms DNA demethylation occurs either in a gradual and passive way, where DNA methylation fails to be maintained after replication, or in a fast and active manner, where the exact mechanisms are not yet fully understood. Multiple enzymes such as Ten-Eleven Translocation (TET), activation-induced cyti- dine deaminase (AID) and thymine DNA glycosylase (TDG) have been shown to be involved in active and passive demethylation [20].

DNA methylation functions Although DNAm is generally associated with gene silencing, it may have different effects on gene regulation depending on the CpG density and where they are located. The majority of CGIs located at TSSs are not methylated. The methylated CGIs at TSSs are associated with providing a long-term repressed state, such as X chromosome inactivation and imprinted genes. On the other hand, methylation within gene body is correlated with highly expresses genes. Hypermethylation of transposable elements and repetitive regions in the gene body, silences them and at the same time enhances transcriptional elongation in the host gene. The regions up to 2kb upstream or downstream of CGIs, so called “CpG shores” have been reported to be the most differentially methylated regions in healthy cell types due to their tissue specificity and also in cancer cells and during aging [20]. Figure 3 depicts the general pattern of methylation over the genome.

Figure 3. The CpG sites are usually visualized using lollipops, with a global pattern of methylated sites at transposable elements and gene bodies and unmethylated sites at CpG islands and CpG-rich promoters (by Mariuswalter, via Wikimedia Com- mons).

DNAm may affect gene regulation with three different mechanisms: 1) Changing binding affinity of transcription factors to gene promoters, 2) changing the binding affinity of methylation-specific recognition factors to gene promoters or gene bodies, and 3) modifying chromatin structure and accessibility of transcription factors [17]. There are two major mechanisms that lead DNA methylation to silencing: 1) Methylated CpG in a CpG island is associated with the formation of a repressive chromatin structure where: • meCpG can be bound by methylated CpG binding proteins e.g. MeCP1 & MeCP2 • MeCP proteins have a DNA binding domain and transcriptional repression domain • MeCP proteins can recruit other factors that condense the chromatin 2) Methylated CpG can prohibit transcription factor binding, and alter gene expression, probably for rare transcription factors, in CpG-poor promoters)

DNA methylation and disease DNA methylation patterns need to be established and maintained properly for development and normal function of an adult organism in mammals. Embryonic stem cells with defective DNMTs do not survive when they want to differentiate [23]. Experiments on mice have shown that knocking out DNMT3a and DNMT3b results in early embryonic lethality, hence de novo methylation is an essential player of mammalian development [24]. One category of DNAm-related diseases is imprinting disorders. Genomic imprinting refers to an epigenetic modification where gene expression is

15 regulated in a parent-of-origin-specific way [25]. The imprinted allele is silenced and only the other allele is expressed. Imprinting control regions (ICRs), which regulate gene expression within imprinted domains, overlap with areas known as differentially methylated regions (DMRs) [23,26]. Gain or loss of DNAm may result in loss of imprinting (LOI) that affects normal allele-specific gene expression. LOI studies have provided evidence for the importance of methylation in imprinting. For example, the use of assisted reproductive technologies has increased the risk of several imprinting diseases such as Beckwith-Wiedeman syndrome, Prader-Willi syndrome, An- gelman syndrome, Albright heredity osteodystrophy (AHO) and transient neonatal diabetes mellitus (TNDM) [27-31]. The other group of diseases related to DNAm defects is the ones associated with repeat instability, for example de novo methylation of expanded CGG repeats in fragile X syndrome [32]. In addition, a wide range of diseases have been reported to be associated with defective methylation patterns including autism [33], Alzheimer’s disease [34] and rheumatoid arthritis [35].

DNA methylation and cancer For a long time cancer has been considered the accumulation of genetic mutations, but genetic alterations occur at low frequency, hence it is not an effi- cient way for malignant transformation. It is now known that genetic and epigenetic mechanisms influence each other and cooperate together in cells showing hallmarks of cancer [36,37]. It is also known that environmental factors such as exposure to certain chemicals, toxins or heavy metals can alter the epigenome and result in an increased risk of cancer [9]. The epigenome is altered during tumor initiation and progression in multiple ways including aberrant DNAm patterns, alterations in nucleosome occupancy, and histone modifications, which determine tumor cell heterogeneity and tumorigenesis [38]. DNAm is the most studied marker in cancer epigenetics. We now know that the methylome of tumor cells is different from normal cells. A global DNA hypomethylation in cancer cell lines was reported about 3 decades ago [39]. This finding was later followed by a contrasting results showing hypermethylation of CpG islands in cancer, especially CG-rich promoters of tumor-suppressor genes [40]. These reverse patterns have been reported in various types of cancer [41]. It is been shown that cancer cells cannot survive without disrupted methylation of specific promoter regions [42].

16 DNA hypomethylation in cancer DNAm contributes to genome stability by silencing repetitive genomic sequences and satellite sequences. Global loss of DNAm contributes to genomic instability, which is a hallmark of cancer and mutagenesis. This may happen by an increased chance of undesired mitotic recombinations and/or reactivation of transposable elements that can move to random places in the genome. Hypomethylation of Sat2 and Sat⍺ is associated with early tumor development in breast cancer, but is involved in tumor progression in ovarian cancer [43,44]. Some incorporated viral sequences can be also activated by loss of DNAm. For example in the case of genital human papillomavirus (HPV), the virus sequences that have been silenced by methylation, can be re-expressed and may lead to cervical cancer [45]. The genes affected by hypomethylation may be associated with different stages of tumorigenesis, based on their functions, and be involved in drug resistance in cancer therapy; for instance, the multidrug resistance 1 gene (MDR1) whose promoter region is hypomethylated in breast cancer cells [22,46]. Although DNAm loss occurs in cancer, no enzyme for active demethylat- ing has been identified yet. By contrast to expected reduction of DNMTs activity, it is overexpressed in cancer. To explain this phenomena, it has been proposed that some inactive variants of DNMT3B are involved in miss- regulating the activity of DNMT3B [47].

DNA hypermethylation in cancer While a global pattern of DNA methylation loss is observed in cancer, some genes become silenced because of hypermethylation of CpG islands in their regulatory regions. The hypermethylation of the promoter region of the retinoblastoma tumor-suppressor gene (RB1) was the first evidence of this inci- dent [48]. By support of various studies since then, it is now known that gene silencing through DNA methylation is a hallmark of all types of cancers in human, as a key player in tumorigenesis [49]. Among the genes that get affected by hypermethylation are the ones associated with DNA repair processes. For example, O-6-methylguanine (MGMT), which is responsible for controlling the negative effect of DNA alkylation, is silenced in many carcinomas [50]. The association of hypermethylation of promoter of MLH1 and defective DNA mismatch repair was reported in many cases of gastric cancers [51]. The promoter region of BRCA1, engaged in DNA repair of double-stranded breaks, transcription, and recombination, also gets hypermethylated in breast and ovarian cancer [52]. Loss of DNAm at tumor-suppressor genes or genes associated with tumor suppression, is much more common than mutations in these genes. While tens of genes are mutated in cancer cells, hundreds to thousands become hypermethylated [38].

17 Epigenetic mechanisms can influence mutation frequencies, for example, epigenetic silencing of a mismatch repair gene named MLH1 significantly increases mutation rates [36]. Methylation of cytosines within exons is the major cause of C -> T transition mutations that account for around 30% of all harmful mutations in the germline [53], about 10-fold more frequent than any other SNV in the human genome [54]. The resulting substitution base, T: A, is not recognizable by the repair machinery, so this effect is more pronounced in highly proliferative tissues. Around 25% of all TP53 mutations occurring in human cancer are associated with this epigenetic phenomenon [36]. Hypermethylation of the promoter of some genes involved in cell adhe- sion may affect tumor progression and metastasis; for example CDH1 and CDH13 genes [55,56]. The genes with pro-apoptotic functions that become inactive in this way affect cancer cell survival; for example, death-associated protein kinase 1 (DAPK1) [57]. The promoter methylation is cancer type specific and each cancer type can be characterized with a specific methylome. The DNAm patterns may depict different stage of tumorigenesis and also show different histological features and distinguish between subtypes of cancer [22].

DNA methylation and aging Aging is generally defined as gradual and time dependent downturn of biological function of living organisms. Understanding the aging mechanisms and possible ways of slowing it has long been a key undertaking in biological research. Epigenetic alterations and genomic instability are among the proposed hallmarks of aging. This gradual loss or decline of function is the main risk factor in many human pathologies such as cancer, diabetes, cardi- ovascular and neurodegenerative diseases [58]. Since DNAm is a more accessible mark for quantitative measurements, it is also a good biomarker to study the epigenetics of aging. DNAm alterations begin at conception and continue during life span [58]. Multiple studies reported that a gradual change of DNA methylation occurs genome-wide with aging especially at repetitive elements. For example, Alu and LINE-1, which are both repetitive elements, show decreased level of DNAm and increased variability by age [59]. This gradual loss of DNAm due to non-accurate maintenance of DNAm over cell divisions is referred to as “epigenetic drift” [60]. According to the results of various studies, the common pattern is that sites with general low methylation level, such as CpG-rich promoters, tend to become hypermethylated while the sites with general high methylation level such as intergenic non-island sites, tend to become hypomethylated (that may lead to genomic stability which is one of the hallmarks of aging). Since most CpG sites are hypermethylated, this mediates a general loss of

18 DNAm and a shift toward the mean methylation level by aging [61]. How- ever, the rate of changes is much higher in early life compared with elderly ages and the affected locations vary. The gain of DNAm in early life occurs more at island shores and intergenic regions, the general loss of DNAm in later life is accompanied by some gain at islands and shores [61]. Despite the tissue specific nature of DNAm, the overall pattern of increased average methylation level in early life and gradual decline later in life is valid in most of tissues such as skin [62], muscle [63], blood [64] and brain [65]. In contrast to epigenetic drift that is defined as a set of age-related DNAm alterations within an individual, epigenetic clock refers to age-dependent sites that are consistent between individuals [66]. Some sites are better correlated with age in a specific tissue showing that epigenetic clock is tissue specific but pan-tissue epigenetic clocks have been also proposed [67]. Since DNAm can be affected by external stimuli and can be heritable via cell divisions, it can be used as a lasting signature of prior exposures accu- mulating over lifespan [68]. For instance, smoking cigarettes has been shown to make a modification in DNAm at AHRR locus in smokers and their children [69,70]. Childhood adversities have also been linked to DNAm changes lasting towards adulthood [71]. Epigenetic clock is a good biomarker of aging since it may reflect the age-dependent epigenetic changes that are shared between individuals [72]. However, some individuals may be epigenetically older or epigenetically younger than their chronological age, for example the study of centenarians with much younger epigenetic ages [73]. Furthermore, older epigenetic age of 5 years has been associated with 21% higher mortality risk [74], suggest- ing the potential of using epigenetic age as a better indicator of health rather than chronological age.

Gene expression pathways In complex disease including cancer, a combination of multiple genetic and environmental risk factors are involved. In this case, the phenotype or disease is not the outcome of one main player but multiple players influencing each other and in combination the outcome. A biological pathway is defined as a series of molecular actions within a cell that results in a new product or a change in the state of the cell. Cells react to different stimuli by interconnecting with each other via the signals that are sent through pathways. The major categories of biological pathways are the ones involved in metabolism, gene regulation, and signal transduction. Biological pathways are also interacting with each other forming a biological network. Identifying the genes, proteins and other molecules that are

19 involved and form a pathway is an ongoing field of research. As an example, the circadian entrainment pathway is shown in Figure 4. Studying phenotypes in the context of gene pathways helps to enlighten the interdependencies of underlying genes and proteins at the same time, enabling investigation of causal effects in the interrupted mechanisms related to a specific pathway. Databases such as KEGG [75] integrate and describe known pathways in different species.

Figure 4. Circadian entrainment pathway. The blue dashed arrows and black arrows show inhibition and activation relation between the genes, respectively.

Measuring DNA methylation There are several methods to detect DNAm patterns genome wide, each with different strengths and weaknesses. Among them three approaches are widely used:

a) Bisulfite sequencing b) Bisulfite microarrays c) Enrichment-based methods

In bisulfite-treated DNA, unmethylated Cs are converted to Us and ampli- fied as Ts by PCR, but 5mCs are protected from this conversion. Next generation sequencing of the bisulfite-treated sequence identifies the methylated and unmethylated CpGs usually referred to as whole genome bisulfite se-

20 quencing (WGBS). While comprehensive genomic coverage is the advantage of this method, the high cost of whole-genome resequencing and difficulties in telling apart 5mC and 5hmC are its disadvantages. In bisulfite microarrays, a preselected set of CpG sites are targeted. This reduces the per-sample cost compared to WGBS, although we may loose some of the informative sites. Illumina provides different arrays with different number of targeted CpG sites:

a) Infinum HumanMethylation27 BeadChip, profiling more than 27,000 sites, usually referred as 27K array b) Infinium HumanMethylation450 BeadChip, profiling more than 480,000 sites, usually referred as 450K array c) Infinium MethylationEPIC Kit, profiling more than 850,000 sites, usually referred as 850K array

The analyzed data in this thesis have been profiled using the 450K array, except for the first paper where 27K array has been used. According to Illu- mina’s technical notes, the 450K array covers 99% of RefSeq genes. It includes around 90% of the content of the 27K array and in addition covers: • Sites located outside of CG rich regions • Non-CG methylated sites found in human stem cells • FANTOM4 promoters • DNase hypersensitive sites • Sites identified as differentially methylated in cancer versus normal comparisons

The limitation of these arrays is that they are available only for the human genome. Different methods are used to enrich DNA in enrichment-based methods. Methyl-DNA immunoprecipitation (MeDIP), methyl-CpG-binding domain (MBD) proteins or methylation dependent restriction enzymes can be used for enriching methylated DNA. On the other hand, restriction enzymes that specifically cut unmethylated DNA are used to enrich unmethylated DNA. The advantage of this method is the ability to distinguish 5mC from 5hmC and a relative low cost, but with a low accuracy at CpG-poor regions [76]. On Illumina’s methylation assays, a pair of probes is used to measure the intensity of methylated and unmethylated alleles at each targeted CpG site. The methylation level is then estimated as the measured intensities of methylated and unmethylated probes and is reported as a value between 0 and 1 known as beta-value. The beta-value is the proportion of methylated probes to the sum of methylated and unmethylated probes. Illumina suggest adding a constant (default: 100) to the denominator to handle situations where both probe intensities are very low [77].

21 Computational background

Machine learning in life sciences The exponential expansion of biological data has broadened the use of machine learning algorithms and methods, aiming at finding needles in a hay- stack, i.e. extracting meaningful information and hidden patterns from large and complex datasets. Machine learning is defined as constructing computer programs that au- tomatically improve with experience [78]. Algorithms of machine learning aim at learning from the existing data or a training set, and optimize a performance criterion. The optimized criteria are assessed by accuracy, in modeling problems through the value of a fit- ness function defining optimization problem. In a modeling problem, the algorithm aims at inducing a model out of the training dataset and deriving conclusions. The aim in an optimization problem is to find a near-optimal solution out of all possible solutions, which is a crucial task in a modeling problem [79]. Machine learning techniques have been applied successfully to different domains of genomics, such as gene prediction [80], detecting regulatory elements [81], biological sequence analysis [82] and identifying functional RNAs [83] and epigenetics domains such as epigenomic-based prediction of gene expression [84], prediction of allele-specific DNAm [85] and DNAm- based age prediction [86]. It has also been used in systems biology for modeling biological networks [87] and evolution for constructing phylogenetic trees [88]. Machine learning methods are usually divided into two major groups: supervised and unsupervised methods. In both scenarios, the method investigates a group of samples that have been characterized by a vector of features (attributes) to learn from it and infer knowledge out of it.

Supervised learning In supervised learning, the samples are labeled as showing the group or class each samples belongs to. Hence the system is supervised by knowing the labels when it tries to find the patterns that are common between the members of each group.

22 In classification as the main supervised learning application, the algorithm aims at classifying the samples based on the values of their features. The learning process is done on the training set producing a classifier and a subset of data is used as the test set to evaluate the predictive power of the model. There are various methods and algorithms for classification with different advantages and disadvantages. We have used rule-based classification (Paper I), which is explained in detail in the following section. The case analyzed in [89] is used here to explain the method with examples.

Decision trees A decision tree is an acyclic and directed graph containing nodes connected by edges that represents an algorithm for classification. The node at the top of the tree is called the root node, and the node(s) with no children is called a leaf node. At each non-leaf node, a condition is checked and based on the result the path is directed to either a leaf node, if there are no more conditions to check, or to another node to continue the algorithm until it reaches a leaf node. The tree expands using a training set. The classifier generated using this method may be overfitted and that is when the path to a leaf node specifies very few samples. For overcoming this problem these nodes are pruned. Decision trees are easy to implement and interpret, and also computation- ally cheap.

Rough sets The classification tasks in this thesis have been done using rule-based modeling based on rough sets theory [90]. The fundament of rough sets is a tabu- lar structure of data, called a decision system. In a decision system, all the samples (objects) are listed on the rows with values of the features (attributes) on the columns, where the last column holds the label of each sample known as the decision class. This construction is called a decision system. Table 1 shows an example of a decision system taken from [89], where features are the methylation status of a targeted CpG site and the decision is the age of the sample.

Table 1. A sample decision table, with samples on the rows and discretized methylation level of each CpG site on the columns.

Sample CpG1 CpG2 … CpGN Class ID/CpG ID Sample 1 Methylated Intermediate … Unmethylated youngerThan0 Sample 2 Methylated Methylated … Intermediate olderThan0 … … … … … … Sample N Unmethylated Methylated … Intermediate youngerThan0

23 The rough set system aims at finding the minimal set of attributes that can discern between the samples of each decision class. This minimal set is known as a reduct [91]. A reduct can be extended to an approximate reduct where the objects are discernable to a certain threshold. For computing reducts, discernibility functions are generated. A discernibility function is a Boolean function, i.e. returning either true or false. Input variables are Bool- ean showing whether an attribute is included or not and each minimal solution of it represents a reduct. More details of the method can be found at [92].

Rule-based modeling using ROSETTA ROSETTA [93,94] is a software system that implements rough-sets methods and calculates the reducts using some approximation algorithms. Here the classifier consists of a set of IF-THEN rules, which are a presentation of the computed reducts. Rules may have one or multiple conditions in the IF part and the decision is shown in the THEN part of the rule. The rules with one condition are called singleton rules and the ones with more than one condition are called conjunctive rules. Each rule can be evaluated by two factors: support, shows the number of samples in the training set for which both IF part and THEN part matches; accuracy is defined as support divided by the number of samples for which only the IF part matches. An example of a conjunctive rule:

IF cg26158194=methylated AND cg12078929=intermediate THEN ‘olderThan50’ Accuracy=0.95, support=21

This rule can be read as: if the conditions in the IF part are satisfied for a sample, the model predicts it to be older than 50 years old. The reported accuracy means that 95% of the samples that have these conditions are actu- ally ‘older than 50 years old’, i.e. 20 samples out of 21 that were predicted by the model to have that age were classified correctly. The performance of the classifier is usually summarized in a table called confusion matrix, where the number of samples in each class and the number of samples classified correctly, or incorrectly classified as other classes is shown (Table 2 shows an example).

24 Table 2. A sample confusion matrix. The total number of samples in each class is shown in parenthesis. The columns specify the number of samples classified by the model as a specific class. The last column shows the accuracy of the model for pre- dicting each separate class and model’s general accuracy.

Observed class Classified as Class A Class B Class C Accuracy per class Class A (25) 20 4 1 80 % Class B (40) 0 40 0 100 % Class C (30) 5 5 20 66.6 % Model accuracy 82.2 %

Feature selection In many biological problems, a large number of features have been quanti- fied for a limited number of samples. For instance, thousands to millions of CpG sites or millions of SNPs are analyzed for a hundred or a few hundreds of samples. In these cases, many features might be redundant or non- informative, i.e. adding no predictive power to the model and they would exponentially increase the computational time. Feature selection methods are applied to save the computational time and to improve the model quality by including only informative features. Feature selection can employ univariate methods evaluating each feature separately or multivariate methods that take feature dependencies into account.

Monte Carlo Feature Selection The feature selection method used in this thesis is named Monte Carlo Fea- ture Selection (MCFS) [95]. MCFS is a multivariate technique that identifies features that contribute significantly to classification and ranks the features according to this criteria. Monte Carlo method as the fundament of MCFS is based on sample randomization techniques. Figure 4 summarizes the MCFS procedure. Out of a set of all (d) features in the original dataset, s subsets with the same number of samples but m (where m << d) features are selected randomly. Then each subset is split into t training and testing sets. For each of these pairs, a decision tree classifier is trained on the training set and tested using the test set. The classification ability of each tree is then assessed by calculating its weighted accuracy:

c 1 n wAcc = ii ∑ ni1+ni 2+...+nic c i=1

25 In this formula, c represents the number of classes and nij the number of samples from class i classified as class j.

Figure 5.Summary of the MCFS procedure. Reprinted with permission from Oxford University Press.

Using the weighted accuracy the Relative Importance (RI) is calculated for each feature:

v st u " no.in n (τ )% RI = wAcc IG n (τ ) gk gk ∑()τ ∑ ()gk $ ' # no.in τ & τ =1 ngk (τ )

In this formula, s * t trees are generated; for each tree (τ) the weighted accuracy (wAcc) is calculated as explained before, c shows the number of decision classes, IG(ngk (τ)) represents the information gain for node ngk (τ), no.in ngk (τ) shows the number of samples in node ngk (τ), no.in τ is the number of samples in the root node of the τ-th tree and u and v are optional weighting parameters, usually set to 1 as default value.

Unsupervised learning In unsupervised learning the goal is to segment the samples into groups or clusters based on their similarity evaluated from the values of their features. The main difference with supervised learning is that the label of the samples is not known, i.e. we do not know how many classes there are, or the labels are known, but we want to check possible subgroups within the samples of the same group. There are different approaches in unsupervised learning such as clustering and Neural Networks.

26 Clustering is the most widely used method in unsupervised learning aiming at finding hidden patterns or groupings in the data. The evaluation of the similarity of different samples is done by a defined distance function that could be Euclidean distance or other probabilistic distances. Multiple algorithms can be used for clustering, such as hierarchical clustering, k-means clustering, self-organizing maps, and hidden Markov Models.

K-means clustering K (equal to the number of clusters) random objects are chosen as the centroids of the clusters. Then, other objects are assigned to the nearest centroid. This process continues by finding the centroid of newly formed clusters and reassigning the objects according to the new centroids, until no object’s cluster changes. Since the selection of centroids in the first place may be random, re-running the clustering algorithm may give a different outcome.

Hierarchical clustering There are two approaches in hierarchical clustering. In the first approach known as bottom-up, or agglomerative, objects are considered as separate clusters and then the two most similar ones are merged into a new cluster. This iterates until the number of clusters reduces to the targeted number of clusters. In the second approach known as top-down or divisive, all the objects are assumed as one cluster and in each iteration one cluster is divided into two until the desired number of clusters is reached.

Self Organizing Map Artificial Neural Networks are a computational model, designed based on inspiration from the way the neurons were hypothesized to work in the brain. The neurons, or nodes, are the highly interconnected processing elements that work simultaneously, organized in layers, and are configured via a learning process in order to solve a specific problem. A Self-Organizing Map (SOM) is a type of artificial neural network that is based on competitive-and-cooperative learning (instead of error-correction learning), where only one neuron is the winning neuron at any given input, updating all other neurons through a topological distance function. As a result the neurons organize themselves according to the input patterns during the learning process to optimally present the input space given the connec- tions between the neurons, often organized as a hypertorus. One main aim of a SOM is to reduce the dimension of the input patterns into a lower- dimensional discrete structure known as a map, where the SOM uses a neighborhood function to keep the topological properties of the transformed space.

27 Sample Input data builds the map in the training mode, and in the mapping mode, the new input data are classified.

Hidden Markov Model (HMM) A stochastic or random process is a collection of random variables. A stochastic process is called a Markov process when the future status of the process depends only on the present state, and not any of the preceding events before the present state. A hidden Markov model (HMM) refers to a model where the system is a Markov process with hidden or unseen states. By itera- tively optimizing the overall fit through Dynamic Programming at a defined cost function, followed by updating the parameters, Hidden Markov models can be used to model spatial or temporal processes.

Hypothesis-free pattern recognition using Saguaro Saguaro [96] is an unsupervised pattern recognition method combining a HMM and multiple SOMs. Input is presented as a stream of vectors, containing either discrete or continuous variables, where missing data is permissible if indicated as such. The HMM segments the data into regions such that the sequence of SOMs, which are the states in the HMM, yield the overall optimal statistical representation of the data. SOMs are split based on their to- pology by identifying the least best modeled regions, which are recognizable as neurons that are farthest from the “center of gravity”. As a result, both the SOMs, represented by “cacti”, as well as the segmentation of the data are given as output.

Visualization The complement to the data mining algorithms and methods that discover hidden patterns from noisy multidimensional biological data is the ability to interpret the discovered models. Visualization helps significantly in this phase by enabling researchers to see structures or patterns in the data being analyzed and to make sense of the statistical models [97]. Visualization can be in many various forms ranging from a simple histogram for example showing the distribution of the data, to complex graphical platforms containing multitudes of methods for presenting data analysis results. Figure 4 shows how visualization enhances detecting the cutoff of 0.2 and 0.8 for discretizing the beta values.

28 all sites (excluding XY) − brain 4e+06 3e+06 2e+06 Frequency 1e+06 0e+00

0.0 0.2 0.4 0.6 0.8 1.0

Beta Values

Figure 6. Histogram of the beta values for all the CpG sites on 450K array (excluding sites on chromosomes X and Y).

In rule-based modeling, visualization can show how a rule works in separat- ing the samples of the targeted class (Figure 6), or can be used to detect and highlight rule interactions (see [98]).

29 IF cg18486150=intermediate AND cg24691453=methylated THEN Age28plus Methylation ( cg24691453_S100A4 ) cg24691453_S100A4 ( Methylation

Age 0−4 Age 5−27 Age 28plus 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.1 0.2 0.3 0.4 Methylation ( cg18486150_KIF17 )

Figure 7. The way a conjunctive rule generated by ROSETTA separates the samples belonging to the targeted class is visualized. Adapted from Torabi Moghadam et al. 2016, BMC Bioinformatics.

In clustering problems, visualization can be also used to translate the distance matrix into a human readable presentation (Figure 7). Visualization also allows for imputing the status or value of some samples with missing values or unknown status, based on the distribution and pattern of the other samples. For example in Figure 8, the mutation status of the IDH gene for the samples with a missing state (NA) can be inferred as wild type (WT) according to the DNAm pattern in gene CFLAR across the samples.

30 0.18

Cluster1 - glioblastoma 0.16 Cluster1 - astrocytoma Cluster1 - oligodendroglioma Cluster1 - oligoastrocytoma 0.14 Cluster1 - NA Cluster2 - glioblastoma 0.12 Cluster2 - astrocytoma Cluster2 - oligodendroglioma Cluster2 - oligoastrocytoma 0.1 Cluster2 - NA

0.08

0.06

0.04

0.02

0 0.25 0.2 0.15 0.14 0.12 0.1 0.1 0.08 0.06 0.05 0.04 0.02 0 0 Figure 8. Visualization of the methylation patterns using an unsupervised learning method (methylSaguaro), where the samples are marked with their histological labels, showing that histology is not accurate to classify samples.

Figure 9. The IDH mutation status of the samples with an NA (not available) status, can be inferred to be wild type (WT) according to the methylation levels of other samples.

31 Aims

The overall aim of this thesis was to investigate, detect and understand DNA methylation data as a key player of the epigenetic regulation of gene expression. The specific goals of this thesis were: 1. Investigating the patterns of DNA methylation in the human brain and the way these patterns change over the lifespan using supervised learning methods. 2. Developing a visualization-based analytical tool that enhances the investigation of DNA methylation and gene expression data over gene pathways. 3. Improving cancer subtyping and classification based on DNA methylation pattern using unsupervised learning methods. 4. Identifying differentiating patterns of DNA methylation between schizophrenic and control samples using supervised learning, unsupervised learning and visualization methods that were introduced and developed in the first three stages.

32 Methods and Results

This thesis is based on the results of four papers that are summarized together with the methods applied for achieving them in this section. Paper I de- scribes a machine learning method for identifying combinatorial patterns of DNAm over the lifespan in the human brain. Paper II presents a novel interactive visualization tool for browsing DNAm and gene expression data over gene pathways. Paper III introduces a machine learning method for detecting cancer subtypes based on DNAm data. Finally Paper IV investigates differentiating patterns of DNAm in a cohort of schizophrenic and control samples using the methods and tools introduced in the first three papers.

Paper I In Paper I, we introduced a method to explore the combinations of CpG sites and associated genes contributing to changing patterns in the methylation levels of human healthy brain as the age changes. Our rule-based classifier reports combinations of CpG sites together with their methylation status in the form of easy-to-read IF-THEN rules, which allows for assaying the genes associated with the underlying sites.

Background DNA methylation is a major player in multiple biological phenomena such as development, chromosome X inactivation, cell differentiation and regulation of gene expression. DNA methylation patterns are tissue specific and several studies have reported that these patterns change over the life span. Detecting these changes over the lifetime of an individual can be used for different applications, such as the identification of genes and their interactions and what processes they are involved in, as well as for forensic genetics and disease prevention. The dominant form in mammals is the methylation of 5’ cytosine in the context of CpG dinucleotides. Considering the complexity of the mechanisms controlled by DNA methylation, we aimed at finding combinations of CpG sites that could characterize different stages of aging.

33 Data We used DNA methylation data from 108 healthy samples published by Numata et al. [99], with ages ranging from fetal period to elderly ages up to 84 years old. The methylation data has been profiled using Illumina’s Infini- um HumanMethylation27 Beadchip that targets 27,578 sites.

Methods We first filtered out the sites with any of the following conditions: (a) located on chromosome X, (b) located on a polymorphic probe, (c) located on a non-uniquely mapped probe and (d) having a standard deviation of beta values over all samples smaller than 0.02. We then discretized the beta values into three methylation levels: (a) unmethylated, if the beta value is smaller than or equal to 0.2, (b) intermediate, if the beta value is bigger than 0.2 and smaller than 0.8, and (c) methylated, if it is bigger than or equal to 0.8. We next constructed binary decision tables for each age, labeling samples as younger than a certain age or older or equal to a certain age, starting from 0 to 60 years old. Figure 9 summarizes the method schematically. The significant features returned by MCFS [95] were selected from each age- specific decision table. Since the number of samples in each class was great- ly different we performed a 100-fold undersampling to avoid the effect of overrepresented samples of one group. A rule-based classifier was generated using ROSETTA [93] for each undersampled decision table and the results were merged into one classifier for each age. In order to study patterns of changes in DNAm in more general age intervals, we computed the Jaccard distance between all age decision tables based on the number of overlaps between the Significant Features set for Age i (SFAi) and Age j (SFAj) for each individual two-class decision table.

Distance (SFAi, SFAj) = 1 – ((SFAi ∩ SFAj) / (SFAii ∪ SFAj))

Results The filtering steps resulted in 11,000 sites to be included in the analysis. Out of these sites a union of 283 were selected by MCFS [95] for different age classes were the number of significant sites dropped as the age increased, indicating that there were fewer sites to discern between the samples in older ages. This confirmed the previous results that DNAm changes were more strongly pronounced at early ages, especially at fetal period and decrease as the age increases.

Figure 10. A schematic review of the steps done for feature selection and classification. Adapted from Torabi Moghadam et al. 2016, BMC Bioinformatics.

We then annotated the sites showing up in each age class using ANNOVAR [100] and checked the genes that were unique for classifying a specific age and the overlapping genes. The hierarchical clustering of the sets of significant features of each age class divided the samples into four age intervals: ‘fetus’, ‘age0-4’, ‘age5-27’ and ‘age28plus’. These groupings were employed to construct a new four- class decision table, and running ROSETTA [93] generated a classifier with accuracy mean of 90%, were fetus and ‘age0-4’ classes had a 100% accuracy and ‘age5-27’ was the most difficult one to classify. Our results showed that the patterns of methylation changes in a healthy individual’s brain were highly complex and interdependent. The CpG sites present in the significant rules were annotated to genes involved in brain

35 formation, general development, as well as genes linked to cancer and Alz- heimer’s disease.

Paper II In Paper II, we have developed an open-source interactive tool for visualis- ing and analysing DNA methylation and gene expression data in the context of gene pathways. Public datasets of DNAm and gene expression of two different types of brain cancers, namely glioblastoma multiforme and lower grade glioma, were investigated with PiiL as a use case to show the functionalities, where previously proven patterns were confirmed.

Background Gene networks or pathways depict a group of regulatory interdependent genes involved in a certain biological phenomenon. Besides, DNAm is a major player in the regulation transcription machinery. Since there are millions of CpG sites that can be altered the relation between DNAm and expression of its host gene or surrounding genes is complex. Hence, providing a tool that can investigate the patterns of DNAm and gene expression at the same time in the context of gene pathways, enhanced by visualization utili- ties, is invaluable. We developed PiiL (Pathway interactive visualization tooL), an interactive regulation browser that facilitates specific hypothesis testing by quickly assessing large data sets.

Data We used publicly available DNAm and gene expression public data generated by The Cancer Genome Atlas project [101], from 65 patients with glioblastoma multiforme (GBM), published by [102], and 100 patients with lower grade glioma (LGG), published by [103], as a use case.

Methods PiiL has been designed with focus on providing a user-friendly interface. It is platform independent, developed in Java and available under the GPL license from: https://github.com/behroozt/PiiL.git PiiL loads and visualizes pathways and gene networks directly from the KEGG database [104]. The user then loads the DNAm and/or gene expression data and the genes that have a match in the opened pathway are color coded according to the methylation or expression level for a specific sample. PiiL facilitates exploring the data by navigating through the samples in a static way as stacked boxes or by showing them consecutively to validate or

36 discover patterns in the data. Additional information for the samples can also be loaded to enable the user to use the group-wise view, where any charac- teristic of the samples can be used to group them. In this view mode, the average methylation/expression of the members of each group are shown for each gene in the pathway. To make sense of the genes in a pathway, the user can directly check their functional annotation in GeneCards (http://www.genecards.org), NCBI Pubmed (http://www.ncbi.nlm.nih.gov/pubmed) or Ensemble (http://www.ensembl.org), being displayed in a web browser. A summary of the data for each gene is also available as barplots and histograms by a single click. As there are usually multiple CpG sites associated with a single gene, av- eraging the beta values of all sites to determine methylation level of a gene may weaken the differentiating signal. To address this problem, PiiL provides a function to include or exclude a subset of CpG sites in the analysis. The sites can be filtered based on standard deviation, range of the beta values, the genomic annotation (for example include only intronic sites), or selecting a set of informative sites that have been computed by other methods such as MCFS [95], methylSaguaro (Paper III) or other techniques. Also worth mentioning is the multiple-tab structure of the platform that enables the user to investigate multiple pathways and various datasets at the same time. The pathways and color-coded data can be exported to vector- based images.

Results We analyzed publicly available DNAm and gene expression data for two types of brain cancers: lower grade glioma and glioblastoma multiforme ([103] and [102]) to show the functionalities and features of PiiL. Using PiiL’s “group wise” view we could easily and quickly investigate the differentiation patterns between samples grouped by different clinical features, such as IDH mutation status, MGMT promoter status and TERT promoter status. We confirmed Noushmehr et al.’s [105] findings that IDH mutation status should be used to classify samples rather than their histology. The visualization technique helped us to impute IDH mutation status of the samples with a missing or unknown status. Applying different filters in PiiL, we could investigate the effect of including/excluding a specific subset of CpG sites that hit a specific gene. For example Figure 10 shows the effect of selecting only exonic sites versus upstream (promoter region) sites.

Figure 11. Part of “TNF signaling pathway” where only the sites located at upstream of the genes are selected (top) versus where only exonic sites are selected. Blue represents hypomethylation and red shows hypermethylation.

Furthermore, we loaded a list of informative sites, computed by methylSa- guaro (Paper III), as the sub selection of CpG sites. This enabled us to highlight the key regulator genes and the subset of the CpG sites that show a high correlation with the expression of their host gene (Figure 11).

38 Apoptosis Apoptosis

Methylation Expression

rho = −0.81 p−value < 2.2e−16

CASP8 Expression Methylation

Expression (log2) Expression Mutant

456789 WT 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 WT Mutant WT Mutant 0.0 0.2 0.4 0.6 0.8 1.0 Figure 12. a subset of CpG sites found by methySaguaro was selected using PiiL from all the CpG sites hitting the genes in this part of the TNF signaling pathway. This highlighted CASP8 with a high negative correlation between methylation and expression among the genes leading to apoptosis.

Paper III In paper III, we introduced a novel machine learning method for identifying subtypes of cancers based on DNAm patterns.

Background Since DNAm is involved in transcriptional regulation, its aberrant patterns have been observed in various diseases, among them, various types of cancer. Thus detecting differentially methylated regions can provide biomarkers for cancer subtyping and help to identify the genes and gene pathways involved in pathogenesis and outcome. However this comparison requires labeling the samples, where in complex and heterogeneous diseases like cancer is not known a priori. Use of unsupervised learning methods comes to help in these scenarios where all the samples have the same label so a supervised learning method is not applicable.

Data We took the publically available DNAm datasets from the TCGA project [101], which were profiled using the 450K array, for different cancer types as listed:

a) Brain tumors namely: glioblastoma multiforme (GBM) and lower grade glioma (LGG), published by [102] and [103] respectively.

39 b) Three types of kidney tumors namely: kidney renal clear cell (KIRC), kidney renal papillary cell (KIRP) and kidney chromophobe (KICH), published by [106], [107] and [108] respectively. c) Acute Myeloid Leukemia (AML) published by [109] d) Colon cancer published by [110]

These datasets were published by The Cancer Genome Atlas (TCGA) project and DNAm profiled using the same array for diagnostic and follow-up samples with Chronic Lymphocytic Leukemia (CLL) was published by [111].

Methods methylSaguaro, developed in this paper, uses DNAm data and identifies a set of CpG sites with a predictive power of cancer subtypes. This method builds on Saguaro [96], a software that was first designed to detect selective forces in whole genome population analyses by combining hidden Markov models and Neural Networks. In this method the base concept is a cactus, which a symmetric n*n (n equal to the number of samples) matrix of pairwise comparisons, where each element shows the distance between two samples relative to all other samples. Here, the input is all CpG sites from 450K methylation array sorted by their genomic position. The aim of each cactus is to represent the segments of the genome, covering at least 3 consecutive CpG sites, match with the pattern of that cactus. The HMM is responsible for the segmentation, and the SOM is then utilized for pattern recognition and unsupervised classification to decide the shape of the cactus. The SOM here aims at generating a set of cacti that shows a differentially methylation pattern, summarizing the local patterns in at least one cactus. Since it is an unsupervised approach, no a priori assumption is required and the system learns the classification from the data. This enabled us to detect the possible effects of a genetic alteration, such as a mutation on DNAm patterns, without knowing the alteration beforehand.

Results In the brain tumors data, our method confirmed the association between mutation in the IDH gene and DNAm levels. Previous research has suggested this for classifying tumors rather than their histological categories [105]. We also observed the reported [112] subtypes of CLL based on the mutation status of the IGHV gene. Analyzing the kidney tumors data, we found that KICH samples markedly differ from KIRC and KIRP samples DNAm wise. Among the differentially methylated genes are CLDN8, an immunohistochemical marker for the dif-

40 ferential diagnosis of CRCC and renal oncocytoma; the homeobox genes HOXB5, HOXC4, and HOXD8; and genes linked to tumorgenesis or tumor suppression, including CASZ1, ITGA6, ZMIZ1, and EBF3. In AML, we found complex DNAm patterns that coincide with current risk assessment based on cytogenetics, but possibly provides improved resolution.

Paper IV In Paper IV, we investigated DNA methylation data for biomarkers of schizophrenia using the tools and methods that we introduced and examined on the same type of data in papers I, II and III.

Background Schizophrenia is a common mental disorder that has a strong tendency in becoming chronic. This complex disease affects about 0.5-1% of the population and shortens the life expectancy by ten to twelve years because of associated high suicide rate and physical problems. While it has been shown to have a strong genetic component with a complex and non-Mendelian inher- itance [113], the reported association between environmental factors and psychotic disorders including schizophrenia such as parental infection [114], maltreatment and trauma during childhood [115] and use of cannabis [116] suggests the possibility of epigenetics factors in developing schizophrenia.

Data Frontal cortex post mortem brain tissue samples collected from international brain banks were profiled using 450K array for 73 patients diagnosed with schizophrenia and 52 healthy samples.

Methods In order to investigate differentially methylated patterns between schizophrenic and healthy samples, we applied the three methods that we introduced and developed for analyzing DNAm data and mining differential patterns, namely:

a) Supervised learning by combining feature selection using using Monte Carlo Feature Selection and rule-based modeling using ROSETTA, classification b) Unsupervised clustering of the samples using methylSaguaro c) Pathways based analysis of the data using PiiL

41 In the classification analysis, in addition to the beta value discretization described in Paper I, where beta values were discretized as unmethylated (beta values smaller than or equal to 0.2), intermediate (beta values between 0.2 and 0.8) and methylated (beta values greater than or equal to 0.8), we also divided the intermediate class into two equal bins as intermedi- ate_unmethylated and intermediate_methylated and trained the model.

Results Although DNAm was shown to be an adequate biomarker in tracking age- related changes over the lifespan and identifying cancer subtypes using our methods, it turned out not to be the marker reflecting differences between schizophrenic and healthy samples. MCFS did not return any CpG site as significant, with any of the beta value discretization methods. Furthermore, methylSaguaro generated one cactus with two clusters that do not match with the schizophrenic state. Pro- jecting the data over KEGG’s “dopaminergic synapse” pathway also confirmed that the difference between the beta values of the samples was very small and in cases that there was some difference it was not a reflection of the schizophrenic state. Since the methods we applied in our previous papers worked successfully on the same type of data, the reason we obtained no results here could be that DNAm is not a suitable biomarker in studying schizophrenia at least with the configuration of this study. The configuration includes the methylation array resolution, region of the brain that has been profiled or generally using methylation arrays that are not able to tell apart 5mC from 5hmC, if that is the case in schizophrenia.

42 Concluding remarks and future prospects

DNA methylation, as an important player involved in epigenetic regulation of transcription machinery, has become spotlight in epigenetics research. In this thesis, we have introduced and developed computational methods for mining DNAm data and discovering differential patterns in different biological phenotypes. The story started with analyzing DNAm data in order to investigate the patterns during the lifespan and detect biomarkers of ageing. The extreme changes between prenatal period and after birth was reflected in highly accurate singleton (having one condition) rules and in the early ages; but predict- ing the age appeared to be less accurate in elderly ages. We believe that do- ing this analysis with a higher resolution array, i.e. targeting more CpG sites would help identifying the patterns to tell apart older samples since the differences are subtler. The combinatorial analysis of the first dataset generated conjunctive rules consisting of a set of CpG sites that change together in order to classify samples. This motivated us to check if the annotated genes (where the sites are part of) are involved in any common biological processes and how they affect the expression of their host genes. To address this hypothesis, we developed open-source software, PiiL, which facilitates hypothesis testing by quickly assessing large datasets of DNAm and gene expression. This tool provides a handful of analytical functions offered as a user-friendly interface that helps biologists to confirm and discover patterns by visually translating data to easy to detect signals. Among the functions is a unique feature that include/exclude a subset of multiple CpG sites related to a gene based on a criteria and enhances pattern recognition and checking the correlation with expression of the host gene for each selected subset. PiiL is being updated frequently and tries to adapt all widely used technologies of measuring DNAm and gene expression. In Paper III, we showed how differentially methylated patterns can be used as biomarkers of detecting cancer subtypes. We developed a state-of- the-art method for detecting tumor subtypes using DNAm data without any a priori assumption. Using this method we confirmed some previous findings and based on that proposed new subgroups in various cancer types. We hope that this helps better understanding of cancer mechanisms and provides better diagnosis for patients leading to a proper treatment.

43 While DNAm appeared to be a good biomarker of ageing and cancer, we did not find that clear signal when we applied our methods on schizophrenic samples. This may be because of the region of the brain that has been profiled or the technology limitations such as lack of discernibility between discernibility between 5mC and 5hmC. In conclusion, by the work accomplished in this thesis, we have contributed to discovering DNAm patterns as biomarkers of ageing and cancer subtyping by developing novel bioinformatics methods and tools. We note that most datasets analyzed have been generated and published previously by other groups, yet our algorithms were able to find hidden treasures that conventional approaches have missed entirely. This illustrates the importance of continuously developing novel algorithms to overcome the limitations and shortcomings of existing methods, and that great improvements are possible in many areas. While we have targeted DNAm in this project, some of our methods can be tailored and applied to mine other types of biological phenomena, where we expect similar gains over conventional analyses in future projects.

44 Acknowledgements

The first lesson I learnt during the years of mountaineering was that no achievement was possible without the help and contribution of many people and factors, from the guy driving us to the close by village to the one walk- ing in front of us, controlling our steps and breaths to reach the peak; and that I would praise and appreciate all of them since the synergy of the group made me reach the goal. Similarly the work presented in this thesis would not be done without the help, knowledge, guidance, support, positive energy and love of invaluable people around me or far from me but their influence and love accompanying me in my mind and my heart. Especially I would like to thank: My main supervisor, Professor Jan Komorowski, for having me in the group and introducing me to the world of machine learning. I am grateful for your guidance and trust to make me grow as an independent researcher. Be- side that I learned about various topics brought up in our lunchtime discussions. I am not into sailing like you but if you sail in the Persian Gulf one day please let me know. My co-supervisor Associate Professor Manfred Grabherr. I could not wish for a better one who is not only a brilliant scientist, but also a trustable friend and a brother. I could not be at the stage that I am without your kind and endless support, help and motivation. Thanks a lot for believing in me and listening to my naggings patiently! I hope we have more time for chess and Ping-Pong later since those are the fields I can challenge you ☺ Professor Georgy Bakalkin for financially supporting my PhD studies. Professor Claes Wadelius, for all nice discussions in our group meetings and journal club. I will remember the chocolate box on the last meeting of each year and I should eat more fish per week to stay healthy and wise like you. All the present and former members of Komorowski’s group who made a collaborative and friendly environment: Zeeshan and Husein, we were “the three musketeers”, spending most of this journey together. I enjoyed all our discussions from language, to poetry (which mostly involved Zeeshan), poli- tics, immigration and everything and to accompany me for “lunch out” events even when I could not win in a democratic voting. Nickolas, thanks for all gym motivation gestures. I will miss our conversations in sign language before saying goodbye in the afternoons. Klev and Sara: thanks for comments, suggestions and discussions on various topics. Mateusz, thanks

45 for tolerating me during the final stressful months in the office. You are such a nice and calm officemate and if you continue with this speed you will be the first Polish bioinformatician who speaks Persian fluently ☺. Karolina, thank you for introducing me to the ‘candy monster’, but I strongly suggest that you improve your musical pattern recognition algorithms ☺ Fredrick, for kind help and suggestions for post-doc positions. Susanne, who graduat- ed two years ago and was always helpful from the first period that I joined the group for my research training until later days. Neda Zamani, for proofreading the thesis and for tuning my stress level whenever it was below or over the expected threshold ☺ Ola Spjuth, for the great help and support in the final year of my master studies to get my first paper. Eva and Lars for the great collaboration for producing the fourth paper. Linda Holmfeldt and Görel for their invaluable help for paper 3. Jiangning, for the precious help with visualization part of paper 3. I am really thankful for your quick responses and being helpful even during weekends and while you were traveling. Professor Ann-Christine Syvänen, for having me in her group for my master thesis project that led me to know many intelligent and nice people in her group. Eva and Mårten, who helped me a lot in developing my bioinformatics skills during my master’s thesis period. Christofer, for all your help with R. Juliana my former office mate at the group, thanks for the suggestions about thesis production. And others whom I got to know there: Per, Anders, Tomas, Yanara, Jonas, Johanna, Helena, Jessica, Magnus, Camilia, Karin, Elin, Mathias and others. My friends Nima, Behdad and Aida, for their encouragement and help especially during the thesis production period. During these years I was very lucky to meet and get to know a lot of people at BMC and EBC. It is not possible to name all of them but: Franziska, Maria, Mitra, Mandana, Magda, Mahsa, Stefan, Adnan, Zaza, Mohammad, Martin, Gang, Marco, Mayank, Christian, Alejandro and Masoud for all friendly and scientific chats and discussions and happy moments. All people working at the administration of the department of Cell and Molecular Biology for helping me with visa related, contract and so on is- sues. My former and present friends at our futsal club, for the energy pump every Friday. I hope they start a league in Sweden and we win the first championship. My friends at Kavian Mountaineering Club, who are so many that I cannot name and have been always in my heart and a source of inspiration and motivation. I am thankful for all the good days I have spent with you and use the lessons I learnt there in my life. All my friends out of BMC with whom I had fun and joyful time: Pooya, Hooman, Afrooz, Alireza, Saba, Jonas and Neda.

46 I could not carry on this journey without endless love and support of my lovely family members. My parents, I have always tried my best to make you happy and words come short to say thanks to you for all your love and support and teaching me patience, honesty and hard working. My lovely sisters, Marjan, Mojdeh and Behnaz, you are my role models. I never felt alone facing any difficulties in life with your encouragement and belief. It is a great pity that you will not be present in my thesis defense. My nephews and nieces, for bringing smile, hope and happiness to our lives. My parent in-laws, thanks a lot for supporting me like your son. My sister in-law Shirin, for all sweet cakes ☺ And last but not least, I am so blessed to have Sahar by my side. Your endless love has enlightened my life and every goal seems reachable with your encouraging and supportive spirit. You have been my source of energy for staying strong and going on through difficult days.

47 References

1. Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, et al. (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nature Biotechnology 28: 1045-1048. 2. Esteller M (2006) The necessity of a human epigenome project. Carcinogenesis 27: 1121-1125. 3. Kanherkar RR, Bhatia-Dey N, Csoka AB (2014) Epigenetics across the human lifespan. Frontiers in Cell and Developmental Biology 2. 4. Jaenisch R, Bird A (2003) Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nature Genetics 33: 245-254. 5. Feinberg AP (2007) Phenotypic plasticity and the epigenetics of human disease. Nature 447: 433-440. 6. Jones PA, Baylin SB (2007) The Epigenomics of Cancer. Cell 128: 683- 692. 7. Adams D, Altucci L, Antonarakis SE, Ballesteros J, Beck S, et al. (2012) BLUEPRINT to decode the epigenetic signature written in blood. Nature Biotechnology 30: 224-226. 8. Naylor S (2003) Biomarkers: current perspectives and future prospects. Expert Review of Molecular Diagnostics 3: 525-529. 9. Mikeska T, Craig JM (2014) DNA Methylation Biomarkers: Cancer and Beyond. Genes 5: 821-864. 10. Newell-Price J, Clark AJL, King P (2000) DNA Methylation and Silencing of Gene Expression. Trends in Endocrinology & Metabolism 11: 142-148. 11. Weber M, Schübeler D (2007) Genomic patterns of DNA methylation: targets and function of an epigenetic mark. Current Opinion in Cell Biology 19: 273-280. 12. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462: 315-322. 13. Pfeifer GP, Kadam S, Jin S-G (2013) 5-hydroxymethylcytosine and its potential roles in development and cancer. Epigenetics & Chromatin 6: 10. 14. Holliday R, Pugh JE (1975) DNA modification mechanisms and gene activity during development. Science 187: 226-232. 15. Riggs AD (1975) X inactivation, differentiation, and DNA methylation. Cytogenetics and Cell Genetics 14: 9-25.

48 16. Hackett JA, Surani MA (2013) DNA methylation dynamics during the mammalian life cycle. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences 368: 20110328. 17. Zhang G, Pradhan S (2014) Mammalian epigenetic mechanisms. IUBMB life 66: 240-256. 18. Smith ZD, Meissner A (2013) DNA methylation: roles in mammalian development. Nature Reviews Genetics 14: 204-220. 19. Moarefi AH, Chédin F (2011) ICF Syndrome Mutations Cause a Broad Spectrum of Biochemical Defects in DNMT3B-Mediated De Novo DNA Methylation. Journal of Molecular Biology 409: 758-772. 20. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics 13: 484-492. 21. Illingworth R, Kerr A, DeSousa D, Jørgensen H, Ellis P, et al. (2008) A Novel CpG Island Set Identifies Tissue-Specific Methylation at Developmental Gene Loci. PLOS Biology 6: e22. 22. Kulis M, Esteller M (2010) DNA Methylation and Cancer. Epigenetics and Cancer, Pt A 70: 27-56. 23. Robertson KD (2005) DNA methylation and human disease. Nature Reviews Genetics 6: 597-610. 24. Okano M, Bell DW, Haber DA, Li E (1999) DNA Methyltransferases Dnmt3a and Dnmt3b Are Essential for De Novo Methylation and Mammalian Development. Cell 99: 247-257. 25. Feil R, Berger F (2007) Convergent evolution of genomic imprinting in plants and mammals. Trends in Genetics 23: 192-199. 26. Maupetit-Méhouas S, Montibus B, Nury D, Tayama C, Wassef M, et al. (2016) Imprinting control regions (ICRs) are marked by mono- allelic bivalent chromatin when transcriptionally inactive. Nucleic Acids Research 44: 621-635. 27. DeBaun MR, Niemitz EL, Feinberg AP (2003) Association of in vitro fertilization with Beckwith-Wiedemann syndrome and epigenetic alterations of LIT1 and H19. American Journal of Human Genetics 72: 156-160. 28. Goldstone AP (2004) Prader-Willi syndrome: advances in genetics, pathophysiology and treatment. Trends in endocrinology and metabolism: TEM 15: 12-20. 29. Temple IK, Shield JPH (2002) Transient neonatal diabetes, a disorder of imprinting. Journal of Medical Genetics 39: 872-875. 30. Nicholls RD, Knepper JL (2001) Genome organization, function and imprinting in Prader-Willi and Angelman syndromes. Annual Review of Genomics and Human Genetics 2: 153-175. 31. Lalande M (2001) Imprints of disease at GNAS1. Journal of Clinical Investigation 107: 793-794. 32. Oberle I, Rousseau F, Heitz D, Kretz C, Devys D, et al. (1991) Instability of a 550-base pair DNA segment and abnormal methylation in fragile X syndrome. Science 252: 1097-1102.

49 33. Ladd-Acosta C, Hansen KD, Briem E, Fallin MD, Kaufmann WE, et al. (2014) Common DNA methylation alterations in multiple brain regions in autism. Molecular Psychiatry 19: 862-871. 34. De Jager PL, Srivastava G, Lunnon K, Burgess J, Schalkwyk LC, et al. (2014) Alzheimer's disease: early alterations in brain DNA methylation at ANK1, BIN1, RHBDF2 and other loci. Nature Neuroscience 17: 1156-1163. 35. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, et al. (2013) Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotechnology 31: 142-147. 36. Shen H, Laird PW (2013) Interplay between the Cancer Genome and Epigenome. Cell 153: 38-55. 37. Hanahan D, Weinberg RA (2011) Hallmarks of Cancer: The Next Generation. Cell 144: 646-674. 38. Choi JD, Lee J-S (2013) Interplay between Epigenetics and Genetics in Cancer. Genomics & Informatics 11: 164-173. 39. Diala ES, Hoffman RM (1982) Hypomethylation of Hela-Cell DNA and the Absence of 5-Methylcytosine in Sv40 and Adenovirus (Type-2) DNA - Analysis by Hplc. Biochemical and Biophysical Research Communications 107: 19-26. 40. Jones PA, Baylin SB (2002) The fundamental role of epigenetic events in cancer. Nature Reviews Genetics 3: 415-428. 41. Baylin SB, Jones PA (2011) A decade of exploring the cancer epigenome — biological and translational implications. Nature Reviews Cancer 11: 726-734. 42. De Carvalho DD, Sharma S, You JS, Su S-F, Taberlay PC, et al. (2012) DNA Methylation Screening Identifies Driver Epigenetic Events of Cancer Cell Survival. Cancer Cell 21: 655-667. 43. Costa FF, Paixão VA, Cavalher FP, Ribeiro KB, Cunha IW, et al. (2006) SATR-1 hypomethylation is a common and early event in breast cancer. Cancer Genetics and Cytogenetics 165: 135-143. 44. Widschwendter M, Jiang G, Woods C, Müller HM, Fiegl H, et al. (2004) DNA hypomethylation and ovarian cancer biology. Cancer Research 64: 4472-4480. 45. Badal V, Chuang LSH, Tan EH-H, Badal S, Villa LL, et al. (2003) CpG Methylation of Human Papillomavirus Type 16 DNA in Cervical Cancer Cell Lines and in Clinical Specimens: Genomic Hypomethylation Correlates with Carcinogenic Progression. Journal of Virology 77: 6227-6234. 46. Sharma G, Mirza S, Parshad R, Srivastava A, Datta Gupta S, et al. (2010) CpG hypomethylation of MDR1 gene in tumor and serum of invasive ductal breast carcinoma patients. Clinical Biochemistry 43: 373-379. 47. Ostler KR, Davis EM, Payne SL, Gosalia BB, Expósito-Céspedes J, et al. (2007) Cancer cells express aberrant DNMT3B transcripts encoding truncated proteins. Oncogene 26: 5553-5563.

50 48. Greger V, Passarge E, Hopping W, Messmer E, Horsthemke B (1989) Epigenetic Changes May Contribute to the Formation and Spontaneous Regression of Retinoblastoma. Human Genetics 83: 155-158. 49. Esteller M (2008) Epigenetics in Cancer. New England Journal of Medicine 358: 1148-1159. 50. Weller M, Stupp R, Reifenberger G, Brandes AA, van den Bent MJ, et al. (2010) MGMT promoter methylation in malignant gliomas: ready for personalized medicine? Nature Reviews Neurology 6: 39-51. 51. Fleisher AS, Esteller M, Tamura G, Rashid A, Stine OC, et al. (2001) Hypermethylation of the hMLH1 gene promoter is associated with microsatellite instability in early human gastric neoplasia. Oncogene 20: 329-335. 52. Catteau A, Morris JR (2002) BRCA1 methylation: a significant role in tumour development? Seminars in Cancer Biology 12: 359-371. 53. Cooper DN, Youssoufian H (1988) The CpG dinucleotide and human genetic disease. Human Genetics 78: 151-155. 54. Hodgkinson A, Eyre-Walker A (2011) Variation in the mutation rate across mammalian genomes. Nature Reviews Genetics 12: 756-766. 55. Huang L, Wu R-L, Xu AM (2015) Epithelial-mesenchymal transition in gastric cancer. American Journal of Translational Research 7: 2141- 2158. 56. Kim JS, Han J, Shim YM, Park J, Kim D-H (2005) Aberrant methylation of H-cadherin (CDH13) promoter is associated with tumor progression in primary nonsmall cell lung carcinoma. Cancer 104: 1825-1833. 57. Michie AM, McCaig AM, Nakagawa R, Vukovic M (2010) Death- associated protein kinase (DAPK) and signal transduction: regulation in cancer. FEBS Journal 277: 74-80. 58. López-Otín C, Blasco MA, Partridge L, Serrano M, Kroemer G (2013) The Hallmarks of Aging. Cell 153: 1194-1217. 59. Bollati V, Schwartz J, Wright R, Litonjua A, Tarantini L, et al. (2009) Decline in genomic DNA methylation through aging in a cohort of elderly subjects. Mechanisms of Ageing and Development 130: 234- 239. 60. Egger G, Liang G, Aparicio A, Jones PA (2004) Epigenetics in human disease and prospects for epigenetic therapy. Nature 429: 457-463. 61. Teschendorff AE, West J, Beck S (2013) Age-associated epigenetic drift: implications, and a case of epigenetic thrift? Human Molecular Genetics 22: R7-R15. 62. Grönniger E, Weber B, Heil O, Peters N, Stäb F, et al. (2010) Aging and chronic sun exposure cause distinct epigenetic changes in human skin. PLoS genetics 6: e1000971. 63. Ong M-L, Holbrook JD (2014) Novel region discovery method for Infinium 450K DNA methylation data reveals changes associated with aging in muscle and neuronal pathways. Aging Cell 13: 142- 155.

51 64. Weidner CI, Lin Q, Koch CM, Eisele L, Beier F, et al. (2014) Aging of blood can be tracked by DNA methylation changes at just three CpG sites. Genome Biology 15: R24. 65. Lister R, Mukamel EA, Nery JR, Urich M, Puddifoot CA, et al. (2013) Global Epigenomic Reconfiguration During Mammalian Brain Development. Science 341: 1237905. 66. Jones MJ, Goodman SJ, Kobor MS (2015) DNA methylation and healthy human aging. Aging Cell 14: 924-932. 67. Day K, Waite LL, Thalacker-Mercer A, West A, Bamman MM, et al. (2013) Differential DNA methylation with age displays both common and dynamic features across human tissues that are influenced by CpG landscape. Genome Biology 14: R102. 68. Feil R, Fraga MF (2012) Epigenetics and the environment: emerging patterns and implications. Nature Reviews Genetics 13: 97-109. 69. Monick MM, Beach SRH, Plume J, Sears R, Gerrard M, et al. (2012) Coordinated changes in AHRR methylation in lymphoblasts and pulmonary macrophages from smokers. American Journal of Medical Genetics Part B, Neuropsychiatric Genetics: The Official Publication of the International Society of Psychiatric Genetics 159B: 141-151. 70. Joubert BR, Håberg SE, Nilsen RM, Wang X, Vollset SE, et al. (2012) 450K epigenome-wide scan identifies differential DNA methylation in newborns related to maternal smoking during pregnancy. Environmental Health Perspectives 120: 1425-1431. 71. Essex MJ, Boyce WT, Hertzman C, Lam LL, Armstrong JM, et al. (2013) Epigenetic vestiges of early developmental adversity: childhood stress exposure and DNA methylation in adolescence. Child Development 84: 58-75. 72. Horvath S (2013) DNA methylation age of human tissues and cell types. Genome Biology 14: R115. 73. Gentilini D, Mari D, Castaldi D, Remondini D, Ogliari G, et al. (2013) Role of epigenetics in human aging and longevity: genome-wide DNA methylation profile in centenarians and centenarians' offspring. Age (Dordrecht, Netherlands) 35: 1961-1973. 74. Marioni RE, Shah S, McRae AF, Chen BH, Colicino E, et al. (2015) DNA methylation age of blood predicts all-cause mortality in later life. Genome Biology 16: 25. 75. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research 34: D354-D357. 76. Bock C (2012) Analysing and interpreting DNA methylation data. Nature Reviews Genetics 13: 705-719. 77. Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, et al. (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11: 587. 78. Mitchell T (1997) Machine learning. New Yrok: McGraw-Hill.

52 79. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, et al. Machine learning in bioinformatics. Briefings in Bioinformatics: 86-112. 80. Mathé C, Sagot M-F, Schiex T, Rouzé P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research 30: 4103-4117. 81. Aerts S, Van Loo P, Moreau Y, De Moor B (2004) A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes. Bioinformatics 20: 1974-1976. 82. Won K-J, Prügel-Bennett A, Krogh A (2004) Training HMM structure with genetic algorithm for biological sequence analysis. Bioinformatics (Oxford, England) 20: 3613-3619. 83. Carter RJ, Dubchak I, Holbrook SR (2001) A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Research 29: 3928-3938. 84. Li J, Ching T, Huang S, Garmire LX (2015) Using epigenomics data to predict gene expression in lung cancer. BMC bioinformatics 16 Suppl 5: S10. 85. He J, Sun M-a, Wang Z, Wang Q, Li Q, et al. (2015) Characterization and machine learning prediction of allele-specific DNA methylation. Genomics 106: 331-339. 86. Vidaki A, Ballard D, Aliferi A, Miller TH, Barron LP, et al. (2017) DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing. Forensic Science International Genetics. 87. Karsakov A, Bartlett T, Ryblov A, Meyerov I, Ivanchenko M, et al. (2017) Parenclitic Network Analysis of Methylation Data for Cancer Identification. PloS One 12: e0169661. 88. Slabbinck B, Waegeman W, Dawyndt P, De Vos P, De Baets B (2010) From learning taxonomies to phylogenetic learning: integration of 16S rRNA gene data into FAME-based bacterial classification. BMC bioinformatics 11: 69. 89. Torabi Moghadam B, Dabrowski M, Kaminska B, Grabherr MG, Komorowski J (2016) Combinatorial identification of DNA methylation patterns over age in the human brain. BMC Bioinformatics 17: 393. 90. Pawlak Z (1982) Rough Sets. International Journal of Computer & Information Sciences 11: 341-356. 91. Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. Intelligent Decision Support: Springer. pp. 331-362. 92. Komorowski J, Pawlak Z, Polkowski L, Skowron A (1999) Rough sets: A tutorial. Rough fuzzy hybridization: A new trend in decision- making: 3-98. 93. Øhrn A, Komorowski J. Rosetta--a rough set toolkit for analysis of data; 1997. Citeseer.

53 94. Komorowski J, Øhrn A, Skowron A. Case studies: public domain, multiple mining tasks systems: ROSETTA rough sets; 2002. Oxford University Press, Inc. pp. 554-559. 95. Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, et al. (2008) Monte Carlo feature selection for supervised classification. Bioinformatics 24: 110-117. 96. Zamani N, Russell P, Lantz H, Hoeppner MP, Meadows JRS, et al. (2013) Unsupervised genome-wide recognition of local relationship patterns. Bmc Genomics 14. 97. Weiss TL, Zieselman A, Hill DP, Diamond SG, Shen L, et al. (2015) The role of visualization and 3-D printing in biological data mining. BioData Mining 8: 22. 98. Bornelov S, Marillet S, Komorowski J (2014) Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers. Bmc Bioinformatics 15. 99. Numata S, Ye TZ, Hyde TM, Guitart-Navarro X, Tao R, et al. (2012) DNA Methylation Signatures in Development and Aging of the Human Prefrontal Cortex. American Journal of Human Genetics 90: 260-272. 100. Wang K, Li MY, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38. 101. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics 45: 1113-1120. 102. Brennan CW, Verhaak RGW, McKenna A, Campos B, Noushmehr H, et al. (2013) The Somatic Genomic Landscape of Glioblastoma. Cell 155: 462-477. 103. Brat DJ, Verhaak RGW, Al-Dape KD, Yung WKA, Salama SR, et al. (2015) Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. New England Journal of Medicine 372: 2481-2498. 104. Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28: 27-30. 105. Noushmehr H, Weisenberger DJ, Diefes K, Phillips HS, Pujara K, et al. (2010) Identification of a CpG Island Methylator Phenotype that Defines a Distinct Subgroup of Glioma. Cancer Cell 17: 510-522. 106. Gooskens S, Gadd S, Meerzaman D, Khan J, de Krijger R, et al. (2014) Comprehensive Molecular Characterization of Clear Cell Sarcoma of the Kidney (Ccsk): A Children's Oncology Group Target Project. Pediatric Blood & Cancer 61: S129-S130. 107. Linehan WM, Spellman PT, Ricketts CJ, Creighton CJ, Fei SS, et al. (2016) Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma. New England Journal of Medicine 374: 135- 145.

54 108. Davis CF, Ricketts CJ, Wang M, Yang LX, Cherniack AD, et al. (2014) The Somatic Genomic Landscape of Chromophobe Renal Cell Carcinoma. Cancer Cell 26: 319-330. 109. Ley TJ, Miller C, Ding L, Raphael BJ, Mungall AJ, et al. (2013) Genomic and Epigenomic Landscapes of Adult De Novo Acute Myeloid Leukemia. New England Journal of Medicine 368: 2059- 2074. 110. Muzny DM, Bainbridge MN, Chang K, Dinh HH, Drummond JA, et al. (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487: 330-337. 111. Cahill N, Bergh AC, Kanduri M, Göransson-Kultima H, Mansouri L, et al. (2013) 450K-array analysis of chronic lymphocytic leukemia cells reveals global DNA methylation to be relatively stable over time and similar in resting and proliferative compartments. Leukemia 27: 150-158. 112. Cahill N, Rosenquist R (2013) Uncovering the DNA methylome in chronic lymphocytic leukemia. Epigenetics 8: 138-148. 113. Rujescu D (2012) Schizophrenia genes: on the matter of their convergence. Current Topics in Behavioral Neurosciences 12: 429- 440. 114. Brown AS, Derkits EJ (2010) Prenatal Infection and Schizophrenia: A Review of Epidemiologic and Translational Studies. American Journal of Psychiatry 167: 261-280. 115. Arseneault L, Cannon M, Fisher HL, Polanczyk G, Moffitt TE, et al. (2011) Childhood trauma and children's emerging psychotic symptoms: A genetically sensitive longitudinal cohort study. The American Journal of Psychiatry 168: 65-72. 116. Moore THM, Zammit S, Lingford-Hughes A, Barnes TRE, Jones PB, et al. (2007) Cannabis use and risk of psychotic or affective mental health outcomes: a systematic review. The Lancet 370: 319-328.

55 Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1520 Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-320720 2017