Bioinformatics for epigenomics

Pablo Cingolani

Master of Science

Computer science

McGill University Montreal,Quebec 2008-08-30

Requirements Statement Copyright Statement

Library and Archives Bibliothèque et Canada Archives Canada

Published Heritage Direction du Branch Patrimoine de l’édition

395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada

Your file Votre référence ISBN: 978-0-494-56817-0 Our file Notre référence ISBN: 978-0-494-56817-0

NOTICE: AVIS:

The author has granted a non- L’auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l’Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. . The author retains copyright L’auteur conserve la propriété du droit d’auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author’s permission.

In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n’y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

ACKNOWLEDGEMENTS

I would like to thank Dr. Mathieu Blanchette and Dr. Michael Hallett for supervising my Masters thesis. My most grateful recognition to Matt Sudderman for his advice on the analysis and his help in editing this document. Many thanks to the LAB members. I must also thank Dr. Moshe Szyf for fruitful discussions and allowing the use of his data.

ii ABSTRACT

Epigenetics refers to reversible, heritable changes in gene regulation that occur without a change in DNA sequence. These changes are usually due to methylation of cytosine bases in DNA. In this work we review existing method- ologies and propose new ones for their use in epigenomics. High throughtput methods to estimate methylation levels were developed as well as methods to make a biological interpretation of the data based on gene sets enrichment. High correlation was obtained between our methylation estimations and ex- perimental data from MeDIP experiments. Our proposed methods for gene sets enrichment performed better than well-known methods.

iii ABREG´ E´

L’´epig´en´etique d´ecrit les changements r´eversibles et h´eritables de la r´egulation g´eniquequi arrivent sans changements dans la s´equenced’ADN. Ces change- ments sont habituellement dus `ala m´ethylation de cytosines dans l’ADN. Dans cette th`ese, nous r´ecapitulons les m´ethodes bioinformatiques existantes et nous proposons des nouvelles m´ethodes pour des probl`emes reli´es`al’´epig´en´etique. Les m´ethodes a haut d´ebitpour l’estimation du niveau de m´ethylation sont d´evelopp´ees,de mˆemeque des m´ethodes pour l’interpr´etation biologique des donn´eesen se basant sur l’enrichissement d’ensemble de g`enes de la mˆeme fonction. De hauts niveaux de corr´elation sont obtenus entre nos estim´es et les donn´eesexp´erimentales provenant d’exp´eriences de type MeDIP. Les m´ethodes que nous proposons pour l’analyse d’enrichissement de fonction des g`enesperforment mieux que les autres m´ethodes existantes.

iv TABLE OF CONTENTS

ACKNOWLEDGEMENTS ...... ii ABSTRACT ...... iii ABREG´ E...... ´ iv LIST OF TABLES ...... viii LIST OF FIGURES ...... ix 1 Introduction: ...... 1 1.1 Background on epigenetics and epigenomics ...... 3 1.2 DNA methylation ...... 4 1.2.1 Imprinting ...... 9 1.2.2 Health ...... 13 1.2.3 Methodologies for analyzing DNA methylation . . . 20 1.3 Histone changes ...... 24 2 Modeling methylated DNA immunoprecipitation ...... 28 2.1 Introduction ...... 28 2.1.1 Introduction to our model ...... 29 2.2 The model ...... 30 2.3 DNA fragment length ...... 33 2.4 Sonication score ...... 34 2.4.1 Sonication score auto-correlation ...... 35 2.4.2 Methylation is auto-correlated ...... 37 2.5 Immunoprecipitation ...... 38 2.6 Hybridization ...... 39 2.6.1 Melting temperature Tm ...... 40 2.6.2 Probe sequence content ...... 41 2.7 Relationship between sonication score, melting temperature, base position and M-values ...... 43 2.8 Predicting sonication signal ...... 44 2.9 Results ...... 46 2.10 Conclusions ...... 50 3 Multiple PCR and primer selection problem ...... 52 3.1 Probability of mispriming ...... 53

v 3.2 Probability of primer pair search failure ...... 54 3.3 Feasibility ...... 55 3.4 Formal problem definition ...... 55 3.5 Problem complexity ...... 56 3.5.1 Definition of 3SAT ...... 57 3.5.2 Mapping 3SAT to primer selection problem . . . . 57 3.5.3 Sequence construction ...... 58 3.6 Non-linear dynamic solution ...... 59 3.7 Getting out of local minima: Stochastic approach . . . . . 62 3.8 Multiple PCR ...... 63 3.8.1 Lower bound using heuristic approach ...... 64 3.8.2 Solving multiple PCR: Simulated annealing . . . . 64 3.8.3 Multiple PCR: Getting as many amplicons in one test tube ...... 65 3.9 Discussion ...... 65 4 Gene sets analysis ...... 67 4.1 Introduction ...... 67 4.2 Previous work ...... 68 4.3 Mutual information ...... 71 4.3.1 Mutual information for gene sets ...... 73 4.3.2 Simulations and results ...... 75 4.4 Ranked list ...... 77 4.4.1 Simulations and results ...... 78 4.5 Discussion ...... 81 5 Conclusions ...... 82 Appendix A ...... 84 A.1 Sonication score: Formulation details ...... 85 A.2 Sonication’s score autocorrelation ...... 90 A.3 Wiener filter model ...... 94 A.3.1 Wiener filter with aditional parameters ...... 95 Appendix B ...... 97 B.1 Ranked sum with replacement ...... 98 B.1.1 Approximation by normal distribution ...... 98 B.1.2 Exact calculation ...... 99 B.1.3 Fast algorithm ...... 101 B.2 Ranked sum without replacement ...... 103 B.2.1 Min / Max values ...... 103 B.2.2 Exact calculation ...... 104 B.2.3 Normal approximation ...... 105 B.2.4 Approximation ...... 108

vi Appendix C ...... 110 C.1 Simulated annealing: Energy difference ...... 111 Appendix D ...... 113 5.2 Symbol reference ...... 114 5.3 Definitions ...... 115 References ...... 120

vii LIST OF TABLES Table page 1–1 Methylation detection methods ...... 25 3–1 How to map each of the eight possible clauses ...... 58 3–2 Heuristic algorithm results for a particular primer selection problem ...... 64 4–1 Algorithm for ranking gene sets using p-values ...... 70 4–2 Greedy algorithm optimizing mutual information ...... 76 4–3 Methodology for comparing algorithms in our simulations . . . 76 4–4 Algorithm comparison: Mean recovery rate and standard devi- ation ...... 77 4–5 Methodology for selecting GO-terms and genes for our simula- tion ...... 79 4–6 Algorithm comparison: Mean recovery rate and standard devi- ation ...... 80 4–7 Algorithm comparison with GSEA using MSigDB set C2. . . . 80 4–8 Algorithm comparison with GSEAusing MSigDB set C5. . . . 80 5–1 Rank sum approximation ...... 108

viii LIST OF FIGURES Figure page 1–1 Methyl group ...... 3 1–2 5-methyl C nucleotides accidentally deaminated ...... 6 1–3 How DNA methylation patterns are inherited ...... 8 1–4 Mammalian X-chromosome inactivation ...... 13 1–5 Bisulfite modification of a C nucleotide ...... 21 1–6 Steps in DMH ...... 27 2–1 Methylated DNA Immunoprecipitation ...... 31 2–2 Model diagram ...... 32 2–3 Gel run image using sonicated DNA...... 33

2–4 Function Sτ (xcg − xp)...... 35

2–5 Auto-correlation function Rss(d) for Sτ ...... 36

2–6 Scatter plot mp+τ vs mp shows auto-correlation...... 37 2–7 Sample M-value auto-correlation function...... 38 2–8 Methylation sample auto-correlation using HEP data...... 39 2–9 Immunoprecipitation efficiency as a function of methylated CpGs 40

2–10 Histograms for sonication scores S(xp) ...... 40

2–11 Tm Histogram and effect on probe M-values ...... 41 2–12 Base position influence hybridization score...... 42 2–13 Relationship of parameters in a fully methylated DNA sample MeDIP experiment...... 43 2–14 M-values and M-value estimates...... 45

2–15 Correlation between sonication signal sp and predicted sonica- tion signals ˆp for different microarrays ...... 48

ix 2–16 Difference between Wiener filter’s coefficients for two different experiments...... 48 2–17 Correlation histogram for all “regions”...... 49

2–18 Original signal Sp(x) and interpolated approximation...... 49 3–1 Number of possible primers needed ...... 56 3–2 Optimization network ...... 60 3–3 Matrix W is a sparse block matrix...... 62 4–1 GO structure: A directed acyclic graph (DAG) for each ontology 69 4–2 In an experiment analyzing N genes, n are “interesting” . . . 71 4–3 Elim algorithm ...... 72 4–4 Gene sets A, B and B0 and set of interesting genes I ...... 73 5–1 Scoring for a given prove p ...... 86 5–2 Binding probability ...... 86

5–3 Two probes centered at xi and xj share a common genomic region...... 90

5–4 Probability density function PN,NT (R)...... 101 5–5 Normal approximation’s RMS error...... 102

5–6 PN,NT (R) for different values of N and NT ...... 109

x CHAPTER 1 Introduction: Epigenetics Genetics studies how living organisms inherit features from one generation to the next. In 1795, Jean-Baptiste Lamarck proposed the “inheritance of acquired characteristics”, the idea that an organism can pass on characteristics that it acquired during its lifetime to its offspring. In 1865, Gregor Mendel proposed “the laws of the inheritance of traits”, the idea that organisms inherit traits from their parents in a discrete manner. Nowadays, it is a well-known fact that genetic information is carried by a DNA sequence and this DNA is copied and inherited across generations as Mendel proposed. Nevertheless, there is evidence confirming that some inheritance that does not involve DNA sequence changes, suggesting a Lamarckian-like mechanism. Epigenetics refers to reversible, heritable changes in gene regulation that occur without a change in DNA sequence [56]. Epigenetic modifications are molecular “flags” on chromosomes that play important roles in regulat- ing function, including activation and repression of some genes. The main epigenetic changes include DNA methylation, histone chromatin struc- ture modification [44] and RNAi. In this thesis, we will focus mainly on DNA methylation and methods for obtaining and analyzing methylation profiles using high-throughput ap- proaches. We will show how to process this data to estimate methylation levels, perform validation using low-throughput methodology, and how to properly interpret its biological meaning. In Chapter 2 we study methods to estimate methylation levels. In Chapter 3 we analyze how to select primers for multiple PCR experiments, in order to

1 speed up validation experiments. In Chapter 4 we propose methods to make a biological interpretation of high-throughput data. Due to the fact that each chapter relates to a topic with its own literature, a review of the relevant literature is not presented here but rather in each chapter. Here is the more detailed structure of this thesis: Chapter 1: In this chapter we review the background on epigenomics. Chapter 2: Methylated DNA Immunoprecipitation (MeDIP) experiments give us data that we might use to detect methylation levels. To efficiently use this data we need to understand the MeDIP process and create models and methods that allow us to detect methylation status. We introduce a model for MeDIP taking into account several factors of the experiment, including: sonication, pull-down efficiency, and probe efficiency. Chapter 3: One of the bottlenecks for epigenetic profiling in a wet lab is performing several PCR amplifications. As each amplification requires time, it could be useful to speed up this process by doing multiple PCR amplifications, i.e. several amplifications in the same test tube. Unfortunately selecting primers for bisulfited DNA is not trivial, so we propose methods to solve this problem. Chapter 4: Following high-throughput experiments we want interpret the biological meaning of large amounts of data. We describe new methods for getting meaningful biological interpretation based on information theory and rank statistics. We also show improved performance compared to other well-known methods. Appendix A: Mathematical details: Sonication score model. Appendix B: Mathematical details: Rank sum statistics. Appendix C: Mathematical details: Primer selection. Appendix D: Symbol references and definitions.

2 1.1 Background on epigenetics and epigenomics

Epigenetics refers to changes in gene expression that remain through the organism’s life, sometimes spanning multiple generations. However, there is no change in the underlying DNA sequence of the organism. The main epigenetic changes are: Methylation is defined as the attachment or substitution of a methyl group. A methyl group is a hydrophobic alkyl functional group derived from methane CH4. It has the formula −CH3 (see Figure 1–1) and is very often abbreviated −Me [2]. Methylation can happen in proteins or DNA bases. This methylation results in the conversion of the cytosine to 5-methylcytosine [2].

Figure 1–1: Methyl group

Histones: DNA winds around histones to conform a compact struc- ture. Histones, are modified by various chemical additions, including phos- phorylation, acetylation, and methylation. Deacetylation of histone tails in methylated DNA produces a more condensed chromatin structure, which is inaccessible to the transcriptional machinery [55], [41]. RNA interference (RNAi) is a mechanism that inhibits expression of genes. Since the mother contributes a large amount of RNA and protein to the zygote, some RNAi regulation can be inherited.

3 1.2 DNA methylation

The modified base 5-methyl cytosine (5-mC) was first detected in DNA by Hotchkiss [17]. Here a brief list of some well-known facts about methylation is presented. Where is it found? In vertebrate DNA is restricted to cytosine (C ) nucleotides in the sequence CpG, which is base-paired to exactly the same sequence (in opposite orientation) on the other strand of the DNA helix [9]. In other organisms methylation is also found in CpG dinucleotides (e.g. in plants, approximately 80 % of cytosines at CpG are methylated). CpNpG triplets are frequently methylated. Non-symmetrical sites such as CpTpA can also be methylated [17]. Methylation does not only occur in cytosine, in E. coli Methyl groups are added to all A residues in the sequence GATC, but not until some time after the A has been incorporated into a newly synthesized DNA chain. CpG islands: The CpG sequences are very unevenly distributed in the genome. They are present at 10 to 20 times their average density in selected regions, called CpG islands, that are 1000 to 2000 nucleotide pairs long [9] [4]. These islands with some important exceptions seem to remain unmethylated in all cell types. CpG islands are often near the promoters of the so-called housekeeping genes1 [9]. The distribution of CpG islands can be explained if we assume that CpG methylation was adopted in vertebrates primarily as a way of maintaining DNA in a transcriptionally inactive state. In vertebrates, new methyl-C to T mutations can be transmitted to the next generation only

1 Genes that code proteins that are essential for cell viability and are there- fore expressed in most cells.

4 if they occur in the germ line2 [9]. Most of the DNA in vertebrate germ cells is inactive and highly methylated. Over long periods of evolutionary time, the methylated CpG sequences in these inactive regions have presumably been lost through spontaneous mutation events that were not properly repaired [9]. The human genome contains an estimated 20,000 CpG islands. Most of the islands mark the 5’ ends of transcription units and thus, presumably, of genes [9]. Although 60% of human genes have CpG islands in the promoter or first exon, more than 80% of all CpG islands have no genes and are unlikely to regulate expression [23]. CpG deamination: The accidental deamination of methylated C nu- cleotides produces the natural nucleotide T. This T would be paired with a G on the opposite strand, forming a mismatched base pair[9] (see Figure 1–2). A special DNA glycosylase recognizes a mismatched base pair involving T in the sequence T-G and removes the T. This DNA repair mechanism is not accurate, methylated C nucleotides are common sites for mutations in vertebrate DNA [9]3 . During the course of evolution, more than three out of every four CpGs have been lost in this way, leaving vertebrates with a remarkable deficiency of this dinucleotide. Gene repression / activation: Methylation has the ability to mod- ulate DNA-protein interactions both positively and negatively [7], although it appears that methylation is mostly associated with the inactivity of genes. It is sometimes stated that the role of methylation is to “lock in” an inactive state, as a consequence of the silencing of a gene by some other mechanism

2 The cell lineage that gives rise to sperm or eggs. 3 Methylated nucleotides account for about one-third of the single-base mu- tations that have been observed in inherited human diseases [9]

5 Figure 1–2: 5-methyl C nucleotides accidentally deami- nated, figure from [9]

[17]. Active promoters in the human genome are frequently hypomethylated or unmethylated [7]. Vertebrate cells contain a family of proteins that bind methylated DNA (e.g. methylated DNA binding domain protein -MBD-). These DNA-binding proteins, interact with chromatin remodeling complexes and histone deacetylases that condense chromatin forming a compact struc- ture. It is difficult for transcription proteins to access DNA in this compact structure, thus producing gene silencing [1]. Tissue differentiation: There are at least 180 cell types in the human body and each individual tissue type has a unique methylation pattern [17] [7]. Some tissue-specific genes, which code for proteins needed only in selected types of cells, are associated with CpG islands [9] [7]. DNA methylation plays an important role in regulating tissue- and stage-specific genes during devel- opment [48]. The reversible and stable nature of DNA methylation makes it a reasonable mechanism for regulating genes in a tissue-specific or developmen- tally specific manner; nonetheless, the role of DNA methylation in the control of developmentally regulated mammalian genes (those that are not subject to genomic imprinting) has been controversial [48]. Defense mechanism: Transcriptional silencing in vertebrate is also particularly important to repress the proliferation of transposable el- ements. While coding sequences make up only a few percent of a typical

6 vertebrate genome, transposable elements can comprise nearly half of these genomes. Transposable elements can make copies of themselves and insert these copies elsewhere in the genome, potentially disrupting genes or impor- tant regulatory sequences. By suppressing the transcription of transposable elements, DNA methylation limits their spread and thereby maintains the in- tegrity of the genome[9]. Most methylated DNA within the genome is located in repetitive parasitic sequence elements such as transposons and endogenous retroviruses. The primary function of DNA methylation might therefore be as a host-defense mechanism that protects against transcriptional activation of these repetitive elements [48]. The methylation profile of the arabidop- sis genome, largely reflects the dense methylation of transposons and other repetitive sequences that are clustered in heterochromatin, where 80% of in- terspersed repeats are methylated, as well as 66% of tandem repeats and 46% of inverted repeats [46]. Some intragenomic parasites such as viruses and retro-elements, are indeed inactivated by methylation [55]. Inheritance: It was proposed that DNA methylation can be inher- ited, if there exists a maintenance DNA methyltransferase4 that recognizes hemimethylated DNA (i.e. DNA that is methylated in one strand and un- methylated in the complementary strand) just after replication and subse- quently methylates the new strand [17]. Methyltransferase acts preferentially on those CpG sequences that are base-paired with a CpG sequence that is already methylated. As a result, the pattern of DNA methylation on the parental DNA strand serves as a template for the methylation of the daughter

4 It is thought that Dnmt1 is responsible for copying DNA methylation patterns to the daughter strands during DNA replication

7 DNA strand, causing this pattern to be inherited directly following DNA repli- cation [9] (see Figure 1–3). Demethylation of DNA can occur when replication happens in the absence of Dnmt1 (passive demethylation). Active demethy- lation may occur in vivo in early mammalian embryos and in tumor cell lines, but the enzymes have not been identified [55].

Figure 1–3: How DNA methylation patterns are inherited. Figure from [9]

DNA methylation patterns are dynamic during vertebrate development [9]. Shortly after fertilization there is a genome-wide wave of demethylation, when the vast majority of methyl groups are lost from the DNA. This demethy- lation may occur either by a specific demethylating enzyme, or by suppression of maintenance DNA methyltransferase activity, resulting in the passive loss of methyl groups during each round of DNA replication. Later in development, at the time that the embryo implants in the wall of the uterus, new methyla- tion patterns are established by several de novo DNA methyltransferases that modify specific unmethylated CpG dinucleotides [9]. Once the new patterns of methylation are established, they can be propagated through rounds of DNA replication by the maintenance methyl transferases.

8 It has also been proposed that “developmental clocks”, controlled by DNA methylation, may be an important component of the developmental program [17]. De novo methylation: During early mammalian embryogenesis, DNA methylation patterns are largely erased and re-established shortly after im- plantation in a wave of de novo methylation [22]. These newly established patterns are then thought to be copied after each round of replication onto the newly synthesized DNA [17]. It is thought that Dnmt3a and Dnmt3b are de novo methyltransferases. Small RNA: A post-transcriptional gene silencing process, termed RNAi, is initiated by the enzyme Dicer, that cleaves long double-stranded RNA (dsRNA) molecules into short fragments known as small interfering RNA (siRNA) [34]. siRNAs are known to cause RNA directed methylation (RdDM) [46]. Methy- lation associated with post-transcriptional gene silencing (PTGS) in plants is acquired in transcribed or coding regions, where its role remains unclear [34]. Involvement of dsRNA in directing DNA methylation would indicate that this molecule acts not only in the cytoplasm to initiate the RNA degradation step of PTGS, but also at the genome level to induce epigenetic modifications. dsRNA might thus be a common molecular component of the mechanisms of PTGS and at least some cases of TGS [34]. Sequence-specific methylation signals consisting of either DNA-DNA or RNA-DNA associations are believed to be involved. DNA pairing might nev- ertheless provide a signal for de novo methylation in plants considering the recent finding of an Arabidopsis [34]. 1.2.1 Imprinting

Genomic imprinting is an epigenetic marking process that causes some genes to be expressed according to their parental origin [44] [48]. The two

9 parental genomes exhibit functional asymmetry due to the presence of im- printed genes that are expressed mono-allelically [48]. This is because within the same cell, one allele of an imprinted gene is expressed, while at the same time the other allele is silent [48]. Approximately 0.1% of genes in mam- mals and in flowering plants are repressed on one of the chromosomes, and this is depends on the parental origin of the gene. A brief list of facts about imprinting: Regulation: In mammals, imprinted genes are implicated in the regula- tion of fetal growth, development and function of the placenta, and postnatal behaviors. In fetal growth regulation, imprinted gene action exhibits a specific directionality, with the majority of paternally expressed genes enhancing fetal growth [55]. Methylation: DNA methylation clearly has an important role in im- printing, both in silencing certain genes (e.g. H19) as well as in activating others (Igf2, Igf2R). Most imprinted genes have differentially methylated regions (DMRs) [55]. Where: Many imprinted genes are located in clusters, and studies on the mechanisms of imprinting have identified elements that are differentially modified on the two parental chromosomes and can act locally on the adjacent gene or over a longer range, affecting the activity of multiple imprinted genes in the cluster [55]. Approximately 50 imprinted genes have been identified, that map to 12 locations within the genome. Many, although not all, are located in clusters that contain genes expressed from maternally inherited alleles alongside genes that are expressed from paternally inherited alleles [48]. In general, gene imprinting appears to be conserved between different mammalian species. One notable exception is the Igf2r locus, which does not show consistent imprinting in humans [48].

10 Many imprinted genes have DMRs outside their promoters that are methy- lated on the active allele. Some of these regions appear to contain promoters for anti-sense RNAs [55]. Imprinting during early stages: Initial parental-origin-specific im- printing signals are likely to be established during gametogenesis. Old im- prints are erased and new parental-specific imprints are established. This is the first phase of so-called nuclear reprogramming, apparently occurring in a place where the two parental genomes are isolated from each other [48]. The mammalian genome undergoes major reprogramming of modification patterns in the germ cells and in the early embryo [55]. After fertilization and before implantation, a second wave of nuclear reprogramming occurs, during which the two parental genomes are once again subject to changes in their epigenetic status [48]. After fertilization, differential methylation of imprinted genes needs to be maintained. This may seem trivial, if it were not for the fact that drastic demethylation of the whole genome occurs in mouse preimplantation embryos [55]. The paternal genome is actively demethylated. The unmethylated allele of an imprinted gene has to resist genome-wide de-novo methylation by Dnmt3a and Dnmt3b after implantation, and it has been suggested that this resistance may be due to a specialized chromatin structure in the unmethylated allele [55]. Methylation differences due to imprinting are erased by demethylation during the development of germ cells and are thus reprogrammed. It is re- markable that, during preimplantation development, genome-wide demethy- lation also occurs. One suggestion is that the paternal demethylation is an “anti-imprinting weapon”. Another view is that demethylation may prime

11 the genome for the widespread transcriptional activation that occurs early in mammals [55]. A definitive correlation between methylation and the silencing of develop- mentally regulated genes is lacking. Both direct and indirect evidence suggest that, although many developmentally regulated genes are not regulated by methylation, others clearly are [48]. X inactivation: Males and females differ in their sex chromosomes. Fe- males have two X chromosomes, whereas males have one X and one Y chromo- some. As a result, female cells contain twice as many copies of X-chromosome genes as do male cells. In mammals, the X and Y sex chromosomes differ radically in gene content: the X chromosome is large and contains more than a thousand genes, whereas the Y chromosome is smaller and contains less than 100 genes. Mammals have evolved a dosage compensation mechanism to equal- ize the dosage of X chromosome gene products between males and females. Mutations that interfere with dosage compensation are lethal, demonstrating the necessity of maintaining the correct ratio of X chromosome to autosome (non-sex chromosome) gene products[9]. The initial choice of which X chro- mosome to inactivate (the maternally inherited one or the paternally inherited one) is random[9]. X-chromosome inactivation also depends on methylation [55]. Epigenetic mechanisms are involved in inactivation of the X chromosome in female mammals and in silencing repetitive “parasitic” elements that make up a significant portion of the mammalian genome [48]. The inactive X is char- acterized by asynchronous DNA replication and by epigenetic modifications to the DNA and chromatin, including DNA methylation, histone H3/H4 hy- poacetylation [48]. X-chromosome inactivation is initiated and spreads from a single site in the middle of the X chromosome, the X-inactivation center (XIC). Portions of the X chromosome that are removed from the XIC and fused to

12 an autosome escape inactivation. In contrast, autosomes that are fused to the XIC of an inactive X chromosome are transcriptionally silenced. The XIC (a DNA sequence of approximately 106 nucleotide pairs) can therefore be consid- ered as a large regulatory element that seeds the formation of heterochromatin and facilitates its bi-directional spread along the entire chromosome [9]. XIST RNA, which is expressed solely from the inactive X chromosome, does not get translated into protein. The spread of XIST RNA from the XIC over the entire chromosome correlates with the spread of gene silencing, indicating that XIST RNA participates in the formation and spread of heterochromatin [9] (see Figure 1–4).

Figure 1–4: Mammalian X-chromosome inactivation. Figure form [9]

1.2.2 Health

DNA methylation has been linked to some diseases: Cancer: Global methylation levels in cancer were observed to be lower than the levels in normal cells. This phenomenon could also correlate with the genetic instability seen in cancer due to the activation of endogenous trans- posable elements and other mobile elements found in the human genome.

13 In addition to hypomethylation, cancer cells exhibit regional hypermethy- lation. At least 60 genes have been shown to be abnormally hypermethy- lated in cancers. Several studies have demonstrated a direct role of abnormal CpG-island promoter methylation in gene silencing in cancer [17]. An esti- mated average of 600 CpG islands5 were aberrantly methylated in the tumors [44]. Aberrant hypomethylation is known to induce activation of oncogenes in cancer[23]. Genome wide analysis of hypomethylated promoter sequences in cancer demonstrated low CG/GC ratio of these sequences, suggesting that CpG-poor genes are sensitive to demethylation activity[23]. Rett syndrome: Rett syndrome is the most common sporadically in- herited form of mental retardation in females [58]. It is caused by mutations in an X-linked gene encoding the MeCP2 protein, that binds to methylated DNA and links DNA methylation to transcriptional repression, causing a se- vere neurological disorder[58]. Mutations in MeCP2 involve loss of imprinting of DLX5 [4]. The missense and truncating mutations in the MeCP2 gene found in Rett patients are scattered across the gene [58]. Several mutations are clustered in two functionally characterized domains of the protein: the “methyl-CpG binding domain” (MBP) and the “transcriptional repression do- main” (TRD) [58]. Mutations in the MBD have shown disturbed binding to methylated CpGs, mostly caused by protein misfolding. Knockout mutations of the mouse Mecp2 reproduce the key aspects of this disease [58]. ICF Syndrome: This is a rare autosomal-recessive disease characterized by immunodeficiency, instability of the centromeric region, and facial anoma- lies [58]. The syndrome is accompanied by hypomethylation of the classical satellite 2 and 3 sequences [58]. The disease is caused by mutations in the DNA

5 There are ∼ 45, 000 total CpG islands in the human genome

14 methyltransferase 3B gene (DNMT3B) [58]. ICF patients usually have reduc- tions or absence of at least two immunoglobulin isotypes, causing defective cell-mediated immunity. In addition, chromosomes 1, 9, and 16 in mitogen- stimulated lymphocytes of ICF patients exhibit pericentromeric condensation anomalies and enhanced chromosomal instability. These alterations coincide with a mild decrease in overall genomic 5-methyl cytosine levels and a striking hypomethylation of repetitive sequences in the pericentromeric region of these chromosomes [58]. ATR-X Syndrome: X-linked alpha-thalassemia/mental retardation syn- drome. Mutations in the ATRX gene give rise to characteristic developmental abnormalities, including severe mental retardation, facial dysmorphism, uro- genital abnormalities, alpha-thalassaemia, and sex-reversal phenotypes [58]. The ATRX gene codes for a protein that contains a plant homeodomain-like domain, present in many chromatin-associated proteins. ATRX protein acts as a transcriptional regulator by modifying the local chromatin structure[58]. In addition, the protein contains a carboxy-terminal domain, which identifies it as a member of the SNF2 family of helicase/ATPases [58]. Mutations found in ATR-X patients give rise to changes in the pattern of methylation of sev- eral highly repeated sequences, including rDNA arrays, a Y-specific satellite, and subtelomeric repeats. The severe reduction in methylation, in conjunc- tion with the appearance of the chromatin-remodeling SNF2 helicase domain, suggests that the ATR-X protein links alterations in DNA methylation to chromatin-remodeling effects [58]. Fragile-X Syndrome: Mutation of the FMR1 gene causes fragile-X mental retardation syndrome [58]. The most common FMR1 mutation is an expansion of a CCG repeat tract at the 5’ end of FMR1, which causes the CCG repeat to become methylated (at the CpG dinucleotides), resulting in FMR1

15 gene is transcriptionally silenced [58]. The methylated 5’ end of the FMR- 1 gene is associated with acetylated forms of histones H3 and H4, whereas non-expanded repeats are wrapped around deacetylated histones. Imprinting related diseases: Epimutations are abnormal changes in epigenetic modifications, just as mutations are changes in the DNA sequence. A few patients with Beckwith Wiedemann syndrome or Prader Willi-Angelman syndrome have epimutations in imprinted genes [55]. Early embryos and their stem cells (ES cells) seem particularly prone to acquiring epimutations when exposed to manipulations or adverse culture conditions, and these are not normally reversible during somatic development [55]. Beckwith-Wiedemann syndrome is a rare genetic or epigenetic over- growth syndrome associated with an elevated risk of embryonic tumor forma- tion. It is caused by mutations in growth regulating genes on chromosome 11 (specifically 11p15) or by errors in genomic imprinting. Patients typically present with omphalocele, macroglossia (large tongue), and macrosomia (large birth weight). The BWS gene locus is adjacent to the WT1 gene implicated in Wilms’ tumor development, and thus the BWS locus has been named WT2. BWS-affected individuals are at an elevated risk of developing Wilms’ tumors as well as other neoplasias such as hepatoblastomas [2]. Prader-Willi Syndrome: PWS is characterized by hyperphagia6 and food preoccupations, as well as small stature and learning difficulties [2]. It is caused by absence of the paternally derived PWS/AS region of chromosome 15 (15q11-13) by one of several genetic mechanisms, including uniparental disomy (when a person receives two copies of a chromosome, from one parent

6 Excessive hunger and abnormally large intake of solids by mouth.

16 and no copies from the other parent), imprinting mutations (i.e. inappropriate ”paternal imprinting”), chromosome translocations, and gene deletions [2]. Other: Some links for schizophrenia and even diabetes have been pro- posed to be related to epigenetic effects. These proposed links are based on cases where these decease affects one of two identical twins [6]. Another study stated that grandchildren of Swedish men born between 1890 and 1920 (when there were crop records) had higher risk of dying from diabetes [6]. Imprinting explains why in mammals, parthenogenetic (bi-maternal) de- velopment from eggs only is not possible [55]. The presence of imprinted genes is the reason why mammals do not support parthenogenetic or androgenetic (bi-paternal) development [48]. Age effects: The evidence for methylation changes during the aging of organisms is somewhat inconsistent. It is highly probable that human cells, as well as those from other long-lived species, are much more resistant to random epigenetic changes in gene expression than cells from short-lived species such as rodents [17]. There seems to be an accumulation of methylation errors with age, probably due to imperfect copying of methylation patterns to the daughter strands after replication by maintenance methyl transferase [40]. Some link or connection has been proposed between the control of total DNA methylation and the loss or maintenance of telomeres, as suggested by the analysis of immortalized cell lines. It is not known why cells with limited in-vitro lifespan lose methylation at a constant rate, whereas immortalized cells stably maintain a given level of DNA methylation [17]. It is apparent that aging cells, particularly epithelial cells, are character- ized by an increased frequency of silencing at multiple loci, such that some cells in older individuals have lost the ability to express a certain percentage of previously active genes [50]. The epigenome, then, displays considerable

17 mosaicism (the presence of two populations of cells with different genotypes in one individual) in adults, and this mosaicism increases with age and is in- fluenced by lifestyle and environmental exposures. Age-related methylation is typically partial, with quantitative assays revealing ranges of 5%-50%, al- though most values are in the lower end of this range. Methylation could be sparsely affecting all cells under study, which may have little effect on gene expression, or methylation could be extensive, but limited to a few cells in a given population. Age-related diseases may be affected by promoter methyla- tion in normal tissues (these include disorders such as senile dementia, insulin resistance, associated diabetes, inflammatory bowel disease, Barrett’s esoph- agus, and chronic hepatitis) [50]. A large body of evidence suggests that many diseases that are strongly modulated by environmental factors, as well as age-related diseases, are associated with, if not caused by, altered DNA methylation patterns in particular tissues [40]. Diet: Evidence has emerged that the essential vitamin folate, vitamin B12, choline, and methionine play a central role in maintaining the stabil- ity of DNA by providing and enabling donation of carbon atoms (or methyl groups) for synthesis of the DNA bases and for the maintenance of methy- lation patterns on DNA [58]. These results suggest that optimizing genome methylation by dietary means or by supplementation should prove to be an effective strategy for preventing diseases caused by genomic instability [58]. Nutritional and environmental influences, such as uptake of toxins, have a significant effect on the genomic methylation pattern. Hence, poor nutrition could cause regulatory disturbances in the genome, leading to metabolic dis- orders or cardiovascular disease [40].

18 DNA methylation, is likely to be involved in mediating the deleterious effects of increased body fat and high-fat diet on insulin sensitivity of insulin target tissues and on the cardiovascular system. Diagnostic: Several additional technical advantages make methylation an ideal tool for routine diagnostics assays: • DNA is a very stable molecule (unlike mRNA and proteins) [40]. DNA methylation techniques lack of particular requirements for sample han- dling, and tissue samples can be analyzed retrospectively. • Methylation and SNP signals can be measured in the same systems [40]. • Methylation information relevant to tumors can be acquired from either circulating tumor cells or free DNA in the blood of cancer patients [40]. • Repeated PCR-amplification cycles do not disturb the sharpness of the methylation signal [40]. • For most of the important cancer indications, that methylation patterns can be used to sub classify cancer in much the same way, and with much the same resolving power, as whole-genome mRNA expression microar- rays [40] • DNA methylation picks up permanent expression changes rather than short-term alterations so fewer false positives occur due to temporary expression changes [40]. Therapy: Unlike genetic changes, which are largely irrevocable once es- tablished, epigenetic changes have the potential to be reversible, this may lead to some “epigenetic therapy”. Nuclear transfer technology can result in substantial epigenetic reprogramming [50]. DNA methylation in promoter regions can be targeted by inhibiting DNA methyltransferases [40]. The ma- jor concern with nonspecific gene demethylation is that global demethylation may lead to reactivation of silenced proviral sequences or imprinted genes

19 with tumor growth-promoting effects [40]. Another highly promising target for epigenetic therapy is histone modification [40]. Again, as with unspecific demethylating agents, the question arises of how to target the effects to spe- cific genes, the re-expression of which are needed to restore apoptotic or tumor growth-suppressing pathways or sensitivity to other anticancer agents [40]. An approach that may overcome the problem of unspecific activation of genes by both DNA demethylating agents and histone deacetylases (HDAC) inhibitors is to direct a specific activity7 to a particular site in the genome. In this approach, the Cys2-His2 zinc-finger DNA binding domain (the most common DNA binding motif found in nature), is used as a scaffold for constructing factors with the desired specificity for a particular DNA binding site [40]. 1.2.3 Methodologies for analyzing DNA methylation

Several methods exist to detect and analyze cytosine methylation (5-mC). There are three main strategies for methylation detection: the digestion of DNA by a methylation-sensitive or -insensitive restriction endonuclease; the chemical modification of DNA by sodium bisulfite or metabisulfite; and im- munoprecipitation of 5-methylcytosine to separate directly the unmethylated and methylated fractions of the genome [4] (see Table 1–1). With less mod- ern technologies it was possible to assay over 1,000 preselected CpG sites in hundreds of genes, but labor costs made whole genome sequencing prohibitive [46]. Of course this changed with the development of new technologies like high-throughput sequencing. Table 1–1 shows a list of some of the methods usually applied for methylation.

7 Acetyl transferase, histone deacetylase, K9 methyl transferase.

20 Restriction enzymes

Restriction enzymes cut DNA into fragments in or near their specific recognition or restriction sites. Interestingly, many enzymes can recognize the same target sequences but cut DNA differently, due to their sensitivity to methylcytosine8 [44]. DNA fragments can be separated by Southern blotting.

The bisulfite method

In single-stranded DNA cytosines are converted to uracil when reacting with bisulfite (see Figure 1–5); whereas 5-methylcytosines (5-mC) are unreac- tive. The modified DNA strands, which now had very few C residues, could be amplified by use of PCR and then sequenced [17]

Figure 1–5: Bisulfite modification of a C nucleotide. Figure from [44]

Restriction landmark genomic scanning

(RLGS) is a method that enables searching for changes in DNA methy- lation in a genome. It uses a combination of DNA-cutting enzymes, one of which is methylation sensitive. The genomic DNA is digested into distinctive restriction fragments, labeled by a radioactive isotope followed by separation

8 E.g. MspI and HpaII are isoschizomers that recognize the CCGG se- quence. When the central CpG residue within CCGG is methylated at cy- tosine, HpaII cannot cleave it, whereas MspI restricts the site regardless the state of methylation of the central cytosine (but is inhibited when the outside C is methylated) [44].

21 of these fragments by two-dimensional gel electrophoresis [44]. It can be ap- plied to genomic DNA of any species without prior knowledge of sequence information. The restriction enzyme used is a rare cutting enzyme9 , which preferentially recognizes CpG islands located in the promoter regions of genes [44].

Arbitrarily primed PCR

The AP-PCR method is based on the ability of PCR to generate a re- producible group of DNA fragments when the reaction is performed at low annealing temperature. To identify methylated sites in the genome, AP-PCR can be performed on DNA samples digested with methylation-sensitive re- striction enzymes (methylation-sensitive HpaII and its isoschizomer MspI). In addition, a second restriction enzyme RsaI, unrelated to DNA methylation, is used to cut the DNA into smaller fragments [28], [44]. A band is identified as • Methylated if a PCR product is present in both the RsaI-digested and the RsaI + HpaII doubly digested samples, but not in the RsaI + MspI doubly digested sample. • Unmethylated when a PCR product is present in the RsaI-digested sam- ple only, but not in doubly-digested samples.

Differential methylation hybridization

DMH is an array-based approach for screening methylation changes of CpG islands in the genome[44]. This proceeds as follows: 1. DNA sequences are selected as probes in a microarray. 2. To prepare targets, DNA from test and control samples is digested with MseI, whose recognition site TTAA rarely occurs in GC-rich regions.

9 Cuts at fewer than 5000 sites in the human genome [44].

22 The enzyme cuts genomic DNA into very small fragments but leaves most CpG islands intact. The digested ends of CpG island fragments are ligated to linkers, which are used for anchoring sequences of a PCR primer. 3. The ligated DNA is digested with methylation-sensitive enzymes HpaII and BstUI. 4. The digested DNA is amplified by PCR using a primer that binds to the flanking linkers. DNA fragments containing methylated sites cannot be digested by the methylation-sensitive enzymes and are amplified by this linker-PCR approach. Many of these amplified fragments are expected be present in tumor, due to abnormal DNA methylation, whereas the same unmethylated fragments are digested and not present in the normal control. 5. Amplified target DNA is purified, and aminoallyl-dUTP is incorporated into DNA through a random primed labeling procedure. The Cy5 or Cy3 fluorescent dye is then coupled to the aminoallyl-dUTP residue in the incorporated site. 6. The fluorescently labeled test and reference targets are pooled and hy- bridized to a microarray slide. Blocking reagents are included to mini- mize hybridization noise. 7. Hybridized chips are scanned.

Methylated DNA immunoprecipitation

In an Methylated DNA Immunoprecipitation (MeDIP) experiment, DNA is isolated, ultrasound is applied to produce cuts at random positions. Anti- bodies that are specific for 5-methyl-cytosine are used to capture methylated DNA. Then whole genome amplification is performed to generate a higher

23 DNA concentration. DNA is fluorescently labeled and hybridized to a microar- ray, so microarray probes having high intensities indicate that methylation is present (see section 2.1 for details).

MethyLight

The MethyLight technique relies on the bisulfite modification of genomic DNA to create methylation-dependent sequence changes. PCR primers can be designed to amplify only specific variants of such sequence patterns (such as fully methylated or fully unmethylated versions). Detection of the PCR products occurs in real time during the PCR reaction, using fluorescence [44].

Sequencing

This method is a variation of the bisulfite method (see section 1.2.3) [30]. Bisulfited DNA is sequenced using next-generation sequencing platforms [24] [29] [26] [33]. This has been successfully applied to detect methylation levels on whole genomes achieving single nucleotide resolution [5]. 1.3 Histone changes

Histones are subject to post-translational modification by enzymes pri- marily on their N-terminal tails, but also in their globular domains. Such modifications include methylation, citrullination, acetylation, phosphoryla- tion, sumoylation, ubiquitination, and ADP-ribosylation. This affects their gene regulatory function [2]. Deacetylation of histone tails in methylated DNA produces a condensed chromatin structure inaccessible to the transcriptional machinery [55]. Researchers realized that histones modifications are dynamic during de- velopment, vary among different tissues, are regulated by specific enzymes, play major roles in the control of gene expression, and interact with other epigenetic control systems such as DNA methylation. A family of proteins binding to methylated DNA (MBDs) has been identified, and some of these

24 Table 1–1: Methylation detection methods

Epigenomic modification Technique DNA methylation BAC arrays coupled Biotin-labeled DNA is generated from end filling of genomic DNA samples to NotI digestion digested with the rare-cutting methylation-sensitive restriction endonuclease NotI. The samples are then hybridized onto a microarray of BAC clones HELP The HELP assay co-hybridizes HpaII digestion products (unmethylated DNA enrichment) with digestion fragments from a methylation-insensitive isoschizomer (MspI) onto a customized array MSDK MSDK uses methylcytosine-sensitive restriction endonucleases to discriminate methylated from unmethylated DNA and maps their position in the genome using SAGE Bisulfite treatment Bisulfite treatment of DNA converts cytosine to uracil unless the base is methylated, allowing one to discriminate methylated from unmethylated DNA. The converted product can be read using high-throughput sequencing or MALDI-TOF mass spectrometry MethyLight Fluorescence-based real-time PCR technique that quantifies methylated DNA by using primers that anneal differently to bisulfite-treated DNA MeDIP Immunoprecipitation of methylated DNA by an antibody specific for 5-methylcytidine. Enriched fraction is hybridized to a CGH array in order to simultaneously investigate methylation changes and DNA mutation GMAT Following ChIP, DNA is ligated to biotinylated linkers and bound to streptavidin beads. Digestion with NlaIII cleaves the ChIP DNA, a linker with MmeII restriction site is added, concatemerized and then cleaved with MmeII to form 21 and 22 bp fragments for sequencing Chromatin modifications ChIP on chip DNA is crosslinked to its associated binding proteins in vivo by chemical treatment. Following shearing, the DNA protein complex is immunopreci- pitated by antibodies specific for a protein or histone modification of interest. The DNA sequence location can be identified by hybridization to a micro- array SACO ChIP product is ligated to PCR adapters. Following digestion with NlaII, the material is divided into two pools and ligated to distinct MmeI adaptors. The adaptors are bound by streptavidin and released by digestion to produce tags. Tags are concatamerized and sequenced ChIP-PET Immunoprecipitated DNA is cloned into a DNA library and then converted into paired-end ditags (PETs). The PETs are concatenated and cloned into a ChIP-PET library for sequencing DNase I QCP Isolate intact nuclei, half are treated with DNaseI. Primers are designed to amplify 250 bp fragments across candidate gene loci. The relative number of intact amplicons is calculated and treated sample versus control. A plot of ratios according to genomic position is performed and a regional baseline of sensitivity is determined. Outliers are identified as hypersensitive sites and likely regulatory elements are in complexes with histone deacetylases. This leads to local deacetyla- tion of histone tails in methylated DNA and thus more condensed chromatin, which is inaccessible to the transcriptional machinery [55]. In general, his- tone acetylation is associated with gene activity, and lack of acetylation, with gene repression. Histone methylation is associated with activity or repression, depending on which lysine is modified [48]. The H3 and H4 histones have long tails protruding from the nucleosome, which can be covalently modified at several places (methylation, acetylation, phosphorylation, ubiquitination, sumoylation, citrullination, and ADP ribosy- lation) [2]. Acetylation of various amino acid residues of histones H3 and H4 is generally associated with an active chromatin configuration and expressed genes (euchromatin). In contrast, histone methylation is generally associated

25 with condensed or heterochromatic chromatin and gene repression. However, many exceptions to this simple rule exist [55].

26 Figure 1–6: Steps in27 DMH. Figure from [44] CHAPTER 2 Modeling methylated DNA immunoprecipitation 2.1 Introduction

In order to study methylation and how it affects different biological pro- cesses, it is important to understand which are the differences in methylation among different DNA samples (e.g. how methylation differs in treated and control cells). In other words, the biological problem that needs to be solved (the ultimate goal of this project) is to identify the location of each methylated base in the whole genome. In theory, it is possible to know the location of each methylated CpG using bisulphite sequencing [5], but at present the costs is very high (this is rapidly changing). Thus, an alternative approach is to use Methylated DNA Immuno- precipitation (MeDIP) which allows the estimation of averages of methyla- tion in CpG dinucleotides for a genomic region, in a cost effective way. As mentioned in Chapter 1 MeDIP is a method that uses microarrays to obtain information about methylation in different genomic regions. Ideally we would like to get information about the methylation status of every single CpG in those genomic regions but MeDIP cannot achieve that resolution, yielding only a noisy measurement of methylation averages across hundreds of bases. Different methods have been proposed for the inference of methylation status based on microarray data, such as probabilistic inference [47] [15] and genomic weighted smoothing [54] (i.e. averaging values of probes within a small continuous genomic region). In order to improve some of the limitations of this technology, we de- veloped a model for MeDIP that takes into account several factors of the

28 experiment, including: sonication, pull-down efficiency, and probe efficiency. Our results show that methylation is autocorrelated. Thus, by sorting the data by genomic location we can take advantage of this autocorrelation and we can improve methylation predictions. In addition, by applying optimal filtering to the data, using well-known digital signal processing techniques, an algorithm of lower complexity than other proposed models is obtained. 2.1.1 Introduction to our model

An MeDIP experiment consist of several steps as illustrated in Figure 2–1: Sonication: Sonication means applying ultrasound to produce cuts at random positions in DNA strands, typically fragments are 300 to 1500 bases. Inmuno precipitation (IP): Antibodies specific for a particular molecule (e.g. 5-methylcytosine) are immobilized on a solid-phase substrate such as mi- croscopic agarose beads or magnetic beads and added to the sonicated DNA. Molecules targeted by the antibodies are captured by the beads via the anti- bodies and separated from the sample (i.e. immunoprecipitated). In MeDIP, the molecule of interest is sheared methylated DNA, so it is immunoprecipi- tated with a monoclonal antibody that specifically recognizes 5-methylcytidine (methylated C ). Some of the original sonicated DNA is retained to be used later as a reference (usually called “control” or “input”). Whole genome amplification (WGA): Immunoprecipitation typically yields small amounts of DNA so it must be amplified. Based on the Polymerase Chain Reaction (PCR) technique, WGA uses many of PCR cycles using Taq polymerase and random primers that anneal at low temperature to increase yields in an unbieased manner. Chip Hybridization: A microarray consists of an arrayed series of mil- lions of spots of DNA, called probes, each containing a specific DNA sequence. The probes are attached to a solid surface by a covalent bond, so we know the

29 position of each probe and its sequence. When the mixture of fluorescently labeled DNA is “poured” onto the chip, DNA fragment will bind to the probe having a complementary subsequence. Fluorescent dyes are used for cDNA labelling (typically Cy3: wavelength of 570 nm and Cy5: wavelength of 670 nm). The two Cy-labelled cDNA samples are mixed before being hybridized to a microarray. Typically we label immunoprecipitated DNA with Cy5 (red) and reference (or control) DNA with Cy3 (green). Scanning: A scanner reads the light intensities for each dye on the mi- croarray using a laser,producing an image is processed from which intensities can be mapped to each probe and thus to each sequence in the genome. Intensities are related to the number of methylated DNA fragments at- tached to each probe. This provides an indirect measurement of methylation levels near the probe. 2.2 The model

As mentioned, the purpose of an MeDIP experiment is to determine methylation levels. After performing the experiment we have the intensi- ties, for both immunoprecipitated and control DNA, for each probe in the chip. We also have gel electrophoresis data (“gel run”) showing the fragment lengths during the sonication step. Based on this information, we develop our model to obtain a quantification of methylation levels in the genomic regions of interest. We model each step of the MeDIP process as follows (see Figure 2–2): Sonication: The first problem we have to solve is to understand the probability distribution of sonicated fragments. We use a “gel run” and basic image processing to extract the data from the gel and then fit a parametric distribution (section 2.3). We then model the sonication process by calcu- lating the “Sonication score”, a number that is proportional to the number

30 Figure 2–1: Methylated DNA Immunopre- cipitation (MeDIP). Figure from [47] and [18] of methylated DNA fragments that contain a given probe sequence (section 2.4). As we show later, sonication score is autocorrelated. Consequently, to take advantage of this fact in our model we require / assume that probes are ordered by genomic location. Immunoprecipitation (IP): This process is not one hundred percent effective. The efficiency of IP depends on the number of methylated CpG dinucleotides in a DNA fragment. We describe how we include this nonlinear efficiency to our model in section 2.5. Whole genome amplification (WGA): Although WGA is not per- fect, there are no known systematic errors, so we will simplify this process by modeling it as an amplification plus random noise.

31 Chip Hybridization: There are several factors that influence probe efficiency (the capacity of a probe to capture DNA). Some of the factors depend on probe sequence and we try to measure their impact. In sections 2.6.1 and 2.6.2 we study the effect of melting temperature and probe sequence. Melting temperature is the temperature at which half of the DNA strands are separated from their complementary strand (in this case the complementary strand is the probe). Probe sequence is also important because A and T nucleotides have fewer hydrogen bonds than C and G nucleotides, so the binding is stronger for CG-rich sequences. Scanning: We assume that feature extraction software eliminates sys- tematic biases, so what remains is random noise.

In this chapter we define ri and gi as the Cy5 (red) and Cy3 (green) intensities for probe number i, mi = log2(ri/gi) is the M-value for probe i, we set mth(x) = 1 if there is a CpG at position x (0 otherwise), we set r(x) = 1 if there is a probe centered at position x, 0 otherwise (Figure 2–2 shows some of these values plotted just for illustration purposes).

Figure 2–2: Model diagram and some signals in the model

32 2.3 DNA fragment length

Sonication means applying ultrasound to produce cuts at random posi- tions in DNA molecules. Here we use a gel electrophoresis image to infer the probability distribution of those fragments by matching fragment sizes to gel bands intensities. In Figure 2–3.A, we see seven bands. The ”bp” band, com- monly known as a ladder, contains DNA fragments of known sizes that are used as a reference. The remaining bands correspond sonicated DNA. Most of the gel band intensities beyond 1500bp are clearly due to noise, so we will assume that most of the fragments are shorter than 1500bp.

Figure 2–3: Gel run image using sonicated DNA. A) Image showing 7 bands. B) Normalized image intensity for some control and MBD3 bands. C) Normalized Image intensity for “ladder” band (red) and ladder’s peak detection (green).

We approximate the relative amount of DNA as a function of fragment size fit to a parametric distribution. The amount of DNA is proportional to the image intensity in the bands (Figure 2–3.A). We extract gel band intensities by calculating intensity averages using a sliding window to reduce noise (i.e. by applying a low pass filter), to obtain a vector ¯iraw of intensity values. The process is repeated for each gel band so that for the example shown

33 in Figure 2–3 we obtain seven vectors. Each vector is normalized to [0, 1] simply by subtracting the minimum value and scaling by the range, ¯ij = ¯raw ¯raw ¯raw ¯raw [ij −min(ij )]/[max(ij )−min(ij )]. To achieve a more robust signal, we ¯avg 1 P6 ¯ then average intensities in all bands (except the ladder band) i = 6 j=1 ij. In order to map intensities to fragment sizes, we use the ladder band. A simple peak detection algorithm is used to detect each local maximum that matches to a known ladder fragment size (see Figure 2–3.C). It is easy to see that the peaks are distributed in a logarithmic scale. We fit the data to formula ξ = β1 + β2 ln(x), were x is the ’real’ scale and ξ is the observed one. The results of the fit are then used to fit ¯iavg to a lognormal distribution aproximating the distribution of fragment sizes. 2.4 Sonication score

We define the “sonication score” for a given probe as the proportion of methylated DNA molecules that could bind to the probe. The idea is that M-value is a measure of the number of molecules that actually bind to the probe, so M-value and sonication score should be strongly correlated. As the sonication score depends on the underlying methylation, it is a link between M-value and methylation level. In theory, then, we could use probe M-value to infer sonication score thus methylation level at a probe. A similar approach has also been proposed in [47], [15], and [42]. We consider all the possible fragments for each possible number of methy- lation sites (CpG dinucleotides) and multiply all probabilities of each fragment size. This is the same as applying a weighted sum function Sτ () to the distance between each methylation site and the probe (see appendix A.1 for detailed formulas) Figure 2–4 shows the shape of the “weighting function” Sτ . Figure 2–2 shows a diagram for this first part of our model, as well as some plots to show what these signals look like.

34 Figure 2–4: Function Sτ (xcg − xp) depends on the dis- tance between the probe xp and the methylation site xcg

2.4.1 Sonication score auto-correlation

One of our hypothesis is that sonication score is auto-correlated and that we could take advantage of that correlation to detect methylation status, so we want to find out the auto-correlation function of Sp(xp). The mathematical details are of this derivation can be found in appendix A.2. The final formula we get is

2 X Rss(d) = (pth − pth) Sτ (t)Sτ (t − d) (2.1) t where d is the separation between probes d = xj −xi, and pth the probability of having methylation in a nucleotide. It’s worth noting that the auto-correlation depends the separation between probes d = xj−xi, the methylation probability pth and the shape of the sonication score function Sτ (Figure 2–5 shows the auto-correlation function).

35 2 As we can see, equation 2.1 has two factors. The first factor is (pth − pth) is either strictly positive when there is methylation and zero when there is no P methylation. Similarly the second term is t Sτ (t)Sτ (t − d) is strictly positive as long as there is overlap between the curves (i.e. the distance is less than 3000bp) and zero when the distance is greater than that number. Thus the auto-correlation is strictly positive whenever there is some methylation and distance is less than 3000bp.

The derivation of Rss(d) assumes that methylation between two sites is independent. In fact this is false because we know that methylation is autocorrelated (see section 2.4.2). Thus our derived formula is actually a lower bound for the true sonication score auto-correlation. Figure 2–7 shows an example auto-correlation that is in fact higher than our lower bound.

Figure 2–5: Auto-correlation function Rss(d) for Sτ , lower bound shown in a dashed line.

According to our model, sonication score is auto-correlated and M-value is an indirect measurement of sonication score. Thus we also expect M-values of nearby probes to be auto-correlated. As we can see in Figure 2–6, there is in fact auto-correlation between M-values that decreases as distance between

36 probes increases (data from a fully methylated human DNA sample, promoter microarray). Figure 2–7 shows the sample auto-correlations for M-values.

Figure 2–6: Scatter plot mp+τ vs mp shows auto-correlation for probe separated 100bp to 800bp and 100.000bp (no correlation), a lowess line is also shown

2.4.2 Methylation is auto-correlated

It is known that methylation usually low or high on different genomic regions for different organisms: e.g. promoters are usually unmethylated, CpG islands are usually unmethylated in mammals or methylated in some plants, repetitive regions and transposons are usually methylated, genes are mostly methylated in mammals, etc. [32]. This indicates that there is some autocorrelation in methylation status. It has also been mentioned in a study performing bisulphite sequencing on A.thaliana genome [5], that methylation

37 Figure 2–7: Left: Sample M-value auto-correlation function line shows a low-pass filtered value. Right: Sample sonication score auto-correlation function, the line shows a low-pass filtered value and dashed line shows theoretical lower bound is locally correlated up to 5000bp and it also has a marked periodicity around ten nucleotides (i.e. one DNA helical turn). We performed a methylation autocorrelation study of human genome using HEP [18] [35] data (see Figure 2–8), which also shows high auto-correlation. 2.5 Immunoprecipitation

The second step in MeDIP process is immuno-precipitation (IP), as is shown as the second block in Figure 2–2. It is known that IP efficiency has a nonlinear relation to the number of methylated Cs [25]. An approximation of the efficiency curve, based on empirical data, is shown in Figure 2–9. We approximate the empirical curve using the following formula:

( " n−keff1 # ) − k eff (n) = min max αeff n + βeff (1 − e eff2 ) + γeff , 1 , 0

where αeff , βeff , keff1, keff2, γeff are suitable coefficients. When incorporate eff (n) into our sonication score, the resulting function is no longer linear and we cannot express it as a convolution. Instead, we start with equation 2.2 and define +∞ eff X Spcg (xp, xcg) = Pbind(x, xcg, xp) eff [cg(x, xcg, xp)] (2.2) x=−∞

38 Figure 2–8: Methylation sample auto-correlation using HEP data (grey) and a low- pass filter of the results (red).

where eff () is our efficiency function and cg(x, xcg, xp) is the number of CpG covered by a fragment that starts at position x and ends at position max(xcg, xp) (i.e. covers the probe and the CpG we are analyzing).

eff The new sonication score for probe xp can be found summing Spcg (xp, xcg) for all CpGs (Figure 2–10 compares histograms of the original and the new sonication score).

X Seff (xp) = Spcg(xp, xcg) (2.3) xcg∈{All CpGs}

2.6 Hybridization

The amount of DNA that gets attached to a probe depends, of course, on the amount of DNA the has a sequence matching the probe sequence, but there are also other factors that influence hybridization signal intensity [21] [52] [19]: melting temperature and base content.

39 Figure 2–9: Efficiency curve: Immunoprecipitation ef- ficiency as a function of methylated CpGs. Empirical data from [25]

Figure 2–10: Left: Histograms for sonication scores S(xp) (blue) and Seff (xp) (orange). Right: Scatter-plot and lowess for Seff (xp) vs S(xp)

2.6.1 Melting temperature Tm

Melting temperature Tm is defined as the temperature at which half of the DNA strands are in random-coil state and half are in double-helical state.

Probes with higher Tm can form more stable double strand DNA at lower hybridization temperatures resulting in higher signal. Using the Unafold pro- gram [52] we can calculate the melting temperatures for all the probes in a chip. Figure 2–11 shows a typical Tm histogram as well as the influence of

40 Tm on M-values (we can actually get the same relationship if we use the log intensity of input channel).

Figure 2–11: Tm Histogram and effect on probe M-values

Having all the melting temperature for each probe tmi, we fitted a linear model to the M-values

tm mˆ i = αtm tmi + βtm (2.4)

2.6.2 Probe sequence content

Probe sequence content affects M-value because base pairs with three hydrogen bonds (C and G) produce stronger attachment than those with two (A and T ). In addition, bases closer to the slide have less impact than bases on the protruding part of the probe [21]. To analyze this, we define a model for each nucleotide: A, C, G, T . We start by defining ai,s = 1 if the sequence for probe i contains an A at position s, and then define A as the matrix formed by ai,s. Now we fit four models (we adopted a methodology similar to [21]):

T m¯ =w ¯A A + wA0 +e ¯A (2.5)

wherem ¯ is a vector with all the M-values for a chip,w ¯A is the vector of model

parameters wAi , wA0 is the intercept term ande ¯A is the error term. Similarly

41 we define ci,s gi,s and ti,s and of course C , G and T and fit the same equation for each nucleotide. The values forw ¯A, w¯C , w¯G andw ¯T are shown in Figure 2–12 (black lines). The resulting parameter values are quite ’noisy’ so we fit them to a model that captures the aforementioned bias of bases closer to the

0 −i slide: wAi = αA i+βA(1−e )+γA ≈ wAi (again, we repeat the procedure for

0 0 0 each nucleotide obtaining wC i, wGi and wT i). Now we calculate the influence of base position on M-values as follows:

base ¯0 T ¯0 T ¯0 T ¯0 T m¯ = wA A + wC C + wG G + wT T (2.6)

Figure 2–12: Base position influence hybridization score (M-values). Black: Raw −i values, Green: Low-pass filtered, Red: Fitted model wi ≈ α i + β(1 − e ) + γ). Dashed lines show the average parameter value when fitting random data.

42 As mentioned before, sequences rich in C s and Gs form stronger double- stranded bonds, thus have a higher melting temperature, Figure 2–13-C. con- firms the relationship between CG-content and melting temperature.

Figure 2–13: Relationship of parameters in a fully methylated DNA sample MeDIP experiment, lowess curves: A) Sonication score vs melting temperature B) Sonication score vs CG-content C) Melting temperature vs CG-content.

2.7 Relationship between sonication score, melting temperature, base position and M-values

In order to validate our assumptions, we show that there is indeed a relationship between sonication score, melting temperature, base position and M-values (mentioned in sections 2.4, 2.6.1 and 2.6.2). To calculate sonication score, we need to know methylation levels. One easy way to know methylation levels is to perform an MeDIP experiment using fully methylated DNA (i.e. all CpG dinucleotides were methylated before hybridization).

43 We already talked about the assumption that sonication scores are indi- cation of M-values and that both quantities are spatially correlated. Figure 2–14 shows low-pass filtered signals from a fully methylated DNA MeDIP ex- periment. Correlation between sonication score and M-values is 0.95 (low-pass filtered data), correlation using sonication score with IP efficiency is 0.96 (also low-pass filtered data). We also mentioned that M-values are influenced by melting temperature and probe sequence. Figure 2–14 also shows M-value estimates from melting

tm temperaturem ˆ p (using equation 2.4) and M-values estimates using probe

base sequence informationm ˆ p (using equation 2.6), again low-pass filtered data.

tm base Correlation betweenm ˆ p an M-values is 0.89, correlation betweenm ˆ p and M-values is also 0.89. Given that the assumptions seem to be correct, we can use these pieces of information to refine our model, thus improve our estimation of methylation levels. 2.8 Predicting sonication signal

In this section we explore method for approximating methylation levels form MeDIP experimental data. Based on previous discussion, this involves deconvolving the sonication score for each probe from its M-values. Unfortu- nately, this deconvolution is not trivial even though it is conceptually simple. If we assume a linear relationship between M-values and sonication scores, then there exists a simple function h() mapping sonication scores to M-values, as well as an inverse model hinv(), mapping M-values to sonication scores. This well-known problem [57] is known as the deconvolution problem. There have been several attempts to solve this problem for microarray experiments [47], [15], [54] and [42]. In some cases they propose that M-values

44 Figure 2–14: All values have been filtered using a moving average (window size 3900), then normalized to µ = 0, σ = 1. M-values (black), M-value estimates base using probe sequence informationm ˆ p (orange), M-value estimates from melting tm temperaturem ˆ p (blue), Sonication score S(xp) (red) and Sonication score with IP efficiency Seff (xp) (green). are normally distributed around some methylation related quantity (very sim- ilar to our sonication score) and they apply probabilistic methods to infer methylation from M-values. It is not clear how successful these methods are because true CpG methylation levels are usually not known. The one excep- tion [42] compared microarray data to sequencing from the Human Epigenome Project (HEP [18]), but unfortunately only 1500 of the total 385,000 probes on the array overlapped with the sequencing data. We solve the deconvolution problem using the Wiener filter. In particular, we will assume a linear model:

+∞ X inv st = hk mt−k + et =s ˆt + et (2.7) k=−∞

inv where hk are the coefficients, et is the error ands ˆt is our estimation of st. Because auto-correlation of both methylation levels and M-values is essen- tially zero for large distances, we will limit our model to a small number of

45 1 P+P coefficients st = k=−P hk mt−k +et. The solution to this problem is (math- −1 ematical details can be found in appendix A.3): h = Rmm rms, where rms is the cross-correlation vector and Rmm is the auto-correlation matrix. This equation is just a Yule-Walker equation or Wiener filter equation, the only difference is that we are using a non-causal model. Explicit consideration of the noise model makes the solution more “robust” (see section A.3). After computing the filter coefficients, we then apply them to the data to get es- timates of st. In section A.3.1 we show the mathematical details on how to incorporate melting temperature and probe sequence to our filter. Due to genomic sequence irregularities, such as repeats, is often not pos- sible to place usable probes at every desired location. Consequently a simple extension of this model can be implemented by dividing the genome into “re- gions” where probes are less than 3000bp apart (i.e. Sτ function’s support). 2.9 Results

Performance: Our filter’s performance predicting sonication values based on M-values is shown in Figure 2–15. Correlation between the estimated signal sˆp and the real signal sp is over 0.70. Given that the autocorrelation is non- zero when probes are located less than 1500 bases away and that inter-probe distance is around 100bp (median for the our chip design), as expected we did not get much performance increase for filter sizes over 30 (i.e. 1500 bp on each size of predicted value). Comparing filter coefficients for different chips: The filter coeffi- cients for two different sets of microarrays is shown in Figure 2–16. As we

1 We do not impose any restrictions on the causality of our model because we obviously have the whole signal at the time of processing (i.e. hk can be nonzero for k < 0)

46 can see the coefficients are similar even for experiments within very different settings. One of the experiments was human DNA array, the other was rat DNA tiling array on a small genomic region. in both case the DNA was 100% methylated so that we could train the model. Region wise prediction: Many regions of the genome cannot be tiled by probes of sufficient quality or are not interesting for a particular experiment, so there are often gaps between consecutive probes. We split the genome into regions with probe separation less than 3000bp and then applied the filter on those regions. Median correlation was 0.81, showing an improvement over the global approach. This improvement could be explained by the fact that small genomic regions fit our assumptions better (e.g. probe distance, wide sense stationary). Figure 2–17 shows the correlation distribution for all regions where we applied the predictor. Methylation prediction on regions between probes: In an ideal case we would like to be able to recover the methylation signal mth(x) (defined in section 2.2) using M-values. Obviously this is not possible because the sonication process is a non-invertible transformation. The best we can hope to achieve is to recover sonication score between probes using an interpolation ˆ algorithm. Figure 2–18 shows simulated data signal Sp(x) approximated by a cubic spline.

47 Figure 2–15: Correlation between sonication signal sp and predicted sonication signals ˆp for different microar- rays (human tiling arrays, fully methylated DNA)

Figure 2–16: Difference between Wiener filter’s coefficients for two different experi- ment in different microarrays for different organisms

48 Figure 2–17: Correlation histogram for all “regions”. Median correlation using M- values (green) and median correlation using quantile normalized M-values (red)

Figure 2–18: Original signal Sp(x) (black), sampled signal S(x) (green) and inter- ˆ polated approximation Sp(x) (red).

49 2.10 Conclusions

In this chapter, we implemented a method to improve the information obtained from MeDIP experiments. We showed the methylation is autocor- related. We compared autocorrelation from MeDIP data and the theoretical autocorrelation expected from our model, showing that our model acts as a “lower bound” of the data. In order to exploit this autocorrelation, microar- ray data from an MeDIP experiment can be sorted by genomic location and then optimal digital filter theory can be used to predict average methylation status in genomic regions. Our models are based on optimal linear digital filtering which reduces the complexity of previously published models (e.g. [47] that uses Bayesian networks, [15] that uses Lasso equations and [42] that uses “Bayesian deconvolution”). Given that true CpG methylation levels are unknown, it is not possible to assess whether these methods are indeed suc- cessful. For the same reason, a comparison of the performance of previous methods with ours is not possible. We also showed that the predictors coefficients are similar even for differ- ent organisms. This is not surprising since the coefficients only depend on the autocorrelation structure of the genomic region, and this can be similar even for different organisms. Regarding the overall goal to obtain detailed methylation status of a ge- nomic region, this technology does not allow us to do more than an “educated guess” of the overall methylation status of a wide genomic area. Our model clearly shows that we cannot reconstruct methylation information for regions less than 1500 bp even if we are able to obtain perfect information from the mi- croarray. Moreover, it is well-known that microarray hybridizations are error prone and the data is very noisy. Thus, for detailed methylation information

50 others sequencing based methods, look much more promising than microarray based ones.

51 CHAPTER 3 Multiple PCR and primer selection problem In order to do validation of a high-throughput experiment, low-throughput methodologies such as bisulfite sequencing, should be used. In bisulfite se- quencing, single-stranded DNA cytosines are converted to uracil when reacting with bisulfite, whereas methylated cytosines remain unreactive. The modified DNA strands are amplified by PCR and then sequenced. Polymerase chain reaction (PCR) is used to amplify (make copies) regions of DNA, called amplicons. As PCR progresses, the DNA generated is used as a template for amplification, thus the DNA template is exponentially ampli- fied. In order to perform PCR amplification, we need a DNA template (to be amplified), primers (short strands of DNA that bind to complementary DNA to initiate DNA replication) and Taq polymerase (an enzyme that make copies of DNA fragments bound by primers). PCR consists of a series cycles, each consisting of three steps: i) denaturation, heating causes melting of DNA tem- plate and primers by disrupting the hydrogen bonds, yielding to single strands of DNA, ii) annealing, the temperature is lowered allowing annealing of the primers to the single-stranded DNA template iii) extension, temperature is raised to allow Taq polymerase to synthesize a new DNA strand complemen- tary to the DNA template strand. These cycles are repeated twenty to forty times. PCR amplification is one of the main experimental bottlenecks in most labs, because each amplification requires large amount of time. Thus, it would be very useful to speed up this process by doing multiple PCR amplifications, (i.e. several amplifications in the same test tube).

52 In epigenetic experimental set ups often use bisulfited DNA. Unfortu- nately, finding primers to amplify bisulfite treated DNA can be very difficult because the treatment converts only unmethylated C s into U s which are con- verted to Ts during PCR amplification. As a result, primer design can rarely assume the conversion of CpG, so primers must avoid CpG sites. All other C s are unmethylated, so bisulfite primers are restricted to matching regions of the treated genome composed only of As, Gs and T s, making it even more difficult to find primers with the desired sequence characteristics as well as acceptable number of genomic binding sites. In this chapter, we address the problem of optimizing the PCR amplifi- cation step. We show the main problems related to multiple PCR, we demon- strate that selecting primers is NP-hard and we approached this problem by using non-linear dynamic systems as well as stochastic optimization. In simulations of different PCR priming problems, which are a sub set of the real problems of our lab, the solution found by our optimization algorithms is as good as the optimal lower bound found by exploring the full solution space using a simple heuristic algorithm. 3.1 Probability of mispriming

Mispriming occurs when two primers initiate amplification of an unin- tended part of the genome. For this to occur, the primers must bind the genome some maximum distance NT aq from one another, we assume here less than one thousand bases apart. In this section we calculate the probability of mispriming. For a genome is NG bases long, and a primer ri that is Nr bases long, the expected number of matches in the genome is:

N Yr Em = NG P [ri(k)] i

53 where ri(k) is base k in primer ri, it can be {A, C, G, T } for DNA or {A, G, T } for bisulfited DNA. If we approximate PA = PG = PT = 1/3 = PAGT , then

Nr Em = NG(PAGT ) . Thus for two primers ri and rj, the probability of mis- priming is

2 Nr 2 2NT aqEm 2EmNT aq 2NT aq(NG(PAGT ) ) 2Nr Pmp = Em = = = 2NT aqNG(PAGT ) NG NG NG

2NT aqNG Pmp ≈ 32Nr

2NT aqNG wheras when using untreated DNA, the probability is Pmp ≈ 42Nr , thus the probability of mispriming in bisufited DNA is approximately (4/3)2Nr times larger than in regular DNA (e.g. for a 20 bases primer, it is 100,000 times larger for 25 bases primer is 1.7 million times larger). 3.2 Probability of primer pair search failure

Let A be the set of amplicons that we want to amplify using PCR. Then 2|A| primers must be selected, one forward and one reverse for each amplicon. Suppose that for each amplicon there are M forward primers and M reverse primers to choose from. Then the probability that a primer r of 2|A| selected primers not mispriming is:

2|A|−1 Pok(r) = B(0, 2|A| − 1,Pmp) = (1 − Pmp)

n k (n−k) where B(k, n, p) = k p (1 − p) is the binomial distribution and Pmp is the probability of two primers mispriming. The probability that r misprimes is

2A−1 Pbad(r) = 1 − Pok(r) = 1 − (1 − Pmp)

If we have M primers to choose from, then the probability of having M incompatible possible primers (i.e. the probability that we can’t select any

M primer from the possible primers) is Pbad(r) . Obviously if none of the forward

54 or none of the reverse primers can be selected, then the amplicon cannot be amplified. The probability of this occurring is:

M h 2A−1iM Pbad(a) = 2Pbad(r) = 2 1 − (1 − Pmp)

The probability of not being able to select primers for a set of A amplicons is the probability of not being able to select appropriate primers for any one of them:

 A Y h 2A−1iM Pbad(A) = 1 − 1 − Pbad(a) = 1 − 1 − 2 1 − (1 − Pmp) (3.1) a∈A

3.3 Feasibility

From Equation 3.1 if solve for M

n 1 h 1/|A|io log 2 1 − (1 − Pbad(A)) M = 2|A|−1 log [1 − (1 − Pmp) ]

We obtain the number of primers needed to ensure a solution with probability

1 − Pbad(A), Figure 3–1 shows the values of M if we set Pbad(r) = 1/2. For example primers can be selected to amplify all A primers without mispriming, for M = 150 less than 50% of the time we will find a set of primers to amplify more than 60 amplicons. Notice that if we want our forward and reverse primers to bind at most 1000 bases apart, this limits our amplicon to at most 700bp. 3.4 Formal problem definition

Here we introduce a more formal definition of the primer selection prob- lem. Note that although it applies to primer design in general, because we ignore DNA sequence, later we address specifically bisulfited DNA sequence. •D be a set of amplicons that we want to amplify, •M be a set of possible primers, •A be a set of all amplicons that primers M can amplify,

55 Figure 3–1: Number of possible primers needed (bisulfited data)

• f be a function that maps pairs of primers to the amplicons that those primers can amplify. Given sets A, M and D ⊆ A and a function f : M × M 7→ A ∪ ∅, is there a subset M∗ ⊆ M such that f(M∗, M∗) = D ∪ ∅? 3.5 Problem complexity

In this section, we show that the problem is NP-Complete. We first show that the correctness of any solution can be checked in polynomial time. We then show that the problem is as hard as 3SAT, a well-known NP-Complete problem[51] (sections 3.5.1 and 3.5.2). If we have a guessed set of primers M∗, it’s easy to check if the conditions are satisfied. Obviously checking that all amplicons are covered by at least one forward and one reverse primer can be done in O(|M∗|) = O(2MA) = O(|A|). We also need to check that there are no conflicting pairs of primers

∗ {f(ri, rj) , ∀ri , rj ∈ M } ∩ {A \ D} = ∅. This can be done in polynomial

56 time, more precisely in O(M∗2) = O(A2). Thus a solution can be checked in polynomial time O(|A|2). 3.5.1 Definition of 3SAT

We show that our problem is NP-complete. We do this by mapping a well-known NP problem to our problem, thus showing that our problem is at least as hard as the NP-Complete problem. One of Karp’s 21 NP-complete problems[51], 3SAT, is to decide weather or not there exists a satisfying truth assignment for a given expression. The expression consists of products of sums of three variables or their negations (in

Boolean expressions, product is AND, addition is OR), e.g. E = (ui1 +u ¯j1 + uk1).(¯ui2 +u ¯j2 + uk2)...(uim + ujm +u ¯km), whereu ¯ denotes negation of variable u. Note that each variable can appear multiple times in the expression. 3.5.2 Mapping 3SAT to primer selection problem

Each clause in E looks like (ui + uj + uk) where the variables can be negated or not, so we have a total of eight possible expressions. We’ll show how to map each of them into our primer selection problem.

First for each variable ui in E:

• Create an amplicons Ai and Bi

• Create new primers ri and ri0

• Set f(ri, ri) = f(ri0 ri0 ) = Ai and f(ri0 , ri) = f(riri0 ) = Bi

When ri is selected, means that ui = true, when ri0 is selected, means ui = false. Note that to find a solution to our primer selection problem, at least one of them has to be selected, because they are the only primers for amplicon

Ai, and only one of them can be selected, because they conflict. Expression E is formed by several clauses. For clause number h we create one amplicon Ch and use primers (both forward and reverse) according to the variables in the clause. If the variable is ui, we use ri as forward and reverse

57 primer for Ch. If clause containsu ¯i then we use ri0 as forward and reverse primer for Ch. Table 3–1 shows how to map each of the eight possible clauses.

We set A = {Ai} ∪ {Bi} ∪ {Ch} and D = {Ai} ∪ {Ch}. If E is satisfiable then there is a set of primers that amplify exactly the desired amplicons. If there are primers that amplify the desired amplicons, then E is satisfiable.

Table 3–1: How to map each of the eight possible clauses

Clause Create Set function f amplicon (u¯i +u ¯j +u ¯k) Ch f(ri0 , ri0 ) = f(rj0 , rj0 ) = f(rk0 , rk0 ) = Ch (u¯i +u ¯j + uk) Ch f(ri0 , ri0 ) = f(rj0 , rj0 ) = f(rk, rk) = Ch (u¯i + uj +u ¯k) Ch f(ri0 , ri0 ) = f(rj, rj) = f(rk0 , rk0 ) = Ch (u¯i + uj + uk) Ch f(ri0 , ri0 ) = f(rj, rj) = f(rk, rk) = Ch (ui + u¯j +u ¯k) Ch f(ri, ri) = f(rj0 , rj0 ) = f(rk0 , rk0 ) = Ch (ui + u¯j + uk) Ch f(ri, ri) = f(rj0 , rj0 ) = f(rk, rk) = Ch (ui + uj + u¯k) Ch f(ri, ri) = f(rj, rj) = f(rk0 , rk0 ) = Ch (ui + uj + uk) Ch f(ri, ri) = f(rj, rj) = f(rk, rk) = Ch

Using this construction for each clause in E, we can map any 3SAT prob- lem to our primer selection problem and it’s easy to see that this mapping can be done in polynomial time. Since 3SAT is NP-hard, then we have shown that out primer selection problem is also NP-hard. The problem is therefore NP-Complete because it is trivial to determine in polynomial time whether or not a proposed set of primers will amplify all of the amplicons and no other regions. 3.5.3 Sequence construction

In fact, to define a real problem in terms of a sequence, we need to be able to show that there exists a sequence (genome) that contains all the amplicons and possible primers created in the previous section. This means that there exists a genome for every possible 3SAT problem, so that we can map it to our primer selection problem. Given a 3SAT problem as an expression E and a set of variables U, first we map all possible primers to DNA sequences LP bases long that do not include

58 any CG dinucleotides (these are primers for bisulfited DNA). We code primer i as number in base 4, but using {A, C, G, T } instead of numbers {0, 1, 2, 3}. We prepend A for variable ui (i.e. primer ri) and we prepend T for variableu ¯i (i.e. primer ri0 ). We also prepend a CG which acts as a separator. For instance, if U = 8, for variables u1 andu ¯2 we would create sequences CGAAAA and CGT AAC respectively. Is easy to see that we can build amplicon sequences that do not conflict with primers by including CG dinucleotides separated less than LP bases. In order to build the complete sequence, we need to concatenate all forward primers, amplicons and reverse primers. We conclude that there exists a se- quence containing all the amplicons and possible primers needed to map a given 3SAT problem. The length of the sequence is bounded by

L ≤ U(LA + 6LP ) + 3U(LA + 4LP )

where the sequence length is L, amplicon sequence length is LA, primer se- quence length is LP and U is the number of variables. Primer sequence length is LP = dlog4(16U/15)e + 3. So total sequence length is O(Ulog(U)), the sequence does not grow exponentially. 3.6 Non-linear dynamic solution

In order to solve this problem, we tried some heuristic techniques, but the results were not satisfactory. Thus we ended up using dynamic non-linear systems. It is well-known that a dynamic system will converge to a minimum of a Lyapunov-candidate-function [49], if such a function exists for the system. Our dynamic system is a simple network as shown in Figure 3–2. Each node i, denoted by a circular node, whose output is xi = +1 if “active”, otherwise xi = 0. The solution to the problem is the set of nodes that are

59 active (in our problem, a node represents a primer). In this network , the input of node i can be connected to the output of another node j by a link with a weight denoted wi,j. In the extreme case, the network is fully connected so we have a dense matrix W composed of non-zero values wi,j. Each node also has an input called ‘bias’, a constant positive input used to activate nodes.

Figure 3–2: Optimization network

The activation function for a node is [49] [27]

1 xi = f(hi) = (3.2) 1 + eβhi

PN where β is a positive constant and hi = j=1 wi,jxi + wi,0x0, here N is the total number of nodes in the network, so in our case N = 2AM, x0 is the bias, wi,0 is a bias weight.

In our problem we require xi to be binary, but the equations allow any xi ∈ [0, 1]. It’s easy to see that f(hi) → {0, 1} as β → ∞, so β is usually increased as the system evolves, making the outputs ‘look binary’.

60 In order to enforce other constraints on the network so that network solu- tions correspond to valid solutions to the amplification problem, we construct the network as follows: • Set each weight between nodes corresponding to incompatible primers to

wc, a negative constant, so that incompatible nodes inhibit each other, preventing both from being active.

• Set each weight between forward primers in the same amplicon to wp, also negative constant. Likewise set, each weight between reverse primers

in the same amplicon wp. • Connect each node to a positive input (bias) to make them ‘active’. This is done to avoid a trivial solution whit all nodes inactive. • Connect each node to itself with a positive weight. The reason is that if a node is active, we want to keep it active when competing with other nodes. After building the network, we initialize the nodes to a random value near 1/2 and we let it ’evolve’ using the former equations. It can be shown that there exists a Lyapunov function or Energy function and can be written as [49] [27] N N N Z xi 1 X X X − E = − xiwi,jxj + ki f 1(x)dx 2 i=1 j=1 i=1 0 where ki are constants that depend on the model. The existence of this Lya- punov function implies that the dynamical system converges to a stable point. The energy of this system is minimized until a stable state is reached. In the case of a discretized model, there is a simplified energy function

N N 1 X X E = − xiwi,jxj (3.3) 2 i=1 j=1

It is not hard to see that this model has lower energy for valid solutions to our problem. For example, if primers i and j are incompatible and both of them

61 are active, then the term −1/2xiwi,jxj will contribute to decrease the energy

(because wi,j is negative). However, if either xi or xj is 0, then energy does not change. Base on our construction, matrix W is a sparse block matrix (see Figure 3–3), we exploit this structure in our implementation to optimize memory usage.

Figure 3–3: Matrix W is a sparse block matrix. Negative weights are shown in red, positive in green, and zero weights are shown in white.

3.7 Getting out of local minima: Stochastic approach

A simple way to obtain solutions from the network is to initiate nodes to random values (close to 0.5) and then repeatedly update nodes using equation x¯t+1 = f(W x¯t). It can be show that under certain conditions this network does not converge because all nodes are updated at the same time [27]. Instead we update nodes randomly (asynchronous update) and the network always con- verges [49]. For small networks, solutions are found efficiently, but as number of nodes and connections increases, the landscape of the energy function be- comes more complex and there the danger of becoming trapped in a local minima increases, to avoid local minima we use stochastic nodes [27], simi- lar to simulated annealing. Update is preformed by drawing values from a

62 distribution defined by a “cooling” temperature, more specifically:

1 pi = f(hi) = 1 + eβhi    1 with probability pi xi =   0 otherwise and β = 1/T where T is the temperature. The temperature is decreased with time, causing the system converge with higher probability to a global minima as time tends to infinity [38]. 3.8 Multiple PCR

The problem so far is described as finding primers for all amplicons as- suming we perform a multiple PCR experiment in one test tube. The next step is to extend this problem to multiple test tubes. For a set of amplicons and a given number of test tubes NT , we want to select a set of primers and assign them each to one of NT test tubes in a way that avoids conflicts. Let:

•T = {tt1, ...., ttNT } be a set of test tubes • Υ: A 7→ T a mapping that assigns each amplicon to a test tube.

• δi,j be an indicator function that is 1 if primers ri and rj belong to amplicons that are in the same test tube. That is   0 0  1 if Υ[f(ri, r)] = Υ[f(rj, r )] for any r, r ∈ M δi,j   0 otherwise

Our energy function is thus

N N 1 X X E = − δi,j xi wi,j xj (3.4) 2 i=1 j=1

We have to find the minimum of this energy as before, but we also want to assign test tubes for each amplicon optimally minx,¯ Υ (E). This problem reduces to the former problem when the number of test tubes is 1, so it has

63 the same NP-hard complexity. To solve it, we used heuristic approach and simulated annealing (a network representation would be too complicated). 3.8.1 Lower bound using heuristic approach

Obviously it’s impossible to enumerate all primer combinations, but using heuristics to explore the space efficiently, one hopes to have some kind of bound on how many test tubes we might need for a given problem instance. We order the amplicons by number of available primer pairs, and the amplicons with fewer available primers are explored first. Obviously the search space grows exponentially, but we only want to explore the first amplicons which are the most “problematic” in order to estimate the minimum number of test tubes needed for a given problem. Table 3–2 shows the heuristic algorithm output for a problem instance. In this instance the first 11 amplicons considered (out of 39), require 7 test tubes, providing a trivial lower bound of 7.

Table 3–2: Heuristic algorithm results for a particular primer selection problem

Order Amplicon Minimum number of test tubes 1 12 2 2 20 3 3 31 4 4 17 5 5 5 6 6 7 6 7 6 6 8 18 6 10 1 7 11 30 7

3.8.2 Solving multiple PCR: Simulated annealing

Simulated annealing requires calculating ∆E at each step which, if done naively, is computationally expensive: O(N 2). But there are some simplifica- tions that can speed up the algorithm. If we assume that we want to change only one output, xk(t + 1) 6= xk(t) the energy difference is (see mathematical

64 details in appendix 5)

  X ∆E = (xk(t) − xk(t + 1))  δi,k xi(t) wi,k i6=k this equation is faster to calculate because is O(N). 3.8.3 Multiple PCR: Getting as many amplicons in one test tube

An extension to the multiple test tube problem is to include as many amplicons as possible in only one test tube. We can solve this problem by assuming that we have a test tube and a garbage can. Whenever an amplicon is in the garbage can we ignore any conflicts with other primers. The equations are similar to the equations in section 3.8, we solve minx,¯ Υ (E), for equation 3.4. We define test tube number 1 as the garbage can, thus   0 0  1 if Υ[f(ri, r)] = Υ[f(rj, r )] = tt2 for any r, r ∈ M δi,j   0 otherwise

3.9 Discussion

The algorithms for finding the multiple PCR primers for one or multiple test tubes give us results that seem to be nearly optimal when we compare with an heuristic algorithm that explores the whole search space. In most cases that are interesting for practical reasons, there seem to be no solution (one test tube) or a solution that has more test tubes than desired (multiple test tubes). So even if our algorithms are able to find good or optimal solutions, these are not good enough to be interesting for real life application in a wet- lab. Another important limiting factor is that the technique for performing multiple PCR in a wet-lab seems to be complicated and error prone. So far there was not much success in performing multiple PCR of bisulfited DNA in our labs.

65 It seems that the multiple PCR problem, at least in the cases weve studied, doesnt have a solution interesting enough for the biologists to implement it. We do not know what percentage of multiple PCR problems have good solutions (one or only a few test tubes), or under what conditions “reasonable” solutions exist. Even though further study of the problem could be interesting, we have to keep in mind that other alternative solutions to this problem do exist (e.g. pipetting & PCR robots that can perform many PCRs in parallel).

66 CHAPTER 4 Gene sets analysis 4.1 Introduction

Usually, as a result of a high-throughput experiment, we want to interpret the biological meaning of large amounts of data, e.g. in the previous chap- ters we had methylation information for all the probes in a microarray from an MeDIP experiment. In order to understand the meaning of these large genomic data sets, there are several efforts that map genes into “biologically meaningful information”. The most well-known are: Gene ontology (GO) [53] , that pro- vides a controlled vocabulary to describe genes and gene product attributes in many organisms; Kyoto Encyclopedia of Genes and Genomes (KEGG) [31], a collection enzymatic pathways; and Molecular Signatures Database (MSigDB) a collection of gene sets for use with GSEA software [45]. In the Gene Ontology project, genes are mapped to one or more nodes in a graph, called GO-terms. These GO-terms are part of a directed acyclic graph (DAG), a tree-like hierarchical structure (see Figure 4–1). In a DAG, parent nodes contain all child nodes but, unlike in a tree, a node in a DAG can have more than one parent. Furthermore, one gene can have many GO- terms associated. There are three DAGs, called ontologies, where GO-terms are annotated using a controlled vocabulary that defines the function of the genes it contains. The ontologies are molecular function, biological processes and cellular components. In our methods, we do not make explicit usage of this DAG structure, so the methodologies can be applied to other information sources (not just GO).

67 The problem in this chapter is the identification of meaningful biologically gene sets from our experimental data. The most well-known approaches to this problem are either using an hypergeometric distribution, or a χ2 probability distribution, or some sort of Brownian bridge statistics (more details in the next sections). In our approach, we used two different methods, one based on Informa- tion Theory (for labelling the data) and one based on non-parametric statis- tics, “Rank Sum Statistics”, to be used with ranked data. Both proposed methods show significant performance improvement compared to well-known methodologies. In particular, the “Rank Sum Statistics” method, not only there is more than 100% improvement over GSEA (the most commonly used non-parametric algorithm), but also there is a significant speed-up (can be up of two orders of magnitude). This software is available as an open source package at http://rssgsc.sourceforge.net/. 4.2 Previous work

Biologically meaningful gene sets (for example from GO or KEGG) are usually identified using the following steps [43, 11, 13, 16, 39, 20, 36, 12, 37]: i) genes are ordered using experimental data, ii) a threshold is defined and genes above that threshold are marked as “interesting”, iii) a significance of the overlap between each gene set and the set of interesting genes is calculated, iv) multiple testing correction is applied [8] and v) gene sets are ranked by the corrected significance. The algorithm is shown in table 4–1. In our experimental data we have a set of N genes, n of them are “inter- esting genes”. We want to analyze a GO-term T containing NT genes and nT of them are “interesting” (see Figure 4–2). Intuitively we can explain how to calculate p-values by using the following analogy: assume we have an urn with 100 marbles (N genes), 30 of them are red (n interesting genes) and 70 are

68 Figure 4–1: GO structure: A directed acyclic graph (DAG) for each ontology. Figure from [53]

white. We draw 10 marbles (GO-term T containing NT genes), the probability of having drawn 6 red marbles (nT ) is calculated using a hypergeometric dis- ( n )( N−n ) tribution [14] f(n ; N, n, N ) = nT NT −nT . The probability of having drawn T T ( N ) NT 6 or more marbles is calculated using Fisher’s exact tests, that is a sum of hypergeometric probabilities [14]:

   n n n N−n F (n ; N, n, N ) = X f(k; N, n, N ) = X k NT −k T T T  N  k=n k=n T T NT As Fisher’s exact tests is hard to compute, a Chi-square approximation or a z-score is often used [13], [39]. There are some difficulties with this approach (the algorithm in Table 4–1): i) the number of significant gene sets might be too large to manually interpret, and ii) gene set significance depends on an arbitrary threshold.

69 Table 4–1: Algorithm for ranking gene sets using p-values

Problem: Extract biologically meaningful information from results of a high-throughput experiment. Input: G = {g1, g2, ..., gN } : A gene set V = {v1, v2, ..., vN } : Experimental results for each gene. th : A threshold Θ = {T1, T2, ..., TH } : A collection of gene sets (e.g. GO-terms) Output: Gene sets ranked by p-value Assumptions: Very low p-values indicate a gene set is “meaningful”, so biological interpretation of low p-value gene sets should “explain the data” Algorithm: Calculate a set of “interesting genes” I ⊂ G gi ∈ I ⇔ vi >= th For each T ∈ Θ Calculate p-value N = |G| : Number of genes in the experiment n = |I| : Number of “interesting” genes NT = |T | : Number of genes in T nT = |I ∩ T | : Number of “interesting” genes in T n N−n n n (k)(N −k) p-value = F (n ; N, n, N ) = P f(k; N, n, N ) = P T T T k=nT T k=nT N (NT ) Perform multiple testing correction Rank gene sets in Θ by p-value

An algorithm that attempts to solve the first problem is the “elim algo- rithm” [8] (see Figure 4–3). The elim method investigates the nodes in the GO graph bottom-up. The level of a node is defined as the minimum number of nodes to reach the root of the DAG (high level nodes are more specific). The algorithm starts by processing the nodes from the highest level (bottom-most) and then iteratively moves to lower level nodes (to the root). The underly- ing assumption is that high level nodes are more specific so they are more useful in describing the biological meaning of an interesting set of genes. A Fisher exact test (or an approximation) is performed for each node. If the p-value is less than a threshold, the genes in the node are “marked” and prop- agated as marked to every parent node (upper induced graph). When a node is processed, the genes that have been marked are removed before calculating p-values. While trying to solve this problem, elim method creates another

70 Figure 4–2: In an experiment analyzing N genes, n are “interesting” (red). Gene set T (show in green) has NT genes, nT i are interesting. problem: we need to find elim’s threshold value (although they propose a value, it seems rather “ad hoc” for each problem). Motivation: We mentioned that what most current algorithms do is to calculate p-values for every gene set and produce p-value ordered collection of all gene sets. We would like to extract the most meaningful gene sets. We attack problem i by comparing collections of gene sets. Given two collections of gene sets Θ1 = {T1, T2, T3} and Θ2 = {T4, T5, T6, T7} we need a way to decide if Θ1 is more informative than Θ2. We propose a solution using mutual information [3]. 4.3 Mutual information

For intuition consider the following toy example. Suppose that we have three significant gene sets A, B, and B0, but we would like to select only two, and hopefully the two most informative with respect to the set of interesting genes I. Suppose also that B has the most significant overlap with I and that B0 is almost identical to B. Note that B0 will have more significant overlap with I than A. In this case, standard methods would pick B and B0 even

71 Figure 4–3: Elim algorithm. Figure from [8] though B with A is intuitively more informative (see Figure 4–4). We will show that ranking by mutual information would agree with this intuition. Mutual information between two random variables measures the informa- tion “gained” when one random variable is known. Here one random variable is related the set of interesting genes I, and the other random variable is re- lated to a collection of gene sets (e.g. {B, A}), so mutual information tells us how much information the collection of gene sets is giving us about the interesting set. In our example, mutual information for gene sets {B, A} will be larger than for gene sets {B, B0}. Thus by maximizing mutual information we can find the most informative collection of gene sets with respect to the set of interesting genes.

For a random variable X which is distributed as pi = p(X = xi), the information is defined as I(xi) = −log2[p(xi)]. The entropy of a data source is the average number of bits per symbol needed to encode it. The entropy is just P the expected information H(X) = E[I(X)] = − i p(xi) log2[p(xi)]. Likewise, given a conditional probability distribution p(X|Y ), the conditional entropy

72 Figure 4–4: Gene sets A, B and B0 and set of interesting genes I (red)

P for X given that Y = yi is H(X|Y = yi) = − j p(xj|yi) log2[p(xj|yi)] then, P the overall conditional entropy is just a weighted sum H(X|Y ) = − i p(yi) H(X|Y = yi), P then H(X|Y ) = − i,j p(yi) p(xj|yi) log2[p(xj|yi)], using Bayes p(xj, yi) = P p(yi)p(xj|yi) ⇒ H(X|Y ) = − i,j p(xj, yi) log2[p(xj|yi)]. Intuitively H(X|Y ) represents the entropy when random variables X and Y are observed, but only Y is ’revealed’. Mutual information is defined as I(X|Y ) = H(X)−H(X|Y ). Intuitively, if the initial entropy is H(X), after Y is ’revealed’ we loose H(X|Y ) entropy. So the remaining entropy is H(X) − H(X|Y ). 4.3.1 Mutual information for gene sets

Using our first example (see Figure 4–2), we want to calculate the mutual information I(I|T ), where I is the set of interesting genes, T is a gene set we

73 are analyzing (e.g. a GO-term). Let G be randomly picked gene, then

P (G ∈ I|G ∈ T ) = nT P (G/∈ I|G ∈ T ) = 1 − nT NT NT (4.1) P (G ∈ I|G/∈ T ) = n−nT P (G/∈ I|G/∈ T ) = 1 − n−nT N−NT N−NT

Then, applying entropy definition H(I) = P (G/∈ I) log2[P (G/∈ I)] +

P (G ∈ I) log2[P (G ∈ I)] So, conditional entropy is calculated as:

H(I|T ) = P (G ∈ T ) {P (G ∈ I|G ∈ T ) log2[P (G ∈ I|G ∈ T )]

+P (G/∈ I|G ∈ T ) log2[P (G/∈ I|G ∈ T )]}

+P (G/∈ T ) {P (G ∈ I|G/∈ T ) log2[P (G ∈ I|G/∈ T )]

+P (G/∈ I|G/∈ T ) log2[P (G/∈ I|G/∈ T )]} (4.2) Then, mutual information is I(I|T ) = H(I)−H(I|T ). So, for each node we have a set of measurements I(I|T ) that can be used to compare gene sets (instead of using p-values form Fisher’s test). Now that we have a formula for one gene set T we can try to extend this formula for a collection of gene sets Θ = {T1, T2, ..., Ts}. For a given collection of s gene sets, there are 2s non-overlapping gene sets (remember that a gene may belong to many GO-terms), we call the collection of non-overlapping gene sets Θ∗. This means that mutual information equation has 2s terms, for

3 example if Θ = {T1, T2, T3}, we hcan enumerate 2 = 8 disjoint gene sets in Θ∗ as follows:

74 Number Set T

0 G \ (T1 ∪ T2 ∪ T3)

1 T1 \ (T2 ∪ T3)

2 T2 \ (T1 ∪ T3)

3 (T1 ∩ T2) \T3

4 T3 \ (T1 ∪ T2)

5 (T1 ∩ T3) \T2

6 (T2 ∩ T3) \T1

7 T1 ∩ T2 ∩ T3

The conditional information is

H(I|Θ∗) = X p(G ∈ T ) H(I|T ) T ∈Θ∗ where H(I|T ) is calculated as shown in equations 4.2 and 4.1. To select gene sets using information theory approach we perform a simple greedy search as shown in Table4–2. On each iteration we select term that best improves I(I|Θ∗) and add it to our collection of selected gene sets. 4.3.2 Simulations and results

We performed several simulations to test our algorithm and compare the results to other well-known algorithms. We randomly selected GO-terms and genes, then use three algorithms to try to recover the GO-terms selected orig- inally. The methodology for selecting GO-terms and genes for our simula- tion is shown in table 4–3. Table 4–4 shows the mean gene set recovery rate (how many gene sets were correctly recovered) for three algorithms tested (“p- values”, “Elim” and “Information theory”), the p-value is calculated using a Wilcoxon test between our method as the best of the other methods. Mean

75 Table 4–2: Greedy algorithm optimizing mutual information

Problem: Select gene sets maximizing mutual information. Input: G = {g1, g2, ..., gN } A gene set. V = {v1, v2, ..., vN } Experimental results for each gene. I : A set of interesting genes. Θ = {T1, T2, ..., TH } : A collection of gene sets (e.g. GO-terms) s : Number of gene set to be selected. Output: Θˆ : A collection of s gene sets. Algorithm: Initialize Θˆ = ∅ Repeat s times (greedy search): Initialize mmax = 0, Tmax = ∅ For each T ∈ Θ Create temporary collection : Θtmp = Θˆ ∪ T ∗ Create a disjoint sets collection: Θtmp ∗ Calculate mutual information : m = I(I|Θtmp) if (m > max) =⇒ Tmax = T ; mmax = m Θˆ = Θˆ ∪ Tmax Return Θˆ recovery rate for table 4–4 is calculated as number of correctly recovered GO- terms divided by number of GO-terms selected originally. Our information theory approach has higher mean recovery rate. Comparative performance of our “Information theory” algorithm shows that it performs better than other algorithms under all tested conditions (different values of gene selection prob- ability and noise-to-signal ratios). Table 4–3: Methodology for comparing algorithms in our simulations

Select gene sets: Initialize: Number of gene sets to select: s Symbol (gene) selection probability: p Noise to signal ratio: ns Set of interesting genes (empty): I = ∅ Select gene sets: Randomly select s gene sets: Θori = {T1, T2, ..., Ts} Select genes: For each gene set T ∈ Θori Randomly select genes g ∈ T with a probability p Add each selected gene g to set I = I ∪ {g} Add ’noise’: Calculate the number of “noise” genes to add (number of genes in I times noise to signal ratio) ng = ns |I| Randomly select ng genes that do not belong to I, add those genes to I. Apply algorithms: For each algorithm {“p-value”, “Elim”, “Information theory”} Θˆ = Apply algorithm selecting best s gene sets ˆ Recovery rate rralgorithm = |Θ ∩ Θori|/|Θori|

76 Table 4–4: Algorithm comparison: Mean recovery rate (and standard devia- tion) for different number of gene sets “s” (based on 8500 simulations). Gene selection probability p ∈ [0.1, ..., 0.7], noise to signal ratio ns ∈ [10%, ..., 260%]

Number of Elim Information theory Fisher Significance gene sets S (p-value) 1 0.39 ±0.49 0.66 ±0.47 0.65 ±0.48 0.03754 2 0.59 ±0.63 1.03 ±0.69 0.88 ±0.45 0.01024 3 0.74 ±0.73 1.40 ±0.83 1.01 ±0.47 < 2.2e − 16 4 0.92 ±0.83 1.74 ±0.99 1.04 ±0.51 < 2.2e − 16 5 1.06 ±0.92 2.01 ±1.12 1.11 ±0.56 < 2.2e − 16

4.4 Ranked list

In the previous section we assumed that we had a set of genes G =

{g1, ..., gN }, a set V = {v1, ..., vN } of experimental results for each gene and a subset of “interesting” genes I ⊂ G. To select these interesting genes, usually a threshold th is defined and all genes gi having values vi ≥ th are considered “interesting”. Usually there is no exact way to define the threshold, this was mentioned in section 4.2 as problem ii and the methods we mentioned so far also have this problem. In order to avoid this problem, we will use rank statistics. There are algorithms that use some type of rank statistics, the most well-know is GSEA [45], [10], that performs Brownian bridge statistics. In this section we will use rank sum statistics, will assume genes are ranked by their experimental values, i.e. we sort genes by V and we assign a rank ri ∈ {1, ..., N}. So all genes are ordered and assigned a rank in the list:

Rank Gene

1 gi1

2 gi2

3 gi3 . . . .

N giN

77 Pk Then we define the rank sum as R = i=1 ri. There is a relationship between the ranked list and the set of “interesting” genes I. Genes are “interesting” if they are high in the ranking (i.e. above a threshold). This means that the mean ranking of genes in I is less than rmean (a rank mean value threshold) or the rank sum of genes in I is less than rsum (a rank sum value threshold). For instance if there are N = 1000 genes in the ranked list and we say that the first 100 are the “interesting” genes, this is equiva- lent to saying that the average rank rmean ≤ 50.5, or that the rank sum is rsum ≤ 101 ∗ 50 = 5050. So for a set of genes T , the probability of a gene being interesting P (G ∈ I|G ∈ T ) is analogous to the rank sum cumulative probability for the genes in T , P (R ≤ rsum|T ). Probability density function for a random variable that represents a rank sum is derived in appendix B. As calculating this probabilities is computa- tionally intensive, we also derive formulas to approximate these probability density functions. 4.4.1 Simulations and results

We applied the rank sum probabilities in a simulation similar to the one described in section 4.3.2. We performed simulations comparing two simple algorithms, one using p-values (select gene set with lower p-value) and other that creates sets using greedy strategy (at each iteration we keep the set that, when incorporated to the current set, has the lowest p-value). Table 4–5 shows the criteria used for gene selection, and table 4–6 show mean recovery rate for both algorithms. Mean recovery rate is calculated the same way as we did in section 4.3.2. Table 4–7 shows a comparison with GSEA, in our simulations, our algorithm performed over 100% better (for this simulation MSigDB C2 was used). It is important to take into account the the our algorithm is up to two orders of magnitude faser than GSEA.

78 The algorithm shown in tabe 4–6 uses a β distribution to generate ranked genes. In order to make sure that our performance improvement was not just a result of how these parameters are choosen or the selected Gene set collection, we performed simulations for different β distributions also using another gene set collection (this time MSigDB C5 -GO- was used). Table 4–8 show the results of these simulations, again the performance improvement was over 100% (between 155% and 188%).

Table 4–5: Methodology for selecting GO-terms and genes for our simulation

Select gene sets: Initialize: Number of gene sets to select: s Symbol (gene) selection probability: p Noise to signal ratio: ns A set of genes (empty): G = ∅ Experimental values (empty): V = ∅ Select gene sets: Randomly select s gene sets: Θori = {T1, T2, ..., Ts} Select genes: For each gene set T ∈ Θori Randomly select genes g ∈ T with a probability p Add each selected gene g to set G = G ∪ {g} Assign an experimental value v ∼ β(β1, β2) using beta distribution Add corresponding values v to set V = V ∪ {v} Add ’noise’: Calculate the number of “noise” genes to add (number of genes in I times noise to signal ratio) ng = ns |G| Randomly select ng genes that do not belong to G, add those genes to G. Apply algorithms: Assign experimental values v ∼ U(0, 1) using uniform distribution to genes recently added to G. Rank genes: Sort genes by experimental value v and assign ranks Apply algorithms: For each algorithm {“p-value”, “p-value Greedy”, “GSEA”} Θˆ = Apply algorithm selecting best s gene sets ˆ Recovery rate rralgorithm = |Θ ∩ Θori|/|Θori|

79 Table 4–6: Algorithm comparison: Mean recovery rate (and standard devia- tion) for different number of gene sets “s” (based on 38000 simulations, gene se- lection probability p ∈ [0.1, ..., 0.9], noise to signal ratio ns ∈ [100%, ..., 2000%]

Number of gene sets: S “RankSum” “Greedy RankSum” 1 0.30 ±0.46 0.32 ±0.46 4 1.06 ±0.61 1.60 ±1.12 7 1.64 ±0.80 3.07 ±1.59 10 2.22 ±1.02 4.65 ±1.98 13 2.78 ±1.20 6.26 ±2.39 16 3.31 ±1.33 7.93 ±2.71 19 3.81 ±1.45 9.47 ±3.04 22 4.36 ±1.57 11.06 ±3.25 25 4.89 ±1.67 12.73 ±3.64 28 5.37 ±1.80 14.09 ±3.91

Table 4–7: Algorithm comparison with GSEA using MSigDB set C2 (curated gene sets). Mean recovery rate and standard deviation, based on 300 simula- tions, p-value is less than 2.2 × 10−16.

Number of gene sets: S “GSEA” “Greedy RankSum” 10 17.4% ±9.9% 39.7% ±18.3% Selection Probability: p 30.0 % 16.8% ±10% 36.6% ±16% 50.0 % 13.2% ±8% 29.1% ±15% 70.0 % 22.4% ±9% 53.5% ±14% Noise to signal: ns 100.0 % 13.2% ±8% 29.1% ±15% 500.0 % 19.6% ±10% 45.1% ±17%

Table 4–8: Algorithm comparison with GSEA using MSigDB set C5 (Gene Ontology). Mean recovery rate and standard deviation, based on more than 800 simulations with different parameters. Number of gene sets is 10. Noise to signal ratio is 500%. P-value is less than 2.2 × 10−16.

Selection Probability α β “GSEA” “Greedy RankSum” Improvement 70% 3 1 2.09 ± 0.80 6.02 ± 1.52 188% 30% 4 1 2.15 ± 0.99 5.49 ± 1.91 155% 50% 4 1 2.55 ± 1.03 6.91 ± 1.37 172% 70% 4 1 2.45 ± 0.93 7.07 ± 1.79 188% 70% 5 1 2.85 ± 1.08 8.00 ± 1.25 181%

80 4.5 Discussion

We mentioned that some difficulties exist when interpreting results from current algorithms for finding biologically meaningful gene sets from high- throughput experimental data, mainly due to the length of the results. In section 4.3.1, we proposed a method based on information theory to solve this problem by selecting a small collection of gene sets (the “most informative” ones). The selection of this small collection of gene sets is possible because we explore the search space using a greedy algorithm combined with our mutual information method. In many methods, gene sets containing redundant genes (i.e. genes that are contained in previously selected gene sets) that could be chosen over gene sets containing novel genes (i.e. genes that are not in any previously chosen gene set). Our algorithm tends to disregard gene sets containing redundant genes. In section 4.4, we proposed a method based on “Rank Sum Statistics” (non-parametric statistics), to avoid making an arbitrary threshold selection. This method uses greedy algorithm for the same reasons as in the previous paragraph. As shown in our simulations, there is a performance improvement between 118% and 188% over GSEA. Moreover, our algorithm is up to two orders of magnitude faster. The software is available as an open source package at http://rssgsc.sourceforge.net/. Some people mentioned that GSEA has to be “tuned” for optimal performance which is, of course, a rather difficult task unless you know the correct answer (but if you did, you probably wouldnt need to run GSEA at all). This is yet another advantage of our algorithm, since there are no parameters, there is no need for “tunning up” at all. As a future work, we would like to merge our information theory method with rank sum statistics, thus providing a more comprehensive methodology.

81 CHAPTER 5 Conclusions New methods for analyzing epigenetic experimental data were proposed to estimate methylation levels, perform validation using low-throughput method- ology, and properly interpret biological meaning of the data. To analyze DNA methylation high-throughput experimental data, we built a model of MeDIP experimental process (Chapter 2). Based on that model, it was possible to estimate methylation values by “filtering” M-values reordered by genomic probe position. High correlation was obtained between our estimation and data from fully methylated DNA. As future work we would like to extend this method to use multiple chip data in the prediction and also to be able to cover some of the problems presented when performing next generation bisulfited sequencing. Low-throughput validation requires many PCR amplifications of bisulfited DNA. Multiple-PCR could help to speed up the process. We showed that the problem of selecting the primers required for multiple PCR amplification of bisulfited DNA, is NP-Complete. We proposed methods based on non-linear dynamic systems as well as simulated annealing to solve this problem and showed that they provide good solutions when comparing to lower bounds obtained by heuristic algorithms. Finally, we analyzed the problem of interpreting the biological meaning of high-throughput experimental data based on Gene Ontology analyses. Meth- ods based on information theory and rank sum statistics were proposed, in order to obtain the most informative functional annotation. On simulated data, our methods performed better than well-known methods. As future

82 work we would like to integrate both proposed methods into a comprehensive one.

83 Appendix A

84 Important: By default we assume that ri, gi and mi have been reordered by genomic position (i.e. chromosome and start position). We will also use some signal processing techniques [57], so we may refer to a set of values ordered by position as “a signal”. Even more, we often use analogies using “time” as reference variable, instead of genomic position1 . A.1 Sonication score: Formulation details

Here we show the details on the formulas for the “sonication score” in- troduced in section 2.4. As we already mentioned, we consider all the possible fragments for each possible number of ’methylation sites’ and multiply all probabilities of each fragment size. This is equivalent of doing a convolution between the distribution function and the ’methylation sites’. So if we call • P (s): Probability of DNA fragment size s after sonication (which has a distribution s ∼ LogNormal(µ, σ)).

• xp: Probe’s position

• λp: Half probe’s length (e.g. probe’s length = 2λp = 60 bases)

Assuming we only have 1 CpG located at xcg which is located after the probe (i.e. xp < xcg). Then we define Pbind(x) the probability of having fragments that start at position x in the genome, which binded on probe xp, this is the probability that a fragment’ size is long enough to ’cover’ both the whole probe and the site xcg, as shown in Figure 5–1:

   P [s > (xcg − x)] if x ≤ (xp − λp) < xp < xcg Pbind(x, xcg, xp, λp) =   0 if (xp − λp) < x ≤ xp < xcg (A-1)

1 Citing Emerson: “A foolish consistency is the hobgoblin of little minds”

85 Figure 5–1: Scoring for a given prove p (only 1 CpG)

If the CpG is located before the probe (i.e. xcg ≤ xp), the formulas are slightly diferent (see Figure 5–2):

Figure 5–2: Binding probability for a given prove when xcg < xp

   P [s > (xp + λp − x)] if x ≤ xcg ≤ xp Pbind(x, xcg, xp, λp) = (A-2)   0 if xcg < x ≤ xp

86 So, combining equations A-1 and A-2 we get    1 − F (xcg − x) if x ≤ (xp − λp) < xp < xcg     0 if (xp − λp) < x ≤ xp < xcg Pbind(x, xcg, xp, λp) =   1 − F (xp + λp − x) if x ≤ xcg ≤ xp     0 if xcg < x ≤ xp

Where F () is the cummulative probability of our fragment size distribu- tion (which is LogNormal, as we said in section 2.3). Using the step function u(t) we can rewrite this formula

Pbind(x, xp, λp, xcg) = [1 − F (xcg − x)]u(xp − λp − x)u(xcg − xp)

+[1 − F (xp + λp − x)]u(xcg − x)u(xp − xcg − 1)

noting that u(xcg − xp) = 1 − u(xp − xcg − 1), the previous formula can be stated as

Pbind(x, xp, λp, xcg) = u(xcg − xp) [1 − F (xcg − x)] u(xp − λp − x)

+[1 − u(xcg − xp)] [1 − F (xp + λp − x)] u(xcg − x) (A-3) For most of this study, we will assume that the length of the probes

λp are not big compared to the average distance to a methylation site, i.e.

λp << |xcg − xp|. Then we can approximate equation A-3 by saying λp = 0, so:

Pbind(x, xp, xcg) = u(xcg − xp) [1 − F (xcg − x)] u(xp − x)

+[1 − u(xcg − xp)] [1 − F (xp − x)] u(xcg − x) (A-4)

87 For a given probe located at xp we define the sonication value as the combined probability for al possible x

+∞ X Spcg(xp, xcg) = Pbind(x, xcg, xp) (A-5) x=−∞

The shape of this function is shown in Figure 2–4. Total sonication value for probe xp can be found summing Spcg(xp, xcg) for all CpGs:

X Sp(xp) = Spcg(xp, xcg) xcg∈{All CpGs}

+∞ X X Sp(xp) = Pbind(x, xcg, xp, λp) (A-6) xcg∈{All CpGs} x=−∞ Combining equations A-4 and A-5 we get:

+∞ X Spcg(xp, xcg) = { u(xcg − xp) [1 − F (xcg − x)]u(xp − x) x=−∞

+[1 − u(xcg − xp)] [1 − F (xp − x)]u(xcg − x)}

defining

f(z) = 1 − F (−z)

we can write Spcg(xp, xcg) as

+∞ +∞ X X Spcg(xp, xcg) = u(xcg−xp) f(x − xcg)u(xp − x)+[1−u(xcg−xp)] f(x − xp)u(xcg − x) x=−∞ x=−∞

changing variables k = x − xcg and l = x − xp

+∞ +∞ X X = u(xcg−xp) f(k)u(xp − xcg − k)+[1−u(xcg−xp)] f(l)u(xcg − xp − l) k=−∞ l=−∞

now we can say τ = xcg − xp, then

88 +∞ +∞ = u(τ) X f(k)u(−τ − k) + [1 − u(τ)] X f(l)u(τ − l) k=−∞ l=−∞

so, as expected, we showed that Spcg(xp, xcg) only depends on the distance between the probe and the methyation site. We can say Spcg(xp, xcg) = Sτ (τ)

+∞ +∞ X X Sτ (τ) = Spcg(xcg −xp) = u(τ) f(k)u(−τ − k)+[1−u(τ)] f(l)u(τ − l) k=−∞ l=−∞

It is easy to see that the sums are convolutions

Sτ (τ) = u(τ)[f ∗ u](−τ) + [1 − u(τ)][f ∗ u](τ) (A-7)

Now we can reformulate equation A-6 based on A-7.

X Sp(xp) = Spcg(xp, xcg) xcg∈{All CpGs}

we define mth(x) a function which is 1 if a position x methylated and 0 otherwise

∞ X Sp(xp) = Sτ (x − xp)mth(x) x=−∞

∞ X Sp(xp) = Sτ (−(xp − x))mth(x) x=−∞

It’s easy to see that Sτ (τ) = Sτ (−τ), so Sτ is a symetric function. Then

∞ X Sp(xp) = Sτ (xp − x)mth(x) x=−∞

then, Sp(xp) is just a convolution sum

Sp(xp) = [Sτ ∗ mth](xp) (A-8)

89 We now define r(x) as an indicator function that has value 1 when there is a probe, 0 otherwise. We also define S(x)

S(x) = r(x)Sp(x) = r(x)[Sτ ∗ mth](x)

We usually work with M-values, which are log2 of probe intensity, so we will define sonication score for probe p in a microarray as the log2 of S(x)

sp = log2[S(xp)] (A-9)

A.2 Sonication’s score autocorrelation

One of our hypothesis is that sonication score it autocorrelated and that we could take advantage of that correlation to detect methylation status, so we want to find out the autocorrelation function of Sp(xp). Let’s assume we have two probes i and j which are separated by a distance d whose sonication scores are si and sj (see Figure 5–3)

Figure 5–3: Two probes centered at xi and xj share a common genomic region (red) where methylated sites influence both sonication scores, this produces an autocor- relation of Sp(xp).

2 By definition the autocorrelation is Rss(xi − xj) = E[si sj] − E[si] , using equation A-7 we can write

 ∞ !  ∞  X X 0 0 E[si sj] = E  Sτ (xi − τ)mth(τ)  Sτ (xj − τ )mth(τ ) τ=−∞ τ 0=−∞

90  ∞ ∞  X X 0 0 = E  Sτ (xi − τ)mth(τ)Sτ (xj − τ )mth(τ ) τ=−∞ τ 0=−∞ we can split this double sum in two: one where τ = τ 0 and one where τ 6= τ 0

" ∞ X E[si sj] = E Sτ (xi − τ)mth(τ)Sτ (xj − τ)mth(τ) τ=−∞ ∞ ∞  X X 0 0 + Sτ (xi − τ)mth(τ)Sτ (xj − τ )mth(τ ) τ=−∞ τ 0=−∞;τ 06=τ

X 2 X X 0 0 E[si sj] = Sτ (xi−τ)Sτ (xj−τ)E[mth(τ)]+ Sτ (xi−τ)Sτ (xj−τ )E[mth(τ) mth(τ )] τ τ τ 06=τ

If we assume that the probability of a methylation is pth and that the prob- ability of methylation of two sites are independent, i.e. P [mth(xi), mth(xj)] =

0 0 2 P (mth(xi) P (mth(xj). Then E[mth(τ) mth(τ )] = P [mth(τ), mth(τ )] = pth .

2 Also E[mth(τ)] = E[mth(τ)] = pth (because mth only takes values {0, 1} ). Then

X X X 0 2 E[si sj] = Sτ (xi − τ)Sτ (xj − τ)pth + Sτ (xi − τ)Sτ (xj − τ )pth τ τ τ 06=τ

X 2 X X 0 = pth Sτ (xi − τ)Sτ (xj − τ) + pth Sτ (xi − τ)Sτ (xj − τ ) τ τ τ 06=τ

as the probes are separated by d, we know that xj = xi + d

X 2 X X 0 = pth Sτ (xi − τ)Sτ (xi + d − τ) + pth Sτ (xi + d − τ)Sτ (xi + d − τ ) τ τ τ 06=τ

0 0 we change variables t = τ − xi and t = τ + d − xi (note that the sum’s limits do not change because they are −∞ to +∞)

X 2 X X 0 = pth Sτ (−t)Sτ (d − t) + pth Sτ (−t)Sτ (d − t ) t t t06=t

we knwo that Sτ is symmetric, so Sτ (t) = Sτ (−t)

91 X 2 X X 0 = pth Sτ (t)Sτ (t − d) + pth Sτ (t)Sτ (t − d) t t t06=t

2 P now we’ll add and substract the term pth t Sτ (t)Sτ (t − d)

X 2 X X 0 2 X 2 X = pth Sτ (t)Sτ (t−d)+pth Sτ (t)Sτ (t −d)+pth Sτ (t)Sτ (t−d)−pth Sτ (t)Sτ (t−d) t t t06=t t t

re-arranging the sums, we get

2 X 2 X X 0 0 2 X = (pth −pth) Sτ (t)Sτ (t−d)+pth Sτ (t )Sτ (t −d)+pth Sτ (t)Sτ (t−d) t t t06=t t

2 P 2 P P 0 note that pth t Sτ (t)Sτ (t − d) = pth t t0=t Sτ (t)Sτ (t − d), so

2 X 2 X X 0 2 X X 0 = (pth−pth) Sτ (t)Sτ (t−d)+pth Sτ (t)Sτ (t −d)+pth Sτ (t)Sτ (t −d) t t t06=t t t0=t

collapsing the two double sums into one double sum, we get

2 X 2 X X 0 E[si sj] = (pth − pth) Sτ (t)Sτ (t − d) + pth Sτ (t)Sτ (t − d) t t t0

changing variables t00 = t0 − d (notice again that sum limits remain the same: −∞ to +∞)

2 X 2 X X 00 E[si sj] = (pth − pth) Sτ (t)Sτ (t − d) + pth Sτ (t)Sτ (t ) t t t00

! 2 X 2 X X 00 = (pth − pth) Sτ (t)Sτ (t − d) + pth Sτ (t) Sτ (t ) t t t00

!2 2 X X E[si sj] = (pth − pth) Sτ (t)Sτ (t − d) + pth Sτ (t) (A-10) t t

Now that we know E[si sj] we need to calculate E[si] to complete the formula:

92 " ∞ # X E[si] = E Sτ (xi − τ)mth(τ) τ=−∞

∞ X = Sτ (xi − τ)E[mth(τ)] τ=−∞

as we’ve already mentioned E[mth(xi)] = pth

∞ X = pth Sτ (xi − τ) τ=−∞

obviously we can change variables t = τ − xi and use the symetric prop- perty of Sτ . Finally, we get

∞ X E[si] = pth Sτ (t) (A-11) t=−∞ Using equations A-10 and A-11 we can calculate the autocorrelation

2 Rss(xi − xj) = E[si sj] − E[si]

!2 ∞ !2 2 X X X = (pth − pth) Sτ (t)Sτ (t − d) + pth Sτ (t) − pth Sτ (t) t t t=−∞ cancelling the last two terms we finally get

2 X Rss(xi − xj) = (pth − pth) Sτ (t)Sτ (t − d) t

replacing d = xj − xi

2 X Rss(d) = (pth − pth) Sτ (t)Sτ (t − d) (A-12) t Figure 2–5 shows the autocorrelation function.

93 A.3 Wiener filter model

Here we show mathematical details on deriving filter parameters used in section 2.8. In that section, a linear model was proposed from which we will derive the “inverse model”

+∞ +P X X inv mp = hk sp−k + p =⇒ st = hk mt−k + et (A-13) k=−∞ k=−P

inv where hk are the inverse model coefficients, et is the the difference between out model’s estimation and the real signal (i.e. the error) st = sˆt + et ⇒ et = st − sˆt. We can construct a vector of M-values mt =

T inv inv inv inv T [mt+P , ..., mt, ..., mt−P ] , a vector h = [h−P , ...., , h0 , ..., hP ] , and a vec-

T inv tor of errors et then st = mt h + et. Similarly we can construct a vector of

T sonication scores st = [st+P , ..., st, ..., st−P ] , then

inv st = Mt h + et (A-14)

where Mt is a matrix whose rows are mt     T  mt−P   mt ... mt−P ... mt−2P       .   . . . .   .   . . . .   .   . . . .           T     mt−2   mt+P −2 ... mt−2 ... mt−P −2           T     mt−1   mt+P −1 ... mt−1 ... mt−P −1          M =  T  =   t  mt   mt+P ... mt ... mt−P           T     mt+1   mt+P +1 ... mt+1 ... mt−P +1           T     mt+2   mt+P +2 ... mt+2 ... mt−P +2       .   . . . .   .   . . . .   .   . . . .       T    mt+P mt+2P ... mt+p ... mt

94 Premultiplying equation A-14 by mt on both sides and applying the ex-

inv pectancy operator E(·) we get E(mt st) = E(mtMt h ) + E(mt et). The last term is zero because we assume that the error signal is uncorrelated with the M-value (otherwise we would be able to predict the error). So we get

inv E(mt st) = E(mtMt)h , the left term is the cross correlation vector rms and

2 inv E(mtMt) is an autocorrelation matrix Rmm, so rms = Rmm h , solving for hinv

inv −1 h = Rmm rms (A-15)

Robustness: In our “direct model”, proposed in equation A-13, we have

P+∞ mp = k=−∞ hk sp−k, or in matrix notation m = H s + e. This implies that

T Rmm = E[m m ] = HRss H + Ree. Also rms = Hrss, so the final equation is

inv −1 h = (HRss H + Ree) H rss (A-16)

noise autocorrelation Ree is usually a diagonal matrix. Intuitively this diagonal matrix allows to reduce singularity problems when calculating the inverse. A.3.1 Wiener filter with aditional parameters

We show here how to include melting temperature (or any additional pa- rameters) in our model. We add probe melting temperature Tm(t) in equation

inv A-13 by adding a coeficient hTm

+P X inv inv st = hk mt−k + hTm Tm(t) + et (A-17) k=−P

2 Note that by doing this, we are implicitly assuming the the process is Wide Sense Stationary (WSS)

95 T We create a new vector Tm(t) = [Tm(t + P ), ..., Tm(t), ..., Tm(t − P )]

inv inv then, in matrix notation, st = Mt h + hTm Tm(t) + et. We premultiply this equation by mt on both sides and apply the expectancy operator E(·)

inv inv obtaining rms = Rmm h + hTm rmTm , where rmTm is the cross-correlation between M-values and melting temperatures. We can re-express this equation

∗ ∗ inv inv T as rms = R h where h = [hTm , h ] and R is a matrix whose first column is

3 ∗ −1 rmTm and all other columns are Rmm. The equation is solved as h = R rms

3 Last column is removed to get a square matrix

96 Appendix B

97 If we have a set of genes ranked from 1 to N, we randomly select NT genes and add the ranks, we obtain a “rank sum” (equation B-1). In this chapter we will explain the mathematical details calculating the probability dentity function of rank sum.

X rsum = ri (B-1) gi∈T It is usefull to keep in mind th following analogy: We have an urn filled with balls with numbers from 1 to N, we randomly draw NT balls and add the numbers. There are two different cases: i) each time we draw a ball, we replace it back in the urn or ii) we do not replace it. The probability density function for the first case is easier to deduct, so we will assuming no replacement. Then we will address tha case when there is no replacement, as the deduction is similar. B.1 Ranked sum with replacement

In these sections we will calculate the probability density functions of a rank sum. We’ll assume that genes are ranked and that ranks can be repeated

(e.g. gene gi is ranked ri and gene gj can also be ranked ri), this is equivalent to say that ranks are drawn with replacement. As an intuitive way of looking at this, assume we have all the ranks written in small pieces of paper in a bag, we get one piece of paper, read it and place it back in the bag. It’s also important to note that we assume that all ranking positions are used, which means that all values from 1 to N are assigned to at least one gene. B.1.1 Approximation by normal distribution

If we are analyzing a set of genes T which has NT genes, we calculate the rank sum as

where ri is gene’s gi rank. This is a sum of random variables that converges to a normal distribution when the number of genes in T tends to infinity. The

98 2 expectation for ri is (N + 1)/2 and the variance is (N − 1)/12 (these are the mean and varance of uniform a distribution between 1 and N). Then the mean for rmean is   X X (N + 1) E[rsum] = E  ri = E [ri] = NT E [ri] = NT gi∈T gi∈T 2

and the variance is

  2 X X N − 1 V ar(rsum) = V ar  ri = V ar [ri] = NT gi∈T gi∈T 12

so we can approximate rmean by a normal distribution

( N + 1 N 2 − 1) N µ = N , σ2 = N T 2 T 12

Now, we can calculate P (R ≤ rsum) by using the normal cumulative distribution. However, there are several probelms with this approximation, specially when the number of terms in the set that we want to analyze is small (e.g. less than 20). In the following section we show how to calculate the exact value of this probability density function. B.1.2 Exact calculation

Now we calculate the exact value of a rank sum probability. In order to do this, we will assume that there can be repeated rank values (e.g. two genes gi and gj can share rank ri = rj in a list). We start considering the trivial case when the rank sum has only one term

(i.e. NT = 1 and T = {r1}). In this case, the probability is    1/N if (1 ≤ R ≤ N) and (N > 0) PN,1(R = rsum) =   0 otherwise because the rank distribution is uniform between 1 and N (only for valid values of R and N, of course).

99 Now let’s consider the case when the rank sum is composed of two or more

terms (i.e. NT ≥ 2 and T = {r1, r2, r3, ..., rNT }). To calculate the probability that the two or more ranks add to R, we have to add all possible combinations

such that r1 + r2 + ... + rNT = R. This is the same as the probability of r1 = r and r2 + r3 + ... + rNT = R − r for all possible values of r, the formula is:

N X PN,NT (R = rsum) = PN,1(r) PN,NT −1(R − r) (B-2) r=1 This formula can be optimized by noting that the maximum value for r in this sum is either N or R − NT + 1. This is because PN,1(r) is non-zero for

values of r ∈ [1,N] . Likewise, PN,NT −1(r) is zero when r < NT (i.e. there is no way to add NT ranks to be less than NT because the minumum possible rank is 1), then

PN,NT −1(r < Nt) = 0

PN,NT −1(r ≤ Nt − 1) = 0

PN,NT −1(−r ≥ −Nt + 1) = 0

PN,NT −1(R − r ≥ R − Nt + 1) = 0

so PN,NT −1(R − r) is zero when R − r ≥ R − Nt + 1, which means we can rewrite the limits of the sum in equation B-2 as

r Xmax PN,NT (R = rsum,N) = PN,1(r, N) PNT −1(R − r, N) (B-3) r=1 where rmax = min(R − NT + 1,N). Figure 5–4 shows some calculated, simulated and normally approximated probability density functions for different N and NT . Root mean squared error (RMS) between exact calculation and normal aproximation is shown in Figure

100 5–5, as we can see the error is very small when either N or NT are over 20, so this is the criteria we use to decide when to use normal approximation.

Figure 5–4: Probability density function PN,NT (R) shown in blue, simulated values (red) and normal approximation (green).

B.1.3 Fast algorithm

Looking at equation B-2 we can see that probabilities are zero outside the limits of sumation, so we can extend those limits to ±∞

N +∞ X X PN,NT (R = rsum) = PN,1(r) PN,NT −1(R − r) = PN,1(r) PN,NT −1(R − r) r=1 r=−∞

= [PN,1 ∗ PN,NT −1](R)

101 Figure 5–5: Normal approximation’s RMS error for dif- ferent N and NT values.

so we can express this probability as the convolution of the two probabil-

ities, evaluated at R, expanding the term PN,NT −1

PN,NT (R = rsum) = [PN,1 ∗ PN,1 ∗ PN,NT −2](R) = [PN,1 ∗ PN,1 ∗ · · · ∗PN,NT −(NT −1)](R)

this is a convolution of NT functions. If we apply the Fourier transform on

NT −1 NT both sides F[PN,NT ] = F[PN,1] ⇒ PN,NT (R = rsum) = F {F[PN,1] }(R). If we calculate the probability density function this way, reduce the complexity of the computation to O[NT Nlog(NT N)], which is the complexity of the Fast Fourier transform.

102 B.2 Ranked sum without replacement

In these sections we will calculate the probability density functions of a rank sum when there is no placement (i.e. once a gene is assigned a rank, no other gene can have the same rank). As an intuitive way of looking at this, assume we have all the ranks written in small pieces of paper in a bag, we get one piece of paper, read it and throw it away (we don not place it back in the bag), adding all the numbers we’ve read we get the rank sum. As before, we assume that all ranking positions are used, which means that each value from 1 to N is assigned to one gene. B.2.1 Min / Max values

Before calculating the probability density function, we need to know what are the minimum and maximum possible values for a rank sum without re- placement. Let’s calculate the minimum value for a rank of N terms if we select NT terms. Clearly the minimum possible value is

NT X NT Rmin(N,NT ) = i = (NT + 1) i=1 2

We’d like to calculate the minimum possible rank sum when selecting NT items that have been ranked from rmin to N (note that the minimum rank is not 1):

rmin+N −1 N N XT XT XT Rmin(N,NT , rmin) = i = (j + rmin − 1) = NT (rmin − 1) + j i=rmin j=1 j=1

the last term is Rmin(N,NT ) = Rmin(N,NT , 1), so

Rmin(N,NT , rmin) = NT (rmin − 1) + Rmin(N,NT ) (B-4)

Now the maximum possible when we select NT items ranks form 1 to N:

N X Rmax(N,NT ) = i i=N−NT +1

103 Changing variables i = N − (NT − j):

N N N N XT XT XT XT Rmax(N,NT ) = N − (NT − j) = N − NT + j = N − NT +Rmin(N,NT ) j=1 j=1 j=1 j=1

= NT (N − NT ) + Rmin

Rmax(N,NT ) = NT (N − NT ) + Rmin (B-5)

B.2.2 Exact calculation

In this section we’ll find the probability density function for a ranked sum when there are no repeated ranks (i.e. no replacement). The only difference between this section and section B.1.2 is that we assume that for any two different genes gi and gj the ranks are different (i.e. ri 6= rj ∀gi 6= gj). Although the formulas and the distribution shapes are different, the methodology to derive the formulas almost the same. We start considering the case when the rank sum has only one term (i.e.

NT = 1 and T = {r1}). We’ll add two parameters: Nout to specify how many ranks have already been drawn (i.e. we cannot use those ranks) and rmin to specify the minimum value to consider in the rank sum (the use of this variable will became evident later). In this case, the probability is uniform, we can only select 1 out of N − Nout ranks:    1/(N − Nout) if (N > 0) and (1 ≤ R ≤ N) PN,1(R = rsum|Nout) =   0 otherwise

Now we define

QN,NT (R|Nout, rmin) = PN,NT (R|Nout) δR

104 When the rank sum is composed of two or more terms (i.e. NT ≥ 2 and

T = {r1, ..., rNT }), as we did before, we add all possible combinations such that r1 + ... + rNT = R. Which is the same as the probability of r1 = r and r2 + ... + rNT = R − r for all possible values of r. This is the same as doing the calculation for r1 < (r2 + ... + rNT ) and then multipliying by all posible combinations. We use the parameter rmin to indicate that r2 +...+rNT cannot be less or equal to r1. So the formula is:

N X QN,NT (R = rsum|Nout, rmin) = NT QN,1(r|Nout, r) QN,NT −1(R − r|Nout + 1, r + 1) r=rmin

This is recursive formula that depends on five different parameters, so it’s important to narrow down the recursion as much as possible. In order to reduce our search space, we can use equations B-4 and B-5 which tell us the minimum and macimum possible values for R. The final equation becomes

  PN  NT QN,1(r|Nout, r) QN,N −1(R − r|Nout + 1, r + 1)  r=1 T    If Rmin(N,NT ,Rmin) ≤ R ≤ Rmax(N,NT ) QN,NT (R = rsum|Nout, rmin) =       0 Otherwise

So if we want to calculate the probability of having an R = rsum when we draw NT numbers from a ranked list of N numbers, we just need to calculate

PN,NT (R = rsum) = QN,NT (R = rsum|Nout = 0, rmin = 1)

B.2.3 Normal approximation

In order to approximate this function using a Gaussian probability density function, we need to calculate the mean and variance.

105 Mean: Now we want to calculate the mean of a rank sum when there is no replacement. We’ll start by writing the definition of a rank (see equation B-1) sum as

N X X rsum = ri = r I(r) (B-6) gi∈T r=1 where I(r) is the indicator function that is 1 when a gene that has rank r belongs to set T . The mean value is

N X E[rrum] = r E[I(r)] r=1

We know that there are NT items in T , so the expected value of the indicator function is E[I(r)] = NT /N (i.e. the set T consist of NT items choosen from N possible ones), then:

N NT X NT (N + 1)N (N + 1) E[rrum] = r = = NT (B-7) N r=1 N 2 2 This is the same value than mean rank sum with replacement (see section B.1.1). Variance: Now we want to calculate the variance of a rank sum when there is no replacement. We will start by using equation B-6

 2    N !2 2 2  X  2 X 2 V ar(rsum) = E[rsum]−E[rsum] = E  ri −E[rsum] = E  r I(r) −E[rsum] gi∈T r=1

(N+1) Using equation B-7 we know that µ = E[rsum] = NT 2

 N !2 N N X 2 X X 0 0 2 V ar[rsum] = E  r I(r)  − µ = r r E[I(r) I(r )] − µ (B-8) r=1 r=1 r0=1

Now we need to calculate E[I(r) I(r0)]. When r = r0 then this is just the probability of picking gene ranked r when choosing NT out of N genes, this is

106 N E[I(r) I(r)] = E[I(r)] = T (B-9) N When r 6= r0 then we can calculate E[I(r) I(r0)] as the probability of choosing NT out of N values where two of them are ’fixed’ (one is r and the other is r0), that is

 N−2  N (N − 1) E[I(r) I(r0)] = NT −2 = T T (B-10)  N  N(N − 1) NT Using equations B-9 and B-10 in equation B-8

N N X X 0 0 2 V ar(rsum) = r r E[I(r) I(r )] − µ r=1 r0=1

N N = X X r r0 E[I(r) I(r0)] + X X r r0 E[I(r) I(r0)] − µ2 r=1 r06=r r=1 r0=r N (N − 1) N N N = T T X X r r0 + T X r2 − µ2 (B-11) N(N − 1) r=1 r06=r N r=1 PN 2 It is known that r=1 r = (N + 1)(2N + 1)N/6 , we’ll call this Kr, so

N X 2 Kr = r = (N + 1)(2N + 1)N/6 (B-12) r=1 replacing B-12 in B-11

N NT (NT − 1) X X 0 NT 2 V ar(rsum) = r r + Kr − µ N(N − 1) r=1 r06=r N

PN P 0 Now we need to calculate r=1 r06=r r r , we’ll call that term Krr0

N N N N X X 0 X X 0 X X 0 Krr0 = r r = r r − r r r=1 r06=r r=1 r0=1 r=1 r0=r

N " N # N = X r X r0 − X r2 r=1 r0=1 r=1 N " N # " #2 X X 0 (N + 1) = r r − Kr = N − Kr (B-13) r=1 r0=1 2

107 replacing this latest result into B-11 we get

NT (NT − 1) NT 2 V ar(r ) = K 0 + K − µ (B-14) sum N(N − 1) rr N r

where µ, Kr and Krr0 are defined in equations B-7 B-12 and B-13 respec- tively (these terms depend only on N and NT ). B.2.4 Approximation

Calculating the probability of a rank sum without replacement means calculating the recursive formula B-6 which is computationally expensive even for small N. As we can see from Figure 5–6 the shape of the distribution becomes similar to a Gaussian as N increases, but even for large N a Gaussian approximation cannot be used when NT ∈ {1, 2,N − 2,N − 1} because in these cases the distributions are always uniform (NT = 1 and NT = N − 1) or triangular (NT = 2 and NT = N − 2). A summary of how to approximate the rank sum distribution is shown in table 5–1. Table 5–1: Rank sum approximation

If NT ∈ {1,N − 1} Rmin(N,NT , rmin) = NT (rmin − 1) + Rmin(N,NT ) Rmax(N,NT ) = NT (N − NT ) + Rmin Uniform distribution [Rmin,Rmax] If NT ∈ {2,N − 2} Rmin(N,NT , rmin) = NT (rmin − 1) + Rmin(N,NT ) (N+1) µ = NT 2 Rmax(N,NT ) = NT (N − NT ) + Rmin Triangular distribution [Rmin, µ, Rmax] If N > 30 and NT ∈/ {1, 2,N − 2,N − 1} (N+1) µ = NT 2 2 h N i 0 Krr = 2 (N + 1) − Kr Kr = (N + 1)(2N + 1)N/6 r NT (NT −1) NT 0 2 σ = N(N−1) Krr + N Kr − µ Gaussian distribution N (µ, σ) Otherwise: Should not approximate, use exact formula B-6.

108 Figure 5–6: PN,NT (R) for different values of N and NT

109 Appendix C

110 C.1 Simulated annealing: Energy difference

As mentioned in section 3.8.2, simulated annealing requires calculating ∆E on each step which, if done naively, is computationally expensive: O(N 2). But there are some simplifications that can speed up the algorithm. If we assume that we want to change only one output, xk(t + 1) 6= xk(t) the energy difference is ∆E = E(t + 1) − E(t) we know that N N 1 X X E(t) = − δi,j xi(t) wi,j xj(t) 2 i=1 j=1

 1 X X X = −  δi,j xi(t) wi,j xj(t) + δi,k xi(t) wi,k xk(t) 2 i6=k j6=k i6=k  X + δk,j xk(t) wk,j xj(t) + δk,k xk(t) wk,k xk(t) j6=k

by definition δk,k = 1 and W is symmetric (i.e. wi,j = wj,i) so

  1 X X X 2 E(t) = −  δi,j xi(t) wi,j xj(t) + 2 δi,k xi(t) wi,k xk(t) + xk(t) wk,k 2 i6=k j6=k i6=k

Replacing the above equation in ∆E

 1 X X X 2 ∆E = −  δi,j xi(t + 1) wi,j xj(t + 1) + 2 δi,k xi(t + 1) wi,k xk(t + 1) + xk(t + 1) wk,k 2 i6=k j6=k i6=k

 X X X 2 − δi,j xi(t) wi,j xj(t) − 2 δi,k xi(t) wi,k xk(t) − xk(t) wk,k i6=k j6=k i6=k

we know that xi(t + 1) = xi(t) ∀i 6= k so the double sums are the same, we can simplify them

 1 X 2 ∆E = − 2 δi,k xi(t + 1) wi,k xk(t + 1) + xk(t + 1) wk,k 2 i6=k

111  X 2 −2 δi,k xi(t) wi,k xk(t) − xk(t) wk,k i6=k

X X 1 2 2 ∆E = δi,k xi(t) wi,k xk(t)− δi,k xi(t + 1) wi,k xk(t + 1)+ ( xk(t) wk,k− xk(t+1) wk,k) i6=k i6=k 2

2 2 X xk(t) − xk(t + 1) ∆E = (xk(t) − xk(t + 1)) δi,k xi(t) wi,k + wk,k i6=k 2 2 our outputs are binary, so xk(t) = xk(t)   X wk,k ∆E = (xk(t) − xk(t + 1))  δi,k xi(t) wi,k +  i6=k 2

finally, if we use simulated anealing, we can use wk,k = 0 simplifying the equation a little bit more   X ∆E = (xk(t) − xk(t + 1))  δi,k xi(t) wi,k i6=k

112 Appendix D

113 5.2 Symbol reference

These are the labels we use to refer to the context where each variable / formula is beign applied DMH: Differential Methylation Hybridization chap- ter 2 Gene sets: Information theory applied to Gene sets, chapter 4 DNA Fragment: DNA sonication fragment’s length, section 2.3 Primer selection: Multiple PCR and Primer selection problem, chapter 3 Prob.: Probabilistic approach: Bimodal distribution, section Symbol Context Meaning A Primer selection Set of amplicons B() Binomial distribution β Primer selection Gain constant for neuron’s transfer function (also β = 1/T in simulated annealing) E,E(t) Primer selection Energy function and energy function at time t

et DHM Predictor’s error time step t

effp DHM Inmuno precipitation efficiency for probe p F Fourier transform f() Primer selection Neuron’s activation function

f(ri, rj ) Primer selection Function returning an amplicon amplified by primers ri and rj

G = {g1, g2, ..., gN } Gene sets The set of every gene in the ontology (GO)

gi DHM Green intensity for probe number i g¯ DHM Vectors of green intensities for all probes in a chip ¯iraw DNA Fragment Raw intensity after sliding window for an image ¯inorm DNA Fragment Normalized intensities for an image s¯norm DNA Fragment Normalized intensities for a size band image H(I|T ) Gene sets Conditional entropy of interesting gene set, given a gene set T

hi Primer selection Activation function for neuron i

ht DHM Transfer function at time step t I Gene sets A set of “interesting” genes I(I|T ) Gene sets Mutual information of interesting gene I given a gene set T

ki Primer selection Number of “interesting” genes in ti

LA Primer selection Amplicon’s a length

LP Primer selection Primer’s effective length (number of bases that bind to the genome)

λp DHM Half probe’s length M Primer selection Number of possible (forward or reverse) primers per amplicon

mi DHM M-value for probe i m¯ DHM M-value for all probes in a chip.

mth(x) DHM Indicator function for methylation at position x. N Primer selection : Total number of possible primers N = 2|A|M Gene sets Number of genes (in the experiment) n Gene sets Number of “interesting” genes

114 Symbol Context Meaning

NG Primer selection Genome’s length

NT Primer selection Number of test tubes in a multiple PCR experiment

NT Gene sets Number of genes in GO-Term T

NT aq Primer selection Number of bases that can be replicated during PCR N (µ, σ) Normal distribution

pi Primer selection Neuron’s i activation probability (stochastic units)

PA, PC ,PG,PT ,PACGT Primer selection Probability of base {A, C, G, T } in a genome

Pmp Primer selection Mispriming probability for 2 primers (i.e. probability of matching another part of the genome) P (r = R) Gene sets Probability of a rank sum being R when N ranks are randomlly selected from [1,N] N,NT sum T P (r = R, N , r ) Gene sets Probability of a rank sum being R when N ranks are randomlly selected N,NT sum out min T from [rmin,N] but Nout have already been selected P (s) DHM Probability of DNA fragment size s after sonication

Pbind(x, xcg , xp, λp) DHM Probability of having fragments that start at position x in the genome, which

binded on probe xp (probe’s length is 2λp) M∗ Primer selection A set of primers that solves our primer selection problem

pth DHM Probability of a site being methylated

Rmin(N,NT ) Gene sets Minimum possible rank sum when NT ranks are randomlly selected from [1,N]

Rmax(N,NT ) Gene sets Maximum possible rank sum when NT ranks are randomlly selected from [1,N]

Rmm DHM M-value’s autocorrelation matrix

Rss(τ) DHM Sonication score’s autocorrelation funciton

ri DHM Red intensity for probe number i r¯ DHM Vectors of red intensity for all probes in a chip

ri Primer selection Primer number i

ri(k) Primer selection Base number k in primer number i r(x) DHM Indicator function for chip’s probe centered at position x

rms DHM Cross correlation vector (M-values vs. sonication score)

rsum Gene sets Rank sum

Sp(xp) DHM Sonication score for probe located at xp

Spcg (xp, xcg ) DHM Sonication score for probe located at xp and methylation site xcg

Sτ (τ) DHM Sonication score for a methylated site located at a distance τ from the probe

Sss(ω) DHM Sonication score’s power spectral density

st DHM Sonication score at time step t

sˆt DHM Sonication score’s prediction at time step t

tmi DHM Melting temperature for probe i

tti Primer selection Test tube number i T Primer selection A set of all test tubes T = {tt , ...., tt } 1 NT T Primer selection System’s temperature (simulated annealing)

Θ = {T1, T2, ..., TH } Gene sets A collection of gene sets

ui, u¯i 3SAT Variable i in a problem,u ¯i corresponds to its negated value

U 3SAT Number of variables (ui) in a given problem

U 3SAT A set of all variables (ui) in a given problem

xi Primer selection Neuron’s i output x¯(t) Primer selection A vector consisting of every neuron’s output at time t

wi,j Primer selection Weight between neuron i and neuron j

W Primer selection Weight matrix consisting of wi,j W C() Watson-Crick’s complement

xp DHM Probe’s position

5.3 Definitions

454 Sequencing: It is a massively-parallel sequencing-by-synthesis (SBS). The system relies on fixing nebulized and adapter-ligated DNA fragments to

115 small DNA-capture beads in a water-in-oil emulsion. The DNA fixed to these beads is then amplified by PCR. Finally, each DNA-bound bead is placed into a 44µm well on a PicoTiterPlate, a fiber optic chip. A mix of enzymes such as polymerase, sulfurase, and luciferase are also packed into the well. The Pi- coTiterPlate is then placed for sequencing. At this stage, the four nucleotides (TAGC) are washed in series over the PicoTiterPlate. During the nucleotide flow, each of the hundreds of thousands of beads with millions of copies of DNA is sequenced in parallel. If a nucleotide complementary to the template strand is flowed into a well, the polymerase extends the existing DNA strand by adding nucleotide(s). Addition of one (or more) nucleotide(s) results in a reaction that generates a light signal that is recorded by the CCD camera in the instrument. The signal strength is proportional to the number of nucleotides, for example, homopolymer stretches, incorporated in a single nucleotide flow [2]. ChIP on chip: ChIP assay followed by the hybridization of the enriched fraction on a microarray in order to locate protein binding sites [4]. CpG: CpG sites are regions of DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases along its length. “CpG” stands for cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. The “CpG” notation is used to distin- guish a cytosine followed by guanine from a cytosine base paired to a guanine [2].

CpG islands: Are regions of DNA near and in approximately 40% of promoters of mammalian genes. They are regions where there are a large number of cytosine and guanine adjacent to each other [2]. These islands are

116 frequently located at the 5’ ends of genes and may participate in the regula- tion of RNA synthesis initiation, an early transcription event that occurs in the promoter sequence located at the beginning of a gene [44].

DMR: Differentially methylated regions. Regions that have difference in methylation on each parental chromosome. dsRNA: Double-stranded RNA (or dsRNA) is RNA with two comple- mentary strands, similar to the DNA found in all ”higher” cells. dsRNA forms the genetic material of some viruses. In eukaryotes, it acts as a trigger to ini- tiate the process of RNA interference and is present as an intermediate step in the formation of siRNAs (small interfering RNAs) [2]. Hemimethylation: When only one strans is methylated (e.g. in a CpG doublet).

HpaII: The enzyme HpaII cuts the substrate CCGG, but not if the cen- tral CpG is methylated.

Imprinting: is a biological phenomenon observed in some genes in which the two inherited copies of the gene have opposite expression patterns. Each of the two inherited copies of the gene is either expressed or silenced, depending on the parental origin of that copy. Whereas in some genes the maternal copy is the one being expressed, the opposite pattern is observed in other genes [2].

Isoschizomer: Restriction enzymes can recognize the same target se- quences but cut DNA differently, due to their sensitivity to methylcytosine

117 Methylation: Is the covalent modification of DNA at the cyclic carbon- 5 of a cytosine residue (5-methyl cytosine: 5-mC).

miRNA: microRNAs (miRNA) are single-stranded RNA molecules of about 21-23 nucleotides in length thought to regulate the expression of other genes. miRNAs are encoded by genes that are transcribed from DNA but not translated into protein (non-coding RNA); instead they are processed from primary transcripts known as pri-miRNA to short stem-loop structures called pre-miRNA and finally to functional miRNA. Mature miRNA molecules are partially complementary to one or more messenger RNA (mRNA) molecules, and they function as to downregulate gene expression [2]. Northern blot: Is a technique used in research to study gene expression. It takes its name from the similarity of the procedure to the Southern blot procedure, named for biologist Edwin Southern, used to study DNA, with the key difference that RNA, rather than DNA, is the sub- stance being analyzed by electrophoresis and detection with a hybridization probe [2].

Restriction enzymes: Restriction enzymes cut DNA into fragments in or near their specific recognition or restriction sites.

siRNA: Small interfering RNA (siRNA), sometimes known as short in- terfering RNA or silencing RNA, are a class of 20-25 nucleotide-long double- stranded RNA molecules that play a variety of roles in biology. Most notably, siRNA is involved in the RNA interference (RNAi) pathway where the siRNA interferes with the expression of a specific gene [2].

118 Sodium bisulte treatment: Method to detect 5-methylcytosine in DNA. Following treatment of DNA, all unmethylated cytosine are converted to uracil. Subsequent PCR amplication displays these bases as thymine. Methyl group at C5 of cytosine inhibits the hydrolytic deamination of cytosine to uracil [4]. Southern blot: Is a method in molecular biology of enhancing the result of an agarose gel electrophoresis by marking specific DNA sequences [2].

Western blot: (a.k.a immunoblot) is a method in molecular biol- ogy/biochemistry/immunogenetics to detect protein in a given sample of tissue homogenate or extract. It uses gel electrophoresis to separate denatured pro- teins by mass [2].

119 References [1] Vardhman K. Rakyan Adele Murrell and Stephan Beck. From genome to epigenome. Human Molecular Genetics, 2005. [2] Anonymous. Wikipedia. 2008. [3] Robert B. Ash. Information theory. Dover, 1990. [4] Pauline A. Callinan and Andrew P. Feinberg. The emerging science of epigenomics. Human Molecular Genetics, pages R95–R101, 2006. [5] Shawn J. Cokus. Shotgun bisulphite sequencing of the arabidopsis genome reveals dna methylation patterning. Nature, pages 215–219, 2008. [6] Carina Dennis. Altered states. Nature, pages 686–688, 2003. [7] Walter Doerfler. (epi)genetic signals: Towards a human genome sequence of all five nucleotides. The Epigenome: Molecular hide and Seek, 2003. [8] Adrian Alexa et. al. Improved scoring of functional groups from gene ex- pression data by decorrelating go graph structure. Bioinformatics, pages 1600–1607, 2006. [9] Alberts et al. Molecular biology of the cell. 2002. [10] Aravind Subramanian et. al. Gene set enrichment analysis: A knowledge- based approach for interpreting genome-wide expression profiles. PNAS, pages 15545–15550, 2005. [11] Barry R Zeeberg et. al. Gominer: a resource for biological interpretation of genomic and proteomic data. Genome Biol., 2003. [12] Barry R Zeeberg et. al. High-throughput gominer, an ’industrial-strength’ integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of common variable immune de- ficiency (cvid). BMC Bioinformatics, page 168, 2005. [13] Bing Zhang et. al. Gotree machine (gotm): a web-based platform for interpreting sets of interesting genes using gene ontology hierarchies. BMC Bioinformatics, 2004. [14] D. Wackerly et. al. Mathematical statistics with applications, sixth edi- tion. Duxbury, 2002.

120 121

[15] David J. Reiss et. al. Model-based deconvolution of genome-wide dna binding. Bioinformatics, page 396, 2006. [16] Douglas A Hosack et. al. Identifying biological themes within lists of genes with ease. Genome Biology, 2003. [17] Douglas S Millar et al. Five not four: History and significance of the fifth base. The Epigenome: Molecular hide and Seek, 2003. [18] Eckhardt F et. al. Dna methylation profiling of human chromosomes 6, 20 and 22. Nat Genet., pages 1378–85, 2006. [19] G. A. Held et. al. Modeling of dna microarray data by using physical properties of hybridization. PNAS, page 75757580, 2003. [20] Gil Alterovitz et. al. Go pad: the gene ontology partition database. Proc Natl Acad Sci, pages 15545–15550, 2005. [21] Hairong Wei et. al. A study of the relationships between oligonucleotide properties and hybridization signal intensities from nimblegen microarray datasets. Nucleic Acids Res., page 29262938, 2008. [22] Heming Zhu et. al. Lsh is involved in de novo methylation of dna. The EMBO Journal, pages 335–345, 2006. [23] I Hatada et. al. Genome wide profiling of promoter methylation in human. Oncogene, pages 3059–3064, 2006. [24] Ido Braslavsky et. al. Sequence information can be obtained from single dna molecules. PNAS, pages 3960–3964, 2003. [25] Ilana Keshet et. al. Evidence for an instructive mechanism of de novo methylation in cancer cells. Nature Genetics, page 149, 2006. [26] Jae Bum Kim et. al. Polony multiplex analysis of gene expression (pmage) in mouse hypertrophic cardiomyopathy. Science, page 1481, 2007. [27] John Hertz et. al. Antroduction to the theory of neural computation. Santa Fe Institute Studies in the Sciences of Complexity, 1991. [28] Jorn Lewin et. al. Quantitative dna methylation analysis based on four- dye trace data from direct sequencing of pcr amplificates. Bioinformatics, pages 3005–3012, 2004. [29] Ju J et. al. Four-color dna sequencing by synthesis using cleavable flu- orescent nucleotide reversible terminators. PNAS, pages 19635–19640, 2006. 122

[30] Kristen H. Taylor et. al. Ultradeep bisulfite sequencing analysis of dna methylation patterns in multiple gene promoters by 454 sequencing. Can- cer Research, pages 8511–8518, 2007. [31] M. Kanehisa et. al. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., pages 27–30, 2000. [32] M. Suzuki et. al. Dna methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics, pages 465–476, 2008. [33] Marcel Margulies et. al. Genome sequencing in microfabricated high- density picolitre reactors. Nature, pages 376–380, 2005. [34] M.F. Mette et. al. Transcriptional silencing and promoter methylation triggered by double-stranded rna. The EMBO Journa, page 51945201, 2000. [35] Rakyan VK et. al. Dna methylation profiling of the human major histo- compatibility complex: a pilot study for the human epigenome project. PLoS, page 405, 2004. [36] S. Falcon et. al. Using gostats to test gene lists for go term association. Bioinformatics, pages 257–258, 2006. [37] S. Falcon et. al. Using gostats to test gene lists for go term association. Bioinformatics, 2007. [38] S. Kirkpatrick et. al. Optimization by simulated annealing. Science,, pages 671–680, 1983. [39] Scott W Doniger et. al. Mappfinder: using gene ontology and genmapp to create a global gene-expression profile from microarray data. Genome Biology, 2003. [40] Sven Olek et. al. Digitizing molecular diagnostics: Current and future applications of epigenome technology. The Epigenome: Molecular hide and Seek, 2003. [41] T. Grange et. al. Epigenomics: Large scale analysis of chromatin modifi- cations and transcription factors / genome interactions. BioEssays, page 1203, 2005. [42] Thomas A Down et. al. A bayesian deconvolution strategy for immunoprecipitation-based dna methylome analysis. Nature Biotechnol- ogy, pages 779 – 785, 2008. [43] Tim Beissbarth et. al. Gostat: Find statistically overrepresented gene ontologies within a group of genes. Bioinformatics, pages 1464–1465, 2004. 123

[44] Tim Hui-Ming Huang et. al. Epi meets genomics: Technologies for finding and reading the 5th base. The Epigenome: Molecular hide and Seek, 2003. [45] Vamsi K Mooth et. al. Pgc-1alpha-responsive genes involved in oxida- tive phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, page 267, 2003. [46] Xiaoyu Zhang et al. Genome-wide high-resolution mapping and functional analysis of dna methylation in arabidopsis. Cell, pages 1189–1201, 2006. [47] Yuan Qi et. al. High-resolution computational models of genome binding events. Nat Biotechnol., page 963, 2006. [48] Anne Ferguson-Smith. At the controls: Genomic imprinting and the epigenetic regulation of gene expression. The Epigenome: Molecular hide and Seek, 2003. [49] S. Haykin. Neural networks, a comprehensive foundation. Prentice Hall, 1999. [50] Jean-Pierre Issa. Living longer: The aging epigenome. The Epigenome: Molecular hide and Seek, 2003. [51] Richard M. Karp. Reducibility among combinatorial problems. Complex- ity of Computer Computations, page 85103, 1972. [52] M. Markham, N. R. & Zuker. Dinamelt web server for nucleic acid melting prediction. Nucleic Acids Research, pages W577–W581;, 2005. [53] The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29 PDF. Gene ontology: tool for the unification of biology. Nature Genetics, pages 25–29, 2000. [54] March 2008. R.A. Irizarry et. al. Genome research. Comprehensive high- throughput arrays for relative methylation (charm). Genome Res., page 780, 2008. [55] Wolf Reik and Wendy Dean. Mammalian epigenomics: Reprogramming the genome for development and therapy. The Epigenome: Molecular hide and Seek, 2003. [56] Moshe Szyf. The dynamic epigenome and its implications in toxicology. Toxicological Sciences, pages 7–23, 2007. [57] Saeed V. Vaseghi. Advanced digital signal processing and noise reduction. 2006. [58] Jaern Walter and Martina Paulsen. Epigenetic trouble: Human diseases caused by epimutations. The Epigenome: Molecular hide and Seek, 2003.