Bioinformatics for Epigenomics

Bioinformatics for epigenomics Pablo Cingolani Master of Science Computer science McGill University Montreal,Quebec 2008-08-30 Requirements Statement Copyright Statement Library and Archives Bibliothèque et Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l’édition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre référence ISBN: 978-0-494-56817-0 Our file Notre référence ISBN: 978-0-494-56817-0 NOTICE: AVIS: The author has granted a non- L’auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l’Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L’auteur conserve la propriété du droit d’auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author’s permission. In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n’y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. ACKNOWLEDGEMENTS I would like to thank Dr. Mathieu Blanchette and Dr. Michael Hallett for supervising my Masters thesis. My most grateful recognition to Matt Sudderman for his advice on the analysis and his help in editing this document. Many thanks to the LAB members. I must also thank Dr. Moshe Szyf for fruitful discussions and allowing the use of his data. ii ABSTRACT Epigenetics refers to reversible, heritable changes in gene regulation that occur without a change in DNA sequence. These changes are usually due to methylation of cytosine bases in DNA. In this work we review existing methodologies and propose new ones for their use in epigenomics. High throughtput methods to estimate methylation levels were developed as well as methods to make a biological interpretation of the data based on gene sets enrichment. High correlation was obtained between our methylation estimations and ex- perimental data from MeDIP experiments. Our proposed methods for gene sets enrichment performed better than well-known methods. iii ABREG´ E´ L’épigénétique décrit les changements réversibles et héritables de la régulation géniquequi arrivent sans changements dans la séquenced’ADN. Ces changements sont habituellement dus àla méthylation de cytosines dans l’ADN. Dans cette thèse, nous récapitulons les méthodes bioinformatiques existantes et nous proposons des nouvelles méthodes pour des problèmes reliésàl’épigénétique. Les méthodes a haut débitpour l’estimation du niveau de méthylation sont développées,de mêmeque des méthodes pour l’interprétation biologique des donnéesen se basant sur l’enrichissement d’ensemble de gènes de la même fonction. De hauts niveaux de corrélation sont obtenus entre nos estimés et les donnéesexpérimentales provenant d’expériences de type MeDIP. Les méthodes que nous proposons pour l’analyse d’enrichissement de fonction des gènesperforment mieux que les autres méthodes existantes. iv TABLE OF CONTENTS ACKNOWLEDGEMENTS . ii ABSTRACT . iii ABREG´ E..................................´ iv LIST OF TABLES . viii LIST OF FIGURES . ix 1 Introduction: Epigenetics . 1 1.1 Background on epigenetics and epigenomics . 3 1.2 DNA methylation . 4 1.2.1 Imprinting . 9 1.2.2 Health . 13 1.2.3 Methodologies for analyzing DNA methylation . 20 1.3 Histone changes . 24 2 Modeling methylated DNA immunoprecipitation . 28 2.1 Introduction . 28 2.1.1 Introduction to our model . 29 2.2 The model . 30 2.3 DNA fragment length . 33 2.4 Sonication score . 34 2.4.1 Sonication score auto-correlation . 35 2.4.2 Methylation is auto-correlated . 37 2.5 Immunoprecipitation . 38 2.6 Hybridization . 39 2.6.1 Melting temperature Tm ............... 40 2.6.2 Probe sequence content . 41 2.7 Relationship between sonication score, melting temperature, base position and M-values . 43 2.8 Predicting sonication signal . 44 2.9 Results . 46 2.10 Conclusions . 50 3 Multiple PCR and primer selection problem . 52 3.1 Probability of mispriming . 53 v 3.2 Probability of primer pair search failure . 54 3.3 Feasibility . 55 3.4 Formal problem definition . 55 3.5 Problem complexity . 56 3.5.1 Definition of 3SAT . 57 3.5.2 Mapping 3SAT to primer selection problem . 57 3.5.3 Sequence construction . 58 3.6 Non-linear dynamic solution . 59 3.7 Getting out of local minima: Stochastic approach . 62 3.8 Multiple PCR . 63 3.8.1 Lower bound using heuristic approach . 64 3.8.2 Solving multiple PCR: Simulated annealing . 64 3.8.3 Multiple PCR: Getting as many amplicons in one test tube . 65 3.9 Discussion . 65 4 Gene sets analysis . 67 4.1 Introduction . 67 4.2 Previous work . 68 4.3 Mutual information . 71 4.3.1 Mutual information for gene sets . 73 4.3.2 Simulations and results . 75 4.4 Ranked list . 77 4.4.1 Simulations and results . 78 4.5 Discussion . 81 5 Conclusions . 82 Appendix A . 84 A.1 Sonication score: Formulation details . 85 A.2 Sonication’s score autocorrelation . 90 A.3 Wiener filter model . 94 A.3.1 Wiener filter with aditional parameters . 95 Appendix B . 97 B.1 Ranked sum with replacement . 98 B.1.1 Approximation by normal distribution . 98 B.1.2 Exact calculation . 99 B.1.3 Fast algorithm . 101 B.2 Ranked sum without replacement . 103 B.2.1 Min / Max values . 103 B.2.2 Exact calculation . 104 B.2.3 Normal approximation . 105 B.2.4 Approximation . 108 vi Appendix C . 110 C.1 Simulated annealing: Energy difference . 111 Appendix D . 113 5.2 Symbol reference . 114 5.3 Definitions . 115 References . 120 vii LIST OF TABLES Table page 1–1 Methylation detection methods . 25 3–1 How to map each of the eight possible clauses . 58 3–2 Heuristic algorithm results for a particular primer selection problem . 64 4–1 Algorithm for ranking gene sets using p-values . 70 4–2 Greedy algorithm optimizing mutual information . 76 4–3 Methodology for comparing algorithms in our simulations . 76 4–4 Algorithm comparison: Mean recovery rate and standard devi- ation . 77 4–5 Methodology for selecting GO-terms and genes for our simula- tion . 79 4–6 Algorithm comparison: Mean recovery rate and standard devi- ation . 80 4–7 Algorithm comparison with GSEA using MSigDB set C2. 80 4–8 Algorithm comparison with GSEAusing MSigDB set C5. 80 5–1 Rank sum approximation . 108 viii LIST OF FIGURES Figure page 1–1 Methyl group . 3 1–2 5-methyl C nucleotides accidentally deaminated . 6 1–3 How DNA methylation patterns are inherited . 8 1–4 Mammalian X-chromosome inactivation . 13 1–5 Bisulfite modification of a C nucleotide . 21 1–6 Steps in DMH . 27 2–1 Methylated DNA Immunoprecipitation . 31 2–2 Model diagram . 32 2–3 Gel run image using sonicated DNA. 33 2–4 Function Sτ (xcg − xp)....................... 35 2–5 Auto-correlation function Rss(d) for Sτ . 36 2–6 Scatter plot mp+τ vs mp shows auto-correlation. 37 2–7 Sample M-value auto-correlation function. 38 2–8 Methylation sample auto-correlation using HEP data. 39 2–9 Immunoprecipitation efficiency as a function of methylated CpGs 40 2–10 Histograms for sonication scores S(xp) . 40 2–11 Tm Histogram and effect on probe M-values . 41 2–12 Base position influence hybridization score. 42 2–13 Relationship of parameters in a fully methylated DNA sample MeDIP experiment. 43 2–14 M-values and M-value estimates. 45 2–15 Correlation between sonication signal sp and predicted sonication signals ˆp for different microarrays . 48 ix 2–16 Difference between Wiener filter’s coefficients for two different experiments. 48 2–17 Correlation histogram for all “regions”. 49 2–18 Original signal Sp(x) and interpolated approximation. 49 3–1 Number of possible primers needed . 56 3–2 Optimization network . 60 3–3 Matrix W is a sparse block matrix. 62 4–1 GO structure: A directed acyclic graph (DAG) for each ontology 69 4–2 In an experiment analyzing N genes, n are “interesting” . 71 4–3 Elim algorithm . 72 4–4 Gene sets A, B and B0 and set of interesting genes I . 73 5–1 Scoring for a given prove p .................... 86 5–2 Binding probability . 86 5–3 Two probes centered at xi and xj share a common genomic region. 90 5–4 Probability density function PN,NT (R). 101 5–5 Normal approximation’s RMS error. 102 5–6 PN,NT (R) for different values of N and NT . 109 x CHAPTER 1 Introduction: Epigenetics Genetics studies how living organisms inherit features from one generation to the next.

Load more