Robust Normalization of Next Generation Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Robust Normalization of Next Generation Sequencing Data Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien Universität Berlin vorgelegt von Johannes Helmuth Berlin, 2017 Erstgutachter: Prof. Dr. Martin Vingron Zweitgutachter: Prof. Dr. Uwe Ohler Tag der Disputation: 15.05.2017 Preface This dissertation introduces a robust normalization method to uncover signals in noisy next generation sequencing data. The genesis of the described approach is the observation that next generation sequencing resembles a sampling process that can be modeled by means of discrete statistics. The specic and sensitive detection of signals from sequencing data pushes the eld of molecular biology forward towards a comprehensive understanding of the functional basic unit of life – the cell. The thesis is structured in three parts: Part I provides a background on molecular biology and statistics that is needed to understand Part II. I describe the fascinating subject of gene regulation and how diverse next genera- tion sequencing techniques have been developed to study cellular processes at the molec- ular level. The data generated in these experiments are naturally modeled with statistical models. To accurately quantify sequencing data, I propose the computational program “bamsignals” which was developed in collaboration with Alessandro Mammana [1]. Part II introduces a novel sequencing data normalization method which was developed under supervision of Dr. Ho-Ryun Chung from the Epigenomics laboratory at the Max Planck Institute for Molecular Genetics in Berlin. A manuscript describing the approach is de- posited on bioRxiv [2] and the method was also featured as a journal article in Springer Press BioSpektrum [3]. The method was implemented as an open-source software [4]. Part III provides the conclusion. The ndings of Part II are wrapped up. Furthermore, I give future directions for research in the eld of next generation sequencing data analysis. Acknowledgments Along the way to write this dissertation, many people helped me in pursuing my goal to become a computational biologist. First of all, I would like to thank Dr. Ho-Ryun Chung for his supervi- sion and nancial support. I always cherish his advice and critic on all my scientic endeavors. His comments on the manuscript greatly improved this thesis. Fruitful discussions with my ref- erees Prof. Dr. Martin Vingron and Prof. Dr. Uwe Ohler helped me understand the science of computational biology. I would like to give thanks to all members of the “Otto-Warburg-Laboratories: Computa- tional Epigenomics” group and the department of “Computational Molecular Biology” at the Max Planck Institute for Molecular Genetics in Berlin. The work environment was always inspiring and gave rise to multiple successful collaborations that did not make it into the book [5–7]. In iii iv particular, I thank Dr. Brian Carey (“Bhuaigh Èire!”) and Marcos Torroba (“¡Sigue así!”) for proofreading the manuscript. Also, I am thankful to be given the opportunity to represent the PhD students of the Max Planck Institute for Molecular Genetics in “Student Association” in the year 2013. The organi- zation of scientic and social events was fun and provided me with feedback on my scientic work. Finally, I would like to thank my darling Sina (“Ich liebe Dich!”), my parents Ingo and Char- lotte and my brother Christian for their motivation along the way. List of Publications Kinkley, S* & Helmuth, J*, Polansky, JK, Dunkel, I, Gasparoni G., Fröhler, S., Chen, W., Walter, J., Hamann, A. & Chung, HR.: “reChIP-seq reveals widespread bivalency of H3K4me3 and H3K27me3 in CD4+ memory T cells”. Nature Communications (2016) 7:12514 doi:10.1038/ncomms12514 Helmuth, J. & Chung, HR.: “Auswertung von Histonmodikations-ChIP-Seq-Datensätzen”. Biospektrum (2016) 22: 568. doi:10.1007/s12268-016-0728-6 Helmuth, J., Li, N., Arrigoni, L., Gianmoena, K., Cadenas, C., Gasparoni, G., Sinha, A., Rosenstiel, P., Walter, J., Hengstler, JG., Manke, T. & Chung, HR.: “normR: Regime enrichment calling for ChIP-seq data”. bioRxiv (October 2016). http://dx.doi.org/10.1101/082263 Corwin, T., Woodsmith, J., Apelt, F., Fontaine, JF., Meierhofer, D., Helmuth, J., Grossmann, A., Andrade-Navarro MA., Ballif, B., Stelzl, U.: “Interaction networks mediate human tyrosine kinase specicity”. Cell Systems (accepted) Yang, J., Moeinzadeh, MH., Kuhl, H., Helmuth, J., Xiao, P., Liu, MG., Zheng, JL., Sun, Z., Fan, W., Deng, MG., Wang, H., Hu, F., Fernie, A., Timmermann, B., Zhang, P. & Vingron, M.: “The haplotype-resolved genome sequence of hexaploid Ipomoea batatas reveals its evolutionary his- tory”. Nature Plants (accepted) Contents I Background 1 1 The Basics of Molecular Biology 3 1.1 The Genome and the Readout of Genetic Information . 3 1.1.1 The DNA . 4 1.1.2 The Chromatin . 5 1.1.3 The Readout of Genetic Information . 7 1.2 Measuring the Cell by Next Generation Sequencing . 9 1.2.1 Gene Expression . 9 1.2.2 Chromatin Modifications . 10 2 Mathematical Concepts 11 2.1 Statistical Prerequisites . 11 2.1.1 Statistical Inference . 11 2.1.2 Multiple Testing Correction and the T Method . 13 2.1.3 The Binomial Distribution . 16 2.1.4 Sampling from Binomial Distributions . 17 2.1.5 Mixture Models . 18 2.2 Model Parameter Estimation . 20 2.2.1 Suicient Statistics . 20 2.2.2 Maximum Likelihood Estimation . 22 2.2.3 The Expectation-Maximization Algorithm . 24 v vi Contents II Normalization of NGS Read Count Data 27 3 The normR Framework 29 3.1 Motivation . 29 3.2 The normR Approach . 31 3.2.1 Sequencing is a (Multinomial) Sampling Trial . 32 3.2.2 Deliberations on the Signal-to-Noise Ratio (S/N) . 34 3.2.3 The normR Method . 35 3.2.4 Why Not Use a Negative Binomial or Multinomial Distribution? . 37 3.3 Outlook . 39 4 ChIP-seq Enrichment Calling with enrichR 41 4.1 Introduction . 41 4.2 Methods . 44 4.2.1 Data Sets . 44 4.2.2 The normR Methods: enrichR ........................ 46 4.2.3 Confidence-Weighted antification of DNA-Methylation . 50 4.2.4 Comparison of Enrichment Callers . 50 4.2.5 Correlating enrichR-estimated Enrichment to NCIS and HMD% . 55 4.2.6 Chromatin Segmentation Based on enrichR Enrichment Calls . 56 4.3 Results – Enrichment Calling in High and Low S/N . 58 4.3.1 Systematic Comparison of Available Enrichment Callers . 60 4.3.2 enrichR Normalization Corresponds to Published In Silico as well as In Vitro Normalization Methods . 67 4.3.3 Improved Chromatin Segmentation with an enrichR-chromHMM Hy- brid Approach . 71 4.4 Discussion . 74 5 Regime Enrichment Calling with regimeR 77 5.1 Introduction . 77 5.2 Methods . 79 5.2.1 Data Sets . 79 5.2.2 The normR Methods: regimeR ........................ 79 5.2.3 Validation of regimeR Calls via Sequence Features . 81 5.3 Results - Distinct Heterochromatic Enrichment Regimes . 82 5.3.1 H3K27me3 Peaks Coincide with CpG Islands Bound by EZH2 . 83 5.3.2 H3K9me3 Peaks are Found within Repeats Bound by ZNF274 . 83 Contents vii 5.3.3 Heterochromatic Peaks Resemble Nucleation Sites for Heterochromatin Embedded within Regions of Broad Enrichment . 85 5.3.4 H3K27me3 and H3K9me3 do Overlap by a Minority within and between Tissues . 86 5.4 Discussion . 88 6 ChIP-seq Dierence Calling with diffR 91 6.1 Introduction . 91 6.2 Methods . 93 6.2.1 Data Sets . 93 6.2.2 The normR Methods: diffR ......................... 93 6.2.3 Gene Ontology Analysis . 95 6.2.4 Comparison of ChIP-seq Dierence Callers . 96 6.3 Results . 99 6.3.1 Dierence Calling in HepG2 Cells and Primary Human Hepatocytes . 99 6.3.2 Comparison of ChIP-seq Dierence Callers . 102 6.4 Discussion . 109 III Conclusion 111 Bibliography 115 A Supplementary Figures 131 B Supplementary Tables 137 C Abstract 140 D Zusammenfassung 142 E Selbstständigkeitserklärung 144 viii Contents List of Figures 1.1 The DNA Double Helix . 4 1.2 Levels of Dynamic Chromatin Compaction . 5 1.3 The Combinatorial Action of Histone Modifications . 6 2.1 A Simple Poisson Model for ChIP-seq Read Count Data . 13 2.2 The Anatomy of the P-Value Distribution . 15 3.1 The Inter-Dependency of Dierence Calling and Normalization . 31 3.2 A Sequencing Experiment Constitutes a Multinomial Sampling Trial . 32 3.3 The Sequencing Depth Ratio Is a Background Estimation Biased towards Signal- Regions . 33 3.4 The Closed System of Statistical Power . 35 3.5 A Negative Multinomial Mixture Fit Does Not Model The Interrelation between Treatment and Control . 38 4.1 enrichR: The normR Method for Two Components . 47 4.2 Enrichment Calling with enrichR on H3K4me3 and H3K36me3 ChIP-seq Data in Primary Human Hepatocytes . 59 4.3 enrichR H3K4me3 Enrichment Calls Agree with Peaks Reported by Competitor Methods . 61 4.4 enrichR H3K36me3 Enrichment Calling Outperforms Competitor Methods . 63 4.5 Precision-Recall-Curves Based on a Bona-Fide Benchmark Support the Superior Performance of enrichR . 64 4.6 High Fold-Change and Read Coverage in Tool-Specific Regions Support Validity of enrichR Calls . 66 4.7 Saturation Analysis Based on a Unified Bona-Fide Benchmark Set for enrichR, MACS2, DFilter, CisGenome, SPP, BCP and MUSIC . 67 4.8 enrichR Normalization Improves on NCIS’ Estimated Normalization Factor to- wards a Correct Background Estimation . 68 ix x List of Figures 4.9 enrichR Enrichment e and ICeChIP’s HMD% are Strongly Correlated but HMD% Incorrectly Inflates Low Count Fold Changes . 70 4.10 chromHMM Binarization Input: enrichR Improved on the Mutual Information between Histone Modifications Associated to Active Transcription in GM12878 ChIP-seq Data . 71 4.11 chromHMM Segmentation Output: An enrichR-chromHMM Hybrid Approach Increased Resolution of Chromatin Segmentation in GM12878 Cells . 73 5.1 regimeR Identifies H3K27me3 and H3K9me3 Enrichment Regimes at the FCGBP Gene Locus .