Methods for the Analysis of High Throughput Sequencing Data

Methods for the Analysis of High Throughput Sequencing Data by Christopher Fletez-Brant, MHS, MS A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland July, 2018 © 2018 by Kipper Fletez-Brant All rights reserved Abstract In this thesis I describe methods for the quality control or analysis of genomics data. I first develop a method for correcting for unwanted variation across samples in Hi-C data, and compare it to other possible approaches. I then develop a method for clustering features in high dimensional Bayesian inference, and apply it gene expression data and the Bayesian non-negative matrix factorization algorithm CoGAPS. ii Thesis Committee Primary Readers Kasper Daniel Hansen Associate Professor Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Elana Fertig Associate Professor Department of Oncology SKCCC, Johns Hopkinsd University iii Acknowledgments Alexis has been at my side for good times, my compass when lost, and she is as wise as she is lovely. This work is yours too. My family, parents Frank and Lorraine and brother Alec, have all provided support over the years, as well as some much-needed laughs. Kasper Hansen has taught me much about thinking about science and being a scientist. Elana Fertig has encouraged me to pursue difficult problems. Other people at JHU to whom I owe various thank you’s include Andy McCallion for allowing me to intern in his lab, and Hans Bjornsson, Hongkai Ji and Jeff Leek for being parts of my thesis committee and sharing their feedback. Hans also introduced me to Kasper. Additionally, Chuck Rohde has taught me much, without which this work would not have been possible. The MD-GEM program was an amazing opportunity and Priya Duggal, Jennifer Deal Dave Valle and Sandy Muscelli are each to be commended for the amazing program they have put together. Finally, grad student life would have been much less fun without my lab mates, JP, Leslie, Pete, Leandros and Albert - thanks guys. iv Table of Contents Table of Contentsv List of Tables viii List of Figures ix 1 Introduction1 1.1 The 3D Genome . .2 1.1.1 3D Assays . .3 1.2 Batch Effects and Normalization . .5 1.2.1 Hi-C and Batch Effect or Normalization . .6 1.3 Gene Expression . .7 1.4 Matrix Factorization . .8 1.4.1 CoGAPS . .9 1.5 Thesis Outline . 10 1.5.1 Unwanted Variation in Hi-C Data . 10 1.5.2 Gene Expression Clustering . 10 v 2 Removing unwanted variation between samples in Hi-C experiments 14 2.1 Abstract . 14 2.2 Introduction . 15 2.3 Results . 17 2.3.0.1 Unwanted variation in Hi-C data varies between distance stratum . 17 2.3.0.2 Band-wise normalization and batch correction 20 2.4 Discussion . 23 2.5 Methods . 24 2.5.0.1 Data Generation . 24 2.5.0.2 Band Matrices . 25 2.5.0.3 Log counts per million transformation . 25 2.5.0.4 HiCNorm . 25 2.5.0.5 BNBC . 26 2.5.0.6 Explained variation and smoothed boxplot . 27 2.5.0.7 A/B compartments from smoothed contact matrices . 27 2.5.0.8 Acknowledgements . 28 2.6 Tables . 28 2.7 Figures . 30 2.8 Supplementary Figures . 33 vi 3 Unsupervised feature learning in Bayesian high dimensional inference 47 3.1 Abstract . 47 3.2 Introduction . 48 3.3 PUMP: Probability of Unique Membership in Patterns . 50 3.4 Experimental Results . 52 3.4.1 Algorithm choice . 52 3.4.2 Data description . 52 3.4.2.1 PUMP findings for simulation . 53 3.4.3 PUMP findings for HNSCC . 53 3.5 Conclusion . 55 3.6 Tables . 56 3.7 Figures . 58 3.8 Appendix . 61 4 Discussion and Conclusion 68 vii List of Tables 2.1 Sample Information . 29 3.1 PUMP results for each factor. 56 3.2 Differences in gene-factor assignments. 57 3.3 Selection of GO Terms . 57 3.4 Complete list of significant GO terms . 63 viii List of Figures 2.1 Unwanted variation in Hi-C data. 30 2.2 Substantial unwanted variation in Hi-C data. 31 S1 The performance assessment of all methods using correlation of batch with principal components. 33 S2 Marginals and R2 for unprocessed data. 34 S3 Use of HiCNorm and choice of width impact unwanted variation. 35 S4 The performance of BNBC by distance. 36 S5 Marginal distributions after BCBN. 37 S6 BNBC preserves structural features of Hi-C contact maps. 38 S7 The performance assessment of all methods using R2...... 39 3.1 Simulated data. 58 3.2 Timecourse Gene Expression Factors . 58 3.3 Timecourse Factor Gene Expression . 59 3.4 Entropy and PUMP scores . 60 ix Chapter 1 Introduction This thesis is concerned with methods for the analysis of genomics data. There are two major foci of this thesis: the analysis of Hi-C data (Chapter2) and the analysis of gene expression data (Chapter3). Hi-C data gives insight into the three-dimensional conformation of DNA in the cell nucleus, while gene expression data gives insight into the transcriptional landscape of a cell or cell type. This introduction provides an overview of these data and concepts related to the methods developed in my research projects. Specifically, I will describe the three dimensional genome and assays designed to interrogate it (Section 1.1), as well as the general concept of batch effect correction and how it relates to assays for the 3D genome (Section 1.2). I will also discuss generally gene expression assays (Section 1.3), matrix factorization and the CoGAPS algorithm (Section 1.4). Finally, I will conclude with an outline of this thesis. 1 1.1 The 3D Genome Historically, genome research has focused on the linear genome. Early efforts focused on identifying specific features, such as the location of exon bound- aries, or the location of enhancers. Characterizing such features implies a focus on individual genomic loci. Advances in per-locus research have yielded insight into the nature histone occupancy, DNA methylation, sequence features of enhancers and promoters, and many other phenomena as well. This line of study has allowed researchers to be able to describe a given locus in terms of putative functions and sequence features. Despite this, the linking of loci, such as enhancers and promoters, is generally non-trivial. The importance of understanding regulatory interactions such as the interaction between a promoter and its enhancers, however, means that the difficulty has not prevented people from trying to link loci. Rather, clever analysis of data derived from existing techniques have made some inroads here. An adaptation of the CAGE-seq assay allowed researchers to find promoter- enhancer pairs active in multiple cell types (Andersson et al., 2014), while integration of multiple assays such as ChIP-seq and RNA-seq have allowed for linking of some promoter-enhancer pairs in a given cell type (Waszak et al., 2015). While enhancer-promoter pairs are important, the overall 3D conformation 2 of DNA within the cell nucleus has functional relevance as well (Dekker, Marti- Renom, and Mirny, 2013). To address this question, a variety of sequencing- based approaches have been developed. I will describe these next. 1.1.1 3D Assays Many of the assays used to describe the 3D arrangement of DNA in the nucleus exploit the fact that the proximity of two loci can be detected by clever uses of DNA crosslinking and sequencing. This crosslinking is referred to as the ’capture’ of the interaction, which has given risen to the family of ’C’ methods (Wit and Laat, 2012). I will focus on these, although the discussion in this section is not exhaustive. The first assay to be described is ’Chromosome Conformation Capture’, or 3C (Dekker et al., 2002). This assay can be used to test if known candidate interactions are occurring in a given sample. The assay works by chemically crosslinking the DNA in the cell, enzymatically digesting away the non- crosslinked DNA, and then ligating the remaining crosslinked pieces together and undoing the crosslinking. The result is a hybrid DNA fragment, with one end coming from one interacting locus and the other end from the other locus. PCR primers targeting the putative interaction can then be used to perform PCR to test if the interaction is present. While 3C allows for the testing of interactions between a priori specified loci, 4C (’Chromosome Conformation Capture-on-Chip’) (Simonis et al., 2006) and 5C (’Chromosome Conformation Capture Carbon Copy’) (Dostie et al., 2006) both try to loosen the requirement of prior knowledge. 4C follows the 3 same crosslinking strategy as 3C, except that after the ligation and reversal of crosslinking the assay uses PCR primers targeting only 1 locus to ampilfy all interactions with this locus. 5C has a different approach, focusing instead on the set of all interactions for a given genomic region. This is achieved by the same steps as 3C and 4C until the ligation and reversal of crosslinking step, at which point 5C ligates universal primers to all fragments and using locus-specific primers to find interactions of interest. 3C, 4C and 5C give insight into, respectively, 1-1, 1-All, and Many-Many relationships. By contrast, Hi-C was devised to obtain the complete picture of all inter-locus interactions of the genome. Hi-C (Lieberman-Aiden et al., 2009) achieves this by taking the same crosslinking strategy as before, but then uses paired-end sequencing to sequence all hybrid DNA fragments.

Methods for the Analysis of High Throughput Sequencing Data

UNIVERSITY of CALIFORNIA RIVERSIDE Unsupervised And

University of California Santa Cruz Sample

Introduction

UCLA UCLA Electronic Theses and Dissertations

Director's Update

Top 100 AI Leaders in Drug Discovery and Advanced Healthcare Introduction

Probabilistic Models for Species Tree Inference and Orthology Analysis

David A. Knowles

Learning from Graph Neighborhoods Using Lstms

Michael Kearns

Population-Based Meta-Heuristic for Active Modules Identification Leandro Corrêa, Denis Pallez, Laurent Tichit, Olivier Soriani, Claude Pasquier

A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment