<<

Methods for the Analysis of High Throughput Sequencing Data

by

Christopher Fletez-Brant, MHS, MS

A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of

Doctor of Philosophy

Baltimore, Maryland July, 2018

© 2018 by Kipper Fletez-Brant

All rights reserved Abstract

In this thesis I describe methods for the quality control or analysis of genomics data. I first develop a method for correcting for unwanted variation across samples in Hi-C data, and compare it to other possible approaches. I then develop a method for clustering features in high dimensional Bayesian infer- ence, and apply it gene expression data and the Bayesian non-negative matrix factorization algorithm CoGAPS.

ii Thesis Committee

Primary Readers

Kasper Daniel Hansen Associate Professor Department of Biostatistics Johns Hopkins Bloomberg School of Public Health

Elana Fertig Associate Professor Department of Oncology SKCCC, Johns Hopkinsd University

iii Acknowledgments

Alexis has been at my side for good times, my compass when lost, and she is as wise as she is lovely. This work is yours too.

My family, parents Frank and Lorraine and brother Alec, have all provided support over the years, as well as some much-needed laughs.

Kasper Hansen has taught me much about thinking about science and being a scientist. Elana Fertig has encouraged me to pursue difficult problems.

Other people at JHU to whom I owe various thank you’s include Andy McCallion for allowing me to intern in his lab, and Hans Bjornsson, Hongkai Ji and Jeff Leek for being parts of my thesis committee and sharing their feedback.

Hans also introduced me to Kasper. Additionally, Chuck Rohde has taught me much, without which this work would not have been possible. The MD-GEM program was an amazing opportunity and Priya Duggal, Jennifer Deal Dave

Valle and Sandy Muscelli are each to be commended for the amazing program they have put together. Finally, grad student life would have been much less fun without my lab mates, JP, Leslie, Pete, Leandros and Albert - thanks guys.

iv Table of Contents

Table of Contentsv

List of Tables viii

List of Figures ix

1 Introduction1

1.1 The 3D Genome ...... 2

1.1.1 3D Assays ...... 3

1.2 Batch Effects and Normalization ...... 5

1.2.1 Hi-C and Batch Effect or Normalization ...... 6

1.3 Gene Expression ...... 7

1.4 Matrix Factorization ...... 8

1.4.1 CoGAPS ...... 9

1.5 Thesis Outline ...... 10

1.5.1 Unwanted Variation in Hi-C Data ...... 10

1.5.2 Gene Expression Clustering ...... 10

v 2 Removing unwanted variation between samples in Hi-C experiments 14

2.1 Abstract ...... 14

2.2 Introduction ...... 15

2.3 Results ...... 17

2.3.0.1 Unwanted variation in Hi-C data varies be- tween distance stratum ...... 17

2.3.0.2 Band-wise normalization and batch correction 20

2.4 Discussion ...... 23

2.5 Methods ...... 24

2.5.0.1 Data Generation ...... 24

2.5.0.2 Band Matrices ...... 25

2.5.0.3 Log counts per million transformation . . . . 25

2.5.0.4 HiCNorm ...... 25

2.5.0.5 BNBC ...... 26

2.5.0.6 Explained variation and smoothed boxplot . 27

2.5.0.7 A/B compartments from smoothed contact matrices ...... 27

2.5.0.8 Acknowledgements ...... 28

2.6 Tables ...... 28

2.7 Figures ...... 30

2.8 Supplementary Figures ...... 33

vi 3 Unsupervised feature learning in Bayesian high dimensional infer- ence 47

3.1 Abstract ...... 47

3.2 Introduction ...... 48

3.3 PUMP: Probability of Unique Membership in Patterns . . . . . 50

3.4 Experimental Results ...... 52

3.4.1 Algorithm choice ...... 52

3.4.2 Data description ...... 52

3.4.2.1 PUMP findings for simulation ...... 53

3.4.3 PUMP findings for HNSCC ...... 53

3.5 Conclusion ...... 55

3.6 Tables ...... 56

3.7 Figures ...... 58

3.8 Appendix ...... 61

4 Discussion and Conclusion 68

vii List of Tables

2.1 Sample Information ...... 29

3.1 PUMP results for each factor...... 56

3.2 Differences in gene-factor assignments...... 57

3.3 Selection of GO Terms ...... 57

3.4 Complete list of significant GO terms ...... 63

viii List of Figures

2.1 Unwanted variation in Hi-C data...... 30

2.2 Substantial unwanted variation in Hi-C data...... 31

S1 The performance assessment of all methods using correlation of batch with principal components...... 33

S2 Marginals and R2 for unprocessed data...... 34

S3 Use of HiCNorm and choice of width impact unwanted variation. 35

S4 The performance of BNBC by distance...... 36

S5 Marginal distributions after BCBN...... 37

S6 BNBC preserves structural features of Hi-C contact maps. . . . 38

S7 The performance assessment of all methods using R2...... 39

3.1 Simulated data...... 58

3.2 Timecourse Gene Expression Factors ...... 58

3.3 Timecourse Factor Gene Expression ...... 59

3.4 Entropy and PUMP scores ...... 60

ix Chapter 1

Introduction

This thesis is concerned with methods for the analysis of genomics data. There are two major foci of this thesis: the analysis of Hi-C data (Chapter2) and the analysis of gene expression data (Chapter3). Hi-C data gives insight into the three-dimensional conformation of DNA in the cell nucleus, while gene expression data gives insight into the transcriptional landscape of a cell or cell type. This introduction provides an overview of these data and concepts related to the methods developed in my research projects. Specifically, I will describe the three dimensional genome and assays designed to interrogate it (Section 1.1), as well as the general concept of batch effect correction and how it relates to assays for the 3D genome (Section 1.2). I will also discuss generally gene expression assays (Section 1.3), matrix factorization and the CoGAPS algorithm (Section 1.4). Finally, I will conclude with an outline of this thesis.

1 1.1 The 3D Genome

Historically, genome research has focused on the linear genome. Early efforts focused on identifying specific features, such as the location of exon bound- aries, or the location of enhancers. Characterizing such features implies a focus on individual genomic loci.

Advances in per-locus research have yielded insight into the nature histone occupancy, DNA methylation, sequence features of enhancers and promoters, and many other phenomena as well. This line of study has allowed researchers to be able to describe a given locus in terms of putative functions and sequence features. Despite this, the linking of loci, such as enhancers and promoters, is generally non-trivial.

The importance of understanding regulatory interactions such as the in- teraction between a promoter and its enhancers, however, means that the difficulty has not prevented people from trying to link loci. Rather, clever anal- ysis of data derived from existing techniques have made some inroads here. An adaptation of the CAGE-seq assay allowed researchers to find promoter- enhancer pairs active in multiple cell types (Andersson et al., 2014), while integration of multiple assays such as ChIP-seq and RNA-seq have allowed for linking of some promoter-enhancer pairs in a given cell type (Waszak et al.,

2015).

While enhancer-promoter pairs are important, the overall 3D conformation

2 of DNA within the cell nucleus has functional relevance as well (Dekker, Marti- Renom, and Mirny, 2013). To address this question, a variety of sequencing- based approaches have been developed. I will describe these next.

1.1.1 3D Assays

Many of the assays used to describe the 3D arrangement of DNA in the nucleus exploit the fact that the proximity of two loci can be detected by clever uses of DNA crosslinking and sequencing. This crosslinking is referred to as the ’capture’ of the interaction, which has given risen to the family of ’C’ methods (Wit and Laat, 2012). I will focus on these, although the discussion in this section is not exhaustive.

The first assay to be described is ’Chromosome Conformation Capture’, or 3C (Dekker et al., 2002). This assay can be used to test if known candidate interactions are occurring in a given sample. The assay works by chemically crosslinking the DNA in the cell, enzymatically digesting away the non- crosslinked DNA, and then ligating the remaining crosslinked pieces together and undoing the crosslinking. The result is a hybrid DNA fragment, with one end coming from one interacting locus and the other end from the other locus. PCR primers targeting the putative interaction can then be used to perform PCR to test if the interaction is present.

While 3C allows for the testing of interactions between a priori specified loci, 4C (’Chromosome Conformation Capture-on-Chip’) (Simonis et al., 2006) and 5C (’Chromosome Conformation Capture Carbon Copy’) (Dostie et al.,

2006) both try to loosen the requirement of prior knowledge. 4C follows the

3 same crosslinking strategy as 3C, except that after the ligation and reversal of crosslinking the assay uses PCR primers targeting only 1 locus to ampilfy all interactions with this locus. 5C has a different approach, focusing instead on the set of all interactions for a given genomic region. This is achieved by the same steps as 3C and 4C until the ligation and reversal of crosslinking step, at which point 5C ligates universal primers to all fragments and using locus-specific primers to find interactions of interest.

3C, 4C and 5C give insight into, respectively, 1-1, 1-All, and Many-Many relationships. By contrast, Hi-C was devised to obtain the complete picture of all inter-locus interactions of the genome. Hi-C (Lieberman-Aiden et al., 2009) achieves this by taking the same crosslinking strategy as before, but then uses paired-end sequencing to sequence all hybrid DNA fragments. These sequenced fragments are then mapped to the genome, giving a comprehensive picture of all pair-wise interactions for the genome, with the relative strength of interactions being indicated by the number of reads mapping to a given pair of loci. Typically the resulting data is summarized into a table of pairwise interactions. This table is a symmetric matrix in which entries i, j give the interaction strength for loci i and j. These matrices are called ’contact maps’, and a distinction is made between interactions between loci on the same chro- mosome (called cis-interactions), and between loci on different chromosomes

(called trans-interactions). Part of this thesis will focus on cis-interactions for Hi-C.

4 1.2 Batch Effects and Normalization

It has long been appreciated that for genomics assays exploring gene expres- sion, there is unwanted variation between different samples that are processed on different days, or at different times, or by different labs (Leek et al., 2010). This variation, which is often completely unrelated to the biological question of interest, is generally referred to as a ’batch’ effect, where ’batch’ simply describes a group of samples processed at the same time. There is an extensive statistical literature describing methods to estimate and then remove the effect of batches for gene expression data (gene expression data is described below).

Popular methods include ComBat (Johnson, Li, and Rabinovic, 2007), PEER (Stegle et al., 2010), SVA (Leek and Storey, 2007) and RUV (Gagnon-Bartsch and Speed, 2012).

Normalization is a general term that refers to correcting for biases or technical effects that will impact comparisons between experimental units. As with batch effects, there is also a rich literature of methods for normalization of gene expression data. Common items to be corrected for include the sequencing depth of a RNA-seq experiment, the non-Gaussian nature of raw count data, or general distributional differences between samples. These issues are generally fixed, respecitively, by scaling by library size, log- or arcsin- transforming a given sample’s gene expression data if from RNA-seq, or quantile normalization, which forces every sample to have the same quantiles.

5 1.2.1 Hi-C and Batch Effect or Normalization

While this is discussed in greater detail in Chapter2, it is worth briefly com- menting on the normalization and batch effect in Hi-C data. As a high- throughput sequencing assay, Hi-C is subject to effects from CG content, sequencing fragment length, and mappability of a given locus (mappability can be thought of as ’uniqueness’) (Hu et al., 2012; Yaffe and Tanay, 2011; Imakaev et al., 2012). These sources of bias impact the relative strength of interaction reported for a given single sample’s contact matrices: for two dif- ferent interactions (matrix cells), the difference in strength may be attributable to differences in GC content of the pairs of interacting loci, for example. This prevents single-sample, cross-interaction comparisons from being sensibly made. HiCNorm explicitly corrects for these three sources of bias, while other methods such as ICE or the Knight-Ruiz matrix balancing algorithm (Knight and Ruiz, 2013) non-parametrically correct for inter-interaction variation gen- erally.

Another source of bias for Hi-C in comparing interactions within the same sample is the distance between loci. Hi-C contact matrices exhibit an exponen- tial decay in strength of interaction as the distance between the interacting loci increases. This makes comparisons between interactions difficult as some of the difference may be attributable to different distances between the pairs of loci. The observed/expected transformation (Lieberman-Aiden et al., 2009) is designed to resolve this issue, and corrects for the effect of distance by scaling all contacts separated by a given distance by their mean. The exponential decay of interaction strength as a function of distance is the subject of some

6 discussion in Chapter2

Finally, the notion of batch effect in Hi-C is hitherto unexplored, largely owing to the difficulty and expense of the assay. Most studies are comparing cell types, or interested in how 2 interactions differ for a given cell type. In this thesis, I am interested in fixing one interaction, and asking how that interaction varies across samples. To this end, I will characterize batch effect in Hi-C, and detail a method to resolve it.

1.3 Gene Expression

The characterization of the transcriptome for a given cell or cell type has been achieved mainly with two technologies. Microarrays are an older chip-based technology that requires knowing which genes to assay, and RNA-seq is a newer technology that uses high-throughput sequencing to directly sequence various RNA species. Regardless of which technology one uses, a given sample’s gene expression data is a vector of expression quantities. This vector, which is of length p for the p genes examined, is real-valued in the case of microarrays or, in the case of RNA-seq, consists of integers indicating how many reads mapped to that gene.

A given experiment generally results in a matrix of dimension p n, with × n being the number of samples examined, and then one can perform various analyses of this data. Examples include differential expression, where the goal is to find genes that have statistically significant differences in expression between different sample groups, or to perform a factor analysis or clustering, to discover biologically relevant groups of either samples or genes in a super-

7 or unsupervised manner. I will describe a method for a clustering analysis in this thesis.

Of course, this is a gross simplification of this entire field of research. How- ever, this thesis considers gene expression data in a manner that is agnostic to the technology used, and assumes simply that data either were or were transformed to be real-valued. Readers wishing to learn more about gene expression technologies or related statistical concerns are directed to Lowe et al. (2017).

1.4 Matrix Factorization

Matrix factorization is a collection of techniques to decompose an observed matrix D (of dimension p n) into component factors ×

p k k n D = AP, A × , P × . ∈ R ∈ R

Some common examples are principal component analysis (PCA), singular value decomposition (SVD) and non-negative matrix factorization (NMF). These techniques differ in important - PCA and SVD are mathematical equal- ities, while NMF is a type of algorithm (and so the equality is replaced by approximation), PCA and NMF decompose D into two factor matrices, while SVD decomposes D into 3 factors, D = USVt - but they all give different types of insight into the structure of D. This thesis will focus on NMF-related methods (see below).

In NMF, the k factors can be thought of as latent groups. Different groups

8 have different relative contributions from different features and from differ- ent samples. The factor matrices of an NMF-decomposed matrix are sparse and non-negative. Generally, A assigns features to factors (also sometimes ’patterns’), and P assigns samples to factors. The non-zero elements of these matrices are the contributions of a given feature or sample to a given pattern.

By considering both pattern matrices together, statements like ’samples from the control group had strong contributions to gene expression of the following genes: ... ’ can be readily made. This allows for understanding the interplay of features and samples.

1.4.1 CoGAPS

CoGAPS (Coordinated Gene Activity in Pattern Sets) (Fertig et al., 2010) is a Bayesian non-negative matrix factorization algorithm. For this version of

NMF, the pattern matrices are estimated using a Bayesian approach, which learns the distribution

p(D A, P)p(A, P) p(A, P D) = | | P(D)

∝ p(D A, P)p(A, P). |

The priors on these matrices (skilling) are chosen such that there are few non-zero elements and the magnitude of these elements is likely to be small; the effect is that large matrix elements are unlikely to occur unless the data D requires them. The estimates Aˆ, Pˆ are the expectation of the posterior. I highlight a method I develop by implementing it in the CoGAPS framework.

9 1.5 Thesis Outline

This section provides brief descriptions of the main chapters of this thesis.

1.5.1 Unwanted Variation in Hi-C Data

In chapter2, several points are made. I describe batch effect in Hi-C, features of batch effect, and how to identify its presence. I next develop a way to correct for it, then demonstrate its success at removing batch effect. I show that our method, BNBC, preserves biologically relevant features of the data despite removing batch effect, and that BNBC is superior to other approaches.

1.5.2 Gene Expression Clustering

Chapter3 discusses a technique for gene expression clustering. Using the posterior samples of a Bayesian non-negative matrix factorization algorithm, I develop a method to assign genes to factors. I show that these groupings are biologically relevant using a GTEx dataset of neuronal cell types. I also show that other approaches miss important features of gene expression data, and give insight about why this approach works.

10 References

Andersson, Robin, Claudia Gebhard, Irene Miguel-Escalada, Ilka Hoof, Jette Bornholdt, Mette Boyd, Yun Chen, Xiaobei Zhao, Christian Schmidl, Takahiro Suzuki, Evgenia Ntini, Erik Arner, Eivind Valen, Kang Li, Lucia Schwarz- fischer, Dagmar Glatz, Johanna Raithel, Berit Lilje, Nicolas Rapin, Frederik Otzen Bagger, Mette Jørgensen, Peter Refsing Andersen, Nicolas Bertin, Owen Rackham, A Maxwell Burroughs, J Kenneth Baillie, Yuri Ishizu, Yuri Shimizu, Erina Furuhata, Shiori Maeda, Yutaka Negishi, Christopher J Mungall, Terrence F Meehan, Timo Lassmann, Masayoshi Itoh, Hideya Kawaji, Naoto Kondo, Jun Kawai, Andreas Lennartsson, Carsten O Daub, Peter Heutink, David A Hume, Torben Heick Jensen, Harukazu Suzuki, Yoshihide Hayashizaki, Ferenc Müller, FANTOM Consortium, Alistair R R Forrest, Piero Carninci, Michael Rehli, and Albin Sandelin (2014). “An atlas of active enhancers across human cell types and tissues”. In: Nature 507.7493, pp. 455–461. DOI: 10.1038/nature12787. Dekker, Job, Marc A Marti-Renom, and Leonid A Mirny (2013). “Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data”. In: Nature Reviews Genetics 14, pp. 390–403. DOI: 10.1038/ nrg3454. Dekker, Job, Karsten Rippe, Martijn Dekker, and Nancy Kleckner (2002). “Cap- turing chromosome conformation”. en. In: Science 295.5558, pp. 1306–1311. ISSN: 0036-8075, 1095-9203. DOI: 10.1126/science.1067799. URL: http: //dx.doi.org/10.1126/science.1067799. Dostie, Josée, Todd A Richmond, Ramy A Arnaout, Rebecca R Selzer, William L Lee, Tracey A Honan, Eric D Rubio, Anton Krumm, Justin Lamb, Chad Nusbaum, Roland D Green, and Job Dekker (2006). “Chromosome Con- formation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements”. en. In: Genome Res. 16.10, pp. 1299–1309. ISSN: 1088-9051. DOI: 10.1101/gr.5571506. URL: http://dx.doi.org/10.1101/gr.5571506.

11 Fertig, Elana J, Jie Ding, Alexander V Favorov, Giovanni Parmigiani, and Michael F Ochs (2010). “CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data”. en. In: Bioinfor- matics 26.21, pp. 2792–2793. ISSN: 1367-4803, 1367-4811. DOI: 10.1093/ bioinformatics/btq503. URL: http://dx.doi.org/10.1093/bioinformatics/ btq503. Gagnon-Bartsch, Johann A and Terence P Speed (2012). “Using control genes to correct for unwanted variation in microarray data”. In: Biostatistics 13, pp. 539–552. DOI: 10.1093/biostatistics/kxr034. Hu, Ming, Ke Deng, Siddarth Selvaraj, Zhaohui Qin, Bing Ren, and Jun S Liu (2012). “HiCNorm: removing biases in Hi-C data via Poisson regres- sion”. In: Bioinformatics 28, pp. 3131–3133. DOI: 10.1093/bioinformatics/ bts570. Imakaev, Maxim, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Nau- mova, Anton Goloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny (2012). “Iterative correction of Hi-C data reveals hallmarks of chro- mosome organization”. In: Nature Methods 9, pp. 999–1003. DOI: 10.1038/ nmeth.2148. Johnson, W Evan, Cheng Li, and Ariel Rabinovic (2007). “Adjusting batch effects in microarray expression data using empirical Bayes methods”. In: Biostatistics 8, pp. 118–127. DOI: 10.1093/biostatistics/kxj037. Knight, Philip A and Daniel Ruiz (2013). “A fast algorithm for matrix bal- ancing”. In: IMA Journal of Numerical Analysis 33, pp. 1029–1047. DOI: 10.1093/imanum/drs019. Leek, Jeffrey T and John D Storey (2007). “Capturing heterogeneity in gene expression studies by surrogate variable analysis”. In: PLOS Genetics 3, pp. 1724–1735. DOI: 10.1371/journal.pgen.0030161. Leek, Jeffrey T, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Ben- jamin Langmead, W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry (2010). “Tackling the widespread and critical impact of batch effects in high-throughput data”. In: Nature Reviews Genetics 11, pp. 733–739. DOI: 10.1038/nrg2825. Lieberman-Aiden, Erez, Nynke L van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan R Lajoie, Peter J Sabo, Michael O Dorschner, Richard Sandstrom, Bradley Bernstein, M A Bender, Mark Groudine, Andreas Gnirke, John Stamatoyannopoulos, Leonid A Mirny, Eric S Lander, and Job Dekker (2009). “Comprehensive mapping of

12 long-range interactions reveals folding principles of the human genome”. In: Science 326, pp. 289–293. DOI: 10.1126/science.1181369. Lowe, Rohan, Neil Shirley, Mark Bleackley, Stephen Dolan, and Thomas Shafee (2017). “Transcriptomics technologies”. en. In: PLoS Comput. Biol. 13.5, e1005457. ISSN: 1553-734X, 1553-7358. DOI: 10.1371/journal.pcbi. 1005457. URL: http://dx.doi.org/10.1371/journal.pcbi.1005457. Simonis, Marieke, Petra Klous, Erik Splinter, Yuri Moshkin, Rob Willemsen, Elzo de Wit, Bas van Steensel, and Wouter de Laat (2006). “Nuclear or- ganization of active and inactive chromatin domains uncovered by chro- mosome conformation capture-on-chip (4C)”. en. In: Nat. Genet. 38.11, pp. 1348–1354. ISSN: 1061-4036. DOI: 10.1038/ng1896. URL: http://dx. doi.org/10.1038/ng1896. Stegle, Oliver, Leopold Parts, Richard Durbin, and John Winn (2010). “A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies”. In: PLoS Com- putational Biology 6, e1000770. DOI: 10.1371/journal.pcbi.1000770. Waszak, Sebastian M, Olivier Delaneau, Andreas R Gschwind, Helena Kilpinen, Sunil K Raghav, Robert M Witwicki, Andrea Orioli, Michael Wiederkehr, Nikolaos I Panousis, Alisa Yurovsky, Luciana Romano-Palumbo, Alexan- dra Planchon, Deborah Bielser, Ismael Padioleau, Gilles Udin, Sarah Thurn- heer, David Hacker, Nouria Hernandez, Alexandre Reymond, Bart De- plancke, and Emmanouil T Dermitzakis (2015). “Population Variation and Genetic Control of Modular Chromatin Architecture in Humans”. In: Cell 162.5, pp. 1039–1050. DOI: 10.1016/j.cell.2015.08.001. Wit, Elzo de and Wouter de Laat (2012). “A decade of 3C technologies: insights into nuclear organization”. In: Genes & Development 26, pp. 11–24. DOI: 10.1101/gad.179804.111. Yaffe, Eitan and Amos Tanay (2011). “Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture”. In: Nature Genetics 43, pp. 1059–1065. DOI: 10.1038/ng.947.

13 Chapter 2

Removing unwanted variation between samples in Hi-C experiments

This chapter is adapted from a paper currently under submission (Fletez-Brant et al., 2017). In addition to myself, other authors on this paper are Yunjiang Qiu, David U. Gorkin, Ming Hu and Kasper D. Hansen. I am responsible for the development of methods and their subsequent characterization presented in this chapter, under the predoctoral supervision of Hansen. Hi-C data referred to in this chapter was obtained from experiments performed by Gorkin and then preprocessed by Qiu.

2.1 Abstract

Hi-C data is commonly normalized using single sample processing methods, with focus on comparisons between regions within a given contact map. Here, we aim to compare contact maps across different samples. We demonstrate that unwanted variation is present in Hi-C data on biological replicates, and

14 that this unwanted variation changes across the contact map. We present BNBC, a method for normalization and batch correction of Hi-C data and show that it substantially improves comparisons across samples.

2.2 Introduction

The Hi-C assay allows for genome-wide measurements of chromatin interac- tions between different genomic regions (Lieberman-Aiden et al., 2009; Wit and Laat, 2012; Dekker, Marti-Renom, and Mirny, 2013; Schmitt, Hu, and Ren,

2016; Davies et al., 2017). Hi-C has predominately been used to comprehen- sively study differences in 3D genome structure between loci within a cell type. Partly because of the high cost of the assay, the role of interpersonal variation in 3D genome structure is largely unexplored.

When comparing genomic data between samples, variation can arise from numerous sources that do not reflect the biology of interest including sample procurement, sample storage, library preparation, and sequencing. We refer to these sources of variation as ’unwanted’ here, because they obscure the underlying biology that is of interest when performing a between-sample comparison. It is critical to correct for this unwanted variation in analysis

(Leek et al., 2010). A number of tools and extensions have been successful at this, particularly for analysis of gene expression data (Leek and Storey, 2007; Leek and Storey, 2008; Gagnon-Bartsch and Speed, 2012; Johnson, Li, and

Rabinovic, 2007; Stegle et al., 2010; Leek, 2014; Risso et al., 2014). Existing normalization methods for Hi-C data are single sample methods, focused on comparisons between different loci in the genome. To facilitate this, some

15 methods explicitly model sources of unwanted variation, such as GC content of interaction loci, fragment length, mappability and copy number (Yaffe and

Tanay, 2011; Hu et al., 2012; Vidal et al., 2017). Other methods are agnostic to sources of bias and attempts to balance the marginal distribution of contacts (Imakaev et al., 2012; Knight and Ruiz, 2013; Rao et al., 2014; Yan et al., 2017).

A comparison of some of these methods found extremely high correlation between their correction factors (Rao et al., 2014); we will use HiCNorm as an exemplar of these within-sample normalization methods (Hu et al., 2012).

By contrast, there has been less work on between-sample normalization. Two existing methods have considered between-sample normalization in the context of a differential comparison, both based on the idea of loess normalization from gene expression microarrays (Yang et al., 2002). In these methods, the estimated fold-change between conditions are modeled using a loess smoother as a function of either average contact strength (Lun and

Smyth, 2015) or distance between loci (Stansfield and Dozmorov, 2017). Using the estimated model, the data are corrected so there is no effect of the covariate on the fold-change. These approaches require a specific comparison of interest and only stabilizes the mean fold-change.

To address the pressing need for cross-sample normalization methods for Hi-C, we developed a BNBC (Bandwise Normalization and Batch Correction), a method for normalization and batch correction of Hi-C data. The method is focused on making individual entries in the contact maps comparable across samples. Our approach is inspired by the observation that patterns of variation between replicates of Hi-C data are different depending on the

16 distance between interacting loci. Therefore, our approach conditions on 1D genomic distance, aggregating all contacts between loci separated by a specific distance across all samples into a matrix where the columns are each sample’s Hi-C interactions for a specific inter-locus interaction distance. Normalization and removal of unwanted variation proceeds on these separate matrices, which we term band matrices. We show that important biological and statistical features of the Hi-C contact maps are preserved using this approach, while stabilizing the marginal distributions across samples and substantially reducing unwanted variation.

2.3 Results

2.3.0.1 Unwanted variation in Hi-C data varies between distance stratum

It is well described that a Hi-C contact map exhibits an exponential decay in signal as the distance between loci increases (Lieberman-Aiden et al., 2009).

When we quantify this behavior across biological replicates (lymphoblas- toid cell lines generated from 3 individuals from each of 3 trios from the HapMap project, Table 2.1), each with 2 growth replicates, we observe sub- stantial variation in the decay rate from sample to sample (Figure 2.1a). We use the term “biological replicate” here, as it is widely used in the context of a population-based study where each biological replicate is a sample from a different individual. Our samples are lymphoblastoid cell lines from the HapMap project (International HapMap Consortium, 2003), because these cell lines have been a widely used model system to study inter-individual

17 variation and genetic mechanisms in numerous molecular phenotypes includ- ing gene expression, chromatin accessibility, histone modification, and DNA methylation (Stranger et al., 2007; Pickrell et al., 2010; Montgomery et al., 2010; Degner et al., 2012; Kasowski et al., 2013; McVicker et al., 2013; Kilpinen et al., 2013; Bell et al., 2011). The Hi-C data from 8 lymphoblastoid cell lines were normalized within growth replicate using HiCNorm (Hu et al., 2012). Following application of HiCNorm, contact maps were corrected for library size using the log counts per million transformation and smoothed using the HiCRep approach (Yang et al., 2017); a bandwidth of 5 was selected using this approach (Methods). Smoothing of the contact map has been found beneficial

(Yang et al., 2017; Ursu et al., 2017; Yaffe and Tanay, 2011; Imakaev et al., 2012); recent work, which we confirm, has found that the correlation between technical replicates are increased by smoothing (Yang et al., 2017).

The library preparation of these samples were done at 3 different time points, and we use these different time points to define a batch factor (Ta- ble 2.1). This batch factor encompasses other potential differences between the samples (aside from library perpetration batch), because the trios sampled here come from 3 different human populations (Yoruba, Han Chinese and Puerto Rico), and all of the Yoruba libraries (from Ibadan, Nigeria) were pre- pared on the same date. The other two populations have one growth replicate in each of two batches. It has been established that phenotypic differences, which are unlikely to be explained by genetics, exists between lymphoblastoid cell lines from different HapMap populations (Stark et al., 2010; Choy et al.,

2008; Stranger et al., 2007). These differences might be related to cell line

18 creation and division (Stark et al., 2010). For this reason, it is hard to separate out the effect of Hi-C experimental batch from cell line creation and division in our data. Nonetheless, both types of effects represents unwanted variation insofar as they confound attempts to study the inter-individual variation in 3D genome organization.

To assess unwanted variation beyond changes in the mean, we represented our data as a set of matrices indexed by genomic distance (Figure 2.1b). Each matrix contains all contacts between loci at a fixed genomic distance for all samples (Methods). We call this a band transformation, since these contacts form diagonal bands in the original Hi-C contact matrices. For each band, we observe substantial variation in the distribution of contacts between samples

(Figure 2.1c-e). These marginal distributions suggests the presence of a un- wanted variation (Leek et al., 2010). We argue that this variation is unwanted across biological replicates, since our data reveal it is at least partly technical.

Note that not all contact distances are treated equally when interpreting Hi-C data: one goal of Hi-C experiments is to identify enhancer-promoter contacts, which are thought to occur primarily with 1 Mb (Vernimmen and Bickmore,

2015).

To assess the impact of unwanted variation on our Hi-C data, we first asked, for each contact, how much variation is explained by the batch factor? We measured the amount of explained variation using R2 from a linear mixed effects model with a random effect to model the increased correlation between growth replicates (Methods). We observe an association between explained variation and distance between loci (Figure 2.2a), with an average R2 value of

19 0.667. This suggests that the effect of the batch factor is substantial and changes with distance. We note again that the “batch factor” here is not simply a Hi-C experiment batch, because the variation between these batches has several potential sources as explained above. To further explore the effect of batch, we performed PCA on each of the band matrices and computed Spearman correlation between each of the first four principal components and the batch indicator. Since our batch factor has 3 levels there are 3 possible orderings of the factor (Figure 2.2b for one ordering, Supplementary Figure S1 for the other two). This again shows substantial unwanted variation associated with the batch factor, and furthermore shows the dynamic nature of the unwanted variation as distance changes.

To ensure that these data characteristics were not introduced by HiCNorm combined with smoothing, we performed the same measurements on the raw Hi-C contact maps found similar characteristics (Supplementary Fig- ure S2), with one exception. Specifically, we found that the smoothing witha larger bandwidth greatly increased the variation explained by the batch factor (Supplementary Figure S3).

Together, our results highlight the need for between-sample normalization and removal of unwanted variation for Hi-C data, and demonstrates that the effect of unwanted variation depends on genomic distance between loci.

2.3.0.2 Band-wise normalization and batch correction

To normalize the data and remove unwanted variation we used the band trans- formation framework. Prior to band transformation we use a 2D smoother on

20 the contact maps. Following smoothing we perform quantile normalization separately on each band matrix. Finally, we apply ComBat (Johnson, Li, and

Rabinovic, 2007) separately on each band matrix using the batch factor vari- able. As said above, we refer to our method as band-wise normalization and batch correction (BNBC) (see Methods). This approach is not critically depen- dent on the choice of smoother or bandwidth nor on the usage of HiCNorm; we observe similar performance across these choices (Figure S3c).

To assess the effect of BNBC, we again measured the variation explained by the batch factor and observed a remarkable decrease of this quantity, which no longer changes as a function of distance (Figure 2.2c). Specifically mean R2 decreases from 0.667 to 0.09. Comparing R2 between HiCNorm and BNBC shows a decrease of essentially every individual contact (Figure 2.2e); this pat- tern depends on distance (Supplementary Figure S4). Likewise, the correlation between each of the first 4 principal components and the batch factor was close to zero (Figure 2.2d). In addition, the marginal distributions are stabilized, which is expected since we performed quantile normalization (Supplementary Figure S5). This shows that BNBC removes substantial unwanted variation associated with batch.

We next investigated the impact of BNBC on features of the contact map. The BNBC-corrected data exhibits the standard decay pattern of Hi-C data, without variation across replicates (Supplementary Figure S5a). More interest- ingly, we observe a contact map very similar to HiCNorm (Supplementary Figure S6). The same is true for its associated first eigenvector, which is com- monly used to identify A/B compartments (Supplementary Figure S6). We

21 conclude that the application of BNBC does not distort gross features of the contact map.

Above we show that increasing the bandwidth of the smoother increases the variation explained by the batch factor, and also increases the correla- tion between technical replicates. When we examine the impact of increased smoothing bandwidth following application of BNBC, we found little effect of bandwidth or put differently, that BNBC was able to correct for the in- crease. Since increasing the bandwidth does increase the correlation between technical replicates, we use the bandwidth recommended by the HiC-Rep approach (a bandwidth of 5). We note that the HiC-Rep criteria does not in- clude consideration of biological signal and we caution that such signal could be diminished. For example, in work on normalization of DNA methylation arrays, we found that methods which performs best at reducing technical variation do not necessarily perform best when the assessment is replication of biological signal (Fortin et al., 2014).

Popular alternatives to ComBat include SVA (Leek and Storey, 2007; Leek and Storey, 2008; Leek, 2014), RUV (Gagnon-Bartsch and Speed, 2012; Risso et al., 2014) and PEER (Stegle et al., 2010) which are all variation of factor models.

These methods construct surrogate variables which represents unmeasured sources of unwanted variation. We experimented with the use of PEER instead of ComBat and observed dramatically reduced performance compared to

ComBat (Figures S7, S1). Note that the choice of R2 for evaluation metric can be considered unfair since ComBat uses the batch factor as input; the correlation plots should not be affected by this. We ran PEER using both 1

22 and 4 factors; results were very similar. The performance of PEER raises the question of how to best correct for unwanted variation when an explicit batch factor is unavailable. This is an important open question because (1) ComBat requires two samples for each level of the batch factor and (2) unwanted variation may be mediated through other factors than library preparation batch.

2.4 Discussion

To analyze Hi-C across samples, including biological replicates, it is clear that between-sample normalization methods are necessary. Here, we have characterized unwanted variation present in Hi-C contact maps and have developed a correction method named BNBC. We show unwanted variation exhibits a distance-dependent effect, in addition to known distance-based features of Hi-C contact maps. We present BNBC, a modular approach where we combine band transformation with existing tools for normalization and re- moval of unwanted variation. We show that BNBC performs well in reducing the impact of unwanted variation while still preserving important 3D features, such as the structure of the contact map and A/B compartments. Our focus in this work has been the normalization of individual entries in the contact map, but we note that proper normalization of such entries are not a requirement for normalization of higher-order structures. For example, we have previously observed that A/B compartments reproduce well between samples from the same cell type (Fortin and Hansen, 2015). Note that the batch factor we have analyzed here could be driven by either or both of cell line construction and

23 Hi-C library preparation. Proper normalization and correction for unwanted variation will be critical for comparing Hi-C contact maps between different samples.

2.5 Methods

2.5.0.1 Data Generation

Hi-C experiments: Lymphoblast Hi-C data analyzed were generated by the dilution Hi-C method using HindIII (Lieberman-Aiden et al., 2009) on 9 lym- phoblastoid cell lines derived from the 1000 Genomes project (Table 2.1). Data are publicly available through 1000 genomes (Chaisson et al., 2017) as well as through the 4D Nucleome data portal (https://data.4dnucleome.org; acces- sions 4DNESYUYFD6H, 4DNESVKLYDOH, 4DNESHGL976U, 4DNESJ1VX52C, 4DNESI2UKI7P,4DNESTAPSPUC, 4DNES4GSP9S4, 4DNESJIYRA44, 4DNESE3ICNE1). Hi-C contact matrices were generated by tiling the genome into 40kb bins and counting the number of interactions between bins. We refer to these as raw contact matrices.

Hi-C read alignment and contact matrices: Reads were aligned to hg19 reference genome using bwa-mem (Li, 2013). Read ends were aligned indepen- dently as paired-end model in BWA cannot handle the long insert size of Hi-C reads. Aligned reads were further filtered to keep only the 5âA˘ Z´ alignment. Read pairs were then manually paired. Read pairs with low mapping quality

(MAPQ<10) were discarded, and PCR duplicates were removed using Picard tools 1.131 http://broadinstitute.github.io/picard. To construct the con- tact matrices, Hi-C read pairs were assigned to predefined 40Kb genomic

24 bins. Bins with low mapping quality (< 0.8), low GC content (< 0.3), and low fragment length (< 10% of the bin size) were discarded.

2.5.0.2 Band Matrices

To make comparisons across individuals, we form band matrices, which are matrices whose columns are all matrix band i from each sample. A matrix band is a collection of entries in a contact matrix between two loci at a fixed distance. Formally, band i is the collection of j, k entries with j k + 1 = i. | − |

2.5.0.3 Log counts per million transformation

We use the logCPM (log counts per million) transformation previous described (Law et al., 2014). Specifically, for a contact matrix X we estimate library size L by the sum of the upper triangular matrix of each of the chromosome specific contact matrices. This discards inter-chromosomal contacts as well as the diagonal of the contact matrix. The logCPM matrix Y is defined as ( ) Xij + 0.5 Y = log 106 ij L + 1 where Xij refers to element i, j from the contact matrix X and L is the estimated library size for that matrix. For data normalized using HiCNorm both X and L are not integers.

2.5.0.4 HiCNorm

We normalized data using HiCNorm (Hu et al., 2012) with an updated im- plementation (https://github.com/ren-lab/HiCNorm). Following HiCNorm normalization, we applied the log counts per million transformation (see

25 above). We then smooth the contact matrices with a box smoother with a bandwidth of 5 bins; we use HiCRep to choose the bandwidth based on the correlation between technical replicates (Yang et al., 2017). The bandwidth we select is the same as the bandwidth selected for 40kb resolution Hi-C data in Yang et al. (2017). Smoothing was performed using the EBImage package

(Pau et al., 2010); this is a separate but equivalent implementation to HiCRep.

2.5.0.5 BNBC

BNBC has the following components: separate smoothing of each contact matrix, application of the band transformation, quantile normalization on each band matrix and finally application of ComBat on each band matrix.

Following the log counts per million transformation of the raw contact matrices, we smooth individual chromosome matrices using a box smoother with a bandwidth of 5, as selected by the HiCRep approach (Yang et al., 2017).

Each contact matrix and each chromosome is smoothed separately. We next apply the band transformation (see above) and quantile normalize each band matrix separately (Bolstad et al., 2003).

Following quantile normalization we apply ComBat (Johnson, Li, and

Rabinovic, 2007) to each band matrix separately. We apply the parametric prior described in Johnson, Li, and Rabinovic (2007). Prior to applying ComBat, we filter out matrix cells for which the intra-batch variance is zero for allbatches.

After applying ComBat we set filtered matrix cells to zero.

Our implementation of BNBC is available in the bnbc R package from the Bioconductor project(Gentleman et al., 2004; Huber et al., 2015)(https:

26 //www.bioconductor.org/packages/bnbc).

2.5.0.6 Explained variation and smoothed boxplot

To assess unwanted variation for each matrix cell in a contact matrix, we employ a linear mixed model approach. Specifically, we fit a mixed effect model regressing HiC contact strength on batch indicator, with a random effect at the subject level to capture the increased correlation between technical replicates. This model is fit using the R package varComp (Qu, Guennel, and Marshall, 2013) and R2 for this model is calculated using the method of Edwards et al. (2008).

To display R2 as a function of distance, we first compute a series of box plots of R2, one for each band matrix. We extract the summary measures for the box plots (median, 1st and 3rd quantile and 1.5 times the inter-quartile range). We then display these 5 curves, with color fills. Medians are black, 1st and 3rd quartiles are pink and 1.5 times the inter-quartile range are blue.

2.5.0.7 A/B compartments from smoothed contact matrices

A/B compartments were originally proposed to be estimated using the first eigenvector of a suitable transform of the contact matrix Lieberman-Aiden et al., 2009. Specifically, the contact matrix was transformed using the observed- expected transformation where each matrix band was divided by its mean.

Our contact matrices following application of the log counts per million trans- form and smoothing are on the log scale. To get A/B compartments from the output of BNBC (Supplementary Figure S6), we exponentiate every entry in the matrix, multiply by 106, apply the observed-expected transformation

27 and compute the first eigenvector. Finally, we standardize the first eigenvec- tors to be in ( 1, 1) and then smooth the standardized eigenvectors using a − moving-average as done by Fortin and Hansen (2015).

2.5.0.8 Acknowledgements

Funding: Research reported in this publication was supported by National Institute of Diabetes and Digestive and Kidney Diseases and the National Cancer Institute of the National Institutes of Health under award numbers

54DK107977 and U24CA180996. KFB was supported by the Maryland Genet- ics, Epidemiology and Medicine (MD-GEM) program. DUG was supported by funding from the A.P. Giannini Foundation and the San Diego Institutional

Research and Academic Career Development Award (IRACDA) program. Disclaimer: The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflict of Interest: None declared.

2.6 Tables

28 Table 2.1: Sample Information

Sample Replicate Ethnicity Sex Family Role Batch Library preparation GM19238 1 YRI F 1 Mother 1 9/26/14 GM19238 2 YRI F 1 Mother 1 9/26/14 GM19239 2 YRI M 1 Father 1 9/26/14 HG00512 1 CHS M 2 Father 2 3/4/15 HG00512 2 CHS M 2 Father 3 5/28/15 HG00513 1 CHS F 2 Mother 2 3/4/15 HG00513 2 CHS F 2 Mother 3 5/28/15 HG00514 1 CHS F 2 Child 2 3/4/15 HG00514 2 CHS F 2 Child 3 5/28/15 HG00731 1 PUR M 3 Father 2 3/4/15 HG00731 2 PUR M 3 Father 3 5/28/15 HG00732 1 PUR F 3 Mother 2 3/4/15 HG00732 2 PUR F 3 Mother 3 5/28/15 HG00733 1 PUR F 3 Child 2 3/4/15 HG00733 2 PUR F 3 Child 3 5/28/15

29 2.7 Figures

C (a) (b) B A A B C A B C A B C 2 Batch 1 Batch 2 Batch 3 log(Contact) 0

Band No. 50 125 200 Genom. Dist. 2 Mb 5 Mb 8 Mb (c) (d) (e) 3 1 0 2 0 log(Contacts) log(Contacts) log(Contacts) HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 GM19238 GM19238 GM19239 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 GM19238 GM19238 GM19239 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 GM19238 GM19238 GM19239

Figure 2.1: Unwanted variation in Hi-C data. We display Hi-C data from chromo- some 14 from 8 different individuals, 7 of which have 2 technical replicates, processed in 3 batches. Each sample is normalized using HiCNorm followed by spatial smooth- ing using HiCRep; data is on a logarithmic scale. (a) Mean contact as a function of distance. Each sample is a separate curve. (b) Band transformation of a collection of Hi-C contact maps. (c)-(e) Boxplots of the marginal distribution of contacts across samples, for loci separated by (c) 40 kb (band 2), (d) 2 Mb (band 50) and (e) 8 Mb (band 200).

30 (a) (b)

0.65 0.5 2

R 0.40 0.0 0.15 Correlation PC1 PC3 PC2 PC4

Band No. 50 125 200 Band No. 50 125 200 Distance 2 Mb 5 Mb 8 Mb Distance 2 Mb 5 Mb 8 Mb (c) (d) 0.65 0.5 2

R 0.40 0.0

0.15 Correlation

Band No. 50 125 200 Band No. 50 125 200 Distance 2 Mb 5 Mb 8 Mb Distance 2 Mb 5 Mb 8 Mb (e) 0.75 (BNBC) 2 R 0.25

0.25 0.75 R2(HiCNorm)

Figure 2.2: Substantial unwanted variation in Hi-C data. (a) The percentage of variance explained (R2) by the batch factor for the HiCNorm processed data, as a function of distance (Methods). The distributions are displayed as a series of smoothed boxplots (black: median, pink: 1st and 3rd quartiles, blue: 1.5 times inter- quartile range, see Methods). (b) The Spearman correlation of the 1st-4th principal components of each band matrix with the batch factor, as a function of distance, for the HiCNorm processed data. Other permutations of the batch factor are shown in Supplementary Figure S1. (c) Like (a) but for data processed using BNBC. (d) Like (b) but for data processed using BNBC. (e) A scatterplot of R2 for data processed using BNBC vs. data processed using HiCNorm, for entries in the contact map separated by less than 10 Mb (band 250).

31 Supplementary Materials

for

Distance-dependent between-sample normalization for Hi-C experiments

Kipper Fletez-Brant, Yunjiang Qiu, David U. Gorkin, Ming Hu, Kasper D.

Hansen

Contains Supplementary Figures S1-S7.

32 2.8 Supplementary Figures

Order: batch 1, 2, 3 Order: batch 2, 1, 3 Order: batch 2, 3, 1 (a) 0.5 0.5 0.5 0.0 0.0 0.0 Correlation (b) 0.5 0.5 0.5 0.0 0.0 0.0 Correlation (c) 0.5 0.5 0.5 0.0 0.0 0.0 Correlation (d) 0.5 0.5 0.5 0.0 0.0 0.0 Correlation (e) 0.5 0.5 0.5 0.0 0.0 0.0 Correlation

Band No. 50 125 200 Band No. 50 125 200 Band No. 50 125 200 Distance 2 Mb 5 Mb 8 Mb Distance 2 Mb 5 Mb 8 Mb Distance 2 Mb 5 Mb 8 Mb

PC1 PC3 PC2 PC4

Supplementary Figure S1: The performance assessment of all methods using cor- relation of batch with principal components. As in Figure 2.2b we assess the influ- ence of batch using Spearman correlation between the batch factor and the 1st-4th principal components of each band matrix, for various methods. Column 1 uses the ordering batch 1, batch 2, batch 3. Column 2 uses the ordering batch 2, batch 1, batch 3. Column 3 uses the ordering batch 2, batch 3, batch 1. (a) Unprocessed data (b) HiCNorm (c) BNBC. (d) BNBC using PEER with 1 hidden factor. (e) BNBC using PEER with 4 hidden factors. The first column and the first 3 rows reproduces Supplementary Figure S2f, Figure 2.2b, and Figure 2.2d.

33 orltoswt h is orpicplcmoet r agddet thelackof to due jagged are components principal smoothing. four first the with by correlations separated loci for 2), samples, curve. (band across separate kb contacts a of is distribution sample marginal Each the distance. of transformation. of million function per been a counts has as log Data contact the HiCRep. using by size smoothing library and for HiCNorm corrected by normalization to data. prior data Unprocessed S2: Figure Supplementary (b) log(Contacts)

2 3 4 (e) R2 HG00512 0.15 0.40 0.65 HG00513 HG00514 Distance Band No. HG00731 (c) HG00732 HG00733 b(ad5)and 50) (band Mb 2 HG00512 2 Mb

50 HG00513 HG00514 HG00731 HG00732 HG00733 GM19238 5 Mb 125 GM19238 GM19239 (c) Genom. Dist. (a) Band No. log(Contacts) log(Contact) 0 2 8 Mb 200 -1 0 1 (d) HG00512 2 Mb

HG00513 50 b(ad200). (band Mb 8 HG00514

34 HG00731 HG00732 HG00733 5 Mb (f) HG00512 125 Correlation HG00513 HG00514 0.0 0.5 Band No. Distance HG00731 u using but 2.2, and 2.1 Figures As HG00732 8 Mb HG00733 200

PC2 PC1 GM19238 GM19238 Batch 3 Batch 2 Batch 1 2 Mb

50 GM19239 (e) (d) PC4 PC3 -

(f) log(Contacts)

-1 0 1 ,.The 2.2a,b. Figure As 5 Mb 125 HG00512 HG00513

(b)-(d) HG00514 HG00731 HG00732 8 Mb

200 HG00733 HG00512 HG00513 (a) Boxplots HG00514 HG00731 (b) Mean HG00732 HG00733 GM19238 40 GM19238 GM19239 (a) (b) 0.75 0.75 (HiCNorm bw 5) 2 (Unprocessed, bw 5) R 2 0.25 0.25 R

0.25 0.75 0.25 0.75 R2(Unprocessed bw 1) R2(HiCNorm bw 1)

(c) (d) 0.75 0.75 (HiCNorm bw 1) (HiCNorm bw 5) 2 2 R R 0.25 0.25

0.25 0.75 0.25 0.75 R2(Unprocessed bw 1) R2(Unprocessed bw 5) (e) 0.75 (BNBC, bw 5) 2 R 0.25

0.25R2(BNBC bw 1)0.75

Supplementary Figure S3: Use of HiCNorm and choice of width impact unwanted variation. Pairwise scatterplots of explained variation by batch (R2), comparing various methods. As Figure 2.2e. (a) Unprocessed data, smoothed with a bandwidth of 1 and 5. (b) HiCNorm data, smoothed with a bandwidth of 1 and 5. (c) HiCNorm data vs. Unprocessed data, both smoothed with a bandwidth of 1. (d) HiCNorm data vs. Unprocessed data, both smoothed with a bandwidth of 5. (e) Data processed using BNBC, smoothed with a bandwidth of 1 and 5.

35 (a) (b) 0.75 0.75 (BNBC) (BNBC) 2 2 R R 0.25 0.25

0.25 0.75 0.25 0.75 R2(HiCNorm) R2(HiCNorm)

(c) (d) 0.75 0.75 (BNBC) (BNBC) 2 2 R R 0.25 0.25

0.25 0.75 0.25 0.75 R2(HiCNorm) R2(HiCNorm)

Supplementary Figure S4: The performance of BNBC by distance. We show a comparison between R2 for data processed using HiCNorm and BNBC. (a) Loci separated by 10 Mb or less (bands 2-251) (Figure 2.2e reproduced). (b) Loci separated by 0-2Mb (bands 2-51). (c) Loci separated by 2-6Mb (bands 52-151). (d) Loci separated by 6-10Mb (bands 152-251).

36 b(ad200). (band Mb by separated loci for samples, across curve. separate a is BNBC. sample using BCBN. processed after data distributions for Marginal S5: Figure Supplementary (b) (a) Band No. log(Contacts) Genom. Dist. log(Contacts) 3 -0.5 1.5 HG00512 HG00513

HG00514 2 Mb HG00731 50 HG00732 HG00733 HG00512 5 Mb

HG00513 125 HG00514 HG00731 Batch 3 Batch 2 Batch 1 HG00732 HG00733 8 Mb GM19238 200 GM19238 GM19239 (b)-(d) (c) log(Contacts) 0 1 (a)

oposo h agnldsrbto fcontacts of distribution marginal the of Boxplots HG00512 HG00513 (b) encnata ucino itne Each distance. of function a as contact Mean HG00514 HG00731 0k bn 2), (band kb 40 37 HG00732 HG00733 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 GM19238 GM19238 GM19239 (c) (d) log(Contacts) b(ad5)and 50) (band Mb 2 -1 0

HG00512 HG00513

but 2.1 Figure As HG00514 HG00731 HG00732 HG00733 HG00512 HG00513 HG00514 HG00731 HG00732 HG00733 (d) GM19238 GM19238

8 GM19239 (a) (b) 1 1 0 0 PC1 PC1 -1 -1 10 Mb 40 Mb 70 Mb 10 Mb 40 Mb 70 Mb Sample A Sample B Sample A Sample B (c) (d) 1 1 0 0 PC1 PC1 -1 -1 10 Mb 40 Mb 70 Mb 10 Mb 40 Mb 70 Mb

Sample A Sample B Sample A Sample B

Supplementary Figure S6: BNBC preserves structural features of Hi-C contact maps. Data from two different biological replicates (samples A, B) on chromosome 14. (a) Contact maps for data processed using HiCNorm. (b) First eigenvector of the contact maps in (a); this is used to estimate A/B compartments. (c) Contact maps for data processed using BNBC. (b) First eigenvector of the contact maps in (c).

38 (a) (b)

0.65 0.65 2 2 R 0.40 R 0.40

0.15 0.15

Band No. 50 125 200 Band No. 50 125 200 Distance 2 Mb 5 Mb 8 Mb Distance 2 Mb 5 Mb 8 Mb

(c) (d)

0.65 0.65 2 2 R 0.40 R 0.40

0.15 0.15

Band No. 50 125 200 Band No. 50 125 200 Distance 2 Mb 5 Mb 8 Mb Distance 2 Mb 5 Mb 8 Mb

(e)

0.65 2

R 0.40

0.15

Band No. 50 125 200 Distance 2 Mb 5 Mb 8 Mb

Supplementary Figure S7: The performance assessment of all methods using R2. As in Figure 2.2a we assess the influence of batch using the percent variation ex- plained by the batch factor (R2), as a function of distance, for various methods. (a) Unprocessed data (Supplementary Figure S2e reproduced). (b) HiCNorm (Figure 2.2a reproduced). (c) BNBC (Figure 2.2c reproduced). (d) BNBC using PEER with 1 hidden factor. (e) BNBC using PEER with 4 hidden factors.

39 References

Bell, Jordana T, Athma A Pai, Joseph K Pickrell, Daniel J Gaffney, Roger Pique- Regi, Jacob F Degner, Yoav Gilad, and Jonathan K Pritchard (2011). “DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines”. In: Genome Biology 12, R10. DOI: 10.1186/gb-2011- 12-1-r10. Bolstad, B M, R A Irizarry, M Astrand, and T P Speed (2003). “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias”. In: Bioinformatics 19, pp. 185–193. DOI: 10.1093/ bioinformatics/19.2.185. Chaisson, Mark J P, Ashley D Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky, Tobias Rausch, Eugene J Gardner, Oscar Rodriguez, Li Guo, Ryan L Collins, Xian Fan, Jia Wen, Robert E Handsaker, Susan Fairley, Zev N Kronenberg, Xiangmeng Kong, Fereydoun Hormozdiari, Dillon Lee, Aaron M Wenger, Alex Hastie, Danny Antaki, Peter Audano, Harrison Brand, Stuart Cantsilieris, Han Cao, Eliza Cerveira, Chong Chen, Xin- tong Chen, Chen-Shan Chin, Zechen Chong, Nelson T Chuang, Deanna M Church, Laura Clarke, Andrew Farrell, Joey Flores, Timur Galeev, Gorkin David, Madhusudan Gujral, Victor Guryev, William Haynes-Heaton, Jonas Korlach, Sushant Kumar, Jee Young Kwon, Jong Eun Lee, Joyce Lee, Wan- Ping Lee, Sau Peng Lee, Patrick Marks, Karine Valud-Martinez, Sascha Meiers, Katherine M Munson, Fabio Navarro, Bradley J Nelson, Conor Nodzak, Amina Noor, Sofia Kyriazopoulou-Panagiotopoulou, Andy Pang, Yunjiang Qiu, Gabriel Rosanio, Mallory Ryan, Adrian Stutz, Diana C J Spierings, Alistair Ward, Annemarie E Welsch, Ming Xiao, Wei Xu, Cheng- sheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley, Goo Jun, Li Ding, Chong Lek Koh, Bing Ren, Paul Flicek, Ken Chen, Mark B Gerstein, Pui- Yan Kwok, Peter M Lansdorp, Gabor Marth, Jonathan Sebat, Xinghua Shi, Ali Bashir, Kai Ye, Scott E Devine, Michael Talkowski, Ryan E Mills, Tobias

40 Marschall, Jan Korbel, Evan E Eichler, and Charles Lee (2017). “Multi- platform discovery of haplotype-resolved structural variation in human genomes”. In: bioRxiv, p. 193144. DOI: 10.1101/193144. Choy, Edwin, Roman Yelensky, Sasha Bonakdar, Robert M Plenge, Richa Saxena, Philip L De Jager, Stanley Y Shaw, Cara S Wolfish, Jacqueline M Slavik, Chris Cotsapas, Manuel Rivas, Emmanouil T Dermitzakis, Ellen Cahir-McFarland, Elliott Kieff, David Hafler, Mark J Daly, and David Altshuler (2008). “Genetic analysis of human traits in vitro: drug response and gene expression in lymphoblastoid cell lines”. In: PLOS Genetics 4, e1000287. DOI: 10.1371/journal.pgen.1000287. Davies, James O J, A Marieke Oudelaar, Douglas R Higgs, and Jim R Hughes (2017). “How best to identify chromosomal interactions: a comparison of approaches”. In: Nature Methods 14, pp. 125–134. DOI: 10.1038/nmeth.4146. Degner, Jacob F, Athma A Pai, Roger Pique-Regi, Jean-Baptiste Veyrieras, Daniel J Gaffney, Joseph K Pickrell, Sherryl De Leon, Katelyn Michelini, Noah Lewellen, Gregory E Crawford, Matthew Stephens, Yoav Gilad, and Jonathan K Pritchard (2012). “DNase I sensitivity QTLs are a major determinant of human expression variation”. In: Nature 482, pp. 390–394. DOI: 10.1038/nature10808. Dekker, Job, Marc A Marti-Renom, and Leonid A Mirny (2013). “Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data”. In: Nature Reviews Genetics 14, pp. 390–403. DOI: 10.1038/ nrg3454. Edwards, Lloyd J, Keith E Muller, Russell D Wolfinger, Bahjat F Qaqish, and Oliver Schabenberger (2008). “An R2 statistic for fixed effects in the linear mixed model”. In: Statistics in Medicine 27, pp. 6137–6157. DOI: 10.1002/ sim.3429. Fletez-Brant, Kipper, Yunjiang Qiu, David U Gorkin, Ming Hu, and Kasper D Hansen (2017). “Removing unwanted variation between samples in Hi-C experiments”. URL: https://www.biorxiv.org/content/early/2017/11/ 06/214361. Fortin, Jean-Philippe and Kasper D Hansen (2015). “Reconstructing A/B com- partments as revealed by Hi-C using long-range correlations in epigenetic data”. In: Genome Biology 16, p. 180. DOI: 10.1186/s13059-015-0741-y. Fortin, Jean-Philippe, Aurélie Labbe, Mathieu Lemire, Brent W Zanke, Thomas J Hudson, Elana J Fertig, Celia Mt Greenwood, and Kasper D Hansen (2014). “Functional normalization of 450k methylation array data improves

41 replication in large cancer studies”. In: Genome Biology 15, p. 503. DOI: 10.1186/s13059-014-0503-2. Gagnon-Bartsch, Johann A and Terence P Speed (2012). “Using control genes to correct for unwanted variation in microarray data”. In: Biostatistics 13, pp. 539–552. DOI: 10.1093/biostatistics/kxr034. Gentleman, Robert C, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean Y H Yang, and Jianhua Zhang (2004). “Bioconductor: open software development for computational biology and bioinformatics”. In: Genome Biology 5, R80. DOI: 10.1186/gb-2004-5-10-r80. Hu, Ming, Ke Deng, Siddarth Selvaraj, Zhaohui Qin, Bing Ren, and Jun S Liu (2012). “HiCNorm: removing biases in Hi-C data via Poisson regres- sion”. In: Bioinformatics 28, pp. 3131–3133. DOI: 10.1093/bioinformatics/ bts570. Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D Hansen, Rafael A Irizarry, Michael Lawrence, Michael I Love, James MacDonald, Valerie Obenchain, Andrzej K Ole´s,Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K Smyth, Dan Tenenbaum, Levi Waldron, and Martin Morgan (2015). “Orchestrating high-throughput genomic analysis with Bioconductor”. In: Nature Methods 12, pp. 115–121. DOI: 10.1038/nmeth. 3252. Imakaev, Maxim, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Nau- mova, Anton Goloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny (2012). “Iterative correction of Hi-C data reveals hallmarks of chro- mosome organization”. In: Nature Methods 9, pp. 999–1003. DOI: 10.1038/ nmeth.2148. International HapMap Consortium (2003). “The International HapMap Project”. In: Nature 426, pp. 789–796. DOI: 10.1038/nature02168. Johnson, W Evan, Cheng Li, and Ariel Rabinovic (2007). “Adjusting batch effects in microarray expression data using empirical Bayes methods”. In: Biostatistics 8, pp. 118–127. DOI: 10.1093/biostatistics/kxj037. Kasowski, Maya, Sofia Kyriazopoulou-Panagiotopoulou, Fabian Grubert, Ju- dith B Zaugg, Anshul Kundaje, Yuling Liu, Alan P Boyle, Qiangfeng Cliff

42 Zhang, Fouad Zakharia, Damek V Spacek, Jingjing Li, Dan Xie, Anthony Olarerin-George, Lars M Steinmetz, John B Hogenesch, Manolis Kellis, , and Michael Snyder (2013). “Extensive variation in chromatin states across humans”. In: Science 342, pp. 750–752. DOI: 10. 1126/science.1242510. Kilpinen, Helena, Sebastian M Waszak, Andreas R Gschwind, Sunil K Raghav, Robert M Witwicki, Andrea Orioli, Eugenia Migliavacca, Michaël Wiederkehr, Maria Gutierrez-Arcelus, Nikolaos I Panousis, Alisa Yurovsky, Tuuli Lap- palainen, Luciana Romano-Palumbo, Alexandra Planchon, Deborah Bielser, Julien Bryois, Ismael Padioleau, Gilles Udin, Sarah Thurnheer, David Hacker, Leighton J Core, John T Lis, Nouria Hernandez, Alexandre Rey- mond, Bart Deplancke, and Emmanouil T Dermitzakis (2013). “Coordi- nated effects of sequence variation on DNA binding, chromatin structure, and transcription”. In: Science 342, pp. 744–747. DOI: 10.1126/science. 1242463. Knight, Philip A and Daniel Ruiz (2013). “A fast algorithm for matrix bal- ancing”. In: IMA Journal of Numerical Analysis 33, pp. 1029–1047. DOI: 10.1093/imanum/drs019. Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth (2014). “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts”. In: Genome Biology 15, R29. DOI: 10.1186/gb-2014-15-2-r29. Leek, Jeffrey T (2014). “svaseq: removing batch effects and other unwanted noise from sequencing data”. In: Nucleic Acids Research 42, gku864. DOI: 10.1093/nar/gku864. Leek, Jeffrey T and John D Storey (2007). “Capturing heterogeneity in gene expression studies by surrogate variable analysis”. In: PLOS Genetics 3, pp. 1724–1735. DOI: 10.1371/journal.pgen.0030161. Leek, Jeffrey T and John D Storey (2008). “A general framework for multiple testing dependence”. In: PNAS 105, pp. 18718–18723. DOI: 10.1073/pnas. 0808709105. Leek, Jeffrey T, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Ben- jamin Langmead, W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry (2010). “Tackling the widespread and critical impact of batch effects in high-throughput data”. In: Nature Reviews Genetics 11, pp. 733–739. DOI: 10.1038/nrg2825. Li, Heng (2013). “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”. In: arXiv, p. 1303.3997. URL: http://arxiv.org/ abs/1303.3997.

43 Lieberman-Aiden, Erez, Nynke L van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan R Lajoie, Peter J Sabo, Michael O Dorschner, Richard Sandstrom, Bradley Bernstein, M A Bender, Mark Groudine, Andreas Gnirke, John Stamatoyannopoulos, Leonid A Mirny, Eric S Lander, and Job Dekker (2009). “Comprehensive mapping of long-range interactions reveals folding principles of the human genome”. In: Science 326, pp. 289–293. DOI: 10.1126/science.1181369. Lun, Aaron T L and Gordon K Smyth (2015). “diffHic: a Bioconductor pack- age to detect differential genomic interactions in Hi-C data”. In: BMC Bioinformatics 16, p. 258. DOI: 10.1186/s12859-015-0683-0. McVicker, Graham, Bryce van de Geijn, Jacob F Degner, Carolyn E Cain, Nicholas E Banovich, Anil Raj, Noah Lewellen, Marsha Myrthil, Yoav Gilad, and Jonathan K Pritchard (2013). “Identification of genetic variants that affect histone modifications in human cells”. In: Science 342, pp. 747– 749. DOI: 10.1126/science.1242429. Montgomery, Stephen B, Micha Sammeth, Maria Gutierrez-Arcelus, Radoslaw P Lach, Catherine Ingle, James Nisbett, Roderic Guigo, and Emmanouil T Dermitzakis (2010). “Transcriptome genetics using second generation sequencing in a Caucasian population”. In: Nature 464, pp. 773–777. DOI: 10.1038/nature08903. Pau, Grégoire, Florian Fuchs, Oleg Sklyar, Michael Boutros, and Wolfgang Huber (2010). “EBImage – an R package for image processing with ap- plications to cellular phenotypes”. In: Bioinformatics 26, pp. 979–981. DOI: 10.1093/bioinformatics/btq046. Pickrell, Joseph K, John C Marioni, Athma A Pai, Jacob F Degner, Barbara E Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, and Jonathan K Pritchard (2010). “Understanding mechanisms underlying human gene expression variation with RNA sequencing”. In: Nature 464, pp. 768–772. DOI: 10.1038/nature08872. Qu, Long, Tobias Guennel, and Scott L Marshall (2013). “Linear score tests for variance components in linear mixed models and applications to genetic association studies”. In: Biometrics 69, pp. 883–892. DOI: 10.1111/biom. 12095. Rao, Suhas S P, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T Robinson, Adrian L Sanborn, Ido Machol, Arina D Omer, Eric S Lander, and Erez Lieberman Aiden (2014). “A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping”. In: Cell 159, pp. 1665–1680. DOI: 10.1016/j.cell.2014.11.021.

44 Risso, Davide, John Ngai, Terence P Speed, and Sandrine Dudoit (2014). “Nor- malization of RNA-seq data using factor analysis of control genes or sam- ples”. In: Nature Biotechnology 32, pp. 896–902. DOI: 10.1038/nbt.2931. Schmitt, Anthony D, Ming Hu, and Bing Ren (2016). “Genome-wide mapping and analysis of chromosome architecture”. In: Nat. Rev. Mol. Cell Biol. 17, pp. 743–755. DOI: 10.1038/nrm.2016.104. Stansfield, John and Mikhail G Dozmorov (2017). “HiCdiff: A method for joint normalization of Hi-C datasets and differential chromatin interaction detection”. In: bioRxiv, p. 147850. DOI: 10.1101/147850. Stark, Amy L, Wei Zhang, Tong Zhou, Peter H O’Donnell, Christine M Beiswanger, R Stephanie Huang, Nancy J Cox, and M Eileen Dolan (2010). “Popu- lation differences in the rate of proliferation of international HapMap cell lines”. In: American Journal of Human Genetics 87, pp. 829–833. DOI: 10.1016/j.ajhg.2010.10.018. Stegle, Oliver, Leopold Parts, Richard Durbin, and John Winn (2010). “A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies”. In: PLoS Com- putational Biology 6, e1000770. DOI: 10.1371/journal.pcbi.1000770. Stranger, Barbara E, Alexandra C Nica, Matthew S Forrest, Antigone Di- mas, Christine P Bird, Claude Beazley, Catherine E Ingle, Mark Dunning, Paul Flicek, Daphne Koller, Stephen Montgomery, Simon Tavaré, Panos Deloukas, and Emmanouil T Dermitzakis (2007). “Population genomics of human gene expression”. In: Nature Genetics 39, pp. 1217–1224. DOI: 10.1038/ng2142. Ursu, Oana, Nathan Boley, Maryna Taranova, Y X Rachel Wang, Galip Gurkan Yardimci, William Stafford Noble, and Anshul Kundaje (2017). “GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs”. In: bioRxiv, p. 181842. DOI: 10.1101/181842. Vernimmen, Douglas and Wendy A Bickmore (2015). “The Hierarchy of Tran- scriptional Activation: From Enhancer to Promoter”. In: Trends in Genetics 31, pp. 696–708. DOI: 10.1016/j.tig.2015.10.004. Vidal, Enrique, Francois le Dily, Javier Quilez, Ralph Stadhouders, Yasmina Cuartero, Thomas Graf, Marc A Marti-Renom, Miguel Beato, and Guil- laume Filion (2017). “OneD: increasing reproducibility of Hi-C Samples with abnormal karyotypes”. In: bioRxiv, p. 148254. DOI: 10.1101/148254.

45 Wit, Elzo de and Wouter de Laat (2012). “A decade of 3C technologies: insights into nuclear organization”. In: Genes & Development 26, pp. 11–24. DOI: 10.1101/gad.179804.111. Yaffe, Eitan and Amos Tanay (2011). “Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture”. In: Nature Genetics 43, pp. 1059–1065. DOI: 10.1038/ng.947. Yan, Koon-Kiu, Galip Gürkan Yardimci, Chengfei Yan, William S Noble, and Mark Gerstein (2017). “HiC-Spector: A matrix library for spectral and re- producibility analysis of Hi-C contact maps”. In: Bioinformatics 33, pp. 2199– 2201. DOI: 10.1093/bioinformatics/btx152. Yang, Tao, Feipeng Zhang, Galip Gurkan Yardimci, Fan Song, Ross C Hardison, William Stafford Noble, Feng Yue, and Qunhua Li (2017). “HiCRep: assess- ing the reproducibility of Hi-C data using a stratum- adjusted correlation coefficient”. In: Genome Research. DOI: 10.1101/gr.220640.117. Yang, Yee Hwa, Sandrine Dudoit, Percy Luu, David M Lin, Vivian Peng, John Ngai, and Terence P Speed (2002). “Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation”. In: Nucleic Acids Research 30, e15.

46 Chapter 3

Unsupervised feature learning in Bayesian high dimensional inference

This chapter is adapted from a paper currently under submission. In addition to myself, other authors on this paper are Genevieve Stein-O’Brien, Tom Sherman and Elana Fertig. I am responsible for the development of methods presented in this chapter, under the predoctoral supervision of Fertig, with analyses characterizing these methods contributed by both myself and Stein- O’Brien. The productionization of methods in this paper was completed by

Sherman.

3.1 Abstract

Decoding messages sent over noisy channels requires having a good probabil- ity model of the kinds of messages being sent and represents the intersection of information theory and . Bayesian matrix factorization (BMF) algorithms address both parts of this problem by simultaneously solving for

47 global correlation structure with locally optimized parts based approaches. By estimate posterior distributions of factor matrices, BMFs are effective way to identify latent structure in high-dimensional data with notable applications in computational biology. The resulting factors capturing dependent patterns of data usage, i.e. words or genes that are used together. However, the same dependent nature of these "patterns" that makes the overall solution more robust also makes interpretation of the significance of individual components of each pattern difficult. Here, we build a statistic, called the Probability of Unique Membership in Patterns (PUMP), to use the resulting posterior distri- bution of BMFs to group features into high and low probability of relevance.

We compare our method to entropy based approaches and show that it reveals novel information. We apply our method to an example from computational biology and demonstrate that it discovers biologically relevant structure.

3.2 Introduction

Matrix factorization techniques decompose a data matrix D n p into the ∈ R × product AP of factor matrices A p k and P k n. In many settings ∈ R × ∈ R × p n: for example, gene expression data often observe >20,000 genes ≫ on possibly hundreds of samples. In this setting matrix factorization is a useful tool for identifying underlying relationships between either features or samples, as often the number of patterns k p and so the k patterns can ≪ be seen to group data elements together. There are a variety of methods for estimating factor matrices: singular value decomposition (SVD) is a common algorithm that finds three matrices UDVt with orthogonal matrices U, V and

48 a diagonal matrix of singular values D, while ICA (Hyvärinen and Oja, 2000) finds statistically independent factors. Bayesian methods are most relevant to machine learning as they naturally lend themselves to a learning process–prior distributions representing a rational hypothesis are updated via the likelihood generated from newly observed data to form a posterior distribution from which inference can be made. These methods can be implemented using different approaches such as variational inference (Paisley, Blei, and Jordan,

2015; Nakajima, Sugiyama, and Babacan, 2011; Ma et al., 2015) or Markov chain Monte Carlo (Xie, Zhou, and Xu, 2017; Salakhutdinov and Mnih, 2007; Ahn et al., 2015) to find factor matrices which are most probable given the observed data (Stein-O’Brien et al., 2018b).

Computational biology and genomics in particular have successfully used matrix factorization techniques to give insight into the structure of data (En- gelhardt and Stephens, 2010; Yang and Michailidis, 2016; Wang, Zheng, and

Zhao, 2015; Jiang, Hu, and Xu, 2017; Li and Wang, 2017). A striking example: after performing PCA on data derived from genotypes of individuals of Euro- pean ancestry, the patterns in the k n sample by factor matrix can be used to × cluster samples according to their common ancestry (Novembre et al., 2008). Similarly, one can cluster individual genetic variants using the p k pattern × by factor matrix. While obtaining such groupings is informative, it is unclear how to ascertain which features are important for a given factor.

In the present work, we leverage the learning process inherent to Bayesian MCMC-based matrix factorization to address the question of assigning a measure of importance to the components of p data patterns. Our algorithm

49 utilizes the posterior samples of the p k feature by factor matrix A to identify × features which uniquely contribute to one of the k patterns by assigning a probability of relevance for each pattern and feature combination. We refer to this probability as the Probability of Unique Membership in Patterns, or PUMP, and threshold features based on their PUMP values to find strongly relevant features for a given factor. We apply our method to gene expression data by analyzing a dataset of gene expression derived from head and neck cancer, estimating factor matrices using an algorithm (Ochs et al., 1999; Fertig et al., 2010) for Bayesian matrix factorization, and discover functionally relevant clusters of genes for several biologically distinct clusters of samples. We then compare our methods. Finally, we compare the performance of PUMP to standard entropy measures from Information theory and find that while closely related they two convey different properties of the data.

3.3 PUMP: Probability of Unique Membership in Patterns

In the Bayesian context, the matrices A, P are estimated as the expectation Aˆ, Pˆ of the posterior distribution p(A, P D). PUMP begins by associating a given | 1 j feature i with a pattern Fj by finding the one-hot vector β which minimizes

√ ϕ(A ) = argmin (A βj)T(A βj) 1 . . . k , i j i − i − ∈ { }

where Ai is the vector of factor values for feature i. This simply searches

1 A ’one-hot’ vector is a vector x in which element xi = 1 and all other elements xj = 0, j = i . ̸

50 over all vectors of length k which are 0 for all positions except that indexed by j, which has the value of 1, and returns the value j which minimizes the

L2 norm. This approach of minimizing the L2 norm between patterns and one-hot vectors has previously been developed by (Stein-O’Brien et al., 2017), but we restate it for clarity. We note also that this is equivalent to picking the element with the largest factor value (see Appendix).

We use the established function ϕ( ) as the basis for a novel algorithm. We · first define an indicator function which assigns a feature i to a factor Fj:

{ 1 j = ϕ(A ) 1 i F = i { ∈ j} 0 else.

This function indicates whether the value of ϕ(Ai) is j and makes an assignation of relevance based on equality or no. However, in expectation, this function is equal to the probability that feature i is associated with factor

Fj. We approximate this probability using the m posterior samples of the factor matrix A(d), d 1 . . . m : ∈ { }

P(i F ) = E[1 i F ] ∈ j { ∈ j} m 1 1 (d) ∑ i Fj . ≈ m d=1 { ∈ }

In other words, the posterior probability of feature i being associated with factor Fj is estimated by the fraction of posterior samples for which this indicator function evaluates to 1.

51 3.4 Experimental Results

3.4.1 Algorithm choice

We chose to estimate factor matrices using an algorithm for Bayesian matrix factorization (Ochs et al., 1999; Fertig et al., 2010), which performs a non- negative matrix factorization and induces sparsity in the factor matrices by use of atomic domain priors (Sibisi and Skilling, 1997).

3.4.2 Data description

We evaluate our algorithm on two datasets. One is a simulated dataset of strongly grouped features, while the other is a gene expression data de- rived from a head and neck squamous cell carcinoma (HNSCC) experiment. Cetuximab-sensitive cell lines were either exposed to cetuximab until they developed resistance, or PBS, a control, weekly for 11 weeks. There are also

2 technical controls not exposed to either, one of each parental cell line, ob- tained prior to the time course sampling. Previous analyses (Stein-O’Brien et al., 2018a) have indicated that setting k = 5 describes the data well, and that one factor represents highly expressed genes which do not vary between experimental and control groups. This factor is not considered in downstream analyses. Expression data were quantile normalized and corrected for se- quencing depth prior to matrix factorization to ensure comparability between samples.

52 3.4.2.1 PUMP findings for simulation

We first consider a simulated gene expression dataset in which we have created groupings of genes for which expression of a given gene is non-zero in only a subset of samples (Figure 3.1), thereby creating a known association of genes and patterns. We apply PUMP to this dataset, specifying k = 3 and recover perfect assignment of genes to patterns (Table 3.1).

3.4.3 PUMP findings for HNSCC

We next consider results derived from the HNSCC dataset. The four factors found previously to have impact on the dynamics over time of cellular re- sponse to Cetuximab are shown in Figure 3.2, where the sample-specific values for each factor are shown and give insight into broad patterns of expression over time. Factor 1 mainly delineates the two technical controls from both

Cetuximab or time course control groups, while factor 2 identifies increasing changes of gene expression over time in the treatment group compared to the PBS control. Factor 3 describes an opposing trend, of decreasing gene expression over time relative to control, while factor 4 shows that there is also an overall decrease in expression-response over time for both Cetuximab- and PBS-exposed groups.

These factors have genes which PUMP identifies as being strongly asso- ciated with them, as seen in Figure 3.3. While it is clear that there are some very clear differences in gene expression within a factor, PUMP is capable of identifying effects not assayable through, for example, traditional differen- tial expression (DE) analysis. We performed standard DE analysis using the

53 Limma (Ritchie et al., 2015) method, looking for genes that are differentially expressed between one group and the other two (for example, the PBS group versus the group composed of both the Cetuximab and 2 other controls), for all three possible permutations of one-vs-all, and kept as a final list the set of all genes found to be differentially expressed in one or more of these compar- isons. While there is substantial overlap between factors 1 and 4, for factors 2 and 3, the Cetuximab-specific increasing- or decreasing-factors, are a different story. In fact, nearly 20% of genes identified by PUMP for factor 2, and just under 50% of genes found by PUMP for factor 3, are missed by a DE analysis (Table 3.2).

In light of the discrepancy between DE and PUMP identified genes, it is worth considering what the PUMP approach is learning. As PUMP assigns features to one of k factors, we can view feature assignment as a categorical variable with k classes, and consider the entropy of assignment of one feature over the entirety of the posterior distribution. In Figure 3.4 we consider the entropy of all genes’ assignments in the HNSCC experiment and compare to their PUMP scores, and observe a strong relationship. As expected, genes for which PUMP makes the same assignment for all posterior samples (PUMP ≈ 1) have extremely low entropy. However, as PUMP decreases slightly (PUMP 0.95), entropy increases in proportion to the assignment of a gene to other ≈ features than the one assigned by PUMP.

In addition to not being readily assayable by DE analyses, the genes iden- tified by PUMP are biologically relevant as well. We performed aGOterm

54 enrichment analysis (using the MSigDB resource (Liberzon et al., 2015)), look- ing for terms that appear in each factor’s set of genes. We assessed statistical significance of terms by computing a bootstrap null distribution, andcallas significant all GO terms with a bootstrap p < 0.01.

We found 99 such significant GO terms (see Appendix), and give asample of the results in Table 3.3. Factor 1 has a large number of genes associated with the two control samples, and a GO analysis suggests that this difference may be due to genes involved in cell migration. Factor 2 describes a rising trend over time of Cetuximab-specific gene expression, which seems to be related to tissue development. By contrast, factor 3 describes a decreasing trend over time of Cetuximab-specific gene expression, which seems to be related tothe plasma membrane. Factor 4 describes a general decrease over time, which seems to describe the general stimulus response. Of note is the involvement of prostaglandin response, which has been previously observed in connection with HNSCC (Camacho et al., 2008).

3.5 Conclusion

We have demonstrated a method for identifying dominant features of a low dimensional representation of high dimensional data obtained by Bayesian matrix factorization. Our method, which is grounded in Bayesian theory and easily computable, uncovers relevant latent structure that is not discoverable by other methods. Moreover, in contrast to many methods, our approach is unsupervised and does not require prior knowledge or manual labeling. We demonstrate the utility of our approach with application to gene expression,

55 which is a common and high dimensional data type in computational biology. Finally, we demonstrate that our approach recovers related but different information from the data when compared to metrics of entropy.

While our approach is designed to induce a partition of the set of features with respect to the set of factors, it may be desirable to find combinations of features that determine a data set. This could arise, for example, in a setting where two factors represent related biological processes, and the goal is to discover features commonly driving both of them. This is a current active research effort.

3.6 Tables

Tables

Pattern No. Genes Mean PUMP Score Pattern 1 Genes 1-10 1.0 Pattern 2 Genes 11-20 1.0 Pattern 3 Genes 21-30 1.0

Table 3.1: PUMP results for each factor. As the mean for each factor is 1.0, this means that for all genes assigned to that factor each have PUMP scores of 1.0, which can only happen when all posterior samples are assign a given gene to the same pattern.

56 Factor #PUMP % Found by DE 1 258 0.97 2 107 0.79 3 77 0.52 4 74 0.91

Table 3.2: Different factors have different genes identified by PUMP as strongly related to that factor. Not all genes identified by PUMP are identified by DE analysis (see main text), and in fact DE analysis can miss nearly half of the genes identified by PUMP.

Term Factor GO_CELL_MIGRATION_INVOLVED_IN_SPROUTING_ANGIOGENESIS 1 GO_BLOOD_VESSEL_ENDOTHELIAL_CELL_MIGRATION 1 GO_ENDOTHELIAL_CELL_MIGRATION 1 GO_REGULATION_OF_BIOMINERAL_TISSUE_DEVELOPMENT 2 GO_POSITIVE_REGULATION_OF_BIOMINERAL_TISSUE_DEVELOPMENT 2 GO_CELL_FATE_COMMITMENT 2 GO_INTRINSIC_COMPONENT_OF_PLASMA_MEMBRANE 3 GO_POSTSYNAPTIC_MEMBRANE 3 GO_PLASMA_MEMBRANE_REGION 3 GO_CELLULAR_RESPONSE_TO_PROSTAGLANDIN_STIMULUS 4 GO_RESPONSE_TO_ENDOGENOUS_STIMULUS 4 GO_CELLULAR_RESPONSE_TO_ENDOGENOUS_STIMULUS 4

Table 3.3: As seen in Figure 3.2, different factors specify different sets of samples, which have different genes associated with them, as seen in Figure 3.3. This selection of GO terms (see main text) gives some insight into biological processes driving each factor. See Appendix for full list of significant GO terms.

57 3.7 Figures Genes

Samples

Figure 3.1: Simulated data in which specific gene expression occurs in a limited subset of genes.

Factor 1 Factor 2

Tech Control Wk 1 Wk 11 Wk 1 Wk 11 PBS Cetuximab Factor 3 Factor 4

Wk 1 Wk 11 Wk 1 Wk 11

Figure 3.2: This experiment, which observes changes HNSCC cell lines over 11 weeks in response to Cetuximab exposure, supports inference of four factors which describe changes in gene expression over time. There are three groups: 2 controls which have no exposure (black points), 11 time points corresponding to exposure to PBS, a control (blue points), and also 11 time points corresponding to Cetuximab exposure (red points). Clearly different factors describe different trajectories over time. Understanding genes associated with these factors gives insight into these temporal dynamics.

58 Tech PBS Cetuximab

Wk 1 Wk 11 Wk 1 Wk 11

Figure 3.3: Shown are the subset of genes found to be strongly associated with a factor, clustered by expression values within a factor. Columns are organized such that the left 2 columns represent technical controls ("Tech"), then the next 11 columns represent the 11 weekly observations for PBS/control ("PBS") and finally the last 11 columns represent the 11 timepoints observed for the Cetuximab/treatment group. Factors are ordered 1 - 4, top to bottom. There are clear factor-specific expression groupings. Importantly, not all genes found by PUMP are identifiable through differential expression analysis.

59 1.0 0.8 PUMP 0.6 0.4

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Entropy

Figure 3.4: Entropy of the assignment of a gene to a pattern (x-axis) is computed by considering the number of times during the MCMC chain that a gene is assigned to one of the several patterns, compared to PUMP score for the same gene (y-axis). For a range of strong PUMP scores (PUMP (0.95, 1.0)) entropy varies considerably, ∈ making PUMP a more reliable metric.

60 3.8 Appendix

Equivalence of per-feature factor maximum and one-hot vec- tor PUMP

Let F be the number of factors, v(h) 0, 1 F be a one-hot vector as used in ∈ { } (h) PUMP, with elements v , and let x +F be a row in the the feature by i ∈ R (h) factor matrix A as we use, with elements xi . Then the L2 distance between these vectors is:

( F )1/2 (h) 2 (h) (h) 2 v x = ∑(vi xi ) || − || i=1 −

Let v(m) be the vector with a 1 in position m such that m is the position of the largest element in x, and let v(g) g = m be some other vector with a ̸ 1 in a position that is not the largest element in x. Then the square of the L2 distances for each of these vectors and x are as follows.

F (g) 2 2 (g) 2 (g) 2 (g) 2 ( v x ) = (vm xm) + (vg xg) + ∑ (vi xi) || − || − − i,i=g,i=m − ̸ ̸

F 2 (g) 2 2 = xm + (vg xg) + ∑ xi − i,i=g,i=m ̸ ̸

F 2 (g),2 (g) 2 2 = xm + (vg 2xgvg + xg) + ∑ xi − i,i=g,i=m ̸ ̸

F 2 2 2 = xm + 1 2xg + xg + ∑ xi − i,i=g,i=m ̸ ̸

61 Because all elements in v(g) are 0 except g, which is 1. Similarly for v(m):

F (m) 2 2 (m) 2 (m) 2 (m) 2 ( v x ) = (vm xm) + (vg xg) + ∑ (vi xi) || − || − − i,i=g,i=m − ̸ ̸

F 2 (m) 2 2 = xg + (vm xm) + ∑ xi − i,i=g,i=m ̸ ̸

F 2 (m),2 (m) 2 2 = xg + (vm 2xmvm + xm) + ∑ xi − i,i=g,i=m ̸ ̸

F 2 2 2 = xg + 1 2xm + xm + ∑ xi − i,i=g,i=m ̸ ̸

As the last summation for both v(g), v(m) is the same, we can drop it from consideration. By definition, xm > xg, so the following inequality holds:

(x2 + x2 + 1) 2x < (x2 + x2 + 1) 2x g m − m g m − g

This implies that the L2 distance is always minimized at the maximum value of the vector of factors for a given gene when using a one-hot vector.

62 Term Factor 1 GO_PHENOL_CONTAINING_COMPOUND_METABOLIC_PROCESS 1 2 GO_AMMONIUM_ION_METABOLIC_PROCESS 1 3 GO_REGULATION_OF_EPIDERMIS_DEVELOPMENT 1 4 GO_CGMP_METABOLIC_PROCESS 1 5 GO_NEGATIVE_REGULATION_OF_GLIAL_CELL_DIFFERENTIATION 1 6 GO_CELL_MIGRATION_INVOLVED_IN_SPROUTING_ANGIOGENESIS 1 7 GO_INORGANIC_ANION_TRANSPORT 1 8 GO_NEGATIVE_REGULATION_OF_RESPONSE_TO_OXIDATIVE_STRESS 1 9 GO_NEGATIVE_REGULATION_OF_ASTROCYTE_DIFFERENTIATION 1 10 GO_REGULATION_OF_ASTROCYTE_DIFFERENTIATION 1 11 GO_ENDOTHELIAL_CELL_MIGRATION 1 12 GO_BLOOD_VESSEL_ENDOTHELIAL_CELL_MIGRATION 1 13 GO_ECTODERM_DEVELOPMENT 1 14 GO_NEGATIVE_REGULATION_OF_RECEPTOR_ACTIVITY 1 15 GO_EPIDERMIS_DEVELOPMENT 1 16 GO_TRIGLYCERIDE_CATABOLIC_PROCESS 1 17 GO_REGULATION_OF_EPIDERMAL_CELL_DIFFERENTIATION 1 18 GO_REGULATION_OF_OXIDATIVE_STRESS_INDUCED_CELL_DEATH 1 19 GO_NEUROEPITHELIAL_CELL_DIFFERENTIATION 1 20 GO_REGULATION_OF_OXIDATIVE_STRESS_INDUCED_NEURON_DEATH 1 21 GO_REGULATION_OF_RESPONSE_TO_OXIDATIVE_STRESS 1 22 GO_IMPORT_INTO_CELL 1 23 GO_O_GLYCAN_PROCESSING 1 24 GO_PLATELET_DENSE_TUBULAR_NETWORK 1 25 GO_EXTRACELLULAR_SPACE 1 26 GO_INORGANIC_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY 1 27 GO_TRANSCRIPTIONAL_ACTIVATOR_ACTIVITY_RNA_POLYMERASE_II_TRANSCRIPTION_REGULATORY_REGION_SEQUENCE_SPECIFIC_BINDING 1 28 GO_PROTEASE_BINDING 1 29 GO_ANION_CHANNEL_ACTIVITY 1 30 GO_PHOSPHATIDYLCHOLINE_1_ACYLHYDROLASE_ACTIVITY 1 31 GO_TRANSMEMBRANE_TRANSPORTER_ACTIVITY 1 32 GO_EMBRYONIC_SKELETAL_SYSTEM_DEVELOPMENT 2 33 GO_EMBRYONIC_SKELETAL_SYSTEM_MORPHOGENESIS 2 34 GO_KIDNEY_EPITHELIUM_DEVELOPMENT 2 35 GO_REGULATION_OF_HAIR_CYCLE 2 36 GO_ESTABLISHMENT_OF_MITOTIC_SPINDLE_ORIENTATION 2 37 GO_REGULATION_OF_BIOMINERAL_TISSUE_DEVELOPMENT 2 38 GO_NEPHRON_EPITHELIUM_DEVELOPMENT 2 39 GO_POSITIVE_REGULATION_OF_BIOMINERAL_TISSUE_DEVELOPMENT 2 40 GO_REGULATION_OF_MEGAKARYOCYTE_DIFFERENTIATION 2 41 GO_REGULATION_OF_FIBROBLAST_GROWTH_FACTOR_RECEPTOR_SIGNALING_PATHWAY 2 42 GO_POSITIVE_REGULATION_OF_HAIR_CYCLE 2 43 GO_REGULATION_OF_ODONTOGENESIS 2 44 GO_ORGANIC_ACID_TRANSMEMBRANE_TRANSPORT 2 45 GO_CELL_FATE_COMMITMENT 2 46 GO_RESPONSE_TO_PARATHYROID_HORMONE 2 47 GO_MYOTUBE_DIFFERENTIATION 2 48 GO_NEPHRON_DEVELOPMENT 2 49 GO_PATTERN_SPECIFICATION_PROCESS 2 50 GO_REGIONALIZATION 2 51 GO_SKELETAL_SYSTEM_MORPHOGENESIS 2 52 GO_REGULATION_OF_ODONTOGENESIS_OF_DENTIN_CONTAINING_TOOTH 2 53 GO_TRANSCRIPTIONAL_ACTIVATOR_ACTIVITY_RNA_POLYMERASE_II_DISTAL_ENHANCER_SEQUENCE_SPECIFIC_BINDING 2 54 GO_MODIFIED_AMINO_ACID_TRANSPORT 3 55 GO_RESPONSE_TO_BMP 3 56 GO_NEGATIVE_REGULATION_OF_MUSCLE_TISSUE_DEVELOPMENT 3 57 GO_INTRINSIC_COMPONENT_OF_PLASMA_MEMBRANE 3 58 GO_SODIUM_CHANNEL_COMPLEX 3 59 GO_NEURON_SPINE 3 60 GO_POSTSYNAPTIC_MEMBRANE 3 61 GO_PLASMA_MEMBRANE_REGION 3 62 GO_CELL_BODY 3 63 GO_MEMBRANE_REGION 3 64 GO_SYNAPTIC_MEMBRANE 3 65 GO_METALLOENDOPEPTIDASE_ACTIVITY 3 66 GO_CELLULAR_RESPONSE_TO_PROSTAGLANDIN_STIMULUS 4 67 GO_MAMMARY_GLAND_EPITHELIAL_CELL_DIFFERENTIATION 4 68 GO_RESPONSE_TO_ENDOGENOUS_STIMULUS 4 69 GO_POSITIVE_REGULATION_OF_PROTEIN_TYROSINE_KINASE_ACTIVITY 4 70 GO_RESPONSE_TO_PROSTAGLANDIN 4 71 GO_ENZYME_LINKED_RECEPTOR_PROTEIN_SIGNALING_PATHWAY 4 72 GO_TERPENOID_METABOLIC_PROCESS 4 73 GO_ACTOMYOSIN_STRUCTURE_ORGANIZATION 4 74 GO_REGULATION_OF_NEURON_MIGRATION 4 75 GO_PARTURITION 4 76 GO_POSITIVE_REGULATION_OF_LONG_TERM_SYNAPTIC_POTENTIATION 4 77 GO_GLIAL_CELL_MIGRATION 4 78 GO_MULTI_MULTICELLULAR_ORGANISM_PROCESS 4 79 GO_LONG_TERM_SYNAPTIC_POTENTIATION 4 80 GO_MULTI_ORGANISM_REPRODUCTIVE_PROCESS 4 81 GO_PEPTIDYL_TYROSINE_MODIFICATION 4 82 GO_REGULATION_OF_LONG_TERM_SYNAPTIC_POTENTIATION 4 83 GO_CELLULAR_RESPONSE_TO_ENDOGENOUS_STIMULUS 4 84 GO_MAMMARY_GLAND_LOBULE_DEVELOPMENT 4 85 GO_REPRODUCTION 4 86 GO_BIOLOGICAL_ADHESION 4 87 GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_SERINE_THREONINE_KINASE_SIGNALING_PATHWAY 4 88 GO_EXTRACELLULAR_MATRIX 4 89 GO_ENDOSOME_LUMEN 4 90 GO_PROTEINACEOUS_EXTRACELLULAR_MATRIX 4 91 GO_ENDOPLASMIC_RETICULUM_LUMEN 4 92 GO_CELL_SURFACE 4 93 GO_PEPTIDASE_ACTIVITY 4 94 GO_SULFUR_COMPOUND_BINDING 4 95 GO_HEPARIN_BINDING 4 96 GO_LIPOPROTEIN_PARTICLE_RECEPTOR_BINDING 4 97 GO_EXTRACELLULAR_MATRIX_STRUCTURAL_CONSTITUENT 4 98 GO_METALLOENDOPEPTIDASE_ACTIVITY 63 4 99 GO_METALLOPEPTIDASE_ACTIVITY 4

Table 3.4: Complete list of significant GO terms References

Ahn, Sungjin, Anoop Korattikara, Nathan Liu, Suju Rajan, and Max Welling (2015). “Large-Scale Distributed Bayesian Matrix Factorization using Stochas- tic Gradient MCMC”. In: arXiv: 1503.01596 [cs.LG]. URL: http://arxiv. org/abs/1503.01596. Camacho, Mercedes, Xavier León, María-Teresa Fernández-Figueras, Miquel Quer, and Luis Vila (2008). “Prostaglandin E(2) pathway in head and neck squamous cell carcinoma”. en. In: Head Neck 30.9, pp. 1175–1181. ISSN: 1043-3074, 1097-0347. DOI: 10.1002/hed.20850. URL: http://dx.doi.org/ 10.1002/hed.20850. Engelhardt, Barbara E and Matthew Stephens (2010). “Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis”. en. In: PLoS Genet. 6.9, e1001117. ISSN: 1553-7390, 1553-7404. DOI: 10.1371/journal.pgen.1001117. URL: http://dx.doi.org/10.1371/ journal.pgen.1001117. Fertig, Elana J, Jie Ding, Alexander V Favorov, Giovanni Parmigiani, and Michael F Ochs (2010). “CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data”. en. In: Bioinfor- matics 26.21, pp. 2792–2793. ISSN: 1367-4803, 1367-4811. DOI: 10.1093/ bioinformatics/btq503. URL: http://dx.doi.org/10.1093/bioinformatics/ btq503. Hyvärinen, A and E Oja (2000). “Independent component analysis: algorithms and applications”. en. In: Neural Netw. 13.4-5, pp. 411–430. ISSN: 0893-6080. DOI: 10.1016/S0893-6080(00)00026-5. URL: https://www.ncbi.nlm.nih. gov/pubmed/10946390. Jiang, Xingpeng, Xiaohua Hu, and Weiwei Xu (2017). “Microbiome Data Representation by Joint Nonnegative Matrix Factorization with Laplacian Regularization”. en. In: IEEE/ACM Trans. Comput. Biol. Bioinform. 14.2, pp. 353–359. ISSN: 1545-5963, 1557-9964. DOI: 10.1109/TCBB.2015.2440261. URL: http://dx.doi.org/10.1109/TCBB.2015.2440261.

64 Li, Jianqiang and Fei Wang (2017). “Towards Unsupervised Gene Selection: A Matrix Factorization Framework”. en. In: IEEE/ACM Trans. Comput. Biol. Bioinform. 14.3, pp. 514–521. ISSN: 1545-5963, 1557-9964. DOI: 10.1109/TCBB. 2016.2591545. URL: http://dx.doi.org/10.1109/TCBB.2016.2591545. Liberzon, Arthur, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P Mesirov, and Pablo Tamayo (2015). “The Molecular Signatures Database (MSigDB) hallmark gene set collection”. en. In: Cell Syst 1.6, pp. 417– 425. ISSN: 2405-4712. DOI: 10.1016/j.cels.2015.12.004. URL: http: //dx.doi.org/10.1016/j.cels.2015.12.004. Ma, Zhanyu, Andrew E Teschendorff, Arne Leijon, Yuanyuan Qiao, Honggang Zhang, and Jun Guo (2015). “Variational Bayesian Matrix Factorization for Bounded Support Data”. en. In: IEEE Trans. Pattern Anal. Mach. Intell. 37.4, pp. 876–889. ISSN: 0162-8828. DOI: 10.1109/TPAMI.2014.2353639. URL: http://dx.doi.org/10.1109/TPAMI.2014.2353639. Nakajima, Shinichi, Masashi Sugiyama, and S D Babacan (2011). “Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent”. In: Advances in Neural Information Processing Systems 24. Ed. by J Shawe-Taylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger. Curran Associates, Inc., pp. 208–216. Novembre, John, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R Boyko, Adam Auton, Amit Indap, Karen S King, Sven Bergmann, Matthew R Nelson, Matthew Stephens, and Carlos D Bustamante (2008). “Genes mirror geography within Europe”. en. In: Nature 456.7218, pp. 98–101. ISSN: 0028-0836, 1476-4687. DOI: 10.1038/nature07331. URL: http://dx.doi. org/10.1038/nature07331. Ochs, M F, R S Stoyanova, F Arias-Mendoza, and T R Brown (1999). “A New Method for Spectral Decomposition Using a Bilinear Bayesian Approach”. In: J. Magn. Reson. 137.1, pp. 161–176. Paisley, John, David M Blei, and Michael I Jordan (2015). “Bayesian nonnega- tive matrix factorization with stochastic variational inference”. In: Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC. URL: http://citeseerx.ist.psu.edu/viewdoc/citations;jsessionid= BAA89A82A1C34BB78EF65828A7039910?doi=10.1.1.675.4586. Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies”. In: Nucleic Acids Res. 43.7, e47. DOI: 10.1093/nar/gkv007. URL: http://dx.doi.org/10. 1093/nar/gkv007.

65 Salakhutdinov, Ruslan and Andriy Mnih (2007). “Probabilistic Matrix Factor- ization”. In: Proceedings of the 20th International Conference on Neural Infor- mation Processing Systems. NIPS’07. USA: Curran Associates Inc., pp. 1257– 1264. ISBN: 9781605603520. URL: http://dl.acm.org/citation.cfm?id= 2981562.2981720. Sibisi, Sibusiso and John Skilling (1997). “Prior Distributions on Measure Space”. In: J. R. Stat. Soc. Series B Stat. Methodol. 59.1, pp. 217–235. ISSN: 1369-7412, 1467-9868. DOI: 10.1111/1467-9868.00065. URL: http://doi. wiley.com/10.1111/1467-9868.00065. Stein-O’Brien, Genevieve, Luciane T. Kagohara, Sijia Li, Manjusha Thakar, Ruchira Ranaweera, Hiroyuki Ozawa, Haixia Cheng, Michael Considine, Sandra Schmitz, Alexander Favorov, Ludmila Danilova, Joseph A. Cali- fano, Evgeny Izumchenko, Daria A. Gaykalova, Christine H. Chung, and Elana J. Fertig (2018a). “Integrated time course omics analysis distinguishes immediate therapeutic response from acquired resistance.” In: bioRxiv. DOI: 10.1101/136564. eprint: https://www.biorxiv.org/content/early/ 2018/01/23/136564.full.pdf. URL: https://www.biorxiv.org/content/ early/2018/01/23/136564. Stein-O’Brien, Genevieve L, Jacob L Carey, Wai Shing Lee, Michael Con- sidine, Alexander V Favorov, Emily Flam, Theresa Guo, Sijia Li, Luigi Marchionni, Thomas Sherman, Shawn Sivy, Daria A Gaykalova, Ronald D McKay, Michael F Ochs, Carlo Colantuoni, and Elana J Fertig (2017). “PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF”. In: Bioinformatics 33.12, pp. 1892–1894. DOI: 10.1093/bioinformatics/btx058. Stein-O’Brien, Genevieve L, Raman Arora, Aedin C Culhane, Alexander Fa- vorov, Lana X Garmire, Casey Greene, Loyal A Goff, Yifeng Li, Alioune Ngom, Michael F Ochs, Yanxun Xu, and Elana J Fertig (2018b). “Enter the matrix: factorization uncovers knowledge from omics”. en. In: bioRxiv, p. 196915. DOI: 10 . 1101 / 196915. URL: https : / / www . biorxiv . org / content/early/2018/04/02/196915. Wang, Hong-Qiang, Chun-Hou Zheng, and Xing-Ming Zhao (2015). “jN- MFMA: a joint non-negative matrix factorization meta-analysis of tran- scriptomics data”. en. In: Bioinformatics 31.4, pp. 572–580. ISSN: 1367-4803, 1367-4811. DOI: 10.1093/bioinformatics/btu679. URL: http://dx.doi. org/10.1093/bioinformatics/btu679.

66 Xie, Fangzheng, Mingyuan Zhou, and Yanxun Xu (2017). “BayCount: A Bayesian Decomposition Method for Inferring Tumor Heterogeneity using RNA-Seq Counts”. In: Yang, Zi and George Michailidis (2016). “A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data”. en. In: Bioinformatics 32.1, pp. 1–8. ISSN: 1367-4803, 1367-4811. DOI: 10. 1093 / bioinformatics / btv544. URL: http : / / dx . doi . org / 10 . 1093 / bioinformatics/btv544.

67 Chapter 4

Discussion and Conclusion

I have presented two different projects in this thesis. One project focused on statistical methods for Hi-C data, and also contributed a thorough investiga- tion into effective diagnostics for Hi-C data as well. The other project was focused on learning biologically relevant groupings of genes using gene ex- pression data and a Bayesian non-negative matrix factorization algorithm, and I showed the utility of our approach. Each project, however, has remaining open questions.

For Hi-C data in Chapter2, the BNBC algorithm estimates and removes batch effect from each strata of distance separately. This is probably leaving data on the table: it is reasonable to think that two consecutive strata of distance, say interactions separated by i bins and by i + 1 bins, are very similar, and there is probably information that could be shared. While I did not pursue this further, it seems reasonable to imagine describing Hi-C data as a hierarchical model, with distance strata as one level in the hierarchy.

Additionally, the comparatively poor performance of algorithms such as PEER raise questions of their own - what about Hi-C data is such that these

68 algorithms perform so badly? Does the answer to this question give insight into batch effect in Hi-C? This was not investigated, although being able to use unsupervised methods such as PEER would resolve the requirement of knowing the experimental batch a given sample came from, as this is a major limitation of BNBC.

The PUMP algorithm in Chapter3 can also be taken in interesting direc- tions. PUMP as described relies on one-hot vectors, which allows for only one gene-pattern linkage, but it would interesting to relax this and allow for real- valued vectors with possibly no zero-valued elements. General real-valued vectors would correspond to linear combinations of patterns, which can be thought of as combinations of biological processes. As a result, methods for matching genes to specific combinations of patterns would provide insight into the interactions of multiple cellular processes.

The PUMP algorithm also suggests another methodological question: by the central limit theorem, the elements of the mean Aˆ, Pˆ matrices are nor- mally distributed with specific means and variances. It would be of interest to develop asymptotics to allow for analytic computations of probabilities of membership, as PUMP currently requires access to the posterior sample matrices, either as they are sampled or after the fact, generally saved to disk. This is an area of active interest, and hopefully more is to come soon.

All work I describe in this thesis attempts to answer concrete and specific questions: BNBC addresses cross-sample normalization in Hi-C data for a specific set of circumstances, while PUMP provides a method for clustering for a particular class of algorithms. In these projects, I have learned that useful

69 methods are frequently simple, but creatively applied. In fact, creativity has been a broad theme: good questions have regularly required creativity not just in their answer, but in their phrasing and description.

70 CHRISTOPHER FLETEZ-BRANT

240-522-8292 cafl[email protected]

EDUCATION Johns Hopkins School of Medicine July 2018 Ph.D. Candidate, Human Genetics Advisor: Kasper Hansen, PhD Johns Hopkins School of Public Health September 2017 M.H.S., Biostatistics Johns Hopkins University May 2012 M.S., Biotechnology St. John’s College May 2007 B.A., Liberal Arts

EXPERIENCE Johns Hopkins University School of Medicine August 2013 - Present Graduate Research Assistant Baltimore, MD Developed batch correction and normalization algorithm for HiC data to characterize healthy donor · 3D chromatin biology. Identified use of empirical null distributions in analysis of ChIP-seq data to characterize non-diseased · interpersonal chromatin variation and function. Derived measures of confidence for Bayesian non-negative matrix factorization-based clustering assign- · ments for gene expression data to identify cell type-specific gene clusters in multiple neuronal cell types.

Vaccine Research Center, NIAID, NIH June 2012 - August 2013 Bioinformatics Scientist Bethesda, MD Developed algorithm for automated quality control of High Throughput System flow cytometry to pro- · cess high throughput flow cytometry data collected from vaccine trials for HIV and influenza vaccines. Implemented support vector machine, principal component analysis-based feature extraction method- · ology for Fluidigm data to characterize HIV vaccine response.

Johns Hopkins University School of Medicine June 2011 - June 2012 Research Assistant Baltimore, MD Explored novel method for enhancer discovery using ChIP-Seq and support vector machines to learn · normal healthy donor gene regulatory loci. Developed web server for support vector machine-based ChIP-Seq analysis tool to support use of ma- · chine learning technology through primarily point-and-click interface.

FELLOWSHIPS Maryland Genetics, Epidemiology and Medicine (MD-GEM) Fellow 2014 - 2017 The MD-GEM cross-trains individuals enrolled in both genetics and epidemiology PhD programs to be proficient in both epidemiological and molecular genetics. PUBLICATIONS Removing unwanted variation between samples in Hi-C experiments. K Fletez-Brant, Y Qiu, DU Gorkin, M Hu, KD Hansen. https://www.biorxiv.org/content/early/2017/11/06/214361 Reconstitution of Peripheral T Cells by Tissue-Derived CCR4+ Central Memory Cells Following HIV-1 Antiretroviral Therapy. K FletezBrant, J pidlen, RR Brinkman, M Roederer, PK Chattopadhyay. Cytometry Part A 89 (5), 461-47. kmer-SVM: a Web-based Toolkit for the Computational Identification of Predictive Reg- ulatory Sequence Features in Genomic Datasets. Fletez-Brant C*, Lee D*, McCallion AS and Beer MA. 2012, Nucleic Acids Research. Integration of ChIP-seq and Machine Learning Reveals Enhancers and a Predictive Reg- ulatory Sequence Vocabulary in Melanocytes. Gorkin DU, Lee D, Reed X, Fletez-Brant C, Blessling SL, Loftus SK, Beer MA, Pavan WJ, and McCallion AS. 2012, Genome Research.

SOFTWARE flowClean - R/Bioconductor module for the efficient quality control of flow cytometry data. Also available as module FCSClean on Gene Pattern website. bnbc - R/Bioconductor and C++ module for performing batch correct and normalization of Hi-C data. CoGAPS - R/Bioconductor library for Bayesian Non-negative Matrix Factorization. I contributed C++ methods to compute posterior probabilities of class membership.

SKILLS

Computer Languages R, Python, C++, Java, Perl Databases MySQL, PostgreSQL, Microsoft SQL Frameworks Bioconductor, Numpy