Predicting Methylation from Sequence and Gene Expression Using Deep Learning with Attention
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/491357; this version posted December 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Predicting Methylation from Sequence and Gene Expression Using Deep Learning with Attention Alona Levy-Jurgenson12, Xavier Tekpli3, Vessela N. Kristensen3 and Zohar Yakhini12 Abstract— DNA methylation has been extensively linked DNA methylation plays a key role in disease development. to alterations in gene expression, playing a key role in the Specifically, hypermethylation can lead to stable silencing of manifestation of multiple diseases, most notably cancer. For tumor suppressor genes [21]. This process has therefore been this reason, researchers have long been measuring DNA methylation in living organisms. The relationship between extensively observed and studied in the context of cancer methylation and expression, and between methylation in [24], [15], [41], [29]. While various forms of cancer are different genomic regions is of great theoretical interest from a central to the discussion on DNA methylation, it has also molecular biology perspective. Therefore, several models have been linked to other diseases and biological processes such been suggested to support the prediction of methylation status as cardiovascular disease [13] and Alzheimer’s Disease [19] in samples. These models, however, have two main limitations: (a) they heavily rely on partially measured methylation levels as well as gene expression regulation in general [40] and as input, somewhat defeating the object as one is required epigenetic editing [35]. Hence, researchers have long been to collect measurements from the sample of interest before measuring DNA methylation levels. applying the model; and (b) they are largely based on human Currently, there are several methods available for mea- mediated feature engineering, thus preventing the model from suring DNA methylation [23]. Some of these methods, unveiling its own representations. To address these limitations we used deep learning, with an attention mechanism, to however, require specialized protocols or a relatively large produce a general model that predicts DNA methylation for a DNA sample size. Hence, depending on the required task, given sample in any CpG position based solely on the sample’s the costs could be significant and the data collection may gene expression profile and the sequence surrounding the be impractical. For this reason, the prediction of DNA CpG. methylation levels through other means could prove highly We show that our model is capable of generalizing to a completely separate test set of CpG positions and subjects. useful. In addition, and perhaps more importantly, the link Depending on gene-CpG proximity conditions, our model can between gene expression and methylation is still an open- attain a Spearman correlation of up to 0.8 and MAE of 0.14 ended question and predictive analyses may provide insight for thousands of CpG sites in the test data. We also identify into this relationship. In this work we set out to address both and analyze several motifs and genes that our model suggests aspects of methylation. may be linked to methylation activity, such as Nodal and Hand1. Moreover, our approach, and most notably the use of The contribution of this work includes: (1) We provide attention mechanisms, offers a novel framework with which a practical tool that enables potential users to input any to extract valuable insights from gene expression data when CpG position for which DNA methylation was not measured, combined with sequence information. along with the sample’s gene-expression profile, and obtain The code and trained models are available at: a prediction. We do so by training a machine learning model https://github.com/YakhiniGroup/Methylation that combines gene expression data with the ambient DNA sequence at the CpG of interest. We demonstrate that this I. INTRODUCTION model is capable of generalizing across CpG sites as well as across samples. (2) From a theoretical perspective, this DNA methylation is a chemical process that modifies result provides proof for a sharp, albeit not necessarily DNA in living organisms and can significantly affect gene causal, link between sequence and expression and between expression, mostly through the inhibition of transcription. In local methylation events. Furthermore, we observe better humans, DNA methylation refers to the presence of a methyl predictability for CpGs that reside closer to genes. We group at a defined position of a cytosine and occurs mostly also unveil motifs and genes that the model identified as in CpG dinucleotides. It has been particularly shown to affect significant contributors to the prediction. Specifically, we gene expression in gene promoter regions with relatively link HAND1 and NODAL to methylation activity in the dense CpGs, known as CpG islands (CGI). When a large cohort analyzed. (3) We provide a novel model design and number of proximal CpGs are methylated, the transcription framework that support the combination of gene-expression of nearby downstream genes may be inhibited. This process data with genomic sequences to extract valuable molecular- is, for example, prominent in the silent X-chromosome in level insights. females [37]. A. Related Work 1 Technion - Israel Institute of Technology, Haifa, Israel Over the past decade, researchers have been investigating 2School of Computer Science, Interdisciplinary Center, Herzeliya, Israel 3Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Nor- the use of machine learning for the prediction of methy- way lation. In [4] and [10] the authors used classifiers such bioRxiv preprint doi: https://doi.org/10.1101/491357; this version posted December 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. as Support Vector Machines (SVMs) and decision trees to III. METHODS determine the status of a given CpG using both structural A. Datasets and sequence-specific features. Similarly, [44] suggested a random forest classifier that uses features such as genomic We used data from two cancer cohorts: (1) 782 breast position and neighbor methylation levels. The latter were cancer patients and (2) 498 prostate cancer patients. For each noted as significant contributors but clearly require collecting patient, we obtained two types of data: (a) gene expression partial methylation data. Others [26], have used a regression data in RSEM normalized count for 17,997 genes (RNA- approach to predict continuous methylation levels across seq) and (b) methylation levels at 360,531 CpG sites (450K tissues, also using SVMs. While the use of regression is Illumina array). indeed more appropriate in the context of continous methy- In addition to the patient-specific data, we also use data lation measurements, this approach requires extensive data specific to any CpG locus: (a) the ambient sequence - 399 collection from a source tissue. More recently, [42] used a base-pairs upstream of the CpG and 399 downstream, for a deep learning model to predict whether a CpG was hypo- total of 800 base-pairs and (b) the genomic distance between or hyper-methylated by using DNA patterns and topological each gene in the profile considered and the locus of interest. features. The latter are human engineered features taken as B. Constructing the Model input by the network model. Like previous methods, this model is limited to binary classification, and is specifically Our task is to predict the methylation level at a CpG site constrained to hypo-/hyper- methylation. in a sample taken from a given subject, using the samples’s To conclude, the main limitations posed by previous gene expression profile and the ambient sequence at the CpG models include: (1) The need to measure methylation in some site. To do so, we created a multi-modal neural network (or all, in the case of learning between tissues) CpG sites. (2) comprised of four sub-networks: one CNN, which acts as Extensive use of human-engineered features. This not only a motif detector for the surrounding sequence, and three incorporates human biases, but also prevents the model from attention components which act as gene amplifiers, each unveiling novel representations. (3) The majority are binary based on the input provided. These sub-networks are then classifiers when in reality methylation levels are measured combined into a single fully-connected network to produce continuously, representing fractions of cells with any given the final prediction. status. Input Data II. APPROACH We will define a single training example to represent one To address the limitations posed by the aforementioned subject (or sample) and one CpG. It contains the following methods, we suggest a general deep learning model that does components: not require measuring methylation levels in the sample of 1) The subject’s gene expression vector e, where each interest, is not limited to specific CpGs, uses neural networks entry, ei, represents the expression level of a gene gi. as feature extractors instead of human-engineered features 2) The sequence surrounding the CpG of interest, repre- and provides continuous predictions. Specifically, we predict sented as a one-hot matrix S. methylation levels at a given CpG in a given sample based 3) A vector d, where di is computed based on the distance, on three factors: in base-pairs, between gi and the CpG of interest. Specifically, a gene residing within the first 2,000 base- • The sequence surrounding the CpG. pairs received a value of 1, the next 2,000 a value of • The sample’s gene expression profile. 0.5 and so on until the last bucket of 2,000 was given • The distance between the profiled genes and the CpG. 9 a value of 0:5 . Beyond this point di was set to 0. For We use a generalized approach that can be applied to any set genes residing on a different chromosome this value was of CpGs.