Discovering Epistatic Feature Interactions from Neural Network Models of Regulatory DNA Sequences

bioRxiv preprint doi: https://doi.org/10.1101/302711; this version posted July 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences Peyton Greenside1, Tyler Shimko2, Polly Fordyce2,3,4,5, and Anshul Kundaje2,6,# 1Biomedical Informatics Training Program, Stanford University, Stanford, 94305 2Department of Genetics, Stanford University, Stanford, 94305 3Department of Bioengineering, Stanford University, Stanford, 94305 4Chan Zuckerberg Biohub, San Francisco, CA, 94158 5Chem-H Institute, Stanford University, Stanford, 94305 6Department of Computer Science, Stanford University, Stanford, 94305 #Corresponding author Abstract Motivation: Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models. Results: We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions in- volving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability: Code is available at: https://github.com/kundajelab/dfim. Contact: [email protected] 1. Introduction enabled the identification of predictive cis-regulatory patterns in DNA sequences used as input to the models. Fea- ture attribution methods estimate the contribution (or Genome-wide biochemical profiling experiments have re- importance) of features, such as individual nucleotides or vealed millions of putative regulatory elements in diverse contiguous subsequences (e.g. motifs), in an input DNA cell states. These massive datasets have spurred the devel- sequence to a model's output prediction. A perturbation- opment of deep neural network (DNN) models to predict based, forward-propagation approach known as in-silico cell-type specific or context-specific molecular phenotypes mutagenesis (ISM) quantifies the importance of a nucleotide such as TF binding, chromatin accessibility and gene ex- in an input DNA sequence as the maximal change in the pression from DNA sequence 1 2 3. Beyond high prediction output prediction from the DNN model when the observed accuracy, the primary appeal of DNNs is that they are nucleotide (e.g. a G) at that position is mutated to any capable of learning predictive sequence features and model- of the alternative bases (e.g. A, C or T). ISM has been ing non-linear feature interactions directly from raw DNA used to score the potential molecular impact of genetic sequence without any prior assumptions. Hence, interpret- variants in regulatory DNA sequences 1 2 3. However, ISM ing these purported black box models could reveal novel is computationally inefficient since each perturbation at insights into the combinatorial regulatory code. every position in an input sequence requires a separate forward propagation to the output through the network. Advances in feature attribution methods for DNNs have 1 bioRxiv preprint doi: https://doi.org/10.1101/302711; this version posted July 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1. Compute importance scores for One hot encoding each nucleotide at each A A Importance C C position in a sequence G G AACGTCTG Score T T 2. Mutate source feature Mutate source source (nucleotide C to A at position 6) (Pos 6, C->A) position 6 3. Compute importance scores of A A A target features (nucleotides at C C G T A G G AA TG all other positions) T T 4. Compute change in importance C G T A scores (FIS) between original sequence and mutated sequence no interaction Target strong interaction 5. Quantify maximal FIS induced e by any mutation [A,G,T] at c source feature Sour source position 6 6. Combine interactions between maxb(C → b) all source-target feature pairs in b {A,G,T} Deep Feature Interaction Map A A C G T A T G ∈ Figure 1: Deep Feature Interaction Maps: DFIM, illustrated in 6 steps, quantifies the maximal Feature Interaction Score (FIS) of every position in a sequence with all other positions. ISM also fails to highlight important features masked by presence of saturation effects. DeepLIFT 4 has also been saturation due to buffering interactions with other features shown to be an efficient approximation of SHAP scores 9. (e.g. multiple motif instances in a sequence that buffer each other) 4. SHAP is a perturbation-based feature attribution Current feature attribution methods only provide the method that borrows from game theory 5. Max-Ent is a fea- importance of individual features. They do not high- ture attribution method that uses a Markov chain Monte light predictive, higher-order feature interactions that are Carlo algorithm to find the maximum-entropy distribution learned by the DNN model. Perturbation-based approaches of inputs that produced a similar hidden representation such as ISM cannot scale to comprehensively score all pair- to the chosen input 6. While SHAP and Max-Ent show wise and higher-order interactions between nucleotides or improved sensitivity and specificity relative to ISM, they subsequence features. Recently, an efficient algorithm was do not scale efficiently to comprehensively characterize fea- proposed to calculate SHAP-based pairwise feature interac- ture importance across millions of regulatory sequences. tion scores 9 specifically from tree-based ensemble models. An alternative family of computationally efficient back- However, computing SHAP interactions from neural net- propagation approaches decompose the output prediction work models between all pairs of features in regulatory corresponding to an input sequence by recursively propa- DNA sequences is computationally inefficient and cannot gating contribution scores through the layers of the DNN scale to reveal comprehensive interaction maps across mil- from the output to the input. One single backpropagation lions of regulatory sequences. pass provides the contribution of all nucleotides in an input DNA sequence to the output prediction. The gradient of Here, we present an efficient approach called Deep Fea- the output with respect to each nucleotide in the input ture Interaction Maps (DFIM) to estimate pairwise interac- DNA sequence - known as a saliency map 7 - is one such tions between features (nucleotides or subsequences) in an estimate of importance and has been used to identify pre- input DNA sequence mapped to an associated regulatory dictive nucleotides in regulatory DNA sequences. Other phenotype by a neural network. We define a novel Feature related approaches such as DeepLIFT 4 and integrated gra- Interaction Score (FIS) between any pair of features (source dients 8 differ in the definition of the importance score that feature and target feature) in an input DNA sequence as is backpropagated and provide improved sensitivity in the the change in the importance score of the target feature 2 bioRxiv preprint doi: https://doi.org/10.1101/302711; this version posted July 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. when the source feature is perturbed, while keeping all the tance scores for the observed nucleotides b at each position other features in the sequence intact. By leveraging efficient p i.e. CX0 = w0[b; p]X0[b; p]. Only the observed nucleotides backpropagation-based feature attribution methods, we can at each position can have non-zero values. DeepLIFT con- efficiently compute FIS between all pairs of nucleotides or tribution scores quantify the sensitivity of the output to fi- predictive motifs across large sets of input DNA sequence. nite changes in the input 4. This is in contrast to gradients, Aggregate summary statistics of the pairwise Feature In- which measure the sensitivity of the output to infinitesimal teraction Score across multiple sequences provide insights changes in the input. Specifically, the DeepLIFT algorithm into common, shared patterns of feature interactions. backpropagates a score (analogous to gradients) which is based

Discovering Epistatic Feature Interactions from Neural Network Models of Regulatory DNA Sequences

Supplemental Materials

Chromatin Dynamics in The

Intrinsic Specificity of DNA Binding and Function of Class II Bhlh

An Activation-Specific Role for Transcription Factor TFIIB in Vivo (Sua7͞pho4͞pho5͞transcriptional Control)

The HLH-6 Transcription Factor Regulates C

Highlighting the Potential Utility of MBP Crystallization Chaperone For

Motif Co-Regulation and Co-Operativity Are Common Mechanisms in Transcriptional, Post-Transcriptional and Post-Translational Regulation Kim Van Roey1,2 and Norman E

The Transcriptional Activators of the PHO Regulon, Pho4p And

Determinants of Chromatin Disruption and Transcriptional Regulation

Nuclear Import of the TATA-Binding Protein: Mediation by the Karyopherin Kap114p and a Possible Mechanism for Intranuclear Targeting Lucy F

Competition for DNA Binding Between Paralogous Transcription Factors Determines Their Genomic Occupancy and Regulatory Functions

Building Models of Small DNA Control Elements for Prediction Of