arXiv:2105.12807v2 [q-bio.GN] 4 Jul 2021 CE6T odn Kand UK ∗ London, 6BT, WC1E 1 nfr aiodApoiainadPoeto (UMAP) Projection and Other and 2008] data. Approximation Hinton, and omics Manifold Maaten der Neighbo from Uniform [van Stochastic (t-SNE) t-distributed capture as Embedding to such methods high-dimensional intractable non-linear patte the are non-linear of complicated the that transformation with struggles linear which [Ringn´erdata, a (PCA) learn Analysis and Component 2008] modelling Principal redu data dimensionality like Standard omics analysis 2016]. methods in al., downstream et practice of [Meng common the analysis a curse to become “the prior has selecti the reduction genome- feature imped dimensionality performing and from datasets, overfitting Therefore, suffers to application. omics wider leads larger analysis for often considerably which data samples dimensionality”, normally of omics of is number wide sample features thousands each the of for of site) number than CpG hundreds the and gene As to (e.g., DNA features up and molecular expression comprises gene (e.g., methylation) data omics High-dimensional Introduction words: Key models. learning b deep from learning discovery deep knowledge annotation biological biomedical of including performance explanations knowledge XOmiVAE domain the current both experiments, by our validated explainin In were models XOmiVAE be by learning the generated To deep network. learning interpretable deep level-based the predictio explain from can XOmiVAE classification results that each clustering demonstrated also X for is It dimension data. dimension. autoencoder latent omics variational and high-dimensional a disad gene using XOmiVAE, prominent limit classification present most and cancer we credibility the the Here of undermine one models. can is problem box” explainability “black of lack The Abstract data Withnell, omics Eloise high-dimensional for using model classification learning cancer deep interpretable an XOmiVAE: PROTOCOL SOLVING PROBLEM aaSineIsiue meilCleeLno,S72Z L 2AZ, SW7 London, College Imperial Institute, Science Data orsodn uhr [email protected] author. Corresponding xlial rica nelgne eplann,cance learning, deep intelligence, artificial explainable 1,2, † 3 eateto optrSine ogKn ats Universi Baptist Kong Hong Science, Computer of Department iouZhang, Xiaoyu † h rttoatoscnrbtdeulyt hspaper. this to equally contributed authors two first The 1, ∗ , nand on † ction a Sun Kai no,UK, ondon, rns es r , . oe lsesgnrtdb A.Teepanberesults explainable The VAE. by generated clusters novel g n cdmcltrtr,wihsosgetptnilfrno for potential great shows which literature, academic and ,adtecreainbtenec eeadec latent each and gene each between correlation the and n, to u nweg,XmVEi n ftefis activation first the of one is XOmiVAE knowledge, our of st mVEi aal frvaigtecnrbto feach of contribution the revealing of capable is OmiVAE rbe ovn Protocol Solving Problem XOmiVAE Preprint aaCmos(D)dtst[rsmne l,21] Similar Genomic 2016]. the al., gene et from [Grossman using dataset profiles 33 (GDC) control methylation Commons of Data normal DNA classification the and the and expression for types for achieved A tumour was models classification. pan-cancer of 97.49% type learning one of tumour is deep accuracy and al 2019] multi-omics et reduction al., Zhang VAE-based dimensionality et 2021, [Zhang al., other first OmiVAE al., et et [Zhang the outperform them, Hira methods a Among 2019, and 2021]. learning al., samples With et deep Azarkhalili tumour model 2019, and space. classify VAE-based learning latent to machine the embedding deep dimensional able in network, emerging lower is promise downstream the to shown classification of have data (VAE) one that omics Autoencoder high-dimensional is methods Variational 2014] from learning 2015]. Welling, and patterns al., [Kingma et [LeCun non-linear data capturing for bu scalability. popular, of terms increasingly in become limitations have have still 2020] al., et [McInnes arXiv, o nytesprie lsicto u h unsupervised the but classification supervised the only not h rcia mlmnaino imdclde learning deep biomedical of implementation practical the atgso eplann plctosi mc.This omics. in applications learning deep of vantages 2 eplann a rvnt eapwru methodology powerful a be to proven has learning Deep 1 sdcne lsicto n lseigaindwith aligned clustering and classification cancer ased eateto elhIfrais nvriyCleeLond College University Informatics, Health of Department lsicto,oisdt,gn expression gene data, omics classification, r VE ae nepeal eplann oe for model learning deep interpretable based (VAE) fdwsra ak n h imdclknowledge. biomedical the and tasks downstream of n ieGuo Yike and 01 p 1–12 pp. 2021, y ogKn,China Kong, Hong ty, 1,3 on, vel ., n 1 t 2 Zhang et al.
to OmiVAE, DeePathology [Azarkhalili et al., 2019] applied each input molecular feature and omics latent dimension for two types of deep autoencoders, Contractive Autoencoder the cancer classification prediction. Deep SHAP was selected (CAE) and Variational Autoencoder (VAE), with only the as the interpretation approach of XOmiVAE due to its ability gene expression data from the GDC dataset, and reached to provide more accurate explanations over other methods, accuracy of 95.2% for the same tumour type classification which likely provides better signal-to-noise ratio in the top task. Hira et al. [2021] adopted the architecture of OmiVAE genes selected. With XOmiVAE, we are able to reveal the with Maximum Mean Discrepancy VAE (MMD-VAE), and contribution of each gene towards the prediction of each classified the molecular subtypes of ovarian cancer with tumour type using gene expression profiles. XOmiVAE can also an accuracy of 93.2-95.5%. Zhang et al. [2021] synthesised explain unsupervised tumour type clusters produced by the previous models and developed a unified multi-task multi-omics VAE embedding part of the deep neural network. Additionally, deep learning framework named OmiEmbed, which supported we raised crucial issues to consider when interpreting deep dimensionality reduction, multi-omics integration, tumour type learning models for tumour classification using the probing classification, phenotypic feature reconstruction and survival strategy. For instance, we demonstrate the importance of prediction. Despite the breakthrough of aforementioned work, choosing reference samples that makes biological sense and a key limitation is prevalent among deep learning based omics the limitations of the connection weight-based approach to analysis methods. Most of these models are “black boxes” with explain latent dimensions of VAE. The results generated by lack of explainability, as the contribution of each input feature XOmiVAE were fully validated by both biomedical knowledge and latent dimension towards the downstream prediction is and the performance of downstream tasks for each tumour obscured. type. XOmiVAE explanations of deep learning based cancer Various strategies have been proposed for interpreting classification and clustering aligned with current domain deep learning models. Among them, the probing strategy, knowledge including biological annotation and literature, which which inspects the structure and parameters learnt by a shows great potential for novel biomedical knowledge discovery trained model, have been shown to be the most promising from deep learning models. [Azodi et al., 2020]. Probing strategies generally fall into one of three categories: connection weights-based, gradient- based and activation level-based approaches [Montavon et al., Methods 2018]. The connection weight-based approach sums the learnt weights between each input dimension and the output layer Datasets and Pre-processing to quantify the contribution score of each feature [Garson, The Cancer Genome Atlas Program (TCGA) [Network et al., 1991, Olden and Jackson, 2002]. Way and Greene [2018] and 2013] pan-cancer dataset, which comprise gene expression Bica et al. [2020] adopted this probing strategy to explain profiles of 33 various tumour types, was used in the the latent space of VAE on gene expression data. However, experiment as a example to demonstrate the explainability the connection weight-based approach can be limited or even of XOmiVAE. A total of 9,081 samples from TCGA were misleading when positive and negative weights offset each selected for training and testing our proposed model, of which other, when features do not have the same scale, or when 407 were normal tissue samples. The TCGA dataset was neurons with large weights are not activated [Shrikumar et al., downloaded from UCSC Xena data portal [Goldman et al., 2017]. In the gradient-based approach, contribution scores 2020] (https://xenabrowser.net/datapages/, accessed on 1 May (or saliency) are measured by calculating the gradient when 2019). We followed the same omics data pre-processing step the input are perturbed [Simonyan et al., 2014]. Dincer et al. as OmiVAE [Zhang et al., 2019] and OmiEmbed [Zhang et al., [2018] applied a gradient-based approach, Integrated Gradients 2021]. Genes targeting the Y chromosome, genes with zero [Sundararajan et al., 2017], to explain a VAE model for gene expression level in all samples, and genes with missing values expression profiles. This approach overcomes limitations of (N/A) in more than 10% of the samples were removed to ensure the connection weights-based method. Despite this, it is the gene expression data was fair and clean across samples. inaccurate when small changes of the input do not effect Furthermore, the remaining N/A values that did not reach the output [Shrikumar et al., 2017]. The activation level-based the 10% threshold were replaced by the expression level of approach conquers these drawbacks by comparing the feature corresponding genes. The Fragments Per Kilobase of transcript activation level of an instance of interest and a reference per Million mapped reads (FPKM) values were normalised to instance [Azodi et al., 2020]. An activation level-based based the unit interval of 0 to 1 to the meet input requirement of method named Layer-wise Relevance Propagation (LRP) has the network. The phenotype data of each sample was also been used to explain a deep neural network for gene expression downloaded from UCSC Xena, which is comprised of age and [Hanczar et al., 2020]. Nevertheless, LRP can produce incorrect gender of the subjects and primary site and disease stage of the results with model saturation [Shrikumar et al., 2017]. Deep samples. The the detailed cancer subtype information of each SHAP [Lundberg and Lee, 2017], which applies the key tumour sample was obtained from Sanchez-Vega et al. [2018]. principles from DeepLIFT [Shrikumar et al., 2017], has been used in a variety of biological applications [Tasaki et al., 2020, Lemsara et al., 2020]. However, there is a lack of research on Explainable OmiVAE (XOmiVAE) the application of Deep SHAP to interpret the latent space of Based on vanilla OmiVAE, we proposed an interpretable deep VAE models and the VAE-based cancer classification using gene learning model for cancer classification using high-dimensional expression profiles. omics data, named Explainable OmiVAE, aka XOmiVAE. The Here we proposed explainable OmiVAE (XOmiVAE), a overall architecture of XOmiVAE was illustrated in Figure 1. VAE-based explainable deep learning omics data analysis The input omics data, which was genome-wide gene expression model for low dimensional latent space extraction and profiles here, was first passed through a VAE embedding cancer classification. OmiVAE took advantage of Deep SHAP network to reduce the dimensionality of the input data from [Lundberg and Lee, 2017] to provide the contribution score of 58,043 to 128. The encoder of the embedding network contained XOmiVAE 3
A B
Supervised Unsupervised
gene contribution Encoder (GPU: 0) Decoder (GPU: 1) Encoder (GPU: 0) Decoder (GPU: 1) LYVE1 0.086 gene contribution LPL 0.074 SEMG1 0.009 MMP11 0.062 Fully Connected latent dim contribution CD171 0.000 … ... Batchnorm #45 1.892 PCA3 0.032 for BRCA FC Block Activation #125 1.586 … ... #50 1.220 for dim#42 … ... for BRCA