A Predictive and Interpretative Deep Neural Network Ensemble for Exposome Data

References Deep-Exposome: A Predictive and Interpretative Deep Neural Network Ensemble for Exposome Data Fei Zou Department of Biostatistics University of North Carolina at Chapel Hill [email protected] Exposome Data Challenge Event, April 2021 Fei Zou Deep Learning for Exposome Data 1/15 References Deep-Exposome Deep Learning Predictive Model PermFIT: Permutation-based Feature Importance Test Analysis Results on Exposome Data Closing Remarks Fei Zou Deep Learning for Exposome Data 2/15 References Deep Neural Network (DNN) Deep Neural Network frequently applied in biomedical research, and one of the most popular machine learning tools Advantage highly flexible in approximating any complex function given the universal approximation theorem: A feedforward DNN with finite hidden units, and at least one hidden layer can approximate any continuous function on a q closed and bounded subset of R (Hornik et al., 1989; Cybenko, 1989). Y = f (X ) + error Fei Zou Deep Learning for Exposome Data 3/15 References Ensemble DNN: Improved Bootstrap Aggregating Challenges: In DNN models, total number of parameters, Nθ is often substantially larger than total sample size, N, leading to unstable DNNs bagging has been successfully used for improving unstable procedures, such as classification and regression trees, and neural networks (Breiman, 1996, 2001). in neural network ensembles, it is argued that \many could be better than all" (Zhou et al., 2002), motivating an novel filtering approach for removing low performing DNNs from the final ensembled model feature importance assessment is difficult given the black-box feature of DNN models. Fei Zou Deep Learning for Exposome Data 4/15 References Ensemble DNN: Improved Bootstrap Aggregating The filtering procedure is based on the performance score: 1 P 2 2 vk = (ri − r O ) − (ri − rik ) , jDO j i2DO k b k k n o 1 P rik 1−rik vk = ri log b + (1 − ri ) log b , jDO j i2DO r O 1−r O k k k P k where DOk is the set of OOB samples, r Ok = i2D ri =jDOk j Ok and r = f (x )(i 2 D ). bik bk i Ok 1 PK fb(X ) = PK k=1 fbk (X )I(vk > v); given a cutoff v. k=1 I(vk >v) Details can be found in Mi et al. (2019) and software is available on GitHub [GitHub: https://github.com/SkadiEye/deepTL] Fei Zou Deep Learning for Exposome Data 5/15 References PermFIT: Permutation-based feature importance test Feature importance test Biomarker identification and prediction improvement Existing methods: Local Interpretable Model-Agnostic Explanations(LIME) (Craven and Shavlik, 1996; Ribeiro et al., 2016), Shapley value-based methods (SHAP) (Shapley, 1953; Lundberg and Lee, 2017), conditional randomization tests (CRTs) (Candes et al., 2018), model-X knockoff (Candes et al., 2018; Lu et al., 2018), the holdout randomization test (HRT)(Tansey et al., 2018), and Gaussian mirrors (Xing et al., 2019, 2020; Dai et al., 2020). PermFit empirically evaluates the importance score of the ith feature defined below h i2 M = E Y − f X (j) − E [Y − f (X )]2 (1) j X ;Xj0 X (j) via permutation where X = (X1; :::; Xj−1; Xj0 ; Xj+1; :::; Xp) and Xj0 , a random vector whose elements are independently drawn from Xj . Fei Zou Deep Learning for Exposome Data 6/15 References PermFit Statistically valid cross-fitting and bootstrap aggregating Computationally efficient empirical permutation to avoid the need of model refit Universally applicable to various types of "black-box" models: random forests, DNN, SVM, etc Robust in the presence of high correlated features with improved feature ranking Software and associated paper (Mi et al., 2021) available at [GitHub: https://github.com/SkadiEye/deepTL] Fei Zou Deep Learning for Exposome Data 7/15 References Analysis Results Exposome data total 1301 samples outcomes: birth weight, bmi z-score, IQ score, behavior score, and asthma (binary) predictors: 222 exposome variables plus demographic variables for all outcomes except birth weight where only 88 prenatal exposome variables are used Models compared: DNN, Lasso, SVM and RF Cross-validated accuracy reported all available features or top 50 features from PermFIT Fei Zou Deep Learning for Exposome Data 8/15 References Analysis Results Table 1: Prediction Accuracy1 DNN RF SVM Lasso Outcome All Top50 All Top50 All Top50 All IQ 0.437 0.449 0.443 0.445 0.438 0.435 0.445 BW2 0.645 0.644 0.659 0.664 0.674 0.666 0.659 BMI 0.744 0.736 0.758 0.750 0.736 0.710 0.781 Behavior3 0.838 0.872 0.866 0.911 0.862 0.882 0.890 Asthma ROC-AUC 0.604 0.619 0.589 0.550 0.436 0.424 0.433 PR-AUC 0.158 0.161 0.136 0.119 0.087 0.090 0.086 1. ROC-AUC and PR-AUC reported for binary Asthma outcome and scaled MSE reported for the rest of the outcomes 2. BW: birth weight 3. Behavior: behavior score Fei Zou Deep Learning for Exposome Data 9/15 References Analysis Results Table 2: Top 5 features from DNN models Asthma BMI Cadmium(Cd) in mother Maternal pre-pregnancy BMI Sex Child birth weight Mercury(Hg) in mother Dichlorodiphenyldichloroethylene(DDE) in child Postnatal Fruit consumption Polychlorinated biphenyl-170(PCB-170) in child Dimethyl thiophosphate(DMTP) in child Maternal age All variables above have been reported in literature Fei Zou Deep Learning for Exposome Data 10/15 References Remarks Ensemble DNNs, promising ML algorithms for exposome research PermFIT, useful tool for assessing feature importance Few significant features identified poly-risk factors with mixture effects, each with a small effect but jointly important missing important biomarkers/features integrated analysis with other types of data, such as metabolomics and transcriptomics data for BMI, our preliminary analysis suggests that serum metabolomics data substantially improve the performance of the DNN models Fei Zou Deep Learning for Exposome Data 11/15 References ReferencesI Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123{140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5{32. Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:`model-x'knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551{577. Craven, M. and Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In Advances in neural information processing systems, pages 24{30. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303{314. Dai, C., Lin, B., Xing, X., and Liu, J. S. (2020). False discovery rate control via data splitting. arXiv preprint arXiv:2002.08542. Fei Zou Deep Learning for Exposome Data 12/15 References ReferencesII Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359{366. Lu, Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8676{8686. Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765{4774. Mi, X., Zou, F., and Zhu, R. (2019). Bagging and deep learning in optimal individualized treatment rules. Biometrics, 75(2):674{684. Mi, X., Zou, F., Zou, B., and Hu, J. (2021). Permutation-based identification of important biomarkers for complex diseases via black-box models. Nature Communications (In Print). Fei Zou Deep Learning for Exposome Data 13/15 References References III Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135{1144. ACM. Shapley, L. (1953). A value for n-person games, contributions to the theory of games. editor: Harold w. kuhn. Tansey, W., Veitch, V., Zhang, H., Rabadan, R., and Blei, D. M. (2018). The holdout randomization test: Principled and easy black box feature selection. arXiv preprint arXiv:1811.00645. Xing, X., Gui, Y., Dai, C., and Liu, J. S. (2020). Neural gaussian mirror for controlled feature selection in neural networks. arXiv preprint arXiv:2010.06175. Xing, X., Zhao, Z., and Liu, J. S. (2019). Controlling false discovery rate using gaussian mirrors. arXiv preprint arXiv:1911.09761. Zhou, Z.-H., Wu, J., and Tang, W. (2002). Ensembling neural networks: many could be better than all. Artificial intelligence, 137(1-2):239{263. Fei Zou Deep Learning for Exposome Data 14/15 References Acknowledgement Collaborative Work Xinlei Mi, Northwestern Univ. Baiming Zou, UNC Julia Rager, UNC Mike O'Shea, UNC Rebecca Fry, UNC Funding Support UNC-SRP (Superfund Research Program): P42ES031007 UNC-Center for Environmental Health & Susceptibility: P30ES010126 Fei Zou Deep Learning for Exposome Data 15/15.

Load more