References

Deep-Exposome: A Predictive and Interpretative Deep Neural Network Ensemble for Exposome Data

Fei Zou

Department of Biostatistics University of North Carolina at Chapel Hill [email protected]

Exposome Data Challenge Event, April 2021

Fei Zou Deep Learning for Exposome Data 1/15 References Deep-Exposome

Deep Learning Predictive Model PermFIT: Permutation-based Feature Importance Test Analysis Results on Exposome Data Closing Remarks

Fei Zou Deep Learning for Exposome Data 2/15 References Deep Neural Network (DNN)

Deep Neural Network frequently applied in biomedical research, and one of the most popular machine learning tools Advantage highly flexible in approximating any complex function given the universal approximation theorem: A feedforward DNN with finite hidden units, and at least one hidden layer can approximate any continuous function on a q closed and bounded subset of R (Hornik et al., 1989; Cybenko, 1989). Y = f (X ) + error

Fei Zou Deep Learning for Exposome Data 3/15 References Ensemble DNN: Improved Bootstrap Aggregating

Challenges:

In DNN models, total number of parameters, Nθ is often substantially larger than total sample size, N, leading to unstable DNNs bagging has been successfully used for improving unstable procedures, such as classification and regression trees, and neural networks (Breiman, 1996, 2001). in neural network ensembles, it is argued that “many could be better than all” ( et al., 2002), motivating an novel filtering approach for removing low performing DNNs from the final ensembled model feature importance assessment is difficult given the black-box feature of DNN models.

Fei Zou Deep Learning for Exposome Data 4/15 References Ensemble DNN: Improved Bootstrap Aggregating

The filtering procedure is based on the performance score: 1 P  2 2 vk = (ri − r O ) − (ri − rik ) , |DO | i∈DO k b k k n    o 1 P rik 1−rik vk = ri log b + (1 − ri ) log b , |DO | i∈DO r O 1−r O k k k P k where DOk is the set of OOB samples, r Ok = i∈D ri /|DOk | Ok and r = f (x )(i ∈ D ). bik bk i Ok 1 PK fb(X ) = PK k=1 fbk (X )I(vk > v), given a cutoff v. k=1 I(vk >v) Details can be found in Mi et al. (2019) and software is available on GitHub [GitHub: https://github.com/SkadiEye/deepTL]

Fei Zou Deep Learning for Exposome Data 5/15 References PermFIT: Permutation-based feature importance test

Feature importance test Biomarker identification and prediction improvement Existing methods: Local Interpretable Model-Agnostic Explanations(LIME) (Craven and Shavlik, 1996; Ribeiro et al., 2016), Shapley value-based methods (SHAP) (Shapley, 1953; Lundberg and Lee, 2017), conditional randomization tests (CRTs) (Candes et al., 2018), model-X knockoff (Candes et al., 2018; et al., 2018), the holdout randomization test (HRT)(Tansey et al., 2018), and Gaussian mirrors (Xing et al., 2019, 2020; et al., 2020). PermFit empirically evaluates the importance score of the ith feature defined below

h  i2 M = E Y − f X (j) − E [Y − f (X )]2 (1) j X ,Xj0 X

(j) via permutation where X = (X1, ..., Xj−1, Xj0 , Xj+1, ..., Xp) and Xj0 , a random vector whose elements are independently drawn from Xj .

Fei Zou Deep Learning for Exposome Data 6/15 References PermFit

Statistically valid cross-fitting and bootstrap aggregating Computationally efficient empirical permutation to avoid the need of model refit Universally applicable to various types of ”black-box” models: random forests, DNN, SVM, etc Robust in the presence of high correlated features with improved feature ranking Software and associated paper (Mi et al., 2021) available at [GitHub: https://github.com/SkadiEye/deepTL]

Fei Zou Deep Learning for Exposome Data 7/15 References Analysis Results

Exposome data total 1301 samples outcomes: birth weight, bmi z-score, IQ score, behavior score, and asthma (binary) predictors: 222 exposome variables plus demographic variables for all outcomes except birth weight where only 88 prenatal exposome variables are used Models compared: DNN, Lasso, SVM and RF Cross-validated accuracy reported all available features or top 50 features from PermFIT

Fei Zou Deep Learning for Exposome Data 8/15 References Analysis Results

Table 1: Prediction Accuracy1

DNN RF SVM Lasso

Outcome All Top50 All Top50 All Top50 All

IQ 0.437 0.449 0.443 0.445 0.438 0.435 0.445

BW2 0.645 0.644 0.659 0.664 0.674 0.666 0.659

BMI 0.744 0.736 0.758 0.750 0.736 0.710 0.781

Behavior3 0.838 0.872 0.866 0.911 0.862 0.882 0.890

Asthma

ROC-AUC 0.604 0.619 0.589 0.550 0.436 0.424 0.433

PR-AUC 0.158 0.161 0.136 0.119 0.087 0.090 0.086

1. ROC-AUC and PR-AUC reported for binary Asthma outcome and scaled MSE reported for the rest of the outcomes 2. BW: birth weight 3. Behavior: behavior score

Fei Zou Deep Learning for Exposome Data 9/15 References Analysis Results

Table 2: Top 5 features from DNN models

Asthma BMI

Cadmium(Cd) in mother Maternal pre-pregnancy BMI Sex Child birth weight Mercury(Hg) in mother Dichlorodiphenyldichloroethylene(DDE) in child Postnatal Fruit consumption Polychlorinated biphenyl-170(PCB-170) in child Dimethyl thiophosphate(DMTP) in child Maternal age

All variables above have been reported in literature

Fei Zou Deep Learning for Exposome Data 10/15 References Remarks

Ensemble DNNs, promising ML algorithms for exposome research PermFIT, useful tool for assessing feature importance Few significant features identified poly-risk factors with mixture effects, each with a small effect but jointly important missing important biomarkers/features integrated analysis with other types of data, such as metabolomics and transcriptomics data for BMI, our preliminary analysis suggests that serum metabolomics data substantially improve the performance of the DNN models

Fei Zou Deep Learning for Exposome Data 11/15 References ReferencesI

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577. Craven, M. and Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In Advances in neural information processing systems, pages 24–30. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314. Dai, C., , B., Xing, X., and , J. S. (2020). False discovery rate control via data splitting. arXiv preprint arXiv:2002.08542.

Fei Zou Deep Learning for Exposome Data 12/15 References ReferencesII

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366. Lu, Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8676–8686. Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774. Mi, X., Zou, F., and , R. (2019). Bagging and deep learning in optimal individualized treatment rules. Biometrics, 75(2):674–684. Mi, X., Zou, F., Zou, B., and , J. (2021). Permutation-based identification of important biomarkers for complex diseases via black-box models. Nature Communications (In Print).

Fei Zou Deep Learning for Exposome Data 13/15 References ReferencesIII

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM. Shapley, L. (1953). A value for n-person games, contributions to the theory of games. editor: Harold w. kuhn. Tansey, W., Veitch, V., , H., Rabadan, R., and Blei, D. M. (2018). The holdout randomization test: Principled and easy black box feature selection. arXiv preprint arXiv:1811.00645. Xing, X., Gui, Y., Dai, C., and Liu, J. S. (2020). Neural gaussian mirror for controlled feature selection in neural networks. arXiv preprint arXiv:2010.06175. Xing, X., , Z., and Liu, J. S. (2019). Controlling false discovery rate using gaussian mirrors. arXiv preprint arXiv:1911.09761. Zhou, Z.-H., , J., and Tang, W. (2002). Ensembling neural networks: many could be better than all. Artificial intelligence, 137(1-2):239–263.

Fei Zou Deep Learning for Exposome Data 14/15 References Acknowledgement

Collaborative Work Xinlei Mi, Northwestern Univ. Baiming Zou, UNC Julia Rager, UNC Mike O’Shea, UNC Rebecca Fry, UNC Funding Support UNC-SRP (Superfund Research Program): P42ES031007 UNC-Center for Environmental Health & Susceptibility: P30ES010126

Fei Zou Deep Learning for Exposome Data 15/15