Viewers: This Article Was Reviewed by Lan Hu, Tim Beissbarth and Dimitar Vassilev

Polewko-Klim et al. Biology Direct (2018) 13:17 https://doi.org/10.1186/s13062-018-0222-9 RESEARCH Open Access Integration of multiple types of genetic markers for neuroblastoma may contribute to improved prediction of the overall survival Aneta Polewko-Klim1*, Wojciech Lesinski´ 1, Krzysztof Mnich2, Radosław Piliszek2 andWitoldR.Rudnicki1,2,3 Abstract Background: Modern experimental techniques deliver data sets containing profiles of tens of thousands of potential molecular and genetic markers that can be used to improve medical diagnostics. Previous studies performed with three different experimental methods for the same set of neuroblastoma patients create opportunity to examine whether augmenting gene expression profiles with information on copy number variation can lead to improved predictions of patients survival. We propose methodology based on comprehensive cross-validation protocol, that includes feature selection within cross-validation loop and classification using machine learning. We also test dependence of results on the feature selection process using four different feature selection methods. Results: The models utilising features selected based on information entropy are slightly, but significantly, better than those using features obtained with t-test. The synergy between data on genetic variation and gene expression is possible, but not confirmed. A slight, but statistically significant, increase of the predictive power of machine learning models has been observed for models built on combined data sets. It was found while using both out of bag estimate and in cross-validation performed on a single set of variables. However, the improvement was smaller and non-significant when models were built within full cross-validation procedure that included feature selection within cross-validation loop. Good correlation between performance of the models in the internal and external cross-validation was observed, confirming the robustness of the proposed protocol and results. Conclusions: We have developed a protocol for building predictive machine learning models. The protocol can provide robust estimates of the model performance on unseen data. It is particularly well-suited for small data sets. We have applied this protocol to develop prognostic models for neuroblastoma, using data on copy number variation and gene expression. We have shown that combining these two sources of information may increase the quality of the models. Nevertheless, the increase is small and larger samples are required to reduce noise and bias arising due to overfitting. Reviewers: This article was reviewed by Lan Hu, Tim Beissbarth and Dimitar Vassilev. Keywords: Neuroblastoma, Feature selection, Machine learning, Random forest, Copy number variation, Gene expression, Synergy *Correspondence: [email protected] 1Institute of Informatics, University of Białystok, Konstantego Ciołkowskiego 1M, 15-245 Białystok, Poland Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Polewko-Klim et al. Biology Direct (2018) 13:17 Page 2 of 22 Background weight to prediction of minority class for unbalanced data The current study is the answer to the CAMDA Neu- sets. roblastoma Data Integration Challenge (camda.info). The Unfortunately, the predictive power of models mea- goal of the challenge was the exploration of the opportu- sured on the training set was not correlated with the nities given by the availability of different types of molec- predictive power measured on the validation set. Only for ular data for improving prediction of patient survival in models predicting the sex of a patient, correlation between neuroblastoma. the quality of the model measured on the training set and Neuroblastoma is a cancer manifesting in early child- that measured on the validation set was 0.41, which is sta- hood. It displays a heterogeneous clinical course and a tistically significant, if not very high. Nevertheless, this large fraction of patients with neuroblastoma will eventu- endpoint is not clinically interesting and it was used in ally enter metastasis and have a poor outcome. Accurate the study merely as a reference representing a very easy identification of the high-risk group is critical for deliv- modelling target. ering an appropriate targeted therapy [1]. Currently, the For all other clinical endpoints correlations between prognosis is based on clinical stage and age of the patient MCCobtainedincross-validationandMCCobtainedon [2]. However, research towards inclusion and integration validation sets are very small, confined to a small inter- of genomic data with expression profiles and traditional val between -0.1 and 0.11. What is more, the variance clinical data is actively pursued in the field [3]. In partic- of MCC obtained both on training and validation sets ular, the effort towards establishing a connection between was very high. For example, the following results were clinical outcome and gene expression has been recently obtained for the overall survival: the mean MCC on the the subject of a multinational project involving multi- training set and validation set for 60 models was 0.48 ple bioinformatical and analytical laboratories [4], where and 0.46, and 95% confidence interval is (0.46, 0.51) for gene expression profiles of 498 patients were examined the former and (0.45, 0.49) for the latter. The high vari- using both microarrays and RNA sequencing. Within the ance and lack of correlation between predictive power of CAMDA Neuroblastoma Challenge this data has been the models obtained on the training and the validation accompanied with previously generated data relating copy sets precludes definitive statements about overall superi- number variation (CNV) for the subset of patients consist- ority of one classifier over another, including comparison ing of 145 individuals [2, 5–7]. The clinical data was avail- of relative merits of different data sets used to build the able for all patients, including survival time, classification classifiers. to the low- or high-risk subset, as well as sex. Since the main goal of the current study is to examine Most of the data in the challenge was already used in whether integrating multiple lines of experimental evi- the study aiming at comparison of utility of RNA-seq and dence can improve the quality of predictive models, high microarray data sets for prediction of the clinical endpoint confidence in robustness of results is crucial. For this pur- for neuroblastoma. What is more, the goal of the CAMDA pose, we propose a protocol that gives robust results that challenge is a logical extension of goals pursued in that are well correlated between training and validation sets. study. Therefore, the current study is based on general The protocol is based on an extensive cross-validation and methodology proposed by Zhang et al. utilises four methods for selecting informative features However, the detailed analysis of the results obtained used for model building. We apply this protocol to exam- in that study shows that significant modifications in the ine the relative utility of different data sets for predicting methodology are required. In particular, the design of the a single clinical endpoint, namely the overall survival. Zhang et al. did not allow for the robust and reproducible Finally, we apply the same protocol to examine whether estimate of predictive power of different models. The models that utilise informative variables from more than study was performed using a single split of data between one data set have a higher predictive power in compar- training set, used to develop models, and validation set, ison with the models utilising information from a single used for assessing the quality of predictions. Six inde- data set. The protocol includes a feature selection step. pendent groups developed models using data from the Hence, it allows to explore differences and similarities training set, the quality of which was then assessed on between genes selected as most informative from three the validation set. Sixty models using different approaches independent experimental methods. and different sets of variables were built for each of the six different clinical endpoints. The predictive power of each Methods model was also estimated using cross-validation on the The single split of data between training set and valida- training set. The metric of choice was Matthews Correla- tion set is not sufficient for robust estimate of perfor- tion Coefficient (MCC) [8]whichisabalancedmeasure mance of the machine learning model on external data. of the predictive power of a binary classifier. In compar- Modelling procedure that includes variable selection and ison with the simple accuracy measure, it assigns greater model building is prone to overfitting in both steps. The Polewko-Klim et al. Biology Direct (2018) 13:17 Page 3 of 22 variable selection finds variables that are informative due sufficiently

Viewers: This Article Was Reviewed by Lan Hu, Tim Beissbarth and Dimitar Vassilev

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support