Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 https://doi.org/10.1186/s12920-020-00759-0

RESEARCH Open Access Cancer gene expression profiles associated with clinical outcomes to treatments Nicolas Borisov1,2* , Maxim Sorokin1,3, Victor Tkachev1, Andrew Garazha1 and Anton Buzdin1,2,3,4

From 11th International Young Scientists School “Systems Biology and Bioinformatics”–SBB-2019 Novosibirsk, Russia. 24-28 June 2019

Abstract Background: Machine learning (ML) methods still have limited applicability in personalized oncology due to low numbers of available clinically annotated molecular profiles. This doesn’t allow sufficient training of ML classifiers that could be used for improving molecular diagnostics. Methods: We reviewed published datasets of high throughput gene expression profiles corresponding to cancer patients with known responses on chemotherapy treatments. We browsed Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA) and Tumor Alterations Relevant for GEnomics-driven Therapy (TARGET) repositories. Results: We identified data collections suitable to build ML models for predicting responses on certain chemotherapeutic schemes. We identified 26 datasets, ranging from 41 till 508 cases per dataset. All the datasets identified were checked for ML applicability and robustness with leave-one-out cross validation. Twenty-three datasets were found suitable for using ML that had balanced numbers of treatment responder and non-responder cases. Conclusions: We collected a database of gene expression profiles associated with clinical responses on chemotherapy for 2786 individual cancer cases. Among them seven datasets included RNA sequencing data (for 645 cases) and the others – microarray expression profiles. The cases represented breast cancer, lung cancer, low-grade glioma, endothelial carcinoma, , adult leukemia, pediatric leukemia and kidney tumors. Chemotherapeutics included , , , trastuzumab, letrozole, tipifarnib, , and . Keywords: Machine learning, Transcriptomics, Gene expression, RNA sequencing, Microarrays, Molecular diagnostics, Biomarkers detection, Cancer, Clinical oncology, Personalized medicine, Chemotherapy

* Correspondence: [email protected] 1Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA 91788, USA 2Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Oblast 141701, Russia Full list of author information is available at the end of the article

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 2 of 9

Background processed gene expression data coupled with case Personalized approach provides important advantages in history excerpts indicating positive or negative response clinical oncology in terms of improved patient survival to certain treatment protocols for cancer patients. This and lower drug toxicities [1, 2]. However, so far it can manually curated collection of molecular datasets will be only cover a minor fraction of cancer patients [3, 4] due helpful for those working with the ML or artificial to lack of robust prognostic biomarkers for most of the intelligence applications in oncology, as well as for the treatments [5]. The proportion of patients eligible for fundamental research and development of cancer personalized oncology slightly grows. For example, the biomarkers. percentage of US patients with cancer estimated to benefit from personalized prescriptions of targeted ther- Methods apeutics was only 0.7% in 2006, and it had increased to We curated GEO [34], TARGET [35] and TCGA [36] ~ 5% in 2018 [4]. However, this progress could be more repositories to extract cancer gene expression profiles significant if more companion diagnostic tests would be associated with the clinical outcomes of chemotherapeu- available for the standardly used cancer drugs. In this tic treatments. We attempted to build a knowledgebase regard, gene expression data, either obtained by RNA se- of molecular datasets suitable for building ML classifiers quencing [1] or using microarrays [6], frequently provide of clinical responses on chemotherapy treatments an advantage over genomic tests. Several trials and clin- (Table 1, Additional file 1). Every included dataset met ical case reports were published recently evidencing high the following criteria: efficiency of gene expression-based prescriptions of cancer chemotherapeutics. Cancer gene expression data – at least 40 gene expression profiles present; can be used per se or can be normalized on the available – data obtained for the same cancer type and using profiles of healthy human tissues [7]. the same experimental platform Using transcriptomic data, bioinformatic models can – every profile is linked with the case clinical history be built for patient-oriented ranking of cancer drugs [8]. – all cancers treated with at least one common drug These models can be hypothesis-driven, e.g. based on or the knowledge of the specific mechanisms of drugs anti- – treatment outcomes are available enabling to classify cancer activities [9–11]. Alternatively, hypothesis-free every case as either responder or non-responder. approaches like machine learning (ML) don’t need any theoretic background but instead strongly require We used different approaches to discriminate between sufficient training and validation datasets. Many ML the treatment responders and non-responders. Where methods may be used for such applications, e.g. decision available, e.g. for the datasets extracted from the GEO trees [12, 13], random forests, RF [14, 15], linear [16], repository, we used the responder/non-responder marks logistic [17], lasso [18, 19], ridge [15, 20] regressions, assigned by the authors of the original communications multi-layer perceptron, MLP [12, 15, 21, 22], support publishing these data. In many instances, the number of vectors machines [12, 13, 15, 23–25], adaptive boosting response groups was more than two and included [26–28], as well as binomial naïve Bayesian [15] method. groups like “partial responders”. However, most High-quality training and validation datasets are re- frequently binary ML-assisted drug response classifiers quired to run both types of the above models. Nowadays are needed that classify patients in only two classes: there is a shortage of clinically annotated molecular data either responders or non-responders [8, 23, 25, 29, 30]. that would help developing ML-assisted diagnostic tools. If a binary classifier is needed, then the number of The datasets available are usually considered too small clinical response groups in the training/validation data- for applying ML [23, 25, 26, 29–33]. Indeed, the figure sets must be also condensed to two, i.e. responders and of dozens or hundreds of annotated biosamples is negli- non-responders. In such case, the groups identified by gible in comparison with ~ 20,000 protein coding genes the authors as partial responders probably can be measured in transcriptomic assays. Intelligent data filter- combined with the responders. This is the case for all ing is, therefore, needed to reduce dimensionality of data current breast cancer datasets, namely GSE25066 [37, [8]. However, a recent approach using dynamic feature 38], GSE41998 [39], GSE20271 [40], GSE50948 [41], GS extraction, or flexible data trimming, can significantly E18728 [48], GSE20181 [49, 59], GSE20194 [50], GSE23 improve performances of ML-based methods for the 988 [51], GSE22358 [52], GSE32646 [53], GSE37946 real-world datasets [15, 25]. [54], GSE42822 [55], GSE59515 [57] and GSE76360 [58]. This study was performed to review available clinically For the TCGA profiles, namely for the low-grade gli- annotated datasets of cancer transcriptomic profiles that oma (TCGA-LGG), lung cancer (TCGA-LC), and uter- may be suitable for applications in ML models. To our ine corpus endothelial carcinoma (TCGA-UEC) datasets, knowledge, this is the largest published collection of and for the acute myeloid leukemia dataset GSE5122 Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 3 of 9

Table 1 Overview of selected transcriptomic datasets of responders/non-responders to cancer chemotherapy, responders (R) vs non-responders (NR) Reference Dataset Disease type Therapy Experimental Number N of cases (R vs NR) Number of ID platform core marker genes (S) [37, 38] GSE25066 Breast cancer with Neoadjuvant + Affymetrix Human 508 (118 R: complete response + 20 different hormonal Genome U133 partial response; and HER2 status Array 389 NR: residual disease + progressive disease) [39] GSE41998 Breast cancer with Neoadjuvant + cyclophosphamide, Affymetrix Human 124 (90 R: complete response + 11 different hormonal followed by Genome U133 partial response; and HER2 status Array 34 NR: residual disease + progressive disease) [40] GSE20271 Breast cancer with Paclitaxel + + adriamycin + Affymetrix Human 85 (18 R: complete response + 11 different hormonal cyclophosphamide Genome U133A partial response; and HER2 status Array 66 NR: residual disease + progressive disease) [41] GSE50948 Breast cancer with Paclitaxel + doxorubincin followed by cyclophos- Affymetrix Human 156 (53 R: complete response + 19 different hormonal phamide + / Genome U133 Plus partial response; 103 NR: residual and HER2 status fluorouracil followed by trastuzumab 2.0 Array disease + progressive disease) [42] GSE9782 Multiple myeloma Bortezomib monotherapy Affymetrix Human 169 (85 R: complete response + 18 Genome U133 partial response; Array 84 NR: no change + progressive disease) [43] GSE39754 Multiple myeloma Vincristine + adriamycin + dexamethasone followed Affymetrix Human 136 (74 R: complete, near-complete 16 by autologous stem cell transplantation (ASCT) Exon 1.0 ST Array and very good partial responders; 62 NR: partial, minor and worse) [44] GSE68871 Multiple myeloma Bortezomib-thalidomide-dexamethasone Affymetrix Human 118 (69 R: complete, near-complete 12 Genome U133 Plus and very good partial responders; 49 NR: partial, minor and worse) [45] GSE55145 Multiple myeloma Bortezomib followed by ASCT Affymetrix Human 61 (33 R: complete, near-complete 14 Exon 1.0 ST Array and very good partial responders; 28 R: partial, minor and worse) [35, 46] TARGET- Childhood kidney Vincristine sulfate + cyclosporine, , Illumina HiSeq 2000 122 (36 R: complete, near-complete 14 50 Wilms tumor + conventional surgery + radiation and very good partial responders; therapy 86 NR: partial, minor and worse) [35, 47] TARGET- Childhood B acute Vincristine sulfate + Illumina HiSeq 2000 98 (30 R, 68 NR: see Fig. 1)14 10 lymphoblastic , cyclophosphamide, doxorubicin leukemia [35] TARGET- Childhood acute Non-target drugs (, cyclosporine, Illumina HiSeq 2000 54 (31 R, 23 NR: see Fig. 1)10 20 myeloid leukemia cytarabine, daunorubicin, ; methotrexate, ) including busulfan and cyclo- phosphamide [35] TARGET- Childhood acute Same non-target drugs, but excluding busulfan and Illumina HiSeq 2000 142 (62 R, 80 NR: see Fig. 1)16 20 myeloid leukemia cyclo- phosphamide Reference Dataset Disease type Therapy Experimental Number NC of cases (R vs NR) Number of ID platform core marker genes (NS) [48] GSE18728 Breast cancer , Affymetrix Human 61 (23R: complete response + 16 Genome U133 Plus partial response; 38 NR: residual 2.0 Array disease + progressive disease) [49] GSE20181 Breast cancer Letrozole Affymetrix Human 52 (37 R: complete response + 11 Genome U133A partial response; 15 NR: residual Array disease + progressive disease) [50] GSE20194 Breast cancer Paclitaxel; (tri) luoroacetyl chloride; 5-fluorouracil, Affymetrix Human 52 (11 R: complete response + 10 , cyclophosphamide Genome U133A partial response; 41 NR: residual Array disease + progressive disease) [51] GSE23988 Breast cancer Docetaxel, capecitabine Affymetrix Human 61 (20 R: complete response + 18 Genome U133A partial response; 41 NR: residual Array disease + progressive disease) [52] GSE22358 Breast cancer Docetaxel, capecitabine Agilent UNC Perou 122 (116 R: complete response + 2 Lab Homo sapiens partial response; 6 NR: residual 1X44K Custom disease + progressive disease) Array [53] GSE32646 Breast cancer Paclitaxel, 5-fluorouracil, epirubicin, Affymetrix Human 115 (27 R: complete response + 17 cyclophosphamide Genome U133 Plus partial response; 88 NR: residual 2.0 Array disease + progressive disease) Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 4 of 9

Table 1 Overview of selected transcriptomic datasets of responders/non-responders to cancer chemotherapy, responders (R) vs non-responders (NR) (Continued) [54] GSE37946 Breast cancer Trastuzumab Affymetrix Human 50 (27 R: complete response + 14 Genome U133A partial response; 23 NR: residual Array disease + progressive disease) [55] GSE42822 Breast cancer Docetaxel, Affymetrix Human 91 (38 R: complete response + 13 5-fluorouracil, epirubicin, cyclophosphamide, Genome U133A partial response; 53 NR: residual capecitabine Array disease + progressive disease) [56] GSE5122 Acute myeloid Tipifarnib Affymetrix Human 57 (13 R: complete response + 10 leukemia Genome U133A partial response + stable disease;44 Array R: progressive disease) [57] GSE59515 Breast cancer Letrozole Illumina HumanHT- 75 (51 R: complete response + 15 12 V4.0 expression partial response; 24 NR: residual beadchip disease + progressive disease) [58] GSE76360 Breast cancer Trastuzumab Illumina HumanHT- 48 (42 R: complete response + 3 12 V3.0 expression partial response; beadchip 6 NR: residual disease + progressive disease) [36] TCGA- Low-grade glioma Temozolomide + (optionally) mibefradil Illumina HiSeq 2000 131 (100 R: complete response + 9 LGG partial response + stable disease;31 NR: progressive disease) [36] TCGA-LC Lung cancer all Paclitaxel + (optionally), Illumina HiSeq 2000 41 (24 R: complete response + 7 types /carboplatin, reolysin partial response + stable disease;17 NR: progressive disease) [36] TCGA-UC Uterine corpus Paclitaxel + (optionally) carboplatin, cisplatin, Illumina HiSeq 2000 57 (57 R: complete response + 2 endothelial doxorubicin partial response + stable disease;7 carcinoma NR: progressive disease)

[56], stable disease cases can be most probably classified robust application of ML the dimensionality of data as the responders whereas progressive disease cases – as must be reduced to make the number of selected the non-responders. For the multiple myeloma dataset features lower than the number of tumor cases or at GSE9782 [42], the classification can be used as defined least comparable to it (Fig. 2a). To reduce dimensional- by the authors, where patents with complete and partial ity, the gene expression data can be aggregated into the response were annotated as the responders, and with no higher-order molecular markers like activation profiles change and progressive disease – as the non-responders. of molecular pathways [23, 29, 30, 60, 61]. Alternatively, For three other multiple myeloma datasets, namely the most informative fraction of the initial data can be GSE39753 [43], GSE68871 [44], and GSE55145 [45], selected that can distinguish between the responder and complete, near-complete and very good partial response non-responder classes. For selection of such marker groups can be most likely considered as the responders, features, several approaches have been proposed, e.g. whereas partial, minor and worse response groups – as Pearson chi-squared test [62], correlation test [27, 62], the non-responders. variance thresholding, genetic algorithms [63], univariate Classification of the TARGET repository profiles was feature selection, recursive feature elimination, principal more sophisticated as no responder classification was component analysis [27], CUR matrix [64], decompos- given by the authors. This was the case for the datasets ition [65] and covariate regression [66]. of pediatric Wilms kidney tumor (TARGET-50), acute In the current research, we applied the following myeloid leukemia (TARGET-20) and acute lympho- leave-one-out-based method for finding robust marker blastic leukemia (TARGET-10) extracted from the gene features [25] (Fig. 2c). Imagine that we have a gene ex- expression repository of National Cancer Institute [35]. pression dataset that embraces N clinical cases, each However, these latter clinical cases were annotated by with corresponding expression profile. For each clinical the time of event-free survival. Distributions of the case i =1, … N, we determine the top Q marker genes event-free survival time enabled us to identify for every that distinguish responding and non-responding cases in dataset two different modes of survival with different a sub-dataset that contains all samples but i. In other slopes (Fig. 1), that can be recognized as either re- words, for all N sub-datasets each having N-1 cases, we sponders or non-responders. interrogate each gene taken one by one and retrieve the top Q set of genes that showed the highest ROC AUC Results values for the difference between responder and non- For raw gene expression data, the number of features i.e. responder profiles. The quality metric area under the interrogated genes, usually exceeds the number of tumor ROC curve (AUC) is the universal metric of a biomarker cases by roughly two orders of magnitude. Therefore, for robustness that depends on its sensitivity and specificity Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 5 of 9

Fig. 1 Distribution of event-free survival time for the patients with (a) childhood kidney Wilms tumor from TARGET-50 dataset, (b) childhood ALL from TARGET-10 dataset and (c) childhood AML from TARGET-20 dataset [35]. Patients on the left from vertical threshold can be considered as the non-responders, and on the right – as the responders to the treatment Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 6 of 9

Fig. 2 Possible scenarios of using ML to build classifiers based on gene expression datasets. a Methods data dimensionality reduction; b approaches to merging and enlarging of gene expression datasets for ML application; c general workflow for a core marker set determination

[67]. It positively correlates with the quality of a bio- We applied this procedure to all the clinically annotated marker and varies from 0.5 till 1. The standard discrim- cancer transcriptomic datasets under consideration and ination threshold is 0.7 and the entries with higher AUC identified for them core marker genes (Table 1). Twenty- are considered high-quality biomarkers, and vice versa three out of 26 datasets investigated provided 7–20 core [68]. AUC is broadly used for detection of biomarkers in marker gene features for further ML applications (Table 1). oncology [69–73]. The remaining three datasets, namely GSE22358 [52], To provide trobust feature selection, the number Q GSE76360 [58] and TCGA-UEC [36], were poorly bal- shouldn’t exceed the number of cases N.Inthe anced because the numbers of responders greatly current application, we took Q equal to 30 because exceeded the respective numbers of non-responders, or all tdatasets under consideration had more than 40 vice versa. For these three instances we were unable to cases. The final list of core marker genes was generate robust core marker gene sets for ML applications obtained by intersecting top Q gene sets for all N because the number of such genes was too low (two-three sub-datasets. per dataset, Table 1). Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 7 of 9

Discussion data (for 645 cases) and the others – microarray expression By the current moment, ML hasn’t made a revolution in profiles. The datasets represented breast cancer, lung biomedicine [12]. This may be partly connected with the cancer, low-grade glioma, endothelial carcinoma, multiple relatively recent emergence of experimental methods myeloma, adult leukemia, pediatric leukemia and kidney generating big amounts of biomedical data combined tumors. Chemotherapeutics used included taxanes, borte- with the developed IT infrastructure. Among these zomib, vincristine, trastuzumab, letrozole, tipifarnib, temo- game-changing methods the major role was played by zolomide, busulfan and cyclophosphamide. the next-generation sequencing (NGS) and novel mass- We hope that presented collection of clinically anno- spectrometry approaches which made whole genome-, tated transcriptomic profiles will be useful to those transcriptome-, proteome- and metabolome analyses working with data analysis in oncology, as well as for the relatively fast and cheap [74–76]. fundamental research and development of next- Yet further development of ML methods in personal- generation cancer biomarkers. ized oncology is still strongly limited by the low number of clinically annotated cancer patient molecular datasets. Supplementary information A dataset suitable for ML should have all together Supplementary information accompanies this paper at https://doi.org/10. 1186/s12920-020-00759-0. enough number of high-throughput molecular profiles and also the associated clinical case history records fea- Additional file 1. Clinically annotated datasets and samples they turing success of the therapeutic regimen used. contain. In this paper we reviewed three major repositories of omics data for the available responder/non-responder Abbreviations datasets including more the 40 cancer cases treated with ALL: Acute lymphoblastic leukemia; AML: Acute myelogenous leukemia; ASCT: Autologous stem cell transplantation; GEO: Gene expression omnibus; the same chemotherapeutics. We identified 26 datasets GSE: GEO series; LOO: Leave-one-out; ML: Machine learning; MLP: Multi-layer with totally 2786 cases, ranging from 41 till 508 cases perceptron; PM: Personalized medicine; RF: Random forest; ROC: Receiver per dataset (Table 1). We checked the robustness of operating characteristic; SVM: Support vector machines; TARGET: Tumor Alterations Relevant for GEnomics-driven Therapy; TCGA: The Cancer these datasets and their suitability for ML applications Genome Atlas using our previous method of core maker feature deter- mination [25]. According to this test, 23/26 datasets Acknowledgements were suitable for ML, each having 7–20 core marker Not applicable. genes/features for further ML applications. Contrarily, About this supplement the remaining three datasets produced only two or three This article has been published as part of BMC Medical Genomics Volume 13 “ ” features, which may seem insufficient for the ML. Poor Supplement 8, 2020: Selected Topics in Systems Biology and Bioinformatics - 2019: medical genomics. The full contents of the supplement are available performance of these three datasets was most likely due online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/ to unbalanced numbers of clinical responder/non-re- volume-13-supplement-8. sponder cases included. Authors’ contributions To increase the number of cases (Fig. 2b), the datasets NB, MS and AB contributed conception and design of the study. NB, MS, VT for the same disease or drug treatment conditions can and AG analyzed the data. NB and AB wrote the manuscript. All authors read be merged using cross-dataset harmonization. Different and approved the final manuscript. methods can be used to harmonize data obtained using Funding the same [77, 78] or two different experimental plat- The work was funded by the Russian Science Foundation grant no. 18-15- forms [79, 80], or even using multiple platforms [81] 00061. Publication costs have been funded by the Russian Science Foundation (Fig. 2b). grant no. 18-15-00061. In addition, when the cases are deficient, transfer learn- The funding body played no role in the design of the study and collection, ing methods may be used for a certain disease or drug analysis, and interpretation of data and in writing of the manuscript. condition. Using this approach, the ML training process Availability of data and materials may be preformed on the multiple available molecular All the data, including IDs of expression profiles, treatment methods and profiles corresponding to cell culture treated with certain clinical response assessment, both done by the teams, who had worked with them, and our binary (“P-vs-N”) response classifications, are provided in drugs [82], whereas the ML classifier validation may be Additional file 1. done on more rare patient cancer cases [23, 29, 30]. Ethics approval and consent to participate Current research did not involve any new human material. All the gene Conclusions expression data that were used for research were taken from publicly We identified 26 clinically annotated gene expression data- available repositories GEO, TARGET and TCGA and had been previously sets ranging from 41 till 508 cases per dataset (Table 1). anonymized by the teams, who had worked with them. Collectively, they covered 2786 individual cancer cases. Consent for publication Among them seven datasets included RNA sequencing Not applicable. Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 8 of 9

Competing interests 20. Tikhonov AN, Arsenin VI. Solutions of ill-posed problems. Washington, New NB, MS, VT, AG and AB have a financial relationship with Omicsway Corp. York, Winston: distributed solely by Halsted Press; 1977. 21. Minsky ML, Papert SA. Perceptrons - expanded edition: an introduction to Author details computational geometry. Boston: MIT Press; 1987. 1Department of Bioinformatics and Molecular Networks, OmicsWay 22. Prados J, Kalousis A, Sanchez J-C, Allard L, Carrette O, Hilario M. Mining Corporation, Walnut, CA 91788, USA. 2Moscow Institute of Physics and mass spectra for diagnosis and biomarker discovery of cerebral accidents. Technology, Dolgoprudny, Moscow Oblast 141701, Russia. 3I.M. Sechenov Proteomics. 2004;4:2320–32. First Moscow State Medical University (Sechenov University), Moscow 23. Borisov N, Tkachev V, Suntsova M, Kovalchuk O, Zhavoronkov A, Muchnik I, 119991, Russia. 4Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, et al. A method of gene expression data transfer from cell lines to cancer Moscow 117997, Russia. patients for machine-learning prediction of drug efficiency. . 2018; 17:486–91. Received: 14 July 2020 Accepted: 27 July 2020 24. Osuna E, Freund R, Girosi F. An improved training algorithm for support Published: 18 September 2020 vector machines: IEEE; 1997. p. 276–85. https://doi.org/10.1109/NNSP.1997. 622408. 25. Tkachev V, Sorokin M, Mescheryakov A, Simonov A, Garazha A, Buzdin A, References et al. FLOating-window projective separator (FloWPS): a data trimming tool 1. Buzdin A, Sorokin M, Garazha A, Glusker A, Aleshin A, Poddubskaya E, et al. for support vector machines (SVM) to improve robustness of the classifier. RNA sequencing for research and diagnostics in clinical oncology. Semin Front Genet. 2019;9:717. https://doi.org/10.3389/fgene.2018.00717. Cancer Biol. 2020;60:311–23. 26. Turki T, Wang JTL. Clinical intelligence: new machine learning techniques 2. Zhukov NV, Tjulandin SA. Targeted therapy in the treatment of solid tumors: for predicting clinical drug response. Comput Biol Med. 2019;107:302–22. practice contradicts theory. Biochem Biokhimiia. 2008;73:605–18. 27. Wang Z, Yang H, Wu Z, Wang T, Li W, Tang Y, et al. In silico prediction of 3. Katz SJ, Ward KC, Hamilton AS, Mcleod MC, Wallner LP, Morrow M, et al. blood-brain barrier permeability of compounds by machine learning and Gaps in receipt of clinically indicated genetic counseling after diagnosis of resampling methods. ChemMedChem. 2018;13:2189–201. breast cancer. J Clin Oncol. 2018;36:1218–24. 28. Yosipof A, Guedes RC, García-Sosa AT. Data mining and machine learning 4. Marquart J, Chen EY, Prasad V. Estimation of the percentage of US patients models for predicting drug likeness and their disease or organ category. with cancer who benefit from genome-driven oncology. JAMA Oncol. 2018; Front Chem. 2018;6. https://doi.org/10.3389/fchem.2018.00162. 4:1093–8. 29. Borisov N, Tkachev V, Muchnik I, Buzdin A. Individual drug treatment 5. Buzdin A, Sorokin M, Garazha A, Sekacheva M, Kim E, Zhukov N, et al. prediction in oncology based on machine learning using cell culture gene Molecular pathway activation - new type of biomarkers for tumor expression data: ACM Press; 2017. p. 1–6. https://doi.org/10.1145/3155077. morphology and personalized selection of target drugs. Semin Cancer Biol. 3155078. 2018;53:110–24. 30. Borisov N, Tkachev V, Buzdin A, Muchnik I. Prediction of drug efficiency by 6. Rodon J, Soria J-C, Berger R, Miller WH, Rubin E, Kugel A, et al. Genomic and transferring gene expression data from cell lines to cancer patients. In: transcriptomic profiling expands precision cancer medicine: the WINTHER Rozonoer L, Mirkin B, Muchnik I, editors. Braverman readings in machine trial. Nat Med. 2019;25:751–8. learning. Key ideas from inception to current state. Cham: Springer 7. Devizorova Z, Mironov S, Buzdin A. Theory of magnetic domain phases in International Publishing; 2018. p. 201–12. https://doi.org/10.1007/978-3-319- ferromagnetic superconductors. Phys Rev Lett. 2019;122:117002. 99492-5_9. 8. Borisov N, Buzdin A. New paradigm of machine learning (ML) in 31. Turki T, Wei Z. A link prediction approach to cancer drug sensitivity personalized oncology: data trimming for squeezing more biomarkers from prediction. BMC Syst Biol. 2017;11. https://doi.org/10.1186/s12918-017-0463-8. clinical datasets. Front Oncol. 2019;9:658. 32. Turki T, Wei Z, Wang JTL. Transfer learning approaches to improve drug 9. Artemov A, Aliper A, Korzinkin M, Lezhnina K, Jellen L, Zhukov N, et al. A sensitivity prediction in multiple myeloma patients. IEEE Access. 2017;5: method for predicting target drug efficiency in cancer based on the 7381–93. analysis of signaling pathway activation. Oncotarget. 2015;6:29347–56. 33. Turki T, Wei Z, Wang JTL. A transfer learning approach via procrustes 10. Shepelin D, Korzinkin M, Vanyushina A, Aliper A, Borisov N, Vasilov R, et al. analysis and mean shift for cancer drug sensitivity prediction. J Bioinforma Molecular pathway activation features linked with transition from normal Comput Biol. 2018;16:1840014. skin to primary and metastatic melanomas in human. Oncotarget. 2016;7: 34. Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene 656–70. expression and hybridization array data repository. Nucleic Acids Res. 2002; 11. Zolotovskaia MA, Sorokin MI, Emelianova AA, Borisov NM, Kuzmin DV, 30:207–10. Borger P, et al. Pathway based analysis of mutation data is efficient for 35. Goldman M, Craft B, Swatloski T, Cline M, Morozova O, Diekhans M, et al. scoring target cancer drugs. Front Pharmacol. 2019;10. https://doi.org/10. The UCSC cancer genomics browser: update 2015. Nucleic Acids Res. 2015; 3389/fphar.2019.00001. 43:D812–7. 12. Robin X, Turck N, Hainard A, Lisacek F, Sanchez J-C, Müller M. Bioinformatics 36. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): for protein biomarker panel classification: what is needed to bring an immeasurable source of knowledge. Contemp Oncol Poznan Pol. 2015; biomarker panels into in vitro diagnostics? Expert Rev Proteomics. 2009;6: 19:A68–77. 675–89. 37. Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, et al. A 13. Bartlett P, Shawe-Taylor J. Generalization performance of support vector genomic predictor of response and survival following taxane-anthracycline machines and other pattern classifiers. In: Advances in Kernel Methods: chemotherapy for invasive breast cancer. JAMA. 2011;305:1873–81. Support Vector Learning. Cambridge: MIT Press; 1999. p. 43–54. ISBN 38. Itoh M, Iwamoto T, Matsuoka J, Nogami T, Motoki T, Shien T, et al. Estrogen 0262194163. receptor (ER) mRNA expression and molecular subtype distribution in ER- 14. Toloşi L, Lengauer T. Classification with correlated features: unreliability of negative/progesterone receptor-positive breast cancers. Breast Cancer Res feature ranking and solutions. Bioinformatics. 2011;27:1986–94. Treat. 2014;143:403–9. 15. Tkachev V, Sorokin M, Borisov C, Garazha A, Buzdin A, Borisov N. Flexible 39. Horak CE, Pusztai L, Xing G, Trifan OC, Saura C, Tseng L-M, et al. Biomarker data trimming improves performance of global machine learning methods analysis of neoadjuvant doxorubicin/cyclophosphamide followed by in omics-based personalized oncology. Int J Mol Sci. 2020;21:713. or paclitaxel in early-stage breast cancer. Clin Cancer Res Off J 16. Stigler SM. The history of statistics: the measurement of uncertainty before Am Assoc Cancer Res. 2013;19:1587–95. 1900. Cambridge: Belknap Press of Harvard University Press; 1986. 40. Tabchy A, Valero V, Vidaurre T, Lluch A, Gomez H, Martin M, et al. 17. Cramer JS. The origins of logistic regression. SSRN Electron J. 2003. Evaluation of a 30-gene paclitaxel, fluorouracil, doxorubicin, and https://doi.org/10.2139/ssrn.360300. cyclophosphamide chemotherapy response predictor in a multicenter 18. Santosa F, Symes WW. Linear inversion of band-limited reflection randomized trial in breast cancer. Clin Cancer Res Off J Am Assoc seismograms. SIAM J Sci Stat Comput. 1986;7:1307–30. Cancer Res. 2010;16:5351–61. 19. Tibshirani R. The lasso method for variable selection in the Cox model. Stat 41. Prat A, Bianchini G, Thomas M, Belousov A, Cheang MCU, Koehler A, et al. Med. 1997;16:385–95. Research-based PAM50 subtype predictor identifies higher responses and Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 9 of 9

improved survival outcomes in HER2-positive breast cancer in the NOAH 61. Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim J-S, et al. A novel study. Clin Cancer Res Off J Am Assoc Cancer Res. 2014;20:511–21. signaling pathway impact analysis. Bioinforma Oxf Engl. 2009;25:75–82. 42. Mulligan G, Mitsiades C, Bryant B, Zhan F, Chng WJ, Roels S, et al. Gene 62. Cho H-J, Lee S, Ji YG, Lee DH. Association of specific gene mutations expression profiling and correlation with outcome in clinical trials of the derived from machine learning with survival in lung adenocarcinoma. PLoS bortezomib. Blood. 2007;109:3177–88. One. 2018;13:e0207204. 43. Chauhan D, Tian Z, Nicholson B, Kumar KGS, Zhou B, Carrasco R, et al. A 63. Soufan O, Kleftogiannis D, Kalnis P, Bajic VB. DWFS: a wrapper feature small molecule inhibitor of ubiquitin-specific protease-7 induces apoptosis selection tool based on a parallel genetic algorithm. PLoS One. 2015;10: in multiple myeloma cells and overcomes bortezomib resistance. Cancer e0117988. Cell. 2012;22:345–58. 64. Mahoney MW, Drineas P. CUR matrix decompositions for improved data 44. Terragna C, Remondini D, Martello M, Zamagni E, Pantani L, Patriarca F, analysis. Proc Natl Acad Sci. 2009;106:697–702. et al. The genetic and genomic background of multiple myeloma patients 65. Turki T, Wei Z. Learning approaches to improve prediction of drug achieving complete response after induction therapy with bortezomib, sensitivity in breast cancer patients. In: 2016 38th annual international thalidomide and dexamethasone (VTD). Oncotarget. 2016;7:9666–79. conference of the IEEE engineering in medicine and biology society (EMBC). 45. Amin SB, Yip W-K, Minvielle S, Broyl A, Li Y, Hanlon B, et al. Gene expression Orlando: IEEE; 2016. p. 3314–20. https://doi.org/10.1109/EMBC.2016.7591437. profile alone is inadequate in predicting complete response in multiple 66. Menden MP, Casale FP, Stephan J, Bignell GR, Iorio F, McDermott U, et al. myeloma. Leukemia. 2014;28:2229–34. The germline genetic component of drug sensitivity in cancer cell lines. Nat 46. Walz AL, Ooms A, Gadd S, Gerhard DS, Smith MA, Guidry Auvil JM, et al. Commun. 2018;9. https://doi.org/10.1038/s41467-018-05811-3. Recurrent DGCR8, DROSHA, and SIX Homeodomain mutations in favorable 67. Green DM, Swets JA. Signal detection theory and psychophysics. Repr. ed. histology Wilms tumors. Cancer Cell. 2015;27:286–97. Los Altos Hills: Peninsula Publ; 2000. 47. Tricoli JV, Blair DG, Anders CK, Bleyer WA, Boardman LA, Khan J, et al. 68. Boyd JC. Mathematical tools for demonstrating the clinical usefulness of Biologic and clinical characteristics of adolescent and young adult cancers: biochemical markers. Scand J Clin Lab Investig Suppl. 1997;227:46–63. acute lymphoblastic leukemia, colorectal cancer, breast cancer, melanoma, 69. Borisov NM, Terekhanova NV, Aliper AM, Venkova LS, Smirnov PY, and sarcoma: biology of AYA cancers. Cancer. 2016;122:1017–28. Roumiantsev S, et al. Signaling pathways activation profiles make better 48. Korde LA, Lusa L, McShane L, Lebowitz PF, Lukes L, Camphausen K, et al. markers of cancer than expression of individual genes. Oncotarget. 2014;5: Gene expression pathway analysis to predict response to neoadjuvant 10198–205. docetaxel and capecitabine for breast cancer. Breast Cancer Res Treat. 2010; 70. Chen L, Zhou Y, Tang X, Yang C, Tian Y, Xie R, et al. EGFR mutation 119:685–99. decreases FDG uptake in non-small cell lung cancer via the NOX4/ROS/ 49. Miller WR, Larionov A. Changes in expression of oestrogen regulated and GLUT1 axis. Int J Oncol. 2018. https://doi.org/10.3892/ijo.2018.4626. proliferation genes with neoadjuvant treatment highlight heterogeneity of 71. Liu T, Cheng G, Kang X, Xi Y, Zhu Y, Wang K, et al. Noninvasively evaluating clinical resistance to the aromatase inhibitor, letrozole. Breast Cancer Res. the grading and IDH1 mutation status of diffuse gliomas by three- 2010;12:R52. dimensional pseudo-continuous arterial spin labeling and diffusion- – 50. Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, et al. Effect weighted imaging. Neuroradiology. 2018;60:693 702. of training-sample size and classification difficulty on the accuracy of 72. Tanioka M, Fan C, Parker JS, Hoadley KA, Hu Z, Li Y, et al. Integrated analysis genomic predictors. Breast Cancer Res. 2010;12:R5. of RNA and DNA from the phase III trial CALGB 40601 identifies predictors 51. Iwamoto T, Bianchini G, Booser D, Qi Y, Coutant C, Shiang CY-H, et al. Gene of response to trastuzumab-based neoadjuvant chemotherapy in HER2- – pathways associated with prognosis and chemotherapy sensitivity in positive breast cancer. Clin Cancer Res. 2018;24:5292 304. molecular subtypes of breast cancer. J Natl Cancer Inst. 2011;103:264–72. 73. Zolotovskaia MA, Sorokin MI, Roumiantsev SA, Borisov NM, Buzdin AA. 52. Glück S, Ross JS, Royce M, McKenna EF, Perou CM, Avisar E, et al. TP53 Pathway instability is an effective new mutation-based type of cancer genomics predict higher clinical and pathologic tumor response in biomarkers. Front Oncol. 2019;8. https://doi.org/10.3389/fonc.2018.00658. operable early-stage breast cancer treated with docetaxel-capecitabine ± 74. Chu Y, Corey DR. RNA sequencing: platform selection, experimental design, – trastuzumab. Breast Cancer Res Treat. 2012;132:781–91. and data interpretation. Nucleic Acid Ther. 2012;22:271 4. 53. Miyake T, Nakayama T, Naoi Y, Yamamoto N, Otani Y, Kim SJ, et al. GSTP1 75. Cox J, Mann M. Quantitative, high-resolution proteomics for data-driven – expression predicts poor pathological complete response to neoadjuvant systems biology. Annu Rev Biochem. 2011;80:273 99. chemotherapy in ER-negative breast cancer. Cancer Sci. 2012;103:913–20. 76. Pettersson E, Lundeberg J, Ahmadian A. Generations of sequencing technologies. Genomics. 2009;93:105–11. 54. Liu JC, Voisin V, Bader GD, Deng T, Pusztai L, Symmans WF, et al. Seventeen- 77. Love MI, Huber W, Anders S. Moderated estimation of fold change and gene signature from enriched Her2/Neu mammary tumor-initiating cells dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. predicts clinical outcome for human HER2+:ERα- breast cancer. Proc Natl 78. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization Acad Sci U S A. 2012;109:5832–7. methods for high density oligonucleotide array data based on variance and 55. Shen K, Qi Y, Song N, Tian C, Rice SD, Gabrin MJ, et al. Cell line derived bias. Bioinforma Oxf Engl. 2003;19:185–93. multi-gene predictor of pathologic response to neoadjuvant chemotherapy 79. Huang H, Lu X, Liu Y, Haaland P, Marron JS. R/DWD: distance-weighted in breast cancer: a validation study on US oncology 02-103 . discrimination for classification, visualization and batch adjustment. BMC Med Genet. 2012;5:51. Bioinformatics. 2012;28:1182–3. 56. Raponi M, Harousseau J-L, Lancet JE, Löwenberg B, Stone R, Zhang Y, et al. 80. Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB. Merging two gene- Identification of molecular predictors of response in a study of tipifarnib expression studies via cross-platform normalization. Bioinformatics. 2008;24: treatment in relapsed and refractory acute myelogenous leukemia. Clin 1154–60. Cancer Res Off J Am Assoc Cancer Res. 2007;13:2254–60. 81. Borisov N, Shabalina I, Tkachev V, Sorokin M, Garazha A, Pulin A, et al. 57. Turnbull AK, Arthur LM, Renshaw L, Larionov AA, Kay C, Dunbier AK, et al. Shambhala: a platform-agnostic data harmonizer for gene expression data. Accurate prediction and validation of response to endocrine therapy in BMC Bioinformatics. 2019;20:66. breast cancer. J Clin Oncol Off J Am Soc Clin Oncol. 2015;33:2270–8. 82. Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, et al. 58. Varadan V, Gilmore H, Miskimen KLS, Tuck D, Parsai S, Awadallah A, et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic Immune signatures following single dose Trastuzumab predict pathologic biomarker discovery in cancer cells. Nucleic Acids Res. 2013;41(Database response to PreoperativeTrastuzumab and chemotherapy in HER2-positive issue):D955–61. early breast cancer. Clin Cancer Res Off J Am Assoc Cancer Res. 2016;22: 3249–59. 59. Miller WR, Larionov A, Anderson TJ, Evans DB, Dixon JM. Sequential changes Publisher’sNote in gene expression profiles in breast cancers during treatment with the Springer Nature remains neutral with regard to jurisdictional claims in aromatase inhibitor, letrozole. Pharmacogenomics J. 2012;12:10–21. published maps and institutional affiliations. 60. Ozerov IV, Lezhnina KV, Izumchenko E, Artemov AV, Medintsev S, Vanhaelen Q, et al. In silico pathway activation network decomposition analysis (iPANDA) as a method for biomarker development. Nat Commun. 2016;7: 13427.