Cancer Gene Expression Profiles Associated with Clinical Outcomes To
Total Page:16
File Type:pdf, Size:1020Kb
Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 https://doi.org/10.1186/s12920-020-00759-0 RESEARCH Open Access Cancer gene expression profiles associated with clinical outcomes to chemotherapy treatments Nicolas Borisov1,2* , Maxim Sorokin1,3, Victor Tkachev1, Andrew Garazha1 and Anton Buzdin1,2,3,4 From 11th International Young Scientists School “Systems Biology and Bioinformatics”–SBB-2019 Novosibirsk, Russia. 24-28 June 2019 Abstract Background: Machine learning (ML) methods still have limited applicability in personalized oncology due to low numbers of available clinically annotated molecular profiles. This doesn’t allow sufficient training of ML classifiers that could be used for improving molecular diagnostics. Methods: We reviewed published datasets of high throughput gene expression profiles corresponding to cancer patients with known responses on chemotherapy treatments. We browsed Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA) and Tumor Alterations Relevant for GEnomics-driven Therapy (TARGET) repositories. Results: We identified data collections suitable to build ML models for predicting responses on certain chemotherapeutic schemes. We identified 26 datasets, ranging from 41 till 508 cases per dataset. All the datasets identified were checked for ML applicability and robustness with leave-one-out cross validation. Twenty-three datasets were found suitable for using ML that had balanced numbers of treatment responder and non-responder cases. Conclusions: We collected a database of gene expression profiles associated with clinical responses on chemotherapy for 2786 individual cancer cases. Among them seven datasets included RNA sequencing data (for 645 cases) and the others – microarray expression profiles. The cases represented breast cancer, lung cancer, low-grade glioma, endothelial carcinoma, multiple myeloma, adult leukemia, pediatric leukemia and kidney tumors. Chemotherapeutics included taxanes, bortezomib, vincristine, trastuzumab, letrozole, tipifarnib, temozolomide, busulfan and cyclophosphamide. Keywords: Machine learning, Transcriptomics, Gene expression, RNA sequencing, Microarrays, Molecular diagnostics, Biomarkers detection, Cancer, Clinical oncology, Personalized medicine, Chemotherapy * Correspondence: [email protected] 1Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA 91788, USA 2Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Oblast 141701, Russia Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 2 of 9 Background processed gene expression data coupled with case Personalized approach provides important advantages in history excerpts indicating positive or negative response clinical oncology in terms of improved patient survival to certain treatment protocols for cancer patients. This and lower drug toxicities [1, 2]. However, so far it can manually curated collection of molecular datasets will be only cover a minor fraction of cancer patients [3, 4] due helpful for those working with the ML or artificial to lack of robust prognostic biomarkers for most of the intelligence applications in oncology, as well as for the treatments [5]. The proportion of patients eligible for fundamental research and development of cancer personalized oncology slightly grows. For example, the biomarkers. percentage of US patients with cancer estimated to benefit from personalized prescriptions of targeted ther- Methods apeutics was only 0.7% in 2006, and it had increased to We curated GEO [34], TARGET [35] and TCGA [36] ~ 5% in 2018 [4]. However, this progress could be more repositories to extract cancer gene expression profiles significant if more companion diagnostic tests would be associated with the clinical outcomes of chemotherapeu- available for the standardly used cancer drugs. In this tic treatments. We attempted to build a knowledgebase regard, gene expression data, either obtained by RNA se- of molecular datasets suitable for building ML classifiers quencing [1] or using microarrays [6], frequently provide of clinical responses on chemotherapy treatments an advantage over genomic tests. Several trials and clin- (Table 1, Additional file 1). Every included dataset met ical case reports were published recently evidencing high the following criteria: efficiency of gene expression-based prescriptions of cancer chemotherapeutics. Cancer gene expression data – at least 40 gene expression profiles present; can be used per se or can be normalized on the available – data obtained for the same cancer type and using profiles of healthy human tissues [7]. the same experimental platform Using transcriptomic data, bioinformatic models can – every profile is linked with the case clinical history be built for patient-oriented ranking of cancer drugs [8]. – all cancers treated with at least one common drug These models can be hypothesis-driven, e.g. based on or chemotherapy regimen the knowledge of the specific mechanisms of drugs anti- – treatment outcomes are available enabling to classify cancer activities [9–11]. Alternatively, hypothesis-free every case as either responder or non-responder. approaches like machine learning (ML) don’t need any theoretic background but instead strongly require We used different approaches to discriminate between sufficient training and validation datasets. Many ML the treatment responders and non-responders. Where methods may be used for such applications, e.g. decision available, e.g. for the datasets extracted from the GEO trees [12, 13], random forests, RF [14, 15], linear [16], repository, we used the responder/non-responder marks logistic [17], lasso [18, 19], ridge [15, 20] regressions, assigned by the authors of the original communications multi-layer perceptron, MLP [12, 15, 21, 22], support publishing these data. In many instances, the number of vectors machines [12, 13, 15, 23–25], adaptive boosting response groups was more than two and included [26–28], as well as binomial naïve Bayesian [15] method. groups like “partial responders”. However, most High-quality training and validation datasets are re- frequently binary ML-assisted drug response classifiers quired to run both types of the above models. Nowadays are needed that classify patients in only two classes: there is a shortage of clinically annotated molecular data either responders or non-responders [8, 23, 25, 29, 30]. that would help developing ML-assisted diagnostic tools. If a binary classifier is needed, then the number of The datasets available are usually considered too small clinical response groups in the training/validation data- for applying ML [23, 25, 26, 29–33]. Indeed, the figure sets must be also condensed to two, i.e. responders and of dozens or hundreds of annotated biosamples is negli- non-responders. In such case, the groups identified by gible in comparison with ~ 20,000 protein coding genes the authors as partial responders probably can be measured in transcriptomic assays. Intelligent data filter- combined with the responders. This is the case for all ing is, therefore, needed to reduce dimensionality of data current breast cancer datasets, namely GSE25066 [37, [8]. However, a recent approach using dynamic feature 38], GSE41998 [39], GSE20271 [40], GSE50948 [41], GS extraction, or flexible data trimming, can significantly E18728 [48], GSE20181 [49, 59], GSE20194 [50], GSE23 improve performances of ML-based methods for the 988 [51], GSE22358 [52], GSE32646 [53], GSE37946 real-world datasets [15, 25]. [54], GSE42822 [55], GSE59515 [57] and GSE76360 [58]. This study was performed to review available clinically For the TCGA profiles, namely for the low-grade gli- annotated datasets of cancer transcriptomic profiles that oma (TCGA-LGG), lung cancer (TCGA-LC), and uter- may be suitable for applications in ML models. To our ine corpus endothelial carcinoma (TCGA-UEC) datasets, knowledge, this is the largest published collection of and for the acute myeloid leukemia dataset GSE5122 Borisov et al. BMC Medical Genomics 2020, 13(Suppl 8):111 Page 3 of 9 Table 1 Overview of selected transcriptomic datasets of responders/non-responders to cancer chemotherapy, responders (R) vs non-responders (NR) Reference Dataset Disease type Therapy Experimental Number N of cases (R vs NR) Number of ID platform core marker genes (S) [37, 38] GSE25066 Breast cancer with Neoadjuvant taxane + anthracycline