Most Variable Genes and Transcription Factors in Acute Lymphoblastic Leukemia Patients
Total Page:16
File Type:pdf, Size:1020Kb
Interdisciplinary Sciences: Computational Life Sciences https://doi.org/10.1007/s12539-019-00325-y ORIGINAL RESEARCH ARTICLE Most Variable Genes and Transcription Factors in Acute Lymphoblastic Leukemia Patients Anil Kumar Tomar1 · Rahul Agarwal2 · Bishwajit Kundu1 Received: 24 September 2018 / Revised: 21 January 2019 / Accepted: 26 February 2019 © International Association of Scientists in the Interdisciplinary Areas 2019 Abstract Acute lymphoblastic leukemia (ALL) is a hematologic tumor caused by cell cycle aberrations due to accumulating genetic disturbances in the expression of transcription factors (TFs), signaling oncogenes and tumor suppressors. Though survival rate in childhood ALL patients is increased up to 80% with recent medical advances, treatment of adults and childhood relapse cases still remains challenging. Here, we have performed bioinformatics analysis of 207 ALL patients’ mRNA expression data retrieved from the ICGC data portal with an objective to mark out the decisive genes and pathways responsible for ALL pathogenesis and aggression. For analysis, 3361 most variable genes, including 276 transcription factors (out of 16,807 genes) were sorted based on the coefcient of variance. Silhouette width analysis classifed 207 ALL patients into 6 subtypes and heat map analysis suggests a need of large and multicenter dataset for non-overlapping subtype classifcation. Overall, 265 GO terms and 32 KEGG pathways were enriched. The lists were dominated by cancer-associated entries and highlight crucial genes and pathways that can be targeted for designing more specifc ALL therapeutics. Diferential gene expression analysis identifed upregulation of two important genes, JCHAIN and CRLF2 in dead patients’ cohort suggesting their pos- sible involvement in diferent clinical outcomes in ALL patients undergoing the same treatment. Keywords Gene expression · KEGG pathways · Leukemia · Most variable genes · Subtype classifcation 1 Introduction age, it is most common in children and adolescents. B-cell acute lymphocytic leukemia (B-ALL) accounts for about Leukemia, cancer of blood or bone morrow, is widely clas- 85% and 75% of childhood ALL and adult ALL cases, sifed into four major categories—acute myeloid leukemia respectively, with male predominance, while T-cell acute (AML), chronic myeloid leukemia (CML), acute lympho- lymphocytic leukemia (T-ALL) accounts for the remaining cytic leukemia (ALL) and chronic lymphocytic leukemia cases [1, 2]. With recent medical advances in treatment pro- (CLL). The basic parameters of this classifcation are rate tocols, global survival rate in childhood ALL is increased of cancer progression and site of cancer development (http:// substantially (> 80%); however, survival rate in adults still www.cance rcent er.com/). ALL is a blood malignancy char- remains less than 40% [3–5]. Also, survival in the ALL acterized by uncontrolled proliferation of lymphoblasts, patients who experience a relapse is very poor [6]. immature B and/or T cells. Though ALL can occur at any Uncontrolled cell proliferation due to loss of cell cycle control is the hallmark of cancer [7, 8]. Chromosomal rear- rangements are common genetic abnormalities in B-ALL, Electronic supplementary material The online version of this e.g., BCR-ABL1, ETV6-RUNX1 and TCF3-PBX1 [9]. Also, article (https ://doi.org/10.1007/s1253 9-019-00325 -y) contains aberrant expression of transcription factors associated with supplementary material, which is available to authorized users. lymphoid development, e.g., PAX5, EBF1 and IKZF1 has * Anil Kumar Tomar been reported in more than 60% B-ALL cases [10, 11]. [email protected] CRLF2 rearrangements and JAK mutations are also detected in B-ALL cases [12]. The genomic profling of high-risk 1 Kusuma School of Biological Sciences, Indian Institute ALL patients has identifed rearrangements of ABL1, JAK2, of Technology Delhi, Hauz Khas, New Delhi 110016, India PDGFRB, CRLF2 and EPOR, activating mutations of IL7R 2 Department of Reproductive Biology, All India Institute and FLT3 and deletion of SH2B3 [13]. Regardless of all the of Medical Sciences, New Delhi 110029, India Vol.:(0123456789)1 3 Interdisciplinary Sciences: Computational Life Sciences advances at the molecular level understanding of the disease, Table 1 Sample details ALL remains a challenging and aggressive disease due to Description Percentage (number) high genetic heterogeneity among patients and its progno- sis is uncertain in relapse cases. For tailoring efective and Total samples 207 specifc therapies, it is essential to classify patients and rec- Sex Male 66.18% (137) ognize those with high probability of relapse at the time of Female 33.81% (70) disease diagnosis. Comprehensive subtype classifcation of Vital status Alive 33.81% (70) risk groups of a disease is important for more specifc treat- Deceased 25.12% (52) ment of patients and better therapeutic outcomes. No data 41.06% (85) The International Cancer Genome Consortium (ICGC) Age at diagnosis (years) 1–9 35.74% (74) coordinates a large number of projects elucidating the 10–19 63.28% (131) genomic changes in various cancer types. Through its data 20–29 0.96% (2) portal (https ://dcc.icgc.org/), ICGC has provided open access to the gene expression data of about 70 cancer pro- jects to research community worldwide. Here, we have per- genes. Overall 3361 most variable genes were sorted out for formed bioinformatics analysis of 207 ALL patients’ mRNA predicting the subtypes. Unsupervised hierarchical cluster- expression data (16,807 genes) retrieved from the ICGC data ing was done on these genes across all the 207 patient sam- portal. The primary objective was to delineate the crucial ples using Bioconductor R package ConsensusClusterPlus genes, transcription factors and pathways responsible for [16]. Final cluster attained the consensus after 1000 reitera- ALL pathogenesis. Also, diferential gene expression analy- tions. The number of clusters that represented the expression sis was performed (male vs. female and alive vs. dead) to data most signifcantly was selected by silhouette method identify genomic variability in patient subgroups. A male of KMeans clustering, a method that calculates the separa- predominance is well known in case of leukemia that led us tion distance between the resulting clusters. This method to perform diferential gene expression analysis in male vs. basically estimates how close each point in one of the clus- female B-ALL patients to identify decisive gene(s), if any, ters is to the points of the neighboring clusters. The value for high occurrence of ALL in males, while alive vs. dead of a silhouette coefcient always lies in the range of [− 1, patient cohorts were chosen for diferential gene expression 1]. Bioconductor R based package Cluster [17] was used to analysis to identify crucial genes that possibly can defne estimate these coefcients. Samples with positive silhou- disease aggression. Recent studies have shown interest in ette coefcient values were selected for further analysis. Top global profling of diferentially expressed genes (DEGs) in variable genes were obtained for each k = 1 to n subtypes ALL. Li et al. have identifed DEGs between diagnostic and by employing sam function from bioconductor package relapsed cases with an aim to explore the underlying mecha- siggenes [18]. Overall median survival analysis of predicted nism of relapsed ALL [14]. In another study, Sedek et al. B-ALL subtypes was performed using coxph model [19] have shown aberrant (over)expression of CD73, CD86 and and Kaplan–Meier (KM) curve was used for presenting the CD304 in a substantial percentage of B-ALL patients [15]. results [20]. 2.3 Pathway Analysis 2 Materials and Methods Gene ontology (GO) annotations and pathway analysis of 2.1 Retrieval of Patient Data 3000 most variable genes among 207 ALL patients was performed using Database for Annotation, Visualization Gene expression array matrices and associated clinical data and Integrated Discovery (DAVID) gene enrichment tool of 207 high-risk B-ALL patients were retrieved from ICGC with default settings [21]. GO annotations and pathways data portal (Project: ALL-US; DCC data release; December with FDR < 0.05 were considered signifcant. This program 7, 2016) and expression matrix of 16,807 genes for all of enlists enriched GO terms and pathways as an output along the patients was normalized. The sample details are given with many other important features. in Table 1. 2.4 Diferential Gene Expression 2.2 Subtype Classifcation and Survival Analysis Gene expression array data of 207 samples were pre-pro- To predict B-ALL subtypes using ICGC-ALL data, genes cessed and genes with more than 50% missing data were were fltered out based on the coefcient of variance (CV). A excluded. Those genes which have expression greater than CV value of ≥ 0.8 was used as cut-of to defne most variable 5 (in more than 80% of the samples) were used for further 1 3 Interdisciplinary Sciences: Computational Life Sciences analysis. The patient samples were grouped separately in Table 2 List of most variable genes and transcription factors among two diferent biological categories (as per the information 207 ALL patients provided in clinical data), male/female and dead/alive. Dif- S. no. Most variable genes Most variable TFs ferential gene expression analysis was performed by employ- ing gene expression data analysis limma package [22]. To 1. S100P FOXC1 account for multiple testing, adjusted p value was estimated 2. SFTPA1 FOXR1 using Benjamini–Hochberg method. Genes having log2- 3. GJB6 SOX8 fold change values > 1.5 (for up-regulation) and < − 1.5 4. LTF HOXA5 (for down-regulation) were considered as signifcantly dif- 5. IFNB1 NFIB ferentially expressed. Diferentially expressed genes (DEGs) 6. FOXC1 IRX3 in predicted subtypes (associated with least and maximum 7. CD1E MYT1L survival) were also identifed. 8. FOXR1 SIX3 9. MS4A3 SALL4 10. PTPRZ1 MEIS1 3 Results and Discussion 11. RBFOX2 ZNF521 12. MT1E IRX2 3.1 Variably Expressed Genes 13. S100A12 SOX11 14. SOX8 CEBPD As accessed by box plot analysis of randomly selected 50 15. MT1H ID1 samples, high gene expression variability was observed that 16.