IDENTIFICATION of CELL SURFACE
MARKERS WHICH CORRELATE WITH SALL4 in a B-CELL ACUTE LYMPHOBLASTIC LEUKEMIA WITH T(8;14) DISCOVERED
THROUGH BIOINFORMATIC ANALYSIS of MICROARRAY GENE EXPRESSION DATA
The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters
Citable link
http://nrs.harvard.edu/urn-3:HUL.InstRepos:38962442
- Terms of Use
- This article was downloaded from Harvard University’s DASH
repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://
nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA
,'(17,),&$7,21 2) &(// 685)$&( 0$5.(56 :+,&+ &255(/$7( :,7+ 6$//ꢀ ,1 $ %ꢁ&(// $&87( /<03+2%/$67,& /(8.(0,$ :,7+ Wꢂꢃꢄꢅꢀꢆ ',6&29(5(' 7+528*+ %,2,1)250$7,& $1$/<6,6 2) 0,&52$55$< *(1( (;35(66,21 '$7$
52%(57 3$8/ :(,1%(5*
$ 7KHVLV 6XEPLWWHG WR WKH )DFXOW\ RI
7KH +DUYDUG 0HGLFDO 6FKRRO
LQ 3DUWLDO )XOILOOPHQW RI WKH 5HTXLUHPHQWV
IRU WKH 'HJUHH RI 0DVWHU RI 0HGLFDO 6FLHQFHV LQ ,PPXQRORJ\
+DUYDUG 8QLYHUVLW\
%RVWRQꢇ 0DVVDFKXVHWWVꢈ
-XQH ꢉꢊꢇ ꢋꢊꢅꢌ
- Thesis Advisor: Dr. Li Chai
- Author: Robert Paul Weinberg
Candidate MMSc in Immunology Harvard Medical School 25 Shattuck Street
Department of Pathology Brigham and Womens’ Hospital 77 Francis Street
- Boston, MA 02215
- Boston, MA 02215
IDENTIFICATION OF CELL SURFACE MARKERS WHICH CORRELATE WITH SALL4 IN A B-CELL ACUTE LYMPHOBLASTIC LEUKEMIA WITH TRANSLOCATION t(8;14) DISCOVERED THROUGH BIOINFORMATICS ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Abstract
Acute Lymphoblastic Leukemia (ALL) is the most common leukemia in children, causing signficant morbidity and mortality annually in the U.S. We performed exploratory data analysis on several microarray gene expression data sets publicly available in the Gene Expression Omnibus (GEO) repository maintained at the National Center for Biotechnology Information of the National Library of Medicine under the National Institutes of Health (http://ncbi.nlm.nih.gov) looking for novel associations and relationships between the zinc finger transcription factor SALL4 and leukemia.
Through this data mining, we found a subset of B-cell ALL where multiple cell surface markers have relatively high correlation with SALL4. However, in part due to the small number of samples in this group ( n = 13 ), the results of these analyses must be considered with caution until such time as they may be validated experimentally in the lab with living leukemia cells.
We evaluated the transcriptome changes in these leukemia datasets which are associated with the expression of the SALL4. The correlation analysis of the microarray data revealed that a small subset of B-cell ALL, comprising 13 samples, a mature B-cell acute lymphoblastic leukemia with a translocation of t(8;14) subset [B-ALL with t(8;14)] has multiple cell surface marker genes which showed relatively high
ii
correlation with SALL4 expression ( | r | > 0.60), whereas 16 other leukemia subsets only showed lowmoderate correlation of the same cell surface biomarkers with SALL4 ( | r | < 0.45).
The microarray gene expression data was obtained using the Affymetrix gene chip, HG-
U133Plus2, which is a 3’ IVT oligonucleotide array for the detection of cDNA, which is synthesized from mRNA extracted from the relevant human cells. The array consists of both Perfect Match and Mismatch probes for the detection and differential analysis of some 23,520 probe-gene pairs. The luminosity readout from the gene chip assay then undergoes a number of statistical manipulations which include standardization and normalization of the data prior to its deposit in the GEO library. Within each dataset the gene expression data is normalized but special methods must be used if one wants to compare the data between different datasets from different experiments in the GEO repository. Some datasets include the raw luminosity read-outs.
The majority of this thesis focuses on one specific microarray gene expression dataset,
GSE13159, which comprises some 2,096 samples taken from patients with acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), chronic myeloid leukemia (CML), myelodysplastic syndrome (MDS) and normal healthy controls.
After finding the B-ALL with t(8;14) wherein the cell surface markers correlate highly with
SALL4, we used the limma package from the R-based Bioconductor platform to perform a linear regression analysis looking for the differential expression of genes in the transcriptome. The linear regression analysis reveals that this B-cell leukemia subset has genes differentially expressed distinct from the average pattern of gene expression of the other lymphoblastic leukemias.
Extensive bioinformatic analyses were carried out on this small group of samples and the limitations of these analyses will be further examined in the discussion section of the paper. Some preliminary functional genomic analysis was carried out on these differentially expressed genes (DEGs)
iii
and they were compartmentalized into specific gene ontologies (GO) and KEGG pathways, which includes the hematopoietic pathway. This corollary data can be found in the appendices attached.
There is some overlap of the Gene Ontologies and the KEGG pathways between the 17 leukemia
/ myelodysplastic groups analyzed, which includes the hematopoietic pathway but the B-ALL with t(8;14) showed differences from the other leukemias.
SALL4 is a zinc-finger transcription factor important in maintaining the pluripotency of embryonic and hematopoietic stem cells as evidenced in transgenic animal models and genetically modified cell lines with either deletion of SALL4 or forced over-expression of SALL4. Experimental evidence also suggests that SALL4 plays an important role in leukemogenesis as well as other oncogenic processes in other neoplasms.
Potentially the association found between these specific cell surface biomarkers with SALL4 expression in this B-ALL with t(8;14) subset may facilitate future research on SALL4. The iPathway tool (www.advaitabio.com) was used to further characterize this B-ALL t(8;14) subset. The iPathway tool revealed 549 differentially expressed genes (DEGs) compared with the normal samples identified out of a total of 20,388 genes with measured expression. These 549 DEGs have a significant impact on 34 biological pathways by KEGG analysis. These 549 DEGs also comprise a significant enrichment of 1431 Gene Ontology (GO) terms, 237 predicted miRNAs and 57 diseases based on uncorrected p-values. These DEGs were analyzed in the context of pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG), the Gene Ontology Consortium database (GO), the miRBase and TARGETSCAN databases. Some of the iPathway results will be found in the appendices. These results must be considered with caution considering significant limitations in this study.
iv
Table of Contents
Pages
Abstract ………………………………………………………………………….. ii - iv Table of Contents ……………………………………………………………….. v - vi List of Figures …………………………………………………………………..… vii List of Tables ……………………………………………………………………... viii Acknowledgments …………………………………………………………………..ix
- Chapter 1: Background ..................................................................................…. 1 - 5
- 1.
2.
- 1.1 The pluripotency-maintaining transcription Factor SALL4 ...........................
- 1
1.3 The genetics of B-cell ALL with the t(8;14) translocation ............................... 3 Chapter 2: Data and Methods ......................................................................…… 5 - 31 2.1 Microarray analysis of Gene expression ………………………...................… 5 2.2 Bioinformatics and Computational Biology tools …………........................…. 7 2.3 R programming language and Bioconductor software tools ..........................…. 8 2.4 Data and computational Results .....................................................................…. 9 2.5 Brief Discussion of Results ...........................................................................…. 30 Chapter 3: Discussion and Perspectives .................................................................33 - 36 3.1 Discussion ………………………………………………….………..………… 33 3.2 Limitations .........................................................................…...........………....…35 3.3 Future Research Paths ..........................................................…...........…..…….. 36 Bibliography ...................................................................................................…… 37 - 57 Appendices
3. 4. 5.
Appendix A Correlation of cell surface markers for other groups ……………… 58 - 60 Appendix B Biology and Genetics of Acute Lymphoblastic leukemias ……….. 61 - 65 Appendix C Top 100 Differentially Expressed Genes B-ALL t(8;14) v normal.. 66 - 72
v
Appendix D Top 100 DEGs for B-ALL without t(8;14) ……………….……… 73 - 76 Appendix E DEGs and Gene Set Enrichment Analysis ……………………….. 77 - 80 Appendix F DEGs with greater than 2-fold change from normal ……………... 81 - 91 Appendix G KEGG pathway analysis on DEGs ……………………………….. 92 - 95 Appendix H Gene Ontology analysis of DEGs ………………………………… 96 - 103 Appendix I Focused analysis of KEGG hematopoietic pathway ……………… 104 - 110 Appendix J Putative modulating microRNAs based on DEG expression …….. 111 - 117 Appendix K Relative ranks of DEGs among 3 ALL groups ……….………….. 118 - 191 Appendix L Master list of DEGs with greater than 2 log-fold change ………… 192 - 220 Appendix M Master list of DEGs found in B-ALL with t(8;14) ………………. 221 - 234 Appendix N Master list of DEGs found in pooled B-ALL without t(8;14) …… 235 - 263 Appendix O Master list of DEGs found in T-ALL ……………………………. 264 - 292 Appendix P Comprehensive KEGG pathway analysis: Biologic pathways …… 293 - 303 Appendix Q Curriculum Vitae of Dr Robert P. Weinberg, DO, JD …………… 304 - 310
vi
LIST OF FIGURES
- Figure
- Description
- Page
Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17
SALL4 expressed in GSE13159 (2,096 samples) …..…………………… 11 SALL4 expressed in the ALL leukemias ……..……………………………. 11 SALL4 expressed in B-ALL with t(8;14) ………………………………….. 11 SALL4 expressed in B-ALL without t(8;14) ………………………………..11 SALL4 expressed in T-ALL leukemia…….…………………………………11 SALL4 expression in AML leukemia .……………………………………….11 SALL4 expressed in CLL leukemia ………………………………………… 12 SALL4 expressed in CML leukemia …………………………………………12 SALL4 expression in MDS …………………………………………………..12 SALL4 expression in healthy cells …………………………………………..12 Scatter plot showing correlation between CD96 and SALL4 ......………….. 18 Scatter plot showing correlation between CD36 and SALL4 ........…………. 18 Scatter plot showing correlation between CD4 and SALL4 ........……………18 Scatter plot showing correlation between CD32 and SALL4 ......………….. 18 Scatter plot showing correlation between CD2 and SALL4 ..……………..... 18 Scatter plot showing correlation between CD7 and SALL4 ......………….. 18 Mean expression of SALL4 in the 8 ALL data sets ………………………… 19
vii
LIST OF TABLES
- Table
- Description
- Page
Table 1 Table 2 Table 3 Table 4 Table 5
List of 2,096 leukemia patient samples in GSE 13159 microarray data ............... 10 Correlation coefficients: 29 cell surface markers vs SALL4 in ALL .......………. 14 7 Cell surface markers most highly correlated with SALL4 expression .............. 17 35 SALL4-associated cell surface markers in the top 400 correlated genes……... 20 Top 300 genes positively correlated with SALL4: r > 0.67 ….……………… 23-28
viii
Acknowledgements
I would like to thank my thesis advisor Li Chai for accepting me as an advisee, even though bioinformatics lie outside her field of expertise. I would also like to thank my thesis co- advisor Rafael Irizarry for his invaluable suggestions and advice on bioinformatic analysis. I am most thankful to my director Shiv Pillai for his encouragement and mentorship throughout the project and his persevering belief that I could accomplish this. I am also thankful to Michael Carroll for his support and encouragement when things became challenging.
I would also like to acknowledge the following institutions for their software tools used in these genomics analyses: R foundation: R programming language; RStduio: IDE platform; Broad Institute: GenomeSpace; GenePattern, CytoScape; biomaRt; Perkin Elmer: TIBCO Spotfire; and Advaita Bio for the use of the iPathway tool for KEGG & GO analysis.
I would also like to thank Yunpeng Liu for providing me with invaluable tutoring on programming in R and using Bioconductor. Finally I would like to thank my wife and family for their support and steadfastness in spite of my crankiness and irritability during stressful periods of my research.
This work was conducted with support from Students in the Master of Medical Sciences in
Immunology program of Harvard Medical School. The content is solely the responsibility of the author and does not necessarily represent the official views of Harvard University and its affiliated academic health care centers.
"Success is not final, failure is not fatal, it is the courage to continue that counts..."
- Winston Churchill
ix
Chapter 1. Background
- 1.1.
- The transcription Factor SALL4
SALL genes (Spalt-like gene family - homologous to a homeotic gene in D. melanogaster) are zinc-finger transcription factors with 4 human isoforms (SALL1, 2, 3, 4) [1]. The SALL4 gene encodes 3 isoforms, called A, B and C, processed by alternative splicing. SALL4 has multiple interactions with different co-factors and also epigenetic complexes [2-5]. Early studies identified SALL4 as an important embryonic stem cell factor (ESC) for the maintenance and renewal of pluripotency [6-8]. SALL4 has 3 clusters of zinc fingers near the carboxy terminus and one zinc finger at the amino terminus [9] .
SALL4 may form both homo- and heterodimers through interaction of their conserved glutamine-rich region and also contains a nuclear localization signal [10,11]. SALL4 binds to the retinoblastoma binding region 4 (RBBP4), as part of the multiprotein nucleosome remodeling and histone deacetylase complex (NuRD) along with several additional binding partners including chromodomain-helicase-DNA binding proteins (CHD3/4 or Mi-2a/b), metastasis-associated proteins (MTA), methyl-CpG-binding domain proteins (MBD2 or MBD3), and histone deacetylases (HDAC1 and HDAC2) [12,13]. This complex effectively represses transcription through its localization to heterochromatin. SALL4 also shows transcriptional repressor activity through its binding to other epigenetic modifiers such as histone lysine-specific demethylase 1 (LSD1), often associated with NuRD [14]. SALL4 also induces gene activity through its association with the mixed lineage leukemia protein
1
(MLL) which has histone 3 lysine 4 (H3K4) trimethylation activity [15-18]. SALL4 also binds the stem cell octamer-binding transcription factor 4 (Oct4) and pluripotency factors Nanog and sex determining region Y (SRY)-box 2 protein (Sox2) [19-22]. These complexes affect transcription in embryonic stem cell regulatory circuits. SALL4 has additional binding partners including the T-box 5 protein (Tbx5), promyelocytic leukemia zinc finger protein (PLZF), Rad50 and b-catenin downstream Wnt signaling [23-25]. Some of these binding partners were identified by mass spectrometry or co-immunoprecipitation.
A key regulatory role for SALL4 involves phophatidyl-inositol-3,4,5-triphosphate
3-phosphatase (PTEN). PTEN is a vital tumor suppressor which induces apoptosis. Through its binding to the PTEN promoter, SALL4 recruits the NuRD to effect its repression, leading to uncontrolled proliferation of the cells [17]. SALL4 is important in maintaining the pluripotency of both mouse and human embryos and the loss of SALL4 results in differentiation of these cells [26-30]. Some of its activity may be due to the down-regulation of Pou5f1 (encoding Oct4) gene expression along with the up-regulation of the caudal-type homeobox 2 (Cdx2) expression [31,32]. The transcriptional regulatory network for pluripotency includes SALL4, Oct4, Nanog and Sox2 [6,8,19-22].
SALL4 knock-out mice exhibit congenital malformations of the hand and eyes, which are analogous to the Okihiro/Duane-Radial-ray syndrome in human patients with SALL4 mutation [33,34]. SALL4 is upregulated in several human cancers such as acute myeloid leukemia (AML), B-cell acute
2
lymphocytic leukemia (B-ALL), germ cell tumors, gastric cancer, breast cancer, hepatocellular carcinoma (HCC), lung cancer, and glioma (GBM) [35-50]. SALL4 is associated with worse prognosis and outcome in HCC, metastatic cancers of the endometrium, colo-rectal, and esophageal squamous cell carcinoma. In B-ALL, SALL4 expression is associated with hypomethylation of the intron 1 region [51]. There is direct activation of SALL4 by the signal transducer and activator of transcription 3 (STAT3) [52]. Canonical Wnt signaling activates SALL4 expression during development and in cancer [53,54]. When SALL4 is overexpressed in mice, they develop MDS like syndrome and AML [55].
High-risk MDS patients have elevated levels of SALL4 expression [56]. Using short hairpin-RNA to knock down SALL4 expression in leukemia cells or treating with a peptide that mimics the N-12aa of SALL4 result in cell death by apoptosis[57].
- 1.2.
- The genetics of B-cell ALL with the t(8;14) translocation
The biology and genetics of acute lymphoblastic leukemias may be found in Appendix F
[references 58-125]. The t(8;14) (q11.2;q32) translocation is relatively rare in B-ALL, but does occur more often in patients with Down Syndrome [126-128]. The ALL patients with t(8;14)(q11.2;q32) are estimated to account for 0.2% of children with B-cell ALL (B-ALL). The prevalence of t(8;14)(q11.2;q32) has been reported as 0.7% of pediatric ALL by the Mitelman database [129]. Burkitt's lymphoma [104] and B-cell acute lymphoblastic leukemias (B-ALL) [105] are often characterized by a reciprocal chromosomal translocation t(8;14) which involves the immunoglobulin heavy chain locus on
3
chromosome 14 [106, 107] with the c-myc oncogene on chromosome 8 [108-110]. Cloning and sequencing of a Burkitt's lymphoma cell line and pre-B ALL cell line t(8;14) breakpoints have revealed recombination of the JH region on chromosome 14 with a specific region on chromosome 8 containing homologous signal sequences recognized by the V-D-J recombinase [111-113].
V-D-J recombinase is also involved in the chromosomal translocations in B-cell chronic lymphocytic leukemias (CLL) with the t(11;14) and follicular lymphomas with t(14;18) translocation [114, 115]. 80% of Burkittt lymphomas carry a t(8;14)(q24;q32) translocation which results in the juxtaposition of the immunoglobulin heavy chain locus with the MYC protooncogene, resulting in its activation. A smaller number of Burkitt lymphomas display the t(2;8) or the t(8;22) translocations which juxtaposes the MYC gene with the κ or λ immunoglobulin light chains. 85% of follicular lymphomas carry a t(14;18)(q32;q21) translocation resulting in the overexpression of Bcl-2 protein, which has an inhibitory effect on apoptosis [121, 122].
These translocations which result in dysregulation of Bcl-2 produce a more indolent malignancy than those translocations which produce dysregulation of c-MYC. The genetic rearrangement with Bcl-2 presumably occurs during improper DH-JH joining of the IgH gene [123, 124]. A prominent feature in these lymphomas as well as numerous other B-cell malignancies, including the follicular lymphomas, acute lymphoblastic leukemias (ALL) and chronic lymphocytic leukemias (CLL), is that the abnormal translocation results in the activation of MYC because the fusion product is under
4
the regulatory control of the immunoglobulin elements [116-118]. There are similar chromosomal translocations involving the T-cell receptor in T-cell malignancies [119, 120]. In each of these cases of translocation-induced malignancies, there is deregulation of an oncogene or putative oncogene which comes under the regulatory control of the juxtaposed immunoglobulin control elements. This mechanism of oncogenesis involves the V-D-J joining recombinases, which normally play an important role in generating receptor variability and diversity in the B-cell and T-cell receptors, acting erroneously in generating chromosomal breakage and translocation.
There was one case report of acute lymphoblastic leukemia with a t(14;18)/BCL2, t(8;14)/cMyc and also t(1;2)/FCGR2B [125]. This case was classified as acute lymphoblastic leukemia FAB subtype L2 (ALL-L2), which despite immature morphology, the leukemic blasts had a mature B-cell phenotype positive for such B-cell markers as CD19, CD20, HLA-DR and surface immunoglobulin but negative for the more primitive markers CD34, TdT and CD10. Consistent with their mature B-cell phenotype, the cells were also negative for the megakaryocytic marker CD41, myeloid markers CD13 and CD33, and the T-cell markers CD3 and CD7. In terms of the neoplastic hallmarks, it is believed that the cMYC and FCGR2B would tend to increase cell proliferative capacity, while the