USE OF CHRONIC LYMPHOCYTIC LEUKEMIA RESEARCH CONSORTIUM DATA REPOSITORY AND EXPRESSION OMNIBUS TO GENERATE AND TEST HYPOTHESES FOR BIOMARKER IDENTIFICATION AND DEVELOPMENT

THESIS

Presented in Partial Fulfillment of the Requirements for

the Degree Master of Science in the Graduate

School of The Ohio State University

By

Kristin Chelsea Keen, B.A.

*****

The Ohio State University 2009

Thesis Committee: Approved by Professor Kun Huang, adviser

Professor Philip Payne ______Adviser Pathology Graduate Program

ABSTRACT

Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the

United States. There is no known cure for CLL. While biomarkers have been found to correlate with disease progression, such as CD38, IGHV, and ZAP-70, there is a need for further validation of these biomarkers as well as new biomarker discovery. In this study, publicly available data from NCBI’s Gene Expression Omnibus was used to identify with expression correlated to ZAP-70 and CD38 mRNA expression patterns. Also utilized were vast amounts of data from the Chronic

Lymphocytic Leukemia Research Consortium (CRC) to search for novel correlations between clinical markers and disease progression and treatment outcome. We found several hundred genes with expression patterns correlated to ZAP-70. We also found several clinical, genetic, biologic, and immunologic CRC data fields which were correlated significantly and at weakly to strongly associated.

ii

Dedicated to my family

iii

ACKNOWLEDGEMENTS

I thank my adviser, Kun Huang, for his positive support and trust in me as a student and a researcher.

I thank my committee member, Philip Payne, for his tangible enthusiasm and thoughtful advice.

I thank Gulcin Ozer for contributing her time, effort, and experience to the correlation analysis of Aim 1.

I thank Cenny Taslim for helping me with Matlab when I was in a pinch.

Lastly, I thank my husband, Jared Circle, and mother, Valerie Keen, for their hours of time doing more than their share of caring for my daughter while I finished this project. I couldn’t have done this without them.

iv

VITA

2003……………………………………………….B.A., Life Sciences, Otterbein College

FIELDS OF STUDY

Major Field: Pathology

v

TABLE OF CONTENTS

Abstract…………………………………………………………………………………....ii

Dedication…………………………………………………………………………….…..iii

Acknowledgements……………………………………………………………….………iv

Vita…………………………………………………………………………..…………….v

List of Figures…………………………………………………………...………………viii

Chapters:

1. Introduction………………………………………………………………..……………1

1.1 Specific Aims…………………………………………………………..……...1

2. Background and Significance …………………………………………..…………3

2.1 Chronic Lymphocytic Leukemia………………………………….…………..3

2.2 The CLL Research Consortium……………………………………….………4

2.3 Bioinformatics and CLL…………………………………………….………...5

3. Methods…………………………………………………………………..………..8

3.1 Methods for Specific Aim 1……………………………………………..…….8

3.2 Methods for Specific Aim 2………………………………………………….10

4. Results ……………………………………………………………………………13

4.1 Results for Specific Aim 1….………………………………………………..13

4.2 Results for Specific Aim 2.…………………………………………………..13 vi

5. Conclusions…………………...... 16

6. Discussion…………………………………………………………...... 17

References……………………………………………………………………...... 21

Appendix A: Tables and Figures for CRC data repository and GEO analysis…………..23

vii

LIST OF FIGURES

Figure 1: Valid and meaningful hypotheses from [1]…………………………..………..24

Figure 2: Flow chart for Aim 1 methods……………………………………..………….25

Figure 3: Flow chart for Aim 2 methods………….………………………..……………26

Figure 4: Fields queried from CRC research data repository for Aim 1……………...... 27

Figure 5: Fields analyzed and bins for these fields………………………………………28

Figure 6: Comparison pairs between CRC query datasheets and

methodology for combining datasheets for analysis………………...…………...29

Figure 7: GDS dataset information summary………………………………………..…..30

Figure 8: Correlated CRC data fields, p ≤ 0.05, phi ≥ 0.3……………………..………...31

Figure 9: Correlation gene lists for ZAP-70 and CD38 for

GDS1388, GDS1454, and GDS2501, threshold 0.4…………………...…….…..32

Figure 10: Randomness calculation for ZAP-70 and CD38 gene

list intersections………………………………………………………...………..33

Figure 11: Annotated IPA gene lists for correlated GDS2501

gene lists and intersected gene lists for ZAP-70 and CD38……………………...34

Figure 12: IPA pathways showing only connected genes, for

combined gene lists for ZAP-70……………………………………………...….35

Figure 13: IPA pathway showing only connected genes, for viii

combined gene lists for CD38……………………………………………………36

Figure 14: Ohio State University Medical Center clinical

reference ranges, November 2008………………………………………...……..37

ix

CHAPTER 1

INTRODUCTION

Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the

United States [2]. Although some leukemias and lymphomas can be cured, there is no known cure for CLL [3]. While some biomarkers have been found to correlate with disease progression, such as CD38, IGHV, and ZAP-70, there is a need for further validation of these biomarkers as well as new biomarker discovery [4]. In this study, publicly available gene expression data from NCBI’s Gene Expression Omnibus is used to identify genes with expression correlated to ZAP-70 and CD38 mRNA expression patterns. We also utilize vast amounts of data from the Chronic Lymphocytic Leukemia

Research Consortium (CRC) to search for novel correlations between clinical markers and disease progression and treatment outcome. Our aims are as follows:

Specific Aim 1: Test previously-formed hypotheses generated using knowledge

engineering (KE)-based approach using CRC data and correlation analysis

as a novel method of biomarker discovery.

Specific Aim 2: Identify genes whose mRNA expression is correlated with

known CLL biomarkers ZAP-70 and CD38.

1

Based on previous studies using gene expression correlation and genelist intersection from multiple datasets, we expect expression of several genes to correlate with ZAP-70 and CD38 in multiple datasets [5, 6]. We expect that our correlation analysis of CRC data will confirm the KE-generated hypotheses and discover new biomarkers for disease progression.

2

CHAPTER 2

BACKGROUND AND SIGNIFICANCE

2.1 Chronic Lymphocytic Leukemia

Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the United States. Nearly 100,000 Americans live with CLL, most of them over fifty years old. Rates of CLL incidence are increasing, and there is no known cure [7]. CLL usually develops slowly, and the symptoms of the disease are similar to many other more common conditions. This increases the difficulty in diagnosis, and it is only after a battery of testing that most patients find out that they have CLL.

CLL is diagnosed through blood tests including white blood cell count and complete blood count. Once a diagnosis is made, staging must be done to determine whether the disease is in the beginning, intermediate, or advanced stage. Some patients remain in the beginning stages of the disease progression and are able to live long lives, never having to deal with many of the disease’s worst symptoms [8, 9]. This results in two distinct groups of patients: those with advancing disease and those with disease that doesn’t seem to progress. Those with the non-progressive manifestation of the disease seem not to need treatment until the disease begins progressing and they become more symptomatic [4].

3

Early determination of which grouping a patient belonged in, progressive or non- progressive CLL, would serve an important function. If this information could be determined in advance, it would potentially enable the development of a better course of action for disease management and treatment [10]. The end result would lead to the improvement of patients’ conditions and possibly saving lives. Biomarkers have proven helpful in identifying patient groups for other diseases [9]. ZAP-70, CD38, and IGHV have been named in multiple studies as biomarkers for CLL disease progression [4, 11,

12]. A positive ZAP-70 test means that a patient would be placed in the progressive group. While this is progress toward earlier characterization of an individual’s disease state, ZAP-70 testing only yields definitive results if conducted during later, symptomatic phases of disease progression [3]. A more efficient method would be to determine biomarkers or tests that are able to definitively determine at an early point in the course of the disease the likelihood with which a patient may soon stop responding to treatment or will begin more rapid disease progression. A large-scale study with thousands of patients with CLL has been done; with it, a database was created that contains hundreds of data fields. This database has the potential for leading to the identification of new biomarkers or tests that can assist in the determination of disease progression, early disease state detection, refractory, and patient response to treatment.

2.2 The CLL Research Consortium

The CLL Research Consortium (CRC) is a multi-site research group funded by the

National Institute whose primary function is to conduct studies of the genetic,

4 biochemical, and immunologic origins of CLL [13, 14]. The CRC has conducted studies that have been responsible for important new insights into CLL pathophysiology and treatment as well as multi-site group data repository management [10, 13-16]. The CRC’s goals are to pursue new treatments for CLL and to examine phenotypic and biomarker relationships specifically geared toward improving staging and predicting patient response to disease treatment. To assist in reaching these goals, the CRC maintains a data repository containing genetic, biochemical, immunologic, and clinical patient data for the over 4,000 patients who have agreed to have their case information contributed to the

CRC [17]. Due to the vast amounts of data compiled from the thousands of patients in hundreds of separate data fields, the CRC data depository is the ideal initial resource to consult for potential biomarker discovery for CLL, provided a positive control for analysis can be effectively identified and utilized.

2.3 Bioinformatics and CLL

Conceptual knowledge acquisition methods involve combining basic units of information and meaningful relationships between those basic units [18, 19]. This information could come from virtually any verifiable source, including journal articles and abstracts, databases of information, or experts within a field. One well-described method for acquiring, refining, and validating knowledge is called knowledge engineering (KE). In practice, there are many theories and methods for performing conceptual knowledge acquisition and KE. In our case, basic factual units from the CRC database were mapped to ontologies, or collections of domain-specific knowledge, to

5 generate seventy hypotheses. These hypotheses were examined by a team of CLL field experts, who determined that nine of those hypotheses were valid and meaningful (Figure

1) [1]. In a large database such as the CRC data repository, having a starting point and a positive control when beginning an analysis can allow for more intelligent querying and a better perspective for viewing results. If these nine hypotheses are found to be true using the CRC data repository, then this is further support for the relevance of this ontology- anchored approach to conceptual knowledge engineering [1].

In order to claim a link between a biomarker and disease progression or patient response to treatment, it is essential to first detect a correlation between pairs of data fields [20-22]. A basic type of correlation is Spearman’s correlation coefficient. When working with exceptionally large amounts of data, it becomes necessary to “bin” the data, or divide it into separate and more manageable groups for analysis. These bins provide the inception points for conduction of the analysis necessary to determine a correlation; if there is a correlation to be found, it is then possible to move forward into a more detailed examination of the relationship between the biomarker and disease progression in patients.

These bioinformatic techniques can be applied to CLL to determine a link between any of a number of biomarkers and data fields representing disease progression or response to treatment. It is our eventual goal to evaluate this data for evidence of change over time to determine whether there is a correlation between delta and disease progression or response to treatment. This application is intended to lay a foundation for this work. However, upon beginning a study such as this one, technical issues come to the

6 forefront. Some of these limitations may include visit windowing due to non-aligning dates, use of a convenient sample, and novel hypotheses that are less common and thus more difficult to detect in a large sample such as the CRC data repository.

7

CHAPTER 3

METHODS

Flowcharts for Aim 1 and Aim 2 can be found in Figures 2 and 3 respectively.

3.1 Methods for Specific Aim 1

Specific Aim 1: Test previously-formed hypotheses generated using knowledge

engineering (KE)-based approach using CRC data and correlation analysis as a

novel method of biomarker discovery.

Data source

Data to be analyzed was taken from the CRC research data repository, maintained at University of California, San Diego, by the CRC Biomedical Informatics Program

[17].

Data selection and preparation

A query was made for patients with multiple center visits for the 182 clinical, biological, genetic, and immunologic fields (Figure 4). The output file was delivered in

Excel format as five different files: BsData_table, Clinical_Tablesv3, cytogenetic,

Facs_table, and Registration Table v2.

Positive controls

8

Nine valid, meaningful conceptual knowledge hypotheses which were developed in a previous study using CRC research data repository fields [1] were used as a positive control for Aim 1 (Figure 1).

Binning process

For each data field, data was sorted using Microsoft Excel's data sort function to display in order of ascending values. Empty cells and undefined or unknown entries were removed from consideration.

If the analyzed data was determined to be categorical, bins were created for each category of viable data. In some cases, more than eight individual categories were present. In these cases, bins were created for ranges of categories.

To further assist in defining parameters for the data sets, the numerical data was checked against The Ohio State University MedicalCenter (OSUMC) clinical data reference ranges (Figure 14). If the data field corresponded to an existing OSUMC reference range, this range was used to create bins for below normal, normal, and above normal. If the OSUMC reference range was gender-specific, the normal range for male and female was combined to create a larger normal range. If data contained very high values (defined as ten times highest normal value), an additional bin was created for these very high values. If there was no OSUMC reference range, then numerical data was divided into four bins of as close to equal size as possible. A few exceptions were made due to extreme lack of variation in data that did not allow for four equal sized bins.

Figure 5 lists data fields analyzed and bins for each data field; exceptions are noted.

Combining data

9

Several factors made it prohibitive to create a single, combined data matrix with the assembled information. In come cases, dates did not align perfectly between different

Excel files. The problem of visit windowing is significant for large-scale research data repositories such as the CRC data repository and warrants more in-depth examination in future studies. In addition, not all patients had submitted entries for all present data fields.

These factors made it unreasonable to create one combined data matrix; however, it was possible to complete a combined data matrix from comparison pairs of Excel files. As a result of the utilization of five separate Excel databases, it became necessary to conduct ten respective comparisons of the data sets (Figure 6). The use of comparison pairs subsequently enabled the utilization of different rules for comparisons between varying types of data (Figure 6) to accommodate misalignments of dates.

Correlation analysis

Three types of correlation were calculated for each comparison between bins and fields: residual, phi, and p-value. Contingency tables and Fisher exact test was used to calculate these values.

3.2 Methods for Specific Aim 2

Specific Aim 2: Identify genes whose mRNA expression is correlated with known CLL

biomarkers ZAP-70 and CD38.

Data selection

GEO Datasets was queried using terms "chronic lymphocytic leukemia" [23].

Five GDS dataset results were generated: GDS2676, GDS2643, GDS2501, GDS1454,

10 and GDS1388. Those datasets with more than one cell type or comparison were eliminated from this study; the datasets left were GDS2501, GDS1454, and GDS1388.

Soft files for these GDS datasets were downloaded for correlation analysis. Figure 7 contains platform type and comparisons made within each selected GDS dataset.

Correlation

Pearson correlation coefficient (PCC) was performed for genes CD38 and ZAP-

70 on the three selected GDS datasets using a MATLAB script. The script calculates

PCC for each gene based on gene expression values of one selected gene (in this case,

CD38 or ZAP-70). The resulting output *.txt file is comprised of a gene list containing only those genes found with PCC absolute value of 0.4 or higher. The appropriate gene symbols are displayed next to their respective PCC in descending order. Due to the presence of three GDS datasets and the two selected genes of interest, it was necessary to conduct PCC analysis six times.

Intersection

To determine which gene symbols could be found on multiple correlation genelists, genelist intersection was performed using a MATLAB script. Genes with

PCCof less than 0.4 were excluded from the results [6]. The script generates an output

*.txt file viewable in Microsoft Excel as a table. The table organizes the data using each respective gene symbol for row headers and the appropriate GDS dataset number for column headers. Cells within the table contain either a PCC value or a “-”. The presence of the “-” symbol indicates that the gene was not correlated with the selected gene at PCC

11

0.4 or better. This analysis was performed two times, once for CD38 and once for ZAP-

70.

Randomness Test for Intersection

In order to calculate the number of genes on the intersection gene list which could be attributed to randomness, a randomness test was done. The percentage of genes on the gene chip which were correlated at PCC = 0.4 or better was calculated. The percentages for intersected datasets GDS1388 and GDS1454 were multiplied to arrive at the percentage of genes which could be attributed to randomness.

Pathway Analysis and

Pathway analysis was done using Ingenuity Pathway Analysis (IPA) [24]. Each genelist was copied and pasted in the Gene Search box at the top of the software interface. Only the best match was selected for each gene, which was most often a direct gene symbol match and rarely a synonym match. Selected genes were used to create a new gene list in IPA. Each gene list was annotated using IPA’s annotate function. Each gene list was selected and placed in a new pathway and connected under standard settings using the “Build” and “Connect” functions. Then the “Auto-Layout” button was used, and all unconnected genes were removed from the pathway. This was done for each gene list and for combined ZAP-70 and CD38 intersected and correlated gene lists for a total of six pathways.

Gene ontology annotations were gathered using GeneInfoViz’s batch search function [25]. Each IPA gene list was pasted into the search box. Gene ontology annotations from each gene were copied and pasted into the IPA annotation tables.

12

CHAPTER 4

RESULTS

4.1 Results for Specific Aim 1

Specific Aim 1: Test previously-formed hypotheses generated using knowledge

engineering (KE)-based approach using CRC data and correlation analysis as a

novel method of biomarker discovery.

The following four dataset comparisons were made: Facs_table to

Clinical_Tablesv3, Facs_table to cytogenetic, Clinical_Tablesv3 to cytogenetic, and

Clinical_Tablesv3 to BsData_table. 1003 data field comparison pairs had p-value less than or equal to 0.05. 89 data field comparison pairs from separate datasheets met the requirements set of p-value less than or equal to 0.05 and phi coefficient greater than or equal to 0.3 (Figure 8). Zero of nine hypotheses were included in those results because out of 63 possible comparisons between CRC fields cited in the hypotheses, only one was done.

4.2 Results for Specific Aim 2

Specific Aim 2: Identify genes whose mRNA expression is correlated with known CLL

biomarkers ZAP-70 and CD38.

13

GDS1388 and GDS1454 each contained 12651 probesets. Of these probesets, 908 and 115 were found to correlate with CD38 at or above the threshold value of 0.4 for

GDS1388 and GDS 1454 respectively. For ZAP-70, GDS1388 contained 754 probesets at or above the threshold value, while GDS1454 contained 179. These results can be viewed in Figure 9. The genelists containing the gene symbols for these probesets were intersected for both CD38 and ZAP-70; 38 genes that were present in both genelists were correlated at or above the threshold value in the same direction for ZAP-70, and 8 genes were intersected for CD38.

A randomness calculation was performed for both intersections (Figure 10). For

CD38, the number of intersected genes with positive correlations due to randomness was calculated to be 5.76; the actual number of genes on the genelist was 8. The number of intersected genes with negative correlations due to randomness was calculated as 0.02; the actual number of genes on this genelist was 0. For ZAP-70, the number of intersected genes with positive correlations due to randomness was calculated to be 4.10; the actual number of genes on this genelist was 8. The number of intersected genes with negative correlations due to randomness was calculated as 5.64; the actual number of genes on this genelist was 30.

All genes on the intersected genelist for GDS1388 and GDS1454 and all genes on the correlated genelist for GDS2501 at or above R = 0.9 or the top twenty, whichever was greater, were entered into Ingenuity Pathway Analysis. Seven of the eight gene symbols for the CD38 intersected list were found in IPA; 36/40 for CD38 correlated list; 33/38 for

ZAP-70 intersected list; and 68/82 for ZAP-70 correlated list (Figure 9). The 42 total

14

CD38 correlates and 101 ZAP-70 correlates were used to create combined pathways using IPA’s pathway function. Connections were found between 2/42 CD38 correlates and 25/101 ZAP-70 correlates (Figures 12 and 13).

Gene ontology information is found in the right column of the gene annotations

(Figure 10). For ZAP-70, 53 of the 68 genes on the IPA correlated list had gene ontology information. 25/33 ZAP-70 correlates from the intersected genelist, 27/36 genes on the

CD38 correlated genelist; and 5/7 CD38 correlates from the intersected genelist.

15

CHAPTER 5

CONCLUSIONS

We did not test all nine of the hypotheses, as stated in Aim 1. Several limitations hindered our ability to analyze the hypotheses within the timeframe, including the problems of convenient sample, visit windowing, and novel hypotheses. These problems merit extensive further study that is beyond the scope of this application. Instead of stalling the study because of these limitations, we analyzed several thousand CRC data repository field comparisons for significance and strength of association and found eighty-nine pairs of fields which were both statistically significant and at least weakly to strongly associated.

We were able to successfully identify genes with mRNA expression correlated with ZAP-70 and CD38 expression. Genes correlated with ZAP-70 were less likely to be due to randomness. Pathway analysis results suggest that a network of genes that has been shown to interact with ZAP-70, either directly or indirectly, may be involved in pathogenesis of CLL and merits further biological validation and examination.

16

CHAPTER 6

DISCUSSION

The results of Aim 1’s preliminary statistical analysis suggest that more work is needed before our KE-based approach may be tested as an acceptable means for hypothesis generation upon more sophisticated analysis. There was difficulty combining the five datasets because of nonmatching dates. This larger problem of visit windowing will need to be solved before Aim 1 can be achieved. Due to the fact that only one out of sixty-three possible comparisons within the initially nine KE-generated hypotheses was made, it is impossible for us to validate the method in the way we initially proposed. An additional hurdle was the challenge of validating novel hypotheses instead of more common, expected hypotheses. There are some controls built into the analysis as performed. For example, test results for kappa and lambda light chains and CD20 are included in both the FACS datasheet and the clinical datasheet through two different testing methods. The results of these comparisons are viewable in Figure 8, because the comparisons were both statistically significant and had at least a weak association as determined through Fisher testing and contingency tables. However, because there is not an open line of research between all sites in the consortium, there is user bias in determining some of these quantitative test results.

17

Although we have not yet achieved Aim 1, thousands of comparisons were made, and many interesting results have been generated. LDH, lactate dehydrogenase, is under consideration by some groups as a potential biomarker for CLL prognosis [26]. In our study, levels of LDH within normal reference ranges had a significant association with normal karyotype pattern and zero percent of cells with abnormal karyotype, while LDH levels higher than reference were significantly associated with both abnormal karyotype pattern and 10.1-50% cells with abnormal karyotype (Figure 8). These results suggest that changes in LDH levels over time might indicate a change in karyotype and disease status; this possibility merits further investigation.

The applied concepts of mRNA expression correlation and mRNA expression intersection are novel, but preliminary results for these methods suggest that they merit further testing. Aim 2 of this application was meant to address this issue. In order to visually examine the results of these tests, IPA was used to generate pathways containing the resulting genes and their known relationships.

Based on the randomness calculation, it was concluded that most genes on the

ZAP-70 intersected list are not due to chance. However, most genes on the CD38 intersected list could be due to chance. Additionally, the combined pathway for ZAP-70 resulted in a large cluster of sixteen genes, while there was only one connection between two genes out of the 42 CD38 correlates used in the IPA analysis. These results suggest that ZAP-70 is regulated at an mRNA expression level in CLL, while regulation of CD38 expression in CLL occurs at a translational or post-translational level.

18

Two genes on the ZAP-70 correlated genelist are known tyrosine kinases, which is also true for ZAP-70. These genes include DDR2 and MAPK10, and both are shown in the combined ZAP-70 IPA pathway in Figure 2. Other genes with tyrosine kinase signal transduction-associated gene ontology on the ZAP-70 correlation lists include MPZL1,

TYK2, and HGS. All of this information combined suggests that in CLL, multiple tyrosine kinases and their signal transduction pathways are dysregulated. Tyrosine kinases have been successfully targeted for therapy in other , including breast cancer and chronic myeloid leukemia [27, 28]. This suggests that researchers should continue to look at tyrosine kinases in CLL as a potential therapeutic target, where it may be possible to find a first CLL-specific drug.

Within the large cluster of correlated genes in the ZAP-70 IPA pathway, it is notable that several genes in the Rho signal transduction network are present.

RhoA is two steps away from ZAP-70 in the network and is known to be involved in actin cytoskeleton organization and cell motility. RhoA has never been implicated in

CLL or suggested as a biomarker, although it has been shown to be highly expressed in hairy cell leukemia (HCL), and its signaling pathways have been implicated in other cancers, including chronic myelogenous leukemia [29, 30]. In the same study that found

RhoA constitutively expressed in HCL, B-CLL cells were found not to overexpress

RhoA. The results of our data analysis suggest that another look be taken at RhoA and its network in CLL.

In the future, we would like to calculate change over time for CRC patient data and complete additional correlation analyses. Also, we would like to further analyze this

19 data for biomarkers which may be dependent on more than one factor. Before these admirable goals can be achieved, the limitations encountered in this study must be addressed. Namely, these include visit windowing, convenient sample, and forming a more open line of research for some CRC data field tests. These problems are likely shared with other NIH-funded research consortiums. As we begin to delve into and solve these problems, we will then become better able to mine the great depths of knowledge currently waiting in consortium data repositories.

20

REFERENCES

1. Payne, P.R., et al. Ontology-anchored Approaches to Conceptual Knowledge Discovery in a Multi-dimensional Research Data Repository. in AMIA Translational Bioinformatics Summit Proc. 2008. 2008. 2. Raval, A., J.C. Byrd, and C. Plass, Epigenetics in chronic lymphocytic leukemia. Semin Oncol, 2006. 33(2): p. 157-66. 3. Byrd, J.C., T.S. Lin, and M.R. Grever, Treatment of relapsed chronic lymphocytic leukemia: old and new therapies. Semin Oncol, 2006. 33(2): p. 210-9. 4. Rassenti, L.Z., et al., Relative value of ZAP-70, CD38, and immunoglobulin mutation status in predicting aggressive disease in chronic lymphocytic leukemia. Blood, 2008. 112(5): p. 1923-30. 5. Rybaczyk, L.A., et al., An indicator of cancer: downregulation of monoamine oxidase-A in multiple organs and species. BMC Genomics, 2008. 9: p. 134. 6. Pujana, M.A., et al., Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet, 2007. 39(11): p. 1338-49. 7. Shanafelt, T.D., et al., Narrative review: initial management of newly diagnosed, early-stage chronic lymphocytic leukemia. Ann Intern Med, 2006. 145(6): p. 435- 47. 8. Marti, G.E., et al., Diagnostic criteria for monoclonal B-cell lymphocytosis. Br J Haematol, 2005. 130(3): p. 325-32. 9. Moreno, C. and E. Montserrat, New prognostic markers in chronic lymphocytic leukemia. Blood Rev, 2008. 22(4): p. 211-9. 10. Alinari, L., et al., Alemtuzumab (Campath-1H) in the treatment of chronic lymphocytic leukemia. Oncogene, 2007. 26(25): p. 3644-53. 11. Chen, L., et al., Expression of ZAP-70 is associated with increased B-cell receptor signaling in chronic lymphocytic leukemia. Blood, 2002. 100(13): p. 4609-14. 12. Humphries, C.G., et al., A new human immunoglobulin VH family preferentially rearranged in immature B-cell tumours. Nature, 1988. 331(6155): p. 446-9. 13. Greaves, A.W., et al., CRC Tissue Core Management System (TCMS): integration of basic science and clinical data for translational research. AMIA Annu Symp Proc, 2003: p. 853. 14. Payne, P.R., A.W. Greaves, and T.J. Kipps, CRC Clinical Trials Management System (CTMS): an integrated information management solution for collaborative clinical research. AMIA Annu Symp Proc, 2003: p. 967.

21

15. Ghia, E.M., et al., Use of IGHV3-21 in chronic lymphocytic leukemia is associated with high-risk disease and reflects antigen-driven, post-germinal center leukemogenic selection. Blood, 2008. 111(10): p. 5101-8. 16. Widhopf, G.F., 2nd, et al., Nonstochastic pairing of immunoglobulin heavy and light chains expressed by chronic lymphocytic leukemia B cells is predicated on the heavy chain CDR3. Blood, 2008. 111(6): p. 3137-44. 17. Chronic Lymphocytic Leukemia Research Consortium (CRC). 2008 [cited 2008 December 1]; Available from: http://cll.ucsd.edu/. 18. Payne, P.R., et al., Conceptual knowledge acquisition in biomedicine: A methodological review. J Biomed Inform, 2007. 40(5): p. 582-602. 19. Payne, P.R., et al., Supporting the Design of Translational Clinical Studies Through the Generation and Verification of Conceptual Knowledge-anchored Hypotheses. AMIA Annu Symp Proc, 2008: p. 566-70. 20. Heeboll, S., et al., SMARCC1 expression is upregulated in prostate cancer and positively correlated with tumour recurrence and dedifferentiation. Histol Histopathol, 2008. 23(9): p. 1069-76. 21. Sastre, J., et al., Circulating tumor cells in colorectal cancer: correlation with clinical and pathological variables. Ann Oncol, 2008. 19(5): p. 935-8. 22. Gormley, M., et al., Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets. BMC Bioinformatics, 2007. 8: p. 415. 23. NBCI GEO. [cited 2008 December 1]; Available from: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gds. 24. Ingenuity Pathway Analysis Software. 2008 [cited 2008; Available from: http://www.ingenuity.com/. 25. Cui, Y. GeneInfoViz. 2008 [cited 2008 December 1]; Available from: http://genenet2.utmem.edu/geneinfoviz/search.php. 26. Shen, Q.D., et al., [Prognostic significance of lactate dehydrogenase and beta2- microglobulin in chronic lymphocytic leukemia]. Zhongguo Shi Yan Xue Ye Xue Za Zhi, 2007. 15(6): p. 1305-8. 27. Gora-Tybor, J. and T. Robak, Targeted drugs in chronic myeloid leukemia. Curr Med Chem, 2008. 15(29): p. 3036-51. 28. Stankov, K., S. Stankov, and S. Popovic, Translational research in complex etiopathogenesis and therapy of hematological malignancies: the specific role of tyrosine kinases signaling and inhibition. Med Oncol, 2008. 29. Kuzelova, K. and Z. Hrkal, Rho-signaling pathways in chronic myelogenous leukemia. Cardiovasc Hematol Disord Drug Targets, 2008. 8(4): p. 261-7. 30. Zhang, X., et al., Constitutively activated Rho guanosine triphosphatases regulate the growth and morphology of hairy cell leukemia cells. Int J Hematol, 2003. 77(3): p. 263-73.

22

APPENDIX A

TABLES AND FIGURES FOR CRC DATA REPOSITORY AND GEO ANALYSIS

23

Figure 1: Valid and meaningful hypotheses from [1] Mapped from CRC Field Mapped from CRC Field

FISHchr17_per_abncells ct_crc002_registration::AlkylatorRefractory FISHchr17_abn_category ct_crc002_registration::PurineAnalogueRefractory FISHchr17_mayo_per_abncells ct_crc002_registration::ImmunotherapyRefractory

FISHchr12_per_abncells ct_crc002_registration::AlkylatorRefractory FISHchr12_abn_category ct_crc002_registration::PurineAnalogueRefractory FISHchr12_mayo_per_abncells ct_crc002_registration::ImmunotherapyRefractory

FISHchr12_per_abncells ct_crc002_registration::AlkylatorRefractory FISHchr12_abn_category ct_crc002_registration::PurineAnalogueRefractory FISHchr12_mayo_per_abncells ct_crc002_registration::ImmunotherapyRefractory

FISHchr17_per_abncells FISHchr17_abn_category FISHchr17_mayo_per_abncells ct_crc002_clinical_data::SGPT

FISHchr17_per_abncells FISHchr17_abn_category FISHchr17_mayo_per_abncells ct_crc002_clinical_data::SGPT

FISHchr12_per_abncells ct_crc002_registration::AlkylatorRefractory FISHchr12_abn_category ct_crc002_registration::PurineAnalogueRefractory FISHchr12_mayo_per_abncells ct_crc002_registration::ImmunotherapyRefractory

FISHchr6_per_abncells FISHchr6_abn_category FISHchr6_mayo_per_abncells ct_crc002_clinical_data::WBC

FISHchr11_per_abncells ct_crc002_registration::AlkylatorRefractory FISHchr11_abn_category ct_crc002_registration::PurineAnalogueRefractory FISHchr11_mayo_per_abncells ct_crc002_registration::ImmunotherapyRefractory

FISHchr17_per_abncells ct_crc002_registration::AlkylatorRefractory FISHchr17_abn_category ct_crc002_registration::PurineAnalogueRefractory FISHchr17_mayo_per_abncells ct_crc002_registration::ImmunotherapyRefractory

24

Figure 2: Flow chart for Aim 1.

Aim 1 Pairwise Joining comparison, CRC data Querying Query tables Combined Fisher exact test Correlation repository results datasheets analysis results

Binning process

Hypotheses Hypothesis testing generated in [1]

25

Figure 3: Flow Chart for Aim 2.

Aim 2 (repeated for both ZAP-70 and CD38 analyses)

GDS1388 GDS1388 Correlation Intersection soft file correlated gene list Intersected gene list GDS1454 GDS1454 Correlation IPA soft file correlated Pathway gene list

GDS2501 GDS2501 Correlation soft file correlated gene list

IPA annotations and Gene Ontology

26

Figure 4: Fields queried from CRC research data repository for Aim 1.

Kar_type Static_FACS_2:bcell_cd38_lymph ct_crc002_clinical_data:PBsIgD Kar_pattern Static_FACS_2:bcell_cd38_mfir ct_crc002_clinical_data:PBFMC7 Kar_per_abncells Static_FACS_2:bcell_igm_lymph ct_crc002_clinical_data:bm_immuno_date Karyotype Static_FACS_2:bcell_igm_mfir ct_crc002_clinical_data:BMCD19CD5 Kar_anom_complex Static_FACS_2:bcell_igd_lymph ct_crc002_clinical_data:BMCD3 comments Static_FACS_2:bcell_igd_mfir ct_crc002_clinical_data:BMCD4 FISHchr6_abn_category Static_FACS_2:bcell_kappa_lymph ct_crc002_clinical_data:BMCD8 FISHchr6_comments Static_FACS_2:bcell_kappa_mfir ct_crc002_clinical_data:BMCD5 FISHchr6_mayo_comments Static_FACS_2:bcell_lambda_lymph ct_crc002_clinical_data:BMCD10 FISHchr6_mayo_pattern Static_FACS_2:bcell_lambda_mfir ct_crc002_clinical_data:BMCD20 FISHchr6_mayo_per_abncells ResearchID ct_crc002_clinical_data:BMCD23 FISHchr6_pattern ClosedUI ct_crc002_clinical_data:BMCD52 FISHchr6_per_abncells Open_UI ct_crc002_clinical_data:BMKappa FISHchr11_abn_category VH:PerformanceDate ct_crc002_clinical_data:BMLambda FISHchr11_comments VH:IGHV_gene ct_crc002_clinical_data:BMsIgD FISHchr11_mayo_comments VH:IGHV_gene_subfamily ct_crc002_clinical_data:BMFMC7 FISHchr11_mayo_pattern VH:jh_gene ct_crc002_clinical_data:bm_morphology_date FISHchr11_mayo_per_abncells VH:d_gene ct_crc002_clinical_data:CellularityBx FISHchr11_pattern VH:Percent_VH_Homology ct_crc002_clinical_data:CellularityAsp FISHchr11_per_abncells VH:per_Zap_flow ct_crc002_clinical_data:Lymphocytes FISHchr12_abn_category ResearchID ct_crc002_clinical_data:Prolymphocytes FISHchr12_comments Protocol ct_crc002_clinical_data:bm_morphology_blastspros FISHchr12_mayo_comments ClosedUI ct_crc002_clinical_data:bm_morphology_metasmyelos FISHchr12_mayo_pattern VisitDate ct_crc002_clinical_data:bm_morphology_monocytes FISHchr12_mayo_per_abncells ct_crc002_clinical_data:hematology_date ct_crc002_clinical_data:bm_morphology_eosinophils FISHchr12_pattern ct_crc002_clinical_data:WBC ct_crc002_clinical_data:bm_morphology_basophils FISHchr12_per_abncells ct_crc002_clinical_data:RBC ct_crc002_clinical_data:bm_morphology_erythoid FISHchr13_abn_category ct_crc002_clinical_data:HgB ct_crc002_clinical_data:Pattern FISHchr13_comments ct_crc002_clinical_data:Plt ct_crc002_clinical_data:Other FISHchr13_mayo_comments ct_crc002_clinical_data:reticulocytes ct_crc002_clinical_data:alkylator_refrac FISHchr13_mayo_pattern ct_crc002_clinical_data:Neuts ct_crc002_clinical_data:pur_analogue_refrac FISHchr13_mayo_per_abncells ct_crc002_clinical_data:Band ct_crc002_clinical_data:BP FISHchr13_pattern ct_crc002_clinical_data:Lymph ct_crc002_clinical_data:Temp FISHchr13_per_abncells ct_crc002_clinical_data:Prolymph ct_crc002_clinical_data:Pulse FISHchr17_abn_category ct_crc002_clinical_data:Mono ct_crc002_clinical_data:Resp FISHchr17_comments ct_crc002_clinical_data:Baso ct_crc002_clinical_data:PerformanceStatus FISHchr17_mayo_comments ct_crc002_clinical_data:Eo ct_crc002_clinical_data:Liver FISHchr17_mayo_pattern ct_crc002_clinical_data:sma_date ct_crc002_clinical_data:Spleen FISHchr17_mayo_per_abncells ct_crc002_clinical_data:Albumin ct_crc002_clinical_data:LymphNodesAxOne FISHchr17_pattern ct_crc002_clinical_data:TotalProtein ct_crc002_clinical_data:LymphNodesAxTwo FISHchr17_per_abncells ct_crc002_clinical_data:Calcium ct_crc002_clinical_data:LymphNodesCxOne FISHchr1114_abn_category ct_crc002_clinical_data:InorganicPhos ct_crc002_clinical_data:LymphNodesCxTwo FISHchr1114_comments ct_crc002_clinical_data:Glucose ct_crc002_clinical_data:LymphNodesIngOne FISHchr1114_pattern ct_crc002_clinical_data:BUN ct_crc002_clinical_data:LymphNodesIngTwo FISHchr1114_per_abncells ct_crc002_clinical_data:Creatinine ct_crc002_clinical_data:lymph_nodes_supone FISH_present ct_crc002_clinical_data:UricAcid ct_crc002_clinical_data:lymph_nodes_suptwo RecordStatus ct_crc002_clinical_data:Bilirubin ct_crc002_clinical_data:lymph_nodes_total Cutoff_11q_calc ct_crc002_clinical_data:AlkalinePhos Protocol Cutoff_13q_calc ct_crc002_clinical_data:LDH VisitDate Cutoff_17p_calc ct_crc002_clinical_data:SGPT ct_crc002_registration:DateOfCLLDiagnosis Cutoff_Tri12_calc ct_crc002_clinical_data:PBFC_Performance_Date ct_crc002_registration:DateOfCRCEvaluation sampledate ct_crc002_clinical_data:PBCD19CD5 ct_crc002_registration:MRDStatus Static_FACS_2:flowdate ct_crc002_clinical_data:PBCD3 ct_crc002_registration:NumberOfPriorTherapies Static_FACS_2:cd19_lymph ct_crc002_clinical_data:PBCD4 ct_crc002_registration:AlkylatorRefractory Static_FACS_2:cd19_mfir ct_crc002_clinical_data:PBCD8 ct_crc002_registration:PurineAnalogueRefractory Static_FACS_2:bcell_cd5_lymph ct_crc002_clinical_data:PBCD10 ct_crc002_registration:ImmunotherapyRefractory Static_FACS_2:bcell_cd5_mfir ct_crc002_clinical_data:PBCD20 ct_crc002_registration:Fever Static_FACS_2:bcell_cd20_lymph ct_crc002_clinical_data:PBCD23 ct_crc002_registration:Infection Static_FACS_2:bcell_cd20_mfir ct_crc002_clinical_data:PBCD52 ct_crc002_registration:WeightLoss Static_FACS_2:bcell_cd23_lymph ct_crc002_clinical_data:PBKappa ct_crc002_registration:NightSweats Static_FACS_2:bcell_cd23_mfir ct_crc002_clinical_data:PBLambda ct_crc002_clinical_data:PBsIgD

27

Figure 5: Fields analyzed and bins for these fields.

*Static_FACS_2:cd19_lymph *ct_crc002_clinical_data:RBC *ct_crc002_clinical_data:BMKappa 0 80 NR, Q 0 3.79 REF neg*/-/0 19.9 NR, *** 80.01 90 3.8 5.74 >19.9/+ 90.01 95 >5.74 *ct_crc002_clinical_data:BMLambda 95.01 100 *ct_crc002_clinical_data:HgB neg*/-/0 19.9 NR, *** *Static_FACS_2:cd19_mfir 0 11.69 REF >19.9/+ 0 37 NR, Q 11.7 17.3 *ct_crc002_clinical_data:BMsIgD 37.01 65 >17.3 neg*/-/0 19.9 NR, *** 65.01 105 *ct_crc002_clinical_data:Plt >19.9/+ >105 0 149 REF *ct_crc002_clinical_data:BMFMC7 *Static_FACS_2:bcell_cd5_lymph 150 400 neg*/-/0 19.9 NR, *** 0 90 NR, Q >400 >19.9/+ 90.01 99 *ct_crc002_clinical_data:reticulocytes *ct_crc002_clinical_data:CellularityBx 99.01 100 0 0.4 REF 0 30 NR, Q *Static_FACS_2:bcell_cd5_mfir 0.5 1.5 30.1 50 0 30 NR, Q >1.5 50.1 80 30.01 80 *ct_crc002_clinical_data:Neuts 80.1 100 80.01 125 0 10 NR, Q *ct_crc002_clinical_data:CellularityAsp >125 10.1 30 0 30 NR, Q *Static_FACS_2:bcell_cd20_lymph 30.1 60 30.1 50 0 23 NR, Q >60.1 50.1 70 23.01 60 *ct_crc002_clinical_data:Band 70.1 100 60.01 83 0 0 NR, Q *ct_crc002_clinical_data:Lymphocytes 83.01 100 1 1 0 20 NR, Q *Static_FACS_2:bcell_cd20_mfir 2 10 20.1 52 0 3.5 NR, Q >10 52.1 79 3.51 7.5 *ct_crc002_clinical_data:Lymph 79.1 100 7.51 14 0 39 NR, Q *ct_crc002_clinical_data:Prolymphocytes >14 39.01 72 0 0 NR, ** *Static_FACS_2:bcell_cd23_lymph 72.01 87 0.01 2 0 45 NR, Q >87 2.01 30 45.01 70 *ct_crc002_clinical_data:Prolymph *ct_crc002_clinical_data:bm_morphology_blastspros 70.01 88 0 0 NR, ** 0 1 NR, ****

28

Figure 6: Comparison pairs between CRC query datasheets and methodology for combining datasheets for analysis.

Facs_table to Clinical_Tablesv3: ResearchID, then sampledate/visitdate. If no direct match between sampledate/visitdate, then link nearest visit date if date is within five days. If no direct match and no visit date within five days, do not include this data. Facs_table to cytogenetic: In cytogenetic sampledate column, convert dates higher than present date (EX: 2099) from 20* to 19*. Research ID, then sampledate. If no direct match for date, then match cytogenetic-sampledate to nearest within five days or first subsequent Facs_table-sampledate. If no direct match and no date within five days or subsequent, do not include this data. Facs_table to BsData_table: ResearchID, then sampledate/PerformanceDate. If no direct match for date, then match BsData_table- PerformanceDate to nearest within five days or first subsequent Facs_table-sampledate. If no direct match and no date within five days or subsequent, do not include this data. Facs_table to Registration Tablev2: Match ResearchID, then VisitDate/sampledate. If no direct match for date, then link nearest visit date within five days. If no direct match and no visit date within five days, do not include this data. Clinical_Tablesv3 to cytogenetic: Research ID, then sampledate/visitdate. If no direct match for date, then match cytogenetic-sampledate to nearest within five days or first subsequent Clinical_Tables-visitdate. If no direct match and no date within five days or subsequent, do not include this data. Clinical_Tablesv3 to BsData_table: ResearchID, then visitdate/PerformanceDate. If no direct match for date, then match BsData_table- PerformanceDate to nearest within five days or first subsequent Cliinical-Tables-visitdate. If no direct match and no date within five days or subsequent, do not include this data. Clinical_Tablesv3 to Registration Tablev2: Match ResearchID, then VisitDate. If no direct match for date, then link nearest visit date within five days. If no direct match and no visit date within five days, do not include this data. cytogenetic to BsData_table: ResearchID, then sampledate/PerformanceDate. If no direct match for date, then match BsData_table- PerformanceDate to nearest within one year of cytogenetic-sampledate. If no date entry for BsData_table, match to earliest date of cytogenetic. cytogenetic to Registration Tablev2: Research ID, then sampledate/visitdate. If no direct match for date, then match cytogenetic-sampledate to nearest within five days or first subsequent RegistrationTable-visitdate. If no direct match and no date within five days or subsequent, do not include this data. BsData_table to Registration Tablev2: Research ID, then PerformanceDate/visitdate. If no direct match for date, then match BsData_Table- Performancedate to nearest within five days or first subsequent RegistrationTable-visitdate. If no direct match and no date within five days or subsequent, do not include this data.

29

Figure 7: GDS dataset information summary.

Sample

GDS# Comparison Tissue Type Count Features

indolent (stable) B-CLL vs. progressive primary

GDS1388 B-CLL lymphocytes 21 12651

primary

GDS1454 healthy control vs. B-CLL lymphocytes 111 12651

B-CLL/ ZAP70-CD38- vs. B-CLL/

GDS2501 ZAP70+CD38+ B lymphocytes 16 22464

30

Figure 8: Correlated CRC data fields, p ≤ 0.05, phi ≥ 0.3.

Correlated Field 1 Field 2 values Phi P value 0-10 ~ 95.01- ct_crc002_clinical_data:Neuts Static_FACS_2:cd19_lymph 100 0.46 1.72E-60 >100 ~ 95.01- ct_crc002_clinical_data:WBC Static_FACS_2:cd19_lymph 100 0.43 1.33E-58 ct_crc002_clinical_data:WBC Static_FACS_2:cd19_lymph 4.5-11 ~ 0-80 0.44 1.16E-55 ct_crc002_clinical_data:Lymph Static_FACS_2:cd19_lymph >87 ~ 95.01-100 0.41 8.27E-53 ct_crc002_clinical_data:Neuts Static_FACS_2:cd19_lymph 30.1-60 ~ 0-80 0.38 1.40E-37 ct_crc002_clinical_data:Lymph Static_FACS_2:cd19_lymph 0-39 ~ 0-80 0.35 9.75E-36 ct_crc002_clinical_data:Lymph Static_FACS_2:cd19_mfir 0-39 ~ 0-37 0.33 6.25E-30 ct_crc002_clinical_data:Neuts Static_FACS_2:cd19_mfir >60 ~ 0-37 0.3 5.27E-21 ct_crc002_clinical_data:BMLa mbda Static_FACS_2:bcell_lambda_mfir >19.9 ~ >5.2 0.65 6.00E-17 ct_crc002_clinical_data:Liver FISHchr13_mayo_per_abncells >1 ~ 0-2.9 0.44 1.41E-16 ct_crc002_clinical_data:Liver FISHchr13_mayo_per_abncells 0-1 ~ 3-100 0.44 1.41E-16 ct_crc002_clinical_data:BMLa Static_FACS_2:bcell_lambda_lymp >19.9 ~ 70.01- mbda h 100 0.63 4.65E-15 ct_crc002_clinical_data:LDH Kar_per_abncells 100-190 ~ 0-0 0.34 2.02E-10 ct_crc002_clinical_data:BMLa -1000-19.9 ~ mbda Static_FACS_2:bcell_kappa_mfir >22 0.41 1.14E-09 ct_crc002_clinical_data:BMLa mbda Static_FACS_2:bcell_kappa_mfir >19.9 ~ 0-1.6 0.48 2.25E-09 ct_crc002_clinical_data:BMKa >19.9 ~ 98.51- ppa Static_FACS_2:bcell_kappa_lymph 100 0.43 4.20E-09 ct_crc002_clinical_data:BMKa -1000-19.9 ~ ppa Static_FACS_2:bcell_lambda_mfir >5.2 0.4 5.56E-09 ct_crc002_clinical_data:Band Static_FACS_2:flowdate >10 ~ 5/22/2003 0.66 7.16E-09 ct_crc002_clinical_data:BMKa -1000-19.9 ~ 0- ppa Static_FACS_2:bcell_kappa_mfir 1.6 0.38 1.10E-08 ct_crc002_clinical_data:BMLa -1000-19.9 ~ mbda Static_FACS_2:bcell_kappa_lymph 98.51-100 0.37 1.40E-08 ct_crc002_clinical_data:LDH Kar_anom_complex 100-190 ~ 0 0.3 2.00E-08 190.1-500 ~ ct_crc002_clinical_data:LDH Kar_per_abncells 10.1-50 0.31 2.36E-08 ct_crc002_clinical_data:BMKa ppa Static_FACS_2:bcell_kappa_mfir >19.9 ~ >22 0.39 7.44E-08 ct_crc002_clinical_data:BMLa -1000-19.9 ~ 0- mbda Static_FACS_2:bcell_lambda_mfir 1.3 0.35 2.50E-07 ct_crc002_clinical_data:BMKa -1000-19.9 ~ 0- ppa Static_FACS_2:bcell_kappa_lymph 1.5 0.34 4.21E-07 ct_crc002_clinical_data:BMC -1000-19.9 ~ 0- D20 Static_FACS_2:bcell_cd20_mfir 3.5 0.66 4.26E-06 ct_crc002_clinical_data:BMLa Static_FACS_2:bcell_lambda_lymp -1000-19.9 ~ 0- mbda h 0.6 0.31 5.87E-06 ct_crc002_clinical_data:BMKa -1000-19.9 ~ ppa Static_FACS_2:bcell_kappa_lymph 1.51-50 0.3 8.51E-06 FISHchr12_per_abncells Static_FACS_2:bcell_cd20_mfir 3-100 ~ >14 0.34 8.85E-06 ct_crc002_clinical_data:BMKa Static_FACS_2:bcell_lambda_lymp -1000-19.9 ~ ppa h 70.01-100 0.31 9.98E-06 ct_crc002_clinical_data:PBCD 3 Static_FACS_2:cd19_lymph >19.9 ~ 0-80 0.43 1.08E-05 ct_crc002_clinical_data:BMLa mbda Static_FACS_2:bcell_kappa_lymph >19.9 ~ 0-1.5 0.35 1.73E-05

31

Figure 9: Correlation gene lists for ZAP-70 and CD38 for GDS1388, GDS1454, and

GDS2501, threshold 0.4.

GDS1388 GDS1454 GDS2501 GDS1388 GDS1454 ZAP70 1 ZAP70 1 ZAP70 1 CD38 1 CD38 AF014118 0.761877 MTA1 0.590415 EHBP1L1 0.958558 RUNX1 0.842006 USH1C W26640 0.753988 LCP2 0.531581 DOK3 0.954622 FABP1 0.834785 ROD1 HPGD 0.739961 TPST2 0.497283 IFT122 0.94937 DTNA 0.814985 DPP6 KCNQ1 0.731021 PKN1 0.496205 DTX3 0.94743 AL109691 0.795703 TIGR:HG2028-HT2082 POU4F2 0.71143 APLP2 0.490371 C1ORF144 0.945776 SIGLEC6 0.747721 NPAS2 TIGR:HG3527-HT37210.67972 PBXIP1 0.488079 ATG4B 0.945213 IFI6 0.747075 PAH PCTK1 0.672659 CABIN1 0.476412 C19ORF6 0.944112 DNM1L 0.745606 RYR2 DEPDC5 0.671785 ARHGAP1 0.473291 CABIN1 0.942791 MAP1B 0.731437 CXCR6 DNASE1 0.669263 RP4-691N24.10.472004 ATP2A3 0.940758 CLEC2B 0.725273 ABCA12 DNM1 0.664518 DGKZ 0.466883 CNOT3 0.938867 EVI1 0.724979 TBX2 W26627 0.662644 TRAF1 0.466373 GRIN3B 0.932735 NDRG2 0.721593 IGFBP5 SMG5 0.658862 EML3 0.459666 RXRA 0.932461 CASR 0.721396 X12949 P2RX4 0.657786 SELPLG 0.45795 KCTD20 0.929487 TSPAN1 0.710331 #NAME? RANBP9 0.651297 LCK 0.454862 STXBP2 0.928299 DZIP1 0.70921 CCNO W22110 0.645469 RRBP1 0.452828 C19ORF6 0.927867 TRIM13 0.70642 LOC257407 CHRNA6 0.640786 LEF1 0.448462 ATP2A3 0.926586 PHACTR2 0.702286 EYA1 TMEM110 0.640517 PACS2 0.448085 GGA3 0.926283 BAAT 0.701827 PCGF2 LRRN2 0.639657 MAZ 0.443079 LOC90379 0.92572 ATP10A 0.699162 POLR3G AL050053 0.636678 CTSD 0.443033 ABHD4 0.925172 MAPK12 0.69782 MYB CD300C 0.635927 CDC25B 0.441665 TIAF1 0.924337 RB1 0.696141 SLIT3 SERBP1 0.627837 ARHGEF1 0.437756 GNRHR 0.924084 TBKBP1 0.694382 COL10A1 EXO1 0.625259 OS9 0.434866 MAPK10 0.922852 RPS24 0.69362 PDHA2 SOX2 0.624761 IRF3 0.43196 DOCK6 0.921808 BCL7A 0.691753 AL049990 IGL@ 0.623824 LCK 0.431374 SH3GLB2 0.920427 CYP3A5 0.691526 CALD1 GABRA5 0.621396 ATP11A 0.430725 DNM2 0.919353 CLDN3 0.690939 RUNX2 L14837 0.621322 MYLK 0.430244 MPZL1 0.917953 TBC1D9 0.680379 FAM134B WASF2 0.621078 MTA1 0.42867 RHOT2 0.917358 W28731 0.67877 CCL2 PITPNA 0.61933 AF034176 0.428129 MSTO1 0.916624 AB001915 0.678547 RYK BCR 0.617548 STK10 0.428016 FLJ14154 0.91653 RPGRIP1 0.67696 FMO3 LRP3 0.615812 GRINA 0.426775 TAS2R4 0.916476 CCL8 0.676242 CCR9 SLC12A4 0.613785 TIGR:HG4724-HT51660.426136 ARHGAP1 0.914202 CDON 0.67158 AI312646 MPZ 0.613173 FEZ2 0.425676 NM_018544 0.91405 MPHOSPH90.671479 CITED1 W28732 0.612571 MYO18A 0.42281 KCTD2 0.913171 MYO10 0.67139 ALDOB TXK 0.606445 PLCB2 0.422398 TOP3A 0.913029 CORT 0.668359 NPY6R SMARCD1 0.605121 STAT2 0.422105 LOC1001308630.912607 PITX2 0.666574 MMP14 Y09160 0.604457 UBTF 0.421073 HGS 0.911103 RAMP1 0.664601 W28760 PRODH 0.602096 P2RX1 0.420448 SAPS1 0.910442 P4HB 0.663916 HMGN3 SIPA1L3 0.600198 FCGRT 0.416489 TYK2 0.910399 SETD1B 0.660944 #NAME? MATK 0.597389 CAMKK2 0.416075 AA877910 0.90926 PPFIBP1 0.65925 D26561 AD000092 0.595514 LDOC1 0.412768 ATP2A3 0.909127 C1S 0.658609 PAX3 PIP5K1C 0.594681 PDIA4 0.411864 FLJ10404 0.908973 DIAPH2 0.657154 GJA1 D26561 0.588783 RNF19B 0.411436 FAM65A 0.908879 MARCO 0.655271 MAP1A SLC22A3 0.58855 TMEM63A 0.411433 ALOX5 0.908634 ARF6 0.654926 M30773

32

Figure 10: Randomness calculation for ZAP-70 and CD38 gene list intersections.

GDS1388 GDS1454 total genes 12651 12651 correlates 639 114 % genes selected 0.05051 0.009011 expected random 5.758122 Actual intersected CD38 positive correlates 8 GDS1388 GDS1454 total genes 12651 12651 correlates 269 1 % genes selected 0.021263 7.9E-05 expected random 0.021263 Actual intersected CD38 negative correlates 0 GDS1388 GDS1454 total genes 12651 12651 correlates 944 55 % genes selected 0.074619 0.004347 expected random 4.104023 Actual intersected ZAP-70 positive correlates 8 GDS1388 GDS1454 total genes 12651 12651 correlates 575 124 % genes selected 0.045451 0.009802 expected random 5.635918 Actual intersected ZAP-70 negative correlates 30

33

Figure 11: Annotated IPA gene lists for correlated GDS2501 gene lists and intersected

gene lists for ZAP-70 and CD38.

Symbol Synonym(s) Gene Name Location Type Drugs AAC-11, AI196452, API5L1, FIF, API5 MGC187857 inhibitor 5 Cytoplasm other

ADP-ribosyl cyclase, CD38 ANTIGEN, Cd38- rs1, NAD+ CD38 glycohydrolase, T10 CD38 molecule Plasma Membrane

BETA ESTROGEN RECEPTOR, ER- BETA, ERB, ERB2, ER[b], ESR-BETA, ESRB, ESTRB, ESTROGEN RECEPTOR BETA, ESR2 NR3A2 estrogen receptor 2 (ER beta) Nucleus ligand-dependent 17-alpha-ethinylestradiol, nuclear receptor fulvestrant, beta-estradiol, estradiol 17beta-cypionate, estrone, estradiol valerate, 3-(4-methoxy)phenyl-4-((4-(2-(1-piperidinyl)ethoxy)phenyl)methyl)-2H-1-benzopyran-7-ol, bazedoxifene, estradiol valerate/testosterone enanthate, TAS-108, ethinyl estradiol/ethynodiol diacetate, estradiol acetate, esterified estrogens, estradiol cypionate/medroxyprogesterone acetate, conjugated estrogens/meprobamate, estradiol/norethindrone acetate, synthetic conjugated estrogens, A, estradiol cypionate/testosterone cypionate, synthetic conjugated estrogens, B, CC8490, MITO-4509, ethinyl estradiol/desogestrel, ethinyl estradiol/drospirenone, premarin, ethinyl estradiol/norelgestromin, ethinyl estradiol/norethindrone, ethinyl estradiol/levonorgestrel, ethinyl estradiol/norgestrel, ethinyl estradiol/norgestimate, conjugated estrogen/medroxyprogesterone acetate, diethylstilbestrol, FC1271A, toremifene, tamoxifen, ERB-041, raloxifene, arzoxifene, clomiphene, estramustine phosphate, diethylstilbestrol d 1810015C04RIK, AU015349, FLJ20152, FLJ22155, FAM134B FLJ22179 family with sequence similarity Unknown 134, member other B

34

Figure 12: IPA pathways showing only connected genes, for combined gene lists for

ZAP-70.

35

Figure 13: IPA pathway showing only connected genes, for combined gene lists for

CD38.

36

Figure 14: Ohio State University Medical Center clinical reference ranges, November

2008.

37