Lung Disease Diagnosis from Gene Expression Profiles

LUNG DISEASE DIAGNOSIS FROM GENE EXPRESSION PROFILES

SHUYI MA

THESIS

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Chemical Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2011

Urbana, Illinois

Adviser:

Assistant Professor Nathan D. Price

ABSTRACT

Lung diseases include some of the most widespread and deadly conditions known to affect people in the US today. One of the main challenges in treating lung disease is the difficulty of diagnosis. Clinical diagnosis remains largely dependent upon symptomatic-based diagnoses; many cases can be either misdiagnosed or undiagnosed until disease has progressed to a more severe stage. Most studies aimed at finding molecular-based diagnostics have focused on one or two diseases at a time, yielding limited success. Instead, we searched for biomarkers reflective of the global health state of the lung by studying data taken from a broad range of lung diseases. We used gene expression microarray data from five different lung diseases—lung adenocarcinoma, lung squamous cell carcinoma, large cell lung carcinoma, chronic obstructive pulmonary disease, and asthma—as well as a non-diseased phenotype, to train a classification tree scheme based on the Top Scoring Pair algorithm (Geman et al., Stat Appl Genet Mol

Biol. 2004; 3: Article 19). The algorithm identified a 32 gene-pair panel that classified all of the phenotypes considered and another panel of 21 gene pair classifiers that classified the three cancers explicitly with sensitivity of 67±8% and 79±6% in ten-fold cross validation (p < 0.001), respectively.

Several of the markers have been previously cited in literature as linked to these cancers. Thus, a TSP- based classification tree scheme accurately identifies lung diseases from the relative expression of a parsimonious set of diagnostic gene pairs.

ACKNOWLEDGMENTS

This project would not have been possible without the support of many people. Many thanks to my adviser, Prof. Nathan D. Price, who helped to make sense of the confusion found in numerous revisions. In addition, special thanks to Jaeyun Sung for guiding me through the initial steps of the project and for informative discussions. Thanks also to Andrew Magis who helped with data processing and to

Yuliang Wang for helpful discussions. Finally, I gratefully acknowledge the funding from the National

Science Foundation Graduate Research Fellowship, which made this project possible.

iii

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION ...... 1 CHAPTER 2: METHODS ...... 4 2.1 Data Collection and Preprocessing ...... 4 2.2 Disease Classification Algorithm ...... 5 2.3 Disease Classification Scheme Parameter Selection ...... 7 2.4 Classification Scheme Performance Testing ...... 9 2.5 Gene Classifier Interrogation ...... 11 CHAPTER 3: RESULTS ...... 12 3.1 Overview of Data Collected ...... 12 3.2 TSP Tree Classification Scheme Output for All Diseases ...... 13 3.3 TSP Tree Classification Scheme Output for Non-Small Cell Lung Cancers ...... 15 3.4 Biological Significance of the Gene Classifiers ...... 21 CHAPTER 4: DISCUSSION ...... 25 APPENDIX A: COMPARISON OF MICROARRAY PRE-PROCESSING TECHNIQUES ...... 27 APPENDIX B: DETAILED OVERVIEW OF EXPERIMENTAL DATA INCLUDED IN META- ANALYSIS ...... 32 REFERENCES ...... 36

CHAPTER 1 INTRODUCTION

Lung diseases are some of the most pervasive and deadly diseases worldwide, causing over four million deaths each year [1]. In the US alone, asthma affects approximately 30 million people and is the most common chronic disease in children [2]. Chronic obstructive pulmonary disease (COPD), a progressive disease which causes difficulty breathing, affects about 16 million people and is the fourth leading cause of death worldwide [3]. Lung cancer is the most fatal cancer, leading to over 150,000 deaths per year [4, 5].

Currently, a main challenge of treating lung disease is the difficulty of diagnosis. In the case of asthma and COPD, late detection results in significantly reduced quality of life because the effects of the disease are not fully reversible. Most urgently, in the case of lung cancer, the tumor has often metastasized beyond the chest wall by the time of diagnosis. This late-stage diagnosis undermines the efficacy traditional treatments. Two-thirds of lung cancer patients are diagnosed at a late, metastatic, stage, which has a five-year survival rate of 2%, whereas 49% of patients diagnosed at the earliest stage can survive five or more years [5].

A potentially valuable source of information to facilitate more effective diagnosis is the blood as it is comparatively non-invasive to sample and has been shown to contain organ-specific secreted gene products with concentrations that depend on disease state [5]. Despite the latent wealth of information found in blood, discovering viable blood markers for disease has so far yielded limited results. Although a handful of protein markers, including CA15-3, CEA, EGFR, and PAP have been used to aid clinical diagnosis of lung cancer, these markers have substantial false positive rates associated with noncancerous and non-diseased health states. The search for novel protein biomarkers with more favorable false positive rates is hindered by the fact that current proteomics methods cannot adequately survey and quantify the entire set of blood proteins, which have concentrations that range over at least nine orders of

1 magnitude [6]. Proteomics methods that attempt to systematically characterize as much of a biological sample as possible do not yield results as reliable as more targeted protein measurement techniques that quantify specific proteins of known identity. Therefore, a more effective biomarker discovery strategy might involve a two-phase procedure that involves discovering candidate sets of disease markers before further quantification on the protein level.

We propose to identify likely blood marker candidates based on a meta-analysis of microarray data, which comprehensively quantifies gene expression on the mRNA level. To analyze these high- throughput data, which include hundreds of measurements of tens of thousands of genes, we leverage computational algorithms to learn the relevant diagnostic signatures. Many studies in search of molecular diagnostics have been conducted on the genome scale, focusing on the expression levels of all human genes. These have typically focused on characterizing one disease compared to a control [7-9]. This strategy tends to yield limited long term success; few biomarkers survive through clinical trials. We attempt a holistic meta-analysis approach by collectively examining several lung associated phenotypes, both diseased and non-diseased, to find context specific biomarkers that are indicative of specific lung health states rather than general markers of presence or absence of disease.

Previous analyses of multiple diseases have examined expression profiles across hundreds of genes which, although informative, are too numerous to serve as biomarker candidates for blood screening. An alternative approach based on the top scoring pair (TSP) algorithm [10], which classifies based on relative expression of a handful of genes. As this method predisposes the selection of a parsimonious set of classifiers, it is more desirable for identifying candidate gene markers of disease.

Searching for likely blood markers with known identities facilitates targeted assays for more sensitive detection and quantification. We wish to harness a classification scheme based on TSP analysis to identify potential blood markers that can diagnose multiple lung diseases with high accuracy. We will examine data from five different lung disease phenotypes—the three types of non-small cell lung cancers: adenocarcinoma (ADC), squamous cell carcinoma (SCC), large cell lung carcinoma (LCLC); asthma

(AST); chronic obstructive pulmonary disease (COPD), and a non-diseased phenotype (NORM). Our meta-analysis will be based on the gene expression microarray data of tissue samples collected from human subjects from several different experimental studies.

CHAPTER 2 METHODS

2.1 Data Collection and Preprocessing

We assembled lung related expression data from the publicly available online databases, Gene

Expression Omnibus (GEO) and ArrayExpress. The heterogeneous sources of the data included in this meta-analysis introduce several aspects of variability among samples that are unrelated to the disease state. One of the main sources of variability stems from the fact that the data included in the study were originally collected from multiple experimental studies. These studies were performed by laboratories that sampled different geographical patient populations, collected and prepared the tissue samples using different methods, measured the gene expression using different microarray platforms, and utilized different data preprocessing methods to yield gene expression values from hybridization intensities measured by the microarrays. We attempted to minimize the sources of variability that were within our control.

To reduce inter-platform variation, we included only lung-associated data that were collected from two Affymetrix (Santa Clara, CA, USA) oligonucleotide microarray platforms: 1) HG-U133 Plus 2, and 2) HG-U133A, which share a large number of common probes (22277) and have similar manufacturing procedures. These microarray platforms measure gene expression by quantifying the amount of gene-specific mRNA that is hybridized to ‘perfect match’ oligonucleotide probes that are complementary to the genes being measured. In addition to the ‘perfect match’ probes, the microarrays also have ‘mismatch’ probes, which have a mismatched nucleotide half way along the probe sequence.

These mismatch probes attempt to estimate non-specific binding for the probes corresponding to each mRNA gene sequence.

We investigated the effect of microarray preprocessing procedures on reported gene expression values and resulting classification, as described in Appendix A. The findings demonstrate that

4 preprocessing the raw hybridization intensities from different study batches together yielded reduced the resulting gene expression variability due to study batch source. To further mitigate the effects of study batch bias, we considered only phenotypes with data collected from at least two independent study batches, as labeled by the study labels provided by GEO or ArrayExpress, for our study. We used these data to train the classification algorithm and to test its performance.

2.2 Disease Classification Algorithm

Our classification scheme is based on a machine learning algorithm called top scoring pair (TSP) analysis, which analyzes ranked expression levels of samples for a binary phenotype classification [10].

The TSP algorithm finds a single gene pair that displays maximal relative expression reversal between samples of the two phenotypes, as assessed by:

| ( ) ( )|,

where

( ) ( )

is the probability that has greater expression than in patients of phenotype . This probability is estimated when applied to experimental data by comparing the relative frequency of occurrence in patients with the known phenotype . Gene pairs with high ( ) for one phenotype and low ( ) for the other (i.e., high TSP scores) have the desired relative expression reversal that facilitates definitive classification. Thus, the classifying gene pair with the highest TSP score is selected, and the classifying condition is: IF THEN phenotype 1, ELSE phenotype 2. Figure 1 shows the relative expression of the classifying gene pair selected by TSP for the phenotypes AST and COPD. In this example, (ITGB5) is always greater in expression than (NUPL2) for AST patients, and the opposite is true for COPD patients.

Figure 1: Standard TSP classification output applied to AST and COPD patients. The relative expressions of the classifying gene pair are plotted on the axes. In this case, the relative expression of the gene ITGB5 is always greater than NUPL2 in AST, and the opposite is observed in COPD.

The generalization of TSP, called k-TSP, selects a panel of disjoint gene pair classifiers with the highest TSP scores, where is a positive integer that can yield greatest classification sensitivity

( ) [11]. We implement a variation of k-TSP, in which a

test sample that has an expression profile such that an integer of the gene pairs display the relative expression reversal indicative of phenotype 1 over phenotype 2 will be classified as phenotype 1. We select the parameter to maximize the specificity while maintaining apparent sensitivity during training above a specified threshold.

TSP and k-TSP perform binary classification. To classify more phenotypes, we structure a series of binary TSP classifications into a hierarchical classification tree. The tree is constructed by successively grouping together phenotypes with the most similar expression profiles as quantified by TSP score, resulting in an agglomerative hierarchical clustering dendrogram with the TSP score as the distance metric. The complete tree has all considered phenotype classes in the topmost node. Branches that

6 separate groups of phenotypes with the highest TSP score reside close to the top of the tree, and branches that separate groups with the lowest TSP score split at the bottom.

The classification tree informs two methods of sequential TSP comparisons. The first, node-based method focuses on the nodes of each tree. The k-TSP variant method is applied, yielding panels of gene pair classifiers designed to determine whether or not test samples are members of the phenotypes that are grouped at each node. If a sample is affirmatively classified, then it passes through both branches spanning from the node and faces classification at both downstream nodes until it reaches nodes with only one phenotype. If a sample is not classified as belonging to one of the phenotypes at a node, then it is rejected from further classification at downstream nodes. This node-based classification can make multiple diagnoses because a sample could pass classification at multiple branches in the tree.

Furthermore, rejection of a sample from being classified by the tree indicates that the sample does not appear to resemble any of the phenotypes in the tree.

To break ties resulting from the node-based classification, a second, branch-based method is implemented. The branch-based classification compares groups on each left branch against groups on the corresponding right branch with the classic TSP classification approach, yielding one gene pair classifier per branch. If a sample is classified as belonging to a phenotype in the left branch node, then it is passed only down the branches that are downstream of the left node. Otherwise, the sample is passed down for subsequent classification at the branches downstream of the right node, akin to a decision tree with TSP as the decision metric.

2.3 Disease Classification Scheme Parameter Selection

The training of the TSP tree classification scheme involves examining the expression values of the genes that are selected as input to the algorithm. We adopted two primary strategies to refine the set of genes that were considered by the TSP tree scheme.

Our first gene refinement strategy involved examining the quality of the microarray expression measurements for each gene and each sample. The Affymetrix oligonucleotide microarrays standard data preprocessing protocol includes a function that assesses the expression value of each gene as ‘present,’

‘marginal’ or ‘absent’ based on whether or not the measured expression of the perfect match probes are significantly greater than the mismatch probes [12]. The ‘present’ call is made when the p-value of difference is less than 0.04, the ‘marginal’ call is made when the p-value of the difference is between 0.04 and 0.06, and the ‘absent’ call is made when the p-value of the difference is greater than 0.06, representing no significant difference between the measurements of the perfect match and the mismatch probes. These expression quality calls can be used as a basis for filtering the expression probes to be considered in the classification scheme.

Average Sensitivity Number of Genes Remaining 1 25000

0.9 0.8 20000

0.7

0.6 15000 0.5

0.4 10000 Sensitivity 0.3 0.2 5000

0.1 Remaining Genes of Number 0 0 10 20 30 40 50 60 70 80 90 100 Percent Absent Exclusion Threshold

Figure 2: Probe filtering effect on the number of genes excluded (dashed line) and the resulting sensitivities of the TSP tree classification averaged across the phenotypes tested (colored solid line). The error bars represent the standard deviation of the sensitivities averaged across the phenotypes.

We investigated the effect of excluding expression probes that have ‘absent’ calls identified in a range of sample percentages on the resulting TSP tree classification performance. Figure 2 shows the performance of the TSP tree classification scheme as a function of the number of genes excluded from the algorithm’s consideration based on the number of ‘absent’ calls made for each gene. Although the number of genes excluded changes substantially depending on the threshold of ‘absent’ calls that are acceptable

8 for a gene to be considered in subsequent classification (ranges from 5818 to 21647 probes), the performance of the TSP tree classification scheme does not significantly change in the ‘absent’ call threshold. We chose to use a 90% absent exclusion threshold; ie., genes were excluded from consideration if more than 90% of the samples were labeled as ‘absent.’

Our second strategy for refining the set of genes to be considered by the classification algorithm is motivated by the long-term goal of finding diagnostic markers in the blood. To maximize the likelihood of finding classifiers that can be ultimately measured in the blood, we restricted the number of genes considered to those that code for proteins that have been reportedly found in the blood. We used a list of

3020 peptides found in the blood by the HUPO plasma proteome project as the reference list [13], and found 2819 expression probes on the microarrays that code that overlap with the HUPO list. Out of these,

2100 expression probes passed the absent call exclusion threshold and were considered in subsequent classification.

2.4 Classification Scheme Performance Testing

We subjected the classification scheme to a series of tests to assess the performance of the TSP tree classification performance. Our primary performance assessment involved executing a ten-fold cross validation with the data. The purpose of implementing cross validation is to test the performance of the classification scheme on data on which the algorithm was not trained, which mitigates the effect of over- fitting and more closely approximates a clinically relevant test scenario. Cross validation is implemented by randomly excluding one tenth of the samples of each phenotype from the training data and by evaluating accuracy based on the classification of this excluded test set. The process is iterated ten times so that each excluded test set is disjoint. In order to maintain an approximately uniform distribution of phenotypes in the training dataset, we implemented a stratified cross validation wherein for each iteration of the outer loop, 58 ADC samples, 48 SCC samples and 47 NORM samples were included into the inner cross-validation to determine the training set.

We compared the performance of the TSP tree classification scheme with another popular machine learning classification algorithm, the support vector machine (SVM) [14]. We trained a one-vs.- one multi-class SVM with a linear kernel on the disease data using the LIBSVM library for MATLAB

[15]. We further implemented feature selection to choose the genes to be included in the analysis to be equal to the number of parameters selected from the TSP classifiers. The feature selection is based on the

F-score metric [16], which is calculated for each gene as [17]:

∑ ∑ ( ) ( ̅̅̅ ̅ ̅ ) ( ) ∑ ∑ ( ) ( ̅̅̅ ̅ )

where labels the total number of samples included in the analysis, and labels the phenotypes to be classified. The represents the expression of the th gene in the th sample, and the represents the phenotype class of the th sample. The terms ̅̅̅ ̅ and ̅ represent the mean expression level for a gene within a phenotype and across all phenotypes, respectively. The ( ) represents the Kronecker delta function:

( ) {

To test the significance of the TSP tree classification scheme performance, we randomly permuted the phenotype class labels associated with the expression data and trained the classification scheme on these permuted expression values 1000 times. We used the resulting distribution of sensitivities to estimate a p-value for the performance of the classification scheme derived from the un- permuted data.

To estimate the effect of experimental batch sources of variability on the data, we implemented a leave lab out validation, wherein the expression data from each experimental study was excluded from the training of the data and used to evaluate the performance of the classification algorithm. We also

10 investigated the effect of excluding multiple experimental studies from the training set on the resulting classification performance.

2.5 Gene Classifier Interrogation

As an additional, informal validation, we examined protein expression for the genes selected by the classification algorithms by visually inspecting immunohistochemistry (IHC) tissue stains the Human

Protein Atlas (http://www.proteinatlas.org/) [18]. We compared the tissue stains of lung cancer and normal bronchial tissue for the genes in the lung cancer-associated classifier panels.

We investigated the biological significance of the individual gene classifiers selected by the TSP tree classification algorithm by performing a literature search aided by the gene annotation compendium,

Gene Cards (www.genecards.org) [19]. We investigated functional interactions between genes using the curated protein interaction database Metacore (Genego, St. Joseph, MI), which generated the interaction networks and provided the links to the literature substantiation.

CHAPTER 3 RESULTS

3.1 Overview of Data Collected

We gathered lung disease related data from the publicly available online repositories: Gene

Expression Omnibus (GEO) and ArrayExpress. Table 1 summarizes the gene expression microarray data collected, and Appendix B gives details about the studies and experimental procedures that generated these data. Each phenotype has data collected from three or more independent studies, with sample sizes ranging from 49 to 580.

Table 1: This reports the number of samples, the number of studies, the types of platforms, and the method of tissue extraction used to collect the samples of the data included in the study. The platform labels represent 1) HG-U133 Plus 2, and 2) HG-U133A.

# Disease Label Platforms # Studies Sample Type Samples Adenocarcinoma ADC 1, 2 14 580 resection Squamous Cell Carcinoma SCC 1, 2 7 239 resection Large Cell Lung Carcinoma LCLC 1,2 4 49 resection airway epithelial Asthma AST 1 3 70 brushing, bronchoalveolar lavage Chronic Obstuctive airway epithelial COPD 1,2 4 63 Pulmonary Disease brushing, resection airway epithelial Normal NORM 1,2 11 469 brushing, resection

A potential source of variability originates from the heterogeneity of sample collection methods.

Cancer tissue samples were collected from surgical resection, whereas some COPD samples were collected from airway epithelial endoscopic brushings. AST samples were collected by brushing or by bronchoalveolar lavage. NORM samples consist of tissue collected from sudden death autopsy subjects as well as controls from studies that supplied the diseased tissue samples. The surgical resection control tissues collected in the studies generating the cancer-associated data were collected from patients with lung cancer locations distal to the tumor.

3.2 TSP Tree Classification Scheme Output for All Diseases

We trained the TSP tree algorithm on these data, including both disease and non-disease phenotypes for consideration. In this paradigm, the top node does not have a classifier panel as all samples were assumed to belong to one of the six phenotypes. We used a sensitivity threshold of 98% for finding optimal . Figure 3 shows the TSP tree algorithm output consisting of the classification tree and gene pair classifier panels. The vote integer at each node is equal to 1 unless otherwise specified.

Figure 3: TSP classification tree and gene pair markers. The tables associated with each node specify the node- based classifiers. The diamond boxes show the branch-based classifiers. The classifiers consist of 32 gene pairs of 54 unique genes.

Notably, tree branching reflects phenotypic characteristics. For example, the lung cancers, ADC

LCLC, and SCC, are first separated from the non-cancerous phenotypes before being separated from each other. Additionally of interest is the fact that several of the genes selected as classifiers by the algorithm have literature evidence suggesting a biological role in the associated diseases, demarcated in the figure by yellow or green boxes or text. This will be expounded further in Section 3.4.

Figure 4 compares ten-fold stratified ten-fold cross validation performance results of the TSP tree method against one-vs-one multiclass SVM with feature selection. The TSP tree method has an average sensitivity of 67±8%, which is comparable to the average performance of one-vs-one SVM: 66±6%. The

TSP tree classifies lung disease expression data reasonably well, with 81% accuracy for SCC and COPD and at least 88% accuracy for the other three phenotypes. In contrast, classification sensitivity of LCLC and NORM samples is 46±10% and 52±10%, respectively. While these sensitivities are 2.8 fold better than chance and are comparable to the SVM performances, they are still suboptimal.

All Health State Classification

TSP Tree One-vs-One SVM 1 0.9 0.8 0.7

0.6 0.5

Sensitivity 0.4 0.3 0.2 0.1 0 ADC SCC LCLC AST COPD NORM Health State

Figure 4: Ten-fold cross validation results of TSP tree algorithm compared to one-vs-one SVM. The TSP method performs better in classifying most of the diseases and worse in classifying non-disease.

The confusion matrix in

Table 2 shows how samples of each phenotype are classified by the TSP tree algorithm. While

overall precision ( ) excluding rejection is 72±11%, the precision of LCLC and

NORM samples is significantly poorer. LCLC samples have significant misclassification for ADC and

SCC (20±9% each). NORM samples have significant false positive rates (20±7%) for COPD.

Table 2: The confusion matrix shows sample performance in ten-fold cross-validation. The rows represent the actual phenotype, and the columns represent the predicted phenotype. The values in the boxed cell is the fraction of a particular phenotype that was predicted to be a given phenotype ( ( )) for * + and * +. The prediction represents samples that were not explicitly designated a phenotype by the classification algorithm.

Predicted Phenotype ADC SCC LCLC AST COPD NORM REJECT 0.66 0.04 0.11 0.002 0.02 0.07 0.11 ADC 58 ± 0.09 ± 0.02 ± 0.03 ± 0.006 ± 0.02 ± 0.04 ± 0.06

0.05 0.72 0.10 0 0.02 0.02 0.09 SCC 47.8 ± 0.05 ± 0.10 ± 0.05 ± 0 ± 0.02 ± 0.02 ± 0.03 0.20 0.20 0.46 0 0.02 0.02 0.10 LCLC 49 ± 0.09 ± 0.09 ± 0.10 ± 0 ± 0.02 ± 0.02 ± 0.04 0 0.001 0.004 0.95 0.01 0.01 0.03 AST 70 ± 0 ± 0.005 ± 0.007 ± 0.02 ± 0.01 ± 0.02 ± 0.03 0.02 0.01 0.01 0.01 0.72 0.20 0.04 COPD 63 ActualPhenotype ± 0.02 ± 0.01 ± 0.01 ± 0.01 ±0.07 ± 0.07 ± 0.04 0.07 0.02 0.01 0.10 0.21 0.52 0.07 NORM 46.9 ± 0.04 ± 0.02 ± 0.02 ± 0.03 ± 0.09 ± 0.10 ± 0.04 55.2 47.7 35 72.3 58.6 43.3 22.6 334.7

3.3 TSP Tree Classification Scheme Output for Non-Small Cell Lung Cancers

Accounting for the difficulty in separating NORM from COPD, we constructed a second TSP tree that explicitly classifies only the cancer phenotypes (Figure 5).

Figure 5: TSP tree classification scheme for lung cancers. The classifiers consist of 21 gene pairs of 37 unique genes.

Unlike the tree containing all phenotypes, classification is performed at the top node, which distinguishes whether a sample is one of the lung cancers considered or a non-cancer. Additionally of interest is the fact that several of the genes selected as classifiers by the algorithm have literature evidence suggesting a biological role in the associated diseases, demarcated in the figure by yellow or green boxes or text. This will be expounded further in Section 3.4.

Figure 6 compares ten-fold cross-validation performance of the TSP cancer tree to one-vs-one

SVM classification, and Table 3 reports the corresponding confusion matrix. Performance on non-cancer phenotypes is measured as the fraction of samples correctly rejected from further classification in the cancer tree.

Cancer State Classification TSP Tree One-vs-One SVM 1

0.8

0.6

0.4 Sensitivity 0.2

0 ADC SCC LCLC AST COPD NORM Health State Figure 6: Ten-fold cross validation results of TSP cancer tree and SVM classification.

The TSP tree method has an average sensitivity of 79±6%, which is comparable to the average performance of one-vs.-one SVM: 79±5%. The cancer tree rejects non-cancerous phenotypes with accuracies of 87-98±2%. The poorest overall performer, LCLC, has a sensitivity of 50±12% with the main sources of error deriving from misclassification as ADC or SCC with 21±9% and 20±6% respectively. This effect is possibly due to a shared non-small cell lung cancer signature. NORM has the greatest non-cancerous false rejection rate and is mistaken for ADC with 10±4% misclassification. This observation may be as a consequence of a significant number of NORM tissue samples being derived from adjacent non-cancer lung tissue resected from lung cancer patients.

Table 3: The confusion matrix shows sample performance in ten-fold cross-validation. The rows represent the actual phenotype, and the columns represent the predicted phenotype. The rows represent the actual phenotype, and the columns represent the predicted phenotype. The values in the boxed cell is the fraction of a particular phenotype that was predicted to be a given phenotype ( ( )) for * + and * +. The prediction represents samples that were not explicitly designated a phenotype by the classification algorithm

Predicted Phenotype ADC SCC LCLC REJECT 0.70 0.04 0.11 0.15 ADC ± 0.07 ± 0.02 ± 0.04 ± 0.05 58

0.08 0.74 0.10 0.09 SCC ± 0.05 ± 0.09 ± 0.07 ± 0.06 47.8 0.21 0.20 0.50 0.09 LCLC ± 0.09 ± 0.06 ± 0.12 ± 0.05 49 0.01 0.01 0.01 0.98 AST ± 0.01 ± 0.02 ± 0.01 ± 0.02 70 0.05 0.005 0.01 0.93

ActualPhenotype COPD ± 0.02 ± 0.01 ± 0.02 ± 0.02 63 0.10 0.02 0.01 0.87 NORM ± 0.04 ± 0.02 ± 0.01 ± 0.04 46.9 62.2 49.3 37.4 185.8 334.7

Figure 7 shows the TSP tree classification performance when trained with label permuted data.

The graph compares the distribution of the resulting sensitivities yielded from 1000 training iterations with the performance trained from actual data. The label permuted training data yielded sensitivities of

36±28%, which is significantly different from the performance trained by un-permuted data (79±6%) with a p-value < 0.001.

Figure 7: Histogram of the overall classification sensitivities resulting from 1000 label permutations. The red line indicates the overall classification sensitivity of the classification without label permutation.

Figure 8 shows the leave lab out validation performance results. The overall resulting sensitivities for the phenotypes are: 71±7% for ADC, 65±9% for SCC, 18±10% for LCLC, 71±26% for AST, 90±9% for COPD, and 89±11% for NORM. In general, we observe the trend that phenotypes with source data from many independent studies performed better than the phenotypes with source data from fewer source.

For example, COPD has a greater average sensitivity than AST, and NORM has a greater accuracy than

SCC, which in turn has a greater accuracy than LCLC. The main discrepancy in this trend is between

LCLC and AST. This exception might be due to the fact that LCLC is substantially more similar to the other non-small cell lung cancers than AST is to NORM or COPD.

ADC SCC LCLC AST COPD NORM 1 0.9 0.8 0.7 0.6 0.5 0.4

Sensitivity 0.3 0.2 0.1

E_15(ADC)

E_231(ADC)

E_2109(SCC) E_3141(SCC) E_6253(SCC) E_7368(AST) E_4302(AST)

E_47(NORM) E_15(NORM)

E_3141(ADC) E_7670(ADC) E_2109(ADC)

E_2109(LCLC)

E_19188 (SCC) E_14814 (SCC) E_10245 (SCC) E_18842 (SCC) E_18965 (AST)

E_231(NORM) E_994(NORM)

E_14814 (ADC) E_19188 (ADC) E_12667 (ADC) E_10799 (ADC) E_17475 (ADC) E_10445 (ADC) E_10245 (ADC) E_10072 (ADC) E_18842 (ADC)

E_1650(COPD) E_8581(COPD) E_8545(COPD) E_5058(COPD)

E_14814 (LCLC) E_19188 (LCLC) E_10445 (LCLC)

E_1650(NORM) E_4302(NORM) E_1643(NORM) E_7670(NORM) E_8545(NORM) E_5058(NORM) E_8581(NORM)

E_10799 (NORM) E_12345 (NORM) E_18842 (NORM) E_18965 (NORM) E_10072 (NORM) E_19188 (NORM) Study ID Excluded

Figure 8: TSP tree classification performance with leave lab out validation. The horizontal axis labels each experimental batch that was excluded from the training set. Each color represents the performance associated with a different phenotype.

We further investigated the trend of increased number of source experimental batches leading to improved leave lab out validation performance by evaluating the classification performance resulting from excluding multiple labs from the training set. We tested the three phenotypes with more than 5 independent experimental batches. Figure 9 shows the resulting sensitivities yielded from a range of the number of experimental batches excluded from the training to be used as the test sets. To compare between the three phenotypes, each of which has a different number of source experimental batches, we normalized the number of labs excluded as a percent of the total number of source studies in each

18 phenotype. The overall trend suggests a slight negative correlation between the number of labs excluded and the resulting sensitivities; all of the trendlines are have negative slopes.

ADC SCC NORM 1 0.9 0.8

0.7

0.6 0.5

0.4 Sensitivity 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Percent of Studies Excluded

Figure 9: Leave multiple lab validation performance. The average overall classification sensitivity is plotted as a function of the fraction of total experimental studies excluded from the algorithm training step. The colored lines are trendlines estimated from linear regressions of the data from the respective phenotypes.

To investigate the correspondence of gene expression level with protein level of the TSP cancer tree classifiers, we examined immunohistochemistry (IHC) stains for lung cancer from the Human Protein

Atlas [18]. Figure 10 shows IHC stains of cancerous and non-diseased tissue for the gene pair classifiers from the (ADC SCC LCLC) node of the TSP cancer tree (Figure 5). Each image has cells stained blue to indicate the presence of the nuclei and brown to indicate the presence of the target protein. Visual inspection indicates that many gene pairs display the same relative expression profiles as predicted by

TSP. In other words, the genes in the left column tend have greater protein expression, reflected in a more intense brown color, than the genes in the right column for the cancer stains. In contrast, genes in the right column have greater expression than those of the left column in the normal tissue stains. Note that several of these genes appear to be differentially expressed between cancer and normal tissue. This suggests that the gene pair markers selected by the TSP tree algorithm from gene expression data on the mRNA level might have similar predictive power when measured on the protein concentration level.

Figure 10: IHC stains of gene pair classifiers for the (ADC SCC LCLC) node from the TSP cancer classification tree. Brown indicates presence of protein and blue shows cell nuclei. Markers in each left column have greater expression than markers in the right columns in the TSP cancer tree method. Note that the COL5A2-FAM107A pair did not have IHC stain images available on the Human Protein Atlas.

3.4 Biological Significance of the Gene Classifiers

The genes selected as classifiers by the TSP tree classification algorithm for both the all- phenotype (Figure 3) and the cancer specific (Figure 5) cases have notable connections to the phenotypes they were selected to classify.

In both the classification scheme for all diseases and the cancer-specific classification scheme, the algorithm selected several genes associated with cancer. For example, ABCC3, JUP, DSC3, and KRT5 have been directly associated with their diagnostic phenotypes, ADC and SCC, respectively [20-23].

Similarly, LCN2 has been linked to cancer cell proliferation in ADC [24]. Other genes have been associated with cancers of relevant histologies in other parts of the human body. For instance, CAST, a classifier linked to high relative expression in ADC, has been linked to colorectal adenocarcinoma [25].

In contrast, PPL and CSF1R, classifiers linked to low relative expression in LCLC and ADC respectively, have been linked to esophageal and oral squamous cell carcinoma [26, 27].

Besides these connections, other gene markers had reported links to lung cancer in general. For example, EGFR is linked to non-small cell lung cancer and has been used as a blood marker in the clinic for lung cancer [28, 29]. GSTM3 has been found to modulate susceptibility to smoking-related lung cancer [30]. TPI was linked to lung cancer in an in vitro carcinogenesis model [31]. These genes were selected as classifiers linked to low relative expression in AST and COPD, which makes sense because a significant portion of the data that the algorithm is seeking to reject at the AST and COPD nodes is lung cancer data.

Several other genes in the classification schemes have been associated with cancer-associated processes. For example, TRAP1, a marker linked to high relative expression in LCLC, is an apoptosis inhibitor in lung cancer [32]. Furthermore, inhibition of KIF11, a classifier linked to the (ADC LCLC) node, is associated with reduced tumor growth and is [33]. Indirect cancer associations are also found in the COPD panel. For example, alterations of the acetylated HIST4H4, a low relative expression COPD

21 marker, have been attributed to decrease in protein levels of the transcription factor c-Myc, which contributes to regulating cell cycle progression, apoptosis, and cellular transformation under hypoxic conditions [34, 35].

The classification trees also isolated genes associated with inflammation in relevant markers. In particular, MRC1, a marker selected as having low relative expression in ADC, LCLC, and SCC, has been linked to inflammatory response by mediating ligand binding and endocytosis of macrophages [36], probably because a significant portion of the data that the algorithm is seeking to reject at the cancer node is AST and COPD data. Additionally, the positive COPD and negative NORM marker CAMK2G has been linked to airway inflammation, which is one of the pathophysiological effects of COPD [37].

In addition to isolating individual classifiers that have biological relevance to the target phenotypes, the classification trees also selected gene pairs that have functional interactions associated with the phenotypes. The first two examples are classifier pairs in the ADC and LCLC panels from the all phenotypes classification scheme. TRIM28 is a corepressor that contributes to the inactivation of the tumor suppressor p53 [38]. IQGAP1 had expression in the cell membrane that was previously found to be inversely correlated with E-cadherin, in gastric cancers, which contributes to tumor cell migration [39].

Increased expression of DNMT1 results in hypermethylation of tumor suppressor gene promoters and subsequent inhibition of these genes, and has been shown to correlate with poor prognosis in lung cancer patients [40]. NOTCH2 is contributes to a signaling pathway that induces apoptosis, which is linked to a classifier that has low relative expression in LCLC. Interestingly, IQGAP1 and TRIM 28 are linked by molecular interaction, as are DNMT1 and NOTCH2, as illustrated in Figure 11. IQGAP1 has been found to activate ERK1 by binding, which serves to propagate the MAPK cascade to regulate cellular processes including proliferation, differentiation, and cellular signals [41] and has been shown to play a role in lung cancer cell lines [42, 43]. ERK1 in turn phosphorylates TRIM28 with undetermined effect [44]. DNMT1 transcriptionally inhibits the transcriptional factor ESR1 [45], which is primarily responsible for

22 regulating cellular growth and differentiation in reproductive organs but has also been found to be expressed in ADC [46]. ESR1 in turn transcriptionally activates NOTCH2 [47].

Figure 11: Interaction network associated with the ADC node of the all phenotype TSP tree (Figure 5). The classifiers are circled in blue. The interaction of IQGAP1 and TRIM28 (i.e., TIF1-beta) is linked by ERK1/2 (highlighted in pink), and the interaction of DNMT1 and NOTCH2 are linked by ESR1 (highlighted in blue). Network representation is generated by Metacore (Genego, St. Joseph, MI). The highlighted interactions are also observed in the LCLC node.

In addition, the genes COL3A1 and EPAS1 are linked by molecular interaction via MMP14

(Figure 12). EPAS1, which has been associated with oxygen homeostasis and angiogenesis in cancer [48,

49], has been shown to transcriptionally activate MMP4 under hypoxic conditions [50]. MMP14 in turn inhibits COL3A1 by cleaving the collagen into peptide fragments [51]. The linking gene MMP14 has been shown to be differentially expressed in non-small cell lung cancer compared to normal lung tissue

[52], and has been linked to tumor cell migration and metastasis by participating in the invasion of the 3D extracellular collagen matrices [53].

Figure 12: Interaction network associated with the (ADC LCLC SCC) node of the cancer specific TSP tree (Figure 5). The classifiers are circled in blue. The interaction of COL3A1 (i.e. collagen III) and EPAS1 is linked by MMP14 (highlighted in blue). Network representation is generated by Metacore (Genego, St. Joseph, MI).

Examination of the gene pair classifiers derived from the TSP classification scheme suggests putative biological links between the classifiers and the phenotypes being classified.

CHAPTER 4 DISCUSSION

We have developed a lung disease diagnosis scheme based on a meta-analysis of gene expression microarray data. The meta-analysis across many experimental studies and hundreds of samples was facilitated by a novel consensus preprocessing of the raw microarray intensity measurements to reduce experimental batch-associated variability. We developed two classification schemes: a generic classifier which explicitly diagnoses all of the phenotypes considered and a cancer-specific classifier which only explicitly diagnoses the three non-small cell lung cancer phenotypes. This all-phenotype tree classifies phenotypes with significant accuracy that was comparable to performance with one-vs.-one SVM, but it has substantial misclassification from NORM samples as being classified as AST or COPD. This might be because a substantial number of NORM samples were non-cancerous tissue taken from lung cancer patients. These subjects might have had undocumented COPD or AST. Also, a significant number of samples were controls from studies examining AST and COPD. COPD and AST had comparatively few numbers of samples and independent studies undertaken, which potentially allows for a lab-specific signature to dominate the expression profiles. Given the difficulty separating NORM from COPD and

AST, we chose to focus on the lung cancers.

The cancer-specific tree scheme explicitly diagnoses only the cancer phenotypes and rejects non- cancers from classification. This cancer tree classifies the phenotypes examined with comparably to one- vs.-one SVM with good accuracy on all phenotypes except LCLC. We surmise that the misclassification of LCLC might be due to a combination of the molecular similarity of LCLC with ADC and SCC— making it a harder classification problem than AST vs. ADC, for example—in concert with the relative dearth of data available compared to the other phenotypes. Leave-lab-out validation substantiates the premise that improved increasing the amount of training data available from more diverse set of experimental studies results in improved classification efficacy by showing a negative correlation between the accuracy and the number of independent studies available to train the classification

25 algorithm. Encouragingly, immunohistochemistry stains showed that the gene products of many classifiers display similar relative concentrations to those predicted by TSP, which substantiates the possibility that these classifiers have diagnostic protein concentration profiles in the blood.

Interestingly, several classifiers isolated by the algorithm can be biochemically linked to their disease phenotypes in literature. This suggests that there might be an underlying biological reason why the gene pairs are diagnostic. We plan to examine further the potential biological significance of classifiers, especially in the context of probing the underlying gene and protein networks that are perturbed by disease.

We have taken a first step toward finding diagnostic blood markers for lung disease. Guided by the lists of gene pairs inferred from the TSP tree algorithm, we can target sensitive assays to quantify protein levels of promising classifiers in the blood. If a set of blood markers can demonstrate predictive reproducibility, these can be harnessed as a platform to aid effective diagnosis and presymptomatic screening of disease. Moreover, elucidating the potential contributions of the markers to disease-inducing perturbations might bring greater insight into underlying mechanisms of the diseases and inform a closer monitoring of disease states. Thus, markers potentially have diagnostic utility to aid effective early screening of patients and to assess the state of diseases through its progression and through treatments.

APPENDIX A: COMPARISON OF MICROARRAY PRE-PROCESSING TECHNIQUES

The gene expression microarray data measures intensity of fluorescently labeled gene fragments that are bound to sets of oligonucleotide probes with specific sequences tailored to be complementary to the target genes. The raw measurement for each gene is a set of intensities for each set of probes, which require in silico preprocessing to reconcile into a single expression value. The integration of the data involves correcting for background variability, normalizing the intensities, and summarizing the set of probes into a single value. Several preprocessing methods have been developed by multiple sources to perform this analysis; Table 4 summarizes the features of the three methods that were applied to the data by the experimental studies included in this meta-analysis.

Table 4: Description of three gene expression microarray preprocessing procedures, as adopted from [54]. Each of these procedures have distinct approaches to the background correction, normalization, and summarization steps.

Background correction Normalization Summarization Procedure Reference Method Method Method Full or partial mismatch Constant MAS5 Tukey biweight [55] probe subtraction parameter

Signal and noise close- RMA Quantile Median polish [56] form transformation

Optical noise, probe GCRMA affinity and mismatch Quantile Median polish [57] probe adjustment

We studied the effect of including data processed with different procedures the resulting expression values reported as well as and its effect on TSP classification performance on a small test subset of the meta-analysis data. The test set incorporated data from ADC (77 samples from GSE10245,

GSE17475, and GSE10799) and SCC (75 samples from GSE10245, GSE2109, and GSE6253). The raw intensity files were compiled and preprocessed each sample with RMA and GCRMA using MATLAB.

We subsequently assembled datasets composed of different mixtures of preprocessing methods (ie.: 0%,

25%, 50%, 75%, 100% RMA preprocessing) and compared the performance of standard binary TSP. output of binary TSP using input of data with mixed preprocessing method. For each input dataset

27 containing both RMA and GCRMA preprocessed data, we tested 20 dataset splicing permutations that preserved the RMA/GCRMA ratio.

Figure 13 shows the ten-fold cross validation performance of the binary TSP classification as a function of the fraction of samples that were preprocessed with RMA. The accuracy of the binary TSP classification is substantially undermined by including datasets generated by heterogeneous preprocessing methods. Interestingly, datasets preprocessed consistently yielded comparable accuracies independent of the preprocessing method selected.

Figure 13: Performance of binary TSP classification at different RMA/GCRMA preprocessing ratios. The error bars represent the standard deviation from 20 permutations.

We investigated a potential cause for this difference in classification performance. Figure 14 illustrates the distribution of the gene classifiers selected by TSP during different rounds of training. The

TSP algorithm selected classifiers much more consistently for datasets containing pure RMA and pure

GCRMA than for datasets containing mixed RMA and GCRMA, which show almost a uniform distribution of gene selection. This suggests that different preprocessing methods transform the data in

28 different ways, so that mixing preprocessing procedures introduces a source of non-phenotype associated variability. Applying a single preprocessing procedure consistently mitigates this effect.

Figure 14: Distribution of classifiers selected by TSP in different iterations of training. The horizontal axis is the gene label and the vertical axis is the frequency that any particular gene were chosen.

Although the RMA and GCRMA preprocessed sets yielded comparable accuracy, we wished to determine whether there is substantial difference in selecting one preprocessing method over the other.

Table 5 reports the top ten classifier pairs selected by the TSP algorithm based on input data processed only by RMA and only by GCRMA, respectively. There is little overlap between the classifiers selected

29 in two preprocessing sets; only 5 out of the 20 genes are in common, and none of the gene pairs are preserved across the two sets.

Table 5: Top ten TSP classifier gene pairs selected based on data from RMA and GCRMA consensus preprocessing, respectively. The genes colored blue and underlined have literature references associated with the relevant phenotype.

Because we wished to assess which of the two preprocessing methods is more favorable for the purpose of the meta-analysis, we compared the literature associations with each gene on the two gene lists. We found that the GCRMA list had substantially more genes that had literature references associated to the relevant phenotype (ADC in the case of the column, SCC in the case of the column), which are demarcated in the table as the boxes highlighted in green or yellow. Therefore, we elected to use the GCRMA method in MATLAB to preprocess all of the data included in the meta-analysis.

Figure 15 compares the effect of consistent versus non-consistent preprocessing on the variance in the entire meta-analysis dataset as measured by principal component analysis. Panel A and C show the data preprocessed differently by each experimental study, and panels B and D show the data preprocessed consistently by GCRMA in MATLAB.

A B

C D

Figure 15: Principal component analysis of the lung-associated gene expression data post consensus preprocessing. The first two principal components are shown. In panels A and B, the colors denote different phenotypes, whereas in panels C and D, the different colors denote different experimental batch.

Visual inspection suggests that in both the consistent preprocessing and the study specific preprocessing cases, the data tend to be separated by experimental study more prominently than by phenotype. However, consistent preprocessing substantially reduces the variation associated with batch effects.

APPENDIX B: DETAILED OVERVIEW OF EXPERIMENTAL DATA INCLUDED IN META- ANALYSIS

Table 6: Description of the experimental studies from which the data included in the meta-analysis were derived. The ‘Study Label’ column designates the GEO or ArrayExpress study identification numbers. The ‘GPL’ column designates the microarray platform used to take the expression measurement, as labeled by GEO (GPL570 represents HG-U133 Plus 2, and GPL96 represents HG-U133A). The bolded, boxed numbers in the ‘Samples’ column report the total number of samples for each phenotype.

Table 6 (cont.) Pre- Study Sample RNA Extraction Hybridization Phenotype GPL Samples Citation processing Notes Label Type Method Protocol Method phenol chloroform isoamyl alcohol Expression steroid broncho- extraction Analysis resistant vs Asthma Goleva alveolar

GSE7368 570 6 ethanol pre- Technical MAS5.0 steroid (AST) (no paper) lavage cipitation recovery, Manual of succeptible cells Qiagen RNeasy Kit Affymetrix. subgroups purification 2.75ug cRNA, airway NuGen Ovation 16 hr, 45C, RMA

GSE4302 570 55 [58] epithelial RNA Affymetrix (Bioconductor) brushing amplification Fluidics Station 450 cells were 15ug cRNA, cultured for airway 16 hr, 45C, Qiagen RNeasy GCRMA 24 hours

GSE18965 96 9 [59] epithelial Affymetrix Kit (Bioconductor) before brushing Fluidics RNA Station 450 extraction 70

trizol extraction, Chronic 15ug cRNA, Qiagen RNeasy Obstructive airway 16 hr, 45C, Kit purification

Pulmonary GSE5058 570 15 [60, 61] epithelial Affymetrix RMA Expression Disease brushing Fluidics Analysis (COPD) Station 450 Technical Manual trizol extraction, Qiagen RNeasy 15ug cRNA, airway Kit 16 hr, 45C,

GSE8545 570 15 [62] epithelial RMA Expression Fluidics brushing Analysis Station 450 Technical Manual resection, 20ug cRNA Expression snap 45C overnight,

GSE8581 570 15 [7] Analysis MAS5.0 frozen 450 Fluidics Technical Manual tumor Station trizol extraction, Superscript II reverse trans- criptase, oligo dT 8-10ug cRNA, resection, primer, T7 RNA overnight, snap emphysema

GSE1650 96 18 [63] pol promoter, Agilent RMA frozen only ENZO Bioarray confocal laser tumor RNA transcript scanning labeling kit, Qiagen RNeasy Kit purification 63

Adeno- resection, Expression carcinoma snap Qiagen RNeasy Analysis GCRMA

GSE10245 570 40 [64] (ADC) frozen Kit Technical (Bioconductor) tumor Manual. resection, Qiashredder, Expression snap Qiagen RNeasy Analysis

GSE17475 96 28 [65] MAS5.0 frozen Kit RNA Technical tumor extraction Manual.

Table 6 (cont.) Pre- Study Sample RNA Extraction Hybridization Phenotype GPL Samples Citation processing Notes Label Type Method Protocol Method 50ng cRNA, stromal cell 16hr 45C, resection, Expression content: Adeno- Fluidics snap Analysis <30%,

carcinoma GSE10799 570 16 [66] Station 450 GCRMA frozen Technical Manual MAIME (ADC) stain, tumor protocol standards GeneChip followed Scanner 7G resection, Expression Expression snap Analysis

GSE7670 96 27 [67] Analysis MAS5.0 frozen Technical Technical Manual tumor Manual trizol extraction, resection, Qiagen RNeasy Expression snap Kit, Analysis RMA

GSE10072 96 58 [68] frozen Expression Technical (Bioconductor) tumor Analysis Manual Technical Manual resection, trizol extraction, Expression snap Expression Analysis >70%

GSE12667 570 75 [69] MAS5 frozen Analysis Technical tumor cell tumor Technical Manual Manual resection, Expression Expression snap Analysis

GSE3141 570 58 [70] Analysis MAS5 frozen Technical Technical Manual tumor Manual resection, Expression Expression snap Analysis Analysis

GSE2109 570 61 [71] MAS5 frozen Technical Manual Technical tumor protocol Manuall trizol extraction, 20ug cRNA resection, Qiagen RNeasy 16hr 45C, 450 snap Kit, Fluidics

GSE19188 570 45 [72] RMA frozen Expression Station, tumor Analysis GeneChip Technical Manual Scanner 3000 Expression resection, Expression Analysis snap Analysis

GSE18842 570 14 [73] Technical RMA frozen Technical Manual Manual tumor protocol protocol guanidium isothiocyanate resection, Expression solution snap Analysis RMAexpress

GSE14814 96 28 [74] homogenization, frozen Technical v0.3 acid phenol- tumor Manual chloroform extraction Superscript ds- cDNA synthesis Expression kit, Analysis resection, ENZO Bioarray hi Technical E-TABM- Blum snap 96 23 yield RNA Manual, MAS5.0

15 (no paper) frozen transcript labeling Fluidics tumor kit, Station Qiagen RNeasy Protocol Kit purification GeneChip GeneChip resection, Expression Expression Tumor E-MEXP- snap Analysis 96 49 [75] Analysis MAS5.0 content:

231 frozen Technical Technical Manual >70% tumor Manual protocol protocol resection, trizol extraction, Expression snap Expression Analysis

GSE10445 570 59 [76] RMA frozen Analysis Technical tumor Technical Manual Manual 580

Qiagen RNeasy Squamous resection, Kit purification Expression controls Cell snap Expression Analysis GCRMA

GSE10245 570 18 [64] were qPCR Carcinoma frozen Analysis Technical (Bioconductor) verified (SCC) tumor Technical Manual Manual

Table 6 (cont.) Pre- Study Sample RNA Extraction Hybridization Phenotype GPL Samples Citation processing Notes Label Type Method Protocol Method Squamous resection, Expression flash frozen Cell snap Analysis

GSE6253 96 18 [77] standard protocol, RMA Carcinoma frozen Technical 1-5ug RNA used (SCC) tumor Manual resection, Expression Expression snap Analysis

GSE3141 570 53 [70] Analysis MAS5 frozen Technical Technical Manual tumor Manual resection, Expression Expression snap Analysis

GSE2109 570 40 [71] Analysis MAS5 frozen Technical Technical Manual tumor Manual 20ug cRNA resection, trizol extraction, 16hr 45C, 450 snap Expression Fluidics

GSE19188 570 27 [72] RMA frozen Analysis Station, tumor Technical Manual GeneChip Scanner 3000 Qiagen RNAeasy resection, Mini Kit Expression snap extraction Analysis

GSE18842 570 32 [73] RMA frozen Expression Technical tumor Analysis Manual Technical Manual guanidium isothiocyanate Expression resection, solution Analysis snap RMAexpress

GSE14814 96 52 [74] homogenization, Technical frozen v0.3 acid phenol- Manual tumor chloroform protocol extraction 239

resection, Expression Large Cell Expression snap Analysis

Carcinoma GSE2109 570 7 [71] Analysis MAS5 frozen Technical (LCLC) Technical Manual tumor Manual 20ug cRNA resection, trizol extraction, 16hr 45C, 450 snap Expression Fluidics

GSE19188 570 19 [72] RMA frozen Analysis Station, tumor Technical Manual GeneChip Scanner 3000 guanidium isothiocyanate resection, Expression solution snap Analysis RMAexpress

GSE14814 96 10 [74] homogenization, frozen Technical v0.3 acid phenol- tumor Manual chloroform extraction resection, trizol extraction, Expression snap Expression Analysis

GSE10445 570 13 [76] RMA frozen Analysis Technical tumor Technical Manual Manual 49

Qiagen MiniElute Expression resection, lung tissue protocol Analysis Normal snap from organ

GSE1643 570 40 [78] extraction, Technical GCRMA (NORM) frozen transplant ENZO Affymetrix Manual tissue donors, cRNA synthesis protocol trizol extraction, Expression airway Expression Analysis

GSE994 96 75 [79] epithelial MAS5.0 Analysis Technical brushing Technical Manual Manual trizol RNA Expression airway extraction, Analysis MAS5.0,

GSE3320 96 11 [80] epithelial Expression Technical RMA brushing Analysis Manual Technical Manual 2.75ug cRNA, airway NuGen Ovation 16 hr, 45C, RMA

GSE4302 570 44 [58] epithelial RNA Fluidics (Bioconductor) brushing amplification Station 450

Table 6 (cont.) Pre- Study Sample RNA Extraction Hybridization Phenotype GPL Samples Citation processing Notes Label Type Method Protocol Method cells were 15ug cRNA, cultured for airway Normal Qiagen RNeasy 16 hr, 45C, GCRMA 24 hours

GSE18965 96 7 [59] epithelial (NORM) Kit extraction Fluidics (Bioconductor) before brushing Station 450 RNA extraction trizol extraction, Qiagen RNeasy 15ug cRNA, airway Kit purification, 16 hr, 45C,

GSE5058 570 24 [60, 61] epithelial RMA Expression Fluidics brushing Analysis Station 450 Technical Manual trizol extraction, Qiagen RNeasy 15ug cRNA, airway Kit purification, 16 hr, 45C,

GSE8545 570 16 [62] epithelial RMA Expression Fluidics brushing Analysis Station 450 Technical Manual resection, 20ug cRNA Expression snap 45C overnight,

GSE8581 570 19 [7] Analysis MAS5.0 frozen 450 Fluidics Technical Manual tissue Station trizol extraction, Superscript II reverse trans- criptase, oligo dT 8-10ug cRNA, resection, primer, T7 RNA overnight, snap

GSE1650 96 12 [63] pol promoter, Agilent RMA frozen ENZO Bioarray confocal laser tumor transcript scanning labeling, Qiagen RNeasy Kit purification 50ng cRNA, resection, Expression 16hr 45C, 450 snap Analysis Fluidics

GSE10799 570 3 [66] GCRMA frozen Technical Manual Station stain, tissue protocol GeneChip Scanner 7G resection, Expression Expression snap Analysis

GSE7670 96 27 [67] Analysis MAS5.0 frozen Technical Technical Manual tissue Manual trizol extraction, resection, Qiagen RNeasy Expression snap Kit purification, Analysis RMA

GSE10072 96 49 [68] frozen Expression Technical (Bioconductor) tissue Analysis Manual Technical Manual 20ug cRNA resection, trizol extraction, 16hr 45C, 450 snap Expression Fluidics

GSE19188 570 65 [72] RMA frozen Analysis Station, tumor Technical Manual GeneChip Scanner 3000 resection, Expression Qiagen RNAeasy snap Analysis

GSE18842 570 45 [73] Mini Kit RMA frozen Technical extraction tumor Manual Superscript ds- Expression cDNA synthesis Analysis resection, kit, ENZO Technical E-TABM- Blum snap Bioarray hi yield 96 18 Manual, MAS5.0

15 (no paper) frozen RNA transcript Fluidics tumor labeling kit, Station Qiagen RNeasy Protocol Kit purification Expression GeneChip surgical Analysis Expression resection, E-MEXP- Technical Analysis 96 9 [75] snap MAS5.0

231 Manual, GNF Technical frozen Fluidics station Manual tumor cDNA synthesis protocol 469

REFERENCES

1. WHO. Fact Sheet: The top 10 causes of death. 2008 [cited; Available from: http://www.who.int/mediacentre/factsheets/fs310/en/index.html. 2. Asthma Statistics. 2010 [cited; Available from: http://www.achooallergy.com/asthma- statistics.asp. 3. COPD Statistical Information. 2004 [cited; Available from: http://www.copd- international.com/library/statistics.htm. 4. Lung Cancer. 2010 [cited; Available from: http://www.emedicinehealth.com/lung_cancer/article_em.htm. 5. Leidinger, P., et al., Identification of lung cancer with high sensitivity and specificity by blood testing. Respir Res, 2010. 11: p. 18. 6. Adkins, J.N., et al., Toward a human blood serum proteome: analysis by multidimensional separation coupled with mass spectrometry. Mol Cell Proteomics, 2002. 1(12): p. 947-55. 7. Bhattacharya, S., et al., Molecular biomarkers for quantitative and discrete COPD phenotypes. Am J Respir Cell Mol Biol, 2009. 40(3): p. 359-67. 8. Maciel, C.M., et al., Differential proteomic serum pattern of low molecular weight proteins expressed by adenocarcinoma lung cancer patients. J Exp Ther Oncol, 2005. 5(1): p. 31-8. 9. Molina, R., et al., Tumor markers (CEA, CA 125, CYFRA 21-1, SCC and NSE) in patients with non-small cell lung cancer as an aid in histological diagnosis and prognosis. Comparison with the main clinical and pathological prognostic factors. Tumour Biol, 2003. 24(4): p. 209-18. 10. Geman, D., et al., Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol, 2004. 3: p. Article19. 11. Tan, A.C., et al., Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics, 2005. 21(20): p. 3896-904. 12. Liu, W.M., et al., Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics, 2002. 18(12): p. 1593-9. 13. Omenn, G.S., et al., Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics, 2005. 5(13): p. 3226-45. 14. Ben-Hur, A., et al., Support vector machines and kernels for computational biology. PLoS Comput Biol, 2008. 4(10): p. e1000173. 15. Chang, C.C. and C.J. Lin (2001) LIBSVM: a library for support vector machines. Volume, 16. Chen, Y.W. and C.J. Lin, Combining SVMs with various feature selection strategies, in Feature Extraction Foundations and Applications, I. Guyon, et al., Editors. 2006, Springer. 17. Mundra, P. and J. Rajapakse, F-score with Pareto Front Analysis for Multiclass Gene Selection. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, 2009: p. 56- 67. 18. Berglund, L., et al., A genecentric Human Protein Atlas for expression profiles based on antibodies. Mol Cell Proteomics, 2008. 7(10): p. 2019-27. 19. Safran, M., et al., GeneCards Version 3: the human gene integrator. Database (Oxford), 2010. 2010: p. baq020. 20. Nuber, U.A., et al., Patterns of desmocollin synthesis in human epithelia: immunolocalization of desmocollins 1 and 3 in special epithelia and in cultured cells. Eur J Cell Biol, 1996. 71(1): p. 1- 13. 21. Hanada, S., et al., Expression profile of early lung adenocarcinoma: identification of MRP3 as a molecular marker for early progression. J Pathol, 2008. 216(1): p. 75-82.

22. Hidaka, N., et al., Expression of E-cadherin, alpha-catenin, beta-catenin, and gamma-catenin in bronchioloalveolar carcinoma and conventional pulmonary adenocarcinoma: an immunohistochemical study. Mod Pathol, 1998. 11(11): p. 1039-45. 23. Johansson, L., Histopathologic classification of lung cancer: Relevance of cytokeratin and TTF-1 immunophenotyping. Ann Diagn Pathol, 2004. 8(5): p. 259-67. 24. Tong, Z., et al., Neutrophil gelatinase-associated lipocalin as a survival factor. Biochem J, 2005. 391(Pt 2): p. 441-8. 25. Lakshmikuttyamma, A., et al., Overexpression of m-calpain in human colorectal adenocarcinomas. Cancer Epidemiol Biomarkers Prev, 2004. 13(10): p. 1604-9. 26. Ohara, M., Restriction fragment length polymorphism of the c-fms gene in the human oral squamous cell carcinomas. Hiroshima J Med Sci, 1994. 43(1): p. 23-9. 27. Nishimori, T., et al., Proteomic analysis of primary esophageal squamous cell carcinoma reveals downregulation of a cell adhesion protein, periplakin. Proteomics, 2006. 6(3): p. 1011-8. 28. Sasaki, H., et al., Elevated serum epidermal growth factor receptor level is correlated with lymph node metastasis in lung cancer. Int J Clin Oncol, 2003. 8(2): p. 79-82. 29. AACC. Tumor Markers. Lab Tests Online 2010 [cited; Available from: http://labtestsonline.org/understanding/analytes/tumor_markers/glance-3.html. 30. Jourenkova-Mironova, N., et al., Role of glutathione S-transferase GSTM1, GSTM3, GSTP1 and GSTT1 genotypes in modulating susceptibility to smoking-related lung cancer. Pharmacogenetics, 1998. 8(6): p. 495-502. 31. Kim, J.E., et al., Identification of potential lung cancer biomarkers using an in vitro carcinogenesis model. Exp Mol Med, 2008. 40(6): p. 709-20. 32. Masuda, Y., et al., Involvement of tumor necrosis factor receptor-associated protein 1 (TRAP1) in apoptosis induced by beta-hydroxyisovalerylshikonin. J Biol Chem, 2004. 279(41): p. 42503-15. 33. Koller, E., et al., Use of a chemically modified antisense oligonucleotide library to identify and validate Eg5 (kinesin-like 1) as a target for antineoplastic drug development. Cancer Res, 2006. 66(4): p. 2059-66. 34. Li, Q. and M. Costa, c-Myc mediates a hypoxia-induced decrease in acetylated histone H4. Biochimie, 2009. 91(10): p. 1307-10. 35. Zhang, L., et al., The impact of C-MYC gene expression on gastric cancer cell. Mol Cell Biochem. 344(1-2): p. 125-35. 36. Liu, L., et al., The beneficial effect of Rheum tanguticum polysaccharide on protecting against diarrhea, colonic inflammation and ulceration in rats with TNBS-induced colitis: the role of macrophage mannose receptor in inflammation and immune response. Int Immunopharmacol, 2008. 8(11): p. 1481-92. 37. Sathish, V., et al., Effect of proinflammatory cytokines on regulation of sarcoplasmic reticulum Ca2+ reuptake in human airway smooth muscle. Am J Physiol Lung Cell Mol Physiol, 2009. 297(1): p. L26-34. 38. Wang, C., et al., MDM2 interaction with nuclear corepressor KAP1 contributes to p53 inactivation. EMBO J, 2005. 24(18): p. 3279-90. 39. Takemoto, H., et al., Localization of IQGAP1 is inversely correlated with intercellular adhesion mediated by e-cadherin in gastric cancers. Int J Cancer, 2001. 91(6): p. 783-8. 40. Lin, R.K., et al., The tobacco-specific carcinogen NNK induces DNA methyltransferase 1 accumulation and tumor suppressor gene hypermethylation in mice and lung cancer patients. J Clin Invest, 2010. 120(2): p. 521-32. 41. Johnson, M., M. Sharma, and B.R. Henderson, IQGAP1 regulation and roles in cancer. Cell Signal, 2009. 21(10): p. 1471-8. 42. Su, Y.J., et al., Role of Rad51 down-regulation and extracellular signal-regulated kinases 1 and 2 inactivation in emodin and mitomycin C-induced synergistic cytotoxicity in human non-small-cell lung cancer cells. Mol Pharmacol, 2010. 77(4): p. 633-43.

43. Zhang, Z.F., J.Q. Ma, and L. Zhang, [Expression and clinical significance of MAPK in non-small cell lung cancer]. Zhonghua Yi Xue Za Zhi, 2005. 85(5): p. 339-42. 44. Kosako, H., et al., Phosphoproteomics reveals new ERK MAP kinase targets and links ERK to nucleoporin-mediated nuclear transport. Nat Struct Mol Biol, 2009. 16(10): p. 1026-35. 45. Macaluso, M., et al., pRb2/p130-E2F4/5-HDAC1-SUV39H1-p300 and pRb2/p130-E2F4/5- HDAC1-SUV39H1-DNMT1 multimolecular complexes mediate the transcription of estrogen receptor-alpha in breast cancer. Oncogene, 2003. 22(23): p. 3511-7. 46. Gomez-Fernandez, C., et al., Immunohistochemical expression of estrogen receptor in adenocarcinomas of the lung: the antibody factor. Appl Immunohistochem Mol Morphol, 2010. 18(2): p. 137-41. 47. Papoutsi, Z., et al., Binding of estrogen receptor alpha/beta heterodimers to chromatin in MCF-7 cells. J Mol Endocrinol, 2009. 43(2): p. 65-72. 48. Sato, M., et al., The PAI-1 gene as a direct target of endothelial PAS domain protein-1 in adenocarcinoma A549 cells. Am J Respir Cell Mol Biol, 2004. 31(2): p. 209-15. 49. Xia, G., et al., Regulation of vascular endothelial growth factor transcription by endothelial PAS domain protein 1 (EPAS1) and possible involvement of EPAS1 in the angiogenesis of renal cell carcinoma. Cancer, 2001. 91(8): p. 1429-36. 50. Petrella, B.L., J. Lohi, and C.E. Brinckerhoff, Identification of membrane type-1 matrix metalloproteinase as a target of hypoxia-inducible factor-2 alpha in von Hippel-Lindau renal cell carcinoma. Oncogene, 2005. 24(6): p. 1043-52. 51. Ohuchi, E., et al., Membrane type 1 matrix metalloproteinase digests interstitial collagens and other extracellular matrix macromolecules. J Biol Chem, 1997. 272(4): p. 2446-51. 52. Atkinson, J.M., et al., Membrane type matrix metalloproteinases (MMPs) show differential expression in non-small cell lung cancer (NSCLC) compared to normal lung: correlation of MMP-14 mRNA expression and proteolytic activity. Eur J Cancer, 2007. 43(11): p. 1764-71. 53. Fisher, K.E., et al., Tumor cell invasion of collagen matrices requires coordinate lipid agonist- induced G-protein and membrane-type matrix metalloproteinase-1-dependent signaling. Mol Cancer, 2006. 5: p. 69. 54. Lim, W.K., et al., Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics, 2007. 23(13): p. i282-8. 55. Hubbell, E., W.M. Liu, and R. Mei, Robust estimators for expression analysis. Bioinformatics, 2002. 18(12): p. 1585-92. 56. Irizarry, R.A., et al., Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 2003. 4(2): p. 249-64. 57. Wu, Z., et al., A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association, 2004. 99(468): p. 909-917. 58. Woodruff, P.G., et al., Genome-wide profiling identifies epithelial cell genes associated with asthma and with treatment response to corticosteroids. Proc Natl Acad Sci U S A, 2007. 104(40): p. 15858-63. 59. Kicic, A., et al., Decreased fibronectin production significantly contributes to dysregulated repair of asthmatic epithelium. Am J Respir Crit Care Med. 181(9): p. 889-98. 60. Carolan, B.J., et al., Up-regulation of expression of the ubiquitin carboxyl-terminal hydrolase L1 gene in human airway epithelium of cigarette smokers. Cancer Res, 2006. 66(22): p. 10729-40. 61. Tilley, A.E., et al., Down-regulation of the notch pathway in human airway epithelium in association with smoking and chronic obstructive pulmonary disease. Am J Respir Crit Care Med, 2009. 179(6): p. 457-66. 62. Ammous, Z., et al., Variability in small airway epithelial gene expression among normal smokers. Chest, 2008. 133(6): p. 1344-53. 63. Spira, A., et al., Gene expression profiling of human lung tissue from smokers with severe emphysema. Am J Respir Cell Mol Biol, 2004. 31(6): p. 601-10.

64. Kuner, R., et al., Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. Lung Cancer, 2009. 63(1): p. 32-8. 65. Neumann, J., et al., Gene expression profiles of lung adenocarcinoma linked to histopathological grading and survival but not to EGF-R status: a microarray study. BMC Cancer. 10: p. 77. 66. Wrage, M., et al., Genomic profiles associated with early micrometastasis in lung cancer: relevance of 4q deletion. Clin Cancer Res, 2009. 15(5): p. 1566-74. 67. Su, L.J., et al., Selection of DDX5 as a novel internal control for Q-RT-PCR from microarray data using a block bootstrap re-sampling scheme. BMC Genomics, 2007. 8: p. 140. 68. Landi, M.T., et al., Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS One, 2008. 3(2): p. e1651. 69. Ding, L., et al., Somatic mutations affect key pathways in lung adenocarcinoma. Nature, 2008. 455(7216): p. 1069-75. 70. Bild, A.H., et al., Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 2006. 439(7074): p. 353-7. 71. Consortium, I.G., Expo Public GEO Data. 2008. 72. Hou, J., et al., Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS One, 2010. 5(4): p. e10312. 73. Sanchez-Palencia, A., et al., Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int J Cancer. 74. Zhu, C.Q., et al., Prognostic and predictive gene signature for adjuvant chemotherapy in resected non-small-cell lung cancer. J Clin Oncol, 2010. 28(29): p. 4417-24. 75. Yap, Y.L., et al., Conserved transcription factor binding sites of cancer markers derived from primary lung adenocarcinoma microarrays. Nucleic Acids Res, 2005. 33(1): p. 409-21. 76. Broet, P., et al., Prediction of clinical outcome in multiple lung cancer cohorts by integrative genomics: implications for chemotherapy selection. Cancer Res, 2009. 69(3): p. 1055-62. 77. Lu, Y., et al., A gene expression signature predicts survival of patients with stage I non-small cell lung cancer. PLoS Med, 2006. 3(12): p. e467. 78. Gruber, M.P., et al., Human lung project: evaluating variance of gene expression in the human lung. Am J Respir Cell Mol Biol, 2006. 35(1): p. 65-71. 79. Spira, A., et al., Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc Natl Acad Sci U S A, 2004. 101(27): p. 10143-8. 80. Harvey, B.G., et al., Modification of gene expression of the small airway epithelium in response to cigarette smoking. J Mol Med, 2007. 85(1): p. 39-53.