A Thesis

entitled

Data Pooling to Identify Differentially Expressed in Lung Cancer of Nonsmokers

by

Nicole Carr

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of Science Degree in Biomedical Sciences

______Dr. Sadik A. Khuder, Committee Chair

______Dr. Barbara Saltzman, Committee Member

______Dr. Alexei Fedorov, Committee Member

______Dr. Patricia R. Komuniecki, Dean College of Graduate Studies

The University of Toledo

May, 2016

Copyright 2016, Nicole M. Carr

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author.

An Abstract of

Data Pooling to Identify Differentially Expressed Genes in Lung Cancer of Nonsmokers

by

Nicole Carr

Submitted to the graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Science

The University of Toledo May 2016

Lung cancer is the leading cause of cancer deaths in the United States. While cigarette smoking is the principle causal agent of lung cancer, 10–15% of patients have no history of smoking, yet still manifest expression similar to smokers. Although microarrays have been applied widely to lung cancer, few studies have explored the specific expression differences between smokers and nonsmokers. The purpose of this study was to identify differentially expressed genes in lung cancer of nonsmokers by combining data from different microarray platforms using quantile transformation.

Two statistical approaches were introduced in this study: the ordinal logistic regression and a new statistical test based on the DISCO Normal distribution. Results from these models were corroborated against the finding of the modified t-tests. The top reported genes were shown to play multiple differing roles in the cellular environment, including muscle maintenance, DNA damage repair mechanisms, cellular metabolism, proliferation and growth, and tumor progression. The results of this study give further insight to the genetic differences observed in lung cancer patients, as well as demonstrating the opportunity to combine data from different microarray platforms.

iii

Acknowledgements

I would like to extend a most sincere thank you to my mentor Dr. Sadik Khuder for all he has taught me. Without his kindness, patience, knowledge, and experience, this project would not have been possible. He’s an ordinary person, with an extraordinary heart.

Thank you to Dr. Barbara Saltzman, for not only her participation as a committee member but for her unparalleled kindness and sincere guidance through my academic career. Her words of wisdom will never be forgotten.

Additional thanks to Dr. Alexei Fedorov, again not only for being a committee member, but for always driving me towards success and inspiring me to learn. I feel sincerely privileged to have had him as a professor.

Many thanks as well to Dr. Sjaak Philipsen, of the Erasmus University Medical

Center Rotterdam, for supplying phenotype information beyond what was publically available for his study.

My deepest gratitude to Dr. Bob Blumenthal and Jo Anne Gray. Without the two of them, their understanding, their perseverance, and their dedication, I may never have made it through the program. I am forever indebted to their kindness.

Finally, sincere thanks to my family and Nathanial Carter. Their endless love and support has always inspired me to strive for the highest.

iv

Table of Contents

An Abstract of ...... iii

Acknowledgements ...... iv

Table of Contents ...... v

List of Tables ...... vii

List of Figures ...... viii

1 Introduction ...... 1

2 Background ...... 4

2.1 Histologies of Lung Cancer...... 4

2.2 Causes of Lung Cancer ...... 5

2.3 Familial Lung Cancer ...... 5

2.4 Microarray Analysis ...... 6

2.5 DNA Hybridization ...... 6

2.6 Types of Microarrays ...... 7

2.7 Cross Validation Studies ...... 8

2.8. Statistical Methods ...... 9

2.8.1 DISCO Normal Distribution ...... 9

2.8.2 Ordinal Logistic Regression ...... 10

3 Methods...... 11

3.1 Data Collection ...... 11

v

3.1.1 Tumor Tissue Data ...... 11

3.1.2 Tumor and Adjacent Normal Tissue Data...... 12

3.2 Data Analysis ...... 13

3.3 Data Comparison ...... 13

3.4 Quantile Scoring Method: ...... 14

4 Results ...... 18

4.1 Individual Platform Results ...... 18

4.2 Quantile Combination: Non-Smokers ...... 19

4.3 Quantile Combination: Smokers ...... 21

4.4 Quantile Combination: Tumor Samples ...... 24

4.5 Common Non-Smoker / Smoker Results ...... 26

5 Discussion ...... 33

6 Conclusion ...... 40

References ...... Error! Bookmark not defined.

A Supplementary Tables and Figures ...... 48

B R-Codes ...... 77

vi

List of Tables

Table 4.1: Top 25 common DEGs from Non-Smoker samples...... 20

Table 4.2 Top 25 common DEGS from smoker samples ...... 23

Table 4.3: Top 25 common DEGs from tumor samples...... 25

Table 4.4: Common DEGs from smoker and non-smoker results...... 27

Table A.1: List of Microarray analysis used in this study ...... 48

Table A.2: Common 407 DEGs from non-smoker data pool ...... 55

Table A.3: 493 Common DEGs from smoker data pool ...... 61

Table A.4: Top 100 Common DEGs from tumor samples ...... 74

vii

List of Figures

Figure 3-1: Example Venn diagram...... 14

Figure 3-2: Schematic representation of Quantile Transformation technique...... 16

Figure 4-1: Relationship of statistical results for non-smokers ...... 19

Figure 4-2: Relationship of statistical results for smokers...... 22

Figure 4-3: Relationship of statistical for tumor samples ...... 24

Figure A-1: HGU133A platform merge results...... 50

Figure A-2: HGU133A platform merge results...... 51

Figure A-3: GSE31548 results ...... 52

Figure A-4: GSE32863 results...... 53

Figure A-5: GSE62949 results ...... 54

Figure A-6: GSE63459 results...... 54

viii

Chapter 1

Introduction

Lung cancer is the leading cause of cancer deaths in both men and women in the

United States. In 2011, there were 1,095,000 new lung cancer cases reported across the world. That same year, there were 951,000 deaths due to lung cancer reported in men; and 427,000 deaths in women. To paint a clearer picture, this data represents that 12.7% of cancer cases and 18.2% of deaths were the result of lung cancer worldwide 2. Not only is there a high mortality rate for this disease, but it’s effects are quickly degenerative, when only 15% of patients diagnosed with lung cancer are still alive five years after their initial diagnosis. One of main reasons for such a low survival rate is the lack of early detection. Symptoms of lung cancer, such as persistent coughs, chest pain, and shortness of breath, typically do not manifest until the disease is advanced. According to the

American Cancer Society, early detection is presently limited to the use of applications such as chest x-rays and low-dose spiral computed tomography (LDCT). Consequently, there is an undoubted need for new biomarkers to help aid in early detection of the disease. Unfortunately, an estimated 224,390 new cases of lung cancer are expected in the United States in 20163.

Although microarrays have been applied widely to lung cancer, few studies have explored the specific differences between lung cancer in smokers and nonsmokers due to

1

the limited number of nonsmokers with lung cancer. Those studies that do include data on smokers versus non-smokers, typically have rather small sample sizes. Through preliminary searching, this study has identified seventeen studies which include a total of

822 smokers and 491 non-smokers for comparison. Currently, it is difficult to combine data from different microarray platforms, and most studies utilize a meta-analysis approach. In this study, we present a new statistical approach for combining information across different platforms of microarrays. We use the quantile scoring method in order to transform data from different microarray platforms to standardized data on the same scale. This new approach will allow combining data from different microarray platforms

This study aims to determine differentially expressed genes in non-smokers vs smokers with lung cancer as well as employing the quantile scoring method as a mechanism of combining data from multiple microarray platforms to achieve greater statistical power.

The specific objectives of this study include: first, identifying the top differentially expressed genes in non-smokers with lung cancer through pooling raw gene expression values from multiple studies using the quantile transformation technique. The second objective is to introduce the DISCO normal distribution and the logistic regression method as novel methods for combining data from different microarray platforms. The results of these models will be corroborated against the findings of the modified t-test. Finally, this study aims to simplify the steps needed for combining data, to determine proper scoring, and to identify the pathways which involve the differentially expressed genes. The results of this study give further insight to the genetic differences

2

observed in lung cancer patients, as well as demonstrating the opportunity to combine data from different microarray platforms.

3

Chapter 2

Background

The American Cancer Society released the 2016 Cancer Facts and Figures report, which estimates 158,080 cancer deaths due to lung cancer alone. In comparison, the report also estimates 49,190 deaths due to colorectal cancer; 41,780 deaths due to pancreatic cancer; 40,890 deaths due to breast cancer; 27,170 deaths due to liver cancer, and 26,120 deaths due to prostate cancer3. The astonishing number of expected lung cancer deaths stresses the importance of a search for biomarkers. Yet there are multiple complications that arise when practically examining the issue. For example, different histological subtypes of lung cancer have different genetic footprints; there are multiple causes of lung cancer, etc. Furthermore, there are multiple options when examining differential gene expression. This chapter illuminates each area of these differences.

2.1 Histologies of Lung Cancer

There are two main histologies of lung cancer—Non-small cell lung cancer

(NSCLC) and small-cell carcinomas (SCLC). NSCLC type tumors develop from the bronchial epithelial cell precursors and can be divided into three separate histologies:

Squamous cell carcinomas, Adenocarcinomas, and Large-cell carcinomas. Squamous cell carcinomas and Adenocarcinomas represent 80% of all lung cancers worldwide4. The

4

SCLC type of lung tumors develop from neuroendocrine cell precursors and only account for a small amount of cancer cases 4. The study participants in this research all suffered from either adenocarcinomas or other non-small cell lung cancers (NSCLC). Limiting the data to one main histological subtype aids to alleviate genetic differences between histologies.

2.2 Causes of Lung Cancer

Cigarette smoking is the most common cause of lung cancer in 80% of cases 3.

However, that leaves 20% of cases which must arise from other causes. There are other environmental exposures such as the release of radon gas from soil and other materials, which has been listed as the second leading cause of lung cancer in the US. Certain occupations can lead to an increased risk for lung cancer, including those that involve work with hazardous materials or chemicals such as paving, painting, chimney sweeping, etc. Furthermore, other causes include secondhand smoke, air pollution, and diesel exhaust. In some cases, preexisting medical conditions can also lead to increased risk of lung cancer such as a severe history of tuberculosis. Yet, these reasons do not encompass all cancer cases.

2.3 Familial Lung Cancer

It is widely recognized that there is a genetic component related to lung cancer. In terms of familial history, risk of lung cancer among first degree relatives was increased

1.9 times, which is significant to that of other common cancers. For example, familial risks for other cancers are 1.5 times (breast cancer), 1.9 times (colon cancer), and 2.7 times (prostate cancer) 2. These statistics support the research for a genetic cause; however, like the other cancers mentioned, genetic components of cancer are

5

multifactorial. This leads not to the question of if there is a genetic component to lung cancer, but what and where. One way to examine this hypothesis is through microarray analysis for differentially expressed genes in non-smokers between their adenocarcinoma samples and adjacent normal tissues.

2.4 Microarray Analysis

Microarray technologies are very beneficial for genome wide gene expression studies. Unfortunately, not all data from microarray platforms can be directly compared.

This issue arises from differences in sample management, data collection, probe differences, etc. that are inconsistent between different microarray platforms. For example, the experiment could use one dye or two dyes (Cy3 and/or Cy5), it could have a photolithographic design, or use particles or beads. Furthermore, the probes tested could be short-oligonucleotide, long-oligonucleotide, cDNA, etc. On a molecular level, the chips can also use different forms of hybridization, such as competitive or non- competitive. The method of labeling and data processing of the probes is often different as well, and thus leads to a near impossibility to directly compare results from two different microarray platforms 5.

2.5 DNA Hybridization

One of the fundamental aspects of microarray technology is the use of nucleic acid hybridization. On a basic level, hybridization is achieved by denaturing DNA to a high temperature (approximately 100oC), and thus denaturing a double stranded helix.

Then, when slightly cooled, a complimentary RNA strand can be introduced to the experiment and renatured to the single stranded DNA. Between microarray platforms, this process can be either competitive or non-competitive. The competitive manner is

6

constructed through the use of two labeled complimentary probes that contain the same nucleotide sequence as the target strand. On the other hand, non-competitive hybridization utilizes two probes, a donor and an acceptor, which both hybridize the target strand.

2.6 Types of Microarrays

One example of microarray platforms analyzed in this study were Affymetrix© gene chips. These platforms are an example of a photolithographic synthesis. On a molecular level, the Affymetrix© chips utilize DNA hybridization between complimentary strands. One complimentary strand is found as a probe attached to the chip, and the other is the target, found in the sample. After hybridization, the expression chips are washed and stained with a fluorescent dye. Using a laser to detect the fluorescence, the results are measured in relative fluorescent units, and translated into amino acid abundance. The report is then the average expression level of the probes 5-

6.Different chips have differing numbers of probes, up to 50,000. Many Affymetrix© chips include multiple perfect match probes, as well as one-base mismatch probes in order to increase its reliability and control non-specific binding. These probes are typically 25 base pairs in length. Furthermore, the probes are each arranged in series, called probe sets, that measure expression for specific mRNAs7.

A second example of microarray platforms analyzed in this study were Illumina© microarrays. Unlike Affymetrix© chips, Illumina© utilizes BeadArray technologies, where the DNA oligonucleotides (typically 50-mer probes) are attached to magnetic beads and allowed to hybridize to the sample DNA. There is many of one type of oligo per bead. Like Affymetrix©, Illumina © also uses more than one probe per gene

7

(approximately 30) in order to increase reliability; however, the probes can be different isoforms of the gene. After hybridization, a scanner utilizing laser and optics analyze the beads and creates an image file that can be analyzed using Bead Studio 5, 8.

This study also utilized Agilent© data sets for analysis. These platforms use inkjet printing, another alternative to the photolithographic system, to assemble oligonucleotide probes onto glass slides that are typically 60 base pairs in length. This process is arguably more uniform and produces more consistent features. As such, usually a gene is only represented by one probe. Agilent ink-jet arrays can be either one or two color arrays. In the case of two color arrays, once two samples are converted to cDNA, they are labeled with a fluorescent dye, using Cy3 (green) and/or Cy5 (red), and then allowed to hybridize to the array. After scanning with a laser, the ratio of red/green fluorescence indicates the fold change in expression level 8-9.

2.7 Cross Validation Studies

Previous studies have been done to cross-validate results from different microarray platforms. For example, Barnes et.al, compared two platform chips that are analyzed in this study: Affymetrix HGU133 Plus 2.0 and Illumina HumanRef-8 platforms. In the study, 30 volunteers using Acid Citrate Dextrose as an anti-coagulant consented to blood draws, and the isolated genetic material for each participant was analyzed using the two platforms, including technical replicates for each. Their results showed that the two platforms are highly comparable; however, there were some unexplainable discrepancies in the results. The authors also commented on how data annotation when comparing cross-platform results can be detrimental if not done carefully. For example, they argue that relying on GenBank identifiers to compare results

8

as a flawed practice as a “particular accession number only indicates the source of the probe sequence and does not imply that the Gen Band sequence is specific for a particular gene” (Barnes et.al, 2005).

2.8. Statistical Methods

There are three statistical methods used in this analysis: modified t-test, DISCO normal distribution, and ordinal logistic regression. Each of these methods were programmed using R statistical software.

2.8.1 DISCO Normal Distribution

The DISCO Normal distribution includes continuous, discrete and mixture of discrete and continuous distributions 10. One advantage of using the DISCO Normal distribution is that the distributions can take any shape. This allows for more flexibility in data analysis. For example, in this distribution, β (like the standard deviation) is not required to be positive, and there is no restriction on the test when it can range from -∞ to

∞. The true gene expression level of a gene on the jth individual in a particular group i, denoted as xij, can be modeled as:

2 exij xij p(x )  ij rs rr2  e rr1 where r1 and rs are the lowest and highest score respectively. The α signifies the average expression level and β represents the biological variability in gene expression. The values of α and β determine the shape of the distribution, which can be a U, J, inverted J, bell or irregular shape. The algorithm used, based on Newton Raphson, estimates these

9

parameters iteratively through the first and second derivatives. This method was developed based on the quantile combination, and a complete R-package is available from the original authors1.

2.8.2 Ordinal Logistic Regression

The use of ordinal logistic regression allows for multiple advantages such as the distribution does not have to be normal, and there does not need to be equal variance for each variable. Another advantage is that logistic regression does not assume a linear relationship between variables, which also allows for more flexibility in analysis 1. The model is:

 P(Y  y | x)  log j   x   j  P(Y  y j | x) 

yj quantile value for the j=1, …, k cutpoints

x grouping variable (0 “smoker”, 1 “non-smoker”)

10

Chapter 3

Methods

3.1 Data Collection

Microarray data were identified by searching National Center for Biotechnology

Gene Expression Omnibus (NCBI GEO) and ArrayExpress databases. The search keywords included “lung cancer”, “nonsmokers”, “smokers”, “lung adenocarcinomas” for studies of type “expression profiling by array”. The search excluded results for

“miRNA”, “epigenetics”, “methylation” and organisms other than homo sapiens. The search resulted in 44 possible studies, and based on sample type, phenotype information, and known smoking status. 17 publically available studies were chosen for analysis. Two authors were contacted in regards to obtaining more specific phenotype data beyond what was publically available, and one responded. To begin, the studies were split into two types, those that contained samples for both tumor and adjacent normal tissue, and those that contained only tumor tissue.

3.1.1 Tumor Tissue Data

For this set of samples, only author table data was downloaded from the GEO website for these samples. Following the download of the data, phenotype files were created for each study, identifying sample smoker status, histology, gender (if available),

11

and age (if available) using Microsoft Excel. As knowledge of smoking status was critical to the study, any samples that had an unknown smoking status were omitted from the study. Studies where a reduced sample size was used is denoted by a * in Table 5, located in appendix A. Following the creation of the phenotype files, each sample name (ex:

GSE...) was replaced with a clear and unique identifier called MERGE_ID (ex:

NS_tumor1), so that specific samples could be identified throughout the data merging process later.

3.1.2 Tumor and Adjacent Normal Tissue Data

To begin, both author table data as well as raw non-normalized data (if available) were downloaded from the GEO website. Raw data was available for download as a TAR

(of CEL) for Affymetrix files, and then compiled into a readable document using the R affy library. Raw data was available as zipped text files for Illumina, and TAR (of .txt) for Agilent platforms. Following the download of the data, phenotype files were created for each study, identifying sample smoker status, histology, gender (if available), and age

(if available) using Microsoft Excel. As knowledge of smoking status was critical to the study, any samples that had an unknown smoking status were omitted from the study.

Studies where a reduced sample size was used is denoted by a * in Table 5, located in

Appendix A. Following the creation of the phenotype files, each sample name (ex:

GSE...) was replaced with a clear and unique identifier called MERGE_ID (ex:

NS_tumor1), so that specific samples could be identified throughout the data merging process later. Because these studies contained both tumor and adjacent normal tissue data, the phenotypes of smoker and non-smoker were separated into separate files (both

.csv and .txt) for analysis.

12

Raw data was downloaded as a comparison mechanism for the Affymetrix sample platforms in order to determine if results between the normalized by author data and the normalized by us data were comparable. Two popular normalization methods were utilized: RMA (Robust Multi-array Average) and MAS5. MAS5 normalizes each array independently and sequentially; RMA uses a multi-chip model. The R-codes used for the normalization process are available in Appendix B.

3.2 Data Analysis

After the data download, phenotype matching, filtering, and renaming was completed, all data was processed using R statistical software (v 3.1.0). The data was analyzed using three statistical approaches: modified t-test, DISCO Normal Distribution, and ordinal logistic regression. Each of these methods was programmed using R statistical software, and all the R-codes used in these procedures are available in

Appendix B.

The modified t-test is currently the standard test used to determine differential gene expression. In this study, the modified t-test is used as standard by which to compare the results from the other two statistical approaches which we are introducing in this study: the DISCO normal and logistic regression tests.

3.3 Data Comparison

The top 1000 differentially expressed genes resulting from each of the three statistical methods were compared in order to determine common results. For example:

13

Ordinal Logistic Regression 1000

DISCO Normal T-Test 1000 1000

Figure 3-1: Example Venn diagram depicting comparison of top 1000 differentially expressed genes from each statistical test.

3.4 Quantile Scoring Method:

The quantile scoring method is an alternative method to a meta-analysis. The advantages of using the quantile scoring method rather than a meta-analysis, is that the quantile combination allows us to work with the raw study data versus the p-values.

Furthermore, it allows us to be able to work with subsets of the data rather than whole data sets.

To use the Quantile scoring method, we will select cut points along the distribution of gene expression values as defined by each specific platform. Gene expression values for a particular gene on a selected platform are pooled together from the groups to be compared for differential expression values. These values are ordered by increasing or decreasing size, then, the set is divided into equal subsets. The number of

14

subdivisions is dictated by the sample size in each group. A score is assigned to the interval in increasing or decreasing order.

The general scheme of the quantile scoring method is as follows:

 Gene expression values for a particular gene on a selected platform are pooled

together from the groups to be compared for differential expression.

 Order values by increasing or decreasing size, then, the set is divided into

precisely equal subsets.

 The number of subdivisions is dictated by the sample size in each group and the

amount of information to be retained in the transformed data.

 Assign scores to the consecutive sub-interval in increasing or decreasing order.

 The idea here is to devise a common scoring scheme for different platforms.

 One possible scoring may be (-2, -1, 0, 1, 2).

 For example:

15

Figure 3-2: Schematic representation of Quantile Transformation technique. Based on ranked expression values, data is converted to an assigned score and combined1.

16

The quantile scoring method was used to combine expression values in three different capacities. First, all studies containing both tumor and adjacent normal tissue and non-smoker patients were combined and analyzed using the three aforementioned statistical tests. Similarly, the studies containing both tumor and adjacent normal tissue and smoker patients were combined an analyzed the same way. Last, the same methods were used for the samples containing only tumor samples.

The quantile expression value data from each study was annotated with the NCBI

GEO provided platform annotation. This annotation contained a unique gene identifier for each probe. The data merge was accomplished by merging the Entrez gene

ID, as identified by NCBI.

17

Chapter 4

Results

There are four main results of this study: the number of common genes found by the three statistical analyses on individual platforms, the top common differentially expressed genes (DEGs) from the quantile combination of non-smoker samples with tumor and adjacent normal tissues, the top common DEGs from the quantile combination of smoker samples with tumor and adjacent normal tissues, and the top common DEGs from tumor samples when compared between smokers and nonsmokers.

4.1 Individual Platform Results

The first results depict comparisons from the tri-statistical analysis from the individual studies that contained both tumor and adjacent normal tissue data. The results are depicted in Venn diagrams that can be viewed in supplementary figures 10-15.

Information regarding the number of samples and number of genes tested for each platform are available in Table 5. The results showed that the top genes determined by the DISCO normal test more closely overlap with the results from the modified t-test than do the results from the logistic regression test.

18

4.2 Quantile Combination: Non-Smokers

The first quantile combination was completed comparing data from non-smokers between tumor tissues and their adjacent normal samples. The data was analyzed using the three statistical methods described in chapter 2. There were 316 samples analyzed, and 3,987 genes. Figure 7 depicts the relationship of common genes found in the top

1000 results of each statistical analysis. The result shows that of the top 1000 DEGs from each test, there were 255 genes unique to the t-test, 245 unique to the logistic regression test, and 169 were unique to the DISCO test. The results from the logistic regression shared 131 genes in common with the t-test, and 217 genes in common with the DISCO.

The results from the DISCO showed 207 genes in common with the t-test, and there were

407 common genes between the three tests. Of the 407 common genes, 322 were statistically significant between tumor and adjacent normal tissue based on the adjusted

Figure 4-1: Relationship of results between top 1000 differentially expressed genes for each: modified t-test, ordinal logistic regression, and DISCO normal tests in nonsmoker samples with tumor and adjacent normal information

19

p-value from the t-test as less than 0.05 (Table 6). Table 1 shows the top 25 genes from the 407 common results. A complete version of Table 1 showing all the common 407 genes can be viewed in supplementary Table 6.

Table 4.1: Top 25 common differentially expressed genes from Non-Smoker samples (figure 3). The full list of 407 common genes can be found in Table 6 in appendix A.

Gene Entrez ID logFC P.Value adj.P.Val ALDH18A1 5832 -5.94552 1.25E-17 1.19E-14 ALDOA 226 -5.40044 8.21E-15 5.19E-12 COL11A1 1301 -5.01774 5.43E-13 2.06E-10 CCDC6 8030 -4.80598 4.86E-12 1.54E-09

MMP11 4320 -4.76002 7.72E-12 1.83E-09 DDX11 1663 -4.65693 2.15E-11 4.08E-09 OLA1 29789 -4.65693 2.15E-11 4.08E-09 GAPDH 2597 -4.56673 5.18E-11 8.19E-09 KIF14 9928 -4.48942 1.09E-10 1.58E-08 KIF23 9493 -4.33479 4.60E-10 5.13E-08 TEX11 56159 -4.24459 1.04E-09 9.43E-08

TRAF4 9618 -4.239 1.10E-09 9.47E-08 SNF8 11267 -4.13591 2.74E-09 2.00E-07 TMEM106C 79022 -4.12861 2.92E-09 2.05E-07 TMEM53 79639 -4.11573 3.27E-09 2.22E-07 GALNT10 55568 -4.08996 4.10E-09 2.68E-07 ARAP1 116985 -4.0513 5.72E-09 3.50E-07 HHLA3 11147 -4.02553 7.14E-09 4.11E-07 SGK2 10110 -3.97398 1.11E-08 5.83E-07 TSGA14 95681 -3.9611 1.23E-08 6.33E-07 SDHC 6391 -3.95551 1.29E-08 6.46E-07 AMACR 23600 -3.93533 1.53E-08 7.09E-07 FGGY 55277 -3.93533 1.53E-08 7.09E-07 GPRC5C 55890 -3.90955 1.90E-08 8.19E-07 ALKBH1 8846 -3.84512 3.23E-08 1.36E-06

20

Of the top 25 results, three genes were reported with a greater than or equal to five times negative fold change in non-smokers. The first gene, ALDH18A1, was reported with an adjusted p-value of 1.19E-14. The second gene, ALDOA, was reported with an adjusted p-value of 5.19E-12. Finally, the third gene, COL11A1, was reported with an adjusted p-value of 2.06E-10. Each of the reported adjusted p-values is taken from the modified t-test results.

4.3 Quantile Combination: Smokers

The second quantile combination was completed comparing data from smokers between tumor tissues and their adjacent normal samples. There were 321 samples analyzed and 3,985 genes. Figure 8 depicts the relationship of common genes found in the top 1000 results of each statistical analysis. The result shows that of the top 1000

DEGs from each test, there were 2 genes unique to the t-test, 503 unique to the logistic regression test, and 2 were unique to the DISCO test. The results from the logistic regression shared 2 genes in common with the t-test, and 2 genes in common with the

DISCO. The results from the DISCO showed 503 genes in common with the t-test, and there were 493 common genes between the three tests. Table 2 shows the top 15 genes from the 493 common results. Of the 493 common genes, 487 were statistically significant between tumor and adjacent normal tissue based on the adjusted p-value from the t-test as less than 0.05 (Table 7). A more extensive version of Table 2 can be viewed in supplementary Table 7.

21

Figure 4-2: Relationship of results between top 1000 differentially expressed genes for each: modified t-test, ordinal logistic regression, and DISCO normal tests in smoker samples with tumor and adjacent normal information

Of the top 25 results, four genes were reported with a greater than or equal to five times negative fold change in non-smokers. The first gene TRAF4, was reported with an adjusted p-value of 1.26E-11. The second gene, TMED9, was reported with an adjusted p-value of 1.94E-10.The third gene, PPP1R14B, was reported with an adjusted p-value of

4.45E-10. Finally, the fourth gene, DEPDC6, was reported with an adjusted p-value of

6.16E-10. Each of the reported adjusted p-values is taken from the modified t-test results.

22

Table 4.2 Top 25 common differentially expressed genes from smoker samples (figure 4). The full list of 493 common genes can be found in Table 7 in appendix A.

Gene Entrez Gene ID logFC P.Value adj.P.Val TRAF4 9618 -5.58116 3.16E-15 1.26E-11 TMED9 54732 -5.2315 1.46E-13 1.94E-10 PPP1R14B 26472 -5.07146 7.81E-13 4.45E-10 DEPDC6 64798 -5.02676 1.24E-12 6.16E-10 SYNJ2 8871 -4.62599 6.36E-11 1.81E-08 NUP62 23636 -4.47464 2.59E-10 5.74E-08 TMED10 10972 -4.35033 7.96E-10 1.51E-07 ACTR2 10097 -4.31928 1.05E-09 1.74E-07 TBC1D16 125058 -4.21236 2.67E-09 3.54E-07 MCTS1 28985 -4.19029 3.23E-09 4.02E-07 ALDH18A1 5832 -4.17583 3.65E-09 4.41E-07 PFKP 5214 -4.16473 4.02E-09 4.57E-07 SP1 6667 -4.06343 9.44E-09 9.30E-07 GART 2618 -4.04443 1.11E-08 1.05E-06 HMGB3 3149 -4.0091 1.48E-08 1.34E-06 CBLC 23624 -3.98367 1.83E-08 1.62E-06 MED17 9440 -3.96039 2.21E-08 1.87E-06 C1orf27 54953 -3.9533 2.34E-08 1.94E-06 TAF10 6881 -3.95022 2.40E-08 1.95E-06 EIF4E 1977 -3.88037 4.21E-08 3.29E-06 MKLN1 4289 -3.83501 6.04E-08 4.23E-06 TIMM13 26517 -3.80262 7.79E-08 5.25E-06 CBX4 8535 -3.7974 8.11E-08 5.28E-06 C11orf80 79703 -3.78697 8.80E-08 5.57E-06

23

4.4 Quantile Combination: Tumor Samples

The data pooling combination was completed comparing data of tumor samples between smokers and non-smokers. There were 731 samples analyzed and 11,307 genes.

Figure 9 depicts the relationship of common genes found in the top 1000 results of each statistical analysis. The result shows that of the top 1000 DEGs from each test, there were

26 genes unique to the t-test, 821 unique to the logistic regression test, and 27 were unique to the DISCO test. The results from the logistic regression shared 6 genes in common with the t-test, and 5 genes in common with the DISCO. The results from the

DISCO showed 800 genes in common with the t-test, and there were 168 common genes between the three tests. Table 3 shows the top 25 genes from the 168 common results. Of the 168 common genes, none were statistically significant between smokers and non- smokers based on the adjusted p-value from the t-test as less than 0.05. A more extensive version of Table 3 can be viewed in supplementary Table 8.

Figure 4-3: Relationship of results between top 1000 differentially expressed genes for each: modified t-test, ordinal logistic regression, and DISCO normal tests in tumor samples between smokers and nonsmokers.

24

Table 4.3: Top 25 common differentially expressed genes from tumor samples (figure 5). The full list of 168 common genes can be found in Table 8 in appendix A.

Gene Entrez Gene ID logFC P.Value adj.P.Val APP 351 -1.82024 0.000622 0.319659 KIFC1 3833 -1.95851 0.000232 0.319659 CENPF 1063 -1.77929 0.000823 0.343535 CBR1 873 -1.67082 0.001684 0.432717 ATP2A2 488 -1.57098 0.003145 0.461865 KIF1A 547 -1.52788 0.004076 0.461865 BPGM 669 -1.58751 0.002842 0.461865 CBR3 874 -1.53327 0.003947 0.461865 CDKN2C 1031 -1.55734 0.003416 0.461865 CENPE 1062 -1.5013 0.004769 0.461865 DDX10 1662 -1.5031 0.004719 0.461865 EFNA2 1943 -1.4679 0.005789 0.461865 FEN1 2237 -1.52967 0.004033 0.461865 FGF12 2257 -1.49375 0.004984 0.461865 GABPA 2551 -1.5074 0.004601 0.461865 GARS 2617 -1.56235 0.003314 0.461865 GRIK2 2898 -1.55265 0.003514 0.461865 H2AFZ 3015 -1.5444 0.003693 0.461865 HSD17B10 3028 -1.57887 0.002997 0.461865 HIF1A 3091 -1.46287 0.005959 0.461865 HMMR 3161 -1.50273 0.004729 0.461865 KIF3C 3797 -1.55553 0.003453 0.461865 MAD2L1 4085 -1.5566 0.003431 0.461865 MAP1B 4131 -1.6446 0.001991 0.461865

25

4.5 Common Non-Smoker / Smoker Results

In the interest of determining differentially expressed genes related to the pathology of lung cancer, the results of the 407 non-smoker DEGs and the result of the

493 smoker DEGs were compared for common genes. There were 105 genes common between the two lists, of which 95 were statistically significant between tumor and adjacent normal tissue based on the adjusted p-value from the t-test as less than 0.05. The results can be viewed below in Table 4.

Some genes that were common between the two pooled analyses have been shown in the literature to play a role in lung cancer. For example: ALDOA14, for which the adjusted p-value in this study was 5.19E-12 for non-smokers and 0.000149 for smokers; CCD613 which reported adjusted p-values in this study of 1.54E-09 for non- smokers and 0.003949 for smokers; and KIF1417, which had adjusted p-values in this study of 1.58E-08 and 0.000147 for non-smokers and smokers respectively. Also,

TRAF421, which reported adjusted p-values of 1.26E-11 for non-smokers and 9.47E-08 for smokers. There were also genes shown to be uniquely differentially expressed in non- smokers. For example: HNRNPA2B125 and DERL-126, which reported adjusted p-values of 0.000114 and 6.25E-05, respectively. As these genes each play important roles in the cellular environment, they were chosen for further investigation in this study.

26

Table 4.4: Common DEGs from smoker and non-smoker results. The unshaded lines represent results from the smoker data pool, while the shaded lines represent results from the non-smoker data pool T-Test LR DISCO Entrez Acession Gene ID Number logFC P.Value adj.P.Val ordp ChiSq 140 AF226731 -1.76569 0.012617 0.051135 1.81E-12 0.020341 ADORA3 140 AF226731 -1.70607 0.01417 0.039845 1.37E-12 0.033159 79647 AI205764 -2.51358 0.000384 0.003828 2.44E-15 0.000545 AKIRIN1 79647 BG109865 -1.99857 0.00406 0.014791 1.95E-13 0.128259 5832 U76542 -4.17583 3.65E-09 4.41E-07 0 8.40E-10 ALDH18A1 5832 U76542 -5.94552 1.25E-17 1.19E-14 0 6.60E-05 226 AI921586 -3.21009 5.76E-06 0.000149 2.22E-16 7.09E-06 ALDOA 226 AI921586 -5.40044 8.21E-15 5.19E-12 0 0.001237 9949 AA707320 -1.83808 0.009413 0.041622 1.11E-13 0.01531 AMMECR1 9949 W84774 -2.16996 0.001809 0.007858 2.38E-14 0.10934 27329 AV659209 -3.7602 1.08E-07 6.54E-06 0 1.25E-07 ANGPTL3 27329 AV659209 -2.76271 7.12E-05 0.00065 0 0.020088 200316 BF511268 -1.83286 0.009617 0.042243 9.34E-13 0.008314 APOBEC3F 200316 BF511268 -2.46634 0.000391 0.002483 2.22E-15 0.007472 139322 BG290639 -1.82818 0.009803 0.042778 6.52E-13 0.01839 APOOL 139322 AA053853 -1.53856 0.026962 0.066287 4.93E-12 0.376136 10973 AU153330 -2.25706 0.00143 0.010282 1.07E-12 0.003692 ASCC3 10973 AU153330 -1.89936 0.006318 0.020639 4.57E-13 0.000539 83734 AI027990 -1.86351 0.008474 0.038625 5.78E-13 0.012014 ATG10 83734 AL136912 -2.51788 0.000295 0.001982 1.55E-15 0.026243 51272 BC000688 -2.92453 3.60E-05 0.000579 2.22E-15 6.37E-05 BET1L 51272 BC000688 -2.38902 0.000593 0.00332 4.22E-15 0.001383 56985 BC001294 -2.22949 0.001635 0.011367 3.02E-14 0.001774 ADPRM 56985 BC001294 -1.42258 0.040823 0.091911 2.79E-11 0.053892 23587 BC002762 -1.83862 0.009392 0.041622 2.23E-13 0.024472 ELP5 23587 BC002762 -1.64894 0.017751 0.047654 2.55E-12 0.059964 29071 BF749723 -2.17704 0.002101 0.014 3.20E-14 0.00609 C1GALT1C1 29071 BF749723 -2.22151 0.001403 0.006528 1.35E-14 0.004204 65260 NM_023077 -2.15081 0.002378 0.01523 2.24E-12 0.007116 COA7 65260 NM_023077 -2.82714 4.81E-05 0.000473 0 0.089149 55969 BC004446 -2.03667 0.004012 0.022481 6.40E-12 0.005336 C20orf24 55969 AF274948 -2.09265 0.002624 0.010396 7.31E-14 0.002135

C3orf14 57415 AI458586 -1.87743 0.007996 0.036911 4.81E-13 0.014834

57415 AI458586 -3.34258 1.54E-06 2.76E-05 0 0.004273

27

813 BG392414 -2.85267 5.58E-05 0.000826 2.22E-15 5.03E-05 CALU 813 BG392414 -3.05909 1.09E-05 0.000136 0 0.008297 821 AW294207 -3.44105 1.17E-06 4.70E-05 0 1.25E-06 CANX 821 AW294207 -2.53248 0.000272 0.001854 1.55E-15 0.000355 726 BF195709 -2.41295 0.000652 0.005738 6.14E-13 0.001012 CAPN5 726 BF195709 -1.4097 0.042683 0.09542 9.63E-12 0.002643 4076 BG107845 -3.20032 6.15E-06 0.000157 4.44E-16 1.33E-05 CAPRIN1 4076 BG107845 -3.41989 8.79E-07 1.79E-05 0 0.004345 8535 AI570531 -3.7974 8.11E-08 5.28E-06 0 5.80E-08 CBX4 8535 AI570531 -1.70607 0.01417 0.039845 1.57E-12 0.004957 8030 AK024913 -2.50662 0.000398 0.003949 1.59E-13 0.000702 CCDC6 8030 AK024913 -4.80598 4.86E-12 1.54E-09 0 0.113999 81669 AK000685 -1.97859 0.005187 0.027085 1.82E-11 0.011644 CCNL2 81669 AK000685 -1.80916 0.009292 0.02863 8.65E-13 0.076152 64866 AK026028 -2.44855 0.000542 0.005022 5.20E-11 0.001286 CDCP1 64866 AK026028 -1.74473 0.012125 0.035459 4.21E-13 0.009319 11190 BF511695 -2.0546 0.003701 0.021127 7.99E-14 0.004523 CEP250 11190 BF511695 -1.6803 0.015698 0.043119 1.43E-12 0.184967 23491 AK000105 -2.19336 0.001944 0.013108 3.11E-15 0.002379 CES3 23491 AK000105 -3.1437 6.19E-06 8.83E-05 0 0.506527 29097 AK024569 -2.49926 0.000414 0.004027 3.55E-15 0.000414 CNIH4 29097 AA834560 -1.37104 0.048699 0.106119 2.93E-11 0.230646 1301 BG028597 -2.41295 0.000652 0.005738 5.75E-13 0.000884 COL11A1 1301 BG028597 -5.01774 5.43E-13 2.06E-10 0 0.001282 1281 AU146808 -2.21517 0.001752 0.011991 1.13E-12 0.001957 COL3A1 1281 AU146808 -2.54365 0.000255 0.00178 8.88E-16 0.001109 9276 AL110129 -3.22347 5.27E-06 0.000144 0 5.59E-06 COPB2 9276 AL110129 -1.97667 0.004484 0.015788 1.58E-13 0.004409 1356 AI922198 -1.75311 0.013263 0.052998 4.54E-11 0.008105 CP 1356 AI684991 -2.07976 0.002788 0.010888 3.84E-14 0.035358 1663 AW571709 -2.55774 0.000302 0.003161 2.22E-16 0.000388 DDX11 1663 AW571709 -4.65693 2.15E-11 4.08E-09 0 0.001985 57706 AB046828 -1.81373 0.010399 0.04474 6.54E-13 0.019085 DENND1A 57706 AB046828 -2.33748 0.000777 0.004087 7.55E-15 0.032402 55526 AI934407 -1.77037 0.012384 0.050497 5.08E-11 0.014244 DHTKD1 55526 AI934407 -3.75492 6.72E-08 2.55E-06 0 0.012805 1977 AA913840 -3.88037 4.21E-08 3.29E-06 0 8.13E-08 EIF4E 1977 T82467 -2.66864 0.000125 0.001007 2.22E-16 0.000494 30001 AW268365 -3.53847 5.77E-07 2.64E-05 0 5.34E-07 ERO1L 30001 AW268365 -3.08486 9.20E-06 0.000122 0 0.000101

28

10146 AV705516 -1.88987 0.007589 0.03532 2.98E-13 0.01401 G3BP1 10146 BE673925 -2.69828 0.000105 0.000883 2.22E-16 0.029657 2597 -3.6668 2.22E-07 1.16E-05 0 2.78E-07 GAPDH

2597 -4.56673 5.18E-11 8.19E-09 0 2.68E-05 9518 AA129612 -1.94166 0.006088 0.030395 2.06E-13 0.006677 GDF15 9518 AA129612 -2.55654 0.000237 0.00168 8.88E-16 0.028689 2671 AI332613 -3.42955 1.27E-06 4.85E-05 0 2.90E-06 GFER 2671 AI332613 -3.71627 9.14E-08 3.15E-06 0 0.021329 10767 AI801875 -2.95223 3.04E-05 0.000506 0 7.64E-05 HBS1L 10767 AI801875 -1.25507 0.071156 0.144134 7.27E-11 0.359141 92815 AI268420 -2.19069 0.001969 0.013232 1.67E-14 0.002717 HIST3H2A 92815 AI268420 -1.42258 0.040823 0.091911 1.12E-11 0.029913 3183 NM_004500 -2.4258 0.00061 0.005453 4.31E-11 0.000693 HNRNPC 3183 AA126793 -1.69319 0.014917 0.041514 1.68E-12 0.005153 3312 AF217511 -2.12632 0.002666 0.016465 6.79E-14 0.00201 HSPA8 3312 AF217511 -2.30053 0.000941 0.0047 9.55E-15 0.080995 3382 AI291524 -2.22976 0.001633 0.011367 8.66E-15 0.002469 ICA1 3382 AI291524 -2.5952 0.000191 0.001424 4.44E-16 0.034474 3417 AA825652 -1.92654 0.006496 0.031638 3.13E-14 0.003712 IDH1 3417 AA825652 -2.67251 0.000122 0.000988 2.22E-16 0.144997 3423 BF055317 -1.79901 0.011038 0.046537 9.09E-13 0.007134 IDS 3423 BF055317 -1.56433 0.024505 0.061849 6.65E-12 0.004871 55540 AF208111 -2.12739 0.002652 0.016465 7.44E-10 0.004833 IL17RB 55540 AF208111 -1.72626 0.013067 0.037692 1.15E-12 0.061927 3673 N95414 -2.34698 0.000914 0.007272 1.27E-12 0.00158 ITGA2 3673 N95414 -2.74983 7.70E-05 0.000689 2.22E-16 0.371413 3736 N64750 -2.0562 0.003675 0.02103 1.44E-13 0.004171 KCNA1 3736 N64750 -2.44057 0.00045 0.002755 1.78E-15 0.100383 3747 AW131785 -1.92988 0.006404 0.031375 1.72E-13 0.007416 KCNC2 3747 AW131785 -3.44566 7.27E-07 1.62E-05 0 0.003074 23030 AW450344 -2.27874 0.001286 0.009449 6.44E-15 0.00146 KDM4B 23030 AW450344 -2.46976 0.000384 0.002478 2.44E-15 0.048832 23240 AW575183 -1.89763 0.007345 0.034506 6.94E-13 0.018089 KIAA0922 23240 AW575183 -2.72405 8.98E-05 0.000786 2.22E-16 0.144819 9928 AW183154 -3.21344 5.64E-06 0.000147 4.44E-16 1.76E-05 KIF14 9928 AW183154 -4.48942 1.09E-10 1.58E-08 0 0.000636 3833 U85977 -2.17048 0.002168 0.014274 4.17E-14 0.002341 KIFC1 3833 U85977 -2.67251 0.000122 0.000988 2.22E-16 0.002601 KRTAP 81850 X63338 -1.99665 0.004792 0.025489 9.04E-14 0.005374 1-3-3 81850 X63338 -2.42209 0.000497 0.00293 2.22E-15 0.140302

29

9440 AK001674 -3.96039 2.21E-08 1.87E-06 0 6.25E-08 MED17 9440 AK001674 -2.20862 0.001496 0.006826 1.15E-14 0.118285 84804 BC006242 -1.92265 0.006605 0.031972 2.64E-11 0.007985 MFSD9 84804 BC006242 -2.91004 2.87E-05 0.000309 0 0.004376 10724 AF307332 -2.40533 0.000679 0.005917 1.85E-10 0.003441 MGEA5 10724 AF307332 -2.37614 0.000635 0.003492 2.22E-15 0.280154 79778 AI821474 -2.60886 0.000228 0.002501 1.08E-13 0.00034 MICALL2 79778 AI821474 -3.52298 4.08E-07 9.80E-06 0 1.90E-05 4320 AW511464 -3.0645 1.50E-05 0.000314 0 3.66E-05 MMP11 4320 AW511464 -4.76002 7.72E-12 1.83E-09 0 0.001074 51073 BC000756 -3.7776 9.47E-08 5.80E-06 0 7.85E-08 MRPL4 51073 BC000756 -1.61587 0.020167 0.052506 2.62E-12 0.199805 64960 AF265439 -3.05647 1.58E-05 0.00032 0 2.45E-05 MRPS15 64960 AF265439 -2.99466 1.67E-05 0.000196 0 0.001158 80198 AA767217 -2.03332 0.004073 0.022756 5.44E-14 0.004336 MUS81 80198 AA767217 -2.26016 0.001156 0.005526 7.33E-15 0.048862 29104 AK021678 -3.32892 2.57E-06 7.93E-05 0 4.78E-06 N6AMT1 29104 AK021678 -2.16996 0.001809 0.007858 1.80E-14 0.064579 65083 BE780892 -2.1021 0.002981 0.017969 1.10E-13 0.003499 NOL6 65083 AK026258 -1.52567 0.028268 0.068698 4.09E-12 0.001125 29789 AL136546 -2.20567 0.001833 0.012443 5.37E-10 0.001195 OLA1 29789 AL136546 -4.65693 2.15E-11 4.08E-09 0 7.84E-05 81285 AF311306 -2.30242 0.001143 0.008676 1.42E-10 0.000618 OR51E2 81285 AI805082 -3.25238 2.92E-06 4.59E-05 0 0.111815 78990 AI656232 -1.86699 0.008352 0.038201 3.45E-13 0.010698 OTUB2 78990 AI656232 -3.07197 1.00E-05 0.000129 0 0.04167 5058 AU154408 -2.14425 0.002452 0.015525 3.55E-15 0.007161 PAK1 5058 AU154408 -2.61538 0.00017 0.001304 4.44E-16 0.000702 5230 AA069778 -2.37201 0.000805 0.00667 6.38E-11 0.003407 PGK1 5230 AA069778 -1.57721 0.023351 0.059489 3.36E-12 0.218347 23133 BE676640 -2.24047 0.00155 0.01097 2.22E-14 0.002854 PHF8 23133 BE676640 -1.3195 0.057813 0.121786 2.09E-11 0.009682 5255 BE503584 -3.63683 2.78E-07 1.38E-05 0 8.36E-07 PHKA1 5255 BE503584 -3.56164 3.04E-07 8.25E-06 0 0.002728 22874 AA535361 -2.38525 0.000753 0.006312 5.26E-13 0.000904 PLEKHA6 22874 AA535361 -1.71896 0.013457 0.038466 1.74E-12 0.079047 56937 AL035541 -3.15817 8.14E-06 0.000188 0 6.62E-06 PMEPA1 56937 AL035541 -2.05399 0.003146 0.012013 7.15E-14 0.029128 26073 AW151250 -1.83741 0.009439 0.04169 1.34E-11 0.02001 POLDIP2 26073 AW151250 -2.07976 0.002788 0.010888 3.49E-14 0.000723

30

5478 T62044 -2.05741 0.003655 0.020951 4.93E-14 0.006932 PPIA 5478 T62044 -1.43331 0.039327 0.088967 1.60E-11 0.011629 29968 BC004863 -2.59695 0.000244 0.002625 0 0.000195 PSAT1 29968 BC004863 -3.79358 4.92E-08 1.99E-06 0 5.99E-05 53635 AW298601 -2.53084 0.00035 0.003537 1.11E-15 0.000923 PTOV1 53635 AW298601 -3.74204 7.45E-08 2.77E-06 0 0.040834 5810 AI796010 -2.10424 0.002952 0.017847 1.28E-13 0.003021 RAD1 5810 AI796010 -1.62876 0.019193 0.050316 1.86E-12 0.202549 29890 W68720 -3.40452 1.51E-06 5.48E-05 0 3.58E-06 RBM15B 29890 W68720 -1.20352 0.083563 0.161182 5.55E-11 0.00428 6167 BF216701 -1.91382 0.006858 0.032838 2.47E-11 0.002931 RPL37 6167 BF724210 -2.56123 0.000231 0.001655 4.22E-13 0.027418 51150 BC006211 -3.10866 1.13E-05 0.000248 4.44E-16 8.45E-06 SDF4 51150 BC006211 -3.0462 1.19E-05 0.000146 0 0.015333 6391 AW183074 -3.00027 2.25E-05 0.000419 0 2.61E-05 SDHC 6391 AW183074 -3.95551 1.29E-08 6.46E-07 0 0.011359 26470 BC000567 -2.59534 0.000246 0.002641 2.22E-16 0.000313 SEZ6L2 26470 BF316510 -2.18285 0.001699 0.007569 2.80E-14 0.020029 6433 BG434474 -2.4725 0.000478 0.004565 4.45E-11 0.000381 SFSWAP 6433 BG434474 -2.09265 0.002624 0.010396 6.59E-14 0.010474 6496 AW473656 -1.83942 0.009361 0.041532 6.76E-11 0.007349 SIX3 6496 BF433916 -2.18285 0.001699 0.007569 2.80E-14 0.081326 6558 AK025062 -2.61368 0.000222 0.002445 1.34E-11 0.000317 SLC12A2 6558 AK025062 -1.57721 0.023351 0.059489 5.26E-12 0.156817 11267 AA935515 -3.16486 7.78E-06 0.000182 1.11E-15 1.51E-05 SNF8 11267 AA935515 -4.13591 2.74E-09 2.00E-07 0 0.003046 10492 BF593158 -3.1393 9.21E-06 0.000207 0 1.05E-05 SYNCRIP 10492 BF593158 -2.6025 0.000183 0.001382 4.44E-16 0.001046 54843 N21426 -2.23511 0.001591 0.011179 2.57E-12 0.001942 SYTL2 54843 N21426 -2.19573 0.001594 0.007205 3.18E-14 0.020896 55775 AK023514 -2.4448 0.000553 0.005096 2.89E-15 0.001767 TDP1 55775 AK023514 -2.8658 3.78E-05 0.000384 0 0.01834 7011 BF197089 -2.40292 0.000687 0.005978 6.28E-13 0.000877 TEP1 7011 BF197089 -1.61587 0.020167 0.052506 4.80E-12 0.331577 54972 W74594 -2.00723 0.004574 0.024719 2.04E-13 0.005786 TMEM132A 54972 W74594 -2.26016 0.001156 0.005526 1.40E-14 0.008964 79639 AA160797 -2.99077 2.39E-05 0.00043 3.11E-15 6.45E-05 TMEM53 79639 AA160797 -4.11573 3.27E-09 2.22E-07 0 0.001182 58986 BC004276 -2.09099 0.003137 0.018627 5.36E-12 0.001803 TMEM8A 58986 BC004276 -1.80357 0.009512 0.029165 5.36E-13 0.000885

31

9618 AI992283 -5.58116 3.16E-15 1.26E-11 0 5.55E-16 TRAF4 9618 AI992283 -4.239 1.10E-09 9.47E-08 0 0.011824 57570 AI168767 -2.25786 0.001424 0.01026 2.24E-10 0.001546 TRMT5 57570 AI168767 -1.63218 0.018942 0.049932 3.09E-12 0.105785 56995 AF288480 -1.76074 0.012868 0.051835 7.02E-11 0.026072 TULP4 56995 H15278 -1.91225 0.005972 0.019884 4.35E-13 0.317301 65264 AA147920 -2.50114 0.00041 0.004016 1.85E-13 0.000618 UBE2Z 65264 BE544096 -1.6876 0.015251 0.042258 1.23E-12 0.019487 56886 AI672492 -2 0.004722 0.025247 1.19E-13 0.004393 UGGT1 56886 AI672492 -3.56164 3.04E-07 8.25E-06 0 0.000257 81555 AW473802 -1.89803 0.007332 0.034506 2.80E-11 0.007884 YIPF5 81555 AA169752 -3.40142 1.01E-06 1.99E-05 0 0.005871 23144 AW024128 -2.80463 7.43E-05 0.00105 2.98E-12 6.63E-05 ZC3H3 23144 AW024128 -2.19573 0.001594 0.007205 1.93E-14 0.057621 7543 AI745209 -2.09367 0.003099 0.018482 6.51E-14 0.004793 ZFX 7543 AI745209 -1.70607 0.01417 0.039845 1.39E-12 0.033233

32

Chapter 5

Discussion

This study has aimed to emphasize the acute need for lung cancer biomarkers, specifically for non-smokers. The results have depicted such biomarkers that fall into multiple biological pathways. For example: muscle maintenance, DNA damage repair mechanisms, cellular metabolism, proliferation and growth, and tumor progression.

One objective of this study was to determine what genes were differentially expressed in non-smokers with lung cancer. Following the quantile combine of nonsmoker samples with tumor and adjacent normal tissue information there were multiple genes determined for further analysis. Additionally, there were multiple common genes found between the smoker and non-smoker data pools. This information is critical to understanding the pathogenesis of lung cancer, and this study explored multiple potential markers.

There were 105 common genes determined between the data pooling of smokers and non-smokers. For example, the gene ALDOA was reported as differentially expressed (adj p= 5.19E-12 in non-smokers and 0.000149 in smokers). Tumors typically have a metabolism altered from that of normal cell function, which aids in increased growth and in some cases, metastasis. One example is the higher intake of glucose by

33

tumor cells than normal cells14. Consequently, genes involved in tumor metabolism have been targets of study for many years. ALDOA (Fructose-bisphosphate aldolase A) plays a role in glycolysis, converting fructose-1,6-bisphosphate to glyceraldehydes-3-phosphate and dihydroxyacetone phosphate. This gene has been previously examined in relation to lung cancer (specifically lung squamous cell carcinoma) by Du et.al. They examined the

ALDOA role in muscle maintenance and mobility, as well as actin organization and cell shape. Their results showed ALDOA overexpression in the lungs was correlated to LSCC tumor growth. Furthermore, reduction of ALDOA expression in LSCC cell lines confirmed reduction of tumor growth and proliferation11.

Genes involved in DNA damage repair are often good candidates for biomarkers, as they play such a critical role in the cell cycle. Results from the quantile combined data of nonsmoker samples with tumor and adjacent normal tissue groups showed differential expression of CCDC6 (adj p=1.54E-09 in non-smokers and 0.003949 in smokers).

Rearrangement of coiled-coil-domain containing 6 produces an oncogene that has been studied previously in multiple cancers, including NSCLC cancer. In normal functioning,

CCDC6 is involved in DNA damage repair, and is part of a stable cycle of ubiquitination and degradation. However, rearrangement of this gene produces a disruption of this cycle

12. . CCDC6 was studied in NSCLC by Morrase, repair and tumor metastasis13.

Like metabolism, cell growth and proliferation is also altered in tumor cells when compared to normal tissues. The results from this study showed differential expression genes involved in cell growth and proliferation; for example, KIF14 (adj p=1.58E-08 in non-smokers and 0.000147 in smokers)—a member of the motor kinesin superfamily. Under expression of KIF14 has been studied as a result of a delay of

34

metaphase-to-anaphase mitotic transition, which was studied by Hung et.al in relation to increased tumor growth in lung adenocarcinomas and decreased survival. In the same study, overexpression of KIF14 decreased tumor growth and migration in vivo 17.

Interestingly, KIF14 knockdown has also been shown to decrease tumor growth in samples where the gene is overexpressed18. A similar common gene found to be common to the smoker and non-smoker data pools in this study was KIFC1 (adj p-value =

0.002601 in non-smokers and 4.17E-14 in smokers). KIFC1, a motor protein of the kinesin-14 family, has been shown to be predictive of brain metastasis in lung cancer 19.

Furthermore, this gene has also been studied as a potential biomarker for ovarian cancer, whereas, because of its activity as a centrosome clustering molecule, it is required to sustain extra centrosome containing cancer cells. It is thought that upregulation of KIFC1 is a useful ‘tactic’ of cancer cells in order to retain functionality with extra chromosomes20.

Another result of this study was TRAF4 (adj p=9.47E-08 in non-smokers and

1.26E-11 in smokers). This gene plays a role in signal transduction. The overexpression of tumor Necrosis Factor Receptor-Associated Factor 4 has been identified in multiple tumor types, including lung cancer21. A study by Li et.al utilized RNA interference to knock down TRAF4 in lung tumor cells. As a result, tumor growth and proliferation were weakened22.

In addition to those genes found to be shared between the smoker and non-smoker data pools, there were also genes found to be uniquely expressed in the non-smoker data pool. One such gene is HNRNPA2B1 (heterogeneous nuclear ribonucleoprotein B1), which is a nuclear binding protein. The adjusted p-value for this gene was 0.000114.

35

Overexpression of this gene has been studied in multiple types of cancer as a control mechanism of mRNA splicing 23. When studied in lung cancer cell lines, decrease of

HNRNPA2B1 expression resulted in a decrease of tumor cell proliferation and increase in 24 Through further examination, studies like that of Guo et.al, have examined

HNRNPA2B1 responsibility of IRF-3 (Interferon regulatory factor-3) splicing in lung cancer cells. Specifically, the increased splicing exclusion of exons 2 and 3 from IRF-3, through RNAi knockdown of HNRNPA2B1, have shown effects on cytokine production in NSCLC 25.

A second gene found specific to the non-smoker pool was DERL1 (adj. p-value =

6.25E-05). Derlin-1 has been studied as a commonly overexpressed gene in lung cancer that results in an increase of cancer cell invasion26. Derlin-1 is a direct effector of the

EGFR pathway, which is commonly analyzed in lung adenocarcinomas. Dong et.al, studied the mechanism of DERL1 in NSCLC and reported an upregulation of MMP-2 and MMP-9. They explain that the upregulation was likely mediated by the EGFR-ERK- pathway, because of the DERL1 direct influence on EGFR.27

A third gene found in this study to be unique to the non-smoker data-pool was

GTSE1 (adj. p-value = 3.43E-06). G2 and S expressed protein-1, has the ability to bind to and negatively regulate p5328. Because of this, GTSE1 was studied by Tian et.al, who utilized verification techniques like real-time PCR to determine a relatedness of GTSE1 to lung cancer. However, they reported that the correlation between GTSE1 and lung cancer survival was not statistically significant, and would therefore not be an optimal choice as a biomarker29.

36

Microarray analysis is a very useful form of studying gene expression. This study has described the strengths associated with combining data from multiple microarray platforms. This is the first study to combine data from multiple microarray platforms using quantile combination while at the same time examining results from three statistical analysis: modified t-test, ordinal logistic regression, and DISCO normal distribution. The benefits of utilizing three statistical measures are that we are able to assume greater accuracy of the results. When a singles gene is determined to be differentially expressed by three statistical methods, rather than just one, the result is more reliable.

The t-test is the standard test by which differential gene expression analyses are done; however, the ordinal logistic regression and DISCO normal test offer other strengths that the t-test does not. The DISCO normal test utilizes a variety of distributions, and is thus more flexible. The algorism is based on Newton Raphson and utilizes an iterative method when estimating the parameters. The results of the statistical comparisons show that the results from the t-test more closely overlap with that of the

DISCO normal distribution a majority of the time, rather than the ordinal logistic regression model’s results. The results also showed that with an increase in the number of probes per array, there was a decrease in the amount of common probes in the top 1000 results. This is due to the fact that the more probes there are per chip, the less likely it is that the same common genes will be found in the top 1000 results.

This study has multiple strengths. First, we compare the results of three statistical tests while also utilizing the quantile combination technique. Furthermore, there were over 1,100 samples analyzed, with up to 11,000 common genes in the largest data pool.

37

The larger sample number increases power to identify differential gene expression and statistical accuracy, as more samples result in a decreased amount of false positives.

The limitations of this study are partly related to the nature of the data we pooled from the individual studies included in this analysis. Lung cancer is an extensively heterogeneous disease and it is possible that the extent of these studies are not an accurate representation of NSCLC. Furthermore, there is limited phenotype information provided for the samples; for example, not all studies designate whether the sample came from a male or female patient, the age, tumor progression, etc. The more phenotype information that is available can extremely benefit the study. However, when working with publically available data, gathering unreported phenotype information is difficult.

In terms of the limitations of the technology, there are many differences between platforms that have been outlined in this study which result in the impossibility of directly comparing microarray data; for example, the number of genes per array, the specific probes per array, the array production and workflow, etc. Moreover, the studies included in this analysis range from the years 2008 to 2014, and consequently, the evolution and advancement of microarray technology over that six-year time span could play an additional role in the differences between technologies. Furthermore, when using quantile combination, only those common genes found on every platform can be combined, thus, some genes that may play important roles in lung cancer are lost if their probes are not found on every study platform. Finally, as with any microarray analysis, there are always false positive/negative results associated with the analysis.

Despite these limitations, we have introduced two new statistical methods for studying differential gene expression and compared them to the current modified t-test

38

method. Additionally, we were able to study approximately 500 non-smoker samples, which to our knowledge have not been done previously. Above all, this study has determined multiple potential biomarkers for further study. Recommendations for further studies include the use of RT-PCR for validation, and a comparison of this data with both miRNA and RNAseq data. For the data analysis, we recommend the use of a meta- analysis of the studies combined in this analysis, as an additional check on the validity of the genes identified in this study. We also recommend simulation studies to compare the performance of ordinal logistic regression versus the DISCO normal distribution.

Furthermore, additional study related to the most efficient gene ID (Entrez ID, GenBank

Accession Number, etc.) by which to pool data would be beneficial.

39

Chapter 6

Conclusion

Lung cancer is a disease responsible for an enormous amount of deaths each year around the world, and the estimates for 2016 are higher than any other cancer type. Both the aggressive nature of the disease and the typically late diagnosis attribute to the high mortality rate. As such, there is an acute and heightened need for lung cancer biomarkers.

This study has aimed to determine possible candidates for lung cancer biomarkers, specifically for non-smokers. The results have depicted such candidate genes that fall into multiple biological pathways: muscle maintenance, DNA damage repair mechanisms, cellular metabolism, proliferation and growth, and tumor progression.

One of the primary objectives of this study was to introduce new methodology.

This study has shown comparison results between the t-test, DISCO normal distribution, and logistic regression models, and shown that the t-test and DISCO normal distribution results typically show a greater overlap in common results. In terms of future use, the representation of gene expression using ordinal values can often be more meaningful than the raw representation.

By pooling data using the quantile combination method, this study was able to gain greater power and statistical accuracy by analyzing over 1000 samples. The data pool was accomplished for non-smoker and smoker tumor samples with adjacent normal

40

tissue, as well as additional pure tumor samples. There were genes found to be common between the smoker and non-smoker data pools, which sheds light on the pathogenesis of lung cancer; however there were also differentially expressed genes reported that were unique to non-smoker samples. Further validation of these findings is merited, through the use of rtPCR, miRNA analysis, and/or RNAseq analysis. It is the vision of this study to utilize this new methodology to determine potential biomarkers for subgroups, including different populations, races, and genders, which may benefit the search for both early diagnosis and prognosis methods for lung cancer.

41

References

Khuder, S. A.; Bazeley, P., Quantile Scores for Combining Results from Different Microarray Platforms. 2009, 135-138.

Brennan, P.; Hainaut, P.; Boffetta, P., Genetics of lung-cancer susceptibility. The Lancet. Oncology 2011, 12 (4), 399-408.

Cancer Facts and Figures 2016; American Cancer Society: Atlanta, 2016.

Yokota, J.; Shiraishi, K.; Kohno, T., Genetic basis for susceptibility to lung cancer: Recent progress and future directions. Advances in cancer research 2010, 109, 51-72.

Barnes, M.; Freudenberg, J.; Thompson, S.; Aronow, B.; Pavlidis, P., Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res 2005, 33 (18), 5914-23.

Kothiyal, P.; Cox, S.; Ebert, J.; Aronow, B. J.; Greinwald, J. H.; Rehm, H. L., An overview of custom array sequencing. Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.] 2009, Chapter 7, Unit 7 17.

HGU133_plus2 Technotes. http://www.affymetrix.com/support/technical/technotes/hgu133_p2_technote.pdf.

Gene expression microarrays: Platform Comparison: Illumina, Affymetrix, Agilent; University of Turku.

Agilent Microarray Technology. http://www.genomics.agilent.com/article.jsp?pageId=2011&_requestid=40749.

Agresti, A., Categorical Data Analysis. 3 ed.; Wiley: 1990.

Du, S.; Guan, Z.; Hao, L.; Song, Y.; Wang, L.; Gong, L.; Liu, L.; Qi, X.; Hou, Z.; Shao, S., Fructose-bisphosphate aldolase a is a potential metastasis-associated marker of lung squamous cell carcinoma and promotes lung cell tumorigenesis and migration. PloS one 2014, 9 (1), e85804.

42

Zhao, J.; Tang, J.; Men, W.; Ren, K., FBXW7-mediated degradation of CCDC6 is impaired by ATM during DNA damage response in lung cancer cells. FEBS letters 2012, 586 (24), 4257-63.

Morra, F.; Luise, C.; Visconti, R.; Staibano, S.; Merolla, F.; Ilardi, G.; Guggino, G.; Paladino, S.; Sarnataro, D.; Franco, R.; Monaco, R.; Zitomarino, F.; Pacelli, R.; Monaco, G.; Rocco, G.; Cerrato, A.; Linardopoulos, S.; Muller, M. T.; Celetti, A., New therapeutic perspectives in CCDC6 deficient lung cancer cells. International journal of cancer. Journal international du cancer 2015, 136 (9), 2146-57.

Koppenol, W. H.; Bounds, P. L.; Dang, C. V., Otto Warburg's contributions to current concepts of cancer metabolism. Nature reviews. Cancer 2011, 11 (5), 325-37.

Cuezva, J. M.; Chen, G.; Alonso, A. M.; Isidoro, A.; Misek, D. E.; Hanash, S. M.; Beer, D. G., The bioenergetic signature of lung adenocarcinomas is a molecular marker of cancer diagnosis and prognosis. Carcinogenesis 2004, 25 (7), 1157-63.

Puzone, R.; Savarino, G.; Salvi, S.; Dal Bello, M. G.; Barletta, G.; Genova, C.; Rijavec, E.; Sini, C.; Esposito, A. I.; Ratto, G. B.; Truini, M.; Grossi, F.; Pfeffer, U., Glyceraldehyde-3-phosphate dehydrogenase gene over expression correlates with poor prognosis in non small cell lung cancer patients. Molecular cancer 2013, 12 (1), 97.

Hung, P. F.; Hong, T. M.; Hsu, Y. C.; Chen, H. Y.; Chang, Y. L.; Wu, C. T.; Chang, G. C.; Jou, Y. S.; Pan, S. H.; Yang, P. C., The motor protein KIF14 inhibits tumor growth and cancer metastasis in lung adenocarcinoma. PloS one 2013, 8 (4), e61664.

Corson, T. W.; Zhu, C. Q.; Lau, S. K.; Shepherd, F. A.; Tsao, M. S.; Gallie, B. L., KIF14 messenger RNA expression is independently prognostic for outcome in lung cancer. Clinical cancer research : an official journal of the American Association for Cancer Research 2007, 13 (11), 3229-34.

Grinberg-Rashi, H.; Ofek, E.; Perelman, M.; Skarda, J.; Yaron, P.; Hajduch, M.; Jacob- Hirsch, J.; Amariglio, N.; Krupsky, M.; Simansky, D. A.; Ram, Z.; Pfeffer, R.; Galernter, I.; Steinberg, D. M.; Ben-Dov, I.; Rechavi, G.; Izraeli, S., The expression of three genes in primary non-small cell lung cancer is associated with metastatic spread to the brain. Clinical cancer research : an official journal of the American Association for Cancer Research 2009, 15 (5), 1755-61.

Pawar, S.; Donthamsetty, S.; Pannu, V.; Rida, P.; Ogden, A.; Bowen, N.; Osan, R.; Cantuaria, G.; Aneja, R., KIFCI, a novel putative prognostic biomarker for ovarian adenocarcinomas: delineating protein interaction networks and signaling circuitries. Journal of ovarian research 2014, 7, 53.

43

Camilleri-Broet, S.; Cremer, I.; Marmey, B.; Comperat, E.; Viguie, F.; Audouin, J.; Rio, M. C.; Fridman, W. H.; Sautes-Fridman, C.; Regnier, C. H., TRAF4 overexpression is a common characteristic of human carcinomas. Oncogene 2007, 26 (1), 142-7.

Li, W.; Peng, C.; Lee, M. H.; Lim, D.; Zhu, F.; Fu, Y.; Yang, G.; Sheng, Y.; Xiao, L.; Dong, X.; Ma, W.; Bode, A. M.; Cao, Y.; Dong, Z., TRAF4 is a critical molecule for Akt activation in lung cancer. Cancer research 2013, 73 (23), 6938-50.

David, C. J.; Chen, M.; Assanah, M.; Canoll, P.; Manley, J. L., HnRNP controlled by c-Myc deregulate pyruvate kinase mRNA splicing in cancer. Nature 2010, 463 (7279), 364-8.

Han, J.; Tang, F. M.; Pu, D.; Xu, D.; Wang, T.; Li, W., Mechanisms underlying regulation of cell cycle and apoptosis by hnRNP B1 in human lung adenocarcinoma A549 cells. Tumori 2014, 100 (1), 102-11.

Guo, R.; Li, Y.; Ning, J.; Sun, D.; Lin, L.; Liu, X., HnRNP A1/A2 and SF2/ASF regulate alternative splicing of interferon regulatory factor-3 and affect immunomodulatory functions in human non-small cell lung cancer cells. PloS one 2013, 8 (4), e62729.

Ran, Y.; Hu, H.; Hu, D.; Zhou, Z.; Sun, Y.; Yu, L.; Sun, L.; Pan, J.; Liu, J.; Liu, T.; Yang, Z., Derlin-1 is overexpressed on the tumor cell surface and enables antibody-mediated tumor targeting therapy. Clinical cancer research : an official journal of the American Association for Cancer Research 2008, 14 (20), 6538-45.

Dong, Q. Z.; Wang, Y.; Tang, Z. P.; Fu, L.; Li, Q. C.; Wang, E. D.; Wang, E. H., Derlin- 1 is overexpressed in non-small cell lung cancer and promotes cancer cell invasion via EGFR-ERK-mediated up-regulation of MMP-2 and MMP-9. The American journal of pathology 2013, 182 (3), 954-64.

(a) Collavin, L.; Monte, M.; Verardo, R.; Pfleger, C.; Schneider, C., Cell-cycle regulation of the p53-inducible gene B99. FEBS letters 2000, 481 (1), 57-62; (b) Monte, M.; Benetti, R.; Buscemi, G.; Sandy, P.; Del Sal, G.; Schneider, C., The cell cycle- regulated protein human GTSE-1 controls DNA damage-induced apoptosis by affecting p53 function. The Journal of biological chemistry 2003, 278 (32), 30356-64.

Tian, T.; Zhang, E.; Fei, F.; Li, X.; Guo, X.; Liu, B.; Li, J.; Chen, Z.; Xing, J., Up- regulation of GTSE1 lacks a relationship with clinical data in lung cancer. Asian Pacific journal of cancer prevention : APJCP 2011, 12 (8), 2039-43.

Landi, M. T.; Dracheva, T.; Rotunno, M.; Figueroa, J. D.; Liu, H.; Dasgupta, A.; Mann, F. E.; Fukuoka, J.; Hames, M.; Bergen, A. W.; Murphy, S. E.; Yang, P.; Pesatori,

44

A. C.; Consonni, D.; Bertazzi, P. A.; Wacholder, S.; Shih, J. H.; Caporaso, N. E.; Jen, J., Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PloS one 2008, 3 (2), e1651.

Tarca, A. L.; Lauria, M.; Unger, M.; Bilal, E.; Boue, S.; Kumar Dey, K.; Hoeng, J.; Koeppl, H.; Martin, F.; Meyer, P.; Nandy, P.; Norel, R.; Peitsch, M.; Rice, J. J.; Romero, R.; Stolovitzky, G.; Talikka, M.; Xiang, Y.; Zechner, C.; Collaborators, I. D., Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge. Bioinformatics 2013, 29 (22), 2892-9.

Lu, T. P.; Tsai, M. H.; Lee, J. M.; Hsu, C. P.; Chen, P. C.; Lin, C. W.; Shih, J. Y.; Yang, P. C.; Hsiao, C. K.; Lai, L. C.; Chuang, E. Y., Identification of a novel biomarker, SEMA5A, for non-small cell lung carcinoma in nonsmoking women. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology 2010, 19 (10), 2590-7.

Neumann, J.; Feuerhake, F.; Kayser, G.; Wiech, T.; Aumann, K.; Passlick, B.; Fisch, P.; Werner, M.; Zur Hausen, A., Gene expression profiles of lung adenocarcinoma linked to histopathological grading and survival but not to EGF-R status: a microarray study. BMC cancer 2010, 10, 77.

Okayama, H.; Kohno, T.; Ishii, Y.; Shimada, Y.; Shiraishi, K.; Iwakawa, R.; Furuta, K.; Tsuta, K.; Shibata, T.; Yamamoto, S.; Watanabe, S.; Sakamoto, H.; Kumamoto, K.; Takenoshita, S.; Gotoh, N.; Mizuno, H.; Sarai, A.; Kawano, S.; Yamaguchi, R.; Miyano, S.; Yokota, J., Identification of genes upregulated in ALK-positive and EGFR/KRAS/ALK-negative lung adenocarcinomas. Cancer research 2012, 72 (1), 100-11.

Ding, L.; Getz, G.; Wheeler, D. A.; Mardis, E. R.; McLellan, M. D.; Cibulskis, K.; Sougnez, C.; Greulich, H.; Muzny, D. M.; Morgan, M. B.; Fulton, L.; Fulton, R. S.; Zhang, Q.; Wendl, M. C.; Lawrence, M. S.; Larson, D. E.; Chen, K.; Dooling, D. J.; Sabo, A.; Hawes, A. C.; Shen, H.; Jhangiani, S. N.; Lewis, L. R.; Hall, O.; Zhu, Y.; Ew, T.; Ren, Y.; Yao, J.; Scherer, S. E.; Clerc, K.; Metcalf, G. A.; Ng, B.; Milosavljevic, A.; Gonzalez-Garay, M. L.; Osborne, J. R.; Meyer, R.; Shi, X.; Tang, Y.; Koboldt, D. C.; Lin, L.; Abbott, R.; Miner, T. L.; Pohl, C.; Fewell, G.; Haipek, C.; Schmidt, H.; Dunford-Shore, B. H.; Kraja, A.; Crosby, S. D.; Sawyer, C. S.; Vickery, T.; Sander, S.; Robinson, J.; Winckler, W.; Baldwin, J.; Chirieac, L. R.; Dutt, A.; Fennell, T.; Hanna, M.; Johnson, B. E.; Onofrio, R. C.; Thomas, R. K.; Tonon, G.; Weir, B. A.; Zhao, X.; Ziaugra, L.; Zody, M. C.; Giordano, T.; Orringer, M. B.; Roth, J. A.; Spitz, M. R.; Wistuba, II; Ozenberger, B.; Good, P. J.; Chang, A. C.; Beer, D. G.; Watson, M. A.; Ladanyi, M.; Broderick, S.; Yoshizawa, A.; Travis, W. D.; Pao, W.; Province, M. A.; Weinstock, G. M.;

45

Varmus, H. E.; Gabriel, S. B.; Lander, E. S.; Gibbs, R. A.; Meyerson, M.; Wilson, R. K., Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455 (7216), 1069-75.

Wilkerson, M. D.; Yin, X.; Walter, V.; Zhao, N.; Cabanski, C. R.; Hayward, M. C.; Miller, C. R.; Socinski, M. A.; Parsons, A. M.; Thorne, L. B.; Haithcock, B. E.; Veeramachaneni, N. K.; Funkhouser, W. K.; Randell, S. H.; Bernard, P. S.; Perou, C. M.; Hayes, D. N., Differential pathogenesis of lung adenocarcinoma subtypes involving sequence mutations, copy number, chromosomal instability, and methylation. PloS one 2012, 7 (5), e36530.

Wilkerson, M. D.; Yin, X.; Hoadley, K. A.; Liu, Y.; Hayward, M. C.; Cabanski, C. R.; Muldrew, K.; Miller, C. R.; Randell, S. H.; Socinski, M. A.; Parsons, A. M.; Funkhouser, W. K.; Lee, C. B.; Roberts, P. J.; Thorne, L.; Bernard, P. S.; Perou, C. M.; Hayes, D. N., Lung squamous cell carcinoma mRNA expression subtypes are reproducible, clinically important, and correspond to normal cell types. Clinical cancer research : an official journal of the American Association for Cancer Research 2010, 16 (19), 4864-75. Yu, G.; Herazo-Maya, J. D.; Nukui, T.; Romkes, M.; Parwani, A.; Juan-Guardela, B. M.; Robertson, J.; Gauldie, J.; Siegfried, J. M.; Kaminski, N.; Kass, D. J., Matrix metalloproteinase-19 promotes metastatic behavior in vitro and is associated with increased mortality in non-small cell lung cancer. American journal of respiratory and critical care medicine 2014, 190 (7), 780-90.

Selamat, S. A.; Chung, B. S.; Girard, L.; Zhang, W.; Zhang, Y.; Campan, M.; Siegmund, K. D.; Koss, M. N.; Hagen, J. A.; Lam, W. L.; Lam, S.; Gazdar, A. F.; Laird- Offringa, I. A., Genome-scale analysis of DNA methylation in lung adenocarcinoma and integration with mRNA expression. Genome research 2012, 22 (7), 1197-211.

Staaf, J.; Jonsson, G.; Jonsson, M.; Karlsson, A.; Isaksson, S.; Salomonsson, A.; Pettersson, H. M.; Soller, M.; Ewers, S. B.; Johansson, L.; Jonsson, P.; Planck, M., Relation between smoking history and gene expression profiles in lung adenocarcinomas. BMC medical genomics 2012, 5, 22.

Robles, A. I.; Arai, E.; Mathe, E. A.; Okayama, H.; Schetter, A. J.; Brown, D.; Petersen, D.; Bowman, E. D.; Noro, R.; Welsh, J. A.; Edelman, D. C.; Stevenson, H. S.; Wang, Y.; Tsuchiya, N.; Kohno, T.; Skaug, V.; Mollerup, S.; Haugen, A.; Meltzer, P. S.; Yokota, J.; Kanai, Y.; Harris, C. C., An Integrated Prognostic Classifier for Stage I Lung Adenocarcinoma Based on mRNA, microRNA, and DNA Methylation Biomarkers. Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer 2015, 10 (7), 1037- 48.

46

Hamilton, A.; Caplan, J.; Dell, K.; Dorfman, A.; Fitzpatrick, L.; Fox, J.; Gregory, S.; Grossman, L.; Kiviat, B.; Kluger, J.; Lacayo, R.; Lemonick, M.; McLaughlin, L.; Newton-Small, J.; Park, A.; Thompson, M.; Walsh, B.; Keegan, R. W., The 50 Best Inventions Of the Year. Time 2008, 172 (19), 67-92.

Hudson, K.; Javitt, G.; Burke, W.; Byers, P.; American Society of Human Genetics Social Issues, C., ASHG Statement* on direct-to-consumer genetic testing in the United States. The American Journal of Human Genetics 2007, 110 (6), 1392-5.

Mrazek, M.; Koenig, B.; Skime, M.; Snyder, K.; Hook, C.; Black, J., 3rd; Mrazek, D., Assessing attitudes about genetic testing as a component of continuing medical education. Academic Psychiatry 2007, 31 (6), 447-51.

47

Appendix A

Supplementary Tables and Figures

A.1 Public Study Overview

Table A.1: Public microarray analyses utilized in this study Microarray Study List

Non- GEO ID Platform Smoker Male Female Gene Sample Sample Types Smoker

GSE10072 Adenocarcinoma Affymetrix 76 31 69 38 22,283 107 30 and normal tissue

Adenocarcinoma GSE43580* Affymetrix 116 28 120 29 54,675 150 and Squamous 31 Cell Carcinoma

Tumor and GSE19804 Affymetrix 0 120 0 120 54,675 120 Normal tissue 32 samples

GSE31546* Affymetrix 12 4 3 14 54,675 17 Adenocarcinoma

Adenocarcinoma GSE31547* Affymetrix 34 14 14 36 22,283 50 and normal

GSE31548* Affymetrix 34 14 14 36 22,645 50 Adenocarcinoma

GSE17475* Affymetrix 24 1 13 15 22,283 28 Adenocarcinoma 33

GSE31210 Lung tumors and Affymetrix 123 123 116 130 54,675 246 34 normal samples

GSE12667* Affymetrix 43 8 10 14 54,675 75 Adenocarcinomas 35

GSE19188* Affymetrix 28 0 26 2 54,675 28 Adenocarcinomas

48

GSE26939* Agilent 99 12 49 51 33,421 116 Adenocarcinoma 36

GSE17710 Squamous Cell Agilent 56 0 32 24 33,421 56 37 Carcinoma

GSE47115 Illumina 42 4 25 21 29,372 46 NSCLC tumors 38

GSE32863 Adenocarcinoma Illumina 57 59 26 90 48,803 116 39 and normal tissue

GEO29016* Mutli-stage Illumina 51 9 31 42 1,668 73 40 carcinomas

GSE63459* Adenocarcinoma Illumina 55 8 31 34 24,525 65 41 and normal tissue

GSE62949 Illumina 0 56 NA NA 24,525 56 Adenocarcinomas

49

A.2 Individual Study Venn Diagrams

Author Data RMA Normalized Non-Smoker Samples Non-Smoker Samples (tumor vs adjacent normal) (tumor vs adjacent normal)

Author Data RMA Normalized MAS5 Normalized Smoker Samples Smoker Samples Smoker Samples (tumor vs adjacent normal) (tumor vs adjacent normal) (tumor vs adjacent normal)

Figure A-1: HGU133A platform merge results. Includes: GSE10072 and GSE31547. This chip tested 22,283 genes.

50

Author Data RMA Normalized MAS5 Normalized Non-Smoker Samples Non-Smoker Samples Non-Smoker Samples (tumor vs adjacent normal) (tumor vs adjacent normal) (tumor vs adjacent normal)

Figure A-2: HGU133A platform merge results. Includes: GSE19804, GSE19188, and GSE31210. This chip tested 54,675 genes.

51

Author Data RMA Normalized MAS5 Normalized Non-Smoker Samples Non-Smoker Samples Non-Smoker Samples (tumor vs adjacent normal) (tumor vs adjacent normal) (tumor vs adjacent normal)

Author Data RMA Normalized MAS5 Normalized Smoker Samples Smoker Samples Smoker Samples (tumor vs adjacent normal) (tumor vs adjacent normal) (tumor vs adjacent normal)

Figure A-3: GSE31548 results. This chip tested 22,645 genes.

52

Author Data Non-Smoker Samples (tumor vs adjacent normal)

Author Data Smoker Samples (tumor vs adjacent normal)

Figure A-4: GSE32863 results. This chip tested 48,803 genes.

53

Author Data Non-Smoker Samples (tumor vs adjacent normal)

Figure A-5: GSE62949 results (all samples were non-smokers). This chip tested 24,525 genes.

Author Data Non-Smoker Samples (tumor vs adjacent normal)

Author Data Smoker Samples (tumor vs adjacent normal)

Figure A-6: GSE63459 results. This chip tested 24,525 genes.

54

A.3 Lists of Common Genes

Table A.2: Common 407 DEGs from non-smoker data pool samples with tumor and adjacent normal tissues

Description t-test LR DISCO Gene Accession logFC P.Value adj.P.Val ordp ChiSq t.test ALDH18A1 5832 -5.94552 1.25E-17 1.19E-14 0 6.60E-05 0.000933 ALDOA 226 -5.40044 8.21E-15 5.19E-12 0 0.001237 0.001283 COL11A1 1301 -5.01774 5.43E-13 2.06E-10 0 0.001282 0.00078 CCDC6 8030 -4.80598 4.86E-12 1.54E-09 0 0.113999 0.080018 MMP11 4320 -4.76002 7.72E-12 1.83E-09 0 0.001074 0.000362 DDX11 1663 -4.65693 2.15E-11 4.08E-09 0 0.001985 0.003142 OLA1 29789 -4.65693 2.15E-11 4.08E-09 0 7.84E-05 0.000153 GAPDH 2597 -4.56673 5.18E-11 8.19E-09 0 2.68E-05 2.52E-05 KIF14 9928 -4.48942 1.09E-10 1.58E-08 0 0.000636 0.001793 KIF23 9493 -4.33479 4.60E-10 5.13E-08 0 0.000368 0.000805 TEX11 56159 -4.24459 1.04E-09 9.43E-08 0 0.059042 0.095679 TRAF4 9618 -4.239 1.10E-09 9.47E-08 0 0.011824 0.009895 SNF8 11267 -4.13591 2.74E-09 2.00E-07 0 0.003046 0.010897 TMEM106C 79022 -4.12861 2.92E-09 2.05E-07 0 0.023291 0.025374 TMEM53 79639 -4.11573 3.27E-09 2.22E-07 0 0.001182 0.003981 GALNT10 55568 -4.08996 4.10E-09 2.68E-07 0 0.000267 0.000334 ARAP1 116985 -4.0513 5.72E-09 3.50E-07 0 0.000393 0.001215 HHLA3 11147 -4.02553 7.14E-09 4.11E-07 0 0.010054 0.009721 SGK2 10110 -3.97398 1.11E-08 5.83E-07 0 0.000604 0.00114 TSGA14 95681 -3.9611 1.23E-08 6.33E-07 0 0.166773 0.092344 SDHC 6391 -3.95551 1.29E-08 6.46E-07 0 0.011359 0.011512 AMACR 23600 -3.93533 1.53E-08 7.09E-07 0 0.018832 0.01335 FGGY 55277 -3.93533 1.53E-08 7.09E-07 0 0.037605 0.029199 GPRC5C 55890 -3.90955 1.90E-08 8.19E-07 0 0.024586 0.004538 ALKBH1 8846 -3.84512 3.23E-08 1.36E-06 0 0.007076 0.009307 PSAT1 29968 -3.79358 4.92E-08 1.99E-06 0 5.99E-05 7.32E-05 WDR59 79726 -3.7807 5.46E-08 2.16E-06 0 0.00043 0.000227 DHTKD1 55526 -3.75492 6.72E-08 2.55E-06 0 0.012805 0.007636 PTOV1 53635 -3.74204 7.45E-08 2.77E-06 0 0.040834 0.023182 KLHDC8A 55220 -3.72915 8.25E-08 3.01E-06 0 0.007208 0.009354 GFER 2671 -3.71627 9.14E-08 3.15E-06 0 0.021329 0.019705 MR1 3140 -3.71627 9.14E-08 3.15E-06 0 0.00207 0.003173 GTSE1 51512 -3.70338 1.01E-07 3.43E-06 0 0.001632 0.002112 PVRL1 5818 -3.68491 1.17E-07 3.80E-06 0 1.74E-05 6.21E-06

55

H1FX 8971 -3.67761 1.24E-07 3.80E-06 0 0.000749 0.00341 JTB 10899 -3.67761 1.24E-07 3.80E-06 0 0.000145 0.000139 FAM192A 80011 -3.67761 1.24E-07 3.80E-06 0 0.002012 0.004951 PHKA1 5255 -3.56164 3.04E-07 8.25E-06 0 0.002728 0.005017 UGGT1 56886 -3.56164 3.04E-07 8.25E-06 0 0.000257 0.000198 KIAA0319L 79932 -3.56164 3.04E-07 8.25E-06 0 0.029346 0.027136 PSMD7 5713 -3.52298 4.08E-07 9.80E-06 0 0.00595 0.002139 PVRL2 5819 -3.52298 4.08E-07 9.80E-06 0 0.001997 0.002214 DCAF13 25879 -3.52298 4.08E-07 9.80E-06 0 0.000266 0.000167 MICALL2 79778 -3.52298 4.08E-07 9.80E-06 0 1.90E-05 3.37E-05 PUS7L 83448 -3.50451 4.69E-07 1.11E-05 0 0.001156 0.001 KARS 3735 -3.48432 5.46E-07 1.26E-05 0 0.016683 0.021335 ECE2 9718 -3.47143 6.01E-07 1.37E-05 0 0.007418 0.004294 KCNC2 3747 -3.44566 7.27E-07 1.62E-05 0 0.003074 0.007646 ELAVL3 1995 -3.43278 8.00E-07 1.71E-05 0 0.01428 0.003976 PDXK 8566 -3.43278 8.00E-07 1.71E-05 0 0.113261 0.092526 CAPRIN1 4076 -3.41989 8.79E-07 1.79E-05 0 0.004345 0.008607 RNPS1 10921 -3.41989 8.79E-07 1.79E-05 0 0.134622 0.048601 YIPF5 81555 -3.40142 1.01E-06 1.99E-05 0 0.005871 0.003434 TH1L 51497 -3.35546 1.40E-06 2.61E-05 0 0.023315 0.009903 PLSCR2 57047 -3.35546 1.40E-06 2.61E-05 0 0.001956 0.004385 CENPO 79172 -3.35546 1.40E-06 2.61E-05 0 0.001968 0.002657 YY1 7528 -3.34988 1.46E-06 2.70E-05 0 0.086069 0.136903 C3orf14 57415 -3.34258 1.54E-06 2.76E-05 0 0.004273 0.004786 SLC2A8 29988 -3.30392 2.03E-06 3.51E-05 0 0.003083 0.004826 GTPBP10 85865 -3.30392 2.03E-06 3.51E-05 0 0.002661 0.00168 C1orf56 54964 -3.29103 2.23E-06 3.77E-05 0 0.006049 0.010099 TIPRL 261726 -3.28545 2.32E-06 3.89E-05 0 0.024345 0.047875 PADI1 29943 -3.25238 2.92E-06 4.59E-05 0 0.003196 0.008394 OR51E2 81285 -3.25238 2.92E-06 4.59E-05 0 0.111815 0.036551 RIF1 55183 -3.2339 3.33E-06 5.09E-05 0 0.001698 0.001413 SLC38A7 55238 -3.2266 3.50E-06 5.31E-05 0 0.025467 0.043948 DERL1 79139 -3.20083 4.18E-06 6.25E-05 0 3.87E-05 6.17E-06 DCAF4 26094 -3.14929 5.96E-06 8.57E-05 0 0.040182 0.161086 ZDHHC24 254359 -3.14929 5.96E-06 8.57E-05 0 0.001002 0.001179 CES3 23491 -3.1437 6.19E-06 8.83E-05 0 0.506527 0.158634 RASAL1 8437 -3.12352 7.09E-06 9.90E-05 0 0.00013 7.83E-05 LONP2 83752 -3.11063 7.74E-06 0.000105 0 0.006486 0.005409 HNRNPA2B1 3181 -3.09774 8.44E-06 0.000114 0 0.000941 0.000926 SLC35C2 51006 -3.09216 8.76E-06 0.000117 0 0.05436 0.013669 ERO1L 30001 -3.08486 9.20E-06 0.000122 0 0.000101 0.000171

56

PIK3R2 5296 -3.07197 1.00E-05 0.000129 0 5.30E-05 7.57E-05 UNC13A 23025 -3.07197 1.00E-05 0.000129 0 0.001268 0.001516 OTUB2 78990 -3.07197 1.00E-05 0.000129 0 0.04167 0.060378 CALU 813 -3.05909 1.09E-05 0.000136 0 0.008297 0.007034 SDF4 51150 -3.0462 1.19E-05 0.000146 0 0.015333 0.017226 MRPS15 64960 -2.99466 1.67E-05 0.000196 0 0.001158 0.000916 SUB1 10923 -2.98348 1.79E-05 0.000205 0 0.005195 0.005209 MUC13 56667 -2.98177 1.81E-05 0.000206 0 0.073419 0.032462 TMEM59 9528 -2.94825 2.25E-05 0.000251 0 0.070669 0.040794 TIAM2 26230 -2.94311 2.32E-05 0.000255 0 0.007476 0.007198 SLC25A15 10166 -2.91734 2.74E-05 0.000297 0 0.037574 0.016011 MFSD9 84804 -2.91004 2.87E-05 0.000309 0 0.004376 0.003071 SLC30A5 64924 -2.90446 2.97E-05 0.000317 0 0.000855 0.000239 LOXL2 4017 -2.8658 3.78E-05 0.000384 0 0.191276 0.156526 SYNE2 23224 -2.8658 3.78E-05 0.000384 0 0.067376 0.119205 SOCS7 30837 -2.8658 3.78E-05 0.000384 0 0.001566 0.002982 DPH5 51611 -2.8658 3.78E-05 0.000384 0 0.035526 0.053609 TBC1D8B 54885 -2.8658 3.78E-05 0.000384 0 0.023675 0.016506 TDP1 55775 -2.8658 3.78E-05 0.000384 0 0.01834 0.025752 C14orf102 55051 -2.85291 4.10E-05 0.000414 0 0.276059 0.141721 C1orf163 65260 -2.82714 4.81E-05 0.000473 0 0.089149 0.0851 ZNF362 149076 -2.82714 4.81E-05 0.000473 0 0.045241 0.05202 TMEM57 55219 -2.81426 5.21E-05 0.000499 0 0.000118 0.000111 MRPL52 122704 -2.81426 5.21E-05 0.000499 0 0.015368 0.013671 IL18BP 10068 -2.80137 5.63E-05 0.000535 0 0.024957 0.01205 CASC5 57082 -2.7756 6.59E-05 0.00061 0 0.005544 0.009181 FBXO9 26268 -2.76271 7.12E-05 0.00065 0 0.038008 0.033666 ANGPTL3 27329 -2.76271 7.12E-05 0.00065 0 0.020088 0.017829 GTPBP8 29083 -2.76271 7.12E-05 0.00065 0 0.209909 0.12118 ITGA2 3673 -2.74983 7.70E-05 0.000689 2.22E-16 0.371413 0.061078 CBX3 11335 -2.74983 7.70E-05 0.000689 2.22E-16 0.000775 0.000249 SLC44A1 23446 -2.74983 7.70E-05 0.000689 0 0.029251 0.005357 KCNK10 54207 -2.74983 7.70E-05 0.000689 2.22E-16 0.256165 0.220871 KIAA0922 23240 -2.72405 8.98E-05 0.000786 2.22E-16 0.144819 0.132488 CDCA3 83461 -2.72405 8.98E-05 0.000786 0 0.008943 0.009739 FGD2 221472 -2.71117 9.70E-05 0.000837 2.22E-16 0.024657 0.008721 SLC1A7 6512 -2.7073 9.93E-05 0.000852 2.22E-16 0.054523 0.033483 G3BP1 10146 -2.69828 0.000105 0.000883 2.22E-16 0.029657 0.033684 SCUBE3 222663 -2.69828 0.000105 0.000883 2.22E-16 0.006458 0.004378 SLC25A13 10165 -2.6854 0.000113 0.00094 2.22E-16 0.027511 0.031322 IDH1 3417 -2.67251 0.000122 0.000988 2.22E-16 0.144997 0.190022

57

KIFC1 3833 -2.67251 0.000122 0.000988 2.22E-16 0.002601 0.003544 EIF4E 1977 -2.66864 0.000125 0.001007 2.22E-16 0.000494 0.000332 GALNT6 11226 -2.64674 0.000142 0.001115 4.44E-16 0.49869 0.153271 C17orf63 55731 -2.64674 0.000142 0.001115 4.44E-16 0.072801 0.052082 CABYR 26256 -2.62097 0.000164 0.001268 4.44E-16 0.00045 0.000277 PAK1 5058 -2.61538 0.00017 0.001304 4.44E-16 0.000702 0.000512 DFFA 1676 -2.60808 0.000177 0.001344 4.44E-16 0.09059 0.082687 BACE2 25825 -2.60808 0.000177 0.001344 4.44E-16 0.03294 0.004895 SYNCRIP 10492 -2.6025 0.000183 0.001382 4.44E-16 0.001046 0.000961 ICA1 3382 -2.5952 0.000191 0.001424 4.44E-16 0.034474 0.042569 YAP1 10413 -2.58961 0.000197 0.001464 4.44E-16 0.04797 0.021085 GIGYF2 26058 -2.58574 0.000201 0.001491 2.22E-16 0.03531 0.0546 ABHD2 11057 -2.56942 0.000221 0.001586 8.88E-16 0.125638 0.051511 BCL2L14 79370 -2.56942 0.000221 0.001586 1.33E-15 0.008865 0.00815 RPL37 6167 -2.56123 0.000231 0.001655 4.22E-13 0.027418 0.025036 GDF15 9518 -2.55654 0.000237 0.00168 8.88E-16 0.028689 0.02605 COL3A1 1281 -2.54365 0.000255 0.00178 8.88E-16 0.001109 0.000847 MRPL39 54148 -2.54365 0.000255 0.00178 4.44E-16 0.103723 0.043793 MGAT4B 11282 -2.53807 0.000263 0.001823 8.88E-16 0.000448 0.000167 CANX 821 -2.53248 0.000272 0.001854 1.55E-15 0.000355 0.000122 LMF1 64788 -2.51788 0.000295 0.001982 1.78E-15 0.016617 0.005505 ATG10 83734 -2.51788 0.000295 0.001982 1.55E-15 0.026243 0.020468 NPRL3 8131 -2.505 0.000316 0.002107 1.11E-15 0.008744 0.006351 MKKS 8195 -2.505 0.000316 0.002107 1.78E-15 0.174576 0.083902 CDKL2 8999 -2.49211 0.00034 0.002238 8.88E-16 0.066706 0.071532 ATAD3A 55210 -2.49211 0.00034 0.002238 8.88E-16 0.034732 0.037301 MAPT 4137 -2.47922 0.000365 0.002361 1.78E-15 0.440651 0.146239 PDE4DIP 9659 -2.47922 0.000365 0.002361 1.11E-15 0.053567 0.033311 MST4 51765 -2.47922 0.000365 0.002361 1.55E-15 0.123709 0.127043 KDM4B 23030 -2.46976 0.000384 0.002478 2.44E-15 0.048832 0.036907 ADAMTS2 9509 -2.46634 0.000391 0.002483 3.11E-15 0.004902 0.006676 APOBEC3F 200316 -2.46634 0.000391 0.002483 2.22E-15 0.007472 0.012102 OVOL1 5017 -2.44787 0.000432 0.002709 2.22E-15 0.016694 0.010238 CREB3L2 64764 -2.44787 0.000432 0.002709 1.33E-15 0.103782 0.05465 FOLR2 2350 -2.44057 0.00045 0.002755 2.22E-15 0.002027 0.001604 KCNA1 3736 -2.44057 0.00045 0.002755 1.78E-15 0.100383 0.130017 SLC13A2 9058 -2.44057 0.00045 0.002755 2.89E-15 0.200784 0.187826 PSME4 23198 -2.44057 0.00045 0.002755 8.88E-16 0.037204 0.025821 PIWIL2 55124 -2.44057 0.00045 0.002755 2.00E-15 0.003301 0.004408 WDR91 29062 -2.43498 0.000464 0.002821 1.78E-15 0.071794 0.062378 NPM1 4869 -2.43237 0.00047 0.002838 2.00E-15 0.009032 0.005955

58

GALT 2592 -2.42768 0.000482 0.002861 1.55E-15 0.089828 0.086512 QPRT 23475 -2.42768 0.000482 0.002861 3.77E-15 0.030754 0.013553 CNGB3 54714 -2.42209 0.000497 0.00293 3.11E-15 0.001135 0.001712 KRTAP1-3 81850 -2.42209 0.000497 0.00293 2.22E-15 0.140302 0.128823 THYN1 29087 -2.41479 0.000517 0.002991 2.44E-15 0.00825 0.008126 DNAJC15 29103 -2.41479 0.000517 0.002991 3.11E-15 0.020611 0.026566 ANKRD5 63926 -2.41479 0.000517 0.002991 2.66E-15 0.004991 0.003115 ANGEL1 23357 -2.40191 0.000554 0.003175 4.00E-15 0.035249 0.022084 BRAF 673 -2.38902 0.000593 0.00332 4.88E-15 0.032138 0.022726 UROS 7390 -2.38902 0.000593 0.00332 3.77E-15 0.018868 0.022588 BET1L 51272 -2.38902 0.000593 0.00332 4.22E-15 0.001383 0.000787 ANO2 57101 -2.38902 0.000593 0.00332 2.44E-15 0.233557 0.164426 GHSR 2693 -2.37614 0.000635 0.003492 5.55E-15 0.00156 0.001294 MGEA5 10724 -2.37614 0.000635 0.003492 2.22E-15 0.280154 0.180072 CLN6 54982 -2.37614 0.000635 0.003492 6.66E-15 0.130334 0.054928 WNT4 54361 -2.36325 0.000679 0.003684 3.33E-15 0.094192 0.044891 SIRPG 55423 -2.36325 0.000679 0.003684 4.00E-15 0.00588 0.003623 TMEM209 84928 -2.35036 0.000727 0.003864 6.88E-15 0.054596 0.031761 PLG 5340 -2.33748 0.000777 0.004087 7.77E-15 0.125131 0.096945 DENND1A 57706 -2.33748 0.000777 0.004087 7.55E-15 0.032402 0.037391 UBE2G2 7327 -2.32459 0.000831 0.00431 5.55E-15 0.019093 0.019171 ADAM22 53616 -2.32459 0.000831 0.00431 8.44E-15 0.058235 0.063455 MAGIX 79917 -2.32459 0.000831 0.00431 4.66E-15 0.249052 0.188316 ADAM11 4185 -2.32072 0.000848 0.004386 7.55E-15 0.053778 0.046722 C6orf64 55776 -2.31901 0.000856 0.004389 6.66E-15 0.330233 0.131819 ATP2B2 491 -2.31171 0.000888 0.004485 3.77E-15 0.00094 0.000695 HSPA8 3312 -2.30053 0.000941 0.0047 9.55E-15 0.080995 0.068367 DNASE1 1773 -2.28594 0.001014 0.004973 7.55E-15 0.017691 0.016613 RAI1 10743 -2.28594 0.001014 0.004973 8.44E-15 0.004934 0.005035 LARP4 113251 -2.28206 0.001034 0.005059 9.33E-15 0.480742 0.132179 IL1F7 27178 -2.27305 0.001083 0.005256 2.35E-14 0.003064 0.00235 UBFD1 56061 -2.27305 0.001083 0.005256 1.13E-14 0.093587 0.063114 LRP3 4037 -2.26016 0.001156 0.005526 1.44E-14 0.047339 0.013322 TMEM132A 54972 -2.26016 0.001156 0.005526 1.40E-14 0.008964 0.017292 MUS81 80198 -2.26016 0.001156 0.005526 7.33E-15 0.048862 0.020439 FOXRED2 80020 -2.23439 0.001316 0.006182 1.62E-14 0.0103 0.011683 AIPL1 23746 -2.22151 0.001403 0.006528 1.82E-14 0.285406 0.08596 C1GALT1C1 29071 -2.22151 0.001403 0.006528 1.35E-14 0.004204 0.003068 SEMA4F 10505 -2.21592 0.001443 0.006696 8.28E-12 0.193741 0.168512 MED17 9440 -2.20862 0.001496 0.006826 1.15E-14 0.118285 0.109744 NFE2L3 9603 -2.20862 0.001496 0.006826 2.89E-14 0.000336 0.000414

59

CACNA1D 776 -2.19573 0.001594 0.007205 1.75E-14 0.226254 0.160591 ZC3H3 23144 -2.19573 0.001594 0.007205 1.93E-14 0.057621 0.06625 SYTL2 54843 -2.19573 0.001594 0.007205 3.18E-14 0.020896 0.025201 METT11D1 64745 -2.19573 0.001594 0.007205 1.78E-14 0.103581 0.060818 ODZ3 55714 -2.18456 0.001685 0.007558 2.58E-14 0.023176 0.014983 SIX3 6496 -2.18285 0.001699 0.007569 2.80E-14 0.081326 0.050007 SEZ6L2 26470 -2.18285 0.001699 0.007569 2.80E-14 0.020029 0.021192 NRAP 4892 -2.17726 0.001746 0.007706 2.51E-14 0.280398 0.15563 CAND1 55832 -2.17726 0.001746 0.007706 4.57E-14 0.015261 0.009842

60

Table A.3: 493 Common DEGs from smoker samples with tumor and adjacent normal tissues

Description t-test LR DISCO Gene Accession logFC P.Value adj.P.Val ordp ChiSq t.test TRAF4 9618 -5.58116 3.16E-15 1.26E-11 0 5.55E-16 8.41E-17 TMED9 54732 -5.2315 1.46E-13 1.94E-10 0 2.05E-14 1.08E-14 PPP1R14B 26472 -5.07146 7.81E-13 4.45E-10 0 2.06E-13 1.14E-13 DEPDC6 64798 -5.02676 1.24E-12 6.16E-10 0 6.94E-13 6.68E-14 SYNJ2 8871 -4.62599 6.36E-11 1.81E-08 0 7.73E-11 1.01E-11 NUP62 23636 -4.47464 2.59E-10 5.74E-08 0 2.63E-10 5.07E-11 TMED10 10972 -4.35033 7.96E-10 1.51E-07 0 3.92E-10 2.07E-10 ACTR2 10097 -4.31928 1.05E-09 1.74E-07 0 7.15E-10 4.11E-10 TBC1D16 125058 -4.21236 2.67E-09 3.54E-07 0 1.24E-09 9.09E-10 MCTS1 28985 -4.19029 3.23E-09 4.02E-07 0 7.97E-09 8.05E-10 ALDH18A1 5832 -4.17583 3.65E-09 4.41E-07 0 8.40E-10 1.32E-09 PFKP 5214 -4.16473 4.02E-09 4.57E-07 0 1.95E-09 1.81E-09 SP1 6667 -4.06343 9.44E-09 9.30E-07 0 1.96E-08 4.52E-09 GART 2618 -4.04443 1.11E-08 1.05E-06 0 1.52E-08 4.13E-09 HMGB3 3149 -4.0091 1.48E-08 1.34E-06 0 3.50E-09 7.29E-09 CBLC 23624 -3.98367 1.83E-08 1.62E-06 0 1.20E-08 9.86E-09 MED17 9440 -3.96039 2.21E-08 1.87E-06 0 6.25E-08 1.14E-08 C1orf27 54953 -3.9533 2.34E-08 1.94E-06 0 4.50E-08 1.22E-08 TAF10 6881 -3.95022 2.40E-08 1.95E-06 0 2.68E-08 7.64E-09 EIF4E 1977 -3.88037 4.21E-08 3.29E-06 0 8.13E-08 1.59E-08 MKLN1 4289 -3.83501 6.04E-08 4.23E-06 0 6.17E-08 2.84E-08 TIMM13 26517 -3.80262 7.79E-08 5.25E-06 0 6.29E-08 5.03E-08 CBX4 8535 -3.7974 8.11E-08 5.28E-06 0 5.80E-08 3.94E-08 C11orf80 79703 -3.78697 8.80E-08 5.57E-06 0 1.74E-08 3.81E-08 MRPL4 51073 -3.7776 9.47E-08 5.80E-06 0 7.85E-08 3.57E-08 ANGPTL3 27329 -3.7602 1.08E-07 6.54E-06 0 1.25E-07 5.80E-08 ATP6V1C1 528 -3.70721 1.63E-07 9.41E-06 1.33E-15 8.61E-08 1.10E-07 STYXL1 51657 -3.70507 1.66E-07 9.43E-06 0 6.64E-07 8.45E-08 USP32 84669 -3.6862 1.91E-07 1.04E-05 1.55E-15 1.68E-07 1.20E-07 GSPT1 2935 -3.68165 1.98E-07 1.07E-05 0 8.92E-07 1.39E-07 GAPDH 2597 -3.6668 2.22E-07 1.16E-05 0 2.78E-07 1.23E-07 MFF 56947 -3.64954 2.53E-07 1.31E-05 1.33E-15 2.53E-07 1.69E-07 PHKA1 5255 -3.63683 2.78E-07 1.38E-05 0 8.36E-07 1.37E-07 SPATS2L 26010 -3.64071 2.70E-07 1.38E-05 0 7.90E-08 1.22E-07 AGR2 10551 -3.60217 3.60E-07 1.75E-05 0 2.67E-07 2.38E-07 C1GALT1 56913 -3.56684 4.68E-07 2.22E-05 0 8.07E-07 2.67E-07 SIX1 6495 -3.55654 5.05E-07 2.37E-05 0 3.52E-07 2.84E-07

61

ZNF146 7705 -3.54382 5.55E-07 2.57E-05 0 2.21E-07 3.23E-07 ERO1L 30001 -3.53847 5.77E-07 2.64E-05 0 5.34E-07 3.98E-07 ADSS 159 -3.52803 6.23E-07 2.79E-05 0 3.07E-07 5.27E-07 CANX 821 -3.44105 1.17E-06 4.70E-05 0 1.25E-06 8.07E-07 HNRNPU 3192 -3.43931 1.18E-06 4.71E-05 0 1.28E-06 6.53E-07 GFER 2671 -3.42955 1.27E-06 4.85E-05 0 2.90E-06 7.89E-07 BRP44 25874 -3.42633 1.30E-06 4.87E-05 0 1.88E-06 1.06E-06 RBM15B 29890 -3.40452 1.51E-06 5.48E-05 0 3.58E-06 1.01E-06 MTCH2 23788 -3.38753 1.70E-06 5.96E-05 0 5.01E-06 9.11E-07 SFXN1 94081 -3.37937 1.81E-06 6.20E-05 0 2.31E-07 1.21E-06 RPS6KB1 6198 -3.37201 1.90E-06 6.44E-05 2.22E-16 2.32E-06 1.11E-06 TMC5 79838 -3.3621 2.04E-06 6.58E-05 0 3.45E-06 1.52E-06 SEL1L3 23231 -3.34364 2.32E-06 7.35E-05 2.93E-14 5.45E-06 1.55E-06 SMC6 79677 -3.3431 2.33E-06 7.35E-05 0 4.55E-06 1.79E-06 N6AMT1 29104 -3.32892 2.57E-06 7.93E-05 0 4.78E-06 2.33E-06 PAPOLA 10914 -3.32758 2.59E-06 7.94E-05 0 2.20E-06 2.04E-06 PSMB7 5695 -3.32611 2.62E-06 7.96E-05 1.55E-14 9.01E-06 1.49E-06 ENTPD6 955 -3.3154 2.82E-06 8.44E-05 0 4.11E-06 1.93E-06 XPO1 7514 -3.30844 2.96E-06 8.72E-05 0 2.53E-06 1.44E-06 BRCC3 79184 -3.28877 3.38E-06 9.84E-05 0 4.79E-06 2.19E-06 ATP2A2 488 -3.24689 4.50E-06 0.000126 7.02E-14 3.12E-06 3.43E-06 DNAJC3 5611 -3.23752 4.79E-06 0.000134 0 9.20E-06 3.96E-06 WHSC1 7468 -3.2311 5.00E-06 0.000138 0 5.82E-06 4.94E-06 COPB2 9276 -3.22347 5.27E-06 0.000144 0 5.59E-06 3.94E-06 DLG5 9231 -3.21972 5.40E-06 0.000145 2.22E-16 2.40E-05 3.21E-06 GFPT1 2673 -3.2133 5.64E-06 0.000147 2.22E-16 4.88E-06 4.11E-06 KIF14 9928 -3.21344 5.64E-06 0.000147 4.44E-16 1.76E-05 3.96E-06 ALDOA 226 -3.21009 5.76E-06 0.000149 2.22E-16 7.09E-06 3.85E-06 MRPL12 6182 -3.20701 5.88E-06 0.000151 0 6.28E-06 3.56E-06 CAPRIN1 4076 -3.20032 6.15E-06 0.000157 4.44E-16 1.33E-05 4.81E-06 PPDPF 79144 -3.19136 6.53E-06 0.000165 6.68E-14 8.51E-06 5.28E-06 ILF2 3608 -3.18881 6.64E-06 0.000166 0 1.33E-05 4.11E-06 SNF8 11267 -3.16486 7.78E-06 0.000182 1.11E-15 1.51E-05 5.55E-06 ARSD 414 -3.16058 8.01E-06 0.000187 2.22E-16 6.65E-06 5.19E-06 PMEPA1 56937 -3.15817 8.14E-06 0.000188 0 6.62E-06 5.54E-06 GSTCD 79807 -3.15362 8.38E-06 0.000192 8.88E-16 1.01E-05 7.16E-06 SYNCRIP 10492 -3.1393 9.21E-06 0.000207 0 1.05E-05 8.08E-06 NFKBIB 4793 -3.12552 1.01E-05 0.000226 1.11E-15 1.13E-05 7.68E-06 BCLAF1 9774 -3.12405 1.02E-05 0.000227 0 8.01E-06 7.48E-06 H2AFV 94239 -3.11267 1.10E-05 0.000243 8.88E-16 1.26E-05 8.30E-06 SDF4 51150 -3.10866 1.13E-05 0.000248 4.44E-16 8.45E-06 8.91E-06

62

COX6A1 1337 -3.10598 1.14E-05 0.000249 0 2.18E-05 9.89E-06 RFC3 5983 -3.09354 1.24E-05 0.000267 0 1.30E-05 1.11E-05 MMP11 4320 -3.0645 1.50E-05 0.000314 0 3.66E-05 1.04E-05 TPM2 7169 -3.06236 1.52E-05 0.000316 0 3.54E-05 1.08E-05 SLC6A20 54716 -3.05593 1.58E-05 0.00032 0 4.74E-05 1.35E-05 MRPS15 64960 -3.05647 1.58E-05 0.00032 0 2.45E-05 1.26E-05 FA2H 79152 -3.04269 1.72E-05 0.000346 2.44E-15 1.38E-05 1.55E-05 RPS6KA3 6197 -3.03894 1.76E-05 0.000352 0 0.00013 1.44E-05 CCNB2 9133 -3.03774 1.78E-05 0.000352 0 2.10E-05 1.10E-05 ENO1 2023 -3.03559 1.80E-05 0.000355 2.22E-15 1.51E-05 1.48E-05 AGFG1 3267 -3.03479 1.81E-05 0.000355 0 4.57E-05 1.60E-05 GUF1 60558 -3.02449 1.93E-05 0.000372 0 4.60E-05 1.17E-05 LPGAT1 9926 -3.01914 2.00E-05 0.000382 6.37E-13 7.72E-05 1.66E-05 KIAA1609 57707 -3.00401 2.20E-05 0.000415 1.11E-15 7.11E-05 1.55E-05 SDHC 6391 -3.00027 2.25E-05 0.000419 0 2.61E-05 2.08E-05 PUM2 23369 -3.00134 2.24E-05 0.000419 0 2.49E-05 1.84E-05 DHX40 79665 -3.00094 2.24E-05 0.000419 0 7.01E-05 1.79E-05 ERGIC2 51290 -2.99826 2.28E-05 0.00042 0 4.09E-05 2.02E-05 C18orf10 25941 -2.99679 2.30E-05 0.00042 0 3.52E-05 2.11E-05 TMEM53 79639 -2.99077 2.39E-05 0.00043 3.11E-15 6.45E-05 1.65E-05 MTHFD2 10797 -2.98555 2.47E-05 0.000436 0 2.59E-05 2.02E-05 UGGT2 55757 -2.97846 2.58E-05 0.000447 0 3.17E-05 2.22E-05 UBIAD1 29914 -2.9727 2.67E-05 0.000457 2.00E-15 4.40E-05 2.11E-05 PTCD1 26024 -2.96735 2.77E-05 0.000467 3.55E-15 3.38E-05 2.12E-05 DNAJC12 56521 -2.96548 2.80E-05 0.00047 4.88E-15 2.48E-05 2.58E-05 HBS1L 10767 -2.95223 3.04E-05 0.000506 0 7.64E-05 2.44E-05 EPN3 55040 -2.95116 3.06E-05 0.000508 2.44E-15 9.82E-06 2.66E-05 CNTN1 1272 -2.94982 3.08E-05 0.00051 2.22E-16 0.000105 2.65E-05 GTF3C3 9330 -2.94514 3.17E-05 0.000523 5.33E-15 3.59E-05 2.59E-05 TMEM177 80775 -2.9442 3.19E-05 0.000523 0 4.05E-05 2.72E-05 PPID 5481 -2.93376 3.40E-05 0.000554 0 7.80E-05 3.28E-05 KCNJ3 3760 -2.92921 3.50E-05 0.000565 0 2.86E-05 2.96E-05 BET1L 51272 -2.92453 3.60E-05 0.000579 2.22E-15 6.37E-05 2.92E-05 FGD6 55785 -2.92132 3.68E-05 0.000588 8.66E-15 3.63E-05 3.14E-05 TPD52 7163 -2.87716 4.81E-05 0.000743 2.22E-16 5.57E-05 4.41E-05 STAT3 6774 -2.87261 4.95E-05 0.000758 0 3.01E-05 3.72E-05 RAP2B 5912 -2.87194 4.97E-05 0.000758 0 6.82E-05 3.40E-05 SLC26A6 65010 -2.86953 5.04E-05 0.000766 2.22E-16 8.66E-05 4.08E-05 GRAMD3 65983 -2.86164 5.29E-05 0.000796 1.20E-14 0.000299 3.83E-05 GALNT7 51809 -2.85428 5.52E-05 0.000821 4.66E-15 0.000155 4.47E-05 CALU 813 -2.85267 5.58E-05 0.000826 2.22E-15 5.03E-05 3.60E-05

63

DENND1B 163486 -2.83808 6.09E-05 0.000885 8.66E-15 0.000153 4.91E-05 TRNAU1AP 54952 -2.83701 6.13E-05 0.000888 2.22E-16 0.000375 5.42E-05 SIPA1L3 23094 -2.83514 6.20E-05 0.000894 3.70E-12 0.000108 5.59E-05 CXCR7 57007 -2.82751 6.48E-05 0.000933 0 9.20E-05 5.00E-05 HSPD1 3329 -2.81948 6.80E-05 0.000971 3.17E-12 2.86E-05 5.83E-05 ZC3H3 23144 -2.80463 7.43E-05 0.00105 2.98E-12 6.63E-05 6.21E-05 RBM47 54502 -2.79727 7.76E-05 0.001092 2.60E-14 6.21E-05 6.87E-05 HOOK1 51361 -2.79299 7.96E-05 0.001112 1.53E-14 3.09E-05 7.92E-05 TLK1 9874 -2.79018 8.09E-05 0.001119 2.53E-14 0.000449 5.72E-05 SLC16A7 9194 -2.77492 8.85E-05 0.001216 1.55E-14 0.000188 7.65E-05 NDUFB1 4707 -2.76275 9.50E-05 0.001283 3.63E-12 8.68E-05 8.17E-05 TMCO3 55002 -2.75552 9.91E-05 0.001319 5.33E-12 0.000319 8.80E-05 DCLK1 9201 -2.75418 9.99E-05 0.001322 4.44E-16 0.000265 9.86E-05 MED28 80306 -2.742 0.000107 0.001412 2.22E-16 0.000163 8.53E-05 EPHA4 2043 -2.73946 0.000109 0.001417 8.44E-15 7.22E-05 0.000101 NUTF2 10204 -2.7396 0.000109 0.001417 6.66E-16 0.000187 0.000101 YTHDF2 51441 -2.74026 0.000108 0.001417 2.33E-14 4.38E-05 0.000105 AP4S1 11154 -2.73398 0.000112 0.001458 4.44E-16 0.000119 0.000122 NDUFB4 4710 -2.73036 0.000115 0.001484 4.69E-14 0.000536 0.000102 RPS6 6194 -2.72836 0.000116 0.001491 2.22E-16 0.000244 0.000115 NIT1 4817 -2.72635 0.000117 0.001499 8.71E-12 0.000239 0.000115 PCDH7 5099 -2.72501 0.000118 0.001506 2.22E-16 0.000148 0.000111 MAGOHB 55110 -2.72447 0.000119 0.001506 4.44E-16 0.000106 9.29E-05 S1PR5 53637 -2.7218 0.000121 0.001519 4.44E-16 0.000274 0.000107 SAP30L 79685 -2.70815 0.00013 0.001623 6.79E-14 0.00049 0.000109 SLC25A23 79085 -2.70561 0.000132 0.001631 1.69E-14 0.000198 0.000122 MCM3AP- AS 114044 -2.70266 0.000134 0.001649 0 0.000122 0.000113 CCDC56 28958 -2.69825 0.000138 0.001681 2.22E-16 0.000442 0.000119 CTNNB1 1499 -2.69597 0.00014 0.001697 2.22E-16 0.000203 0.000147 DDA1 79016 -2.69035 0.000144 0.001732 6.75E-14 0.0002 0.000118 HELLS 3070 -2.67363 0.000159 0.001887 5.54E-12 0.00018 0.00015 ZIC1 7545 -2.67001 0.000162 0.001915 2.26E-11 0.000897 0.000127 IMPDH2 3615 -2.66238 0.000169 0.001982 2.22E-16 0.000265 0.000136 ENAH 55740 -2.65529 0.000176 0.002051 3.24E-14 0.000496 0.000146 SNRPA1 6627 -2.65449 0.000177 0.002054 6.66E-16 0.00025 0.000152 ATP5G3 518 -2.64124 0.00019 0.002175 6.66E-16 0.000285 0.000172 RNF216 54476 -2.64218 0.000189 0.002175 8.88E-16 0.000435 0.000178 TSHZ2 128553 -2.64164 0.00019 0.002175 1.26E-13 0.000265 0.000173 PDK1 5163 -2.63964 0.000192 0.002188 5.17E-14 0.000278 0.000176 NAPG 8774 -2.63602 0.000196 0.002217 4.44E-16 0.000238 0.000167 ZFP64 55734 -2.63669 0.000195 0.002217 4.51E-14 0.000367 0.000142

64

OSBPL7 114881 -2.62211 0.000212 0.002359 0 0.000388 0.000173 SIX5 147912 -2.61823 0.000217 0.002404 5.37E-14 0.000246 0.000182 NDUFA10 4705 -2.61689 0.000218 0.002415 8.88E-16 0.000159 0.000186 SLC12A2 6558 -2.61368 0.000222 0.002445 1.34E-11 0.000317 0.000212 KHSRP 8570 -2.60859 0.000229 0.002501 1.84E-11 0.000307 0.000204 MICALL2 79778 -2.60886 0.000228 0.002501 1.08E-13 0.00034 0.000196 CTPS2 56474 -2.60792 0.000229 0.002504 2.22E-16 5.17E-05 0.000186 ZNF12 7559 -2.60083 0.000239 0.002583 0 0.000607 0.000178 PSAT1 29968 -2.59695 0.000244 0.002625 0 0.000195 0.000211 SEZ6L2 26470 -2.59534 0.000246 0.002641 2.22E-16 0.000313 0.000206 C4orf19 55286 -2.58972 0.000254 0.002717 8.88E-16 0.00102 0.000269 CXorf56 63932 -2.57567 0.000274 0.002904 1.33E-15 0.000535 0.000274 EIF3D 8664 -2.56577 0.000289 0.003057 4.44E-16 0.000447 0.000292 ITGBL1 9358 -2.56323 0.000293 0.003092 2.44E-15 0.000596 0.000253 MRPL35 51318 -2.55948 0.000299 0.003139 4.44E-16 0.000733 0.000299 DDX11 1663 -2.55774 0.000302 0.003161 2.22E-16 0.000388 0.000254 BMPR1B 658 -2.54489 0.000324 0.003303 8.88E-16 0.000502 0.000348 PPP2R5E 5529 -2.54583 0.000323 0.003303 2.22E-15 0.000766 0.000338 HN1 51155 -2.54208 0.000329 0.003337 2.45E-13 0.000571 0.000308 PTOV1 53635 -2.53084 0.00035 0.003537 1.11E-15 0.000923 0.000315 CHRNB2 1141 -2.51519 0.000381 0.003827 3.77E-15 0.000494 0.000334 NARS 4677 -2.51479 0.000381 0.003827 1.33E-15 0.000695 0.00035 TOX3 27324 -2.51331 0.000384 0.003828 4.83E-11 0.000544 0.000327 AKIRIN1 79647 -2.51358 0.000384 0.003828 2.44E-15 0.000545 0.000336 CCDC6 8030 -2.50662 0.000398 0.003949 1.59E-13 0.000702 0.000338 SPCS3 60559 -2.50529 0.000401 0.003957 7.79E-11 0.000868 0.000329 BAD 572 -2.50274 0.000407 0.003992 2.22E-16 0.001074 0.000315 ATG16L1 55054 -2.50274 0.000407 0.003992 2.30E-13 0.000308 0.000431 UBE2Z 65264 -2.50114 0.00041 0.004016 1.85E-13 0.000618 0.000329 CNIH4 29097 -2.49926 0.000414 0.004027 3.55E-15 0.000414 0.00044 EGLN1 54583 -2.49819 0.000417 0.00404 2.98E-13 0.000535 0.000341 RAB3D 9545 -2.49351 0.000427 0.004132 4.66E-11 0.000159 0.000439 MAP3K2 10746 -2.47812 0.000464 0.004463 2.66E-15 0.000736 0.000406 SFSWAP 6433 -2.4725 0.000478 0.004565 4.45E-11 0.000381 0.00046 NDUFS8 4728 -2.46916 0.000486 0.004624 4.00E-11 0.000626 0.00044 HFE 3077 -2.46099 0.000508 0.00477 3.33E-13 0.002654 0.00043 CKLF 51192 -2.45939 0.000512 0.0048 2.79E-13 0.001141 0.000445 FBXO2 26232 -2.45537 0.000523 0.004891 3.77E-15 0.0013 0.000424 C20orf11 54994 -2.45403 0.000527 0.004914 6.61E-11 0.000781 0.000511 KRT19 3880 -2.45149 0.000534 0.004968 5.11E-15 0.000664 0.000521 GLT8D1 55830 -2.44948 0.000539 0.005009 2.44E-15 0.00109 0.000484

65

CDCP1 64866 -2.44855 0.000542 0.005022 5.20E-11 0.001286 0.000509 ZNF821 55565 -2.44601 0.000549 0.005077 2.22E-15 0.000533 0.000485 RPS7 6201 -2.444 0.000555 0.005096 8.15E-13 0.000472 0.000557 TDP1 55775 -2.4448 0.000553 0.005096 2.89E-15 0.001767 0.000479 FBXL6 26233 -2.43771 0.000574 0.00521 6.37E-13 0.000459 0.000593 TMEM165 55858 -2.43758 0.000574 0.00521 5.76E-13 0.000921 0.00054 THOC2 57187 -2.43824 0.000572 0.00521 5.28E-13 0.000868 0.000536 SLC1A4 6509 -2.43262 0.000589 0.005322 2.22E-16 0.001711 0.000477 RPS27L 51065 -2.43276 0.000589 0.005322 4.88E-15 0.000953 0.000522 HNRNPC 3183 -2.4258 0.00061 0.005453 4.31E-11 0.000693 0.000542 IPP 3652 -2.42473 0.000614 0.005471 2.86E-13 0.001653 0.000544 UMPS 7372 -2.42232 0.000622 0.005515 4.00E-15 0.001453 0.000493 AGPAT5 55326 -2.4175 0.000637 0.005642 1.64E-14 0.001137 0.000618 CAPN5 726 -2.41295 0.000652 0.005738 6.14E-13 0.001012 0.000625 COL11A1 1301 -2.41295 0.000652 0.005738 5.75E-13 0.000884 0.000618 RPS19 6223 -2.41295 0.000652 0.005738 8.16E-11 0.00183 0.000602 C7orf42 55069 -2.40693 0.000673 0.005894 4.00E-15 0.000909 0.000664 MGEA5 10724 -2.40533 0.000679 0.005917 1.85E-10 0.003441 0.000567 TEP1 7011 -2.40292 0.000687 0.005978 6.28E-13 0.000877 0.000651 P4HA1 5033 -2.39944 0.0007 0.006047 1.22E-14 0.001171 0.000668 UBE2B 7320 -2.39984 0.000698 0.006047 1.16E-10 0.000844 0.00066 MRTO4 51154 -2.39823 0.000704 0.006071 3.11E-15 0.000976 0.000665 MUC4 4585 -2.39208 0.000727 0.006173 4.00E-15 0.001882 0.000656 ASAP1 50807 -2.39261 0.000725 0.006173 1.67E-14 0.000714 0.000674 PLEKHA6 22874 -2.38525 0.000753 0.006312 5.26E-13 0.000904 0.000747 STAM2 10254 -2.38365 0.000759 0.006351 7.11E-15 0.000833 0.000727 PPP2CA 5515 -2.38177 0.000766 0.00639 1.51E-14 0.001581 0.000655 CEP70 80321 -2.38164 0.000767 0.00639 7.77E-15 0.003362 0.000794 PGK1 5230 -2.37201 0.000805 0.00667 6.38E-11 0.003407 0.000711 TFG 10342 -2.36692 0.000826 0.006789 7.11E-15 0.001991 0.000798 TSR1 55720 -2.36772 0.000823 0.006789 3.55E-15 0.000832 0.000725 TMEM156 80008 -2.36344 0.000841 0.006868 8.51E-13 0.00087 0.000798 SIRT6 51548 -2.35782 0.000866 0.007052 5.20E-13 0.001352 0.00083 GUSB 2990 -2.35314 0.000886 0.007163 8.22E-15 0.00127 0.000904 C16orf53 79447 -2.35327 0.000886 0.007163 7.69E-13 0.001584 0.000859 SMEK1 55671 -2.351 0.000896 0.007217 7.11E-15 0.003005 0.000934 GSR 2936 -2.34939 0.000903 0.007242 6.88E-15 0.001879 0.000862 ELF2 1998 -2.34859 0.000907 0.007247 1.98E-14 0.000469 0.000792 BCAN 63827 -2.34779 0.000911 0.007257 1.49E-12 0.001262 0.000906 ITGA2 3673 -2.34698 0.000914 0.007272 1.27E-12 0.00158 0.000893 COQ9 57017 -2.34217 0.000937 0.007421 6.23E-13 0.000865 0.000866

66

SCN9A 6335 -2.33922 0.000951 0.007503 6.22E-15 0.001656 0.001005 B4GALT3 8703 -2.33708 0.000961 0.007529 4.66E-15 0.00089 0.000902 SCAMP2 10066 -2.33748 0.000959 0.007529 1.34E-12 0.002396 0.00091 TMEM120B 144404 -2.33695 0.000962 0.007529 2.89E-15 0.001525 0.000833 CD46 4179 -2.32985 0.000997 0.007772 1.76E-10 0.002001 0.000889 HS2ST1 9653 -2.32985 0.000997 0.007772 9.10E-15 0.000961 0.000944 IGFBP1 3484 -2.32838 0.001004 0.007815 6.22E-15 0.001551 0.001004 HSPA5 3309 -2.32356 0.001029 0.007898 3.09E-10 0.001236 0.001003 UBE2H 7328 -2.32356 0.001029 0.007898 8.99E-11 0.000807 0.000942 ZDHHC13 54503 -2.32464 0.001023 0.007898 7.52E-13 0.001198 0.000927 ERCC4 2072 -2.31406 0.001079 0.008219 1.09E-14 0.004038 0.00109 OR51E2 81285 -2.30242 0.001143 0.008676 1.42E-10 0.000618 0.001046 RHOC 389 -2.30162 0.001148 0.008679 6.22E-15 0.001131 0.001146 LRIG2 9860 -2.30122 0.00115 0.008679 5.42E-13 0.002869 0.001027 SNX13 23161 -2.296 0.00118 0.008873 4.88E-15 0.001499 0.001226 USP3 9960 -2.29065 0.001212 0.009077 2.91E-14 0.003472 0.001281 HKDC1 80201 -2.29024 0.001215 0.009078 1.04E-14 0.002399 0.001303 C2orf55 343990 -2.28074 0.001273 0.009426 1.95E-14 0.001709 0.00126 ZZZ3 26009 -2.27967 0.00128 0.009441 1.49E-14 0.001371 0.001262 RNF123 63891 -2.27967 0.00128 0.009441 3.95E-14 0.002817 0.001386 KDM4B 23030 -2.27874 0.001286 0.009449 6.44E-15 0.00146 0.001269 ABCC3 8714 -2.27807 0.00129 0.009463 1.33E-15 0.000764 0.001175 HTATIP2 10553 -2.27245 0.001326 0.009693 4.20E-10 0.003419 0.001186 UTP11L 51118 -2.27017 0.001341 0.009784 6.00E-15 0.002761 0.001198 PITPNC1 26207 -2.26589 0.001369 0.009955 1.07E-14 0.002629 0.001138 C2orf3 6936 -2.26321 0.001387 0.010068 2.91E-10 0.002169 0.001154 ENY2 56943 -2.26174 0.001397 0.010104 2.44E-14 0.002373 0.001402 CUL1 8454 -2.2592 0.001415 0.010212 2.13E-14 0.004063 0.001192 TRMT5 57570 -2.25786 0.001424 0.01026 2.24E-10 0.001546 0.001368 ASCC3 10973 -2.25706 0.00143 0.010282 1.07E-12 0.003692 0.001279 NIPAL2 79815 -2.25572 0.001439 0.010312 9.55E-15 0.000881 0.001541 SRD5A3 79644 -2.25304 0.001458 0.010399 1.87E-12 0.002449 0.001419 TOM1L1 10040 -2.24742 0.001499 0.010661 1.11E-15 0.001831 0.001307 SLC9A5 6553 -2.24435 0.001521 0.010784 1.40E-14 0.002267 0.001514 PHF8 23133 -2.24047 0.00155 0.01097 2.22E-14 0.002854 0.001517 ARHGEF12 23365 -2.23899 0.001561 0.011029 1.24E-14 0.001492 0.001391 GALNT1 2589 -2.23565 0.001587 0.011169 4.26E-14 0.00253 0.001449 SYTL2 54843 -2.23511 0.001591 0.011179 2.57E-12 0.001942 0.001588 AK4 205 -2.23337 0.001604 0.011248 1.31E-12 0.004479 0.001509 TRIM14 9830 -2.23311 0.001606 0.011248 9.33E-15 0.00287 0.001569 GALNT2 2590 -2.23096 0.001623 0.011326 6.67E-13 0.001141 0.001487

67

ICA1 3382 -2.22976 0.001633 0.011367 8.66E-15 0.002469 0.001593 C17orf48 56985 -2.22949 0.001635 0.011367 3.02E-14 0.001774 0.001571 CHD7 55636 -2.22882 0.00164 0.011384 3.53E-14 0.00161 0.001535 PRKAA2 5563 -2.22427 0.001677 0.011596 1.02E-14 0.001772 0.001803 PPAT 5471 -2.22334 0.001684 0.011628 3.77E-15 0.002438 0.001484 HNRNPR 10236 -2.22253 0.001691 0.011653 2.55E-14 0.002413 0.001563 GTF2H5 404672 -2.21678 0.001738 0.01194 1.64E-14 0.002765 0.001521 KDM6A 7403 -2.21611 0.001744 0.011957 3.15E-14 0.00212 0.001675 COL3A1 1281 -2.21517 0.001752 0.011991 1.13E-12 0.001957 0.001712 FAM162A 26355 -2.21464 0.001756 0.012001 3.26E-12 0.003868 0.001685 OLA1 29789 -2.20567 0.001833 0.012443 5.37E-10 0.001195 0.001564 PLEKHA5 54477 -2.20594 0.001831 0.012443 2.08E-12 0.004507 0.001768 ENTPD7 57089 -2.20608 0.00183 0.012443 5.01E-10 0.00399 0.00175 FNTB 2342 -2.19577 0.001922 0.01298 4.34E-12 0.002968 0.001791 NEFM 4741 -2.19577 0.001922 0.01298 1.80E-14 0.001812 0.001918 CES3 23491 -2.19336 0.001944 0.013108 3.11E-15 0.002379 0.001734 HIST3H2A 92815 -2.19069 0.001969 0.013232 1.67E-14 0.002717 0.0018 TMC6 11322 -2.18493 0.002024 0.013553 3.07E-10 0.004158 0.001883 C1GALT1C1 29071 -2.17704 0.002101 0.014 3.20E-14 0.00609 0.002212 MTFR1 9650 -2.17557 0.002116 0.014051 4.13E-10 0.004257 0.001903 HGD 3081 -2.17463 0.002125 0.01409 6.30E-10 0.006682 0.001976 ENTPD2 954 -2.17249 0.002147 0.014195 8.53E-10 0.003612 0.002024 KIFC1 3833 -2.17048 0.002168 0.014274 4.17E-14 0.002341 0.002055 S1PR2 9294 -2.16887 0.002184 0.014344 5.82E-14 0.002947 0.002104 CEP68 23177 -2.16754 0.002198 0.014402 4.00E-12 0.002685 0.002147 TFAM 7019 -2.16165 0.00226 0.014735 2.71E-14 0.00388 0.002317 LIG4 3981 -2.15536 0.002328 0.01503 4.71E-14 0.006275 0.002194 EIF3B 8662 -2.15148 0.00237 0.01523 3.35E-12 0.004667 0.002253 USP2 9099 -2.15081 0.002378 0.01523 3.62E-14 0.004058 0.002073 C1orf163 65260 -2.15081 0.002378 0.01523 2.24E-12 0.007116 0.002244 KLHL36 79786 -2.1472 0.002418 0.015441 8.40E-10 0.005586 0.002479 SLC35A3 23443 -2.14559 0.002437 0.015483 4.80E-12 0.004846 0.002295 TRIM9 114088 -2.14586 0.002434 0.015483 2.22E-14 0.004171 0.002296 PAK1 5058 -2.14425 0.002452 0.015525 3.55E-15 0.007161 0.002121 TRIM2 23321 -2.13328 0.002581 0.016141 5.11E-15 0.004527 0.00237 WBSCR22 114049 -2.131 0.002608 0.016262 5.55E-14 0.003978 0.002461 ZNF318 24149 -2.1286 0.002638 0.016419 7.28E-14 0.007945 0.00249 HSPA8 3312 -2.12632 0.002666 0.016465 6.79E-14 0.00201 0.002614 YARS 8565 -2.12672 0.002661 0.016465 1.49E-11 0.006674 0.002724 TES 26136 -2.12632 0.002666 0.016465 5.44E-14 0.003189 0.002553 NUP62CL 54830 -2.12672 0.002661 0.016465 3.40E-12 0.001828 0.002297

68

IL17RB 55540 -2.12739 0.002652 0.016465 7.44E-10 0.004833 0.002727 APH1A 51107 -2.12498 0.002682 0.016542 6.50E-12 0.007446 0.002472 NCOR2 9612 -2.11802 0.00277 0.017005 8.70E-14 0.002907 0.00294 ARF6 382 -2.11736 0.002779 0.017006 4.62E-14 0.003653 0.002875 FANCI 55215 -2.11026 0.002871 0.017438 2.84E-14 0.004614 0.002719 S100A6 6277 -2.10612 0.002927 0.017748 8.87E-10 0.007147 0.002969 RAD1 5810 -2.10424 0.002952 0.017847 1.28E-13 0.003021 0.003393 NOL6 65083 -2.1021 0.002981 0.017969 1.10E-13 0.003499 0.002916 NELF 26012 -2.1013 0.002992 0.017981 4.73E-14 0.005395 0.002695 TRABD 80305 -2.1005 0.003003 0.018021 6.24E-14 0.003593 0.003167 CORO2A 7464 -2.09983 0.003013 0.018049 1.58E-11 0.00673 0.003085 ASB7 140460 -2.09902 0.003024 0.018088 1.20E-14 0.005157 0.002679 NRD1 4898 -2.09648 0.003059 0.018273 6.88E-15 0.005598 0.002823 ZFX 7543 -2.09367 0.003099 0.018482 6.51E-14 0.004793 0.003007 KDM5C 8242 -2.093 0.003108 0.018512 5.51E-14 0.005071 0.00322 RPL13 6137 -2.09247 0.003116 0.018529 1.03E-13 0.006191 0.003221 TMEM8A 58986 -2.09099 0.003137 0.018627 5.36E-12 0.001803 0.003211 UTP6 55813 -2.08979 0.003155 0.018702 5.97E-14 0.008024 0.003232 SORD 6652 -2.08711 0.003193 0.018848 5.07E-12 0.006549 0.002881 SLC17A5 26503 -2.08149 0.003276 0.019264 9.37E-14 0.005674 0.003391 SHB 6461 -2.07587 0.003361 0.019636 3.57E-14 0.004725 0.003387 KLHL12 59349 -2.07587 0.003361 0.019636 3.14E-12 0.007446 0.002929 NUDT21 11051 -2.07494 0.003376 0.019674 1.14E-13 0.005648 0.003128 UGT8 7368 -2.07213 0.003419 0.019857 4.37E-14 0.003688 0.003478 CAPN1 823 -2.07039 0.003446 0.019985 3.87E-12 0.00308 0.003435 GPATCH4 54865 -2.06798 0.003484 0.020146 1.15E-14 0.004255 0.003281 TFB1M 51106 -2.06249 0.003572 0.020593 8.66E-15 0.010479 0.003174 PPIA 5478 -2.05741 0.003655 0.020951 4.93E-14 0.006932 0.003745 KCNA1 3736 -2.0562 0.003675 0.02103 1.44E-13 0.004171 0.00364 CEP250 11190 -2.0546 0.003701 0.021127 7.99E-14 0.004523 0.00356 MYCN 4613 -2.05232 0.00374 0.021266 1.88E-13 0.011769 0.003353 DPYSL3 1809 -2.04697 0.003831 0.021648 3.74E-13 0.007736 0.003758 APLP2 334 -2.03881 0.003974 0.022329 1.92E-13 0.005571 0.003873 C20orf24 55969 -2.03667 0.004012 0.022481 6.40E-12 0.005336 0.003784 MUS81 80198 -2.03332 0.004073 0.022756 5.44E-14 0.004336 0.004007 CHML 1122 -2.03278 0.004082 0.022778 6.60E-12 0.008547 0.003929 LANCL2 55915 -2.03024 0.004129 0.022988 8.90E-14 0.006808 0.003653 B4GALT6 9331 -2.02971 0.004139 0.022997 2.08E-13 0.003686 0.004411 PDIA3 2923 -2.02369 0.004251 0.023459 4.64E-14 0.003353 0.004335 SOX12 6666 -2.01485 0.004422 0.024113 6.04E-14 0.00643 0.004199 CCT5 22948 -2.01552 0.004409 0.024113 4.20E-11 0.007575 0.004514

69

MLF1IP 79682 -2.01592 0.004401 0.024113 8.30E-14 0.003636 0.004762 EEF1A1 1915 -2.00776 0.004563 0.024699 2.08E-13 0.006902 0.004436 TMEM132A 54972 -2.00723 0.004574 0.024719 2.04E-13 0.005786 0.004695 ZNF193 7746 -2.00482 0.004623 0.024868 6.28E-14 0.01149 0.004434 DTL 51514 -1.99973 0.004727 0.025247 6.79E-14 0.003333 0.004934 UGGT1 56886 -2 0.004722 0.025247 1.19E-13 0.004393 0.005138 ISYNA1 51477 -1.99679 0.004789 0.025489 1.51E-11 0.005034 0.004844 KRTAP1-3 81850 -1.99665 0.004792 0.025489 9.04E-14 0.005374 0.004746 DUSP4 1846 -1.99625 0.004801 0.0255 2.71E-13 0.0081 0.00506 PDXDC1 23042 -1.99425 0.004843 0.025692 9.77E-12 0.006848 0.004405 CARS 833 -1.99331 0.004863 0.02576 2.39E-11 0.002309 0.004627 RFX1 5989 -1.99304 0.004869 0.02576 6.55E-14 0.004211 0.004438 SHH 6469 -1.98421 0.005061 0.026566 2.68E-13 0.010568 0.005602 DNAJC10 54431 -1.98167 0.005118 0.026793 2.21E-11 0.011396 0.005087 CCNL2 81669 -1.97859 0.005187 0.027085 1.82E-11 0.011644 0.004738 PCSK6 5046 -1.97725 0.005218 0.027208 3.29E-13 0.010707 0.005548 COX4NB 10328 -1.97444 0.005282 0.027401 1.11E-14 0.005462 0.005007 STK16 8576 -1.95477 0.005753 0.029162 1.65E-13 0.006035 0.005391 GOLT1B 51026 -1.95263 0.005807 0.029395 2.78E-11 0.011843 0.005528 TAF1D 79101 -1.95223 0.005817 0.029409 9.61E-12 0.0064 0.00532 ZNF107 51427 -1.95102 0.005847 0.02945 1.77E-11 0.012057 0.005184 SMG1 23049 -1.94942 0.005888 0.029581 2.75E-11 0.004973 0.005549 TFCP2L1 29842 -1.94942 0.005888 0.029581 1.25E-13 0.01318 0.006037 RAB7L1 8934 -1.94835 0.005915 0.02966 5.50E-13 0.021902 0.006086 GDF15 9518 -1.94166 0.006088 0.030395 2.06E-13 0.006677 0.006386 RNF19A 25897 -1.94099 0.006106 0.030444 2.62E-13 0.011788 0.006408 NDUFS1 4719 -1.93858 0.006169 0.030723 1.62E-13 0.008349 0.006119 CSNK2A1 1457 -1.93563 0.006248 0.031036 1.33E-13 0.012745 0.005995 ZMYM5 9205 -1.93376 0.006298 0.031149 1.25E-11 0.006251 0.005918 CYCS 54205 -1.93363 0.006302 0.031149 5.09E-13 0.009021 0.006264 ZFAND6 54469 -1.93082 0.006378 0.031311 1.87E-13 0.010277 0.006292 METT10D 79066 -1.93068 0.006382 0.031311 1.45E-13 0.007967 0.006657 GLRB 2743 -1.92935 0.006418 0.031375 3.37E-13 0.00759 0.006636 KCNC2 3747 -1.92988 0.006404 0.031375 1.72E-13 0.007416 0.006696 ARHGAP26 23092 -1.92935 0.006418 0.031375 3.18E-13 0.013382 0.00674 API5 8539 -1.92881 0.006433 0.031409 1.62E-11 0.009472 0.006071 PTER 9317 -1.92801 0.006455 0.031478 1.20E-11 0.016056 0.0061 IDH1 3417 -1.92654 0.006496 0.031638 3.13E-14 0.003712 0.006236 ZNF552 79818 -1.92613 0.006507 0.031654 1.21E-13 0.008514 0.006268 CHD2 1106 -1.92453 0.006552 0.031792 3.07E-13 0.004544 0.007001 MFSD9 84804 -1.92265 0.006605 0.031972 2.64E-11 0.007985 0.006744

70

SPTLC2 9517 -1.91984 0.006684 0.032279 2.77E-13 0.020391 0.006729 CHD9 80205 -1.91891 0.006711 0.032368 6.28E-11 0.014248 0.006257 ORC2 4999 -1.91784 0.006742 0.032438 2.38E-11 0.007637 0.006433 CMAS 55907 -1.91717 0.006761 0.032491 3.85E-11 0.018396 0.006451 RPL37 6167 -1.91382 0.006858 0.032838 2.47E-11 0.002931 0.006968 GDI2 2665 -1.91034 0.00696 0.033205 4.27E-13 0.014211 0.006921 AP1AR 55435 -1.91048 0.006956 0.033205 1.64E-13 0.007551 0.006419 ZMIZ2 83637 -1.90967 0.00698 0.033222 1.67E-13 0.008837 0.007278 RQCD1 9125 -1.90539 0.007107 0.03371 1.96E-11 0.012798 0.006795 SR140 23350 -1.90392 0.007152 0.03388 1.37E-11 0.012004 0.006389 CELF1 10658 -1.90285 0.007184 0.033994 9.51E-13 0.021065 0.006626 KIAA0922 23240 -1.89763 0.007345 0.034506 6.94E-13 0.018089 0.006862 LRP10 26020 -1.89777 0.00734 0.034506 2.59E-13 0.009079 0.007364 LRRC8E 80131 -1.89803 0.007332 0.034506 1.70E-11 0.009899 0.007416 YIPF5 81555 -1.89803 0.007332 0.034506 2.80E-11 0.007884 0.00717 UQCRB 7381 -1.89683 0.00737 0.034581 6.41E-13 0.01215 0.006796 ESRP2 80004 -1.89415 0.007453 0.034893 1.80E-11 0.018212 0.007104 SMC5 23137 -1.89308 0.007487 0.035009 2.69E-13 0.006751 0.00793 CPSF6 11052 -1.89148 0.007538 0.035123 1.06E-10 0.011535 0.00771 G3BP1 10146 -1.88987 0.007589 0.03532 2.98E-13 0.01401 0.007671 SLMO2 51012 -1.88505 0.007744 0.035958 2.58E-13 0.013074 0.008036 FAM3C 10447 -1.88144 0.007862 0.036465 2.09E-11 0.009576 0.007597 PACSIN3 29763 -1.88077 0.007884 0.036525 5.58E-13 0.023172 0.007564 C3orf14 57415 -1.87743 0.007996 0.036911 4.81E-13 0.014834 0.008224 RPS11 6205 -1.87354 0.008126 0.037428 2.43E-13 0.008767 0.008254 ZNF214 7761 -1.869 0.008282 0.03797 1.06E-12 0.01517 0.00799 OTUB2 78990 -1.86699 0.008352 0.038201 3.45E-13 0.010698 0.008606 INPP4A 3631 -1.86525 0.008412 0.038435 5.82E-11 0.014989 0.008255 ATG10 83734 -1.86351 0.008474 0.038625 5.78E-13 0.012014 0.007851 SOX2 6657 -1.86043 0.008583 0.039033 9.50E-14 0.010513 0.008488 TMEM50B 757 -1.85655 0.008722 0.039531 7.81E-13 0.009872 0.008674 SSR3 6747 -1.85494 0.00878 0.03975 7.27E-11 0.015221 0.008452 PMS1 5378 -1.85401 0.008814 0.039858 3.11E-13 0.00981 0.008467 MAFG 4097 -1.8516 0.008903 0.040054 3.37E-13 0.01217 0.008731 RPL27 6155 -1.85147 0.008908 0.040054 4.08E-13 0.00883 0.009415 SEZ6L 23544 -1.84999 0.008962 0.040161 7.37E-13 0.010781 0.009726 ZNF117 51351 -1.84946 0.008982 0.040161 3.12E-13 0.022471 0.009166 SEC61A2 55176 -1.84973 0.008972 0.040161 4.00E-13 0.015572 0.009307 CADM4 199731 -1.85013 0.008957 0.040161 9.26E-14 0.006725 0.008617 ZNF587 84914 -1.84531 0.009137 0.040718 8.04E-14 0.006001 0.008906 SAR1B 51128 -1.84317 0.009218 0.040987 4.91E-11 0.018087 0.008885

71

SIX3 6496 -1.83942 0.009361 0.041532 6.76E-11 0.007349 0.009262 AMMECR1 9949 -1.83808 0.009413 0.041622 1.11E-13 0.01531 0.009143 C17orf81 23587 -1.83862 0.009392 0.041622 2.23E-13 0.024472 0.009675 POLDIP2 26073 -1.83741 0.009439 0.04169 1.34E-11 0.02001 0.008935 USP9X 8239 -1.83594 0.009496 0.041804 4.84E-13 0.008614 0.009813 LIN37 55957 -1.83327 0.009601 0.04222 4.48E-13 0.015196 0.009329 APOBEC3F 200316 -1.83286 0.009617 0.042243 9.34E-13 0.008314 0.009229 AGGF1 55109 -1.83005 0.009728 0.042545 1.80E-12 0.02111 0.009662 APOOL 139322 -1.82818 0.009803 0.042778 6.52E-13 0.01839 0.009725 FKBP3 2287 -1.8247 0.009944 0.043344 5.71E-13 0.014602 0.009621 SV2C 22987 -1.82256 0.010031 0.043582 9.38E-13 0.014766 0.009721 ERBB3 2065 -1.81681 0.010269 0.044303 4.91E-14 0.015589 0.009534 DENND1A 57706 -1.81373 0.010399 0.04474 6.54E-13 0.019085 0.010772 CCDC41 51134 -1.81266 0.010444 0.044887 5.48E-13 0.020671 0.010831 ARFGAP1 55738 -1.81052 0.010536 0.045133 2.33E-11 0.011422 0.010406 ABHD5 51099 -1.80436 0.010802 0.045978 8.11E-13 0.010288 0.011253 PSMF1 9491 -1.79981 0.011003 0.046486 2.02E-12 0.017676 0.010772 PIGV 55650 -1.79955 0.011015 0.046486 6.06E-13 0.020636 0.010762 IDS 3423 -1.79901 0.011038 0.046537 9.09E-13 0.007134 0.011252 PRRC1 133619 -1.79794 0.011086 0.046689 2.03E-11 0.017257 0.011028 IFT81 28981 -1.79165 0.011371 0.047687 6.33E-13 0.014347 0.011575 TECR 9524 -1.79031 0.011433 0.047745 6.66E-13 0.008795 0.01188 PCYT2 5833 -1.7883 0.011525 0.047906 1.50E-10 0.03922 0.010677 PCCA 5095 -1.7867 0.0116 0.04814 1.23E-12 0.027125 0.011964 DDAH1 23576 -1.78536 0.011662 0.048249 7.98E-13 0.051518 0.011459 GABRD 2563 -1.78228 0.011807 0.048747 5.61E-13 0.007564 0.011632 CA10 56934 -1.77787 0.012018 0.049412 7.62E-11 0.025888 0.012156 EIF4E2 9470 -1.77532 0.012141 0.049866 5.55E-13 0.015946 0.010641 DIRAS2 54769 -1.77452 0.01218 0.049872 8.08E-13 0.017905 0.012636 IL17RC 84818 -1.77452 0.01218 0.049872 8.78E-11 0.027844 0.012239 CREB1 1385 -1.77318 0.012245 0.050088 9.55E-13 0.01648 0.012679 POGK 57645 -1.77118 0.012344 0.050439 8.52E-13 0.023387 0.012824 DHTKD1 55526 -1.77037 0.012384 0.050497 5.08E-11 0.014244 0.012244 AURKAIP1 54998 -1.76957 0.012423 0.050556 2.65E-11 0.01549 0.011577 RIMS2 9699 -1.76796 0.012503 0.050829 4.38E-13 0.019559 0.012801 ARHGAP1 392 -1.76756 0.012523 0.050858 6.07E-11 0.006094 0.01227 ADORA3 140 -1.76569 0.012617 0.051135 1.81E-12 0.020341 0.012347 C3orf63 23272 -1.76181 0.012813 0.051719 2.52E-12 0.031681 0.012838 TULP4 56995 -1.76074 0.012868 0.051835 7.02E-11 0.026072 0.012195 BHLHE41 79365 -1.7598 0.012916 0.051975 1.06E-10 0.014492 0.012899 METTL8 79828 -1.7586 0.012977 0.052172 1.26E-12 0.024518 0.012638

72

SLC2A4RG 56731 -1.75793 0.013012 0.052258 9.20E-11 0.034174 0.012574 SLC38A10 124565 -1.75659 0.013081 0.05243 6.03E-11 0.021942 0.013127 WISP3 8838 -1.75458 0.013186 0.052742 7.35E-13 0.0126 0.01274 CP 1356 -1.75311 0.013263 0.052998 4.54E-11 0.008105 0.013171 TRPM3 80036 -1.75231 0.013305 0.053113 5.79E-13 0.01284 0.013062

73

Table A.4: Top 100 Common DEGs from tumor samples between smokers and non- smokers

Description t-test LR DISCO Entrez Gene ID logFC P.Value adj.P.Val ordp ChiSq t.test APP 351 -1.82024 0.000622 0.319659 0 0.000667 0.000319 KIFC1 3833 -1.95851 0.000232 0.319659 0 7.83E-05 0.000351 CENPF 1063 -1.77929 0.000823 0.343535 0 0.000637 0.00118 CBR1 873 -1.67082 0.001684 0.432717 0 0.001814 0.001788 ATP2A2 488 -1.57098 0.003145 0.461865 0 0.002601 0.004037 KIF1A 547 -1.52788 0.004076 0.461865 0 0.00414 0.002844 BPGM 669 -1.58751 0.002842 0.461865 0 0.002269 0.002664 CBR3 874 -1.53327 0.003947 0.461865 0 0.004364 0.003116 CDKN2C 1031 -1.55734 0.003416 0.461865 0 0.003362 0.003528 CENPE 1062 -1.5013 0.004769 0.461865 0 0.004658 0.004703 DDX10 1662 -1.5031 0.004719 0.461865 0 0.005681 0.004528 EFNA2 1943 -1.4679 0.005789 0.461865 0 0.009118 0.004381 FEN1 2237 -1.52967 0.004033 0.461865 0 0.002034 0.004339 FGF12 2257 -1.49375 0.004984 0.461865 0 0.006689 0.004361 GABPA 2551 -1.5074 0.004601 0.461865 0 0.002616 0.005323 GARS 2617 -1.56235 0.003314 0.461865 0 0.002483 0.003131 GRIK2 2898 -1.55265 0.003514 0.461865 0 0.006921 0.002403 H2AFZ 3015 -1.5444 0.003693 0.461865 0 0.001928 0.0048 HSD17B10 3028 -1.57887 0.002997 0.461865 0 0.003121 0.003715 HIF1A 3091 -1.46287 0.005959 0.461865 0 0.00533 0.007415 HMMR 3161 -1.50273 0.004729 0.461865 0 0.004435 0.004676 KIF3C 3797 -1.55553 0.003453 0.461865 0 0.003139 0.002812 MAD2L1 4085 -1.5566 0.003431 0.461865 0 0.001779 0.004738 MAP1B 4131 -1.6446 0.001991 0.461865 0 0.002666 0.0015 MCM4 4173 -1.48513 0.00524 0.461865 0 0.003008 0.006741 NCBP1 4686 -1.51746 0.004336 0.461865 0 0.005781 0.004292 NPPC 4880 -1.56739 0.003214 0.461865 0 0.004687 0.002541 NRAP 4892 -1.59288 0.00275 0.461865 0 0.001847 0.002812 EIF4EBP1 1978 -1.44813 0.006483 0.463572 0 0.008314 0.007084 GCLC 2729 -1.45532 0.006222 0.463572 0 0.008921 0.005759 CALML3 810 -1.42228 0.007501 0.47916 0 0.012367 0.006688 BRF1 2972 -1.3993 0.008526 0.496555 0 0.012044 0.006629 LECT2 3950 -1.39678 0.008645 0.496555 0 0.011698 0.008664 KRT85 3891 -1.38889 0.009029 0.506358 0 0.012397 0.00779 CA12 771 -1.37416 0.009787 0.50931 0 0.006709 0.009596 GOLGA3 2802 -1.37236 0.009884 0.50931 0 0.0083 0.010225

74

HRG 3273 -1.37451 0.009769 0.50931 0 0.008556 0.00858 ORC1L 4998 -1.37091 0.009962 0.50931 0 0.006673 0.011785 DAGLA 747 -1.34434 0.011498 0.511806 0 0.013869 0.009443 CENPA 1058 -1.34973 0.01117 0.511806 0 0.006894 0.012985 DPYSL3 1809 -1.36733 0.010158 0.511806 0 0.0164 0.009232 FKBP4 2288 -1.36302 0.010398 0.511806 0 0.005896 0.011968 GABRR1 2569 -1.3544 0.010893 0.511806 0 0.01638 0.010731 IGFBP5 3488 -1.34615 0.011387 0.511806 0 0.01939 0.008878 MAFG 4097 -1.34722 0.011322 0.511806 0 0.020877 0.008053 NDUFA10 4705 -1.3544 0.010893 0.511806 0 0.017858 0.011474 EIF4G1 1981 -1.33214 0.012271 0.516081 0 0.010821 0.009729 FKBP3 2287 -1.33105 0.012342 0.516081 0 0.01267 0.009834 GNG4 2786 -1.32532 0.012723 0.516081 0 0.007636 0.01455 FGD1 2245 -1.31741 0.013265 0.521227 0 0.017458 0.011961 KIF11 3832 -1.3192 0.013141 0.521227 0 0.012567 0.016724 MDH1 4190 -1.31776 0.01324 0.521227 0 0.009549 0.015707 H2AFX 3014 -1.31022 0.013776 0.524044 0 0.018112 0.015401 NFE2L1 4779 -1.30844 0.013905 0.524044 0 0.011632 0.01485 IARS 3376 -1.3016 0.014411 0.530259 0 0.014299 0.014472 CA5A 763 -1.29191 0.015156 0.533816 0 0.025593 0.011428 PARP1 142 -1.2348 0.020272 0.53457 0 0.016227 0.020776 ATP4A 495 -1.25096 0.01869 0.53457 0 0.011441 0.020534 ATP5C1 509 -1.28974 0.015327 0.53457 0 0.020279 0.016396 MRPL49 740 -1.25887 0.017956 0.53457 0 0.017027 0.019431 CDC2 983 -1.28185 0.015965 0.53457 0 0.012003 0.01434 CDK5 1020 -1.25024 0.018758 0.53457 0 0.029286 0.015788 CHRM2 1129 -1.23229 0.020528 0.53457 0 0.021754 0.020973 DAB1 1600 -1.23768 0.019982 0.53457 0 0.018551 0.018061 DKC1 1736 -1.2858 0.015643 0.53457 0 0.013372 0.015286 DNA2L 1763 -1.2567 0.018155 0.53457 0 0.013754 0.021536 CELSR3 1951 -1.27287 0.016718 0.53457 0 0.013112 0.01495 STX2 2054 -1.26138 0.017729 0.53457 0 0.01528 0.016368 EPRS 2058 -1.25778 0.018055 0.53457 0 0.01552 0.021019 FANCG 2189 -1.23084 0.020677 0.53457 0 0.025319 0.021526 GAST 2520 -1.28186 0.015964 0.53457 0 0.02502 0.013392 GMFB 2764 -1.25023 0.018759 0.53457 0 0.023476 0.018968 SFN 2810 -1.25815 0.018022 0.53457 0 0.017897 0.019663 GSK3B 2932 -1.22689 0.021088 0.53457 0 0.018555 0.019851 ICT1 3396 -1.22689 0.021088 0.53457 0 0.012895 0.019934 KPNA2 3838 -1.25922 0.017924 0.53457 0 0.011218 0.019902 KRT81 3887 -1.25133 0.018656 0.53457 0 0.013039 0.019607

75

LIMK2 3985 -1.23768 0.019982 0.53457 0 0.018052 0.019003 ALDH3A1 218 -1.20428 0.023581 0.541946 0 0.028252 0.023781 BAK1 578 -1.21325 0.022562 0.541946 0 0.019855 0.021653 CDC27 996 -1.20319 0.023706 0.541946 0 0.031755 0.021931 DRD4 1815 -1.20176 0.023873 0.541946 0 0.029792 0.02303 FGF3 2248 -1.21792 0.022048 0.541946 0 0.030298 0.026779 FGL1 2267 -1.21253 0.022643 0.541946 0 0.040509 0.017673 FH 2271 -1.21828 0.022009 0.541946 0 0.015892 0.024278 HDAC2 3066 -1.20822 0.023129 0.541946 0 0.018599 0.025382 HMGB1 3146 -1.2014 0.023916 0.541946 0 0.028092 0.024466 MDFI 4188 -1.22223 0.021582 0.541946 0 0.028021 0.018344 MAP3K4 4216 -1.20966 0.022965 0.541946 0 0.022134 0.02438 MYBL2 4605 -1.20894 0.023047 0.541946 0 0.019009 0.026644 NRD1 4898 -1.19134 0.025119 0.554677 0 0.037615 0.020389 ABAT 18 -1.17949 0.026602 0.563092 0 0.031676 0.020974 BUB1 699 -1.182 0.026282 0.563092 0 0.019108 0.030352 CHEK1 1111 -1.18271 0.026192 0.563092 0 0.016895 0.027 GCG 2641 -1.18308 0.026145 0.563092 0 0.035138 0.023086 HNRNPA2B1 3181 -1.18487 0.02592 0.563092 0 0.028307 0.023263 MUTYH 4595 -1.18165 0.026327 0.563092 0 0.026162 0.024463 NPPA 4878 -1.17877 0.026695 0.563092 0 0.032374 0.027126 AKR1C1 1645 -1.17589 0.027068 0.564274 0 0.02947 0.023281 APP 351 -1.82024 0.000622 0.319659 0 0.000667 0.000319

76

Appendix B

R-Codes

B.1 Compile Raw Affymetrix Data setwd("C:\\Nicole Carr\\Affymetrix\\HG-U133A\\GSE31547")

### RAW DATA #### library(affy) batch=ReadAffy() batchdata.raw=exprs(batch)write.csv(data.raw,file="Raw_data_GSE31547.csv")old.par= par(mfrow=c(3,1))pheno=read.table("GSE31547.Raw.Targets.txt", header=T,sep="\t")rownames(pheno)=pheno[["FileName"]]pData(batch)=pheno

B.2 Raw Data Normalization ## RMA NORMALIZATION eSet=rma(batch) eSet

## MAS5 NORMALIZATION library(affy) eSet=mas5(batch) eSetpData(eSet) head(exprs(eSet)) fData(eSet) annotation(eSet) library(multtest) library(genefilter) par(old.par) sds=rowSds(exprs(eSet)) sh=shorth(sds) sh eSetfilt=eSet[sds>=sh,] dim(eSet) dim(eSetfilt) write.csv(eSetfilt, file="Raw.MAS5normalized.GSE31547.csv")

77

### Transpose file #### filematrix <- function(ver,outfile) 42 42 42 41 41 41 40 39 38 38 37 36 35 34 33 (!!! INVALID CITATION !!!) 37 38 38 39 38 38 38 str = gsub("^,","",str); writeLines(str,con=writ,sep="\n")}close(writ)} matrx <- read.csv("Raw.MAS5normalized.GSE31547.csv", header=F, dec=".",sep=",")filematrix(t(matrx),"Raw.MAS5normalized.GSE31547b.csv")

B.3 Statistical Analysis Experimental Design setwd("C:\\Nicole Carr\\experimental design") #This experimental design is for x platform. #Read author table for Non-smokers exprs.eset.df <- read.table("NS_tumorVnormal.txt", header=T) #OR read author table for smokers exprs.eset.df <- read.table("S_tumorVnormal.txt", header=T) #OR read the MAS5 normalized data exprs.eset.df <- read.table("MAS5tumorVnormalNS.txt", header=T) exprs.eset.df <- read.table("MAS5tumorVnormal_S.txt", header=T) #OR read the RMA normalized data exprs.eset.df <- read.table("RMAtumorVnormalNS.txt", header=T) exprs.eset.df <- read.table("RMAtumorVnormal_S.txt", header=T)

#Finding differentially expressed probesets using modified t-test #Two Phenotypes: Normal lung or Tumor a1=exprs.eset.df design=cbind(rep(1,[total number]),c(rep(0,[n tumor]),rep(1,[n normal]))) design # fit the linear model library(limma) fit <- lmFit(exprs.eset.df, design) #Bayesian ebFit <- eBayes43 #Top Differential Expressed Genes tp <- topTable(ebFit, coef=2, adjust.method="BH", nrow(exprs.eset.df)) write.csv(tp, file="topdiffexpgenes.csv")

#Ordinal Logistic Regression source("quant.txt") dat1 = read.csv("NS_tumorVadeno.csv") dat.df = dat1 num.quantiles = 10

78

grp1 = 1:[n tumor] grp2 = x:[n all]

## save scores ## gather.scores = function44{quantiles = get.quantiles(dat.df[gene,],num.quantiles)grp1.scores = get.scores(dat.df[gene,grp1],quantiles)grp2.scores = get.scores(dat.df[gene,grp2],quantiles)c(grp1.scores,grp2.scores)} scores = sapply(1:nrow(dat.df),gather.scores) scores1=t(scores) write.csv(scores1, file="Quant.GEOID.csv") library(rms) g <- c(rep(1,[n tumor]),rep(2,[n normal])) #get the p-values ord <- function(x){m<-data.frame(g, x); ab=lrm(x ~ g,m)z=ab$coefficients[3]/sqrt(ab$var[3,3]) p.value<-2*(1-pnorm(q = abs(z), mean = 0, sd = 1))return(p.value)} scores1=t(scores) ordp <- apply(scores1, 1, ord) data1=cbind(dat1, ordp) write.csv(data1,"ordregpvalue_.csv",row.names = F)

#get the ordinal logistic regression coefficients ord2 <- function(x){m<-data.frame(g, x); ab=lrm(x ~ g,m)z=ab$coefficients45/sqrt(ab$var[3,3]) p.value<-2*(1-pnorm(q = abs(z), mean = 0, sd = 1))return(ab$coefficients)} scores2=t(scores)ordp2 <- apply(scores2, 1, ord2) data1=cbind(dat1, t(ordp2)) write.csv(data1,"RMAordregcoeff1_S.csv",row.names = F)

##DISCO source("Disco Main Code.txt") disco(dat.filename="tumorVadenoNS.csv",num.quantiles=10,grp1=c(1:[n tumor),grp2=c(x:[n normal]),output.filename="tdisco.affy1")

#Disco Main Code

79

### Code for calculating quantile frequencies ### #Count frequencies for a given group bin.exprs = function(grp.exprs,quantiles,labels) {#intialize matrix to store adjusted expression valuesvalues = matrix(0,ncol=1,nrow=length(grp.exprs))num.quantiles = length(quantiles)values[grp.exprs < quantiles[1]] = labels[1]for(index in c(2:num.quantiles)) {values[grp.exprs >= quantiles[index-1] & grp.exprs < quantiles[index] ] = labels[index]}values[grp.exprs >= quantiles[num.quantiles]] = labels[num.quantiles] #convert expression values to quantile labels #values[grp.exprs < quantiles[1]] = -2#values[grp.exprs >= quantiles[1] & grp.exprs < quantiles[2]] = -1#values[grp.exprs >= quantiles[2] & grp.exprs < quantiles[3]] = 0#values[grp.exprs >= quantiles[3] & grp.exprs < quantiles[4]] = 1#values[grp.exprs >= quantiles[4]] = 2 #function to calculate frequencies of quantile labelsfreqs = function(quantile.label) {sum(values == quantile.label)} #return frequencies as a vector return(sapply(labels,freqs)) #return(values) #if want values, rather than freqs}

#Setup quantiles and groups and get group frequencies get.quantile.freqs = function(exprs,grp1,grp2,labels) {#add check for number of quantilesnum.quantiles = length(labels) - 1quantile.unit = 1/(num.quantiles+1)quantile.thresholds = sapply(1:num.quantiles,function(x){quantile.unit*x}) #calculate quantiles for gene# quantiles = quantile(exprs,c(0.2,0.4,0.6,0.8))quantiles = quantile(exprs,quantile.thresholds,na.rm=TRUE)#grp1.freqs = bin.exprs(exprs[grp1],quantiles,labels)#grp2.freqs = bin.exprs(exprs[grp2],quantiles,labels)grp1.values = bin.exprs(exprs[grp1],quantiles,labels)grp2.values = bin.exprs(exprs[grp2],quantiles,labels)#return(cbind(grp1.freqs,grp2.freqs))return (list(grp1=grp1.values,grp2=grp2.values))}###

###### Code for performing Disco-Normal Analysis ####Calculate first and second derivative derivatives = function(fr,x,beta,k) {nrowx = length(x[,1]) sampln = length(fr[1,])nrowb = length(beta)n = colSums(fr)#n1 = sum(fr[,1])#n2 = sum(fr[,2]) xb1 = fr*x#xb1.1 = fr[,1]*x[,1]#xb1.2 = fr[,2]*x[,1] xb = colSums(xb1)#xb.1 = sum(xb1.1)#xb.2 = sum(xb1.2) xs1 = fr*(x*x)#xs1.1 = fr[,1]*(x*x)#xs1.2 = fr[,2]*(x*x) xs = colSums(xs1)#xs.1 = sum(xs1.1)#xs.2 = sum(xs1.2) nrowxs = length(xs)#nrowxs = length(xs.1) nr=matrix(0,ncol=sampln, nrow=nrowx) ar11= matrix(0,ncol=sampln, nrow=nrowx) ar21= matrix(0,ncol=sampln, nrow=nrowx) ar31= matrix(0,ncol=sampln, nrow=nrowx) ar41= matrix(0,ncol=sampln, nrow=nrowx) anr=matrix(0,ncol=sampln,

80

nrow=nrowx) aar11= matrix(0,ncol=sampln, nrow=nrowx) aar21= matrix(0,ncol=sampln, nrow=nrowx) aar31= matrix(0,ncol=sampln, nrow=nrowx) aar41= matrix(0,ncol=sampln, nrow=nrowx) f1= matrix(0,ncol=1, nrow=nrowb) f2= matrix(0,ncol=1, nrow=3) d1= matrix(0,ncol=nrowb, nrow=nrowb) d2= matrix(0,ncol=3, nrow=3)for (j in 1:nrowx) { for (i in 1:sampln) { a12= cbind(x[j,i], x[j,i]^2)if (k==1) { beta2= rbind(beta[i], beta[sampln+i]) nr[j,i]=exp(a12%*%beta2) ar11[j,i]=x[j,i]*nr[j,i] ar21[j,i]=(x[j,i]^2)*nr[j,i] ar31[j,i]=(x[j,i]^3)*nr[j,i] ar41[j,i]=(x[j,i]^4)*nr[j,i]}if (k==2) {beta2=rbind(beta[1], beta[i+1])anr[j,i]=exp(a12%*%beta2)aar11[j,i]=x[j,i]*anr[j,i]aar21[j,i]=(x[j,i]^2) *anr[j,i]aar31[j,i]=(x[j,i]^3)*anr[j,i]aar41[j,i]=(x[j,i]^4)*anr[j,i]} } } ar= colSums (nr) aar= colSums (anr) ar1= colSums(ar11) ar2= colSums(ar21) ar3= colSums(ar31) ar4= colSums(ar41) aar1= colSums(aar11) aar2= colSums(aar21) aar3= colSums(aar31) aar4= colSums(aar41) ac1=matrix(0, ncol=1, nrow=sampln) ac2=matrix(0, ncol=1, nrow=sampln) acc1=0 acc2=0 xbb=0 for (l in 1:sampln) { if (k==1) { f1[l]=xb[l]-n[l]*(ar1[l]/ar[l])f1[l+sampln]=xs[l]- n[l]*(ar2[l]/ar[l])d1[l, l]=-n[l]*(((-ar1[l]^2)+(ar[l]*ar2[l]))/(ar[l]^2))d1[l, sampln+l]=-n[l]*(((-ar2[l]*ar1[l])+(ar[l]*ar3[l]))/(ar[l]^2)) d1[sampln+l,l]=d1[l,sampln+l] d1[l+sampln, l+sampln]=-n[l]*(((- ar2[l]^2)+(ar[l]*ar4[l]))/(ar[l]^2)) }if (k==2) { ac1[l]=n[l]*(aar1[l]/aar[l])ac2[l]=n[l]*(((- aar1[l]^2)+(aar[l]*aar2[l])))/(aar[l]^2)acc1=acc1+ac1[l] acc2=acc2+ac2[l] f2[l+1]=xs[l]-n[l]*(aar2[l]/aar[l]) d2[l+1,l+1]=-n[l]*(((- aar2[l]^2)+(aar[l]*aar4[l]))/(aar[l]^2)) d2[1, l+1]=-n[l]*(((- aar2[l]*aar1[l])+(aar[l]*aar3[l]))/(aar[l]^2)) d2[l+1,1]=d2[1, l+1] xbb=xbb+xb[l] }}if(k==1) { f=f1d=d1}if (k==2) { f2[1]=xbb-acc1d2[1, 1] = -acc2 f=f2d=d2}return(list(f,d))} #Optimize functionnewton=function(f,d,beta, tol,fr,x,k) {if (k==2) {beta=matrix(0, ncol=1, nrow=3)}while(abs(f) > tol ) {f3=derivatives(fr=fr,x=x,beta=beta,k=k)f=f3[[1]]d=f3[[2]]delta=solve(- 1*d)%*%fbeta=beta+delta}return(beta)} #Compute Chi-Square and p-value for data disco.chisq = function(fr,x,k) {beta=matrix(0, ncol=1, nrow=4)f3=derivatives(fr,x,beta,k)f=f3[[1]]d=f3[[2]]beta1=newton(f,d, beta, tol=0.000000000001,fr,x,k=1)beta2=newton(f,d, beta, tol=0.000000000001,fr,x,k=2)logl=matrix(0,ncol=2, nrow=1)sampln=length(fr[1,])n1=colSums(fr)xb1=fr*xxb= colSums (xb1)xs1=fr*(x*x)xs=colSums(xs1)nrowx=length(x[,1])nr=matrix(0,ncol=sampln , nrow=nrowx)larp1=matrix(0,ncol=1, nrow=2)for (k in 1:2) {if (k==1) {alpha=beta1[1:2]bes=beta1[3:4]}if (k==2) {alpha=matrix(beta2[1], ncol=1, nrow=2)bes=beta2[2:3]}for (j in 1:nrowx) { for (i in 1:sampln) { a12= cbind(x[j,i], x[j,i]^2) beta2a= rbind(alpha[i], bes[i]) nr[j,i]=exp(a12%*%beta2a) } }br=colSums(nr)larp1[k]=n1%*%log(br)logl[k]=xb%*%alpha+xs%*%bes- larp1[k]}chisq=-2*(logl[2]-logl[1])pvalue=1- pchisq(chisq,

81

1)return(list("beta1"=beta1,"beta2"=beta2,"chisq"=chisq,"pvalue"=pvalue))#retur n(pvalue)} #Calculate Goodness of Fit gof = function(fr,beta) {sum=matrix(0, ncol=1, nrow=2)pii=matrix(0, ncol=2, nrow=5)pex=matrix(0, ncol=2, nrow=5)tp= colSums(fr)for (st in 1:2) { for (s in - 2:2) {sum[st]= sum[st] +exp(s*beta[st] + (s^2)*beta[st+2])}} for (i in 1:5) { for (j in 1:2) { pii[i,j]= tp[j]*(exp((i-3)*beta[j] + ((i-3)^2)*beta[j+2]))/sum[j] } } for (i in 1:5) { for (j in 1:2) { pex[i,j]= (fr[i,j] - pii[i,j])^2 /pii[i,j] } }ch2=colSums(pex)p_fit=1- pchisq(ch2, 4)#return(list(gof=ch2,p.value=p_fit))return(list("grp1.gof"=ch2[1],"grp2.gof"=ch 2[2],"grp1.pvalue"=p_fit[1],"grp2.pvalue"=p_fit[2]))} ### ####Calculate p-value using t-test get.pval.ttest = function(exprs,index1,index2,datafilter=as.numeric) {t.test(datafilter(exprs[index1]),datafilter(exprs[index2]))$p.value}#main functiondisco = function(dat.filename,num.quantiles,grp1,grp2,output.filename) {if(length(grep("txt",dat.filename)) == 1) {dat.df = read.table(dat.filename,sep="\t",header=T)} else {if(length(grep("csv",dat.filename)) == 1) {dat.df = read.table(dat.filename,sep=",",header=T)} else {cat("Input file must end in .txt or .csv")}} num.labels = num.quantiles + 1seq = seq(length.out = num.labels)labels = seq - median(seq) #if num.labels is odd, will get half values, so add 0.5 if(round(labels[1]) != labels[1]) {labels = labels + 0.5}get.disco.results = function(gene.exprs) {fr = get.quantile.freqs(exprs=as.numeric(gene.exprs),grp1=grp1,grp2=grp2,labels=lab els)fr.mat = cbind(fr$grp1,fr$grp2)disco.res = disco.chisq(fr=fr.mat,x=cbind(labels,labels),k=2)disco.res[["ttest.pvalue"]] = get.pval.ttest(gene.exprs,grp1,grp2)disco.res[["shapiro.pvalue.grp1"]] = shapiro.test(gene.exprs[grp1])$p.valuedisco.res[["shapiro.pvalue.grp2"]] = shapiro.test(gene.exprs[grp2])$p.valuegof = gof(fr=fr.mat,beta=disco.res$beta1)#disco.resreturn(list("gof"=gof,"chisq"=disco. res$chisq,"pvalue"=disco.res$pvalue,"ttest.pvalue"=disco.res$ttest.pvalue,"shapir o.pvalue.grp1"=disco.res$shapiro.pvalue.grp1,"shapiro.pvalue.grp2"=disco.res$sh apiro.pvalue.grp2))}disco.res = apply(dat.df,1,get.disco.results)grp1.gof.pvals = function(x) {disco.res[[x]]$gof$grp1.pvalue}grp2.gof.pvals = function(x) {disco.res[[x]]$gof$grp2.pvalue}chisq.pvals = function(x) {disco.res[[x]]$pvalue}ttest.pvals = function(x) {disco.res[[x]]$ttest.pvalue}shapiro.pvals.grp1 = function(x) {disco.res[[x]]$shapiro.pvalue.grp1}shapiro.pvals.grp2 = function(x) {disco.res[[x]]$shapiro.pvalue.grp2}pvals=cbind("GOF Grp 1"=sapply(1:length(disco.res),grp1.gof.pvals),"Shapiro Grp 1"=sapply(1:length(disco.res),shapiro.pvals.grp1),"GOF Grp 2"=sapply(1:length(disco.res),grp2.gof.pvals),"Shapiro Grp 2"=sapply(1:length(disco.res),shapiro.pvals.grp2),"ChiSq"=sapply(1:length(disco.

82

res),chisq.pvals),"t-test"=sapply(1:length(disco.res),ttest.pvals))outfile = paste(output.filename,".csv",sep="") write.csv(pvals, file=outfile)}

83