GENETIC AND NON-GENETIC FACTORS AFFECTING SURVIVAL OF GLIOBLASTOMA MULTIFORME PATIENTS

By

Kuan-Han Hank Wu

A THESIS

Submitted to Michigan State University in partial fulfillment of the requirements for the degree of

Biostatistics – Master of Science

2016

ABSTRACT

GENETIC AND NON-GENETIC FACTORS AFFECTING SURVIVAL OF GLIOBLASTOMA MULTIFORME PATIENTS

By

Kuan-Han Hank Wu

Glioblastoma Multiforme (GBM) is the most malignant and common primary brain tumor. Despite improvements in clinical treatment, the five-year survival rate is only 10%. In recent years, given the increased availability of omic data, the analysis of cancer datasets has seen a change in emphasis from non-genetic factors to understanding the genetic factors contributing to the disease. The goal of this study is to jointly model time to death of GBM patients with clinical and demographic covariates along with target

SNP and GE. For this study we applied 2 datasets, The Cancer Genome Atlas (TCGA) and Surveillance, Epidemiology, and End Results (SEER). In concordance with previous studies, the non-genetic factors found to significantly affect survival included: age, race, area of residence in the US, treatment, Karnofsky performance score, number of primary tumors, and tumor location. Additionally, expression quantitative trait loci (eQTL) analysis was performed to assess the genetic regulation of expression. Finally, a gene expression wide association study (GE-WAS) was also performed to capture associations between survival and the whole genome gene expression. Two hotspot SNP were identified by eQTL analysis; allele ‘A’ in both hotspots tended to be associated with longer survival in both hotspots. However, these hotspots were not significantly associated with survival. Finally, fourteen were found to be significantly associated with survival in GBM patients although this significance disappeared after controlling for demographic covariates.

ACKNOWLEDGEMENTS

I would like to express my gratitude to all the committee members for their insight and expertise. I would especially like to thank Dr. Ana I. Vazquez, my committee chair. Without her excellent advice, input, and mentorship, the door of statistical genetics would not have been opened, and I would not have been able to finish my thesis. I would also like to thank Dr. Juan Pedro Steibel. His guidance in the methodology of eQTL analysis helped me to build the heart of my thesis. Finally, I would like to thank Dr. Gustavo de los Campos and Dr. Joseph Gardiner for their professional opinions and invaluable improvements to my thesis.

I also thank my lab members Yeni Bernal, Agustin Gonzalez, and Yogasudha

Veturi for their support in data analysis, statistical software coding, and their further help in writing and conducting the study. Thanks also to the entire QuantGen research group for their encouragement and the supportive atmosphere they provided. I would like to recognize The Cancer Genome Atlas and Surveillance,

Epidemiology, and End Results for making datasets available.

Finally, to all my friends and family members, this would not have been possible without your help. Your continued support during my years of study were essential to me.

iii TABLE OF CONTENTS

CHAPTER 1 ...... 1 INTRODUCTION ...... 1 1.1. Background ...... 1 1.2. Aims ...... 5 REFERENCES ...... 6

CHAPTER 2 ...... 9 NON-GENETIC FACTORS AFFECTING SURVIVAL OF GLIOBLASTOMA MULTIFORME PATIENTS ...... 9 2.1. Introduction ...... 9 2.2. Materials and Methods ...... 10 2.2.1. Data ...... 10 2.2.1.1. TCGA ...... 10 2.2.1.2. SEER ...... 14 2.2.2. Statistical Models ...... 15 2.3. Results ...... 17 2.3.1. Descriptive Statistics ...... 17 2.3.1.1. TCGA ...... 17 2.3.1.2. SEER ...... 21 2.3.2. Survival Analysis ...... 24 2.3.2.1. TCGA ...... 24 2.3.2.2. SEER ...... 28 2.4. Discussion and Conclusions ...... 38 APPENDICES ...... 41 Appendix A. Data dictionary for TCGA clinical dataset ...... 42 Appendix B. Data dictionary for SEER clinical dataset ...... 43 Appendix C. Summary of hazard ratio estimates for clinical covariates (TCGA) ...... 45 Appendix D. Summary of hazard ratio estimates for clinical covariates (SEER) ...... 46 REFERENCES ...... 48

CHAPTER 3 ...... 50 ANALYSIS OF EXPRESSION QUANTITATIVE TRAIT LOCI IN GLIOBLASTOMA MULTIFORME TUMORS ...... 50 3.1 Introduction ...... 50 3.2 Materials and Methods ...... 52 3.2.1 Data ...... 52 3.2.2 Identification of confounders and pre-correction of gene expression .... 53 3.2.3 Statistical models ...... 53 3.2.3.1 Expression quantitative trait loci (eQTL) ...... 54 3.2.3.1.1 Model for eQTL analyses ...... 54 3.2.3.1.2 Hotspot estimation ...... 56

iv

3.2.3.1.3 Target genes selection ...... 59 3.2.3.2 Hotspot genotype survival analyses ...... 59 3.2.3.3 Gene Expression Wide Association Study (GE-WAS) ...... 60 3.3 Results ...... 62 3.3.1 Expression quantitative trait loci ...... 62 3.3.1.1 eQTL analysis ...... 62 3.3.1.2 Hotspot estimation ...... 66 3.3.1.3 Target gene selection ...... 68 3.3.2 Survival analyses for hotspot ...... 68 3.3.3 Gene Expression Wide Association Study ...... 71 3.4 Discussion and Conclusions ...... 73 APPENDICES ...... 76 Appendix A. Identification of confounders in gene expression ...... 77 Appendix B. Accounting for population structure in eQTL model ...... 80 Appendix C. Multiple-test correction methods ...... 83 Appendix D. Subsets of target genes ...... 86 REFERENCES ...... 90

CHAPTER 4 ...... 93 SUMMARY AND CONCLUSIONS ...... 93

v LIST OF TABLES

Table 2.1 Description of the Karnofsky Performance Score ...... 13

Table 2.2 Descriptive statistics of clinical covariate at cancer diagnosis by vital status at year 1 and year 2 after diagnosis (TCGA)...... 17

Table 2.3 Descriptive statistics of clinical covariate at cancer diagnosis by vital status at year 1 and year 2 after diagnosis (SEER)...... 22

Table 2.4 ICD-O-3 primary brain tumor topographic sites ...... 24

Table 2.5 Data dictionary for TCGA clinical dataset ...... 42

Table 2.6 Data dictionary for SEER clinical dataset ...... 43

Table 2.7 Hazard ratio, confidence interval (95%), and p-value for clinical factors (TCGA)...... 45

Table 2.8 Hazard ratio, confidence interval (95%), and p-value for clinical factors (SEER)...... 46

Table 3.1 Frequency table of number of significant genes per SNP (FDR<0.05). For example, one SNP is significantly associated to 20 genes and 26 genes (significant hotspot)...... 65

Table 3.2 Number of cis- / trans- acting genes, , and position in the hotspots, rs3172494 ana rs7698461...... 67

Table 3.3 Number of patients of genotype with 0, 1, or 2 copies of minor alleles in hotspot...... 71

Table 3.4 Number of genes statistically significant to survival in different gene sets and comparing by different multiple test correction and threshold...... 71

Table 3.5 Summary of differential expression analysis for baseline model ...... 78

Table 3.6 Summary of differential expression analysis for model adjusted by batch effect ...... 79

Table 3.7 Possible outcome of performing m hypothesis tests simultaneously...... 83

Table 3.8 Subsets of target genes ...... 86

vi

LIST OF FIGURES

Figure 2.1 Flowchart identifying the data cleaning process and the final sample size...... 12

Figure 2.2 Histogram of age at initial pathologic diagnosis and non-parametric density curve ...... 19

Figure 2.3 Proportion of vital status among different age group (quartile 25, 50, 75, 100). A. Vital status at year 1 in each different age group. B. Vital status at year 2 in each different age group...... 20

Figure 2.4 Survival curves by age group (TCGA). The solid line represents the mean of the group. Years: years of follow up since diagnosis...... 25

Figure 2.5 Survival curves by Karnofsky Performance Score (TCGA). The solid line represents the mean of the group. Years: years of follow up since diagnosis. ... 26

Figure 2.6 Survival curves by targeted molecular therapy (TCGA). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 27

Figure 2.7 Survival curves by interaction of age and gender (TCGA). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 28

Figure 2.8 Survival curves by age group (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 29

Figure 2.9 Survival curves by race (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 30

Figure 2.10 Survival curves by area of residence in the US (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 31

Figure 2.11 Survival curves by surgery (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 32

Figure 2.12 Survival curves by radiation (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 33

Figure 2.13 Survival curves by sequence of radiation and surgery (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 34

vii

Figure 2.14 Survival curves by number of primary tumor (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 35

Figure 2.15 Survival curves by sequence of primary tumor (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 36

Figure 2.16 Survival curves by primary tumor site (SEER). The solid line represents the mean of the group. Years: years of patients survive after diagnosis...... 37

Figure 3.1 Data available and omics sample size for TCGA ...... 54

Figure 3.2 Flowchart of observed and permuted false eQTL hotspots. The left represents the observed dataset and the right side represents the generation of a new hypothesis by permuting the SNP by individual. The false eQTL results (right) were expected to have less association being significant by biological effect...... 57

Figure 3.3 Flowchart showing the process to select a threshold number of genes significantly associated to a SNP. Above the threshold, the SNP was defined to be a hotspot...... 58

Figure 3.4 Subsets of genes selected from the whole genome set ...... 59

Figure 3.5 Flowchart of eQTL analysis Flowchart showing number parameters and numbers affects in each set of analysis. Cis-acting results are shown in the left and trans-acting are shown in the right. The study found 1,190 cis-eQTL and 4,239 trans-eQTL ...... 63

Figure 3.6 eQTL map. Dots represent statistically significant eQTL in their location of the SNP (x-axis, SNP position) by gene expression (y-axis, gene position). Blue dots are the cis-eQTL and red dots are the hotspot eQTL. The axis are the number of human chromosome, and the SNP (x) and GE (y) are ordered by position in the chromosome...... 64

Figure 3.7 Manhattan plot for the number of significant genes per SNP observed in the unpermuted data. Using 18 genes as the threshold of multiple gene to define the hotspot, 2 SNP, rs7698461 and rs3172494, were found to be the hotspot in GBM patients. SNP is ordered by position in human chromosome. . 66

Figure 3.8 Manhattan plot for 99th percentile of the number of gene observed significantly in 100 permutations per SNP. The dashed line is the significant throshold of 18 genes. SNP is ordered by position in human chromosome ...... 67

viii

Figure 3.9 Survival curves of hotspot-rs3172494 by genotype. The solid line represents the mean of the group, and the dotted line represents the 95% confidence interval. Years: years of patient survival after diagnosis...... 69

Figure 3.10 Survival curves of hotspot- rs7698461 by genotype. The solid line represents the mean of the group, and the dotted line represents the 95% confidence interval. Years: years of patient survival after diagnosis...... 70

Figure 3.11 Scatter plot of heritability and significance of the genetic effect for the 10,883 genes...... 82

ix KEY TO ABBREVIATIONS

ABTA American Brain Tumor Association

CLIN Clinical Information

DE Differentially Expressed

EE Equally Expressed eQTL Expression Quantitative Trait Loci

FDR False Discovery Rate

GBM Glioblastoma Multiforme

GE Gene Expression

GE-WAS Gene Expression Wide Association Study

ICD-O-3 International Classification of Diseases for Oncology, 3rd Edition

KPS Karnofsky Performance Score

LD Linkage Disequilibrium

Mb Megabase

PC Principal Component pFDR Positive False Discovery Rate

PHM Proportional Hazard Model

SD Standard Deviation

SEER Surveillance, Epidemiology, and End Results

SNP Single Nucleotide Polymorphism

TCGA The Cancer Genome Atlas

TMZ Temozolomide

x

CHAPTER 1 INTRODUCTION

1.1. Background

The tumor form in the glial cell of the brain is called ‘glioma’, and astrocytomas is one type of the glioma. Astrocytoma are graded in a scale of I to IV

(ABTA, 2014). In this scale, the higher the tumor grade, the more malignant the tumor is. On the other hand, the lower the tumor grade, the higher the chance of recovery (Frederick, 2002). Among astrocytomas, Glioblastoma Multiforme (GBM) is the highest grade astrocytoma tumor (grade IV), which grows rapidly, spreads faster than the lower grade tumor, and has an ample blood supply nourishing the tumor cells. In addition, GBM is the most malignant form of an astrocytoma and the most common and lethal primary brain tumor, representing 15.4% among all the primary brain tumors, 50% of glioma, and 60-75% of astrocytomas (ABTA, 2014;

Frederick, 2002).

There are 2 subtypes of glioblastoma: de novo (new or primary) and secondary. Primary GBM tumors form quickly without clinical or histological evidence of less malignant precursor lesions. Most of the primary glioblastoma patients are elderly people, with a mean age of 62 years old for primary GBM tumor, and about 90% of the glioblastoma are primary GBM tumors (Ohgaki, 2005). On the other hand, secondary GBM tumors start from a lower grade astrocytoma (Grade II or III), which had been genetically mutated to severe malignant glioblastoma

(Ohgaki & Kleihues, 2013). Secondary GBM tumor grows slower and takes a longer

1

time to form, but it is still aggressive. About 10% of GBM are secondary and the mean age of the patients are 45 years old (ABTA, 2014; Ohgaki, 2013).

In general, GBM rarely spread to other places in the body besides the cerebral hemisphere of the brain (ABTA, 2014). Since the tumors grow rapidly and the skull can not expand, it will cause increments on brain pressure leading to various symptoms including headache, nausea, vomiting, and drowsiness.

Depending on the location in the brain, the tumor may also cause weakness on one side of the body, memory and/or speech difficulties, and visual changes (ABTA,

2014).

Glioblastomas contain a variety of cell types, including cystic mineral, calcium depostis, blood vessels, or mixed grade of cells. Therefore, the variety of cells present an additional challenge for GBM therapy. Moreover, different cells react differently to each therapy, and a better therapy is to combine various kinds of approaches. Generally, in order to reduce the pressure on the brain, the first step is to remove the part of tumor that can be surgically removed. The next step consists of radiation and chemotherapy to slow down the growth of the tumor, when it is not possible to remove all of it through the surgery (ABTA, 2014).

In terms of treatments, the standard care of GBM changed in 2005; today the gold-standard first-line treatment consist of a combination of radiotherapy and chemotherapy with temozolomide (TMZ). The median survival time slightly improved from the pre-temozolomide era (2000-2003; surgery plus radiation) to the post-temozolomide era (2005-2008; surgery, radiation, and TMZ), increasing from 12.0 to 14.2 months, respectively (Johnson & O’Neill, 2012). Alternatively, the

2

use of radiation with TMZ has been suggested to increase the survival rate in GBM patients (Preusser et al., 2015). Specifically, the unadjusted hazard of death for patients with radiotherapy and chemotherapy with TMZ was significantly lower

(hazard ratio: 0.63) than patients with radiotherapy alone. Also, there was a statistically significant difference in survival between the two treatments, with a two-year survival rate of 10.4% on patients whom received radiotherapy alone in comparison to 26.5% observed on patients received radiotherapy and TMZ (Stupp et al., 2005).

According to Stupp et al. (2005), the median survival time for adults having

GBM (treated with TMZ and radiation therapy) is around 14.6 months (95% CI:

13.2-16.8). In the case of survival rate, the American Brain Tumor Assocation-ABTA

(2014) reported a rate of 30% and 10% for two and five years, respectively (ABTA,

2014). Moreover, children tend to live longer than adults, with a five-year survival close to 40% in children under 16 years old (Song et al., 2010). However, although the standard treatment has been improved, GBM patients have not benefited in the long extent. There has been a modest improvement in the overall survival rate, with one year survival rate slightly increasing from 40% to 46% (radiation and chemotherapy) (van Tellingen et al., 2015).

The American Brain Tumor Association published in their report that the disease incidence varies with age and gender. In general, people between 45 and 65 years have the higest risk of GBM, and men have a higher risk than women, with a

1.3 to 1 ratio of male to female (ABTA, 2014; Moon & Lesniak, 2014). Moreover, racial differences have been observed in many studies (ABTA, 2014; Muquit et al.,

3

2015; Moon & Lesniak, 2014). For instance, the incidence of GBM in European

Americans is double that of African Americans. Asians have the lowest risk of the three racial groups (Muquit et al., 2015). Moreover, the average rate for North

American and European nations is 3-4 cases in 100,000 people per year, with a lower incidence rate in Asia and South America. This pattern of incidence throughout the regions is frequently observed for tumors occuring on central nervous system (Muquit et al., 2015).

Glioblastoma multiforme is the most common and malignant brain cancer.

Regardless of the improvements of clinical care and treatments developed in the past decades, the survival rate is still low. Previous studies (Nelson et al., 2012;

Muquit et al., 2015) were mainly focused on clinical, geographical, nutritional factors, and socio-economic status as risk factors. Since cancer is a disease of the genome and considering availability of gene expression information, studing the tumor gene expresion in order to get the indepth knowledge of cancer has become more important.

One of the hopes in genetic studies is to find novel markers that may be driving the disease and could lead to more accurate prediction of the progression for the disease (Parsons et al., 2008). Henshall et al. (2003) applied multivariate survival analysis on gene expression profiles obtained from prostate cancers patients to identify markers marginally associated to prostate cancer relapse and to enhance the predictive ability of prostate cancer relapse. Likewise, research in lung cancer has shown that gene expression profiles can be used to predict the survival of patients in early stages of lung adenocarcinomas (Beer et al., 2002). According to

4

Beer et al. (2002), the selected genes will allow to identify groups of patients at a higher risk who would be benefit from additional therapy in early stages (Beer et al.,

2002). However, the inclusion of gene expression information in models for GBM cancer patients is still incipient, and further evidences and methodological developments are required to demonstrate the importance of genetic factors in the early prediction, diagnostic, and prognostic of the disease.

1.2. Aims

The goal of this study is to model the survival rate and vital status on GBM patients, by adding clinical and demographic covariates (Chapter 2) alongside with high-dimension genomic information (Chapter 3) to better characterize GBM tumors and understand the disease. In this sense, in addition to incorporating gene expression information into survival models we also investigate the genetic regulation of gene expression and associating variants involved in regulation of gene expression to survival (Chapter 3).

5

REFERENCES

6

REFERENCES

ABTA - American Brain Tumor Association (2014). Glioblastoma and Malignant Astrocytoma. Available at : http://www.abta.org/secure/glioblastoma- brochure.pdf.

Beer, D. G., Kardia, S. L., Huang, C. C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharb, T. G., Thomas, D. G., Lizqness, M. L., Kuick, R., Hayasaka, S., Taylor, J. M. G., Iannettoni, M. D., Orringer, M. B., & Hanash, S. (2002). Gene- expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine, 8(8), 816-824.

Frederick, L., Page, D. L., Fleming, I. D., Fritz, A. G., Balch, C. M., Haller, D. G., & Morrow, M. (2002). AJCC cancer staging manual (Vol. 1). Springer Science & Business Media.

Henshall, S. M., Afar, D. E., Hiller, J., Horvath, L. G., Quinn, D. I., Rasiah, K. K., Gish, K., Willhite, D., Kench, J. G., Gardiner-Garden, M., Stricker, P. D., Scher, H. I., Grygiel, J. J., Agus, D. B., Mack, D. H., & Sutherland, R. L. (2003). Survival analysis of genome-wide gene expression profiles of prostate cancers identifies new prognostic targets of disease relapse. Cancer Research, 63(14), 4196-4203.

Johnson, D. R., & O’Neill, B. P. (2012). Glioblastoma survival in the United States before and during the temozolomide era. Journal of Neuro-Oncology, 107(2), 359-364.

Moon, K., & Lesniak, M. S. (2014). Glioblastoma: Risk factors, diagnosis and treatment options. International Journal of Cancer Research and Prevention, 7(3/4), 183.

Muquit, S., Parks, R., & Basu, S. (2015). Socio-economic characteristics of patients with glioblastoma multiforme. Journal of Neuro-Oncology, 1-5.

Nelson, J. S., Burchfiel, C. M., Fekedulegn, D., & Andrew, M. E. (2012). Potential risk factors for incident glioblastoma multiforme: The Honolulu heart program and honolulu-asia aging study. Journal of Neuro-Oncology, 109(2), 315-321.

Ohgaki, H. (2005). Genetic pathways to glioblastomas. Neuropathology, 25(1), 1-7.

Ohgaki, H., & Kleihues, P. (2013). The definition of primary and secondary glioblastoma. Clinical Cancer Research, 19(4), 764-772.

Parsons, D. W., Jones, S., Zhang, X., Lin, J. C. H., Leary, R. J., Angenendt, P., Mankoo, P., Carter, H., Siu, I., Gallia, G. L., Olivi, A., McLendon, R., Rasheed, B. A., Keir, S., Nikolskaya, T., Nikolsky, Y., Busam, D. A., Tekleab, H., Diaz, L. A., Hartigan, J.,

7

Smith, D. R., Strausberg, R. L., Marie, S. K. N., Shinjo, S. M. O., Yan, H., Riggins, G. J., Bigner, D. D., Karchin, R., Papadopoulos, N., Parmigiani, G., Vogelstein, B., Velculescu, V. E., & Kinzler, K. W. (2008). An integrated genomic analysis of human glioblastoma multiforme. Science, 321(5897), 1807-1812.

Preusser, M., Lim, M., Hafler, D. A., Reardon, D. A., & Sampson, J. H. (2015). Prospects of immune checkpoint modulators in the treatment of glioblastoma. Nature Reviews Neurology, 11(9), 504-514.

Song, K. S., Phi, J. H., Cho, B., Wang, K., Lee, J. Y., Kim, D. G., Kim, I. H., Ahn, H. S., Park, S., & Kim, S. (2010). Long-term outcomes in children with glioblastoma. Journal of Neurosurgery: Pediatrics, 6(2), 145.

Stupp, R., Mason, W. P., van den Bent, M. J., Weller, M., Fisher, B., Taphoom, M. J., Belanger, K., Brandes, A. A., Marosi, C., Bogdahn, U., Curschmann, J., Janzer, R. C., Ludwin, S. K., Gorlia, T., Allgeier, A., Lacombe, D., Caimcross, G., Eisenhauer, E., & Mirimanoff, R. O. (2005). Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. New England Journal of Medicine, 352(10), 987-996. van Tellingen, O., Yetkin-Arik, B., de Gooijer, M. C., Wesseling, P., Wurdinger, T., & de Vries, H. E. (2015). Overcoming the blood–brain tumor barrier for effective glioblastoma treatment. Drug Resistance Updates, 19, 1-12.

8

CHAPTER 2 NON-GENETIC FACTORS AFFECTING SURVIVAL OF GLIOBLASTOMA MULTIFORME PATIENTS

2.1. Introduction

Glioblastoma Multiforme (GBM) is the highest-grade (grade IV) astrocytoma tumor. It grows rapidly, being the most malignant and the most lethal primary brain tumor. Moreover, the median overall survival for glioblastoma patients is relatively short with only 12 months for GBM in contrast to 44.4 months for low-grade gliomas (grade II). Even with the standard treatment of GBM (radiation and chemotherapy with temozolomide), the median overall survival for GBM patients increase only 2.2 months to 14.2 months (Johnson & O’Neill, 2012; Yoshida et al.,

2015).

Genetic and non-genetic factors are known to affect the survival rate of GBM patients. In this chapter, we conduct a survival analysis to identify the non-genetic covariates associated with survival in GBM patients. We used two datasets for identification of risk factors associated with mortality. In this chapter, Cox proportional hazards model is applied to The Cancer Genome Atlas (TCGA) as well as to the Surveillance, Epidemiology, and End Results (SEER). To test the difference in survival rates for different groups with the same risk factors, we used log-rank test.

9

2.2. Materials and Methods

2.2.1. Data

This study was developed with two datasets. The first one, The Cancer

Genome Atlas (TCGA), has a rich array of omic platforms performed in brain tumors and permits work on different genomic research objectives. The second dataset is from the Surveillance, Epidemiology, and End Results (SEER). Although SEER does not contain any genomic information, it has a larger sample size than TCGA allowing to validate the findings and effects of demographic and clinical factors affecting

GBM. Description of datasets and edition processes is described in next section.

2.2.1.1. TCGA

TCGA is the collaboration between National Cancer Institute (NCI) and

National Research Institute (NHGRI), providing comprehensive, high-dimensional genomic information, and multiple data types in 34 major- and sub-type of human cancers from more than 11,000 individuals.

In this study we use data from 523 participants (after edition) with GBM made available by TCGA. Before analysis we excluded patients with conditions (e.g., patients with previous cancers) that may significantly change the omics at the tumor. Also, we discarded patient groups with few subjects and whose omics may be significantly different to the rest of the sample (e.g., patients from Asian origin).

Specifically, patients excluded (and sample size) were: (i) patients who had prior cancer diagnosed (n=1), (ii) prior brain tissue diagnosis of lower grade glioma

(n=15), (iii) history of neoadjuvant treatment for tumor submitted for TCGA (n=21),

10

(iv) Asian origin (n=13), (v) no vital status (n=2), (vi) no chemotherapy (n=20), and

(vii) zero days to last follow up (n=1). Data cleaning process is show in Figure 2.1.

11

TCGA – Glioblastoma Multiforme (n=593) (n=1)

Prior brain tissue diagnosis of lower grade Glioma (n=15) (n=1)

History of neoadjuvant treatment (n=21) (n=1)

Prior cancer diagnosed (n=1)

No chemotherapy (n=20) (n=1)

Asian (n=13) (n=1)

Time to Event equal to 0 (n=1)

Vital status (n=2)

Final sample size for GBM (n=523) (n=1)

Figure 2.1 Flowchart identifying the data cleaning process and the final sample size.

12

Demographics and clinical covariates included in the analysis were selected based on their marginal association to survival. Stepwise regression was performed for variable selection. We used year 1 and year 2 vital status and days to last follow up as response variables. Covariates were tested for the significance one at a time by using forward selection. Following a backward elimination after the first step, all the significant variables were included in model while those not significant were excluded.

Karnofsky performance score (KPS) is a numerical scale (Table 2.1) that evaluates general well-being of cancer patients. It represents the functional capabilities of the daily life for a patient. Also, it can be used to evaluate whether the patient can receive chemotherapy treatment and the amount of dosage for it. The higher the score is, the better the prognosis for survival in cancer patients.

Table 2.1 Description of the Karnofsky Performance Score KPS Performance of patients 100 Normal no complaints; no evidence of disease. 90 Able to carry on normal activity; minor signs or symptoms of disease. 80 Normal activity with effort; some signs or symptoms of disease. 70 Cares for self; unable to carry on normal activity or to do active work. 60 Requires occasional assistance, but is able to care for most personal needs. 50 Requires considerable assistance and frequent medical care. 40 Disabled; requires special care and assistance. 30 Severely disabled; hospital admission is indicted but death not imminent. 20 Very sick; hospital admission and active supportive treatment necessary. 10 Moribund; fatal processes progressing rapidly. 0 Dead Score from 0 to 100. The higher the score, the better the patients’ performance. (Crooks, V., Waller, S., Smith, T., & Hahn, T. J. (1991). The use of the Karnofsky Performance Scale in determining outcomes and risk in geriatric outpatients. Journal of Gerontology, 46(4), M139-M144.)

13

We are also interested in the interaction of age group and gender. Hence, we created the new variable ‘age-gender’, by using 60 years old as a cutoff for age.

Patients older than 60 years are grouped together and patients younger or equal to

60 are in a second group. The four age-gender groups are: female ≤ 60, male ≤ 60, female > 60, and male > 60. Also the 25%, 50%, and 75% quartile of the age distribution at initial pathologic diagnosis was used to create age categories. Four of the age groups are: <50, 51-60, 61-69, and >69. The edited final clinical dataset includes data from 523 patients and 6 clinical and demographic explanatory variables (See a comprehensive description in Appendix A).

2.2.1.2. SEER

The SEER program collects cancer data from population-based multiple centers across the whole nation, providing authoritative cancer information, incidence, and survival. Data from SEER will also be used to validate covariates identified as relevant in GBM survival. GBM patients were identified by diagnosis code of the International Classification of Diseases for Oncology, Third Edition (ICD-

O-3). Patients were chosen by ICD-O-3 histological type 9440/3 (Glioblastoma,

NOS), primary site C71 (Brain), and Grade IV tumor. The SEER clinical dataset covers a total of 15,167 patients. Observations with unknown month to last follow up (n=366) were dropped.

The clinical covariates included in the model were selected by stepwise regression, using year 1 and year 2 vital statuses, and month to last follow up as response variables. Selected covariates include age at diagnosis, race, primary

14

tumor site, radiation sequence with surgery, surgery performed, radiation performed, number of primary tumors, sequence number of tumor, region in US, and vital status (See a comprehensive description in Appendix B).

Survival duration was adjusted by adding 0.1 month to every participant in

SEER clinical data, in order to keep the information from patients with 0 month of survival (n=1,165). Also, new variables such as ‘USA geographical region’ and

‘surgery’ were created according to variables ‘State’ and ‘reason no cancer directed surgery’. USA geographical region was created and contained 4 categories, which are West (n=7,282; AL, CA, HI, NM, UT, and WA), Mid-West (n=3,500; IA and MI),

South (n=2,393; GA, KY, and LA), and North-East (n=1,992; CT and NJ). Surgery was grouped into yes (n=11,600; surgery performed), no (n=3,505; not recommended, died prior to surgery, or patient refused), and unknown (n=62). The final SEER clinical data consisted of information from 15,167 patients and 9 variables (Detailed description of these variables is included in Appendix B).

2.2.2. Statistical Models

In this study, we are interested in the association of clinical factors with survival time. This study was performed in both TCGA and SEER datasets. Primary objective of this study was to identify survival difference between groups given each clinical covariate. The survival analysis was carried out with a Proportional Hazards

(Cox Regression) model (PHM) (Cox, 1972). For each covariate in Appendices A and

2.B, survival cure were generated separately without control for other variables.

The Cox model is the most commonly used semi-parametric regression model. It

15

specifies the hazard function by a product of parametric term for the covariates and a nonparametric term for time.

The standard form of the PHM is written as

, ℎ " # = ℎ% " exp (+ #) where ℎ " # represents approximately the instantaneous risk of death at time t , given survival at time t and antecedent covariates x, and ℎ% " is the unspecified baseline hazard function (Cox, 1972).

The baseline hazard function is the hazard function for a referent covariate profile. The second term, exp (+,#), provides a means to assess the effect of covariates by the hazard ratio ℎ " # /ℎ% " . Clinical covariates in TCGA and SEER are fully described in the Appendices A and B.

The log-rank test was constructed to test for differences in the survival functions across different groups. Log-rank test is a nonparametric test, where no distributional assumption is needed on the underlying survival distributions. It tests whether the survival function in two or more independent groups are different

(Mantel, 1966). For two groups with survival functions S1(t) and S2(t) the log-rank test is

H : S (t) = S (t) 0 1 2 H1 : S1(t) ≠ S2 (t) for all t. The survival function S1(t) is the probability of an individual survival time

T1 in group 1 is greater than t , i.e., /0 = 1[30 > "]. Similarly for group 2, /6 =

1 [36 > "]. Extension of the hypothesis test to multiple groups is straightforward.

16

2.3. Results

2.3.1. Descriptive Statistics

2.3.1.1. TCGA

The descriptive statistics of clinical covariates for patient alive at year 1 and year 2, and for patients died before year 1 and year 2 in TCGA dataset is showed in

Table 2.2. There were 523 patients included in the study and time to 50% survival is

1.1 years. Overall, 217 patients (52%) were alive at year 1 and 204 patients (48%) died during year 1. Patients censored before year 1 were considered as missing. At year 2, overall numbers of alive and dead patients were 70 (17%) and 331 (83%), respectively.

Table 2.2 Descriptive statistics of clinical covariate at cancer diagnosis by vital status at year 1 and year 2 after diagnosis (TCGA). Clinical Covariate Year 1 (n, %) Year 2 (n, %) Censor before year 1 Censor before year 2 102 (19) 122 (23) Vital status Alive Dead Alive Dead 217 (42) 204 (39) 70 (14) 331 (63) Gender Female 76 (35) 77 (38) 31 (44) 118 (36) Male 141 (65) 127 (62) 39 (56) 213 (64) Race African American 14 (7) 16 (8) 5 (7) 21 (7) European American 194 (93) 179 (92) 64 (93) 295 (93) Age Group (at diagnosis) < 50 76 (35) 25 (12) 35 (50) 63 (19) 51-60 66 (30) 56 (27) 19 (27) 93 (28) 61-69 55 (25) 52 (26) 13 (19) 90 (27) >69 20 (10) 71 (35) 3 (4) 85 (26) Targeted Molecular Treatment No 112 (65) 99 (85) 36 (62) 164 (77) Yes 61 (35) 18 (15) 22 (38) 50 (23)

17

Table 2.2 (cont’d) Clinical Covariate Year 1 (n, %) Year 2 (n, %) Censor before year 1 Censor before year 2 102 (19) 122 (23) Vital status Alive Dead Alive Dead 217 (42) 204 (39) 70 (14) 331 (63) Diagnosis method Tumor Resection 175 (81) 183 (90) 53 (76) 288 (88) Excisional Biopsy 40 (18) 17 (8) 17 (24) 38 (11) Other method 1 (1) 3 (2) 0 (0) 3 (1)

The study consisted of 200 females (38%) and 323 males (62%). Originally, the dataset included European Americans (n=457), African Americans (n=45), and

Asians (n=13). The 13 Asians patients were removed from the analysis due to small sample size. Asians not only have a larger survival times than the rest of the sample, but also the specific allele frequency associated to the race would be complicated to deal with such a small size, and race would be confounder in the analysis. The average of KPS was 77.9 ± 14.1 (mean ± SD) and the average days to last follow up is

422.2 ± 456.7.

The distribution of age at initial pathologic diagnosis is showed in Figure 2.2.

The range is from 10 to 89 years old; overall mean and standard deviation of age were 58.3 and 14.4, respectively (3.54 variance-to-mean ratio). Age was grouped into 4 categories by quartile. The percentage of patients who died in different age groups shows a linear increase in both year 1 and year 2. With the increase of age, the higher mortality rate was observed (Figure 2.3). Linear increase of death rate by age group was also observed in survival analysis performed by the Cox proportional hazard model.

18

Figure 2.2 Histogram of age at initial pathologic diagnosis and non-parametric density curve

19

A.

Age group

B.

Age group

Figure 2.3 Proportion of vital status among different age group (quartile 25, 50, 75, 100). A. Vital status at year 1 in each different age group. B. Vital status at year 2 in each different age group.

20

2.3.1.2. SEER

In SEER data, the population-based cancer registries provide more information and larger sample size to characterize and evaluate the disease. The clinical data analyzed in this study consists of 15,167 observations. Among all the patients, 33% (n=4,791) of the patients were alive in the first year and 67% of the patients die before year 1; in year 2, the proportion of alive and dead patients was

11% (n=1,608) and 89% (n=12,812), respectively. Both year 1 and 2 proportion of alive patients, 33% and 11%, were lower than TCGA clinical dataset, 52% and 17%.

Also, time to 50% survival, 0.59 year, is shorter than TCGA data, 1.1 year. Data distribution in all clinical covariates is showed in Table 2.3.

The age at initial diagnosis of the patients ranged from 0 to 98 years old, with a median age at diagnosis of 63 years old and a mean age of 61.7 ± 14.5 years. In agreement with TCGA sample, a higher proportion of males with GBM than women were included in this dataset, with a 1.4 : 1 ratio of male to female. Most of the patients were recruited in the West region of US (48%; n=7282), following by Mid-

West (23%; n=3500), South (16%; n=1992), and North-East (13%; n=1992). The

SEER sample is more ethnically diverse than TCGA, with 0.3% of American Indian and Alaska Native (n=49), 3.6% of Asian and Pacific Islander (n=551), 4.8% of

African American (n=734), 91.1% of European American (n=13,184), and 0.2% of unknown race (n=19). Clinical and demographic covariates were collected in SEER dataset, including the site of primary tumor in brain (Table 2.4), number of primary tumors, sequence of surgery and radiation, sequence of tumors, among others. Each

21

of the covariates provided important information for the analysis. Further results

would be shown in the survival analysis.

Table 2.3 Descriptive statistics of clinical covariate at cancer diagnosis by vital status at year 1 and year 2 after diagnosis (SEER). Clinical Covariate Year 1 (n, %) Year 2 (n, %) Censor Censor 548 (4) 747 (5) Vital status Alive Dead Alive Dead 4,791 (31) 9,828 (65) 1,608 (11) 12,812(84) Gender Female 1,957 (41) 4,232 (43) 702 (44) 5,397 (42) Male 2,834 (59) 5,596 (57) 906 (56) 7,415 (58) Race American Indian/ Alaska Native 19 (<1) 27 (<1) 10 (<1) 35 (<1) Asian or Pacific Islander 202 (4) 317 (3) 81 (5) 425 (3) African American 251 (5) 458 (4) 81 (5) 620 (5) European American 4,313 (90) 9,018 (92) 1,434 (89) 11,722(92) Unknown 6 (<1) 8 (<1) 2 (<1) 10 (<1) Age Group (at diagnosis) < 50 1,688 (35) 1,191 (12) 754 (47) 2,061 (16) 51-60 1,462 (31) 1,893 (19) 451 (28) 2,848 (22) 61-69 1,007 (21) 2,506 (26) 277 (17) 3,199 (25) >69 624 (13) 4,238 (43) 126 (8) 4,704 (37) Region in USA Mid-West 1,025 (21) 2,382 (24) 318 (20) 3,036 (24) North-East 715 (15) 1,180 (12) 236 (15) 1,629 (13) South 740 (16) 1,568 (16) 267 (17) 2,015 (16) West 2,311 (48) 4,698 (48) 787 (48) 6,132 (47) Surgery No 450 (9) 2,999 (30) 139 (9) 3,298 (26) Yes 4,323 (90) 6,785 (69) 1,466 (91) 9,455 (74) Unknown 18 (1) 44 (1) 3 (<1) 59 (<1) Radiation Beam radiation 4,223 (88) 6,248 (65) 1,405 (87) 8,893 (69) Beam with Implants or Isotopes 99 (2) 46 (<1) 37 (2) 108 (1) Implants or Isotopes 15 (<1) 22 (<1) 7 (<1) 28 (<1) None 331 (7) 2,911 (29) 123 (8) 3,100 (24) Radiation, not specified 43 (1) 78 (<1) 14 (1) 103 (1) Refused 17 (<1) 265 (4) 1 (<1) 281 (2) Unknown 62 (1) 258 (4) 20 (1) 299 (2)

22

Table 2.3 (cont’d) Clinical Covariate Year 1 (n, %) Year 2 (n, %) Censor Censor 548 (4) 747 (5) Vital status Alive Dead Alive Dead 4,791 (31) 9,828 (65) 1,608 (11) 12,812(84) Sequence of Radiation and Surgery No radiation and/or surgery 816 (17) 5,141 (52) 265 (17) 5,662 (44) Radiation after surgery 3,827 (80) 4,513 (46) 1,286 (80) 6,890 (54) Radiation before and after surgery 38 (<1) 25 (<1) 12 (<1) 49 (<1) Radiation prior to surgery 76 (2) 112 (1) 24 (2) 162 (1) Sequence Unknown 22 (<1) 33 (<1) 13 (<1) 42 (<1) Primary Tumor Site C71.0-Cerebrum 153 (3) 423 (4) 60 (4) 510 (4) C71.1-Frontal lobe 1,324 (28) 2,366 (24) 503 (31) 3,127 (24) C71.2-Temporal lobe 1,306 (27) 2,132 (22) 396 (25) 2,976 (23) C71.3-Parietal lobe 850 (18) 1,712 (17) 289 (18) 2,248 (18) C71.4-Occipital lobe 226 (5) 406 (4) 60 (4) 566 (4) C71.5-Ventricle, NOS* 14 (<1) 42 (<1) 4 (<1) 51 (<1) C71.6-Cerebellum, NOS* 30 (<1) 75 (1) 12 (1) 90 (1) C71.7-Brain stem 19 (<1) 68 (1) 8 (<1) 79 (1) C71.8-Overlapping lesion of brain 735 (15) 2,036 (20) 237 (15) 2,506 (20) C71.9-Brain, NOS* 134 (3) 568 (6) 39 (2) 659 (5) Number of Primary Tumors 1 4,357 (91) 8,646 (88) 1,448 (90) 11,378(89) 2 395 (8) 1,052 (11) 145 (9) 1,282 (10) 3 37 (<1) 112 (1) 13 (1) 134 (1) >4 2 (<1) 18 (<1) 2 (<1) 18 (<1) Sequence of Tumors One primary tumor 4,280 (89) 8,334 (85) 1,423 (88) 11,016(85) 1st of 2 or more primaries 79 (2) 77 (1) 49 (3) 106 (1) 2nd of 3 or more primaries 391 (8) 1,228 (12) 123 (8) 1,475 (12) 3rd of 3 or more primaries 37 (1) 169 (2) 11 (1) 193 (2) 4th or greater 4 (<1) 20 (<1) 2 (<1) 22 (<1) *NOS: not otherwise specified

23

Table 2.4 ICD-O-3 primary brain tumor topographic sites ICD-O-3 Term Cerebrum, Basal ganglia, Central white matter, Cerebral cortex, Cerebral hemisphere, Corpus striatum, Globus pallidus, C71.0 Hypothalamus, Insula, Internal capsule, Island of Reil, Operculum, Pallium, Putamen, Rhinencephalon, Supratentorial brain NOS, and Thalamus. C71.1 Frontal lobe C71.2 Temporal lobe, Hippocampus, and Uncus. C71.3 Parietal lobe C71.4 Occipital lobe Ventricle NOS, Cerebral ventricle, Choroid plexus NOS, Choroid plexus C71.5 of lateral ventricle, Choroid plexus of third ventricle, Ependyma, Lateral ventricle NOS, and Third ventricle NOS. C71.6 Cerebellum NOS, Cerebellopontine angle, and Vermis of cerebellum. Brain stem, Cerebral peduncle, Basis pedunculi, Choroid plexus of C71.7 fourth ventricle, Fourth ventricle NOS, Infratentorial brain NOS, Medulla oblongata, Midbrain, Olive, Pons, and Pyramid. C71.8 Overlapping lesion of brain, Corpus callosum, and Tapetum. Brain NOS, Intracranial site, Cranial fossa NOS, Anterior cranial fossa, C71.9 Middle cranial fossa, Posterior cranial fossa, and Suprasellar. *NOS: not otherwise specified http://training.seer.cancer.gov/brain/tumors/abstract-code-stage/topographic.html

2.3.2. Survival Analysis

2.3.2.1. TCGA

In the initial analyses, we characterized the survival rate in GBM patients by clinical covariates. Survival curve for age groups constructed after fitting the Cox

PHM are shown in Figure 2.4. The age groups were significantly different in survival

(p=1.1e-16). Younger age groups have better estimated survival. We also found that survival rates were statistically different in patients by KPS, ≤80 versus >80

(p=2.3e-5) and by targeted molecular therapy, yes versus no (p=8.2e-4) (Figures 2.5

24

and 2.6). Patients with KPS > 80 have better survival than patients in the lower score group (≤80). Targeted molecular therapy was also found to be a beneficial treatment for GBM patient by improving survival (p=0.00082).

Figure 2.4 Survival curves by age group (TCGA). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

25

Figure 2.5 Survival curves by Karnofsky Performance Score (TCGA). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

26

Figure 2.6 Survival curves by targeted molecular therapy (TCGA). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

The interaction of age and gender were found to be significant different in survival, as well (Figure 2.7). For age-gender grouping factor survival rate, age effect has the same trend in both genders; females and males older than 60 years have lower survival rate than females and males younger or equal to 60. However, Figure

2.7 shows an opposite trend of survival rate for males and females when they are in different age group. The survival rates among four groups from the highest to lowest are female ≤ 60, male ≤ 60, male > 60, and then female > 60. Young females tend to have the highest survival rate, while the old female group has the lowest

27

among the 4 groups. In addition, gender alone did not have a statistical difference in terms of survival.

Figure 2.7 Survival curves by interaction of age and gender (TCGA). The solid line represents the mean of the group. Years: years of patient survival after diagnosis..

2.3.2.2. SEER

Analysis of the SEER data also showed statistically significant differences in survival by age groups (Figure 2.8), supporting the results found in TCGA sample.

Moreover, from the wide range of variables collected in SEER dataset, more information were discovered in terms of patients survival, providing the knowledge

28

ahead of time which treatment, sequence of treatment, or population would benefit the most.

Figure 2.8 Survival curves by age group (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

Race was not found to be significantly different in the TCGA; however, five different categories of race in SEER have different survival. We observed the same trend in survival: Asians with the highest rate, following African American and

European American (Figure 2.9) as in previous studies (ABTA, 2014), as well as confirming the results obtained in TCGA. Also, the differences between regions of

United States were observed significant difference by log-rank test (p<0.0001), as

29

well. Patients from North-East tend to live longer, West region has the second longest survival, South the third, and Mid-West has the worst outcome for survival in GBM patients (Figure 2.10) These differences may be due to confounders, the populations differ between regions, not only in age and age composition, but also in socioeconomic factors and several other factors. The differences found in the regions will be also reflecting this heterogeneous composition.

Figure 2.9 Survival curves by race (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

30

Figure 2.10 Survival curves by area of residence in the US (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

Surgery and radiation performance, and sequence of surgery and radiation also affect survival outcome of GBM patients. Patients, who received surgery, have a higher chance for longer survival (Figure 2.11). In terms of radiation type, individuals who received the combination of beam radiation with radioactive implants or radioisotopes have better survival prognosis. Survival rate was lower for patients received the different type of radiation along. Patients who refused to have radiation or did not have radiation therapy have lower survival compared to those who had radiation therapy (Figure 2.12).

31

Figure 2.11 Survival curves by surgery (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

32

Figure 2.12 Survival curves by radiation (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

The benefit of having radiation was also confirmed in sequence of surgery and radiation (Figure 2.13). Patients were categorized into having radiation therapy before, after, or both before and after surgery; the result shows that radiation performed (before and after) would prolong. For patients only received therapy once, radiation after the surgery benefit more than radiation prior to surgery. The results were all statistically significant (p < 0.0001).

33

Figure 2.13 Survival curves by sequence of radiation and surgery (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

Furthermore, the probability of survival was not only by the treatment, also different by the number, sequence, and location of tumor in brain. In the sequence of primary brain tumor, survival for different order of tumor was modeled. In general, patients having less primary tumor would have lower mortality. For patients with

GBM tumor as first primary tumor, lower risk of death was observed in this group.

Risk of death increases in the order of 2nd, 3rd, and 4th or greater number of primaries (Figure 2.14). However, differences were observed in the 1st primary

34

tumor group. For the 1st primary tumor, if patients mainly have 1 primary tumor, they would possibly have lower survival than the group of patients that is also 1st primary tumor but have 2 or more primaries (Figure 2.15).

Figure 2.14 Survival curves by number of primary tumor (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

35

Figure 2.15 Survival curves by sequence of primary tumor (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

Results above are consistent among TCGA, SEER, and previous studies, which further prove the validity of our data. Tumor location in the brain was also a significant factor associated with survival (p < 0.0001). Ten sites on the brain were recorded by ICD-O-3 topographic site system. Results are display in Figure 2.16.

36

Figure 2.16 Survival curves by primary tumor site (SEER). The solid line represents the mean of the group. Years: years of patient survival after diagnosis.

Frontal lobe of the brain is an area that controls the memory and cognitive function, such as planning, organizing, and personality. Patients with tumors growing in the frontal lobe have the lowest risk of death compared with patients with tumors growing elsewhere. Temporal lobe (controls for hearing) tumor also has a lower relative risk of death from GBM compared with other sites. GBM tumors in ventricle (not otherwise specified) have the highest risk. Ventricle of brain is filled with fluid, serves as a communication network of cavities.

37

2.4. Discussion and Conclusions

In this chapter, we identify non-genetic factors affecting survival of

Glioblastoma Multiforme (GBM) patients in TCGA and SEER datasets. The survival data and covariate distribution in these two independent datasets are slightly different, but most of the results were consistent between the datasets and with previous studies (Appendices C and D). In TCGA, survival rate at year one, two, and five are 55.6%, 20.8%, and 4%, respectively; in SEER data, survival rate at year one, two, and five are 30.6%, 11.4%, and 3.7%, respectively. Therefore patients recruited in SEER have lower survival rate.

Previous studies highlight a measurable association between age at initial diagnosis and survival rate among GBM patients (ABTA, 2014; Moon & Lesniak,

2014). A very similar result was also observed in our study. There is a statistically significant difference in survival of patients in different age groups; elderly GBM patients are more likely to have higher mortality rates compared to their younger counterparts.

Also, previous studies have pointed out the survival difference by analysis of gender and race groups (Muquit et al., 2015; ABTA, 2014; Moon & Lesniak, 2014).

However, we found inconsistent result compared our finding with previous studies.

In both TCGA and SEER, survival has not found to be significantly different by gender. For different race group, SEER data (larger sample size; n=15,167) showed a significant difference by race, which was not seen identified in TCGA (smaller

38

sample size; n=523) data. This can be explained by the smaller sample size in TCGA, which impeded us to find a significant effect.

Age-gender grouping factor is also evaluated in the study. The hazard ratio of gender differs in different age groups. Females older than 60 years have a decreased risk of dying from GBM than males in the same age group. However, females younger than 60 have higher mortality rates from GBM, compared to males. But the age-gender interaction is absent in SEER data. The risk of gender crossover was not observed in previous literature reviews, either. Thus, we believe the effect found in

TCGA might be a particularity of the TCGA sample. The gender ratio in both datasets is also different, being 1 to 1.6 (female to male) and 1 to 1.4, in TCGA and SEER respectively.

Survival rate differences of patients living in different regions in US were also observed in our study. It is well known that lifestyle varies in the different regions across the nation. A study done by Nelson et al. (2014) reported lifestyle related risk factors associated with GBM incidence. Specifically, Nelson et al. (2014) reported that individuals with higher education, calories consumption, amount of sugar intake, and larger triceps skinfold thickness were more likely to develop GBM.

Socio-economic status (SES) was also reported to influence the incidence rate of

GBM. People in higher class of SES, living in suburban area (lower population density), higher household income, and higher percentage of population owing a car have higher incidence rate (Muquit et al., 2015). All the reported GBM risk factors vary possibly among regions of US.

39

Stupp et al. (2009) stated that the combined therapy treatment

(radiotherapy with temozolomide) could significantly prolong patients survival compared with radiotherapy alone. Consistent with this finding we also found that treatments and surgery were protective for patients in our study. GBM patients receiving surgery, implants and isotopes radiation, and radiation performed before and after surgery have lower risk compared to rest of the patients.

Tumor location at different sites of the brain was also significantly associated with survival. Frontal lobe tumor is the most harmless GBM tumor, while the most aggressive case is the tumor located at ventricle of the brain. Patients with tumor at ventricle have the highest risk, which was also reported in the study conducted by

Chaichana et al. (2008). They found that GBM bordering the lateral ventricles was associated with lower survival compared to non-lateral ventricles tumor. Significant difference in median survival time was seen in patients with ventricle GBM (8 months) versus non-ventricle GBM (11months). Our results from SEER dataset confirm these findings. TCGA, however, does not include tumor location in the covariates reported.

In conclusion, the non-genetic risk factors identified in this study were mostly consistent between the two datasets and previous studies. Although gender was not found to be significantly associated with survival, other factors including: race, treatment, age, and tumor location were validated by previous studies and showed consistent results among the evaluated datasets.

40

APPENDICES

41

Appendix A. Data dictionary for TCGA clinical dataset

Table 2.5 Data dictionary for TCGA clinical dataset Variable Scale Description Descriptors bcr_patient_barcode discrete Patients’ ID Alive vital_status discrete All cause of death Dead Time interval from the data of last follow up to days_to_last_followup numeric the date of initial pathologic diagnosis, Mean (SD) 422.2 (456.7) represented as a calculated number of days Age of the patient at initial pathologic age_at_initial_pathologic_diagnosis numeric Mean (SD) 58.3 (14.4) diagnosis Female gender discrete Gender of patients Male Information about race based on the Office of African American race discrete Management and Budget (OMB) categories European American Score for cancer patients’ daily functional karnofsky_performance_score numeric Mean (SD) 77.9 (14.1) capability index Excisional Biopsy The procedure to secure the tissue used for method_initial_path_dx discrete Tumor resection the original pathologic diagnosis Other method Indicator of patient receive targeted molecular Yes targeted_molecular_therapy discrete therapy or not No https://tcga-data.nci.nih.gov/docs/dictionary/

42

Appendix B. Data dictionary for SEER clinical dataset

2.1 Data dictionary for SEER clinical dataset Table 2.6 Data dictionary for SEER clinical dataset Variable Scale Description Descriptors Patient.ID discrete Patients’ ID Alive Death discrete All cause of death Dead Time interval from the data of last follow up to the date of initial Survival_months numeric Mean (SD) 11.9 (19.9) pathologic diagnosis, represented as a calculated number of months Age of the patient at initial pathologic Age_at_diagnosis numeric Mean (SD) 61.7 (14.5) diagnosis American Indian/ Alaska Native Asian or Pacific Islander Race discrete Patients’ race African American European American Unknown C71.0-Cerebrum C71.1-Frontal lobe C71.2-Temporal lobe C71.3-Parietal lobe ICD-O-3 primary brain tumor C71.4-Occipital lobe Primary.Site…labeled discrete topographic code sites C71.5-Ventricle, NOS C71.6-Cerebellum, NOS** C71.7-Brain stem C71.8-Overlapping lesion of brain C71.9-Brain, NOS Yes Surgery discrete Surgery is performed or not No Unknown

43

Table 2.6 (cont’d) Variable Scale Description Descriptors No radiation and/or surgery Radiation after surgery The order in which surgery and Radiation.sequence.with.surgery discrete Radiation before and after surgery radiation therapies were administered Radiation prior to surgery Sequence unknown, but both were given Beam radiation Beam with Implants or Isotopes The method or source of radiation Implants or Isotopes Radiation discrete administered as a part of the first None course of treatment Radiation, not specified Refused Unknown One primary only The order of all primary reportable 1st of 2 or more primaries Sequence.number discrete tumors diagnosed during a patient's 2nd of 2 or more primaries lifetime 3rd of 3 or more primaries 4th or greater of 4 or more primaries 1 The number of primary tumors patients 2 Number.of.primaries discrete had 3 >4 Mid-West North-East Region discrete Region in US patients were recruited South West **not otherwise specified; http://seer.cancer.gov/archive/manuals/2012/SPCSM_2012_maindoc.pdf

44

Appendix C. Summary of hazard ratio estimates for clinical covariates (TCGA)

Table 2.7 Hazard ratio, confidence interval (95%), and p-value for clinical factors (TCGA). Covariates Hazard ratio 95% CI p-value Gender Female 1.0000 - Male 1.1150 (0.9055-1.3720) 0.306 Race European American 1.0000 - African American 1.2160 (0.8091-1.8280) 0.347 Age Group < 50 1.0000 - 51-60 1.5990 (1.2060-2.1200) 0.0011 ** 61-69 1.8810 (1.4140-2.5030) < 0.0001 *** >69 3.7650 (2.7610-5.1330) < 0.0001 *** Age-gender Female > 60 1.0000 - Male > 60 0.9667 (0.7203-1.2974) 0.822 Female ≤ 60 0.4457 (0.3208-0.6193) < 0.0001 *** Male ≤ 60 0.5623 (0.4215-0.7500) < 0.0001 *** Targeted Molecular Treatment No 1.0000 - Yes 0.6239 (0.4721-0.8247) 0.0009 *** Diagnosis method Excisional Biopsy 1.0000 - Tumor Resection 1.2970 (0.9725-1.7290) 0.0768 Other method 2.0090 (0.8003-5.0410) 0.1374 Karnofsky Performance Score KPS < 80 1.0000 - KPS ≥ 80 0.5548 (0.4207-0.7316) < 0.0001 *** Significance codes: 0 ‘***’; 0.001 ‘**’; 0.01 ‘*’; 0.05 ‘.’; 0.1 ‘ ’; 1

45

Appendix D. Summary of hazard ratio estimates for clinical covariates (SEER)

Table 2.2 Hazard ratio, confidence interval (95%), and p-value for clinical factors (SEER) Table 2.8 Hazard ratio, confidence interval (95%), and p-value for clinical factors (SEER). Covariates Hazard ratio 95% CI p-value Gender Female 1.0000 - Male 0.9732 (0.9411-1.0060) 0.1120 Race American Indian/ Alaska Native 1.0000 - Asian or Pacific Islander 1.2821 (0.9250-1.7770) 0.1357 African American 1.4466 (1.0472-1.9980) 0.0251 * European American 1.5338 (1.1199-2.1010) 0.0077 ** Unknown 0.9932 (0.5086-1.9390) 0.9839 Age Group < 50 1.0000 - 51-60 1.4910 (1.4150-1.5710) < 0.0001 *** 61-69 1.9970 (1.8960-2.1040) < 0.0001 *** >69 3.4130 (3.2470-3.5870) < 0.0001 *** Region in USA Mid-West 1.0000 - North-East 0.8548 (0.8072-0.9052) < 0.0001 *** South 0.9261 (0.8775-0.9775) 0.0053 ** West 0.9180 (0.8805-0.9571) < 0.0001 *** Surgery No 1.0000 - Yes 0.4577 (0.4401-0.4760) < 0.0001 *** Unknown 0.7205 (0.5593-0.9282) 0.0112 * Radiation Beam radiation 1.0000 - Beam with Implants or Isotopes 0.6798 (0.5753-0.8033) < 0.0001 *** Implants or Isotopes 0.9155 (0.6538-1.2820) 0.6070 None 2.8588 (2.7454-2.9770) < 0.0001 *** Radiation, not specified 1.0027 (0.8302-1.2110) 0.9780 Refused 3.5702 (3.1691-4.0220) < 0.0001 *** Unknown 1.8352 (1.6403-2.0532) < 0.0001 *** Sequence of Radiation and Surgery No radiation and/or surgery 1.0000 - Radiation after surgery 0.3967 (0.3832-0.4106) < 0.0001 *** Both before and after surgery 0.3397 (0.2628-0.4391) < 0.0001 *** Radiation prior to surgery 0.4443 (0.3833-0.5150) < 0.0001 *** Sequence Unknown 0.4021 (0.3075-0.5257) < 0.0001 ***

46

Table 2.8 (cont’d) Covariates Hazard ratio 95% CI p-value Primary Tumor Site C71.0-Cerebrum 1.0000 - C71.1-Frontal lobe 0.7540 (0.6895-0.8246) < 0.0001 *** C71.2-Temporal lobe 0.7690 (0.7029-0.8413) < 0.0001 *** C71.3-Parietal lobe 0.8357 (0.7623-0.9162) 0.0001 *** C71.4-Occipital lobe 0.8072 (0.7196-0.9055) 0.0003 *** C71.5-Ventricle, NOS* 1.0651 (0.8036-1.4117) 0.6609 C71.6-Cerebellum, NOS* 0.8856 (0.7145-1.0977) 0.2673 C71.7-Brain stem 0.9891 (0.7872-1.2428) 0.9249 C71.8-Overlapping lesion of brain 0.9724 (0.8876-1.0654) 0.5483 C71.9-Brain, NOS* 1.2112 (1.0831-1.3545) 0.0008 *** Number of Primary Tumor 1 1.0000 - 2 1.1320 (1.0711-1.1970) < 0.0001 *** 3 1.2110 (1.0270-1.4280) 0.0228 * >4 1.3160 (0.8286-2.0900) 0.2448 Sequence of Tumor One primary tumor 1.0000 - 1st of 2 or more primaries 0.5498 (0.4641-0.6512) < 0.0001 *** 2nd of 3 or more primaries 1.2739 (1.2086-1.3427) < 0.0001 *** 3rd of 3 or more primaries 1.5237 (1.3262-1.7506) < 0.0001 *** 4th or greater 1.7682 (1.1847-2.6393) 0.0053 ** Significance codes: 0 ‘***’; 0.001 ‘**’; 0.01 ‘*’; 0.05 ‘.’; 0.1 ‘ ’; 1

47

REFERENCES

48

REFERENCES

ABTA - American Brain Tumor Association (2014). Glioblastoma and Malignant Astrocytoma. Available at : http://www.abta.org/secure/glioblastoma- brochure.pdf.

Chaichana, K. L., McGirt, M. J., Frazier, J., Attenello, F., Guerrero-Cazares, H., & Quinones-Hinojosa, A. (2008). Relationship of glioblastoma multiforme to the lateral ventricles predicts survival following tumor resection. Journal of Neuro-Oncology, 89(2), 219-224.

Cox, D. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, 34(2), 187.

Johnson, D. R., & O’Neill, B. P. (2012). Glioblastoma survival in the United States before and during the temozolomide era. Journal of Neuro-Oncology, 107(2), 359–364.

Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemotherapy Reports, 50(3), 163–170.

Moon, K. S., & Lesniak, M. S. (2014). Glioblastoma: risk factors, diagnosis and treatment options. International Journal of Cancer Research and Prevention, 7(3/4), 183.

Muquit, S., Parks, R., & Basu, S. (2015). Socio-economic characteristics of patients with glioblastoma multiforme. Journal of Neuro-Oncology, 1-5.

Nelson, J. S., Burchfiel, C. M., Fekedulegn, D., & Andrew, M. E. (2012). Potential risk factors for incident glioblastoma multiforme: the Honolulu Heart Program and Honolulu-Asia Aging Study. Journal of Neuro-Oncology, 109(2), 315-321.

Stupp, R., Hegi, M. E., Mason, W. P., van den Bent, M. J., Taphoorn, M. J., Janzer, R. C., Ludwin, S. K., Allgeier, A., Fisher, B., Belanger, K., Hau, P., Brandes, A. A., Gijtenbeek, J., Marosi, C., Vecht, C. J., Mokhtari, K., Wesseling, P., Villa, S., Eisenhauer, E., Gorilia, T., Weller, M., Lacombe, D., Cairncross, J. G., & Mirimanoff, R. (2009). Effects of radiotherapy with concomitant and adjuvant temozolomide versus radiotherapy alone on survival in glioblastoma in a randomised phase III study: 5-year analysis of the EORTC-NCIC trial. The Lancet Oncology, 10(5), 459-466.

Yoshida, E. J., Ortega, A., Patil, C. G., Hu, J. L., Rudnick, J. D., Phuphanich, S., Hakimian, B., Mirhadi, A. J., & Shiao, S. L. (2015). The Impact of Temozolomide on Post Radiation Therapy Progression Free Survival in Low Grade Glioma. International Journal of Radiation Oncology, Biology, Physics, 93(3), E98.

49

CHAPTER 3 ANALYSIS OF EXPRESSION QUANTITATIVE TRAIT LOCI IN GLIOBLASTOMA MULTIFORME TUMORS

3.1 Introduction

The clinical treatments and care for GBM have improved in the past decades, but the survival rate of Glioblastoma Multiforme (GBM) patients is still low. For instance, American Brain Tumor Association (ABTA) reported a 5-year survival rate of GBM patients only 10% (ABTA, 2014). Currently, the emphasis in research has shifted from the study of clinical risk factors affecting GBM to trying to understand the genetic factors contributing to the development of the disease and its progression. Omic data (methilation, gene expression, mutations) can be used to study the processes occurring at the tumor; this information is essential to understand the carcinogenic process as well as the genetic regulation of the gene expression.

Gene expression is the result of multiple biological mechanisms taking place in the tumor. To understand how genetic factors affect gene expression we conducted an expression-quantitative trait loci (eQTL) analysis. An eQTL study is used to analyze the statistical association between SNP variants from DNA with measures of gene expression. Searching for the genetic polymorphisms that affect the transcript level in GBM tumors, Chen et al. (2014) found multiple genes whose expression appear to be regulated vary DNA variants. Also, significant association between chosen eQTL-based gene and survival was found in GBM patients (Chen et

50

al., 2014). This suggests the importance of a genetic component on to GBM prognosis. However, the study by Chen et al. (2014) studied the association between gene expression at the tumor and DNA variants from somatic cells. Because of mutations, DNA from tumor cells can differ greatly from somatic-cell DNA.

Therefore, in this chapter we present an eQTL study where both gene expression and DNA information was assessed at the tumor. Additionally, we assessed the association between gene expression levels and patients survival using a gene expression-wide association study (GE-WAS).

51

3.2 Materials and Methods

3.2.1 Data

The data used in this chapter was from The Cancer Genomes Atlas (TCGA) and includes clinical information (CLIN), gene expression (GE), and single nucleotide polymorphism (SNP) from the tumor; all collected in GBM patients. A total of 460 individuals had complete data for GE and CLIN, and 364 subjects had complete data for SNP (Figure 3.1). TCGA data was accessed under MSU IRB permission #15-745. The GE dataset was obtained from Broad Institute

HT_HG_U133A platform, and the SNP dataset was provided through the

Humanhap550 platform.

Gene expression. The available data contained brain tumor tissue expression data (normalized log-intensities) of 460 GBM patients from 9,750 genes; 1,194 of these genes were removed because they showed very low levels of variability across samples (SD<0.2), 1,133 genes were removed because they did not have annotated genome position. We also aimed at removing genes with more than 10% of missing values but no gene reached that criteria (Figure 3.1).

DNA markers. The available genotype dataset for 364 individuals included

462,021 SNP, all called from GBM tumor. Of these SNP, 11,875 were removed because they had minor allele frequency smaller than one percent, 47,979 were removed because they had more than 10% of missing values. One individual was removed from the study because the patient had more than 20% of missing genotypes. Genotypes were coded as 0, 1, and 2 according the count of minor allele,

52 and the missing genotypes were imputed with the mean allelic count of the corresponding SNP across samples.

3.2.2 Identification of confounders and pre-correction of gene expression

A preliminary analysis was performed, in which differential expression (DE) analysis for all the gene with possible confounders such as batch, age, or gender was carried out. As a result of the DE analysis, the principal component (PC) derived from SNP markers from the tumor and age at initial diagnosis were included in the eQTL analysis. A detailed explanation of DE analysis and the results is described in the Appendix A.

The Cancer Genome Atlas collected samples at multiple hospital sites from

2001 to 2008. Gene expression data was generated in 16 unique batches, with 6 to

58 samples per batch. It is well-known that the systematic differences (batch effects) may lead to erroneous conclusion (Soneson, Gerster, & Delorenzi, 2014).

Therefore, gene expression was pre-corrected for batch effects. This was done using a mixed model with batch as a random effect. One of such model was fitted to each gene and the residuals from these models were used for subsequent data analyses.

3.2.3 Statistical models

In this section we shall describe the methods used to assess the associations between GE and SNP (eQTL) and the ones used to assess the association between survival and GE (GE-WAS). The sample size availability for eQTL analysis is 364 patients with 9,750 gene expressions and 462,201 SNP, and for GE-WAS is 460 individuals with 10,883 gene expression information (Figure 3.1).

53

• CLIN!(n=460)! DE*analysis* GE1WAS* 460!people! • GE!(n=460;!10,883!genes)! 10,833!genes! eQTL* 364!people! 9,750!genes! 462,021!SNP! • SNP!(n=364;!462,021!SNP)!

Figure 3.1 Data available and omics sample size for TCGA

3.2.3.1 Expression quantitative trait loci (eQTL)

3.2.3.1.1 Model for eQTL analyses

Expression quantitative trait loci (eQTL) analysis is used to analyze the genetic association between GE and SNP. The goal of the eQTL analyses is to search for genetic polymorphisms associated to the expression level of the genes. The association between GE levels and SNP was assessed using linear regression models with pre-adjusted GE levels as response, age, marker-derived principal components

(PC1 and PC2 were included to account for confounding to population structure) and a SNP as predictors, specifically the linear model used was:

!"# = %#&' + )*#& + +&,#& + -#&

//0 - 2 3, 56: #& ~ 789

j = 1, …, 9,750 genes k = 1, …, 462,021 SNP where GEj is normalized measure of gene expression vector of jth gene expressed as log-intensity pre-corrected for batch effects, μjk is an intercept, X is a design matrix with dimension of 364 patients by 3 variables (age and first two SNP-derived PC), αj is the vector of the effects of age and of the first two SNP-derived PC on the

54 expression of the jth gene, wk is the vector of kth SNP, and βjk is the effect of the kth

SNP on the expression of the jth gene.

The model above described was fitted for all genes and SNP using Ordinary

Least Squares. The association between GE levels and SNP was assessed using p- values for the SNP effect derived from the model described above. See Appendices A and B for variable selection in eQTL model.

Multiple-test corrections. In an eQTL study the association between GE and

SNP is tested by one GE with one SNP at a time. The process involves a large number of tests; therefore, when making inferences the effects of testing multiple hypotheses simultaneously need to be taken into account. There are multiple ways to address this problem, including Bonferroni, controlling the rate of false discoveries (FDR=False Discovery Rate) and controlling the positive FDR (pFDR). A detailed review of these methods is provided in the Appendix C, for inferences we use FDR to perform multiple-test corrections in eQTL study.

Cis versus Trans regulation. The significant associations between expression level and SNP are the expression quantitative trait loci, eQTL (i.e., the loci that control the transcript level). eQTL are classified as ‘local’ and ‘distal’, according to the distance of the variant and starting position of the gene encoding the transcription. Local eQTL is known as cis-eQTL, where the SNP affecting the transcript level of a gene is mapped to a proximal position on the same chromosome. Conversely, distal (trans-) eQTL is a SNP mapped far away from the position of gene or not in the same chromosome (Rockman & Kruglyak, 2006). The threshold of distance defining cis- or trans- acting eQTL is arbitrary. We defined the threshold in our study by using the same cutoff point as the GBM eQTL study

55 conducted by Chen et al. (2014), using 1 megabase (Mb). The cis-acting association was defined for SNP falling within a 1 Mb window up and down stream of the starting position of the associated gene.

3.2.3.1.2 Hotspot estimation

A hotspot SNP is an eQTL variant affecting multiple genetic regulations, positioned distantly from the SNP (Breitling et al., 2008). Breitling et al. (2008) suggested that eQTL with multiple gene association observed from the real dataset may be an artifact from the clusters of genes with highly correlated expression.

These correlations between genes may be caused by many non-genetic mechanisms, such as environmental factors. Therefore, they recommend using permutation to determine proper thresholds for hotspots. To set the threshold for number of multiple significant gene associations per SNP, permutations were performed to find significane thresholds of hospot. A permutation of the SNP generates a null distribution, thus, the number of significant genes in the permutations could give us an estimate of the number of false association found by chance to any given SNP.

Permutations were conducted 100 times by using the permuted SNP dataset and original, non-pernumted, gene expression data. Figure 3.2 shows a flowchart with permutation process.

56

Figure 3.2 Flowchart of observed and permuted false eQTL hotspots. The left represents the observed dataset and the right side represents the generation of a new hypothesis by permuting the SNP by individual. The false eQTL results (right) were expected to have less association being significant by biological effect.

57

The SNP information of patient was permuted and randomly matched to another patient’s gene expression level. A new set of significant eQTL was formed at each permutation. The 99th percentile of the number of significant genes by SNP among 100 permutations was used to estimate the threshold of defining hotspot.

Out of the 462,021 SNP, the maximum 99th percentile of the number of significant genes would be set as the threshold for hotspot. Any SNP with statistically associated genes more than the threshold would stated as hotspot for the disease

Figure 3.3 is a graphical representative of the permutation.

SNP!with!sta0s0cally! observed! Obs.! associated!genes! SNP!data! !above!the!threshold!

Iden0fy!the!99th!percen0le!of!the! number!of!significant!genes!by! SNP!among!100!permuta0ons!!

GE! SNP1 ! !:!5! Hotspot- Perm-1! -eQTL! SNP2 ! !:!7!

! Perm-2 SNP3 ! !:!18! Determine!the!maximum! number!of!significant! genes!per!SNP!! SNP4 ! !:!0! permuted! among!462,021!SNP! Perm ! SNP!data! -3 SNP5 ! !:!10! 18!

(100!permuta0ons)! …..

! SNP6 ! !:!9! threshold!to!!

….. define!hotspot! ! Perm-99! SNP462,020 !:!13!

Perm-100! SNP462,021 !:!3!

Figure 3.3 Flowchart showing the process to select a threshold number of genes significantly associated to a SNP. Above the threshold, the SNP was defined to be a hotspot.

58

3.2.3.1.3 Target genes selection

The eQTL analysis will indicate which SNP may be modifying GE levels.

Subsets of target genes were based on the result of eQTL to further study their association with survival. A gene regulated by an eQTL, and associated to survival may be indicating a genetic predisposition to resistance/sensitivity to the disease.

Three different ways of selecting the target genes were conducted: (i) whole eQTL gene set, (ii) cis-acting gene set, and (iii) hotspot gene set (Figure 3.4). In eQTL gene set, all unique genes whose expression level were statistically associated with at least one SNP were included in eQTL gene set. Within this eQTL gene subset, genes with cis-acting only were extracted for cis-acting gene set. Last, all the genes having significant association with the selected hotspot SNP were selected for hotspot gene set.

Whole& Cis8ac;ng& gene&set& genome& eQTL& Hotspot& set& gene&set& gene&set& (p=10,883)&

Figure 3.4 Subsets of genes selected from the whole genome set

3.2.3.2 Hotspot genotype survival analyses

Survival analyses were carried out to assess the association between patient survival and the hotspots identified. This analysis includes Cox Proportional

Hazards model (PHM) for the count of the minor-allele frequency variant. Cox PHM was described in chapter 2 – 2.2.2. Statistical models (Cox, 1972). Log-rank test was

59

conducted to evaluate whether the hazard ratio of each genotype groups were significant difference.

3.2.3.3 Gene Expression Wide Association Study (GE-WAS)

In this study, GE-WAS was performed to identify gene regulation affecting survival of GBM patients in large population of unrelated people. However, population structure could lead to false positive associations. Population differentiation occurred across demographic covariates in human; hence, we incorporated potential confounders in the model to control the effect. Specifically, demographic covariates were included in GE-WAS, and those variables may be associated with survival of patients (response variable) as well as with the gene expression level (explanatory variable). Moreover, we adjusted the model with covariates that might influence survival by including age (continuous), gender (male or female), and race (white, black, or unknown).

Survival analyses were carried out with Cox PH model, which were conducted for expression level of one gene at a time with 10,883 genes. The statistical model can be described as follow,

ℎ < = = ℎ < exp (DE , + HIJ , + IJN0JO , + OHSJ , ) ># ? ># FG# > KLM# > LMPQMR# > RKTM#

i = 1, …, 460 individuals j = 1, …, 10,883 genes where GEij is normalized measure of gene expression of ith individual at jth gene expressed as log-intensity pre-corrected for batch effects, and βGEj, βagej, βgenderj, and

th βracej are the effects of GE, age, gender, and race on the hazard of the j gene.

60

In GE-WAS analysis, multiple tests (10,833 tests) were conducted, simultaneously; it also brought a high amount of false positive associations between survival and gene expression. Therefore, we corrected the results by the multiple tests correction. Multiple test correction was performed with three methods (i.e., Bonferroni, FDR, and pFDR) and the results were compared. A detailed review of these methods is provided in the Appendix C. Last, to increase the power to detect true positive association, GE-WAS was performed in each selected target gene sets.

61

3.3 Results

3.3.1 Expression quantitative trait loci

3.3.1.1 eQTL analysis

The total number of hypothesis tests to evaluate the existence of an association between genetic variants and gene expression level was 4,504,704,750

(9,750 genes by 462,021 SNP). Cis association between genotype and gene expression was defined as SNP and starting position of the gene locate within 1 Mb window. Following this definition, 3,175,427 gene-SNP pairs were categorized to cis-acting and 4,501,529,323 to trans. Among these 4.5 billion tests, 47,794,924 eQTL had nominal p-values smaller than 0.01; however, after multiple tests correction (FDR < 0.05) 5,429 significant GBM eQTL were identified. Cis associations were found in 301 genes and 1,112 SNP with 1,190 cis-eQTL, and trans associations were found in 1,175 genes and 2,880 SNP with 4,239 trans-eQTL (Figure 3.5).

62

SNP GE (after quality control) (after quality control) p=462,021 SNP p=9,750 genes

Total number of tests 4,504,704,750 tests

cis-acting test trans-acting test 3,175,427 tests 4,501,529,323 tests

Significant cis association Significant trans association (p < 0.01) (p < 0.01) 80,952 cis-eQTL 47,713,972 trans-eQTL

Multiple test correction Multiple test correction (FDR < 0.05) (FDR < 0.05) 1,190 cis-eQTL 4,239 trans-eQTL

Unique SNP Unique gene Unique SNP Unique gene in cis-eQTL in cis-eQTL in trans-eQTL in trans-eQTL p=1,112 SNP p=301 genes p=2,880 SNP p=1,175 genes genes

Figure 3.5 Flowchart of eQTL analysis Flowchart showing number parameters and numbers affects in each set of analysis. Cis-acting results are shown in the left and trans-acting are shown in the right. The study found 1,190 cis-eQTL and 4,239 trans-eQTL.

63

Figure 3.6 eQTL map. Dots represent statistically significant eQTL in their location of the SNP (x-axis, SNP position) by gene expression (y-axis, gene position). Blue dots are the cis-eQTL and red dots are the hotspot eQTL. The axis are the number of human chromosome, and the SNP (x) and GE (y) are ordered by position in the chromosome.

Figure 3.6 shows genes affected by genetic variants of the eQTL analysis. We observed more cis-acting eQTL than trans-acting eQTL and several SNP with multiple gene associations. In Figure 3.6, each point represents a statistically significant eQTL variant (after multiple test correction), and both SNP and genes were ordered by the position from chromosome 1 to chromosome 23. The anti- diagonal, from lower left corner to the upper right corner, represents the location of

SNP and gene is close, which are the cis-acting eQTL that we found in the study.

64

Furthermore, it shows a higher intensity of cis-eQTL compare to trans-eQTL, located at anti-off-diagonal. Approximately, 400 significant cis-eQTL were found in every 1 million cis combination tests, and only 1 trans-eQTL detected to be significant in every 1 million trans combination tests. Also, we observed several patterns of significant tests forming vertical lines in the plot that show SNP with numerous gene associations, which were the potential hotspot.

Among the 5,429 significant associations between expression level and SNP

(both cis- and trans-acting) detected in GBM tumor tissue, there are 3,970 unique eQTL based SNP. Some of these SNP were found to be linked to more than one gene.

In this study, 3,314 eQTL based SNP (83% of the unique eQTL based SNP) have significant association with only 1 gene, and the highest amount of associated gene for eQTL based SNP is 26 genes in this study (Table 3.1 and Figure 3.7).

Furthermore, this large numbers of eQTL could potentially be explained by the existance of hotspots in the GBM tumor. Additionally, SNP associated with many genes, would be more likely to explain the contribution of variation for tumor gene expression.

Table 3.1 Frequency table of number of significant genes per SNP (FDR<0.05). For example, one SNP is significantly associated to 20 genes and 26 genes (significant hotspot). nGene 1 2 3 4 5 6 7 8 9

nSNP 3,314 377 119 55 36 18 16 13 1

nGene 10 11 12 13 14 15 17 20 26

nSNP 6 6 1 1 3 1 1 1 1 nGene: number of significant genes per SNP; nSNP: amount of SNP. The significant eQTL (cis & trans) was filtered by FDR < 0.05.

65

Figure 3.7 Manhattan plot for the number of significant genes per SNP observed in the unpermuted data. Using 18 genes as the threshold of multiple gene to define the hotspot, 2 SNP, rs7698461 and rs3172494, were found to be the hotspot in GBM patients. SNP is ordered by position in human chromosome.

3.3.1.2 Hotspot estimation

By chance, a maximum of 17 genes would be significantly associate with 1

SNP. Seventeen significant genes per eQTL is the highest number observed in one

SNP (Figure 3.8). This value was foundto be the maximum 99th percentile of 100 permutions between the 462,021 SNP. Any eQTL-based-SNP associated with the

66

regulation of 18 or more gene expresssions was defined as a hotspot. We found 2 hotspots associated to several genes through the genotype (Table 3.2).

Figure 3.8 Manhattan plot for 99th percentile of the number of gene observed significantly in 100 permutations per SNP. The dashed line is the significant throshold of 18 genes. SNP is ordered by position in human chromosome

Table 3.2 Number of cis- / trans- acting genes, chromosome, and position in the hotspots, rs3172494 ana rs7698461. Hotspot Cis Trans Total Chromosome Position rs3172494 1 25 26 3 48706491 rs7698461 0 20 20 4 147102835

67

The 2 hotspots are rs317294 and rs7698461. Figure 3.6 shows the hotspots in red vertical lines. Hotspot-rs3172494 is located on chromosome 3 and has 1 cis- atcing eQTL and 25 trans-acting eQTL. Hotspot-rs7698461 is on chormosome 4 and associated with 20 genes, which were all trans-eQTL (Table 3.2). The hotspots SNP were further studied as the genetic factor affecting survival in GBM patients.

3.3.1.3 Target gene selection

The selected target gene sets are eQTL gene, cis-acting gene, and hotspot gene. Out of the 5,429 eQTL discovered in this study, 1,431 unique genes were selected for eQTL target gene set. Among these eQTL target gene set, 301 unique cis- eQTL based genes were also selected for cis-acting target gene set. Hotspot- rs3172494 is associated with 26 genes and hotspot- rs7698461 is associated with

20 genes, which gave a total number of 45 unique genes selected for hotspot target gene set.

3.3.2 Survival analyses for hotspot

Figures 3.9 and 3.10 show the survival curve obtained from Cox regression analysis of the years of living after initial diagnosis in patients with GBM. Both of the hotspots do not show statistically different in hazard ratio with individuals carried different count of minor allele. P-value of hotspot-rs3172494 and hotspot- rs7698461 were both greater than 0.05, but survival curves in these 2 figures are not highly overlapped, showing different trend in each groups.

68

Figure 3.9 Survival curves of hotspot-rs3172494 by genotype. The solid line represents the mean of the group, and the dotted line represents the 95% confidence interval. Years: years of patient survival after diagnosis.

69

Figure 3.10 Survival curves of hotspot- rs7698461 by genotype. The solid line represents the mean of the group, and the dotted line represents the 95% confidence interval. Years: years of patient survival after diagnosis.

Furthermore, some groups are having relatively smaller sample size. In rs3172494, only 8 people have 2 copies of minor allele; also, only 7 people did not carry any minor allele in rs7698461 (Table 3.3). With the small sample size of some groups, the confidence intervals are relatively wider, potentially leading to less statistically different between groups. However, this trend suggests that patients with more minor allele copy in rs3172494 and rs7698461 live longer than patients with less copy.

70

Table 3.3 Number of patients of genotype with 0, 1, or 2 copies of minor alleles in hotspot. Hotspot 0 copy 1 copy 2 copies Total rs3172494 294 (CC) 75 (AC) 8 (AA) 377 rs7698461 7 (GG) 61 (AG) 309 (AA) 377

3.3.3 Gene Expression Wide Association Study

In the whole genome set of gene expression, 430 genes (3.95%) were found to be significantly associated with survival by positive FDR less than 0.05 without incorporating demographic confounder in model. Results differ across different multiple tests correction. As expected, the most stringent method, Bonferroni, found the less significant gene (11 genes; 0.1%) associated with survival, and FDR method found 315 genes (2.89%) whose expression is associated to survival (Table 3.4).

Table 3.4 Number of genes statistically significant to survival in different gene sets and comparing by different multiple test correction and threshold. Gene set Covariates Threshold Bonferroni FDR pFDR Whole 0.05 11 315 430 GE genome 0.10 15 663 1,003 set GE+ age+ 0.05 0 0 0 (10,883) gender+ race 0.10 0 0 22 eQTL 0.05 3 5 5 GE gene 0.10 4 27 37 set GE+ age+ 0.05 0 0 0 (1,431) gender+ race 0.10 0 0 0 Cis-acting 0.05 0 0 7 GE gene 0.10 0 18 30 set GE+ age+ 0.05 0 0 0 (301) gender+ race 0.10 0 0 0 Hotspot 0.05 0 0 2 GE gene 0.10 0 2 4 set GE+ age+ 0.05 0 0 0 (45) gender+ race 0.10 0 0 0

71

GE-WAS was also applied to each subset of target genes. Different numbers of significant genes were observed in different multiple tests correction, as well. In all three target gene sets, Bonferroni method identify the fewest significant genes, followed by FDR, and pFDR, which found the most in terms of significant gene associated with survival.

Among 3 subsets, the genes discovered to be significant associated (pFDR <

0.05) with survival are CD151, DNTTIP2, EFEMP2, FBXO17, HIST1H4B, LRRFIP1,

MAN2B2, MTRR, NOL3, PDCL3, TAPBPL, THNSL2, TIMM22, and ZIC3. Hotspot gene set without controlling the confounders in model found the highest proportion of significant genes by pFDR less than 0.05. The proportions of genes found significant in eQTL gene, cis-acting gene, and hotspot gene sets were 0.35%, 2.33%, and 4.44%, respectively. Highest proportion of genes significantly associated with survival of

GBM patients was found in the hotspot target set.

However, incorporating demographic covariates (i.e., age, gender, and race) did not find any significant genes associate with survival, neither by Bonferroni,

FDR, nor pFDR corrections across whole genome set and three target subsets. It may be possible that age at initial diagnosis, race, and gender affecting the patient survival via certain genes or it is also possible that those confounders effect survival via another mechanism, and the gene is a marker of age, gender, or race.

72

3.4 Discussion and Conclusions

In this chapter we identify genetic markers affecting regulation of gene and the association between gene expression and time to death in GBM patients. Two eQTL hotspots were discovered, and target gene sets were selected according to eQTL results. GE-WAS revealed 14 genes significantly associate with survival among the whole genome set and 3 target gene groups.

Two hotspots were identified by eQTL analysis when we set a threshold of 18 genes significant associated with one SNP. The threshold was set based on an analysis with 100 permutations to identify the hotspot. The hotspots rs3172494 and rs7698461 had 26 and 20 genes associated, respectively. As far as we know, the two hotspots discovered have not been reported to be associated with disease progression in GBM or other cancers.

We observed that survival curves of different genotype groups have the same order in both hotspots. Major allele and minor allele for hotspot rs3172494 is cytosine and adenine, and for rs7698461 is guanine and adenine, respectively. GBM patients with 2 copies of minor allele have higher survival compared to patients with 1 minor allele or none in both hotspots, and individuals with no copy of minor allele live the shortest. Nevertheless, the statistical power of small sample size has constrained the possibility of detecting true effects (Button et al., 2013). The results of hotspot survival analysis do not show statistical difference in hazard ratio for patients carried different genotype.

Another limitation of this study is that the eQTL analysis was studying common variants, but SNP derived from normal and tumor tissue were highly

73

matched. Common variants will reflect the genetic background and composition of the subjects, but not the changes occurring due to the cancer.

GE-WAS without controlling the demographic effect found 14 genes have significant association with survival GBM patients. Among the 14 genes, gene

TAPBPL and THNSL2 were reported in Chen et al. (2014) GBM eQTL study.

Although the previous study was using different tissue and selection method, the expression levels of these 2 genes were also selected in our study of having significant association with time to death of patients.

Nevertheless, the confounder uncontrolled model will arise the danger of false positives in structured population (Vilhjálmsson, 2013), and the result of GE-

WAS included demographic covariates (i.e., age, gender, and race) to avoid confounding do not show any significant gene. To evaluate the true genetic effect on survival, population structure must be removed. A larger sample size study is needed to confirm the finding of association between time to death and gene expression, as well as regulation of gene expression and genetic variants.

Batch effect as well as age at initial diagnose were found to have numerous genes differentially expressed by DE analysis. We then pre-corrected the GE level by batch, and included age in the eQTL analysis. Soneson et al. (2014) pointed out that different batch may cause significantly effect on the generalizable result from study.

Our result on DE analysis also found a significant batch effect (16 levels) in gene expression, more than 95% genes differentially expressed across batches. Batch correction was performed before implementation of eQTL analysis. Age at initial diagnosis was found to be the most significant covariate to affect the transcript level

74

of genes among other covariates in DE analysis. Moreover, age has been found in many GBM study to be a critical risk factors affecting patient survival (ABTA, 2014;

Moon & Lesniak, 2014).

Cancer is characterized by a very large number of somatic mutations and copy number variation (CNV). However, the small number of patients with individual mutations rise power issue, limiting our ability to study individual mutations. Nevertheless, CNV can provide more insights of how the DNA changes regulate GE later affecting survival. Thus, further studies on expression and CNV analysis could potentially glean more information of disease progression and distinguish patients who are more resistant to the disease.

75

APPENDICES

76

Appendix A. Identification of confounders in gene expression

Differential expression (DE) analysis had served as a preliminary test for our study. The main goal of the DE analysis in this chapter is to find out the covariates that will be included in the eQTL model to account for its effect. When a significant proportion of genes differentially expressed is identified across different levels of the covariate, the covariate will be considered as the factor affecting the expression of genes. Thus, covariates with many differentially expressed genes were added in the eQTL model. On the other hand, if the expression of gene remains the same across levels of covariate, the gene is equally expressed (EE) between groups.

The statistical model to identify the significant DE gene can be described as follows,

!"# = %#' + )V# + -#

: -#~2 3, 5678

j = 1, …, 10,883 genes where GEjb is the normalized measure of gene expression vector of jth gene expressed as log-intensity pre-corrected for batch effects, μj is the mean expression level of jth gene, X is a design matrix with dimension of 460 patients by 3 variables, including one covariate x (i.e., age, gender, race, and among others. See Appendix A in Chapter 2), PC1 and PC2 derived from tumor SNP, and βj is the vector of 3 coefficient effect, covariate x, PC1 and PC2 at jth gene.

Out of 10,883 total genes left after edition for criteria and edition of gene without variants, there were 10,592 (97%) genes differentially expressed between

77

batches (Table 3.5). This showed that the systematic difference could add background noise leading to large number of false positive, high proportion of the genes was differentially expressed due to artifact.

Table 3.5 Summary of differential expression analysis for baseline model Variables nobs (1) levels 5% FDR (2) Gender 460 2 24 Age at initial diagnosis 460 cont (3) 1,152 Race 445 2 3 Initial diagnosis method 460 3 151 Karnofsky Performance Score 338 cont (3) 0 Targeted molecular therapy 327 2 76 Year 1 vital status 382 2 1 Year 2 vital status 368 2 347 Age group 460 4 134 GE batch 460 16 10,592 SNP derived PC1 364 cont (3) 26 SNP derived PC2 364 cont (3) 65 SNP derived PC3 364 cont (3) 0 nobs (1): number of observations; levels: number of levels in incidence matrix; 5% FDR (2): number of genes significantly different with expression after FDR correction (FDR<0.5); cont (3): continuous variable.

In the DE analysis, PC 1 and 2 derived from tumor SNP were found to have 26 and 65 DE genes, respectively (Table 3.5); hence first two PC were considered fixed effect in the eQTL analysis. Table 3.6 is the result from DE analysis by pre-corrected batch effect and adding PC1 and PC2 derived from tumor SNP, which shows the number of genes differentially expressed per variable after FDR correction.

According to the result of DE analysis, age at initial diagnosis is the variable with the most differentially expressed genes (Table 3.6). Age has 816 DE genes (8% among the whole genome set); clearly, it has a significant effect on gene expression.

78

Thus, age at initial diagnosis was selected as important demographic covariates that need to be accounted for the effects causing genes to express differentially. Hence, in our subsequent analysis, eQTL, covariates included in the model were PC1 and 2 derived from tumor SNP, and patients’ age at initial diagnosis.

Table 3.6 Summary of differential expression analysis for model adjusted by batch effect Variables nobs (1) levels 5% FDR (2) Gender 364 2 27 Age at initial diagnosis 364 cont (3) 816 Race 351 2 31 Initial diagnosis method 364 3 0 Karnofsky Performance Score 276 cont (3) 0 Targeted molecular therapy 257 2 76 Year 1 vital status 318 2 1 Year 2 vital status 311 2 23 Age group 364 4 10 PC3_snp 364 cont (3) 0 nobs (1): number of observations; levels: number of levels in incidence matrix; 5% FDR (2): number of genes significantly different with expression after FDR correction (FDR<0.5); cont (3): continuous variable.

79

Appendix B. Accounting for population structure in eQTL model

In genome-wide association studies, it has been shown that including a polygenic effect in the model accounts for relationships and population stratification

(Astle & Balding, 2009). Thus, we fitted a linear mixed model to estimate heritability of gene expression. Based on results of the previously described analyses, the model included age at initial diagnosis, as a fixed effect and a random individual polygenic effect. The statistical model can be described as follow,

!"# = %#' + W,# + X# + -#

: X#~2 3, !6Y : -#~2 3, 56M

j = 1, …, 10,883 genes where GEj is normalized measure of gene expression vector of ith individual at jth gene expressed as log-intensity pre-corrected for batch effects with dimension of

460 individuals by 1 gene; μj is the mean expression level of jth gene, xβj model age as fixed effect at jth gene, and u model the SNP effect as random.

Heritability (h2) of gene expression, the proportion of variance of an expression trait that is due to genetic additive variation among individuals, was computed from variance component estimates and can be estimated by,

6: ℎ: = Y : : 6Y + 6M

To test the hypothesis of whether genetic effect is significantly affecting the gene expression. A likelihood ratio test was performed by comparing the fit of full model (fixed effect + random effect) to the fit of a reduced model (fixed effect only).

80

The average estimated heritability of all genes was 0.47 ± 0.43 (mean ± sd).

However, the result shows a troubling pattern in the graphic of significance of genetic effect and heritability (Figure 3.11). In general the expected pattern is that with increased estimated heritability there is an increase in significance and that estimated heritability close to 1.0 would have highly significance p-values. However,

Figure 3.11 shows that hundreds of thousands of genes with heritability close to 1 also have p-value close to 1. This result suggests that the polygenic effect cannot be reliably separated from the residual effects due to a very sparse realized proportion of allele sharing. Based on this result, we decided to account for population stratification by fitting principle component derived from tumor SNP as fixed effect.

81

Figure 3.11 Scatter plot of heritability and significance of the genetic effect for the 10,883 genes.

PC was taken to evaluate whether the effects of population structure need to be removed. The variation explained by PCz reduced while z increase, and PCzs are uncorrelated (Zhao et al., 2014). The first three PCs derived from SNP markers, together explaining 40% of total variation, were tested for how many gene differentially expressed across levels of PC. Only the first two PCs show genes differentially expressed. Thus, we added the first 2 PC in model to help us control the effect caused by population structure.

82

Appendix C. Multiple-test correction methods

Both the eQTL and GE-WAS analysis involve testing large number of hypothesis simultaneously. There are several approaches that can be considered to account for multiple testing, including: Bonferroni (Dunn, 1961), False Discovery

Rate (FDR) (Benjamini & Hochberg, 1995), and positive False Discovery Rate

(pFDR) (Storey, 2002).

Bonferroni method controls experiment-wise type I error rate and thus, is preferable for evaluating statistical significance when a few tests are conducted. It approximates the significance threshold dividing the desired type-I error rate by the number of test performed (α/m). This approach assumes that the tests are independent and sets the p-value cut-off so that the probability of having one type I error (false discovery) is α. With high-dimensional genomic data, the Bonferroni method is generally over-conservative and reduces the power of the study.

Alternatively one can adopt a criteria that controls the rates of false discoveries (FDR=false discovery rate, Benjamini & Hochberg, 1995)). Table 3.7 describes how the FDR is computed.

Table 3.7 Possible outcome of performing m hypothesis tests simultaneously.

True value of hypothesis

H0 is true Ha is true Total Significant V (type I error) S R Test (p-value < α) result Not significant U T (type II error) m-R (p-value ≥ α)

Total m0 m-m0 m

83

The method of Benjamini & Hochberg fixes the acceptable FDR level beforehand and applies a sequential threshold across the tests, controlling the rejection region for each test. FDR considers observed p-values for m tests in order, p(1) ≤ p(2) ≤ … ≤ p(k) ≤ … ≤ p(m), and re-computing the threshold at each stage. The number of significant test (Z) detected by FDR is estimated by

Z Z = max {Z: _ ≤ b} & a which rejects the hypothesis with p-value < p(k). The FDR will be less than or equal to the chosen level (α).

Storey (2002) provided another method, called positive False Discovery

Rate (pFDR). This approach estimates FDR for each test:

h? _(g) def _ = g i a where h? is the estimated proportion of truly null tests. Moreover, the point estimate of FDR is expected to be equal to the true FDR. The number of significant test (i) rejected by pFDR method is estimated by

i = max {i: def _ g ≤ b} which rejects the hypothesis with p-value < p(l), and provides FDR ≤ α.

The difference between FDR and pFDR is the inclusion of a proportion of truly null tests (h?) in pFDR, which is estimated by the number of not significant test over the expected number of null hypothesis

#(_ > j) h j = > ? 1 − j a

84

where λ is the rejection region, and (1-λ)m is the expected number of tests under null hypothesis.

The numbers of tests rejected by two methods are equal (Z=i) if the h? in pFDR is equal to one,

_ Z i = max i: g ≤ b ⟺ Z = max {Z: _ ≤ b} i/a & a

While i is greater than Z when h? is less than 1. The pFDR method rejects a higher amount of hypotheses than FDR, while controlling the same false discovery rate.

85

Appendix D. Subsets of target genes

Table 3.8 Subsets of target genes Subset of genes Gene names eQTL gene AAK1, AASDHPPT, ABCA12, ABCA4, ABCB4, ABCB9, ABCC2, ABCE1, ABCG5, ABHD3, ACBD3, ACE2, ACRV1, ACSBG2, ACSM3, ACTL8, ACVRL1, ACYP1, ADAMDEC1, ADAMTSL3, ADCY8, ADCYAP1, ADD3, ADH1B, ADH1C, ADH7, ADI1, ADIPOQ, ADRA2C, AFAP1, AFP, AGBL2, AGER, AGTR1, AGTR2, AHSA1, AHSG, AICDA, AJAP1, AKAP10, AKAP3, AKR1B10, AKT1, AKT3, ALAS2, ALB, ALDH1B1, ALDH3A1, ALDH8A1, ALMS1, ALPL, ALX1, AMACR, AMBN, AMFR, AMH, ANGEL2, ANKRD1, ANKRD55, ANKS1B, ANPEP, ANTXR1, ANXA13, ANXA5, AOC3, AP3S2, APOBEC3A, APOH, APOOL, APPL1, AQP5, ARFGEF1, ARHGAP15, ARHGAP19, ARHGAP24, ARHGDIB, ARHGEF3, ARHGEF5, ARHGEF6, ARID3B, ARNT, ARRB1, ARSA, ARSB, ARTN, ASB13, ASH2L, ATAD2B, ATF5, ATF6, ATF7IP2, ATG7, ATHL1, ATIC, ATP12A, ATP4A, ATP5A1, ATP6V0A4, ATP6V1B1, ATP6V1D, ATP6V1E1, ATP7A, ATP8A1, ATP9A, ATXN7, ATXN7L1, AZIN1, BARX1, BCL10, BCL11B, BCL2L14, BEGAIN, BEX1, BFAR, BFSP2, BHMT, BHMT2, BLK, BMP5, BMP8B, BMX, BNC1, BRE, BSN, BTC, BTG3, BTN3A2, C14orf105, C18orf8, C1orf27, C1QA, C20orf27, C2orf42, C2orf47, C3, C4BPA, C4BPB, C5orf15, C6, C6orf15, C8G, CA6, CABP1, CABYR, CACNA2D2, CACNA2D3, CACNG2, CALB1, CALCA, CALCB, CALML5, CAMK1G, CAMLG, CAMP, CAND2, CANX, CAPN6, CAPN9, CAPRIN2, CARD8, CARS2, CARTPT, CASQ2, CAT, CBFA2T3, CBLL1, CBLN1, CBR3, CCBL2, CCDC33, CCDC88C, CCKBR, CCL11, CCL19, CCL21, CCL22, CCL4, CCNE1, CCNH, CCNL1, CCR6, CCR7, CD151, CD177, CD19, CD1D, CD226, CD3E, CD70, CD79A, CD79B, CD80, CD81, CD84, CD9, CDC42, CDH12, CDH17, CDH20, CDH7, CDK5, CDKAL1, CDO1, CDT1, CEACAM1, CEACAM5, CEACAM6, CEBPB, CEBPZ, CEL, CELP, CENPI, CENPJ, CENPN, CENPT, CEP135, CEP192, CEP76, CES1, CETN3, CFD, CFHR5, CFTR, CGA, CHCHD2, CHD8, CHIA, CHIT1, CHML, CHRNA1, CHRNA5, CIAO1, CILP, CKB, CLC, CLCA1, CLCA2, CLCA4, CLCN5, CLDN1, CLDN16, CLDN18, CLDN3, CLDN4, CLDN8, CLEC2B, CLEC2D, CLIC5, CLPP, CLU, CNBP, CNGA1, CNN3, CNNM1, CNTN6, COCH, COL10A1, COL11A2, COL15A1, COL18A1, COL1A1, COL2A1, COL5A2, COLQ, COMP, CORIN, CORO2A, CP, CPB2, CPM, CPNE1, CPS1, CPZ, CR2, CRABP1, CRABP2, CREB3L1, CREG1, CRH, CRIPT, CRISP2, CRISP3, CROCC, CRTAM, CRTC3, CRYBA1, CRYBA2, CRYBB2, CRYGD, CRYL1, CRYM, CRYZ, CSH1, CSN1S1, CSN3, CSRP1, CST1, CST2, CST6, CTAG2, CTBP1, CTSC, CTSK, CUZD1, CXADR, CXCL9, CXorf56, CYP19A1, CYP20A1, CYP24A1, CYP26A1, CYP27B1, CYP2E1, CYP3A5, CYP4F11, CYP4F2, D4S234E, DAB1, DACH1, DAD1, DAZ1, DAZ4, DAZL, DCHS1, DCT, DCUN1D1, DCUN1D2, DCUN1D4, DDC, DDN, DDX11, DDX3Y, DDX43, DDX56, DDX58, DECR1, DECR2, DEF6, DEFA4, DEFB1, DENND2A, DERA, DGCR8, DHRS1, DHRS2, DHX34, DIO3, DKK2, DLEU1, DLX5, DMBT1, DMRT1, DNAI2, DNAJC15, DNASE1, DNASE1L3, DNASE2B, DNTTIP2, DOCK4, DOK5, DPM1, DPP4, DPPA4, DPT, DPYS, DSC3, DTX3, DUOX1, DUSP12, DUSP2, DUSP22, DYNLT1, EBF2, ECEL1, EDAR, EDN3, EFCAB1, EFCAB2, EFEMP2*, EFNA2, EGLN3, EGR4, EIF1AX, EIF1AY, EIF2B1, EIF2B2, EIF4A3, EIF5A, ELF4, ELK3, ELMO2, ENDOD1, ENO3, ENTPD3, ENTPD5, EPB41L4B, EPHA7, EPS8L3, EPYC, EREG, ERLIN1, ERMAP, ERMP1, ERO1LB, ETAA1, ETNK2, ETS2, EXO1, EXPH5, EXTL1, EYA2, F2RL1, FAAH, FABP1, FABP4, FABP6, FABP7, FAIM, FAM118A, FAM134B, FAM57A, FANCE, FANCF, FAP, FASTKD5, FBLN1, FBXO17*, FCER1A, FCER1G, FCER2, FCGR2A, FCN3, FEM1B, FES, FEZ1, FEZF2, FGA, FGB, FGD6, FGF3, FGFBP1, FGFR4, FGG, FGL1, FIG4, FIGF, FLNB, FMO1, FMO3, FMO5, FN1, FNBP4, FOS, FOXA2, FOXC1, FOXD2, FOXE1, FOXL2, FRS2, FRS3, FSTL1, FTSJ3, FUBP3, FUT3, FXYD2,

86

Table 3.8 (cont’d) Subset of genes Gene names eQTL gene FXYD6, FYB, FZD10, FZD8, G6PD, GABRA4, GABRB3, GABRD, GABRG2, GABRP, GABRR1, GALNT12, GALNT14, GALNT8, GAPDH, GAS1, GAS7, GATA3, GATA4, GATA6, GBAS, GBE1, GCAT, GCDH, GCG, GCNT2, GCNT3, GDF10, GDF15, GDF3, GEMIN4, GFPT1, GFRA1, GHITM, GINS3, GIPC2, GJB3, GLCE, GLI1, GLRA2, GLT8D2, GLUD2, GNB1L, GNG11, GNG12, GNGT1, GNLY, GNMT, GOLGA5, GOT1, GPA33, GPC1, GPM6B, GPR143, GPR15, GPR171, GPR20, GPR22, GPR35, GPR6, GPR87, GPRC5D, GPX2, GPX3, GREM1, GREM2, GRHPR, GRID2, GRIN2A, GRK6, GRM8, GRP, GSPT1, GSTA1, GSTM3, GSTM5, GSTP1, GSTZ1, GTF2F2, GTF3C2, GTF3C5, GTPBP4, GUCA2A, GUCY2C, GUF1, H1F0, H2AFJ, HABP2, HAL, HBS1L, HCRT, HDDC2, HERPUD1, HGD, HIF3A, HIP1R, HIST1H2AM, HIST1H2BI, HIST1H4B, HIVEP2, HLA-A, HLA-DQA1, HLA-DQB1, HLA-DRB6, HMBOX1, HMGA2, HMGCS2, HMGN4, HMHB1, HNRNPU, HOMER2, HOOK1, HOXB9, HOXC11, HOXC5, HOXD10, HPGD, HPR, HSD17B12, HSD17B2, HSF2, HSPA8, HSPB7, HSPE1, HTR2B, HTR2C, HTR3A, HYAL1, HYAL3, HYMAI, IAPP, ICAM4, ICK, ICOS, ID4, IFT88, IGFBP3, IGHD, IGHM, IGLV3-25, IL12RB1, IL17B, IL1RL1, IL7R, INHBE, INSL4, IQCB1, IRF4, IRF5, IRX4, ISL1, ISLR, ITGA4, ITGA9, ITGB7, ITIH1, ITIH2, ITIH4, ITPA, ITPKC, KAZALD1, KCMF1, KCNG1, KCNJ13, KCNK15, KCNK5, KCNMB3, KCNV1, KCTD12, KERA, KHDRBS2, KIAA0391, KIAA0408, KIF17, KIF1A, KIF22, KIF5A, KIR3DL3, KIT, KLF12, KLF2, KLF3, KLHDC4, KLHL2, KLHL24, KLHL3, KLK7, KLK8, KLRF1, KNG1, KPNA4, KPNA5, KPTN, KRR1, KRT13, KRT14, KRT15, KRT16, KRT17, KRT19, KRT20, KRT23, KRT32, KRT34, KRT4, KRT5, KRT6A, KRT6B, KRT75, KRT86, KTN1, L1TD1, LAD1, LAG3, LAIR2, LAMP2, LAPTM4A, LARS, LARS2, LAT, LAX1, LBR, LCK, LCP2, LDHC, LDLR, LDLRAP1, LEFTY1, LGALS2, LGALS4, LGI2, LGR5, LHX1, LHX6, LIAS, LILRB4, LIME1, LIPT1, LMO2, LMO7, LPL, LRP5L, LRPPRC, LRRC61, LRRC8E, LRRFIP1*, LRRN2, LSM3, LSP1, LSS, LTA, LTBP2, LTBP4, LY6D, LY6G6D, LY75, LY96, LYZ, MAEA, MAFG, MAGEA1, MAGEA10, MAGEA11, MAGEA12, MAGEA4, MAGEA5, MAGEA9, MAGEB1, MAGEB2, MAGEB3, MAGEC1, MAGEC2, MAN2B2, MAP3K7, MAP4K1, MAP7, MAPK6, MAPRE1, MARCH2, MARCO, MAST3, MATK, MAX, MC4R, MCM10, MCOLN3, MCPH1, MCTP2, MDM2, MDM4, MED17, MED4, MEOX1, MEP1A, MEPE, METTL3, MFAP2, MFAP5, MFHAS1, MFN1, MIA, MID1, MID2, MIPEP, MLLT10, MME, MMP10, MMP12, MMP13, MMP16, MMP24, MMP3, MOCOS, MPP3, MPP5, MPV17, MPZ, MRE11A, MRPL19, MRPL2, MRPL44, MRPL52, MRPS10, MRPS17, MS4A1, MS4A3, MS4A4A, MS4A6A, MSC, MSMB, MTL5, MTMR11, MTRR, MUC13, MUC16, MUC4, MXD4, MXRA7, MYB, MYBL2, MYO15A, MYO15B, MYO3A, MYO7B, MYOM2, MYOZ2, MYOZ3, MYT1L, NADSYN1, NARS2, NBEAL2, NBR2, NCAPD2, NCF1, NCKIPSD, NCL, NCOA4, NCSTN, NDEL1, NDST3, NDST4, NDUFA10, NDUFA8, NDUFS5, NEK2, NEK3, NEUROD6, NFATC4, NFE2L2, NFX1, NGRN, NHLH1, NHLH2, NINJ1, NIPSNAP1, NIPSNAP3B, NKIRAS2, NKX2- 1, NLGN4Y, NME7, NMUR1, NOC2L, NOL3*, NOS1AP, NOTCH2, NOV, NOX1, NPHS1, NPPB, NPR1, NPR3, NPY1R, NPY2R, NPY5R, NQO2, NR0B1, NR1D2, NR4A3, NRG1, NRIP2, NRTN, NTS, NTSR1, NUBP2, NUCKS1, NUDT2, NUDT9, NUFIP1, NUP188, NUP214, NUP62CL, OCA2, OGN, OLFM4, OLFML1, ONECUT2, OPRK1, OR1F1, OR7A5, OSBPL2, OSGEP, OTOR, OVGP1, OVOL2, P2RX5, P2RY10, P2RY14, P2RY2, PACRG, PACS1, PAEP, PAGE4, PAK6, PAK7, PALLD, PAPPA, PARP11, PART1, PATZ1, PAX1, PAX2, PAX8, PCBP1, PCCA, PCDHB8, PCK1, PCNX, PCSK1N, PDCD1LG2, PDCL3, PDE4B, PDE4C, PDE6A, PDGFRL, PDHX, PDIA2, PDSS2, PEX1, PEX6, PF4V1, PGC, PGGT1B, PGRMC1, PGRMC2, PHGDH, PHKG1, PHTF2, PIGN, PIGR, PIGZ, PIH1D1, PIP, PIR, PITX1, PITX2, PKNOX1, PKP1, PKP2, PLAC1, PLAC8, PLCB4, PLEKHA6, PLEKHG6, PLN, PLTP,

87

Table 3.8 (cont’d) Subset of genes Gene names eQTL gene PMP2, PNOC, PNRC2, POF1B, POLI, PON3, POU2AF1, POU3F4, POU4F2, PPEF1, PPIE, PPIL2, PPP1CC, PPP1R14D, PPP2CA, PPP4R1, PPRC1, PPY, PRAME, PRB1PRB3, PRDM13, PRDM2, PRDX1, PREPL, PRG4, PRKCQ, PRKRA, PRL, PRMT3, PRMT8, PRND, PRPF18, PRPF40A, PRPSAP1, PRRG4, PRSS1, PRSS12, PRSS16, PRSS21, PRSS23, PRSS3, PSG4, PSG7, PSG9, PSMA2, PSMC2, PSMD5, PSME4, PSORS1C1, PSPH, PTCH1, PTDSS2, PTGER2, PTGFR, PTH2R, PTPN22, PTPRB, PTPRCAP, PTPRZ1, PWP1, PYCR1, QRICH1, QRSL1, RAB11B, RAB11FIP1, RAB15, RAB1A, RAB27A, RAB31, RAB33B, RAB38, RAB9A, RABEP1, RABGGTA, RAD52, RALGDS, RAPGEF2, RASGRF1, RB1, RBBP5, RBCK1, RBMS3, RBPJL, REEP2, REEP5, REG1A, REG1B, REN, RET, RETSAT, RGR, RGS13, RGS14, RIBC2, RIN2, RIT2, RLN2, RNASE6, RNASEH1, RNF32, RNF43, RNPS1, ROR2, RORB, RPA1, RPE65, RPL31, RPL36AL, RPS27, RPS4Y1, RPS6KA1, RPS6KB1, RRAD, RRM1, RWDD2B, RYR1, RYR2, S100A14, S100A7, S100B, SAA4, SAG, SARS, SCARF1, SCEL, SCFD1, SCGB2A1, SCGB2A2, SCGN, SCN7A, SCN8A, SCNN1A, SCRIB, SDC1, SEC13, SELE, SELP, SENP5, SENP7, SEPT2, SEPT4, SEPT6, SERPINA1, SERPINB13, SERPINB2, SERPINB3, SERPINB6, SERPINB7, SERPINB9, SERPIND1, SERTAD2, SETD4, SFRP5, SFTPB, SFTPC, SFTPD, SGCB, SGCD, SGCG, SGPP1, SH2D1A, SH2D3A, SH3TC2, SIGLEC8, SIPA1L3, SIRT4, SIT1, SIX1, SIX2, SIX6, SLAMF1, SLAMF7, SLC12A2, SLC12A3, SLC12A5, SLC13A4, SLC16A3, SLC18A1, SLC22A18AS, SLC22A3, SLC25A16, SLC25A21, SLC25A23, SLC25A40, SLC25A46, SLC26A10, SLC26A3, SLC27A2, SLC29A1, SLC2A9, SLC30A3, SLC34A2, SLC35B1, SLC38A2, SLC38A4, SLC39A1, SLC39A2, SLC43A1, SLC44A4, SLC4A8, SLC5A1, SLC6A11, SLC6A13, SLC6A14, SLC6A16, SLC6A20, SLC6A3, SLC6A4, SLC9A1, SLCO1A2, SLIT3, SMARCD3, SMCP, SMEK1, SMO, SMPDL3B, SMPX, SMR3A, SMU1, SMUG1, SNAI1, SNAPC1, SNCAIP, SNTA1, SNTG2, SNX7, SOCS6, SOD3, SOHLH2, SON, SORCS3, SOSTDC1, SOX13, SOX14, SP140, SP3, SP4, SPAG8, SPARC, SPARCL1, SPATA1, SPEF1, SPIB, SPINK1, SPINK2, SPINK4, SPOCK3, SPRR1A, SPRR1B, SPRR3, SQRDL, SRPK1, SRPK3, SRR, SRY, SST, SSTR2, SSX1, ST6GALNAC2, ST8SIA3, ST8SIA5, STAG3, STAP1, STAP2, STARD7, STAT4, STEAP4, STMN1, STOM, STRN3, STX12, STX1A, STYK1, STYXL1, SULT1B1, SULT1C2, SULT1E1, SURF1, SURF2, SUV39H2, SV2C, SYMPK, SYN1, SYN3, SYNCRIP, SYNJ2, SYT13, TAC3, TAF1B, TALDO1, TAPBPL, TAX1BP3, TBC1D8B, TBCE, TBP, TBX2, TBX21, TBX3, TCEA1, TCF21, TCL1A, TDGF1, TEK, TERF2, TERT, TEX14, TFAP2B, TFAP2C, TFDP3, TFF1, TFF3, TFPI2, TH, THBS1, THBS4, THNSL2, THOC5, THOC6, THSD4, TIMM13, TIMM22, TLE6, TLK1, TLR2, TM4SF4, TM6SF1, TM7SF3, TMBIM4, TMC5, TMEM110, TMEM131, TMEM140, TMEM187, TMEM47, TMEM50A, TMEM50B, TMEM63A, TMEM80, TMPRSS2, TMPRSS3, TMPRSS5, TMSB4Y, TNF, TNFAIP3, TNFRSF10C, TNFRSF13B, TNFRSF17, TNFSF14, TNFSF18, TNFSF4, TNFSF9, TNIP2, TNMD, TNNC1, TNNI3, TNNT3, TNS4, TOMM20, TPRKB, TPTE, TRAF1, TRAF3IP2, TRAK2, TRAP1, TRAPPC4, TREX1, TRH, TRIM14, TRIM31, TRIM48, TRIM66, TRIM9, TRIP13, TRPC4, TRPM8, TSC2, TSGA10, TSHR, TSPAN31, TSPAN6, TSPAN8, TSPYL1, TSPYL4, TTC12, TTC27, TTR, TUBA1A, TXNL4B, TYRP1, U2AF1, UBASH3A, UBE2S, UBE3B, UBE4B, UBFD1, UGT2B15, UGT2B17, UIMC1, UNC13A, UNC5C, UNKL, UPK1B, UPK3B, USP22, USP36, USP6, USP9Y, UTY, UXS1, VAMP1, VEGFC, VGLL1, VGLL3, VIL1, VIM, VIPR1, VNN1, VPS4A, VTCN1, WARS, WARS2, WBSCR22, WDR12, WDR18, WDR26, WDR4, WDR41, WDR78, WISP3, WNT10B, WNT11, WNT4, WNT5B, WNT7B, WRB, WT1, XAB2, XPNPEP2, YBX2, YIPF6, YPEL1, YRDC, YWHAE, ZBED2, ZBTB25, ZBTB32, ZBTB39, ZBTB5, ZC3H15, ZCWPW1,

88

Table 3.8 (cont’d) Subset of genes Gene names eQTL gene ZDHHC14, ZFP36, ZFP36L1, ZFP37, ZFX, ZFY, ZIC3*, ZIM2, ZMYM5, ZMYND11, ZNF124, ZNF131, ZNF141, ZNF165, ZNF185, ZNF232, ZNF257, ZNF266, ZNF267, ZNF287, ZNF3, ZNF337, ZNF37A, ZNF444, ZNF493, ZNF556, ZNF682, ZRSR2, ZWINT, ZXDA Cis-acting gene AASDHPPT, ADAMDEC1, ADH7, ADI1, AHSA1, AKAP10, AKAP3, ALDH8A1, ALMS1, AMACR, AMH, ANKS1B, APOOL, APPL1, ARHGAP15, ARHGEF3, ARSB, ARTN, ATHL1, ATP7A, ATXN7L1, AZIN1, BCL10, BEGAIN, BTN3A2, C14orf105, C18orf8, CAMLG, CAND2, CARS2, CAT, CCBL2, CCL4, CD151*, CD84, CDC42, CDKAL1, CEBPZ, CENPN, CENPT, CEP192, CFD, CHCHD2, CHD8, CLEC2D, CRIPT, CRYBB2, CRYL1, CRYZ, CSRP1, CTBP1, CXorf56, D4S234E, DCT, DDX11, DDX56, DDX58, DECR2, DENND2A, DERA, DGCR8, DHRS1, DLEU1, DNAJC15, EGLN3, EIF5A, ENDOD1, ENTPD5, ERLIN1, ERMAP, ETAA1, ETNK2, ETS2, EXPH5, EXTL1, F2RL1, FAM57A, FIG4, GAS7, GCAT, GCDH, GDF15, GDF3, GEMIN4, GFPT1, GLT8D2, GNB1L, GNLY, GNMT, GPC1, GPX3, GRHPR, GSTM3, GSTM5, GSTZ1, GTF2F2, GUF1, HBS1L, HDDC2, HIST1H4B*, HIVEP2, HLA-A, HLA-DQA1, HLA-DQB1, HLA-DRB6, HMBOX1, HNRNPU, HOXB9, HSD17B12, HSF2, HSPB7, HTR2B, HTR3A, HYAL3, ICK, IFT88, IL12RB1, IL1RL1, INSL4, IQCB1, IRF5, ITIH1, ITIH4, ITPA, KIAA0391, KLF12, KPNA4, LARS, LARS2, LBR, LDHC, LIAS, LIPT1, LRPPRC, LSS, LTBP2, LY75, MAN2B2*, MARCH2, MATK, MAX, MED4, MEOX1, METTL3, MFHAS1, MIPEP, MPP5, MRE11A, MRPL19, MRPL2, MRPL44, MRPL52, MRPS10, MTRR*, MYOM2, MYOZ2, NADSYN1, NARS2, NCAPD2, NCKIPSD, NCL, NDUFS5, NEK3, NEUROD6, NFX1, NMUR1, NQO2, NR1D2, NRG1, NUBP2, NUDT2, NUFIP1, OCA2, OR7A5, OSGEP, PATZ1, PAX8, PCCA, PCDHB8, PCK1, PCNX, PCSK1N, PDE4B, PDHX, PEX6, PGGT1B, PHTF2, PIH1D1, PIR, PPIE, PPP4R1, PRB1, PRDM2, PRG4, PRKCQ, PRMT3, PSMA2, PSORS1C1, PSPH, QRICH1, QRSL1, RAB15, RAB33B, RABEP1, RABGGTA, RAD52, RAPGEF2, RB1, RBCK1, REEP2, RETSAT, RGS14, RIN2, RIT2, RNASE6, RNPS1, RPL31, RPL36AL, RRM1, RWDD2B, SAG, SCFD1, SDC1, SEPT2, SERPINB9, SETD4, SLC25A23, SLC25A40, SLC2A9, SMU1, SOCS6, SOHLH2, SOSTDC1, SP4, SPATA1, SRPK1, SRR, STAP2, STRN3, STYXL1, TAF1B, TALDO1, TAPBPL*, TAX1BP3, TBX21, TCL1A, THBS4, THNSL2*, THOC5, THOC6, TIMM22*, TLK1, TM7SF3, TMEM110, TMEM140, TMEM187, TMEM50A, TMEM50B, TMEM80, TMPRSS5, TOMM20, TPRKB, TRAF3IP2, TRAP1, TRAPPC4, TREX1, TRIM66, TRIM9, TSPYL4, TTC12, TTC27, TYRP1, UBE4B, UBFD1, UIMC1, UNC5C, UNKL, USP22, USP6, VIL1, VNN1, WARS, WARS2, WDR18, WDR41, WRB, XAB2, ZBTB25, ZBTB5, ZDHHC14, ZFP36L1, ZFX, ZMYM5, ZNF131, ZNF232, ZNF266, ZNF287, ZNF3, ZNF444, ZNF493, ZNF682 Hotspot gene ABCE1, ASH2L, ATIC, BFAR, BMP8B, C1orf27, C2orf47, CLDN16, DNTTIP2*, DPM1, DUSP12, EIF2B1, FUBP3, GSPT1, GTF3C2, HSPE1, MAP3K7, MAPRE1, MED17, NDUFA10, NFE2L2, NUDT9, PAGE4, PCBP1, PDCL3*, PPP1CC, PPP2CA, PRKRA, PRPF40A, PRPSAP1, PWP1, QRICH1, RAB1A, RNASEH1, RPA1, RPS6KB1, SLC35B1, SOX14, SYNCRIP, TBCE, TCEA1, TMEM131, UXS1, WDR12, ZC3H15 GENE*: Genes significant associated (pFDR<0.05) with survival outcome.

89

REFERENCES

90

REFERENCES

ABTA - American Brain Tumor Association (2014). Glioblastoma and Malignant Astrocytoma. Available at : http://www.abta.org/secure/glioblastoma- brochure.pdf.

Astle, W., & Balding, D. J. (2009). Population Structure and Cryptic Relatedness in Genetic Association Studies. Statistical Science, 24(4), 451–471.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300.

Breitling, R., Li, Y., Tesson, B. M., Fu, J., Wu, C., Wiltshire, T., Gerrits, A., Bystrykh, L. V., de Haan, G., Su, A. I., & Jansen, R. C. (2008). Genetical Genomics: Spotlight on QTL Hotspots. PLoS Genetics, 4(10).

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376.

Chen, Q.-R., Hu, Y., Yan, C., Buetow, K., & Meerzaman, D. (2014). Systematic Genetic Analysis Identifies Cis-eQTL Target Genes Associated with Glioblastoma Patient Survival. PLOS ONE, 9(8), e105393.

Cox, D. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, 34(2), 187.

Dunn, O. J. (1961). Multiple Comparisons among Means. Journal of the American Statistical Association, 56(293), 52–64.

Moon, K. S., & Lesniak, M. S. (2014). Glioblastoma: Risk Factors, Diagnosis and Treatment Options. International Journal of Cancer Research and Prevention, 7(3/4), 183–212.

Rockman, M. V., & Kruglyak, L. (2006). Genetics of global gene expression. Nature Reviews Genetics, 7(11), 862.

Soneson, C., Gerster, S., & Delorenzi, M. (2014). Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation. PLoS ONE, 9(6).

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B, 64(3), 479–498.

91

Vilhjálmsson, B. J., & Nordborg, M. (2013). The nature of confounding in genome- wide association studies. Nature Reviews Genetics, 14(1), 1-2.

Zhao, Q., Shi, X., Xie, Y., Huang, J., Shia, B., & Ma, S. (2014). Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA. Briefings in Bioinformatics, bbu003.

92

Chapter 4 SUMMARY AND CONCLUSIONS

The objective of this study was to achieve a better understanding of the genetic and non-genetic factors affecting survival of Glioblastoma Multiforme (GBM) patients. Survival analyses were applied on clinical and demographic data to determine the risk factors affecting survival. Subsequently, genes were preselected by eQTL analysis and their effects on GBM patient survival were evaluated.

In Chapter 2, survival analyses were performed in two independent datasets,

TCGA and SEER, to identify non-genetic factors affecting survival. In these independent datasets, we confirmed that age at initial diagnosis was significantly associated with survival. Older patients were observed to have higher risk as compared to younger patients, which was in agreement with the results obtained from previous studies. Also, the interaction between age and gender was found to be significantly associated with survival in TCGA; young female patients tended to live longer than young male patients, in contrast to patients older than 60. Other variables reported to have significantly different hazard ratios across different levels of the covariate were Karnofsky performance score (KPS), targeted molecular therapy, area of residence in the US, radiotherapy, surgery, number of primary tumors, and tumor location. Patients that had a KPS of 80 or lower were found to have lower survival while Asians and patients from North-East of US were found to have higher survival. Also, in contrast to previous studies, we observed that race and gender did not have a significant interaction. Furthermore, targeted molecular therapy was found to be a protective factor for patients suffering from GBM. Other

93

protective factors included surgery performed, both beam implants and isotopes radiation performed, radiation (performed before as well as after surgery), and fewer primary tumors. Finally, tumor located at the ventricle of the brain was found to be the most fatal tumor.

In Chapter 3, expression quantitative trait loci (eQTL) analyses were performed and gene expression wide association study (GE-WAS) was conducted to identify the genes associated with survival. The statistical model for eQTL analysis incorporated the important covariates selected from differential expression (DE) analysis. We pre-adjusted the model for the gene expression batch effect, age at initial diagnosis and the first two principal components derived from tumor SNP (to control for the effects of population structure). A total number of 5,429 eQTL were detected, including 1,190 cis-eQTL and 4,239 trans-eQTL. Higher numbers of significant cis-acting eQTL (400 out of 1 million) were observed in the study compared to trans-acting eQTL (1 out of 1 million). Using a permutation analysis, we chose a threshold of at least 18 significant gene associations for declaring the corresponding SNP a hotspot. Two hotspot eQTL were found: rs3172494 and rs7698461, with 26 and 20 significant gene associations, respectively. Patients carrying allele ‘C’ at hotspot rs3172494 in chromosome 3 and ‘G’ at hotspot rs7698461 in chromosome 4 had lower survival rates whereas patients carrying allele ‘A’ for both hotspots had higher survival. Nevertheless, neither of the two hotspot SNP was significantly associated with survival, in concordance with previous GBM/cancer survival studies. Subsets of target genes were obtained from the results of the eQTL analysis, 1,431 target genes were selected for the eQTL gene

94

set, 301 target genes were selected for the cis-acting gene set, and 45 target genes were selected for the hotspot gene set. From GE-WAS, we found 14 genes statistically associated with survival among the three target gene subsets; however, none of these associations was significant after controlling for demographic factors.

We reason that inadequate sample size lowered our statistical power to detect significant associations; especially since allele ‘A’ in both hotspots was observed to prolong patient survival time.

In this study, we did not account for any potential linkage disequilibrium

(LD) between SNP, given that some of the SNP associated to the same gene are likely to be in the same LD block. We also could not validate the eQTL and GE-WAS results

(obtained from TCGA) in the independent SEER dataset since it only comprises information on clinical and demographic variables.

Our research mainly focused on studying association between SNP variant and GE level; however, cancer is characterized by a very large number of copy number variation (CNV). In future studies, incorporating CNV information could help researchers obtain more insights into how structural variations in DNA can regulate GE and potentially affect survival in GBM patients.

95