Transcriptomic Analysis of Prostate Cancer Progression

Bridget Lindsay Bickers

The thesis is submitted in partial fulfilment of the requirements for the award of the degree of Doctor of Philosophy of the University of Portsmouth June 2018

i

Abstract

Background

Prostate cancer has a highly variable clinical course, which cannot be fully explained by clinicopathological variables. New biomarkers are needed to better inform treatment decisions at diagnosis to prevent the overtreatment of indolent disease, whilst allowing more aggressive disease to be appropriately treated with interventional and adjunctive treatments.

Methods

This discovery study used TaqMan arrays to measure the expression of a panel of 91 putative markers of prostate cancer progression in archival biopsy samples from 67 prostate cancer patients treated with radical prostatectomy and/or radiotherapy. The expression data were correlated with clinical outcome and binary logistic regression analysis was used to identify predictive models of 5-year biochemical/clinical recurrence.

Results

Multivariate logistic regression models were focused to a final set of 28 two-gene and 9 three-gene models of recurrence which showed the highest overall performance in terms of calibration (Likelihood Ratio Test statistic (χ2)) and discriminatory accuracy (Area Under the Curve), whilst retaining parsimony (smallest number of predictor variables).

The best performing models according to χ2 were AMACR.EFNA5.SDHA (χ2(3) = 22.483, p<0.0001) and AMACR.HPRT1.SDHA (χ2(3) = 22.265, p<0.0001). In addition to demonstrating a highly significant improvement in fit to the data in comparison to the null model, these models had high discriminatory accuracy (Area Under the Curve of 0.812 and 0.829 respectively). By selecting an optimal threshold for categorisation of the patients into the recurrent or non-recurrent group, AMACR.EFNA5.SDHA could correctly categorise 76.1% of patients with a sensitivity and specificity of 81% and 73.9%

ii respectively. AMACR.HPRT1.SDHA could correctly categorise 85.1% of patients with a sensitivity and specificity of 81% and 87% respectively.

Conclusion

The study identified a selection of two- and three-gene models with potential to predict 5-year recurrence in this set of prostate cancer patients. These models may provide novel prognostic information beyond standard prognostic variables at the time of treatment selection but would need to be externally validated in another set of prostate cancer patients to confirm their generalizability as prostate cancer biomarkers. The findings contributed to knowledge regarding disease progression in prostate cancer. Several of interest were identified, some of which have previously been described in relation to prostate cancer progression and others that showed a novel association. A large proportion of the identified genes were key genes in metabolic pathways, underlining the importance of metabolic reprogramming in prostate cancer progression.

iii

Table of Contents

DECLARATION XI

ABBREVIATIONS XII

ACKNOWLEDGEMENTS XVII

INTRODUCTION 1

1.1 Prostate cancer epidemiology 1

1.2 The prostate 4

1.3 Prostate pathology 8 Benign prostatic hyperplasia 8 High-grade prostatic intraepithelial neoplasia 8 Prostate cancer 9

1.4 Prostate cancer treatments 10

1.5 Prostate cancer diagnosis and the PSA test 12 Transrectal ultrasound (TRUS)-guided prostate biopsy 13

1.6 Prostate cancer prognosis 14 Staging 14 The Gleason grading system 15 Pretreatment serum PSA 16 Risk stratification 16 Limitations of clinical prognostic tools 17 Biomarker research and molecular technologies 19 The hallmark traits of cancer 21 Prostate cancer prognostic biomarkers 36 Gene expression panels 37 This study 39

TAQMAN ARRAY STUDY 40

2.1 Introduction 40 Background 40 Previous pilot study 41

iv

The current study - new research 42 Aims 43

2.2 Materials and methods 44 Introduction/practical considerations 44 Sample size and power calculations 52 Ethical permissions 53 Practical methods 53 Data analysis methods 59

2.3 Results 72 Introduction 72 Samples 72 Method development of practical methods 73 Method development of data analysis 76 Optimisation of the assay 99 Independent samples t-tests 101 Univariate logistic regression analysis 104 Multivariate logistic regression analysis 116 Comparison of single, two-gene and three-gene models 134 Distribution of genes in the top gene models 142 Comparison to the pilot study 148

2.4 Discussion 159 Gleason score analysis 159 Comparison of recurrence analysis between the BB and SL study 160 BB study - recurrence analysis 162 Calibration and discrimination - evaluating the performance of the predictive models 164 Gene prevalence plots 165 Genes included in the highest multivariate logistic regression models according to χ2 167 Discussion regarding genes of interest 179 General discussion and future studies 181

2.5 Conclusions 185

APPENDIX 1 EXAMPLES OF POTENTIAL BIOMARKERS OF PROSTATE CANCER PROGRESSION 188

v

APPENDIX 2 GENES ON THE BB AND SL ARRAY 194

APPENDIX 3 BACKGROUND INFORMATION 204

APPENDIX 4 SELECTION OF REFERENCE GENE 209

APPENDIX 5 BINARY LOGISTIC REGRESSION/RECEIVER OPERATING CHARACTERISTIC (ROC) CURVES 213

APPENDIX 6 ARRAYS FOR METHOD DEVELOPMENT AND TESTING THE CONTROL CDNA 221

APPENDIX 7 PATIENT DEMOGRAPHICS 224

APPENDIX 8 ELECTRONIC APPENDIX 228

APPENDIX 9 FORM UPR16 229

APPENDIX 10 DISSEMINATION LIST 230

BIBLIOGRAPHY 231

vi

Table of Figures

Figure 1.1 The 20 most common cancers, UK, 2015, Cancer Research UK (2) ...... 1

Figure 1.2 Average number of new prostate cancer cases per year and age-specific incidence rates, males, UK, 2013-2015 (7) ...... 2

Figure 1.3 Estimated prostate cancer incidence worldwide (estimated age-standardized rates per 100,000)(12)...... 3

Figure 1.4 The location of the prostate gland (22) ...... 4

Figure 1.5 The zones of the prostate (24, 25) ...... 5

Figure 1.6 High power microscopic image of normal prostate gland acini (26) ...... 6

Figure 1.7 Schematic representation of the normal prostate epithelium (27) ...... 6

Figure 1.8 The TNM staging system (85)...... 14

Figure 1.9 Gleason grading system according to the International Society of Urological Pathology (87) ...... 15

Figure 1.10 The hallmark traits of cancer (adapted from (116)) ...... 21

Figure 1.11 The phases of the cell cycle showing two of the major checkpoints (G1/S, G2/M) and the molecules which regulate them ...... 24

Figure 1.12 Steps involved in angiogenesis (168) ...... 27

Figure 1.13 Metabolism in the healthy prostate and at different stages of prostate cancer: a) Normal prostate b) Early prostate cancer c) Late prostate cancer (adapted from (188)) ...... 29

Figure 1.14 Schematic representation of the multiple stages of metastasis (196) ...... 31

Figure 2.1 Genes included on the SL array and their putative roles in cancer initiation/progression ...... 48

Figure 2.2 Genes included on the BB array and their putative roles in cancer initiation/progression ...... 49

Figure 2.3 Genes on the BB array, categorised according to their roles in cancer initiation/progression 50

Figure 2.4 Amplification plots for the housekeeping gene HMBS (linear view): a) before baseline setting (Rn vs cycle number) b) after correctly setting baseline at 3-19 cycles (ΔRn vs cycle number) ...... 61

Figure 2.5 Amplification plots for HMBS (log view) showing threshold value at 0.426 ...... 61

Figure 2.6 Amplification plots for HMBS (log view) showing high precision of replicates ...... 62

Figure 2.7 Standard curve plot for HMBS ...... 62

Figure 2.8 ROC curve plotted in SPSS...... 70

Figure 2.9 Steps to gain new ethical approval to access patient data ...... 80 vii

Figure 2.10 Inappropriate setting of the baseline ...... 82

Figure 2.11 The threshold is set too high ...... 83

Figure 2.12 VBA script in Excel spreadsheet containing list of two-gene combinations ...... 85

Figure 2.13 Example snippet of SPSS input file for logistic regression analysis ...... 85

Figure 2.14 VBA script ...... 86

Figure 2.15 Final syntax file ...... 86

Figure 2.16 Snippet of final spreadsheet containing information from the logistic regression outputs ... 87

Figure 2.17 Percentage prevalence of each gene in the highest 9.3% of the two-gene models according to overall percent accuracy ...... 91

Figure 2.18 Percentage prevalence of each gene in the highest 12.7% of the three-gene models according to overall percent accuracy ...... 92

Figure 2.19 Percentage prevalence of each gene in the highest 9.3% of the two-gene models according to χ2 and overall percent accuracy ...... 94

Figure 2.20 Percentage prevalence of each gene in the highest 12.7% of the three-gene models according to χ2 and overall percent accuracy ...... 95

Figure 2.21 Percentage prevalence of each gene in the highest 9.3% of the two-gene models in order of Wald significance from univariate logistic regression ...... 97

Figure 2.22 Standard curves for housekeeping genes ...... 99

Figure 2.23 ROC curves for the best single gene models ...... 113

Figure 2.24 ROC curves for the best two-gene models ...... 125

Figure 2.25 ROC curves for the best three-gene models ...... 133

Figure 2.26 Flow diagram to show predictive modelling process ...... 141

Figure 2.27 Distribution of the genes in the highest two-gene models according to χ2 ...... 144

Figure 2.28 Distribution of the genes in the highest two-gene models according to overall percent accuracy ...... 145

Figure 2.29 Distribution of the genes in the highest three-gene models according to χ2 ...... 146

Figure 2.30 Distribution of the genes in the highest three-gene models according to overall percent accuracy ...... 147

viii

Table of Tables Table 1.1 D’Amico risk classifications ...... 17

Table 1.2 Examples of potential biomarkers of prostate cancer progression ...... 37

Table 2.1 Comparison of samples included in the SL and BB study ...... 42

Table 2.2 The events per variable for the single, two-gene and three-gene models ...... 52

Table 2.3 Reagent volumes required for reverse transcription positive reaction ...... 57

Table 2.4 Reagent volumes required for reverse transcription negative reaction ...... 57

Table 2.5 RNA/cDNA concentration and purity of samples 1-11 as measured by UV-spectroscopy ...... 74

Table 2.6 The number of analysable genes (Ct<35) in the NPA and PA samples ...... 76

Table 2.7 Difference in Ct value between the NPA and PA samples for the housekeeping genes ...... 76

Table 2.8 Number of gene combinations for logistic regression analysis ...... 84

Table 2.9 List of tables exported from logistic regression output ...... 87

Table 2.10 The percentage of genes with p-values <0.05 and >0.05 in the top gene models ranked by χ2 and overall percent accuracy ...... 96

Table 2.11 Amplification efficiencies and R2 values for the housekeeping genes ...... 100

Table 2.12 Intra- and inter-assay variation for the housekeeping genes ...... 100

Table 2.13 Genes with p<0.5 in the independent samples t-tests for the comparison of the recurrent and non-recurrent group (Pfaffl dataset) ...... 102

Table 2.14 Genes with p<0.5 in the independent samples t-tests for the comparison of the recurrent and non-recurrent group (2-ΔCt dataset) ...... 103

Table 2.15 Output parameters from the logistic regression analysis of the single gene models (listed in descending order of χ2) ...... 105

Table 2.16 Variables in the equation for the single gene models ...... 108

Table 2.17 Single gene models ranked in descending order of the overall percent accuracy ...... 110

Table 2.18 The χ2 values showing the effect of adding Gleason score to each single gene model ...... 112

Table 2.19 AUC values for the single gene models ...... 114

Table 2.20 Optimal thresholds for the top single gene models ...... 115

Table 2.21 Output parameters from the logistic regression analysis of the top two-gene models (listed in descending order of χ2) ...... 117

Table 2.22 Variables in the equation for the two-gene models ...... 120

Table 2.23 Two-gene models ranked in descending order of the overall percent accuracy ...... 122 ix

Table 2.24 The χ2 values showing the effect of adding Gleason score to each two-gene model ...... 124

Table 2.25 AUC values for the two-gene models ...... 126

Table 2.26 Optimal thresholds for the top two-gene models ...... 127

Table 2.27 Output parameters from the logistic regression analysis of the top three-gene models (listed in descending order of χ2) ...... 128

Table 2.28 Variables in the equation for the three-gene models ...... 131

Table 2.29 Three-gene models ranked in descending order of the overall percent accuracy ...... 132

Table 2.30 AUC values for the three-gene models ...... 133

Table 2.31 Optimal thresholds for the top three-gene models ...... 134

Table 2.32 χ2 values between the two-gene models and their nested single gene models ...... 135

Table 2.33 χ2 values between the three-gene models and their nested two-gene models ...... 137

Table 2.34 Two-gene models with significant predictor variables ...... 139

Table 2.35 Three-gene models with significant predictor variables ...... 140

Table 2.36 Differentially expressed genes (p<0.05) between GS ≥7 and <7 in the SL and BB study ...... 148

Table 2.37 Genes with p-values <0.5 in both studies for the independent samples t-tests (GS ≥7 vs <7) ...... 150

Table 2.38 Independent samples t-tests comparing gene expression between the recurrent and non- recurrent group in the SL and BB study ...... 151

Table 2.39 Genes with p-values <0.5 in both studies in the independent samples t-tests (recurrent vs non-recurrent) ...... 153

Table 2.40 Single gene models for the prediction of recurrence in the SL study (Pfaffl dataset) ...... 156

Table 2.41 Single gene models for the prediction of recurrence in the SL study (2-ΔCt dataset) ...... 156

Table 2.42 Multivariate logistic regression analysis for the prediction of recurrence in the SL study (Pfaffl dataset) ...... 158

Table 2.43 Multivariate logistic regression analysis for the prediction of recurrence in the SL study (2- ΔCt dataset) ...... 158

Table 2.44 Genes found in the best performing two- and three-gene models ...... 166

x

Declaration

Whilst registered as a candidate for the above degree, I have not been registered for any other research award. The results and conclusions embodied in this thesis are the work of the named candidate and have not been submitted for any other academic award.

(Revised) Word Count - 40,010

xi

Abbreviations

ΔG2/χ2 Likelihood Ratio Test statistic %CV percentage coefficient of variation %E percentage amplification efficiency -2LL -2*ln likelihood A260/280 ratio of the absorbance at 260nm and 280nm ADT androgen deprivation therapy ALCAM CD166 antigen AMACR alpha-methyl-CoA racemase ANN artificial neural network AR androgen receptor ARE androgen response element ATP adenosine triphosphate AUC area under the curve BAK1 Bcl-2 homologous antagonist/killer BAX apoptosis regulator BAX BCL2 apoptosis regulator Bcl-2 BCL2L2 Bcl-2-like 2 BCR biochemical recurrence BIRC5 baculoviral IAP repeat-containing protein 5 BPH benign prostatic hyperplasia c cut-off point c* optimal cut-off point CAF cancer associated fibroblast CAM cell adhesion molecule CAPRA Cancer of the Prostate Risk Assessment CCP cell cycle progression CD276 CD276 antigen CDH1 cadherin-1 CDH2 cadherin-2 CDKN1B cyclin-dependent kinase inhibitor 1B

xii cDNA complementary DNA cGMP cyclic guanosine monophosphate Ct threshold cycle CT computerized tomography CTC circulating tumour cell df degrees of freedom DHT dihydrotestosterone DNA deoxyribonucleic acid dNTP deoxyribonucleotide triphosphate DRE digital rectal examination E amplification efficiency EBRT external beam radiotherapy ECM extracellular matrix EGFR epidermal growth factor receptor EMT epithelial-mesenchymal transition EPV events per variable ERBB2 receptor tyrosine-protein kinase erbB-2 ERG ETS transcription factor ERG FASN fatty acid synthase FFPE formalin-fixed, paraffin-embedded FGF2 fibroblast growth factor 2 FISH Fluorescence In-Situ Hybridisation fPSA free PSA FRET fluorescence resonance energy transfer GPS Genomic Prostate Score GS Gleason score GSTP1 glutathione S-transferase pi 1 HGPIN high-grade prostatic intraepithelial neoplasia HRA Health Research Authority HYAL1 hyaluronidase-1 IGF1/2 insulin-like growth factor I/II IHC immunohistochemistry

xiii

J Youden Index KLK2 kallikrein-2 KLK3 kallikrein related peptidase 3 LHRH luteinizing hormone-releasing hormone ln natural logarithm log logarithm logit ln odds LOH loss of heterozygosity LR logistic regression LRT Likelihood Ratio Test LUTS lower urinary tract symptoms MAPK mitogen-activated protein kinase mCRPC metastatic castration-resistant prostate cancer MGB minor groove binder miRNA microRNA MLE maximum likelihood estimation MMP matrix metalloproteinase MMP2 72 kDa type IV collagenase MMP9 matrix metalloproteinase-9 MREC Medical Research and Ethics Committee MRI magnetic resonance imaging mRNA messenger RNA MYC Myc proto-oncogene protein MYC MYC proto-oncogene, bHLH transcription factor NADPH nicotinamide adenine dinucleotide phosphate NE neuroendocrine NF-κB nuclear factor kappa-light-chain-enhancer of activated B cells NKX3-1 NK3 homeobox 1 NPA non-preamplified PA preamplified PAP prostatic acid phosphatase PCSM prostate cancer-specific mortality

xiv

PHT Portsmouth Hospitals Trust PI3K phosphoinositide 3-kinase PIA proliferative inflammatory atrophy PLAU urokinase-type plasminogen activator PLAUR urokinase plasminogen activator surface receptor PSA prostate-specific antigen PTEN phosphatase and tensin homolog PTGS2 prostaglandin G/H synthase 2 QAH Queen Alexandra Hospital qPCR quantitative polymerase chain reaction R2 coefficient of determination RB1 retinoblastoma-associated protein RNA ribonucleic acid ROC Receiver Operating Characteristic RP radical prostatectomy rRNA ribosomal RNA RT radiotherapy SDH succinate dehydrogenase SERPINE1 plasminogen activator inhibitor-1 SFEC Science Faculty Ethics Committee STAT signal transducer and activator of transcription TCA tricarboxylic acid TE Tris-EDTA TGFB1 transforming growth factor beta-1 THBS1 thrombospondin-1 TMPRSS2 transmembrane serine protease 2 TNM tumour-node-metastasis TORC Translational Oncology Research Centre TP53 tumour protein p53 tPSA total PSA TRAF6 TNF receptor-associated factor 6 tRNA transfer RNA

xv

TRUS transrectal ultrasound scan TURP transurethral resection of the prostate UoP University of Portsmouth VBA Visual Basic for Applications VEGFA vascular endothelial growth factor A

xvi

Acknowledgements

I would like to thank the Institute of Biological and Biomedical Sciences (IBBS) at the University of Portsmouth for funding this project. The project also received vital donations from the charitable funds at the Queen Alexandra Hospital, the Prostate Cancer Support Organisation (PCaSO) and from CanTech Ltd, for which I am most grateful.

At the University of Portsmouth, many thanks to my supervisors Dr Alan Cooper, Dr Jeremy Mills and Professor Darek Gorecki for their help. Thank you Jeremy for helping me prepare for the viva presentation.

Unfortunately, Alan has now passed away but I was very lucky to have his encouragement, which allowed me to continue with my studies. Although he was not originally connected to my project, he took over supervision in 2016 and went out of his way to advise and support me. Thank you Alan, you were one in a million.

I would also like to thank Bernard Higgins, who offered essential help with the statistical analysis for my project and helped me understand logistic regression analysis.

Thank you very much to Dr Sam Larkin for her help with the data analysis and being a great source of support.

There are many individuals at the Queen Alexandra Hospital, who played important roles at different stages of my project. I would like to thank everyone in the Cancer Laboratory for their help and friendship over the years. Many thanks to Professor Ian Cree for his supervision during the time of the experimental work and also for his advice after this time. I am especially grateful to Louise Bolton who was always on hand to help me during the experimental work.

Thank you Dr Sharon Glaysher for organising the collection of the recent patient data and everyone else who was involved in this process including the Clinical Trials Assistants and Scott Elliott.

xvii

I am very grateful to Annie Rogers, a volunteer in the Urology Department who spent hours searching the patient databases for me so that I would have as much follow-up data as possible.

I must not forget Dr Tara Walker, the pathologist who kindly agreed to mark the histopathology slides for my project, many years ago when my project had just begun.

Thank you to all my great friends including my animal friends for their support and loyalty over the years.

I would not have been able to complete this project without all the love and support from my Mum, Dad and brother. You have been amazing and I am sorry that it has been such a long and difficult journey. Thanks Andy for the many hours that you spent discussing my data analysis and your wonderful help with formatting. You also wrote the VBA script that allowed me to analyse my data in an automated manner which saved me hours of work. I couldn’t have managed without your help.

xviii

I dedicate this thesis to my lovely parents

xix

Introduction

1.1 Prostate cancer epidemiology

In the UK, prostate cancer is the most common cancer and the second leading cause of cancer-related death in men. The most recently published statistics from the year 2015 show that 47,151 men were diagnosed with prostate cancer within that year alone and 11,631 men died from the disease (1)(Figure 1.1).

Figure 1.1 The 20 most common cancers, UK, 2015, Cancer Research UK (2)

Prostate cancer development is influenced by both genetic and lifestyle/environmental factors, with age, family history and ethnicity being the most significant risk factors (3). The greatest risk factor is undoubtedly age, with over 75% of cases diagnosed in men aged 60 and above (4). The number of prostate cancers diagnosed within each age range and the age-specific incidence rates are shown in Figure 1.2.

The prostate cancer incidence rates have increased considerably over the last few decades in many countries, including the UK, with incidental detection of asymptomatic

1 disease through serum prostate-specific antigen (PSA) testing and transurethral resection of the prostate (TURP) making a significant contribution to this rise (5). Survival for men with prostate cancer has also increased during this time, with the latest statistics showing 5- and 10-year survival rates of 85% and 84% respectively, compared with 37% and 25% in the early 1970s (6).

Figure 1.2 Average number of new prostate cancer cases per year and age-specific incidence rates, males, UK, 2013-2015 (7)

However, although improvements in treatment have contributed to the better survival rates, the widespread use of PSA testing has led to the detection of greater numbers of early stage prostate cancers, which would artifically raise survival rates (8). This is reflected by a prostate cancer mortality rate that has remained relatively stable over the last few decades (9).

Excluding age, a family history of prostate cancer appears to be the strongest risk factor for the disease. The relative risk of prostate cancer is increased two to three fold in individuals who have a first degree relative with prostate cancer and increased further if multiple relatives are affected or if the affected relative was diagnosed before the age of 65 (10).

Prostate cancer incidence rates show a large degree of variation worldwide with the highest rates in northern and western Europe, Australia/New Zealand and North America, and the lowest rates in Asian countries (11). This is largely due to the high

2 prevalence of PSA testing and biopsies in developed countries, although genetic and environmental risk factors such as diet may also play a role (Figure 1.3).

Figure 1.3 Estimated prostate cancer incidence worldwide (estimated age-standardized rates per 100,000)(12)

Prostate cancer mortality rates are higher in less developed countries. However, the variation in worldwide mortality rates is lower than the variation in incidence rates, as PSA testing has a larger influence on incidence than mortality (12).

There are undoubtedly inherent genetic differences involved in disease risk, as evidenced by the variation in prostate cancer incidence between different ethnic groups within the same country (13). Black men of both African and Caribbean descent have a risk of prostate cancer that is approximately three fold greater than their white counterparts of a similar age (14). Importantly, black men tend to show faster disease progression, and have significantly higher prostate cancer-specific mortality rates in comparison to white men (15, 16).

Although no definite modifiable risk factors have been identified for prostate cancer yet, there is some evidence that dietary factors may influence both the development of prostate cancer and its progression (17, 18). For example, a high intake of dairy products

3 and well-cooked red meat may be a risk factor (17, 19) and lycopene (an antioxidant found in tomatoes) may act as a preventative factor (20).

1.2 The prostate

The prostate is an accessory gland of the male reproductive system situated below the bladder and in front of the rectum and completely encircling the urethra at its origin (Figure 1.4). Prostatic fluid secreted by the prostate’s glandular tissue plays an important role in sperm function, including its motility (21). At ejaculation, strong muscular contractions of the prostate propel the prostatic fluid in to the urethra, where it mixes with sperm and seminal fluid to form semen. The prostate also has roles relating to the control of urine flow and to the metabolism of the hormone testosterone.

Figure 1.4 The location of the prostate gland (22)

The prostate is well innervated and surrounded by a fibrocollagenous capsule. It is comprised of four main zones: the peripheral, central and transition zone containing approximately 70%, 25% and 5% of the glandular tissue respectively, and the anterior fibromuscular zone which contains no glandular tissue (23)(Figure 1.5).

The glandular zones contain tubuloacinar glands, which consist of numerous secretory units called acini, connected to a complex network of branching ducts that open onto

4 the urethra. These glands are supported by well vascularised, loose connective tissue and embedded in a supporting fibromuscular stroma (Figure 1.5).

Figure 1.5 The zones of the prostate (24, 25)

The acini of the prostate glands are lined with simple or pseudostratified epithelium, which is thrown into folds projecting into the large irregular lumen. This epithelium consists of a luminal layer of secretory cells, below which is a basal layer of non-secretory cells. Neuroendocrine (NE) cells are sparsely distributed throughout the prostatic epithelium (Figure 1.6 and Figure 1.7).

5

Figure 1.6 High power microscopic image of normal prostate gland acini (26)

Figure 1.7 Schematic representation of the normal prostate epithelium (27)

The luminal cells are terminally differentiated cells, which generate the secretory products of the prostate, including PSA and prostatic acid phosphatase (PAP). As the luminal layer is dependent on androgens for its survival, the luminal secretory cells show an abundance of androgen receptors (AR)(28).

The basal cells are relatively undifferentiated cells (29). These cells lack secretory activity and express only low levels of AR as the basal layer is not dependent on androgens for its survival (30). However, the basal layer contains an important subset of androgen- responsive stem cells (31), which give rise to the more rapidly dividing population of transit amplifying cells, an expanding hierarchy of cells with different stages of

6 differentiation which eventually commit to terminal differentiation (32). This is an androgen-dependent process that allows the secretory layer of the prostate to be maintained (33).

The processes of cell growth, proliferation and differentiation in the prostate gland are tightly regulated to maintain homeostasis of the epithelial layer and are dependent on a delicate balance between androgens, oestrogens and stromal growth factors with stromal-epithelial interactions playing an important role (34, 35).

The NE cells are dendritic intraepithelial cells, which produce a range of neurosecretory products, such as hormones and growth factors (36). NE cells lack the AR and play a role in the regulation of epithelial cell growth and differentiation, through androgen- independent mechanisms (37).

Androgens and the prostate Androgens regulate important processes involved in the growth and function of the prostate (38). Testosterone and its metabolite 5α-dihydrotestosterone (DHT) exert their cellular effects in the prostate through binding to the AR, an intracellular steroid hormone receptor, which acts as a ligand-activated transcription factor. The cytoplasmic AR undergoes a change in conformation with ligand binding, which results in its translocation to the nucleus. In the nucleus, the AR binds to androgen response elements (AREs) in the promoter regions of target genes to regulate their transcription. These androgen responsive genes play vital roles in prostate cell proliferation, differentiation and survival (39). Androgen signalling through the stromal AR is important for maintaining homeostasis of the epithelial layer and androgen signalling through the epithelial AR promotes the secretory function of the prostate through the expression of genes which encode secretory including PSA, PAP and kallikrein- 2 (KLK2)(40, 41).

7

1.3 Prostate pathology

Benign prostatic hyperplasia

Benign prostatic hyperplasia (BPH) is a non-malignant growth of the prostate gland, arising from the uncontrolled proliferation of prostatic stromal and epithelial cells in the transition zone of the prostate. This may cause pressure to be exerted on the urethra resulting in lower urinary tract symptoms (LUTS)(42). In comparison to prostate cancer, which involves the proliferation of cells with marked molecular abnormalities, BPH is an overgrowth of normal prostatic cells. Both conditions are dependent on androgens for their growth and development, and show a marked increase in prevalence with age (40, 43). However, BPH is more prevalent than prostate cancer and occurs almost exclusively in the transition zone of the prostate, which is rarely the site of the index tumour in prostate cancer (44). The presence of BPH does not substantially increase the risk of prostate cancer (45), although the two conditions may coexist.

High-grade prostatic intraepithelial neoplasia

High-grade prostatic intraepithelial neoplasia (HGPIN) is characterised by an enhanced cellular proliferation within the prostatic ducts and acini, and shares structural similarities to prostate cancer, such as a loss of basal cells and nuclear and nucleolar enlargement (46). Unlike prostate cancer, the basal cell layer in HGPIN remains intact. HGPIN is considered a precursor of prostate cancer and is most prevalent in the peripheral zone of the prostate, similar to prostate cancer (47). HGPIN is frequently found concomitant with prostate cancer in regions adjacent to cancer (47). The finding of HGPIN at prostate biopsy, particularly in multiple cores, is significantly associated with an increased risk of prostate cancer at follow-up biopsy (48). Underlining the role of HGPIN as a precursor to prostate cancer are the shared biological and molecular features between the two conditions, with HGPIN reflecting alterations which are intermediate in the transition from normal to malignant prostatic epithelium (49).

8

Prostate cancer

The vast majority of prostate cancers are acinar adenocarcinomas, arising in the epithelial cells lining the acini of the prostate glands. Prostate cancer usually arises in the peripheral zone of the prostate, with only 10-20% of prostate cancers originating in the transition zone (50) and <10% in the central zone (51).

Due to improvements in detection, over 80% of prostate cancers in the UK are diagnosed as localised or locally advanced disease, with the majority of these being localised (52). Prostate cancer is frequently asymptomatic as it tends to occur in the peripheral zone, which is farthest away from the urethra. If symptoms do occur they are usually LUTS including obstructive and voiding symptoms (53).

Due the frequent late presentation of symptoms in prostate cancer, some patients present with metastatic disease, where cancer has spread to the regional lymph nodes or to other parts of the body. The most frequent sites of distant metastases in prostate cancer besides the non-regional lymph nodes are the bones, followed by the liver and lungs (54). In comparison to a nearly 100% disease-specific survival for localised prostate cancer, less than a third of men presenting with metastatic disease survive beyond five years (55).

9

1.4 Prostate cancer treatments

Watchful waiting/active surveillance Watchful waiting involves the monitoring of a prostate cancer patient with regular PSA tests but no active treatment (although symptomatic treatment is given if necessary). This strategy is used for men who are more likely to die from other causes than prostate cancer due to their advanced age, co-morbidities or due to having low risk disease. This approach saves patients from suffering treatment-related impairment of quality of life as a result of undergoing invasive treatments, which are unlikely to benefit them.

Active surveillance aims to avoid or postpone definitive treatment in men with localised, low risk disease and takes a more active approach to monitoring, involving regular PSA tests, digital rectal examinations (DRE) and repeat biopsies, with early curative intervention given if needed.

Radical prostatectomy Radical prostatectomy (RP) is the mainstay of treatment for men with localised disease and a life expectancy of >10 years, who have a high chance of dying from prostate cancer without radical treatment. RP involves the removal of the prostate along with the seminal vesicles and some surrounding tissue. Men with locally advanced disease can be treated with surgery but require adjunctive therapies such as radiotherapy or hormone therapy (56). RP is associated with a risk of acute complications including cardiovascular complications and wound infection/bleeding (57). The incidence of long- term complications from the procedure is high, with a prevalence of erectile dysfunction and incontinence of approximately 80% and 20% respectively (57).

Radiotherapy Radiotherapy may be used to treat localised or locally advanced disease and may be administered as external beam radiotherapy (EBRT) or brachytherapy, either with or without hormone therapy. EBRT involves high dose irradiation of the prostate from outside the body. Brachytherapy is a more focused approach, which irradiates the cancer from within the prostate. EBRT is used as an adjunctive therapy in RP patients to improve outcome, particularly in men with adverse pathologic findings such as pT3 stage, extraprostatic extension or seminal vesicle invasion. Additionally, it is used to

10 treat local disease recurrence in men with biochemical recurrence (BCR) after surgery and has a role in the palliative treatment of advanced disease, where it can be used to shrink bone metastases and alleviate pain. Radiotherapy, particularly EBRT, is associated with both transient and long-term complications such as incontinence, rectal urgency and erectile dysfunction (58).

Androgen deprivation therapy (hormone therapy) Both normal and malignant prostate epithelial cells are dependent on androgen signalling for their survival and growth (59). In prostate cancer, androgen deprivation therapy (ADT) can be used to deprive the androgen-sensitive prostate cancer cells of androgens. A good initial response to ADT is seen in most patients with androgen- dependent disease resulting in tumour regression due to the cessation of proliferation and instigation of apoptosis (60). ADT is usually administered via luteinizing hormone- releasing hormone (LHRH) agonists or antagonists, which halt the production of androgens by the testes. Alternatively, antiandrogens which prevent the binding of androgens to the AR may be used. ADT is associated with significant morbidity including hot flashes, loss of bone mineral density and the development of metabolic syndrome (61-63).

Although ADT may be used as a primary therapy, it is more commonly used as an adjunctive treatment with surgery or radiotherapy for the treatment of intermediate/high risk localised disease, locally advanced disease and lymph node positive disease (64-66).

ADT may be used for disease control and symptomatic relief in metastatic prostate cancer (67). The clinical response to ADT is rarely sustained and tumour growth inevitably recurs due the development of metastatic castration-resistant prostate cancer (mCRPC)(68), which has a poor prognosis (survival is usually less than two years) (69).

Treatments for mCRPC Treatments that may provide symptom relief and improve survival in men with mCRPC include abiraterone acetate, which selectively inhibits a critical enzyme in the androgen biosynthesis pathway (70) or systemic treatments such as docetaxel, a cytotoxic chemotherapeutic agent (71). 11

1.5 Prostate cancer diagnosis and the PSA test

The serum PSA test and the DRE are the most frequently used initial tests for prostate cancer. An indication for a prostate needle biopsy would be an elevated serum PSA measurement and/or abnormal DRE.

PSA is a kallikrein-like serine protease encoded by the androgen-responsive gene kallikrein related peptidase 3 (KLK3) and secreted almost exclusively by the epithelial cells of the prostate gland. The serum PSA level is elevated in prostate cancer due to disruption of the prostate architecture, which allows PSA to pass more easily into the circulation. The risk of finding prostate cancer increases with more elevated PSA concentrations (72), with the likelihood of finding advanced disease increasing with higher PSA concentrations (73). The traditional serum PSA cutoff is 4.0ng/ml (74); however this cutoff value may be adjusted according to age and other factors such as ethnicity (75).

Several limitations are associated with the use of the PSA test for the detection of prostate cancer. Despite being highly tissue-specific, the PSA test is not cancer-specific and elevations are also found in benign conditions of the prostate, including BPH and prostatitis (76). As a result, a large proportion of men with PSA values ≥4.0 ng/ml are not found to have prostate cancer on biopsy, and have been unnecessarily exposed to the risks and side effects of this procedure (77). Although the positive predictive value for a PSA 10-20ng/ml is relatively high (~74%), it is only ~50% for a PSA 4-10ng/ml where there is significant overlap between the PSA ranges of BPH and prostate cancer (78). The serum PSA test also has some limitations with regard to sensitivity as prostate cancer, including high-grade disease, may be found in men with a serum PSA within the normal range (79).

The PSA test plays a central role in the serial monitoring of prostate cancer patients after RP or radiotherapy. After RP, the serum PSA should drop to an almost undetectable level due to the complete removal of the prostate tissue. BCR after RP is usually defined by a serum PSA value of >0.2ng/ml or a slight variation on this definition (80). BCR is a surrogate endpoint and time to systemic progression and death may take a protracted course (81). The timing of BCR is important for assessing the future risk of progression.

12

For instance, a short time interval between the end of surgery/radiotherapy and BCR or a short PSA doubling time is highly predictive of earlier clinical progression and prostate cancer-specific death (82, 83).

Transrectal ultrasound (TRUS)-guided prostate biopsy

Transrectal ultrasound (TRUS)-guided biopsy is used for the definitive diagnosis of prostate cancer and for prognostic purposes. Under the guidance of an ultrasound probe, a fine needle is passed into the prostate to collect several thin cores of tissue for histological examination by a pathologist. If cancer is present, the percentage cancer involvement in each core and the amount of positive cores are used to assess tumour volume. The biopsy samples are histopathologically graded using the Gleason grading system (84).

13

1.6 Prostate cancer prognosis

Stage, Gleason grade and the serum PSA level are the most important prognostic tools, which are used in combination to guide the clinical management of prostate cancer.

Staging

The most common system of staging is the Tumour-Node-Metastasis (TNM) system (85), which describes the extent of the primary tumour (T category) and whether the cancer has spread to the regional lymph nodes (N category) or metastasized to other parts of the body (M category). The clinical stage is an estimate of the extent of disease prior to treatment, based purely on the results of the physical examination (DRE), biopsy and imaging studies (e.g. bone scans, computerized tomography (CT) and magnetic resonance imaging (MRI)). The pathologic stage can only be measured in surgically treated patients as it incorporates both the clinical staging results and the pathology results from the tissue removed at surgery. Pathologic staging is more accurate than clinical staging, as it provides a more direct assessment of the nature and extent of disease. The main categories of the TNM staging system are described in Figure 1.8. Most of the categories in Figure 1.8 have futher subcategories (a, b, c) to allow more specific descriptions.

Figure 1.8 The TNM staging system (85)

14

Prostate cancer can then be broadly categorised as localised, locally advanced and advanced, based on the TNM staging as follows:

Localised T1, N0, M0 T2, N0, M0

Locally advanced T3, N0, M0 T4, N0, M0 any T, N1, M0

Advanced any T, any N, M1

The Gleason grading system

Biopsy and RP samples are graded by the Gleason grading system, which classifies a tissue based on its degree of differentiation (86)(Figure 1.9). This system consists of grades 1-5; with the tissue becoming less differentiated with increasing grade as its resemblance to the normal glandular architecture of the prostate lessens. However, grades 1 and 2 are not generally used for biopsy samples. The Gleason score (GS) is the sum of the primary grade (the most frequent pattern) followed by the secondary grade (second most frequent), e.g. 3+4 = 7.

Figure 1.9 Gleason grading system according to the International Society of Urological Pathology (87)

15

Pretreatment serum PSA

The pretreatment serum PSA test is also used for prognostic purposes as it is positively associated with adverse clinicopathological parameters such as extracapsular extension and seminal vesicle invasion (88). Furthermore, studies have found preoperative PSA to be an independent predictor of biochemical recurrence-free survival and prostate cancer-specific survival after RP (82, 88).

Risk stratification

As a result of improved detection methods, the incidence of men diagnosed with non- metastatic prostate cancer has increased, particularly the proportion of men with localised disease, which has a heterogeneous outcome (52). The majority of these prostate cancers have a slow disease progression, which is unlikely to cause symptoms or become lethal for up to 15 years, even if untreated (89). Often due to the advanced age of most prostate cancer patients, these men will die of competing causes before treatment is ever necessary (57). However, locally advanced disease and a minority of localised prostate cancers have a much higher risk of metastasis and death within this time (90). For these prostate cancers and/or men with a high life expectancy, interventional treatments such as surgery and radiotherapy are potentially curative options, which may increase disease-specific survival (91).

It is of paramount importance to determine the level of risk for each patient at the time of diagnosis in terms of the potential for the cancer to spread, to avoid overtreatment of clinically insignificant disease, whilst ensuring that no curative opportunities are missed to prevent disease progression in patients with more aggressive disease.

Risk stratification is the foundation for clinical decision-making and treatment selection in men with non-metastatic prostate cancer and uses the three most important prognostic indicators available at diagnosis: the pretreatment PSA, clinical tumour stage and biopsy GS. The most widely used method of risk statification is the D’Amico system, which classifies patients into low, intermediate and high risk categories for disease progression after definitive treatment (RP or EBRT) and may be used to guide treatment decisions (92, 93)(Table 1.1).

16

Table 1.1 D’Amico risk classifications

Risk Score Low risk T1-2a and GS ≤6 and PSA <10 Intermediate risk T2b and/or GS 7 and/or PSA 10-20 High risk ≥T2c and/or GS 8-10 and/or PSA >20

Although D’Amico’s risk categories are useful for guiding treatment decisions, there exists a significant heterogeneity of prognoses within each risk category, particularly within the intermediate risk group which is the largest group of prostate cancer patients (94).

Nomograms, which use point scales of clinicopathological variables to calculate the continuous probability of a particular outcome provide a more accurate and individualised method of prediction than risk categories. A widely used preoperative nomogram for the prediction of 5 or 10 year probability of prostate cancer recurrence after surgery is Stephenson’s nomogram (95), which is based on the variables preoperative PSA, clinical stage, biopsy GS, and also includes the results of systematic biopsies and adjusts for the year of surgery to account for stage migration.

A newer predictive tool called the Cancer of the Prostate Risk Assessment (CAPRA) score combines the accuracy of the best nomograms, with the usability of the D’Amico classification system (96). The CAPRA score can be used to predict outcome after RP and other treatments including radiotherapy, primary ADT and watchful waiting/active surveillance (97, 98). The variables incorporated in the CAPRA score are: age, preoperative PSA, biopsy GS, clinical tumour stage and the percentage of positive cores on biopsy. The CAPRA score demonstrates a relatively high discriminatory accuracy for the prediction of BCR after RP (c-index of 0.66-0.81)(99).

Limitations of clinical prognostic tools

Even the best performing predictive tools have significant limitations and there still exists a significant heterogeneity of outcome between prostate cancer patients with similar clinicopathological variables. Prognostic tools such as nomograms are developed using retrospective studies of particular populations of prostate cancer patients. Therefore, they may not be universally applicable to prostate cancer patients with different clinical demographics and geographic location (100). 17

Secondly and most importantly, all the aforementioned predictive tools are restricted by the inherent limitations of their component parts: the GS, staging and the serum PSA. In particular, preoperative prediction tools which can be used at the time of treatment selection, tend to be poorer predictors of outcome than the postoperative tools, which incorporate pathologic findings (101). The lower discriminatory value of the pretreatment tools results from under/overgrading of the biopsy sample in comparison to the pathologic sample and/or inaccuracies in predicting the extent of disease using clinical rather than pathologic staging.

Limitations in the biopsy GS result from both sampling error and the subjective nature of the Gleason grading system. The biopsy procedure samples a very small amount of the total prostate tissue. As the majority of prostate cancers are multifocal exhibiting grade heterogeneity (102), the biopsy may under-represent areas of high grade disease that are significant prognostically or oversample clinically insignificant areas. This results in frequent discordance between the biopsy and pathologic GS, with biopsy undergrading found in over 20% of cases and overgrading in approximately 10% (103). Gleason grading is also limited by its subjective nature and is prone to intra- and inter- pathologist variation of interpretation (104).

Clinical staging at the time of treatment selection is highly subjective and assessment of the extent of disease and prediction of outcome may be inaccurate (105) due to limitations in the available detection techniques such as TRUS, MRI and CT scans (106). A problem of clinical understaging has been reported as most prevalent (105).

Additionally, the pretreatment PSA has limitations as a prognostic tool. In particular, the predictive value of PSA is low for patients with a serum PSA below 10ng/ml, which has accounted for an increasing number of prostate cancer patients in the past few decades (107). The majority of these patients do not require active treatment due to a prolonged natural disease course, but a small number of prostate cancers progress despite a low or undetectable serum PSA level. In fact, in prostate cancer patients with high GS (8-10), low serum PSA levels (<4.0 ng/ml) are associated with a particularly aggressive form of the disease (108).

Limitations in the tools used for risk classification make clincial decision-making difficult. Overgrading/staging has led to the overtreatment of clinically insignificant disease with 18 interventional treatments such as RP and radiotherapy, which are associated with high morbidity and financial cost (109). Undergrading/staging has led to some missed opportunites to provide curative treatment for men with aggressive disease and the failure to identify patients who could benefit from adjunctive treatments alongside RP/radiotherapy to improve their chances of survival (110).

Biomarker research and molecular technologies

The requirement to delineate prostate cancer patients according to their risk of disease progression to better inform treatment decisions has resulted in the search for prognostic biomarkers. A more individualised approach to risk stratification is especially important in prostate cancer due to the heterogeneity of its clinical behaviour, wide range of treatment options and the high risk of impairment to quality of life associated with its treatments (58).

Biomarker research has focused on biomarkers that can distinguish between prostate cancer patients who can confidently be assigned to conservative management (active surveillance) and those who would benefit from interventional treatments such as RP and radiotherapy. A particular challenge is whether to treat patients within the intermediate risk group (according to currently used risk classifications) with active surveillance or definitive treatment (111). Secondly, the 10-year BCR rate after RP and definitive radiotherapy is 20-40% and 30-50% respectively (112, 113). There is therefore a need to more accurately identify patients who are at a high risk of recurrence after surgery or radiotherapy and who would benefit from adjunctive therapies to improve their outlook, e.g. radiotherapy and/or ADT with RP.

Molecular technologies have been critical to understanding the complex molecular alterations that initiate and drive the progression of prostate cancer. As these alterations may underpin the wide array of clinical phenotypes and behaviour in prostate cancer even amongst patients with similar clinicopathological features, these tools have been used in the search for prognostic prostate cancer biomarkers. Although non-invasive biomarkers (e.g. serum or urine) would be ideal, tissue samples may offer the most amenable sample source for biomarker research. Ideally, these markers should

19 be able to be measured in biopsy samples, as this is often the only tissue sample available at the time of treatment selection.

A myriad of different deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and protein markers have been explored in prostate cancer. Despite many of these biomarkers showing significant potential to be useful for risk prediction in prostate cancer, none have so far been routinely incorporated into clinical practice. For a biomarker to be adopted into clinical practice, it would have to demonstrate an ability to improve upon the currently used clinical prognostic tools, either in addition to these tools or alone (114). However, the exploration of these biomarkers has led to an increased knowledge regarding the molecular alterations that are associated with prostate cancer progression. Many of the molecular alterations that have been identified as being associated with disease progression in prostate cancer have a known role in the acquisition of individual or multiple hallmark traits of cancer (115). In the next section, the hallmark traits of cancer and the molecular changes associated with their attainment are described, with an emphasis on the alterations found in prostate cancer progression.

20

The hallmark traits of cancer

Cancer is driven by genetic alterations in somatic cells which dysregulate cellular pathways, primarily through the activation of oncogenes and the disruption of tumour suppressor genes. These changes allow cancer cells to acquire the hallmark traits of cancer as described by Hanahan and Weinberg (116)(Figure 1.10).

Figure 1.10 The hallmark traits of cancer (adapted from (116))

These hallmark traits provide the cell with growth and survival advantages over its neighbouring cells. A process similar to Darwinian selection occurs whereby variant cells, which have acquired changes that aid in the acquisition of the traits, are selected and undergo clonal expansion (117). As further genetic alterations are accumulated and selected, the cells gradually acquire all the hallmark traits. The accumulation of genetic alterations continues throughout disease progression (118).

21

Self-sufficiency in growth signals For normal cells to begin actively proliferating from a quiescent state, they are dependent on extracellular growth signals, which control entry into and progression through growth and division cycles (118). In contrast, cancer cells are not dependent on exogenous growth stimulation, leading to unregulated, chronic proliferation, which interferes with the normal homeostasis of cell number, tissue structure and function (116). Cancer cells attain this characteristic through a variety of mechanisms, which cause dysregulation of important intracellular growth signalling pathways, such as the mitogen-activated protein kinase (MAPK) cascade and the phosphoinositide 3-kinase (PI3K) pathway (118).

Firstly, there may be upregulation of growth factor ligands, which initiate these signal transduction pathways (119). Cancer cells may send signals to surrounding normal cells to increase the release of external growth factors (120). Alternatively, cancer cells may produce their own growth factors, to which they respond by a process of autocrine proliferative stimulation, thereby reducing their reliance on external growth factors (121, 122).

Cancer cells may gain autonomy over growth signalling through the overexpression or mutation of tyrosine kinase growth factor receptors, leading to a heightened response to normal levels of ligand or ligand-independent signalling. The epidermal growth factor receptors, epidermal growth factor receptor (EGFR)(123) and receptor tyrosine-protein kinase erbB-2 (ERBB2)(124) are overexpressed in a proportion of prostate cancers, with expression positively correlated with disease progression (123, 125), particularly with the transition to androgen-independence (126, 127).

Furthermore, growth signalling pathways may be activated by alterations to downstream intracellular signal transducers or the transcription factor targets themselves, leading to an increased expression of proliferation-associated genes (128). For instance, an important transcription factor of the MAPK pathway called the Myc proto-oncogene protein (MYC) is frequently overexpressed in prostate cancer, with copy number gains of its gene (MYC) increasing in frequency with disease progression (129, 130). The frequency of MYC amplification is significantly increased in androgen- independent prostate cancer (131).

22

Insensitivity to antigrowth signals Normal cells receive multiple anti-proliferative signals, which act predominantly through the cell cycle to block proliferation and maintain quiescence. Cancer cells have adopted various strategies to evade these signals through the activation of components that drive cell cycle progression and inactivation of those that restrain its progression. A central point of action in antigrowth signalling is the G1/S checkpoint of the cell cycle, which allows the cell to monitor its environment and halt transition into the S phase if something is awry (132). A key component of the G1 checkpoint is the retinoblastoma- associated protein (RB1) which prevents progression through this checkpoint in its hypophosphorylated state by sequestering the transcription factors necessary for the G1/S transition (133). Dysregulation of the retinoblastoma pathway, which results in progression through this checkpoint and unchecked proliferation, is a universal finding in cancer (118). Alterations may include overexpression of cyclins and cyclin-dependent kinases (134, 135) which act to phosphorylate RB1, as well as the loss of cyclin- dependent kinase inhibitors which keep RB1 in its hypophosphorylated form by inhibiting cyclin/cyclin-dependent kinase complexes (136). Alternatively, upstream signals of the pathway may be affected such as the antiproliferative signalling molecule, transforming growth factor beta-1 (TGFB1)(137).

In prostate cancer, the expression of cyclin-dependent kinase inhibitors such as cyclin- dependent kinase inhibitor 1B (CDKN1B) may be reduced (138). Dysregulation of TGF- beta-1 signalling is a feature of prostate tumourigenesis, where the growth inhibitory effects of TGFB1 are counteracted, leading it to paradoxically stimulate tumour progression (139). Elevated levels of plasma and tissue TGFB1 have been correlated with features of aggressive disease in some studies (140, 141).

Additionally, alterations of the tumor protein p53 gene (TP53) leading to reduced protein function are found in a proportion of prostate cancers and are associated with adverse outcome (142). TP53 is a transcription factor involved in a cell’s response to internal stress such as DNA damage and oncogene activation, through activation of genes involved in cell cycle arrest and apoptosis (143). TP53 acts as a regulator of the G1/S checkpoint and other important transition points including the G2/M checkpoint (144)(Figure 1.11).

23

Another tumour suppressor gene called the phosphatase and tensin homolog (PTEN) is frequently lost in prostate cancer (145). PTEN encodes an important negative regulator of the PI3K pathway, which mediates cellular growth, proliferation and survival (146). Through its inhibitory action on the PI3K pathway, the PTEN protein has a regulatory role in all phases of the cell cycle including the G1/S and G2/M transitions (Figure 1.11). Its loss will therefore allow unimpeded progression through the cell cycle at these checkpoints (147).

Figure 1.11 The phases of the cell cycle showing two of the major checkpoints (G1/S, G2/M) and the molecules which regulate them

Evading apoptosis Expansion of the tumour cell population is not only determined by the rate of cell proliferation but also the rate of cell death, particularly from apoptosis. Apoptosis is a highly ordered process, which is induced in situations such as DNA damage, oncogene activation and hypoxia. It involves two major classes of molecules: cell-surface receptors called sensors which monitor both the intracellular and extracellular environment, and effectors which initiate the apoptotic machinery (118). Apoptosis is prevented by the binding of survival factors such as insulin-like growth factor I and II (IGF1/2) to their specific receptors, whereas death signals such as Fas ligand binding to the Fas receptor will initiate the process (118). Of central importance to the apoptotic cascade is the balance of different members of the Bcl-2 family which consists of pro-apoptotic

24 proteins, e.g. apoptosis regulator BAX (BAX) and Bcl-2 homologous antagonist/killer (BAK1), and anti-apoptotic proteins such as apoptosis regulator Bcl-2 (BCL2) and Bcl-2- like protein 2 (BCL2L2)(148). Although the apoptotic program is initially instigated by oncogene expression to eliminate mutant cells, cancer cells must become resistant to apoptosis to allow a tumour to progress. Inhibition of apoptosis plays a critical role in cancer progression by increasing the survival of tumour cells, which results in an overall net tumour growth and promotes genomic instability (149). Importantly, evasion of apoptosis is necessary for the survival of disseminated cancer cells during metastasis and plays a role in drug resistance (150, 151).

Mechanisms that may contribute to the evasion of apoptosis in prostate cancer and have been correlated with its progression include: increased IGF survival signalling (152), downregulation of components of death signalling pathways such as the Fas pathway (153) and upregulation of the important anti-apoptotic proteins BCL2 (154) and baculoviral IAP repeat-containing protein 5 (BIRC5)(155). Upregulation of BCL2 is particularly important for the transition to androgen-independence that may occur after treatment with androgen ablation therapy (156).

Limitless replicative potential Non-cancerous cells undergo a finite number of divisions before they enter senescence. The cell-autonomous program which prevents limitless replication must be overcome by cancer cells, if a tumour is to expand to a size which compromises health (118). This process was illustrated by Hayflick et al. (157), who showed that normal fibroblasts in culture exhibited a finite replicative potential before they became senescent. In comparison, cells that had their RB1 and TP53 tumour suppressor proteins rendered non-functional multiplied to further generations until a second state called crisis was reached. The crisis state was characterised mainly by cell death and karyotypic disarray, but with an occasional mutant cell arising with the potential to continue multiplying indefinitely. This process was termed immortalization and gave an insight into how cancer cells that have accumulated the necessary attributes for tumour formation are selected for in vivo and can break the mortality barrier (158).

Of great importance to the mortality barrier are telomeres, multiple repeats of a short DNA sequence, which protect the ends of from degradation and fusion

25 with other chromosomes. With each cell division, a portion of the telomere is lost due to the inability of DNA polymerases to entirely replicate the 3’ ends of DNA during the S phase of the cell cycle (159). Progressive telomeric loss eventually triggers damage response pathways, causing the cell to either enter senescence or die through entry into the crisis state. Cancer cells have ingenious mechanisms to maintain telomere stability, allowing them to bypass the mortality barrier, including the reactivation of telomerase (the enzyme responsible for adding DNA sequences to the ends of telomeres)(160, 161) and a process called alternative lengthening of telomeres, which uses recombination- based interchromosomal exchanges of sequence information to maintain telomeres (162).

Telomere shortening appears to be an early stage alteration in prostate cancer (163), rendering chromosomes more vulnerable to fusion, breakage and rearrangement, thereby promoting carcinogenesis through chromosomal instability. In contrast, increased telomerase activity has been shown to correlate with disease progression in prostate cancer (164). In particular, the gene TERT which encodes the catalytic subunit of the telomerase enzyme complex called telomerase reverse transcriptase has been investigated as a non-invasive marker of aggressive disease (165).

Sustained angiogenesis A tumour cannot progress beyond 1-2mm in size, invade surrounding tissue or disseminate without the formation of its own vascular network through a process called angiogenesis (166). Angiogenesis is the development of new blood vessels from pre- existing vasculature and is an essential requirement of carcinogenesis to sustain the growth of the expanding mass, through the provision of nutrients and oxygen and the removal of waste. The angiogenic switch is activated at an early stage of carcinogenesis by an upregulation of angiogenic activators including growth factors, integrins and enzymes; and a downregulation of angiogenesis inhibitors (118). In response to proangiogenic signals, the usually quiescent vascular endothelial cells start to divide and migrate into surrounding tissue forming a vascular sprout. This is followed by the organisation of endothelial cells into hollow tubes that gradually evolve into a network of blood vessels (167)(Figure 1.12).

26

Figure 1.12 Steps involved in angiogenesis (168)

In prostate cancer several angiogenic activators have been found to be significantly upregulated, including a potent proangiogenic growth factor called vascular endothelial growth factor A (VEGFA)(169), which plays a critical role in angiogenesis through its effect on endothelial cell function. A positive correlation of both VEGFA and its receptors with disease progression have been demonstrated (170, 171). Additionally, some studies show upregulation of the enzyme prostaglandin G/H synthase 2 (PTGS2)(172), a central mediator of angiogenesis, as well as induction of matrix metalloproteinases (MMPs)(173, 174) in prostate cancer. MMPs are proteolytic enzymes which enable the extracellular matrix (ECM) to be degraded and remodelled, thereby allowing endothelial cells to migrate and invade surrounding tissue. The expression of PTGS2, 72 kDa type IV collagenase (MMP2) and matrix metalloproteinase-9 (MMP9) have been positively correlated with adverse clinicopathological parameters and poor outcome in prostate cancer (172, 175, 176). Conversely, endogenous angiogenesis inhibitors, such as thrombospondin-1 (THBS1), may be downregulated in prostate cancer (177).

Reprogramming of energy metabolism Rapidly proliferating cells require reprogramming of their energy metabolism to allow the bioenergetic demands of their altered cellular activity to be met. They require energy in the form of adenosine triphosphate (ATP) and a raft of macromolecules such as nucleotides, proteins and fatty acids for the assembly of new cells (178). Hence, 27 dysregulation is seen in the proteins and enzymes of key metabolic pathways during carcinogenesis (179). A characteristic change in energy metabolism, exhibited in many types of cancer cells, is the Warburg effect whereby cancer cells shift their energy metabolism of glucose to glycolysis even when sufficient oxygen is available, rather than through the tricarboxylic acid (TCA) cycle and the electron transfer chain (180). This is called aerobic glycolysis and appears initially to be counterproductive as it provides a much lower yield of ATP than can be achieved through aerobic respiration. However, alongside an increased flux of glucose, the Warburg effect in cancer cells allows an amplified rate of ATP generation. Additionally, it allows glycolytic intermediates to be channelled into various biosynthetic pathways to support proliferation, e.g. nonessential amino acids, nicotinamide adenine dinucleotide phosphate (NADPH), glycerol and citrate for lipids, and ribose sugars for nucleotides (181). To increase glucose flux, cancer cells may increase both the uptake and utilization of glucose, for instance through an upregulation of glucose transporters (182). However, metabolic reprogramming exhibits high variability between different cancers and it is now generally accepted that tumour cells rely on a combination of both aerobic glycolysis and oxidative metabolism to supply their energy and macromolecular needs (183).

In prostate cancer, oxidative metabolism remains important for carcinogenesis and the Warburg effect tends to be a feature of more advanced disease only (184). Importantly, in comparison to other cells in the body, normal peripheral zone cells of the prostate have a distinctive metabolic profile, allowing them to accumulate high levels of citrate, PSA and polyamines, which are secreted into the prostatic fluid (185). The extremely high levels of citrate are due to direct inhibition of m-aconitase, a mitochondrial enzyme that catalyses the first step of citrate oxidation in the TCA cycle, as a result of high intracellular levels of zinc (186). Without citrate oxidation, the usual continuation of the TCA cycle (which is seen in other organs) is effectively halted, resulting in a much lower production of ATP (186). Conversely, energy production in malignant prostate cells becomes more efficient as the ability to accumulate zinc is diminished, leading to reactivation of m-aconitase and the ability to oxidise large quantities of citrate through the TCA cycle (187). These changes are illustrated schematically in Figure 1.13.

28

Figure 1.13 Metabolism in the healthy prostate and at different stages of prostate cancer: a) Normal prostate b) Early prostate cancer c) Late prostate cancer (adapted from (188))

29

Metabolic profiling studies have highlighted important metabolic pathways in prostate cancer as well as identifying prognostic metabolic profiles and individual biomarkers (189). Metabolic changes that have been documented in prostate cancer and may be associated with its progression include: overexpression of fatty acid synthase (FASN) (190), a key enzyme in de novo fatty acid synthesis, which may reflect an increased need for lipid biosynthesis in rapidly proliferating cells for membrane formation and intracellular signalling; increased alpha-methyl-CoA racemase (AMACR), an enzyme which is involved in the beta oxidation of branched chain fatty acids and fatty acid derivatives (191); an increased production of the metabolite sarcosine (192)(a byproduct of methionine metabolism which is a central pathway in the methylation of DNA, RNA and proteins); and an increased (choline+creatine+polyamines)/citrate ratio (193).

Tissue invasion and metastasis Once the growth of a primary malignant tumour is progressive and it has its own vascular network, cancer cells from the tumour may invade surrounding tissue. This may be followed by metastasis, where cancer cells spread through the blood and lymphatic vessels to distant sites in the body. For micrometastases to colonize, they must develop their own vascular network, evade the host immune response and respond to organ- specific factors that promote their growth (194). Cancer cells at the distant site can then invade the host stroma and enter blood vessels to start the process of producing additional metastases (194). Metastases represent a lethal threat to the host, being responsible for 90% of cancer deaths from solid tumours (195). The multi-stage process of metastasis is depicted in Figure 1.14.

30

Figure 1.14 Schematic representation of the multiple stages of metastasis (196)

Invasion and metastasis utilise similar mechanistic strategies, including the release of extracellular proteolytic enzymes such as MMPs, serine proteases and cysteine proteases from cancer cells and conscripted stromal cells, which effectively degrade and remodel the tumour microenvironment to facilitate the passage of cancer cells through the epithelium, blood vessels and stroma (118). Every step involved in metastasis is dependent on changes in the coupling of tumour cells to both adjacent cells and the ECM. Therefore, changes in the expression and distribution of integrins (which mediate cell to ECM adhesion) and changes to cell adhesion molecules (CAMs) such as cadherins and immunoglobulins are also fundamental to allow tumour cells to migrate and invade, and to allow growth at a distant site (197).

In prostate cancer, serine proteases have been found to be upregulated. Studies have demonstrated increased expression of urokinase-type plasminogen activator (PLAU) (198) and upregulation of the gene HPN which codes for serine protease hepsin (199). PLAU catalyses the conversion of plasminogen to plasmin upon binding to its receptor called urokinase plasminogen activator surface receptor (PLAUR), promoting the

31 degradation of the basement membrane and ECM by the activation of a cascade of proteolytic enzymes (200).

Cadherin-1 (CDH1) expression has been found to be significantly downregulated in prostate cancer, with reduced expression correlated with poor prognosis (201). CDH1 is an important cell adhesion molecule of the cadherin family which plays a pivotal role in the maintenance of cell architecture. Changes in the expression of CAMs of the immunoglobulin supergene family, such as the CD166 antigen (ALCAM), have been linked to prostate cancer progression (202).

Epithelial-mesenchymal transition The epithelial-mesenchymal transition (EMT) is a developmental regulatory program, which is activated during the progression of epithelial cancers as it enables cancer cells to acquire a migratory phenotype through the loss of epithelial characteristics and the development of mesenchymal properties (203). Interactions of the cancerous cells with the surrounding stromal cells are important for the induction of EMT, as reflected by the identification of EMT processes concentrated at the invasive front of a primary tumour (204).

Several characteristics of EMT are correlated with disease progression in prostate cancer and may hold prognostic significance, including increased vimentin expression (205) and cadherin switching (206). Cadherin switching refers to the loss of CDH1 and an increased expression of another type of cadherin called cadherin-2 (CDH2) as the normal epithelial architecture is lost and there is a transition to a mesenchymal, more invasive and metastatic phenotype (207).

Genomic instability The development of the aforementioned cancer traits is dependent on a succession of clonal expansions, each of which offers the cancer cells a selective growth advantage. Due to the highly vigilant genome maintenance systems that detect and resolve errors in the DNA, the spontaneous mutation rate of normal somatic cells is extremely low (118). The number of genetic alterations found in tumours is vastly higher than can be explained by this low spontaneous mutation rate (208). This disparity can be explained by genome instability, an increase in the rate at which genetic alterations are accumulated, which aids clonal evolution and thereby encourages the acquisition of 32 additional cancer traits (209). Genome instability can be categorised as nucleotide instability, microsatellite instability which causes gain or loss of short nucleotide repeats, and instability resulting in alterations of chromosome number or structure (209). The molecular causes of genome instability in sporadic cancers are poorly defined, in comparison to hereditary cancer, where genomic instability is linked with specific mutations in DNA repair genes (210).

Chromosomal instability is the most prevalent form of genome instability in sporadic, solid cancers and results from alterations in chromosomal instability genes, such as the genes involved in chromosome segregation and the mitotic spindle checkpoint (211, 212). Chromosomal instability leads to aneuploidy, the gain or loss of whole or segmental chromosomes (e.g. through amplifications, deletions or translocations)(209). Aneuploidy is associated with loss of heterozygosity (LOH), a central mechanism by which tumour suppressor genes are inactivated (213). Microsatellite instability is also seen in some sporadic cancers and is caused by epigenetic inactivation or mutations of mismatch repair genes (214).

In prostate cancer there is an increased frequency of both gains and losses of chromosomal DNA with disease progression (215), and some alterations may be predictive of poor outcome such as amplification at 8q24, a region which contains the oncogene MYC (MYC proto-oncogene, bHLH transcription factor)(129). Probably the most frequent chromosomal rearrangement in prostate cancer is the TMPRSS2:ERG gene fusion between the androgen-regulated gene transmembrane serine protease 2 (TMPRSS2) and ETS transcription factor ERG (ERG)(216).

Evasion of the immune system and tumour-promoting inflammation Evasion of immune destruction is an essential hallmark of cancer as described below. Paradoxically, the inflammatory response may also be tumour promoting and can aid the acquisition of the hallmark capabilities by contributing certain molecules to the tumour microenvironment (217).

Tumour-promoting inflammation Chronic inflammation is a significant risk factor for the development of cancer (218). In the normal acute inflammatory response to infection or injury, regenerative proliferation and inflammation subside once the causative agent is removed and repair 33 is complete. In contrast, the persistence of initiating agents or a failure of the normal mechanisms that resolve inflammation may lead to a state of chronic inflammation (219). If cells have suffered mutagenic changes, chronic inflammation supports their continued proliferation due to the array of growth and survival factors provided by the inflammatory cells, thereby promoting carcinogenesis (219).

After cancer initiation, the tumour microenvironment will be infiltrated by an array of immune cells including tumour-associated macrophages, mast cells and neutrophils (220), which promote malignant progression through the production of cytotoxic mediators such as growth factors, proangiogenic factors and ECM degrading enzymes (221). Additionally, immune cells generate reactive oxygen and nitrogen species, inducing oxidative genomic damage and post-translational modifications, which may dysregulate central pathways involved in cell cycle progression or DNA repair (222).

Several studies have shown an increased risk of prostate cancer in men with a history of prostatitis (inflammation of the prostate)(223), which may be caused by infectious agents or oestrogens (224, 225). The link between chronic inflammation and the development of prostate cancer has led to the characterisation of a potential precursor of prostate cancer called proliferative inflammatory atrophy (PIA), which describes atrophic lesions of the prostate associated with chronic inflammation and proliferation in response to epithelial injury (226). PIA may be a precursor of HGPIN and prostate cancer in some cases (227). There is an identifiable progressive pathway of genetic alterations between PIA, HGPIN and prostate cancer including the reduced expression of tumour suppressor genes such as NK3 homeobox 1 (NKX3-1)(228) and an increased expression of the anti-apoptotic BCL2 (229).

Hypermethylation of the glutathione S-transferase pi 1 (GTSP1) gene may be critical for the transition of PIA into HGPIN and prostate cancer (230). The glutathione S-transferases are a family of enzymes which protect the genome from oxidative stress inflicted by inflammatory processes, by catalysing the detoxification of reactive electrophiles. Promoter hypermethylation of GSTP1, which leads to its inactivation, is probably the most frequent genomic alteration in HGPIN and prostate cancer and leaves cells vulnerable to oxidative genome damage caused by inflammation (231). Hypermethylated GSTP1 has shown some promise as a non-invasive

34 diagnostic/prognostic prostate cancer marker, as it can be measured in samples such as serum and urine (232, 233).

Stromal interactions The interactions between a tumour and its microenvironment are vital for cancer growth and progression. The tumour recruits and/or activates various stromal cells to the local microenvironment to form the tumour stroma which contains cancer associated fibroblasts (CAFs) and cells of the tumour associated vasculature and inflammatory cells (234). In comparison to normal stromal cells, the tumour stromal cells induce or upregulate various molecules which contribute to the proliferative and invasive phenotype of the cancer cells, including growth factors, angiogenic factors and proteolytic enzymes. Furthermore, the tumour stroma contains immune cells which allow the tumour to evade the host immune system (234). Complex reciprocal signalling back and forth between the tumour and stromal cells allows them to coevolve together, which contributes to cancer progression. This process is repeated at distant sites of colonisation, if metastasis occurs (235).

In prostate cancer, the reactive stroma is enriched with myofibroblasts and CAFs and shows a significant reduction in differentiated smooth muscle cells (236). Characteristic changes to the ECM composition include an increased expression of hyaluronan (237) and a reduced expression of desmin (238). Additionally, there is an increased expression of growth factors such as fibroblast growth factor 2 (FGF2)(239) and matrix remodelling enzymes (240).

Some components of the reactive stroma show prognostic significance in prostate cancer. For instance, expression of stromal hyaluronan (237) and tumour epithelial hyaluronidase-1 (HYAL1)(which degrades hyaluronan)(241) are positively correlated with disease progression. Hyaluronan may aid tumour progression through its stimulatory effect on cell proliferation (237) and role in tumour cell motility (242). Hyaluronidases such as HYAL1 are important for processing hyaluronan into fragments with potent angiogenic properties, thereby promoting neovascularisation (243).

Evading immune destruction For tumours to progress they must evade the body’s immune system, either by avoiding immune system detection or by disabling or eliminating immune cells (244). Tumour 35 cells can avoid detection by shedding surface antigens or by the dysregulation of important components required for effective antigen presentation (245, 246). As a result, tumours exhibit a poor ability to stimulate T cells or to act as targets for tumour- specific T cells. Immunosuppression in cancer may directly target the T cell response, e.g. through the downregulation of co-stimulatory molecules on the tumour cell surface (247), which prevents the co-stimulatory signals to the T cells that are necessary for their activation. There may also be upregulation of the co-inhibitory molecules that oppose T cell activation (248). Tumour and tumour stromal cells also produce a host of immunosuppressive factors, including cytokines and enzymes which inhibit immune cell activity (249). Finally, the formation of tumours encourages the recruitment and expansion of suppressor immune cell populations to the area such as regulatory T cells (250) and myeloid-derived suppressor cells (251), which secrete immunosuppressive modulators that suppress the effector function of T cells (252).

Immunosuppressive factors present in the microenvironment of prostate tumours include: signal transducers and activators of transcription (STATs)(253) and immunosuppressive cytokines (254). An expansion in the number of regulatory T cells such as FOXP3+ cells is also a feature of the tumour microenvironment in prostate cancer (255), with higher numbers of these cells representing a poor prognostic factor (256). Additionally, the coinhibitory ligand CD276 antigen (CD276) is upregulated in prostate cancer tissue (257), with increased expression associated with adverse clinical outcome (258).

Prostate cancer prognostic biomarkers

Many of the genes/proteins found to be dysregulated in prostate cancer have been investigated as potential prognostic biomarkers. A small selection of potential prognostic biomarkers, which have been widely investigated in prostate cancer, are critically evaluated in Appendix 1. The biomarkers included in Appendix 1 have been classified according to their putative roles in cancer (Table 1.2).

36

Table 1.2 Examples of potential biomarkers of prostate cancer progression

Self-sufficiency in growth signals MKI67 proliferation marker protein Ki-67 EZH2/EZH2 enhancer of zeste 2 polycomb repressive complex 2 subunit MYC MYC proto-oncogene, bHLH transcription factor Insensitivity to antigrowth signals PTEN/PTEN phosphatase and tensin homolog CCP score 31-gene signature of cell cycle progression genes Evading apoptosis BCL2 apoptosis regulator Bcl-2 IGFBP3 insulin-like growth factor-binding protein 3 Sustained angiogenesis MVD microvessel density PTGS2 prostaglandin G/H synthase 2 Tissue invasion and metastasis PLAU/PLAUR urokinase-type plasminogen activator, urokinase plasminogen activator surface receptor CDH1 cadherin-1 Four-kallikrein panel p2PSA [-2]proPSA and derivatives Genomic instability TMPRSS2:ERG gene fusion Stromal interactions stromal HA/HYAL1 stromal hyaluronan/hyaluronidase-1 serum BAP/uNTX serum bone-specific alkaline phosphatase/urinary N-telopeptide

Gene expression panels

Due to the complex molecular alterations that are characteristic of prostate cancer progression, it would be unrealistic to expect a single biomarker to fulfil the role of risk prediction and small panels of markers may more accurately reflect the clinical phenotype of prostate cancer.

Gene expression studies using complementary DNA (cDNA)/oligonucleotide microarray technology are important in the discovery phase of prognostic biomarkers in prostate cancer. However, such studies produce huge volumes of data that are difficult to resolve further. Additionally, microarray studies have a requirement for high quality RNA and so generally use fresh frozen tissue (259). This presents a problem as fresh frozen tissue is limited in both its availability and its associated clinical follow-up data, with which to correlate the gene expression data. In comparison, formalin-fixed paraffin-embedded (FFPE) samples are more readily available and possess well-documented follow-up data. However, due to a lower quality and quantity of RNA extracted from FFPE samples, a more sensitive technique such as real-time quantitative polymerase chain reaction (qPCR) analysis is required (260). Therefore, genes of interest that have been identified in microarray studies are frequently validated in FFPE samples using real-time qPCR

37

(261). However, real-time qPCR is too labour-intensive and expensive to realistically validate the expression of more than 10-20 genes.

Despite these problems, some studies have identified gene expression profiles that can predict Gleason grade (262) and more relevant clinical endpoints such as BCR, systemic progression and prostate cancer-specific death (263-265). Some of these studies found the identified molecular signatures to be superior to GS for the prediction of clinical outcome (263).

Although some of the results are promising, few of the identified gene panels have been thoroughly validated in independent studies. Additionally, the molecular signatures between studies have rarely overlapped (266). As a result, there are just a handful of gene panels that have been incorporated into commercially available tests and none have been routinely used in clinical practice.

The Prolaris and the Oncotype Dx are both biopsy tests used for risk stratification in men with newly diagnosed low/intermediate risk prostate cancer (267). Biopsy tests have high clinical applicability as they have the potential to be used at the time of diagnosis for the stratification of patients for either active surveillance or therapeutic interventions. These tests have been found to add meaningful prognostic information when used alongside standard clinicopathological variables at the time of treatment selection (268).

Prolaris® test The Prolaris® test (Myriad Genetics, Salt Lake City, UT, USA) is a 46 gene expression panel including 31 cell cycle progression (CCP) genes and 15 housekeeping genes, which is expressed as the CCP score. The CCP score has been found to be an independent predictor of BCR after RP and EBRT, and prostate cancer-specific mortality in conservatively managed patients (263, 269, 270). Prolaris testing may allow risk stratification beyond standard clinicopathological variables (263).

The Prolaris test can also be used with radical prostatectomy samples to select patients with a high risk of BCR who may benefit from adjunctive therapy (271).

38

Oncotype DX® test The Oncotype DX® test (Genomic Health Inc, Redwood City, CA, USA) is a 17 gene expression panel including 12 cancer-related genes representing four biological pathways (stromal response, androgen signalling, proliferation, cellular organization) and 5 reference genes, which are combined to form the Genomic Prostate Score (GPS). The GPS has been found to be an independent predictor of unfavourable pathology found at RP (primary Gleason grade 4 or any grade 5 and/or non-organ confined disease). The addition of the GPS to a pretreatment clinical model such as the CAPRA provides a more accurate prediction of the presence/absence of unfavourable pathology compared with clinical information alone (272, 273).

These tests are not in widespread use as they require further validation through prospective studies to identify the long-term impact of changes in clinical management based on test results (e.g. shifting patients from interventional to non-interventional treatment)(274). These tests are very expensive and the financial and practical implications of using them need to be investigated.

Additionally, although the aforementioned biopsy tests have shown predictive ability for men with low/intermediate risk disease (as defined by standard clinicopathological parameters), these tests are not suitable for men with high risk disease (268).

This study

Therefore this study focused on the identification of progression-related gene markers in prostate cancer biopsy samples that would be relevant to patients with a wide range of conventional risk parameters. Identified genes were investigated as to whether they had potential to be used alone or in combination to provide novel prognostic information about prostate cancer patients and ultimately prove useful to inform treatment decisions.

The TaqMan array study and its data analysis are described in detail in Chapter 2. The findings are described in relation to a pilot study by Larkin et al., from which the idea for this study originated (275).

39

TaqMan array study

2.1 Introduction

Background

The clinical behaviour of prostate cancer is highly variable and despite much research, the cause of this variability remains poorly understood. Many prostate cancers have a long natural history with a low risk of systemic progression or death from the disease during the first 15 years. The deaths that occur within this time are usually the result of competing causes. Conversely, some patients exhibit a more aggressive form of the disease, which metastasizes quickly and proves lethal within a short time frame (276). A major challenge that exists in the clinical management of prostate cancer is the differentiation of patients with aggressive disease from those with a slow-growing form of the disease. Clinicians rely heavily on the use of Gleason grading of biopsy samples and clinical staging for treatment decision-making. These tools are limited in their ability to accurately predict tumour behaviour, as a result of biopsy sampling error and inaccurate clinical staging due to the multifocal and heterogeneous nature of prostate cancer, and the limitations of the Gleason grading system (103-105). As a consequence, there remains a significant problem of overtreatment in men with relatively indolent disease, exposing these men to the unnecessary risks and side effects of treatment, in an attempt to avoid undertreating the minority of prostate cancer patients with aggressive disease (109).

Due to the limitations of clinical prognostic tools, the variability in disease behaviour and the risk of treatment-related impairment on quality of life, a more individualised approach to risk assessment in prostate cancer is necessary. Prognostic molecular assays offer the potential to provide a more individualised method to inform treatment decisions, such as whether to start treatment immediately or to manage conservatively, and whether adjunctive therapy is necessary.

The aim of this research was to identify molecular markers of prostate cancer progression and to investigate whether any of these markers could be used individually or in combination to predict the risk of disease recurrence after RP or radiotherapy. This 40 could potentially provide novel prognostic information to better inform treatment decisions in newly-diagnosed prostate cancer patients.

TaqMan array technology was chosen as it offered a medium throughput method which retained the sensitivity of standard real-time qPCR but allowed the analysis of multiple genes at once (277). This provided a much less time-consuming and labour-intensive method to collect expression data for a large number of genes (96 genes) in multiple samples. This technology was sensitive enough to allow the use of RNA extracted from FFPE biopsy samples (277). The use of FFPE samples allowed the gene expression data to be correlated with patient outcome over a long follow-up period. Gene expression data were correlated with clinical outcome (biochemical/clinical recurrence) and binary logistic regression (LR) analysis was used to identify gene models with the potential to categorise patients according to 5-year disease recurrence after treatment.

Previous pilot study

The study extended a previous ‘proof of concept’ study by Larkin et al., which analysed the expression of 91 putative prostate cancer progression genes in 29 radical prostatectomy samples using a similar TaqMan array to our study (275). Using LR analysis, they identified several genes that in certain combinations, could relatively accurately predict GS ≥7 or biochemical/clinical recurrence in their dataset. The gene models were ranked according to their overall percent accuracy (percentage of the total number of samples that were correctly predicted).

In their study the single gene models of HSPB1 and INMT could predict GS ≥7 vs <7 with an overall percent accuracy of 82.8% and 79.3% respectively. The models ABL1, ANPEP, CD9, EFNA1, GPM6A, ITGB4 and NELL2 could predict GS with an accuracy of 75.9%. PSCA could predict GS with an accuracy of 72.4%.

The multivariate gene models co-utilising the markers ANPEP/PSCA, ABL1/ANPEP and ANPEP/GPM6A correctly predicted 86.2%, 82.8% and 82.8% of cases according to the GS respectively.

ANPEP, INMT and TRIP13 as single gene models could correctly categorise recurrent/non-recurrent disease in 79.3%, 75.9% and 69% of patients. TRIP13

41 expression, in combination with the preoperative PSA level and GS, had the highest overall percent accuracy for the prediction of recurrence (85.7%)(275).

The current study - new research

The present study (BB study) sought to validate and extend the findings of the previous study (SL study) using a similar method of RNA extraction, cDNA synthesis and real-time qPCR analysis. However, a much larger set of prostate cancer samples (67 successfully analysed samples) was used to enhance the reliability of the findings. A limitation of the previous study, besides the small sample size, was the inclusion of RP samples only (275). This represented a possible bias towards disease with a lower risk of recurrence, as RP tends to be performed in organ-confined disease of low/intermediate GS (278). This limitation was overcome in the present study by expanding the range of recurrence risks through the inclusion of biopsy samples from both radiotherapy (RT) patients and RP patients. Importantly, the use of predominantly biopsy samples (where available) sought to increase the clinical utility of the study. Biopsies are often the only samples available at the time of treatment decision-making, so a biopsy-based prognostic biomarker would offer the potential to provide novel objective information to aid the clinician at the time of treatment selection. The differences in the sample set between the SL and the BB study are described in Table 2.1.

Table 2.1 Comparison of samples included in the SL and BB study

Study Primary treatment Sample type RP RT RP TURP Biopsy SL study 29 0 29 0 0 BB study 33 34 12 3 52

The BB study also included several important developments to both the practical and data analysis methods in comparison to the SL study (which was only a pilot study) to improve the validity of the findings. In particular, the use of a more statistically valid method for ranking the predictive models was included in the BB study. These changes are explained in detail in the method development Sections 2.3.3 and 2.3.4.

42

Aims

• To successfully measure the expression of a panel of putative prostate cancer progression genes in archival prostate biopsy samples from men with varying grades and stages of prostate cancer treated with RP and/or RT. • To correlate the gene expression data with clinical outcome and identify any genes with potential as markers of biochemical/clinical recurrence. • To thoroughly investigate whether any of the genes can be utilized, either individually or in combination, as logistic regression models for the prediction of biochemical/clinical recurrence within 5 years from treatment.

43

2.2 Materials and methods

Introduction/practical considerations

Choice of practical methodology Due to resource limitations for storing frozen tissue and the requirement to preserve tissue architecture for diagnosis, tissue samples in pathology departments are routinely formalin-fixed and paraffin-embedded before storage. In addition to their abundance, such archival samples possess well-documented patient data. They are ideal for retrospective molecular studies, which correlate gene expression data with clinical outcome. Therefore, archival prostate cancer samples offered a rich resource of samples for this study.

Studies have found that gene expression profiles using FFPE tissue show a high relative concordance with frozen tissue (279). However, the quality and yield of RNA extracted from FFPE tissue is typically lower than from frozen tissue (260). Due to the formalin fixation, nucleic acid is trapped and modified by the formation of protein-nucleic acid crosslinks, leading to a low yield of RNA, which is typically highly fragmented and chemically modified (279, 280). RNA degradation continues with the time that the samples are kept in storage at room temperature (279).

Due to these factors, gene expression quantification from FFPE samples can only be achieved using a highly sensitive technique such as fluorescence-based real-time qPCR analysis and the detection rate is lower (higher cycle threshold (Ct) values) in comparison to frozen tissue (281).

Several important measures were implemented in this study to overcome the compromised quantity and quality of RNA and to increase the chance of successful real- time qPCR. These measures are described as follows:

RNA extraction The RecoverAll™ Total Nucleic Acid Isolation Kit (Ambion, Austin, TX, USA) was used to extract the RNA from the FFPE samples. This kit was ideal for the recovery of fragmented RNA as it is designed to extract the maximal amount of RNA of all sizes using rapid glass- fibre filter technology. Importantly, this kit includes a protease digestion and heating 44 step to release RNA covalently bound to protein and reverse some of the chemical modifications. The Ambion RecoverAll kit performs highly in terms of the quantity and quality of RNA that can be extracted from FFPE samples, maximising the amount of amplifiable RNA available (282).

Reverse transcription and TaqMan array analysis It was also important to optimise the reverse transcription step as degraded and chemically modified RNA can impair the efficiency of this step. Random primers were used as they tend to generate the highest yield of cDNA (283). These prime at multiple points throughout the messenger RNA (mRNA) transcript rather than at the polyA tail only, as is the case with oligo-dT primers. They are therefore ideally suited to the RNA extracted from FFPE tissue, which is typically degraded and prone to loss of the polyA tails due to the modification of bases by formalin (280).

The TaqMan® array (Applied Biosystems, Carlsbad, CA, USA) was chosen for the study as it offered a medium-throughput method for fluorescence-based real-time qPCR analysis (277). TaqMan arrays offer a sensitive and reproducible method to study gene expression (277) and have been used for the validation of prognostic gene biomarkers in various cancers including prostate cancer (284). This technology requires a low cDNA input and so was ideal for the limited quantities of cDNA obtained from the FFPE tissue in this study, particularly from the minimally-sized biopsy samples (284).

For the real-time qPCR, short amplicons were chosen as they produce lower Ct values (improved sensitivity) than longer amplicons for the amplification of short cDNA templates (from degraded RNA)(285). The use of the internal reference gene hydroxymethylbilane synthase (HMBS) was fundamental to normalise the gene expression data as it corrected for differences in the total RNA quantity and quality between samples.

Design of the new TaqMan array For this study the TaqMan arrays were customised with 96 gene targets, which allowed up to 4 samples to be run per card (two ports per sample). Further background information on TaqMan arrays can be found in Appendix 3.

45

The design of the array was based on the array used in the pilot study (SL study)(275), which had been customised with 91 putative prostate cancer progression markers (Figure 2.1). The majority of the prostate cancer-related genes in the SL study had been identified as important prostate carcinogenesis genes in a study by Narayanan and Keedwell (286), which analysed published microarray data using artificial neural networks. Other genes had been identified through an extensive literature search or as a result of previous work in the department (287, 288).

The ABI manufacturing control 18S ribosomal RNA (18S) was also included on the array in addition to four other housekeeping genes that had been documented as showing stable expression in prostate cancer tissues between different stages and grades (289): hydroxymethylbilane synthase (HMBS), TATA-box binding protein (TBP), hypoxanthine phosphoribosyltransferase 1 (HPRT1) and succinate dehydrogenase complex flavoprotein subunit A (SDHA).

The new array (BB array)(Figure 2.2) included the same genes (including housekeeping genes) as the SL array except for the following changes:

Ten genes were removed, that had been found to show p-values of >0.5 in the independent samples t-tests (for the comparison of the GS ≥7 and <7 group and/or recurrent and non-recurrent group) in the SL study (275). As a p-value of >0.5 suggested that these genes were unlikely to be useful in LR models, they were excluded from the array to provide space for other genes of interest. The excluded genes were CDH2, CTNNA1, HSPA5, KCNN4, MMP9, RBM12, SFPQ, TGM2, HSP90B1 and WIF1.

The genes ERBB2, C3, HPX, MKI67, MT1X, MUC1, PLAU, PTEN and TF were selected for the BB array as replacements for these genes. On theoretical grounds, ERBB2, MK167 MUC1, PLAU and PTEN were highlighted as putative prostate cancer progression markers following a review of current literature.

For instance, MUC1 encodes a transmembrane glycoprotein, which has been strongly associated with clinicopathological parameters of aggressive disease and poor prognosis in several types of cancer including prostate cancer (290-292). As MUC1 is an anti- adhesion molecule, it has been proposed that it may enhance the invasive and metastatic potential of cancer cells (292), for example by reducing cadherin-1 and

46 integrin-mediated cell adhesion (293, 294). In prostate cancer, studies have found MUC1 mRNA and protein overexpression to correlate with aggressive clinicopathological features and poor prognosis (290, 291).

The remainder of the novel genes on the new array (C3, HPX, MT1X and TF) were highlighted as genes of interest based on previous work in our department. For example, hemopexin (HPX) was identified as a putative marker of high grade prostate cancer (GS ≥7) in a serum proteome study (295). A further tissue microarray study showed significant differences in hemopexin expression across benign, low grade (<7), high grade (≥7) and metastatic tissue (295). Hemopexin is a serum glycoprotein which transports heme from the bloodstream to the liver for metabolism and iron recovery (296). Despite the fact that the role of hemopexin in cancer is not fully elucidated, numerous proteomic profiling studies using serum/plasma samples or other body fluids have identified hemopexin as being significantly up- or downregulated in various types of cancer including prostate cancer (297, 298).

The full names of each gene incorporated on the SL and BB array can be found in Appendix 2, where they have been grouped according to their putative roles in cancer initiation/progression.

The proportion of the total number of genes on the BB array within each category are illustrated in Figure 2.3.

47

Samples Port

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDH2* CDKN1C CHM CRHR1 CTNNA1* A 1 CYP11A1 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 B 1 HSPA5* HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB6 KCNN4* KLHL5 M M P2 M M P9* M NX1 MYC* NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC PLCL2 PLLP C 2 POP5 PRIM 2 PSCA PTGFR PVALB PVT1 RAP1GAP RBM 12* RHBDL1 SDHA SERPINB5 SFPQ* SLC44A2 PRELID3A TBP TBRG4 TERT TGFB2 TGM 2* HSP90B1* TRAFD1 TRIP13 VASH1 WIF1* D

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDH2 * CDKN1C CHM CRHR1 CTNNA1 * E 3 CYP11A1 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 F 2 HSPA5 * HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB6 KCNN4 * KLHL5 M M P2 M M P9 * M NX1 M YC * NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC PLCL2 PLLP G 4 POP5 PRIM 2 PSCA PTGFR PVALB PVT1 RAP1GAP RBM 12 * RHBDL1 SDHA SERPINB5 SFPQ * SLC44A2 PRELID3A TBP TBRG4 TERT TGFB2 TGM 2 * TRA1 * TRAFD1 TRIP13 VASH1 WIF1 * H

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDH2 * CDKN1C CHM CRHR1 CTNNA1 * I 5 CYP11A1 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 J 3 HSPA5 * HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB6 KCNN4 * KLHL5 M M P2 M M P9 * M NX1 M YC * NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC PLCL2 PLLP K 6 POP5 PRIM 2 PSCA PTGFR PVALB PVT1 RAP1GAP RBM 12 * RHBDL1 SDHA SERPINB5 SFPQ * SLC44A2 PRELID3A TBP TBRG4 TERT TGFB2 TGM 2 * TRA1 * TRAFD1 TRIP13 VASH1 WIF1 * L

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDH2 * CDKN1C CHM CRHR1 CTNNA1 * M 7 CYP11A1 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 N 4 HSPA5 * HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB6 KCNN4 * KLHL5 M M P2 M M P9 * M NX1 M YC * NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC PLCL2 PLLP O 8 POP5 PRIM 2 PSCA PTGFR PVALB PVT1 RAP1GAP RBM 12 * RHBDL1 SDHA SERPINB5 SFPQ * SLC44A2 PRELID3A TBP TBRG4 TERT TGFB2 TGM 2 * TRA1 * TRAFD1 TRIP13 VASH1 WIF1 * P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Oncogenes Tumour suppressor genes Apoptosis related Angiogenic Invasion and metastasis related Metabolic Differentiation and proliferation Detox Others Housekeeping genes * Genes included on SL array only Figure 2.1 Genes included on the SL array and their putative roles in cancer initiation/progression 48

Samples Port

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL C3* CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDKN1C CHM CRHR1 CYP11A1 A 1 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERBB2* ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 B 1 HPX* HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB4 ITGB6 ITGB6 KLHL5 M KI67* M M P2 M NX1 M T1X* M UC1* NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC C 2 PLAU* PLCL2 PLLP POP5 PRIM 2 PSCA PTEN* PTGFR PVALB PVT1 RAP1GAP RHBDL1 SDHA SERPINB5 SLC44A2 PRELID3A TBP TBRG4 TERT TF* TGFB2 TRAFD1 TRIP13 VASH1 D

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL C3 CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDKN1C CHM CRHR1 CYP11A1 E 3 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERBB2 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 F 2 HPX HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB4 ITGB6 ITGB6 KLHL5 M KI67 M M P2 M NX1 M T1X M UC1 NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC G 4 PLAU PLCL2 PLLP POP5 PRIM 2 PSCA PTEN PTGFR PVALB PVT1 RAP1GAP RHBDL1 SDHA SERPINB5 SLC44A2 PRELID3A TBP TBRG4 TERT TF TGFB2 TRAFD1 TRIP13 VASH1 H

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL C3 CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDKN1C CHM CRHR1 CYP11A1 I 5 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERBB2 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 J 3 HPX HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB4 ITGB6 ITGB6 KLHL5 M KI67 M M P2 M NX1 M T1X M UC1 NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC K 6 PLAU PLCL2 PLLP POP5 PRIM 2 PSCA PTEN PTGFR PVALB PVT1 RAP1GAP RHBDL1 SDHA SERPINB5 SLC44A2 PRELID3A TBP TBRG4 TERT TF TGFB2 TRAFD1 TRIP13 VASH1 L

ABL1 ALCAM ALPG AMACR ANPEP ANXA2 APOBEC3G ASB2 ATP10B BRCA2 18S ATP5M PL C3 CANX CD109 CD4 CD44 CD9 CDC42EP3 CDH1 CDKN1C CHM CRHR1 CYP11A1 M 7 DPT EDNRA EEF1A1 EFNA1 EFNA2 EFNA3 EFNA4 EFNA5 ERBB2 ERG EZH2 F11R FABP5 FADS1 GPM 6A GPR12 GPR19 GSTM 3 GSTM 5 GUCY1A1 HM BS HOXC6 HPN HPRT1 N 4 HPX HSPB1 INM T IRF7 ITGA2 ITGA3 ITGA5 ITGB1 ITGB4 ITGB4 ITGB6 ITGB6 KLHL5 M KI67 M M P2 M NX1 M T1X M UC1 NDUFAF1 NELL2 PNP NUP50 PCDH18 PGC O 8 PLAU PLCL2 PLLP POP5 PRIM 2 PSCA PTEN PTGFR PVALB PVT1 RAP1GAP RHBDL1 SDHA SERPINB5 SLC44A2 PRELID3A TBP TBRG4 TERT TF TGFB2 TRAFD1 TRIP13 VASH1 P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Oncogenes Tumour suppressor genes Apoptosis related Angiogenic Invasion and metastasis related Metabolic Differentiation and proliferation Detox Others Housekeeping genes * Genes included on BB array only Figure 2.2 Genes included on the BB array and their putative roles in cancer initiation/progression 49

Figure 2.3 Genes on the BB array, categorised according to their roles in cancer initiation/progression

Choice of data analysis In this study, two different relative quantification methods were used to convert TaqMan array data (Ct values) to normalised gene expression ratios: the 2-∆Ct method and the Pfaffl method (275). The 2-∆Ct method is relative quantification in its simplest form and is described alongside a description of real-time qPCR data analysis in Appendix 3.

The Pfaffl method calculates a target gene expression ratio as a fold increase relative to a control sample and is normalised to the expression of a reference gene (299). This method accounts for the fact that the target and reference genes may not exhibit the same amplification efficiencies. The equation for the Pfaffl method is as follows:

∆퐶푡,푡푎푟푔푒푡(푐표푛푡푟표푙−푡푒푠푡) 퐸푡푎푟푔푒푡 푅푎푡𝑖표 = ∆퐶푡,푟푒푓(푐표푛푡푟표푙−푡푒푠푡) 퐸푟푒푓

Etarget and Eref are the real-time PCR amplification efficiencies of the target and reference gene transcripts.

∆Cttarget is the Ct of the target gene in the control sample minus the Ct of the target gene in the test sample.

ΔCtref is the Ct of the reference gene in the control sample minus the Ct of the reference gene in the test sample. 50

The housekeeping gene HMBS was used as the reference gene for normalisation as it has been found to demonstrate extremely low variation between grades and stages of prostate cancer (289, 300). The use of HMBS as a suitable reference gene for this study was confirmed by demonstrating its stability between samples (ΔCt method), and its low variability between the GS ≥7 and <7 group and between the recurrent and non- recurrent group (Appendix 4).

Binary logistic regression (LR) was selected as an appropriate statistical method for creation of the predictve models, as the outcome (dependent) variable consisted of only two categories (recurrent and non-recurrent). LR was used to model the probability of the outcome (being in either the recurrent or non-recurrent group) based on the expression of particular genes (independent/predictor variables).

In comparison to linear regression, LR does not require a linear relationship between the independent variables and the dependent variable. This is fundamental, as the use of a categorical dependent variable violates the assumption of linearity found in ordinary regression. LR uses a log transformation of the outcome (dependent variable) to allow the non-linear relationship between the independent variables and dependent variable to be modelled in a linear manner. This was ideal as it allowed an understanding of the effect of several predictor variables (genes) on the single dichotomous dependent variable (i.e. how expression of a set of genes affects membership in the recurrent or non-recurrent categories). Importantly, LR controls for any confounding between predictor variables, which would not be possible if the predictor variables were analysed separately (301). Furthermore, LR does not require that the independent variables or dependent variable be normally distributed or have equal variance between groups.

A more detailed description of LR analysis and Receiver Operating Characteristic (ROC) curves, which were used to test the discriminatory accuracy of the identified predictive models, can be found in Appendix 5. This includes a detailed description of the important parameters found in the LR output from SPSS (statistical package used in this study).

51

Sample size and power calculations

It was essential to ensure that the study included enough data to reliably estimate the parameters for the logistic regression equation in a multivariate analysis, to reduce the risk of creating unreliable models artificially produced by the data. Sample size calculation for LR, which ensures an acceptable level of statistical power, tends to be based on the rule of thumb of 10 events per variable (EPV) where the events describe the number of samples in the minority group (302). However, more recent studies have challenged this rule and suggest that the use of 5-9 EPV is sufficient (303).

Due to limitations in the number of samples that could be accessed and processed, the sample size of this study was small. Additionally, despite processing 105 prostate cancer samples, the final number of samples successfully analysed was only 67. (The reasons behind the exclusion of 38 samples is explained in detail in Section 2.3.2). However, as the number of events (number of patients in the recurrent group) was 21, up to 3 predictor variables could be used (7 EPV). Therefore, single, two-gene and three-gene LR models could be analysed for the categorisation of patients according to recurrence.

The EPV for the single, two- and three-gene models are illustrated in Table 2.2.

Table 2.2 The events per variable for the single, two-gene and three-gene models

Number of positive Number of variables Events per variable events in model (EPV) 21 1 21 21 2 11 21 3 7

A secondary LR analysis was considered for the comparison of late recurrence (between year 5 and 10) and early recurrence (between year 0 and 5). However, complete 10-year outcome data was only available for some of the patients. Importantly, only 6 of these patients could be classified as suffering late recurrence. The low number of events precluded the use of late/early recurrence as the dependent variable in multivariate LR analysis.

Prostate cancer-specific death was also considered as an endpoint for LR analysis. Due to the usually protracted disease progression in prostate cancer, there were few prostate cancer-specific deaths within follow up. More specifically, no prostate cancer-

52 specific deaths occurred within 5 years and only 4 within 10 years. The use of prostate cancer-specific death as an endpoint would have required a much longer follow-up period and a far greater sample size than was possible for this study.

Ethical permissions

The study used archival prostate cancer tissue samples that had been formalin-fixed and paraffin-embedded for routine clinical analysis and storage within the Department of Pathology, Queen Alexandra Hospital (QAH), Portsmouth. At the time of the experimental work, ethical consent for the use of the samples in research was already in place as part of the protocols associated with the Portsmouth Molecular Pathology Tissue Bank (07/MRE08/2). The tissue bank was headed by Professor Ian Cree, the director of the Translational Oncology Research Centre (TORC), QAH, where the practical work was completed in its entirety. Due to ethical permissions lapsing in 2012 with the closure of the tissue bank, new ethical permissions were gained in 2016 so that updated patient outcome data could be accessed for the final data analysis and write- up (see Section 2.3.4.3 for further details).

Practical methods

Sample sourcing and clinical information FFPE sample blocks from patients with varying grades and stages of prostate cancer were collected from storage with their corresponding haematoxylin and eosin labelled slides. The samples consisted of predominantly biopsies but RP and TURP samples were used if a biopsy was not available.

Patient data including pathology results, dates/types of treatment and outcome data (including both biochemical and clinical parameters) were sourced by accessing the databases at the QAH, including the Apex Pathology System and the Graphnet system (Electronic Patient Records). Where insufficient data was found by these methods, patients’ notes were investigated by a research nurse, clinical trials assistant or pathologist.

Patients were categorised as having recurrent disease based on information pertaining to biochemical recurrence and/or clinical recurrence (such as the presence of novel

53 metastases). The presence of BCR in patients who had undergone RP was defined by a serum PSA of ≥0.2ng/ml followed by a subsequent confirmatory PSA of >0.2ng/ml, as suggested by the American Urological Association (80). For patients who had received RT, BCR was defined by a PSA level greater than or equal to the post-treatment nadir plus 2ng/ml, according to the American Society for Therapeutic Radiology and Oncology (304).

Identification and extraction of areas of interest A histopathologist examined the haematoxylin and eosin labelled slides from sections of the sample blocks. Cancerous areas containing the purest tumour, i.e. those with the smallest amount of stroma and inflammation were marked and graded (according to the Gleason system). To ensure reproducibility, the same histopathologist was responsible for marking all the slides.

Two 0.6mm tissue cores (if possible) were punched from the areas of highest Gleason grade on the corresponding sample block using a manual tissue arrayer (Beecher Instruments, Inc., Sun Prairie, WI, USA) and the tissue punches placed in a 1.5ml microcentrifuge tube. Care was taken to avoid cross-contamination between samples by rubbing gloved hands with 70% industrial methylated spirit and cleaning arrayer punches with DNA Zap™ (Applied Biosystems, Carlsbad, CA, USA; AM9890) between samples.

RNA extraction RNA was extracted from the tissue punches using the RecoverAll™ Total Nucleic Acid Isolation Kit (Ambion, Austin, TX, USA; AM1975) with a modified protocol as follows:

Deparaffinization Tubes containing the tissue punches were prewarmed in a heat block at 70⁰C for 15-20 mins followed by the addition of 1ml of 100% xylene to each tube (prewarmed to 50⁰C for 15-20 mins). Samples were incubated at 50⁰C for 3 mins to melt the paraffin followed by centrifugation at 13,000g for 2 mins. Xylene was carefully removed and discarded using a fine-tipped Pasteur pastette, with care taken to avoid disruption to the tissue pellet. The pellet was washed twice to remove residual xylene, with each wash using 1ml of 100% ethanol that was subsequently removed with a pastette after

54 centrifugation as before. After the second wash, the pellet was air dried at room temperature (10-15 mins) with care taken to avoid overdrying the pellet.

Protease digestion 200µl of digestion buffer followed by 4µl of protease solution (both supplied with the kit) were added to each sample. After the tissue pellet was brought down into the solution by pulse centrifugation, the sample was incubated in a heat block at 50⁰C for 15 mins followed by 15 mins at 80⁰C to break protein cross-links and reverse chemical modifications of the RNA.

After the protease digestion, 240µl of isolation additive was added. The lysate was either processed immediately or stored at -20⁰C awaiting the nucleic acid isolation step.

Nucleic acid isolation 550µl of 100% ethanol was added to the lysate and 700µl of ethanol/lysate mixture was loaded onto the filter cartridge, which was held in the collection tube. After centrifugation at 10,000g for 60s and once the flow-through had been discarded, the remaining lysate/ethanol sample was loaded onto the filter and the process repeated. 700µl of wash 1 was then added to the filter cartridge and the tubes centrifuged at 10,000g for 30s with the flow-through discarded. This was followed by the addition of 500µl of wash 2/3 and an identical centrifugation at 10,000g for 30s. A further centrifugation of the filter cartridge tube assembly for 30s removed the residual wash from the filter cartridge.

To remove contaminating DNA from the sample, 60µl of a DNase solution (6µl 10x buffer, 4µl DNase (included in kit) and 50µl nuclease-free water (Promega, Madison, WI, USA; P1193)) was carefully applied to the centre of each filter cartridge. After incubation at room temperature for 30 mins, the samples were washed with 700µl of wash 1 as before except that after the wash was added, the tubes were left at room temperature for 60s before centrifugation. This was followed by two identical washes with wash 2/3 as previously described. The tube assembly was spun for a further 60s to remove the residual wash 2/3. Finally, the filter cartridge was transferred to a new collection tube.

55

Elution of RNA For elution of the RNA, 110µl of nuclease-free water (prewarmed to 95⁰C)(Promega) was carefully added to the centre of each filter cartridge. After incubation at room temperature for 60s, the samples were centrifuged at 10,000g for 60s. The eluate was carefully reapplied to the filter cartridge and the process repeated. The filter cartridge was discarded and the eluate containing the RNA was measured for quantity and purity using a NanoDrop® ND-1000 spectrophotometer (Thermo Scientific, Waltham, MA, USA) as described below.

In instances where RNA was not immediately converted to cDNA, RNA samples were stored at -80⁰C to prevent RNA degradation.

Assessment of RNA concentration and purity The concentration and purity of the RNA samples were measured using a Nanodrop® ND-1000 spectrophotometer (Thermo Scientific) according to the manufacturer’s instructions. 1.3µl of the sample was applied to the sample retention system after zeroing the instrument using nuclease-free water (Promega).

Reverse transcription of RNA to cDNA RNA was converted to cDNA using the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems; 4368814). Master mixes were prepared to a 2X concentration using stock reagents supplied with the kit and nuclease-free water (Promega) to make a final volume of 75µl and 15µl per sample for the reverse transcription positive (Table 2.3) and reverse transcription negative (Table 2.4) master mix respectively.

Appropriate volumes of each master mix were aliquoted for each sample and an equal quantity of diluted sample RNA was added to each master mix to make a 1X final concentration. The reverse transcription negative master mix was used to check for genomic DNA contamination in the RNA sample and did not contain any reverse transcriptase enzyme.

56

Table 2.3 Reagent volumes required for reverse transcription positive reaction

Reagent Volume (µl) Promega nuclease-free water 31.5 10X reverse transcription buffer 15 25X deoxyribonucleotide triphosphates (dNTPs) 6 10X random primers 15 Multiscribe™ reverse transcriptase (50U/µl) 7.5 Total per 2X Master Mix reaction 75 Add diluted sample 75

Table 2.4 Reagent volumes required for reverse transcription negative reaction

Reagent Volume (µl) Promega nuclease-free water 7.8 10X reverse transcription buffer 3 25X dNTPs 1.2 10X random primers 3 Multiscribe™ reverse transcriptase (50U/µl) 0 Total per 2X Master Mix reaction 15 Add diluted sample 15

After mixing carefully, the tubes were placed in a cooling block until they were loaded onto a peqSTAR 96 Universal thermocycler machine (PEQLAB Ltd., Erlangen, Germany). The thermocycler conditions for the reverse transcription reaction consisted of: step 1 (10 mins at 25⁰C) and step 2 (120 mins at 37⁰C). The cDNA was stored at -20⁰C until use.

Preamplification of cDNA cDNA was preamplified prior to its use on the BB array. Each gene that was present on the array (apart from 18S) underwent the preamplification procedure. Therefore, TaqMan® Gene Expression Assays (Applied Biosystems; 4331182) were required for 95 genes.

Pooling Taqman assays Equal volumes of each 20X TaqMan® Gene Expression Assay (Applied Biosystems) were pooled. The pooled assays were then diluted with 1X Tris-EDTA (TE) buffer (Promega; V6231) for a 0.2X final concentration per assay. The gene expression assay mix was stored at -20⁰C until use.

Preamplification PCR Preamplification reactions were prepared for each sample. Each reaction contained 25µl of TaqMan® PreAmp Master Mix (2X)(Applied Biosystems; 4391128), 12.5µl of pooled assay mix (0.2X, each assay) and 12.5µl of sample cDNA.

57

Preamplification reactions were run on the peqSTAR 96 Universal thermocycler machine (PEQLAB Ltd.) with the following cycling conditions: 95⁰C 10 mins, (14 cycles of: 95⁰C 15s, 60⁰C 4 mins). Preamplification products were either kept on ice for immediate use or stored at -20⁰C until use.

Running the TaqMan array card Design of the TaqMan® array The BB Taqman array cards (Applied Biosystems; 4342259) were customised with 96 TaqMan® Gene Expression Assays (Figure 2.2). Assay numbers are listed in Appendix 2.

Preparation of the TaqMan® array The 7900HT Fast Real-Time PCR machine (Applied Biosystems) was switched on at least 30 mins prior to use, to enable the laser to warm up. During this time, the array cards were brought to room temperature.

100µl of diluted preamplified cDNA was added to 100µl of Taqman® Universal PCR Master Mix (X2)(Applied Biosystems; 4304437). The solution was added to two ports of the array by pipetting 100µl into each port. The sample was introduced via the larger of the two sample holes of the port, with care taken to avoid introducing air bubbles. The plate was centrifuged twice using a Sorvall® Legend T centrifuge (Thermo Scientific) at 331g for 1 min each time. The plate was sealed with an array sealer (Applied Biosystems) and the loading port strip cut off.

The plates were run with identical cycling parameters each time which were: 50⁰C 2 mins, 94.5⁰C 10 mins, (40 cycles of: 97⁰C 30s, 59.7⁰C 1 min) according to the manufacturer’s instructions.

N.B. The 48 gene arrays used for testing the control DNA and for use in the method development are shown in Appendix 6. These plates were run with 8 samples per card as each port led to all 48 genes.

Preparation and testing of control cDNA Control cDNA for use in the standard curve and replicate experiments consisted of pooled cDNA procured from FFPE tissue samples of a variety of different cancers

58 including prostate cancer, using a similar extraction and reverse transcription procedure as described above. However, 8-10 tissue sections from the control samples were used rather than tissue punches to maximise the cDNA yield. An equal volume of each cDNA sample was used to produce the pooled cDNA.

Control cDNA underwent batch testing which consisted of three runs using a 48 gene array (Appendix 6). The first two arrays were run by different operators on different days using 300ng/µl of control cDNA (diluted with nuclease-free water) in five ports with a no-template control (nuclease-free water) in the final three ports. Batch testing was required to ensure that the control cDNA contained sufficient and uniform quantities of the genes on the array and to ensure high reproducibility between control replicates. A final array was run in a similar format but using reverse transcription negative control cDNA to ensure there was no genomic DNA contamination (N.B. the control cDNA was not preamplified at this point).

Standard curve and replicate plates Sample and control cDNA were diluted in 1X TE buffer (Promega) before use in the prostate TaqMan array card (BB array).

A standard curve plate was run before the sample plates and consisted of 1:20, 1:100 and 1:200 dilutions of preamplified control cDNA. A replicate plate was run before and after the sample plates, which consisted of a 1:20 dilution of preamplified control cDNA run in triplicate.

The sample plates were run using a 1:20 dilution of preamplified sample cDNA with a total of four samples run per card. The raw data for each plate (Ct values) were saved as an SDS file.

Data analysis methods

The amplification plots for each gene were initially viewed in SDS software (version 2.4.1; Applied Biosystems). As the optimal baseline and threshold values varied significantly between the genes, these values were set individually for each gene with the settings held constant for every plate (see Section 2.3.4.4).

59

Setting the baseline and threshold values SDS files from the replicate and sample plates were initially converted from AQ to RQ data files using the SDS software, so they could be imported into ExpressionSuite software (version 1.1; Thermo Fisher Scientific), which allowed multi-plate analysis of the amplification plots for each gene. Multi-plate analysis was necessary to optimise the baseline and threshold settings for each gene, thereby ensuring accurate Ct values.

The baseline was determined by viewing the data for each gene in the linear scale amplification plot and if the amplification curves were not flat at zero, adjusting the baseline (from the default setting of 3-15 cycles) so that the baseline stop value was approximately 1-2 cycles before the earliest amplification (Figure 2.4).

The threshold value for each gene was also manually set (using the logarithmic (log) scale amplification plot) at a value significantly above the baseline but within the exponential phase of the amplification plot (Figure 2.5). The precision of the assay was also demonstrated to be high at the threshold value, as evidenced by the merging together of the replicate amplification curves (Figure 2.6).

The data for the standard curve were analysed using the chosen baseline and threshold values for each gene in the SDS software as an AQ file.

Replicate and sample plates were analysed with these baseline and threshold values but using the ExpressionSuite software, as the chosen settings only needed to be imported once for all plates.

After the software analysed the data from each plate with the chosen settings, the Ct values were exported and analysed in Excel before statistical analysis with SPSS (version 24; IBM, Portsmouth, UK).

60

Figure 2.4 Amplification plots for the housekeeping gene HMBS (linear view): a) before baseline setting (Rn vs cycle number) b) after correctly setting baseline at 3-19 cycles (ΔRn vs cycle number)

Figure 2.5 Amplification plots for HMBS (log view) showing threshold value at 0.426

61

Figure 2.6 Amplification plots for HMBS (log view) showing high precision of replicates

Standard curve and replicate plate analysis Before Ct values could be used for quantification, it was essential to confirm that the qPCR assay was optimised using information from the standard curve and replicate plates.

Initially, the standard curve data for each gene was plotted as the Ct value vs the log of each standard dilution (Figure 2.7). As the standard dilutions were expressed as dilution factors (1:20, 1:100 and 1:200) rather than concentrations, they were given proportional values (50, 10 and 5) to enable a plot to be generated.

Figure 2.7 Standard curve plot for HMBS

Standard curve data was collated in Excel and included the parameters calculated automatically by the SDS software: the coefficient of determination (R2), the slope and the y intercept. The R2 was used to measure the fit of the data to the regression line (as

62 a measure of linearity). The slope of the regression line was used to assess the percentage amplification efficiency (%E) of each gene using the following formula:

퐴푚푝푙𝑖푓𝑖푐푎푡𝑖표푛 푒푓푓𝑖푐𝑖푒푛푐푦 (퐸) = 10−1/푠푙표푝푒

%퐸 = (퐸 − 1) 푥 100%.

The precision and reproducibility of the assay were determined from the results of the replicate plates by measuring intra-assay (within array) and inter-assay (between arrays) variation. As Ct values cannot be used directly to calculate variation (as log units), it was necessary to initially convert Ct values for the replicates to actual quantities using the equation of the linear regression line from the standard curve.

Calculation of actual quantities using the linear regression equation (y = mx + c) y = Ct value m = slope x = quantity c = y intercept

퐶푡 = 푠푙표푝푒 (푙표푔 푞푢푎푛푡𝑖푡푦) + 푦 𝑖푛푡푒푟푐푒푝푡 퐶푡−푦 𝑖푛푡푒푟푐푒푝푡 ( ) 푄푢푎푛푡𝑖푡푦 = 10 푠푙표푝푒

The calculation of actual quantities allowed intra- and inter-assay variation to be assessed for all the reference genes, as there was no requirement to normalise to a reference gene. It is important to remember that, as dilution factors were used rather than concentrations for the standard curve, these values were proportional quantities rather than absolute quantities.

To assess intra-assay variation for each gene, the mean and standard deviation between the triplicate samples within each plate were calculated. The percentage coefficient of variation (%CV) was then determined using the following formula:

%퐶푉 = (푠푡푎푛푑푎푟푑 푑푒푣𝑖푎푡𝑖표푛/푚푒푎푛) ∗ 100

Likewise, inter-assay variation was determined as the %CV between the two replicate plates.

63

Relative quantification of sample data Before data quantification of the samples, any undetermined genes were given a Ct value of 40 to make statistical analysis easier. Next, any cDNA samples which showed low expression of the housekeeping gene HMBS (Ct ≥35) were excluded from further analysis as the samples were not deemed to be of sufficient cDNA quality (305).

Ct values were converted to normalised gene expression ratios by two different relative quantification methods: the Pfaffl method (299) and the 2-∆Ct method, using HMBS as the reference gene. The Pfaffl equation (Section 2.2.1.3) used amplification efficiencies of the target and reference genes which were taken from the standard curve experiments. The Ct values of the target and reference genes for the control sample were obtained from the replicate plates (the mean of the 6 replicates).

Both datasets were natural logarithm (ln) transformed to normalise the data. Statistical analysis was performed using SPSS (version 24; IBM, Portsmouth, UK).

Before analysis in SPSS, patients were assigned a binary score according to 5-year disease recurrence (biochemical and/or clinical). Those who did not experience recurrence were assigned to group “0” and those who experienced recurrence were assigned to group “1”. This time period was measured from the date of RP (in surgically managed patients) or the start of RT treatment (if no surgery given).

SPSS analysis - Independent samples t-tests Initially, independent samples t-tests were performed on both Pfaffl and 2-ΔCt datasets to determine any differences in the mean expression of each gene between the recurrent and non-recurrent group.

Only genes with a p-value <0.5 in both datasets were analysed further. Only the Pfaffl dataset was used for the subsequent LR analysis for reasons explained in Section 2.3.4.6.

Both the GS and the pretreatment PSA value were also checked for any differences between the recurrent and non-recurrent group and taken forward into LR analysis if a p-value of <0.5 was found. The pretreatment PSA was analysed with an independent samples t-test as above, but the GS was analysed with a Mann-Whitney U test as it was ordinal data.

64

SPSS analysis - Correlation analysis Prior to LR analysis, it was necessary to assess any correlation between the genes using a Pearson correlation analysis (also using SPSS). Genes which correlated with each other (p<0.05) were not put together in LR models. This step was taken to avoid the problem of multicollinearity, the presence of two or more independent variables (predictors) which are not just correlated with the dependent variable but are also closely correlated to each other. Multicollinearity can confound the results of LR by causing unreliable estimates of the regression coefficients, which may lead to incorrect assumptions about the relationship between predictors and the dependent variable (306).

SPSS analysis - Binary logistic regression analysis Binary LR was performed univariately for the genes selected from the independent samples t-tests. These genes were combined in two and three gene combinations, using only non-correlating genes together in each model. The number of possible gene combinations for the models was prohibitively large for manual ‘point and click’ processing in SPSS. An automated approach was developed using a Visual Basic for Applications (VBA) script in Excel (written by Mr Andrew Bickers) to produce SPSS syntax scripts that would perform all the univariate and multivariate LRs that were required without user intervention. The script also formatted the outputs to produce only tables containing the required parameters. These outputs were combined in Excel to produce a single spreadsheet containing the necessary information, with one row representing each gene combination. The spreadsheet could be used to rank the gene models according to any of the parameters. This provided a simple method to rank the large numbers of gene models by the chosen parameter of overall performance, the Likelihood Ratio Test statistic (ΔG2/χ2). In the method development, it also provided a method by which the gene models could be ordered by the overall percent accuracy (the previously used ranking parameter in the SL study and for the first submission of the BB study). The validity of using the overall percent accuracy for ranking was questioned during the method development and the comparison of the data ranked by each method allowed a critical comparison between the two ranking methods. The rationale behind the decision to use χ2 for ranking the models rather than the overall percent accuracy is described in detail in the method development Section 2.3.4.8.

65

Ranking the logistic regression models

Ranking by the Likelihood Ratio Test statistic (ΔG2) For the final analysis, the single, two-gene and three-gene LR models were ranked in decreasing order of the value for the Likelihood Ratio Test statistic (ΔG2). This value was calculated automatically by SPSS to compare each model with the null model (constant only model without predictor variable(s)).

The ΔG2 is calculated using the Likelihood Ratio Test (LRT). This statistic was chosen as a measure of overall performance of each model as it is considered the most accurate goodness-of-fit test in LR analysis, particularly in small datasets (307). The goodness-of- fit is an essential characteristic of the overall performance of a predictive model and describes how well the model fits the observed data, i.e. the deviance between the observed and predicted outcomes. A smaller deviance indicates a better fit to the observed data. In this study, it described how well the model predicted the observed outcomes for the patients (categorised correctly into the recurrent or non-recurrent group).

Likelihood Ratio Test in SPSS The LRT is used to compare the goodness-of-fit of two models, a restricted model (with fewer predictors) and a final model (more predictors), where the models are hierarchically nested, i.e. the final model differs from the restricted model by the addition of one or more predictors.

SPSS uses the LRT to compare the performance of each model in relation to the null model (constant only model without predictor variable(s)). The LRT statistic (ΔG2) is automatically calculated by the software and describes the improvement in fit under the final model (with predictor variable(s)) in comparison to the fit under the null model. ΔG2 is denoted χ2 in SPSS as it follows a chi-squared distribution.

SPSS calculates χ2 after it has selected the parameter estimates (coefficients) for the final model, which achieve the greatest likelihood of the observed outcome (see Appendix 5 for a fuller description of LR analysis). It separately fits the null model and the final model to the data and calculates the -2*ln likelihood (-2LL) for each. This value is used rather than the likelihood to make calculation easier. The -2LL describes the deviance between

66 the observed and the predicted outcomes, with smaller values reflecting a better fit. SPSS aims to achieve the lowest possible value for the -2LL. χ2 is calculated as the difference between the -2LL of the null model and the -2LL of the final model as shown below:

푙𝑖푘푒푙𝑖ℎ표표푑 푓표푟 푛푢푙푙 푚표푑푒푙 훥퐺2 = 휒2 = −2 ∗ ln 푑푓 푙𝑖푘푒푙𝑖ℎ표표푑 푓표푟 푓𝑖푛푎푙 푚표푑푒푙

= (−2퐿퐿 푛푢푙푙 푚표푑푒푙) − (−2퐿퐿 푓𝑖푛푎푙 푚표푑푒푙) df is the degrees of freedom (the number of predictor variables added to the final model)

A higher value of χ2 reflects a larger difference between the -2LL of the null model and the final model, therefore indicating a better fitting model.

SPSS also calculated a p-value for χ2, which was essential to identify whether the improvement in fit obtained by adding the variable(s) to the null model was statistically significant. χ2 follows a chi-squared distribution, with the degrees of freedom (df) equal to the difference in the number of predictor variables between the null and the final model. A larger value for χ2 results in a smaller p-value, which provides evidence that the final model is a significant improvement over the null model.

The results section includes only models for which the p-value for χ2 was significant to the conventional confidence level of 95% (p<0.05). Additionally, only the two-gene models are included, for which the χ2 exceeded the highest value obtained for the single gene models; and only the three-gene models for which the χ2 exceeded the highest value obtained for the two-gene models.

Ranking by the overall percent accuracy (used in the pilot study and method development) During the method development, before χ2 had been selected as an appropriate measure of overall performance, the predictive models were ranked in descending order of the overall percent accuracy as this had been the method used in the SL study. During the method development, it was decided to use the χ2 in preference to the overall percent accuracy (2.3.4.8). To validate our decision to use χ2, the ranking of the best performing models by χ2 was compared with the ranking by overall percent accuracy (2.3.4.8 and 2.3.10). 67

Overall percent accuracy in SPSS The overall percent accuracy is calculated automatically by SPSS and is the percentage of cases that are correctly classified. The overall percent accuracy is calculated as:

[(true positive+true negatives)/total] * 100

SPSS calculated the overall percent accuracy by initially calculating the predicted probability of being in the recurrent group for each patient by incorporating the appropriate gene expression values into the LR equation (SPSS previously selected the parameter estimates for the equation). The patients were classified into either the recurrent or non-recurrent group based on their predicted probability. SPSS uses a default threshold of 0.5. Therefore, patients with a predicted probability ≥0.5 were classified as recurrent and those with a predicted probability of <0.5 were classified as non-recurrent. SPSS then compared the predicted categories with the observed categories and calculated the number of patients correctly classified into both the recurrent and non-recurrent groups. The overall percent accuracy was then calculated as the total number of correctly classified patients out of the total number of patients.

Several additional parameters from the LR output, that will be referred to later, include the significance of the Wald statistic, the Nagelkerke R2 and the odds ratio.

The Wald statistic is another measure of goodness-of-fit that was included in the SPSS output. The Wald chi-square test describes the significance of each predictor variable in an LR model by testing the null hypothesis that the regression coefficient for the variable is zero. The significance for the Wald statistic could therefore be used to assess the contribution that each gene made to the model (p<0.05 would indicate a significant contribution to a 95% confidence level).

The odds ratio is the exponentiated regression coefficient for the predictor variable. Although the regression coefficient is used in the LR equation, it is difficult to interpret as it is measured in natural logarithmic units. In comparison, the odds ratio is easier to interpret. Here the odds ratio described the number by which the raw odds in favour of the patient being in the recurrent group were multiplied for a one-unit increase in expression of the gene (predictor variable).

68

The Nagelkerke R2 is also described. This is a pseudo R2 statistic, as LR does not have an equivalent to the R2 found in Ordinary Least Squares regression. Although the Nagelkerke R2 has limitations and should not be considered a measure of the goodness of fit, it does offer a relative measure of the proportion of variation in outcome that is explained by a model (307).

Using the LRT to assess the effect of additional predictor variables The second use of the LRT in this study was to compare the fit of each two-gene model with its nested single gene models and each three-gene model with its nested two-gene models. This was necessary to assess whether the addition of another variable significantly improved upon each single and two-gene model and consequently if the addition of the variable was justified. As the -2LL had been calculated for each model automatically by SPSS, the LRT statistic (χ2) comparing the restricted model (fewer predictor variables) with the final model could be calculated manually by subtracting the -2LL of the final model from the -2LL of the restricted model. As the LRT statistic follows a chi-squared distribution, the significance (of the improvement in fit) could be calculated using an online chi square probability calculator using the value for χ2 and the degrees of freedom (N.B. the degrees of freedom here is the number of predictor variables added to the restricted model to make the final model).

푙𝑖푘푒푙𝑖ℎ표표푑 푓표푟 푠푚푎푙푙푒푟 푚표푑푒푙 훥퐺2 = 휒2 = −2 ∗ ln 푑푓 푙𝑖푘푒푙𝑖ℎ표표푑 푓표푟 푙푎푟푔푒푟 푚표푑푒푙

= (−2퐿퐿 푠푚푎푙푙푒푟 푚표푑푒푙) − (−2퐿퐿 푙푎푟푔푒푟 푚표푑푒푙) df is the number of predictor variables added to the smaller model to make the larger model

For instance, the two-gene model AMACR.EFNA5 was compared with both single gene models AMACR and EFNA5. Likewise, the three-gene model AMACR.EFNA5.SDHA was compared to AMACR.EFNA5, AMACR.SDHA and EFNA5.SDHA.

The LRT statistic was also used in a similar manner to test the effect of adding the GS to each of the single and two-gene models. Here, the χ2 was calculated as the (-2LL of gene(s) only model) - (-2LL of the genes(s) and GS model).

69

SPSS - Receiver Operating Characteristic curves The discriminatory ability of the gene models to categorise patients according to recurrence was tested using Receiver Operating Characteristic (ROC) curves.

ROC curves (Figure 2.8) were plotted in SPSS using the predicted probabilities for the samples, which were calculated during the LR analysis by the software using the LR equation.

Figure 2.8 ROC curve plotted in SPSS

The SPSS output produced a list of coordinates of the curve, from which an optimal cut- off point (c*) of predicted probability was determined for each gene model by maximising the value of the Youden Index (J)(308) described by the equation below, where c is the cut-off point.

푌표푢푑푒푛 𝑖푛푑푒푥 (퐽) = 푠푒푛푠𝑖푡𝑖푣𝑖푡푦 (푐) + 푠푝푒푐𝑖푓𝑖푐𝑖푡푦 (푐) − 1

Genes in the top performing gene models To identify the most frequently found genes in the best performing predictive models (those with the highest χ2 values), the percentage prevalence of each gene was calculated in the highest 19.4%, 9.3%, and 2.8% of the two-gene models and the highest 12.7%, 6.1% and 2% of the three-gene models. To compare our method of ranking (χ2) with the method used in the pilot study (overall percent accuracy), the percentage prevalence of each gene was also calculated for similarly sized divisions in the top two- and three-gene models according to the overall percent accuracy. These divisions were

70 selected as appropriate to take into account the quantisation of the overall percent accuracy data, where multiple gene models presented with the same value, and was the reason that more rounded percentages were not used (explained further in the method development Section 2.3.4.8).

The calculations for the prevalence of each gene in each fraction were performed by counting the total number of times each gene appeared. The percentage prevalence was calculated by dividing this number by the total number of genes included in all the models in the fraction and multiplying by 100.

The percentage prevalence of each gene in the three fractions was plotted separately for the highest models according to χ2 and according to overall percent accuracy. The genes were listed from left to right in order of decreasing significance for the Wald statistic (increasing p-value) found in the univariate LR analysis.

71

2.3 Results

Introduction

The results section includes:

• Information about the patients included in the study. • A method development section - developments since the pilot study. • The results of the independent samples t-tests, logistic regression analysis and ROC analysis - identification of predictive models of recurrence. • Description of the prevalence of each gene in the best performing predictive models of recurrence - a novel manner for identifying the most important predictors. • Comparison of the results of this study with the results of the pilot study.

Samples

105 samples were processed through to the relative quantification stage. However, 38 samples were removed for the reasons listed below, leaving 67 samples taken into statistical analysis.

• patient treated with hormones only (7 samples) • patient conservatively managed (3 samples) • patient had metastatic disease at diagnosis (6 samples) • insufficient follow-up data (8 samples) • patient died of other causes within the 5 years (2 samples - these patients died of lymphoma and abdominal aortic aneurysm) • low HMBS expression (Ct value ≥35)(12 samples)

Of the 67 patients included in the final analysis, 21 were categorised as recurrent and 46 as non-recurrent. There were 40 patients classified as GS ≥7 and 27 classified as GS <7.

Clinicopathological information for each of the final 67 patients can be found in Appendix 7.

72

Method development of practical methods

Introduction A preamplification procedure was included in the final methods to enrich the sample cDNA before its use in the TaqMan array. This was necessary, as during the investigative period, a problem had been found regarding the amount of amplifiable cDNA. This problem and the steps taken to resolve it are described in more detail below.

Enrichment of cDNA for successful real-time qPCR The first BB array plate which was run, was a standard curve plate consisting of 150ng/µl, 300ng/µl, and 600ng/µl concentrations of non-preamplified (NPA) control cDNA. This plate showed Ct values of <35 for the majority of genes, but even at the highest concentration, over 15% of the genes on the array had Ct values which were either undetected (≥40) or higher than would be considered optimal for accurate gene expression determination (≥35)(305).

In comparison, when the first sample cDNA was run on the next BB array at 300ng/µl, over 70% of the genes on the array were either undetermined or showed Ct values of ≥35.

To ensure that the problem was not specific to the first 4 samples, another set of samples were run on a 48 gene TaqMan array, which contained a selection of cancer- related genes (Appendix 6). This array was readily available in the department and was used in preference to the BB array due to the limited numbers of the latter. The Ct values remained high (≥35) at a cDNA concentration of both 300ng/µl and 600ng/µl.

In contrast to these findings, preliminary measurements of the RNA and cDNA samples by UV-spectroscopy appeared to show adequate concentrations for downstream real- time qPCR analysis (Table 2.5). The RNA samples showed less than ideal purity as demonstrated by a value of ~1.5 for the ratio of the absorbance at 260nm and 280nm (A260/280)(pure RNA should be ~2.0). However, the A260/280 ratio tends to be lower for RNA extracted from FFPE tissue than from fresh frozen tissue (309).

73

Table 2.5 RNA/cDNA concentration and purity of samples 1-11 as measured by UV-spectroscopy

Sample RNA cDNA

A260/280 Conc (ng/µl) A260/280 Conc (ng/µl) 1 1.4 26.7 1.8 2234.7 2 1.3 50.1 1.8 2275.7 3 1.4 39.5 1.8 2241.4 4 1.4 50.1 1.8 2279.1 5 1.5 35.6 1.8 2146.9 6 1.5 30.5 1.8 2036.4 7 1.5 30.6 1.8 2066.7 8 1.4 57.6 1.8 2158.6 9 1.5 50.2 1.8 2151.2 10 1.6 36.9 1.8 2004.2 11 1.6 44.6 1.8 2092.9

The high Ct values suggested the presence of low target cDNA quantity and/or quality despite a reasonable input of cDNA based on the spectrophotometric measurement. This overestimation of the cDNA concentration by the spectrophotometer would likely have resulted from the contribution of other components of the reverse transcription mix to the absorbance readings at 260nm, e.g. dNTPs and remaining RNA. Therefore, UV-spectroscopy was not a suitable method to measure the cDNA concentration.

Spectrophotometric measurement of the RNA samples before reverse transcription showed a relatively low yield of RNA (27-58ng/µl). The low yield was expected, as RNA yield tends to be significantly lower from FFPE tissue than fresh tissue as protein-nucleic acid crosslinking makes RNA extraction more difficult. Furthermore, the low RNA yield in this study may have been compounded by the limited amount of tissue per sample (two 0.6mm tissue punches), as well as difficulties in procuring these tissue punches due to the minimal size of the biopsy samples.

However, this yield should have been enough to allow successful real-time qPCR analysis. Therefore, the high Ct values suggested a problem with the quality of the RNA. Crucially, spectrophotometric assessment of RNA does not measure integrity, an important parameter of the RNA quality that affects subsequent molecular analysis. The integrity of RNA extracted from FFPE tissue is invariably compromised as a result of the fixation and embedding processes and is typically highly fragmented and chemically modified (279, 280). These samples were also several years old and storage at room temperature tends to cause further degradation of RNA (310). Despite steps to overcome these issues, as described in Section 2.2.1.1, an issue of insufficient

74 amplifiable cDNA remained. The poor quality of the RNA may have limited the efficiency of the reverse transcription reaction leading to a much lower cDNA yield than would be expected for the RNA input amount.

The Ct values were much higher than those found in the pilot study (which did not require a preamplification)(275). Despite these differences, the concentrations of RNA as measured spectrophotometrically were within a similar range between the two studies. The pilot study used only RP samples, whereas our study had used predominantly biopsy samples. The different size of the samples, which can affect how the tissue reacts to processing and storage, may have been responsible for the differences in the RNA quality between the studies (281). Additionally, surgical and biopsy samples are procured and processed differently, so there would likely be many variables between the sample types that could affect RNA quality.

The purity of the RNA, as measured by the A260/A280 ratio, was slightly lower than the generally acceptable value for RNA of ≥1.8. This is a common finding when using FFPE samples (311). A low A260/A280 ratio may be caused by the presence of proteins or other contaminants which absorb at 280nm and may suggest some residual formalin or protein crosslinks from the fixation process (312).

To increase the amount of amplifiable target cDNA for use in the array, a preliminary preamplification step using the TaqMan® PreAmp Master Mix Kit (Applied Biosystems) was included to amplify the 95 target genes (every gene on the TaqMan array apart from 18S). This kit was chosen as it had the capacity to amplify the specific cDNA gene targets on the array multifold with high reproducibility, without significantly modifying the original gene expression profile (313, 314).

As found by other studies, the inclusion of a preamplification step shifted the Ct values to much lower values thereby increasing the sensitivity of the real-time qPCR analysis and expanding the number of genes that could be analysed (315).

As both the NPA and preamplified (PA) cDNA for samples 1-4 were analysed with the BB array, this allowed a direct comparison of the Ct values before and after preamplification. Table 2.6 shows the number of analysable genes (Ct<35) in the NPA and PA cDNA for these samples. It clearly shows the low number of analysable genes in

75 the NPA samples. After preamplification, the number of analysable genes was significantly increased.

Table 2.6 The number of analysable genes (Ct<35) in the NPA and PA samples

Number of genes Ct<35 Sample RNA conc NPA PA 1 26.7 25 81 2 50.1 5 57 3 39.5 25 64 4 50.1 23 81

Table 2.7 shows the decrement in Ct value (ΔCt = CtNPA-CtPA) due to the preamplification for each housekeeping gene. The ΔCt values for the entire set of genes can be found in Appendix 8.

Table 2.7 Difference in Ct value between the NPA and PA samples for the housekeeping genes

Sample HMBS HPRT1 SDHA TBP 1 NPA 37.94 40 35.51 40 PA 26.38 30.28 26.62 28.23 ΔCt 11.55 9.72 8.89 11.77

2 NPA 40 40 40 40 PA 33.08 40 30.45 33.65 ΔCt 6.92 0 9.55 6.35

3 NPA 40 40 34.40 40 PA 29.02 40 27.92 31.35 ΔCt 10.98 0 6.48 8.65

4 NPA 34.23 40 35.42 40 PA 26.84 33.86 27.14 33.68 ΔCt 7.39 6.15 8.28 6.32

Mean ΔCt 9.21 3.97 8.30 8.27

Due to the problems encountered with spectrophotometric measurement, after the preamplification, the cDNA samples were added to the array according to their dilution factor rather than any specific concentration.

Method development of data analysis

Introduction There were many important developments to the data analysis methods incorporated since the pilot study (and since the initial submission of the thesis). These were essential

76 to enhance the validity of the findings and were done in accordance with the recommendations made by the external examiner. These developments are described in the following sections and include:

• Correlation of gene expression data with recurrence rather than GS (Section 2.3.4.2). • Use of a longer follow-up period requiring updated patient outcome data (Section 2.3.4.2 and 2.3.4.3). • Setting baseline and threshold values individually for each gene to ensure optimal results (Section 2.3.4.4). • Relaxation of the criteria for entry of genes into the logistic regression analysis (Section 2.3.4.5). • Use of the Pfaffl dataset only for logistic regression analysis (Section 2.3.4.6). • Use of an automated approach for the logistic regression analysis (Section 2.3.4.7) • Exploration and use of an alternative method for ranking the logistic regression models (Section 2.3.4.8).

Clinical endpoints and updated outcome data

Using recurrence as the clinical endpoint In the pilot study, LR models were identified that could predict RP patients into groups according to GS (≥7 vs <7). A further LR analysis was performed using recurrence as the dependent variable, as a more meaningful clinical endpoint. However, the findings of the recurrence analysis in the pilot study were limited by the incomplete follow-up data available for the patients.

The initial analysis of our dataset (first submission) also included a preliminary LR analysis using GS ≥7 and <7 groups. A secondary analysis was included using recurrence as the dependent variable but the length of follow-up data available for the patients was only short at that time.

The previous use of GS as a surrogate for aggressive and indolent prostate cancer is flawed as GS shows limited correlation with actual patient outcome (see points below).

77

Therefore, the final analysis focused on the identification of predictive models of disease recurrence only.

• GS does not entirely explain the risk of disease progression in prostate cancer. Although GS 8-10 is associated with a high risk of disease progression and death from prostate cancer (316), some patients within this group have a favourable outcome (317). Additionally, some patients with GS 6 have an unfavourable outcome (318).

• In particular, patients with GS 7 are a prognostically heterogeneous group (319). A large proportion of patients in this study fell within this group (>40%). Therefore, patients with both favourable and unfavourable outcomes would be included in the so-called “aggressive” group (GS ≥7). This was confirmed by the mixture of both recurrent and non-recurrent patients with GS 7 in this dataset.

• Some patients only had a biopsy GS available for categorisation (those treated with RT only). Biopsies are prone to inaccuracies of grading for the reasons described earlier, with undergrading of biopsies the most frequently encountered problem (320). Therefore, some of the patients who had a GS of 6 on biopsy and therefore classified in the low GS group (<7), may have actually had a higher GS if the entire prostate had been analysed, and would be more correctly classified as GS ≥7 (321).

Extending follow-up To enhance the clinical relevance of any significant findings from the study, it was necessary to analyse the gene expression data using a longer follow-up period. This necessitated that various parameters of disease outcome for the patients, including biochemical and clinical parameters, be updated to cover the years that had elapsed since this clinical information was last accessed (approximately 2012).

By extending the clinical information available, the statistical analysis could be performed using 5-year disease recurrence/non-recurrence as the dependent variable. Predictive models of disease recurrence within 5 years of RP/RT would be clinically relevant as most recurrences in prostate cancer occur within 5 years from treatment

78

(322). Additionally, earlier disease recurrence may signal more aggressive disease with a significantly increased risk of disease-specific mortality (323).

Previously (at first submission of the thesis) only 3 years outcome data were available for the patients. Similarly, the pilot study had limited outcome data available at that time. In fact, the pilot study did not use a constant follow-up time for the statistical analysis with the length of follow-up varying between patients from 2 to 5 years.

Updating the outcome data before our final analysis meant that some patients whose data had previously been excluded as incomplete, could now be included. This time, the data acquisition included the use of the database Graphnet (Electronic Patient Records) as an additional source of clinical information alongside the other previously used sources (the pathology records and patient notes). This may have increased the amount of clinical information available.

New ethical permissions An issue arose regarding access to the new patient data, as the ethical permissions that had originally allowed the use of the samples in the study (and therefore access to patient data) had lapsed. At the time of the experimental work in the TORC, QAH (2009- 2012), Medical Research and Ethics Committee (MREC) approval was in place for the use of the samples in research through their inclusion in the Portsmouth Molecular Pathology Tissue Bank (07/MRE08/2). This approval expired in 2012 with the closure of the Tissue Bank after the departure of Professor Ian Cree, the director of TORC.

Therefore in 2016, it was necessary to submit a new application for Health Research Authority (HRA) approval before any additional patient data could be accessed. Additionally, an internal review with the Science Faculty Ethics Committee (SFEC) at the University of Portsmouth (UoP) was required.

Both applications were successful, with receipt of an approval letter from the HRA and a letter of Favourable Ethical Opinion from the SFEC. As there was no joint research agreement in place between the Portsmouth Hospitals Trust (PHT) and the UoP, a new data transfer agreement was drawn up and signed by both parties.

Data collection proceeded after a letter of Confirmation of Capacity and Capability from the PHT had been received (Ref PHT/2016/100). 79

The PHT assigned a local collaborator (Dr Sharon Glaysher) at the hospital to oversee the collection of the patient data, which were accessed using hospital databases and patients’ notes by Dr Glaysher and members of staff in the Oncology and Urology teams. The data were anonymized prior to my receiving it, as I no longer held an honorary contract at the PHT.

A graphical depiction of the steps to gain access to the necessary patient data is shown in Figure 2.9.

Figure 2.9 Steps to gain new ethical approval to access patient data

Optimal setting of the baseline and threshold values In the initial data analysis (first submission) the baseline was set at 3-15 cycles (default range) and the threshold value at 0.5, which were the settings used in the pilot study. The same settings were used for every target sequence on the array. It became evident that these settings were not optimal for every gene on the array due to the variability in expression levels between genes. For instance, the baseline stop value of 15 was set too 80 low for less abundant genes and too high for more abundant genes (Figure 2.10). If the baseline is set too low, then there will be insufficient amount of background subtracted. If the baseline is set too high, it would eliminate cycles in which the amplification starts to rise above the background. Either scenario may lead to inaccurate determination of Ct values.

Furthermore, the threshold value of 0.5 was not suitable for every gene. The threshold value should be set at a value above the background fluorescence and within the exponential growth portion of the amplification curve for accurate quantification of starting template amount. The value of 0.5 was found to be too high for many of the less abundant genes, where it was situated in the plateau phase rather than in the exponential portion of the amplification plot (Figure 2.11). The measurement of Ct values in the plateau phase leads to inaccurate quantification as there is no longer doubling of the amplification product with each cycle, so the fluorescent signal cannot be used to accurately determine the amount of template.

Therefore, for the final analysis, the baseline and threshold were set individually for each gene for optimal settings. The baseline was set so that the stop value was approximately 1-2 cycles before the earliest amplification ensuring that background noise was eliminated. The threshold value for each gene was set at a value within the exponential phase of the amplification plot.

To allow the appropriate settings to be found for each gene, it was necessary to view the amplification plots from all the sample plates together. ExpressionSuite software (version 1.1; Thermo Fisher Scientific) was ideal for this purpose. Once the optimal settings for each gene had been selected, they could be imported once into ExpressionSuite and then used for the analysis of the data from every plate, as it was essential to keep the settings constant between the samples.

81

Figure 2.10 Inappropriate setting of the baseline

82

Figure 2.11 The threshold is set too high

Independent samples t-tests - relaxing criterion for entry into multivariate logistic regression analysis In the initial submission of the BB study, only genes found to have a p-value of <0.05 in the independent samples t-test were used as predictors in the multivariate LR models. This stringent criterion had also been used in the pilot study. For the final analysis, the cut-off value for inclusion of the genes into the multivariate LR analysis was relaxed to include every gene with a p-value of <0.5. After relaxing this threshold, the number of genes that could be included in the models increased from 3 in the pilot study to 39 in the final analysis.

The relaxation of the entry criterion for the selection of variables was necessary, as the use of the conventional significance cut-off (p<0.05) may fail to identify variables that could have important predictive value in regression models (324). It is possible for variables in combination to be highly predictive of outcome, despite each variable showing only a weak association with outcome. Several authors suggest selecting all variables with a significance for the Wald statistic of p<0.25 in the univariate LR analysis (325, 326).

A cut-off of p<0.5 was used in this study to ensure that no potentially important genes were excluded, as the sample set was small and p-values may fail to reach significance in small sample sizes, despite a large magnitude of effect (327). Further increasing the threshold was deemed necessary as the selection was based on the p-value from the 83 t-test rather than on the p-value of the Wald statistic from the univariate analysis (as described in these other studies) and these two tests describe the relationship between the predictor variables and outcome in different ways. Therefore, a relaxed criterion of p<0.5 was considered most appropriate.

Pfaffl dataset For the final analysis, only the Pfaffl dataset was used in the LR analysis rather than both datasets (Pfaffl and 2-ΔCt) as had been used previously. The Pfaffl method provided a more reliable method of relative quantification than the 2-ΔCt method, as it controls for differences between the amplification efficiencies of the target and reference genes (299). From the results of the standard curve experiments, it was evident that there were differences between the amplification efficiencies of most of the genes and the reference gene HMBS. Therefore, this method was deemed more suitable. Additionally, it would have been too time-consuming to analyse both datasets in the multivariate LR analysis as the amount of data had increased substantially due to the relaxation of the inclusion criterion for entry of predictor variables.

Automated approach to logistic regression analysis Relaxing the threshold for inclusion of genes resulted in a significant increase in the number of possible gene combinations to be processed. It would have been prohibitively time-consuming to process the large number of two- and three-gene combinations manually in SPSS (Table 2.8). Therefore, an automated process was developed for the project.

Table 2.8 Number of gene combinations for logistic regression analysis

Number of genes Number of combinations Single 39 Two-gene combinations 387 Three-gene combinations 1482

The SPSS package has several methods of automation including the use of macros and scripts in various programming languages. Automation in SPSS allows the same or similar commands to be issued repeatedly and for input sources and output formats to be defined.

84

The most common method of automation is SPSS Scripting, which is based on SaxBasic language. SPSS macros are created from the SPSS command syntax language. Scripting can also be undertaken in Python. Scripts in the SPSS command syntax language and Python can be combined to run together.

In order to produce scripts to perform LR on the lists of gene combinations, a VBA script was used in Microsoft Excel. This script was written by Mr Andrew Bickers. Each gene combination, for which a LR was required, occupied a row in a spread sheet (Figure 2.12). The VBA script, invoked by a button in the spread sheet, looped through the rows generating an SPSS syntax phrase for each gene combination in an ASCII text file that commanded SPSS to perform a LR on those gene combinations. The gene names corresponded to the column titles in an SPSS input file (Figure 2.13).

Figure 2.12 VBA script in Excel spreadsheet containing list of two-gene combinations

Figure 2.13 Example snippet of SPSS input file for logistic regression analysis

The VBA script (Figure 2.14) also used some embedded Python commands to format the output.

85

The final syntax text file was named as *.sps and could be read directly into SPSS as a syntax file (Figure 2.15). The syntax contained formatting commands to output the relevant tables and specify output locations as well as the parameters required to perform LRs.

Figure 2.14 VBA script

Figure 2.15 Final syntax file

The output files generated by executing this syntax contained the relevant tables (Table 2.9) from the LR output. These outputs were exported and combined in Excel to produce a single spreadsheet containing all the required information with one row representing each gene combination (Figure 2.16). These final spread sheets were produced for

86 single, two- and three-gene combinations. The LR outputs in this format provided all the necessary information in one spreadsheet, which simplified analysis and allowed ranking of the gene models by any parameter.

Table 2.9 List of tables exported from logistic regression output

Name of table Information included Classification table (final model) Percent predicted correctly - overall, non-recurrence, recurrence Hosmer and Lemeshow test Chi-square, sig. Omnibus tests of model coefficients Chi-square, sig. Model summary -2 ln likelihood, Cox & Snell R square, Nagelkerke R square Variables in the equation Gene names, sig.

Figure 2.16 Snippet of final spreadsheet containing information from the logistic regression outputs

Insights about ranking the logistic regression models The pilot study (and the first submission) had ranked the LR models according to their overall percent accuracy as this is a highly intuitive method of describing performance of a predictive model. Although this method of ranking was initially used to rank the models in the final analysis, the appropriateness of this method for assessing overall model performance was questioned based on the following findings:

Limitations of the overall percent accuracy Ranking the models by overall percent accuracy led to a quantisation effect of the data, which made ranking difficult. The overall percent accuracies fell into discrete values with multiple models having the same value. This “quantisation” of the data resulted from the fact that the overall percent accuracy is calculated by the software as the number of patients correctly categorised. Therefore, differences in the overall percent accuracy are caused by either an increase or decrease in the number of patients correctly categorised, resulting in the percentage value increasing or decreasing by a specific amount. This effect was compounded by the small sample size in this study, which meant that a 1.49% 87 change in overall percent accuracy was observed for every additional patient correctly or incorrectly predicted [(1/67) * 100].

It was also evident that despite a good overall percent accuracy for some models, they showed imbalanced values for the sensitivity (the proportion of recurrent patients correctly classified as recurrent) and the specificity (the proportion of non-recurrent patients correctly classified as non-recurrent), with a distinctly higher value for the specificity than the sensitivity for all the models. These values are given in the logistic regression output but are not adequately described by the overall percent accuracy, which does not give information about the number of correct and incorrect classifications of each class (328).

Additionally, the overall percent accuracy is not a reliable measure of performance in imbalanced datasets (the number of samples in the “1” group is markedly higher than the “0” group) and leads to misleading accuracies as the minority class contributes less to the overall accuracy. As algorithms are essentially accuracy driven, they tend to predict all samples into the majority class, which may lead to an overestimation of the overall percent accuracy (329). This dataset was moderately imbalanced, with 21 patients in the recurrent group (“1”) and 46 in the non-recurrent group (“0”).

This can be exacerbated by the fact that SPSS uses a threshold of 0.5 to decide between classes. In imbalanced datasets, where the predicted probabilities of the minority class tend to be underestimated, this threshold is not appropriate and may lead to unreliable accuracies and an imbalance of sensitivity and specificity, as seen in our models (330). (The identification of an optimal threshold to maximise sensitivity and specificity is explained in more detail in the results Sections 2.3.7.4, 2.3.8.5 and 2.3.8.8).

The overall percent accuracy is also misleading, if not given in the context of the null model. In this dataset, the value for the null model was high at 68.7%. This reflected the large proportion of non-recurrent cases, as the null model simply classified every sample as non-recurrent. The percent accuracy does not accurately describe the gene models in relation to the null model; i.e. the improvement in performance that can be achieved by the addition of the gene(s) to the null model. Therefore, models appearing to have a good performance based on their overall percent accuracy may not be as promising if the percent accuracy for the null model is high. 88

It was apparent that some of the highest-ranking models according to overall percent accuracy contained variables with a low significance (high p-value) for the Wald statistic. Conversely, many of the lower-ranking gene models contained variables with lower p-values for the Wald significance. From this, it could be deduced that some of the best performing models according to percent accuracy contained variables that were not making any significant contribution to the model and others containing only highly significant variables were ranked much lower. For example, the two-gene model AMACR.ANPEP was ranked at number 5. The Wald significance for the two variables AMACR and ANPEP were 0.014 and 0.440 respectively showing that ANPEP made no significant contribution to the model. Conversely, AMACR.SDHA was ranked at number 43. Both AMACR and SDHA had highly significant values for the Wald statistic of 0.003 and 0.009 respectively. A similar pattern emerged for the three-gene models. For example, the three-gene model ABL1.GUCY1A1.PLLP was ranked number 5 according to the overall percent accuracy. The Wald significance for the three variables ABL1, GUCY1A1 and PLLP were 0.022, 0.041 and 0.279 respectively. This demonstrated that the variable PLLP did not significantly contribute to the model. Conversely, the gene model AMACR.HPRT1.SDHA was ranked at number 54. Each of the three variables AMACR, HPRT1 and SDHA were highly significant at 0.005, 0.049 and 0.008.

A further observation that cast doubt on the use of the overall percent accuracy for ranking the models was found in the results of a preliminary analysis, which had been used to visualise the most important genes in the large number of processed models.

Distribution of genes in the best performing gene models according to overall percent accuracy A graphical representation of the prevalence of each of the 39 genes in the best performing predictive models (according to the overall percent accuracy) was produced. Although a plot of prevalence in the top 10% of the models would have been ideal, due to the overall percent values falling into discrete values, the top 9.3% and 12.7% of the two-gene and three-gene models respectively were selected to ensure that all models with values within this range were included.

The percentage prevalence of each gene within these upper fractions was calculated as described in Section 2.2.5.10. These values were then separately plotted on a bar chart

89 in order of increasing p-value obtained from the independent samples t-test (Figure 2.17 and Figure 2.18).

It was evident from the plots that the gene AMACR, with a prevalence of 26.4% and 20.2% in the two- and three-gene models respectively, was the most prevalent gene in the top models according to overall percent accuracy. The prevalence of AMACR was approximately 3-4 times that found for any other gene. As would be expected, AMACR had a highly significant p-value in the independent samples t-test of 0.018. However, the genes HPRT1 and SDHA, which had the most highly significant p-values in the t-test, had a prevalence of only 5.6% each in the two-gene models and 6.4% and 1.8% respectively in the three-gene models. A number of genes which had much higher p-values were found to be as prevalent or more prevalent than HPRT1 and SDHA, namely PVT1 with a p-value of 0.073 had a prevalence of 6.6% in the three-gene models; GUCY1A1 with a p-value of 0.149 had a prevalence of 5.6% and 6% in the two- and three- gene models respectively and EFNA5 with a p-value of 0.151 had a prevalence of 6.9% in the two-gene models. The relatively high prevalence of some genes with even higher p-values was also noticed, e.g. NELL2 and PLLP with p-values of 0.292 and 0.303 respectively. NELL2 had a prevalence of 4.2% in the two-gene models and PLLP had a prevalence of 4.3% in the three-gene models.

The discrepancies between the p-values for the t-test and the prevalence of the genes in the models with the highest overall percent accuracy contributed to concerns regarding the validity of using the overall percent accuracy to rank the models.

Summary of limitations of the overall percent accuracy To summarise, the overall percent accuracy was not considered appropriate as a measure of overall performance in this study due to its unreliability in imbalanced datasets, its use of a default threshold of 0.5 and its inability to describe both sensitivity and specificity. This decision was backed up by several authors who advised that it is unsuitable as a measure of overall performance of LR models due to the described reasons (328, 330). The overall percent accuracy was also problematic for ranking due to the described problem of quantisation of the data, which prevented individual ranking of the models within each discrete value.

90

P=0.05

Figure 2.17 Percentage prevalence of each gene in the highest 9.3% of the two-gene models according to overall percent accuracy

91

P=0.05

Figure 2.18 Percentage prevalence of each gene in the highest 12.7% of the three-gene models according to overall percent accuracy

92

Ranking by the Likelihood Ratio Test statistic The described limitations of using overall percent accuracy precluded its use as a measure of overall model performance in this study. Instead, it was decided to use the χ2 for ranking purposes as the LRT test is the most useful parameter of goodness-of-fit in an LR model (307)(described in Section 2.2.5.7).

Ranking the gene models in descending order of χ2 was ideal as this value was calculated automatically in SPSS to compare the null model and the final model. Therefore, this parameter gave the performance of the model in the context of the null model, so these values could be directly compared between models and used for ranking purposes. Furthermore, as the χ2 was different for every model, the models could be ranked more specifically than was possible with the overall percent accuracy.

As the LRT can also be used to compare any two nested models, it could also be used for the comparison of each two-gene model with its nested single gene models and each three-gene model with its nested two-gene models. This was essential to identify if the more complex model offered any significant improvement over the restricted model. Only models which offered a significant improvement over their nested models would be considered useful models. This could also be ascertained by investigating whether the added predictor variable had high significance for the Wald statistic. However, as the LRT is considered more reliable than the Wald test, the LRT was preferred for this purpose (307).

To validate the decision to use the χ2, another plot was produced which compared the highest performing models ranked by χ2 with those ranked by the overall percent accuracy. The percentage prevalence of each gene in the highest 9.3% and 12.7% of the two- and three-gene models ranked by χ2 was calculated and the prevalence according to each ranking method was plotted on one graph (Figure 2.19 and Figure 2.20).

93

P=0.05

Figure 2.19 Percentage prevalence of each gene in the highest 9.3% of the two-gene models according to χ2 and overall percent accuracy

94

P=0.05

Figure 2.20 Percentage prevalence of each gene in the highest 12.7% of the three-gene models according to χ2 and overall percent accuracy

95

The plots according to χ2 showed that AMACR was the most prevalent gene at 25% and 23.2% for the two- and three-gene models respectively. The prevalence of AMACR was 2-3 times the prevalence for any other gene. However, although HPRT1 and SDHA were lower than AMACR, their prevalence values were higher than those found in the gene models with the highest overall percent accuracy. Generally, the prevalence of the genes in the top models ranked by χ2 was higher for the genes with low p-values (<0.05) and lower for the genes with high p-values (>0.05). This was reflected graphically by a more gradual decline in the prevalence as the p-value increased than was observed for the overall percent accuracy. The percentage of genes with p-values <0.05 and >0.05 in the top models according to each ranking method is shown in Table 2.10. It shows the higher percentage of genes with p-values <0.05 in the top models ranked by χ2 in comparison to those ranked by overall percent accuracy. For example, the percentage of genes with p-values <0.05 is 62.5% in the top two-gene models according to χ2 compared with 45.8% in the top two-gene models according to overall percent accuracy.

Table 2.10 The percentage of genes with p-values <0.05 and >0.05 in the top gene models ranked by χ2 and overall percent accuracy

p-value In top 9.3% In top 12.7% Two-gene % Three-gene χ2 Two-gene % Three-gene χ2 % genes < p=0.05 45.8 62.5 39.4 46.5 % genes > p=0.05 54.2 37.5 60.6 53.5

A plot has also been included which depicts the genes from left to right according to the increasing p-value for the Wald significance (univariate LR) rather than the p-value for the t-test (Figure 2.21).

It was evident from this plot, that the prevalence and the Wald significance were correlated as a gradual decline in the prevalence was noted as the significance for the Wald statistic reduced (p-value increased). This suggested that the p-value for the Wald significance may be a more suitable measure for selecting genes for inclusion into multivariate LR analysis as described by others (325, 326).

96

P=0.05

Figure 2.21 Percentage prevalence of each gene in the highest 9.3% of the two-gene models in order of Wald significance from univariate logistic regression

97

This plot shows that the genes with lower p-values for the Wald significance had consistently higher prevalence in the top models according to χ2, with a gradual decline in prevalence as the p-value increased. In contrast, the peaks for the overall percent accuracy do not follow this pattern, with a relatively high prevalence observed for genes with high p-values, e.g. a prevalence of 4.2% for NELL2 (p-value of 0.289).

Some interesting observations were also made when viewing the entire set of two- and three-gene models ranked by the different methods. The ranking of the models by χ2 resulted in a very different order to the gene models in comparison to the order obtained using the overall percent accuracy. Ranking according to χ2 resulted in gene models at the top of list having consistently lower p-values for the Wald statistic (p<0.05) than was found in the gene models with the highest overall percent accuracy.

For example, the three-gene model AMACR.HPRT1.SDHA which had been ranked at number 54 according to the overall percent accuracy was ranked at number 2 according to the χ2. Each of the three variables were highly significant at 0.005, 0.049 and 0.008 respectively. ABL1.GUCY1A1.PLLP, which had been ranked at number 5 according to the overall percent accuracy, was ranked at number 421 according to χ2. This gene model contained the variable PLLP which had a low significance of 0.279.

The relationship between χ2 and the Wald statistic was not surprising as these two tests are asymptotically equivalent (331). However, the Wald test is used to test the significance of each individual coefficient in the model whereas χ2 compares the overall model with the null model. The χ2 was chosen to rank the models as it gave a measure of the whole model and was considered superior to the Wald statistic as a measure of overall performance (307).

98

Optimisation of the assay

Optimisation of the assay was demonstrated using important parameters obtained from the standard curve for each gene. The standard curve of every gene included on the array was assessed except for the genes ALPG, CD4 and GPM6A, which were excluded as they showed undetermined Ct values for every dilution of the control cDNA.

There are too many standard curve plots to include in the results, so the plots of the 5 housekeeping genes (18S, HMBS, HPRT1, SDHA and TBP) are presented and described here (Figure 2.22).

Figure 2.22 Standard curves for housekeeping genes

The %E and the R2 values are presented in Table 2.11. The acceptable range of %E in real-time qPCR is 90-110%. The housekeeping gene HMBS showed a value of 103%, which was close to an ideal value of 100% signifying near perfect doubling of the amplification product with each cycle. The %E of HPRT1 and TBP were also within the

99 optimal range at 94% but the amplification efficiencies of 18S and SDHA were slightly lower than optimal at 86% and 82% respectively. The R2 values for 18S, HMBS, HPRT1 and TBP demonstrated high linearity (>0.990). The R2 value for SDHA was marginally below optimal at 0.978.

Table 2.11 Amplification efficiencies and R2 values for the housekeeping genes

Gene %E R2 18S 86.1 0.991 HMBS 102.8 1.000 HPRT1 93.5 0.995 SDHA 81.8 0.978 TBP 94.3 0.994

The %E and R2 values for the entire set of genes on the array can be found in the electronic Appendix 8. The %E tended to be within the optimal range or only slightly outside these boundaries for the majority of the genes. There were only a few genes for which the %E was significantly lower than optimal (APOPBEC3G, CRHR1, HPX, ITGA2, MMP2, MNX1, PSCA, PRELID3A and TERT) and only one gene, where the %E was significantly higher than optimal (EEF1A1). Every gene on the array had an optimal R2 value of >0.980 except for CRHR1, EDNRA, EFNA3, GSTM5, PCDH18 and TERT.

Another important characteristic of an optimised assay is high reproducibility as demonstrated by low intra- and inter-assay variation. The intra- and inter-assay variation (as measured by the %CV) for each housekeeping gene is shown in Table 2.12. The intra-assay variation was ideally low for 18S, HMBS and TBP (<10%). The intra-assay variation was higher for HPRT1 and SDHA. The inter-assay variation was ~15% for all 5 housekeeping genes. The values for the entire set of genes are shown in Appendix 8.

Table 2.12 Intra- and inter-assay variation for the housekeeping genes

%CV %CV %CV between Gene replicate plate 1 replicate plate 2 plate 1 and 2 (intra-assay) (intra-assay) (inter-assay) 18S 3.5 9.2 15.9 HMBS 2.5 7.1 12.5 HPRT1 6.0 37.8 15.6 SDHA 25.4 33.5 15.0 TBP 9.9 8.5 12.1

100

Independent samples t-tests

The independent samples t-tests for comparison of gene expression between the recurrent and non-recurrent group identified 39 genes with a p-value <0.5 in both datasets. The 39 genes are shown in Table 2.13 and Table 2.14 for the Pfaffl and 2-ΔCt datasets respectively. For each gene, the table includes the significance of the t-test (p-value), the mean expression in the recurrent and non-recurrent group and the fold difference between the groups.

From the 39 genes, 23 showed increased expression and 16 showed decreased expression in the recurrent versus non-recurrent group (in both datasets).

Only 5 genes were significantly differentially expressed between the groups in both datasets to the conventional significance level of 95% (p<0.05). These genes were AMACR, HPRT1, PLAU, PTEN and SDHA.

In both datasets, the genes AMACR, HPRT1 and PTEN showed significantly decreased expression in the recurrent versus non-recurrent group. The expression of PLAU and SDHA were significantly increased in the recurrent vs non-recurrent group.

A further 6 genes were significant to a confidence level of 90% (p<0.1): ABL1, ANXA2, ATP10B, CDH1, PGC and PVT1.

The GS was also included in the LR analysis as it had a p-value <0.5 in the Mann-Whitney U test. The preoperative PSA was not included (p>0.5).

101

Table 2.13 Genes with p<0.5 in the independent samples t-tests for the comparison of the recurrent and non-recurrent group (Pfaffl dataset)

Mean ↑/↓ Gene sig. Non-recurrence Recurrence with recurrence Fold difference ABL1 0.045 -0.251 0.006 ↑ 42.83 AMACR 0.018 2.786 1.714 ↓ 1.63 ANPEP 0.205 -1.149 -0.444 ↑ 2.59 ANXA2 0.072 -2.234 -1.808 ↑ 1.24 ATP10B 0.054 -8.812 -7.531 ↑ 1.17 ATP5MPL 0.295 -0.360 -0.213 ↑ 1.69 C3 0.202 -2.195 -1.854 ↑ 1.18 CD109 0.125 -1.753 -1.465 ↑ 1.20 CDH1 0.052 0.126 0.396 ↑ 3.14 CHM 0.331 -0.697 -0.527 ↑ 1.32 DPT 0.098 -1.891 -1.302 ↑ 1.45 EFNA1 0.155 -0.453 -0.246 ↑ 1.84 EFNA5 0.151 -0.548 -1.137 ↓ 2.07 EZH2 0.451 -1.476 -1.708 ↓ 1.16 F11R 0.235 -0.769 -0.623 ↑ 1.23 FABP5 0.274 -1.164 -0.941 ↑ 1.24 GSTM3 0.146 -0.434 -0.150 ↑ 2.89 GSTM5 0.438 -0.483 -0.911 ↓ 1.89 GUCY1A1 0.149 0.662 0.403 ↓ 1.64 HPRT1 0.008 -0.372 -1.272 ↓ 3.42 IRF7 0.356 -1.472 -1.795 ↓ 1.22 ITGA2 0.244 -0.589 -0.350 ↑ 1.68 ITGB1 0.115 -0.460 -0.216 ↑ 2.13 ITGB6b 0.433 -2.701 -3.107 ↓ 1.15 NDUFAF1 0.154 -0.185 -0.698 ↓ 3.77 NELL2 0.292 -0.328 0.040 ↑ 9.2 PNP 0.322 -1.401 -1.714 ↓ 1.22 PCDH18 0.197 -0.076 -0.494 ↓ 6.5 PGC 0.070 -4.145 -5.170 ↓ 1.25 PLAU 0.020 -2.807 -2.297 ↑ 1.22 PLLP 0.303 -1.142 -1.731 ↓ 1.52 POP5 0.088 -0.087 0.126 ↑ 2.45 PTEN 0.038 0.141 -0.788 ↓ 6.59 PTGFR 0.422 -1.732 -2.136 ↓ 1.23 PVT1 0.073 -0.102 0.236 ↑ 3.31 SDHA 0.013 -0.262 0.055 ↑ 5.76 SERPINB5 0.360 -3.942 -4.651 ↓ 1.18 TBRG4 0.362 -0.786 -0.685 ↑ 1.15 TRAFD1 0.149 0.089 0.406 ↑ 4.56

GS 0.372 ↑

102

Table 2.14 Genes with p<0.5 in the independent samples t-tests for the comparison of the recurrent and non-recurrent group (2-ΔCt dataset)

Mean ↑/↓ Gene sig. Non-recurrence Recurrence with recurrence Fold difference ABL1 0.073 1.151 1.369 ↑ 1.19 AMACR 0.023 2.546 1.588 ↓ 1.6 ANPEP 0.214 0.736 1.433 ↑ 1.95 ANXA2 0.069 2.920 3.339 ↑ 1.14 ATP10B 0.049 -8.262 -7.049 ↑ 1.17 ATP5MPL 0.439 1.660 1.777 ↑ 1.07 C3 0.264 1.697 2.015 ↑ 1.19 CD109 0.099 -1.145 -0.851 ↑ 1.35 CDH1 0.049 3.472 3.741 ↑ 1.08 CHM 0.402 -0.808 -0.659 ↑ 1.23 DPT 0.120 0.237 0.818 ↑ 3.45 EFNA1 0.165 1.308 1.508 ↑ 1.15 EFNA5 0.143 -2.037 -2.676 ↓ 1.31 EZH2 0.411 -1.917 -2.188 ↓ 1.14 F11R 0.224 1.992 2.138 ↑ 1.07 FABP5 0.270 0.293 0.512 ↑ 1.75 GSTM3 0.262 -0.311 -0.072 ↑ 4.32 GSTM5 0.389 -3.383 -3.917 ↓ 1.16 GUCY1A1 0.056 1.532 1.152 ↓ 1.33 HPRT1 0.025 -1.595 -2.578 ↓ 1.62 IRF7 0.342 -1.857 -2.196 ↓ 1.18 ITGA2 0.391 0.202 0.368 ↑ 1.82 ITGB1 0.219 2.891 3.089 ↑ 1.07 ITGB6b 0.398 -4.176 -4.645 ↓ 1.11 NDUFAF1 0.183 -3.290 -3.961 ↓ 1.2 NELL2 0.344 -2.096 -1.749 ↑ 1.2 PNP 0.297 -0.260 -0.606 ↓ 2.33 PCDH18 0.206 -0.607 -0.997 ↓ 1.64 PGC 0.062 -4.912 -6.032 ↓ 1.23 PLAU 0.023 0.069 0.568 ↑ 8.23 PLLP 0.250 -2.651 -3.419 ↓ 1.29 POP5 0.114 -0.244 -0.047 ↑ 5.19 PTEN 0.030 -3.756 -4.827 ↓ 1.29 PTGFR 0.393 -2.727 -3.170 ↓ 1.16 PVT1 0.096 -2.330 -2.010 ↑ 1.16 SDHA 0.046 0.898 1.166 ↑ 1.3 SERPINB5 0.368 -3.504 -4.170 ↓ 1.19 TBRG4 0.297 -0.083 0.030 ↑ 3.77 TRAFD1 0.204 0.346 0.635 ↑ 1.84

GS 0.372 ↑

103

Univariate logistic regression analysis

Single gene models ranked by χ2 The selected output parameters from the univariate LR analysis of the 39 genes are listed in Table 2.15. A model containing only GS is also included in the table.

The single gene models were ranked in descending order of the LRT statistic (χ2), which described the improvement in fit achieved through the addition of each gene to the null model. The degrees of freedom (df) and the significance of χ2 are shown to the right of each χ2 value.

The models with the highest value for χ2 were AMACR (χ2(1) = 7.8, p<0.01) and HPRT1 (χ2(1) = 6.7, p<0.01). The higher the value for χ2, the higher the significance (lower p-value) and the greater the improvement in fit under the final model than the null model. To reiterate, χ2 was calculated as the (-2LL of the null model) - (-2LL of the final model). Therefore, as the -2LL of the null model remained constant, a higher value for χ2 reflected a lower -2LL of the final model (the model fitted the data better).

The addition of either AMACR or HPRT1 offered a significant improvement over the null model to a 99% confidence level (p<0.01). SDHA, PLAU, PTEN, ABL1 and CDH1 significantly improved the fit of the null model to a 95% confidence level (p<0.05). A further number of genes improved the fit of the model to a 90% confidence level (p<0.1): ATP10B, PVT1, ANXA2, PGC, EFNA5, TRAFD1, POP5, DPT and GUCY1A1. The remainder of the genes had p-values >0.1, suggesting that these single gene models were not significantly better for predicting the data than the null model. However, this did not prevent the genes being included in the multivariate analysis as they had potential to be useful in combination, particularly those with p-values <0.25.

The GS as a single predictor was ranked near the bottom of the list (p-value of >0.3) showing that it offered no significant improvement over the null model for the prediction of recurrence.

104

Table 2.15 Output parameters from the logistic regression analysis of the single gene models (listed in descending order of χ2)

Gene(s) Wald significance Percent Accuracy (%) Nagelkerke Non- 1 2 3 -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall Null Model 83.324 68.7 AMACR 75.564 7.760 1 0.0053 0.154 0.009 33.3 95.7 76.1 HPRT1 76.595 6.728 1 0.0095 0.134 0.022 14.3 95.7 70.1 SDHA 77.088 6.236 1 0.0125 0.125 0.020 23.8 95.7 73.1 PLAU 77.835 5.489 1 0.0191 0.111 0.026 19.0 93.5 70.1 PTEN 79.164 4.159 1 0.0414 0.085 0.048 19.0 95.7 71.6 ABL1 79.322 4.002 1 0.0455 0.081 0.053 19.0 93.5 70.1 CDH1 79.372 3.952 1 0.0468 0.080 0.060 14.3 97.8 71.6 ATP10B 79.611 3.712 1 0.0540 0.076 0.058 14.3 91.3 67.2 PVT1 79.880 3.444 1 0.0635 0.070 0.083 9.5 97.8 70.1 ANXA2 79.973 3.351 1 0.0672 0.069 0.076 0.0 95.7 65.7 PGC 80.025 3.298 1 0.0693 0.068 0.074 9.5 97.8 70.1 EFNA5 80.235 3.089 1 0.0788 0.063 0.090 14.3 100.0 73.1 TRAFD1 80.313 3.011 1 0.0827 0.062 0.125 9.5 97.8 70.1 POP5 80.374 2.950 1 0.0859 0.061 0.098 4.8 100.0 70.1 DPT 80.481 2.843 1 0.0918 0.058 0.102 4.8 97.8 68.7 GUCY1A1 80.505 2.819 1 0.0931 0.058 0.102 4.8 100.0 70.1 PCDH18 80.625 2.699 1 0.1004 0.055 0.116 9.5 97.8 70.1 ITGB1 80.809 2.515 1 0.1128 0.052 0.118 9.5 97.8 70.1 CD109 80.833 2.491 1 0.1145 0.051 0.128 9.5 100.0 71.6 GSTM3 81.128 2.196 1 0.1384 0.045 0.147 0.0 100.0 68.7 ITGA2 81.263 2.061 1 0.1511 0.043 0.160 4.8 100.0 70.1 EFNA1 81.293 2.030 1 0.1542 0.042 0.160 9.5 97.8 70.1 NDUFAF1 81.367 1.957 1 0.1618 0.040 0.167 14.3 97.8 71.6 PNP 81.448 1.876 1 0.1708 0.039 0.223 4.8 100.0 70.1 ANPEP 81.647 1.677 1 0.1954 0.035 0.204 0.0 100.0 68.7 C3 81.647 1.677 1 0.1953 0.035 0.202 0.0 97.8 67.2 PLLP 81.710 1.614 1 0.2040 0.033 0.205 14.3 97.8 71.6 F11R 81.868 1.456 1 0.2276 0.030 0.234 4.8 100.0 70.1 FABP5 82.095 1.229 1 0.2676 0.026 0.271 0.0 100.0 68.7 NELL2 82.157 1.166 1 0.2801 0.024 0.289 0.0 100.0 68.7 ATP5MPL 82.173 1.151 1 0.2833 0.024 0.292 0.0 100.0 68.7

105

SERPINB5 82.246 1.078 1 0.2992 0.022 0.299 0.0 100.0 68.7 CHM 82.311 1.013 1 0.3143 0.021 0.329 4.8 100.0 70.1 EZH2 82.460 0.864 1 0.3526 0.018 0.358 4.8 100.0 70.1 TBRG4 82.460 0.864 1 0.3525 0.018 0.357 0.0 100.0 68.7 IRF7 82.488 0.835 1 0.3607 0.017 0.359 4.8 97.8 68.7 PTGFR 82.674 0.649 1 0.4203 0.014 0.419 0.0 100.0 68.7 ITGB6b 82.703 0.621 1 0.4306 0.013 0.428 0.0 100.0 68.7 GSTM5 82.719 0.605 1 0.4367 0.013 0.434 0.0 100.0 68.7

GS 82.357 0.966 1 0.3256 0.020 0.330 0.0 100.0 68.7

106

The table also gives a value for the significance of the Wald statistic, which shows the contribution that each variable (coefficient) makes to the model. The significance values for the Wald test were in a similar range to those found for the χ2. This was expected as these tests are asymptotically equivalent as they are both likelihood-based tests.

The values obtained for the pseudo R2 statistic called the Nagelkerke R2 also decreased in parallel with the χ2. The model containing AMACR had the highest value for the Nagelkerke R2 of 0.154, showing that approximately 15% of the variation in recurrence could be described by this model. In comparison, the model GSTM5 had a Nagelkerke R2 of only 0.013, showing that only 1% of the variation in recurrence was described by this model (i.e. GSTM5 as a single gene model was virtually worthless for predicting recurrence).

The overall percent accuracy is included in the table for comparison with the χ2. The percentage of recurrent and non-recurrent cases that were predicted correctly are also included from the SPSS classification table (sensitivity and specificity).

An overall percent accuracy for the single gene model AMACR showed it could predict 76% of the patients correctly. However, despite 96% of patients with non-recurrent disease being correctly predicted, only 33% of those with recurrent disease were correctly predicted (high specificity and low sensitivity). The limitations of using percent accuracy to assess the performance of a predictive model are explained in detail in the method development in Section 2.3.4.8.

Importantly, the LR output also contained the variables for the logistic regression equation including the regression coefficient for each variable and the regression constant. These variables were needed to allow computation of the predicted probability of each patient being in the recurrent group, which was essential before classification into either the recurrent or non-recurrent group could take place. The regression coefficient describes the increase (or decrease) in the ln odds (logit) of a patient being in the recurrent group for a one-unit increase in expression of the gene (predictor variable). Although necessary for the LR equation, this value is extremely difficult to interpret. In comparison, the odds ratio (exponent of the coefficient) is easier to interpret and describes the increase/decrease in raw odds of the patient being in the

107 recurrent group for a one-unit increase in gene expression. The odds ratio therefore provided a useful description of the effect on outcome for each gene in the model. These regression coefficients and odds ratios for the single gene models are shown in Table 2.16.

Table 2.16 Variables in the equation for the single gene models

Model Predictor β coefficient Standard Error Odds Ratio (95%CI) Wald sig. 1 AMACR -0.517 0.196 0.596 (0.406-0.877) 0.009 2 HPRT1 -0.550 0.241 0.577 (0.360-0.925) 0.022 3 SDHA 1.423 0.610 4.148 (1.254-13.716) 0.020 4 PLAU 0.769 0.345 2.158 (1.097-4.243) 0.026 5 PTEN -0.313 0.158 0.732 (0.537-0.997) 0.048 6 ABL1 1.095 0.567 2.988 (0.984-9.077) 0.053 7 CDH1 1.054 0.560 2.869 (0.957-8.606) 0.060 8 ATP10B 0.202 0.107 1.224 (0.993-1.509) 0.058 9 PVT1 0.743 0.428 2.102 (0.909-4.863) 0.083 10 ANXA2 0.557 0.314 1.745 (0.943-3.228) 0.076 11 PGC -0.225 0.126 0.799 (0.624-1.022) 0.074 12 EFNA5 -0.368 0.218 0.692 (0.452-1.060) 0.090 13 TRAFD1 0.800 0.521 2.227 (0.802-6.183) 0.125 14 POP5 0.978 0.590 2.659 (0.836-8.451) 0.098 15 DPT 0.342 0.209 1.408 (0.934-2.122) 0.102 16 GUCY1A1 -0.767 0.468 0.464 (0.185-1.163) 0.102 17 PCDH18 -0.454 0.289 0.635 (0.361-1.118) 0.116 18 ITGB1 0.717 0.458 2.048 (0.834-5.029) 0.118 19 CD109 0.617 0.405 1.853 (0.838-4.097) 0.128 20 GSTM3 0.544 0.375 1.723 (0.826-3.593) 0.147 21 ITGA2 0.604 0.430 1.830 (0.787-4.253) 0.160 22 EFNA1 0.682 0.486 1.978 (0.764-5.125) 0.160 23 NDUFAF1 -0.263 0.190 0.769 (0.530-1.116) 0.167 24 PNP -0.438 0.359 0.645 (0.319-1.306) 0.223 25 ANPEP 0.167 0.131 1.182 (0.913-1.529) 0.204 26 C3 0.344 0.270 1.411 (0.832-2.394) 0.202 27 PLLP -0.188 0.148 0.829 (0.620-1.108) 0.205 28 F11R 0.693 0.582 2.000 (0.639-6.258) 0.234 29 FABP5 0.384 0.349 1.468 (0.741-2.910) 0.271 30 NELL2 0.223 0.211 1.250 (0.827-1.889) 0.289 31 ATP5MPL 0.550 0.522 1.734 (0.623-4.827) 0.292 32 SERPINB5 -0.105 0.101 0.901 (0.739-1.097) 0.299 33 CHM 0.422 0.433 1.525 (0.653-3.561) 0.329 34 EZH2 -0.259 0.282 0.772 (0.444-1.340) 0.358 35 TBRG4 0.598 0.650 1.819 (0.509-6.496) 0.357 36 IRF7 -0.177 0.194 0.837 (0.573-1.224) 0.359 37 PTGFR -0.110 0.137 0.895 (0.685-1.170) 0.419 38 ITGB6b -0.105 0.133 0.900 (0.694-1.168) 0.428 39 GSTM5 -0.097 0.124 0.908 (0.712-1.157) 0.434

GS 0.246 0.253 1.279 (0.779-2.098) 0.330

From Table 2.16 it is evident that the odds of the patient being in the recurrent group were decreased as the expression of AMACR increased by one unit (as reflected by the negative sign of the coefficient and an odds ratio of less than one). The odds ratio of 0.596 shows that the odds of the patient being in the recurrent group approximately halved as AMACR expression increased. Therefore, the patient was more likely to be in

108 the recurrent group with a lower expression of AMACR. In comparison, the odds ratio for SDHA is 4.148, reflecting that the odds of the patient being in the recurrent group were multiplied by approximately 4 for every one-unit increase in SDHA expression.

Conversely, at the end of the list, GSTM5 has an odds ratio of 0.908. This value is close to 1, suggesting that the odds of the patient being in the recurrent group were not affected by the expression of GSTM5.

Single gene models ranked by overall percent accuracy The single gene models ordered by the overall percent accuracy are shown in Table 2.17 for comparison with the order of the models ranked by χ2.

The order of the gene models was substantially different to the order according to χ2. To illustrate the difference between the two ranking methods, the rank of each model using each method is shown in the far left-hand columns. The difference in the rank between the two ranking methods is also included (rank change).

It was evident that ranking according to overall percent accuracy pushed some genes such as HPRT1 and PLAU, which were at the top of the list according to the χ2, further down the list. Other genes near the bottom of the list according to χ2, such as CD109, NDUFAF1 and PLLP, were moved to near the top of the list according to overall percent accuracy. Additionally, ranking by the percent accuracy resulted in the presence of genes with high p-values for the Wald significance in the top portion of the list, e.g. CHM and EZH2, and likewise those with relatively low p-values at the bottom of the list, e.g. ANXA2 and ATP10B.

109

Table 2.17 Single gene models ranked in descending order of the overall percent accuracy

Gene(s) Percent Accuracy (%) Wald significance Rank Rank Rank Non- Nagelkerke change (-2LL) (%) 1 2 3 Recurrent Recurrent Overall χ2 df sig. R2 1 2 3 Null Model 68.7 0 1 1 AMACR 33.3 95.7 76.1 7.760 1 0.0053 0.154 0.009 10 12 2 EFNA5 14.3 100.0 73.1 3.089 1 0.0788 0.063 0.090 0 3 3 SDHA 23.8 95.7 73.1 6.236 1 0.0125 0.125 0.020 15 19 4 CD109 9.5 100.0 71.6 2.491 1 0.1145 0.051 0.128 2 7 5 CDH1 14.3 97.8 71.6 3.952 1 0.0468 0.080 0.060 17 23 6 NDUFAF1 14.3 97.8 71.6 1.957 1 0.1618 0.040 0.167 20 27 7 PLLP 14.3 97.8 71.6 1.614 1 0.2040 0.033 0.205 3 5 8 PTEN 19.0 95.7 71.6 4.159 1 0.0414 0.085 0.048 3 6 9 ABL1 19.0 93.5 70.1 4.002 1 0.0455 0.081 0.053 23 33 10 CHM 4.8 100.0 70.1 1.013 1 0.3143 0.021 0.329 11 22 11 EFNA1 9.5 97.8 70.1 2.030 1 0.1542 0.042 0.160 22 34 12 EZH2 4.8 100.0 70.1 0.864 1 0.3526 0.018 0.358 15 28 13 F11R 4.8 100.0 70.1 1.456 1 0.2276 0.030 0.234 2 16 14 GUCY1A1 4.8 100.0 70.1 2.819 1 0.0931 0.058 0.102 13 2 15 HPRT1 14.3 95.7 70.1 6.728 1 0.0095 0.134 0.022 5 21 16 ITGA2 4.8 100.0 70.1 2.061 1 0.1511 0.043 0.160 1 18 17 ITGB1 9.5 97.8 70.1 2.515 1 0.1128 0.052 0.118 6 24 18 PNP 4.8 100.0 70.1 1.876 1 0.1708 0.039 0.223 2 17 19 PCDH18 9.5 97.8 70.1 2.699 1 0.1004 0.055 0.116 9 11 20 PGC 9.5 97.8 70.1 3.298 1 0.0693 0.068 0.074 17 4 21 PLAU 19.0 93.5 70.1 5.489 1 0.0191 0.111 0.026 8 14 22 POP5 4.8 100.0 70.1 2.950 1 0.0859 0.061 0.098 14 9 23 PVT1 9.5 97.8 70.1 3.444 1 0.0635 0.070 0.083 11 13 24 TRAFD1 9.5 97.8 70.1 3.011 1 0.0827 0.062 0.125 0 25 25 ANPEP 0.0 100.0 68.7 1.677 1 0.1954 0.035 0.204 5 31 26 ATP5MPL 0.0 100.0 68.7 1.151 1 0.2833 0.024 0.292 12 15 27 DPT 4.8 97.8 68.7 2.843 1 0.0918 0.058 0.102 1 29 28 FABP5 0.0 100.0 68.7 1.229 1 0.2676 0.026 0.271 9 20 29 GSTM3 0.0 100.0 68.7 2.196 1 0.1384 0.045 0.147 9 39 30 GSTM5 0.0 100.0 68.7 0.605 1 0.4367 0.013 0.434 5 36 31 IRF7 4.8 97.8 68.7 0.835 1 0.3607 0.017 0.359 6 38 32 ITGB6b 0.0 100.0 68.7 0.621 1 0.4306 0.013 0.428 3 30 33 NELL2 0.0 100.0 68.7 1.166 1 0.2801 0.024 0.289 3 37 34 PTGFR 0.0 100.0 68.7 0.649 1 0.4203 0.014 0.419 110

3 32 35 SERPINB5 0.0 100.0 68.7 1.078 1 0.2992 0.022 0.299 1 35 36 TBRG4 0.0 100.0 68.7 0.864 1 0.3525 0.018 0.357 29 8 37 ATP10B 14.3 91.3 67.2 3.712 1 0.0540 0.076 0.058 12 26 38 C3 0.0 97.8 67.2 1.677 1 0.1953 0.035 0.202 29 10 39 ANXA2 0.0 95.7 65.7 3.351 1 0.0672 0.069 0.076

GS 0.0 100.0 68.7 0.966 1 0.3256 0.020 0.330

111

Gleason score added as a predictor variable to single gene models Although GS as a single variable model provided no significant improvement over the null model, it could potentially have been useful in combination with other variables. The LRT was used to ascertain if the addition of the GS to the single gene models led to a significant improvement in their overall fit. Therefore, χ2 in this circumstance, described the difference between the -2LL of the gene only model and the -2LL of the final model (including both the gene and the GS). These χ2 values are listed in Table 2.18.

Table 2.18 The χ2 values showing the effect of adding Gleason score to each single gene model

Gene(s) χ2 df sig. 1 2 3 diff from gene 1 AMACR GS 1.525 1 0.217 HPRT1 GS 0.808 1 0.369 SDHA GS 1.006 1 0.316 PLAU GS 0.791 1 0.374 PTEN GS 0.451 1 0.502 ABL1 GS 1.545 1 0.214 CDH1 GS 1.332 1 0.248 ATP10B GS 0.418 1 0.518 PVT1 GS 0.608 1 0.436 ANXA2 GS 1.837 1 0.175 PGC GS 0.939 1 0.333 EFNA5 GS 0.360 1 0.549 TRAFD1 GS 0.772 1 0.380 POP5 GS 1.655 1 0.198 DPT GS 1.794 1 0.180 GUCY1A1 GS 1.081 1 0.299 PCDH18 GS 0.648 1 0.421 ITGB1 GS 1.284 1 0.257 CD109 GS 1.648 1 0.199 GSTM3 GS 1.542 1 0.214 ITGA2 GS 1.114 1 0.291 EFNA1 GS 1.219 1 0.270 NDUFAF1 GS 0.756 1 0.385 PNP GS 1.002 1 0.317 ANPEP GS 1.373 1 0.241 C3 GS 1.047 1 0.306 PLLP GS 0.782 1 0.377 FABP5 GS 0.838 1 0.360 NELL2 GS 1.238 1 0.266 ATP5MPL GS 1.359 1 0.244 SERPINB5 GS 1.109 1 0.292 CHM GS 1.456 1 0.228 EZH2 GS 1.512 1 0.219 TBRG4 GS 1.288 1 0.256 IRF7 GS 0.907 1 0.341 PTGFR GS 0.924 1 0.336

The addition of GS led to a slight reduction in the -2LL for the final model (shown by a positive value for χ2) as would be expected with the addition of another predictor

112 variable to the model. However, the value for χ2 was not significant for any of the models, which demonstrated that the addition of GS made no significant improvement to any of the single gene models.

ROC curves of the single gene models ROC curves were plotted for the single gene models as a measure of discriminatory accuracy and to identify the most appropriate threshold value for prediction of the patients in either the recurrent or non-recurrent group. The ROC plots are shown for the genes AMACR, HPRT1, SDHA, PLAU and PTEN, which were the top single gene models (according to the χ2)(Figure 2.23). The area under the curve (AUC) values from the plots are shown in Table 2.19 in descending order of the χ2.

Figure 2.23 ROC curves for the best single gene models

113

Table 2.19 AUC values for the single gene models

Genes ranked by AUC χ2 AUC Rank AMACR 0.677 2 HPRT1 0.731 1 SDHA 0.663 6 PLAU 0.664 5 PTEN 0.671 3 ABL1 0.659 7 CDH1 0.628 12 ATP10B 0.639 9 PVT1 0.599 19 ANXA2 0.657 8 PGC 0.667 4 EFNA5 0.585 25 TRAFD1 0.586 24 POP5 0.598 21 DPT 0.635 10 GUCY1A1 0.616 14 PCDH18 0.549 35 ITGB1 0.608 15 CD109 0.622 13 GSTM3 0.630 11 ITGA2 0.588 23 EFNA1 0.601 17 NDUFAF1 0.568 32 PNP 0.580 28 ANPEP 0.602 16 C3 0.601 18 PLLP 0.516 37 F11R 0.578 29 FABP5 0.582 27 NELL2 0.594 22 ATP5MPL 0.569 31 SERPINB5 0.528 36 CHM 0.571 30 EZH2 0.510 38 TBRG4 0.585 26 IRF7 0.467 39 PTGFR 0.599 20 ITGB6b 0.563 33 GSTM5 0.552 34

GS 0.565

114

The single gene model with the highest AUC was HPRT1 at 0.731. AMACR, SDHA, PLAU, PTEN, ABL1, ANXA2 and PGC had lower AUC values, within a narrow range of 0.657- 0.677. It was evident that a significant proportion of the single gene models had AUC values that were only slightly above 0.5, suggesting that their predictive accuracy was only slightly better than a random classification. The single gene models containing EZH2 and IRF7 had AUC values of 0.510 and 0.467 respectively, showing that these genes had no predictive value as single gene models. The ranking of the AUC values did not completely correlate with the ranking by the χ2, although a similar trend was observed with the AUC decreasing alongside the χ2.

The threshold at which the Youden index (J) was highest, which can be used as an optimal cut-off point for classification, is shown for the top 5 single gene models (according to χ2)(Table 2.20). It is important to note that this value simply provided the value at which the combination of sensitivity and specificity was highest and did not always provide the most balanced combination of these parameters. If an alternative threshold would offer a more balanced combination of sensitivity and specificity, this is provided below the threshold at which J is highest.

Table 2.20 Optimal thresholds for the top single gene models

Single gene model Cut-off value Sensitivity Specificity AMACR 0.335 0.571 0.783 AMACR 0.286 0.619 0.674 HPRT1 0.236 1.000 0.391 HPRT1 0.315 0.524 0.804 SDHA 0.275 0.762 0.587 PLAU 0.309 0.571 0.674 PTEN 0.284 0.667 0.674

For the single gene model AMACR, at a threshold of 0.335 (at which J is highest), a sensitivity of 57% and a specificity of 78% could be attained. However, a slightly more balanced combination of these parameters could be found using an alternative threshold of 0.286, providing a 62% sensitivity and 67% specificity.

Similarly, for HPRT1 at a threshold of 0.236 (at which J is highest), the sensitivity and specificity were 100% and 39% respectively. By selecting a different threshold of 0.315, a more balanced combination of sensitivity and specificity of 52% and 80% respectively could be achieved.

115

Multivariate logistic regression analysis

Correlation analysis The final number of two- and three-gene combinations analysed in the multivariate LR were 387 and 1,482 respectively. As there were many correlating genes which could not be put together, these numbers were a significant reduction from the total number of possible gene combinations from the 39 genes. If there had been no correlating genes, the number of novel two- and three-gene combinations would have been 741 and 9,139 respectively. The results of the correlation analysis can be found in the electronic Appendix 8.

Due to the large number of models that were analysed, the complete results for the multivariate LR analysis are provided in the electronic Appendix 8. From these results, a total of 139 two-gene models and 771 three-gene models demonstrated a significant improvement over the null model (p<0.05 for the χ2).

However, unless a two-gene model offers a lower -2LL than could be achieved with a single gene model; or unless a three-gene model offers a lower -2LL than could be achieved by a two-gene model, the use of the model with more predictor variables would not usually be justified.

Therefore, only the two-gene models which improved upon the best single gene model are included in this section. Similarly, only the three-gene models which improved upon the best two-gene model are included. These two- and three-gene models were the top 79 and 22 respectively (according to χ2).

Two-gene models ranked by χ2 The important parameters from the LR output of the top 79 two-gene models are shown in Table 2.21.

116

Table 2.21 Output parameters from the logistic regression analysis of the top two-gene models (listed in descending order of χ2)

Gene(s) Wald Significance Percent Accuracy (%) Nagelkerke Non- 1 2 3 -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall Null Model 83.324 68.7 AMACR SDHA 66.602 16.722 2 0.0002 0.310 0.003 0.009 38.1 91.3 74.6 HPRT1 PLAU 68.547 14.777 2 0.0006 0.278 0.015 0.011 42.9 93.5 77.6 AMACR HPRT1 70.141 13.183 2 0.0014 0.251 0.015 0.038 42.9 95.7 79.1 HPRT1 SDHA 70.351 12.973 2 0.0015 0.247 0.031 0.021 28.6 97.8 76.1 ANXA2 HPRT1 70.355 12.969 2 0.0015 0.247 0.019 0.009 33.3 87.0 70.1 AMACR ATP10B 70.377 12.947 2 0.0015 0.247 0.005 0.029 33.3 89.1 71.6 AMACR DPT 71.229 12.094 2 0.0024 0.232 0.005 0.047 42.9 89.1 74.6 AMACR TRAFD1 71.233 12.091 2 0.0024 0.232 0.005 0.089 47.6 95.7 80.6 ABL1 AMACR 71.301 12.023 2 0.0025 0.231 0.047 0.008 42.9 95.7 79.1 AMACR PVT1 71.883 11.440 2 0.0033 0.221 0.008 0.080 42.9 93.5 77.6 AMACR EFNA5 71.932 11.392 2 0.0034 0.220 0.007 0.069 42.9 91.3 76.1 AMACR GSTM3 71.975 11.349 2 0.0034 0.219 0.005 0.065 38.1 93.5 76.1 AMACR POP5 71.983 11.341 2 0.0034 0.219 0.007 0.075 33.3 95.7 76.1 AMACR CDH1 72.210 11.114 2 0.0039 0.215 0.011 0.084 33.3 91.3 73.1 HPRT1 POP5 72.215 11.109 2 0.0039 0.215 0.020 0.055 23.8 95.7 73.1 EFNA5 SDHA 72.360 10.964 2 0.0042 0.212 0.051 0.012 33.3 95.7 76.1 AMACR ITGB1 72.395 10.929 2 0.0042 0.211 0.007 0.081 33.3 91.3 73.1 CD109 HPRT1 72.397 10.927 2 0.0042 0.211 0.055 0.014 23.8 95.7 73.1 AMACR FABP5 72.646 10.678 2 0.0048 0.207 0.005 0.096 38.1 93.5 76.1 PTEN SDHA 72.679 10.645 2 0.0049 0.206 0.039 0.018 33.3 89.1 71.6 AMACR SERPINB5 72.746 10.577 2 0.0050 0.205 0.005 0.101 33.3 89.1 71.6 AMACR PTEN 72.801 10.523 2 0.0052 0.204 0.016 0.108 28.6 91.3 71.6 PCDH18 SDHA 72.864 10.460 2 0.0054 0.203 0.060 0.016 23.8 95.7 73.1 PCDH18 PLAU 72.945 10.379 2 0.0056 0.202 0.043 0.013 28.6 95.7 74.6 PGC SDHA 72.954 10.370 2 0.0056 0.201 0.047 0.013 23.8 89.1 68.7 ATP10B HPRT1 73.021 10.302 2 0.0058 0.200 0.062 0.026 23.8 93.5 71.6 PLAU PTEN 73.084 10.240 2 0.0060 0.199 0.021 0.034 38.1 89.1 73.1 PGC PLAU 73.106 10.218 2 0.0060 0.199 0.035 0.014 28.6 93.5 73.1 C3 HPRT1 73.166 10.158 2 0.0062 0.198 0.075 0.017 23.8 93.5 71.6 AMACR ANXA2 73.194 10.130 2 0.0063 0.197 0.014 0.131 42.9 93.5 77.6 F11R HPRT1 73.227 10.097 2 0.0064 0.197 0.081 0.012 28.6 95.7 74.6 AMACR PNP 73.275 10.049 2 0.0066 0.196 0.007 0.178 28.6 93.5 73.1 NDUFAF1 PLAU 73.420 9.904 2 0.0071 0.193 0.041 0.010 33.3 91.3 73.1 AMACR ATP5MPL 73.462 9.862 2 0.0072 0.192 0.006 0.162 42.9 93.5 77.6 GUCY1A1 HPRT1 73.563 9.761 2 0.0076 0.190 0.092 0.017 33.3 93.5 74.6 ABL1 PTEN 73.643 9.681 2 0.0079 0.189 0.024 0.022 28.6 91.3 71.6 AMACR PLLP 73.764 9.560 2 0.0084 0.187 0.008 0.186 38.1 91.3 74.6 117

AMACR C3 73.787 9.537 2 0.0085 0.186 0.008 0.188 42.9 91.3 76.1 AMACR CD109 73.810 9.513 2 0.0086 0.186 0.012 0.199 38.1 93.5 76.1 HPRT1 TRAFD1 73.829 9.495 2 0.0087 0.186 0.034 0.172 23.8 97.8 74.6 ATP5MPL HPRT1 73.939 9.385 2 0.0092 0.184 0.120 0.018 23.8 95.7 73.1 DPT HPRT1 73.947 9.377 2 0.0092 0.184 0.114 0.031 28.6 95.7 74.6 AMACR TBRG4 73.948 9.375 2 0.0092 0.183 0.006 0.213 38.1 93.5 76.1 GSTM3 HPRT1 73.991 9.332 2 0.0094 0.183 0.118 0.023 23.8 95.7 73.1 PLAU PLLP 74.082 9.242 2 0.0098 0.181 0.011 0.058 33.3 91.3 73.1 CHM HPRT1 74.105 9.219 2 0.0100 0.181 0.153 0.016 19.0 95.7 71.6 HPRT1 PGC 74.169 9.155 2 0.0103 0.179 0.030 0.124 28.6 93.5 73.1 HPRT1 NELL2 74.301 9.023 2 0.0110 0.177 0.017 0.141 33.3 95.7 76.1 AMACR ITGA2 74.307 9.017 2 0.0110 0.177 0.012 0.272 38.1 91.3 74.6 AMACR PCDH18 74.311 9.013 2 0.0110 0.177 0.016 0.281 38.1 93.5 76.1 ABL1 GUCY1A1 74.312 9.012 2 0.0110 0.177 0.022 0.034 38.1 95.7 77.6 CDH1 PGC 74.317 9.007 2 0.0111 0.177 0.026 0.031 23.8 93.5 71.6 ABL1 HPRT1 74.342 8.982 2 0.0112 0.176 0.145 0.046 23.8 97.8 74.6 AMACR EFNA1 74.364 8.960 2 0.0113 0.176 0.012 0.282 42.9 93.5 77.6 GUCY1A1 TRAFD1 74.380 8.943 2 0.0114 0.176 0.024 0.032 23.8 95.7 73.1 AMACR NELL2 74.412 8.911 2 0.0116 0.175 0.008 0.295 42.9 93.5 77.6 POP5 PTEN 74.491 8.833 2 0.0121 0.174 0.047 0.021 28.6 91.3 71.6 EFNA1 HPRT1 74.539 8.785 2 0.0124 0.173 0.164 0.022 23.8 95.7 73.1 CDH1 HPRT1 74.549 8.775 2 0.0124 0.172 0.171 0.051 14.3 93.5 68.7 CDH1 GUCY1A1 74.603 8.721 2 0.0128 0.171 0.027 0.039 23.8 93.5 71.6 PNP SDHA 74.624 8.700 2 0.0129 0.171 0.212 0.016 28.6 97.8 76.1 AMACR ITGB6b 74.627 8.697 2 0.0129 0.171 0.007 0.331 33.3 91.3 73.1 AMACR NDUFAF1 74.743 8.581 2 0.0137 0.169 0.014 0.370 23.8 89.1 68.7 AMACR CHM 74.746 8.578 2 0.0137 0.169 0.009 0.384 38.1 93.5 76.1 AMACR F11R 74.751 8.573 2 0.0138 0.169 0.012 0.371 38.1 93.5 76.1 GUCY1A1 PLAU 74.765 8.559 2 0.0139 0.169 0.088 0.025 33.3 95.7 76.1 HPRT1 PVT1 74.878 8.446 2 0.0147 0.166 0.045 0.210 28.6 95.7 74.6 HPRT1 ITGB1 74.883 8.441 2 0.0147 0.166 0.034 0.196 23.8 95.7 73.1 AMACR ANPEP 74.958 8.366 2 0.0153 0.165 0.014 0.440 42.9 93.5 77.6 ATP10B PTEN 75.015 8.309 2 0.0157 0.164 0.047 0.036 38.1 87.0 71.6 PLAU SERPINB5 75.050 8.274 2 0.0160 0.163 0.013 0.098 23.8 93.5 71.6 AMACR GSTM5 75.092 8.232 2 0.0163 0.162 0.009 0.490 28.6 93.5 73.1 AMACR IRF7 75.093 8.231 2 0.0163 0.162 0.010 0.491 33.3 91.3 73.1 CD109 PTEN 75.204 8.120 2 0.0172 0.160 0.059 0.025 23.8 93.5 71.6 FABP5 HPRT1 75.222 8.102 2 0.0174 0.160 0.246 0.026 19.0 95.7 71.6 GSTM5 SDHA 75.227 8.097 2 0.0175 0.160 0.172 0.011 28.6 93.5 73.1 PLAU PVT1 75.397 7.927 2 0.0190 0.157 0.043 0.137 28.6 93.5 73.1 GUCY1A1 PVT1 75.472 7.852 2 0.0197 0.155 0.044 0.047 28.6 97.8 76.1 ABL1 PGC 75.475 7.849 2 0.0198 0.155 0.041 0.054 33.3 95.7 76.1

118

The χ2 values for the listed two-gene models were all significant (p<0.05) showing that these gene models fitted the data significantly better than the null model. The χ2 values were higher than those obtained for any of the single gene models (due to the lower -2LL in the two-gene models versus the single gene models). These findings were expected as only the models that provided a significant improvement over the null model and provided a better fit than any of the single gene models were listed.

The top two-gene models were AMACR.SDHA (χ2(2) = 16.7, p<0.0005) and HPRT1.PLAU (χ2(2) = 14.8, p<0.001). These two-gene models offered much higher values for χ2 (and higher significance) than the values found for either of the single gene models nested within them. For instance, AMACR.SDHA fitted the data better than either AMACR (χ2(1) = 7.8, p<0.01) or SDHA (χ2(1) = 6.2, p<0.05).

The Nagelkerke R2 values for the two-gene models were also substantially higher than those found for the single gene models. For example, the Nagelkerke R2 for AMACR.SDHA was 0.31, which suggests that this two-gene model accounted for 31% of the variation in recurrence. In contrast, only 15% and 13% of the variation in recurrence was accounted for by the single gene models of AMACR and SDHA respectively.

The values for the Wald significance showed whether each gene made a significant contribution to the model. For most of the two-gene models presented here, each gene made a significant contribution to the model (p<0.05). However, there were some gene models (particularly those further down the list), where one of the predictors did not make a significant contribution to the model, e.g. ITGA2 in AMACR.ITGA2 (p-value >0.25).

The variables in the LR equations are given for the top 20 two-gene models (Table 2.22). These include the regression coefficients and odds ratios as had been done previously for the single gene models. However, as there is more than one predictor variable in each model, an adjusted odds ratio is given, which is the increase/decrease in the odds for a one-unit increase in expression of the variable when controlling for all other variables in the model.

119

Table 2.22 Variables in the equation for the two-gene models

Model Predictor β coefficient Standard Odds ratio (95%CI) p-value Error 1 AMACR -0.653 0.220 0.521 (0.338-0.802) 0.003 SDHA 1.899 0.726 6.680 (1.610-27.715) 0.009 2 HPRT1 -0.758 0.311 0.469 (0.255-0.863) 0.015 PLAU 1.080 0.422 2.944 (1.286-6.737) 0.011 3 AMACR -0.489 0.202 0.613 (0.413-0.911) 0.015 HPRT1 -0.506 0.244 0.603 (0.374-0.973) 0.038 4 HPRT1 -0.608 0.282 0.545 (0.313-0.947) 0.031 SDHA 1.488 0.642 4.426 (1.257-15.589) 0.021 5 ANXA2 0.843 0.360 2.322 (1.148-4.699) 0.019 HPRT1 -0.698 0.266 0.498 (0.295-0.838) 0.009 6 AMACR -0.591 0.210 0.554 (0.367-0.836) 0.005 ATP10B 0.261 0.119 1.298 (1.027-1.639) 0.029 7 AMACR -0.592 0.210 0.553 (0.367-0.835) 0.005 DPT 0.455 0.229 1.576 (1.005-2.470) 0.047 8 AMACR -0.573 0.202 0.564 (0.379-0.838) 0.005 TRAFD1 0.999 0.588 2.717 (0.858-8.601) 0.089 9 ABL1 1.205 0.608 3.338 (1.014-10.988) 0.047 AMACR -0.543 0.205 0.581 (0.389-0.868) 0.008 10 AMACR -0.526 0.197 0.591 (0.402-0.869) 0.008 PVT1 0.811 0.463 2.251 (0.908-5.577) 0.080 11 AMACR -0.546 0.201 0.579 (0.391-0.859) 0.007 EFNA5 -0.420 0.232 0.657 (0.417-1.034) 0.069 12 AMACR -0.591 0.211 0.554 (0.366-0.838) 0.005 GSTM3 0.727 0.395 2.069 (0.955-4.484) 0.065 13 AMACR -0.551 0.203 0.577 (0.387-0.859) 0.007 POP5 1.144 0.642 3.140 (0.893-11.042) 0.075 14 AMACR -0.507 0.200 0.603 (0.407-0.891) 0.011 CDH1 1.008 0.583 2.741 (0.874-8.603) 0.084 15 HPRT1 -0.662 0.284 0.516 (0.296-0.900) 0.020 POP5 1.368 0.713 3.927 (0.971-15.882) 0.055 16 EFNA5 -0.501 0.257 0.606 (0.367-1.002) 0.051 SDHA 1.716 0.682 5.564 (1.461-21.189) 0.012 17 AMACR -0.552 0.204 0.576 (0.386-0.858) 0.007 ITGB1 0.845 0.485 2.327 (0.900-6.017) 0.081 18 CD109 0.872 0.454 2.391 (0.982-5.823) 0.055 HPRT1 -0.660 0.270 0.517 (0.304-0.877) 0.014 19 AMACR -0.599 0.212 0.549 (0.363-0.832) 0.005 FABP5 0.645 0.387 1.906 (0.893-4.072) 0.096 20 PTEN -0.337 0.164 0.714 (0.518-0.984) 0.039 SDHA 1.485 0.629 4.413 (1.287-15.138) 0.018

For the highest ranking two-gene model AMACR.SDHA, the adjusted odds ratio is 0.521 for AMACR and 6.680 for SDHA. Therefore, the patient was approximately half as likely to be in the recurrent group for a one-unit increase in AMACR and approximately 7 times as likely to be in the recurrent group for a one-unit increase in SDHA.

120

Two-gene models by overall percent accuracy The top two-gene models ordered by overall percent accuracy are shown in Table 2.23.

The ranking of the two-gene models by overall percent accuracy was very different to the ranking by χ2. Some of the gene models with the highest χ2 had been pushed down the list. AMACR.SDHA was the highest two-gene model according to χ2 but was ranked at 43 according to the overall percent accuracy. Importantly, this gene model showed high significance for the Wald statistic for both AMACR and SDHA (p<0.01), indicating that both predictors were significant. Conversely, ABL1.PTGFR was ranked at 149 according to χ2, providing a non-significant improvement over the null model, but was ranked at 13 according to the overall percent accuracy. This gene model contained only one significant predictor variable (ABL1)(p<0.05). The other predictor variable PTGFR (p>0.15) made no significant contribution to the model. The number of models which contained a non-significant predictor variable was higher in the overall percent accuracy list in comparison to the χ2 list.

121

Table 2.23 Two-gene models ranked in descending order of the overall percent accuracy

Gene(s) Percent Accuracy (%) Wald Significance Rank Rank Rank Non- Nagelkerke change (-2LL) (%) 1 2 3 Recurrent Recurrent Overall χ2 df sig. R2 1 2 3 Null Model 68.7 7 8 1 AMACR TRAFD1 47.6 95.7 80.6 12.091 2 0.0024 0.232 0.005 0.089 7 9 2 ABL1 AMACR 42.9 95.7 79.1 12.023 2 0.0025 0.231 0.047 0.008 0 3 3 AMACR HPRT1 42.9 95.7 79.1 13.183 2 0.0014 0.251 0.015 0.038 47 51 4 ABL1 GUCY1A1 38.1 95.7 77.6 9.012 2 0.0110 0.177 0.022 0.034 64 69 5 AMACR ANPEP 42.9 93.5 77.6 8.366 2 0.0153 0.165 0.014 0.440 24 30 6 AMACR ANXA2 42.9 93.5 77.6 10.130 2 0.0063 0.197 0.014 0.131 27 34 7 AMACR ATP5MPL 42.9 93.5 77.6 9.862 2 0.0072 0.192 0.006 0.162 46 54 8 AMACR EFNA1 42.9 93.5 77.6 8.960 2 0.0113 0.176 0.012 0.282 47 56 9 AMACR NELL2 42.9 93.5 77.6 8.911 2 0.0116 0.175 0.008 0.295 0 10 10 AMACR PVT1 42.9 93.5 77.6 11.440 2 0.0033 0.221 0.008 0.080 9 2 11 HPRT1 PLAU 42.9 93.5 77.6 14.777 2 0.0006 0.278 0.015 0.011 67 79 12 ABL1 PGC 33.3 95.7 76.1 7.849 2 0.0198 0.155 0.041 0.054 136 149 13 ABL1 PTGFR 28.6 97.8 76.1 5.763 2 0.0560 0.116 0.032 0.186 24 38 14 AMACR C3 42.9 91.3 76.1 9.537 2 0.0085 0.186 0.008 0.188 24 39 15 AMACR CD109 38.1 93.5 76.1 9.513 2 0.0086 0.186 0.012 0.199 48 64 16 AMACR CHM 38.1 93.5 76.1 8.578 2 0.0137 0.169 0.009 0.384 6 11 17 AMACR EFNA5 42.9 91.3 76.1 11.392 2 0.0034 0.220 0.007 0.069 47 65 18 AMACR F11R 38.1 93.5 76.1 8.573 2 0.0138 0.169 0.012 0.371 0 19 19 AMACR FABP5 38.1 93.5 76.1 10.678 2 0.0048 0.207 0.005 0.096 8 12 20 AMACR GSTM3 38.1 93.5 76.1 11.349 2 0.0034 0.219 0.005 0.065 29 50 21 AMACR PCDH18 38.1 93.5 76.1 9.013 2 0.0110 0.177 0.016 0.281 9 13 22 AMACR POP5 33.3 95.7 76.1 11.341 2 0.0034 0.219 0.007 0.075 20 43 23 AMACR TBRG4 38.1 93.5 76.1 9.375 2 0.0092 0.183 0.006 0.213 157 181 24 C3 GUCY1A1 28.6 97.8 76.1 5.199 2 0.0743 0.105 0.131 0.069 107 132 25 DPT EFNA5 23.8 100.0 76.1 6.155 2 0.0461 0.123 0.092 0.090 107 133 26 EFNA5 GSTM3 23.8 100.0 76.1 6.149 2 0.0462 0.123 0.070 0.094 74 101 27 EFNA5 POP5 23.8 100.0 76.1 7.213 2 0.0272 0.143 0.053 0.061 12 16 28 EFNA5 SDHA 33.3 95.7 76.1 10.964 2 0.0042 0.212 0.051 0.012 37 66 29 GUCY1A1 PLAU 33.3 95.7 76.1 8.559 2 0.0139 0.169 0.088 0.025 48 78 30 GUCY1A1 PVT1 28.6 97.8 76.1 7.852 2 0.0197 0.155 0.044 0.047 17 48 31 HPRT1 NELL2 33.3 95.7 76.1 9.023 2 0.0110 0.177 0.017 0.141 28 4 32 HPRT1 SDHA 28.6 97.8 76.1 12.973 2 0.0015 0.247 0.031 0.021 73 106 33 IRF7 SDHA 33.3 95.7 76.1 7.131 2 0.0283 0.142 0.350 0.020 300 334 34 NELL2 PLLP 23.8 100.0 76.1 2.752 2 0.2526 0.057 0.294 0.213 26 61 35 PNP SDHA 28.6 97.8 76.1 8.700 2 0.0129 0.171 0.212 0.016 122

148 184 36 PGC PLLP 28.6 97.8 76.1 5.157 2 0.0759 0.104 0.065 0.174 50 87 37 ABL1 EFNA5 23.8 97.8 74.6 7.577 2 0.0226 0.150 0.050 0.077 15 53 38 ABL1 HPRT1 23.8 97.8 74.6 8.982 2 0.0112 0.176 0.145 0.046 122 161 39 ABL1 PLLP 28.6 95.7 74.6 5.570 2 0.0617 0.112 0.054 0.211 33 7 40 AMACR DPT 42.9 89.1 74.6 12.094 2 0.0024 0.232 0.005 0.047 8 49 41 AMACR ITGA2 38.1 91.3 74.6 9.017 2 0.0110 0.177 0.012 0.272 5 37 42 AMACR PLLP 38.1 91.3 74.6 9.560 2 0.0084 0.187 0.008 0.186 42 1 43 AMACR SDHA 38.1 91.3 74.6 16.722 2 0.0002 0.310 0.003 0.009 151 195 44 ANPEP EFNA5 19.0 100.0 74.6 5.005 2 0.0819 0.101 0.175 0.086 137 182 45 ANXA2 EZH2 28.6 95.7 74.6 5.183 2 0.0749 0.105 0.046 0.189 74 120 46 ANXA2 GUCY1A1 28.6 95.7 74.6 6.479 2 0.0392 0.130 0.065 0.085 91 138 47 ATP10B PNP 28.6 95.7 74.6 6.021 2 0.0493 0.121 0.046 0.210 144 192 48 ATP10B PTGFR 23.8 97.8 74.6 5.057 2 0.0798 0.102 0.040 0.248 68 117 49 ATP10B PVT1 28.6 95.7 74.6 6.704 2 0.0350 0.134 0.075 0.099 63 113 50 CDH1 EFNA5 23.8 97.8 74.6 6.874 2 0.0322 0.137 0.070 0.109 79 130 51 CDH1 PNP 19.0 100.0 74.6 6.196 2 0.0451 0.124 0.051 0.208 34 86 52 CDH1 PTEN 28.6 95.7 74.6 7.584 2 0.0226 0.150 0.079 0.062 74 127 53 DPT GUCY1A1 28.6 95.7 74.6 6.328 2 0.0423 0.127 0.071 0.073 12 42 54 DPT HPRT1 28.6 95.7 74.6 9.377 2 0.0092 0.184 0.114 0.031 92 147 55 EFNA5 ITGA2 19.0 100.0 74.6 5.783 2 0.0555 0.116 0.076 0.120 120 176 56 EFNA5 NELL2 19.0 100.0 74.6 5.398 2 0.0673 0.109 0.056 0.142 211 268 57 EFNA5 PLLP 28.6 95.7 74.6 3.987 2 0.1362 0.081 0.129 0.342 146 204 58 EFNA5 TBRG4 23.8 97.8 74.6 4.854 2 0.0883 0.098 0.054 0.197 39 98 59 EZH2 PLAU 33.3 93.5 74.6 7.252 2 0.0266 0.144 0.195 0.019 32 92 60 EZH2 SDHA 28.6 95.7 74.6 7.401 2 0.0247 0.147 0.301 0.018 156 217 61 F11R GUCY1A1 23.8 97.8 74.6 4.667 2 0.0969 0.095 0.182 0.082 31 31 62 F11R HPRT1 28.6 95.7 74.6 10.097 2 0.0064 0.197 0.081 0.012 223 286 63 F11R PLLP 23.8 97.8 74.6 3.614 2 0.1641 0.074 0.166 0.146 82 146 64 GSTM3 GUCY1A1 28.6 95.7 74.6 5.786 2 0.0554 0.116 0.094 0.068 30 35 65 GUCY1A1 HPRT1 33.3 93.5 74.6 9.761 2 0.0076 0.190 0.092 0.017 142 208 66 GUCY1A1 PNP 19.0 100.0 74.6 4.808 2 0.0903 0.097 0.095 0.207 30 97 67 GUCY1A1 POP5 23.8 97.8 74.6 7.260 2 0.0265 0.144 0.049 0.051 1 67 68 HPRT1 PVT1 28.6 95.7 74.6 8.446 2 0.0147 0.166 0.045 0.210 29 40 69 HPRT1 TRAFD1 23.8 97.8 74.6 9.495 2 0.0087 0.186 0.034 0.172 148 218 70 ITGA2 PLLP 23.8 97.8 74.6 4.635 2 0.0985 0.094 0.097 0.109 47 24 71 PCDH18 PLAU 28.6 95.7 74.6 10.379 2 0.0056 0.202 0.043 0.013 50 122 72 PGC PTEN 33.3 93.5 74.6 6.451 2 0.0397 0.129 0.133 0.082 106 179 73 PLLP POP5 23.8 97.8 74.6 5.219 2 0.0736 0.105 0.136 0.072 172 246 74 PLLP TRAFD1 23.8 97.8 74.6 4.355 2 0.1133 0.088 0.250 0.141 175 250 75 POP5 PTGFR 19.0 100.0 74.6 4.281 2 0.1176 0.087 0.071 0.246

123

Addition of Gleason score to the two-gene models The effect of adding the GS to each two-gene model is shown in Table 2.24 (here χ2 describes the difference in -2LL between the two-gene model and the final model including two genes and the GS).

Table 2.24 The χ2 values showing the effect of adding Gleason score to each two-gene model

Gene(s) χ2 1 2 3 diff from gene 1+2 df sig. AMACR SDHA GS 1.703 1 0.192 HPRT1 PLAU GS 0.621 1 0.431 AMACR HPRT1 GS 1.282 1 0.258 HPRT1 SDHA GS 0.737 1 0.391 ANXA2 HPRT1 GS 2.098 1 0.148 AMACR ATP10B GS 0.915 1 0.339 AMACR DPT GS 2.806 1 0.094 AMACR TRAFD1 GS 1.191 1 0.275 ABL1 AMACR GS 2.258 1 0.133 AMACR PVT1 GS 1.098 1 0.295 AMACR EFNA5 GS 0.667 1 0.414 AMACR GSTM3 GS 2.230 1 0.135 AMACR POP5 GS 2.293 1 0.130 AMACR CDH1 GS 1.773 1 0.183 HPRT1 POP5 GS 1.383 1 0.240 EFNA5 SDHA GS 0.247 1 0.619 AMACR ITGB1 GS 2.061 1 0.151 CD109 HPRT1 GS 1.417 1 0.234 AMACR FABP5 GS 1.275 1 0.259 PTEN SDHA GS 0.401 1 0.527 AMACR SERPINB5 GS 1.940 1 0.164 AMACR PTEN GS 0.942 1 0.332 PCDH18 SDHA GS 0.291 1 0.590 PCDH18 PLAU GS 0.144 1 0.704 PGC SDHA GS 0.823 1 0.364 ATP10B HPRT1 GS 0.302 1 0.583 PLAU PTEN GS 0.325 1 0.569 PGC PLAU GS 0.693 1 0.405 C3 HPRT1 GS 0.737 1 0.391 AMACR ANXA2 GS 2.300 1 0.129 AMACR PNP GS 1.625 1 0.202 NDUFAF1 PLAU GS 0.505 1 0.477 AMACR ATP5MPL GS 2.153 1 0.142 GUCY1A1 HPRT1 GS 0.898 1 0.343 ABL1 PTEN GS 0.796 1 0.372 AMACR PLLP GS 1.328 1 0.249 AMACR C3 GS 1.695 1 0.193 AMACR CD109 GS 2.275 1 0.132 HPRT1 TRAFD1 GS 0.465 1 0.495 ATP5MPL HPRT1 GS 1.429 1 0.232 DPT HPRT1 GS 1.625 1 0.202 AMACR TBRG4 GS 2.102 1 0.147 GSTM3 HPRT1 GS 1.714 1 0.191 PLAU PLLP GS 0.551 1 0.458 CHM HPRT1 GS 1.283 1 0.257 HPRT1 PGC GS 0.852 1 0.356 HPRT1 NELL2 GS 1.116 1 0.291 AMACR ITGA2 GS 1.646 1 0.200 AMACR PCDH18 GS 1.232 1 0.267 ABL1 GUCY1A1 GS 1.903 1 0.168 CDH1 PGC GS 1.399 1 0.237 ABL1 HPRT1 GS 1.036 1 0.309 AMACR EFNA1 GS 1.804 1 0.179

124

GUCY1A1 TRAFD1 GS 0.822 1 0.365 AMACR NELL2 GS 1.803 1 0.179 POP5 PTEN GS 0.963 1 0.326 EFNA1 HPRT1 GS 0.944 1 0.331 CDH1 HPRT1 GS 1.151 1 0.283 CDH1 GUCY1A1 GS 1.622 1 0.203 PNP SDHA GS 1.025 1 0.311 AMACR NDUFAF1 GS 1.330 1 0.249 AMACR CHM GS 2.187 1 0.139 GUCY1A1 PLAU GS 0.834 1 0.361 HPRT1 PVT1 GS 0.539 1 0.463 HPRT1 ITGB1 GS 0.989 1 0.320 AMACR ANPEP GS 1.799 1 0.180 ATP10B PTEN GS 0.090 1 0.764 PLAU SERPINB5 GS 0.985 1 0.321 AMACR IRF7 GS 1.483 1 0.223 CD109 PTEN GS 0.994 1 0.319 FABP5 HPRT1 GS 0.739 1 0.390 PLAU PVT1 GS 0.537 1 0.464 GUCY1A1 PVT1 GS 0.575 1 0.448 ABL1 PGC GS 1.425 1 0.233

The low significance for the χ2 values showed that the addition of GS as a predictor variable provided no significant improvement to the fit of the two-gene models.

ROC curves of the two-gene models The ROC curves for the top 5 two-gene models (according to χ2) are shown in Figure 2.24, with the full list of AUC values in Table 2.25.

Figure 2.24 ROC curves for the best two-gene models

125

Table 2.25 AUC values for the two-gene models

AUC Gene models ordered by χ2 AUC Rank AMACR SDHA 0.784 5 HPRT1 PLAU 0.808 1 AMACR HPRT1 0.769 7 HPRT1 SDHA 0.757 12 ANXA2 HPRT1 0.788 3 AMACR ATP10B 0.744 20 AMACR DPT 0.755 13 AMACR TRAFD1 0.738 25 ABL1 AMACR 0.744 21 AMACR PVT1 0.706 50 AMACR EFNA5 0.723 35 AMACR GSTM3 0.747 18 AMACR POP5 0.709 45 AMACR CDH1 0.727 32 HPRT1 POP5 0.764 8 EFNA5 SDHA 0.700 56 AMACR ITGB1 0.726 33 CD109 HPRT1 0.764 9 AMACR FABP5 0.722 37 PTEN SDHA 0.706 51 AMACR SERPINB5 0.732 28 AMACR PTEN 0.713 43 PCDH18 SDHA 0.748 17 PCDH18 PLAU 0.704 53 PGC SDHA 0.749 15 ATP10B HPRT1 0.749 16 PLAU PTEN 0.735 26 PGC PLAU 0.726 34 C3 HPRT1 0.793 2 AMACR ANXA2 0.695 63 F11R HPRT1 0.778 6 AMACR PNP 0.707 49 NDUFAF1 PLAU 0.718 38 AMACR ATP5MPL 0.692 67 GUCY1A1 HPRT1 0.710 44 ABL1 PTEN 0.732 29 AMACR PLLP 0.705 52 AMACR C3 0.708 46 AMACR CD109 0.699 57 HPRT1 TRAFD1 0.741 23 ATP5MPL HPRT1 0.755 14 DPT HPRT1 0.745 19 AMACR TBRG4 0.697 60 GSTM3 HPRT1 0.730 31 PLAU PLLP 0.708 47 CHM HPRT1 0.788 4 HPRT1 PGC 0.741 24 HPRT1 NELL2 0.734 27 AMACR ITGA2 0.704 54 AMACR PCDH18 0.674 75 ABL1 GUCY1A1 0.716 41 CDH1 PGC 0.723 36 ABL1 HPRT1 0.744 22 AMACR EFNA1 0.660 79 GUCY1A1 TRAFD1 0.704 55 AMACR NELL2 0.692 68 POP5 PTEN 0.697 61 EFNA1 HPRT1 0.731 30 CDH1 HPRT1 0.761 10 CDH1 GUCY1A1 0.685 71 PNP SDHA 0.671 78 AMACR ITGB6b 0.695 64 AMACR NDUFAF1 0.697 62

126

AMACR CHM 0.674 76 AMACR F11R 0.674 77 GUCY1A1 PLAU 0.685 72 HPRT1 PVT1 0.716 42 HPRT1 ITGB1 0.759 11 AMACR ANPEP 0.699 58 ATP10B PTEN 0.717 40 PLAU SERPINB5 0.689 69 AMACR GSTM5 0.699 59 AMACR IRF7 0.693 66 CD109 PTEN 0.708 48 FABP5 HPRT1 0.718 39 GSTM5 SDHA 0.682 74 PLAU PVT1 0.695 65 GUCY1A1 PVT1 0.683 73 ABL1 PGC 0.688 70

The ROC plots for the two-gene models showed a greater coverage of the left-hand portion of the plot than was found for the single gene models as reflected by higher AUC values (higher discriminatory accuracy). More than 70% of the two-gene models had AUC values >0.7. The top 5 models according to the χ2 had AUC values of AMACR.SDHA(0.784), HPRT1.PLAU(0.808), AMACR.HPRT1(0.769), HPRT1.SDHA(0.757) and ANXA2.HPRT1(0.788).

The threshold at which J is highest is shown for the top 5 two-gene models (Table 2.26). For HPRT1.PLAU, an alternative threshold is given below the value for the highest J, as this threshold provided a better balance of sensitivity and specificity. At the threshold of 0.245 (at which J is highest), HPRT1.PLAU showed a sensitivity and specificity of 91% and 63% respectively. However, a better balance of sensitivity and specificity of 71% and 76% respectively could be achieved using a threshold of 0.335.

Table 2.26 Optimal thresholds for the top two-gene models

Two-gene model Cut-off value Sensitivity Specificity AMACR SDHA 0.316 0.714 0.783 HPRT1 PLAU 0.245 0.905 0.630 HPRT1 PLAU 0.335 0.714 0.761 AMACR HPRT1 0.371 0.619 0.848 HPRT1 SDHA 0.402 0.524 0.935 ANXA2 HPRT1 0.284 0.810 0.739

Three-gene models ranked by χ2 The parameters from the LR output for the three gene-models (ranked by χ2) are shown in Table 2.27.

127

Table 2.27 Output parameters from the logistic regression analysis of the top three-gene models (listed in descending order of χ2)

Gene(s) Wald Significance Percent Accuracy (%) Nagelkerke Non- 1 2 3 -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall Null Model 83.324 68.7 AMACR EFNA5 SDHA 60.841 22.483 3 0.0001 0.401 0.002 0.039 0.004 61.9 89.1 80.6 AMACR HPRT1 SDHA 61.059 22.265 3 0.0001 0.397 0.005 0.049 0.008 52.4 91.3 79.1 AMACR PNP SDHA 63.162 20.162 3 0.0002 0.365 0.002 0.156 0.007 42.9 91.3 76.1 AMACR PCDH18 SDHA 63.182 20.142 3 0.0002 0.365 0.004 0.080 0.005 57.1 93.5 82.1 AMACR SDHA SERPINB5 63.381 19.943 3 0.0002 0.362 0.002 0.007 0.082 57.1 89.1 79.1 AMACR PTEN SDHA 63.461 19.863 3 0.0002 0.361 0.005 0.084 0.008 47.6 93.5 79.1 AMACR GSTM5 SDHA 64.196 19.128 3 0.0003 0.349 0.002 0.121 0.004 52.4 89.1 77.6 AMACR PLLP SDHA 64.806 18.518 3 0.0003 0.339 0.003 0.186 0.009 47.6 91.3 77.6 HPRT1 PGC PLAU 65.015 18.309 3 0.0004 0.336 0.019 0.066 0.007 42.9 89.1 74.6 AMACR HPRT1 POP5 65.046 18.278 3 0.0004 0.335 0.012 0.028 0.039 52.4 93.5 80.6 GUCY1A1 HPRT1 PLAU 65.137 18.187 3 0.0004 0.334 0.077 0.010 0.010 38.1 95.7 77.6 AMACR ATP10B HPRT1 65.174 18.150 3 0.0004 0.333 0.009 0.032 0.045 47.6 93.5 79.1 HPRT1 PLAU PTGFR 65.208 18.116 3 0.0004 0.333 0.008 0.005 0.069 47.6 93.5 79.1 AMACR ITGB6b SDHA 65.257 18.067 3 0.0004 0.332 0.002 0.246 0.008 47.6 89.1 76.1 AMACR ANXA2 HPRT1 65.350 17.974 3 0.0004 0.331 0.031 0.037 0.016 47.6 93.5 79.1 AMACR HPRT1 TRAFD1 65.761 17.563 3 0.0005 0.324 0.007 0.052 0.114 57.1 95.7 83.6 AMACR ATP10B EFNA5 65.903 17.420 3 0.0006 0.322 0.003 0.020 0.051 47.6 91.3 77.6 AMACR IRF7 SDHA 66.150 17.174 3 0.0007 0.318 0.004 0.505 0.009 47.6 91.3 77.6 AMACR DPT HPRT1 66.265 17.059 3 0.0007 0.316 0.009 0.061 0.059 42.9 91.3 76.1 AMACR GSTM3 HPRT1 66.300 17.024 3 0.0007 0.315 0.009 0.059 0.043 38.1 91.3 74.6 AMACR ATP5MPL HPRT1 66.356 16.968 3 0.0007 0.314 0.009 0.069 0.027 47.6 91.3 77.6 AMACR DPT PVT1 66.359 16.965 3 0.0007 0.314 0.004 0.027 0.048 47.6 89.1 76.1

128

These 22 models had been selected for inclusion into the results section as these were the only three-gene models which provided a significant improvement (lower -2LL) over the best performing two-gene model AMACR.SDHA. The χ2 values for the listed three- gene models were higher than the values obtained for any of the listed two-gene models.

The models with the best overall fit were AMACR.EFNA5.SDHA (χ2(3) = 22.5, p<0.0001) and AMACR.HPRT1.SDHA (χ2(3) = 22.3, p<0.0001). Both predictive models contained only predictor variables which each made a significant contribution to the predictive model as shown by the significance of the Wald statistic (p<0.05).

The Naglekerke R2 of AMACR.EFNA5.SDHA and AMACR.HPRT1.SDHA were 0.401 and 0.397 respectively, suggesting that approximately 40% of the variability in recurrence could be explained by either of these models. This was higher than the values for the two-gene models nested within these models. For instance, the Nagelkerke R2 was only 0.310, 0.220 and 0.212 for AMACR.SDHA, AMACR.EFNA5 and EFNA5.SDHA respectively.

Approximately a third of the three-gene models contained genes, which each made a significant contribution to the model (p<0.05 for the Wald significance). The remainder of the models contained one gene that did not make a significant contribution to the model, e.g. AMACR.IRF7.SDHA contained the variable IRF7 (Wald significance of p>0.5) which failed to make any significant contribution to the performance of the simpler model AMACR.SDHA. Although the significance for the Wald statistic gives a good estimation of the contribution of each gene to the model, the LRT is a more accurate measure of performance in logistic regression. Therefore, in Section 2.3.9 the LRT was used to assess the contribution of each variable in the two- and three-gene models by comparing the overall fit between the restricted model (model without the additional variable) and the complex model (model with the additional variable).

The variables in the LR equations for the three-gene models are listed in Table 2.28.

For the highest model AMACR.EFNA5.SDHA, the odds ratios were 0.497, 0.546 and 9.772 for the variables AMACR, EFNA5 and SDHA. Therefore, the odds of a patient being in the recurrent group were multiplied by approximately 10 for a one-unit increase in SDHA

129 expression. Conversely, the odds of the patient being in the recurrent group were approximately halved for a one-unit increase in AMACR or EFNA5 expression.

Three-gene models by overall percent accuracy The three-gene models ranked by overall percent accuracy are listed in Table 2.29.

The order of the three-gene models according to the overall percent accuracy was very different to the order by χ2 as was found for the two-gene models. For instance, the model AMACR.GUCY1A1.PLLP was ranked at number 5 according to overall percent accuracy but was ranked at number 421 according to the χ2. This model contained the variable PLLP, which had a high p-value for the Wald significance (p>0.25). Conversely the second highest three-gene model according to χ2 (AMACR.HPRT1.SDHA) was ranked at number 54 according to the overall percent accuracy. This gene model contained only highly significant variables (p<0.05).

130

Table 2.28 Variables in the equation for the three-gene models

Model Predictor β coefficient Standard Error Odds ratio (95%CI) p-value 1 AMACR -0.699 0.224 0.497 (0.320-0.771) 0.002 EFNA5 -0.606 0.294 0.546 (0.307-0.971) 0.039 SDHA 2.279 0.793 9.772 (2.064-46.262) 0.004 2 AMACR -0.630 0.223 0.532 (0.344-0.825) 0.005 HPRT1 -0.578 0.293 0.561 (0.316-0.996) 0.049 SDHA 2.014 0.754 7.493 (1.708-32.876) 0.008 3 AMACR -0.698 0.225 0.497 (0.320-0.774) 0.002 PNP -0.792 0.558 0.453 (0.152-1.352) 0.156 SDHA 2.135 0.793 8.457 (1.787-40.022) 0.007 4 AMACR -0.646 0.226 0.524 (0.337-0.816) 0.004 PCDH18 -0.776 0.443 0.460 (0.193-1.097) 0.080 SDHA 2.566 0.921 13.019 (2.139-79.237) 0.005 5 AMACR -0.767 0.242 0.464 (0.289-0.746) 0.002 SDHA 1.943 0.721 6.982 (1.698-28.709) 0.007 SERPINB5 -0.205 0.118 0.814 (0.646-1.027) 0.082 6 AMACR -0.631 0.226 0.532 (0.342-0.828) 0.005 PTEN -0.314 0.181 0.731 (0.512-1.043) 0.084 SDHA 1.993 0.750 7.334 (1.687-31.881) 0.008 7 AMACR -0.683 0.226 0.505 (0.324-0.786) 0.002 GSTM5 -0.217 0.140 0.805 (0.612-1.059) 0.121 SDHA 2.219 0.780 9.201 (1.996-42.422) 0.004 8 AMACR -0.675 0.225 0.509 (0.327-0.792) 0.003 PLLP -0.241 0.182 0.786 (0.551-1.123) 0.186 SDHA 1.970 0.752 7.169 (1.642-31.297) 0.009 9 HPRT1 -0.683 0.291 0.505 (0.285-0.894) 0.019 PGC -0.266 0.145 0.766 (0.577-1.018) 0.066 PLAU 1.208 0.451 3.347 (1.382-8.102) 0.007 10 AMACR -0.528 0.209 0.590 (0.392-0.889) 0.012 HPRT1 -0.630 0.287 0.533 (0.303-0.935) 0.028 POP5 1.571 0.761 4.813 (1.083-21.389) 0.039 11 GUCY1A1 -0.924 0.522 0.397 (0.143-1.105) 0.077 HPRT1 -0.767 0.299 0.464 (0.258-0.835) 0.010 PLAU 1.160 0.451 3.190 (1.317-7.728) 0.010 12 AMACR -0.561 0.214 0.571 (0.375-0.868) 0.009 ATP10B 0.263 0.123 1.301 (1.023-1.654) 0.032 HPRT1 -0.510 0.254 0.601 (0.365-0.989) 0.045 13 HPRT1 -0.865 0.328 0.421 (0.221-0.800) 0.008 PLAU 1.375 0.494 3.957 (1.502-10.422) 0.005 PTGFR -0.291 0.160 0.747 (0.546-1.023) 0.069 14 AMACR -0.677 0.224 0.508 (0.328-0.788) 0.002 ITGB6b -0.175 0.151 0.839 (0.624-1.129) 0.246 SDHA 1.964 0.738 7.126 (1.677-30.285) 0.008 15 AMACR -0.454 0.211 0.635 (0.420-0.960) 0.031 ANXA2 0.759 0.364 2.135 (1.046-4.361) 0.037 HPRT1 -0.644 0.266 0.525 (0.312-0.886) 0.016 16 AMACR -0.562 0.210 0.570 (0.378-0.860) 0.007 HPRT1 -0.608 0.313 0.545 (0.295-1.005) 0.052 TRAFD1 0.828 0.523 2.288 (0.821-6.376) 0.114 17 AMACR -0.636 0.218 0.529 (0.346-0.811) 0.003 ATP10B 0.298 0.128 1.347 (1.048-1.730) 0.020 EFNA5 -0.486 0.249 0.615 (0.378-1.001) 0.051 18 AMACR -0.644 0.221 0.525 (0.341-0.809) 0.004 IRF7 -0.140 0.210 0.869 (0.575-1.313) 0.505 SDHA 1.922 0.734 6.838 (1.621-28.847) 0.009 19 AMACR -0.553 0.212 0.575 (0.380-0.871) 0.009 DPT 0.441 0.235 1.555 (0.980-2.466) 0.061 HPRT1 -0.523 0.277 0.593 (0.344-1.020) 0.059 20 AMACR -0.558 0.214 0.572 (0.376-0.871) 0.009 GSTM3 0.796 0.422 2.218 (0.970 5.072) 0.059 HPRT1 -0.558 0.276 0.572 (0.333-0.982) 0.043 21 AMACR -0.544 0.210 0.581 (0.385-0.876) 0.009 ATP5MPL 1.170 0.643 3.221 (0.913-11.360) 0.069 HPRT1 -0.629 0.284 0.533 (0.306-0.930) 0.027 22 AMACR -0.608 0.210 0.545 (0.361-0.822) 0.004 DPT 0.537 0.243 1.710 (1.061-2.756) 0.027 PVT1 0.995 0.503 2.706 (1.009-7.256) 0.048

131

Table 2.29 Three-gene models ranked in descending order of the overall percent accuracy

Gene(s) Percent Accuracy (%) Wald Significance Rank Rank Rank Non- Nagelkerke change (-2LL) (%) 1 2 3 Recurrent Recurrent Overall χ2 df sig. R2 1 2 3 Null Model 25 26 1 AMACR C3 HPRT1 95.7 57.1 83.6 16.527 3 0.0009 0.307 0.016 0.077 0.025 14 16 2 AMACR HPRT1 TRAFD1 95.7 57.1 83.6 17.563 3 0.0005 0.324 0.007 0.052 0.114 118 121 3 AMACR PCDH18 POP5 97.8 52.4 83.6 13.506 3 0.0037 0.257 0.011 0.167 0.055 94 98 4 ABL1 AMACR PLLP 95.7 52.4 82.1 13.845 3 0.0031 0.262 0.046 0.007 0.183 416 421 5 ABL1 GUCY1A1 PLLP 95.7 52.4 82.1 10.200 3 0.0169 0.198 0.022 0.041 0.279 31 37 6 ABL1 GUCY1A1 PTEN 95.7 52.4 82.1 15.924 3 0.0012 0.297 0.008 0.021 0.013 17 24 7 AMACR EFNA5 POP5 93.5 57.1 82.1 16.565 3 0.0009 0.308 0.004 0.037 0.038 47 55 8 AMACR HPRT1 PVT1 95.7 52.4 82.1 15.180 3 0.0017 0.285 0.013 0.077 0.185 5 4 9 AMACR PCDH18 SDHA 93.5 57.1 82.1 20.142 3 0.0002 0.365 0.004 0.080 0.005 84 94 10 AMACR PVT1 TRAFD1 95.7 52.4 82.1 13.890 3 0.0031 0.263 0.005 0.206 0.207 73 84 11 ABL1 AMACR PCDH18 95.7 47.6 80.6 14.122 3 0.0027 0.267 0.037 0.016 0.179 101 113 12 ABL1 GUCY1A1 HPRT1 97.8 42.9 80.6 13.621 3 0.0035 0.258 0.066 0.042 0.052 207 220 13 AMACR ANPEP TRAFD1 95.7 47.6 80.6 12.121 3 0.0070 0.233 0.006 0.861 0.115 36 50 14 AMACR CHM HPRT1 93.5 52.4 80.6 15.441 3 0.0015 0.289 0.017 0.176 0.024 54 69 15 AMACR EFNA1 HPRT1 93.5 52.4 80.6 14.468 3 0.0023 0.273 0.022 0.270 0.035 42 58 16 AMACR EFNA5 PVT1 91.3 57.1 80.6 15.015 3 0.0018 0.282 0.006 0.075 0.084 16 1 17 AMACR EFNA5 SDHA 89.1 61.9 80.6 22.483 3 0.0001 0.401 0.002 0.039 0.004 12 30 18 AMACR EFNA5 TRAFD1 91.3 57.1 80.6 16.265 3 0.0010 0.303 0.003 0.062 0.093 333 352 19 AMACR F11R PLLP 93.5 52.4 80.6 10.787 3 0.0129 0.209 0.011 0.274 0.144 22 42 20 AMACR GSTM3 PVT1 93.5 52.4 80.6 15.793 3 0.0013 0.295 0.004 0.045 0.057 124 145 21 AMACR GSTM5 TRAFD1 93.5 52.4 80.6 13.132 3 0.0044 0.250 0.005 0.307 0.085 12 10 22 AMACR HPRT1 POP5 93.5 52.4 80.6 18.278 3 0.0004 0.335 0.012 0.028 0.039 193 216 23 AMACR IRF7 TRAFD1 95.7 47.6 80.6 12.201 3 0.0067 0.234 0.005 0.739 0.108 91 115 24 AMACR PNP PVT1 93.5 52.4 80.6 13.584 3 0.0035 0.258 0.006 0.206 0.085 153 178 25 AMACR PLLP PVT1 95.7 47.6 80.6 12.650 3 0.0055 0.242 0.007 0.273 0.107 90 116 26 CD109 HPRT1 PGC 95.7 47.6 80.6 13.580 3 0.0035 0.258 0.049 0.019 0.109 47 74 27 EFNA5 PTEN SDHA 95.7 47.6 80.6 14.271 3 0.0026 0.270 0.078 0.071 0.012 313 341 28 GSTM3 GUCY1A1 PTEN 97.8 42.9 80.6 10.888 3 0.0123 0.211 0.071 0.053 0.031 504 533 29 NELL2 PGC PTEN 97.8 42.9 80.6 9.334 3 0.0252 0.183 0.105 0.067 0.066

132

ROC curves of the three-gene models The ROC curves for the top 5 three-gene models (excluding those with non-significant variables, see Section 2.3.9) are shown in Figure 2.25 and the AUC values for the three- gene models are shown in Table 2.30.

Figure 2.25 ROC curves for the best three-gene models

Table 2.30 AUC values for the three-gene models

AUC 2 Gene models ordered by χ AUC Rank AMACR EFNA5 SDHA 0.812 7 AMACR HPRT1 SDHA 0.829 1 AMACR PNP SDHA 0.802 16 AMACR PCDH18 SDHA 0.804 12 AMACR SDHA SERPINB5 0.816 5 AMACR PTEN SDHA 0.803 14 AMACR GSTM5 SDHA 0.788 20 AMACR PLLP SDHA 0.792 17 HPRT1 PGC PLAU 0.819 3 AMACR HPRT1 POP5 0.808 10 GUCY1A1 HPRT1 PLAU 0.813 6 AMACR ATP10B HPRT1 0.807 11 HPRT1 PLAU PTGFR 0.821 2 AMACR ITGB6b SDHA 0.803 15 AMACR ANXA2 HPRT1 0.818 4 AMACR HPRT1 TRAFD1 0.811 8 AMACR ATP10B EFNA5 0.775 21 AMACR IRF7 SDHA 0.791 18 AMACR DPT HPRT1 0.810 9 AMACR GSTM3 HPRT1 0.804 13 AMACR ATP5MPL HPRT1 0.791 19 AMACR DPT PVT1 0.772 22

133

The ROC plots for the three-gene models covered a greater portion of the left-hand corner of the plot than the two-gene models, which was reflected in AUC values of >0.770 for every model, with over 70% of the three-gene models showing AUC values >0.8. The gene model AMACR.HPRT1.SDHA showed the highest value of 0.829.

The optimal cut-off values for the top 5 three-gene models are shown in Table 2.31.

Table 2.31 Optimal thresholds for the top three-gene models

Three-Gene model Cut-off value Sensitivity Specificity AMACR EFNA5 SDHA 0.219 0.810 0.739 AMACR HPRT1 SDHA 0.312 0.810 0.870 AMACR HPRT1 POP5 0.340 0.714 0.870 AMACR ATP10B HPRT1 0.353 0.619 0.848 AMACR ATP10B HPRT1 0.308 0.714 0.739 AMACR ANXA2 HPRT1 0.288 0.810 0.804

The top three-gene models provided relatively high values for both sensitivity and specificity. For example, using a threshold of 0.312, a sensitivity and specificity of 81% and 87% respectively could be achieved using the gene model AMACR.HPRT1.SDHA.

Comparison of single, two-gene and three-gene models

The primary use of the LRT statistic (χ2) was to compare the overall fit of each model with the null model. These values were used as a method to rank the single, two- and three-gene models. As the LRT can be used to compare any two nested models, a secondary use in this study was to compare the fit of each two-gene model with each of its nested single gene models; and to compare the three-gene models with their nested two-gene models. This was essential to evaluate the significance of the improvement in fit attained by the addition of another gene to a model. This could provide evidence as to whether its addition, which makes a relatively more complex model, can be justified.

The χ2 values presented in Table 2.32 show the difference in the overall fit of each two- gene model with the fit of each of its nested single gene models. This value was calculated as the -2LL (single gene model) minus the -2LL (two-gene model).

134

Table 2.32 χ2 values between the two-gene models and their nested single gene models

Gene(s) χ2 χ2 diff. from diff. from 1 2 3 gene 1 df sig. gene 2 df sig. AMACR SDHA 8.962 1 0.003 10.486 1 0.001 HPRT1 PLAU 8.048 1 0.005 9.288 1 0.002 AMACR HPRT1 5.423 1 0.020 6.454 1 0.011 HPRT1 SDHA 6.244 1 0.013 6.737 1 0.009 ANXA2 HPRT1 9.618 1 0.002 6.240 1 0.013 AMACR ATP10B 5.187 1 0.023 9.234 1 0.002 AMACR DPT 4.335 1 0.037 9.252 1 0.002 AMACR TRAFD1 4.331 1 0.037 9.080 1 0.003 ABL1 AMACR 8.021 1 0.005 4.263 1 0.039 AMACR PVT1 3.681 1 0.055 7.997 1 0.005 AMACR EFNA5 3.632 1 0.057 8.303 1 0.004 AMACR GSTM3 3.589 1 0.058 9.153 1 0.003 AMACR POP5 3.581 1 0.058 8.391 1 0.004 AMACR CDH1 3.354 1 0.067 7.162 1 0.008 HPRT1 POP5 4.380 1 0.036 8.159 1 0.004 EFNA5 SDHA 7.875 1 0.005 4.728 1 0.030 AMACR ITGB1 3.169 1 0.075 8.414 1 0.004 CD109 HPRT1 8.436 1 0.004 4.198 1 0.041 AMACR FABP5 2.918 1 0.088 9.449 1 0.002 PTEN SDHA 6.485 1 0.011 4.409 1 0.036 AMACR SERPINB5 2.818 1 0.093 9.500 1 0.002 AMACR PTEN 2.763 1 0.097 6.363 1 0.012 PCDH18 SDHA 7.761 1 0.005 4.224 1 0.040 PCDH18 PLAU 7.680 1 0.006 4.890 1 0.027 PGC SDHA 7.071 1 0.008 4.134 1 0.042 ATP10B HPRT1 6.590 1 0.010 3.574 1 0.059 PLAU PTEN 4.751 1 0.029 6.080 1 0.014 PGC PLAU 6.919 1 0.009 4.729 1 0.030 C3 HPRT1 8.481 1 0.004 3.429 1 0.064 AMACR ANXA2 2.370 1 0.124 6.779 1 0.009 F11R HPRT1 8.641 1 0.003 3.368 1 0.067 AMACR PNP 2.289 1 0.130 8.173 1 0.004 NDUFAF1 PLAU 7.947 1 0.005 4.415 1 0.036 AMACR ATP5MPL 2.102 1 0.147 8.711 1 0.003 GUCY1A1 HPRT1 6.942 1 0.008 3.032 1 0.082 ABL1 PTEN 5.679 1 0.017 5.521 1 0.019 AMACR PLLP 1.800 1 0.180 7.946 1 0.005 AMACR C3 1.777 1 0.183 7.860 1 0.005 AMACR CD109 1.754 1 0.185 7.023 1 0.008 HPRT1 TRAFD1 2.766 1 0.096 6.484 1 0.011 ATP5MPL HPRT1 8.234 1 0.004 2.656 1 0.103 DPT HPRT1 6.534 1 0.011 2.648 1 0.104 AMACR TBRG4 1.616 1 0.204 8.512 1 0.004 GSTM3 HPRT1 7.137 1 0.008 2.604 1 0.107 PLAU PLLP 3.753 1 0.053 7.628 1 0.006 CHM HPRT1 8.206 1 0.004 2.490 1 0.115 HPRT1 PGC 2.426 1 0.119 5.856 1 0.016 HPRT1 NELL2 2.294 1 0.130 7.856 1 0.005 AMACR ITGA2 1.257 1 0.262 6.956 1 0.008 AMACR PCDH18 1.253 1 0.263 6.314 1 0.012 ABL1 GUCY1A1 5.010 1 0.025 6.193 1 0.013 CDH1 PGC 5.055 1 0.025 5.708 1 0.017 ABL1 HPRT1 4.980 1 0.026 2.253 1 0.133 AMACR EFNA1 1.200 1 0.273 6.929 1 0.009 GUCY1A1 TRAFD1 6.125 1 0.013 5.933 1 0.015 AMACR NELL2 1.152 1 0.283 7.745 1 0.005 POP5 PTEN 5.883 1 0.015 4.673 1 0.031 EFNA1 HPRT1 6.754 1 0.009 2.056 1 0.152 CDH1 HPRT1 4.823 1 0.028 2.046 1 0.153 CDH1 GUCY1A1 4.769 1 0.029 5.902 1 0.015 PNP SDHA 6.824 1 0.009 2.464 1 0.117

135

AMACR ITGB6b 0.937 1 0.333 8.076 1 0.005 AMACR NDUFAF1 0.821 1 0.365 6.624 1 0.010 AMACR CHM 0.818 1 0.366 7.565 1 0.006 AMACR F11R 0.813 1 0.367 7.117 1 0.008 GUCY1A1 PLAU 5.740 1 0.017 3.070 1 0.080 HPRT1 PVT1 1.717 1 0.190 5.002 1 0.025 HPRT1 ITGB1 1.712 1 0.191 5.926 1 0.015 AMACR ANPEP 0.606 1 0.436 6.689 1 0.010 ATP10B PTEN 4.596 1 0.032 4.149 1 0.042 PLAU SERPINB5 2.785 1 0.095 7.196 1 0.007 AMACR GSTM5 0.472 1 0.492 7.627 1 0.006 AMACR IRF7 0.471 1 0.493 7.395 1 0.007 CD109 PTEN 5.629 1 0.018 3.960 1 0.047 FABP5 HPRT1 6.873 1 0.009 1.373 1 0.241 GSTM5 SDHA 7.492 1 0.006 1.861 1 0.173 PLAU PVT1 2.438 1 0.118 4.483 1 0.034 GUCY1A1 PVT1 5.033 1 0.025 4.408 1 0.036 ABL1 PGC 3.847 1 0.050 4.550 1 0.033

These χ2 values were positive for all the gene models demonstrating that the overall fit of the two-gene models was an improvement over their individual single gene models (reduction in -2LL with the addition of another gene). Each row in the table shows a different two-gene model; the column “difference from gene 1” shows the χ2 between a single gene model containing only the first gene and the two-gene model. The column “difference from gene 2” shows the χ2 between the single gene model containing only the second gene and the two-gene model. The df is 1 as one predictor variable (gene) was added to each single gene model. The significance for each χ2 value is included.

The table is better described with an example such as AMACR.SDHA. The χ2 between AMACR and AMACR.SDHA was 8.962 with a high significance (p<0.005). This explained that a significant improvement to the fit of the single gene model AMACR was achieved by adding SDHA to form the two-gene model AMACR.SDHA. The χ2 between SDHA and AMACR.SDHA was slightly higher at 10.486 (p<0.005) verifying that the addition of AMACR to the single gene model SDHA led to a significant improvement in its fit. The effect of adding AMACR to SDHA was slightly higher than that achieved by adding SDHA to AMACR (as AMACR had a lower -2LL than SDHA as a single gene model).

For the top 9 genes on the list of two-gene models, it was evident that the addition of each of the genes in the model led to a significant improvement on either of the single gene models. Therefore, these two-gene models are useful in combination. Further down the list there are two-gene models for which one of the genes is clearly not contributing to the model, e.g. AMACR.TBRG4. The addition of TBRG4 to AMACR

136

resulted in a χ2 of 1.616, which was not significant (p-value of 0.204). However, the addition of AMACR to TBRG4 was significant (p<0.005). This was not surprising due to the low -2LL of AMACR as a single gene model. As the fit under AMACR.TBRG4 was no better than under AMACR alone, the addition of TBRG4 would be worthless.

The χ2 between each three-gene model and its nested two-gene models are presented in Table 2.33. For example, for AMACR.EFNA5.SDHA, the column “difference from gene 2+3”, shows the difference in -2LL between EFNA5.SDHA and AMACR.EFNA5.SDHA; the column “difference from gene 1 + 3” shows the difference in -2LL between AMACR.SDHA and AMACR.EFNA5.SDHA and the column “difference from gene 1+2” shows the difference in -2LL between AMACR.EFNA5 and AMACR.EFNA5.SDHA.

Table 2.33 χ2 values between the three-gene models and their nested two-gene models

Gene(s) χ2 χ2 χ2 diff. from diff. from diff. from 1 2 3 gene 2+3 df sig. gene 1+3 df sig. gene 1+2 df sig. AMACR EFNA5 SDHA 11.519 1 0.001 5.761 1 0.016 11.091 1 0.001 AMACR HPRT1 SDHA 9.292 1 0.002 5.543 1 0.019 9.082 1 0.003 AMACR PNP SDHA 11.462 1 0.001 3.44 1 0.064 10.113 1 0.002 AMACR PCDH18 SDHA 9.682 1 0.002 3.42 1 0.064 11.129 1 0.001 AMACR SDHA SERPINB5 12.735 1 0.000 9.365 1 0.002 3.221 1 0.073 AMACR PTEN SDHA 9.218 1 0.002 3.141 1 0.076 9.34 1 0.002 AMACR GSTM5 SDHA 11.031 1 0.001 2.406 1 0.121 10.896 1 0.001 AMACR PLLP SDHA 10.805 1 0.001 1.796 1 0.180 8.958 1 0.003 HPRT1 PGC PLAU 8.091 1 0.005 3.532 1 0.060 9.154 1 0.003 AMACR HPRT1 POP5 7.169 1 0.007 6.937 1 0.008 5.095 1 0.024 GUCY1A1 HPRT1 PLAU 3.410 1 0.065 9.628 1 0.002 8.426 1 0.004 AMACR ATP10B HPRT1 7.847 1 0.005 4.967 1 0.026 5.203 1 0.023 HPRT1 PLAU PTGFR 10.874 1 0.001 10.435 1 0.001 3.339 1 0.068 AMACR ITGB6b SDHA 10.897 1 0.001 1.345 1 0.246 9.37 1 0.002 AMACR ANXA2 HPRT1 5.005 1 0.025 4.791 1 0.029 7.844 1 0.005 AMACR HPRT1 TRAFD1 8.068 1 0.005 5.472 1 0.019 4.38 1 0.036 AMACR ATP10B EFNA5 10.045 1 0.002 6.029 1 0.014 4.474 1 0.034 AMACR IRF7 SDHA 10.043 1 0.002 0.452 1 0.501 8.943 1 0.003 AMACR DPT HPRT1 7.682 1 0.006 3.876 1 0.049 4.964 1 0.026 AMACR GSTM3 HPRT1 7.691 1 0.006 3.841 1 0.050 5.675 1 0.017 AMACR ATP5MPL HPRT1 7.583 1 0.006 3.785 1 0.052 7.106 1 0.008 AMACR DPT PVT1 9.728 1 0.002 5.524 1 0.019 4.87 1 0.027

The data in Table 2.33 show that the highest three-gene models AMACR.EFNA5.SDHA and AMACR.HPRT1.SDHA contain good combinations of genes. The addition of AMACR to EFNA5.SDHA resulted in a χ2 of 11.519 (p<0.001), the addition of EFNA5 to AMACR. SDHA resulted in a χ2 of 5.761 (p<0.05) and the addition of SDHA to AMACR.EFNA5 resulted in a χ2 of 11.091 (p<0.001). However, realistically it is important to know the improvement that can be achieved from the best of the nested two-gene models. Here,

137

AMACR.SDHA had the lowest -2LL of the nested models, so the improvement attained by adding EFNA5 was more modest with a χ2 = 5.761 (p<0.05). However, the improvement was significant so the addition of EFNA5 to AMACR.SDHA may be justified.

For some of the other three-gene models, although there was a reduction in the -2LL in the three-gene model in comparison to the nested two-gene model (positive value for χ2), the improvement in fit was not significant. For example, the addition of IRF7 to AMACR.SDHA provided virtually no improvement (p>0.5), therefore AMACR.IRF7.SDHA would not be considered an improvement over AMACR.SDHA.

After eliminating any two- and three-gene models that contained one or more non- significant predictors (p>0.05) according to this method, 28 two gene-models and 9 three-gene models remained (Table 2.34 and Table 2.35). These models are the only two- and three-gene models listed which offer a significant improvement over their nested single and two-gene models respectively.

138

Table 2.34 Two-gene models with significant predictor variables

Gene(s) Wald Significance Percent Accuracy (%) Nagelkerke Non- 1 2 3 -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall Null Model 83.324 68.7 AMACR SDHA 66.602 16.722 2 0.0002 0.310 0.003 0.009 38.1 91.3 74.6 HPRT1 PLAU 68.547 14.777 2 0.0006 0.278 0.015 0.011 42.9 93.5 77.6 AMACR HPRT1 70.141 13.183 2 0.0014 0.251 0.015 0.038 42.9 95.7 79.1 HPRT1 SDHA 70.351 12.973 2 0.0015 0.247 0.031 0.021 28.6 97.8 76.1 ANXA2 HPRT1 70.355 12.969 2 0.0015 0.247 0.019 0.009 33.3 87.0 70.1 AMACR ATP10B 70.377 12.947 2 0.0015 0.247 0.005 0.029 33.3 89.1 71.6 AMACR DPT 71.229 12.094 2 0.0024 0.232 0.005 0.047 42.9 89.1 74.6 AMACR TRAFD1 71.233 12.091 2 0.0024 0.232 0.005 0.089 47.6 95.7 80.6 ABL1 AMACR 71.301 12.023 2 0.0025 0.231 0.047 0.008 42.9 95.7 79.1 HPRT1 POP5 72.215 11.109 2 0.0039 0.215 0.020 0.055 23.8 95.7 73.1 EFNA5 SDHA 72.360 10.964 2 0.0042 0.212 0.051 0.012 33.3 95.7 76.1 CD109 HPRT1 72.397 10.927 2 0.0042 0.211 0.055 0.014 23.8 95.7 73.1 PTEN SDHA 72.679 10.645 2 0.0049 0.206 0.039 0.018 33.3 89.1 71.6 PCDH18 SDHA 72.864 10.460 2 0.0054 0.203 0.060 0.016 23.8 95.7 73.1 PCDH18 PLAU 72.945 10.379 2 0.0056 0.202 0.043 0.013 28.6 95.7 74.6 PGC SDHA 72.954 10.370 2 0.0056 0.201 0.047 0.013 23.8 89.1 68.7 PLAU PTEN 73.084 10.240 2 0.0060 0.199 0.021 0.034 38.1 89.1 73.1 PGC PLAU 73.106 10.218 2 0.0060 0.199 0.035 0.014 28.6 93.5 73.1 NDUFAF1 PLAU 73.420 9.904 2 0.0071 0.193 0.041 0.010 33.3 91.3 73.1 ABL1 PTEN 73.643 9.681 2 0.0079 0.189 0.024 0.022 28.6 91.3 71.6 ABL1 GUCY1A1 74.312 9.012 2 0.0110 0.177 0.022 0.034 38.1 95.7 77.6 CDH1 PGC 74.317 9.007 2 0.0111 0.177 0.026 0.031 23.8 93.5 71.6 GUCY1A1 TRAFD1 74.380 8.943 2 0.0114 0.176 0.024 0.032 23.8 95.7 73.1 POP5 PTEN 74.491 8.833 2 0.0121 0.174 0.047 0.021 28.6 91.3 71.6 CDH1 GUCY1A1 74.603 8.721 2 0.0128 0.171 0.027 0.039 23.8 93.5 71.6 ATP10B PTEN 75.015 8.309 2 0.0157 0.164 0.047 0.036 38.1 87.0 71.6 CD109 PTEN 75.204 8.120 2 0.0172 0.160 0.059 0.025 23.8 93.5 71.6 GUCY1A1 PVT1 75.472 7.852 2 0.0197 0.155 0.044 0.047 28.6 97.8 76.1

139

Table 2.35 Three-gene models with significant predictor variables

Gene(s) Wald Significance Percent Accuracy (%) Nagelkerke Non- 1 2 3 -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall AMACR EFNA5 SDHA 60.841 22.483 3 0.0001 0.401 0.002 0.039 0.004 61.9 89.1 80.6 AMACR HPRT1 SDHA 61.059 22.265 3 0.0001 0.397 0.005 0.049 0.008 52.4 91.3 79.1 AMACR HPRT1 POP5 65.046 18.278 3 0.0004 0.335 0.012 0.028 0.039 52.4 93.5 80.6 AMACR ATP10B HPRT1 65.174 18.150 3 0.0004 0.333 0.009 0.032 0.045 47.6 93.5 79.1 AMACR ANXA2 HPRT1 65.350 17.974 3 0.0004 0.331 0.031 0.037 0.016 47.6 93.5 79.1 AMACR HPRT1 TRAFD1 65.761 17.563 3 0.0005 0.324 0.007 0.052 0.114 57.1 95.7 83.6 AMACR ATP10B EFNA5 65.903 17.420 3 0.0006 0.322 0.003 0.020 0.051 47.6 91.3 77.6 AMACR DPT HPRT1 66.265 17.059 3 0.0007 0.316 0.009 0.061 0.059 42.9 91.3 76.1 AMACR DPT PVT1 66.359 16.965 3 0.0007 0.314 0.004 0.027 0.048 47.6 89.1 76.1

140

The different steps involved in the data analysis are summarised in Figure 2.26, which illustrates how the predictive models were selected, and then gradually pared down to those with the best overall performance in terms of the χ2 whilst retaining parsimony.

Figure 2.26 Flow diagram to show predictive modelling process

141

Distribution of genes in the top gene models

By graphically representing the prevalence of each gene in the highest-ranking models (by χ2), genes of interest were identified which may be potential progression markers of prostate cancer with use as predictors in LR models of recurrence. It should be reiterated that the prevalence plots included all gene models in the selected fractions. The prevalence plots were produced before any models containing non-significant variables were removed.

The plots give an insight into the distribution of the genes in the models with the highest values for χ2 and illustrate how the prevalence changes as the selected divisions are narrowed towards the highest percentiles (~top 2%). The prevalence of the highest models according to overall percent accuracy were plotted in the same manner for comparison and to illustrate the differences in the best performing models between the two ranking methods. The four plots are included to allow comparisons between the ranking methods and between the two- and three-gene models. The genes are listed in order of increasing p-value for the Wald significance from the univariate LR analysis. The plots could potentially identify genes, even those with higher than conventional p-values for the Wald significance, that may be useful in combination with other genes, as evidenced by a significant presence in the best performing models.

Figure 2.27 and Figure 2.28 show the distribution of genes in the top two-gene models ranked by χ2 and overall percent accuracy. Figure 2.29 and Figure 2.30 show the distribution of genes in the top three-gene models ranked by χ2 and overall percent accuracy.

The plots show the percentage prevalence of each gene in the highest 19.4% (Fraction 1), 9.3% (Fraction 2) and 2.8% (Fraction 3) of the two-gene models (Figure 2.27 and Figure 2.28) and the highest 12.7% (Fraction 1), 6.1% (Fraction 2) and 2.0% (Fraction 3) of the three-gene models (Figure 2.29 and Figure 2.30).

The percentage prevalence of each gene is illustrated by blue, orange and grey bars for Fraction 1, 2 and 3 respectively. A marker has been included to identify the genes with p-values <0.05.

142

The plots illustrate the high prevalence of AMACR (which had the lowest p-value for the Wald statistic of 0.009) in comparison to any other genes for both the two- and three- gene models ranked by both methods.

The prevalence is lower for HPRT1 and SDHA which also had a Wald significance of p<0.025. The prevalence of these two genes is greater in the χ2 plot in comparison to the overall percent accuracy plot. For example, HPRT1 is approximately twice as prevalent in the top fraction (Fraction 3) in the χ2 compared to the overall percent accuracy plot. The most prevalent genes in the top models according to χ2 are generally those with p-values <0.05. The prevalence tends to reduce after this point, with low prevalence associated with genes with p-values >0.05. In comparison, the prevalence of many of the genes with p-values >0.05 is relatively high in the top models according to overall percent accuracy. The prevalence of some of these genes is as high as the genes with highly significant p-values (<0.05). This pattern is accentuated in the three-gene models ranked by overall percent accuracy. It is also important to note the increased prevalence of some of the genes with p-values >0.05 in the three-gene models compared with the two-gene models ranked by χ2. The prevalence tends to remain relatively high for genes with p-values up to 0.15 (up to GSTM3), a finding which is not evident in the two-gene models.

It is also clear from the χ2 plots that the genes with the highest significance for the Wald statistic (p<0.05), which are those with the highest overall prevalence, show the greatest prevalence in the upper fraction (top ~2%).

143

P=0.05

Figure 2.27 Distribution of the genes in the highest two-gene models according to χ2

144

P=0.05

Figure 2.28 Distribution of the genes in the highest two-gene models according to overall percent accuracy

145

P=0.05

Figure 2.29 Distribution of the genes in the highest three-gene models according to χ2

146

P=0.05

Figure 2.30 Distribution of the genes in the highest three-gene models according to overall percent accuracy

147

Comparison to the pilot study

Independent samples t-tests for the prediction of GS groups As predictive models of disease outcome are more clinically relevant than predictive models of high GS (≥7), the current study (BB study) focused on the identification of predictive models of recurrence.

A logistic regression analysis using recurrence as the dependent variable had also been included in the pilot study, but the limited follow-up data available at that time decreased the validity of the identified models of recurrence. Therefore, the pilot study had focused on the identification of models, which could categorise prostate cancer patients according to GS (≥7 vs <7).

The BB study focused on identifying predictive models of recurrence. However, for comparison with the SL study, independent samples t-tests were also performed on the datasets to compare gene expression between GS ≥7 and <7.

The list of genes showing differential expression between the groups to a conventional level of significance (p<0.05)(both Pfaffl and 2-ΔCt datasets) are shown for each study (Table 2.36).

Table 2.36 Differentially expressed genes (p<0.05) between GS ≥7 and <7 in the SL and BB study

SL Study BB Study Gene Pfaffl Delta Ct Gene Pfaffl Delta Ct ABL1 0.037 * 0.028 * ↓ BRCA2 0.021 * 0.024 * ↓ ANPEP 0.005 ** 0.011 * ↓ EFNA2 0.019 * 0.019 * ↑ CD9 0.028 * 0.018 * ↓ GPR19 0.024 * 0.024 * ↑ EFNA1 0.006 ** 0.004 ** ↓ INMT 0.041 * 0.037 * ↓ GPM6A 0.015 * 0.043 * ↓ RHBDL1 0.019 * 0.019 * ↑ HSPB1 0.025 * 0.037 * ↓ VASH1 0.009 ** 0.011 * ↑ INMT 0.015 * 0.031 * ↓ ITGB4 0.021 * 0.017 * ↓ NELL2 0.018 * 0.015 * ↓ PSCA 0.024 * 0.044 * ↓

PSA 0.346 PSA 0.744 * p<0.05,** p<0.01 ↑/↓ illustrates increased/decreased expression in the GS ≥7 group

148

In the SL study, 10 genes were identified as being significantly differentially expressed between GS ≥7 and <7 (in both datasets). EFNA1 was significant to a 99% confidence level (p<0.01) and ABL1, ANPEP, CD9, GPM6A, HSPB1, INMT, ITGB4, NELL2 and PSCA were significant to a 95% confidence level (p<0.05). All these genes were found to show significantly lower expression in GS ≥7 versus <7.

The BB study found 6 genes to be differentially expressed between the groups to a conventional significance level (p<0.05). These genes were BRCA2, EFNA2, GPR19, INMT, RHBDL1 and VASH1. The genes BRCA2 and INMT showed significantly lower expression in GS ≥7 vs <7. EFNA2, GPR19, RHBDL1 and VASH1 showed significantly higher expression in GS ≥7 vs <7.

The only gene found to be significantly differentially expressed according to GS (p<0.05) in both studies was INMT, which showed reduced expression in the GS ≥7 group (in both studies).

From the genes that were significantly differentially expressed in the SL study, only NELL2, PSCA, ANPEP and CD9 showed p-values <0.5 in the BB study (ANPEP and CD9 had values <0.25). The remainder of the significant genes in the SL study (ABL1, EFNA1, HSPB1, ITGB4) had p-values of >0.5 in the BB study and GPM6A was one of the excluded genes (not expressed).

The genes ANPEP, CD9, and NELL2 showed a lower expression in the high GS group in both studies. PSCA showed higher expression in the high GS group in the BB study but lower expression in this group in the SL study.

Apart from INMT, the genes which were differentially expressed in the BB study (p<0.05) (BRCA2, EFNA2, GPR19 and RHBDL1 and VASH1) were of no significance in the SL study (p>0.5). The genes BRCA2, EFNA2 and GPR19 had not been expressed (therefore excluded) and the genes RHBDL1 and VASH1 had p-values >0.5 in the SL study.

Altogether, a total of 25 genes showed p-values <0.5 in both studies in the independent samples t-tests comparing GS groups (Table 2.37).

149

Table 2.37 Genes with p-values <0.5 in both studies for the independent samples t-tests (GS ≥7 vs <7)

SL Study BB Study Gene Pfaffl Delta Ct Pfaffl Delta Ct ANPEP 0.005 ** 0.011 * ↓ 0.240 0.237 ↓ APOBEC3G 0.249 0.253 ↓ 0.309 0.338 ↑ ASB2 0.061 0.161 ↓ 0.081 0.081 ↓ CANX 0.109 0.357 ↓ 0.429 0.430 ↓ CD44 0.131 0.316 ↓ 0.111 0.113 ↓ CD9 0.028 * 0.018 * ↓ 0.152 0.133 ↓ CDC42EP3 0.059 0.055 ↓ 0.206 0.153 ↓ CDKN1C 0.009 ** 0.086 ↓ 0.299 0.304 ↓ CHM 0.113 0.383 ↓ 0.226 0.223 ↓ EZH2 0.106 0.376 ↓ 0.111 0.119 ↑ F11R 0.170 0.239 ↓ 0.063 0.063 ↓ FABP5 0.132 0.360 ↓ 0.433 0.433 ↑ GSTM3 0.056 0.040 * ↓ 0.146 0.136 ↓ HOXC6 0.131 0.177 ↑ 0.093 0.092 ↑ INMT 0.015 * 0.031 * ↓ 0.041 * 0.037 * ↓ ITGA2 0.133 0.217 ↓ 0.357 0.267 ↓ ITGA3 0.178 0.231 ↓ 0.249 0.264 ↓ ITGA5 0.150 0.441 ↓ 0.308 0.309 ↓ NELL2 0.018 * 0.015 * ↓ 0.250 0.248 ↓ NUP50 0.276 0.342 ↓ 0.387 0.386 ↑ PCDH18 0.491 0.444 ↓ 0.324 0.320 ↓ PSCA 0.024 * 0.044 * ↓ 0.452 0.456 ↑ RAP1GAP 0.234 0.490 ↓ 0.105 0.107 ↓ TBRG4 0.054 0.278 ↓ 0.155 0.166 ↓ TRAFD1 0.072 0.070 ↓ 0.207 0.186 ↑ * p<0.05,** p<0.01 ↑/↓ illustrates increased/decreased expression in the GS ≥7 group

In the SL study, all the genes showed lower expression in the GS ≥7 group compared with the <7 group except HOXC6, which showed higher expression. HOXC6 also showed higher expression in the GS ≥7 group in the BB study. In the BB study, most of the genes showed lower expression in the GS ≥7 group as had been found in the SL study, except APOBEC3G, EZH2, FABP5, NUP50, PSCA and TRAFD1 which showed higher expression in the GS ≥7 group (different direction to the SL study).

The following section compares the results between the two studies of the independent sample t-tests used for the comparison of gene expression between the recurrent and non-recurrent group.

Independent samples t-tests comparing the recurrent and non- recurrent group The results of the t-tests comparing the recurrent and non-recurrent group are presented for both studies (Table 2.38).

150

Table 2.38 Independent samples t-tests comparing gene expression between the recurrent and non- recurrent group in the SL and BB study

SL Study BB Study Gene Pfaffl Delta Ct Gene Pfaffl Delta Ct @18S 0.223 0.325 ↓ @18S 0.528 0.384 ABL1 0.301 0.373 ↓ ABL1 0.045 * 0.073 ↑ ALCAM 0.333 0.637 ALCAM 0.829 0.990 ALPG N N ALPG N N AMACR 0.770 0.334 AMACR 0.018 * 0.023 * ↓ ANPEP 0.030 * 0.041 * ↓ ANPEP 0.205 0.214 ↑ ANXA2 0.987 0.569 ANXA2 0.072 0.069 ↑ APOBEC3G 0.147 0.149 ↓ APOBEC3G 0.881 0.773 ASB2 0.255 0.421 ↓ ASB2 0.894 0.894 ATP5MPL 0.849 0.659 ATP5MPL 0.295 0.439 ↑ ATP10B N N ATP10B 0.054 0.049 * ↑ BRCA2 N N BRCA2 0.417 0.516 C3 C3 0.202 0.264 ↑ CANX 0.531 0.954 CANX 0.571 0.523 CD109 0.343 0.398 ↓ CD109 0.125 0.099 ↑ CD4 0.146 0.230 ↓ CD4 N N CD44 0.100 0.089 ↓ CD44 0.992 0.757 CD9 0.141 0.121 ↓ CD9 0.736 0.938 CDC42EP3 0.115 0.067 ↓ CDC42EP3 0.751 0.956 CDH1 0.496 0.394 ↑ CDH1 0.052 0.049 * ↑ CDH2 0.759 0.260 CDH2 CDKN1C 0.094 0.287 ↓ CDKN1C 0.713 0.762 CHM 0.373 0.755 CHM 0.331 0.402 ↑ CRHR1 0.840 0.659 CRHR1 0.416 0.617 CTNNA1 0.569 0.995 CTNNA1 CYP11A1 N N CYP11A1 0.846 0.841 DPT 0.215 0.241 ↓ DPT 0.098 0.120 ↑ EDNRA 0.281 0.523 EDNRA 0.789 0.768 EEF1A1 0.211 0.315 ↓ EEF1A1 0.193 0.855 EFNA1 0.242 0.429 ↓ EFNA1 0.155 0.165 ↑ EFNA2 N N EFNA2 0.948 0.870 EFNA3 N N EFNA3 0.783 0.680 EFNA4 N N EFNA4 0.958 0.800 EFNA5 0.409 0.804 EFNA5 0.151 0.143 ↓ ERBB2 ERBB2 0.465 0.526 ERG 0.346 0.214 ↑ ERG 0.699 0.741 EZH2 0.881 0.670 EZH2 0.451 0.411 ↓ F11R 0.931 0.671 F11R 0.235 0.224 ↑ FABP5 0.343 0.595 FABP5 0.274 0.270 ↑ FADS1 0.406 0.882 FADS1 0.571 0.453 GPM6A 0.171 0.281 ↓ GPM6A N N GPR12 N N GPR12 0.902 0.783 GPR19 N N GPR19 0.923 0.897 GSTM3 0.159 0.114 ↓ GSTM3 0.146 0.262 ↑ GSTM5 N N GSTM5 0.438 0.389 ↓ GUCY1A1 0.434 0.909 GUCY1A1 0.149 0.056 ↓ HOXC6 0.344 0.416 ↑ HOXC6 0.739 0.720 HPN 0.665 0.683 HPN 0.636 0.985 HPRT1 0.791 0.647 HPRT1 0.008 ** 0.025 * ↓ HPX HPX 0.874 0.632 HSPA5 0.672 0.671 HSPA5 HSPB1 0.048 0.050 ↓ HSPB1 0.541 0.558 INMT 0.006 ** 0.010 * ↓ INMT 0.416 0.507 IRF7 0.371 0.744 IRF7 0.356 0.342 ↓ ITGA2 0.354 0.611 ITGA2 0.244 0.391 ↑ ITGA3 0.391 0.449 ↓ ITGA3 0.880 0.732 ITGA5 0.461 0.807 ITGA5 0.445 0.525 ITGB1 0.448 0.745 ITGB1 0.115 0.219 ↑ ITGB4 0.190 0.369 ↓ ITGB4b 0.413 0.513 ITGB6 0.314 0.586 ITGB6b 0.433 0.398 ↓ KCNN4 0.574 0.609 KCNN4 KLHL5 0.383 0.565 KLHL5 0.702 0.625

151

MKI67 MKI67 0.728 0.829 PVALB 0.161 0.172 ↓ PVALB 0.857 0.849 MMP2 0.186 0.303 ↓ MMP2 0.798 0.634 MMP9 0.831 0.116 MMP9 MNX1 N N MNX1 0.517 0.758 MT1X MT1X 0.940 0.828 MUC1 MUC1 0.702 0.879 MYC 0.126 0.271 ↓ MYC NDUFAF1 0.371 0.577 NDUFAF1 0.154 0.183 ↓ NELL2 0.096 0.080 ↓ NELL2 0.292 0.344 ↑ PNP 0.413 0.340 ↑ PNP 0.322 0.297 ↓ NUP50 0.729 0.817 NUP50 0.668 0.681 PCDH18 0.706 0.667 PCDH18 0.197 0.206 ↓ PGC 0.713 0.587 PGC 0.070 0.062 ↓ PLAU PLAU 0.020 * 0.023 * ↑ PLCL2 0.963 0.827 PLCL2 0.874 0.967 PLLP 0.246 0.350 ↓ PLLP 0.303 0.250 ↓ POP5 0.604 0.940 POP5 0.088 0.114 ↑ PRIM2A 0.249 0.100 ↓ PRIM2 0.624 0.782 PSCA 0.232 0.327 ↓ PSCA 0.879 0.952 PTEN PTEN 0.038 * 0.030 * ↓ PTGFR 0.845 0.163 PTGFR 0.422 0.393 ↓ PVT1 0.725 0.956 PVT1 0.073 0.096 ↑ RAP1GAP 0.469 0.769 RAP1GAP 0.508 0.490 RBM12 0.502 0.760 RBM12 RHBDL1 0.900 0.467 RHBDL1 0.480 0.529 SDHA 0.206 0.309 ↓ SDHA 0.013 * 0.046 * ↑ SFPQ 0.219 0.273 ↓ SFPQ SERPINB5 N N SERPINB5 0.360 0.368 ↓ SLC44A2 0.777 0.712 SLC44A2 0.513 0.534 PRELID3A N N PRELID3A 0.792 0.976 TBP 0.476 0.962 TBP 0.812 0.962 TBRG4 0.539 0.756 TBRG4 0.362 0.297 ↑ TERT N N TERT 0.854 0.824 TF TF 0.476 0.553 TGFB2 N N TGFB2 0.534 0.408 TGM2 0.342 0.410 ↓ TGM2 HSP90B1 0.701 0.509 HSP90B1 TRAFD1 0.508 0.939 TRAFD1 0.149 0.204 ↑ TRIP13 0.030 * 0.031 * ↑ TRIP13 0.615 0.592 VASH1 0.881 0.650 VASH1 0.691 0.570 WIF1 0.738 0.779 WIF1

GS 0.051 GS 0.372 PSA 0.195 PSA 0.842 * p<0.05,** p<0.01 ↑/↓ illustrates increased/decreased expression in the recurrent group N = no expression

The only genes found to be significantly differentially expressed (p<0.05) in the SL study between the recurrent versus non-recurrent group were ANPEP, INMT and TRIP13. The expression of ANPEP and INMT were significantly lower, and the expression of TRIP13 was significantly higher in the recurrent versus non-recurrent group.

In the BB study, a different set of genes were found to be differentially expressed between the recurrent and non-recurrent group to a conventional significance level (p<0.05): AMACR, HPRT1, PLAU, PTEN and SDHA. The expression of AMACR, HPRT1 and

152

PTEN were lower in the recurrent versus non-recurrent group. The expression of PLAU and SDHA was higher in the recurrent versus non-recurrent group.

There were no genes found to be significantly differentially expressed between the recurrent and non-recurrent group (p<0.05) in both studies.

The genes INMT and TRIP13, which were found to be highly significant in the previous study, had p-values >0.5 in the BB study. ANPEP showed a p-value of <0.5 in both studies, but in comparison to the SL study (where ANPEP expression was reduced in the recurrent group), its expression was increased in the recurrent group in the BB study.

The genes AMACR and HPRT1, which were found to be significantly differentially expressed according to recurrence in the BB study, were of no significance in the SL study (p>0.5). SDHA was the only significant gene in our study which had a p-value <0.5 in the previous study. However, SDHA displayed decreased expression in the SL study but increased expression in the BB study in the recurrent group. The final two highly significant genes in our study were PLAU and PTEN, which had been specific to the BB study only.

If the SL study had used our relaxed threshold of p<0.5 for inclusion of variables into the multivariate LR analysis, they would have taken forward 34 genes rather than 3. The BB study used 39 genes for the multivariate LR analysis.

A total of 11 genes were found to show p-values <0.5 in both studies for the comparison of recurrent and non-recurrent groups (Table 2.39).

Table 2.39 Genes with p-values <0.5 in both studies in the independent samples t-tests (recurrent vs non-recurrent)

SL Study BB Study Gene Pfaffl Delta Ct Pfaffl Delta Ct ABL1 0.301 0.373 ↓ 0.045 * 0.073 ↑ ANPEP 0.030 * 0.041 * ↓ 0.205 0.214 ↑ CD109 0.343 0.398 ↓ 0.125 0.099 ↑ CDH1 0.496 0.394 ↑ 0.052 0.049 * ↑ DPT 0.215 0.241 ↓ 0.098 0.120 ↑ EFNA1 0.242 0.429 ↓ 0.155 0.165 ↑ GSTM3 0.159 0.114 ↓ 0.146 0.262 ↑ NELL2 0.096 0.080 ↓ 0.292 0.344 ↑ PNP 0.413 0.340 ↑ 0.322 0.297 ↓ PLLP 0.246 0.350 ↓ 0.303 0.250 ↓ SDHA 0.206 0.309 ↓ 0.013 * 0.046 * ↑ * p<0.05,** p<0.01 ↑/↓ illustrates increased/decreased expression in the recurrent group

153

For most of these genes, the gene expression difference between the recurrent and non- recurrent group was in a different direction between the two studies. The only gene for which the direction of change between the recurrent and non-recurrent group concurred between the studies was CDH1 (higher in recurrent). The BB study had a larger number of genes demonstrating increased expression in the recurrent group in comparison to the SL study.

It is important to note that the significance obtained for the GS and preoperative PSA were much higher in the SL study (lower p-values). Only the GS was included as a predictor variable for the multivariate analysis in the BB study (p<0.5), whereas both GS and PSA had been included in the SL study.

Predictive models of recurrence in SL study Due to their strict entry criterion, the SL study took forward only three genes (ANPEP, INMT and TRIP13) into LR analysis for recurrence.

The output parameters from the univariate LR of these genes are shown for both the Pfaffl (Table 2.40) and 2-ΔCt (Table 2.41) datasets as LR was performed on both datasets in the SL study.

The models are listed in decreasing order of the χ2 (although they were not ordered by this parameter in the SL study).

For both datasets in the SL study, the χ2 values for the single gene models ANPEP and TRIP13 were significant to a 95% confidence level (p<0.05). The χ2 value for the single gene model INMT was significant to a 99% confidence level (p<0.01). Therefore, all three genes showed a significant improvement over the null model for the prediction of recurrence. INMT was the model with the best overall fit of the three genes in both datasets.

A single variable model including only the GS was significantly better than the null model (p<0.05). A model including only PSA was not significantly better than the null model (p-value >0.15).

154

In comparison to the SL study, the variables INMT, TRIP13 and PSA were not included in our LR analysis of recurrence due to their low significance in the t-tests (p>0.5).

ANPEP and GS were included (p<0.5), although the p-values obtained for these variables in the t-tests were much higher than the values found in the SL study. In comparison to the SL study, which found the single gene model ANPEP to significantly improve upon the null model, the BB study did not find ANPEP to significantly improve upon the null model for the prediction of recurrence (χ2(1) = 1.7, p-value 0.2). Furthermore, ANPEP did not feature in the top 10% of two-gene models and only very rarely in the top 10% of three-gene models in the BB study.

Additionally, in the BB study, the GS provided no benefit as either a single predictor model or when added to any single-gene or two-gene models.

155

Table 2.40 Single gene models for the prediction of recurrence in the SL study (Pfaffl dataset)

-2LL Null Nagelkerke Wald Gene(s) model -2LL χ2 df sig. R2 sig. Percent Accuracy (%) AUC Non- 1 2 3 1 2 3 Recurrent recurrent Overall INMT 35.924 28.381 7.542 1 0.006 0.322 0.035 33.3 95.0 75.9 0.772 TRIP13 35.924 30.632 5.292 1 0.021 0.235 0.050 33.3 85.0 69.0 0.767 ANPEP 35.924 30.851 5.073 1 0.024 0.226 0.048 44.4 95.0 79.3 0.783 GS 35.924 31.701 4.223 1 0.040 0.191 0.080 11.1 95.0 69.0 0.703 PSA 33.503 31.766 1.738 1 0.187 0.086 0.201 12.5 95.0 71.4 0.647

Table 2.41 Single gene models for the prediction of recurrence in the SL study (2-ΔCt dataset)

-2LL Null Nagelkerke Wald Gene(s) model -2LL χ2 df sig. R2 sig. Percent Accuracy (%) AUC Non- 1 2 3 1 2 3 Recurrent recurrent Overall INMT 35.924 29.156 6.768 1 0.009 0.293 0.050 22.2 95.0 72.4 0.75 TRIP13 35.924 30.665 5.259 1 0.022 0.234 0.050 22.2 90.0 69.0 0.75 ANPEP 35.924 31.509 4.415 1 0.036 0.199 0.057 22.2 90.0 69.0 0.772 GS 35.924 31.701 4.223 1 0.04 0.191 0.080 11.1 95.0 69.0 0.703 PSA 33.503 31.766 1.738 1 0.187 0.086 0.201 12.5 95.0 71.4 0.647

156

The results from the multivariate LR analysis of the SL study are shown in Table 2.42 and Table 2.43. The two-variable models which offered a significant improvement over the null model (χ2, p<0.05) were INMT.GS, TRIP13.GS and TRIP13.PSA in the Pfaffl dataset and INMT.TRIP13, INMT.GS and TRIP13.GS in the 2-ΔCt dataset. The model TRIP13.GS was the best performing two-variable model in both datasets: χ2(2) = 10.7, p<0.01 (Pfaffl dataset) and χ2(2) = 10.5, p<0.01 (2-ΔCt dataset), with high discriminatory accuracy as demonstrated by AUC values of 0.833 and 0.861 respectively.

However, the only two-variable model which offered a significant improvement over its nested single gene model was TRIP13.GS in both datasets. This was calculated as previously described (-2LL single gene model) - (-2LL two-variable model)(data included in Appendix 8).

The three-variable models, which offered a significant improvement over the null model (χ2, p<0.05) were TRIP13.GS.PSA in the Pfaffl dataset and INMT.TRIP13.GS and TRIP13.GS.PSA in the 2-ΔCt dataset. TRIP13.GS.PSA offered a significant improvement over the two-variable model TRIP13.GS in the Pfaffl dataset only (χ2 = 3.99, p<0.05).

The values obtained for the χ2 (between the null and final model) in the BB study reached higher values for the top two- and three-gene models than those in the SL study. Additionally, there was a more significant improvement in the three-variable models in comparison to the two-variable models in our study. Despite lower χ2 values for the best models in the SL study than the BB study, the overall percent accuracy and AUC values of the best models in the two studies were in a similar range.

157

Table 2.42 Multivariate logistic regression analysis for the prediction of recurrence in the SL study (Pfaffl dataset)

Wald Significance Percent Accuracy (%)

Gene(s) -2LL Null Nagelkerke Non- 1 2 3 model -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall AUC TRIP13 GS PSA 33.503 21.241 12.262 3 0.007 0.508 0.036 0.043 0.2 87.5 85.0 85.7 0.888 TRIP13 GS 35.924 25.231 10.693 2 0.005 0.434 0.04 0.036 55.6 80.0 72.4 0.833 INMT GS 35.924 26.696 9.228 2 0.01 0.384 0.07 0.231 44.4 90.0 75.9 0.822 TRIP13 PSA 33.503 26.56 6.943 2 0.031 0.315 0.054 0.192 50 85.0 75.0 0.819 ANPEP PSA 33.503 28.893 4.61 2 0.1 0.218 0.122 0.384 12.5 90.0 67.9 0.806

Table 2.43 Multivariate logistic regression analysis for the prediction of recurrence in the SL study (2-ΔCt dataset)

Wald Significance Percent Accuracy (%)

Gene(s) -2LL Null Nagelkerke Non- 1 2 3 model -2LL χ2 df sig. R2 1 2 3 Recurrent recurrent Overall AUC INMT TRIP13 GS 35.924 23.799 12.124 3 0.007 0.481 0.272 0.096 0.123 55.6 90 79.3 0.867 TRIP13 GS 35.924 25.389 10.535 2 0.005 0.429 0.044 0.039 77.8 85.0 82.8 0.861 TRIP13 GS PSA 33.503 23.129 10.374 3 0.016 0.444 0.055 0.047 0.319 62.5 90.0 82.1 0.863 INMT TRIP13 35.924 26.527 9.396 2 0.009 0.390 0.110 0.142 44.4 90.0 75.9 0.828 INMT GS 35.924 27.609 8.315 2 0.016 0.351 0.103 0.253 33.3 90.0 72.4 0.817 TRIP13 PSA 33.503 27.838 5.665 2 0.059 0.263 0.078 0.269 50 95.0 82.1 0.763 INMT PSA 33.503 28.965 4.538 2 0.103 0.214 0.135 0.835 12.5 95.0 71.4 0.731 ANPEP PSA 33.503 29.966 3.537 2 0.171 0.17 0.203 0.435 25.0 90.0 71.4 0.756

158

2.4 Discussion

Gleason score analysis

In the independent samples t-tests, there was no overlap between the significantly differentially expressed genes according to recurrence and the significantly differentially expressed genes according to GS (≥7 vs <7)(p<0.05). Furthermore, many of the genes with p-values <0.5 in both analyses showed an expression change between the recurrent and non-recurrent group that was in a different direction to the change between the GS ≥7 and <7 group.

It was also evident that the GS had no use as a predictor variable of recurrence in this dataset in either the univariate or multivariate LR analysis. There was only a marginal difference in the mean GS between the recurrent and non-recurrent group (6.90 and 6.63 respectively, p-value of 0.372 in Mann-Whitney U test). These findings were caused mainly by the large proportion of GS 7 patients in this dataset (>40%). Patients with GS 7 tumours have a heterogeneous clinical outcome (319), as demonstrated here by the finding of 18 non-recurrent and 9 recurrent patients within this group.

Additionally, the GS may not have been accurately assigned for all patients, especially those for whom only a biopsy GS was available (those treated with RT only). Biopsy grading has well-documented limitations and is prone to over- and undergrading (320). It is interesting to note that the GS in the SL study differed to an almost conventional significance level between the recurrent and non-recurrent group (p-value 0.051)(275). GS was also found to be an important predictor variable in their multivariate LR analysis. The GS may have been more reliably assigned in the SL study as only RP patients were included, so only pathologic GS was used for categorisation rather than biopsy GS.

The lack of correlation between the GS and recurrence in the BB study would have negated the prognostic relevance of any identified differentially expressed genes between GS ≥7 and <7. Therefore the GS could not be reliably used as a surrogate for

159 aggressive disease. This validated the choice to use recurrence as the clinical endpoint in the final study.

Contrary to these findings, the GS has been used in the discovery process of some gene expression studies to identify putative genes of prostate cancer progression, which were then further analysed using outcome data (261, 262). However, these studies compared GS ≥8 (or Gleason grade 4/5) and GS ≤6 (or Gleason grade 3) and excluded GS 7 in the discovery phase as these more highly segregated groups have a closer association with outcome (262, 332). Other studies found no overlap between genes which were predictive of GS and those that were predictive of recurrence, suggesting that the gene expression changes associated with dedifferentiation may be different to those associated with outcome (290, 333).

Comparison of recurrence analysis between the BB and SL study

There were few similarities between the SL study and the BB study regarding the genes found to be differentially expressed between the recurrent and non-recurrent group. The significant genes (p<0.05) were different between the two studies and AMACR, which in the BB study was found to be significantly differentially expressed according to recurrence (p-value 0.018) and the most prevalent gene in the best performing LR models, had a p-value of >0.5 in the t-tests of the SL study. HPRT1, the most significantly differentially expressed gene in the BB study (p-value 0.008) had a p-value of >0.5 in the SL study. SDHA was found to be significantly increased in the recurrent group in the BB study (p-value 0.013), whereas it was reduced in the recurrent group in the SL study (although not significantly). Similarly, ANPEP was significantly reduced in the recurrent group in the SL study but increased in this study (although not significantly).

There were only 11 genes with p-values <0.5 in both studies and crucially, all these 11 genes except CDH1, demonstrated a different direction of gene expression in the two studies between the recurrent and non-recurrent group. CDH1 was the only gene with

160 a p-value of <0.5 in both studies, which showed the same pattern of expression between the two studies (increased expression in the recurrent group).

Two of the genes that were found be significant in the BB study were PLAU and PTEN. These genes were novel to the BB array and had been selected based on their association with prostate cancer progression in other studies (see Appendix 1). These genes were found to be important predictor variables in the LR models, which validated the decision to incorporate these genes on the BB array.

The fact that the two studies identified different genes as showing differential expression in recurrent disease is not an uncommon finding in gene expression studies (334). Although a similar assay platform was used, there were many key differences in methodology between the SL and BB study, which resulted from the extensive method development that was carried out in the latter study including the essential optimization of the baseline and threshold settings. Importantly, the BB study used a larger set of samples which were predominantly biopsies, whereas the SL study used only RP samples. The gene expression data in the BB study was analysed with a longer follow- up period than the SL study and the outcome data was more complete.

It is not possible to directly compare the results of the two studies due to these differences. Furthermore, the more relaxed threshold in the BB study for inclusion of predictor variables into multivariate LR models resulted in the analysis of 2,019 models in comparison to only 13 in the SL study.

The SL study tested only one two-gene model (INMT.TRIP13)(2-ΔCt dataset only). In comparison, the BB study tested 39 single-, 387 two-gene and 1482 three-gene models. The remainder of the models in their study had included GS and/or PSA alongside a single gene rather than combinations of genes.

To reiterate, the SL study used overall percent accuracy as the primary measure of overall model performance. In comparison, the BB study used χ2 for the reasons described in Section 2.3.4.8. Both studies included ROC analysis as an important measure of discriminatory accuracy.

161

It must be stressed that the SL study was only a pilot study and the BB study comprised three times as many patients to enhance the validity of the findings. Importantly, the pilot study had only 9 positive events (9 recurrent patients) compared with 21 in the BB study. For LR to be statistically valid and to avoid the identification of spurious models through overfitting, the use of at least 5-9 EPV is advised (303). The validity of the two- and three-variable models in the SL study, which had an EPV of only 4.5 and 3 respectively, were therefore compromised. Despite the study finding some potentially useful two- and three-variable predictive models of recurrence (TRIP13.GS.PSA and TRIP13.GS), the results of the single variable models in their study would be considered the most reliable.

The single gene model INMT was the highest performing single gene model in the SL study for the prediction of recurrence (according to χ2). In comparison, the BB study found AMACR to be the highest performing model of recurrence. The values for the χ2 and its significance were similar between INMT in the SL study and AMACR in the BB study. The AUC value for INMT in their study was higher at 0.772 than AMACR in the BB study (0.677). However, the two- and three-gene models in the BB study (which contained a suitable number of variables for the sample size) provided significantly higher values for both the χ2 and AUC than was found for the single gene models in either study.

BB study - recurrence analysis

The BB study found 7 single gene models (AMACR, HPRT1, SDHA, PLAU, PTEN, ABL1, CDH1) that offered a significant improvement over the null model for the prediction of recurrence in this dataset (χ2, p<0.05). However, the amount of variation in recurrence which was explained by AMACR and HPRT1 (the genes with the highest χ2 value) was low at ~15% and 13% respectively, as described by the Nagelkerke R2. The discriminatory accuracy (as shown by the AUC values) was good for HPRT1, at 0.731. The remaining 6 genes (even AMACR) had AUC values between 0.628-0.677, demonstrating that the discriminatory accuracy was not ideal. Therefore, the ability of the genes individually in LR models to discriminate between recurrent and non-recurrent patients was limited.

162

There were 28 two-gene models which offered a significant improvement over their nested single gene models (as assessed by the secondary use of χ2), and an improvement over the best single gene model (lower -2LL). The two-gene models at the bottom of the list were not ideal as they offered only a marginally lower -2LL value than the lowest value obtained for the single gene models (AMACR). Therefore, they did not offer any significant benefit over the parsimonious single gene model AMACR. The best performing two-gene models of AMACR.SDHA and HPRT1.PLAU showed χ2 values that were approximately twice the value obtained for AMACR as a single gene model. The better fit of these two-gene models to the data was also reflected by much higher significance values for the χ2. These two-gene models explained a much larger amount of the variation in recurrence (31% and 28% respectively). The AUC values were 0.784 and 0.808 for AMACR.SDHA and HPRT1.PLAU respectively, suggesting good discriminatory accuracy. By selecting the optimal threshold for categorisation of patients, a good balance of sensitivity and specificity could be achieved for the predictive models. For HPRT1.PLAU the threshold at which the value for the Youden’s index was highest provided a high sensitivity of 91% and a lower specificity of 63%. A better balance of sensitivity and specificity of 71% and 76% was achieved by selecting an alternative threshold. The sensitivity and specificity of AMACR.SDHA was 71% and 78% respectively.

There were 9 three-gene models that offered both an improvement over their nested two-gene models and over the best two-gene model. The best performing models AMACR.EFNA5.SDHA and AMACR.HPRT1.SDHA could explain ~40% of the variation in recurrence and these two models had AUC values of 0.812 and 0.829 respectively, showing excellent discriminatory accuracy. The sensitivity and specificity at the optimal threshold was 81% and 74% for AMACR.EFNA5.SDHA respectively, and 81% and 87% for AMACR.HPRT1.SDHA. The remainder of the three-gene models provided only limited improvement (marginally lower -2LL) over the best performing two-gene model of AMACR.SDHA. Therefore, the use of AMACR.EFNA5.SDHA and AMACR.HPRT1.SDHA may be justified, but the use of the remainder of the three-gene models may not.

It was not possible to add any further predictor variables to the three-gene models as it was imperative to keep the number of variables to a minimum due to the small sample

163 size (302). Ideally, a larger sample size would have been used, but there were limitations regarding the number of samples that could be accessed, so the study had to adapt the number of variables included in the models according to the final sample size. The use of a model that is too complex for the dataset increases the risk of overfitting, which is when the model becomes tailored to the features of the specific sample that it is tested on, including its noise and anomalies. This leads to misleading values for the parameters in the LR output. An overfitted model would be highly unlikely to fit another random sample and so would have no value as a predictive model for future use (335). Additionally, exceeding the appropriate number of predictor variables will prevent the LR analysis from adequately controlling for confounding variables (301). Confounding variables are extraneous variables which influence both the independent and dependent variables and may lead to a spurious association between them.

Calibration and discrimination - evaluating the performance of the predictive models

It is important to mention that the χ2 and the AUC values were used to measure different characteristics of the overall performance of the predictive models. The χ2 measures goodness-of-fit and is used to measure the calibration of a model, i.e. the agreement between the predicted and observed outcomes. The AUC is a measure of discrimination, i.e. how well the model distinguishes between recurrent and non-recurrent patients (336). Generally, as the χ2 values decreased so did the AUC values; i.e. the models with the highest calibration were usually the models with the highest discriminatory accuracy, although they did not correlate exactly. For example, the single gene model HPRT1 had a higher AUC value than AMACR, despite AMACR having the highest χ2.

Although it is important to assess both calibration and discrimination when evaluating the overall performance of a predictive model, the χ2 in this study provided the most suitable parameter by which to rank the vast numbers of candidate two- and three-gene models. The AUC values were not suitable for ranking purposes as these values fall within a narrow range from 0.5-1. The larger range of χ2 values and the attributes of using this parameter to evaluate the overall performance of the models (explained in the method development Section 2.3.4.8) made it ideal for ranking purposes.

164

Furthermore, the secondary use of the χ2, describing the improvement in fit provided by the addition of another predictor to a single or two-gene model, enabled a useful method by which to assess the importance of each predictor variable in each model.

Gene prevalence plots

The plots of gene prevalence in the top performing models were useful to illustrate how the highest predictive models varied, depending on whether χ2 or the overall percent accuracy was used for ranking. This is explained in detail in the method development Section 2.3.4.8 and contributed to the decision to use χ2 as the ranking method.

The prevalence plots allowed an assessment of the prevalence of each gene in the best performing two- and three-gene models according to χ2. It is important to remember that a limitation of using such a method is the fact that the prevalence plots included every tested model in the top fractions. Therefore, models containing non-significant predictor variables were also included, which could potentially have led to an overrepresentation of some genes without significant predictive value. This was more evident for the three-gene models as some of these models contained a non-significant variable but, due to the high significance of the remaining variables, their ranking was high.

Despite these limitations, the plots provided a useful description of the genes that were found most frequently in the best performing two- and three-gene models and were important predictor variables in this study. It was evident that the prevalence of AMACR far exceeded any of the other genes in the top models. This was important and highlighted AMACR as an important putative progression marker in prostate cancer. Although the genes with p-values <0.05 were more likely to present in the best performing models, the plots also demonstrated that it was important to use a more relaxed p-value threshold for inclusion of predictors into multivariate LR (as genes with p-values >0.05 also presented).

The genes in this study had been selected for LR analysis based on the p-value of the t-test rather than the Wald test (from the univariate LR analysis). In the method development, the prevalence was plotted in increasing order of the p-value for the t-test

165 initially. Subsequently, the prevalence was plotted in increasing order of the p-value for the Wald test. The prevalence correlated better with the significance of the Wald test than the t-test, providing a clearer picture of potential p-value thresholds that could be selected for inclusion of genes into multivariate analysis. In future, the p-value for the Wald test may be a more appropriate method by which to select predictor variables for the multivariate analysis, as has been suggested by others (325, 326). These studies used a threshold of 0.25, which would have been suitable in this study, as no genes were found in the final lists of models which had p-values >0.25. Table 2.44 shows the genes included in the final 28 two-gene and 9 three-gene models, which were the best performing models containing only significant predictor variables. The p-values for both the t-test and the Wald test are included.

Table 2.44 Genes found in the best performing two- and three-gene models

Gene t-test Wald test Two-gene Three-gene p-value p-value models models AMACR 0.018 0.0085 6 9 HPRT1 0.008 0.0225 6 6 SDHA 0.013 0.0197 6 2 PLAU 0.02 0.0258 5 0 PTEN 0.038 0.0479 6 0 ABL1 0.045 0.0535 3 0 CDH1 0.052 0.0600 2 0 ATP10B 0.054 0.0580 2 2 PVT1 0.073 0.0825 1 1 ANXA2 0.072 0.0762 1 1 PGC 0.070 0.0741 3 0 EFNA5 0.151 0.0904 1 2 TRAFD1 0.149 0.1245 2 1 POP5 0.088 0.0975 2 1 DPT 0.098 0.1022 1 2 GUCY1A1 0.149 0.1015 4 0 PCDH18 0.197 0.1157 2 0 CD109 0.125 0.1277 2 0 NDUFAF1 0.154 0.1669 1 0

The following section provides a brief description of the potential significance of the genes found in the best performing predictive models.

166

Genes included in the highest multivariate logistic regression models according to χ2

Genes with high prevalence in the models

AMACR AMACR (alpha-methyl-CoA racemase) was the most prevalent gene in the best performing two- and three-gene models according to χ2 by several fold. AMACR was a predictor in each of the highest ranking three-gene models and demonstrated a significant reduction in expression in the recurrent compared with the non-recurrent samples (p<0.05). AMACR encodes a peroxisomal and mitochondrial enzyme involved in the beta-oxidation of branched chain fatty acids and bile acid intermediates (337) and is well-documented in prostate cancer where it is overexpressed at both the transcriptional and protein level (191, 199). Immunohistochemistry of AMACR is currently utilised as a diagnostic tissue marker of the disease in biopsy samples (338).

Regarding its prognostic potential, the results are conflicting. In agreement with our findings, some studies have found reduced AMACR expression (mRNA/protein) to be correlated with disease progression in prostate cancer. Low mRNA/protein expression of AMACR has been found to be correlated with adverse clinicopathological parameters, including elevated clinical stage (339, 340), and several studies have shown a significantly lower expression in metastatic compared to localized disease (341, 342). Few studies have correlated AMACR expression with outcome, although Rubin et al. found low AMACR protein expression to be an independent predictor of BCR in a cohort of RP patients (343). They also found low AMACR expression to be predictive of disease- specific death in conservatively managed prostate cancer patients (343). Other studies have either found no significant association between AMACR expression and outcome (339) or the opposite finding of an association between high expression and poor prognosis (344, 345).

Increased AMACR expression may be important for energy provision in prostate cancer cells, through its role in the beta-oxidation of fatty acids, a catabolic pathway which generates energy in the form of ATP (346). De novo fatty acid synthesis is also

167 consistently upregulated in prostate cancer through the overexpression of fatty acid synthase, particularly with disease progression (347, 348). The significance of a downregulation in AMACR with disease progression as found in our study and others is not yet known. It may reflect a complex shifting balance between lipid synthesis and oxidation to ensure that both lipids and energy are supplied to the prostate cancer cells for growth and survival (346).

HPRT1 HPRT1 (hypoxanthine phosphoribosyltransferase 1) was the most significantly differentially expressed gene between the groups, showing significantly lower expression in the recurrent versus non-recurrent group (p<0.01).

HPRT1 had originally been included on the array as a housekeeping gene. Its use as a reference gene in this study was precluded by its high variability between the samples, including between the recurrent and non-recurrent group (Appendix 4). The finding that HPRT1 expression varied significantly between the non-recurrent and recurrent group was novel and highlighted the fact that so-called housekeeping genes can vary considerably under different circumstances and need to be selected carefully. Contrary to our findings, some studies have demonstrated the suitability of HPRT1 as a reference gene in prostate cancer as they demonstrated its low variability between samples, particularly between different grades and stages of disease (289). As a result of HPRT1 being considered a housekeeping gene, there are few studies that have investigated its expression as a target gene in prostate cancer.

HPRT1 encodes a central enzyme in the purine salvage pathway, which synthesizes purine nucleotides from intermediates produced in the degradative pathway for RNA and DNA (349). It is important to note that purines can be synthesized by either the de novo pathway or the salvage pathway (350). The de novo synthesis of nucleotides is known to be significantly upregulated in neoplastic cells to support DNA replication and RNA production to maintain cellular proliferation and growth. This is achieved through the utilization of intermediates from glycolysis and the TCA cycle, and the upregulation of enzymes that catalyse the de novo pathway (351, 352). In comparison, the nucleotide salvage pathway in cancer is not well understood. Our findings suggest a possible down-

168 regulation in the purine salvage pathway, which conflicts with some other reports. Other studies have found an upregulation of HPRT1 gene expression with disease progression in prostate cancer (353, 354).

SDHA SDHA showed a significantly increased expression in the recurrent versus non-recurrent group (p<0.05). Similar to HPRT1, SDHA had originally been included on the array as a housekeeping gene.

SDHA (succinate dehydrogenase complex flavoprotein subunit A) encodes one of the main catalytic subunits of succinate dehydrogenase (SDH: mitochondrial complex II), a mitochondrial enzyme involved in two essential processes of cellular energy production: the TCA cycle and oxidative phosphorylation (355). The TCA cycle is a final common pathway for the oxidation of glucose, glutamine and fatty acids, which is central to cellular energy production and provision of intermediates for macromolecule synthesis (356). Succinate dehydrogenase (SDH) couples the oxidation of succinate to fumarate in the TCA cycle, with the transfer of electrons to ubiquinone in the electron transfer chain (355). The electron transfer chain is the site of oxidative phosphorylation and the final stage of aerobic respiration, which is responsible for production of energy in the form of ATP.

Increased SDHA expression could potentially reflect increased activity in both the TCA cycle and the electron transfer chain, and therefore elevated energy production through aerobic respiration. Initially, this finding appears contradictory to the Warburg effect, a feature of many cancers, which describes a shift towards aerobic glycolysis whereby pyruvate is converted to lactate rather than oxidised through the TCA cycle (180). In fact, SDHA mutations leading to a reduction or loss of the SDH catalytic subunit and defective SDH function are found in some cancers such as paragangliomas and gastrointestinal stromal tumours (357, 358). However, metabolic reprogramming exhibits high variability both between and within different cancers and it is now generally accepted that tumour cells rely on a combination of both aerobic glycolysis (Warburg effect) and oxidative metabolism to supply their energy and macromolecular needs (183).

169

As previously described, normal prostate cells have a distinctive metabolism due to a truncated TCA cycle which prevents citrate oxidation and ATP production, thereby limiting oxidative metabolism (359). In comparison to the energy inefficient normal prostate cells, malignant prostate cells switch to a highly energy efficient metabolism, with large quantities of citrate oxidised through the TCA cycle (360). The Warburg effect tends to become pronounced only at the most advanced stages of the disease (184). Our findings suggest increased TCA cycle activity in recurrent disease. Due to the role of SDH in coupling the TCA cycle with the electron transfer chain, this may reflect a point at which there is increased flux through the final stage of cellular respiration (oxidative phosphorylation) to keep up with the increasing energy demands of the cancerous cells as the disease progresses. However, in addition to energy production, the TCA cycle is important for channelling glycolytic intermediates into various anabolic pathways to support the proliferation and growth of cancerous cells (181). For instance, the TCA cycle provides precursors for anabolic pathways such as amino acid and fatty acid synthesis. Activity in these anabolic pathways and also in catabolic pathways that replenish the TCA cycle with metabolites are known to be increased with prostate cancer progression (353, 361).

Other studies have identified TCA cycle activation in prostate cancer (353, 361, 362). A recent study by Shao et al. using a combined approach of metabolomics and transcriptomics found significant accumulations of TCA cycle metabolites in prostate cancer tissue in comparison to normal adjacent tissue (361). The levels of fumarate and malate were positively correlated with GS and tumour stage. Of particular interest is the fact that most genes involved in the TCA cycle showed significantly increased expression in the prostate cancer vs normal tissue. A number of these genes including SDHA were positively correlated with GS and tumour stage. These findings directly implicate an increased gene expression of SDHA with accumulation of fumarate levels and show a positive correlation of SDHA with prostate cancer progression, which concurs with our findings. Importantly, several studies that have identified an increased respiratory capacity in prostate cancer suggest a significant shift during oxidative phosphorylation from NADH (through mitochondrial complex I) towards succinate-supported oxidation (via SDH; mitochondrial complex II)(363, 364). It is also important to link the finding of

170 both increased SDHA expression and reduced PTEN expression in recurrent disease in this study. This concurs with other studies that found a significant association between loss of PTEN in prostate cancer and accumulation of intracellular succinate levels and higher succinate-supported oxidation via mitochondrial complex II, which may provide cancer cells with a growth and survival advantage (364).

PLAU PLAU (plasminogen activator, urokinase), which encodes a serine protease, was found to show significantly increased expression in the recurrent group (p-value 0.020). Of interest, PLAU expression was also increased in the GS ≥7 group. The enzyme PLAU, upon binding its receptor PLAUR, catalyses the conversion of plasminogen to plasmin, promoting the degradation of the basement membrane and ECM by the activation of a cascade of proteolytic enzymes (200). This enzyme is one of an array of different proteolytic enzymes, which may be released from cancer cells (and transformed stromal cells) to facilitate cancer cell invasion and migration through remodelling of the tumour microenvironment (200). In agreement with our findings, several studies have found increased PLAU and PLAUR protein expression in prostate cancer tissue to be correlated with adverse clinicopathological parameters (198, 365). PLAU, either alone or in combination with plasminogen activator inhibitor 1 (SERPINE1) has been found to independently predict BCR after RP in some studies (198, 365). Increased serum/plasma levels of PLAU and PLAUR have also shown promise as independent predictors of poor outcome in prostate cancer (366, 367). Few studies have investigated mRNA expression of PLAU in prostate cancer.

PTEN PTEN (phosphatase and tensin homolog) was significantly downregulated in the recurrent compared with the non-recurrent samples (p-value 0.038). PTEN is a tumour suppressor gene whose protein product is an important negative regulator of the PI3K/AKT pathway, which promotes cellular growth and survival (146). Genomic loss of PTEN is a frequent finding in prostate cancer (145). PTEN deletion is a well-characterised poor prognostic factor in prostate cancer, correlated with adverse clinicopathological parameters (368, 369) and independently predictive of poor outcome such as recurrence after RP (145, 370).

171

Immunohistochemistry (IHC) detection of PTEN protein expression offers a much cheaper and less time-consuming method for the detection of PTEN loss than the detection of PTEN deletion by Fluorescence In Situ Hybridization (FISH). Furthermore, levels of the PTEN protein can be affected by mechanisms other than gene deletion, including mutation and epigenetic regulation (371). This would not be identified by FISH analysis which may underestimate the extent of PTEN loss. A reduction in PTEN protein expression has been found to be a predictor of poor prognosis, including shorter recurrence-free survival after RP (372) and an increased risk of prostate cancer-specific mortality after surgery and other modalities (373, 374). The studies of PTEN/PTEN loss in prostate cancer have favoured in situ methods such as FISH and IHC due to the high degree of variability of PTEN between different regions of a tumour (375, 376). Therefore, the prognostic potential of PTEN loss at the transcriptional level has not been extensively explored. It is interesting to note that loss of PTEN has been found to decrease oxidative phosphorylation and promote glycolysis (the Warburg effect)(364). This may potentially link the findings in this study of a reduction in the expression of both PTEN and NDUFAF1.

ABL1 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) was significantly upregulated in the recurrent samples (p-value 0.045) and was the gene that showed the greatest upregulation in terms of the fold difference (fold difference of ~43). ABL1 encodes one of the two members of the Abelson family of non-receptor protein tyrosine kinases, which have roles in a variety of biological processes including cell proliferation, survival and motility (377). ABL1 is commonly associated with chronic myelogenous leukaemia, which is driven by activation of the ABL1 gene through its translocation with the Breakpoint Cluster Region gene on chromosome 22 to form a fusion gene BCR-ABL1 (378).

It is now appreciated that ABL1 and/or ABL2 show increased expression and/or activation in many types of solid cancers by mechanisms that do not involve gene mutation or translocation. ABL1 protein expression has been found to be increased in several types of cancer and correlated with poor prognosis (379, 380). ABL1 activation has also been implicated in prostate cancer progression, with some studies identifying

172 a role for ABL1 in the proliferation, survival and motility of prostate cancer cells (381, 382). A role for ABL1 in bone metastatic growth has been documented in prostate cancer (383) and an ABL tyrosine kinase inhibitor has been used experimentally for the treatment of bone metastases (384). Few studies have directly correlated ABL1 expression with prognosis in prostate cancer. However, a recent study by Gupta et al. found an increase in copy number gains of ABL1 in circulating tumour cells (CTCs) from men with mCRPC (385).

CDH1 CDH1 (cadherin 1) showed significantly increased expression in the recurrent versus non-recurrent group (p-value 0.052). CDH1 was the only gene that showed a similar change in expression in the recurrence analysis of the pilot study. CDH1 encodes an important cell adhesion molecule of the cadherin family. The finding of increased CDH1 expression in the recurrent samples was contradictory to the many studies that have documented widespread loss of CDH1 protein expression in epithelial cancer (386). Loss of CDH1 is considered a cardinal feature of EMT, whereby cancerous epithelial cells acquire a migratory phenotype through the loss of epithelial characteristics and the development of mesenchymal properties (387).

In prostate cancer, most studies have found CDH1 protein expression to be significantly downregulated (206, 388). Reduced CDH1 expression has also been correlated with adverse clinicopathological parameters in prostate cancer (206, 389) and found to be a predictor of BCR and cancer-specific mortality after RP (201, 207). A possible explanation of our findings may be the fact that loss of CDH1 is only a transient change that occurs at certain stages of disease progression. Some studies have found CDH1 to be re-expressed in advanced prostate cancer, where its expression may aid the growth and establishment of metastases (390, 391). In fact, CDH1 was downregulated in the GS ≥7 vs <7 group in our study, suggesting that CDH1 loss may be an earlier change in prostate cancer progression.

173

Genes included less frequently in the top two- and three-gene models

ATP10B ATP10B (ATPase phospholipid transporting 10B (putative)) expression was higher in the recurrent versus non-recurrent group, with a p-value of 0.054, which was slightly above the conventional significance level (p<0.05). ATP10B was found in two of the final two- and three-gene models.

ATP10B encodes a protein with a putative role as the catalytic subunit of an enzyme that belongs to the family of P-type ATPases, which function as phospholipid flippases. These enzymes couple the hydrolysis of ATP with the translocation of phospholipids from the outer to the inner surface of membranes (392). The role of ATP10B in cancer is not currently known, although some studies have identified dysregulation of phospholipid flippases in cancer cell lines (393).

PGC PGC (progastricsin) expression was lower in the recurrent versus non-recurrent samples (p-value 0.070). PGC encodes an aspartic protease produced mainly in the stomach, which is a precursor to the digestive enzyme pepsin. However, ectopic expression of PGC is found in other organs including the prostate (394). PGC expression has been found to be increased in cancerous tissue from some of these organs including prostate cancer (395, 396). Our findings concur with several studies which have found increased PGC protein expression to be associated with a more favourable prognosis in various cancers including prostate cancer (397-399).

ANXA2 ANXA2 (annexin A2) was upregulated in the recurrent compared with the non-recurrent group (p-value 0.072). ANXA2 encodes a calcium-dependent phospholipid-binding protein localized on the cell surface, which is involved in various biological processes including cytoskeletal organisation, endosome trafficking and fibrinolysis (400, 401). ANXA2 is also expressed by many types of tumour cells and increased ANXA2 expression has been documented in a variety of cancers (402). In these cancers, a central mechanism by which ANXA2 may promote cancer progression is through its role as a co-

174 receptor for plasminogen, which accelerates the conversion of plasminogen to plasmin by tissue-type plasminogen activator (403). Plasmin is a serine protease which is centrally involved in angiogenesis and tumour cell migration through metalloproteinase activation and degradation of the ECM. In the fewer cancers where ANXA2 is downregulated its role in progression would likely be through plasminogen- independent mechanisms (404).

In prostate cancer ANXA2 expression has been found to be downregulated in well and moderately differentiated prostate cancer tissue in comparison to benign prostate tissue, but then ANXA2 expression re-emerges in poorly differentiated disease (405). A positive relationship between ANXA2 protein expression and disease progression in prostate cancer has been demonstrated in several studies (406, 407). For instance, increased ANXA2 expression has been found to be predictive of earlier BCR after RP or RT (407, 408).

PVT1 PVT1 (Pvt1 oncogene) was upregulated in the recurrent group (p-value 0.073), which is in agreement with an identified tumourigenic role for PVT1 in prostate cancer through its role in cellular proliferation and inhibition of apoptosis (409). The PVT1 gene encodes a long non-coding RNA. This gene is mapped to chromosome 8q24, an important risk locus for many cancers including prostate cancer, which also contains the well- characterised oncogene MYC (410). PVT1 has been found to be upregulated in prostate cancer, with expression positively correlated with grade and tumour stage in some studies (409, 411). Furthermore, increased PVT1 expression has been found to be significantly associated with poor disease-specific and overall survival (409, 411).

POP5

POP5 (POP5 homolog, ribonuclease P/MRP subunit) showed increased expression in the recurrent group (p-value 0.088). The significance of this finding is unclear. The POP5 protein is a subunit of the RNase MRP and RNase P complexes, which are endoribonucleases with a role in the processing of precursor ribosomal RNA (rRNA) and precursor transfer RNA (tRNA) respectively (412). This gene has not been previously associated with prostate cancer or any other cancers in the literature.

175

DPT

DPT (dermatopontin) expression was increased in the recurrent versus non-recurrent group (p-value 0.098). DPT is a tyrosine-rich ECM protein, which has a role in cell-ECM adhesion and ECM assembly, and is centrally involved in wound healing (413). DPT has been found to be downregulated at both the mRNA and protein level in some cancers (414, 415). In these cancers it appears to act as a metastasis suppressor with lower expression correlated with poorer prognosis (414, 415). Its expression in prostate cancer has not been described with regard to prognosis. Although the studies in other cancers suggest a tumour suppressor role for DPT which is contrary to our findings, a recent study has found a pro-angiogenic role for DPT in cancer (416).

CD109 CD109 (CD109 molecule) was upregulated in the recurrent group in our study (p-value 0.125). CD109 is a glycosylphosphatidylinositol-anchored surface antigen with a documented role as a negative regulator of TGF-beta-1 signalling (417). Upregulation of CD109 mRNA/protein expression has been found to be significantly correlated with poor prognosis in some cancers including lung cancer (418, 419). The expression of CD109 in prostate cancer has not been described in the literature.

TRAFD1 TRAFD1 (TRAF-type zinc finger domain containing 1) was upregulated in the recurrent group although the significance was low (p-value of 0.149). TRAFD1 was only found in a couple of the final two-gene models and one three-gene model and did not make a large contribution to any of these models. TRAFD1 is a negative feedback regulator that controls excessive innate immune response through an inhibitory action on TNF receptor-associated factor 6 (TRAF6), thereby suppressing nuclear factor kappa-light- chain-enhancer of activated B cells (NF-κB) and MAPK signalling, which are activated by TRAF6 (420, 421). The findings in this study of an upregulation of TRAFD1 with recurrent disease are contradictory to the studies of other malignancies in which TRAF6 has been found to be upregulated and a poor prognostic factor (422, 423). In prostate cancer TRAF6 and other members of the TRAF family (with which TRAFD1 also interacts) such

176 as TRAF2 and TRAF4 have been found to be upregulated with high expression associated with disease progression (424-426).

GUCY1A1 GUCY1A1 (guanylate cyclase 1 soluble subunit alpha 1) was downregulated in the recurrent group (p-value of 0.149). GUCY1A1 encodes a subunit of the enzyme soluble guanylate cyclase, which through its association with another subunit (GUCY1B1), mediates nitric oxide signalling by catalyzing the synthesis of cyclic guanosine monophosphate (cGMP). GUCY1A1 is an androgen regulated gene (427) and has been found to promote the proliferation and survival of prostate cancer cells independently of its enzyme activity, mediators of nitric oxide or androgens (428). Contrary to our findings, other studies have found GUCY1A1 expression to be significantly overexpressed in prostate cancer tissue (353, 427). GUCY1A1 expression has also shown a positive correlation with advanced stage and lethal disease in some prostate cancer studies (290, 353).

EFNA5 EFNA5 (ephrin A5) was found in the three-gene model with the highest χ2, despite a p-value of 0.151 in the independent samples t-test. EFNA5 expression was reduced in the recurrent group compared with the non-recurrent group. EFNA5 encodes the ephrin-A5 ligand, which is a member of the ephrin A subclass of ephrin ligands.

Ephrins are membrane bound ligands, which activate tyrosine kinase Eph receptors and have regulatory roles in the building and maintenance of organized tissue structures during embryonic development (429). Eph-ephrin interactions may result in forward or reverse signalling, eliciting a wide array of different cellular effects including both cell repulsion and adhesion (430). Eph receptors and ephrins may be upregulated (431, 432) or downregulated (433, 434) in cancer, and have demonstrated roles in tumour cell proliferation, adhesion, migration and angiogenesis (430).

Ephrin-A5 is a glycosylphosphatidylinositol-anchored protein with a role in axon guidance and migration during brain development (435). Our findings of downregulation of EFNA5 in the recurrent patients suggests a tumour suppressor role in prostate cancer. Downregulation of EFNA5 mRNA and protein have been demonstrated in other

177 malignancies including glioma, chondrosarcoma and colon cancer, with expression negatively associated with their progression (436-438). Downregulation of EFNA5 has been documented in malignant versus benign prostate cancer (439, 440), although a prognostic role for EFNA5 has not been previously described. However, some of the receptors with which ephrin-A5 may interact, have been found to be dysregulated in prostate cancer and correlated with disease progression; e.g. downregulation of EPHA5 and EPHA7, and upregulation of EPHA2 and EPHA3 (441-444). A mechanism by which ephrin-A5 may exert a tumour suppressor role could be explained by its effect on cellular adhesion. Studies have found that activation of ephrin-A5 by EPHA5 induces signalling in the ephrin-A5 expressing cells, which enhances their adhesive properties (445). Therefore a reduction in ephrin-A5 may provide a potential mechanism by which tumour cells can lose their adhesive properties, allowing tumour cell migration.

NDUFAF1 NDUFAF1 (NADH:ubiquinone oxidoreductase complex assembly factor 1) presented as a variable in only one of the top two-gene models. A reduction in expression was found in the recurrent versus non-recurrent group (p-value of 0.154). The NDUFAF1 protein has a role in the assembly of mitochondrial complex I (NADH dehydrogenase)(446), the largest enzyme involved in oxidative phosphorylation, which catalyses the first step of the mitochondrial electron transport chain. As NDUFAF1 is essential for the proper functioning of mitochondrial complex I, the down-regulation of NDUFAF1 in our study with recurrent disease implies a possible dysregulation of NADH dehydrogenase in more aggressive disease (446). The findings of increased TCA cycle activation (increased SDHA expression) but potential dysfunction of a central component of the electron transfer chain (reduced NDUFAF1) in more aggressive disease may reflect the start of a shift away from complete oxidative metabolism towards aerobic glycolysis (so-called Warburg effect) to allow the immediate generation of macromolecules and rapid energy production. Alternatively, the two findings together may reflect increased TCA cycle activity and functional oxidative metabolism, but with the aforementioned shift from NADH to succinate-supported ATP production through mitochondrial complex II (363, 364).

178

Some studies have found the expression of several key genes involved in oxidative phosphorylation to be negatively associated with high Gleason grade and poor outcome in prostate cancer (447, 448). These studies did not measure NDUFAF1 expression specifically. Downregulation of NDUFAF1 mRNA or protein expression has been demonstrated in other cancers and identified as a poor prognostic factor (449, 450).

It is difficult to state definitively that reduced expression of NDUFAF1 contributes to a downregulation in oxidative phosphorylation, as the exact role of NDUFAF1 as one of many assembly proteins for mitochondrial complex I is uncertain.

PCDH18 PCDH18 (protocadherin 18) was downregulated in the recurrent group (p-value of 0.197). PCDH18 encodes a member of the nonclustered protocadherins of the cadherin superfamily of adhesion molecules (451). PCDH18 is involved in cell-cell adhesion with important roles during development of the brain (452, 453). Several of the nonclustered protocadherins have now been established as tumour suppressor genes (454, 455).

PCDH8, PCDH10 and PCDH17 are downregulated by promoter methylation in prostate cancer, with downregulation significantly associated with adverse clinicopathological parameters and poor outcome (456-458). PCDH18 has not been specifically identified as a tumour suppressor gene in prostate cancer. In other cancers PCDH18 has been found to be downregulated, with lower PCDH18 expression identified as a poor prognostic factor (459, 460).

Discussion regarding genes of interest

In this study, it was evident that a large proportion of the genes found to be important predictors in the best performing models of recurrence were genes encoding key enzymes of metabolic pathways. These pathways included fatty acid oxidation (AMACR), the TCA cycle (SDHA), oxidative phosphorylation (NDUFAF1) and purine metabolism (HPRT1, GUCY1A1).

Metabolic reprogramming is now considered a hallmark trait of cancer and is necessary to allow tumour cells to meet their metabolic requirements, which differ from normal cells (116). Tumour cells reprogram their anabolic and catabolic metabolism to acquire

179 sufficient energy and cellular building blocks for biomass synthesis, in order to maintain growth and support rapid proliferation (461). The study of metabolism in prostate cancer is an active area of research, and understanding prostate cancer through a metabolic perspective is of great interest (189). In fact, an emerging risk factor for prostate cancer is metabolic syndrome, which describes a collection of metabolic risk factors for cardiovascular disease including visceral obesity, insulin resistance and abnormal cholesterol levels (462). Metabolic syndrome is also associated with aggressive potential in prostate cancer as the chronic inflammation associated with this condition may promote tumour growth (463, 464).

Various metabolic changes have previously been identified in prostate cancer and associated with its progression (465). Activated metabolic pathways, metabolomic signatures and alterations in key metabolic genes have been found to show prognostic potential in some recent studies (361, 466).

The most highly prevalent gene in the best performing gene models of this study, whose prevalence far exceeded any other gene, was AMACR, a well-characterised gene in prostate cancer which encodes an enzyme involved in fatty acid oxidation (191). A large number of studies has described upregulation of AMACR expression (mRNA and protein) in prostate cancer and the diagnostic value of AMACR protein expression has been well- described (191, 467). However, the prognostic role of AMACR has been less fully explored, with conflicting results (343, 345). In agreement with our results, some studies have found reduced AMACR mRNA/protein expression in more aggressive prostate cancer (339, 340). Upregulation of AMACR expression in prostate cancer is likely to reflect the reliance of early prostate cancers on fatty acid oxidation as a source of energy (468). The significance of reduced AMACR expression in recurrent disease is unclear but may relate to the shift towards de novo lipid synthesis, which is feature of advanced prostate cancer, or the changing availability of other energy sources at this stage (469).

The study found a potential shift towards oxidative metabolism in recurrent disease, as shown by possible activation of the TCA cycle (increased SDHA expression). Reactivation of the TCA cycle is a well-documented characteristic in prostate cancer and has been associated with its progression (353, 361). This finding alongside decreased PTEN and

180

NDUFAF1 expression in recurrent disease may reflect a shift towards succinate- supported oxidation in more aggressive prostate cancer, which has been described by others (363, 364). The identification of SDHA and HPRT1 (another metabolic gene) as important predictor variables was novel as these genes have rarely been investigated with regard to outcome in prostate cancer as they have long been regarded as reference genes (289, 470).

This study identified several other genes as important predictor variables in the best performing models, some of which have previously been associated with outcome in prostate cancer (e.g. PLAU and PTEN)(366, 369). In particular, PTEN loss is one of the most commonly cited genomic events in prostate cancer, which has a strong association with disease progression (471). Other important genes of interest found in this study have not been previously described in relation to prostate cancer progression (e.g. ATP10B and EFNA5).

It is important to remember that the study identified genes that were differentially expressed between the recurrent and non-recurrent group and had strong predictive value in combination. It is possible to postulate the effect that these gene changes may have on prostate cancer progression. However, it is difficult to directly link mRNA changes with a final protein and its activity due to the numerous regulatory mechanisms that may occur at the post-transcriptional and post-translational level (472, 473). Furthermore, many of these genes encode ‘components’ of proteins only and the exact role of these components may not be known (e.g. a subunit or assembly factor of an enzyme). Despite a lack of understanding of the precise effect of these transcriptomic changes, this does not reduce their predictive value.

General discussion and future studies

A plethora of studies has identified gene expression profiles of disease progression in prostate cancer using technologies such as oliognucleotide or cDNA microarray technologies. Gene expression profiles have been identified in prostate cancer with the potential to predict various clinical endpoints such as BCR, systemic progression and prostate cancer-specific death (262, 354, 474). These technologies analyse vast numbers of genes, resulting in large amounts of data that are difficult to resolve further.

181

Generally, these studies have used more sensitive techniques such as real-time qPCR or IHC to validate genes of interest in larger numbers of archival samples (261, 290). As real-time qPCR is time-consuming, usually only small numbers of genes are validated using this technique. Despite the large numbers of potential gene biomarkers and panels that have been identified as being associated with outcome in prostate cancer, few have been thoroughly validated or brought into clinical practice. The candidate genes have also rarely overlapped between studies (334).

The TaqMan array was chosen for this study as it allowed a significant number of genes to be analysed at once, whilst retaining the necessary sensitivity of real-time qPCR, which allowed the use of minimally-sized FFPE biopsy samples (277). Archival biopsy samples were both readily available and had the necessary clinical follow-up to allow correlation of the gene expression data with patient outcome. By testing a significant number of genes, it was more likely that useful predictive models could be identified as, due to the complexity of the genetic changes in prostate cancer, predictive models incorporating multiple genes would seem most feasible (475).

TaqMan arrays have been used in a small number of other studies to analyse mRNA panels in prostate cancer using various selections of candidate genes, although they have been used more frequently with microRNA (miRNA) panels (476, 477). Importantly, the aforementioned commercially available prognostic gene expression panels called Prolaris® and Oncotype DX® use TaqMan array technology (268).

Many reported gene expression panels in prostate cancer selected candidate genes from a limited set of genes or biological pathways that had been previously implicated in carcinogenesis. For example, the Prolaris test selected its final gene panel of 31 genes from only 126 cell cycle progression genes chosen from the Gene Expression Omnibus database (271).

The genes on our TaqMan array had been selected from 12,600 genes in a published microarray dataset. A novel method of machine learning based on a single layer artificial neural network (ANN) was utilised to learn from and reduce this data to 44 genes that were strongly associated with prostate cancer (286). This method was chosen as it is ideally suited to handling complex datasets and reduces them to smaller datasets, which

182 are more likely to contain the most relevant genes of interest (478). Other genes were selected due to their repeated association with prostate cancer progression in the literature review or as a result of previous work in the department. To the best of my knowledge, this particular gene panel has not been described in other studies.

A number of two- and three-gene logistic regression models were identified that could predict 5-year recurrence with relatively high accuracy in a set of prostate cancer patients treated with radical prostatectomy/radiotherapy. The particular gene combinations found to have powerful predictive value in this study have not been described in other studies. In particular, a number of key metabolic genes were well represented in the best performing gene models, which have not been extensively explored with regard to risk stratification in prostate cancer.

As a discovery study, this study included various measures to limit the production of spurious models, including the avoidance of too many predictor variables for the sample size and the avoidance of high multi-collinearity. However, it is imperative to remember that these models have only been found to predict recurrence in this particular dataset.

To understand if the final models would be generalizable to other prostate cancer patients, their performance would need to be independently validated in another dataset from a completely separate group of prostate cancer patients (external validation)(479). An external validation study could focus on the final 28 two-gene and 9 three-gene models found in this training dataset. The expression of the 19 genes included in these models could be measured in biopsy samples from the new set of patients (treated with radical prostatectomy/radiotherapy as in the training set) using identical methods. It would be essential that the 5-year recurrence status for each of the new patients be known to allow each of the final models to be applied to the new dataset. The model performance could be assessed initially by calculating the predicted probability of 5-year recurrence for each new patient using the previously identified parameter estimates for each model in the LR equation. ROC curves could then be plotted using these probabilities to assess the discrimination accuracy of the model on the new dataset. The AUC value would be a measure of the accuracy of the model to discriminate patients into either the recurrent or non-recurrent group and this value

183 could be directly compared with the value obtained for the training set (480). A measure of calibration that measures the agreement between predicted and observed outcome (e.g. calibration plot) could also be used to assess the performance of each final model in the validation compared with the training dataset (481).

As the collection of data from an entirely new set of prostate cancer patients would be difficult to achieve, an alternative method would be to run the final 28 two-gene and 9 three-gene models on an in silico validation set such as The Cancer Genome Atlas/Memorial Sloan-Kettering Cancer Center (TCGA/MSKCC) or Stand Up to Cancer/Prostate Cancer Foundation (SU2C/PCF) dataset using a cohort of similarly treated prostate cancer patients to this study.

It is important to consider a potential limitation of using biopsy samples for gene expression studies. The well-characterised problem of intratumour heterogeneity may also affect gene expression analysis. This may have affected both this study and the pilot study as both relied on the use of minimally-sized tissue punches from only one or two individual areas (482). However, the selection of the area of highest Gleason grade may have been more reliable in the pilot study as the entire prostate was available for examination by the pathologist. In comparison, due to sampling error, the biopsy sample may have either missed or over-represented the area of highest Gleason grade (103). Despite these concerns, several studies have suggested that molecular features may be widespread throughout a prostate tumour (483) and prognostic molecular signatures have been identified in biopsy samples that were not affected by sampling error (272, 484).

Another limitation of the study was the use of recurrence as an endpoint. Ideally, systemic progression or prostate cancer-specific death would have been more relevant clinical endpoints. Although patients who had suffered metastastic progression (clinical recurrence) were included in the recurrent group, many of the patients had suffered only BCR within the tested time frame. Therefore, some of the identified genes may have been relevant to BCR only, rather than systemic progression. Men who develop BCR may have no clinical sign of recurrent disease or may have locally recurrent disease or even the presence of metastatic disease, and the clinical course of these patients is

184 highly variable (354). Therefore, BCR is not a relevant predictor of systemic progression and a large proportion of these men do not progress to metastatic progression (81). As explained earlier, there were insufficient numbers of patients who suffered systemic progression or death from prostate cancer within the time frame, to allow these endpoints to be used in this study. Future studies using larger datasets, e.g. in silico, could potentially investigate whether the identified gene models have the potential to predict either systemic progression or prostate cancer-specific death.

2.5 Conclusions

This study used a TaqMan array to measure the expression of a set of putative progression-related genes in diagnostic biopsies of prostate cancer patients treated with radical prostatectomy and/or radiotherapy. Using binary logistic regression analysis, 28 two-gene models and 9 three-gene models were identified, which demonstrated good overall performance for the prediction of 5-year recurrence (biochemical/clinical) in these patients. Many of the genes found to be important predictors in the models were key genes encoding components of metabolic pathways, underlining the importance of metabolic reprogramming in prostate cancer progression.

The best performing three-gene models were: AMACR.EFNA5.SDHA (χ2(3) = 22.483, p<0.0001) and AMACR.HPRT1.SDHA (χ2(3) = 22.265, p<0.0001). In addition to their good overall performance (low p-value for the LRT statistic (χ2) describing the fit of the data under the model as a highly significant improvement over the null model), these models showed high discriminatory accuracy as demonstrated by AUC values of 0.812 and 0.829 respectively. Using an optimal threshold for categorisation of patients into either the recurrent or non-recurrent group, AMACR.EFNA5.SDHA demonstrated a sensitivity and specificity of 81% and 73.9% respectively. AMACR.HPRT1.SDHA had a sensitivity and specificity of 81% and 87% respectively.

The best performing two-gene models were AMACR.SDHA (χ2(2) = 16.722, p<0.0005) and HPRT1.PLAU (χ2(2) = 14.777, p<0.001). In addition to their good overall performance as shown by the high significance of χ2, AMACR.SDHA and HPRT1.PLAU had AUC values

185 of 0.784 and 0.808 respectively. AMACR.SDHA demonstrated a sensitivity and specificity of 71.4% and 78.3% respectively. HPRT1.PLAU had a sensitivity and specificity of 71.4% and 76.1% respectively.

It is important to remember that the two-gene models, despite their poorer overall performance, may be more statistically valid than the three-gene models due to the smaller number of predictors, which would make overfitting of the data less likely, particularly in this relatively small dataset (11 EPV for the two-gene models in comparison to only 7 for the three-gene models).

The study provided evidence that a panel of progression-related genes could be measured in archival prostate biopsies using TaqMan array and a selection of these genes could be useful in combination to inform treatment decisions in prostate cancer patients, e.g. for the selection of RP patients who would benefit from adjunctive treatments such as RT and/or ADT to increase their chance of progression-free survival. The predictive models identified in this study would require further validation. Only if found to improve upon the use of standard clinical and pathological tools would they be suitable for risk stratification at the time of clinical decision-making in prostate cancer.

As the study followed on from and expanded a pilot study, a large component of this study was the method development (275). Several important changes were made to increase both the clinical applicability and potentially the validity of the study in comparison to the pilot study. As a result of these changes, there were few similarities in the results between the two studies regarding the genes of interest identified and the predictive models generated. The changes in the BB study included those to enhance the robustness of the analysis methods, such as the use of individual baseline and threshold settings for each gene to improve the accuracy of the Ct values.

To increase the clinical applicability of the study, biopsy samples were used in preference to radical prostatectomy samples. Importantly, the outcome data for the patients was more complete than in the pilot study and the gene expression data were analysed with a longer follow-up time.

186

A greater number of models were tested in comparison to the pilot study, as the criterion for inclusion of predictors into the multivariate logistic regression analysis was relaxed. This was done to prevent the premature exclusion of variables, which could be important predictors, despite less than conventional significance (in t-tests). This presented a new challenge of ranking the large number of processed models, necessitating a new automated approach to analysis.

A more valid parameter of overall model performance, the LRT statistic (χ2), was used for ranking the models rather than the overall percent accuracy that had been used in the pilot study, which has several limitations.

This study detailed a method for whittling down the large number of ranked models to a final set of models, taking into consideration the need for the logistic regression models to exhibit both high calibration and discriminatory accuracy whilst retaining parsimony to prevent overfitting.

187

Appendix 1 Examples of potential biomarkers of prostate cancer progression

Name of biomarker Role gene/ ↑/↓ Sample Advantages Disadvantages mRNA/ with type protein progression Self-sufficiency in growth signals

MKI67 cellular marker of protein ↑ tissue Prognostic value has been demonstrated in RP and Some biopsy studies failed to find any independent proliferation biopsy samples in prostate cancer (485, 486). prognostic value (489, 490). proliferation marker protein Ki-67 In biopsy samples, expression positively correlated As a biopsy-based marker, may suffer from difficulties with adverse clinicopathological parameters (487); it related to molecular heterogeneity between different is an independent predictor of various clinical regions of the tumour and limited tumour sampling. endpoints including BCR and prostate cancer-specific mortality (PCSM) after RP/RT in some studies (485, Prognostic potential dependent on cut-off value and 488). no standard value has been defined.

EZH2/EZH2 methyltransfer- mRNA ↑ tissue EZH2 mRNA expression upregulated in metastatic vs Some studies found no independent predictive value ase involved in localized prostate cancer tissue (491, 492). for EZH2 mRNA or protein (495, 496). enhancer of zeste 2 transcriptional polycomb repressive repression Few studies have investigated EZH2 as a biopsy-based complex 2 subunit protein EZH2 protein expression in RP samples correlates marker. EZH2 expressed in <10% of prostate cancer with clinicopathological parameters of aggressive cells so could be missed in biopsy samples (495, 497). disease (493); EZH2 found to independently predict BCR after RP in a number of studies (494).

MYC transcription DNA ↑ tissue MYC amplification increases in frequency with MYC amplification may be infrequently found in low factor prostate cancer progression (129); some studies grade/early stage prostate cancers so its prognostic MYC proto-oncogene, have found MYC amplification to be an independent value may be relevant to advanced prostate cancers bHLH transcription factor predictor of BCR after RP/RT (129, 130). only (129).

188

Measurement of MYC amplification requires Fluorescence In Situ Hybridisation (FISH) which makes it difficult to implement routinely.

Insensitivity to anti-growth signals

PTEN/PTEN negative gene ↓ tissue PTEN genomic deletion increases in frequency with Limited number of studies have investigated the regulator of the prostate cancer progression (145); some studies potential of PTEN loss to predict PCSM (369, 498). phosphatase and tensin PI3K/AKT have found PTEN deletion to be significantly homolog pathway associated with early recurrence after RP (145, 369). FISH analysis is labour-intensive and may underestimate the frequency of PTEN protein loss as genomic deletion is not the only mechanism of PTEN loss.

protein ↓ Variability in reported frequency of PTEN protein loss; studies investigating the prognostic value of PTEN protein loss have shown conflicting results with an independent prognostic value demonstrated in some studies (373, 499), but not others (145, 500).

31-gene signature of cell mRNA ↑ CCP score tissue Validated in several independent cohorts using Prospective studies are required to monitor the cycle progression (CCP) biopsy samples and found to be an independent outcome of changing treatments in response to genes predictor of BCR after RP/RT (269, 501); and PCSM in Prolaris test results (particularly shifting patients from conservatively managed patients (502). Prognostic interventional to conservative treatment). ability found to exceed standard clinicopathological variables in some studies (503). Expensive.

Commercially available as the biopsy-based Prolaris test (Myriad Genetics), which may provide additional meaningful information to risk assessment of localized prostate cancer at the time of treatment selection (268).

Evading apoptosis

189

BCL2 inhibits the action Has been found to independently predict BCR after Other studies have found no significant predictive of proapoptotic protein ↑ tissue RP/RT in a handful of studies (504, 505). value (201, 506). apoptosis regulator Bcl-2 proteins at the Limited number of studies. cell membrane

IGFBP3 proapoptotic role protein ↓ plasma/ Inversely correlated with clinicopathological Limited number of studies with some failing to find any by limiting the serum parameters of aggressive disease and poor prognosis prognostic value for circulating IGFBP3 (508, 509). insulin-like growth factor- availability of free in some studies (152, 507). binding protein 3 insulin-like IGFBP3 is highly abundant in circulation and affected growth factors by a variety of factors.

Sustained angiogenesis

MVD measured by IHC protein ↑ tissue Correlated with adverse clinicopathological No independent predictive value found by others against parameters (510); studies have found MVD to be an (514), likely due to differing methodologies including microvessel density endothelial cell independent predictor of BCR after RP (177, 511). It the endothelial marker chosen, the representative markers has also been found to be predictive of poor survival tumour area selected and the counting method (515). (512, 513). Problems with lack of specificity of markers/difficulties with staining newly formed vessels.

Few studies have investigated predictive value of MVD in biopsy samples.

PTGS2 enzyme involved protein ↑ tissue Correlated with adverse clinicopathological Conflicting results by others who have either failed to in the production parameters (172); an independent predictor of BCR find any predictive value for PTGS2 or found no prostaglandin G/H of prostaglandins; and PCSM after RP in some studies (172, 516). independent predictive value (517, 518). synthase 2 a central mediator of Variation in the methods used for measuring staining angiogenesis intensity.

PTGS2 expression may be restricted to infiltrating inflammatory cells including lymphocytes and macrophages (519).

190

Invasion and metastasis

PLAU/PLAUR serine protease protein ↑ plasma/ PLAU and PLAUR positively correlated with clinical Relatively few studies have investigated the prognostic involved in serum parameters of disease progression (520); potential of soluble PLAU/PLAUR. urokinase-type degradation of preoperative PLAU found to be an independent plasminogen the basement predictor of progression after RP (520, 521). Cleaved forms of soluble PLAUR may be most activator/urokinase membrane and significant prognostically (523, 524). plasminogen activator ECM PLAUR alone or in combination with miR-375 surface receptor microRNA has been found to be highly predictive of disease-specific/overall survival (367, 522).

CDH1 epithelial cell protein ↓ tissue Reduced expression is correlated with adverse Other studies failed to find any independent predictive adhesion clinicopathological parameters (206); independent value for CDH1 (526, 527). cadherin-1 molecule predictor of recurrence after RP and RT (201, 525); reduced expression associated with poor disease- A wide variety of different methods for scoring specific survival (207, 389). between studies.

Loss of CDH1 may be a transient change at certain stages of progression only and CDH1 may be re- expressed in mCRPC (205).

Four-kallikrein panel algorithm using protein ↑ plasma Has undergone extensive validation and High cost. results from four commercially available as the 4Kscore® Test (OPKO prostate-specific Lab); improves the prediction of biopsy detectable, Shows significant potential for limiting unnecessary kallikrein assays high grade (≥7) prostate cancer (528, 529). biopsies (532) but prognostic potential is limited by the (total PSA (tPSA), small number of studies that have investigated more free PSA (fPSA), Improves the prediction of aggressive histopathology specific endpoints such as recurrence after treatment intact PSA and found at RP (pT3-pT4, extracapsular extension, and disease-specific mortality. kallikrein-2) and tumour volume >0.5cm3 or any Gleason grade ≥4) clinical when added to a clinical model containing age, information stage, PSA and biopsy findings (530); another study found that the panel predicts 20-year risk of metastasis (531).

191

[-2]proPSA(p2PSA) and its protein ↑ serum PHI has been found to have greater ability to predict Further research needed to assess clinical utility. derivatives; prostate cancer and GS ≥7 disease than tPSA Beckman Coulter or %fPSA (533). Limited number of studies which tended to use Prostate Health Index samples with tPSA values of <10ng/ml-may have (PHI) Predictive of unfavourable pathology and BCR in RP exaggerated the predictive potential of PHI. [(p2PSA/fPSA)x √tPSA] patients (534, 535); may be useful in preoperative setting. p2PSA is an isoform of precursor PSA PHI improves prediction of biopsy reclassification (component of fPSA) during follow-up in active surveillance patients (Gleason ≥7, >2 positive cores, >50% of any core involved with cancer) (536).

Genomic instability

TMPRSS:ERG gene fusion DNA/ positive vs tissue Correlated with adverse clinicopathological Other studies have found a null or even an inverse measured using mRNA/ negative/↑ parameters (537); independent predictor of BCR and association with poor clinical outcome (500, 544). transmembrane serine FISH/ protein PCSM after RP in some studies (538, 539). protease 2:ETS RT-qPCR/ERG Prognostic significance may be dependent on the type transcription factor ERG immunostaining Independent predictor of disease progression and of rearrangement causing the fusion, e.g. PCSM in conservatively managed cohorts in some rearrangement through deletion; therefore FISH may studies (540, 541). be the most appropriate method, which is labour- intensive (545).

ERG immunostaining offers a much easier assay but may also detect fusions between ERG and other genes.

Only a proportion of prostate cancers carry this fusion mRNA ↑ urine gene, hence its prognostic ability may be dependent Useful for the detection of aggressive disease (542, on its combination with other markers. 543).

Stromal interactions stromal HA/tumoural glycosamino- protein ↑ tissue HA correlated with some adverse clinicopathological No independent prognostic value for HA found in HYAL1 glycan parameters (237); HYAL1 alone or in combination other studies (237, 547).

192

component of the with HA has been found to independently predict stromal ECM/ BCR after RP in some studies (241, 546). Limited number of studies. hyaluronan/tumoural hyaluronidase hyaluronidase-1 degrades hyaluronan

serum BAP/uNTX bone formation protein ↑ serum/ Both predict poor overall survival in patients with No prognostic value in non-metastatic disease. marker/bone urine metastatic disease (548, 549); may be useful to serum bone-specific resorption marker monitor response to treatment in metastatic disease alkaline phosphatase/ . urinary N-telopeptide

193

Appendix 2 Genes on the BB and SL array

Appendix 2a List of genes and assay numbers on the BB array. Genes in green were included on the BB array only. Assay numbers in red denote the use of a different assay number to the SL array.

Gene name Alternative Full name ABI assay ID name(s) number Oncogenes APOBEC3G CEM15 apolipoprotein B mRNA Hs00222415_m1 MDS019 editing enzyme catalytic bK150C2.7 subunit 3G dJ494G10.1 EEF1A1 EE1A1 eukaryotic translation Hs00265885_g1 EEF1A elongation factor 1 alpha 1 EF1A LENG7 ERG erg-3 ERG, ETS transcription factor Hs00171666_m1 p55 EZH2 ENX-1 enhancer of zeste 2 polycomb Hs00544830_m1 EZH1 repressive complex 2 subunit KMT6 KMT6A NELL2 NRP2 neural EGFL like 2 Hs00196254_m1 PVT1 LINC00079 Pvt1 oncogene Hs01069044_m1 NCRNA00079 RHBDL1 RHBDL rhomboid like 1 Hs00610210_g1 RRP Tumour suppressor genes BRCA2 BRCC2 BRCA2, DNA repair associated Hs01037414_m1 FACD FAD FAD1 FANCD FANCD1 PTEN BZS phosphatase and tensin Hs00829813_s1 MHAM homolog MMAC1 PTEN1 TEP1 RAP1GAP RAP1GA1 RAP1 GTPase activating Hs00182299_m1 RAP1GAP1 protein RAP1GAPII SERPINB5 PI5 serpin family B member 5 Hs00985285_m1 maspin TGFB2 transforming growth factor Hs00234244_m1 beta 2 Apoptosis related ABL1 JTK7 ABL proto-oncogene 1, Hs00245443_m1 c-ABL non-receptor tyrosine kinase p150 EDNRA ET-A endothelin receptor type A Hs00609865_m1 ETA-R HSPB1 CMT2F heat shock protein family B Hs00356629_g1 Hs.76067 (small) member 1

194

HSP27 HSP28 Hsp25 Angiogenic ANXA2 ANX2 annexin A2 Hs00743063_s1 ANX2L4 CAL1H LIP2 LPC2D EFNA1 ECKLG ephrin A1 Hs00358886_m1 EPLG1 LERK1 TNFAIP4 EFNA2 ELF-1 ephrin A2 Hs00154858_m1 EPLG6 LERK6 EFNA3 EPLG3 ephrin A3 Hs00191913_m1 Ehk1-L LERK3 EFNA4 EPLG4 ephrin A4 Hs00193299_m1 LERK4 EFNA5 AF1 ephrin A5 Hs00157342_m1 EPLG7 LERK7 MMP2 CLG4 matrix metallopeptidase 2 Hs00234422_m1 CLG4A TBE-1 VASH1 KIAA1036 vasohibin 1 Hs00208609_m1 Invasion and metastasis related ALCAM CD166 activated leukocyte cell Hs00233455_m1 MEMD adhesion molecule ANPEP CD13 alanyl aminopeptidase, Hs00174265_m1 gp150 membrane LAP1 p150 PEPN CD44 CSPG8 CD44 molecule (Indian blood Hs00174139_m1 HCELL group) IN MC56 MDU2 MDU3 MIC4 Pgp1 CD9 MIC3 CD9 molecule Hs00233521_m1 MRP-1 TSPAN29 CDC42EP3 BORG2 CDC42 effector protein 3 Hs00377831_m1 CEP3 UB1 CDH1 CD324 cadherin 1 Hs01023894_m1 UVO DPT dermatopontin Hs00355056_m1 F11R CD321 F11 receptor Hs00375889_m1 JAM-1 JAMA JCAM PAM-1 GUCY1A1 GUCY1A3 guanylate cyclase 1 soluble Hs00168325_m1 GC-SA3 subunit alpha 1 GUC1A3 ITGA2 CD49B integrin subunit alpha 2 Hs00158127_m1

195

ITGA3 CD49c integrin subunit alpha 3 Hs00233722_m1 GAP-B3 MSK18 VCA-2 VLA3a ITGA5 CD49e integrin subunit alpha 5 Hs00233743_m1 FNRA ITGB4 CD104 integrin subunit beta 4 Hs00236216_m1 ITGB6 integrin subunit beta 6 Hs00168458_m1 MUC1 CD227 mucin 1, cell surface Hs00904314_g1 MCKD1 associated PEM PUM PCDH18 PCDH68L protocadherin 18 Hs01556217_m1 PLAU UPA plasminogen activator, Hs01547054_m1 URK urokinase Metabolic ALPG ALPPL2 alkaline phosphatase, germ Hs00741068_g1 GCAP cell AMACR P504S alpha-methylacyl-CoA Hs02786742_s1 RACE racemase CYP11A1 CYP11A cytochrome P450 family 11 Hs00167984_m1 P450SCC subfamily A member 1 FABP5 E-FABP fatty acid binding protein 5 Hs02339439_g1 KFABP PA-FABP FADS1 D5D fatty acid desaturase 1 Hs00203685_m1 FADS6 FADSD5 LLCDL1 TU12 HPX hemopexin Hs00167197_m1 INMT indolethylamine N- Hs00198941_m1 methyltransferase PNP NP purine nucleoside Hs01002926_m1 PUNP phosphorylase TF PRO1557 transferrin Hs00169070_m1 PRO2086 Differentiation and proliferation ASB2 ASB-2 ankyrin repeat and SOCS box Hs00387867_m1 containing 2 CDKN1C BWCR cyclin dependent kinase Hs00175938_m1 BWS inhibitor 1C KIP2 P57 p57Kip2 ERBB2 HER-2 erb-b2 receptor tyrosine Hs01001598_g1 HER2 kinase 2 NEU GPR12 GPCR21 G protein-coupled receptor 12 Hs00271037_s1 HOXC6 HOX3 homeobox C6 Hs00171690_m1 HOX3C HPN TMPRSS1 hepsin Hs00170096_m1 IRF7 interferon regulatory factor 7 Hs00242190_g1 ITGB1 CD29 integrin subunit beta 1 Hs00559595_m1 FNRB GPIIA MDF2 MSK12 MKI67 KIA marker of proliferation Ki-67 Hs01032443_m1 MIB-1 NUP50 NPAP60L nucleoporin 50 Hs00538109_m1

196

PSCA prostate stem cell antigen Hs00194665_m1 PTGFR FP prostaglandin F receptor Hs00168763_m1 PRELID3A SLMO1 PRELI domain containing 3A Hs00272505_s1 C18orf43 HFL-EDDG1 TBRG4 Cpr2 transforming growth factor Hs00191057_m1 FASTKD4 beta regulator 4 TRAFD1 FLN29 TRAF-type zinc finger domain Hs00198630_m1 containing 1 Detox GSTM3 GST5 glutathione S-transferase mu 3 Hs00356079_m1 GSTM5 glutathione S-transferase mu 5 Hs00757076_m1 MT1X MT-1l metallothionein 1X Hs00745167_sH MT1 Others ATP10B ATPVB ATPase phospholipid Hs00391638_m1 KIAA0715 transporting 10B (putative) ATP5MPL C14orf2 ATP synthase membrane Hs00191690_m1 MP68 subunit 6.8PL C3 ARMD9 complement C3 Hs01100917_g1 CPAMD1 CANX CNX calnexin Hs00233492_m1 IP90 P90 CD109 CPAMD7 CD109 molecule Hs00370347_m1 CD4 CD4mut CD4 molecule Hs00181217_m1 CHM DXS540 CHM, escort protein 1 Hs00166083_m1 REP-1 TCD CRHR1 CRF-R corticotropin releasing Hs00366363_m1 CRF1 hormone receptor 1 CRHR GPM6A GPM6 glycoprotein M6A Hs00245530_m1 GPR19 G protein-coupled receptor 19 Hs00272049_s1 KLHL5 kelch like family member 5 Hs00375006_m1 MNX1 HB9 motor neuron and pancreas Hs00232128_m1 HLXB9 homeobox 1 HOXHB9 SCRA1 NDUFAF1 CGI-65 NADH:ubiquinone Hs00211245_m1 CIA30 oxidoreductase complex assembly factor 1 PGC progastricsin Hs00160052_m1 PLCL2 PLCE2 phospholipase C like 2 Hs00392897_m1 PLLP PMLP plasmolipin Hs00762550_s1 TM4SF11 POP5 POP5 homolog, ribonuclease Hs00372215_m1 P/MRP subunit PRIM2 PRIM2A DNA primase subunit 2 Hs00386277_m1 PVALB D22S749 parvalbumin Hs00161045_m1 SLC44A2 CTL2 solute carrier family 44 Hs00220814_m1 member 2 TERT EST2 telomerase reverse Hs00162669_m1 TCS1 transcriptase TP2 TRT hEST2 TRIP13 16E1BP thyroid hormone receptor Hs00188500_m1 interactor 13 Housekeeping genes 18S rRNA 18S ribosomal RNA Hs99999901_s1

197

HMBS PBGD hydroxymethylbilane synthase Hs00609297_m1 PORC UPS HPRT1 HGPRT hypoxanthine Hs99999909_m1 HPRT phosphoribosyltransferase 1 SDHA FP succinate dehydrogenase Hs00188166_m1 SDH2 complex flavoprotein subunit SDHF A TBP GTF2D1 TATA-box binding protein Hs00427620_m1 SCA17 TFIID

198

Appendix 2b List of genes and assay numbers on the SL array. Genes in blue were included on the SL array only. Assay numbers in red denote the use of a different assay number to the BB array.

Gene name Alternative Full name ABI assay ID name(s) number Oncogenes APOBEC3G CEM15 apolipoprotein B mRNA Hs00222415_m1 MDS019 editing enzyme catalytic bK150C2.7 subunit 3G dJ494G10.1 EEF1A1 EE1A1 eukaryotic translation Hs00265885_g1 EEF1A elongation factor 1 alpha 1 EF1A LENG7 ERG erg-3 ERG, ETS transcription factor Hs00171666_m1 p55 EZH2 ENX-1 enhancer of zeste 2 polycomb Hs00172783_m1 EZH1 repressive complex 2 subunit KMT6 KMT6A NELL2 NRP2 neural EGFL like 2 Hs00196254_m1 PVT1 LINC00079 Pvt1 oncogene Hs00413039_m1 NCRNA00079

RHBDL1 RHBDL rhomboid like 1 Hs00610210_g1 RRP SFPQ PSF splicing factor proline and Hs00192574_m1 PPP1R140 glutamine rich Tumour suppressor genes BRCA2 BRCC2 BRCA2, DNA repair associated Hs00609060_m1 FACD FAD FAD1 FANCD FANCD1 RAP1GAP RAP1GA1 RAP1 GTPase activating Hs00182299_m1 RAP1GAP1 protein RAP1GAPII SERPINB5 PI5 serpin family B member 5 Hs00184728_m1 Maspin TGFB2 transforming growth factor Hs00236092_m1 beta 2 Apoptosis related ABL1 JTK7 ABL proto-oncogene 1, Hs00245443_m1 c-ABL non-receptor tyrosine kinase p150 EDNRA ET-A endothelin receptor type A Hs00609865_m1 ETA-R HSPA5 GRP78 heat shock protein family A Hs00607129_gH BiP (Hsp70) member 5 HSPB1 CMT2F heat shock protein family B Hs00356629_g1 Hs.76067 (small) member 1 HSP27 HSP28 Hsp25 MYC c-Myc MYC proto-oncogene, bHLH Hs00153408_m1 MYCC transcription factor

199

HSP90B1 TRA1 heat shock protein 90 beta Hs00427665_g1 GRP94 family member 1 GP96 Angiogenic ANXA2 ANX2 annexin A2 Hs00743063_s1 ANX2L4 CAL1H LIP2 LPC2D EFNA1 ECKLG ephrin A1 Hs00358886_m1 EPLG1 LERK1 TNFAIP4 EFNA2 ELF-1 ephrin A2 Hs00154858_m1 EPLG6 LERK6 EFNA3 EPLG3 ephrin A3 Hs00191913_m1 Ehk1-L LERK3 EFNA4 EPLG4 ephrin A4 Hs00193299_m1 LERK4 EFNA5 AF1 ephrin A5 Hs00157342_m1 EPLG7 LERK7 MMP2 CLG4 matrix metallopeptidase 2 Hs00234422_m1 CLG4A TBE-1 MMP9 CLG4B matrix metallopeptidase 9 Hs00234579_m1 VASH1 KIAA1036 vasohibin 1 Hs00208609_m1 Invasion and metastasis related ALCAM CD166 activated leukocyte cell Hs00233455_m1 MEMD adhesion molecule ANPEP CD13 alanyl aminopeptidase, Hs00174265_m1 gp150 membrane LAP1 p150 PEPN CD44 CSPG8 CD44 molecule (Indian blood Hs00174139_m1 HCELL group) IN MC56 MDU2 MDU3 MIC4 Pgp1 CD9 MIC3 CD9 molecule Hs00233521_m1 MRP-1 TSPAN29 CDC42EP3 BORG2 CDC42 effector protein 3 Hs00377831_m1 CEP3 UB1 CDH1 CD324 cadherin 1 Hs00170423_m1 UVO CDH2 NCAD cadherin 2 Hs00169953_m1 CDHN CD325 CTNNA1 CAP102 catenin alpha 1 Hs00426996_m1 DPT dermatopontin Hs00170030_m1 F11R CD321 F11 receptor Hs00375889_m1 JAM-1 JAMA JCAM

200

PAM-1 GUCY1A1 GUCY1A3 guanylate cyclase 1 soluble Hs00168325_m1 GC-SA3 subunit alpha 1 GUC1A3 ITGA2 CD49B integrin subunit alpha 2 Hs00158127_m1 ITGA3 CD49c integrin subunit alpha 3 Hs00233722_m1 GAP-B3 MSK18 VCA-2 VLA3a ITGA5 CD49e integrin subunit alpha 5 Hs00233743_m1 FNRA ITGB4 CD104 integrin subunit beta 4 Hs00173995_m1 ITGB6 integrin subunit beta 6 Hs00168458_m1 PCDH18 PCDH68L protocadherin 18 Hs00251855_m1 TGM2 TGC transglutaminase 2 Hs00190278_m1 WIF1 WNT inhibitory factor 1 Hs00183662_m1 Metabolic ALPG ALPPL2 alkaline phosphatase, germ Hs00741068_g1 GCAP cell AMACR P504S alpha-methylacyl-CoA Hs00204885_m1 RACE racemase CYP11A1 CYP11A cytochrome P450 family 11 Hs00167984_m1 P450SCC subfamily A member 1 FABP5 E-FABP fatty acid binding protein 5 Hs02339439_g1 KFABP PA-FABP FADS1 D5D fatty acid desaturase 1 Hs00203685_m1 FADS6 FADSD5 LLCDL1 TU12 INMT indolethylamine N- Hs00198941_m1 methyltransferase PNP NP purine nucleoside Hs00165367_m1 PUNP phosphorylase Differentiation and proliferation ASB2 ASB-2 ankyrin repeat and SOCS box Hs00387867_m1 containing 2 CDKN1C BWCR cyclin dependent kinase Hs00175938_m1 BWS inhibitor 1C KIP2 P57 p57Kip2 GPR12 GPCR21 G protein-coupled receptor 12 Hs00271037_s1 HOXC6 HOX3 homeobox C6 Hs00171690_m1 HOX3C HPN TMPRSS1 hepsin Hs00170096_m1 IRF7 interferon regulatory factor 7 Hs00242190_g1 ITGB1 CD29 integrin subunit beta 1 Hs00559595_m1 FNRB GPIIA MDF2 MSK12 KCNN4 IK potassium calcium-activated Hs00158470_m1 hSK4 channel subfamily N member KCa3.1 4 NUP50 NPAP60L nucleoporin 50 Hs00855432_g1 PSCA prostate stem cell antigen Hs00194665_m1 PTGFR FP prostaglandin F receptor Hs00168763_m1 PRELID3A SLMO1 PRELI domain containing 3A Hs00398895_m1 C18orf43

201

HFL-EDDG1 TBRG4 Cpr2 transforming growth factor Hs00191057_m1 FASTKD4 beta regulator 4 TRAFD1 FLN29 TRAF-type zinc finger domain Hs00198630_m1 containing 1 Detox GSTM3 GST5 glutathione S-transferase mu 3 Hs00168307_m1 GSTM5 glutathione S-transferase mu 5 Hs00757076_m1 Others ATP10B ATPVB ATPase phospholipid Hs00391638_m1 KIAA0715 transporting 10B (putative) ATP5MPL C14orf2 ATP synthase membrane Hs00191690_m1 MP68 subunit 6.8PL CANX CNX calnexin Hs00233492_m1 IP90 P90 CD109 CPAMD7 CD109 molecule Hs00370347_m1 CD4 CD4mut CD4 molecule Hs00181217_m1 CHM DXS540 CHM, Hs00166083_m1 REP-1 TCD CRHR1 CRF-R corticotropin releasing Hs00366363_m1 CRF1 hormone receptor 1 CRHR GPM6A GPM6 glycoprotein M6A Hs00245530_m1 GPR19 G protein-coupled receptor 19 Hs00272049_s1 KLHL5 kelch like family member 5 Hs00375006_m1 MNX1 HB9 motor neuron and pancreas Hs00232128_m1 HLXB9 homeobox 1 HOXHB9 SCRA1 NDUFAF1 CGI-65 NADH:ubiquinone Hs00211245_m1 CIA30 oxidoreductase complex assembly factor 1 PGC progastricsin Hs00160052_m1 PLCL2 PLCE2 phospholipase C like 2 Hs00392897_m1 PLLP PMLP plasmolipin Hs00762550_s1 TM4SF11 POP5 POP5 homolog, ribonuclease Hs00372215_m1 P/MRP subunit PRIM2 PRIM2A DNA primase subunit 2 Hs00386277_m1 PVALB D22S749 parvalbumin Hs00193520_m1 RBM12 SWAN RNA binding motif protein 12 Hs00246072_s1 KIAA0765 SLC44A2 CTL2 solute carrier family 44 Hs00220814_m1 member 2 TERT EST2 telomerase reverse Hs00162669_m1 TCS1 transcriptase TP2 TRT hEST2 TRIP13 16E1BP thyroid hormone receptor Hs00188500_m1 interactor 13 Housekeeping genes 18S rRNA 18S ribosomal RNA Hs99999901_s1 HMBS PBGD hydroxymethylbilane synthase Hs00609297_m1 PORC UPS HPRT1 HGPRT hypoxanthine Hs99999909_m1 HPRT phosphoribosyltransferase 1

202

SDHA FP succinate dehydrogenase Hs00188166_m1 SDH2 complex flavoprotein subunit SDHF A TBP GTF2D1 TATA-box binding protein Hs00427620_m1 SCA17 TFIID

203

Appendix 3 Background information

Taqman arrays

The TaqMan array is a 384-well microfluidic card, which allows up to 384 real-time qPCR reactions to be run simultaneously. The card can be customised with 12 to 384 gene targets from Applied Biosystems inventoried collection of TaqMan® Gene Expression Assays that are preloaded into each well of the card. cDNA samples mixed with PCR master mix are loaded via the 8 sample loading ports which are each connected to 48 (1µl) wells via a network of microchannels, allowing delivery of the sample to the wells via a centrifugal filling process.

Sample loading

Sample port

Well

Picture of a) TaqMan arrays b) TaqMan array loading ports

TaqMan® Gene Expression Assays use TaqMan detection chemistry, which utilises the innate 5’ nuclease activity of Taq DNA polymerase to cleave a fluorescently labelled sequence-specific probe and produce a detectable signal (550). Each assay contains a forward and reverse primer, which are designed to be complementary to specific regions of the target sequence, and a TaqMan probe which is complementary to an internal region of the template between the two primer sites. The probe is labelled with two dyes of different fluorescence emission wavelengths: a reporter dye and a quencher dye situated on the 5’ and 3’ end respectively. When the probe is intact, an excited

204 reporter dye does not fluoresce but transfers energy to the quencher dye by fluorescence resonance energy transfer (FRET) due to the proximity of the two dyes. If the target DNA sequence is present, annealing of the forward and reverse primers will be followed by binding of the probe to its complementary sequence. Then during the extension step of the PCR amplification reaction, as the enzyme reaches the bound probe, it utilises its 5’ exonuclease activity to displace the probe and cleave the reporter dye from it. The release of the reporter dye prevents FRET, allowing emission of the fluorescent signal from the reporter dye, which increases proportionally to the amount of accumulated PCR product. These assays show high specificity as the generation of a fluorescent signal is dependent on the binding of three different nucleotide sequences (forward and reverse primers and probe)(551).

TaqMan® Gene Expression Assays use a TaqMan MGB probe. The TaqMan MGB probe contains a 6-FAM (fluorescein) reporter dye on its 5’ end and both a non-fluorescent quencher and a conjugated minor groove binder (MGB) on its 3’ end.

The use of a non-fluorescent quencher over the more commonly used quencher dye TAMRA increases the sensitivity of the assay by keeping the background fluorescence low, hence increasing signal to noise ratio (552). The MGB serves to enhance the specificity of the assay by increasing the melting temperature of the probe (553). This allows shorter probes to be used, which may be designed for enhanced sequence discrimination. For instance, in multi-exon genes, probes can be designed which span an exon-exon boundary. This ensures that a fluorescent signal will only be generated from correctly spliced templates thereby preventing interference of gene quantification by genomic DNA (552).

205

Real-time qPCR

In contrast to conventional PCR which relies on end-point analysis, real-time qPCR allows the measurement of accumulated products as the PCR reaction progresses by including a fluorescent molecule in the reaction mixture. As the template (cDNA) is simultaneously amplified and quantified in real-time qPCR, measurement can take place in the exponential phase of the reaction rather than in the plateau phase as in conventional PCR, offering a more accurate approach to nucleic acid quantification. During the exponential phase, the reaction is working at maximum efficiency as there are no limitations in terms of enzyme activity and/or substrate availability (e.g. NTPs) and there is an approximate doubling of the PCR product with each cycle. Therefore during the exponential phase, the fluorescent signal is directly proportional to the amount of PCR product and can be used for quantification of the initial starting amount of the target. Other attributes of real-time qPCR, which make it the preferential technology for gene expression studies include its high sensitivity, wide dynamic range and high reproducibility (554).

The amplification plot is central to the analysis of real-time qPCR data, which shows the fluorescent signal (y-axis) against the cycle number (x-axis).

Real-time qPCR amplification plot (555)

206

It is possible to determine the initial starting amount of a target gene through characterisation of the point in time (fractional number of PCR cycles) at which amplification of the target generates enough fluorescent signal to exceed a threshold value. This timepoint is called the Ct (cycle threshold) value and will be inversely proportional to the starting amount, if the threshold is set within the exponential phase (556). Therefore a larger amount of starting template will have a lower Ct value as fewer cycles are required for the fluorescent signal to cross the threshold, and a smaller amount will have a higher Ct value.

The following equation describes amplification in qPCR:

n Xn = X0 (E+1) n = number of cycles

Xn = number of target amplicons at cycle n

X0 = initial number of target molecules E = efficiency of the amplification of the target

If Xn is the amount of amplicon produced at the threshold, n is the number of cycles required for the amplicon to reach this level (i.e. the Ct value for the sample).

Ct Xn = X0 (E+1)

Therefore the initial number of target molecules is:

Ct X0 = Xn/(E+1)

As the threshold is set during the exponential phase where efficiency is theoretically 1, this equation becomes:

Ct X0 = Xn/2

As both the target and reference gene amplification are compared at a constant threshold value, at which they have exactly the same amounts of amplicon (Xn), Xn can be given the value of 1 in the exponential amplification equation:

Ct X0 = 1/2

207

Therefore, the initial number of target molecules is:

-Ct X0 = 2

Relative quantification

Relative quantification is the most commonly used method for quantification using real- time qPCR data, which normalises the target gene expression to the expression of a reference gene. The simplest method of relative quantification is the 2-∆Ct method which uses the following equation:

2−퐶푡(푡푎푟푔푒푡) Ratio (target/reference) = = 2-[Ct(target)-Ct(reference)] 2−퐶푡(푟푒푓푒푟푒푛푐푒)

The reference gene is also used as an endogenous control that controls for sample to sample variations of input quantity and quality (such as the degree of RNA degradation) and reverse transcription efficiency.

208

Appendix 4 Selection of reference gene

The housekeeping genes HMBS, HPRT1, SDHA and TBP were investigated as candidate reference genes. The housekeeping gene 18S was not included due to its well- documented drawbacks and the fact that it was the only gene that did not undergo a preamplification procedure (557). The stability of the four housekeeping genes was initially tested using a ΔCt approach that compared the relative expression of each pair of housekeeping genes. Firstly, the relative expression of each pair of genes was calculated for each sample (ΔCt) followed by calculation of the mean ΔCt and standard deviation for each gene pair between the samples. The mean standard deviation for each housekeeping gene was then calculated.

Comparison of housekeeping genes using ΔCt method

Genes Mean ΔCt Std dev Mean Std dev

HMBS vs HPRT1 2.832 2.201 HMBS vs SDHA -1.433 0.754 HMBS vs TBP 1.003 1.001 1.319

HPRT1 vs HMBS -2.832 2.201 HPRT1 vs SDHA -4.265 2.137 HPRT1 vs TBP -1.829 2.009 2.116

SDHA vs HMBS 1.433 0.754 SDHA vs HPRT1 4.265 2.137 SDHA vs TBP 2.436 0.873 1.255

TBP vs HMBS -1.003 1.001 TBP vs HPRT1 1.829 2.009 TBP vs SDHA -2.436 0.873 1.294

The housekeeping genes HMBS, SDHA and TBP showed relatively stable expression between the samples, as demonstrated by relatively low values for the mean standard deviations of 1.319, 1.255, and 1.294 respectively.

209

The variation in the Ct values of each housekeeping gene between the samples is shown below. These values are given for the recurrent and non-recurrent group, as well as the GS ≥7 and <7 group.

Mean Ct value of each housekeeping gene with standard deviation and percentage coefficient of variation

Gene Mean Ct Std dev %CV

HMBS Total 26.304 1.855 7.054 Recurrent 26.964 2.257 8.372 Non-recurrent 26.002 1.577 6.066 GS ≥7 26.399 1.689 6.397 GS <7 26.163 2.104 8.041

HPRT1 Total 29.135 3.712 12.742 Recurrent 30.614 4.429 14.467 Non-recurrent 28.460 3.164 11.117 GS ≥7 29.157 3.660 12.554 GS <7 29.103 3.858 13.258

SDHA Total 24.871 1.994 8.017 Recurrent 25.215 2.326 9.223 Non-recurrent 24.713 1.829 7.402 GS ≥7 24.895 1.996 8.019 GS <7 24.834 2.028 8.165

TBP Total 27.307 2.122 7.772 Recurrent 27.905 2.389 8.561 Non-recurrent 27.034 1.956 7.236 GS ≥7 27.280 1.975 7.240 GS <7 27.346 2.362 8.638

HMBS is the housekeeping gene which shows the lowest overall variation in Ct value between the samples (%CV of 7.054). For all four of the housekeeping genes, there is little difference between the mean Ct value of the GS ≥7 and <7 group. The difference between the mean Ct value of the recurrent and non-recurrent group is low for HMBS, SDHA and TBP but much higher for HPRT1.

210

Box plots are also included to show the distribution of the Ct values for each housekeeping gene. Box plots are shown for the recurrent/non-recurrent and GS ≥7/<7 groups.

Box plot representation of raw Ct values of the four housekeeping genes in the recurrent and non-recurrent groups are shown as medians (lines), 25th percentile to 75th percentile (boxes) and ranges (whiskers)

211

Box plot representation of raw Ct values of the four housekeeping genes in the GS ≥7 and <7 groups are shown as medians (lines), 25th percentile to 75th percentile (boxes) and ranges (whiskers)

The short boxes and whiskers for HMBS demonstrate its low variability between the samples. Additionally, the median values are at similar levels between the recurrent and non-recurrent group and between the GS ≥7 and <7 group. In comparison, HPRT1 showed much higher variability and had median values at different levels, particularly between the recurrent and non-recurrent group.

212

Appendix 5 Binary logistic regression/Receiver Operating Characteristic (ROC) curves

Binary logistic regression

Binary logistic regression is a statistical method used for the prediction of dichotomous category membership. This method exploits the association between the dependent and independent variables to predict the category membership of the dependent variable with the highest possible accuracy. In this study, the independent (predictor) variables were relative expression of the chosen genes and the dependent variable (binary outcome) was 5-year disease recurrence (0=non-recurrent, 1=recurrent).

Logistic regression estimates a theoretical probability curve from the data, which shows the predicted probability of an individual being in the “1” binary category (recurrent group) as a function of the predictor variables (gene expression). This probability curve is called the logistic regression function and would be a nonlinear function of the predictor variables. Hence based on the level of expression of a particular gene, the logistic regression probability curve could show the probability of the patient being in the recurrent group. Further predictor variables (expression of other genes) could then be added to improve the predictive accuracy of the model.

To understand how logistic regression works, it is important to understand two important measures of likelihood: odds and probability. The odds of an event are the number of ways that an event could occur divided by the number of ways it could fail to occur. The probability is the number of ways that an event could occur divided by the total number of outcomes possible. Probability is related to the odds as follows:

표푑푑푠 Probability (p) = 1+표푑푑푠

In logistic regression, the probability of the outcome is expressed as a function of the logit, which is the ln of the odds:

Logit = ln(odds)

213

푙표푔푖푡 p = 푒 1+푒푙표푔푖푡

The logit is assumed to show a linear relationship with the predictor variables X1, X2...Xp where b0 is the regression constant and b1, b2...bp are the regression coefficients of the predictor variables. This is shown by the logit equation:

Ln(odds) = b0 + b1x1 + b2x2 + ...... + bpxp

The probability can therefore be described by the logistic regression equation:

푒푏0+푏1푋1+푏2푋2+⋯+푏푝푋푝 p = 1+푒푏0+푏1푋1+푏2푋2+⋯+푏푝푋푝

In this study, the logit could be described as the ln(odds) of the patient being in the recurrent group.

Logistic regression estimates the values of the parameters (regression coefficients) in the logit equation using Maximum Likelihood Estimation (MLE), whereby parameters are selected which maximize the probability of the observed data (likelihood). MLE is performed in SPSS by a series of iterations using highly intensive computing algorithms. The parameter estimation in SPSS aims to achieve the greatest likelihood of the observed outcome, i.e. the smallest deviance between the predicted and the observed data. In logistic regression, the ln likelihood rather than the likelihood is used and this value is multiplied by -2 to make calculation easier. Therefore, SPSS aims to achieve the lowest value possible for the -2*ln likelihood (-2LL), which demonstrates a better fit to the data.

It then uses the Likelihood Ratio Test (LRT) to compare the “goodness of fit” of the final model with the null model (constant only model without predictor variables). It separately fits the null model and the final model to the data and calculates a -2LL value for each. Then it calculates a value for the LRT statistic (ΔG2) as follows:

푙𝑖푘푒푙𝑖ℎ표표푑 푓표푟 푛푢푙푙 푚표푑푒푙 ΔG2 = -2*ln 푙𝑖푘푒푙𝑖ℎ표표푑 푓표푟 푓𝑖푛푎푙 푚표푑푒푙

= (-2LL null model) - (-2LL final model)

214

A higher value for ΔG2 is preferable, as it represents a greater improvement in the fit of the final model (in comparison to the null model).

SPSS also calculates a p-value for ΔG2, which was essential to identify whether the improvement in fit obtained by adding the gene(s) to the null model was statistically significant. The ΔG2 follows a chi-squared distribution, with the degrees of freedom equal to the number of predictor variables added to form the final model. A larger value for ΔG2 results in a smaller p-value, which provides evidence for selecting the final model over the null model.

The important parameters of the SPSS logistic regression output are described below, using the gene model AMACR.EFNA5.SDHA as an example.

The beginning of the SPSS output showed the -2LL value for the null model. This was followed by the classification table for the null model.

215

The predictions of the null model were based on the most frequently predicted category in the dataset which was non-recurrent (46 patients out of 67). SPSS categorised all patients as non-recurrent, resulting in an overall accuracy of the null model of 68.7%.

The next table was the Omnibus Tests of Model Coefficients, which gave the results of the LRT, which tested the fit of the final model in comparison to the null model. This included the value for ΔG2 (χ2) and its significance.

Here, the low p-value indicated that the model including predictor variables (genes) was an improvement over the null model for the prediction of recurrence.

The model summary contained the value of the -2LL for the final model (used in the Omnibus Tests of Model Coefficients) and the pseudo R2 values for the final model.

The Nagelkerke R2 is a pseudo R2 statistic, as logistic regression does not have an equivalent to the R2 found in Ordinary Least Squares regression. Although the Nagelkerke R2 has limitations and should not be considered a measure of the goodness- of-fit, it does offer a relative measure of the proportion of variation in outcome that is explained by a model. Here, this value showed that the gene model AMACR.EFNA5.SDHA explained approximately 40% of the variation in recurrence.

The next table showed the results of the Hosmer and Lemeshow Test, another goodness-of-fit test. The p-value for this test should be >0.05 as demonstrated here to

216 indicate good fit to the data. However, the Hosmer and Lemeshow test was not considered useful in this study as it is unreliable with small sample sizes.

SPSS calculated the predicted probability of each patient being in the recurrent group using the logistic regression equation and the expression of the genes included in the models (predictor variables). It then classified patients into either the recurrent or non- recurrent group based on their predicted probability (classified as recurrent if probability was ≥0.5 and non-recurrent if <0.5). The classification table compared the predicted categories with the observed categories.

The example shows that the gene model AMACR.EFNA5.SDHA correctly categorised 80.6% of the patients in this dataset (overall percent accuracy of 80.6%).

In more detail, the table showed that 13 patients with recurrent disease were correctly classified as recurrent (true positives) and 41 patients with non-recurrent disease were correctly classified as non-recurrent (true negatives). In comparison, 5 patients with non-recurrent disease were incorrectly classified as recurrent (false positives) and 8 patients with recurrent disease were incorrectly classified as non-recurrent (false negatives).

217

From these values, the true positive rate (sensitivity) was calculated as 61.9% [true positives/(true positives+false negatives)](61.9%). The true negative rate (specificity) was calculated as 89.1% [true negatives/(true negatives+false positives)].

The overall percent accuracy was calculated as:

(true positives+true negatives)/total * 100

The final table was very important as it showed the variables in the logistic regression equation.

The parameters for the logistic regression equation were shown in the left hand column

B. In this example, the regression constant (b0) was 0.694. The regression coefficient (b1) for the gene AMACR was -0.699, (b2) for the gene EFNA5 was -0.606 and (b3) for the gene SDHA was 2.279. These values showed the increase or decrease in the logit (ln(odds)) of the patient being in the recurrent group for a one-unit increase in gene expression, whilst keeping all other predictors constant. Here, the negative sign of the coefficient for AMACR and EFNA5 showed how much the ln(odds) of the patient being in the recurrent group decreased for a one-unit increase in AMACR and EFNA5 expression respectively. Conversely, the positive sign of the coefficient for SDHA showed how much the ln(odds) of the patient being in the recurrent group increased for a one- unit increase in SDHA expression.

The righthand column of the table Exp(B) gave a value for the antilog of the regression coefficient or (eB). This value is called the odds ratio and is much easier to interpret than the regression coefficient. The odds ratio was the number by which the odds in favour of the patient being in the recurrent group was multiplied for a one-unit increase in

218 expression of the gene. Here the odds ratio for AMACR was 0.497, meaning that the raw odds of the patient being in the recurrent group were approximately halved for a one- unit increase in AMACR expression. In comparison, the odds ratio for SDHA was 9.772, meaning that the odds in favour of the patient being in the recurrent group were multiplied by nearly 10 for a one unit increase in SDHA expression. Therefore, SDHA expression had the greatest effect out of the three predictors on the raw odds of the patient being in the recurrent group.

This table also contained the results of the Wald test which was used to test the significance of each independent variable in the model by testing the null hypothesis that the regression coefficient for the variable was 0. The significance of the Wald test for the independent variables AMACR, EFNA5 and SDHA was 0.002, 0.039 and 0.004 respectively. Therefore, the null hypothesis that the regression coefficient was 0 could be rejected with a 95% confidence level for each gene. In other words, each of the three genes made a significant contribution to the logistic regression model for the prediction of recurrence.

SPSS calculated the predicted probability of each patient being in the recurrent group by inserting the parameters of the logit equation and the expression of the predictor variables into the logistic regression equation.

푒푙표푔푖푡 Predicted probability = 1+푒푙표푔푖푡

푒0.694−0.699퐴푀퐴퐶푅−0.606퐸퐹푁퐴5+2.279푆퐷퐻퐴 = 1+푒0.694−0.699퐴푀퐴퐶푅−0.606퐸퐹푁퐴5+2.279푆퐷퐻퐴

ROC curves

Goodness-of-fit tests measure the agreement between the observed and predicted probabilities, which provides an important measure of calibration of a logistic regression model. In addition to good calibration, a predictive model should exhibit high discriminatory accuracy. The discriminatory accuracy in this study described the ability of the predictive model to distinguish between the recurrent and non-recurrent group. This was tested using ROC curves.

219

An ROC curve is a plot showing the true positive rate (sensitivity) vs false positive rate (1-specificity) at all possible threshold values (cut-off points). The cut-off points in this study were the calculated probabilities of being in the recurrent group. As the ROC curve is a plot of sensitivity against 1-specificity, a good test shows a high sensitivity and low 1-specificity on the ROC plot, hence will show a curved line that covers much of the top left hand corner of the plot. Ideally, the AUC (area under the curve) should be as near to 1 as possible to reflect the greatest discriminatory ability of a model over all cut-off points.

The output in SPSS showed the ROC plot and the value for the AUC. Importantly, it provided a list of coordinates of the curve showing the sensitivity and 1-specificity at various cut-off points (c-values), which allowed determination of an optimal cut-off point (c*). At c*, the model had optimal differentiating ability with equal weight attributed to sensitivity and specificity.

Patients could then be assigned to the recurrent group if they had a value for the predicted probability above this optimal cut-off point (c*), and to the non-recurrent group if the predicted probability was below this point.

220

Appendix 6 Arrays for method development and testing the control cDNA

The array used in the method development differed from the array used for testing the control cDNA by only one gene (ERBB2 rather than LEMD as shown).

Gene name Alternative Full name ABI assay ID number name(s) ABCA2 ABC2 ATP binding cassette subfamily A Hs01096987_g1 member 2 MTHFR methylenetetrahydrofolate Hs01114482_m1 reductase TYMP ECGF1 thymidine phosphorylase Hs01034314_g1 MNGIE UGT1A6 GNT1 UDP glucuronosyltransferase family 1 Hs00153559_m1 HLUGP member A6 UGT1F UGT2B7 UGT2B9 UDP glucuronosyltransferase family 2 Hs02556232_s1 member B7 SLC28A2 CNT2 solute carrier family 28 member 2 Hs00188407_m1 HCNT2 HsT17153 SPNT1 CADPS2 CAPS2 calcium dependent secretion Hs01095965_m1 activator 2 SCUBE2 Cegb1 signal peptide, CUB domain and EGF Hs00221277_m1 Cegf1 like domain containing 2 CYP19A1 ARO cytochrome P450 family 19 subfamily Hs00903411_m1 ARO1 A member 1 CPV1 CYAR P-450AROM SLC28A1 CNT1 solute carrier family 28 member 1 Hs00984391_m1 18S rRNA 18S ribosomal RNA Hs99999901_s1 SLC28A3 CNT3 solute carrier family 28 member 3 Hs00910437_m1 DCK deoxycytidine kinase Hs00176127_m1 CASP1 ICE caspase 1 Hs01554467_m1 IL1BC BAG1 BCL2 associated athanogene 1 Hs00185390_m1 DAPK3 ZIP death associated protein kinase 3 Hs00154676_m1 ZIPK BCL2L11 BIM BCL2 like 11 Hs00375807_m1 BOD BNIP3 Nip3 BCL2 interacting protein 3 Hs00969293_mH GADD45B GADD45BETA growth arrest and DNA damage Hs01548007_g1 MYD118 inducible beta RAD23B HHR23B RAD23 homolog B, nucleotide Hs01011338_g1 HR23B excision repair protein P58 POLD1 CDC2 DNA polymerase delta 1, catalytic Hs01100833_g1 POLD subunit POLE POLE1 DNA polymerase epsilon, catalytic Hs00173030_m1 subunit POLG POLG1 DNA polymerase gamma, catalytic Hs01018681_m1 POLGA subunit

221

POLH RAD30A DNA polymerase eta Hs00197814_m1 XP-V RPA1 HSSB replication protein A1 Hs01556867_m1 REPA1 RF-A RP-A RPA70 WRN RECQ3 Werner syndrome RecQ like helicase Hs01087898_m1 RECQL2 ERCC4 FANCQ ERCC excision repair 4, endonuclease Hs01063530_m1 RAD1 catalytic subunit ATM ATA ATM serine/threonine kinase Hs01112306_g1 ATC ATD ATDC TEL1 TELO1 MYC c-Myc MYC proto-oncogene, bHLH Hs01562521_m1 MYCC transcription factor TGFA transforming growth factor alpha Hs00608187_m1 ESR1 ESR estrogen receptor 1 Hs00174860_m1 Era NR3A1 ESR2 ER-beta estrogen receptor 2 Hs00230957_m1 Erb NR3A2 PGRMC1 HPR6.6 progesterone receptor membrane Hs00198499_m1 component 1 CDKN1B KIP1 cyclin dependent kinase inhibitor 1B Hs00153277_m1 P27KIP1 TUBA1A B-ALPHA-1 tubulin alpha 1a Hs00362387_m1 TUBA3 CCNB1 CCNB cyclin B1 Hs01030101_g1 S100P S100 calcium binding protein P Hs00195584_m1 AURKA AIK aurora kinase A Hs00269212_m1 ARK1 AurA BTAK PPP1R47 STK15 STK6 STK7 MYB c-myb MYB proto-oncogene, transcription Hs00920554_m1 factor GRB7 growth factor receptor bound Hs00180450_m1 protein 7 LEMD2 NET25 LEM domain containing 2 Hs00397754_m1 (control dJ482C21.1 cDNA test) ERBB2 (method HER-2 erb-b2 receptor tyrosine kinase 2 Hs01001598_g1 development) HER2 NEU ALOX5 5-LOX arachidonate 5-lipoxygenase Hs00167536_m1 PTGS2 COX2 prostaglandin-endoperoxide Hs00153133_m1 synthase 2 IGFBP3 BP-53 insulin like growth factor binding Hs00181211_m1 IBP3 protein 3 MAPT DDPAC microtubule associated protein tau Hs00213487_m1 MAPTL MSTD MTBT1

222

MTBT2 PPND TAU IPC IPC XenoRNA XenoRNA HMBS PBGD hydroxymethylbilane synthase Hs00609297_m1 PORC UPS

223

Appendix 7 Patient demographics

The following table shows detailed information about the final 67 patients including clinicopathological parameters, treatments and follow-up data.

Abbreviations

AA antiandrogen BCR biochemical recurrence DOB date of birth GS Gleason score LHRH luteinizing hormone-releasing hormone agonist mets metastatic disease PCa prostate cancer RT radiotherapy RP radical prostatectomy TURP transurethral resection of the prostate Tx treatment

224

5

RP

10

-

-

T

year

R

-

DOB

Stage

RP GS RP

5

Patient

Sample

Year0

OtherTx

Ethnicity

Year5

(2016/17) GS Biopsy

Date of Date

PrePSA Tx

alive/dead alive/dead

recurrence

Hormone Tx Hormone

Date of death of Date Causedeath of

1998 and AA 2010-13, 1 14/02/31 1998 abiraterone 2016 1 BCR mets alive 27 6 biopsy LHRH 2005, AA and LHRH BCR and 2 26/11/24 C 03/05/05 2009, LHRH 2010/11 1 mets dead 30/10/2014 PCa 0.5 6 8 biopsy LHRH 2002, salvage AA 2005, zoledronic acid 2008-10, BCR and 3 28/09/32 2002 and 11/07 LHRH, stilboestrol 10/07 chemotherapy 10/08 1 mets dead 20/10/2010 PCa 11.1 T2 7 biopsy salvage RT 5 to 4 12/01/42 03/08/03 7/2005 AA 2/05, LHRH 3/05 1 BCR alive 40 8(TURP) 9 TURP 5 04/07/42 08/09/03 0 0 0 alive 8.9 4 biopsy salvage RT 7/07 6 28/07/34 02/12/03 to 8/07 1 BCR alive 7.6 T3 5 6 biopsy LHRH and AA 3/05 after mets, LHRH 3/2014, LHRH BCR and 7 07/09/36 05/01/04 and AA 2/2015 1 mets alive 7.6 T3a 8 5 biopsy 8 28/03/46 13/04/04 0 0 0 alive 4.5 pT2 6 RP AA 2005, salvage LHRH 8/10, 10 20/06/37 C 03/08/04 salvage AA 6/13 0 BCR alive 7.6 6 7 biopsy BCR and 11 19/11/47 C 24/05/04 ?2014 1 mets alive 6 RP 12 04/12/37 25/05/04 0 0 0 alive 10.3 T1c 6 RP 13 10/02/36 12/10/04 0 0 0 alive 13.2 5 biopsy 14 04/08/29 27/07/04 1 BCR alive 10.8 5 RP 15 30/07/47 23/11/04 3/05 0 0 0 alive 8.6 6 7 biopsy other 16 07/11/40 22/11/04 0 0 dead 23/08/2012 causes 2.4 pT2 5 biopsy 17 26/02/43 20/12/04 salvage RT 9/06 1 BCR alive 5.5 7 biopsy salvage AA 7/06, LHRH 12/09 to 2014 after mets, LHRH BCR and 19 28/09/48 09/11/04 2015 1 mets alive 2.9 T3b 7 RP 20 31/01/50 C 12/09/05 0 0 0 alive 4.9 tert 4 6 5 biopsy 225

22 14/11/43 C 31/01/05 0 0 0 alive 5.5 7 RP other 23 17/05/35 12/07 LHRH 2007 0 0 dead 7/01/2013 causes 6.8 7 biopsy 25 24/03/46 08/02/05 ?5/06 ?AA 11/05 1 BCR ? 17 7 RP 26 09/10/46 28/02/05 0 0 0 alive 7 pT2c 5 RP other 27 16/04/59 03/05/05 0 0 dead 30/03/2013 causes 10 7 RP 28 24/04/46 17/05/05 9/05 to 10/05 0 0 ? 8.7 T4 7 RP 30 29/12/33 C finished 8/06 LHRH finished early 2007 0 0 0 alive 8 8 biopsy 31 02/10/35 C started 8/06 started 3/06, LHRH 2014/5 1 BCR alive 20.3 8 biopsy 35 11/08/54 22/05/06 0 0 0 alive 4.6 T2b 6 7 biopsy 37 26/09/32 3/08 to 4/08 8/07 0 0 BCR alive 24 6 biopsy 38 08/11/31 2/08 to 3/08 LHRH 7/07 to 8/08 0 0 alive 18 7 biopsy other 39 05/11/35 11/07 to 11/07 LHRH, AA 9/07 0 0 dead 29/12/2014 causes 19.3 6 biopsy 40 06/08/41 C summer 2010 0 0 alive 25.3 6 biopsy other 42 15/05/40 C 02/01/07 0 0 0 dead 3/04/2017 causes 6.2 pT3 6 biopsy other 43 23/08/35 C 4/07 to 5/07 AA, LHRH 10/06 to 8/07 0 0 dead 22/09/2013 causes 8.9 7 biopsy 44 23/06/40 C 6/07 AA, LHRH 12/06 0 0 0 alive 2.3 7 biopsy 46 08/11/40 C 3/07 to 5/07 LHRH 10/06 to 10/07 0 0 0 alive 10.2 6 biopsy 50 07/09/48 C finished 6/07 1/07 to 1/08 0 0 0 alive 3.6 6 biopsy 51 25/12/39 C 2008 0 0 alive 40.6 7 biopsy 52 24/06/36 2008 1 BCR alive 11.2 7 biopsy LHRH 2/13, AA and LHRH 53 13/07/39 6/08 2014 1 BCR alive 19.3 6 biopsy

54 16/07/34 C 7/07 to 9/07 0 0 0 alive 7.9 8(TURP) TURP AA 2007/AA since 2011, 55 23/05/48 C 25/06/07 10/07 to 11/07 enzalutamide for bony mets 1 BCR mets alive 27/02/2017 ? 2.6 pT4 9 biopsy

56 03/10/37 C 2/11 LHRH 1/07 and 2011 0 0 alive 20.4 7(TURP) TURP 57 26/12/39 C finished 5/08 9/07 0 0 alive 13.3 6 biopsy 226

58 18/09/45 C 12/06/07 9/07 to 11/07 0 0 0 alive 12.8 pT2c 5 biopsy 60 23/05/41 C 11/06/07 11/07 to 1/08 LHRH started 08/07 0 0 alive 0.4 9 biopsy 61 10/01/45 17/07/07 1 BCR alive 11 T2c 6 biopsy 62 11/11/46 06/08/07 1/08 to 2/08 LHRH 2008 0 0 0 alive 13.7 8 biopsy 65 12/11/44 8/07 0 0 0 alive 7.3 6 biopsy 67 07/04/31 finished 10/07 LHRH finished 5/08 0 0 alive 39 T2 6 biopsy 68 07/11/41 2008 LHRH 2008 0 0 alive 18 7 biopsy 69 10/10/34 finished 2/08 LHRH 2007 and 11/11 1 BCR alive 24 7 biopsy 72 12/02/35 4/08 to 5/08 0 0 BCR alive 10 7 biopsy 75 19/05/48 11/07 to 1/08 7/07 0 0 BCR alive 88 7 biopsy 76 24/10/40 16/10/07 12/11 1 BCR alive 6.6 7 biopsy 77 01/05/37 early 2008 restarted LHRH 2016 0 0 BCR alive 6 7 biopsy 78 11/09/29 1/08 to 2/08 2/08 0 0 BCR alive 12 8 biopsy 80 20/09/47 1/08 to 3/08 LHRH 10/07 0 0 ? 7.5 7 biopsy 85 19/03/38 10/09/07 2008 LHRH 2008 and 2/16 1 BCR mets alive 8.1 T2c 8 8 RP 88 21/10/37 10/08 0 0 alive 7.6 6 biopsy other 89 12/03/30 3/08 LHRH 11/07 0 0 dead 3/02/2017 causes 10.1 T1 8 biopsy 91 07/12/33 3/08 LHRH 2007 0 0 alive 18.9 9 biopsy other 92 07/01/37 11/07 LHRH 11/07 0 0 dead 13/12/2014 causes 9.6 7 biopsy LHRH 07 and 2012 LHRH 93 04/02/42 finished 4/08 4/14 chemotherapy 2015 1 mets BCR dead 6/11/2015 PCa 6 7 biopsy 94 24/04/34 2008 LHRH 2008 0 0 alive 9.5 8 biopsy AA and LHRH 2012, then 97 14/04/37 finished 4/08 intermittent LHRH 1 BCR alive 7.2 7 biopsy 99 22/10/42 27/11/07 0 0 ? 7.4 6 RP 101 27/12/36 11/01/05 0 0 0 ? 4.8 6 biopsy

227

Appendix 8 Electronic appendix

• Documents for gaining new ethical permissions

• Parameters from standard curves and replicate plates for all genes

• Full logistic regression dataset for BB study including correlation analysis

• Comparison of Ct values between non-preamplified and preamplified samples

• Logistic regression models of recurrence (SL study)

228

Appendix 9 Form UPR16

229

Appendix 10 Dissemination list

Bickers B, Aukim-Hastie C. New molecular biomarkers for the prognosis and management of prostate cancer – the post PSA era. Anticancer Res. 2009; 29(8):3289- 98.

Bickers B, Aukim-Hastie C. New molecular biomarkers for the prognosis and management of prostate cancer; beyond the prostate-specific antigen test – Oncology News. May/June 2009.

Larkin S, Holmes S, Cree I, Walker T, Basketter V, Bickers B et al. Identification of markers of prostate cancer progression using candidate gene expression. Br J Cancer. 2012; 106(1):157-65

Larkin S, Zeidan B, Taylor M, Bickers B, Al-Ruwaili J, Aukim-Hastie C et al. Proteomics in prostate cancer biomarker discovery. Expert Rev Proteomics. 2010; 7(1):93-102.

230

Bibliography

1. Cancer Research UK. Prostate Cancer Statistics [Internet]. London: Cancer Research UK; 2017 [cited 2018]. Available from: http://www.cancerresearchuk.org/health-professional/cancer- statistics/statistics-by-cancer-type/prostate-cancer#heading-Zero. 2. Cancer Research UK. Cancer Incidence for Common Cancers [Internet]. London: Cancer Research UK; 2018 [cited 2018]. Available from: http://www.cancerresearchuk.org/health-professional/cancer- statistics/incidence/common-cancers-compared#heading-Zero. 3. Gann P. Risk factors for prostate cancer. Rev Urol. 2002;4(Suppl 5):S3–S10. 4. Li J, Djenaba J, Soman A, Rim S, Master V. Recent trends in prostate cancer incidence by age, cancer stage, and grade, the United States, 2001-2007. Prostate Cancer. 2012;2012:691380. 5. Bray F, Lortet-Tieulent J, Ferlay J, Forman D, Auvinen A. Prostate cancer incidence and mortality trends in 37 European countries: an overview. Eur J Cancer. 2010;46(17):3040-52. 6. Quaresma M, Coleman M, Rachet B. 40-year trends in an index of survival for all cancers combined and survival adjusted for age and sex for each cancer in England and Wales, 1971-2011: a population-based study. Lancet. 2015;385(9974):1206-18. 7. Cancer Research UK. Prostate Cancer Incidence by Age [Internet]. London: Cancer Research UK; 2018 [cited 2018]. Available from: http://www.cancerresearchuk.org/health-professional/cancer- statistics/statistics-by-cancer-type/prostate-cancer/incidence#heading-One. 8. Draisma G, Boer R, Otto S, van der Cruijsen I, Damhuis R, Schroder F, et al. Lead times and overdetection due to prostate-specific antigen screening: estimates from the European Randomized Study of Screening for Prostate Cancer. J Natl Cancer Inst. 2003;95(12):868-78. 9. Cancer Research UK. Prostate Cancer Mortality Statistics [Internet]. London: Cancer Research UK; 2017 [cited 2018]. Available from: http://www.cancerresearchuk.org/health-professional/cancer- statistics/statistics-by-cancer-type/prostate-cancer/mortality#ref-2. 10. Zeegers M, Jellema A, Ostrer H. Empiric risk of prostate carcinoma for relatives of patients with prostate carcinoma: a meta-analysis. Cancer. 2003;97(8):1894-903. 11. Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer. 2015;136(5):E359-86. 12. International Agency for Research on Cancer WHO. GLOBOCAN 2018 Age standardized (World) incidence rates, prostate, all ages [Internet]. Lyon: International Agency for Research on Cancer, World Health Organization; 2018 [cited 2018]. Available from: http://gco.iarc.fr/today/data/factsheets/cancers/27-Prostate-fact-sheet.pdf. 13. Kheirandish P, Chinegwundoh F. Ethnic differences in prostate cancer. Br J Cancer. 2011;105(4):481–5. 14. Ben-Shlomo Y, Evans S, Ibrahim F, Patel B, Anson K, Chinegwundoh F, et al. The risk of prostate cancer amongst black men in the United Kingdom: the PROCESS cohort study. Eur Urol 2008;53(1):99- 105. 15. Taksler G, Keating N, Cutler D. Explaining racial differences in prostate cancer mortality. Cancer. 2012;118(17):4280-9. 16. Faisal F, Sundi D, Cooper J, Humphreys E, Partin A, Han M, et al. Racial disparities in oncologic outcomes after radical prostatectomy: long-term follow-up. Urology. 2014;84(6):1434-41. 17. Aune D, Navarro Rosenblatt D, Chan D, Vieira A, Vieira R, Greenwood D, et al. Dairy products, calcium, and prostate cancer risk: a systematic review and meta-analysis of cohort studies. Am J Clin Nutr. 2015;101(1):87-117. 18. Zu K, Mucci L, Rosner B, Clinton S, Loda M, Stampfer M, et al. Dietary lycopene, angiogenesis, and prostate cancer: a prospective study in the prostate-specific antigen era. J Natl Cancer Inst. 2014;106(2):djt430. 19. Joshi A, Corral R, Catsburg C, Lewinger J, Koo J, John E, et al. Red meat and poultry, cooking practices, genetic susceptibility and risk of prostate cancer: results from a multiethnic case-control study. Carcinogenesis. 2012;33(11):2108-18. 20. Chen P, Zhang W, Wang X, Zhao K, Negi D, Zhuo L, et al. Lycopene and risk of prostate cancer: a systematic review and meta-analysis. Medicine (Baltimore). 2015;94(33):e1260.

231

21. Alshahrani S, McGill J, Agarwal A. Prostatitis and male infertility. J Reprod Immunol. 2013;100(1):30-6. 22. Macmillan Cancer Support. The Prostate Gland [Internet]. London: Macmillan Cancer Support; 2015 [cited 2018]. Available from: http://www.macmillan.org.uk/Cancerinformation/Cancertypes/Prostate/Aboutprostatecancer/Theprost ate.aspx. 23. McNeal J. The zonal anatomy of the prostate. Prostate. 1981;2(1):35-49. 24. Carter H, Couzens G. The Whole Life Prostate Book [Internet]. New York: Free Press; 2012. Available from: http://wholelifeprostate.com/prostate.html. 25. Mescher A. Junqueira's Basic Histology: Text and Atlas. 12 ed. New York: McGraw-Hill Medical; 2010. 26. WebPathology. High Quality of Pathology Images of Benign and Malignant Neoplasms and Related Entities [Internet]. Richmond: WebPathology; 2018 [cited 2018]. Available from: http://www.webpathology.com/slides-13/slides/001.%20Prostate_NormalGlands.jpg. 27. Oldridge E, Pellacani D, Collins A, Maitland N. Prostate cancer stem cells: Are they androgen- responsive? Mol Cell Endocrinol. 2012;360(1-2):14-24. 28. Letellier G, Perez M, Yacoub M, Levillain P, Cussenot O, Fromont G. Epithelial phenotypes in the developing human prostate. J Histochem Cytochem. 2007;55(9):885–90. 29. Hudson D, Guy A, Fry P, O’Hare M, Watt F, Masters J. Epithelial cell differentiation pathways in the human prostate: identification of intermediate phenotypes by keratin expression. J Histochem Cytochem. 2001;49(2):271–8. 30. Bonkhoff H, Remberger K. Widespread distribution of nuclear androgen receptors in the basal cell layer of the normal and hyperplastic human prostate. Virchows Arch A Pathol Anat Histopathol. 1993;422(1):35-8. 31. Hudson D, O'Hare M, Watt F, Masters J. Proliferative heterogeneity in the human prostate: evidence for epithelial stem cells. Lab invest. 2000;80(8):1243-50. 32. van Leenders G, Dijkman H, Hulsbergen-van de Kaa C, Ruiter D, Schalken J. Demonstration of intermediate cells during human prostate epithelial differentiation in situ and in vitro using triple-staining confocal scanning microscopy. Lab invest. 2000;80(8):1251-8. 33. English H, Santen R, Isaacs J. Response of glandular versus basal rat ventral prostatic epithelial cells to androgen withdrawal and replacement. Prostate. 1987;11(3):229-42. 34. Cunha G, Hayward S, Dahiya R, Foster B. Smooth muscle-epithelial interactions in normal and neoplastic prostatic development. Acta Anat (Basel). 1996;155(1):63-72. 35. Cunha G, Cooke P, Kurita T. Role of stromal-epithelial interactions in hormonal responses. Arch Histol Cytol. 2004;67(5):417-34. 36. Abrahamsson P, Wadstrom L, Alumets J, Falkmer S, Grimelius L. Peptide-hormone and serotonin- immunoreactive cells in normal and hyperplastic prostate glands. Pathol Res Pract. 1986;181(6):675-83. 37. Bonkhoff H, Stein U, Remberger K. Endocrine-paracrine cell types in the prostate and prostatic adenocarcinoma are postmitotic cells. Hum Pathol. 1995;26(2):167-70. 38. Dehm S, Tindall D. Molecular regulation of androgen action in prostate cancer. J Cell Biochem. 2006;99(2):333-44. 39. Nantermet P, Xu J, Yu Y, Hodor P, Holder D, Adamski S, et al. Identification of genetic pathways activated by the androgen receptor during the induction of proliferation in the ventral prostate gland. J Biol Chem. 2004;279(2):1310-22. 40. Nelson P, Clegg N, Arnold H, Ferguson C, Bonham M, White J, et al. The program of androgen- responsive genes in neoplastic prostate epithelium. Proc Natl Acad Sci USA. 2002;99(18):11890–5. 41. Love H, Booton S, Boone B, Breyer J, Koyama T, Revelo M, et al. Androgen regulated genes in human prostate xenografts in mice: relation to BPH and prostate cancer. PLoS One. 2009;4(12):e8384. 42. Kapoor A. Benign prostatic hyperplasia (BPH) management in the primary care setting. Can J Urol. 2012;19(Suppl 1):10-7. 43. Nicholson T, Ricke W. Androgens and estrogens in benign prostatic hyperplasia: past, present and future. Differentiation. 2011;82(4-5):184-99. 44. Shannon B, McNeal J, Cohen R. Transition zone carcinoma of the prostate gland: a common indolent tumour type that occasionally manifests aggressive behaviour. Pathology. 2003;35(6):467-71. 45. Schenk J, Kristal A, Arnold K, Tangen C, Neuhouser M, Lin D, et al. Association of symptomatic benign prostatic hyperplasia and prostate cancer: results from the prostate cancer prevention trial. Am J Epidemiol. 2011;173(12):1419-28.

232

46. Abdel-Khalek M, El-Baz M, Ibrahiem e-H. Predictors of prostate cancer on extended biopsy in patients with high-grade prostatic intraepithelial neoplasia: a multivariate analysis model. BJU Int. 2004;94(4):528-33. 47. Eminaga O, Hinkelammert R, Abbas M, Titze U, Eltze E, Bettendorf O, et al. High-grade prostatic intraepithelial neoplasia (HGPIN) and topographical distribution in 1,374 prostatectomy specimens: existence of HGPIN near prostate cancer. Prostate. 2013;73(10):1115-22. 48. De Nunzio C, Albisinni S, Cicione A, Gacci M, Leonardo C, Esperto F, et al. Widespread high grade prostatic intraepithelial neoplasia on biopsy predicts the risk of prostate cancer: a 12 months analysis after three consecutive prostate biopsies. Arch Ital Urol Androl. 2013;85(2):59-64. 49. Jung S, Shin S, Kim M, Baek I, Lee J, Lee S, et al. Genetic progression of high grade prostatic intraepithelial neoplasia to prostate cancer. Eur Urol. 2016;69(5):823-30. 50. Iremashvili V, Pelaez L, Jorda M, Manoharan M, Rosenberg D, Soloway M. Prostate cancers of different zonal origin: clinicopathological characteristics and biochemical outcome after radical prostatectomy. Urology. 2012;80(5):1063-9. 51. Cohen R, Shannon B, Phillips M, Moorin R, Wheeler T, Garrett K. Central zone carcinoma of the prostate gland: a distinct tumor type with poor prognostic features. J Urol. 2008;179(5):1762-7. 52. Greenberg D, Wright K, Lophathanon A, Muir K, Gnanapragasam V. Changing presentation of prostate cancer in a UK population--10 year trends in prostate cancer risk profiles in the East of England. Br J Cancer. 2013;109(8):2115-20. 53. Siddiqui E, Mumtaz F, Gelister J. Understanding prostate cancer. J R Soc Promot Health. 2004;124(5):219-21. 54. Gandaglia G, Abdollah F, Schiffmann J, Trudeau V, Shariat S, Kim S, et al. Distribution of metastatic sites in patients with prostate cancer: a population-based analysis. Prostate. 2014;74(2):210-6. 55. National Cancer Registration and Analysis Service. Stage Breakdown by CCG 2016 [Internet]. London: Public Health England; 2018 [cited 2018]. Available from: http://www.ncin.org.uk/publications/survival_by_stage. 56. Lee S, Cho K. Multimodal therapy for locally advanced prostate cancer: the roles of radiotherapy, androgen deprivation therapy, and their combination. Radiat Oncol J. 2017;35(3):189-97. 57. Wilt T, Brawer M, Jones K, Barry M, Aronson W, Fox S, et al. Radical prostatectomy versus observation for localized prostate cancer. N Engl J Med. 2012;367(3):203-13. 58. Resnick M, Koyama T, Fan K, Albertsen P, Goodman M, Hamilton A, et al. Long-term functional outcomes after treatment for localized prostate cancer. N Engl J Med. 2013;368(5):436-45. 59. Chatterjee B. The role of the androgen receptor in the development of prostatic hyperplasia and prostate cancer. Mol Cell Biochem. 2003;253(1-2):89-101. 60. Kahn B, Collazo J, Kyprianou N. Androgen receptor as a driver of therapeutic resistance in advanced prostate cancer. Int J Biol Sci. 2014;10(6):588-95. 61. Gonzalez B, Jim H, Donovan K, Small B, Sutton S, Park J, et al. Course and moderators of hot flash interference during androgen deprivation therapy for prostate cancer: a matched comparison. J Urol. 2015;194(3):690-5. 62. Smith M, Lee W, Brandman J, Wang Q, Botteman M, Pashos C. Gonadotropin-releasing hormone agonists and fracture risk: a claims-based cohort study of men with nonmetastatic prostate cancer. J Clin Oncol. 2005;23(31):7897-903. 63. Bosco C, Crawley D, Adolfsson J, Rudman S, Van Hemelrijck M. Quantifying the evidence for the risk of metabolic syndrome and its components following androgen deprivation therapy for prostate cancer: a meta-analysis. PLoS One. 2015;10(3):e0117344. 64. Nguyen Q, Levy L, Lee A, Choi S, Frank S, Pugh T, et al. Long-term outcomes for men with high- risk prostate cancer treated definitively with external beam radiotherapy with or without androgen deprivation. Cancer. 2013;119(18):3265-71. 65. Bekelman J, Mitra N, Handorf E, Uzzo R, Hahn S, Polsky D, et al. Effectiveness of androgen- deprivation therapy and radiotherapy for older men with locally advanced prostate cancer. J Clin Oncol. 2015;33(7):716-22. 66. Lawton C, Winter K, Grignon D, Pilepich M. Androgen suppression plus radiation versus radiation alone for patients with stage D1/pathologic node-positive adenocarcinoma of the prostate: updated results based on national prospective randomized trial Radiation Therapy Oncology Group 85-31. J Clin Oncol. 2005;23(4):800-7. 67. Salonen A, Taari K, Ala-Opas M, Sankila A, Viitanen J, Lundstedt S, et al. Comparison of intermittent and continuous androgen deprivation and quality of life between patients with locally

233 advanced and patients with metastatic prostate cancer: a post hoc analysis of the randomized FinnProstate Study VII. Scand J Urol. 2014;48(6):513-22. 68. Sharifi N, Dahut W, Steinberg S, Figg W, Tarassoff C, Arlen P, et al. A retrospective study of the time to clinical endpoints for advanced prostate cancer. BJU Int. 2005;96(7):985-9. 69. Smaletz O, Scher H, Small E, Verbel D, McMillan A, Regan K, et al. Nomogram for overall survival of patients with progressive metastatic prostate cancer after castration. J Clin Oncol. 2002;20(19):3972- 82. 70. de Bono J, Logothetis C, Molina A, Fizazi K, North S, Chu L, et al. Abiraterone and increased survival in metastatic prostate cancer. N Engl J Med. 2011;364(21):1995-2005. 71. Zielinski R, Azad A, Chi K, Tyldesely S. Population-based impact on overall survival after the introduction of docetaxel as standard therapy for metastatic castration resistant prostate cancer. Can Urol Assoc J. 2014;8(7-8):E520-3. 72. Thompson I, Ankerst D, Chi C, Goodman P, Tangen C, Lucia M, et al. Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J Natl Cancer Inst. 2006;98(8):529-34. 73. Risko R, Merdan S, Womble P, Barnett C, Ye Z, Linsell S, et al. Clinical predictors and recommendations for staging computed tomography scan among men with prostate cancer. Urology. 2014;84(6):1329-34. 74. Crawford E, DeAntoni E, Etzioni R, Schaefer V, Olson R, Ross C. Serum prostate-specific antigen and digital rectal examination for early detection of prostate cancer in a national community-based program. The Prostate Cancer Education Council. Urology. 1996;47(6):863-9. 75. Espaldon R, Kirby K, Fung K, Hoffman R, Powell A, Freedland S, et al. Probability of an abnormal screening prostate-specific antigen result based on age, race, and prostate-specific antigen threshold. Urology. 2014;83(3):599-605. 76. Sarwar S, Adil M, Nyamath P, Ishaq M. Biomarkers of prostatic cancer: an attempt to categorize patients into prostatic carcinoma, benign prostatic hyperplasia, or prostatitis based on serum prostate specific antigen, prostatic acid phosphatase, calcium, and phosphorus. Prostate Cancer. 2017;2017:5687212. 77. Schröder F, Denis L, Roobol M. Epilogue: different approaches for prostate cancer screening in the EU? Eur J Cancer. 2010;46(17):3120-5. 78. Lojanapiwat B, Anutrakulchai W, Chongruksut W, Udomphot C. Correlation and diagnostic performance of the prostate-specific antigen level with the diagnosis, aggressiveness, and bone metastasis of prostate cancer in clinical practice. Prostate Int. 2014;2(3):133-9. 79. Thompson I, Pauler D, Goodman P, Tangen C, Lucia M, Parnes H, et al. Prevalence of prostate cancer among men with a prostate-specific antigen level < or =4.0 ng per milliliter. N Engl J Med. 2004;350(22):2239-46. 80. Cookson M, Aus G, Burnett A, Canby-Hagino E, D'Amico A, Dmochowski R, et al. Variation in the definition of biochemical recurrence in patients treated for localized prostate cancer: the American Urological Association prostate guidelines for localized prostate cancer update panel report and recommendations for a standard in the reporting of surgical outcomes. J Urol. 2007;177(2):540-5. 81. Boorjian S, Thompson R, Tollefson M, Rangel L, Bergstralh E, Blute M, et al. Long-term risk of clinical progression after biochemical recurrence following radical prostatectomy: the impact of time from surgery to recurrence. Eur Urol. 2011;59(6):893-9. 82. Brockman J, Alanee S, Vickers A, Scardino P, Wood D, Kibel A, et al. Nomogram predicting prostate cancer-specific mortality for men with biochemical recurrence after radical prostatectomy. Eur Urol. 2015;67(6):1160-7. 83. Zumsteg Z, Spratt D, Romesser P, Pei X, Zhang Z, Polkinghorn W, et al. The natural history and predictors of outcome following biochemical relapse in the dose escalation era for prostate cancer patients undergoing definitive external beam radiotherapy. Eur Urol. 2015;67(6):1009-16. 84. Gleason D. Classification of prostatic carcinomas. Cancer Chemother Rep. 1966;50(3):125–8. 85. Buyyounouski M, Choyke P, McKenney J, Sartor O, Sandler H, Amin M, et al. Prostate cancer - major changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J Clin. 2017;67(3):245-53. 86. Helpap B, Egevad L. Modified Gleason grading. An updated review. Histol Histopathol. 2009;24(5):661–6. 87. Epstein J, Allsbrook Jr W, Amin M, Egevad L, Committee IG. The 2005 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma. Am J Surg Pathol. 2005;29(9):1228-42.

234

88. Freedland S, Hotaling J, Fitzsimons N, Presti J, Kane C, Terris M, et al. PSA in the new millennium: a powerful predictor of prostate cancer prognosis and radical prostatectomy outcomes--results from the SEARCH database. Eur Urol. 2008;53(4):758-64. 89. Klotz L, Vesprini D, Sethukavalan P, Jethava V, Zhang L, Jain S, et al. Long-term follow-up of a large active surveillance cohort of patients with prostate cancer. J Clin Oncol. 2015;33(3):272-7. 90. Gnanapragasam V, Lophatananon A, Wright K, Muir K, Gavin A, Greenberg D. Improving clinical risk stratification at diagnosis in primary prostate cancer: a prognostic modelling study. PLoS Med. 2016;13(8):e1002063. 91. Hamdy F, Donovan J, Lane J, Mason M, Metcalfe C, Holding P, et al. 10-year outcomes after monitoring, surgery, or radiotherapy for localized prostate cancer. N Engl J Med. 2016;375(15):1415-24. 92. D'Amico A, Whittington R, Malkowicz S, Schultz D, Blank K, Broderick G, et al. Biochemical outcome after radical prostatectomy, external beam radiation therapy, or interstitial radiation therapy for clinically localized prostate cancer. JAMA. 1998;280(11):969-74. 93. Boorjian S, Karnes R, Rangel L, Bergstralh E, Blute M. Mayo Clinic validation of the D'amico risk group classification for predicting survival following radical prostatectomy. J Urol. 2008;179(4):1354-60. 94. Beauval J, Ploussard G, Cabarrou B, Roumiguie M, Ouzzane A, Gas J, et al. Improved decision making in intermediate-risk prostate cancer: a multicenter study on pathologic and oncologic outcomes after radical prostatectomy. World J Urol. 2017;35(8):1191-7. 95. Stephenson A, Scardino P, Eastham J, Bianco F, Dotan Z, Fearn P, et al. Preoperative nomogram predicting the 10-year probability of prostate cancer recurrence after radical prostatectomy. J Natl Cancer Inst. 2006;98(10):715-7. 96. Cooperberg M, Pasta D, Elkin E, Litwin M, Latini D, DuChane J, et al. The UCSF Cancer of the Prostate Risk Assessment (CAPRA) Score: a straightforward and reliable preoperative predictor of disease recurrence after radical prostatectomy. J Urol. 2005;173(6):1938–42. 97. Cooperberg M, Broering J, Carroll P. Risk assessment for prostate cancer metastasis and mortality at the time of diagnosis. J Natl Cancer Inst. 2009;101(12):878-87. 98. Cooperberg M, Freedland S, Pasta D, Elkin E, Presti J, Amling C, et al. Multiinstitutional validation of the UCSF Cancer of the Prostate Risk Assessment for prediction of recurrence after radical prostatectomy. Cancer. 2006;107(10):2384-91. 99. Brajtbord J, Leapman M, Cooperberg M. The CAPRA score at 10 years: contemporary perspectives and analysis of supporting studies. Eur Urol. 2017;71(5):705-9. 100. Campbell J, Raymond E, O'Callaghan M, Vincent A, Beckmann K, Roder D, et al. Optimum tools for predicting clinical outcomes in prostate cancer patients undergoing radical prostatectomy: a systematic review of prognostic accuracy and validity. Clin Genitourin Cancer. 2017;15(5):e827-e34. 101. Seo W, Kang P, Kang D, Yoon J, Kim W, Chung J. Cancer of the Prostate Risk Assessment (CAPRA) preoperative score versus postoperative score (CAPRA-S): ability to predict cancer progression and decision-making regarding adjuvant therapy after radical prostatectomy. J Korean Med Sci. 2014;29(9):1212-6. 102. Andreoiu M, Cheng L. Multifocal prostate cancer: biologic, prognostic, and therapeutic implications. Hum Pathol. 2010;41(6):781-93. 103. Danneman D, Drevin L, Delahunt B, Samaratunga H, Robinson D, Bratt O, et al. Accuracy of prostate biopsies for predicting Gleason score in radical prostatectomy specimens: nationwide trends 2000-2012. BJU Int. 2017;119(1):50-6. 104. McKenney J, Simko J, Bonham M, True L, Troyer D, Hawley S, et al. The potential impact of reproducibility of Gleason grading in men with early stage prostate cancer managed by active surveillance: a multi-institutional study. J Urol. 2011;186(2):465-9. 105. Reese A, Sadetsky N, Carroll P, Cooperberg M. Inaccuracies in assignment of clinical stage for localized prostate cancer. Cancer. 2011;117(2):283-9. 106. Outwater E, Montilla-Soler J. Imaging of prostate carcinoma. Cancer Control. 2013;20(3):161-76. 107. Shariat S, Abdel-Aziz K, Roehrborn C, Lotan Y. Pre-operative percent free PSA predicts clinical outcomes in patients treated with radical prostatectomy with total PSA levels below 10 ng/ml. Eur Urol. 2006;49(2):293-302. 108. Falchook A, Martin N, Basak R, Smith A, Milowsky M, Chen R. Stage at presentation and survival outcomes of patients with Gleason 8-10 prostate cancer and low prostate-specific antigen. Urol Oncol. 2016;34(3):119.e19-26. 109. Loeb S, Bjurlin M, Nicholson J, Tammela T, Penson D, Carter H, et al. Overdiagnosis and overtreatment of prostate cancer. Eur Urol. 2014;65(6):1046-55.

235

110. Heidegger I, Skradski V, Steiner E, Klocker H, Pichler R, Pircher A, et al. High risk of under-grading and -staging in prostate cancer patients eligible for active surveillance. PLoS One. 2015;10(2):e0115537. 111. Dall'Era M, Klotz L. Active surveillance for intermediate-risk prostate cancer. Prostate Cancer Prostatic Dis. 2017;20(1):1-6. 112. Kupelian P, Mahadevan A, Reddy C, Reuther A, Klein E. Use of different definitions of biochemical failure after external beam radiotherapy changes conclusions about relative treatment efficacy for localized prostate cancer. Urology. 2006;68(3):593-8. 113. Liesenfeld L, Kron M, Gschwend J, Herkommer K. Prognostic factors for biochemical recurrence more than 10 years after radical prostatectomy. J Urol. 2017;197(1):143-8. 114. Van Den Eeden S, Lu R, Zhang N, Quesenberry C, Shan J, Han J, et al. A biopsy-based 17-gene Genomic Prostate Score as a predictor of metastases and prostate cancer death in surgically treated men with clinically localized disease. Eur Urol. 2018;73(1):129-38. 115. Datta D, Aftabuddin M, Gupta D, Raha S, Sen P. Human prostate cancer hallmarks map. Sci Rep. 2016;6:30691. 116. Hanahan D, Weinberg R. Hallmarks of cancer: the next generation. Cell. 2011;144(5):646-73. 117. Nowell P. The clonal evolution of tumor cell populations. Science. 1976;194(4260):23-8. 118. Hanahan D, Weinberg R. The hallmarks of cancer. Cell. 2000;100(1):57-70. 119. Witsch E, Sela M, Yarden Y. Roles for growth factors in cancer progression. Physiology (Bethesda). 2010;25(2):85-101. 120. Bhowmick N, Neilson E, Moses H. Stromal fibroblasts in cancer initiation and progression. Nature. 2004;432(7015):332-7. 121. Lokker N, Sullivan C, Hollenbach S, Israel M, Giese N. Platelet-derived growth factor (PDGF) autocrine signaling regulates survival and mitogenic pathways in glioblastoma cells: evidence that the novel PDGF-C and PDGF-D ligands may play a role in the development of brain tumors. Cancer Res. 2002;62(13):3729-35. 122. Feng S, Dakhova O, Creighton C, Ittmann M. Endocrine fibroblast growth factor FGF19 promotes prostate cancer progression. Cancer Res. 2013;73(8):2551-62. 123. Peraldo-Neia C, Migliardi G, Mello-Grand M, Montemurro F, Segir R, Pignochino Y, et al. Epidermal Growth Factor Receptor (EGFR) mutation analysis, gene expression profiling and EGFR protein expression in primary prostate cancer. BMC Cancer. 2011;11:31. 124. Ahmad I, Patel R, Babloo Singh L, Nixon C, Seywright M, Barnetson R, et al. HER2 overcomes PTEN (loss)-induced senescence to cause aggressive prostate cancer. Proc Natl Acad Sci USA. 2011;108(39):16392–7. 125. Zhu G, Liu Z, Epstein J, Davis C, Christudass C, Carter H, et al. A novel quantitative multiplex tissue immunoblotting for biomarkers predicts a prostate cancer aggressive phenotype. Cancer Epidemiol Biomarkers Prev. 2015;24(12):1864-72. 126. Osman I, Scher H, Drobnjak M, Verbel M, Morris M, Agus D, et al. HER-2/neu (p185neu) protein expression in the natural or treated history of prostate cancer. Clin Cancer Res. 2001;7(9):2643-7. 127. Shah RB, Ghosh D, Elder JT. Epidermal growth factor receptor (ErbB1) expression in prostate cancer progression: correlation with androgen independence. Prostate. 2006;66(13):1437-44. 128. Sever R, Brugge J. Signal transduction in cancer. Cold Spring Harb Perspect Med. 2015;5(4). 129. Fromont G, Godet J, Peyret A, Irani J, Celhay O, Rozet F, et al. 8q24 amplification is associated with Myc expression and prostate cancer progression and is an independent predictor of recurrence after radical prostatectomy. Hum Pathol. 2013;44(8):1617-23. 130. Zafarana G, Ishkanian A, Malloff C, Locke J, Sykes J, Thoms J, et al. Copy number alterations of c- MYC and PTEN are prognostic factors for relapse after prostate cancer radiotherapy. Cancer. 2012;118(16):4053-62. 131. Edwards J, Krishna N, Witton C, Bartlett J. Gene amplifications associated with the development of hormone-resistant prostate cancer. Clin Cancer Res. 2003;9(14):5271-81. 132. Shackelford R, Kaufmann W, Paules R. Cell cycle control, checkpoint mechanisms, and genotoxic stress. Environ Health Perspect. 1999;107(Suppl 1):5-24. 133. Weinberg R. The retinoblastoma protein and cell cycle control. Cell. 1995;81(3):323-30. 134. Qi F, Yuan Y, Zhi X, Huang Q, Chen Y, Zhuang W, et al. Synergistic effects of AKAP95, Cyclin D1, Cyclin E1, and Cx43 in the development of rectal cancer. Int J Clin Exp Pathol. 2015;8(2):1666-73. 135. Hamilton E, Infante J. Targeting CDK4/6 in patients with cancer. Cancer Treat Rev. 2016;45:129- 38.

236

136. Moselhy S, Kumosani T, Kamal I, Jalal J, Jabaar H, Dalol A. Hypermethylation of P15, P16, and E- cadherin genes in ovarian cancer. Toxicol Ind Health. 2013;31(10):924-30. 137. Neuzillet C, Tijeras-Raballand A, Cohen R, Cros J, Faivre S, Raymond E, et al. Targeting the TGFβ pathway for cancer therapy. Pharmacol Ther. 2015;147:22-31. 138. Wolters T, Vissers K, Bangma C, Schroder F, van Leenders G. The value of EZH2, p27(kip1), BMI-1 and MIB-1 on biopsy specimens with low-risk prostate cancer in selecting men with significant prostate cancer at prostatectomy. BJU Int. 2010;106(2):280-6. 139. Kimbrough-Allah M, Millena A, Khan S. Differential role of PTEN in transforming growth factor β (TGF-β) effects on proliferation and migration in prostate cancer cells. Prostate. 2018;78(5):377-89. 140. Shariat S, Walz J, Roehrborn C, Montorsi F, Jeldres C, Saad F, et al. Early postoperative plasma transforming growth factor-beta1 is a strong predictor of biochemical progression after radical prostatectomy. J Urol. 2008;179(4):1593-7. 141. Wu C, Chang Y, Lin W, Chen W, Chen M. TGF beta1 expression correlates with survival and tumor aggressiveness of prostate cancer. Ann Surg Oncol. 2015;22(Suppl 3):S1587-93. 142. Kluth M, Harasimowicz S, Burkhardt L, Grupp K, Krohn A, Prien K, et al. Clinical significance of different types of p53 gene alteration in surgically treated prostate cancer. Int J Cancer. 2014;135(6):1369- 80. 143. Levine A. p53, the cellular gatekeeper for growth and division. Cell. 1997;88(3):323-31. 144. Giono L, Manfredi J. The p53 tumor suppressor participates in multiple cell cycle checkpoints. J Cell Physiol. 2006;209(1):13-20. 145. Krohn A, Diedler T, Burkhardt L, Mayer P, De Silva C, Meyer-Kornblum M, et al. Genomic deletion of PTEN is associated with tumor progression and early PSA recurrence in ERG fusion-positive and fusion- negative prostate cancer. Am J Pathol. 2012;181(2):401-12. 146. Cantley L. The phosphoinositide 3-kinase pathway. Science. 2002;296(5573):1655-7. 147. Brandmaier A, Hou S, Shen W. Cell cycle control by PTEN. J Mol Biol. 2017;429(15):2265-77. 148. Kale J, Osterlund E, Andrews D. BCL-2 family proteins: changing partners in the dance towards death. Cell Death Differ. 2018;25(1):65-80. 149. Zhivotovsky B, Kroemer G. Apoptosis and genomic instability. Nat Rev Mol Cell Biol. 2004;5(9):752-62. 150. Su Z, Yang Z, Xu Y, Chen Y, Yu Q. Apoptosis, autophagy, necroptosis, and cancer metastasis. Mol Cancer. 2015;14:48. 151. Fulda S. Evasion of apoptosis as a cellular stress response in cancer. Int J Cell Biol. 2010;2010:370835. 152. Rowlands M, Holly J, Hamdy F, Phillips J, Goodwin L, Marsden G, et al. Serum insulin-like growth factors and mortality in localised and advanced clinically detected prostate cancer. Cancer Causes Control. 2012;23(2):347-54. 153. Carvalho J, Filipe L, Costa V, Ribeiro F, Martins A, Teixeira M, et al. Detailed analysis of expression and promoter methylation status of apoptosis-related genes in prostate cancer. Apoptosis. 2010;15(8):956-65. 154. Fleischmann A, Huland H, Mirlacher M, Wilczak W, Simon R, Erbersdobler A, et al. Prognostic relevance of Bcl-2 overexpression in surgically treated prostate cancer is not caused by increased copy number or translocation of the gene. Prostate. 2012;72(9):991-7. 155. Mathieu R, Lucca I, Vartolomei M, Mbeutcha A, Klatte T, Seitz C, et al. Role of Survivin expression in predicting biochemical recurrence after radical prostatectomy: a multi-institutional study. BJU Int. 2016;119(2):234-8. 156. Lin Y, Fukuchi J, Hiipakka R, Kokontis J, Xiang J. Up-regulation of Bcl-2 is required for the progression of prostate cancer cells from an androgen-dependent to an androgen-independent growth stage. Cell Res. 2007;17(6):531-6. 157. Hayflick L. The limited in vitro lifetime of human diploid cell strains. Exp Cell Res. 1965;37:614- 36. 158. Wright W, Pereira-Smith O, Shay J. Reversible cellular senescence: implications for immortalization of normal human diploid fibroblasts. Mol Cell Biol. 1989;9(7):3088-92. 159. Zakian V. Telomeres: the beginnings and ends of eukaryotic chromosomes. Exp Cell Res. 2012;318(12):1456-60. 160. Bertorelle R, Briarava M, Rampazzo E, Biasini L, Agostini M, Maretto I, et al. Telomerase is an independent prognostic marker of overall survival in patients with colorectal cancer. Br J Cancer. 2013;108(2):278-84.

237

161. Kulic A, Plavetic N, Gamulin S, Jakic-Razumovic J, Vrbanec D, Sirotkovic-Skerlev M. Telomerase activity in breast cancer patients: association with poor prognosis and more aggressive phenotype. Med Oncol. 2016;33(3):23. 162. Queisser A, Heeg S, Thaler M, von Werder A, Opitz O. Inhibition of telomerase induces alternative lengthening of telomeres during human esophageal carcinogenesis. Cancer Genet. 2013;206(11):374-86. 163. Meeker A, Hicks J, Platz E, March G, Bennett C, Delannoy M, et al. Telomere shortening is an early somatic DNA alteration in human prostate tumorigenesis. Cancer Res. 2002;62(22):6405-9. 164. Iczkowski K, Pantazis C, McGregor D, Wu Y, Tawfik O. Telomerase reverse transcriptase subunit immunoreactivity: a marker for high-grade prostate carcinoma. Cancer. 2002;95(12):2487-93. 165. March-Villalba J, Martinez-Jabaloyas J, Herrero M, Santamaria J, Alino S, Dasi F. Cell-free circulating plasma hTERT mRNA is a useful marker for prostate cancer diagnosis and is associated with poor prognosis tumor characteristics. PLoS One. 2012;7(8):e43470. 166. Folkman J. Seminars in medicine of the Beth Israel Hospital, Boston. Clinical applications of research on angiogenesis. N Engl J Med. 1995;333(26):1757-63. 167. Tonini T, Rossi F, Claudio P. Molecular basis of angiogenesis and cancer. Oncogene. 2003;22(42):6549-56. 168. Rajabi M, Mousa S. The role of angiogenesis in cancer treatment. Biomedicines. 2017;5(2). 169. Woollard D, Opeskin K, Coso S, Wu D, Baldwin M, Williams E. Differential expression of VEGF ligands and receptors in prostate cancer. Prostate. 2013;73(6):563-72. 170. Wang Q, Diao X, Sun J, Chen Z. Stromal cell-derived factor-1 and vascular endothelial growth factor as biomarkers for lymph node metastasis and poor cancer-specific survival in prostate cancer patients after radical prostatectomy. Urol Oncol. 2013;31(3):312-7. 171. Talagas M, Uguen A, Garlantezec R, Fournier G, Doucet L, Gobin E, et al. VEGFR1 and NRP1 endothelial expressions predict distant relapse after radical prostatectomy in clinically localized prostate cancer. Anticancer Res. 2013;33(5):2065-75. 172. Bin W, He W, Feng Z, Xiangdong L, Yong C, Lele K, et al. Prognostic relevance of cyclooxygenase- 2 (COX-2) expression in Chinese patients with prostate cancer. Acta Histochem. 2011;113(2):131-6. 173. Zhong W, Han Z, He H, Bi X, Dai Q, Zhu G, et al. CD147, MMP-1, MMP-2 and MMP-9 protein expression as significant prognostic factors in human prostate cancer. Oncology. 2008;75(3-4):230-6. 174. Reis S, Pontes-Junior J, Antunes A, de Sousa-Canavez J, Dall'Oglio M, Passerotti C, et al. MMP-9 overexpression due to TIMP-1 and RECK underexpression is associated with prognosis in prostate cancer. Int J Biol Markers. 2011;26(4):255-61. 175. Trudel D, Fradet Y, Meyer F, Harel F, Tetu B. Significance of MMP-2 expression in prostate cancer: an immunohistochemical study. Cancer Res. 2003;63(23):8511-5. 176. Morgia G, Falsaperla M, Malaponte G, Madonia M, Indelicato M, Travali S, et al. Matrix metalloproteinases as diagnostic (MMP-13) and prognostic (MMP-2, MMP-9) markers of prostate cancer. Urol Res. 2005;33(1):44-50. 177. Miyata Y, Mitsunari K, Asai A, Takehara K, Mochizuki Y, Sakai H. Pathological significance and prognostic role of microvessel density, evaluated using CD31, CD34, and CD105 in prostate cancer patients after radical prostatectomy with neoadjuvant therapy. Prostate. 2015;75(1):84-91. 178. Hsu P, Sabatini D. Cancer cell metabolism: Warburg and beyond. Cell. 2008;134(5):703-7. 179. Hirschey M, DeBerardinis R, Diehl A, Drew J, Frezza C, Green M, et al. Dysregulated metabolism contributes to oncogenesis. Semin Cancer Biol. 2015;35:S129-S50. 180. Warburg O. On the origin of cancer cells. Science. 1956;123(3191):309-14. 181. DeBerardinis R, Lum J, Hatzivassiliou G, Thompson C. The biology of cancer: metabolic reprogramming fuels cell growth and proliferation. Cell Metab. 2008;7(1):11-20. 182. Adekola K, Rosen S, Shanmugam M. Glucose transporters in cancer metabolism. Curr Opin Oncol. 2012;24(6):650-4. 183. Zhang Y, Yang J. Altered energy metabolism in cancer: a unique opportunity for therapeutic intervention. Cancer Biol Ther. 2013;14(2):81-9. 184. Cutruzzola F, Giardina G, Marani M, Macone A, Paiardini A, Rinaldo S, et al. Glucose metabolism in the progression of prostate cancer. Front Physiol. 2017;8:97. 185. Trock B. Application of metabolomics to prostate cancer. Urol Oncol. 2011;29(5):572-81. 186. Costello L, Franklin R, Feng P. Mitochondrial function, zinc, and intermediary metabolism relationships in normal prostate and prostate cancer. Mitochondrion. 2005;5(3):143-53. 187. Franz M, Anderle P, Burzle M, Suzuki Y, Freeman M, Hediger M, et al. Zinc transporters in prostate cancer. Mol Aspects Med. 2013;34(2-3):735-41.

238

188. Elia I, Schmieder R, Christen S, Fendt S. Organ-specific cancer metabolism and its potential for therapy. Handb Exp Pharmacol. 2016;233:321-53. 189. Eidelman E, Twum-Ampofo J, Ansari J, Siddiqui M. The metabolic phenotype of prostate cancer. Front Oncol. 2017;7:131. 190. Hamada S, Horiguchi A, Kuroda K, Ito K, Asano T, Miyai K, et al. Increased fatty acid synthase expression in prostate biopsy cores predicts higher Gleason score in radical prostatectomy specimen. BMC Clin Pathol. 2014;14(1):3. 191. Jiang N, Zhu S, Chen J, Niu Y, Zhou L. A-methylacyl-CoA racemase (AMACR) and prostate-cancer risk: a meta-analysis of 4,385 participants. PLoS One. 2013;8(10):e74386. 192. Khan A, Rajendiran T, Ateeq B, Asangani I, Athanikar J, Yocum A, et al. The role of sarcosine metabolism in prostate cancer progression. Neoplasia. 2013;15(5):491-501. 193. Giskeodegard G, Bertilsson H, Selnaes K, Wright A, Bathen T, Viset T, et al. Spermine and citrate as metabolic biomarkers for assessing prostate cancer aggressiveness. PLoS One. 2013;8(4):e62375. 194. Talmadge J, Fidler I. AACR centennial series: the biology of cancer metastasis: historical perspective. Cancer Res. 2010;70(14):5649-69. 195. Gupta G, Massague J. Cancer metastasis: building a framework. Cell. 2006;127(4):679-95. 196. Saxena M, Christofori G. Rebuilding cancer metastasis in the mouse. Mol Oncol. 2013;7(2):283- 96. 197. Guan X. Cancer metastases: challenges and opportunities. Acta Pharm Sin B. 2015;5(5):402-18. 198. Kumano M, Miyake H, Muramaki M, Furukawa J, Takenaka A, Fujisawa M. Expression of urokinase-type plasminogen activator system in prostate cancer: correlation with clinicopathological outcomes in patients undergoing radical prostatectomy. Urol Oncol. 2009;27(2):180-6. 199. Halvorsen O, Oyan A, Bo T, Olsen S, Rostad K, Haukaas S, et al. Gene expression profiles in prostate cancer: association with patient subgroups and tumour differentiation. Int J Oncol. 2005;26(2):329-36. 200. Dass K, Ahmad A, Azmi A, Sarkar S, Sarkar F. Evolving role of uPA/uPAR system in human cancers. Cancer Treat Rev. 2008;34(2):122-36. 201. Ma D, Zhou Z, Yang B, He Q, Zhang Q, Zhang X. Association of molecular biomarkers expression with biochemical recurrence in prostate cancer through tissue microarray immunostaining. Oncol Lett. 2015;10(4):2185-91. 202. Kristiansen G, Pilarsky C, Wissmann C, Kaiser S, Bruemmendorf T, Roepcke S, et al. Expression profiling of microdissected matched prostate cancer samples reveals CD166/MEMD and CD24 as new prognostic markers for patient survival. J Pathol. 2005;205(3):359-76. 203. Guarino M, Rubino B, Ballabio G. The role of epithelial-mesenchymal transition in cancer pathology. Pathology. 2007;39(3):305-18. 204. Sethi S, Macoska J, Chen W, Sarkar F. Molecular signature of epithelial-mesenchymal transition (EMT) in human prostate cancer bone metastasis. Am J Transl Res. 2010;3(1):90-9. 205. Figiel S, Vasseur C, Bruyere F, Rozet F, Maheo K, Fromont G. Clinical significance of epithelial- mesenchymal transition (EMT) markers in prostate cancer. Hum Pathol. 2016;61:26-32. 206. Drivalos A, Chrisofos M, Efstathiou E, Kapranou A, Kollaitis G, Koutlis G, et al. Expression of α5- integrin, α7-integrin, Ε-cadherin, and N-cadherin in localized prostate cancer. Urol Oncol. 2016;34(4):165.e11-8. 207. Gravdal K, Halvorsen O, Haukaas S, Akslen L. A switch from E-cadherin to N-cadherin expression indicates epithelial to mesenchymal transition and is of strong and independent importance for the progress of prostate cancer. Clin Cancer Res. 2007;13(23):7003-11. 208. Loeb L. Mutator phenotype may be required for multistep carcinogenesis. Cancer Res. 1991;51(12):3075-9. 209. Pikor L, Thu K, Vucic E, Lam W. The detection and implication of genome instability in cancer. Cancer Metastasis Rev. 2013;32(3-4):341-52. 210. Romero-Laorden N, Castro E. Inherited mutations in DNA repair genes and cancer risk. Curr Probl Cancer. 2017;41(4):251-64. 211. Solomon D, Kim T, Diaz-Martinez L, Fair J, Elkahloun A, Harris B, et al. Mutational inactivation of STAG2 causes aneuploidy in human cancer. Science. 2011;333(6045):1039-43. 212. Wang X, Cheung H, Chun A, Jin D, Wong Y. Mitotic checkpoint defects in human cancers and their implications to chemotherapy. Front Biosci. 2008;13:2103-14. 213. Michor F, Iwasa Y, Vogelstein B, Lengauer C, Nowak M. Can chromosomal instability initiate tumorigenesis? Semin Cancer Biol. 2005;15(1):43-9.

239

214. Muzny D, Bainbridge M, Chang K, Dinh H, Drummond J, Fowler G, et al. Comprehensive molecular characterization of human colon and rectal cancer. Cancer Genome Atlas Network. Nature. 2012;487(7407):330-7. 215. Williams J, Greer P, Squire J. Recurrent copy number alterations in prostate cancer: an in silico meta-analysis of publicly available genomic data. Cancer Genet. 2014;207(10-12):474-88. 216. Gsponer J, Braun M, Scheble V, Zellweger T, Bachmann A, Perner S, et al. ERG rearrangement and protein expression in the progression to castration-resistant prostate cancer. Prostate Cancer Prostatic Dis. 2014;17(2):126-31. 217. Grivennikov S, Greten F, Karin M. Immunity, inflammation, and cancer. Cell. 2010;140(6):883-99. 218. Hussain S, Harris C. Inflammation and cancer: an ancient link with novel potentials. Int J Cancer. 2007;121(11):2373-80. 219. Coussens L, Werb Z. Inflammation and cancer. Nature. 2002;420(6917):860-7. 220. Vendramini-Costa D, Carvalho J. Molecular link mechanisms between inflammation and cancer. Curr Pharm Des. 2012;18(26):3831-52. 221. Balkwill F, Charles K, Mantovani A. Smoldering and polarized inflammation in the initiation and promotion of malignant disease. Cancer Cell. 2005;7(3):211-7. 222. Hussain S, Hofseth L, Harris C. Radical causes of cancer. Nat Rev Cancer. 2003;3(4):276-85. 223. Doat S, Marous M, Rebillard X, Tretarre B, Lamy P, Soares P, et al. Prostatitis, other genitourinary infections and prostate cancer risk: Influence of non-steroidal anti-inflammatory drugs? Results from the EPICAP study. Int J Cancer. 2018. 224. Ellem S, Wang H, Poutanen M, Risbridger G. Increased endogenous estrogen synthesis leads to the sequential induction of prostatic inflammation (prostatitis) and prostatic pre-malignancy. Am J Pathol. 2009;175(3):1187-99. 225. Iovene M, Martora F, Mallardo E, De Sio M, Arcaniolo D, Del Vecchio C, et al. Enrichment of semen culture in the diagnosis of bacterial prostatitis. J Microbiol Methods. 2018;154:124-6. 226. Palapattu G, Sutcliffe S, Bastian P, Platz E, De Marzo A, Isaacs W, et al. Prostate carcinogenesis and inflammation: emerging insights. Carcinogenesis. 2005;26(7):1170-81. 227. Wang W, Bergh A, Damber J. Morphological transition of proliferative inflammatory atrophy to high-grade intraepithelial neoplasia and cancer in human prostate. Prostate. 2009;69(13):1378-86. 228. Song H, Zhang B, Watson M, Humphrey P, Lim H, Milbrandt J. Loss of Nkx3.1 leads to the activation of discrete downstream target genes during prostate tumorigenesis. Oncogene. 2009;28(37):3307-19. 229. De Marzo A, Marchi V, Epstein J, Nelson W. Proliferative inflammatory atrophy of the prostate: implications for prostatic carcinogenesis. Am J Pathol. 1999;155(6):1985-92. 230. De Nunzio C, Kramer G, Marberger M, Montironi R, Nelson W, Schroder F, et al. The controversial relationship between benign prostatic hyperplasia and prostate cancer: the role of inflammation. Eur Urol. 2011;60(1):106-17. 231. Martignano F, Gurioli G, Salvi S, Calistri D, Costantini M, Gunelli R, et al. GSTP1 methylation and protein expression in prostate cancer: diagnostic implications. Dis Markers. 2016;2016:4358292. 232. Mahon K, Qu W, Devaney J, Paul C, Castillo L, Wykes R, et al. Methylated Glutathione S- transferase 1 (mGSTP1) is a potential plasma free DNA epigenetic marker of prognosis and response to chemotherapy in castrate-resistant prostate cancer. Br J Cancer. 2014;111(9):1802-9. 233. Woodson K, O'Reilly K, Hanson J, Nelson D, Walk E, Tangrea J. The usefulness of the detection of GSTP1 methylation in urine as a biomarker in the diagnosis of prostate cancer. J Urol. 2008;179(2):508- 11. 234. Hofmeister V, Schrama D, Becker J. Anti-cancer therapies targeting the tumor stroma. Cancer Immunol Immunother. 2008;57(1):1-17. 235. Quail D, Joyce J. Microenvironmental regulation of tumor progression and metastasis. Nat Med. 2013;19(11):1423-37. 236. Tuxhorn J, Ayala G, Smith M, Smith V, Dang T, Rowley D. Reactive stroma in human prostate cancer: induction of myofibroblast phenotype and extracellular matrix remodeling. Clin Cancer Res. 2002;8(9):2912-23. 237. Josefsson A, Adamo H, Hammarsten P, Granfors T, Stattin P, Egevad L, et al. Prostate cancer increases hyaluronan in surrounding nonmalignant stroma, and this response is associated with tumor growth and an unfavorable outcome. Am J Pathol. 2011;179(4):1961-8.

240

238. Tomas D, Spajic B, Milosevic M, Demirovic A, Marusic Z, Kruslin B. Intensity of stromal changes predicts biochemical recurrence-free survival in prostatic carcinoma. Scand J Urol Nephrol. 2010;44(5):284-90. 239. Pecqueux C, Arslan A, Heller M, Falkenstein M, Kaczorowski A, Tolstov Y, et al. FGF-2 is a driving force for chromosomal instability and a stromal factor associated with adverse clinico-pathological features in prostate cancer. Urol Oncol. 2018;36(8):365.e15-56.e26. 240. Gravina G, Mancini A, Ranieri G, Di Pasquale B, Marampon F, Di Clemente L, et al. Phenotypic characterization of human prostatic stromal cells in primary cultures derived from human tissue samples. Int J Oncol. 2013;42(6):2116-22. 241. Gomez C, Gomez P, Knapp J, Jorda M, Soloway M, Lokeshwar V. Hyaluronic acid and HYAL-1 in prostate biopsy specimens: predictors of biochemical recurrence. J Urol. 2009;182(4):1350-6. 242. Ricciardelli C, Russell D, Ween M, Mayne K, Suwiwat S, Byers S, et al. Formation of hyaluronan- and versican-rich pericellular matrix by prostate cancer cells promotes cell motility. J Biol Chem. 2007;282(14):10814-25. 243. Lokeshwar V, Rubinowicz D, Schroeder G, Forgacs E, Minna J, Block N, et al. Stromal and epithelial expression of tumor markers hyaluronic acid and HYAL1 hyaluronidase in prostate cancer. J Biol Chem. 2001;276(15):11922-32. 244. Whiteside T. Immune suppression in cancer: effects on immune cells, mechanisms and future therapeutic intervention. Semin Cancer Biol. 2006;16(1):3-15. 245. Liu Y, Komohara Y, Domenick N, Ohno M, Ikeura M, Hamilton R, et al. Expression of antigen processing and presenting molecules in brain metastasis of breast cancer. Cancer Immunol Immunother. 2012;61(6):789-801. 246. Bandoh N, Ogino T, Katayama A, Takahara M, Katada A, Hayashi T, et al. HLA class I antigen and transporter associated with antigen processing downregulation in metastatic lesions of head and neck squamous cell carcinoma as a marker of poor prognosis. Oncol Rep. 2010;23(4):933-9. 247. Zhang Y, Luo Y, Qin S, Mu Y, Qi Y, Yu M, et al. The clinical impact of ICOS signal in colorectal cancer patients. Oncoimmunology. 2016;5(5):e1141857. 248. Wu S, Zhao X, Wu S, Du R, Zhu Q, Fang H, et al. Overexpression of B7-H3 correlates with aggressive clinicopathological characteristics in non-small cell lung cancer. Oncotarget. 2016;7(49):81750-6. 249. Vinay D, Ryan E, Pawelec G, Talib W, Stagg J, Elkord E, et al. Immune evasion in cancer: mechanistic basis and therapeutic strategies. Semin Cancer Biol. 2015;35:S185-S98. 250. Ganesan A, Johansson M, Ruffell B, Beltran A, Lau J, Jablons D, et al. Tumor-infiltrating regulatory T cells inhibit endogenous cytotoxic T cell responses to lung adenocarcinoma. J Immunol. 2013;191(4):2009-17. 251. Jordan K, Amaria R, Ramirez O, Callihan E, Gao D, Borakove M, et al. Myeloid-derived suppressor cells are associated with disease progression and decreased overall survival in advanced-stage melanoma patients. Cancer Immunol Immunother. 2013;62(11):1711-22. 252. Maenhout S, Van Lint S, Emeagi P, Thielemans K, Aerts J. Enhanced suppressive capacity of tumor- infiltrating myeloid-derived suppressor cells compared with their peripheral counterparts. Int J Cancer. 2014;134(5):1077-90. 253. Singh N, Hussain S, Bharadwaj M, Kakkar N, Singh S, Sobti R. Overexpression of signal transducer and activator of transcription (STAT-3 and STAT-5) transcription factors and alteration of suppressor of cytokine signaling (SOCS-1) protein in prostate cancer. J Recept Signal Transduct Res. 2012;32(6):321-7. 254. Lin D, Wang X, Choi S, Ci X, Dong X, Wang Y. Immune phenotypes of prostate cancer cells: evidence of epithelial immune cell-like transition? Asian J Urol. 2016;3(4):195-202. 255. Flammiger A, Weisbach L, Huland H, Tennstedt P, Simon R, Minner S, et al. High tissue density of FOXP3+ T cells is associated with clinical outcome in prostate cancer. Eur J Cancer. 2013;49(6):1273-9. 256. Nardone V, Botta C, Caraglia M, Martino E, Ambrosio M, Carfagno T, et al. Tumor infiltrating T lymphocytes expressing FoxP3, CCR7 or PD-1 predict the outcome of prostate cancer patients subjected to salvage radiotherapy after biochemical relapse. Cancer Biol Ther. 2016;17(11):1213-20. 257. Yuan H, Wei X, Zhang G, Li C, Zhang X, Hou J. B7-H3 over expression in prostate cancer promotes tumor cell progression. J Urol. 2011;186(3):1093-9. 258. Liu Y, Vlatkovic L, Saeter T, Servoll E, Waaler G, Nesland J, et al. Is the clinical malignant phenotype of prostate cancer a result of a highly proliferative immune-evasive B7-H3-expressing cell population? Int J Urol. 2012;19(8):749-56.

241

259. Webster A, Zumbo P, Fostel J, Gandara J, Hester S, Recio L, et al. Mining the archives: a cross- platform analysis of gene expression profiles in archival formalin-fixed paraffin-embedded tissues. Toxicol Sci. 2015;148(2):460-72. 260. Rogerson L, Darby S, Jabbar T, Mathers M, Leung H, Robson C, et al. Application of transcript profiling in formalin-fixed paraffin-embedded diagnostic prostate cancer needle biopsies. BJU Int. 2008;102(3):364-70. 261. Kosari F, Munz J, Savci-Heijink C, Spiro C, Klee E, Kube D, et al. Identification of prognostic biomarkers for prostate cancer. Clin Cancer Res. 2008;14(6):1734-43. 262. Penney K, Sinnott J, Fall K, Pawitan Y, Hoshida Y, Kraft P, et al. mRNA expression signature of Gleason grade predicts lethal prostate cancer. J Clin Oncol. 2011;29(17):2391-6. 263. Cuzick J, Berney D, Fisher G, Mesher D, Moller H, Reid J, et al. Prognostic value of a cell cycle progression signature for prostate cancer death in a conservatively managed needle biopsy cohort. Br J Cancer. 2012;106(6):1095-9. 264. Erho N, Crisan A, Vergara I, Mitra A, Ghadessi M, Buerki C, et al. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS One. 2013;8(6):e66855. 265. Bismar T, Demichelis F, Riva A, Kim R, Varambally S, He L, et al. Defining aggressive prostate cancer using a 12-gene model. Neoplasia. 2006;8(1):59-68. 266. Chen Z, Gerke T, Bird V, Prosperi M. Trends in gene expression profiling for prostate cancer risk assessment: a systematic review. Biomed Hub. 2017;2:472146. 267. Mohler J, Armstrong A, Bahnson R, D'Amico A, Davis B, Eastham J, et al. Prostate Cancer, Version 1.2016. J Natl Compr Canc Netw. 2016;14(1):19-30. 268. Cucchiara V, Cooperberg M, Dall'Era M, Lin D, Montorsi F, Schalken J, et al. Genomic markers in prostate cancer decision making. Eur Urol. 2018;73(4):572-82. 269. Freedland S, Gerber L, Reid J, Welbourn W, Tikishvili E, Park J, et al. Prognostic utility of cell cycle progression score in men with prostate cancer after primary external beam radiation therapy. Int J Radiat Oncol Biol Phys. 2013;86(5):848-53. 270. Bishoff J, Freedland S, Gerber L, Tennstedt P, Reid J, Welbourn W, et al. Prognostic utility of the cell cycle progression score generated from biopsy in men treated with prostatectomy. J Urol. 2014;192(2):409-14. 271. Cuzick J, Swanson G, Fisher G, Brothman A, Berney D, Reid J, et al. Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: a retrospective study. Lancet Oncol. 2011;12(3):245-55. 272. Klein E, Cooperberg M, Magi-Galluzzi C, Simko J, Falzarano S, Maddala T, et al. A 17-gene assay to predict prostate cancer aggressiveness in the context of Gleason grade heterogeneity, tumor multifocality, and biopsy undersampling. Eur Urol. 2014;66(3):550-60. 273. Cullen J, Rosner I, Brand T, Zhang N, Tsiatis A, Moncur J, et al. A biopsy-based 17-gene Genomic Prostate Score predicts recurrence after radical prostatectomy and adverse surgical pathology in a racially diverse population of men with clinically low- and intermediate-risk prostate cancer. Eur Urol. 2015;68(1):123-31. 274. Shore N, Concepcion R, Saltzstein D, Lucia M, van Breda A, Welbourn W, et al. Clinical utility of a biopsy-based cell cycle gene expression assay in localized prostate cancer. Curr Med Res Opin. 2014;30(4):547-53. 275. Larkin S, Holmes S, Cree I, Walker T, Basketter V, Bickers B, et al. Identification of markers of prostate cancer progression using candidate gene expression. Br J Cancer. 2012;106(1):157-65. 276. Wilt T, Ahmed H. Prostate cancer screening and the management of clinically localized disease. BMJ. 2013;346:f325. 277. Goulter AB, Harmer DW, Clark KL. Evaluation of low density array technology for quantitative parallel measurement of multiple genes in human tissue. BMC Genomics. 2006;7:34. 278. Sooriakumaran P, Nyberg T, Akre O, Haendler L, Heus I, Olsson M, et al. Comparative effectiveness of radical prostatectomy and radiotherapy in prostate cancer: observational study of mortality outcomes. BMJ. 2014;348:g1502. 279. Cronin M, Pho M, Dutta D, Stephans J, Shak S, Kiefer M, et al. Measurement of gene expression in archival paraffin-embedded tissues: development and performance of a 92-gene reverse transcriptase- polymerase chain reaction assay. Am J Pathol. 2004;164(1):35-42.

242

280. Masuda N, Ohnishi T, Kawamoto S, Monden M, Okubo K. Analysis of chemical modification of RNA from formalin-fixed samples and optimization of molecular biology applications for such samples. Nucleic Acids Res. 1999;27(22):4436-43. 281. Li J, Smyth P, Cahill S, Denning K, Flavin R, Aherne S, et al. Improved RNA quality and TaqMan Pre- amplification method (PreAmp) to enhance expression analysis from formalin fixed paraffin embedded (FFPE) materials. BMC Biotechnol. 2008;8:10. 282. Okello J, Zurek J, Devault A, Kuch M, Okwi A, Sewankambo N, et al. Comparison of methods in the recovery of nucleic acids from archival formalin-fixed paraffin-embedded autopsy tissues. Anal Biochem. 2010;400(1):110-7. 283. Bustin S, Nolan T. Pitfalls of quantitative real-time reverse-transcription polymerase chain reaction. J Biomol Tech. 2004;15(3):155-66. 284. Rabiau N, Dantal Y, Guy L, Ngollo M, Dagdemir A, Kemeny J, et al. Gene panel model predictive of outcome in patients with prostate cancer. OMICS. 2013;17(8):407-13. 285. Antonov J, Goldstein D, Oberli A, Baltzer A, Pirotta M, Fleischmann A, et al. Reliable gene expression measurements from degraded RNA by quantitative real-time PCR depend on short amplicons and a proper normalization. Lab Invest. 2005;85(8):1040-50. 286. Narayanan A, Keedwell E. Perceptrons for analysing prostate cancer gene expression data. Unpublished. 2006. 287. Hastie C, Saxton M, Akpan A, Cramer R, Masters J, Naaby-Hansen S. Combined affinity labelling and mass spectrometry analysis of differential cell surface protein expression in normal and prostate cancer cells. Oncogene. 2005;24(38):5905-13. 288. Hastie C, Masters J, Moss S, Naaby-Hansen S. Interferon-gamma reduces cell surface expression of annexin 2 and suppresses the invasive capacity of prostate cancer cells. J Biol Chem. 2008;283(18):12595-603. 289. Ohl F, Jung M, Xu C, Stephan C, Rabien A, Burkhardt M, et al. Gene expression studies in prostate cancer tissue: which reference gene should be selected for normalization? J Mol Med (Berl). 2005;83(12):1014-24. 290. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci U S A. 2004;101(3):811-6. 291. Genitsch V, Zlobec I, Thalmann G, Fleischmann A. MUC1 is upregulated in advanced prostate cancer and is an independent prognostic factor. Prostate Cancer Prostatic Dis. 2016;19(3):242-7. 292. Nath S, Mukherjee P. MUC1: a multifaceted oncoprotein with a key role in cancer progression. Trends Mol Med. 2014;20(6):332-42. 293. Roy L, Sahraei M, Subramani D, Besmer D, Nath S, Tinder T, et al. MUC1 enhances invasiveness of pancreatic cancer cells by inducing epithelial to mesenchymal transition. Oncogene. 2011;30(12):1449- 59. 294. Wesseling J, van der Valk S, Vos H, Sonnenberg A, Hilkens J. Episialin (MUC1) overexpression inhibits integrin-mediated cell adhesion to extracellular matrix components. J Cell Biol. 1995;129(1):255- 65. 295. Larkin S. Molecular characterisation of prostate cancer progression markers: University of Portsmouth; 2010. 296. Mehta N, Srinivasa S. Role of hemoglobin/heme scavenger protein hemopexin in atherosclerosis and inflammatory diseases. Curr Opin Lipidol. 2015;26(5):384-7. 297. Sharma S, Ray S, Moiyadi A, Sridhar E, Srivastava S. Quantitative proteomic analysis of meningiomas for the identification of surrogate protein markers. Sci Rep. 2014;4:7140. 298. Bergamini S, Bellei E, Reggiani Bonetti L, Monari E, Cuoghi A, Borelli F, et al. Inflammation: an important parameter in the search of prostate cancer biomarkers. Proteome Sci. 2014;12:32. 299. Pfaffl M. A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res. 2001;29(9):e45. 300. Demidenko R, Razanauskas D, Daniunaite K, Lazutka J, Jankevicius F, Jarmalaite S. Frequent down-regulation of ABC transporter genes in prostate cancer. BMC Cancer. 2015;15:683. 301. Pourhoseingholi M, Baghestani A, Vahedi M. How to control confounding effects by statistical analysis. Gastroenterol Hepatol Bed Bench. 2012;5(2):79-83. 302. Peduzzi P, Concato J, Kemper E, Holford T, Feinstein A. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49(12):1373-9. 303. Vittinghoff E, McCulloch C. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol. 2007;165(6):710-8.

243

304. Roach M, Hanks G, Thames Jr H, Schellhammer P, Shipley W, Sokol G, et al. Defining biochemical failure following radiotherapy with or without hormonal therapy in men with clinically localized prostate cancer: recommendations of the RTOG-ASTRO Phoenix Consensus Conference. Int J Radiat Oncol Biol Phys. 2006;65(4):965–74. 305. Mestdagh P, Feys T, Bernard N, Guenther S, Chen C, Speleman F, et al. High-throughput stem- loop RT-qPCR miRNA expression profiling using minute amounts of input RNA. Nucleic Acids Res. 2008;36(21):e143. 306. Midi H, Sarkar S, Rana S. Collinearity diagnostics of binary logistic regression model. Journal of Interdisciplinary Mathematics. 2010;13(3):253-67. 307. Bewick V, Cheek L, Ball J. Statistics review 14: Logistic regression. Crit Care. 2005;9(1):112-8. 308. Youden W. Index for rating diagnostic tests. Cancer. 1950;3(1):32-5. 309. Kalmar A, Wichmann B, Galamb O, Spisak S, Toth K, Leiszter K, et al. Gene expression analysis of normal and colorectal cancer tissue samples from fresh frozen and matched formalin-fixed, paraffin- embedded (FFPE) specimens after manual and automated RNA isolation. Methods. 2013;59(1):S16-9. 310. Stewart G, Baird J, Rae F, Nanda J, Riddick A, Harrison D. Utilizing mRNA extracted from small, archival formalin-fixed paraffin-embedded prostate samples for translational research: assessment of the effect of increasing sample age and storage temperature. Int Urol Nephrol. 2011;43(4):961-7. 311. Nonn L, Vaishnav A, Gallagher L, Gann P. mRNA and micro-RNA expression analysis in laser- capture microdissected prostate biopsies: valuable tool for risk assessment and prevention trials. Exp Mol Pathol. 2010;88(1):45-51. 312. Manchester K. Use of UV methods for measurement of protein and nucleic acid concentrations. Biotechniques. 1996;20(6):968-70. 313. Noutsias M, Rohde M, Block A, Klippert K, Lettau O, Blunert K, et al. Preamplification techniques for real-time RT-PCR analyses of endomyocardial biopsies. BMC Mol Biol. 2008;9:3. 314. Mengual L, Burset M, Marín-Aguilera M, Ribal M, Alcaraz A. Multiplex preamplification of specific cDNA targets prior to gene expression analysis by TaqMan Arrays. BMC Res Notes. 2008;1:21. 315. Ciotti P, Garuti A, Ballestrero A, Cirmena G, Chiaramondia M, Baccini P, et al. Reliability and reproducibility of a RNA preamplification method for low-density array analysis from formalin-fixed paraffin-embedded breast cancer samples. Diagn Mol Pathol. 2009;18(2):112-8. 316. Jang W, Kim L, Yoon C, Rha K, Choi Y, Hong S, et al. Effect of preoperative risk group stratification on oncologic outcomes of patients with adverse pathologic findings at radical prostatectomy. PLoS One. 2016;11(10):e0164497. 317. Fischer S, Lin D, Simon R, Howard L, Aronson W, Terris M, et al. Do all men with pathological Gleason score 8-10 prostate cancer have poor outcomes? Results from the SEARCH database. BJU Int. 2016;118(2):250-7. 318. Reichard C, Klein E. Clinical and molecular rationale to retain the cancer descriptor for Gleason score 6 disease. Nat Rev Urol. 2017;14(1):59-64. 319. Kane C, Eggener S, Shindel A, Andriole G. Variability in outcomes for patients with intermediate- risk prostate cancer (Gleason score 7, International Society of Urological Pathology Gleason group 2-3) and implications for risk stratification: a systematic review. Eur Urol Focus. 2017;3(4-5):487-97. 320. Mortezavi A, Keller E, Poyet C, Hermanns T, Saba K, Randazzo M. Clinical impact of prostate biopsy undergrading in an academic and community setting. World J Urol. 2016;34(10):1481-90. 321. Alberts A, Bokhorst L, Kweldam C, Schoots I, van der Kwast T, van Leenders G, et al. Biopsy undergrading in men with Gleason score 6 and fatal prostate cancer in the European Randomized study of Screening for Prostate Cancer Rotterdam. Int J Urol. 2017;24(4):281-6. 322. Walz J, Chun F, Klein E, Reuther A, Saad F, Graefen M, et al. Nomogram predicting the probability of early recurrence after radical prostatectomy for prostate cancer. J Urol. 2009;181(2):601-7. 323. Dell'Oglio P, Suardi N, Boorjian S, Fossati N, Gandaglia G, Tian Z, et al. Predicting survival of men with recurrent prostate cancer after radical prostatectomy. Eur J Cancer. 2016;54:27-34. 324. Mickey R, Greenland S. The impact of confounder selection criteria on effect estimation. Am J Epidemiol. 1989;129(1):125-37. 325. Bursac Z, Gauss C, Williams D, Hosmer D. Purposeful selection of variables in logistic regression. Source Code Biol Med. 2008;3:17. 326. Zhang Z. Model building strategy for logistic regression: purposeful selection. Ann Transl Med. 2016;4(6):111. 327. Sullivan G, Feinn R. Using effect size—or why the p value is not enough. J Grad Med Educ. 2012;4(3):279-82.

244

328. Lopez V, Fernandez A, Garcia S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences. 2013;250:113-41. 329. Guo H, Zhi W, Liu H, Xu M. Imbalanced learning based on logistic discrimination. Comput Intell Neurosci. 2016;2016:5423204. 330. Freeman E, Moisen G. A comparison of the performance of threshold criteria for binary classification in terms of predictive prevalence and kappa. Ecological Modelling. 2008;217(1):48-58. 331. Demidenko E. Sample size determination for logistic regression revisited. Stat Med. 2007;26(18):3385-97. 332. Cheville J, Karnes R, Therneau T, Kosari F, Munz J, Tillmans L, et al. Gene panel model predictive of outcome in men at high-risk of systemic progression and death from prostate cancer after radical retropubic prostatectomy. J Clin Oncol. 2008;26(24):3930-6. 333. Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1(2):203-9. 334. Chen Z, Gerke T, Bird V, Prosperi M. Trends in gene expression profiling for prostate cancer risk assessment: a systematic review. Biomed Hub. 2017;2(2):472146. 335. Babyak M. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med. 2004;66(3):411-21. 336. Steyerberg E, Eijkemans M, Harrell FJ, Habbema J. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making. 2001;21(1):45-56. 337. Ferdinandusse S, Denis S, IJlst L, Dacremont G, Waterham H, Wanders R. Subcellular localization and physiological role of alpha-methylacyl-CoA racemase. J Lipid Res. 2000;41(11):1890-6. 338. Singh V, Manu V, Malik A, Dutta V, Mani N, Patrikar S. Diagnostic utility of p63 and α-methyl acyl Co A racemase in resolving suspicious foci in prostatic needle biopsy and transurethral resection of prostate specimens. J Cancer Res Ther. 2014;10(3):686-92. 339. Barry M, Dhillon P, Stampfer M, Perner S, Ma J, Giovannucci E, et al. α-Methylacyl-CoA racemase expression and lethal prostate cancer in the Physicians' Health Study and Health Professionals Follow-up Study. Prostate. 2012;72(3):301-6. 340. Kube D, Savci-Heijink C, Lamblin A, Kosari F, Vasmatzis G, Cheville J, et al. Optimization of laser capture microdissection and RNA amplification for gene expression profiling of prostate cancer. BMC Mol Biol. 2007;8:25. 341. Rubin M, Zhou M, Dhanasekaran S, Varambally S, Barrette T, Sanda M, et al. alpha-Methylacyl coenzyme A racemase as a tissue biomarker for prostate cancer. JAMA. 2002;287(13):1662-70. 342. Nowroozi M, Ayati M, Amini E, Mahdian R, Yousefi B, Arbab A, et al. Is There a Role for Genetic Information in Risk Assessment and Decision Making in Prostate Cancer? Nephrourol Mon. 2016;8(6):e41505. 343. Rubin M, Bismar T, Andrén O, Mucci L, Kim R, Shen R, et al. Decreased alpha-methylacyl CoA racemase expression in localized prostate cancer is associated with an increased rate of biochemical recurrence and cancer-specific death. Cancer Epidemiol Biomarkers Prev. 2005;14(6):1424-32. 344. Box A, Alshalalfa M, Hegazy S, Donnelly B, Bismar T. High alpha-methylacyl-CoA racemase (AMACR) is associated with ERG expression and with adverse clinical outcome in patients with localized prostate cancer. Tumour Biol. 2016;37(9):12287-99. 345. Ross-Adams H, Lamb A, Dunning M, Halim S, Lindberg J, Massie C, et al. Integration of copy number and transcriptomics provides risk stratification in prostate cancer: a discovery and validation cohort study. EBioMedicine. 2015;2(9):1133-44. 346. Wu X, Daniels G, Lee P, Monaco M. Lipid metabolism in prostate cancer. Am J Clin Exp Urol. 2014;2(2):111-20. 347. Rossi S, Graner E, Febbo P, Weinstein L, Bhattacharya N, Onody T, et al. Fatty acid synthase expression defines distinct molecular signatures in prostate cancer. Mol Cancer Res. 2003;1(10):707-15. 348. Van de Sande T, Roskams T, Lerut E, Joniau S, Van Poppel H, Verhoeven G, et al. High-level expression of fatty acid synthase in human prostate cancer tissues is linked to activation and nuclear localization of Akt/PKB. J Pathol. 2005;206(2):214-9. 349. Sculley D, Dawson P, Emmerson B, Gordon R. A review of the molecular basis of hypoxanthine- guanine phosphoribosyltransferase (HPRT) deficiency. Hum Genet. 1992;90(3):195-207. 350. Lane A, Fan T. Regulation of mammalian nucleotide metabolism and biosynthesis. Nucleic Acids Res. 2015;43(4):2466-85.

245

351. Tong X, Zhao F, Thompson C. The molecular determinants of de novo nucleotide biosynthesis in cancer cells. Curr Opin Genet Dev. 2009;19(1):32-7. 352. Mannava S, Grachtchouk V, Wheeler L, Im M, Zhuang D, Slavina E, et al. Direct role of nucleotide metabolism in C-MYC-dependent proliferation of melanoma cells. Cell Cycle. 2008;7(15):2392-400. 353. Kelly R, Sinnott J, Rider J, Ebot E, Gerke T, Bowden M, et al. The role of tumor metabolism as a driver of prostate cancer progression and lethal disease: results from a nested case-control study. Cancer Metab. 2016;4:22. 354. Nakagawa T, Kollmeyer T, Morlan B, Anderson S, Bergstralh E, Davis B, et al. A tissue biomarker panel predicting systemic progression after PSA recurrence post-definitive prostate cancer therapy. PLoS One. 2008;3(5):e2318. 355. Rutter J, Winge D, Schiffman J. Succinate dehydrogenase - assembly, regulation and role in human disease. Mitochondrion. 2010;10(4):393-401. 356. Anderson N, Mucka P, Kern J, Feng H. The emerging role and targetability of the TCA cycle in cancer metabolism. Protein Cell. 2018;9(2):216-37. 357. Pillai S, Gopalan V, Lo C, Liew V, Smith R, Lam A. Silent genetic alterations identified by targeted next-generation sequencing in pheochromocytoma/paraganglioma: a clinicopathological correlations. Exp Mol Pathol. 2017;102(1):41-6. 358. Miettinen M, Killian J, Wang Z, Lasota J, Lau C, Jones L, et al. Immunohistochemical loss of succinate dehydrogenase subunit A (SDHA) in gastrointestinal stromal tumors (GISTs) signals SDHA germline mutation. Am J Surg Pathol. 2013;37(2):234-40. 359. Costello L, Franklin R. Novel role of zinc in the regulation of prostate citrate metabolism and its implications in prostate cancer. Prostate. 1998;35(4):285-96. 360. Kurhanewicz J, Vigneron D, Nelson S, Hricak H, MacDonald J, Konety B, et al. Citrate as an in vivo marker to discriminate prostate cancer from benign prostatic hyperplasia and normal prostate peripheral zone: detection via localized proton spectroscopy. Urology. 1995;45(3):459-66. 361. Shao Y, Ye G, Ren S, Piao H, Zhao X, Lu X, et al. Metabolomics and transcriptomics profiles reveal the dysregulation of the tricarboxylic acid cycle and related mechanisms in prostate cancer. Int J Cancer. 2018;143(2):396-407. 362. Latonen L, Afyounian E, Jylha A, Nattinen J, Aapola U, Annala M, et al. Integrative proteomics in prostate cancer uncovers robustness against genomic and transcriptomic aberrations during disease progression. Nat Commun. 2018;9(1):1176. 363. Dueregger A, Schopf B, Eder T, Hofer J, Gnaiger E, Aufinger A, et al. Differential utilization of dietary fatty acids in benign and malignant cells of the prostate. PLoS One. 2015;10(8):e0135704. 364. Weber A, Klocker H, Oberacher H, Gnaiger E, Neuwirt H, Sampson N, et al. Succinate accumulation is associated with a shift of mitochondrial respiratory control and HIF-1α upregulation in PTEN negative prostate cancer cells. Int J Mol Sci. 2018;19(7). 365. Gupta A, Lotan Y, Ashfaq R, Roehrborn C, Raj G, Aragaki C, et al. Predictive value of the differential expression of the urokinase plasminogen activation axis in radical prostatectomy patients. Eur Urol. 2009;55(5):1124-33. 366. Shariat S, Karam J, Walz J, Roehrborn C, Montorsi F, Margulis V, et al. Improved prediction of disease relapse after radical prostatectomy through a panel of preoperative blood-based biomarkers. Clin Cancer Res. 2008;14(12):3785-91. 367. Wach S, Al-Janabi O, Weigelt K, Fischer K, Greither T, Marcou M, et al. The combined serum levels of miR-375 and urokinase plasminogen activator receptor are suggested as diagnostic and prognostic biomarkers in prostate cancer. Int J Cancer. 2015;137(6):1406-16. 368. Shah R, Bentley J, Jeffery Z, DeMarzo A. Heterogeneity of PTEN and ERG expression in prostate cancer on core needle biopsies: implications for cancer risk stratification and biomarker sampling. Hum Pathol. 2015;46(5):698-706. 369. Troyer D, Jamaspishvili T, Wei W, Feng Z, Good J, Hawley S, et al. A multicenter study shows PTEN deletion is strongly associated with seminal vesicle involvement and extracapsular extension in localized prostate cancer. Prostate. 2015;75(11):1206-15. 370. Geybels M, Fang M, Wright J, Qu X, Bibikova M, Klotzle B, et al. PTEN loss is associated with prostate cancer recurrence and alterations in tumor DNA methylation profiles. Oncotarget. 2017;8(48):84338-48. 371. Dillon L, Miller T. Therapeutic targeting of cancers with loss of PTEN function. Curr Drug Targets. 2014;15(1):65-79.

246

372. Lotan T, Heumann A, Rico S, Hicks J, Lecksell K, Koop C, et al. PTEN loss detection in prostate cancer: comparison of PTEN immunohistochemistry and PTEN FISH in a large retrospective prostatectomy cohort. Oncotarget. 2017;8(39):65566-76. 373. Ahearn T, Pettersson A, Ebot E, Gerke T, Graff R, Morais C, et al. A prospective investigation of PTEN loss and ERG expression in lethal prostate cancer. J Natl Cancer Inst. 2015;108(2). 374. Bismar T, Hegazy S, Feng Z, Yu D, Donnelly B, Palanisamy N, et al. Clinical utility of assessing PTEN and ERG protein expression in prostate cancer patients: a proposed method for risk stratification. J Cancer Res Clin Oncol. 2018;144(11):2117-25. 375. Gumuskaya B, Gurel B, Fedor H, Tan H, Weier C, Hicks J, et al. Assessing the order of critical alterations in prostate cancer development and progression by IHC: further evidence that PTEN loss occurs subsequent to ERG gene fusion. Prostate Cancer Prostatic Dis. 2013;16(2):209-15. 376. Krohn A, Freudenthaler F, Harasimowicz S, Kluth M, Fuchs S, Burkhardt L, et al. Heterogeneity and chronology of PTEN deletion and ERG fusion in prostate cancer. Mod Pathol. 2014;27(12):1612-20. 377. Ganguly S, Plattner R. Activation of abl family kinases in solid tumors. Genes Cancer. 2012;3(5- 6):414-25. 378. Kelliher M, McLaughlin J, Witte O, Rosenberg N. Induction of a chronic myelogenous leukemia- like syndrome in mice with v-abl and BCR/ABL. Proc Natl Acad Sci USA. 1990;87(17):6649-53. 379. Singer C, Hudelist G, Lamm W, Mueller R, Czerwenka K, Kubista E. Expression of tyrosine kinases in human malignancies as potential targets for kinase-specific inhibitors. Endocr Relat Cancer. 2004;11(4):861-9. 380. Ganguly S, Fiore L, Sims J, Friend J, Srinivasan D, Thacker M, et al. c-Abl and Arg are activated in human primary melanomas, promote melanoma cell invasion via distinct pathways, and drive metastatic progression. Oncogene. 2012;31(14):1804-16. 381. Balan V, Nangia-Makker P, Kho D, Wang Y, Raz A. Tyrosine-phosphorylated galectin-3 protein is resistant to prostate-specific antigen (PSA) cleavage. J Biol Chem. 2012;287(8):5192-8. 382. Arora S, Saini S, Fukuhara S, Majid S, Shahryari V, Yamamura S, et al. MicroRNA-4723 inhibits prostate cancer growth through inactivation of the Abelson family of nonreceptor protein tyrosine kinases. PLoS One. 2013;8(11):e78023. 383. Lee Y, Huang C, Murshed M, Chu K, Araujo J, Ye X, et al. Src family kinase/abl inhibitor dasatinib suppresses proliferation and enhances differentiation of osteoblasts. Oncogene. 2010;29(22):3196-207. 384. Yu E, Massard C, Gross M, Carducci M, Culine S, Hudes G, et al. Once-daily dasatinib: expansion of phase II study evaluating safety and efficacy of dasatinib in patients with metastatic castration-resistant prostate cancer. Urology. 2011;77(5):1166-71. 385. Gupta S, Li J, Kemeny G, Bitting R, Beaver J, Somarelli J, et al. Whole genomic copy number alterations in circulating tumor cells from men with abiraterone or enzalutamide-resistant metastatic castration-resistant prostate cancer. Clin Cancer Res. 2017;23(5):1346-57. 386. Wong S, Fang C, Chuah L, Leong C, Ngai S. E-cadherin: its dysregulation in carcinogenesis and clinical implications. Crit Rev Oncol Hematol. 2018;121:11-22. 387. Carrizo M. Loss of E-cadherin expression and epithelial mesenchymal transition (EMT) as key steps in tumor progression. J Cancer Prev Curr Res. 2017;7(3):00235. 388. Jaggi M, Johansson S, Baker J, Smith L, Galich A, Balaji K. Aberrant expression of E-cadherin and beta-catenin in human prostate cancer. Urol Oncol. 2005;23(6):402-6. 389. Whiteland H, Spencer-Harty S, Thomas D, Davies C, Morgan C, Kynaston H, et al. Putative prognostic epithelial-to-mesenchymal transition biomarkers for aggressive prostate cancer. Exp Mol Pathol. 2013;95(2):220-6. 390. Rubin M, Mucci N, Figurski J, Fecko A, Pienta K, Day M. E-cadherin expression in prostate cancer: a broad survey using high-density tissue microarray technology. Hum Pathol. 2001;32(7):690-7. 391. De Marzo A, Knudsen B, Chan-Tack K, Epstein J. E-cadherin expression as a marker of tumor aggressiveness in routinely processed radical prostatectomy specimens. Urology. 1999;53(4):707-13. 392. Halleck M, Lawler JJ, Blackshaw S, Gao L, Nagarajan P, Hacker C, et al. Differential expression of putative transbilayer amphipath transporters. Physiol Genomics. 1999;1(3):139-50. 393. Vallabhapurapu S, Blanco V, Sulaiman K, Vallabhapurapu S, Chu Z, Franco R, et al. Variation in human cancer cell external phosphatidylserine is regulated by flippase activity and intracellular calcium. Oncotarget. 2015;6(33):34375–88. 394. Merino A, Vazquez J, Rodriguez J, Fernandez R, Quintela I, Gonzalez L, et al. Pepsinogen C expression in tumors of extragastric origin. Int J Biol Marker. 2000;15(2):165-70.

247

395. Antunes A, Reis S, Leite K, Real D, Sousa-Canavez J, Camara-Lopes L, et al. PGC and PSMA in prostate cancer diagnosis: tissue analysis from biopsy samples. Int Braz J Urol. 2013;39(5):649-56. 396. Fukushima N, Sato N, Prasad N, Leach S, Hruban R, Goggins M, et al. Characterization of gene expression in mucinous cystic neoplasms of the pancreas using oligonucleotide microarrays. Oncogene. 2004;23(56):9042-51. 397. Diaz M, Rodriguez J, Sanchez J, Sanchez M, Martin A, Merino A, et al. Clinical significance of pepsinogen C tumor expression in patients with stage D2 prostate carcinoma. Int J Biol Markers. 2002;17(2):125-9. 398. Truan N, Vizoso F, Fresno M, Fernandez R, Quintela I, Alexandre E, et al. Expression and clinical significance of pepsinogen C in resectable pancreatic cancer. Int J Biol Markers. 2001;16(1):31-6. 399. Fernandez R, Vizoso F, Rodriguez J, Merino A, Gonzalez L, Quintela I, et al. Expression and prognostic significance of pepsinogen C in gastric carcinoma. Ann Surg Oncol. 2000;7(7):508-14. 400. Lokman N, Ween M, Oehler M, Ricciardelli C. The role of annexin A2 in tumorigenesis and cancer progression. Cancer Microenviron. 2011;4(2):199-208. 401. Liu W, Hajjar K. The annexin A2 system and angiogenesis. Biol Chem. 2016;397(10):1005-16. 402. Wang C, Lin C. Annexin A2: its molecular regulation and cellular expression in cancer development. Dis Markers. 2014;2014:308976. 403. Diaz V, Hurtado M, Thomson T, Reventos J, Paciucci R. Specific interaction of tissue-type plasminogen activator (t-PA) with annexin II on the membrane of pancreatic cancer cells activates plasminogen and promotes invasion in vitro. Gut. 2004;53(7):993-1000. 404. Ceruti P, Principe M, Capello M, Cappello P, Novelli F. Three are better than one: plasminogen receptors as cancer theranostic targets. Exp Hematol Oncol. 2013;2(1):12. 405. Griner N, Young D, Chaudhary P, Mohamed A, Huang W, Chen Y, et al. ERG oncoprotein inhibits ANXA2 expression and function in prostate cancer. Mol Cancer Res. 2015;13(2):368-79. 406. Yang N, Wang L, Liu J, Liu L, Huang J, Chen X, et al. MicroRNA-206 regulates the epithelial- mesenchymal transition and inhibits the invasion and metastasis of prostate cancer cells by targeting Annexin A2. Oncol Lett. 2018;15(6):8295-302. 407. Dunne J, Lamb D, Delahunt B, Murray J, Bethwaite P, Ferguson P, et al. Proteins from formalin- fixed paraffin-embedded prostate cancer sections that predict the risk of metastatic disease. Clin Proteomics. 2015;12(1):24. 408. Inokuchi J, Narula N, Yee D, Skarecky D, Lau A, Ornstein D, et al. Annexin A2 positively contributes to the malignant phenotype and secretion of IL-6 in DU145 prostate cancer cells. Int J Cancer. 2009;124(1):68-74. 409. Yang J, Li C, Mudd A, Gu X. LncRNA PVT1 predicts prognosis and regulates tumor growth in prostate cancer. Biosci Biotechnol Biochem. 2017;81(12):2301-6. 410. Ghoussaini M, Song H, Koessler T, Al Olama A, Kote-Jarai Z, Driver K, et al. Multiple loci with different cancer specificities within the 8q24 gene desert. J Natl Cancer Inst. 2008;100(13):962-6. 411. Chang Z, Cui J, Song Y. Long noncoding RNA PVT1 promotes EMT via mediating microRNA-186 targeting of Twist1 in prostate cancer. Gene. 2018;654:36-42. 412. van Eenennaam H, Lugtenberg D, Vogelzangs J, van Venrooij W, Pruijn G. hPop5, a protein subunit of the human RNase MRP and RNase P endoribonucleases. J Biol Chem. 2001;276(34):31635-41. 413. Krishnaswamy V, Korrapati P. Role of dermatopontin in re-epithelialization: implications on keratinocyte migration and proliferation. Sci Rep. 2014;4:7385. 414. Fu Y, Feng M, Yu J, Ma M, Liu X, Li J, et al. DNA methylation-mediated silencing of matricellular protein dermatopontin promotes hepatocellular carcinoma metastasis by α3β1 integrin-Rho GTPase signaling. Oncotarget. 2014;5(16):6701-15. 415. Yamatoji M, Kasamatsu A, Kouzu Y, Koike H, Sakamoto Y, Ogawara K, et al. Dermatopontin: a potential predictor for metastasis of human oral cancer. Int J Cancer. 2012;130(12):2903-11. 416. Krishnaswamy V, Balaguru U, Chatterjee S, Korrapati P. Dermatopontin augments angiogenesis and modulates the expression of transforming growth factor beta 1 and integrin alpha 3 beta 1 in endothelial cells. Eur J Cell Biol. 2017;96(3):266-75. 417. Li C, Hancock M, Sehgal P, Zhou S, Reinhardt D, Philip A. Soluble CD109 binds TGF-β and antagonizes TGF-β signalling and responses. Biochem J. 2016;473(5):537-47. 418. Shukla S, Evans J, Malik R, Feng F, Dhanasekaran S, Cao X, et al. Development of a RNA-Seq based prognostic signature in lung adenocarcinoma. J Natl Cancer Inst. 2016;109(1). 419. Tao J, Li H, Li Q, Yang Y. CD109 is a potential target for triple-negative breast cancer. Tumour Biol. 2014;35(12):12083-90.

248

420. Sanada T, Takaesu G, Mashima R, Yoshida R, Kobayashi T, Yoshimura A. FLN29 deficiency reveals its negative regulatory role in the Toll-like receptor (TLR) and retinoic acid-inducible gene I (RIG-I)-like helicase signaling pathway. J Biol Chem. 2008;283(49):33858-64. 421. Mashima R, Saeki K, Aki D, Minoda Y, Takaki H, Sanada T, et al. FLN29, a novel interferon- and LPS-inducible gene acting as a negative regulator of toll-like receptor signaling. J Biol Chem. 2005;280(50):41289-97. 422. Liu X, Wang Z, Zhang G, Zhu Q, Zeng H, Wang T, et al. High TRAF6 expression is associated with esophageal carcinoma recurrence and prompts cancer cell invasion. Oncol Res. 2017;25(4):485-93. 423. Wu H, Hao A, Cui H, Wu W, Yang H, Hu B, et al. TRAF6 expression is associated with poorer prognosis and high recurrence in urothelial bladder cancer. Oncol Lett. 2017;14(2):2432-8. 424. Singh R, Karri D, Shen H, Shao J, Dasgupta S, Huang S, et al. TRAF4-mediated ubiquitination of NGF receptor TrkA regulates prostate cancer metastasis. J Clin Invest. 2018;128(7):3129-43. 425. Wei B, Liang J, Hu J, Mi Y, Ruan J, Zhang J, et al. TRAF2 is a valuable prognostic biomarker in patients with prostate cancer. Med Sci Monit. 2017;23:4192-204. 426. Rodriguez-Berriguete G, Sanchez-Espiridion B, Cansino J, Olmedilla G, Martinez-Onsurbe P, Sanchez-Chapado M, et al. Clinical significance of both tumor and stromal expression of components of the IL-1 and TNF-α signaling pathways in prostate cancer. Cytokine. 2013;64(2):555-63. 427. Cai C, Chen S, Zheng Z, Omwancha J, Lin M, Balk S, et al. Androgen regulation of soluble guanylyl cyclasealpha1 mediates prostate cancer cell proliferation. Oncogene. 2007;26(11):1606-15. 428. Cai C, Hsieh C, Gao S, Kannan A, Bhansali M, Govardhan K, et al. Soluble guanylyl cyclase α1 and p53 cytoplasmic sequestration and down-regulation in prostate cancer. Mol Endocrinol. 2012;26(2):292- 307. 429. Lisabeth E, Falivelli G, Pasquale E. Eph receptor signaling and ephrins. Cold Spring Harb Perspect Biol. 2013;5(9). 430. Taylor H, Campbell J, Nobes C. Ephs and ephrins. Curr Biol. 2017;27(3):R90-R5. 431. Xi H, Wu X, Wei B, Chen L. Aberrant expression of EphA3 in gastric carcinoma: correlation with tumor angiogenesis and survival. J Gastroenterol. 2012;47(7):785-94. 432. Cui X, Lee M, Yu G, Kim I, Yu H, Song E, et al. EFNA1 ligand and its receptor EphA2: potential biomarkers for hepatocellular carcinoma. Int J Cancer. 2010;126(4):940-9. 433. Fu D, Wang Z, Wang B, Chen L, Yang W, Shen Z, et al. Frequent epigenetic inactivation of the receptor tyrosine kinase EphA5 by promoter methylation in human breast cancer. Hum Pathol. 2010;41(1):48-58. 434. Li X, Wang L, Gu J, Li B, Liu W, Wang Y, et al. Up-regulation of EphA2 and down-regulation of EphrinA1 are associated with the aggressive phenotype and poor prognosis of malignant glioma. Tumour Biol. 2010;31(5):477-88. 435. Drescher U, Kremoser C, Handwerker C, Loschinger J, Noda M, Bonhoeffer F. In vitro guidance of retinal ganglion cell axons by RAGS, a 25 kDa tectal protein related to ligands for Eph receptor tyrosine kinases. Cell. 1995;82(3):359-70. 436. Wang T, Chang J, Ho J, Wu H, Chen T. EphrinA5 suppresses colon cancer development by negatively regulating epidermal growth factor receptor stability. FEBS J. 2012;279(2):251-63. 437. Li J, Liu D, Liu G, Xie D. EphrinA5 acts as a tumor suppressor in glioma by negative regulation of epidermal growth factor receptor. Oncogene. 2009;28(15):1759-68. 438. Kalinski T, Ropke A, Sel S, Kouznetsova I, Ropke M, Roessner A. Down-regulation of ephrin-A5, a gene product of normal cartilage, in chondrosarcoma. Hum Pathol. 2009;40(12):1679-85. 439. Romanuik T, Ueda T, Le N, Haile S, Yong T, Thomson T, et al. Novel biomarkers for prostate cancer including noncoding transcripts. Am J Pathol. 2009;175(6):2264-76. 440. Rosenberg E, Gerashchenko G, Hryshchenko N, Mevs L, Nekrasov K, Lytvynenko R, et al. Expression of cancer-associated genes in prostate tumors. Exp Oncol. 2017;39(2):131-7. 441. Li S, Zhu Y, Ma C, Qiu Z, Zhang X, Kang Z, et al. Downregulation of EphA5 by promoter methylation in human prostate cancer. BMC Cancer. 2015;15:18. 442. Wu R, Wang H, Wang J, Wang P, Huang F, Xie B, et al. EphA3, induced by PC-1/PrLZ, contributes to the malignant progression of prostate cancer. Oncol Rep. 2014;32(6):2657-65. 443. Chen P, Huang Y, Zhang B, Wang Q, Bai P. EphA2 enhances the proliferation and invasion ability of LNCaP prostate cancer cells. Oncol Lett. 2014;8(1):41-6. 444. Guan M, Xu C, Zhang F, Ye C. Aberrant methylation of EphA7 in human prostate cancer and its relation to clinicopathologic features. Int J Cancer. 2009;124(1):88-94.

249

445. Davy A, Robbins S. Ephrin-A5 modulates cell adhesion and morphology in an integrin-dependent manner. EMBO J. 2000;19(20):5396-405. 446. Vogel R, Janssen R, Ugalde C, Grovenstein M, Huijbens R, Visch H, et al. Human mitochondrial complex I assembly is mediated by NDUFAF1. FEBS J. 2005;272(20):5317-26. 447. Jhun M, Geybels M, Wright J, Kolb S, April C, Bibikova M, et al. Gene expression signature of Gleason score is associated with prostate cancer outcomes in a radical prostatectomy cohort. Oncotarget. 2017;8(26):43035-47. 448. Zhong Y, Li X, Ji Y, Li X, Li Y, Yu D, et al. Pyruvate dehydrogenase expression is negatively associated with cell stemness and worse clinical outcome in prostate cancers. Oncotarget. 2017;8(8):13344–56. 449. Wang P, Song M, Zeng Z, Zhu C, Lu W, Yang J, et al. Identification of NDUFAF1 in mediating K-Ras induced mitochondrial dysfunction by a proteomic screening approach. Oncotarget. 2015;6(6):3947-62. 450. Yang J, Seol S, Leem S, Kim Y, Sun Z, Lee J, et al. Genes associated with recurrence of hepatocellular carcinoma: integrated analysis by gene expression and methylation profiling. J Korean Med Sci. 2011;26(11):1428-38. 451. Kim S, Yasuda S, Tanaka H, Yamagata K, Kim H. Non-clustered protocadherin. Cell Adh Migr. 2011;5(2):97-105. 452. Aamar E, Dawid I. Protocadherin-18a has a role in cell adhesion, behavior and migration in zebrafish development. Dev Biol. 2008;318(2):335-46. 453. Iruretagoyena J, Davis W, Bird C, Olsen J, Radue R, Teo Broman A, et al. Differential changes in gene expression in human brain during late first trimester and early second trimester of pregnancy. Prenat Diagn. 2014;34(5):431-7. 454. Hu X, Sui X, Li L, Huang X, Rong R, Su X, et al. Protocadherin 17 acts as a tumour suppressor inducing tumour cell apoptosis and autophagy, and is frequently methylated in gastric and colorectal cancers. J Pathol. 2013;229(1):62-73. 455. Wang K, Lin C, Liu C, Liu D, Huang R, Ding D, et al. Global methylation silencing of clustered proto- cadherin genes in cervical cancer: serving as diagnostic markers comparable to HPV. Cancer Med. 2015;4(1):43-55. 456. Lin Y, Xie P, Wang L, Ma J. Aberrant methylation of protocadherin 17 and its clinical significance in patients with prostate cancer after radical prostatectomy. Med Sci Monit. 2014;20:1376-82. 457. Wang L, Xie P, Lin Y, Ma J, Li W. Aberrant methylation of PCDH10 predicts worse biochemical recurrence-free survival in patients with prostate cancer after radical prostatectomy. Med Sci Monit. 2014;20:1363-8. 458. Lin Y, Li Y, Ma J. Aberrant promoter methylation of protocadherin8 (PCDH8) in serum is a potential prognostic marker for low Gleason score prostate cancer. Med Sci Monit. 2017;23:4895-900. 459. Zhou D, Tang W, Su G, Cai M, An H, Zhang Y. PCDH18 is frequently inactivated by promoter methylation in colorectal cancer. Sci Rep. 2017;7(1):2819. 460. Mishra A, Sriram H, Chandarana P, Tanavde V, Kumar R, Gopinath A, et al. Decreased expression of cell adhesion genes in cancer stem-like cells isolated from primary oral squamous cell carcinomas. Tumour Biol. 2018;40(5):1010428318780859. 461. Sun L, Suo C, Li S, Zhang H, Gao P. Metabolic reprogramming for cancer cells and their microenvironment: beyond the Warburg Effect. Biochim Biophys Acta Rev Cancer. 2018;1870(1):51-66. 462. Bhindi B, Locke J, Alibhai S, Kulkarni G, Margel D, Hamilton R, et al. Dissecting the association between metabolic syndrome and prostate cancer risk: analysis of a large clinical cohort. Eur Urol. 2015;67(1):64-70. 463. De Nunzio C, Simone G, Brassetti A, Mastroianni R, Collura D, Muto G, et al. Metabolic syndrome is associated with advanced prostate cancer in patients treated with radical retropubic prostatectomy: results from a multicentre prospective study. BMC Cancer. 2016;16:407. 464. Staal J, Beyaert R. Inflammation and NF-κB signaling in prostate cancer: mechanisms and clinical implications. Cells. 2018;7(9). 465. Lima A, Bastos Mde L, Carvalho M, Guedes de Pinho P. Biomarker discovery in human prostate cancer: an update in metabolomics studies. Transl Oncol. 2016;9(4):357-70. 466. McDunn J, Li Z, Adam K, Neri B, Wolfert R, Milburn M, et al. Metabolomic signatures of aggressive prostate cancer. Prostate. 2013;73(14):1547-60. 467. Jain D, Gupta S, Marwah N, Kalra R, Gupta V, Gill M, et al. Evaluation of role of alpha-methyl acyl- coenzyme A racemase/P504S and high molecular weight cytokeratin in diagnosing prostatic lesions. J Cancer Res Ther. 2017;13(1):21-5.

250

468. Liu Y. Fatty acid oxidation is a dominant bioenergetic pathway in prostate cancer. Prostate Cancer Prostatic Dis. 2006;9(3):230-4. 469. Galbraith L, Leung H, Ahmad I. Lipid pathway deregulation in advanced prostate cancer. Pharmacol Res. 2018;131:177-84. 470. Souza A, Brum I, Neto B, Berger M, Branchini G. Reference gene for primary culture of prostate cancer cells. Mol Biol Rep. 2013;40(4):2955-62. 471. Jamaspishvili T, Berman D, Ross A, Scher H, De Marzo A, Squire J, et al. Clinical implications of PTEN loss in prostate cancer. Nat Rev Urol. 2018;15(4):222-34. 472. Corbett A. Post-transcriptional regulation of gene expression and human disease. Curr Opin Cell Biol. 2018;52:96-104. 473. Chuh K, Pratt M. Chemical methods for the proteome-wide identification of posttranslationally modified proteins. Curr Opin Chem Biol. 2015;24:27-37. 474. Glinsky GV, Glinskii AB, Stephenson AJ, Hoffman RM, Gerald WL. Gene expression profiling predicts clinical outcome of prostate cancer. J Clin Invest. 2004;113(6):913-23. 475. Bostrom P, Bjartell A, Catto J, Eggener S, Lilja H, Loeb S, et al. Genomic predictors of outcome in prostate cancer. Eur Urol. 2015;68(6):1033-44. 476. Panigrahi G, Ramteke A, Birks D, Abouzeid Ali H, Venkataraman S, Agarwal C, et al. Exosomal microRNA profiling to identify hypoxia-related biomarkers in prostate cancer. Oncotarget. 2018;9(17):13894-910. 477. Daures M, Idrissou M, Judes G, Rifai K, Penault-Llorca F, Bignon Y, et al. A new metabolic gene signature in prostate cancer regulated by JMJD3 and EZH2. Oncotarget. 2018;9(34):23413-25. 478. Narayanan A, Keedwell E, Gamalielsson J, Tatineni S. Single-layer artificial neural networks for gene expression analysis. Neurocomputing. 2004;61:217-40. 479. Altman D, Vergouwe Y, Royston P, Moons K. Prognosis and prognostic research: validating a prognostic model. BMJ. 2009;338:b605. 480. Steyerberg E, Vickers A, Cook N, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;2(1):128-38. 481. Riley R, Ensor J, Snell K, Debray T, Altman D, Moons K, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. 2016;353:i3140. 482. Gerlinger M, Catto J, Orntoft T, Real F, Zwarthoff E, Swanton C. Intratumour heterogeneity in urologic cancers: from molecular evidence to clinical implications. Eur Urol. 2015;67(4):729-37. 483. Boyd L, Mao X, Xue L, Lin D, Chaplin T, Kudahetti S, et al. High-resolution genome-wide copy- number analysis suggests a monoclonal origin of multifocal prostate cancer. Genes Chromosomes Cancer. 2012;51(6):579-89. 484. Shipitsin M, Small C, Choudhury S, Giladi E, Friedlander S, Nardone J, et al. Identification of proteomic biomarkers predicting prostate cancer aggressiveness and lethality despite biopsy-sampling error. Br J Cancer. 2014;111(6):1201-12. 485. Tollefson M, Karnes R, Kwon E, Lohse C, Rangel L, Mynderse L, et al. Prostate cancer Ki-67 (MIB- 1) expression, perineural invasion, and Gleason score as biopsy-based predictors of prostate cancer mortality: the Mayo model. Mayo Clin Proc. 2014;89(3):308-18. 486. Tretiakova M, Wei W, Boyer H, Newcomb L, Hawley S, Auman H, et al. Prognostic value of Ki67 in localized prostate carcinoma: a multi-institutional study of >1000 prostatectomies. Prostate Cancer Prostatic Dis. 2016;19(3):264-70. 487. Zellweger T, Gunther S, Zlobec I, Savic S, Sauter G, Moch H, et al. Tumour growth fraction measured by immunohistochemical staining of Ki67 is an independent prognostic factor in preoperative prostate biopsies with small-volume or low-grade prostate cancer. Int J Cancer. 2009;124(9):2116-23. 488. Verhoven B, Yan Y, Ritter M, Khor L, Hammond E, Jones C, et al. Ki-67 is an independent predictor of metastasis and cause-specific mortality for prostate cancer patients treated on Radiation Therapy Oncology Group (RTOG) 94-08. Int J Radiat Oncol Biol Phys. 2013;86(2):317-23. 489. Carozzi F, Tamburrino L, Bisanzi S, Marchiani S, Paglierani M, Di Lollo S, et al. Are biomarkers evaluated in biopsy specimens predictive of prostate cancer aggressiveness? J Cancer Res Clin Oncol. 2016;142(1):201-12. 490. Rubio J, Ramos D, López-Guerrero J, Iborra I, Collado A, Solsona E, et al. Immunohistochemical expression of Ki-67 antigen, cox-2 and Bax/Bcl-2 in prostate cancer; prognostic value in biopsies and radical prostatectomy specimens. Eur Urol. 2005;48(5):745-51.

251

491. Tomlins S, Mehra R, Rhodes D, Cao X, Wang L, Dhanasekaran S, et al. Integrative molecular concept modeling of prostate cancer progression. Nat Genet. 2007;39(1):41-51. 492. Varambally S, Dhanasekaran S, Zhou M, Barrette T, Kumar-Sinha C, Sanda M, et al. The polycomb group protein EZH2 is involved in progression of prostate cancer. Nature. 2002;419(6907):624-9. 493. Melling N, Thomsen E, Tsourlakis M, Kluth M, Hube-Magg C, Minner S, et al. Overexpression of enhancer of zeste homolog 2 (EZH2) characterizes an aggressive subset of prostate cancers and predicts patient prognosis independently from pre- and postoperatively assessed clinicopathological parameters. Carcinogenesis. 2015;36(11):1333-40. 494. Huber F, Montani M, Sulser T, Jaggi R, Wild P, Moch H, et al. Comprehensive validation of published immunohistochemical prognostic biomarkers of prostate cancer -what has gone wrong? A blueprint for the way forward in biomarker studies. Br J Cancer. 2015;112(1):140-8. 495. van Leenders G, Dukers D, Hessels D, van den Kieboom S, Hulsbergen C, Witjes J, et al. Polycomb- group oncogenes EZH2, BMI1, and RING1 are overexpressed in prostate cancer with adverse pathologic and clinical features. Eur Urol. 2007;52(2):455-63. 496. Berezovska O, Glinskii A, Yang Z, Li X, Hoffman R, Glinsky G. Essential role for activation of the Polycomb group (PcG) protein chromatin silencing pathway in metastatic prostate cancer. Cell Cycle. 2006;5(16):1886-901. 497. Hoogland A, Verhoef E, Roobol M, Schröder F, Wildhagen M, van der Kwast T, et al. Validation of stem cell markers in clinical prostate cancer: α6-Integrin is predictive for non-aggressive disease. Prostate. 2014;74(5):488-96. 498. Reid A, Attard G, Ambroisine L, Fisher G, Kovacs G, Brewer D, et al. Molecular characterisation of ERG, ETV1 and PTEN gene loci identifies patients at low and high risk of death from prostate cancer. Br J Cancer. 2010;102(4):678-8. 499. Chaux A, Peskoe S, Gonzalez-Roibon N, Schultz L, Albadine R, Hicks J, et al. Loss of PTEN expression is associated with increased risk of recurrence after prostatectomy for clinically localized prostate cancer. Mod Pathol. 2012;25(11):1543-9. 500. Kim S, Kim S, Joung J, Lee G, Hong E, Kang K, et al. Overexpression of ERG and wild-type PTEN are associated with favorable clinical prognosis and low biochemical recurrence in prostate cancer. PLoS One. 2015;10(4):e0122498. 501. Leon P, Cancel-Tassin G, Drouin S, Audouin M, Varinot J, Comperat E, et al. Comparison of cell cycle progression score with two immunohistochemical markers (PTEN and Ki-67) for predicting outcome in prostate cancer after radical prostatectomy. World J Urol. 2018;36(9):1495-500. 502. Cuzick J, Stone S, Fisher G, Yang Z, North B, Berney D, et al. Validation of an RNA cell cycle progression score for predicting death from prostate cancer in a conservatively managed needle biopsy cohort. Br J Cancer. 2015;113(3):382-9. 503. Sommariva S, Tarricone R, Lazzeri M, Ricciardi W, Montorsi F. Prognostic value of the cell cycle progression score in patients with prostate cancer: a systematic review and meta-analysis. Eur Urol. 2016;69(1):107-15. 504. Cho I, Chung H, Cho K, Kim J, Joung J, Seo H, et al. Bcl-2 as a predictive factor for biochemical recurrence after radical prostatectomy: an interim analysis. Cancer Res Treat. 2010;42(3):157-62. 505. Pollack A, Cowen D, Troncoso P, Zagars G, von Eschenbach A, Meistrich M, et al. Molecular markers of outcome after radiotherapy in patients with prostate carcinoma: Ki-67, bcl-2, bax, and bcl-x. Cancer. 2003;97(7):1630-8. 506. Miyake H, Muramaki M, Kurahashi T, Takenaka A, Fujisawa M. Expression of potential molecular markers in prostate cancer: correlation with clinicopathological outcomes in patients undergoing radical prostatectomy. Urol Oncol. 2010;28(2):145-51. 507. Terracciano D, Bruzzese D, Ferro M, Mazzarella C, Di Lorenzo G, Altieri V, et al. Preoperative insulin-like growth factor-binding protein-3 (IGFBP-3) blood level predicts Gleason sum upgrading. Prostate. 2012;72(1):100-7. 508. Cao Y, Lindström S, Schumacher F, Stevens V, Albanes D, Berndt S, et al. Insulin-like growth factor pathway genetic polymorphisms, circulating IGF1 and IGFBP3, and prostate cancer survival. J Natl Cancer Inst. 2014;106(6):dju085. 509. Yu H, Nicar M, Shi R, Berkel H, Nam R, Trachtenberg J, et al. Levels of insulin-like growth factor I (IGF-I) and IGF binding proteins 2 and 3 in serial postoperative serum samples and risk of prostate cancer recurrence. Urology. 2001;57(3):471-5.

252

510. Grivas N, Goussia A, Stefanou D, Giannakis D. Microvascular density and immunohistochemical expression of VEGF, VEGFR-1 and VEGFR-2 in benign prostatic hyperplasia, high-grade prostate intraepithelial neoplasia and prostate cancer. Cent European J Urol. 2016;69(1):63-71. 511. Revelos K, Petraki C, Scorilas A, Stefanakis S, Malovrouvas D, Alevizopoulos N, et al. Correlation of androgen receptor status, neuroendocrine differentiation and angiogenesis with time-to-biochemical failure after radical prostatectomy in clinically localized prostate cancer. Anticancer Res. 2007;27(5B):3651-60. 512. Wikstrom P, Lissbrant I, Stattin P, Egevad L, Bergh A. Endoglin (CD105) is expressed on immature blood vessels and is a marker for survival in prostate cancer. Prostate. 2002;51(4):268-75. 513. El-Gohary Y, Silverman J, Olson P, Liu Y, Cohen J, Miller R, et al. Endoglin (CD105) and vascular endothelial growth factor as prognostic markers in prostatic adenocarcinoma. Am J Clin Pathol. 2007;127(4):572-9. 514. Erbersdobler A, Isbarn H, Dix K, Steiner I, Schlomm T, Mirlacher M, et al. Prognostic value of microvessel density in prostate cancer: a tissue microarray study. World J Urol. 2010;28(6):687-92. 515. Miyata Y, Sakai H. Reconsideration of the clinical and histopathological significance of angiogenesis in prostate cancer: usefulness and limitations of microvessel density measurement. Int J Urol. 2015;22(9):806-15. 516. Cohen B, Gomez P, Omori Y, Duncan R, Civantos F, Soloway M, et al. Cyclooxygenase-2 (COX-2) expression is an independent predictor of prostate cancer recurrence. Int J Cancer. 2006;119(5):1082-7. 517. Richardsen E, Uglehus R, Due J, Busch C, Busund L. COX-2 is overexpressed in primary prostate cancer with metastatic potential and may predict survival. A comparison study between COX-2, TGF-beta, IL-10 and Ki67. Cancer Epidemiol. 2010;34(3):316-22. 518. Fernandez-Martinez A, Carmena M, Arenas M, Bajo A, Prieto J, Sanchez-Chapado M. Overexpression of vasoactive intestinal peptide receptors and cyclooxygenase-2 in human prostate cancer. Analysis of potential prognostic relevance. Histol Histopathol. 2012;27(8):1093-101. 519. Zha S, Gage W, Sauvageot J, Saria E, Putzi M, Ewing C, et al. Cyclooxygenase-2 is up-regulated in proliferative inflammatory atrophy of the prostate, but not in prostate carcinoma. Cancer Res. 2001;61(24):8617-23. 520. Shariat S, Roehrborn C, McConnell J, Park S, Alam N, Wheeler T, et al. Association of the circulating levels of the urokinase system of plasminogen activation with the presence of prostate cancer and invasion, progression, and metastasis. J Clin Oncol. 2007;25(4):349-55. 521. Svatek R, Jeldres C, Karakiewicz P, Suardi N, Walz J, Roehrborn C, et al. Pre-treatment biomarker levels improve the accuracy of post-prostatectomy nomogram for prediction of biochemical recurrence. Prostate. 2009;69(8):886-94. 522. Al-Janabi O, Taubert H, Lohse-Fischer A, Frohner M, Wach S, Stohr R, et al. Association of tissue mRNA and serum antigen levels of members of the urokinase-type plasminogen activator system with clinical and prognostic parameters in prostate cancer. Biomed Res Int. 2014;2014:972587. 523. Lippert S, Berg K, Hoyer-Hansen G, Lund I, Iversen P, Christensen I, et al. Copenhagen uPAR prostate cancer (CuPCa) database: protocol and early results. Biomark Med. 2016;10(2):209-16. 524. Almasi C, Brasso K, Iversen P, Pappot H, Hoyer-Hansen G, Dano K, et al. Prognostic and predictive value of intact and cleaved forms of the urokinase plasminogen activator receptor in metastatic prostate cancer. Prostate. 2011;71(8):899-907. 525. Kachroo N, Warren A, Gnanapragasam V. Multi-transcript profiling in archival diagnostic prostate cancer needle biopsies to evaluate biomarkers in non-surgically treated men. BMC Cancer. 2014;14:673. 526. Barber A, Castillo-Martin M, Bonal D, Jia A, Rybicki B, Christiano A, et al. PI3K/AKT pathway regulates E-cadherin and Desmoglein 2 in aggressive prostate cancer. Cancer Med. 2015;4(8):1258-71. 527. Jung W, Sung C, Han S, Kim K, Kim M, Ro J, et al. AZGP-1 immunohistochemical marker in prostate cancer: potential predictive marker of biochemical recurrence in post radical prostatectomy specimens. Appl Immunohistochem Mol Morphol. 2014;22(9):652-7. 528. Parekh D, Punnen S, Sjoberg D, Asroff S, Bailen J, Cochran J, et al. A multi-institutional prospective trial in the USA confirms that the 4Kscore accurately identifies men with high-grade prostate cancer. Eur Urol. 2015;68(3):464-70. 529. Assel M, Sjoblom L, Murtola T, Talala K, Kujala P, Stenman U, et al. A four-kallikrein panel and β- microseminoprotein in predicting high-grade prostate cancer on biopsy: an independent replication from the Finnish section of the European Randomized Study of Screening for Prostate Cancer. Eur Urol Focus. 2017.

253

530. Carlsson S, Maschino A, Schroder F, Bangma C, Steyerberg E, van der Kwast T, et al. Predictive value of four kallikrein markers for pathologically insignificant compared with aggressive prostate cancer in radical prostatectomy specimens: results from the European Randomized Study of Screening for Prostate Cancer section Rotterdam. Eur Urol. 2013;64(5):693-9. 531. Stattin P, Vickers A, Sjoberg D, Johansson R, Granfors T, Johansson M, et al. Improving the specificity of screening for lethal prostate cancer using prostate-specific antigen and a panel of kallikrein markers: a nested case-control study. Eur Urol. 2015;68(2):207-13. 532. Konety B, Zappala S, Parekh D, Osterhout D, Schock J, Chudler R, et al. The 4Kscore® test reduces prostate biopsy rates in community and academic urology practices. Rev Urol. 2015;17(4):231-40. 533. Stephan C, Vincendeau S, Houlgatte A, Cammann H, Jung K, Semjonow A. Multicenter evaluation of [-2]proprostate-specific antigen and the Prostate Health Index for detecting prostate cancer. Clin Chem. 2013;59(1):306-14. 534. Fossati N, Buffi N, Haese A, Stephan C, Larcher A, McNicholas T, et al. Preoperative prostate- specific antigen isoform p2PSA and its derivatives, %p2PSA and Prostate Health Index, predict pathologic outcomes in patients undergoing radical prostatectomy for prostate cancer: results from a Multicentric European Prospective Study. Eur Urol. 2015;68(1):132-8. 535. Lughezzani G, Lazzeri M, Buffi N, Abrate A, Mistretta F, Hurle R, et al. Preoperative Prostate Health Index is an independent predictor of early biochemical recurrence after radical prostatectomy: results from a prospective single-center study. Urol Oncol. 2015;33(8):337.e7-14. 536. Tosoian J, Loeb S, Feng Z, Isharwal S, Landis P, Elliot D, et al. Association of [-2]proPSA with biopsy reclassification during active surveillance for prostate cancer. J Urol. 2012;188(4):1131-6. 537. Lu L, Zhang H, Pang J, Hou G, Lu M, Gao X. ERG rearrangement as a novel marker for predicting the extra-prostatic extension of clinically localised prostate cancer. Oncol Lett. 2016;11(4):2532-8. 538. Nam R, Sugar L, Wang Z, Yang W, Kitching R, Klotz L, et al. Expression of TMPRSS2:ERG gene fusion in prostate cancer cells is an important prognostic factor for cancer progression. Cancer Biol Ther. 2007;6(1):40-5. 539. Spencer E, Johnston R, Gordon R, Lucas J, Ussakli C, Hurtado-Coll A, et al. Prognostic value of ERG oncoprotein in prostate cancer recurrence and cause-specific mortality. Prostate. 2013;73(9):905-12. 540. Berg K. The prognostic and predictive value of TMPRSS2-ERG gene fusion and ERG protein expression in prostate cancer biopsies. Dan Med J. 2016;63(12). 541. Demichelis F, Fall K, Perner S, Andren O, Schmidt F, Setlur S, et al. TMPRSS2:ERG gene fusion associated with lethal prostate cancer in a watchful waiting cohort. Oncogene. 2007;26(31):4596-9. 542. Leyten G, Hessels D, Jannink S, Smit F, de Jong H, Cornel E, et al. Prospective multicentre evaluation of PCA3 and TMPRSS2-ERG gene fusions as diagnostic and prognostic urinary biomarkers for prostate cancer. Eur Urol. 2014;65(3):534-42. 543. Sanda M, Feng Z, Howard D, Tomlins S, Sokoll L, Chan D, et al. Association between combined TMPRSS2:ERG and PCA3 RNA urinary testing and detection of aggressive prostate cancer. JAMA Oncol. 2017;3(8):1085-93. 544. Xu B, Chevarie-Davis M, Chevalier S, Scarlata E, Zeizafoun N, Dragomir A, et al. The prognostic role of ERG immunopositivity in prostatic acinar adenocarcinoma: a study including 454 cases and review of the literature. Hum Pathol. 2014;45(3):488-97. 545. Attard G, Clark J, Ambroisine L, Fisher G, Kovacs G, Flohr P, et al. Duplication of the fusion of TMPRSS2 to ERG sequences identifies fatal human prostate cancer. Oncogene. 2008;27(3):253-63. 546. Ekici S, Cerwinka W, Duncan R, Gomez P, Civantos F, Soloway M, et al. Comparison of the prognostic potential of hyaluronic acid, hyaluronidase (HYAL-1), CD44v6 and microvessel density for prostate cancer. Int J Cancer. 2004;112(1):121-9. 547. Aaltomaa S, Lipponen P, Tammi R, Tammi M, Viitanen J, Kankkunen J, et al. Strong stromal hyaluronan expression is associated with PSA recurrence in local prostate cancer. Urol Int. 2002;69(4):266-72. 548. Som A, Tu S, Liu J, Wang X, Qiao W, Logothetis C, et al. Response in bone turnover markers during therapy predicts overall survival in patients with metastatic prostate cancer: analysis of three clinical trials. Br J Cancer. 2012;107(9):1547-53. 549. Fizazi K, Massard C, Smith M, Rader M, Brown J, Milecki P, et al. Bone-related parameters are the main prognostic factors for overall survival in men with bone metastases from castration-resistant prostate cancer. Eur Urol. 2015;68(1):42-50. 550. Heid C, Stevens J, Livak K, Williams P. Real time quantitative PCR. Genome Res. 1996;6(10):986- 94.

254

551. Holland P, Abramson R, Watson R, Gelfand D. Detection of specific polymerase chain reaction product by utilizing the 5'----3' exonuclease activity of Thermus aquaticus DNA polymerase. Proc Natl Acad Sci USA. 1991;88(16):7276-80. 552. Lazaruk K, Wang Y, Zhong J, Maltchenko S, Rabkin S, Hunkapiller K, et al. The design process for a new generation of quantitative gene expression analysis tools-TaqMan® probe-based assays for human and other model organism genes [Internet]. Foster City: Applied Biosystems; 2006 [cited 2018]. Available from: https://www.thermofisher.com/document-connect/document- connect.html?url=https://assets.thermofisher.com/TFS- Assets/LSG/manuals/cms_040599.pdf&title=White%20Paper:%20The%20Design%20Process%20for%20 a%20New%20Generation%20of%20Quantitative%20Gene%20Expression%20Analysis%20Tools. 553. Kutyavin I, Afonina I, Mills A, Gorn V, Lukhtanov E, Belousov E, et al. 3'-minor groove binder-DNA probes increase sequence specificity at PCR extension temperatures. Nucleic Acids Res 2000;28(2):655- 61. 554. Schmittgen T, Zakrajsek B, Mills A, Gorn V, Singer M, Reed M. Quantitative reverse transcription- polymerase chain reaction to study mRNA decay: comparison of endpoint and real-time methods. Anal Biochem. 2000;285(2):194-204. 555. Bio-Rad Laboratories. Real-time PCR Applications Guide [Internet]. Hercules: Bio-Rad Laboratories, Inc.; 2006 [cited 2018]. Available from: http://www.bio- rad.com/webroot/web/pdf/lsr/literature/Bulletin_5279.pdf. 556. Ginzinger D. Gene quantification using real-time quantitative PCR: an emerging technology hits the mainstream. Exp Hematol. 2002 30(6):503–12. 557. Kozera B, Rapacz M. Reference genes in real-time PCR. J Appl Genet. 2013;54(4):391-406.

255