Identification of pre-diagnostic proteomics markers contributing to the future risk of lung cancer

Sonia Dagnino, Barbara Bodinier, Marc Chadeau-Hyam

8th October, 2020

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 1 / 27 Data overview

Prospective case-control study of lung cancer on 648 participants from EPIC Italy (N=382) and NOWAC (N=266, Women only)

92 circulatory inflammatory measured with the Olink platform

Protein levels provided are log2-transformed (Olink’s NPX unit where an increase of 1 NPX corresponds to the doubling of concentration) Replicated measurements (N=2) for 56 EPIC Italy participants Quality control: 16 controls (same sample measured 16 times) ⇒ A total of 720 measurements (704 samples and 16 controls) over 8 plates (N=90 samples each)

In the same participants (NOWAC only, N=222): transcriptomics measurements (N=11,610 probes)

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 2 / 27 Participant characteristics

Full population NOWAC EPIC Women EPIC Men (N=648) (N=266) (N=169) (N=213) Age at sample (years) 55.29 (5.58) 56.49 (4.01) 53.17 (7.49) 55.48 (4.99) Gender: Female 435 (67.1) 266 (100.0) 169 (100.0) 0 (0.0) Body Mass Index 25.58 (3.86) 24.90 (3.53) 25.56 (4.91) 26.41 (3.05) Centre NOWAC 266 (41.0) 266 (100.0) 0 (0.0) 0 (0.0) Florence 132 (20.4) 0 (0.0) 79 (46.7) 53 (24.9) Naples 8 (1.2) 0 (0.0) 8 (4.7) 0 (0.0) Ragusa 28 (4.3) 0 (0.0) 0 (0.0) 28 (13.1) Turin 122 (18.8) 0 (0.0) 26 (15.4) 96 (45.1) Varese 92 (14.2) 0 (0.0) 56 (33.1) 36 (16.9) Lung cancer status: case 323 (49.8) 133 (50.0) 84 (49.7) 106 (49.8) Time to diagnosis (years) 5.80 (3.59) 3.81 (2.02) 7.24 (3.69) 7.17 (3.87) Subtype Adenocarcinoma 142 (44.0) 63 (47.4) 37 (44.0) 42 (39.6) Large-cell carcinoma 42 (13.0) 6 (4.5) 13 (15.5) 23 (21.7) Small-cell carcinoma 46 (14.2) 26 (19.5) 10 (11.9) 10 (9.4) Squamous-cell carcinoma 50 (15.5) 19 (14.3) 11 (13.1) 20 (18.9) Other lung cancer 43 (13.3) 19 (14.3) 13 (15.5) 11 (10.4) Smoking status Never 181 (28.1) 70 (26.3) 69 (40.8) 42 (20.0) Former 199 (30.9) 73 (27.4) 35 (20.7) 91 (43.3) Current 265 (41.1) 123 (46.2) 65 (38.5) 77 (36.7) Smoking intensity 9.56 (8.75) 7.79 (6.12) 6.87 (8.66) 13.80 (9.89) Packyears 14.36 (14.59) 11.44 (10.75) 9.72 (13.00) 21.64 (17.03) Time since quitting smoking (years) 5.42 (8.77) 4.96 (9.49) 4.06 (7.44) 6.75 (8.51) Smoking duration 22.04 (16.83) 24.29 (17.91) 15.31 (15.41) 24.64 (15.08) Cumulative Smoking Index 0.94 (0.93) 0.56 (0.43) 0.93 (1.03) 1.42 (1.08) Quality Control: Warning 15 (2.3) 3 (1.1) 1 (0.6) 11 (5.2) Storage time (years) 14.91 (4.94) 9.03 (0.98) 18.97 (1.17) 18.74 (1.23) Gel status: poor quality 21 (10.7) - 4 (5.2) 17 (14.2)

⇒ Differences in smoking habits between Men and Women in EPIC

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 3 / 27 Participant characteristics

Full population NOWAC EPIC Women EPIC Men (N=648) (N=266) (N=169) (N=213) Age at sample (years) 55.29 (5.58) 56.49 (4.01) 53.17 (7.49) 55.48 (4.99) Gender: Female 435 (67.1) 266 (100.0) 169 (100.0) 0 (0.0) Body Mass Index 25.58 (3.86) 24.90 (3.53) 25.56 (4.91) 26.41 (3.05) Centre NOWAC 266 (41.0) 266 (100.0) 0 (0.0) 0 (0.0) Florence 132 (20.4) 0 (0.0) 79 (46.7) 53 (24.9) Naples 8 (1.2) 0 (0.0) 8 (4.7) 0 (0.0) Ragusa 28 (4.3) 0 (0.0) 0 (0.0) 28 (13.1) Turin 122 (18.8) 0 (0.0) 26 (15.4) 96 (45.1) Varese 92 (14.2) 0 (0.0) 56 (33.1) 36 (16.9) Lung cancer status: case 323 (49.8) 133 (50.0) 84 (49.7) 106 (49.8) Time to diagnosis (years) 5.80 (3.59) 3.81 (2.02) 7.24 (3.69) 7.17 (3.87) Subtype Adenocarcinoma 142 (44.0) 63 (47.4) 37 (44.0) 42 (39.6) Large-cell carcinoma 42 (13.0) 6 (4.5) 13 (15.5) 23 (21.7) Small-cell carcinoma 46 (14.2) 26 (19.5) 10 (11.9) 10 (9.4) Squamous-cell carcinoma 50 (15.5) 19 (14.3) 11 (13.1) 20 (18.9) Other lung cancer 43 (13.3) 19 (14.3) 13 (15.5) 11 (10.4) Smoking status Never 181 (28.1) 70 (26.3) 69 (40.8) 42 (20.0) Former 199 (30.9) 73 (27.4) 35 (20.7) 91 (43.3) Current 265 (41.1) 123 (46.2) 65 (38.5) 77 (36.7) Smoking intensity 9.56 (8.75) 7.79 (6.12) 6.87 (8.66) 13.80 (9.89) Packyears 14.36 (14.59) 11.44 (10.75) 9.72 (13.00) 21.64 (17.03) Time since quitting smoking (years) 5.42 (8.77) 4.96 (9.49) 4.06 (7.44) 6.75 (8.51) Smoking duration 22.04 (16.83) 24.29 (17.91) 15.31 (15.41) 24.64 (15.08) Cumulative Smoking Index 0.94 (0.93) 0.56 (0.43) 0.93 (1.03) 1.42 (1.08) Quality Control: Warning 15 (2.3) 3 (1.1) 1 (0.6) 11 (5.2) Storage time (years) 14.91 (4.94) 9.03 (0.98) 18.97 (1.17) 18.74 (1.23) Gel status: poor quality 21 (10.7) - 4 (5.2) 17 (14.2)

⇒ Differences in smoking duration between NOWAC and EPIC Women

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 3 / 27 Participant characteristics

Full population NOWAC EPIC Women EPIC Men (N=648) (N=266) (N=169) (N=213) Age at sample (years) 55.29 (5.58) 56.49 (4.01) 53.17 (7.49) 55.48 (4.99) Gender: Female 435 (67.1) 266 (100.0) 169 (100.0) 0 (0.0) Body Mass Index 25.58 (3.86) 24.90 (3.53) 25.56 (4.91) 26.41 (3.05) Centre NOWAC 266 (41.0) 266 (100.0) 0 (0.0) 0 (0.0) Florence 132 (20.4) 0 (0.0) 79 (46.7) 53 (24.9) Naples 8 (1.2) 0 (0.0) 8 (4.7) 0 (0.0) Ragusa 28 (4.3) 0 (0.0) 0 (0.0) 28 (13.1) Turin 122 (18.8) 0 (0.0) 26 (15.4) 96 (45.1) Varese 92 (14.2) 0 (0.0) 56 (33.1) 36 (16.9) Lung cancer status: case 323 (49.8) 133 (50.0) 84 (49.7) 106 (49.8) Time to diagnosis (years) 5.80 (3.59) 3.81 (2.02) 7.24 (3.69) 7.17 (3.87) Subtype Adenocarcinoma 142 (44.0) 63 (47.4) 37 (44.0) 42 (39.6) Large-cell carcinoma 42 (13.0) 6 (4.5) 13 (15.5) 23 (21.7) Small-cell carcinoma 46 (14.2) 26 (19.5) 10 (11.9) 10 (9.4) Squamous-cell carcinoma 50 (15.5) 19 (14.3) 11 (13.1) 20 (18.9) Other lung cancer 43 (13.3) 19 (14.3) 13 (15.5) 11 (10.4) Smoking status Never 181 (28.1) 70 (26.3) 69 (40.8) 42 (20.0) Former 199 (30.9) 73 (27.4) 35 (20.7) 91 (43.3) Current 265 (41.1) 123 (46.2) 65 (38.5) 77 (36.7) Smoking intensity 9.56 (8.75) 7.79 (6.12) 6.87 (8.66) 13.80 (9.89) Packyears 14.36 (14.59) 11.44 (10.75) 9.72 (13.00) 21.64 (17.03) Time since quitting smoking (years) 5.42 (8.77) 4.96 (9.49) 4.06 (7.44) 6.75 (8.51) Smoking duration 22.04 (16.83) 24.29 (17.91) 15.31 (15.41) 24.64 (15.08) Cumulative Smoking Index 0.94 (0.93) 0.56 (0.43) 0.93 (1.03) 1.42 (1.08) Quality Control: Warning 15 (2.3) 3 (1.1) 1 (0.6) 11 (5.2) Storage time (years) 14.91 (4.94) 9.03 (0.98) 18.97 (1.17) 18.74 (1.23) Gel status: poor quality 21 (10.7) - 4 (5.2) 17 (14.2)

⇒ Differences in time to diagnosis between NOWAC and EPIC

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 3 / 27 Participant characteristics

Full population NOWAC EPIC Women EPIC Men (N=648) (N=266) (N=169) (N=213) Age at sample (years) 55.29 (5.58) 56.49 (4.01) 53.17 (7.49) 55.48 (4.99) Gender: Female 435 (67.1) 266 (100.0) 169 (100.0) 0 (0.0) Body Mass Index 25.58 (3.86) 24.90 (3.53) 25.56 (4.91) 26.41 (3.05) Centre NOWAC 266 (41.0) 266 (100.0) 0 (0.0) 0 (0.0) Florence 132 (20.4) 0 (0.0) 79 (46.7) 53 (24.9) Naples 8 (1.2) 0 (0.0) 8 (4.7) 0 (0.0) Ragusa 28 (4.3) 0 (0.0) 0 (0.0) 28 (13.1) Turin 122 (18.8) 0 (0.0) 26 (15.4) 96 (45.1) Varese 92 (14.2) 0 (0.0) 56 (33.1) 36 (16.9) Lung cancer status: case 323 (49.8) 133 (50.0) 84 (49.7) 106 (49.8) Time to diagnosis (years) 5.80 (3.59) 3.81 (2.02) 7.24 (3.69) 7.17 (3.87) Subtype Adenocarcinoma 142 (44.0) 63 (47.4) 37 (44.0) 42 (39.6) Large-cell carcinoma 42 (13.0) 6 (4.5) 13 (15.5) 23 (21.7) Small-cell carcinoma 46 (14.2) 26 (19.5) 10 (11.9) 10 (9.4) Squamous-cell carcinoma 50 (15.5) 19 (14.3) 11 (13.1) 20 (18.9) Other lung cancer 43 (13.3) 19 (14.3) 13 (15.5) 11 (10.4) Smoking status Never 181 (28.1) 70 (26.3) 69 (40.8) 42 (20.0) Former 199 (30.9) 73 (27.4) 35 (20.7) 91 (43.3) Current 265 (41.1) 123 (46.2) 65 (38.5) 77 (36.7) Smoking intensity 9.56 (8.75) 7.79 (6.12) 6.87 (8.66) 13.80 (9.89) Packyears 14.36 (14.59) 11.44 (10.75) 9.72 (13.00) 21.64 (17.03) Time since quitting smoking (years) 5.42 (8.77) 4.96 (9.49) 4.06 (7.44) 6.75 (8.51) Smoking duration 22.04 (16.83) 24.29 (17.91) 15.31 (15.41) 24.64 (15.08) Cumulative Smoking Index 0.94 (0.93) 0.56 (0.43) 0.93 (1.03) 1.42 (1.08) Quality Control: Warning 15 (2.3) 3 (1.1) 1 (0.6) 11 (5.2) Storage time (years) 14.91 (4.94) 9.03 (0.98) 18.97 (1.17) 18.74 (1.23) Gel status: poor quality 21 (10.7) - 4 (5.2) 17 (14.2)

⇒ Differences in storage time between NOWAC and EPIC

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 3 / 27 Below detection limit/missing data by disease status

Proportion of measurements below the limit of detection in cases (N=351 samples from 323 participants) and controls (N=353 samples from 325 participants) separately

1.0 Healthy controls (N=353) Lung cancer cases (N=351)

0.8

0.6

0.4 Proportion LOD below

0.2

0.0 IL2 IL4 IL5 IL6 IL8 IL7 LIF uPA NT3 TNF IL33 IL20 IL24 IL13 IL10 IL18 SCF CD6 CD5 ADA HGF LIFR OPG Flt3L OSM TSLP PDL1 FGF5 CCL3 CCL4 CST5 CD40 CSF1 TNFB IL17A IL12B IL17C CD8A ARTN NRTN MCP3 MCP2 MCP4 MCP1 IL2RB SIRT2 GDNF DNER TRAIL MMP1 AXIN1 ST1A1 FGF23 4EBP1 FGF21 CCL25 CCL20 CCL11 CCL19 CCL28 CCL23 FGF19 CD244 CXCL9 CXCL5 CXCL1 CXCL6 IL18R1 CASP8 VEGFA IL20RA IL10RA IL15RA IL10RB CDCP1 MMP10 TWEAK CX3CL1 CXCL11 CXCL10 IL1alpha SLAMF1 IL22RA1 STAMBP TRANCE BetaNGF ENRAGE TNFSF14 TGFalpha TNFRSF9 IFNgamma LAPTGFbeta1

⇒ 21 proteins with more than 30% of measurements below the detection limit ⇒ Additionally, 3 missing values for sample 103590 (warning in QC) ⇒ Very small differences in proportions of missing values between cases and controls

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 4 / 27 Quality check: control measurements

Standard deviation between samples (dark blue) and controls (red) Protein names are ordered by decreasing proportion of samples below the LOD

2.0 Samples (N=704) Controls (N=16)

1.5

1.0 Standard deviation

0.5

0.0 IL2 IL4 IL5 IL6 IL8 IL7 LIF uPA NT3 TNF IL33 IL20 IL24 IL13 IL10 IL18 SCF CD6 CD5 ADA HGF LIFR OPG Flt3L OSM TSLP PDL1 FGF5 CCL3 CCL4 CST5 CD40 CSF1 TNFB IL17A IL12B IL17C CD8A ARTN NRTN MCP3 MCP2 MCP4 MCP1 IL2RB SIRT2 GDNF DNER TRAIL MMP1 AXIN1 ST1A1 FGF23 4EBP1 FGF21 CCL25 CCL20 CCL11 CCL19 CCL28 CCL23 FGF19 CD244 CXCL9 CXCL5 CXCL1 CXCL6 IL18R1 CASP8 VEGFA IL20RA IL10RA IL15RA IL10RB CDCP1 MMP10 TWEAK CX3CL1 CXCL11 CXCL10 IL1alpha SLAMF1 IL22RA1 STAMBP TRANCE BetaNGF ENRAGE TNFSF14 TGFalpha TNFRSF9 IFNgamma LAPTGFbeta1

⇒ More variability between the actual samples than between the controls for all proteins except IL1alpha (∼ 80% of data below the detection limit)

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 5 / 27 Repeated measurements

Pearson’s correlation between replicates (N=56 samples measured twice) on the 71 proteins with less than 30% of missing values 1.000 0.995 0.990 0.985 0.980 0.975 Correlation between replicates Correlation between 0.970 0.965 2101850 (N=69) 2100495 (N=71) 2101630 (N=71) 2102068 (N=71) 2101536 (N=71) 2100450 (N=69) 2103779 (N=71) 2103865 (N=70) 2102038 (N=71) 2103767 (N=68) 2101836 (N=70) 2102121 (N=70) 2101730 (N=69) 2103774 (N=71) 2104083 (N=70) 2103971 (N=71) 2104096 (N=70) 2104023 (N=69) 2103895 (N=71) 2104084 (N=70) 2104013 (N=68) 2105573 (N=70) 2105490 (N=71) 2105053 (N=71) 2105697 (N=69) 2105253 (N=70) 2105623 (N=71) 2105298 (N=70) 2105524 (N=70) 2105179 (N=68) 2106943 (N=70) 2105180 (N=70) 2107903 (N=69) 2107578 (N=71) 2105501 (N=71) 2105407 (N=69) 2108054 (N=69) 2108262 (N=71) 2108330 (N=70) 2108558 (N=71) 2108343 (N=67) 2107974 (N=70) 2109777 (N=68) 2110358 (N=69) 2110574 (N=67) 2110115 (N=71) 2111901 (N=70) 2109801 (N=70) 2110032 (N=68) 2111944 (N=70) 2111169 (N=71) 2111183 (N=70) 2112878 (N=71) 2112935 (N=70) 2111168 (N=71) 2111282 (N=69)

⇒ High consistency between both measurements (correlations above 0.96) ⇒ The average value is taken for replicated measurements ⇒ Remaining missing values are imputed using QRILC for left-censored data

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 6 / 27 Quality control and outlier detection

Principal Component Analysis: score plots along the first 3 PCs Detection of outliers along the first 5 PCs using the multivariate distance-based algorithm implemented in the R package mvoutlier

10 10 4

2

5 5 0

−2 0 0

−4 Comp 2 (9.18% e.v.) Comp 3 (3.61% e.v.) Comp 3 (3.61% e.v.) −6 −5 −5

−8

−30 −20 −10 0 10 −30 −20 −10 0 10 −8 −6 −4 −2 0 2 4 Comp 1 (43.7% e.v.) Comp 1 (43.7% e.v.) Comp 2 (9.18% e.v.)

Selected Samples (N=578) Quality Control (N=5) Sample Conservation (N=6) Detected Outliers (N=40) Quality Control + Detected Outliers (N=4) Sample Conservation + Detected Outliers (N=9) Quality Control + Sample Conservation + Detected Outliers (N=6)

⇒ All 70 samples with poor quality data are removed for further analyses

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 7 / 27 Principal component analysis after exclusions

PCA on the imputed data after exclusion of participants (N=578)

100 NOWAC (N=238) Florence (N=131) Naples (N=7) Turin (N=83) 80 Varese (N=92) 5 Men (N=176) Women (N=402)

60

0 40 Comp 2 (7.46% e.v.)

20 −5 Percentage of explained variance of explained Percentage

0

0 10 20 30 40 50 60 70 −10 −5 0 5 10

Number of components Comp 1 (27.23% e.v.)

⇒ The first PC explains 27% of the variance ⇒ Score plots show strong differences between EPIC centres and NOWAC ⇒ Need to account for heterogeneity between study/cohorts in the models

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 8 / 27 Principal component analysis after exclusions

Denoising of the data by extracting the residuals from a linear mixed model with protein levels (outcome) against plate and centre as random effects PCA on the residuals (denoised data)

100

5

80

60 0

40 Comp 2 (7.52% e.v.) NOWAC (N=238) −5 Florence (N=131) 20 Naples (N=7) Turin (N=83) Percentage of explained variance of explained Percentage Varese (N=92) Men (N=176) 0 Women (N=402) −10

0 10 20 30 40 50 60 70 −10 −5 0 5 10

Number of components Comp 1 (18.93% e.v.)

⇒ The first PC explains 19% of the variance ⇒ Linear mixed models were successful in removing centre effects

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 9 / 27 Analytical plan

1 Investigating the associations with lung cancer in Women only 2 Accounting for the effects of smoking using packyears and smoking status 3 Sensitivity analyses: stratification by cohort and median time to diagnosis 4 Validation in Men 5 Accounting for joint effects of the proteins in multivariate analyses 6 Stratification on main histological subtypes 7 Validation in external cohorts (EPIC, NSHDS) 8 Evaluation of the complementarity between proteins and packyears in lung cancer status discrimination using ROC curves 9 Exploration of the functional role of lung cancer-related proteins via OMICs integration

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 10 / 27 Univariate analyses in Women

Univariate logistic models with lung cancer status/packyears (outcome) against protein levels (predictor) and adjusted on age and BMI

⇒ 12 significant associations with lung cancer after FDR correction ⇒ 11/12 are also associated with packyears ⇒ Overall attenuation of the strength of the association with lung cancer upon adjustment on packyears, only CDCP1 survives adjustment

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 11 / 27 Investigating the effects on smoking

Stratified analyses by smoking status

⇒ No significant association in never smoking Women ⇒ CDCP1, SCF and IL6 are significantly associated with future lung cancer status in current smoking Women

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 12 / 27 Validation in Men

Analyses adjusted on packyears are conducted in Women (N=388) and Men (N=173) separately

0.6 CDCP1 CDCP1 3.0

0.4 2.5 (Men)

2.0 ) e

0.2 u l a

v 1.5 (Men)

− β p ( 0.0 10 1.0 g o l − 0.5 −0.2

0.0

−0.2 0.0 0.2 0.4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

β (Women) − log10(p − value) (Women)

⇒ CDCP1 is nominally significant in Men

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 13 / 27 Pairwise correlations between proteins

Heatmap of Pearson’s correlations between the imputed levels of the 71 inflammatory proteins in controls (left) and cases (right) Hierarchical clustering performed in healthy controls CONTROLS (N=290) CASES (N=288)

AXIN1 1 AXIN1 1 SIRT2 SIRT2 STAMBP STAMBP ST1A1 ST1A1 ENRAGE 0.5 ENRAGE 0.5 CASP8 CASP8 LAPTGFbeta1 LAPTGFbeta1 CD40 CD40 CCL11 CCL11 CXCL11 0 CXCL11 0 MCP4 MCP4 VEGFA VEGFA HGF HGF CD6 −0.5 CD6 −0.5 CD5 CD5 TGFalpha TGFalpha OSM OSM TNFSF14 TNFSF14 CD244 −1 CD244 −1 CSF1 CSF1 IL8 IL8 MMP1 MMP1 4EBP1 4EBP1 ADA ADA MCP1 MCP1 CCL23 CCL23 CXCL6 CXCL6 MCP2 MCP2 CST5 CST5 IL15RA IL15RA IL10 IL10 CDCP1 CDCP1 IFNgamma IFNgamma TNFB TNFB TNF TNF IL12B IL12B TNFRSF9 TNFRSF9 FGF21 FGF21 CD8A CD8A IL6 IL6 TRANCE TRANCE FGF19 FGF19 IL17A IL17A MMP10 MMP10 uPA uPA TRAIL TRAIL TWEAK TWEAK OPG OPG LIFR LIFR IL10RB IL10RB CX3CL1 CX3CL1 DNER DNER CCL25 CCL25 FGF23 FGF23 SCF SCF NT3 NT3 IL7 IL7 CXCL1 CXCL1 CXCL5 CXCL5 CCL19 CCL19 CCL20 CCL20 Flt3L Flt3L CCL28 CCL28 IL18 IL18 CXCL9 CXCL9 CXCL10 CXCL10 IL18R1 IL18R1 PDL1 PDL1 MCP3 MCP3 CCL4 CCL4 CCL3 CCL3 AXIN1 SIRT2 STAMBP ST1A1 ENRAGE CASP8 LAPTGFbeta1 CD40 CCL11 CXCL11 MCP4 VEGFA HGF CD6 CD5 TGFalpha OSM TNFSF14 CD244 CSF1 IL8 MMP1 4EBP1 ADA MCP1 CCL23 CXCL6 MCP2 CST5 IL15RA IL10 CDCP1 IFNgamma TNFB TNF IL12B TNFRSF9 FGF21 CD8A IL6 TRANCE FGF19 IL17A MMP10 uPA TRAIL TWEAK OPG LIFR IL10RB CX3CL1 DNER CCL25 FGF23 SCF NT3 IL7 CXCL1 CXCL5 CCL19 CCL20 Flt3L CCL28 IL18 CXCL9 CXCL10 IL18R1 PDL1 MCP3 CCL4 CCL3 AXIN1 SIRT2 STAMBP ST1A1 ENRAGE CASP8 LAPTGFbeta1 CD40 CCL11 CXCL11 MCP4 VEGFA HGF CD6 CD5 TGFalpha OSM TNFSF14 CD244 CSF1 IL8 MMP1 4EBP1 ADA MCP1 CCL23 CXCL6 MCP2 CST5 IL15RA IL10 CDCP1 IFNgamma TNFB TNF IL12B TNFRSF9 FGF21 CD8A IL6 TRANCE FGF19 IL17A MMP10 uPA TRAIL TWEAK OPG LIFR IL10RB CX3CL1 DNER CCL25 FGF23 SCF NT3 IL7 CXCL1 CXCL5 CCL19 CCL20 Flt3L CCL28 IL18 CXCL9 CXCL10 IL18R1 PDL1 MCP3 CCL4 CCL3

⇒ Overall, similar correlation patterns in cases and controls ⇒ Some correlated proteins, need to account for this in the models

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 14 / 27 Multivariate analyses

Logistic-LASSO on lung cancer against all proteins, adjusted on age and BMI Variable selection used in combination with stability analyses (1,000 iterations on subsamples of 80% of the data) to derive selection proportions in the base and further adjusted on packyears models

1.0 1.0 SCF CDCP1 A Base model IL10 MCP1 IL8 CDCP1 Model adjusted on packyears IL6 IL12BCD8A IL10ST1A1 MMP10 0.8 0.8 CXCL11 IL8

ST1A1 LIFR CXCL6CXCL5 OPG 0.6 0.6 IL6 SCF CD8A

0.4 0.4

Selection proportion 0.2 Selection proportion 0.2

0.0 0.0

IL8 IL6 IL7 −5 0 5 10 uPA NT3 TNF IL10 IL18 SCF CD5 CD6 ADA HGF LIFR OPG Flt3L OSM PDL1 CCL4 CD40 CSF1 CST5 CCL3 TNFB IL12B IL17A CD8A MCP1 MCP2 MCP3 MCP4 SIRT2 TRAIL DNER MMP1 AXIN1 ST1A1 CCL20 CCL11 FGF19 CCL23 FGF21 FGF23 CCL19 CCL28 CD244 CCL25 4EBP1 CXCL6 CXCL5 IL18R1 CXCL9 CXCL1 CASP8 VEGFA IL10RB IL15RA CDCP1 MMP10 TWEAK CXCL11 CXCL10 CX3CL1 STAMBP TRANCE ENRAGE TNFSF14

TGFalpha log p value signed TNFRSF9 − 10( − ) ( ) IFNgamma LAPTGFbeta1

B 1.0 ⇒ Good consistency with univariate results1.0 CDCP1 CDCP1 ⇒ CDCP1 and IL10 in the model adjusted on smokingIL10 0.8 0.8 IL10

0.6 0.6 Barbara Bodinier Pre-diagnostic markers of lung cancer CCL25 8th October, 2020 15 / 27 CXCL11

0.4 0.4

Selection proportion 0.2 Selection proportion 0.2

0.0 0.0

IL8 IL6 IL7 −4 −2 0 2 4 6 8 10 uPA NT3 TNF IL10 IL18 SCF CD5 CD6 ADA HGF LIFR OPG Flt3L OSM PDL1 CCL4 CD40 CSF1 CST5 CCL3 TNFB IL12B IL17A CD8A MCP1 MCP2 MCP3 MCP4 SIRT2 TRAIL DNER MMP1 AXIN1 ST1A1 CCL20 CCL11 FGF19 CCL23 FGF21 FGF23 CCL19 CCL28 CD244 CCL25 4EBP1 CXCL6 CXCL5 IL18R1 CXCL9 CXCL1 CASP8 VEGFA IL10RB IL15RA CDCP1 MMP10 TWEAK CXCL11 CXCL10 CX3CL1 STAMBP TRANCE ENRAGE TNFSF14

TGFalpha log p value signed TNFRSF9 − 10( − ) ( ) IFNgamma LAPTGFbeta1

C 1.0 1.0 CDCP1 IL12B

CCL11 CDCP1 0.8 0.8 CD244 SCF IL10RB CXCL6 CCL3TRAIL MCP1 0.6 0.6 CD6 CXCL10 IL18

0.4 0.4

Selection proportion 0.2 Selection proportion 0.2

0.0 0.0

IL8 IL6 IL7 0 5 10 uPA NT3 TNF IL10 IL18 SCF CD5 CD6 ADA HGF LIFR OPG Flt3L OSM PDL1 CCL4 CD40 CSF1 CST5 CCL3 TNFB IL12B IL17A CD8A MCP1 MCP2 MCP3 MCP4 SIRT2 TRAIL DNER MMP1 AXIN1 ST1A1 CCL20 CCL11 FGF19 CCL23 FGF21 FGF23 CCL19 CCL28 CD244 CCL25 4EBP1 CXCL6 CXCL5 IL18R1 CXCL9 CXCL1 CASP8 VEGFA IL10RB IL15RA CDCP1 MMP10 TWEAK CXCL11 CXCL10 CX3CL1 STAMBP TRANCE ENRAGE TNFSF14

TGFalpha log p value signed TNFRSF9 − 10( − ) ( ) IFNgamma LAPTGFbeta1 1.0 1.0 SCF CDCP1 A Base model IL10 MCP1 IL8 CDCP1 Model adjusted on packyears IL6 IL12BCD8A IL10ST1A1 MMP10 0.8 0.8 CXCL11 IL8

ST1A1 LIFR CXCL6CXCL5 OPG 0.6 0.6 IL6 SCF CD8A

Analyses by0.4 histological subtypes 0.4

Selection proportion 0.2 Selection proportion 0.2

Stability analyses0.0 of the logistic-LASSO stratified0.0 by subtype:

IL8 IL6 IL7 −5 0 5 10 uPA NT3 TNF IL10 IL18 SCF CD5 CD6 ADA HGF LIFR OPG Flt3L OSM PDL1 CCL4 CD40 CSF1 CST5 CCL3 TNFB IL12B IL17A CD8A MCP1 MCP2 MCP3 MCP4 SIRT2 TRAIL DNER MMP1 AXIN1 ST1A1 CCL20 CCL11 FGF19 CCL23 FGF21 FGF23 CCL19 CCL28 CD244 CCL25 4EBP1 CXCL6 CXCL5 IL18R1 CXCL9 CXCL1 CASP8 VEGFA IL10RB IL15RA CDCP1 MMP10 TWEAK CXCL11 CXCL10 CX3CL1 STAMBP TRANCE ENRAGE TNFSF14

TGFalpha log p value signed TNFRSF9 − 10( − ) ( ) adenocarcinoma (N=91 cases) and small-cellIFNgamma carcinoma (N=32 cases) LAPTGFbeta1

B 1.0 1.0 CDCP1 CDCP1 IL10

0.8 0.8 IL10

0.6 0.6 CCL25

CXCL11

0.4 0.4

Selection proportion 0.2 Selection proportion 0.2

0.0 0.0

IL8 IL6 IL7 −4 −2 0 2 4 6 8 10 uPA NT3 TNF IL10 IL18 SCF CD5 CD6 ADA HGF LIFR OPG Flt3L OSM PDL1 CCL4 CD40 CSF1 CST5 CCL3 TNFB IL12B IL17A CD8A MCP1 MCP2 MCP3 MCP4 SIRT2 TRAIL DNER MMP1 AXIN1 ST1A1 CCL20 CCL11 FGF19 CCL23 FGF21 FGF23 CCL19 CCL28 CD244 CCL25 4EBP1 CXCL6 CXCL5 IL18R1 CXCL9 CXCL1 CASP8 VEGFA IL10RB IL15RA CDCP1 MMP10 TWEAK CXCL11 CXCL10 CX3CL1 STAMBP TRANCE ENRAGE TNFSF14

TGFalpha log p value signed TNFRSF9 − 10( − ) ( ) IFNgamma LAPTGFbeta1

C 1.0 1.0 CDCP1 IL12B

CCL11 CDCP1 0.8 0.8 CD244 SCF IL10RB CXCL6 CCL3TRAIL MCP1 0.6 0.6 CD6 CXCL10 IL18

0.4 0.4

Selection proportion 0.2 Selection proportion 0.2

0.0 0.0

IL8 IL6 IL7 0 5 10 uPA NT3 TNF IL10 IL18 SCF CD5 CD6 ADA HGF LIFR OPG Flt3L OSM PDL1 CCL4 CD40 CSF1 CST5 CCL3 TNFB IL12B IL17A CD8A MCP1 MCP2 MCP3 MCP4 SIRT2 TRAIL DNER MMP1 AXIN1 ST1A1 CCL20 CCL11 FGF19 CCL23 FGF21 FGF23 CCL19 CCL28 CD244 CCL25 4EBP1 CXCL6 CXCL5 IL18R1 CXCL9 CXCL1 CASP8 VEGFA IL10RB IL15RA CDCP1 MMP10 TWEAK CXCL11 CXCL10 CX3CL1 STAMBP TRANCE ENRAGE TNFSF14

TGFalpha log p value signed TNFRSF9 − 10( − ) ( ) IFNgamma LAPTGFbeta1 ⇒ CDCP1 is highly selected but re-ordering of the other signals suggesting heterogeneity between the subtypes

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 16 / 27 Validation of CDCP1 in external cohorts

Validation using data from two external cohorts: EPIC (Netherlands, UK, Germany, Spain) and NSHDS (Northern Sweden Health and Disease Study) Logistic models adjusted on age and BMI (base model), and further adjusted on packyears or smoking status

⇒ Associations with all lung cancer survive adjustment on smoking ⇒ Significant associations with adenocarcinoma despite small sample size

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 17 / 27 Quantifying the amount of disease-relevant information

ROC analyses with packyears, CDCP1 and LASSO-selected proteins to quantify the amount of disease-relevant information brought by proteins

ALL LUNG CANCER (N=191 CASES) ADENOCARCINOMA (N=89 CASES) SMALL-CELL (N=30 CASES)

A 1.0 B 1.0 C 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 True Positive Rate Positive True Rate Positive True Rate Positive True

0.2 0.2 0.2 Packyears − AUC = 0.74 [0.73−0.74] Packyears − AUC = 0.69 [0.68−0.70] Packyears − AUC = 0.88 [0.87−0.88] CDCP1 − AUC = 0.65 [0.65−0.66] CDCP1 − AUC = 0.68 [0.68−0.69] CDCP1 − AUC = 0.74 [0.73−0.75] CDCP1 + Packyears − AUC = 0.75 [0.74−0.75] CDCP1 + Packyears − AUC = 0.73 [0.73−0.74] CDCP1 + Packyears − AUC = 0.88 [0.87−0.89] LASSO stable (N=10) − AUC = 0.73 [0.72−0.75] LASSO stable (N=2) − AUC = 0.71 [0.70−0.71] LASSO stable (N=3) − AUC = 0.81 [0.78−0.82] 0.0 LASSO stable (N=10) + Packyears − AUC = 0.78 [0.77−0.79] 0.0 LASSO stable (N=2) + Packyears − AUC = 0.76 [0.75−0.76] 0.0 LASSO stable (N=3) + Packyears − AUC = 0.88 [0.86−0.89]

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate ⇒ CDCP1 alone yields an AUC of 0.65 (all LC), 0.68 (adenocarcinoma) and 0.74 (small-cell carcinoma) ⇒ Increase from 0.69 to 0.73 with CDCP1 on top of packyears (adenocarcinoma) ⇒ Moderate additional information with more proteins (N=10, AUC going from 0.75 to 0.78)

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 18 / 27 Integration of CDCP1 with transcriptomics

Univariate analyses of CDCP1 against all N=11,610 transcripts measured in the same participants (N=222, NOWAC)

8

LRRN3 − Znl3dSA4HuOuhEnO4o ●

6 SEM1 − Ngi9Xlgh6np8jgjgjk ● ) e u l a v

● − 4 ● p

( ● ● ● ● ● ● 10 ● ● ● ●● ● ● ●

g ● ● ● ●● ●● ● ● ●● o ●●

l ● ● ● ●● ●●● ●● ●● ●● ● − ●● ●● ●●● ●● ●●● ●● ● ● ●● ●● ●● ●● ●● ●●●● ●●● ● ●● ●● ●●● ●● ●●● ● ●●●● ●●●●● ●●●●● ●●●●● ●●● ●●● ●● ●●●● ●●● ●●●● 2 ●●● ●● ●●●● ●●●●●● ●●●● ●●●● ●●●●● ●●● ●●● ●●●● ●●● ●●●● ●●●●● ●●● ●●●●● ●●●●● ● ●●● ●●● ●●●● ●●●●●● ●●●●● ●●●● ●●●●● ●●●●●● ●●●● ●●●●● ●●●●● ●●●● ●●●●● ●●●●● ●●●●● ●●●● ●●●●● ●●●●● ●●●● ●●●●●● ●●●●● ●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●● ●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●● ● ●●●●●● ●●●●● ●●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● ● ●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●● ●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●● 0 ●●●●●●●●●●●●●●●

−0.10 −0.05 0.00 0.05 0.10 0.15 β ⇒ Significant association of CDCP1 with LRRN3 (marker of tobacco smoking) and SEM1

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 19 / 27 Exploration of the functional role of CDCP1

11,610 transcripts measured in the same participants (N=222, NOWAC) Transcript nuIDs could be linked to 11,485 unique symbols 10,656 of these gene symbols could be identified on the Panther Database

Functional annotation based on classifications from knowledgebases: Biological Process Reactome

⇒ Each classification provides different levels of grouping of the (classification in groups and sub-groups of genes) ⇒ One gene can belong to different groups within each classification

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 20 / 27 Associations between CDCP1 and Reactome pathways

Grouping of the transcripts by Reactome pathways ⇒ 1,545 functional groups involving 6,401 unique genes (some of these pathways are made of just one gene)

Summary of each group using PCA: all components explaining more than 5% of the variance of the group are kept ⇒ 8,043 PC scores summarising the pathways (new dataset) ⇒ Number of PCs per pathway ranging between 1 and 10

Univariate regressions of CDCP1 against PC scores summarising the groups Estimation of the Effective Number of Test with the number of PCs to explain 90% of the variance over the 8,043 scores (ENT=109)

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 21 / 27 Associations between CDCP1 and Reactome pathways

Univariate regressions of CDCP1 against PC scores summarising Reactome pathways

4

) 3 e u l a v - p (

10 2 g o l -

1

0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 b

Deactivation of the beta−catenin transactivating complex Defective GALNT3 causes familial hyperphosphatemic tumoral calcinosis (HFTC) Initial triggering of complement CBY1 CTBP1 XPO1 HDAC1 SOX4 RPS27A UBA52 TLE3 APC RBBP5 AKT1 UBB LEF1 MLL4 SOX13 BCL9 MEN1 PYGO2 BTRC TCF7 TLE4 BCL9L ASH2L TLE2 CHD8 UBC YWHAZ CTNNBIP1 TLE1 CTNNB1 CFP C2 C1QB C1QC C1QA CFB CFD FCN1 GZMM HDC C4A MUC4 MUC2 MUC20 MUC16 MUC6 MUC1

⇒ Identification of 3 groups including 6-30 transcripts that were not detected in univariate analyses

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 22 / 27 Associations between CDCP1 and Biological Processes

Grouping of the transcripts by Biological Processes ⇒ 3,600 functional groups involving 9,826 unique genes (some of these pathways are made of just one gene)

Summary of each group using PCA: all components explaining more than 5% of the variance of the group are kept ⇒ 20,974 PC scores summarising the pathways (new dataset) ⇒ Number of PCs per pathway ranging between 1 and 10

Univariate regressions of CDCP1 against PC scores summarising the groups Estimation of the Effective Number of Test with the number of PCs to explain 90% of the variance over the 20,974 scores (ENT=140)

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 23 / 27 Associations between CDCP1 and Biological Processes

Univariate regressions of CDCP1 against PC scores summarising Biological Processes

5

4 ) e u l

a 3 v - p ( 10 g

o 2 l -

1

0

−1.0 −0.5 0.0 0.5 1.0 b

Protein localization to nucleus (GO:0034504) Interleukin−18−mediated signaling pathway (GO:0035655) ERAD pathway (GO:0036503) Positive regulation of ERAD pathway (GO:1904294) Positive regulation of metallopeptidase activity (GO:1905050) Response to fluid shear stress (GO:0034405) Positive regulation of dendrite development (GO:1900006) Negative regulation of cAMP−dependent protein kinase activity (GO:2000480) Regulation of chemotaxis (GO:0050920) Regulation of cell−cell adhesion (GO:0022407) Positive regulation of pseudopodium assembly (GO:0031274) RNF185 RCN3 UBXN6 RHBDD1 CCDC47 SEL1L VCP MARCH6 C2orf30 DNAJB12 RNF5 SYVN1 DERL1 AKT1 ALOX5 PDGFB IL18RAP IL18R1 IL18BP YWHAE GBP2 CALR XPO1 HHEX ZBTB7A DULLARD TOR1B LMNA DNAJB6 TOR1AIP1 TMEM188 BMP4 ABCA7 ZBTB16 BANP PAF1 DVL1 TOPORS XPA RPL11 TOR1A PLRG1 RNFT1 C19orf6 USP13 ATXN3 DDRGK1 MAPK14 ANTXR1 MAPK3 CLDN4 KARS,SYK P2RX4 PKD1 CITED2 PDGFRB LRP8 TMEM106B IQGAP1 ALK CYFIP1 PACSIN1 PTN DAB2IP EZH2 PKIA PKIG PRKAR2A SIRT1 PRKAG2 PRKAR1A USP14 CXCR4 FAM49B P2RY12 PC MINK1 LEF1 SRC ADAM8 FUT7 KIT CDC42EP1 APC CDC42EP2 F2RL1 CDC42EP4 CCR7 CCL21 CDC42EP3 FAM39E C15orf62 CDC42

⇒ Identification of 13 groups including protein localization to nucleus, regulation of cell-cell adhesion and regulation of chemotaxis

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 24 / 27 Conclusions and perspectives

Analyses of a panel of circulating inflammatory proteins in association with the future risk of lung cancer in two prospective cohorts ⇒ Identification of robust associations between CDCP1 and all lung cancer and adenocarcinoma ⇒ Associations hold in stratified analyses by cohort (EPIC/NOWAC) or median time-to-diagnosis ⇒ Validation of these findings in two independent cohorts ⇒ Moderate gain in AUC on top of packyears (0.04 for adenocarcinoma) ⇒ Limited gain when considering joint effects of inflammatory proteins

CDCP1 (CUB domain containing protein 1): transmembrane noncatalytic receptor involved in the loss of anchorage in epithelial cells during ⇒ Previously found associated with higher proliferation and poor prognosis in lung cancer

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 25 / 27 Conclusions and perspectives

Using transcriptomics measurements in the same individuals, integration of CDCP1 with transcript levels to gain insight into its functional role Functional grouping of the transcripts based on the Reactome and Biological Processes knowledgebases Identification of associations between CDCP1 and summarised pathways ⇒ β-catenin transactivating complex: linked to a range of cancers, implicated in tumour development ⇒ cell-cell adhesion: CDCP1 disruption associated with interference in EGF/EGFR (Epidermal growth factor receptor) induced cell migration

Metabolomics measurements in the same participants: to be integrated with proteins and transcripts to explore joint effects at multiple molecular levels

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 26 / 27 Acknowledgements

Dr Sonia Dagnino Prof Marc Chadeau-Hyam (PI) Prof Roel Vermeulen Dr Florence Guida Dr Karl Smith-Byrne Dr Mattias Johansson Dr Therese Haugdahl Nost Prof Torkjel Sandanger Funding: Cancer Research - UK ‘Mechanomics’ PRC project grant (Grant PRC 22184 to MC-H)

Barbara Bodinier Pre-diagnostic markers of lung cancer 8th October, 2020 27 / 27