Supplemental Materials Methods, Figures and Tables

Analysis of blood-based expression in idiopathic Parkinson disease

Supplementary Methods

Figures: e-1, e-2, e-3

Tables: e-1, e-2, e-3, e-4

Ron Shamir, PhD,* Christine Klein, MD,*,† David Amar, PhD,* Eva-Juliane Vollstedt, MD, Michael Bonin, PhD, Marija Usenovic, PhD, Yvette C. Wong, PhD, Ales Maver, MD, PhD, Sven Poths, Hershel Safer, PhD, Jean-Christophe Corvol, MD, PhD, Suzanne Lesage, PhD, Ofer Lavi, MSc, Günther Deuschl, MD, PhD, Gregor Kuhlenbaeumer, MD, PhD, Heike Pawlack, BSc, Igor Ulitsky, PhD, Meike Kasten, MD, Olaf Riess, MD, Alexis Brice, MD, Borut Peterlin, MD, PhD,† Dimitri Krainc, MD, PhD,†

1

Supplementary Methods

Sample collection, RNA isolation and microarray processing

We collected samples of venous blood using a standardized blood withdrawal protocol.

PaxGene (Qiagen) and EDTA tubes were obtained. As participants were recruited at different time points, PaXGene tubes were inverted 10x directly after blood collection, placed at room temperature for 24 hours and subsequently frozen at -80 0C until RNA extraction. RNA was extracted after patient recruitment at all centers was completed and performed by the same individual. Samples were processed according to manufacturer’s protocols. From each patient, four whole blood samples were collected in Paxgene Vacutainer (BD Biosciences). RNA was isolated using Paxgene 96 RNA purification kit (BD Biosciences). Quality of RNA specimen was checked on an Agilent BioAnalyzer 2100 (Agilent, Germany) and processed for Affymetrix

Gene Chips using Affymetrix 3´-IVT Express labeling kit (Affymetrix, Santa Clara). For globin reduction, from each sample, 1.5 μg of total RNA was treated using the GLOBINclear™ Human

Kit (Ambion, Austin, TX, USA) according to the manufacturer's instructions. Fragmented and labeled cDNA was hybridized onto GeneChip® U133 Plus 2.0 Array

(Affymetrix). Staining of biotinylated cDNA and scanning of arrays were performed according to the manufacturer's recommendations.

Computational Preprocessing

The original data contained microarray expression profiles from 523 individuals. We tested three preprocessing methods: RMA, GC-RMA, and MAS5. Under the assumption that blood expression profiles should be highly correlated, we tested the effect of the methods on the sample correlation. Correlation distribution between sample groups, where grouping of the samples was by the year of the RNA extraction, demonstrated that the MAS5 method achieved lower correlation scores than the other methods, while RMA had a slight advantage over GC-

RMA (Figure e-1). Using RMA as the selected preprocessing method, the analysis identified

37 samples that had low correlation with other years (<0.8), which were removed from the dataset.

2

Batch Effect Reduction

Batches refer to a set of samples produced in the same laboratory on the same date. The majority of batches contained both patient and control samples. To test if the data contained such effects, we first applied an SVM classifier using all probes to predict the batch of samples.

We grouped batches by year and lab, producing five batches. Leave-one-out cross-validation achieved ROC score of 0.999, and 98.6% accuracy, showing an extremely high batch effect.

Similar results were obtained even after removing thousands of that individually had significant association with the batches. We thus used the fSVA method1 to reduce batch effects which produced a model that reduced confounding effects in new independent samples from new batches. The fSVA method searched for surrogate variables that represented a significant source of variation in the data, which could be correlated with the batch or represent variance due to unknown factors. The training set was used to infer surrogate variables, which were then regressed out from the training set and subsequently from the validation and test sets.

Cross validation

To test the performance of our framework on data from new batches, we repeatedly removed a complete batch from the data, learned a classifier (i.e., the complete process, including the fSVA model, feature selection and SVM) using the samples from the remaining batches, and then used it to predict the labels of samples in the excluded batch. In the first stage of our analysis in which we analyzed the training set, we removed batches with less than 10 samples.

Comparison to previous studies

We compared our signature to those obtained by six other studies. We could not perform meta- analysis since most studies measured only a few genes2-4 and/or had very small sample sizes5,

6. However, we did apply the signature of Scherzer et al. 2007 to our data and vice versa.

Analysis and Statistics

Classification was performed using linear SVM. We used the CMA R package7 for feature selection. Functional and pathway enrichment analyses were done in EXPANDER8. Network

3 analysis was done using the Cytoscape9 plug-in GeneMANIA10. All statistical analyses were performed in R. ROC curves were generated using the ROCR R package11. All datasets have been deposited in Omnibus (GEO; accession number GSE99039). The analysis R code is available at https://github.com/Shamir-Lab/GENEPARK.

Supplementary References

1. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012;28:882-883. 2. Grunblatt E, Zehetmayer S, Jacob CP, Muller T, Jost WH, Riederer P. Pilot study: peripheral biomarkers for diagnosing sporadic Parkinson's disease. Journal of neural transmission 2010;117:1387-1393. 3. Molochnikov L, Rabey JM, Dobronevsky E, et al. A molecular signature in blood identifies early Parkinson's disease. Molecular neurodegeneration 2012;7:26. 4. Chikina MD, Gerald CP, Li X, et al. Low-variance RNAs identify Parkinson's disease molecular signature in blood. Movement disorders : official journal of the Movement Disorder Society 2015;30:813-821. 5. Shehadeh LA, Yu K, Wang L, et al. SRRM2, a potential blood biomarker revealing high in Parkinson's disease. PloS one 2010;5:e9104. 6. Mutez E, Larvor L, Lepretre F, et al. Transcriptional profile of Parkinson blood mononuclear cells with LRRK2 mutation. Neurobiology of aging 2011;32:1839-1848. 7. Slawski M, Daumer M, Boulesteix AL. CMA: a comprehensive Bioconductor package for supervised classification with high dimensional data. BMC bioinformatics 2008;9:439. 8. Ulitsky I, Maron-Katz A, Shavit S, et al. Expander: from expression microarrays to networks and functions. Nature protocols 2010;5:303-322. 9. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 2011;27:431-432. 10. Montojo J, Zuberi K, Rodriguez H, et al. GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop. Bioinformatics 2010;26:2927-2928. 11. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics 2005;21:3940-3941.

4

Supplementary Figures

Figure e-1. Preprocessing of the original 523 blood expression profiles. We tested the effect of different preprocessing methods on the sample correlation under the assumption that blood expression profiles should be highly correlated. The distribution of the correlation between samples is shown after preprocessing by three different methods (RMA, GC-RMA, and MAS5), grouping the samples by the year of the RNA extraction. 37 samples had low correlation with other years (<0.8) and were removed from the data. A) Correlation between pairs of 2010 microarrays. RMA and GC-RMA average > 0.96. B) Correlation between 2008a and 2009 microarrays. RMA average: 0.93, GC-RMA average: 0.9. For all graphs: y-axis = count (of sample pairs), x-axis = correlation.

5

Figure e-2. Leave-batch-out cross-validation analysis on the training set. Each point shows the AUC score or accuracy (y-axis) for a different signature size (x-axis).

6

Figure e-3. Gene signature performance on idiopathic PD, controls and other neurodegenerative diseases (NDD). The plots show performance using our signature on the independent test set when 1) comparing IPD to all others (Control and NDD) (purple line, AUC score 0.63, p=0.033), and 2) comparing diseases (IPD and NDD) to controls (orange line, AUC score 0.72, p=7.5E-5). Of note, our classifier was not initially trained to differentiate between

IPD and NDD. For comparison, the result reported in Scherzer et al. (2007) is also shown (grey line, AUC score 0.69, p=0.047), which used fewer samples (105 individuals) and a classifier initially trained to differentiate between PD samples and both NDD samples and healthy controls.

7

Supplementary Tables

Table e-1. Demographic data of clinical cohorts after preprocessing IPD NDD NDD Controls (MSA, CBD, (Subset PSP or PDD) of HD) Total # 205 21 19 233 Age 62 ± 11 66 ± 10 51 ± 10 58 ± 30 % Males 95 (50%) 10 (56%) 8 (42%) 75 (35%) Age at onset 56 ± 11 65 ±8 43 ± 16 N/A UPDRS Scales UPDRS I 2.2 ± 2.1 2.0 ± 2.3 0.5 ± 0.9 UPDRS II 9.6 ± 6.3 22.7 ± 7.7 0.2 ± 0.5 UPDRS III 23.3 ± 9.9 29.3 ± 5.4 0.6 ± 1.4 UPDRS IV 2.6 ± 3.0 1.6 ± 1.9 0.1 ± 0.4 Hoehn & Yahr Stages Stage 0 7 0 Stage 1 58 0 Stage 2 70 1 Stage 3 30 3 Stage 4 8 2 Stage 5 0 3 MoCA score 27 ± 3 26 ± 3

UPDRS I-IV = Unified Parkinson’s Disease Rating Scale part I-IV MoCA = Montreal Cognitive Assessment

8

Table e-2. Characteristics of training, validation and test cohorts after preprocessing

Training Validation Test # IPDs 140 35 30 # Controls 153 40 40 Average Age 62.48 64.84 65.35 Age SD 10.47 8.72 9.14 % Female 0.61 0.46 0.52

9

Table e-3. Characteristics of IPD and control samples before and after batch removal analysis

Samples after batch removal analysis Training Validation Test # IPDs 86 15 30 # Controls 107 24 40 Average age 63.26 63.44 65.35 Age SD 9.71 5.66 9.14 % Females 0.62 0.50 0.52 Number of batches 7 7 13 Mean batch size 27.57 5.57 5.38 Median batch size 26 5 3 Samples before batch removal analysis Training Validation Test Number of batches 32 22 13 Mean batch size 9.16 3.41 5.38 Median batch size 4 2 3

10

Table e-4. List of 87 genes in blood-based IPD signature

Importance Fold- Probe id Gene Name score Regulation change P-value 218555_at 29882 ANAPC2 32.940 UP 1.14 2.98E-08 201698_s_at 8683 SRSF9 32.205 UP 1.09 4.15E-08 212187_x_at 5730 PTGDS 32.101 UP 1.40 4.35E-08 203887_s_at 7056 THBD 30.142 UP 1.36 1.06E-07 214119_s_at 2280 FKBP1A 27.992 UP 1.15 2.84E-07 203175_at 391 RHOG 26.640 UP 1.08 5.30E-07 200709_at 2280 FKBP1A 25.653 UP 1.11 8.38E-07 225820_at NA NA 25.375 UP 1.17 9.54E-07 211178_s_at 9051 PSTPIP1 24.232 UP 1.10 1.63E-06 215706_x_at 7791 ZYX 23.137 UP 1.13 2.73E-06 202391_at 10409 BASP1 22.886 UP 1.11 3.07E-06 203247_s_at 7572 ZNF24 22.848 DOWN 0.92 3.13E-06 208450_at 3957 LGALS2 22.671 DOWN 0.67 3.40E-06 31837_at 91289 LMF2 22.344 UP 1.07 3.97E-06 212770_at 7090 TLE3 22.094 UP 1.15 4.47E-06 218638_s_at 10417 SPON2 21.816 UP 1.26 5.10E-06 200808_s_at 7791 ZYX 21.550 UP 1.12 5.79E-06 205558_at 7189 TRAF6 21.384 DOWN 0.88 6.27E-06 204263_s_at 1376 CPT2 21.061 UP 1.15 7.32E-06 217929_s_at 79932 KIAA0319L 21.049 UP 1.14 7.36E-06 217874_at 8802 SUCLG1 20.763 DOWN 0.94 8.44E-06 229373_at NA NA 20.666 UP 1.17 8.84E-06 244382_at 83874 TBC1D10A 20.536 UP 1.14 9.41E-06 212178_s_at 340318 LOC340318 20.289 UP 1.11 1.06E-05 203760_s_at 6503 SLA 19.591 UP 1.11 1.48E-05 224769_at NA NA 18.993 DOWN 0.88 1.98E-05 231974_at 8085 MLL2 18.977 UP 1.11 1.99E-05 217858_s_at 51566 ARMCX3 18.759 UP 1.14 2.22E-05 222871_at 55220 KLHDC8A 18.731 DOWN 0.87 2.25E-05 211748_x_at 5730 PTGDS 18.717 UP 1.27 2.26E-05 202088_at 25800 SLC39A6 18.496 UP 1.17 2.52E-05 209007_s_at 57035 C1orf63 18.349 DOWN 0.87 2.70E-05 205367_at 10603 SH2B2 18.302 UP 1.11 2.77E-05 222468_at 79932 KIAA0319L 18.168 UP 1.13 2.95E-05 218367_x_at 27005 USP21 17.991 UP 1.11 3.22E-05 216036_x_at 23038 WDTC1 17.957 DOWN 0.91 3.27E-05 233007_at 7520 XRCC5 17.794 DOWN 0.86 3.54E-05 1567628_at 972 CD74 17.688 UP 1.15 3.73E-05 212101_at 23633 KPNA6 17.545 UP 1.09 4.00E-05 200701_at 10577 NPC2 17.394 UP 1.09 4.31E-05 202275_at 2539 G6PD 17.168 UP 1.10 4.81E-05 202731_at 27250 PDCD4 17.040 UP 1.15 5.12E-05 215273_s_at 10474 TADA3 16.818 UP 1.07 5.71E-05 215639_at 10044 SH2D3C 16.677 UP 1.15 6.12E-05 217840_at 51428 DDX41 16.573 UP 1.08 6.44E-05 241610_x_at 55690 PACS1 16.522 UP 1.42 6.60E-05 11

208845_at 7419 VDAC3 16.476 DOWN 0.94 6.75E-05 202215_s_at 4802 NFYC 16.416 UP 1.08 6.96E-05 204206_at 4335 MNT 16.396 UP 1.09 7.02E-05 223852_s_at 83931 STK40 16.366 UP 1.07 7.13E-05 217934_x_at 10273 STUB1 16.347 UP 1.07 7.20E-05 209002_s_at 57658 CALCOCO1 16.333 UP 1.08 7.25E-05 200086_s_at 1327 COX4I1 16.320 DOWN 0.94 7.30E-05 229574_at 29896 TRA2A 16.293 DOWN 0.91 7.39E-05 225289_at 6774 STAT3 16.284 UP 1.10 7.43E-05 221890_at 63925 ZNF335 16.213 UP 1.09 7.69E-05 203758_at 1519 CTSO 16.099 UP 1.11 8.13E-05 225701_at 80709 AKNA 15.753 UP 1.07 9.65E-05 229001_at 90673 PPP1R3E 15.630 UP 1.11 0.000103 222623_s_at 51193 ZNF639 15.591 UP 1.14 0.000105 220494_s_at NA NA 15.567 UP 1.38 0.000106 210186_s_at 2280 FKBP1A 15.549 UP 1.12 0.000107 201545_s_at 8106 PABPN1 15.368 UP 1.14 0.000117 218517_at 79960 PHF17 15.333 UP 1.11 0.000119 203523_at 4046 LSP1 15.210 UP 1.08 0.000126 201991_s_at 3799 KIF5B 15.158 DOWN 0.92 0.00013 242398_x_at 515 ATP5F1 15.104 UP 1.09 0.000133 203508_at 7133 TNFRSF1B 15.013 UP 1.06 0.000139 230669_at 5922 RASA2 14.924 DOWN 0.90 0.000146 212041_at 9114 ATP6V0D1 14.912 UP 1.07 0.000146 203109_at 9040 UBE2M 14.878 UP 1.10 0.000149 201180_s_at 2773 GNAI3 14.853 DOWN 0.92 0.000151 216449_x_at 7184 HSP90B1 14.769 DOWN 0.90 0.000157 217870_s_at 51727 CMPK1 14.746 UP 1.16 0.000159 208696_at 22948 CCT5 14.733 DOWN 0.94 0.00016 208645_s_at 6208 RPS14 14.574 DOWN 0.95 0.000173 223135_s_at 56987 BBX 14.569 UP 1.08 0.000174 201298_s_at 55233 MOB1A 14.532 DOWN 0.93 0.000177 212223_at 3423 IDS 14.486 UP 1.10 0.000181 201114_x_at 5688 PSMA7 14.458 DOWN 0.94 0.000184 203723_at 3707 ITPKB 14.299 UP 1.08 0.000199 204669_s_at 11237 RNF24 14.239 UP 1.10 0.000205 242059_at 55500 ETNK1 14.132 DOWN 0.86 0.000216 242134_at NA #N/A 14.113 UP 1.36 0.000218 224778_s_at NA #N/A 14.098 DOWN 0.91 0.00022 218920_at 54540 FAM193B 14.062 UP 1.10 0.000224 218860_at 79050 NOC4L 14.008 UP 1.10 0.00023 213738_s_at 498 ATP5A1 14.007 DOWN 0.95 0.00023 226140_s_at 220213 OTUD1 13.979 UP 1.12 0.000233 223562_at 64098 PARVG 13.964 UP 1.07 0.000235 240166_x_at 158234 TRMT10B 13.853 UP 1.10 0.000249 232909_s_at 2186 BPTF 13.834 DOWN 0.92 0.000251 224377_s_at 22931 RAB18 13.805 DOWN 0.90 0.000255 218251_at 58526 MID1IP1 13.784 UP 1.09 0.000257 205546_s_at 7297 TYK2 13.715 UP 1.06 0.000266

12

203370_s_at 9260 PDLIM7 13.712 UP 1.12 0.000267 222143_s_at 64419 MTMR14 13.697 UP 1.06 0.000269 234644_x_at NA #N/A 13.690 UP 1.10 0.00027 218738_s_at 51444 RNF138 13.659 UP 1.16 0.000274 238612_at NA #N/A 13.631 UP 1.09 0.000278

13