Normalization and Integration of Large-Scale Metabolomics Data Using Support Vector Regression

Supplementary for

Supplementary Fig. 1 SVR-based data normalization results in study 1 using different number of top correlated peaks as independent variables (Injection order, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 15 peaks): the RSD mean after applications of SVR-based data normalization method using different number of peaks. Red represent subject samples and blue represent QC samples.

S1 Supplementary Fig. 2 SVR-based method can normalize different kinds of non-linear signal drifts: (a-c) applications of SVR-based data normalization method to fit disorder (a), increasing (b), and increasing (c) signal drifts; (d-f) QC and subject samples with disorder (d), increasing (e), and increasing (f) signal drifts after normalizations. The two red dotted lines are the margin lines of the support vectors.

S2 Supplementary Fig. 3 The comparisons of VIP values, p values (FDR correction), and RSDs of 30 potential metabolite biomarkers before and after SVR normalization: VIP values (a), p values (FDR correction) (b), and RSDs (c).

Supplementary Table 1. The sample run order of study 1. Run order Sample Run order Sample 1 Blank1 125 QC sample 12 2 Blank2 126-133 8×Subject samples 3 Blank3 134 Blank 16 4-13 10×conditioning samples 135 QC sample 13 14 Blank 4 136-143 8×Subject samples 15 QC sample 1 144 Blank 17 16-23 8×Subject samples 145 QC sample 14 24 Blank 5 146-153 8×Subject samples 25 QC sample 2 154 Blank 18 26-33 8×Subject samples 155 QC sample 15 34 Blank 6 156-163 8×Subject samples 35 QC sample 3 164 Blank 19 36-43 8×Subject samples 165 QC sample 16 44 Blank 7 166-173 8×Subject samples 45 QC sample 4 174 Blank 20 46-53 8×Subject samples 175 QC sample 17 54 Blank 8 176-183 8×Subject samples 55 QC sample 5 184 Blank 21 56-63 8×Subject samples 185 QC sample 18 64 Blank 9 186-193 8×Subject samples 65 QC sample 6 194 Blank 22 66-73 8×Subject samples 195 QC sample 19 74 Blank 10 196-203 8×Subject samples 75 QC sample 7 204 Blank 23 76-83 8×Subject samples 205 QC sample 20 84 Blank 11 206-213 8×Subject samples 85 QC sample 8 214 Blank 24 86-93 8×Subject samples 215 QC sample 21 94 Blank 12 216-223 8×Subject samples 95 QC sample 9 224 Blank 25 96-103 8×Subject samples 225 QC sample 22 104 Blank 13 226-233 8×Subject samples

S3 105 QC sample 10 234 Blank 26 106-113 8×Subject samples 235 QC sample 23 114 Blank 14 236-243 8×Subject samples 115 QC sample 11 244 Blank 27 116-123 8×Subject samples 245 QC sample 24 124 Blank 15 246-253 8×Subject samples 254 Blank 28 324 Blank 35 255 QC sample 25 325 QC sample 32 256-263 8×Subject samples 326-333 8×Subject samples 264 Blank 29 334 Blank 36 265 QC sample 26 335 QC sample 33 266-273 8×Subject samples 336-343 8×Subject samples 274 Blank 30 344 Blank 37 275 QC sample 27 345 QC sample 34 276-283 8×Subject samples 346-353 8×Subject samples 284 Blank 31 354 Blank 38 285 QC sample 28 355 QC sample 35 286-293 8×Subject samples 356-363 8×Subject samples 294 Blank 32 364 Blank 39 295 QC sample 29 365 QC sample 36 296-303 8×Subject samples 366-373 8×Subject samples 304 Blank 33 374 Blank 40 305 QC sample 30 375 QC sample 37 306-313 8×Subject samples 376 Blank 41 314 Blank 34 377 Blank 42 315 QC sample 31 378 Blank 43 316-323 8×Subject samples

Supplementary Table 2. The baseline characteristics of esophagus cancer cohort (study 2). Variable Screen negative Screen positive P value Sample number 236 532 --

Age (mean, sd) 54.7 (7.3) 59.1 (7.2) 3.19e-14 Sex (n, %)

Male 110 (46.6) 279 (52.4) --

Female 126 (53.4) 253 (47.6) -- BMI (mean, sd) 24.2 (6.3) 24.7 (3.4) 0.175

Supplementary Table 3. The sample run order of study 2. Run order Sample Run order Sample 1 Blank1 135 QC sample 13 2 Blank2 136-143 8×Subject samples

S4 3 Blank3 144 Blank 17 4-13 10×conditioning samples 145 QC sample 14 14 Blank 4 146-153 8×Subject samples 15 QC sample 1 154 Blank 18 16-23 8×Subject samples 155 QC sample 15 24 Blank 5 156-163 8×Subject samples 25 QC sample 2 164 Blank 19 26-33 8×Subject samples 165 QC sample 16 34 Blank 6 166-173 8×Subject samples 35 QC sample 3 174 Blank 20 36-43 8×Subject samples 175 QC sample 17 44 Blank 7 176-183 8×Subject samples 45 QC sample 4 184 Blank 21 46-53 8×Subject samples 185 QC sample 18 54 Blank 8 186-193 8×Subject samples 55 QC sample 5 194 Blank 22 56-63 8×Subject samples 195 QC sample 19 64 Blank 9 196-203 8×Subject samples 65 QC sample 6 204 Blank 23 66-73 8×Subject samples 205 QC sample 20 74 Blank 10 206-213 8×Subject samples 75 QC sample 7 214 Blank 24 76-83 8×Subject samples 215 QC sample 21 84 Blank 11 216-223 8×Subject samples 85 QC sample 8 224 Blank 25 86-93 8×Subject samples 225 QC sample 22 94 Blank 12 226-233 8×Subject samples 95 QC sample 9 234 Blank 26 96-103 8×Subject samples 235 QC sample 23 104 Blank 13 236-243 8×Subject samples 105 QC sample 10 244 Blank 27 106-113 8×Subject samples 245 QC sample 24 114 Blank 14 246-253 8×Subject samples 115 QC sample 11 254 Blank 28 116-123 8×Subject samples 255 QC sample 25 124 Blank 15 256 Blank 29 125 QC sample 12 257 Blank 30 126-133 8×Subject samples 258 Blank 31 134 Blank 16

Supplementary Table 4. Peaks frequency (%) distribution in different levels of log intensity in QC samples of study 1.

Intensity (log) Raw data SVR

S5 8-9 48.0 76.0 9-10 53.8 89.7 10-11 57.0 93.8 11-12 57.0 93.6 12-13 58.2 87.9 13-14 58.8 91.8 14-15 57.8 84.4 15-16 65.4 96.2 16-17 40.0 80.0 17-18 100 100

Supplementary Table 5. Median distances in the PCA score plots of study 1.

Sum Batch ratio Batch ratio Sample Raw data Linear LOESS SVR (mean) (median) QC sample 22.3 22.4 22.3 22.3 18.9 15.9 1.68 Subject sample 22.9 22.3 22.9 22.9 18.4 19.2 5.12

Supplementary Table 6. The top 30 potential markers selected according to the rank of VIP values in raw data of study 2. Polarity Name mz RT (s) VIP POS M236T212_POS 236.1839 212 2.5260 POS M274T329_POS 274.2744 329 1.9840 POS M404T354_POS 404.2066 354 1.9820 POS M432T354_POS 432.2378 354 1.9386 POS M703T27_POS 702.8632 27 1.8589 POS M445T354_POS 445.2333 354 1.8581 POS M418T354_POS 418.2221 354 1.8435 POS M650T234_POS 650.3954 234 1.8421 POS M651T234_1_POS 650.8972 234 1.8370 POS M672T235_POS 672.4085 235 1.8350 POS M682T240_2_POS 682.0710 240 1.8331 POS M739T243_POS 739.1200 243 1.8316 POS M740T243_1_POS 739.7887 243 1.8268 POS M373T529_POS 373.0991 529 1.8267 POS M695T241_POS 695.0937 241 1.8251 POS M598T435_POS 598.4519 435 1.8201 POS M656T455_POS 656.4936 455 1.8167 POS M696T241_1_POS 695.7623 241 1.8150 POS M710T241_1_POS 710.1036 241 1.8148 POS M540T413_POS 540.4101 413 1.8110 POS M628T232_POS 628.3822 232 1.8076 POS M651T239_POS 651.0672 239 1.8066 POS M629T232_1_POS 628.8840 232 1.8047 POS M267T354_POS 267.1230 354 1.8046 POS M715T476_POS 714.5356 476 1.8037 POS M783T244_POS 783.1462 244 1.7988

S6 POS M784T244_1_POS 783.8149 244 1.7983 POS M839T27_POS 838.8375 27 1.7976 POS M223T340_POS 223.0640 340 1.7900 POS M608T217_POS 608.3845 217 1.7870

Supplementary Table 7. The top 30 potential markers selected according to the rank of VIP values in SVR normalized data of study 2. Polarity Name mz RT (s) VIP POS M236T212_POS 236.1839 212 3.2367 NEG M267T32_NEG 267.0723 32 2.1879 POS M274T329_POS 274.2745 329 2.1847 POS M207T197_POS 207.1493 197 2.0820 POS M223T340_POS 223.0640 340 1.9970 POS M122T167_POS 122.0965 167 1.9640 POS M672T235_POS 672.4085 235 1.9483 POS M650T234_POS 650.3954 234 1.9380 POS M716T237_POS 716.4348 237 1.9365 POS M717T237_1_POS 716.9366 237 1.9342 POS M651T234_1_POS 650.8972 234 1.9335 NEG M279T646_NEG 279.2329 646 1.930 POS M761T240_1_POS 760.9629 240 1.9250 POS M760T240_POS 760.4611 240 1.9241 POS M738T239_POS 738.4479 239 1.9238 POS M784T244_1_POS 783.8149 244 1.9235 POS M783T244_POS 783.1462 244 1.9224 POS M739T243_POS 739.1200 243 1.9173 POS M740T243_1_POS 739.7887 243 1.9142 POS M656T455_POS 656.4936 455 1.9141 POS M782T241_POS 782.4743 241 1.9094 POS M715T476_POS 714.5356 476 1.9067 POS M677T455_POS 677.4222 455 1.9067 POS M598T435_POS 598.4519 435 1.9055 POS M540T413_POS 540.4101 413 1.9025 POS M628T232_POS 628.3822 232 1.9002 POS M608T217_POS 608.3845 217 1.8981 POS M783T241_1_POS 782.9761 241 1.8973 POS M629T232_1_POS 628.8840 232 1.8966 POS M652T220_POS 652.4106 220 1.8926

Supplementary Table 8. The comparison of VIP values, p values and RSDs of 30 potential markers selected in SVR normalized data of study 2 before and after SVR normalization.

S7 VIP VIP -log10(adjusted p) -log10(adjusted p) RSD RSD Name (Raw data) (SVR) (Raw data) (SVR) (Raw data, %) (SVR, %)

S8 M236T212_POS 2.5265 3.2367 57.7 57.8 16.7 1.13 M267T32_NEG 1.0580 2.1879 5.30 31.5 49.4 10.7 M274T329_POS 1.9841 2.1847 35.8 36.6 14.4 2.12 M207T197_POS 1.6555 2.0820 44.5 44.2 19.5 2.26 M223T340_POS 1.7897 1.9970 32.2 25.7 21.4 3.13 M122T167_POS 1.3410 1.9640 5.70 15.3 35.0 1.89 M672T235_POS 1.8351 1.9483 30.9 31.9 11.1 1.04 M650T234_POS 1.8421 1.9380 33.4 33.4 12.0 1.12 M716T237_POS 1.7727 1.9365 29.9 32.0 16.1 1.19 M717T237_1_POS 1.7669 1.9342 32.0 34.1 16.4 1.17 M651T234_1_POS 1.8370 1.9335 36.6 36.7 12.0 1.02 M279T646_NEG 1.1944 1.930 16.3 35.4 60.9 14.8 M761T240_1_POS 1.7246 1.9250 31.9 34.8 18.4 1.26 M760T240_POS 1.7287 1.9241 30.0 32.8 17.9 1.32 M738T239_POS 1.7526 1.9238 29.5 31.6 16.2 1.24 M784T244_1_POS 1.7983 1.9235 35.5 37.0 16.4 1.88 M783T244_POS 1.7988 1.9224 35.4 37.0 16.1 1.95 M739T243_POS 1.8316 1.9173 36.9 38.6 15.9 1.85 M740T243_1_POS 1.8268 1.9142 36.9 38.5 16.0 1.92 M656T455_POS 1.8167 1.9141 31.3 32.1 9.21 3.11 M782T241_POS 1.6664 1.9094 30.1 33.8 22.1 1.35 M715T476_POS 1.8037 1.9067 44.9 44.7 23.3 3.87 M677T455_POS 1.6383 1.9067 35.9 40.5 33.1 5.78 M598T435_POS 1.8201 1.9055 38.9 39.1 7.18 1.88 M540T413_POS 1.8110 1.9025 37.7 37.5 10.0 1.30 M628T232_POS 1.8076 1.9002 33.3 33.4 12.5 0.86 M608T217_POS 1.7874 1.8981 29.5 30.9 8.70 0.88 M783T241_1_POS 1.6644 1.8973 31.3 34.6 22.3 1.19 M629T232_1_POS 1.8047 1.8966 36.4 36.4 12.5 0.91 M652T220_POS 1.7616 1.8926 29.0 30.9 10.7 1.27

Supplementary Methods Sample Preparation. 50 μL of human serum sample was placed in a 96-well plate and extracted with 150 μL of MeOH using Bravo liquid handling system (Agilent Technologies, USA), followed by vortex for 30 seconds and incubation for 2 hours at -20 °C in order to precipitate proteins. Then the 96-well plate was centrifuged at 4,000 rpm for 20 min at 4 °C. The resulting supernatants were transferred to LC-MS vials (Agilent Technologies, USA) and stored at -80 °C prior to LC-MS analysis. LC-MS Analysis. The LC-MS analyses were performed using a UHPLC system (1290 series, Agilent Technologies, USA) coupled to a quadruple time-of-flight (Q-TOF) mass spectrometer (Agilent 6550 iFunnel Q-TOF, Agilent Technologies, USA). Waters ACQUITY UPLC HSS T3 columns (particle size, 1.8 μm; 100 mm (length) × 2.1 mm (i.d.)) were used for the LC separation. The column was maintained at 25 °C during the LC-MS runs. The mobile phases A = 0.1% FA in water

(positive mode) or 0.5 mM NH4F in water (negative mode), and B = 0.1% FA in ACN (positive mode) or 100% ACN

S9 (negative mode) were used. The linear gradient eluted from 1% B for 1 min, 1% B to 100% B (1 - 8 min), 100% B (8 - 10 min), 100% B to 1% B (10 - 10.1 min), then stayed at 1% B for 1.9 minutes to equilibrate the column. The total time is 12 minutes for each analysis. The flow rate was 0.5 mL/min and the sample injection volume was 6 μL. ESI source parameters were set as follows: sheath gas temperature, 400 °C; dry gas temperature, 250 °C; sheath gas flow, 12 L/min; dry gas flow, 16 L/min; capillary voltage, 3000 V or -3000 V in positive or negative modes, respectively; nozzle voltage, 0 V; and nebulizer pressure, 20 psi or 40 psi in positive or negative modes, respectively. The MS1 data acquisition rate was set as 4 spectra/s, and the TOF mass range was set as m/z 50 – 1200 Da. Data Analysis. MS raw data (.d) files were converted to the mzXML format using ProteoWizard, and processed by XCMS. Part of XCMS processing parameters were optimized and set as follows: mass accuracy in peak detection = 10 ppm; peak width c = (5, 20); snthresh = 6; bw = 10; minfrac = 0.5. R package CAMERA was used for peak annotation after XCMS data processing. Metabolic features detected less than 80% in all the QC samples were discarded. Only metabolite ion peaks annotated by CAMERA were selected for the subsequent statistical analyses. The data normalization and statistical analyses were performed using R (version 3.1.1), except that the orthogonal partial least squares (OPLS) analysis was performed using SIMCA-P 14 (Umetrics AB, Umea, Sweden). Double cross validated PLS-DA and ROC analyses. PLS-DA with double cross validation was used to construct receiver operating characteristic curves (ROCs) for raw and SVR normalized datasets. As shown in Supplementary Fig. 5, double cross validation according to the literature (Rosenberg et al. 2010) with slight modifications is used to optimize the PLS model parameters. In the outer loop, 20% of samples are randomly selected as validation dataset, and the rest 80% samples are the inner loop, which are used to optimize the PLS model and to select the appropriate number of components. In the inner loop, a 10-fold cross validation is used (R package: pls). The mean squared error of prediction (MSEP) is used as the criteria to select the appropriate number of components. The whole process consists as one analysis, and the 80%/20% random selection and analysis repeated for 500 times. All of the prediction results from the 500 repetitive analyses are summaried for ROC analyse showed in Fig. 6c.

S10 Supplementary Fig. 4 Scheme of double cross validation to optimize the PLS model parameters.

Other Normalization Methods Used. Two other commonly used QC-based normalization methods, linear and LOESS normalizations were selected for comparison purposes and implemented as follows. The linear and LOESS models were constructed using the functions lm and loess in R. (1) Linear normalization The single value regression method assumes that the signal shift among QC samples is linear. First, a linear model of each peak in QC samples was first constructed respect to the injection order of the QC samples. Second, the linear model built for each peak was used to predict the intensities of the same peak in subject samples by assuming that the intensity drifts of the QC samples and subject samples were similarly affected by instrument drift. Finally, the intensities of each peak in the subject samples were divided by the predictive peak intensities for normalization to remove unwanted intensity drifts and analytical variations during data acquisition. Peaks in QC samples were normalized in the same way. (2) LOESS normalization First, a LOESS model of each peak in QC samples was constructed respect to the injection order of the QC samples. In this step, there were two parameters that were optimized using leave one out cross validation (LOO) method. First parameter was the smoothing parameter known as “span”, it controlled the proportion of peaks in QC samples for each local regression. The second parameter was “degree”, which constrained the degree (first or second degree) in each local QC sample. We optimized two parameters for each peak in the procedure of model constructing. Second, the LOESS model built for each peak was used to predict the intensities of the same peak in subject samples. Finally, the intensities of each peak in the subject

S11 samples were divided by the predictive peak intensities for normalization. Peaks in QC samples were normalized in the same way.

Reference Rosenberg, L. H., Franzen, B., Auer, G., Lehtio, J., & Forshed, J. (2010). Multivariate meta-analysis of proteomics data from human prostate and colon tumours. BMC Bioinformatics, 11, 468.

S12