Two Alternative Data-Splitting Numerous Hypothesis Tests

Two alternative data-splitting Numerous hypothesis tests were performed in this study. To reduce the false positive due to multiple testing, we are not only seeking the results with extremely small p values but also those that can be robustly found in different subsets of our dataset. To do so, we split our dataset into the discovery and validation sets as shown in Table 1. Furthermore, we performed additional two data-splitting on the same dataset. The results shown in the main text are from the original data- splitting. Presented here in Supporting Information were those from the two alternative data- splitting. The results from three data-splitting showed consistent patterns. (Tables S1-S3) Therefore, it is unlikely that our report is simply chance aberration due to particular split. Analyses of batch effect During the data preprocessing, normalization with consideration of the batch effect was performed. Even so, the association between the batch and cigarette smoking may still potentially exert undue confounding effects. Because of that, we performed the analyses with adjustment of the batch effect in the genome-wide analyses and focal copy number analyses. The corresponding analyses to Figure 1A and 1B and those to Figure 1C and 1D were shown in Tables S4 and S5, respectively. For the focal copy number-smoking association, 1,000 loci were randomly selected to perform both batch-adjusted and unadjusted analyses. P values from the two analyses were compared to see whether they follow similar distributions. Such comparisons were made for both single-marker analyses (Figure S1A) and 10-marker analyses (Figure S1B). Overall, the analyses adjusted for the batch effect did not change the pattern of the results and our conclusions. Multiple-marker analyses of cigarette smoking and copy number: whole-genome, chromosome, and 10-marker focal locus We developed three methods to test the genome-wide or chromosome/arm-specific copy number patterns between heavy and light/non-smokers and one method to test the association of the chromosome/arm-specific or focal CNs and smoking pack-years. First, we calculated the total events of copy number gains and losses and compared them between the two smoking groups by the two-sided t test, which provides a convenient summary index but collapses CNAs information over the genomic locations. The second method is to apply two-sample tests for continuous copy number by calculating the standardized difference of two average copy numbers for each locus as: ci m1i m2i v1i n1 v2i n2 where mji and vji is the estimated mean and variance, respectively, of copy number for group j at locus i, and nj is the sample size in group j. We 2 summed up ci over i across the loci in the arm of a chromosome to calculate the observed total standardized squared difference (Cobserved). By permuting the two groups and carrying out the above procedure for 10,000 times, we obtained a non-parametric null distribution (Cnull). Then p values were obtained by comparing Cobserved and Cnull. This permutation procedure provides a valid global test for the overall difference by accounting for multiple comparisons and correlation of CNAs between different loci. The third is a similar method extended to the discrete variable of CNAs (CN≥2.7 or not; CN≤1.3 or not) as mentioned above, but with advantage of testing gains and losses separately. We applied two-sample tests for binomial data by calculating the standardized difference of two proportions for each locus as: di p1i p2i p1i 1 p1i n1 p2i 1 p2i n2 where pji is the estimated proportion (stabilized by adding 0.5 in the numerator) of CN gains (or losses) for 2 group j at locus i and nj is the sample size in group j. We summed up di over i across the arm of a chromosome to calculate the observed total standardized squared difference (Dobserved). Non- parametric null distribution (Dnull) was approximated by 10,000 permutations, and p values were obtained by comparing Dobserved and Dnull. The above three methods require the smoking exposure to be dichotomized. To fully capture the continuous dose-response relationship between copy numbers and pack-years of cigarette smoking, we also developed a test to summarize such association in a chromosome/arm-specific fashion. We obtained the test statistics, F f where f is observed i i i the F statistics of regressing continuous copy numbers on the smoking pack-years (square-root transformed) up to the quadratic term at locus i. Again, the non-parametric null distribution (Fnull) was generated by 10,000 permutations and p values were obtained as the tail probability of Fobserved in Fnull. The proposed test is equivalent to the powerful score test for testing the variance of coefficients in a multivariate regression by assuming regression coefficients have an arbitrary distribution with mean 0 and variance τ (1), in which copy numbers of a region or chromosome (as a vector) are regressed on smoking pack-years. The null hypothesis of our proposed test is that all the coefficients relating pack-years to copy numbers are zero, or equivalently, copy numbers at all loci have no association with smoking pack-years, which is equivalent to H0: τ=0. The alternative hypothesis would be that copy numbers at some loci have association with smoking pack-years. This variance component test is a powerful test by borrowing information in multiple markers and effectively accounting for correlation among the CNVs in a marker set, and reducing the degrees of freedom of the test. To investigate the association of focal copy numbers and smoking pack-years, we analyzed copy number >2 and ≤2 separately. Two outliers (>2.5 standard deviations) of smoking pack-years were excluded to eliminate the potential spurious result driven by them. Square root was taken for smoking pack-years to transform a right skewed distribution into an approximately normal distribution.(2) Both multiple-marker and single-marker analyses were performed. In the multiple-marker analyses, we grouped the consecutive ten SNPs (markers) as a set and calculated Fobserved and Fnull by the methods mentioned above to obtain the p values for each set of markers. For those with Fobserved much greater than Fnull (from 10,000 permutations), the null distributions were obtained by the Satterthwaite approximation(3,4), in which the first two moments of scaled 2 χ distribution were matched with those of Fnull. In total, we performed 25,655 hypothesis tests for copy number >2 and ≤2 separately. Such multiple-marker analyses had better statistical power than the single-marker analyses when the markers were correlated, which is the case in the copy number data. Single-marker analyses of cigarette smoking and focal copy number: the dichotomous version We also pursued the single-marker analyses in a dichotomous fashion. We tabulated the copy numbers (≥2.7 vs. 1.3-2.7 for gains; ≤1.3 vs. 1.3-2.7 for losses) and smoking pack-years (>60 vs. ≤60) and tested its association by Fisher exact test for 256,554 loci. Logistic regressions were also performed with adjustment of age at diagnosis, gender, two cohorts, clinical stage and histology. The dichotomous-version analyses show that copy number gains in 15 loci and copy number losses in 6 loci are associated with the heavy smoking. (Figure S9A and Table S7) These candidate loci were clustered in 8q24, 12q21, 12q23 and 17q22 (Figures S9B-E) for gains and 8p12 and 8p23 for losses (Figures S9F and G). References 1. Liu D, Lin X, & Ghosh D (2007) Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models Biometrics 63, 1079-1088. 2. Zhou W, Liu G, Park S, Wang Z, Wain JC, et al. (2005) Gene-smoking interaction associations for the ERCC1 polymorphisms in the risk of lung cancer Cancer Epidemiol Biomarkers Prev 14, 491-496. 3. Satterthwaite FE (1946) An Approximate Distribution of Estimates of Variance Components Biometrics Bulletin 2, 110-114. 4. Kwee L, Liu D, Lin X, Ghosh D, and Epstein M. (2008) A powerful and flexible multilocus association test for quantitative traits Am J of Human Genetics, 82, 386-397 Table S1. P values of comparing % of probes with CN≥2.7 (or <1.3) between heavy smokers and non-/light-smokers, with three different data-splitting. Main text Alternative 1 Alternative 2 % of probes with copy number ≥2.7 Discovery set 0.008 0.0076 0.00467 Validation set 0.0095 0.0108 0.0147 Both sets 0.000246 % of probes with copy number <1.3 Discovery set 0.44 0.88 0.64 Validation set 0.97 0.29 0.25 Both sets 0.61 Table S2. P values of testing whether the mean G/T ratio is different than that at random (40.64%), with three different data-splitting. NS/LS: non-smokers/light-smokers; HS: heavy smokers. Main text Alternative 1 Alternative 2 Copy number gain NS/LS 0.8 0.11 0.095 Discovery set HS 0.59 0.16 0.87 NS/LS 0.007 0.38 0.41 Validation set HS 0.023 0.32 0.05 NS/LS 0.08 Both sets HS 0.083 Copy number loss NS/LS 0.011 9.42×10-4 6.45×10-5 Discovery set HS 0.78 0.562 0.6 NS/LS 9.80×10-4 0.0165 0.071 Validation set HS 0.31 0.097 0.35 NS/LS 5.15×10-5 Both sets HS 0.32 Table S3. P values of the eleven loci presented in Table S6 from the single-marker CN-smoking analyses with three different data- splitting.

Two Alternative Data-Splitting Numerous Hypothesis Tests

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support