Linking Disease Associations with Regulatory Information in the Supplementary Material ! Marc A. Schaub, Alan P. Boyle, Anshul Kundaje, Serafim Batzoglou, Michael Snyder

Supplementary Tables

Supplementary Table 1: RegulomeDB scoring scheme. Lower scores indicate more evidence for the SNP to be in a regulatory region. This scoring scheme is adapted from Boyle et al. 2012, with modifications described in Supplementary Methods.!

RegulomeDB Score 2a ChIP-seq peak, matched DNaseI footprint, matched motif, DNaseI peak RegulomeDB Score 2b ChIP-seq peak, DNaseI footprint and peak, and motif RegulomeDB Score 2c ChIP-seq peak, matched motif and DNaseI peak RegulomeDB Score 3a ChIP-seq peak, DNaseI peak and motif RegulomeDB Score 3b ChIP-seq peak and matched motif RegulomeDB Score 4 ChIP-seq and DNaseI peak RegulomeDB Score 5a ChIP-seq peak only RegulomeDB Score 5b DNaseI peak only RegulomeDB Score 6 Motif only RegulomeDB Score 7 No annotation

Supplementary Table 2: Fraction of associations overlapping functional regions for different linkage disequilibrium thresholds. Only functional SNPs that are in linkage disequilibrium at or above the indicate threshold in all HapMap 2 populations are used. 2 2 Lead Perfect LD r Ӌ 0.9 r Ӌ 0.8

Count Count Count Count

Total 4724 4724 4724 4724 Predicted motif 1335 28.3% 1656 35.1% 1947 41.2% 2181 46.2% DNaseI hypersensitivity peak 1714 36.3% 1979 41.9% 2189 46.3% 2374 50.3% DNaseI footprint 355 7.5% 471 10.0% 600 12.7% 718 15.2% ChIP-seq peak 938 19.9% 1144 24.2% 1318 27.9% 1491 31.6% Coding 223 4.7% 268 5.7% 306 6.5% 345 7.3% Exon, non-coding 146 3.1% 165 3.5% 202 4.3% 232 4.9% RegulomeDB Score 2 (total) 126 2.7% 165 3.5% 197 4.2% 248 5.2% RegulomeDB Score 2a 14 0.3% 16 0.3% 21 0.4% 26 0.6% RegulomeDB Score 2b 109 2.3% 145 3.1% 169 3.6% 212 4.5% RegulomeDB Score 2c 3 0.1% 4 0.1% 7 0.1% 10 0.2% RegulomeDB Score 3 (total) 58 1.2% 75 1.6% 84 1.8% 89 1.9% RegulomeDB Score 3a 55 1.2% 69 1.5% 78 1.7% 83 1.8% RegulomeDB Score 3b 3 0.1% 6 0.1% 6 0.1% 6 0.1% RegulomeDB Score 4 436 9.2% 492 10.4% 538 11.4% 571 12.1% RegulomeDB Score 5 (total) 1110 23.5% 1192 25.2% 1240 26.2% 1254 26.5% RegulomeDB Score 5a 218 4.6% 246 5.2% 270 5.7% 289 6.1% RegulomeDB Score 5b 892 18.9% 946 20.0% 970 20.5% 965 20.4% RegulomeDB Score 6 933 19.8% 908 19.2% 873 18.5% 838 17.7% RegulomeDB Score 7 1692 35.8% 1459 30.9% 1284 27.2% 1147 24.3% eQTL 462 9.8% 514 10.9% 556 11.8% 597 12.6% eQTL + Predicted motif 113 2.4% 206 4.4% 275 5.8% 329 7.0% eQTL + DNaseI hypersensitivity peak 201 4.3% 300 6.4% 359 7.6% 413 8.7% eQTL + DNaseI footprint 40 0.8% 89 1.9% 129 2.7% 165 3.5% eQTL + ChIP-seq peak 118 2.5% 197 4.2% 247 5.2% 306 6.5% eQTL + RegulomeDB Score 2 (total) 17 0.4% 28 0.6% 34 0.7% 45 1.0% eQTL + RegulomeDB Score 2a 1 0.0% 1 0.0% 3 0.1% 6 0.1% eQTL + RegulomeDB Score 2b 16 0.3% 27 0.6% 30 0.6% 38 0.8% eQTL + RegulomeDB Score 2c 0 0.0% 0 0.0% 1 0.0% 1 0.0% eQTL + RegulomeDB Score 3 (total) 5 0.1% 11 0.2% 12 0.3% 14 0.3% eQTL + RegulomeDB Score 3a 4 0.1% 10 0.2% 11 0.2% 13 0.3% eQTL + RegulomeDB Score 3b 1 0.0% 1 0.0% 1 0.0% 1 0.0% eQTL + RegulomeDB Score 4 56 1.2% 77 1.6% 87 1.8% 94 2.0% eQTL + RegulomeDB Score 5 (total) 117 2.5% 131 2.8% 137 2.9% 130 2.8% eQTL + RegulomeDB Score 5a 27 0.6% 30 0.6% 33 0.7% 36 0.8% eQTL + RegulomeDB Score 5b 90 1.9% 101 2.1% 104 2.2% 94 2.0% eQTL + RegulomeDB Score 6 200 4.2% 159 3.4% 144 3.0% 136 2.9% ! ! ! Supplementary Table 3: Fraction of associations overlapping functional regions for different linkage disequilibrium thresholds (European populations). Functional SNPs that are in linkage disequilibrium with the lead SNP at or above the indicate threshold in the HapMap 2 CEU population are used. Only associations that were identified or replicated in populations of European descent are used.

2 2 Lead Perfect LD r Ӌ 0.9 r Ӌ 0.8

Count Count Count Count

Total 2461 2461 2461 2461

Predicted motif 729 29.6% 1470 59.7% 1678 68.2% 1846 75.0% DNaseI hypersensitivity peak 959 39.0% 1527 62.0% 1709 69.4% 1846 75.0% DNaseI footprint 213 8.7% 614 24.9% 755 30.7% 935 38.0% ChIP-seq peak 527 21.4% 1107 45.0% 1310 53.2% 1492 60.6% Coding 135 5.5% 266 10.8% 333 13.5% 396 16.1% Exon, non-coding 86 3.5% 186 7.6% 223 9.1% 276 11.2% RegulomeDB Score 2 (total) 84 3.4% 204 8.3% 240 9.8% 285 11.6% RegulomeDB Score 2a 9 0.4% 23 0.9% 23 0.9% 31 1.3% RegulomeDB Score 2b 73 3.0% 174 7.1% 207 8.4% 245 10.0% RegulomeDB Score 2c 2 0.1% 7 0.3% 10 0.4% 9 0.4% RegulomeDB Score 3 (total) 28 1.1% 67 2.7% 87 3.5% 92 3.7% RegulomeDB Score 3a 27 1.1% 65 2.6% 85 3.5% 90 3.7% RegulomeDB Score 3b 1 0.0% 2 0.1% 2 0.1% 2 0.1% RegulomeDB Score 4 245 10.0% 367 14.9% 392 15.9% 397 16.1% RegulomeDB Score 5 (total) 579 23.5% 609 24.7% 597 24.3% 544 22.1% RegulomeDB Score 5a 107 4.3% 175 7.1% 187 7.6% 175 7.1% RegulomeDB Score 5b 472 19.2% 434 17.6% 410 16.7% 369 15.0% RegulomeDB Score 6 495 20.1% 382 15.5% 310 12.6% 263 10.7% RegulomeDB Score 7 0 0.0% 380 15.4% 279 11.3% 208 8.5% eQTL 244 9.9% 373 15.2% 419 17.0% 483 19.6% eQTL + Predicted motif 68 2.8% 270 11.0% 333 13.5% 409 16.6% eQTL + DNaseI hypersensitivity peak 112 4.6% 294 11.9% 361 14.7% 439 17.8% eQTL + DNaseI footprint 30 1.2% 168 6.8% 216 8.8% 288 11.7% eQTL + ChIP-seq peak 64 2.6% 234 9.5% 305 12.4% 390 15.8% eQTL + RegulomeDB Score 2 (total) 14 0.6% 32 1.3% 33 1.3% 41 1.7% eQTL + RegulomeDB Score 2a 0 0.0% 8 0.3% 6 0.2% 5 0.2% eQTL + RegulomeDB Score 2b 14 0.6% 23 0.9% 27 1.1% 36 1.5% eQTL + RegulomeDB Score 2c 0 0.0% 1 0.0% 0 0.0% 1 0.0% eQTL + RegulomeDB Score 3 (total) 2 0.1% 12 0.5% 19 0.8% 23 0.9% eQTL + RegulomeDB Score 3a 2 0.1% 11 0.4% 18 0.7% 23 0.9% eQTL + RegulomeDB Score 3b 0 0.0% 1 0.0% 1 0.0% 0 0.0% eQTL + RegulomeDB Score 4 27 1.1% 46 1.9% 55 2.2% 59 2.4% eQTL + RegulomeDB Score 5 (total) 60 2.4% 68 2.8% 64 2.6% 43 1.7% eQTL + RegulomeDB Score 5a 13 0.5% 18 0.7% 21 0.9% 12 0.5% eQTL + RegulomeDB Score 5b 47 1.9% 50 2.0% 43 1.7% 31 1.3% eQTL + RegulomeDB Score 6 104 4.2% 56 2.3% 39 1.6% 31 1.3% ! ! ! Supplementary Table 4: Enrichment of GWAS associations for different types of functional evidence. Entries with significant enrichment (P-value < 0.05) are represented in bold. SNPs are filtered as described in Methods, resulting in smaller sets than in Supplementary Table 2. Only functional SNPs that are in linkage disequilibrium at or above the indicated threshold in all HapMap 2 populations are used. ! 2 r Ӌ 0.9 Lead SNPs Perfect LD Enrichment Enrichment Enrichment Observed Exp. P-value Observed Exp. P-value Observed Exp. P-value Total 2364 2364 2364

Predicted motif 688 29.1% 27.7% 1.05 0.087 855 36.2% 37.5% 0.97 0.164 1011 42.8% 44.3% 0.96 0.119 DNaseI hypersensitivity peak 810 34.3% 30.5% 1.12 1.3E-04 942 39.8% 37.2% 1.07 0.009 1066 45.1% 42.3% 1.07 0.009 DNaseI footprint 178 7.5% 6.2% 1.22 0.005 237 10.0% 9.0% 1.12 0.063 307 13.0% 11.5% 1.13 0.047 ChIP-seq peak 446 18.9% 15.1% 1.25 1.3E-06 546 23.1% 20.5% 1.13 0.001 642 27.2% 24.9% 1.09 0.012 RegulomeDB Score 2 63 2.7% 2.0% 1.36 0.015 79 3.3% 2.7% 1.22 0.086 97 4.1% 3.4% 1.22 0.041 RegulomeDB Score <= 3 86 3.6% 2.8% 1.32 0.012 107 4.5% 3.9% 1.15 0.149 130 5.5% 4.9% 1.13 0.147 RegulomeDB Score <= 4 290 12.3% 9.5% 1.29 3.6E-05 342 14.5% 12.2% 1.19 0.002 393 16.6% 14.3% 1.16 0.002 RegulomeDB Score <= 5 839 35.5% 31.6% 1.12 1.3E-04 950 40.2% 36.8% 1.09 0.001 1040 44.0% 40.3% 1.09 0.001 RegulomeDB Score <= 6 1326 56.1% 51.1% 1.10 3.5E-07 1441 61.0% 57.0% 1.07 7.1E-05 1514 64.0% 60.4% 1.06 1.3E-04 eQTL 191 8.1% 6.1% 1.33 1.0E-04 213 9.0% 7.1% 1.26 0.001 234 9.9% 8.0% 1.24 0.001 eQTL + Predicted motif 53 2.2% 1.6% 1.44 0.015 84 3.6% 3.3% 1.09 0.460 114 4.8% 4.5% 1.07 0.482 eQTL + DNaseI hypersensitivity peak 91 3.8% 2.5% 1.57 4.7E-05 129 5.5% 4.0% 1.38 0.001 159 6.7% 5.1% 1.32 0.001 eQTL + DNaseI footprint 20 0.8% 0.5% 1.63 0.021 39 1.6% 1.3% 1.30 0.096 58 2.5% 2.0% 1.24 0.118 eQTL + ChIP-seq peak 52 2.2% 1.3% 1.68 1.3E-04 84 3.6% 2.6% 1.37 0.007 111 4.7% 3.6% 1.30 0.007 eQTL + RegulomeDB Score 2 10 0.4% 0.2% 2.40 0.003 14 0.6% 0.4% 1.63 0.068 18 0.8% 0.5% 1.63 0.033 eQTL + RegulomeDB Score <= 3 13 0.5% 0.2% 2.36 0.002 19 0.8% 0.5% 1.69 0.045 23 1.0% 0.6% 1.53 0.052 eQTL + RegulomeDB Score <= 4 32 1.4% 0.8% 1.63 0.002 44 1.9% 1.3% 1.38 0.025 57 2.4% 1.7% 1.44 0.005 eQTL + RegulomeDB Score <= 5 85 3.6% 2.3% 1.54 2.4E-04 109 4.6% 3.2% 1.45 3.5E-04 122 5.2% 3.6% 1.43 2.9E-04 eQTL + RegulomeDB Score <= 6 162 6.9% 5.2% 1.31 6.5E-04 169 7.1% 5.6% 1.28 0.002 176 7.4% 5.7% 1.30 0.001 ! 2 2 r Ӌ 0.8 r Ӌ 0.5

Enrichment Enrichment Observed Exp. P-value Observed Exp. P-value Total 2364 2364

Predicted motif 1144 48.4% 50.6% 0.96 0.032 1467 62.1% 64.9% 0.96 0.006 DNaseI hypersensitivity peak 1171 49.5% 47.4% 1.05 0.038 1461 61.8% 60.1% 1.03 0.077 DNaseI footprint 368 15.6% 14.3% 1.09 0.077 554 23.4% 22.6% 1.04 0.283 ChIP-seq peak 727 30.8% 29.5% 1.04 0.168 988 41.8% 41.8% 1.00 0.958 RegulomeDB Score 2 125 5.3% 4.0% 1.33 0.001 174 7.4% 5.9% 1.24 0.004 RegulomeDB Score <= 3 162 6.9% 5.8% 1.18 0.038 231 9.8% 8.7% 1.13 0.063 RegulomeDB Score <= 4 449 19.0% 16.4% 1.16 0.002 573 24.2% 22.2% 1.09 0.022 RegulomeDB Score <= 5 1098 46.4% 43.7% 1.06 0.005 1239 52.4% 50.9% 1.03 0.112 RegulomeDB Score <= 6 1562 66.1% 63.1% 1.05 0.001 1623 68.7% 67.3% 1.02 0.133 eQTL 256 10.8% 8.7% 1.24 7.3E-04 305 12.9% 10.8% 1.19 0.001 eQTL + Predicted motif 139 5.9% 5.7% 1.03 0.692 223 9.4% 8.7% 1.08 0.206 eQTL + DNaseI hypersensitivity peak 183 7.7% 6.2% 1.25 0.003 258 10.9% 9.1% 1.21 0.001 eQTL + DNaseI footprint 73 3.1% 2.7% 1.13 0.300 135 5.7% 5.1% 1.11 0.162 eQTL + ChIP-seq peak 136 5.8% 4.6% 1.24 0.014 207 8.8% 7.6% 1.15 0.030 eQTL + RegulomeDB Score 2 23 1.0% 0.6% 1.63 0.015 38 1.6% 0.9% 1.71 1.0E-04 eQTL + RegulomeDB Score <= 3 29 1.2% 0.8% 1.51 0.020 46 1.9% 1.3% 1.56 0.002 eQTL + RegulomeDB Score <= 4 67 2.8% 2.0% 1.41 0.002 94 4.0% 2.8% 1.44 8.6E-05 eQTL + RegulomeDB Score <= 5 130 5.5% 4.0% 1.37 2.6E-04 147 6.2% 4.6% 1.34 6.3E-05 eQTL + RegulomeDB Score <= 6 181 7.7% 5.8% 1.32 1.7E-04 181 7.7% 5.8% 1.31 6.6E-05 ! ! ! Supplementary Table 5: Enrichment of GWAS associations for different types of functional evidence (European populations). Entries with significant enrichment (P-value < 0.05) are represented in bold. SNPs are filtered as described in Methods, resulting in smaller sets than in Supplementary Table 2. Functional SNPs that are in linkage disequilibrium with the lead SNP at or above the indicate threshold in the HapMap 2 CEU population are used. Only associations that were identified or replicated in populations of European descent are used. 2 r Ӌ 0.9 Lead SNP Perfect LD Enrichment Enrichment Enrichment Observed Exp. P-value Observed Exp. P-value Observed Exp. P-value Total 1310 1310 1310

Predicted motif 401 30.6% 28.0% 1.09 0.019 775 59.2% 62.6% 0.95 0.011 888 67.8% 71.1% 0.95 0.013 DNaseI hypersensitivity peak 474 36.2% 31.1% 1.16 1.3E-04 774 59.1% 58.8% 1.00 0.850 884 67.5% 66.7% 1.01 0.512 DNaseI footprint 114 8.7% 6.4% 1.35 4.2E-04 302 23.1% 22.3% 1.03 0.526 377 28.8% 28.8% 1.00 0.988 ChIP-seq peak 266 20.3% 15.6% 1.30 4.9E-07 553 42.2% 40.8% 1.04 0.233 668 51.0% 49.1% 1.04 0.134 RegulomeDB Score 2 43 3.3% 2.0% 1.62 0.001 105 8.0% 6.0% 1.34 0.002 120 9.2% 7.3% 1.25 0.014 RegulomeDB Score <= 3 55 4.2% 2.8% 1.48 0.002 140 10.7% 8.7% 1.23 0.011 163 12.4% 10.7% 1.16 0.045 RegulomeDB Score <= 4 183 14.0% 9.8% 1.43 2.4E-07 337 25.7% 21.6% 1.19 1.1E-04 389 29.7% 25.3% 1.18 1.7E-04 RegulomeDB Score <= 5 478 36.5% 32.0% 1.14 0.002 670 51.1% 49.5% 1.03 0.175 717 54.7% 53.1% 1.03 0.134 RegulomeDB Score <= 6 758 57.9% 51.3% 1.13 3.3E-06 892 68.1% 65.7% 1.04 0.044 900 68.7% 67.1% 1.02 0.145 eQTL 107 8.2% 6.1% 1.35 0.002 150 11.5% 10.1% 1.14 0.120 173 13.2% 11.7% 1.13 0.096 eQTL + Predicted motif 34 2.6% 1.6% 1.67 0.005 102 7.8% 7.8% 1.00 0.968 135 10.3% 9.9% 1.04 0.670 eQTL + DNaseI hypersensitivity peak 56 4.3% 2.5% 1.74 6.5E-05 119 9.1% 8.1% 1.12 0.212 151 11.5% 10.2% 1.13 0.138 eQTL + DNaseI footprint 16 1.2% 0.5% 2.32 2.5E-04 64 4.9% 4.4% 1.11 0.385 89 6.8% 6.4% 1.07 0.480 eQTL + ChIP-seq peak 33 2.5% 1.4% 1.83 1.4E-04 88 6.7% 6.8% 0.99 0.952 123 9.4% 8.9% 1.05 0.571 eQTL + RegulomeDB Score 2 8 0.6% 0.2% 3.56 1.2E-04 14 1.1% 0.8% 1.31 0.288 14 1.1% 1.0% 1.03 0.916 eQTL + RegulomeDB Score <= 3 10 0.8% 0.2% 3.42 1.1E-04 20 1.5% 1.1% 1.37 0.129 22 1.7% 1.4% 1.20 0.369 eQTL + RegulomeDB Score <= 4 21 1.6% 0.8% 1.89 0.001 40 3.1% 2.5% 1.24 0.159 47 3.6% 2.9% 1.24 0.141 eQTL + RegulomeDB Score <= 5 47 3.6% 2.4% 1.52 0.007 68 5.2% 4.3% 1.21 0.125 73 5.6% 4.6% 1.21 0.106 eQTL + RegulomeDB Score <= 6 87 6.6% 5.2% 1.28 0.022 90 6.9% 5.6% 1.23 0.061 89 6.8% 5.5% 1.23 0.050 ! 2 2 r Ӌ 0.8 r Ӌ 0.5

Enrichment Enrichment Observed Exp. P-value Observed Exp. P-value Total 1310 1310

Predicted motif 985 75.2% 77.6% 0.97 0.044 1179 90.0% 90.3% 1.00 0.755 DNaseI hypersensitivity peak 962 73.4% 73.3% 1.00 0.931 1154 88.1% 87.0% 1.01 0.207 DNaseI footprint 475 36.3% 34.9% 1.04 0.297 736 56.2% 52.4% 1.07 0.001 ChIP-seq peak 772 58.9% 56.7% 1.04 0.107 1025 78.2% 74.5% 1.05 0.002 RegulomeDB Score 2 144 11.0% 8.7% 1.27 0.004 199 15.2% 12.3% 1.23 0.002 RegulomeDB Score <= 3 196 15.0% 12.6% 1.19 0.011 262 20.0% 17.6% 1.14 0.024 RegulomeDB Score <= 4 433 33.1% 28.3% 1.17 1.2E-04 490 37.4% 35.0% 1.07 0.075 RegulomeDB Score <= 5 733 56.0% 55.1% 1.02 0.417 734 56.0% 56.2% 1.00 0.863 RegulomeDB Score <= 6 889 67.9% 66.8% 1.02 0.291 807 61.6% 62.4% 0.99 0.490 eQTL 206 15.7% 13.2% 1.19 0.004 304 23.2% 18.5% 1.25 1.7E-06 eQTL + Predicted motif 170 13.0% 11.9% 1.09 0.188 292 22.3% 18.0% 1.24 1.2E-05 eQTL + DNaseI hypersensitivity peak 186 14.2% 12.1% 1.18 0.014 300 22.9% 18.1% 1.27 7.1E-07 eQTL + DNaseI footprint 125 9.5% 8.2% 1.16 0.040 250 19.1% 14.5% 1.31 1.3E-07 eQTL + ChIP-seq peak 162 12.4% 10.9% 1.13 0.084 287 21.9% 17.2% 1.27 8.6E-07 eQTL + RegulomeDB Score 2 17 1.3% 1.2% 1.08 0.730 36 2.7% 1.9% 1.46 0.027 eQTL + RegulomeDB Score <= 3 27 2.1% 1.6% 1.27 0.172 49 3.7% 2.5% 1.52 0.004 eQTL + RegulomeDB Score <= 4 53 4.0% 3.2% 1.28 0.042 74 5.6% 4.2% 1.34 0.010 eQTL + RegulomeDB Score <= 5 71 5.4% 4.7% 1.15 0.199 88 6.7% 5.4% 1.25 0.030 eQTL + RegulomeDB Score <= 6 84 6.4% 5.4% 1.19 0.069 91 6.9% 5.6% 1.23 0.046 ! ! ! ! Supplementary Table 6: Height functional SNPs overlapping CTCF ChIP-seq peaks, and prostate cancer functional SNPs overlapping Androgen Receptor (AR) ChIP-seq peaks. An empty cell in the Functional SNP column indicates that the associated lead SNP is also functional. ! Height-associated functional SNPs overlapping CTCF binding sites Associated SNP Functional SNP Study PubMed ID P-value chr1 rs6686842 rs11209342 SCMH1 Weedon et al. 2008 18391952 2E-08 chr2 rs2580816 NPPC, DIS3L2 Lango Allen et al. 2010 20881960 6E-22 chr2 rs6724465 rs13419740 NHEJ1 Weedon et al. 2008 18391952 2E-08

chr3 rs9863706 RYBP Lango Allen et al. 2010 20881960 4E-13 chr6 rs6899976 L3MBTL3 Gudbjartsson et al. 2008 18391951 6E-06 chr6 rs7742369 KRT18P9, CYCSL1 Okada et al. 2010 20189936 1E-13 chr7 rs2730245 rs6965685 WDR60 Lettre et al. 2008 18391950 3E-07 chr9 rs7032940 rs7036157 AKAP2, C9orf152 Kim et al. 2010 19893584 3E-06 chr9 rs946053 COL27A1 Gudbjartsson et al. 2008 18391951 2E-07 chr14 rs1950500 rs12590407 RIPK3, NFATC4 Lango Allen et al. 2010 20881960 2E-18 chr15 rs8041863 rs8030631 ACAN Weedon et al. 2008 18391952 8E-08

chr17 rs3760318 ADAP2 Gudbjartsson et al. 2008 18391951 2E-09 chr17 rs2665838 GH2, CSH1 Lango Allen et al. 2010 20881960 5E-25 chr19 rs12986413 rs1015670 DOT1L Lettre et al. 2008 18391950 3E-08 chr22 rs5751614 BCR Gudbjartsson et al. 2008 18391951 6E-06 Prostate cancer-associated functional SNPs overlapping AR binding sites Associated SNP Functional SNP Gene Study PubMed ID P-value chr4 rs7679673 rs10007915 RPL6P14 - TET2 Eeles et al. 2009 19767753 3E-14 chr6 rs339331 RFX6 Takata et al. 2010 20676098 2E-12 chr7 rs10486567 JAZF1 Thomas et al. 2008 18264096 2E-06 chr8 rs1456315 SRRM1P1,POU5F1B Takata et al. 2010 20676098 2E-29 chr17 rs1859962 CALM2P1,SOX9 Schumacher et al. 2011 21743057 3E-11 Eeles et al. 2009 19767753 2E-16 Eeles et al. 2008 18264097 1E-06 Gudmundsson et al. 2007 17603485 3E-10

Supplementary Figures

Supplementary Figure 1: Overview of enrichment. (A.) Percentage of associated SNPs mapped to a functional SNP overlapping DNaseI peaks, DNaseI footprints and ChIP-seq peaks (full lines) compared to expected percentages in the matched null sets (dotted line) for various linkage disequilibrium thresholds. As the LD threshold decreases, the fraction of associations that can be mapped to functional SNPs increases, but the enrichment for functional SNPs amongst associated SNPs decreases. Comparison of the fraction of SNPs overlapping DNaseI (B.) and ChIP-seq peaks (C.) in various null sets of matched random SNPs (blue) and sets of associated SNPs (green). The fraction of random SNPs overlapping DNaseI peaks increases when properties of associated SNPs are matched more closely, and when considering associations supported by more evidence. A total of 1216 lead SNPs were curated from an original study that included a separate replication population, and 478 of them overlap DNase1 peaks (39.3%). A total of 166 lead SNPs were replicated in a different study, and 85 of them (51.2%) overlap a DNase1 peak. A similar trend can be observed for ChIP-seq peaks. The red arrows indicate the most stringent matched null set, and the set of all associations. We compare those two sets in order to show that enrichments are significant.!

Supplementary Figure 2:! Phenotype level overview of the overlap between associations and DNaseI- seq peaks. This matrix view shows phenotypes vertically and DNaseI-seq peaks horizontally. Only phenotypes with at least 20 lead SNPs are shown, but totals are computed over the entire data set. In panel A, DNaseI-seq peaks are grouped according to ENCODE cell line tier, sex, karyotype, lineage and tissue. In panel B, cell lines that overlap with at least 100 associated regions are shown individually. Each cell represents the number of SNPs that are in strong LD with an association for the respective phenotype (r2 ! 0.8 in the CEU population) that overlap with a DNase-I peak in the respective group of cell lines (panel A) and in the specific cell line (panel B). ! ! /0)1%#-$23"%&)+*#)4-5"#"%&)67)&2#",2*84,)

)!"

(!" 5/,9+:";+,<9" '!" =>;+34+-"5/,9+:";+,<9" &!" ?@:0A9+B";+,<9"

%!" =>;+34+-"?@:0A9+B";+,<9"

$!" 5C,9+:"2DD4;1EC49" !"#$"%&'(")*+)',,*$-'.*%,) #!" =>;+34+-"5/,9+:"2DD4;1EC49"

!" *+,-./0" 0+12+34"*5" *5"!67" *5"!68" *5"!6)" *5"!6(" *5"!6'"

90)1%#-$23"%&,)+*#)7:',";)<"'=,) >0)1%#-$23"%&,)+*#)>2;!?,"@)<"'=,) (!" %!" '!" $'" &!" $!" %!" #'" $!" #!" #!" '" !" !" &:#. A0.'%&'(# 9#< /:$#0B# &:99?#

!"#$%&'(# )*)+, *-#./0( 1 2 3 40$4#. $%.50/ 6#50/# 70/# #4&%8#.5 5#9%8#.5 #$8%8#.5 :$$#.;4#//;5099 #<&.0#5=.'%$:4;5#9%8#.5>;&.%("#4&%8#.5 :$8?4#8;(/?.:(%&#$&;94#//;8#.:-#8;@.%5;9A:$;@:=.%=/09& @:=.%=/09& =/%%8 #(:&"#/:?5 /:-#. 9A:$ =.0:$ =.#09& 5?94/# #'# 8?4& (.%9&0&# (0$4.#09 #5=.'%$:4;94#// @#&0/;5#5=.0$# 4#.-:< "#0.& @%.#9A:$ #$8%&"#/:?5 /?$B :$8?4#8;(/?.:(%&#$&;94#// 9(:$0/;4%.8 5'%5#&.:?5 =%$# (0$4.#0&:4;8?4& =.0:$;":((%405(?9 4%/%$ 4%$$#4&:-# $#.-%?9 70550.' 4#.#=#//0. #5=.'%$:4;/?$B B:$B:-0 B:$B:-0/ A:8$#' !"!#$ %&'( )*)+ ,&, &'% &)( +*, ''' +(' '(% '() '&, -.% )+% )') )&* )*% -%* (&) &'- &') &(+ %-& %&& %%& %*' ),. )+' )+% )') )') )-' )(* )(* )&) )&* )%' )). )). ))( )*) ,, ,( ,& ,) .& .& .% +, ++ C#:B"& 332 D1 D3 2E 21 F2 1E FF 1G FH F1 11 31 3H 31 G 1D 2G 21 2F 2D 21 2H 3I 3F 3I 31 31 3H 3D 3J 32 3D 32 31 J 32 E 3H G 3H 3H E 33 E 3H E J J K%.%$0.';"#0.&;8:9#09# DD 2G 2G 3F E 3G 21 21 21 2H 23 2H 1 3 2 G 31 3I 33 3F 31 3H D E I D 2 1 3 I D G I F 2 D 1 1 1 2 H I 1 F 2 1 1 I 1 )'(#;2;8:0=# F1 2H 3E J D 3F 3F 3F 3F 33 33 3F 1 1 2 1 G J E I 2 D I F I F I 1 1 F F 1 F F 2 D 2 3 3 F 2 2 F 2 F 2 2 1 1 K%B$:&:-#; &;(#.@%.50$4# 1G 31 3H 1 1 D I J D I I 1 2 2 2 3 F 1 1 F 2 1 2 3 3 3 3 2 2 3 2 3 3 H 2 H 3 H H 3 3 H H 3 H H H H 3 L:(%/0.;8:9%.8#. 1G 3I 3F F J G 32 G 3H E G E D F 1 3 J D J F F F F 3 F D 1 D F 2 H 3 1 H 1 3 2 1 2 H 3 H H H H H H H 3 L%8';5099;:$8#< 1G 3I 3D G I 3H 33 33 33 32 3H 33 I D D H G 32 J I I D D 1 I D I I D 1 3 3 F 3 D 3 3 F 1 3 F H 3 H H 3 H H H M/4#.0&:-#;4%/:&:9 1E 23 2H E D 31 3I 3D 3D 33 3D 3D F I F 1 31 E 33 G D 1 I 3 1 2 D F I 2 2 F F 3 F 3 3 1 3 H 3 F H 2 H H 3 2 1 L:(%/0.;8:9%.8#.;0$8;94":N%(".#$:0 1I 31 31 F I E G 3H 3H G J E D 2 2 2 J E 2 1 I 1 1 2 1 F F D 2 1 2 2 2 2 2 H H 2 3 3 F 3 3 3 3 1 3 3 3 K.%"$O9;8:9#09# 11 3D 31 I J 32 33 3F 33 J 3F G D 1 1 F 32 J I I J D I 2 I F F D 1 F D F 1 D 1 F 2 2 F 2 F 3 1 2 2 2 2 2 F P'9:4;/?(?9;#.'&"#50&%9?9 13 3D 3D F D 3H 32 3F 32 33 32 G 3 1 2 3 33 G I G F F 2 F 2 F 2 3 1 2 F 3 2 F 2 3 2 2 F 3 2 3 1 3 2 H 3 3 3 P?88#$;40.8:04;0..#9& 13 32 33 I D 33 G E 3H E E G 1 3 2 2 D I J D F D 1 F 1 F 2 1 3 3 3 1 3 2 2 H 3 1 2 3 3 3 H 2 3 2 2 1 H 7#$0.4"#;Q0B#;0&;%$9#&R 13 33 3H 1 1 J I F J I D I 1 1 3 2 F D 1 D F 1 1 2 2 F 2 1 1 2 3 2 1 2 3 2 1 2 2 3 2 3 2 3 2 3 3 3 3 K%B$:&:-#;(#.@%.50$4# 1H 31 32 1 3 3H I G D G D F H 2 2 3 1 D 1 D I 2 3 3 F 3 2 H 2 3 3 3 3 H 2 2 H H 2 H H H H H H H H H H )'(#;3;8:0=# 2G 3E 3I F D 31 3H 3D 31 33 3F E 2 2 3 H 31 I D 2 I 2 3 3 2 3 1 2 2 H H 3 3 3 3 1 H H 3 2 H H H H H H H H 3 +&&#$&:%$;8#@:4:&;"'(#.04&:-:&';8:9%.8#. 2G 3D 3F I D 33 33 31 G 32 3H G 1 F F H J D J I E D F 3 1 1 2 1 F 1 2 3 1 2 F 2 1 F 2 3 H 3 3 3 H H H H H !.%9&0&#;40$4#. 2E 32 32 1 3 G E I J F D 33 H 2 H H 2 D D 1 2 2 H 3 F D 3 H 2 3 3 H 2 3 H 2 3 H H 3 3 H H H H 2 3 3 3 L.#09&;40$4#. 2I 3I 3D G I 3D 33 3D 3F 33 31 3H 1 1 3 F 3H J E E D G F 1 I 1 1 1 1 F 1 F 1 D 3 D D D 1 3 1 D F 1 2 1 2 F 3 !.%&#:$;S?0$&:&0&:-#;&.0:&;/%4: 2D 3H 3H 2 2 E D J F E F F 3 3 3 2 F 1 F F F 2 2 F 1 3 3 3 3 2 2 2 3 2 3 1 3 2 2 2 3 2 2 2 2 2 2 2 3 T#9(%$9#;&%;9&0&:$;&"#.0(' 2D J I 1 3 D D D D I D D 3 2 2 3 D F 1 D 1 F 2 3 2 1 2 3 2 3 2 2 H 3 2 3 3 3 3 3 H 3 3 3 3 3 3 3 H +5'%&.%(":4;/0&#.0/;94/#.%9:9 2F J J 2 2 1 I D F D 1 3 3 3 2 H 1 1 3 1 2 1 3 2 3 3 3 3 3 3 H H H H 2 2 H 3 H 3 H H H H H H 3 H H T#9(%$9#;&%;0$&:(9'4"%&:4;&.#0&5#$& 21 I I H 2 F F D 1 2 2 F 2 3 H H 3 3 3 1 H 3 2 H 2 2 H 2 3 H H H H H H H H H 3 H H H H H H H H H H T"#?50&%:8;0.&".:&:9 21 31 33 D J E 3H 33 E E 32 E 1 3 3 H 3H E 1 F D 1 3 F 2 2 2 1 3 2 2 3 2 2 3 H H H 1 3 1 H H H H H H H 3 C#50&%/%B:40/;0$8;=:%4"#5:40/;&.0:&9 21 32 3H 2 1 E I 3H I F J J 3 3 3 H I 3 D F 2 H H 2 2 3 3 3 3 2 3 3 H H 3 3 H H H 3 3 3 H H H H H 3 H ,%$B#-:&' 22 D D 3 3 1 1 1 2 2 1 1 3 3 H 3 2 1 2 3 3 3 H 2 3 3 2 3 3 3 3 3 H 3 H H H H 3 H 3 3 H H 3 H H H 3 U);:$&#.-0/ 22 E J H 2 D D J J 2 F D 3 3 3 3 F 1 2 2 2 H 2 H 2 2 3 3 3 H 3 2 H 3 3 H H H 3 H H 3 H H H 3 H H H V$@%.50&:%$;(.%4#99:$B;9(##8 2H G E 1 1 J I J I E D D H H H H 1 D 1 D 1 2 2 F 3 2 H H H 3 2 3 H 2 H 2 H 3 3 2 H H 2 3 3 H H 3 H 7?/&:(/#;94/#.%9:9 2H 33 E F 1 G J E J J J J 1 F 2 H J 1 D F 2 1 1 H F F 2 1 F 1 H H 3 H 2 1 3 2 2 H H H H H H 3 3 H H U?0$&:&0&:-#;&.0:&9 2H G I 3 F D F I 2 1 J 3 3 H 3 2 D 2 3 1 2 3 3 1 3 3 3 3 H 3 3 2 3 3 3 H H 3 3 3 3 3 3 3 3 3 2 2 3

!"#$%&'(# )*)+, *-#./0( )FJ8 C955 C955& ,$40( C#(0&%4' C5#4 74@J C#/091 W532EJE W532EG3 XDI2 +DFG C#/091V@$0F" C?"J C3"#94 C?"JD ,$40(+$8.% C?-#4 Y"#A EGEE& !0$:9/#&9 V(9 7#8?//% W53G21G !"&# C09( K// !0$:98 C:(# Y"8@08 P&#//0&# C.(# C#(B2 W53G21E 7'%5#&. *9&#%=/ M.%&90 W53G2FH 74@JC'(%chromosome X. We exclude the single association on chromosome Y from our analysis. The 5694 associations represent a total of 4724 distinct SNPs, 470 phenotypes and 810 studies. 416 SNPs are associated in more than one study.

Linkage Disequilibrium We use HapMap version 2 (The International HapMap Consortium 2007) and version 3 (The International HapMap Consortium 2010) in order to obtain Linkage Disequilibrium information between SNPs. HapMap 2 provides a higher SNP density, whereas HapMap 3 provides information for more populations. The HapMap 2 data we use includes 2,776,528 SNPs in the CEPH (Utah residents with ancestry from northern and western Europe, abbreviation: CEU) population, 2,554,939 SNPs in the Han Chinese in Beijing, China (abbreviation: CHB) population, 3,114,362 SNPs in the Yoruba in Ibadan, Nigeria (abbreviation: YRI) population, and 2,509,881 SNPs in the Japanese in Tokyo, Japan (abbreviation: JPT) population. We create an intersection set that includes all 2,135,736 SNPs that are assessed in all four HapMap 2 populations. We use HapMap 2 and HapMap 3 data in order to create a list of pairs of SNP for which there is some evidence of LD (r2 ! 0.1) in any HapMap 2 or HapMap 3 population.

Genotyping arrays We download the list of SNPs that appear on genotyping arrays from the SNP Genotyping Array track of the UCSC genome browser (Kent et al. 2002) and use all SNPs that also appear in dbSNP 132. Arrays include Affymetrix SNP 6.0 (905,283 SNPs), Affymetrix SNP 5.0 (435,360 SNPs), Affymetrix GeneChip Human Mapping 250K Nsp (257,159 SNPs) Affymetrix GeneChip Human Mapping 250K Sty (233,887 SNPs), Illumina Human Hap 650v3 (660,388 SNPs), Illumina Human Hap 550v3 (560,972 SNPs) Illumina Human Hap 300v3 (318,046 SNPs), Illumina Human1M-Duo (1,146,891 SNPs) Illumina Human CytoSNP-12 (299,358 SNPs), Illumina Human 660W-Quad (593,197 SNPs), and Illumina Human Omni1-Quad (972,372 SNPs). This track does not provide information for the Perlegen arrays. We compute combined lists of all SNPs that appear on any Affymetrix array, and of all SNPs that appear on any Illumina array, remove SNPs that do not appear in all HapMap2 populations, and remove SNPs on Chromosome Y. We obtain a total of 1,006,273 SNPs for the Illumina arrays, and 920,693 SNPs for the Affymetrix arrays.

SNP properties We use the function information generated by the UCSC genome browser for each SNP in dbSNP 132 (Sayers et al. 2011). The functional role is predicted based on UCSC . Functional classes are: near the 3’ end of the gene (within 500 bases of a transcript), near the 5’ end of a gene (within 2kB of a transcript), coding synonymous, coding non-synonymous (nonsense, missense, frameshift, coding indel or coding unknown), 3’ or 5’ untranslated regions, , splice sites or unknown (intergenic regions). These classifications do not use GENCODE v7 information, or any regulatory information from ENCODE. We use the UCSC Genes track in order to find the closest transcription start site (TSS) for each SNP, and the dbSNP allele frequency information to determine the minor allele frequency of each SNP. We use a total of 26,561,892 SNPs from dbSNP 132.

Functional annotations We use RegulomeDB in order to annotate SNPs with regulatory information. RegulomeDB integrates various types of assays, including DNaseI-seq peaks and ChIP-seq peaks, both generated by the ENCODE consortium, DNaseI footprints, conserved motifs, eQTLs curated from several studies in multiple tissues, and validated functional loci. For each SNP RegulomeDB provides a list of datasets in which there is evidence of function in a region overlapping the SNP, as well as a score indicating the confidence that the SNP is functional based on all the available evidence for the locus. The RegulomeDB paper describes the database and scoring metric in more details (Boyle et al. 2012).

We run RegulomeDB on every SNP in dbSNP 132. There is some evidence of function for a total of 13,453,666 SNPs (50.7%). In this study, we use a slightly modified format of the RegulomeDB scores in order to integrate linkage disequilibrium and eQTL information. A RegulomeDB score of 1a through 1f indicates that a SNP is an eQTL, with the letter indicating how much other functional information supports there is for the SNP. Each letter maps to another score (2a through 5), with the only difference being that the higher scores denote SNPs that are not eQTLs. We map scores between 1a and 1f back to the corresponding scores between 2a and 5, and handle eQTLs separately. We also create two additional scores, 5a for SNPs overlapping ChIP-seq peaks only, and 5b for SNPs overlapping DNaseI-seq peaks only, whereas RegulomeDB uses a score of 5 for both. Supplementary Table 1 provides an overview of the modified scoring scheme.

Transcribed regions We use GENCODE v7 (Harrow et al. 2012) to identify SNPs that overlap transcribed regions. We intersect the GENCODE v7 Genes basic set track from the UCSC browser with all SNPs in dbSNP 132 and determine whether the SNP lies in an exon. If the SNP is in an exon, we use the coding region start and end information of the browser table in order to determine whether the SNP is in a coding region or is in a non-coding part of an exon. We consider introns in a similar way than intergenic regions since regulatory elements can be found in both. This leads to a transcriptional annotation for each SNP.

Discussion of the integration of linkage disequilibrium information The use of linkage disequilibrium to identify functional SNPs is based on the assumption that the linkage disequilibrium structure used in this analysis is the same than in the population in which the association study was performed. This assumption is necessary in order to consider that a functional SNP in strong linkage disequilibrium with a lead SNP is associated with the phenotype. If the two SNPs are not correlated in the actual study population, then there is no evidence that the functional SNP has an effect on the phenotype. Linkage disequilibrium patterns differ significantly between populations, and it is therefore challenging to obtain linkage disequilibrium information that closely matches the population in which the GWAS was performed. We choose to use a conservative approach in which we consider two SNPs to be in strong linkage disequilibrium only if they are in strong linkage disequilibrium (r2 ! 0.8) in all four HapMap 2 populations. These populations are of European, African, Japanese and Chinese origin, and thus encompass a large part of the variation between populations, and in particular represent populations that diverged early on in recent human evolution. Linkage disequilibrium patterns that are conserved across all four populations are likely to be conserved in the populations studied in GWAS as well, and this approach should thus reduce the number of false positives amongst the functional SNPs we identify. This clearly comes at the cost of sensitivity: when considering SNPs in strong LD with the lead SNP in all populations, 33% of the lead SNPs are annotated but have no SNP in LD with them. We separately repeat our analysis by considering functional SNPs that are in strong LD in the CEU HapMap population with a lead SNP associated with a phenotype in a population of European descent. This increases the number of SNPs assessed for functional evidence. The fraction of lead SNPs that are mapped to a functional SNP also increases: 80% of the lead SNP are found to be in strong LD with a SNP overlapping a region identified to be functional in at least one ENCODE assay. Recent analysis of the genetic structure of the European population do however show significant differences between different sub-populations, and it is therefore important to keep in mind that further analysis is needed, as the correlation between a lead SNP and an functional SNP in the HapMap CEU population might be weaker the population in which the original association was performed.

Randomizations We create null sets in which each lead SNP is matched to a random SNP that shares the same properties. We then repeat the annotation steps on each null set, and obtain an empirical distribution of the fraction of functional SNPs expected for matched SNPs, and of the score distribution amongst matched SNPs.

Filtering In order to obtain null sets that are similar to the set of associated SNPs, we only consider SNPs that were assessed in all HapMap 2 populations, for which the minor allele frequency is known in dbSNP, and that we can map to a genotyping platform. We use the GWAS catalog to determine whether an associated SNP was found using an Affymetrix or an Illumina array, and then determine whether the SNP is the corresponding list of SNPs we have previously computed. If the SNP is not in the list, or if the platform is not Affymetrix or Illumina, then the lead SNP is filtered out. An exception is the case where the SNP was found using imputation, in which case the SNP is kept in the set as long as it is in HapMap 2. A total of 1160 lead SNPs are filtered out at this stage. We then search for pairs of lead SNPs that show some evidence of linkage disequilibrium between them (r2 ! 0.1 in any HapMap 2 or HapMap 3 population). For each such pair we keep only the lead SNP with the stronger association (more significant P-value reported in the GWAS catalog), and repeat this process until no pair of SNPs in linkage disequilibrium with each other is left in the filtered set. A total of 1200 lead SNPs are filtered out due to LD, leaving us with a set of 2364 SNPs. We repeat the annotation steps on this new set, and perform all enrichment comparisons using this set and its subset of European association, which is obtained in a similar way than for the set of all lead SNPs.

Matched null sets We compute several matched null sets that model the properties of the associated SNPs increasingly closely. We group SNPs into bins based on their minor allele frequency, with each bin representing a 5% minor allele frequency interval. For all null sets, a lead SNP is always matched to a random SNP in the same minor allele frequency bin. We also ensure that SNPs in each null set are not in linkage disequilibrium with each other, using the same criterion than for the lead SNPs. The null-dbSNP set is obtained by matching each lead SNP to a SNP in dbSNP 132. The null-HapMap2 set is obtained in a similar way, except that the matched SNPs must be amongst the SNPs assessed in all HapMap 2 populations. The null-Array set is obtained by matching a lead SNP with a SNP that appears on the same platform (Illumina or Affymetrix) than the lead SNP, or a SNP that is in HapMap 2 if the SNP has been imputed. If the original study includes both Affymetrix and Illumina platform, and the lead SNP is present on both, then the SNP is matched to a random SNP from either platform. The null-Array-Function set uses the same criteria than the null-Array set, but in addition also requires that the matched lead SNP is in the same functional category (as predicted using UCSC genes, see above) than the lead SNP. Finally, the null-Array-Function-Distance is obtained by also requiring that the matched SNP is at a similar distance to the nearest transcription start site (with respect to UCSC genes) than the lead SNP if the lead SNP is located in an intergenic region or an . We group SNPs into bins according to the logarithm of their distance to the nearest transcription start site, and each bin represents a log10(distance) interval of 0.1. All SNPs with a log10(distance) under 3.5 are grouped into one bin, and all SNPs with a log10(distance) above 13 are grouped into one bin. While we use all these random null sets in Figure 3B and 3C, enrichments presented in Table 3 and Figure 3A are only computed using the most stringent null-Array-Function-Distance sets.

Discussion of the randomization approach We use a conservative approach in order to estimate enrichment. We do so in two steps, by first filtering lead SNPs and then generating random null sets that closely match the properties of the lead SNPs. In the filtering step, we ensure that the lead SNPs that we use when computing enrichment are independent. If two lead SNPs were in strong linkage disequilibrium, then the set of SNPs in LD would likely overlap, and a functional region in LD with both SNPs would be double counted. This is a fairly likely situation given that there are large regions of perfect linkage disequilibrium, and different chips use different tag SNPs to genotype the same region. If a phenotype is assessed on two different platforms, then two different SNPs could be reported as significant even though they are both in the same associated region. Accurately replicating the fine linkage disequilibrium structure between lead SNPs when building random null sets would be extremely challenging, and we therefore decide to consider only SNPs that are not in LD with each other in order to estimate enrichment. Second, as our method for identifying associated functional SNPs relies on linkage disequilibrium information from HapMap, we do need to ensure that the fraction of SNPs that are assessed in HapMap does not differ substantially between the actual lead SNP set and the random sets. If we had allowed a lead SNP to be matched to any SNP in dbSNP, then LD information would only be available for a small subset of the random null sets, which would artificially decrease the number of functional SNPs identified in those random sets. In order to avoid any subtle difference that could bias our enrichment estimation, we limit the set of lead SNPs to SNPs that have been assessed in all four HapMap 2 populations, and similarly limit the set of random SNPs. While this substantially decreases the set of lead SNPs that are used to compute enrichment estimates, the fraction of lead SNPs overlapping functional regions in this smaller set (Supplementary Tables 4, 5) is comparable to using all reported associations (Supplementary Tables 2, 3). We then compute a random set that matches the properties of the lead SNPs. A lead SNP is matched to a SNP with similar minor allele frequency, since minor allele frequency can affect the strength of the statistical association between a SNP and a phenotype. We then map each lead SNP to another SNP on the same genotyping array. This corrects for biases that may result from the choice of SNPs put on the genotyping array. We also map a lead SNP to a SNP that has the same function (with respect to UCSC genes), such that the fraction of coding SNPs, for example, is the same in the random sets than amongst lead SNPs. Finally, we also ensure that each lead SNP is mapped to a random SNP that is at a similar distance to nearest transcription start site. This avoids the situation in which an association that is close to a gene, and thus more likely to overlap with some functional data is matched by a SNP in a so-called gene desert in a null set. We still obtain significant enrichment for functional regions even when taking all these factors into account, and when comparing the most stringent random set to all lead SNPs. While using such a conservative approach and still reaching significance is strong evidence that there is enrichment for functional elements in regions associated with disease and other phenotypes, one can argue that matching the properties of lead SNPs that closely actually amounts to overcorrecting. For example, while there appears to be a bias for associated SNPs to be closer to known genes than random SNPs on the same genotyping chip, this is actually a biologically interesting property of disease associations. By requiring the matched null sets to also lay more close to known genes, we increase the probability that those random SNPs are in regulatory regions, or in strong LD with a regulatory region. It is likely that the functional SNPs identified when using random null sets, and in particular those supported by multiple types of functional data, also affect some phenotype in some way, but that this association has yet to be discovered in a GWAS. Functional analysis of SNPs associated with coronary artery disease in the 9p21 region

INTRODUCTION High throughput functional information has also been used to identify a large number of enhancers in the 9p21 region, and determine that two SNPs associated with coronary artery disease overlap with an enhancer and disrupt a STAT1 binding site (Harismendy et al. 2011).

We discuss several specific functional SNPs, and in particular provide evidence indicating that rs1333047, a SNP in perfect linkage disequilibrium with coronary artery disease associated SNP rs1333049 in the CEU population only, is likely a functional SNP. This result can explain why the association between rs1333049 and coronary artery disease has not been replicated in populations of African descent.

RESULTS The 9p21 gene desert is a region that contains multiple SNPs that are strongly associated with coronary artery disease. We consider the functional information available from ENCODE in order to generate candidate functional SNPs in this region. The association between rs1333049 and coronary artery disease has been replicated in multiple studies in populations of European descent (WTCCC 2007, Samani et al. 2007, Broadbent et al. 2008, Wild et al. 2011) as well as in populations of Japanese and Korean descent (Hiura et al. 2008, Hinohara et al. 2008). In the HapMap 2 CEU population, this SNP is part of a haplotype block that includes rs10757278 and rs1333047, both of which are in perfect LD with rs1333049. rs10757278 has also been itself associated with coronary artery disease in multiple populations of European descent (Helgadottir et al. 2007, Shen et al. 2008, Broadbent et al. 2008) and in the Chinese Han population (Ding et al. 2009). Figure 6 (main manuscript) provides an overview of this region. There is no evidence supporting a functional role for rs1333049. However, both rs10757278 and rs1333047 overlap a DNase hypersensitivity peak as well as ChIP-seq peaks for STAT1 and STAT3 in HeLA-S3 cells, and are therefore functional SNPs. Furthermore, rs10757278 lies in a STAT1 binding site, and rs1333047 lies in a binding site and a DNaseI footprint for Interferon-stimulated gene factor 3 (ISGF3). The motif is a good match when extending the less specific part of the motif (positions 8-9) located between the two very specific regions (positions 2-7 and 10-13) by a . This is similar to cases of variable spacer length previously observed for binding motifs (Badis et al. 2009). While the functional role of rs10757278 has been previously reported (Harismendy et al. 2011), evidence of the functional role of rs1333047 is novel. Interestingly, while only 27 base pairs separate the two SNPs, they are in perfect linkage disequilibrium in the CEU population only. The frequency of the ‘A’ allele at rs1333047 in the Yoruba in Ibadan, Nigeria (YRI) HapMap 2 population is only 0.8%, compared to 50.8% in the CEU population. This allele is part of the protective haplotype found in GWAS performed in populations of European descent. The ‘A’ allele is part of the motif for ISGF3 binding, whereas the ‘T’ allele is not.

The most recent meta-analysis in populations of European descent (Schunkert et al. 2011) identifies rs4977574 as the most strongly associated locus in 9p21. We find that this SNP overlaps DNase hypersensitivity peaks in two ENCODE cell lines (Hah and Lncap), a conserved motif for the Androgen Receptor (AR), and a ChIP-seq peak for AR in a data set from Wei et al. (Wei et al. 2010). The RegulomeDB score for this SNP is 2c. In the YRI population, the minor allele frequency of this SNP is only 7.5%, and it is in strong LD with rs10757278 (r2 = 0.803) and in weaker LD with rs1333049 (r2 = 0.382). It is in strong LD with both in the CEU population (r2 of respectively 0.874 and 0.885). Previously identified functional candidate rs1333045 (Jarinova et al. 2009) obtains a RegulomeDB score of 3a, as it overlaps a conserved motif (Hand1::Tcfe2a), a ChIP-seq peak for GATA3 in the T-47D ENCODE cell line, and a DNase peak in the Huvec ENCODE cell line. This SNP has a minor allele frequency of 45.8% in the YRI population, and is in weak LD with both rs10757278 (r2 = 0.119) and rs1333049 (r2 = 0.251). In the CEU population it is in strong LD with both (r2 of respectively 0.808 and 0.815).

DISCUSSION A new functional SNP in 9p21 could explain the lack of association in populations of African descent We identify rs1333047 as a candidate functional SNP in the 9p21 region. This region is associated with coronary artery disease (McPherson et al. 2007) and several other diseases, and the risk of coronary artery disease for the 25% of individuals in populations of European descent that are homozygous for the risk allele is two times higher than for individuals homozygous for protective alleles (McPherson 2010). This region is a gene desert, but the non-coding RNA ANRIL overlaps the SNPs associated with coronary artery disease (Broadbent et al. 2008). A recent study in mice showed that the deletion of the region orthologous to 9p21 leads to changes in the expression of the orthologs of the two human genes closest to 9p21, cyclin-dependent kinase inhibitors genes CDKN2A and CDKN2B, and has effects on the phenotype (Visel et al. 2010). SNPs associated with coronary artery disease affect the expression of ANRIL, and to a smaller extent CDKN2A and CDKN2B in human (Cunnington et al. 2010).

The candidate functional SNP rs1333047 potentially disrupts a binding site for Interferon-stimulated gene factor 3 (ISGF3). ISGF3 is part of the JAK-STAT (Janus Activated Kinase - Signal Transducer and Activator of Transcription) cascade. In Type-I-Interferon signaling, STAT1, STAT2 and IFN-regulatory factor 9 (IFN9) form the ISGF3 complex (Fu et al. 1990) that binds to IFN-stimulated response elements (ISRE) in the nucleus. This contrasts with Type-II-Interferon signaling, in which STAT1-STAT1 homodymers directly bind to IFN-"- activated sites (GAS). Interferon-" is the only Type-II-Interferon. A review of both Interferon signaling pathways can be found in Platanias 2005. We find two functional SNPs in perfect linkage disequilibrium with each other and with tag SNP rs1333049 in the HapMap 2 CEU population. The second functional SNP, rs10757278, has been previously shown to be functional (Harismendy et al. 2011), and is located in a GAS. The experimental evidence supporting the functional role of rs10757278 does, however, also support rs1333047. Both SNPs are in perfect linkage disequilibrium in all individuals re-sequenced by Harismendy et al., and would thus be equally strongly associated with the phenotype. Harismendy et al. compare lymphoblastoid cell lines (which have a high expression level of STAT1) that are homozygous for the risk allele to lymphoblastoid cell lines that are homozygous for the protective allele at rs10757278. Given the perfect LD in this region, they are likely also homozygous respectively for the risk and protective alleles at rs1333047. They show that in cell lines homozygous for the protective allele, STAT1 knockdown leads to a 7-fold up- regulation of the expression of ANRIL, a non-coding transcript located in the 9p21 region, whereas there is a much smaller effect in cell lines homozygous for the risk allele. This would, however, also be consistent with an effect caused by rs1333047 since STAT1 is part of the ISGF3 complex, and a knockdown of STAT1 would therefore also affect ISGF3. Furthermore, they use ChIP to identify that STAT1 binds at rs10757278 only in cell lines with the protective allele. Binding of STAT1 in this region does not imply the absence of ISGF3 binding, and given that STAT1 is part of of the ISGF3 complex, it is also possible that ISGF3 binding was detected rather than binding of the STAT1-STAT1 homodymer. Finally, Harismendy et al. show that treatment with Interferon-" leads to a change in expression of ANRIL in HeLA and HUVEC cell lines. As Interferon-" is only know to be involved in the Type-II-Interferon pathway, this cannot be explained by ISGF3 binding to the ISRE at rs1333047. This change is, however, in the opposite direction to the change expected from the observation in the STAT1 knockdown experiment, which indicates that there might be multiple binding sites at play in this region. The evidence is therefore compatible with a functional role for both rs1333047 and rs10757278.

While both functional SNPs are in perfect linkage disequilibrium in the HapMap 2 CEU population, as well as in the individuals studied by Harismendy et al., this is not the case in other populations. In particular, the protective allele at rs1333047 is rare (0.8%) in the HapMap 2 YRI population. Major differences in allele frequency and linkage disequilibrium structure between populations in the 9p21 region have been previously identified (Silander et al. 2009). Interestingly, multiple GWAS of coronary artery disease in populations of African descent failed to replicate the association at rs1333049 (Assimes et al. 2008) and rs10757278 (Kral et al. 2011, Lettre et al. 2011). While these studies did not replicate the most strongly associated SNPs in European populations, they did identify SNPs that are associated with coronary artery disease in populations of African descent. Two additional associatied SNPs, rs10757274 and rs2383206 were identified in the European (McPherson et al. 2007), South Korean (Shen et al. 2008) but not replicated in African-American (McPherson et al. 2007). rs10757274 has been associated with heart failure in individuals of European descent, but the association was not significant in African American (Yamagishi et al. 2009). Therefore there is strong evidence that associations with coronary artery disease identified in population of European origin in 9p21 are not replicated in populations of African origin. If rs10757278 is the functional SNP that has the largest effect on the phenotype in this region, then the absence of replication can only be explained by an interaction between the effect at rs10757278 and some other region in which the populations of African descent differ from the populations of European descent. If, however, rs1333047 functionally affects the phenotype, then the lack of replication can be explained by the lack of linkage disequilibrium between rs1333047 and the genotyped SNPs. Therefore, the lack of replication of this finding in populations of African descent supports rs1333047 as a candidate functional SNP in this region. Furthermore, rs1333047 is the strongest association with coronary artery disease association identified in an African American population using a method that combines association and admixture information (Pasaniuc et al. 2011).

The linkage disequilibrium patterns also differ between the HapMap 2 CEU population and the two Asian populations (CHB and JPT). Linkage disequilibrium does, however, remain relatively high (r2 of respectively 0.978 and 0.442 between rs1333047 and associated SNPs rs10757278 and rs1333049). A large scale, gene centric analysis (IBC 50K CAD Consortium 2011) showed that the effect size for the association between 9p21 SNP rs1333042 and coronary artery disease was larger in European than Asian (odds ratio of respectively 1.27 and 1.14). We analyze the linkage disequilibrium between rs1333047 and haplotypes identified in previous studies in the Han Chinese population (Ding et al. 2009), which includes rs2383206, rs1004638, rs17761446 and rs10757278. Only haplotype AATA includes the protective ‘A’ allele at rs1333047, and rs1333047 is in strong linkage disequilibrium (r2 = 0.975) with rs1004638. The AATA haplotype is more frequent in controls (30.5%) than in cases (27.3%). Similarly, we analyze the haploblock reported in an association study of the Korean population (Shen et al. 2008). Only two SNPs, rs2383206 and rs10757278 are also in HapMap 2. For these SNPs, only haplotype AA includes the protective ‘A’ allele at rs1333047, and this haplotype is protective in the Korean population, with a frequency of 52.1% in controls and 45.1% in cases. Therefore previous results in population of Asian ancestry are also compatible with a potential functional role of rs1333047.

The implications of this potential functional association are significant. A recent meta-analysis of 7 independent studies with a total of 9,487 cases and 30,171 controls (Wild et al. 2011) identifies rs1333049 as the strongest association with coronary artery disease (P-value 7.12#10$58, odds ratio 1.27). The original study that associates rs10757278 with coronary artery disease in the Icelandic population and three populations in the United States (Helgadottir et al. 2007) shows that the odds ratio for heterozygous carriers of the risk allele is 1.26, and the odds ratio for the homozygous carriers is 1.64, and that this association alone might explain up to 21% of the population attributable risk. This study did not specifically analyze rs1333047. Since all three SNPs are part of the same haplotype in the CEU population, and are perfectly correlated, the odds ratios and attributable risks would be similar for rs1333047. The differences in LD structure however mean that both rs10757278 and rs1333049 are poor proxies for rs1333047 in other populations. This has important implications in personalized medicine, as testing those SNPs would lead to incorrect risk predictions if rs1333047 is indeed the mutation that plays a functional role in the phenotype. Furthermore, a potential functional role for rs1333047 would mean that the Type-I-Interferon signaling pathway plays a role in the association at 9p21, in addition to the Type-II-Interferon pathway previously identified by Harismendy et al. Finally, it is important to note that in YRI the frequency of the protective allele is very low (0.8%), meaning that most individuals in populations of African descent might be at a higher risk for coronary artery disease if rs1333047 is the mutation that plays a functional role in the disease. Interestingly, given the low minor allele frequency at rs1333047 in the YRI population, even genotyping this locus would require a much larger population in order to reach statistical significance in this population. This example illustrates the power of combining the results of functional studies such as ENCODE with results from GWAS in multiple population, and in particular in considering populations in which replication of an association were unsuccessful together with linkage disequilibrium data. Further experimental validation of the binding of ISGF3 to this region, of the effect of rs1333047 on this binding site, and of the association between rs1333047 and coronary artery disease across populations are however necessary to definitely prove that rs1333047 is a functional SNP linked to coronary artery disease. Given the evidence supporting a functional role for multiple loci in tight linkage disequilibrium in 9p21, it appears likely that multiple SNPs in binding sites for transcription factors that are part of a broad range of pathways together play a role in the biological process underlying the association of this region with coronary artery disease. References

References cited in both the main manuscript and supplementary information are not repeated

Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X et al. 2009. Diversity and complexity in DNA recognition by transcription factors. Science 324: 1720- 1723

Cunnington MS, Santibanez Koref M, Mayosi BM, Burn J, Keavney B. 2010. Chromosome 9p21 SNPs Associated with Multiple Disease Phenotypes Correlate with ANRIL Expression. PLoS Genet 6: e1000899 doi:10.1371/journal.pgen.1000899

Ding H, Xu Y, Wang X, Wang Q, Zhang L, Tu Y, Yan J, Wang W, Hui R, Wang CY et al. 2009. 9p21 is a shared susceptibility locus strongly for coronary artery disease and weakly for ischemic stroke in Chinese Han population. Circ Cardiovasc Genet 2: 338-346

Eeles RA, Kote-Jarai Z, Al Olama AA, Giles GG, Guy M, Severi G, Muir K, Hopper JL, Henderson BE, Haiman CA et al. 2009. Identification of seven new prostate cancer susceptibility loci through a genome-wide association study. Nat Genet 41: 1116-1121

Eeles RA, Kote-Jarai Z, Giles GG, Olama AA, Guy M, Jugurnauth SK, Mulholland S, Leongamornlert DA, Edwards SM, Morrison J et al. 2008. Multiple newly identified loci associated with prostate cancer susceptibility. Nat Genet 40: 316-321

Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, Sulem P, Thorlacius S, Gylfason A, Steinberg S et al. 2008. Many sequence variants affecting diversity of adult human height. Nat Genet 40: 609-615

Gudmundsson J, Sulem P, Steinthorsdottir V, Bergthorsson JT, Thorleifsson G, Manolescu A, Rafnar T, Gudbjartsson D, Agnarsson BA, Baker A et al. 2007. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat Genet 39: 977-983

Helgadottir A, Thorleifsson G, Manolescu A, Gretarsdottir S, Blondal T, Jonasdottir A, Jonasdottir A, Sigurdsson A, Baker A, Palsson A et al. 2007. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 316: 1491-1493

IBC 50K CAD Consortium. 2011. Large-scale gene-centric analysis identifies novel variants for coronary artery disease. PLoS Genet 7: e1002260 doi:10.1371/journal.pgen.1002260

Kim JJ, Lee HI, Park T, Kim K, Lee JE, Cho NH, Shin C, Cho YS, Lee JY, Han BG et al. 2010. Identification of 15 loci influencing height in a Korean population. J Hum Genet 55: 27-31

Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S, Voight BF, Butler JL, Guiducci C et al. 2008. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet 40: 584-591

McPherson R, Pertsemlidis A, Kavaslar N, Stewart A, Roberts R, Cox DR, Hinds DA, Pennacchio LA, Tybjaerg-Hansen A, Folsom AR et al. 2007. A common allele on associated with coronary heart disease. Science 316: 1488-1491

McPherson R. 2010. Chromosome 9p21 and coronary artery disease. N Engl J Med 362: 1736-1737

Okada Y, Kamatani Y, Takahashi A, Matsuda K, Hosono N, Ohmiya H, Daigo Y, Yamamoto K, Kubo M, Nakamura Y et al. 2010. A genome-wide association study in 19 633 Japanese subjects identified LHX3- QSOX2 and IGF1 as adult height loci. Hum Mol Genet 19: 2303-2312

Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WH, Ruczinski I, Fornage M, Siscovick DS, Zhu X et al. 2011. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet 7: e1001371 10.1371/journal.pgen.1001371

Platanias LC. 2005. Mechanisms of type-I- and type-II-interferon-mediated signalling. Nat Rev Immunol 5: 375- 386

Schumacher FR, Berndt SI, Siddiq A, Jacobs KB, Wang Z, Lindstrom S, Stevens VL, Chen C, Mondul AM, Travis RC et al. 2011. Genome-wide association study identifies new prostate cancer susceptibility loci. Hum Mol Genet 20: 3867-3875

Schunkert H, Konig IR, Kathiresan S, Reilly MP, Assimes TL, Holm H, Preuss M, Stewart AF, Barbalic M, Gieger C et al. 2011. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet 43: 333-338

Shen GQ, Li L, Rao S, Abdullah KG, Ban JM, Lee BS, Park JE, Wang QK. 2008. Four SNPs on chromosome 9p21 in a South Korean population implicate a genetic locus that confers high cross-race risk for development of coronary artery disease. Arterioscler Thromb Vasc Biol 28: 360-365

Shen GQ, Rao S, Martinelli N, Li L, Olivieri O, Corrocher R, Abdullah KG, Hazen SL, Smith J, Barnard J et al. 2008. Association between four SNPs on chromosome 9p21 and myocardial infarction is replicated in an Italian population. J Hum Genet 53: 144-150

Silander K, Tang H, Myles S, Jakkula E, Timpson NJ, Cavalli-Sforza L, Peltonen L. 2009. Worldwide patterns of haplotype diversity at 9p21.3, a locus associated with type 2 diabetes and coronary heart disease. Genome Med 1: 51

Takata R, Akamatsu S, Kubo M, Takahashi A, Hosono N, Kawaguchi T, Tsunoda T, Inazawa J, Kamatani N, Ogawa O et al. 2010. Genome-wide association study identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat Genet 42: 751-754

Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A et al. 2008. Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet 40: 310-315

Visel A, Zhu Y, May D, Afzal V, Gong E, Attanasio C, Blow MJ, Cohen JC, Rubin EM, Pennacchio LA. 2010. Targeted deletion of the 9p21 non-coding coronary artery disease risk interval in mice. Nature 464: 409-412

Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS et al. 2008. Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet

Yamagishi K, Folsom AR, Rosamond WD, Boerwinkle E. 2009. A genetic variant on chromosome 9p21 and incident heart failure in the ARIC study. Eur Heart J 30: 1222-1228