INVESTIGATION

Genetic Variants Contribute to Expression Variability in Humans

Amanda M. Hulse* and James J. Cai*,†,1 *Interdisciplinary Program in Genetics and †Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas 77843-4458

ABSTRACT Expression quantitative trait loci (eQTL) studies have established convincing relationships between genetic variants and . Most of these studies focused on the mean of gene expression level, but not the variance of gene expression level (i.e., gene expression variability). In the present study, we systematically explore genome-wide association between genetic variants and gene expression variability in humans. We adapt the double generalized linear model (dglm) to simultaneously fit the means and the variances of gene expression among the three possible genotypes of a biallelic SNP. The genomic loci showing significant association between the variances of gene expression and the genotypes are termed expression variability QTL (evQTL). Using a data set of gene expression in lymphoblastoid cell lines (LCLs) derived from 210 HapMap individuals, we identify cis-acting evQTL involving 218 distinct , among which 8 genes, ADCY1, CTNNA2, DAAM2, FERMT2, IL6, PLOD2, SNX7, and TNFRSF11B, are cross-validated using an extra expression data set of the same LCLs. We also identify 300 trans-acting evQTL between .13,000 common SNPs and 500 randomly selected representative genes. We employ two distinct scenarios, emphasizing single-SNP and multiple-SNP effects on expression variability, to explain the formation of evQTL. We argue that detecting evQTL may represent a novel method for effectively screening for genetic interactions, especially when the multiple-SNP influence on expression variability is implied. The implication of our results for revealing genetic mechanisms of gene expression variability is discussed.

UANTITATIVE genetic analysis has long focused on de- level (i.e., expression QTL, eQTL) (Montgomery and Dermit- Qtecting genetic variants that affect organismal pheno- zakis 2011). The difference in mean gene expression and its types. This is often done by contrasting mean differences in genetic control have been extensively examined in humans phenotypes among genotypes. Despite increasing evidence (Stranger et al. 2005, 2007b; Choy et al. 2008; Montgomery across several species for genetic control of phenotypic var- et al. 2010; Pickrell et al. 2010).Thedifferenceinvarianceof iance (Ansel et al. 2008; Hill and Mulder 2010; Jimenez- gene expression (i.e., gene expression variability) is geneti- Gomez et al. 2011), variance differences in phenotypes have cally controlled and likely to be selectable (Raser and O’Shea been largely ignored. Recently, however, the fact that vari- 2005; Blake et al. 2006; Maheshri and O’Shea 2007; Cheung ance of phenotypes is genotype dependent has inspired the and Spielman 2009; Zhang et al. 2009). A small number of detection of genetic variants associated with phenotypic var- initial efforts have been made to quantify the difference in iability (Pare et al. 2010; Sudmant et al. 2010; Yang et al. gene expression variability (that is, variance of gene expres- 2012). sion) (Ho et al. 2008; Li et al. 2010a; Mar et al. 2011; Xu et al. When gene expression level is considered as a heritable, 2011b). Yet, little attention has been paid to the genetic con- quantitative trait, statistical associations between mean gene trol of gene expression variability in humans. expression and genotype can be established to identify those In the present study, we seek to discover genome-wide genomic loci associated with or linked to gene expression genetic variants (i.e., SNPs) associated with differences in the variance of gene expression among individuals. We adapt the Copyright © 2013 by the Genetics Society of America double generalized linear model (dglm) (Verbyla and Smyth doi: 10.1534/genetics.112.146779 Manuscript received September 13, 2012; accepted for publication October 31, 2012 1998) to test for the inequality of expression variances and Supporting information is available online at http://www.genetics.org/lookup/suppl/ measure the contribution of genetic variants to the expression doi:10.1534/genetics.112.146779/-/DC1. 1Corresponding author: Department of Veterinary Integrative Biosciences, Texas A&M heteroscedasticity. The model has been recently used to de- University, TAMU 4458, College Station, TX 77843. E-mail: [email protected] tect genetic loci controlling phenotypic variability in chicken

Genetics, Vol. 193, 95–108 January 2013 95 F2 crosses (Ronnegard and Valdar 2011). A likelihood ratio Genotype data test (LRT) of the dglm method allows us to compare the fitof Human polymorphism data were obtained from the HapMap a “full model” and a “mean model.” The full model takes into project (International Hapmap Consortium 2007) and the account the contribution of genotype to both the mean and pilot phase of The 1000 Genomes (1000G) Project (The the variance of gene expression simultaneously, while the 1000 Genomes Project Consortium 2010). The HapMap mean model takes into account only the contribution of ge- data release 28 includes genotypes of 4 million SNPs notype to the mean, ignoring the contribution to the variance. merged from phases 1, 2, and 3 of the project (International Asignificant result of the LRT indicates the nonrandom asso- Hapmap Consortium 2007). HapMap samples are from four ciation between the genotypes and the variances of gene different populations: Yorubans from Ibadan, Nigeria (YRI), expression. Here we designate the genomic loci statistically individuals of European origin in Utah (CEU), Han Chinese associated with gene expression variability expression variabil- from Beijing (CHB), and Japanese from Tokyo (JPT). The ity QTL (evQTL). The results of our genome-wide scan for raw data from the pilot study of the 1000G project contains evQTL provide a glimpse into the abundance and distribution 15 million SNPs from a total of 179 samples also from the of expression variability controlling variants in the human four HapMap populations. From the HapMap data, we genome. Given that the variance of a quantitative trait is extracted genotypes of individuals whose gene expression likely to differ under the influence of genetic interactions data are included in the GSE6536 and GSE11582 data sets. (Pare et al. 2010; Ronnegard and Valdar 2011), our evQTL The HapMap CEU and YRI populations consist of 60 trios. detecting method may be used to help detect the interactions We removed the 60 child samples from trios and used the between genetic variants controlling gene expression. remaining 210 unrelated samples in our analysis. From the 1000G data, we extracted genotypes of 153 and 149 individuals whose gene expression data were available Methods in GSE6536 and GSE11582, respectively. The 1000G project Expression data data give a snapshot of human polymorphism at an unprece- dented scale and resolution. The released 1000G low-coverage Gene expression data from the studies of Stranger et al. data captures nearly all (95%) of the common polymor- (2007a) and Choy et al. (2008) were downloaded from phism in a relatively ascertainment-free manner (The 1000 the Gene Expression Omnibus (GEO) website with accession Genomes Project Consortium 2010). One disadvantage of nos. GSE6536 and GSE11582, respectively. The two data using the 1000G genotype data in our case was that the sets were designated GSE6536 and GSE11582 thereafter. sample size became smaller. In addition, genotypes of 59 In the two studies, the expression levels were measured in individuals were extracted from the HapMap release and lymphoblastoid cell lines (LCLs) derived from HapMap in- paired with expression data in the RNA-seq data set. dividuals using two different platforms: Illumina human The minor allele frequency (MAF) and F statistic for whole-genome expression array (WG-6 version 1) for st SNPs, and LD estimates between SNP pairs (R2 and D9) were GSE6536 (Stranger et al. 2007a,b) and Affymetrix human computed using Matlab functions in PGEToolbox (Cai 2008). genome U133A array for GSE11582 (Choy et al. 2008). The To eliminate low-frequency polymorphisms, we discarded downloaded data had been normalized by using quantile SNPs with MAFs ,10% in the YRI, CEU, and CHB/JPT pop- normalization across replicates of a single individual and ulations. To control for the effect of population stratification, then median normalized across all HapMap individuals. we excluded SNPs with F $0.2. We also excluded SNPs that The downloaded data sets included 16,992 genes (19,440 st were found to be deviated from the Hardy-Weinberg (HW) probes) and 13,012 genes (20,995 probes) in GSE6536 and equilibrium by using hweStrata, an exact stratified test across GSE11582 data sets, respectively, and 11,633 shared genes. populations (Schaid et al. 2006). The Sequence Alignment/Map (SAM) files of the RNA se- quencing (RNA-seq) data for 60 individuals of European Regression models origin in Utah (CEU) HapMap individuals from the study For each transcript-SNP pair, the association between gene of Montgomery et al. (2010) were downloaded at the website expression level and the genotype is assumed to be linear. http://jungle.unige.ch/rnaseq_CEU60.WeusedSAMMate The conventional linear regression model is (Xu et al. 2011a) to estimate the expression level using the number of reads per kilobase of transcript per million mapped ¼ m þ a þ e ; e ð ; s2Þ; yi gi i i N 0 (1) reads (RPKM) (Mortazavi et al. 2008).

The coefficient of variation (CV) was used as a normalized where yi indicates a gene expression trait of individual i, gi measure of the dispersion of expression distribution as in is the genotype at the given SNP (encoded as 0, 1, or 2 for previous studies (Maheshri and O’Shea 2007; Ansel et al. homozygous rare, heterozygous and homozygous common e s2 2008; Ronnegard and Valdars 2011). The CV of each gene alleles, respectively), and i is the residual with variance . ¼ s m fi was computed as Cv 100 m, where and are the stan- The signi cance of association can be assessed using the nom- dard deviation and the mean of gene expression levels, inal, parametric P-value of the test of no association, i.e., a =0, respectively. or using a permutation test (Stranger et al. 2005).

96 A. M. Hulse and J. J. Cai To account for the effect of population difference on gene To select representative genes, we applied the k-means clus- expression, we introduced a covariate x into the model. We tering algorithm to all genes included in the GSE6536 data defined xi as the ith row in a design vector of the covariate, set and obtained 500 distinct clusters of genes. We used the which is encoded as 0 for African (YRI) individuals and 1 for k-means clustering algorithm implemented in Matlab func- Eurasian (non-YRI) individuals initially and then was also tion kmeans and we put the initial settings of the random assessed with encoding YRI, CEU, and CHB/JPT with 0, 1, number generator to their default values with seed 0. From and 2, respectively. With these we considered the double each cluster, we randomly selected one gene to form the generalized linear model as follows: 500-gene set. To select the subset of genome-wide SNPs, we first detected the genome-wide linkage disequilibrium ¼ m þ b þ a þ e ; e ð ; s2 ð uÞÞ; yi xi gi i i N 0 exp gi (2) (LD) blocks with YRI genotypes using Haploview v4.2 with default parameters (Barrett et al. 2005). LD blocks were u fi where is the corresponding vector of coef cients of geno- defined using the method described by Gabriel et al. type gi on the residual variance. With this model, the mean (2002). From each genomic region of LD block, as well as and variance of expression yi are controlled by both effects each genomic region between two neighboring LD blocks (i.e., of population substructure, xi, and SNP genotype, gi.Equa- interblock region), one common SNP (MAF . 10% in all four tion 2 is otherwise equivalent to Equation 1 but with the ad- HapMap populations) was randomly selected. Finally a total e ð ; s2Þ ðs2Þ¼ ditional covariate xi and with i N 0 i [where log i of 13,718 HapMap SNPs on autosomes were selected for the ðs2Þþ u log gi ], modeling dispersion in generalized linear trans-acting association analysis. For each pair of gene and models. Equation 2 is termed the dglm (Verbyla and Smyth SNP, the same criteria for detecting cis-evQTL, with one mod- fi 1998), which is, more speci cally, a classical heteroscedastic ification, that is, P , 1 · 1029 instead of , 1 · 1028, fi dispersion regression model (Bickel 1978). We coded the tting pro- were used to assess the significance of association. cedure for Equation 2 using the dglm package in R. For regression computation, we tested one SNP-gene pair and Creation of eQTL by subsampling evQTL obtained two P-values: Pdispersion and Pmean, for the effects of We simulated a population of 10,000 individuals under HW genotypes on the variance and the mean of expression lev- equilibrium with minor allele frequency of 0.4. Three els, respectively (Ronnegard and Valdar 2011). SNP-gene normal distributions of expression levels with equal means pairs that did not make the algorithm converge during com- and monotonically increased expression variances, one for putation were discarded. each genotype, was generated as follows: aa N(50,2), Aa N(50,4), and AA N(50,6). Individuals were randomly Assessing significance assigned a genotype then a gene expression value randomly To account for multiple testing performed between massive from the normal distribution that corresponded to the numbers of probe-SNP pairs (747,524 and 763,745 for assigned genotype. From the simulated population a random GSE6536 and GSE11582 data sets, respectively), we used sampling of 210 individuals was extracted. For each random , · 28 the threshold of Pdispersion 1 10 , which is closely equiv- sample eQTL presence was detected using the linear regres- alent to Bonferroni adjusted P , 0.01, to assess the genome- sion Equation 1. This process was repeated for iterations of wide significance. To control for the effect of outlier expression 10,000 random samples to determine the average number of data points, permutation tests were conducted for all signif- eQTL detected per 10,000 samples. icant pairs. Specifically, for each transcript probe-SNP pair, Analysis of functional annotation we performed 10,000 permutations of expression phenotype relative to the genotypes of the SNP. An association was We tested for significant enrichment of genes with specific considered significant if the P-value from the analysis of ontology terms, keywords in annotation, and pathways rela- the observed Pdispersion was lower than the threshold of the tive to the rest of the genes in the , using 0.001 tail of the distribution of the Pdispersion-values from the DAVID (Huang da et al. 2009). The enrichments of terms, 10,000 permutations, that is, Ppermutation , 0.001. In addi- keywords, and pathways among the tested genes were eval- tion, we required the P-value of the F–K test (Fligner and uated through the use of Fisher’s exact test. The nominal Killeen 1976) for homogeneity of variances in gene expres- P-values were corrected for multiple testing using the sion among different genotype groups to be ,0.01, following Benjamini–Hochberg method (Benjamini and Hochberg 1995; Fraser and Schadt (2010). The F–K test is a nonparametric Huang da et al. 2009). We determined statistical significance of alternative to the F-test investigating the homogeneity of var- enrichment using a cutoff of false discovery rate , 5%. iances. The F–K test does not assume that the data are nor- mally distributed and it is also insensitive to violations of this Association with copy number variations and assumption (Hallin and Paindaveine 2008). positive selection To evaluate whether evQTL are more likely to be associated Trans-association analysis with copy number variation (CNV), we obtained the genome- For the trans-acting association analysis, we selected 500 wide CNV information from the study of Johansson and Feuk representative genes and a subset of genome-wide SNPs. (2011). They used a total of 30,047 structural variations from

Expression Variability QTL in Humans 97 the Database of Genomic Variants (Iafrate et al. 2004), con- taining experimentally detected variations larger than 1 kb in healthy controls, to define the CNV-containing regions, as well as the CNV-free regions, in the human genome. They estimated that 36% of the human genome belongs to CNV- containing regions and the remaining 64% belongs to the CNV-free regions (Johansson and Feuk 2011). We took the chromosomal coordinates of these regions and checked how many evQTL genes we identified were located in each region. To address the question of whether positive selection is Figure 1 Comparison of expression variability of a gene (FXN) with con- likely to be associated with evQTL genes, we checked whether stricted variability vs. a gene (XCL1) with elevated variability. Expression identified evQTL genes appear to be in genomic regions under levels shown are from the GSE6536 data set. positive selection. We compiled the results from different genome-scan studies that detected positive selection. These studies employed a variety of statistics to identify a total of variability (Figure S1). For example, FXN and XCL1 have 4282 genomic regions (784 Mb) showing signatures of pos- a similar mean level of expression in LCLs; however, the itive selection (Supporting Information, Table S5). We took CV of FXN (= 2.5) is only one-fifth that of XCL1 (= 12.8) the chromosomal coordinates of these regions and deter- calculated using the GSE6536 data set across 210 unrelated mined evQTL genes that were located in each region. individuals (Figure 1). The same pattern was observed when Linkage disequilibrium between cis-evQTL SNPs using the GSE11582 data set to calculate the CVs for the two genes (= 5.7 and 23.7, respectively) (Figure S2). To measure the linkage between cis-evQTL SNPs, we com- To illustrate that the gene expression variability is not puted two estimates of linkage disequilibrium, R2 and D9, explained by measurement error, we compared the micro- between all possible pairs of cis-evQTL SNPs in each corre- array data (GSE6536) with the RNA-seq data (Montgomery sponding gene. To obtain the null distributions of R2 and D9 et al. 2010). For each expressed gene in the two data sets, values, we randomly sampled the same number of pairs of we calculated the CV of expression levels across individuals. SNPs in the cis-regions of each gene and computed the two The use of CV allows for dimensionless comparison of dif- LD estimates between all pairs of SNPs. We plotted the fre- ferent types of normalized expression data sets such as quency distributions of R2 and D9 values obtained between microarray data and RNA-seq data. We found that variabil- evQTL SNP pairs against those between pairs of randomly ity in expression for each gene was similar in microarray and selected SNPs. If the distributions of R2 and D9 for cis-evQTL RNA-seq data sets [Spearman’s r = 0.703, P , 1 · 10215 SNP pairs shifted toward the higher ends, compared to those (n = 12,964, assuming independence between genes), Fig- for randomly selected SNP pairs, it will indicate that evQTL ure S3]. This result suggests that technical error in expres- SNPs are nonrandomly associated with other evQTL SNPs in sion measurement cannot fully account for the expression the same locus and the links are made through haplotypes. variability observed in the samples. Therefore, variability in gene expression is a fundamental biological property of gene Results expression itself, rather than the technology used to mea- We obtained genome-wide expression data assessed by micro- sure expression (see also Hansen et al. 2011). array hybridization in human LCLs from the studies of Stranger To examine the population specificity of gene expression et al. (2007a) and Choy et al. (2008). We denote the two data variability, we performed pairwise correlation analyses be- sets by their GEO accessions: GSE6536 and GSE11582. To tween CVs estimated separately for different HapMap popula- exclude transcripts that are not expressed or extremely lowly tions: YRI, CEU, and CHB/JPT. All three correlations were fi ’ r expressed in the LCLs, we removed probes with the intensity signi cantly positive and strong [Spearman s (cvYRI, cvCEU)= r r signal smaller than the 10th percentile of all intensity values. 0.92, (cvCEU, cvCHB/JPT) = 0.91, (cvYRI, cvCHB/JPT)=0.92,all 2 The final data sets included 15,574 genes (17,496 probes) and ,1 · 10 15]. We also used the coefficient of dispersion (CD), 12,076 genes (18,896 probes) in GSE6536 and GSE11582 the ratio between the variance s2 and the mean m,asan data sets, respectively, and 10,188 shared genes. The two ex- alternative statistic of the variability, and conducted the same pression data sets include samples from the same 210 unre- correlation analysis using the two microarray data sets. Again, lated individuals included in the HapMap consortium studies all three correlations were significantly positive and strong ’ r r (International Hapmap Consortium 2005, 2007). We used the [Spearman s (cdYRI, cdCEU)=0.93, (cdCEU, cdCHB/JPT)= r , · 215 two data sets separately to repeat our evQTL analysis and 0.92, (cdYRI, cdCHB/JPT)=0.94,allP 1 10 ]. These synthesized the results. results suggest that the population specificity of gene expres- sion variability is weak. Thus, it might be “safe” to transfer the Biological variability of gene expression estimate of expression variability from one population to an- Using the CV as an estimate of variability, we found that dif- other. This finding shows good agreement with the observa- ferent genes showed substantial differences in their expression tions reported in the studies of Li et al. (2010a) and Storey

98 A. M. Hulse and J. J. Cai Figure 2 Schematic illustrations of eQTL and evQTL with a real example. For any given biallelic SNP, if one of its two pos- sible homozygous genotypes (e.g., AA) is statistically associated with a high (or low) variance of expression levels of a gene, while the other possible homo- zygous genotype (e.g., aa) is associated with a low (or high) variance of expres- sion levels of the gene, and its heterozygous genotype is associated with an intermediate variance of expression levels of the gene, then this SNP-gene pair is a potential evQTL (middle). The real example is between SNP rs6572868 and FERMT2 (right). SNP rs6572868 is located in the 59-UTR of the FERMT2 gene on 14. The variation of expression of FERMT2 among TT homozygotes is larger than that among CC homozygotes. Heterozygous (CT) individuals show an intermediate level of variation in expression of FERMT2. The significant association between genotypes of 29 the SNP and expression variability of the gene was detected by jointly using the dglm test (Pdispersion = 5.6 · 10 ), permutation test (P , 0.001), and the Fligner–Killeen (F–K) test (P = 4.8 · 1025). et al. (2007), that is, most expression variation is due to var- tween genotypes [encoded with 0, 1, and 2 based on the iation among individuals rather than among populations. inclusion of the number of major allele(s)] and variance of gene expression in SNP-gene pairs. A SNP-gene pair with an Identification of cis-acting evQTL association reaching the significance threshold of Pdispersion , Next we identified genome-wide genetic variants associated 1 · 1028 in the dglm test was considered a potential evQTL. with expression variability by considering nonepistatic, addi- To rigorously assess the significance of each potential evQTL, tive genetic effects of genotypes on expression levels in an we employed the permutation test, in which the gene expres- adapted regression model (Methods, Figure 2). As a second- sion data were randomly shuffled and the evQTL relationship order statistic, the variance of expression levels (or the CV) between the shuffled expression replicate and genotype was requires large sample sizes to be accurately estimated (Lee reassessed. For each evQTL, the shuffling and reassessing and Nelder 2006; Visscher and Posthuma 2010). To increase processes were repeated 10,000 times to obtain the null dis- the power of association analysis, we pooled data (including tribution of Pdispersion.WerequiredanevQTLassociationto normalized expression values and genotypes) from individu- have a P-value of permutation test, Ppermutation , 0.001, that als of YRI, CEU, and CHB/JPT populations. This resulted in is, the observed Pdispersion was not $10 out of 10,000 Pdispersion 210 data points per gene for both GSE6536 and GSE11532 values obtained from the dglm tests with the shuffled expres- data sets. The pooling operation is justified because gene sion data sets (Methods). The permutation test was less sensi- expression variability is not population specific. Nevertheless, tive to outliers, which could have a large impact on the Pdispersion to control for the nonspecific effects of population stratifica- in the regression (see Figure S5 for an example). Finally, we tion on the expression variability, we only included SNPs with also required, PF–K, the P-value of the Fligner–Killeen (F–K) MAF .10% in all and each of the three populations in the test (Fligner and Killeen 1976), a nonparametric alternative analysis. We also required that all these included SNPs to the F-test, for the homogeneity of variances of expression showed no signs of population differentiation and no signs levels among genotype groups to be smaller than 0.01. The of deviation from HW equilibrium (Methods). In addition, in purpose of using the three tests jointly is to reduce false the regression model (Methods), we included the population positives and prioritize evQTL of large effect. identities as a covariate. We encoded YRI and non-YRI indi- Using the two expression data sets, GSE6536 and viduals using 0 and 1, respectively, and introduced this vari- GSE11582, we detected 166 and 60 genes (corresponding able into the final model as the covariate. We also tried to 179 and 108 probes, respectively) that were significantly another schema, in which YRI, CEU, and CHB/JPT popula- associated with at least one SNP that is located in their cis- tions were encoded with 0, 1, and 2, respectively. The two regions. The full list of these genes and probes are given in encoding schema produced highly similar results indicated by Tables S1 and S2. A total of 218 distinct genes were found the significant positive correlation between P-values obtained when combining the results from both data sets, including 8 by using two and three categories of populations as covariates common genes (Table 1). That is, evQTL SNPs were identi- (Spearman’s r =0.98,P , 1 · 10215, n = 15,399) (Figure fied in the genomic regions of these 8 genes using either of S4). Only the results derived from the first encoding schema the two expression data sets. This number was found to be are reported in this article. statistically significant [P (x $ 8) = 8.6 · 1026, hypergeo- We detected cis-acting evQTL SNPs by considering those metric test, Figure S6]. For each of the 8 common genes, the SNPs located within a 1-Mb region from the transcription effect of selected evQTL SNP on the gene expression vari- starting site (TSS) of a gene. The method we used was based ability as a function of genotypes (0, 1, and 2) is shown in on the dglm, which is the regression model previously used in Figure S7. detecting major loci controlling the variance of nonexpression As mentioned, 166 genes were identified using the phenotypes (Ronnegard and Valdar 2011). We applied the GSE6536 data set as they are involved in at least one cis- dglm test to identify potentially significant associations be- evQTL (Table S1). Among these genes, 81 have more than

Expression Variability QTL in Humans 99 Table 1 The evQTL genes identified using the expression data sets of GSE6536 and GSE11582, from the studies of Stranger et al. (2007a) and Choy et al. (2008), respectively

cis-evQTL (expression cis-eQTL (expression mean) HGNC gene symbol Gene name variance) SNP No. of trans-evQTL SNPs SNP previously detected ADCY1 Adenylate cyclase 1 rs7795288 ND rs1007572 (Veyrieras et al. 2008) CTNNA2 Catenin (cadherin- rs11126722, rs6722266, 11 rs17340143 (Pickrell et al. 2010) associated ), rs12620205, rs12616530, alpha 2 and rs10496228 DAAM2 Dishevelled associated rs1616076 and rs1651105 ND ND activator of morphogenesis 2 FERMT2 Fermitin family rs6572868 ND ND homolog 2 IL6 Interleukin 6 rs6969326 and rs2905342 ND ND (interferon, beta 2) PLOD2 Procollagen-lysine, rs12486562, rs10935575, 65 rs1449444, rs4681294, rs9852153, 2-oxoglutarate rs12496198, rs2864793, rs9289716, rs12496198, and 5-dioxygenase 2 rs2903479, and rs1877511 rs2864793 (Veyrieras et al. 2008) SNX7 Sorting nexin 7 rs12406053, rs12057921, 6 rs9285629, and other 14 SNPs rs11166111, and rs1545835 (Pickrell et al. 2010; Stranger et al. 2007b) TNFRSF11B Tumor necrosis rs1160273, rs7814509, ND rs7010267, rs1872426, factor receptor rs1493941, rs10955900, rs3134058, rs3134063, superfamily, rs3103986, and rs12545372 rs6469789, rs1385503, member 11b and rs2073617 (Veyrieras et al. 2008) The cis-acting eQTL SNPs and the references for the previous studies, in which they were detected, are given. In all these studies, gene expression levels are measured in LCLs. The cis-evQTL SNP, rs864793, that has been previously identified as a cis-eQTL SNP (Veyrieras et al. 2008), is underlined. ND, not detected. one evQTL SNP. We iterated all 379 gene-SNP pairs and involving 228 SNP-gene pairs (this number is smaller than found that 6 gene-SNP pairs show significant evQTL rela- the original 341 because the expression of some genes in the tionship using either GSE6536 or GSE11582 data sets. In 341 SNP-gene pairs was measured by multiple probes but other words, we replicated the discovery of evQTL in these 6 we only randomly picked one probe in doing this inspec- gene-SNP pairs between the exact same SNPs and corre- tion). We inspected the direction of association between sponding genes using both expression data sets. In all these genotypes and the variances of gene expression levels and 6 cases, the direction of the effect of genotypes on the var- found most evQTL obtained from the two data sets are in iance of expression is consistent between GSE6536 or the same direction. GSE11582 data sets (Figure S8). Finally, we used the RNA-seq data to validate the effect of For the remaining 373 gene-SNP pairs, 32 gene-SNP pairs evQTL SNPs. One of the advantages of using the RNA-seq in GSE6536 do not appear in GSE11582. This is because data are that the saturation of signal intensity above a certain different array specifications used in two data sets result in abundance level is not a problem (Majewski and Pastinen the coverage of slightly different gene sets. The rest, 341 2011). We focused on the direction of the allele-specific gene-SNP pairs, appeared in both data sets, but they were effect of evQTL SNPs on the increase or decrease of the found significant using GSE6536 but not GSE11582. We variability. Again, it is expected that if one allele of an evQTL suspect that this inconsistency was produced for two SNP identified using microarray data is associated with reasons. First, the GSE11582 data set contains a mixup of higher expression variability of the corresponding gene, a handful of RNA samples, which have been accidentally then the same allele-specific effects should be recovered swapped. This mixup was revealed by a recent application of using the RNA-seq data. Here is a positive example: using novel microarray experiment QC methods (see http://www. both GSE6536 and GSE11582 data sets, we found that the ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11582). Sec- expression variability of CTNNA2 decreases with the number ond, the stringent Pdispersion cutoff for statistical significance of the major allele (C) and increases with the number blocked many evQTL signals from the GSE11582 data set. of minor allele (T) in genotypes of SNP rs11126722, that

We reassessed the distribution of Pdispersion of associations of is, VarTT . VarTC . VarCC (Figure S7). In this case, we these 341 gene-SNP pairs using the GSE11582 data. Indeed, expected to find the same effects—TT is associated with the distribution was shifted toward the lower end of the higher and CC is associated with lower expression variability

Pdispersion spectrum, albeit these values were not small of CTNNA2—using RNA-seq data. Indeed, using the RNA- enough (i.e., , 1 · 1028) to become genome-wide signifi- seq data for 59 genotyped CEU individuals, we found 85% cant (Figure S9). We also conducted a side-by-side compar- of tested evQTL SNPs (35 of the 41) act in the same direc- ison between GSE6536 and GSE11582 results for evQTL tions as they were identified using the GSE6536 microarray

100 A. M. Hulse and J. J. Cai Figure 3 Distribution of cis-acting evQTL SNPs at IL6. Red circles indicate positions of SNPs that 28 are significantly (Pdispersion , 1 · 10 , Ppermutation , 0.001, and PF–K , 0.01) associated with the expression variability of IL6.Thex-axis shows the chromosomal coordinate in megabases. The

y-axis shows the negative logarithm of Pdispersion of the dglm test. The dashed line indicates the 28 significance threshold of Pdispersion =1· 10 . Black circles above the dashed line indicate SNPs 28 having Pdispersion , 1 · 10 but not Ppermutation , 0.001 and/or PF–K , 0.01. The red arrow points to the transcriptional start site of IL6. data (Table S3). Note that, due to the small sample size of tightly linked than those of randomly selected SNPs (Figure 59 individuals to measure variability in the RNA-seq data S12, Methods). set, only 41 of the evQTL SNPs were selected for this valida- Relationship between evQTL and eQTL tion. This was because we required that variances of RNA-seq expression between homozygotes of SNPs be significantly dif- To understand the relationship between evQTL and con- ferent (i.e., P , 0.01, F–Ktest)—as only then could the di- ventional eQTL, we examined how some evQTL identified rection of SNPs’ effect on expression variability be assessed. (using the GSE6536 data set) are also eQTL. Using the Taken together, these results provide significant evidence simple linear regression model (Equation 1 of Methods), we that evQTL SNPs are associated with regulating the vari- computed P-values for eQTL between the 379 evQTL SNP- ability in expression of corresponding genes in a cis-acting gene pairs. We found that 40 of 379 evQTL SNPs have an allele-specific manner. eQTL with P , 1 · 1028—that is, 11% of evQTL SNPs are also eQTL SNPs for the same corresponding genes. These include the evQTL between rs2864793 and PLOD2 (Table 1), Spatial distribution of cis-acting evQTL SNPs which has been previously identified as an eQTL (Veyrieras To understand the spatial distribution of cis-evQTL SNPs et al. 2008). We categorized these joint evQTL–eQTL into around the corresponding genes, we used genotype data four groups based on the direction of the genotypes’ effects from the pilot study of The 1000 Genomes (1000G) Project on the mean and the variance of gene expression levels to replace HapMap SNPs. The 1000G SNPs cover the ge- (Figure 4). The distribution of these 40 SNPs in the four nome more thoroughly than HapMap SNPs. So the advan- groups is: 12, 5, 3, and 20. Among them, 32 (80%, including tage of using the 1000G SNPs is that we can literally groups 1 and 4) showed a pattern of positive correlation; “impute” genotypes for sites that were not genotyped in that is, the variance of gene expression increases with in- the HapMap project. The disadvantage is that the number creasing mean gene expression. In contrast, 8 (20%, includ- of genotyped 1000G individuals whose gene expression data ing groups 2 and 3) showed a negative correlation. The are also available is much smaller than that of HapMap existence of the anticorrelation rejects the hypothesis that individuals (e.g., 153 vs. 210 for GSE6536). Nevertheless, evQTL are merely byproducts of large effect eQTL. we applied the same criteria for ascertaining evQTL to the Given the heterogeneous patterns of the evQTL–eQTL 1000G genotype data. When using the GSE6536 data, we relationship, it is tempting to speculate that some eQTL found significant cis-evQTL SNPs in all eight previously iden- are, in fact, evQTL. When the sample size is relatively small, tified genes (Figure S10). When using the GSE11582 data, evQTL can be falsely identified as eQTL. To show this, we we found significant cis-evQTL SNPs in six of eight genes simulated a subsampling process on an evQTL and demon- (except IL6 and PLOD2, Figure S11). strated that it is likely for the subsampling process to create Due to the large number of 1000G SNPs, additional novel a sample that is easily falsely discovered as an eQTL. We cis-evQTL SNPs were identified using the 1000G genotype started with a population of 10,000 individuals and created data. At the IL6 locus, besides the two SNPs identified using an ideal evQTL with three genotypes having the same ex- HapMap data, 33 new evQTL SNPs (including 3 intronic SNPs, pression mean and monotonically increased expression var- one in each of introns 2, 3, and 4) were added (Figure 3). iances: Varaa = 2, VarAa = 4, and VarAA =6(Figure S13 A, In addition, 6, 12, and 37 new evQTL SNPs in introns of Methods). From this population, we randomly sampled 210 ADCY1, PLOD2, and SNX7, respectively, were identified. individuals and computed the eQTL significance between Within SNX7, a synonymous SNP, rs2019213, and a nonsy- their genotypes and gene expression using the simple linear nonymous SNP, rs35296149, were also identified. Finally regression (Equation 1, Methods). Four independent realiza- we computed two estimates of linkage disequilibrium, R2 tions of subsampling, in which eQTL were detected (P , and D9, between all possible pairs of cis-evQTL SNPs in each 0.01), are shown in Figure S13 B–E. The random subsam- corresponding gene and found pairs of these SNPs are more pling was repeated and each time the presence of an eQTL

Expression Variability QTL in Humans 101 Figure 4 Direction of evQTL and eQTL effects of genotypes in the joint evQTL–eQTL. Lines in- dicate the result of simple linear regression for eQTL effect as a function of genotypes. Major allele is designated A, minor allele is designated a. The joint evQTL–eQTL are classified into four groups: 1, both mean and variance increase with the major allele; 2, mean increases but variance decreases; 3, mean level decreases but variance increases; and 4, both mean and variance decrease.

was tested. In every 10,000 subsamplings, we detected 76 Intriguingly, evQTL in five human leukocyte antigen eQTL on average when using the 0.01 cutoff, and detected (HLA) genes, HLA-A, HLA-DRB5, HLA-DRB1, HLA-DQA2, no eQTL when using the 1 · 1028 cutoff. It seems unlikely and HLA-DPA1, were identified (Tables S1 and S2). Three that an evQTL can be detected as an eQTL using statistical of MHC class II genes, HLA-DRB5, HLA-DRB1, and HLA- tests adapted by most eQTL studies. However, this is DQA2, are located in a 200-kb region on chromosome not conclusive as our simulation was conducted using a 6p21, where we identified two evQTL SNPs, rs4410767 “perfectly symmetric” evQTL, in which gene expression ex- and rs9271720, associated with two of these three genes, actly fits normal distributions with zero skewness. In real and one evQTL SNP, rs3129766, associated with all three experimental data, if an evQTL is “asymmetric,” such as the genes (Figure 5). A remarkable characteristic of evQTL in one on the right of Figure 2, then the rate of misidentifi- these HLA genes is that expression levels tend to show a bi- cations might be higher. modal pattern, instead of a continuum pattern, among indi- viduals with the same genotypes. For instance, rs4410767 Functions and genomic features of cis-acting causes bimodal expression of three genotypes of HLA-DQA2, evQTL genes as well as bimodal expression of the major allele homozy- The eight evQTL genes, which were identified using both gote of HLA-DRB5. Also when major allele homozygotes of microarray data sets, encode with multifaceted these SNPs (e.g., TT of rs4410767) are associated with functions (Table 1). For example, IL6 encodes interleukin 6, a higher variability of a gene (e.g., HLA-DRB5), the minor which plays a role in many biological processes, including allele homozygotes of these SNPs (e.g., CC of rs4410767) signaling for the final differentiation of B cells to produce are associated with a lower variability of the other gene antibodies (Hirano et al. 1986) and stimulating production (e.g., HLA-DQA2), or vice versa. These results broadly cor- of hepatocytes (Andus et al. 1987). PLOD2 encodes a telo- respond with expectations from previous findings: (1) the peptide lysyl hydroxylase, which is essential for creating human MHC exhibits extreme genetic diversity and is stability of intermolecular collagen cross-linkages by creat- maintained by balancing selection (Parham and Ohta 1996; ing hydroxylysines, which serve as attachment sites for car- Bergstrom et al. 1998), (2) haplotype-specific differences in bohydrate units (van der Slot et al. 2003). Putting all 218 gene expression are common across the MHC (Vandiedonck evQTL genes identified with both expression data sets to- et al. 2011), and (3) this region contains genes (e.g., HLA- gether, we found these genes tend to encode extracellular DPA1) that show large copy number differences across pop- and transmembrane proteins and are more likely to be in- ulations (Sudmant et al. 2010). The expression of a gene, e.g., volved in cell adhesion and immunity (Table S4). HLA-DQA2, that falls onto two dichotomous parts could be

102 A. M. Hulse and J. J. Cai Figure 5 Distribution of evQTL SNPs in a 200-kb HLA region on chromosome 6p21 and their effects on the expression variability of corresponding genes. The upward red triangles indicate the positions of three evQTL SNPs. The associations between SNPs’ genotypes and expression variability of HLA-DRB5, HLA-DRB1, and HLA-DQA2 are plotted. Genes under the influence of the same SNP are grouped in the boxes. These genes encode cell- surface proteins that belong to the MHC class II that are essential elements of the immune system. MHC regions are among the most highly polymorphic coding regions in the human genome (Bergstrom et al. 1998). due to the effect of an additional unidentified SNP, which confirmed by the permutation test in which we randomly increases (or decreases) the mean expression level of the shuffled the genotypes and repeated the trans-evQTL detec- gene by the presence (or absence) of an allele. tion analysis. With the shuffled data set, we only detected We found that 84% of the 218 evQTL genes are located 33 trans-evQTL (false positives) that involve 24 genes, in CNV-containing regions (Methods). This ratio is highly among which only 6 genes have .1 trans-evQTL SNP. The significant as the CNV-containing regions only occupy 36% randomly shuffled data set produced no apparent evQTL of the whole genome (P =5· 10217 Fisher’s exact test), hotspots. Finally we conducted the same analysis with the suggesting that evQTL are more likely to be located within GSE11582 data set independently and obtained a compara- genomic regions that contain CNVs. In addition, 22% (47 of ble number (298) of trans-evQTL with the distribution that 218) of evQTL genes are located in the genomic regions also indicated the presence of evQTL hotspots (Figure S14). previously reported to be under positive selection (Methods, and see Table S5 for the list of references). This ratio is not Discussion significantly different from random expectation given that those previously reported regions under positive selection We carried out a genome-wide analysis to identify genomic collectively occupy 26% of the whole genome (P . 0.1, loci associated with gene expression variability. Our analysis x2 test). is conservative in the sense of not only the stringent criteria we adopted for assessing the statistical significance, but also Genome-wide distribution of trans-acting evQTL the experimental conditions we applied to maximize the use To identify trans-acting evQTL, we selected 500 representa- of information. For example, we pooled the expression data tive genes and .13,000 SNPs across all autosomes to con- from different HapMap populations and used the full set of duct an exhaustive pair-by-pair testing of all possible expression data in estimating the variance of gene expres- interactions between these genes and SNPs (Methods). We sion. By pooling the data, we increased the sample size. identified a total of 356 gene-SNP pairs that had a trans- Meanwhile, we resorted to only analyzing SNPs that were acting evQTL relationship (Figure 6). These trans-evQTL in- both polymorphic and common in each of populations and volved 35 distinct genes. Most of these genes (80%) had .1 lost the chance to detect population-specific evQTL. Thus we associated SNP. These include ANO3, NRCAM, PDGFD, and may underestimate the number of evQTL because of the ASPA with 114, 43, 39, and 27 trans-evQTL SNPs, respec- unreported evQTL relationships present in one population tively (Table S6). There were 30 SNPs involved in trans- but not the others. Nevertheless, using two expression data evQTL relationships with two different genes. These results sets, we detected a total of 218 distinct loci that contain cis- suggest the existence of “evQTL hotspots,” defined as con- acting evQTL SNP(s) and cross-validated eight of these loci. sisting of multiple evQTL SNPs linked to a common gene or We demonstrated that, in addition to cis-acting evQTL, there one evQTL SNP linked to multiple genes. This is further are many more trans-acting evQTL in the human genome.

Expression Variability QTL in Humans 103 Figure 6 Global view of trans-acting evQTL. The x-axis shows the absolute genomic position of SNPs and the y- axis shows the absolute genomic position of the 500 genes under consideration. Red dots indicate that the cor- responding SNP-gene pairs are trans-acting evQTL. Larger dots represent evQTL in which the SNPs are located within genomic regions ,10 Mb from the transcription start sites of the corresponding genes. A total of 356 trans-evQTL SNPs are distributed in genes (ordered by their genomic position along the y-axis): ALPL (4), LCE1E (3), FCRL2 (1), CDK18 (3), CCL20 (3), EOMES (3), WNT5A (2), PRICKLE2 (8), GAP43 (6), UTS2D (1), KIT (3), FAT1 (14), CXCL14 (10), TGFBI (1), NRCAM (43), COL22A1 (2), ZNF462 (2), ANXA8L2 (3), HHEX (1), ANO3 (114), PDGFD (39), PAH (1), RGS6 (4), SERPINA1 (6), LARP6 (8), NRG4 (2), MT1G (2), ASPA (27), CCL1 (14), CDH2 (1), DSC2 (7), DTNA (8), CLEC4G (6), NLRP4 (1), and C21orf56 (3). The GSE6536 expression data set is used for this analysis.

Our findings provide orthogonal information that is not raises the sensitivity or increases the degeneration rate. Ho- available in existing eQTL literature and may lead to the mozygous “aa” individuals are expected to show a lower identification of important genetic regulatory mechanisms degree of variability in expression level of corresponding of gene expression, which may have downstream pheno- genes because the “a” allele cushions (or canalizes) the im- typic effects. pact of a changing environment or maintain intact mRNA for a longer time. When the “a”allele is replaced with “A,” the Formation of evQTL cushion or stability effects are removed. Without these cush- Throughout the article, we have assumed that gene express ioning effects, the gene expression phenotype becomes more variability is measurable. However, it is worth noting that sensitive to environmental influences, which causes a greater the gene expression variability was measured via a “snap- variance among individuals. There is a need for finding shot” of the difference in the means of gene expression be- approaches to identify biologically variable loci in nonrepli- tween multiple individuals at a single time point (although cable systems. Mechanistically, a mutation (e.g., the allele two expression data sets were used, they were used inde- “a” or “A”) may alter the regulation of gene expression at any pendently). This quantity is related to but differs from the stage of the coupled processes of transcription, cotranscrip- stochastic noise of gene expression—the random fluctuation tion, and post-transcription (Pandit et al. 2008; Majewski of gene expression within replicated measures. To measure and Pastinen 2011; Chalancon et al. 2012). Also a vast array the stochastic noise, we would have to measure gene expres- of biophysical parameters governing these processes may be sion in each of the individuals multiple times. The gene involved, including the efficiency of initiation, the speed of expression variability we measured is the summation of transcription, the frequency of transcriptional bursting, the within-individual variability (i.e., stochastic noise of mRNA stability, the presence of antisense RNA-mediated gene expression within the lines) and the between-individ- degradation, nonsense-mediated decay, the status of regula- ual variability, but not either of them alone. Thus, we did not tory network, and so on (Raser and O’Shea 2005; Newman solely measure expression variability as a simple phenotypic et al. 2006; Volfson et al. 2006; Raj and van Oudenaarden trait but as a composite trait. Consequentially, we put for- 2008; Li et al. 2010b; Dahan et al. 2011; Majewski and ward two distinct scenarios to explain the formation of Pastinen 2011; Xu et al. 2011b; Chalancon et al. 2012; evQTL. Schoenberg and Maquat 2012). Certainly, there are also First, we emphasize the influence of single SNP (i.e.,an many nongenetic factors introduced in the path from the evQTL SNP) on within-individual variability of gene expres- human donor to the study of gene expression of an LCL sion. Imagine that the introduction of an “a” allele of a given in vitro (Choy et al. 2008). These nongenetic factors may SNP lowers the sensitivity of gene regulation in response to include the random choice of which subpopulations of B environmental changes or decreases the rate of degenera- cells are selected in the process of immortalization, the tion of mRNA of the gene, while introduction of an “A” allele amount of and individual response to the EBV virus, the

104 A. M. Hulse and J. J. Cai history of passage in cell culture and culture conditions, the increase of the “environmental variance” of gene expres- laboratory protocols and reagents with which assays are sion); these evQTL thus are true variance QTL. However, performed, and so on. All these make it difficult to study evQTL can also represent loci that are involved in the in- the mechanism of genetic control of expression variability. A teraction with other, unidentified loci. The effect of the well-controlled experimental system is, therefore, highly evQTL then depends on the genetic background, the true expected in the future. We believe that new technologies effect could be dissected in larger populations (or by com- are required to better understand the molecular mechanisms paring twins with the same genotype), but if all genetic behind evQTL control of gene expression variability. New backgrounds are pooled in a single analysis (as we did), experimental methods such as, single-molecule mRNA de- the locus seems to induce more variable phenotypes. cay measurements used for revealing the mechanism of pro- A potential pitfall of using the dglm model to detect moter-regulated mRNA stability (Trcek et al. 2011), and evQTL needs to be clarified. The dglm model defined in single-cell genomics technologies used for real-time investi- Equation 2 does not model the contribution of additional gation of transcription levels and decay (Chalancon et al. causal variants on mean expression levels per se. Because of 2012), could prove to be important for future expression this, the resulting lack of fit of the mean model (relative to variability studies. Further application of these powerful the full model) could be interpreted as the existence of sig- technologies holds promise for determining the actual nificant heteroscedasticity. Using HLA-DQA2 as an example, mechanisms behind expression variance and variability. the bimodal expression of the gene (Figure 4) could be Second, we attribute the formation of evQTL to the explained using an additional causal SNP controlling mean interaction between two or more genotypes (SNP–SNP in- expression levels under a dominant or recessive model, teraction or epistasis) to control a single expression pheno- without invoking the evQTL effect. The dglm model used type. While the effect of single SNP may explain the in this study does not model the additional causal SNP. formation of evQTL, it seems more plausible to consider When the current model is applied for detecting genetic the effects of multiple SNP. Under this scenario, an evQTL interactions, it should only be used as the first step in screen- is created through the effect of multiple SNPs that interact ing to prioritize SNPs. This is a similar approach as the with one another to influence the expression of a gene. This Levene’s test that is used as the first step of the “variance notion is supported by the observation that genetic interac- prioritization” method (Pare et al. 2010). Given the flexibil- tions can create complex phenotypic patterns that could also ity of dglm, the current model (Equation 2) can be extended explain the variability of complex traits (Ronnegard and to incorporate the effect of single additional casual SNP and Valdar 2011). In evQTL what we have mapped may be the additive (not interaction) effect of multiple SNPs on a complex gene expression pattern that is created through mean expression levels. Therefore, detecting evQTL and us- the interaction between expression-controlling variants. It is ing extended dglm models represent a novel method for well known that there are many molecular mechanisms that effectively screening for genetic interactions, especially allow different genotypes to influence gene expression. when the multiple-SNP influence on expression variability These include allele-specific expression through the modu- is implied. lation of activity of cis-regulatory elements, the regulation of Roles of CNV and natural selection transcript isoform levels through disruption of the splicing machinery, and the modification of chromatin accessibility CNVs form an abundant class of genetic variation that is and transcription factor binding (Wang et al. 2008; Pickrell presumed to impact gene expression. Schlattl et al. (2011) et al. 2010; Lalonde et al. 2011; Degner et al. 2012). How- identified CNVs associated with the expression of 110 genes ever, the mechanisms underlying the interaction between in CEU and YRI populations. These included a CNV down- these expression-controlling genotypes are poorly under- stream of HLA-DQB1, which is associated with HLA-DQB1 stood. Nevertheless, compared to the single-SNP explana- gene expression. In most of these CNV-associated eQTL, tion for evQTL, the idea of the interacting SNPs resulting CNVs intersecting genes typically affect genes in the in evQTL seems to be more widely accepted. In fact, several expected direction, i.e., involving a positive correlation be- methods for detecting the interaction between SNPs (e.g., tween gene copy number and expression level. Thus, CNVs Pare et al. 2010; Struchalin et al. 2010, 2012; Daye et al. may be implicated in the creation of eQTL via dosage mech- 2012) are based on testing for heteroscedasticity of traits. A anism. However, this assumption cannot be used to explain wider application of these methods (including the dglm the negative correlations between gene copy number and method in this study) in genetic association studies will expression level. Schlattl et al. (2011) found that 20% of supply more knowledge to understand the interaction be- associations of CNVs displayed the unexpected negative cor- tween genetic variants. relations, including several instances that had been previ- Overall, we relied on statistical association to detect ously observed (Stranger et al. 2007a; Henrichsen et al. evQTL effects on the variance of gene expression. Different 2009). evQTL may be created through very different mechanisms. The results presented in this article suggest that most Some evQTL can correspond to loci controlling canalization/ cis-acting evQTL are located in CNV-containing regions of decanalization of the gene expression level (decrease or the human genome. These raise the possibility that evQTL

Expression Variability QTL in Humans 105 account for the mixture of positive and negative correlations positive selection in evQTL and estimate how many evQTL between gene copy number and expression level. In addi- are under positive selection. For eQTL, such a method has tion, although our results are only correlative, more evQTL already been developed (Fraser et al. 2011), which may in CNV-containing regions may deserve a new theory. It has prove to be adaptable for evQTL. been suggested that the control of gene copy number rep- Our genome-wide screening for evQTL is an initial step resents a way to lower the intrinsic noise in gene expression toward better understanding the genetic factors controlling (Raser and O’Shea 2005). Evolutionary analysis showed, in gene expression variability in humans. Based on our results, both yeasts and mammals, the expression of duplicate gene we hypothesize that there are many unrecognized and broadly trends to decrease after duplication to rebalance gene dos- distributed genetic variants that play a role in controlling ex- age (Qian et al. 2010). We argue that noise control by in- pression variability or are involved in the formation of evQTL creased copy number provides another possible explanation through interacting with other variants. The influence of these for the co-occurrence of evQTL and CNVs. In future study, genetic variants is further modulated by other genomic and detailed analyses of fine-resolution CNV data are needed to cellular features, such as structural variations and transcrip- clarify the basis of these complex correlations. SNPs that tag tional regulatory networks. These empirical findings under- nearby CNVs could plausibly be used as markers for identi- score the genetic basis underlying phenotypic variability in fying CNV-associated evQTL. CNVs contribute to the suscep- biological systems and the roles of variability-determining tibility of various complex neurological disorders in humans. variants and genetic interactions in population evolution. The link between CNVs and human mental disorders may be through the altered expression variability of related loci. Acknowledgments Indeed, the expression variability of genes in cells from patients suffering from schizophrenia and Parkinson disease We wish to thank all anonymous reviewers for their has been found to be different from that in cells from normal constructive comments. We are grateful to Tomasz Kora- controls (Mar et al. 2011). Thus understanding the relation- lewski for help with data processing. We thank Han Liang, ship between CNVs and evQTL may shed light on the study Lan Zhu, Loren Skow, Ying Zhang, and Quan Long for of these complex disorders. valuable suggestions. We acknowledge the Texas A&M It is expected that control of gene expression variability is Supercomputing Facility (http://sc.tamu.edu/) for provid- under evolutionary pressure. Expression variability allows ing computing resources. This research was supported in simultaneous achievement of multiple steady-state pheno- part by a Gray Lady Foundation grant to J.J.C. types in a population (Raser and O’Shea 2005), which may play an important role in differentiation in multicellular organisms, such as that in the human immune system. For Literature Cited example, in T cells, the gene expression variability contrib- fi Acar, M., J. T. Mettetal, and A. van Oudenaarden, 2008 Stochastic utes to the useful diversi cation of biological functions switching as a survival strategy in fluctuating environments. within a clonal population and interferes with accurate an- Nat. Genet. 40: 471–475. tigen discrimination. The genetic components that coregu- Andus, T., T. Geiger, T. Hirano, H. Northoff, U. Ganter et al., late the expression levels of multiple genes to achieve 1987 Recombinant human stimulatory factor 2 (BSF- fi phenotypic variability in a controlled manner are obviously 2/IFN-beta 2) regulates beta- brinogen and albumin mRNA lev- els in Fao-9 cells. FEBS Lett. 221: 18–22. under selection (Feinerman et al. 2008). Several evQTL Ansel, J., H. Bottin, C. Rodriguez-Beltran, C. Damon, M. Nagarajan fi were identi ed at the human MHC region, which is charac- et al., 2008 Cell-to-cell stochastic variation in gene expression terized by a complex genomic landscape maintained by bal- is a complex genetic trait. PLoS Genet. 4: e1000049. ancing selection (Parham and Ohta 1996; Vandiedonck et al. Barrett, J. C., B. Fry, J. Maller, and M. J. Daly, 2005 Haploview: 2011). Our results underscore the role of natural selection, analysis and visualization of LD and haplotype maps. Bioinfor- matics 21: 263–265. especially balancing selection, in maintaining expression Benjamini, Y., and Y. Hochberg, 1995 Controlling the false dis- variability in HLA genes. covery rate: a practical and powerful approach to multiple test- In unicellular yeast, phenotypic variability allows some ing. J. R. Stat. Soc., B 57: 289–300. individuals in a population to be in an “anticipatory” state Bergstrom, T. F., A. Josefsson, H. A. Erlich, and U. Gyllensten, for a sudden environmental change (Acar et al. 2008; Zhang 1998 Recent origin of HLA-DRB1 alleles and implications for – et al. 2009). Increased expression variability is therefore human evolution. Nat. Genet. 18: 237 242. fi Bickel, P. J., 1978 Using residuals robustly I: tests for heterosce- bene cial in changing environmental conditions due to lack dasticity, nonlinearity. Ann. Stat. 6: 266–291. of regulatory responses to all possible environmental Blake, W. J., G. Balazsi, M. A. Kohanski, F. J. Isaacs, K. F. Murphy changes, thus higher noise allows for faster adaptation to et al., 2006 Phenotypic consequences of promoter-mediated changing environments. Yet, there is no evidence that this transcriptional noise. Mol. Cell 24: 853–865. notion can be applied to humans. Our results do not suggest Cai, J. J., 2008 PGEToolbox: a Matlab toolbox for population genetics and evolution. J. Hered. 99: 438–440. an under- or overrepresentation of cis-acting evQTL in ge- Chalancon, G., C. N. Ravarani, S. Balaji, A. Martinez-Arias, L. nomic regions under positive selection. Further studies Aravind et al., 2012 Interplay between gene expression noise could be directed to develop new methods for detecting and regulatory network architecture. Trends Genet. 28: 221–232.

106 A. M. Hulse and J. J. Cai Cheung, V. G., and R. S. Spielman, 2009 Genetics of human gene phisms in regulating human gene expression. Genome Res. expression: mapping DNA variants that influence gene expres- 21: 545–554. sion. Nat. Rev. Genet. 10: 595–604. Lee, Y., and J. A. Nelder, 2006 Double hierarchical generalized Choy, E., R. Yelensky, S. Bonakdar, R. M. Plenge, R. Saxena et al., linear models. J. R. Stat. Soc. Ser. C Appl. Stat. 55: 139–167. 2008 Genetic analysis of human traits in vitro: drug response Li, J., Y. Liu, T. Kim, R. Min, and Z. Zhang, 2010a Gene expres- and gene expression in lymphoblastoid cell lines. PLoS Genet. 4: sion variability within and between human populations and im- e1000287. plications toward disease susceptibility. PLOS Comput. Biol. 6: Dahan, O., H. Gingold, and Y. Pilpel, 2011 Regulatory mecha- e1000910. nisms and networks couple the different phases of gene expres- Li, J., R. Min, F. J. Vizeacoumar, K. Jin, X. Xin et al., sion. Trends Genet. 27: 316–322. 2010b Exploiting the determinants of stochastic gene expres- Daye, Z. J., J. Chen, and H. Li, 2012 High-dimensional heterosce- sion in Saccharomyces cerevisiae for genome-wide prediction of dastic regression with an application to eQTL data analysis. Bio- expression noise. Proc. Natl. Acad. Sci. USA 107: 10472–10477. metrics 68: 316–326. Maheshri, N., and E. K. O’Shea, 2007 Living with noisy genes: Degner, J. F., A. A. Pai, R. Pique-Regi, J. B. Veyrieras, D. J. Gaffney how cells function reliably with inherent variability in gene ex- et al., 2012 DNase I sensitivity QTLs are a major determinant pression. Annu. Rev. Biophys. Biomol. Struct. 36: 413–434. of human expression variation. Nature 482: 390–394. Majewski, J., and T. Pastinen, 2011 The study of eQTL variations Feinerman, O., J. Veiga, J. R. Dorfman, R. N. Germain, and G. by RNA-seq: from SNPs to phenotypes. Trends Genet. 27: 72–79. Altan-Bonnet, 2008 Variability and robustness in T cell activa- Mar, J. C., N. A. Matigian, A. Mackay-Sim, G. D. Mellick, C. M. Sue tion from regulated heterogeneity in protein levels. Science 321: et al., 2011 Variance of gene expression identifies altered net- 1081–1084. work constraints in neurological disease. PLoS Genet. 7: Fligner, M. A., and T. J. Killeen, 1976 Distribution-free 2-sample e1002207. tests for scale. J. Am. Stat. Assoc. 71: 210–213. Montgomery, S. B., and E. T. Dermitzakis, 2011 From expression Fraser, H. B., and E. E. Schadt, 2010 The quantitative genetics of QTLs to personalized transcriptomics. Nat. Rev. Genet. 12: phenotypic robustness. PLoS ONE 5: e8635. 277–282. Fraser, H. B., T. Babak, J. Tsang, Y. Zhou, B. Zhang et al., Montgomery, S. B., M. Sammeth, M. Gutierrez-Arcelus, R. P. Lach, 2011 Systematic detection of polygenic cis-regulatory evolu- C. Ingle et al., 2010 Transcriptome genetics using second tion. PLoS Genet. 7: e1002023. generation sequencing in a Caucasian population. Nature Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy et al., 464: 773–777. 2002 The structure of haplotype blocks in the human genome. Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, Science 296: 2225–2229. 2008 Mapping and quantifying mammalian transcriptomes by Hallin, M., and D. Paindaveine, 2008 Optimal rank-based tests for RNA-Seq. Nat. Methods 5: 621–628. homogeneity of scatter. Ann. Stat. 36: 1261–1298. Newman, J. R., S. Ghaemmaghami, J. Ihmels, D. K. Breslow, M. Hansen, K. D., Z. Wu, R. A. Irizarry, and J. T. Leek, Noble et al., 2006 Single-cell proteomic analysis of S. cerevi- 2011 Sequencing technology does not eliminate biological siae reveals the architecture of biological noise. Nature 441: variability. Nat. Biotechnol. 29: 572–573. 840–846. Henrichsen, C. N., E. Chaignat, and A. Reymond, 2009 Copy num- Pandit, S., D. Wang, and X. D. Fu, 2008 Functional integration of ber variants, diseases and gene expression. Hum. Mol. Genet. transcriptional and RNA processing machineries. Curr. Opin. 18: R1–R8. Cell Biol. 20: 260–265. Hill, W. G., and H. A. Mulder, 2010 Genetic analysis of environ- Pare, G., N. R. Cook, P. M. Ridker, and D. I. Chasman, 2010 On mental variation. Genet. Res. (Camb). 92: 381–395. the use of variance per genotype as a tool to identify quantita- Hirano, T., K. Yasukawa, H. Harada, T. Taga, Y. Watanabe et al., tive trait interaction effects: a report from the Women’s Genome 1986 Complementary DNA for a novel human interleukin Health Study. PLoS Genet. 6: e1000981. (BSF-2) that induces B lymphocytes to produce immunoglobu- Parham, P., and T. Ohta, 1996 Population biology of antigen pre- lin. Nature 324: 73–76. sentation by MHC class I molecules. Science 272: 67–74. Ho, J. W., M. Stefani, C. G. dos Remedios, and M. A. Charleston, Pickrell, J. K., J. C. Marioni, A. A. Pai, J. F. Degner, B. E. Engelhardt 2008 Differential variability analysis of gene expression and its et al., 2010 Understanding mechanisms underlying human application to human diseases. Bioinformatics 24: i390–i398. gene expression variation with RNA sequencing. Nature 464: Huang da, W., B. T. Sherman, and R. A. Lempicki, 2009 Sys- 768–772. tematic and integrative analysis of large gene lists using DAVID Qian, W., B. Y. Liao, A. Y. Chang, and J. Zhang, 2010 Main- bioinformatics resources. Nat. Protoc. 4: 44–57. tenance of duplicate genes and their functional redundancy by Iafrate, A. J., L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe reduced expression. Trends Genet. 26: 425–430. et al., 2004 Detection of large-scale variation in the human Raj, A., and A. van Oudenaarden, 2008 Nature, nurture, or genome. Nat. Genet. 36: 949–951. chance: stochastic gene expression and its consequences. Cell International HapMap Consortium, 2005 A haplotype map of the 135: 216–226. human genome. Nature 437: 1299–1320. Raser, J. M., and E. K. O’Shea, 2005 Noise in gene expression: International HapMap Consortium, 2007 A second generation hu- origins, consequences, and control. Science 309: 2010–2013. man haplotype map of over 3.1 million SNPs. Nature 449: 851– Ronnegard, L., and W. Valdar, 2011 Detecting major genetic loci 861. controlling phenotypic variability in experimental crosses. Ge- Jimenez-Gomez, J. M., J. A. Corwin, B. Joseph, J. N. Maloof, and D. netics 188: 435–447. J. Kliebenstein, 2011 Genomic analysis of QTLs and genes al- Schaid, D. J., A. J. Batzler, G. D. Jenkins, and M. A. Hildebrandt, tering natural variation in stochastic noise. PLoS Genet. 7: 2006 Exact tests of Hardy-Weinberg equilibrium and homoge- e1002295. neity of disequilibrium across strata. Am. J. Hum. Genet. 79: Johansson, A. C., and L. Feuk, 2011 Characterization of copy 1071–1080. number-stable regions in the human genome. Hum. Mutat. Schlattl, A., S. Anders, S. M. Waszak, W. Huber, and J. O. Korbel, 32: 947–955. 2011 Relating CNVs to transcriptome data at fine resolution: Lalonde, E., K. C. Ha, Z. Wang, A. Bemmo, C. L. Kleinman et al., assessment of the effect of variant size, type, and overlap with 2011 RNA sequencing reveals the role of splicing polymor- functional regions. Genome Res. 21: 2004–2013.

Expression Variability QTL in Humans 107 Schoenberg, D. R., and L. E. Maquat, 2012 Regulation of cyto- lysyl hydroxylase, an important enzyme in fibrosis. J. Biol. plasmic mRNA decay. Nat. Rev. Genet. 13: 246–259. Chem. 278: 40967–40972. Storey, J. D., J. Madeoy, J. L. Strout, M. Wurfel, J. Ronald et al., Vandiedonck, C., M. S. Taylor, H. E. Lockstone, K. Plant, J. M. 2007 Gene-expression variation within and among human Taylor et al., 2011 Pervasive haplotypic variation in the spli- populations. Am. J. Hum. Genet. 80: 502–509. ceo-transcriptome of the human major histocompatibility com- Stranger, B. E., M. S. Forrest, A. G. Clark, M. J. Minichiello, S. plex. Genome Res. 21: 1042–1054. Deutsch et al., 2005 Genome-wide associations of gene expres- Verbyla, A. P., and G. K. Smyth, 1998 Double generalized linear sion variation in humans. PLoS Genet. 1: e78. models: approximate residual maximum likelihood and diag- Stranger,B.E.,M.S.Forrest,M.Dunning,C.E.Ingle,C.Beazley nostics, pp. 1–15 in Research Report. Department of Statistics, et al., 2007a Relative impact of nucleotide and copy num- University of Adelaide, Adelaide, Australia. ber variation on gene expression phenotypes. Science 315: Veyrieras, J. B., S. Kudaravalli, S. Y. Kim, E. T. Dermitzakis, Y. Gilad 848–853. et al., 2008 High-resolution mapping of expression-QTLs Stranger, B. E., A. C. Nica, M. S. Forrest, A. Dimas, C. P. Bird et al., yields insight into human gene regulation. PLoS Genet. 4: 2007b Population genomics of human gene expression. Nat. e1000214. Genet. 39: 1217–1224. Visscher, P. M., and D. Posthuma, 2010 Statistical power to detect Struchalin, M. V., A. Dehghan, J. C. Witteman, C. van Duijn, and Y. genetic loci affecting environmental sensitivity. Behav. Genet. – S. Aulchenko, 2010 Variance heterogeneity analysis for detec- 40: 728 733. Volfson, D., J. Marciniak, W. J. Blake, N. Ostroff, L. S. Tsimring tion of potentially interacting genetic loci: method and its lim- et al., 2006 Origins of extrinsic variability in eukaryotic gene itations. BMC Genet. 11: 92. expression. Nature 439: 861–864. Struchalin, M. V., N. Amin, P. H. Eilers, C. M. van Duijn, and Y. S. Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang et al., Aulchenko, 2012 An R package “VariABEL” for genome-wide 2008 Alternative isoform regulation in human tissue transcrip- searching of potentially interacting loci by testing genotypic tomes. Nature 456: 470–476. variance heterogeneity. BMC Genet. 13: 4. Xu, G., N. Deng, Z. Zhao, T. Judeh, E. Flemington et al., Sudmant, P. H., J. O. Kitzman, F. Antonacci, C. Alkan, M. Malig 2011a SAMMate: a GUI tool for processing short read align- et al., 2010 Diversity of human copy number variation and ments in SAM/BAM format. Source Code Biol. Med. 6: 2. – multicopy genes. Science 330: 641 646. Xu, Z., W. Wei, J. Gagneur, S. Clauder-Munster, M. Smolik et al., The 1000 Genomes Project Consortium, 2010 A map of human 2011b Antisense expression increases gene expression vari- genome variation from population-scale sequencing. Nature ability and locus interdependency. Mol. Syst. Biol. 7: 468. – 467: 1061 1073. Yang, J., R. J. Loos, J. E. Powell, S. E. Medland, E. K. Speliotes et al., Trcek,T.,D.R.Larson,A.Moldon,C.C.Query,andR.H.Singer, 2012 FTO genotype is associated with phenotypic variability 2011 Single-molecule mRNA decay measurements reveal of body mass index. Nature 490: 267–272. promoter- regulated mRNA stability in yeast. Cell 147: Zhang, Z., W. Qian, and J. Zhang, 2009 Positive selection for 1484–1497. elevated gene expression noise in yeast. Mol. Syst. Biol. 5: 299. van der Slot, A. J., A. M. Zuurmond, A. F. Bardoel, C. Wijmenga, H. E. Pruijs et al., 2003 Identification of PLOD2 as telopeptide Communicating editor: E. Stone

108 A. M. Hulse and J. J. Cai GENETICS

Supporting Information http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.112.146779/-/DC1

Genetic Variants Contribute to Gene Expression Variability in Humans

Amanda M. Hulse and James J. Cai

Copyright © 2013 by the Genetics Society of America DOI: 10.1534/genetics.112.146779 Contents

Supporting Figures ______2 Fig. S1. Distribution of expression variability of genes. ______2 Fig. S2. Contrasting difference in CVs between two representative genes FXN and XCL1. ______3 Fig. S3. Comparison of gene expression variability estimated by microarray and RNA-seq. ______4 Fig. S4. Effect of covariant of populations encoded differently on the significance of overall Pdispersion. ______5 Fig. S5. Example of an evQTL with Pdispersion < 1 × 10-8 but Ppermutation > 0.001. ______6 Fig. S6. Hypergeometric test of two independent samples. ______7 Fig. S7. Effect of evQTL genotype on gene expression variability for significant genes. ______9 Fig. S8. Comparison of the effect of evQTL genotypes on the expression variability of corresponding genes. ______11

Fig. S9. Histogram of Pdispersion values in GSE11582 of the top evQTL SNP-gene associations in GSE6536. ______12 Fig. S10. Spatial distribution of cis-evQTL SNPs around the corresponding genes whose expression variability is associated with these SNPs. ______16 Fig. S11. Spatial distribution of cis-evQTL SNPs around the corresponding genes whose expression variability is associated with these SNPs. ______19 Fig. S12. Comparisons of frequency distributions of R2 and D’ between evQTL SNPs and randomly selected SNPs. ______20 Fig S13. Sub-sampling of evQTL results in the false discovery of eQTL. ______21 Fig S14. Global view of trans-acting evQTL (GSE11582). ______22 Supporting Tables ______23 Table S1. The cis-acting evQTL genes identified using the expression data set of GSE6536 ______23 Table S2. The cis-acting evQTL genes identified using the expression data set of GSE11582. ______28 Table S3. Analysis of directionality of the effect of evQTL SNPs on expression variability. ______31 Table S4. Functional enrichment of the cis-acting evQTL genes. ______33 Table S5. evQTL genes located in the genomic regions under positive selection. ______35 Table S6. Identified trans-evQTL genes (GSE6536). ______37 References ______38

A. M. Hulse and J. J. Cai 1 SI Supporting Figures

Figure S1 Distribution of expression variability of genes. Gene expression data is from the studies of Stranger et al. (2007)(left panel) and Choy et al. (2008)(right panel). Histograms of the coefficient of variance (CV) of the normalized expression levels of genes are presented. The x-axis is the log10 value of CV and the y-axis is the probability density. Black arrows indicate the CV values for 8 evQTL genes in Table 1.

2 SI A. M. Hulse and J. J. Cai FXN (CV = 5.7) XCL1 (CV = 23.7) 10 10

8 8

6 6

4 4 NormalizedExpressionLevel NormalizedExpressionLevel 2 2 50 100 150 200 50 100 150 200 Sample Index Sample Index

Figure S2 Contrasting difference in CVs between two representative genes FXN and XCL1. Gene expression data used to estimate the CVs were obtained from the study of Choy et al. (2008)(GEO accession GSE11582). The red line indicates the mean expression level of the gene. The x-axis is a random sample index for the 270 samples. The y-axis is the expression level for each individual.

A. M. Hulse and J. J. Cai 3 SI

Figure S3 Comparison of gene expression variability estimated by microarray and RNA-seq. CV estimated using microarray (X-axis, array intensities) and that estimated using RNA-seq [Y-axis, RPKM (Reads Per kb per Million Reads)]. The microarray data are obtained from the study of Stranger et al. (2007)(GEO accession GSE6536). SAM files of the RNA sequencing (RNA-seq) data for 60 CEU HapMap individuals from the study of Montgomery et al. (2010) were downloaded at the website http://jungle.unige.ch/rnaseq_CEU60. We quantified transcript levels in reads per kilobase of transcript per million mapped reads (RPKM)(Mortazavi et al. 2008), using SAMMate (Xu et al. 2011). The spearman correlation coefficient between the sampled genes = 0.703 (P = 0).

4 SI A. M. Hulse and J. J. Cai 16

14

12

10

), YRI-CEU-ASN 8

6 dispersion

4 -log10(P

2

0 0 2 4 6 8 10 12 14 16 -log10(P ), YRI-Non YRI dispersion

Figure S4 Effect of covariant of populations encoded differently on the significance of overall Pdispersion. On the x-axis the populations were encoded as 0 for African (YRI) individuals and 1 for Eurasian (non-YRI) individuals. On the y-axis the populations were encoded as 0 for African (YRI) individuals, 1 for European origin in Utah (CEU) individuals, and 2 for Asian (ASN, including Chinese from Beijing (CHB) and Japanese from Tokoyo (JPT)). Spearman's ρ = 0.980, P = 0 (n = 15,399).

A. M. Hulse and J. J. Cai 5 SI

-8 Figure S5 Example of an evQTL with Pdispersion < 1 × 10 but Ppermutation > 0.001. Example of a locus which significance tests are influenced by outliers.

6 SI A. M. Hulse and J. J. Cai 8

N = 9,511

Figure S6 Hypergeometric test of two independent samples. Assume that N: the total number of genes considered; k: the number of evQTL genes identified using GSE6536 data set; n: the number of evQTL genes identified using GSE11582 data set; x: the number of evQTL genes identified using both data sets, then p(x; N, n, k): hypergeometric probability is the probability that an n-trial hypergeometric experiment results in exactly x successes, when the population consists of N genes, k of which are identified using one of the data sets. Here, p(≥x; N, n, k) = 4.8×10-5.

A. M. Hulse and J. J. Cai 7 SI 8 SI A. M. Hulse and J. J. Cai

Figure S7 Effect of evQTL genotype on gene expression variability for significant genes. Genotypes noted 0, 1, 2 depict number of major allele(s). Expression data shown is from the study of Stranger et al. (2007)(GEO accession GSE6536).

A. M. Hulse and J. J. Cai 9 SI

10 SI A. M. Hulse and J. J. Cai

Figure S8 Comparison of the effect of evQTL genotypes on the expression variability of corresponding genes. Six SNPs identified from both GSE11582 and GSE6536 data sets are shown. Left column contains results from the GSE11852 data set; the right column contains the results from the GSE6536 data set.

A. M. Hulse and J. J. Cai 11 SI

Figure S9 Histogram of Pdispersion values in GSE11582 of the top evQTL SNP-gene associations in GSE6536. Pdispersion values for 241 out of 379 significant SNP-gene evQTL associations detected using the GSE6536 data set were “re-computed” using the GSE11582 data set.

12 SI A. M. Hulse and J. J. Cai A. M. Hulse and J. J. Cai 13 SI 14 SI A. M. Hulse and J. J. Cai A. M. Hulse and J. J. Cai 15 SI

Figure S10 Spatial distribution of cis-evQTL SNPs around the corresponding genes whose expression variability is associated with these SNPs. The GSE6536 expression data set was used. All tested loci are shown, significantly associated evQTL are shown in red while non- -8 significant loci are shown in black. Red circles indicate positions of SNPs that are significantly (that is, Pdispersion < 1 × 10 , Ppermutation <

0.001, and PF-K < 0.01) associated with the expression variability. The cross indicates that P Pdispersion < 1.1e-16 (which is reported as P- value = 0 in R) and its y-axis position is putative. Gene tracks from the UCSC genome browser are shown below. Transcriptional start site of the gene tested for association is marked with a red line. The plot for IL6 is a reproduction of Fig. 3 in the main text.

16 SI A. M. Hulse and J. J. Cai A. M. Hulse and J. J. Cai 17 SI 18 SI A. M. Hulse and J. J. Cai

Figure S11 Spatial distribution of cis-evQTL SNPs around the corresponding genes whose expression variability is associated with these SNPs. The GSE11582 expression data set was used. All tested loci are shown, significantly associated evQTL are shown in red while non- -8 significant loci are shown in black. Red circles indicate positions of SNPs that are significantly (that is, Pdispersion < 1 × 10 , Ppermutation <

0.001, and PF-K < 0.01) associated with the expression variability. The cross indicates that P Pdispersion < 1.1e-16 (which is reported as P- value = 0 in R) and its y-axis position is putative. Gene tracks from the UCSC genome browser are shown below. Transcriptional start site of the gene tested for association is marked with a red line.

A. M. Hulse and J. J. Cai 19 SI

Figure S12 Comparisons of frequency distributions of R2 and D’ between evQTL SNPs and randomly selected SNPs.

20 SI A. M. Hulse and J. J. Cai A B C D E Gene Expression

aa Aa AA

Figure S13 Sub-sampling of evQTL results in the false discovery of eQTL. The graph on the left depicts how an eQTL (black dots and line) is created by sub-sampling of a true evQTL (gray dots and line). (A) An evQTL created by simulation of gene expression in 10,000 individuals (Methods). Random samples of 210 individuals were taken from the population. (B, C, D, and E) are four independent realizations of sub-sampling data, in which eQTL is detected (P < 0.01).

A. M. Hulse and J. J. Cai 21 SI 3000

2500

2000

1500 Absolute gene position (Mb)

1000

500

0 0 500 1000 1500 2000 2500 3000 Absolute SNP position (Mb)

Figure S14 Global view of trans-acting evQTL (GSE11582). Gene expression data used were obtained from the study of Choy et al. (2008)(GEO accession GSE11582). The x-axis shows the absolute genomic position of SNPs and the y-axis shows the absolute genomic position of the 500 genes under consideration. Red dots indicate that the corresponding SNP-gene pairs are trans-acting evQTL. A total of 298 trans-evQTL SNPs are distributed in genes (ordered by their genomic position along the y-axis): CDA (1), SNX7 (5), NRAS (1), TRIM45 (1), CADM3 (2), TRIB2 (1), KIF3C (2), CTNNA2 (18), IL18R1 (2), TANK (3), ARL8B (4), LARS2 (1), WNT5A (1), GAP43 (1), ACPP (2), PLOD2 (49), TXK (1), TEC (1), CXCL10 (2), SEPT11 (1), PDCD6 (5), PJA2 (1), HSPA4 (1), PCDHA2 (1), PCDHA3 (1), G3BP1 (6), DAAM2 (1), IL6 (3), ADCY1 (38), TTC26 (2), EN2 (1), MSR1 (1), MAK16 (2), STMN2 (1), TNFRSF11B (50), COL5A1 (1), TUBB2C (5), DNAJC12 (4), ADK (4), FAM35A (1), CHST1 (2), TIMM10 (1), NOP2 (1), TRPC4 (1), GPR18 (1), STXBP6 (1), FOXG1 (1), FERMT2 (18), PGF (2), ISLR (1), PRSS21 (3), CACNG4 (1), KCNJ2 (3), RPS28 (1), HNRNPM (3), UBA52 (1), PSENEN (2), SNRPB (2), DYNLRB1 (4), BIRC7 (2), TMPRSS3 (18), and HSFX1 (1).

22 SI A. M. Hulse and J. J. Cai Supporting Tables

Table S1 The cis-acting evQTL genes identified using the expression data set of GSE6536.

# HGNC Gene Probe ID Gene name symbol 1 ABCA12 GI_30795236-A, ATP-binding cassette, sub-family A (ABC1), member 12 GI_30795237-I 2 ABI1 GI_4885610-S abl-interactor 1 3 ACCS GI_14211920-S 1-aminocyclopropane-1-carboxylate synthase homolog (Arabidopsis)(non- functional) 4 ACSBG1 GI_27477104-S acyl-CoA synthetase bubblegum family member 1 5 ADAD2 GI_31543071-S adenosine deaminase domain containing 2 6 ADAM23 GI_4501912-S ADAM metallopeptidase domain 23 7 ADAM28 GI_11496993-I ADAM metallopeptidase domain 28 8 ADCY1 GI_31083192-S adenylate cyclase 1 (brain) 9 AK4 GI_8051578-S adenylate kinase 4 10 ANKDD1A GI_38176293-S ankyrin repeat and death domain containing 1A 11 ANO3 GI_13899226-S anoctamin 3 12 ANXA8L2 GI_4502112-S annexin A8-like 2 13 APOBEC3B GI_22907024-S apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like 3B 14 ARAP3 GI_21264336-S ArfGAP with RhoGAP domain, ankyrin repeat and PH domain 3 15 ASRGL1 GI_23308566-S asparaginase like 1 16 B3GALTL GI_34996530-S beta 1,3-galactosyltransferase-like 17 BCAS3 GI_22095352-S breast carcinoma amplified sequence 3 18 BHLHE41 GI_13540520-S basic helix-loop-helix family, member e41 19 BLOC1S2 GI_28603839-S biogenesis of lysosomal organelles complex-1, subunit 2 20 BMP4 GI_19528651-A bone morphogenetic protein 4 21 C10orf11 GI_24475719-S chromosome 10 open reading frame 11 22 CASQ1 GI_21536273-S calsequestrin 1 (fast-twitch, skeletal muscle) 23 CAV2 GI_38176291-A caveolin 2 24 CCDC112 GI_22749136-S coiled-coil domain containing 112 25 CCL1 GI_4506832-S chemokine (C-C motif) ligand 1 26 CCNA1 GI_16306528-S cyclin A1 27 CD93 GI_11496985-S CD93 molecule 28 CDH2 GI_14589888-S cadherin 2, type 1, N-cadherin (neuronal) 29 CGNL1 GI_31982905-S cingulin-like 1 30 CHI3L2 GI_11993934-S chitinase 3-like 2 31 CHIA GI_42542397-A chitinase, acidic

A. M. Hulse and J. J. Cai 23 SI 32 CLEC2D GI_7019446-S C-type lectin domain family 2, member D 33 CLECL1 GI_40548404-S C-type lectin-like 1 34 CPEB4 GI_32698754-S cytoplasmic polyadenylation element binding protein 4 35 CRNKL1 GI_30795219-S crooked neck pre-mRNA splicing factor-like 1 (Drosophila) 36 CTNNA2 GI_7656998-S catenin (cadherin-associated protein), alpha 2 37 CXCR5 GI_14589867-I chemokine (C-X-C motif) receptor 5 38 D4S234E GI_36951161-S 39 DAAM2 GI_40548414-S dishevelled associated activator of morphogenesis 2 40 DAPK2 GI_14670382-S death-associated protein kinase 2 41 DDIT4L GI_34222182-S DNA-damage-inducible transcript 4-like 42 DHDH GI_31542505-S dihydrodiol dehydrogenase (dimeric) 43 DHODH GI_45006950-S dihydroorotate dehydrogenase (quinone) 44 DISC1 GI_11037064-S disrupted in schizophrenia 1 45 DSC2 GI_40806177-A desmocollin 2 46 EDNRB GI_4503466-A, GI_4557546- endothelin receptor type B A 47 EEPD1 GI_24308300-S endonuclease/exonuclease/phosphatase family domain containing 1 48 EFNA1 GI_33359679-A, ephrin-A1 GI_33359681-I 49 EFNB2 GI_33359689-S ephrin-B2 50 EMP1 GI_4503558-S epithelial membrane protein 1 51 ESRP2 GI_13435148-S epithelial splicing regulatory protein 2 52 ETAA1 GI_37059813-S Ewing tumor-associated antigen 1 53 EVC GI_24497530-A, Ellis van Creveld syndrome GI_24497530-I 54 FAM118A GI_8923587-S family with sequence similarity 118, member A 55 FAM123A GI_40288202-S family with sequence similarity 123A 56 FAT1 GI_4885228-S FAT tumor suppressor homolog 1 (Drosophila) 57 FCGR2A GI_11056051-S Fc fragment of IgG, low affinity IIa, receptor (CD32) 58 FCRL3 GI_21314763-S Fc receptor-like 3 59 FERMT2 GI_29789005-S fermitin family member 2 60 FKBP10 GI_21361894-S FK506 binding protein 10, 65 kDa 61 FLRT3 GI_38202220-A fibronectin leucine rich transmembrane protein 3 62 FOXD1 GI_4758391-S forkhead box D1 63 GFPT2 GI_4826741-S glutamine-fructose-6-phosphate transaminase 2 64 GLIPR1L1 GI_22749526-S GLI pathogenesis-related 1 like 1 65 GP1BB GI_9945387-S glycoprotein Ib (platelet), beta polypeptide 66 GPC5 GI_34106705-S glypican 5 67 GPR125 GI_37540666-S G protein-coupled receptor 125

24 SI A. M. Hulse and J. J. Cai 68 GRM3 GI_4504138-S glutamate receptor, metabotropic 3 69 HIATL1 GI_14211858-S hippocampus abundant transcript-like 1 70 HLA-A GI_24797066-S major histocompatibility complex, class I, A 71 HLA-DQA2 GI_11095446-S major histocompatibility complex, class II, DQ alpha 2 72 HLA-DRB1 GI_4504410-S major histocompatibility complex, class II, DR beta 1 73 HLA-DRB5 GI_26665892-S major histocompatibility complex, class II, DR beta 5 74 HLF GI_31542934-S hepatic leukemia factor 75 HMBOX1 GI_13375737-S homeobox containing 1 76 IGFBP3 GI_19923110-S insulin-like growth factor binding protein 3 77 IL13 GI_26787977-S interleukin 13 78 IL6 GI_10834983-S interleukin 6 (interferon, beta 2) 79 INPP4B GI_4504706-S inositol polyphosphate-4-phosphatase, type II, 105kDa 80 IQGAP2 GI_5729886-S IQ motif containing GTPase activating protein 2 81 ISLR2 GI_22055338-S immunoglobulin superfamily containing leucine-rich repeat 2 82 ITIH1 GI_4504780-S inter-alpha-trypsin inhibitor heavy chain 1 83 JAG1 GI_4557678-S jagged 1 84 KHDRBS3 GI_5730072-S KH domain, RNA binding, signal transduction associated 3 85 KLF5 GI_14251214-S Kruppel-like factor 5 (intestinal) 86 KRT5 GI_17318577-S keratin 5 87 KRT86 GI_15431325-S keratin 86 88 LAMA3 GI_38045907-A laminin, alpha 3 89 LARP6 GI_37537705-A, La ribonucleoprotein domain family, member 6 GI_37537709-I 90 LCT GI_32481205-S lactase 91 LGALS4 GI_6006017-S lectin, galactoside-binding, soluble, 4 92 LTBP1 GI_4557730-S latent transforming growth factor beta binding protein 1 93 MAEL GI_14249589-S maelstrom homolog (Drosophila) 94 MAP4K4 GI_22035601-A, mitogen-activated protein kinase kinase kinase kinase 4 GI_8923351-S 95 MGMT GI_4505176-S O-6-methylguanine-DNA methyltransferase 96 MOSC1 GI_33285009-S 97 MT4 GI_14269577-S metallothionein 4 98 MUC13 GI_15042952-S mucin 13, cell surface associated 99 MYO1B GI_37549338-S, myosin IB GI_44889480-S 100 MYRIP GI_23308602-S myosin VIIA and Rab interacting protein 101 NACA2 GI_42476196-S nascent polypeptide-associated complex alpha subunit 2 102 NKG7 GI_20127481-S natural killer cell group 7 sequence 103 NOL4 GI_4505420-S nucleolar protein 4

A. M. Hulse and J. J. Cai 25 SI 104 NPTX1 GI_4505442-S neuronal pentraxin I 105 NRCAM GI_41281388-S neuronal cell adhesion molecule 106 OLIG3 GI_31343439-S oligodendrocyte transcription factor 3 107 OR52E4 GI_41200999-S olfactory receptor, family 52, subfamily E, member 4 108 PARVB GI_20127527-S parvin, beta 109 PHACTR3 GI_34304355-A phosphatase and actin regulator 3 110 PKHD1L1 GI_31377831-S polycystic kidney and hepatic disease 1 (autosomal recessive)-like 1 111 PLEKHH2 GI_37546901-S pleckstrin homology domain containing, family H, member 2 112 PLG GI_4505880-S plasminogen 113 PLOD2 GI_33636741-I, GI_4505888- procollagen-lysine, 2-oxoglutarate 5-dioxygenase 2 A 114 PPIL3 GI_19557635-A peptidylprolyl isomerase (cyclophilin)-like 3 115 PPP4R4 GI_17402885-I protein phosphatase 4, regulatory subunit 4 116 PRICKLE2 GI_27481547-S, prickle homolog 2 (Drosophila) GI_38524619-S 117 PRSS50 GI_31543829-S protease, serine, 50 118 PRUNE2 GI_37539630-S prune homolog 2 (Drosophila) 119 PTPN14 GI_34328898-S protein tyrosine phosphatase, non-receptor type 14 120 PTPRG GI_18860897-S protein tyrosine phosphatase, receptor type, G 121 PTPRM GI_18860903-S protein tyrosine phosphatase, receptor type, M 122 RAB32 GI_20127508-S RAB32, member RAS oncogene family 123 REEP1 GI_12597656-S receptor accessory protein 1 124 RGS17 GI_21361404-S regulator of G-protein signaling 17 125 RGS6 GI_31742475-S regulator of G-protein signaling 6 126 RNASET2 GI_38683865-S ribonuclease T2 127 RNF180 GI_31341781-S ring finger protein 180 128 RNF183 GI_34147709-S ring finger protein 183 129 ROR1 GI_4826867-S receptor tyrosine kinase-like orphan receptor 1 130 RPL37A GI_16306561-S ribosomal protein L37a 131 S100A10 GI_4506760-S S100 calcium binding protein A10 132 S100Z GI_18640743-S S100 calcium binding protein Z 133 SERPINA1 GI_21361197-S serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1 134 SHANK2 GI_19743793-I SH3 and multiple ankyrin repeat domains 2 135 SHC3 GI_21314678-S SHC (Src homology 2 domain containing) transforming protein 3 136 SLC23A1 GI_44680142-A, solute carrier family 23 (nucleobase transporters), member 1 GI_44680142-I 137 SNCAIP GI_4885602-S synuclein, alpha interacting protein 138 SNX7 GI_23111053-I, sorting nexin 7

26 SI A. M. Hulse and J. J. Cai GI_23111054-A 139 SPATA13 GI_23308552-S spermatogenesis associated 13 140 SPP1 GI_38146097-S secreted phosphoprotein 1 141 ST8SIA1 GI_28373095-S ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 1 142 STK32B GI_8923753-S serine/threonine kinase 32B 143 STOX1 GI_39932584-S storkhead box 1 144 STXBP6 GI_34147674-S syntaxin binding protein 6 (amisyn) 145 SULT1A2 GI_29550878-A sulfotransferase family, cytosolic, 1A, phenol-preferring, member 2 146 SYCP3 GI_31559782-S synaptonemal complex protein 3 147 SYTL2 GI_15011899-A synaptotagmin-like 2 148 TACC2 GI_11119413-S transforming, acidic coiled-coil containing protein 2 149 TAGAP GI_23199968-A T-cell activation RhoGTPase activating protein 150 TCL1A GI_11415027-S T-cell leukemia/lymphoma 1A 151 TDO2 GI_5032164-S tryptophan 2,3-dioxygenase 152 TLN2 GI_22035664-S talin 2 153 TMEM119 GI_32171198-S transmembrane protein 119 154 TMEM200A GI_40538800-S transmembrane protein 200A 155 TNFRSF11B GI_22547122-S tumor necrosis factor receptor superfamily, member 11b 156 TREM1 GI_31543823-S triggering receptor expressed on myeloid cells 1 157 TRIM55 GI_34878851-A tripartite motif containing 55 158 TRIM9 GI_29543553-I tripartite motif containing 9 159 TUSC3 GI_30410787-I tumor suppressor candidate 3 160 UGT2A1 GI_5803212-S UDP glucuronosyltransferase 2 family, polypeptide A1, complex locus 161 UTS2 GI_12056480-I urotensin 2 162 WDR17 GI_31317281-I, WD repeat domain 17 GI_31317310-I 163 WDR41 GI_42716286-S WD repeat domain 41 164 WNT5A GI_17402917-S wingless-type MMTV integration site family, member 5A 165 ZFP57 GI_37552164-S zinc finger protein 57 homolog (mouse) 166 ZNF462 GI_31377724-S zinc finger protein 462

A. M. Hulse and J. J. Cai 27 SI Table S2 The cis-acting evQTL genes identified using the expression data set of GSE11582.

# HGNC Gene Probe ID Gene name symbol 1 ACPP 204393_s_at acid phosphatase, prostate 2 ADCY1 213245_at, 215340_at, 215348_at adenylate cyclase 1 (brain) 3 APCS 206350_at amyloid P component, serum 4 CALM1 200622_x_at, 200623_s_at, 200653_s_at, 200655_s_at, calmodulin 1 (phosphorylase kinase, delta) 207243_s_at, 209563_x_at, 211984_at, 211985_s_at, 213688_at 5 CAMK1D 220246_at calcium/calmodulin-dependent protein kinase ID 6 CHPF 202175_at chondroitin polymerizing factor 7 CRHBP 205984_at corticotropin releasing hormone binding protein 8 CTNNA2 205373_at catenin (cadherin-associated protein), alpha 2 9 CTTN 201059_at, 214073_at, 214074_s_at, 214782_at cortactin 10 CYP2R1 207786_at cytochrome P450, family 2, subfamily R, polypeptide 1 11 DAAM2 212793_at dishevelled associated activator of morphogenesis 2 12 DHRS4 218021_at dehydrogenase/reductase (SDR family) member 4 13 DNAJC15 218435_at DnaJ (Hsp40) homolog, subfamily C, member 15 14 FERMT2 209210_s_at, 214212_x_at fermitin family member 2 15 GRIK5 214966_at, 217509_x_at glutamate receptor, ionotropic, kainate 5 16 HBD 206834_at hemoglobin, delta 17 HGF 209961_s_at hepatocyte growth factor (hepapoietin A; scatter factor) 18 HIC1 208461_at hypermethylated in cancer 1 19 HIC2 212964_at, 212965_at, 212966_at, 216911_s_at hypermethylated in cancer 2 20 HLA-DPA1 211990_at, 211991_s_at, 213537_at major histocompatibility complex, class II, DP alpha 1 21 IGF1 209540_at, 209541_at, 209542_x_at, 211577_s_at insulin-like growth factor 1 (somatomedin C) 22 IL1R1 202948_at interleukin 1 receptor, type I 23 IL6 205207_at interleukin 6 (interferon, beta 2) 24 LHFP 218656_s_at lipoma HMGIC fusion partner 25 LMO3 204424_s_at LIM domain only 3 (rhombotin-like 2) 26 LPIN1 212272_at, 212274_at, 212276_at lipin 1 27 MAPK9 203218_at, 210570_x_at mitogen-activated protein kinase 9 28 MED20 206961_s_at, 212872_s_at mediator complex subunit 20 29 MEIS2 207480_s_at Meis homeobox 2

28 SI A. M. Hulse and J. J. Cai 30 MGAT1 201126_s_at mannosyl (alpha-1,3-)-glycoprotein beta-1,2-N- acetylglucosaminyltransferase 31 MYBPC1 214087_s_at myosin binding protein C, slow type 32 MYH11 201495_x_at, 201496_x_at, 201497_x_at, 207961_x_at myosin, heavy chain 11, smooth muscle 33 NDUFA6 202000_at, 202001_s_at NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 6, 14kDa 34 NOL3 221566_s_at, 221567_at, 59625_at nucleolar protein 3 (apoptosis repressor with CARD domain) 35 NUP88 202900_s_at, 214192_at nucleoporin 88kDa 36 PARP8 219033_at poly (ADP-ribose) polymerase family, member 8 37 PGM1 201968_s_at phosphoglucomutase 1 38 PKM2 201251_at pyruvate kinase, muscle 39 PLOD2 202619_s_at procollagen-lysine, 2-oxoglutarate 5-dioxygenase 2 40 PPFIBP2 212841_s_at PTPRF interacting protein, binding protein 2 (liprin beta 2) 41 PPIC 204517_at, 204518_s_at peptidylprolyl isomerase C (cyclophilin C) 42 PRKCH 206099_at, 218764_at protein kinase C, eta 43 PRPF4B 202126_at, 202127_at, 211090_s_at PRP4 pre-mRNA processing factor 4 homolog B (yeast) 44 RABEP1 214552_s_at rabaptin, RAB GTPase binding effector protein 1 45 RBL2 212331_at, 212332_at retinoblastoma-like 2 (p130) 46 RGNEF 219610_at 47 RPL11 200010_at ribosomal protein L11 48 RPL31 200962_at, 200963_x_at, 221593_s_at ribosomal protein L31 49 SLC19A2 209681_at solute carrier family 19 (thiamine transporter), member 2 50 SLC22A17 218675_at, 221106_at solute carrier family 22, member 17 51 SLC22A4 205896_at solute carrier family 22 (organic cation/ergothioneine transporter), member 4 52 SLC6A2 210353_s_at, 215715_at, 216610_at, 216611_s_at, solute carrier family 6 (neurotransmitter 217214_s_at transporter, noradrenalin), member 2 53 SNX7 205573_s_at sorting nexin 7 54 SRP19 205335_s_at signal recognition particle 19kDa 55 SV2B 205551_at synaptic vesicle glycoprotein 2B 56 TNFRSF11B 204933_s_at tumor necrosis factor receptor superfamily, member 11b 57 TNK1 205793_x_at, 217149_x_at tyrosine kinase, non-receptor, 1 58 UTP18 203721_s_at UTP18, small subunit (SSU) processome component, homolog (yeast)

A. M. Hulse and J. J. Cai 29 SI 59 WBP11 217821_s_at, 217822_at WW domain binding protein 11 60 ZDHHC24 205634_x_at zinc finger, DHHC-type containing 24

30 SI A. M. Hulse and J. J. Cai Table S3 Analysis of directionality of the effect of evQTL SNPs on expression variability. Min1 and Maj1 indicate the minor and major alleles used to determine the direction (Dir1) of allele effect on expression variability assessed using the GSE6536 data. Min2 and Maj2 indicate the minor and major alleles used in determine the direction (Dir2) of allele effect on expression variability assessed using the RNA-seq data. SameDir: 1 indicates the same direction of the effect, while 0 indicates the opposing direction. P_dglm, P_permu, and P_FK indicate P-values of dglm, permutation, and F-K tests, respectively.

# Gene Probe Min1 Maj1 Dir1 Chr Position (hg18) P_dglm P_permu P_FK Min2 Maj2 Dir2 Same Dir 1 ANKDD1A GI_38176293-S T C 1 15 62963701 0.0584301 0 3.25E-05 T C 1 1 2 ASRGL1 GI_23308566-S C G -1 11 61692317 8.87E-05 0 0.00020229 G C 1 1 3 CAV2 GI_38176291-A A G -1 7 115789709 3.09E-19 0 1.55E-16 A G -1 1 4 CCNA1 GI_16306528-S A G 1 13 34978730 0.000301071 0.0002 0.00133202 G A -1 1 5 CHI3L2 GI_11993934-S T C 1 1 111599695 2.75E-09 0 0.000182419 T C 1 1 6 CLECL1 GI_40548404-S A G -1 12 10200399 0.00538965 0 0.000291269 G A 1 1 7 CLECL1 GI_40548404-S T C -1 12 10200558 0.00538965 0 0.000291269 T C -1 1 8 CTNNA2 GI_7656998-S T C -1 2 79465943 0.000279227 0 3.62E-06 C T 1 1 9 DISC1 GI_11037064-S T A -1 1 229802898 0.000255121 0 0.000111018 A T -1 0 10 DISC1 GI_11037064-S T C -1 1 229817264 5.45E-09 0 2.97E-10 C T -1 0 11 DISC1 GI_11037064-S G A 1 1 229910192 1.99E-05 0 9.45E-06 G A -1 0 12 EMP1 GI_4503558-S C A -1 12 12341799 0.00466055 0.0004 0.00947516 C A -1 1 13 HLA-DRB1 GI_4504410-S G A 1 6 32702419 0.578276 0 2.19E-11 G A 1 1 14 HLA-DRB5 GI_26665892-S C T 1 6 32556107 2.68E-20 0 1.28E-15 C T 1 1 15 HLA-DRB5 GI_26665892-S G A 1 6 32702419 1.58E-23 0 1.16E-24 G A 1 1 16 IQGAP2 GI_5729886-S A T -1 5 75174925 0.0805595 0.0003 0.00227382 T A 1 1 17 KRT5 GI_17318577-S G A -1 12 50992821 0.00013108 0.0001 0.00870083 G A -1 1 18 KRT5 GI_17318577-S G A -1 12 50992857 0.00013108 0 0.00870083 G A -1 1 19 KRT5 GI_17318577-S T C -1 12 50997717 0.000149378 0.0002 0.00693345 T C -1 1 20 NKG7 GI_20127481-S C T -1 19 57452843 0.461895 0 0.000882016 C T -1 1 21 PHACTR3 GI_34304355-A T C -1 20 57527300 1.42E-07 0 9.87E-07 C T 1 1 22 PKHD1L1 GI_31377831-S A G 1 8 110614783 1.88E-45 0 7.00E-11 G A -1 1 23 PTPN14 GI_34328898-S G C -1 1 212499946 8.89E-05 0 1.96E-05 G C -1 1 24 REEP1 GI_12597656-S C T -1 2 86326381 1.14E-08 0 0.00524881 C T -1 1

A. M. Hulse and J. J. Cai 31 SI 25 RGS17 GI_21361404-S G C 1 6 153437287 1.56E-06 0 1.01E-05 G C 1 1 26 RGS6 GI_31742475-S T C -1 14 72201292 0.00279922 0.0008 3.16E-05 C T -1 0 27 S100A10 GI_4506760-S C G -1 1 150197924 3.09E-14 0 8.81E-09 C G -1 1 28 SERPINA1 GI_21361197-S A T -1 14 93639000 1.17E-05 0 0.0085942 T A 1 1 29 SERPINA1 GI_21361197-S A T 1 14 93869275 6.72E-07 0.0008 4.54E-05 A T 1 1 30 SNCAIP GI_4885602-S A T -1 5 121861852 0.0105081 0.0004 0.000755901 A T -1 1 31 SNX7 GI_23111053-I T C -1 1 98996096 5.14E-06 0.0008 1.44E-07 T C -1 1 32 SNX7 GI_23111054-A C T -1 1 98972034 3.82E-05 0.0006 1.51E-05 C T -1 1 33 SNX7 GI_23111054-A T C -1 1 98996096 9.18E-06 0.0005 1.65E-05 T C -1 1 34 SPP1 GI_38146097-S C T 1 4 89049656 1.37E-05 0.0002 0.000755318 C T 1 1 35 STXBP6 GI_34147674-S C T 1 14 24499438 0.12104 0.0002 0.000152535 C T 1 1 36 TNFRSF11B GI_22547122-S C T -1 8 119862426 1.62E-06 0 3.42E-11 T C 1 1 37 TNFRSF11B GI_22547122-S A G -1 8 119864420 1.68E-06 0 2.07E-10 G A 1 1 38 TNFRSF11B GI_22547122-S A G 0 8 119874067 4.41E-09 0 7.84E-12 A G -1 0 39 TNFRSF11B GI_22547122-S T C -1 8 119877403 2.84E-09 0 1.45E-11 T C -1 1 40 TUSC3 GI_30410787-I T A -1 8 15076561 0.0879955 0.0007 0.00531386 T A -1 1 41 UTS2 GI_12056480-I T A -1 1 8416520 1.50E-09 0 1.70E-08 T A 1 0

32 SI A. M. Hulse and J. J. Cai Table S4 Functional enrichment of the cis-acting evQTL genes. Analysis performed using DAVID with cis-evQTL genes determined in GSE6536 (Stranger, Forrest et al. 2007) and GSE11582 (Choy, Yelensky et al. 2008).

Category Term Count % PValue List Pop Pop Fold FDR Total Hits Total Enrichment GOTERM_CC_FAT GO:0070161~anchoring junction 10 4.6 3.73E-04 165 172 12782 4.50 0.48 GOTERM_CC_FAT GO:0044459~plasma membrane 46 21.2 6.96E-04 165 2203 12782 1.62 0.89 part GOTERM_CC_FAT GO:0005912~adherens junction 9 4.1 8.63E-04 165 155 12782 4.50 1.11 GOTERM_CC_FAT GO:0030027~lamellipodium 6 2.8 0.002022 165 70 12782 6.64 2.58 GOTERM_CC_FAT GO:0005887~integral to plasma 28 12.9 0.002381 165 1188 12782 1.83 3.03 membrane GOTERM_CC_FAT GO:0031226~intrinsic to plasma 28 12.9 0.003261 165 1215 12782 1.79 4.13 membrane INTERPRO IPR014745:MHC class II, 4 1.8 0.002446 208 22 16659 14.56 3.41 alpha/beta chain, N-terminal KEGG_PATHWAY hsa05332:Graft-versus-host 6 2.8 3.61E-04 83 39 5085 9.43 0.41 disease KEGG_PATHWAY hsa05310:Asthma 5 2.3 0.001098 83 29 5085 10.56 1.24 KEGG_PATHWAY hsa05330:Allograft rejection 5 2.3 0.002499 83 36 5085 8.51 2.80 KEGG_PATHWAY hsa04940:Type I diabetes mellitus 5 2.3 0.004414 83 42 5085 7.29 4.91 SP_PIR_KEYWORDS signal 64 29.5 4.77E-06 217 3250 19235 1.75 0.006 SP_PIR_KEYWORDS glycoprotein 78 35.9 6.35E-06 217 4318 19235 1.60 0.009 SP_PIR_KEYWORDS cell adhesion 15 6.9 3.13E-04 217 422 19235 3.15 0.42 SP_PIR_KEYWORDS polymorphism 155 71.4 3.53E-04 217 11550 19235 1.19 0.48 SP_PIR_KEYWORDS transmembrane protein 19 8.8 3.54E-04 217 642 19235 2.62 0.48 SP_PIR_KEYWORDS mhc ii 5 2.3 3.84E-04 217 31 19235 14.30 0.52 SP_PIR_KEYWORDS heterodimer 7 3.2 0.001082 217 103 19235 6.02 1.46 UP_SEQ_FEATURE signal peptide 64 29.5 5.89E-06 217 3250 19113 1.73 0.009 UP_SEQ_FEATURE glycosylation site:N-linked 74 34.1 2.20E-05 217 4129 19113 1.58 0.03 (GlcNAc...)

A. M. Hulse and J. J. Cai 33 SI UP_SEQ_FEATURE sequence variant 163 75.1 8.00E-05 217 11992 19113 1.20 0.12 UP_SEQ_FEATURE domain:Ig-like C1-type 5 2.3 0.001054 217 40 19113 11.01 1.62

34 SI A. M. Hulse and J. J. Cai Table S5 evQTL genes located in the genomic regions under positive selection. # Gene Chr Start End Reference (hg18) (hg18) 1 ACPP 3 133518991 133569357 Voight (Europeans) 2 AK4 1 65386679 65464449 Voight (Asians) 3 ANKDD1A 15 62991183 63036581 Tang (Chinese) 4 BCAS3 17 56111601 56824269 Oleksyk (Europeans) 5 CDH2 18 23786115 24010985 Tang (Europeans) 6 CTTN 11 69931052 69960163 Tang (Chinese) 7 DAPK2 15 61987772 62119514 Carlson (N.S.) Tang (Chinese) Williamson (N.S.) Voight (Asians) 8 FAM118A 22 44096925 44114951 Oleksyk (Africans) 9 FAM123A 13 24641742 24643758 Voight (Africans) Tang (Europeans) 10 FCGR2A 1 159741882 159754563 Voight (Africans) 11 FERMT2 14 52394845 52487037 Tang (Europeans) 12 GRM3 7 86232398 86331608 Tang (Chinese) Voight (Asians) 13 HLA-DPA1 6 33144405 33149326 Voight (Africans) 14 IL13 5 132021778 132023874 Tang (Europeans) 15 INPP4B 4 143169385 143571863 Tang (Europeans) 16 ITIH1 3 52786672 52800968 Tang (Chinese) 17 KHDRBS3 8 136539292 136728510 Williamson (N.S.) 18 LAMA3 18 19523646 19788611 Tang (Europeans) 19 LCT 2 136262364 136311210 HapmapII (Europeans) Grossman (CEU) Tang (Europeans) Voight (Europeans) Nielsen (N.S.) 20 LTBP1 2 33025896 33477117 Kimura (N.S.) 21 MYO1B 2 191849867 191996932 Tang (Europeans) Kimura (N.S.) 22 MYRIP 3 39917312 40274662 Tang (Europeans) 23 NOL3 16 65765574 65766502 Carlson (N.S.) Kimura (N.S.) 24 NUP88 17 5230250 5263695 Voight (Europeans) 25 PGM1 1 63831748 63897935 Tang (Chinese) 26 PKHD1L1 8 110443986 110611496 Tang (Chinese) 27 PKM2 15 70279045 70310383 Tang (Europeans) Oleksyk (Europeans) Voight (Europeans) 28 PLEKHH2 2 43725317 43846242 Tang (Chinese) 29 PPIL3 2 201444363 201460584 Kimura (N.S.) Williamson (N.S.) 30 PRKCH 14 60858573 61086303 Oleksyk (Africans & Europeans ) Tang (Europeans) 31 PTPN14 1 212597888 212704770 Williamson (N.S.) 32 PTPRM 18 7557817 8396161 Tang (Europeans) Voight (Europeans) HapmapII (Europeans) Grossman (CEU) 33 RABEP1 17 5126506 5227243 Voight (Europeans) 34 RGNEF 5 73016433 73272595 Tang (Europeans) 35 ROR1 1 64012677 64417127 Tang (Chinese) Tang (Europeans)

A. M. Hulse and J. J. Cai 35 SI 36 SHANK2 11 69996622 70536021 Tang (Chinese) Tang (Europeans) 37 SLC19A2 1 167701711 167721629 Tang (Chinese) 38 SLC22A17 14 22885697 22891264 Williamson (N.S.) 39 SRP19 5 112224973 112255204 Tang (Chinese) Voight (Asians) 40 ST8SIA1 12 22245753 22378434 Carlson (N.S.) 41 STK32B 4 5104492 5551712 Tang (Europeans) 42 STOX1 10 70257387 70322499 Oleksyk (Africans & Europeans ) 43 SV2B 15 89570498 89636787 Tang (Chinese) Williamson (N.S.) 44 TNFRSF11B 8 120005794 120033242 Tang (Europeans) 45 TRIM55 8 67202058 67249383 Tang (Europeans) 46 UTP18 17 46692945 46729350 Wang (All populations) 47 WNT5A 3 55479160 55496054 Voight (Asians)

• Akey, J.M., G. Zhang, K. Zhang, L. Jin, and M.D. Shriver. 2002. Interrogating a high-density SNP map for signatures of natural selection. Genome Res 12: 1805-1814. • Carlson, C.S., D.J. Thomas, M.A. Eberle, J.E. Swanson, R.J. Livingston, M.J. Rieder, and D.A. Nickerson. 2005. Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res 15: 1553-1565. • International-HapMap-Consortium. 2005. A haplotype map of the human genome. Nature 437: 1299-1320. • International-HapMap-Consortium. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851-861. • Kimura, R., A. Fujimoto, K. Tokunaga, and J. Ohashi. 2007. A practical genome scan for population-specific strong selective sweeps that have reached fixation. PLoS ONE 2: e286. • Nielsen, R., S. Williamson, Y. Kim, M.J. Hubisz, A.G. Clark, and C. Bustamante. 2005. Genomic scans for selective sweeps using SNP data. Genome Res 15: 1566-1575. • Oleksyk, T.K., K. Zhao, F.M. De La Vega, D.A. Gilbert, S.J. O'Brien, and M.W. Smith. 2008. Identifying selected regions from heterozygosity and divergence using a light-coverage genomic dataset from two human populations. PLoS ONE 3: e1712. • Sabeti, P.C., P. Varilly, B. Fry, J. Lohmueller, E. Hostetter, C. Cotsapas, X. Xie, E.H. Byrne, S.A. McCarroll, R. Gaudet, S.F. Schaffner, E.S. Lander, and International-HapMap-Consortium. 2007. Genome-wide detection and characterization of positive selection in human populations. Nature 449: 913-918. • Tang, K., K.R. Thornton, and M. Stoneking. 2007. A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. PLoS Biol 5: e171. • Voight, B.F., S. Kudaravalli, X. Wen, and J.K. Pritchard. 2006. A map of recent positive selection in the human genome. PLoS Biol 4: e72. • Wang, E.T., G. Kodama, P. Baldi, and R.K. Moyzis. 2006. Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc Natl Acad Sci U S A 103: 135-140. • Williamson, S.H., M.J. Hubisz, A.G. Clark, B.A. Payseur, C.D. Bustamante, and R. Nielsen. 2007. Localizing recent adaptive evolution in the human genome. PLoS Genet 3: e90. • Grossman SR, Shylakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, Hostetter E, Angelino E, Garber M, Zuk O, Lander ES, Schaffner SF, Sabeti PC. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science. 2010 Feb 12;327(5967):883-6.

36 SI A. M. Hulse and J. J. Cai Table S6 Identified trans-evQTL genes (GSE6536). Genes with trans-acting evQTL SNP(s) identified from a set of 500 representative genes and 13,718 SNPs.

Gene Number of trans- Chr Start End Description symbol vQTL SNPs (hg18) (hg18) ANO3 36 11 26210829 26684835 anoctamin 3 ASPA 21 17 3377404 3406713 aspartoacylase PDGFD 21 11 103777914 104035107 platelet derived growth factor D NRCAM 18 7 107788082 108097161 neuronal cell adhesion molecule FAT1 10 4 187508937 187647876 FAT tumor suppressor homolog 1 (Drosophila) DSC2 6 18 28645940 28682378 desmocollin 2 CCL1 5 17 32687399 32690230 chemokine (C-C motif) ligand 1 LARP6 5 15 71123891 71146498 La ribonucleoprotein domain family, member 6 PRICKLE2 5 3 64079543 64253655 prickle homolog 2 (Drosophila) CXCL14 4 5 134906373 134914969 chemokine (C-X-C motif) ligand 14 GAP43 4 3 115342171 115440337 growth associated protein 43 C21orf56 3 21 47581062 47604390 chromosome 21 open reading frame 56 DTNA 3 18 32073254 32471808 dystrobrevin, alpha MT1G 3 16 56700647 56701977 metallothionein 1G RGS6 3 14 72315323 73030654 regulator of G-protein signaling 6 CDK18 2 1 205473723 205501921 cyclin-dependent kinase 18 CLEC4G 2 19 7793844 7797057 C-type lectin domain family 4, member G LCE1E 2 1 152758690 152760902 late cornified envelope 1E NRG4 2 15 76233277 76352069 neuregulin 4 WNT5A 2 3 55499743 55523973 wingless-type MMTV integration site family, member 5A ZNF462 2 9 109625378 109775915 zinc finger protein 462 ANXA8L2 1 10 47746936 47763041 annexin A8-like 2 CCL20 1 2 228678558 228682272 chemokine (C-C motif) ligand 20 CDH2 1 18 25530930 25757410 cadherin 2, type 1, N-cadherin (neuronal) COL22A1 1 8 139600478 139926249 collagen, type XXII, alpha 1 HHEX 1 10 94447945 94455403 hematopoietically expressed homeobox KIT 1 4 55524085 55606881 v-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog SERPINA1 1 14 94843084 94857030 serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1 UTS2D 1 3 190984957 191048325 urotensin 2 domain containing

A. M. Hulse and J. J. Cai 37 SI References Choy E, Yelensky R, Bonakdar S, Plenge RM, Saxena R, De Jager PL, Shaw SY, Wolfish CS, Slavik JM, Cotsapas C et al. 2008. Genetic analysis of human traits in vitro: drug response and gene expression in lymphoblastoid cell lines. PLoS genetics 4(11): e1000287. Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. 2010. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464(7289): 773-777. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7): 621-628. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C et al. 2007. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315(5813): 848-853. Xu G, Deng N, Zhao Z, Judeh T, Flemington E, Zhu D. 2011. SAMMate: a GUI tool for processing short read alignments in SAM/BAM format. Source code for biology and medicine 6(1): 2.

38 SI A. M. Hulse and J. J. Cai