Central Annals of Biometrics & Bringing Excellence in Open Access

Editorial *Corresponding author Huaizhen Qin, Department of Biostatistics and Tulane University School of Public Health Exploit the Neglected and Tropical Medicine, New Orleans, LA 70112, USA, Tel: 504-988-2042; Fax: 504-988-1706; Email:

Heteroscedasticity in Genetics Submitted: 05 December 2016 Accepted: 05 December 2016 Published: 08 December 2016 Copyright Huaizhen Qin* and Weiwei Ouyang © 2016 Qin et al. Department of Biostatistics and Bioinformatics, Tulane University School of Public Health and Tropical Medicine, USA OPEN ACCESS Recently, there have been emerging investigations on EDITORIAL heteroscedasticity in genetic association studies. Roughly The use of (generalized) is ubiquitous in speaking, there are two schools about how to deal with potential genetic association studies of complex phenotypes. In this area, is often assumed, namely, the of considered as impediment to information mining, and data modelling errors are assumed to be invariant with respect to transformationheteroscedasticity. is employed In the to first eliminate school, data heteroscedasticity heteroscedasticity, is the effects being modelled. Such studies aim to localize causal i.e., log transformation and Box-Cox transformation. Sun et al. loci by exploiting the effects of genetic variants at the loci [12], formally demonstrated that biological interpretations that are in LD with the genuine causal loci (Figure 1), i.e., testing based on genotypedriven heteroscedasticity may not be valid when phenotypic with respect to the three genotype for β1 = 0 under standard Y = β0 + Gβ1+ e, assuming 2 homoscedasticity (σe , the of does not depend on G). groups are not equal. They theoretically showed that the variance The homoscedasticity assumption is often violated in real-world can be expressed as a quadratic function of the mean and a genetic association studies, more or less. Heteroscedasticity, aka, transformation can be employed to equalize the three variances variance heterogeneity, is ubiqitous, since LD is ubiquitous [1]. The for an autosomal diallelic SNP. The quadratic transformation is distribution of the modeling error in aforesaid linear regression apparently depends on the genuine causal variants and hence the there are only three possible genotypes. However, it may not observed genotypes at their immediate neighbor markers, i.e., worksufficient well to for eliminate set wise heteroscedasticity tests, e.g., the most for popular a single SKAT SNP, because[12] for 2 σe = h() G for some function h().. The heteroscedasticity problem identifying sequence associations. In general, it is hard to choose in genetic association studies was acknowledged at least as early a transformation to completely eliminate set (e.g., gene, pathway) as in 2000 but was historically neglected for simplicity [2]. In the driven heteroscedasticity and the induced bias in era of deep sequencing, homoscedasticity is still assumed by vast most existent prominent marker wise and set wise association way to calibrate heteroscedasticityother than a way to directly tests [3-10]. As warned by Econometrician White [11], the exploitestimates it. of regression coefficients. Data transformation is one presence of heteroscedasticity can invalidate the statistical tests In the second school, heteroscedasticityis thought to represent that assume homoscedasticity. an important information resource other than an impediment for mining genetics data. Venables et al. [13], pointed out that when

difference would often be more important than any difference in theirsamples means. show Variability-controlling a significant difference quantitative in their trait variances, loci (vQTLs) this [14-21] were reported as genetic variants whose allelic states are associated with phenotypic variability. vQTLs show evidence of potential highorder interactions including networks with other genes (gene-gene interactions) or environmental factors (gene- environment interactions), and may also implicate an increased variation over time among subjects in a given genotype group. Gene expression level is considered as a quantitative phenotype. Figure 1 Schematic diagram of association mapping. Genetic loci are Expression variability QTLs (evQTLs) [22] were reported as said to be in LD if some combinations of alleles occur at these loci loci whose allelic states are associated with variances of gene more or less frequently than expected from random formation [2]. A expression. Formation of evQTL would be due to gene-gene correlation between genotyped markers (G) and study trait (Y) would be observed if these markers are in LD with some QTLs (C). Therefore, interactions. The double (DGLM) [23] QTLs can be localized by their immediate . and the hierarchical generalized linear model (HGLM) [24-

Cite this article: Qin H, Ouyang W (2016) Exploit the Neglected Heteroscedasticity in Genetics Data. Ann Biom Biostat 3(1): 1026. Qin et al. (2016) Email:

Central Bringing Excellence in Open Access 26] have been proposed to incorporate mean and dispersion components of testing variants and adjusting for covariates. The DGLM is an extension of a classical heteroscedastic regression model [27] incorporating heteroscedasticity within dispersion component. The HDGLM is a direct extension of the DGLM to avoid misinterpretation of detected Heteroscedasticity [19,28] Dueallow to both various fixed latent and predictors,random effects. detected Caution heteroscedasticity should be used may to not necessarily indicate strong gene-gene or gene-environment interactions. Figure 2 Schematic diagram of admixture mapping. Within a CAB, all Aforesaid association methods were designed virtually for genetics data from homogeneous populations. Before this its immediate neighbor loci. Local ancestriesA of the block directly editorial, admixture mapping was thought to come of age [29] impactsloci are in the ALD, genotypic and a specific distributions causal locusof all inthe C blockcan also wide be lociin BLD and with the but no publication addressed heteroscedasticity in admixture distribution of global ancestriesQ; and (Q, A) can be associated with mapping regime. We argue that exploiting ancestry driven some of all other causal factors (X), which directly or indirectly impact the distribution of Y. mapping. The genomes of admixed subjects derive from 1 + kheteroscedasticitywill distinct ancestral populations have significant (k relevance to admixture window included 100 SNPs as recommended by the package of segments (aka, admixture blocks) with various ancestries (ethnic origins). For example, genomes ≥ 1)of andAfrican thus Americans are mosaics are formed from recent admixture of West African and European surrogatesdocument. Then, and computed we fitted the the DGLM calibrated [23] of residuals drinking of symptoms drinking ancestries (k = 1). There are three LD sources in admixed symptoms.on age, gender, Next, at the each first window, 10 global we divided PCs as the global subjects ancestry into genomes [30] Admixture LD (ALD) occurs between the loci three groups according to their window wise European ancestries within each admixture block due to coherence in local ancestries. and inspected local ancestry driven heteroscedasticity by the Background LD (BLD) is the traditional LD inherited from their Levene’s test [40]. We implemented such a two-stage procedure homogeneous ancestral populations. Mixture LD (MLD) occurs because the existent dglm function [41] did not converge when among loci across the entire mosaic genomes due to variation jointly modeling both local ancestries and aforesaid predictors. in global ancestries. ALD has long been exploited for identifying A couple of clear peaks in –log10 (P ) emerged (Figure 3), where causal admixture blocks (CABs) that harbor causal alleles with the signals of heteroscedasticity appeared much stronger than distinct ancestral frequencies [29,31-33] that of the corresponding mean effects. In this analysis, the two pieces of information track each other consistently. Namely, According to our and others’ genetic studies in admixed the local ancestries at CABs simultaneously impact conditional populations [34-37], we depict the generic causal network phenotypic mean and conditional phenotypic variances for the between trait and its causal factors within and out of a given predictors. By this example, heteroscedastic admixture CAB (Figure 2). For each admixed subject , local ancestries mapping appears to be a novel and useful complementary tool for =,, … are the same for all loci within the CAB, where AAAi() i1 ik localizing genetic causal variants in the era of deep sequencing. Aij ∈{}0,, 1 2 stands for the number of marker wise alleles from the jth ancestral population(,,)j=1 … k ; and global ancestries In a broad sense, agenetic variant that alters arbitrary moments of the distribution of a study phenotype can be called a QQQi=(,,) i1 … ik is the average of local ancestries across the entire genome. C (G) is the set of all causal (neutral) loci within genetic causal variant for that phenotype. Due to the complexity the CAB. X is the set of all the other causal factors, e.g., other of biological mechanisms, the conditional distribution of a study causal loci across the genome, environmental factors, and phenotype for given predictors cannot be completely determined latent pleiotropic traits. (Q,A) can be associated with some of by the conditional phenotypic mean alone. The conditional all other causal factors (X), which directly or indirectly impact phenotypic distribution, in theory, can be completely determined by all the conditional moments of the phenotype for the given the distribution of Y. Let ZZZi=(,,) i1 … i stand for observed covariates. Armitage trend test (ATT) with correction for global predictors. Currently, vast most prominent sequence association ancestry Q and observed covariates Z is adopted in standard tests [5-12] merely exploit the effects of testing variants on conditional phenotypic means. These methods would be invalid and underpoweredin the presence of heteroscedasticity due to regression model (Box 1) is employed to relate AQi, i and Zi admixture mapping [38] to be specific, a conventional linear global genetic and non-genetic covariates, genetic causal variants, (eq.1) and test for H0: β 3 = 0 (eq. 4), assuming homoscedasticity (eqs. 2 & 3). Undoubtedly, this method is underpowered because their ancestries, mutual interactions between genetic variants, it neglects the heteroscedasticity due to local ancestry by gene local ancestries and environmental covariates. It would be interactions and local ancestry by environment interactions. necessary to calibrate global covariates driven heteroscedasticity for validating association analyses; and it would be able to To illustrate local ancestry driven heteroscedasticity, we improve power by jointly exploiting effects of genetic variants and analyzed 1330 unrelated African American genomes from aforesaid interactions on both phenotypic means and phenotypic the SAGE dataset. For each subject, we inferred window wise European ancestries of adjacent (but non-overlapped) windows the two tasks simultaneously. In particular, we anticipate more on chromosome 1, using the MULTIMIX package [39]. Each variances. Novel sequence association methods should fulfill

significant power gains from the integration of loci set driven Ann Biom Biostat 3(1): 1026 (2016) 2/4 Qin et al. (2016) Email:

Central Bringing Excellence in Open Access

Figure 3 Mean and dispersion tests for alcohol dependence. To exploit ancestry driven heteroscedasticity, we inferred window wise European ancestries of 1330 unrelated African American genomes at adjacent (but non-overlapped) windows on chromosome 1, using the MULTIMIX package [1]. Each window included 100 SNPs as suggested by the package document. We then inspected window wise ancestry driven heteroscedasticity of calibrated drinking symptom residuals after adjusting for

age, gender, the first 10 global PCs as global ancestry surrogates by the DGLM algorithm [3]. heteroscedasticity than single marker driven heteroscedasticity. 11. White H. A heteroskedasticity-consistent estimator Further, it may be instructive to investigate how a set of genetic and a direct test for heteroskedasticity. .1980; 48: 817- variants alters the entire conditional phenotypic distribution. 838 12. REFERENCES in Phenotypic Variability across SNP Genotypes? Am J Hum Genet. Sun X, Elston R, Morris N, Zhu X. What Is the Significance of Difference 1. Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, et al. Linkage 2013; 93: 390-397 disequilibrium in the human genome. Nature. 2001; 411: 199-204. 13. Exegeses on linear models. 1998. 2. Schork NJ, Nath SK, Fallin D, Chakravarti A. Linkage disequilibrium 14. Deng Q, Paré G. A fast algorithm to optimize SNP prioritization for analysis of biallelic DNA markers, human quantitative trait loci, and gene-gene and gene-environment interactions. Genet Epidemiol. 2011; 35: 729-738. 67: 1208-1218. threshold-defined case and control subjects. Am J Hum Genet. 2000; 15. Fraser HB, Schadt EE. The quantitative genetics of phenotypic 3. Han F, Pan W. A data-adaptive sum test for disease association with robustness. PloS one. 2010; 5. multiple common or rare variants. Human hered. 2010; 70: 42-54. 16. Perry GM, Nehrke KW, Bushinsky DA, Reid R, Lewandowski KL, 4. Li B, Leal SM. Methods for detecting associations with rare variants for urinary calcium excretion in rat (Rattus norvegicus). Genetics. 2012; common diseases: application to analysis of sequence data. Am J Hum 191:Hueber 1003-1013. P, et al. Sex modifies genetic effects on residual variance in Genet. 2008; 83: 311-321. 17. Rönnegård Lars, Valdar W. Detecting major genetic loci controlling 5. Li B, Leal SM. Discovery of rare variants via sequencing: implications phenotypic variability in experimental crosses. Genetics. 2011; for the design of complex trait association studies. PLoS Genet. 2009; 188:435-447. 5. 18. Shen X, Pettersson M, Rönnegård Lars, Carlborg Ö. Inheritance beyond 6. Madsen BE, Browning SR. A groupwise association test for rare plain heritability: variance-controlling genes in Arabidopsis thaliana. mutations using a weighted sum . PLoS Genet. 2009; 5. PLoS Genet. 2012; 8.

7. Morgenthaler S, Thilly WG. A strategy to discover genes that carry 19. Struchalin MV, Dehghan A, Witteman JC, van Duijn C, Aulchenko YS. multi-allelic or mono-allelic risk for common diseases: a cohort allelic Variance heterogeneity analysis for detection of potentially interacting sums test (CAST). Mutation Research/Fundamental and Molecular genetic loci: method and its limitations. BMC Genet. 2010; 11. Mechanisms of Mutagenesis. Mutation Research. 2007; 615: 28-56. 20. Visscher PM, Posthuma D. Statistical power to detect genetic loci affecting environmental sensitivity. Behavior genetics. 2010; 40: 728- 8. Morris AP, Zeggini E. An evaluation of statistical approaches to rare 733. variant analysis in genetic association studies. Genet Epidemiol. 2010; 34: 188-193. 21. Yang Y, Christensen OF, Sorensen D. Use of genomic models to study genetic control of environmental variance. Genet Res. 2011; 93: 33-46. 9. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ et al. Pooled association tests for rare variants in exon-resequencing 22. Hulse AM, Cai JJ. Genetic variants contribute to gene expression variability in humans. Genetics. 2013; 193: 95-108. studies. Am J Hum Genet. 2010; 86: 832-838. 23. Smyth GK. Generalized linear models with varying dispersion. Journal 10. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association of the Royal Statistical Society. Series B Methodological. 1989; 51: 47- testing for sequencing data with the sequence kernel association test. 60. Am J Hum Genet. 2011; 89: 82-93.

Ann Biom Biostat 3(1): 1026 (2016) 3/4 Qin et al. (2016) Email:

Central Bringing Excellence in Open Access

24. Rönnegård Lars, Shen X, Alam M. hglm: A Package for Fitting 118-119. Hierarchical Generalized Linear Models. R Journal. 2010; 2. 34. Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti DV. 25. Lee Y, Nelder JA. Hierarchical generalized linear models. Journal of the and heterogeneity in genetic association studies with admixed Royal Statistical Society Series B (Methodological) 1996; 619-678. populations. Am J Epidemiol. 2013; 177: 351-360. 26. Lee Y, Nelder JA, Pawitan Y. Generalized linear models with random 35. Qin H, Morris N, Kang SJ, Li M, Tayo B, Lyon H, et al. Interrogating local

studies. Bioinformatics. 2010; 26: 2961-2968. 27. effects:Bickel PJ.unified Using analysis residuals via H-likelihood. robustly I: TestsCRC Press. for heteroscedasticity, 2006. population structure for fine mapping in genome-wide association nonlinearity. The Annals of 1978; 266-291. 36. Analysis. Statistical Human Genetics. Springer. 2012; 399-409. 28. Paré G, Cook NR, Ridker PM, Chasman DI. On the use of variance per Qin H, Zhu X. Allowing for Population Stratification in Association genotype as a tool to identify quantitative trait effects: a 37. Wang X, Zhu X, Qin H, Cooper RS, Ewens WJ, Li C, et al. Adjustment for report from the Women’s Genome Health Study. PLoS genetics. 2010; local ancestry in genetic association analysis of admixed populations. 6. Bioinformatics. 2011; 27: 670-677. 29. Winkler CA, Nelson GW, Smith MW. Admixture mapping comes of age. 38. Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WL, et Annu Rev Genomics Hum Genet. 2010; 11: 65-89. al. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer 30. Falush D, Stephens M, Pritchard JK. Inference of population structure Consortium. PLoS genetics. 2011; 7. using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003; 164: 1567-1587. 39. Churchhouse C, Marchini J. Multiway admixture deconvolution using phased or unphased ancestral panels. Genetic . 2013; 37: 31. Zhu X, Tang H, Risch N. Admixture mapping and the role of population 1-12. structure for localizing disease genes. Adv Genet. 2008; 60: 547-569. 40. Levene H. Robust tests for equality of variances. Contributions to 32. Rife DC. Populations of hybrid origin as source material for the probability and statistics: Essays in honor of . 1960; detection of linkage. Am J Hum Genet. 1954; 6: 26-33. 2: 278-292. 33. Darvasi A, Shifman S. The beauty of admixture. Nat Genet. 2005; 37: 41. Dunn PK, Smyth GK, Dunn MP. The dglm Package. 2006.

Cite this article Qin H, Ouyang W (2016) Exploit the Neglected Heteroscedasticity in Genetics Data. Ann Biom Biostat 3(1): 1026.

Ann Biom Biostat 3(1): 1026 (2016) 4/4