Hodonsky et al. BMC Genomics (2020) 21:228 https://doi.org/10.1186/s12864-020-6626-9

RESEARCH ARTICLE Open Access Ancestry-specific associations identified in genome-wide combined-phenotype study of traits emphasize benefits of diversity in genomics Chani J. Hodonsky1,2* , Antoine R. Baldassari1, Stephanie A. Bien3, Laura M. Raffield4, Heather M. Highland1, Colleen M. Sitlani5, Genevieve L. Wojcik6, Ran Tao7, Marielisa Graff1, Weihong Tang8, Bharat Thyagarajan8, Steve Buyske9, Myriam Fornage10, Lucia A. Hindorff11, Yun Li1, Danyu Lin1, Alex P. Reiner3,12, Kari E. North1,4, Ruth J. F. Loos13, Charles Kooperberg12 and Christy L. Avery1

Abstract Background: Quantitative red blood cell (RBC) traits are highly polygenic clinically relevant traits, with approximately 500 reported GWAS loci. The majority of RBC trait GWAS have been performed in European- or East Asian-ancestry populations, despite evidence that rare or ancestry-specific variation contributes substantially to RBC trait heritability. Recently developed combined-phenotype methods which leverage genetic trait correlation to improve statistical power have not yet been applied to these traits. Here we leveraged correlation of seven quantitative RBC traits in performing a combined-phenotype analysis in a multi-ethnic study population. Results: We used the adaptive sum of powered scores (aSPU) test to assess combined-phenotype associations between ~ 21 million SNPs and seven RBC traits in a multi-ethnic population (maximum n = 67,885 participants; 24% African American, 30% Hispanic/Latino, and 43% European American; 76% female). Thirty-nine loci in our multi- ethnic population contained at least one significant association signal (p < 5E-9), with lead SNPs at nine loci significantly associated with three or more RBC traits. A majority of the lead SNPs were common (MAF > 5%) across all ancestral populations. Nineteen additional independent association signals were identified at seven known loci (HFE, KIT, HBS1L/MYB, CITED2/FILNC1, ABO, HBA1/2, and PLIN4/5). For example, the HBA1/2 locus contained 14 conditionally independent association signals, 11 of which were previously unreported and are specific to African and Amerindian ancestries. One variant in this region was common in all ancestries, but exhibited a narrower LD block in African Americans than European Americans or Hispanics/Latinos. GTEx eQTL analysis of all independent lead SNPs yielded 31 significant associations in relevant tissues, over half of which were not at the immediately proximal to the lead SNP. (Continued on next page)

* Correspondence: [email protected] 1University of North Carolina Gillings School of Public Health, 135 Dauer Dr, Chapel Hill, NC 27599, USA 2University of Virginia Center for Public Health Genomics, 1355 Lee St, Charlottesville, VA 22908, USA Full list of author information is available at the end of the article

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Hodonsky et al. BMC Genomics (2020) 21:228 Page 2 of 14

(Continued from previous page) Conclusion: This work identified seven loci containing multiple independent association signals for RBC traits using a combined-phenotype approach, which may improve discovery in genetically correlated traits. Highly complex genetic architecture at the HBA1/2 locus was only revealed by the inclusion of African Americans and Hispanics/ Latinos, underscoring the continued importance of expanding large GWAS to include ancestrally diverse populations. Keywords: Blood cell traits, Combined-phenotype analysis, Pleiotropy, Diversity, Multi-ethnic, GWAS

Background ancestrally diverse Population Architecture using Genomics In the average adult, 200 billion red blood cells (RBCs) are and Epidemiology (PAGE) study [46]. Our findings reinforce generated daily from hematopoietic stem cells in the bone the necessity of incorporating multi-ethnic study populations marrow. The most commonly assessed traits for mature in genomics in order to accurately characterize RBC trait loci RBCs are hematocrit (HCT), hemoglobin concentration and encourage equitable application of the results to transla- (HGB), mean corpuscular hemoglobin (MCH), MCH con- tional work [39]. The complexity of association signals at loci centration (MCHC), mean corpuscular volume (MCV), previously characterized in European- and East Asian- RBC count (RBCC), and red cell distribution width (RDW); ancestry populations also demonstrates improved power to together, these traits are used to characterize RBC develop- perform conditional analysis using a combined-phenotype ment and function, diagnose anemic disorders, and identify model [47]. risk factors for complex chronic diseases [1–6]. RBC traits also are moderately to highly heritable, making these com- Results plex quantitative traits excellent candidates for genomic in- The number of participants with both phenotype and terrogation [7–9]. Improved characterization of RBC genotype data ranged from 33,549 (RDW) to 67,885 molecular pathways has benefitted both disease diagnosis (HCT, see Methods, Tables S2 &S3). Seventy-eight per- and pharmaceutical development, as has been demon- cent of participants were female and participants were strated by recent successes in a BCL11A-silencing gene on average 57 years old at time of blood collection therapy clinical trial for individuals with sickle cell disease (Table S4). Self-reported race/ethnicity in the total study (SCD) [10, 11]. population was approximately 20% African American, Genetic association studies have reported over 500 in- 30% Hispanic/Latino, and 40% European American dependent loci for RBC traits [12–31]. However, several (Table S3). research gaps remain which may be addressed via re- cently developed methods and broadly representative Combined-phenotype analyses study populations. First, previously published RBC trait Approximately 21 M SNPs met our inclusion criteria and genome-wide association study (GWAS) populations were evaluated in our primary analysis, a combined- – have mostly been ancestrally homogeneous [31 39]. phenotype multi-ethnic meta-analysis of seven RBC traits. Utilization of diverse study populations can improve SNP associations with the combined phenotype multi- identification of rare or ancestry-specific variants located ethnic meta-analysis exceeded genome-wide significance in biological pathways that affect phenotypes in global at 39 loci (p <5E-09,FiguresS1,S2), all of which were populations and, when summary data are made publicly identified previously. Lead SNPs at nine loci (KIT, HFE, available, enable construction of broadly applicable poly- HBS1L/MYB, IKZF1, TFR2, HBB, HBA1/2, GCDH, and genic risk scores [40]. Relatedly, gaps between estimated TMPRSS6) were associated with three or more RBC traits heritability and the proportion of variance explained by at genome-wide significance (Tables 1,S5A). HCT, HGB, GWAS findings suggest that additional associations re- and MCHC exhibited genome-wide-significant associa- main to be identified, including rare variants and inde- tions at the fewest loci (eleven, ten, and six, respectively), pendent secondary associations at known loci that are whereas MCH and MCV had the most (twenty and both more likely to be ancestrally specific [12, 41, 42]. twenty-one, respectively, Fig. 1a, Table S5A). Estimated Finally, RBC traits exhibit modest to high correlation, partial correlations by RBC trait pair ranged from HCT- and several dozen loci have been reported for two or MCHC (partial correlation ρ = − 0.02) to HCT-HGB (ρ = more RBC traits, although few studies have leveraged 0.94, Fig. 1b). Consistent with other GWAS of quantitative this shared genetic architecture to increase statistical complex traits, effect size was inversely correlated to allele – power to map novel RBC trait loci [12, 20, 26, 43 45]. frequency across all phenotypes (Fig. 1c). In this work, we examined the individual and shared gen- Trait-specific directions of effect were largely consistent etic architecture of seven RBC traits in participants of the with pairwise correlations. Among 58 independent Hodonsky Table 1 RBC trait loci with evidence of multiple independent signals among PAGE study participants Signal Chr:pos Ref/ CAFa p values Alt a

Multi-ethnic RBC trait-specific Combined phenotype by race/ethnicity Genomics BMC al. et AA HL EU HCT HGB MCH MCHC MCV RBCC RDW AA HL EU N = 67,885 N = 67,870 N = 41,317 N = 67,856 N = 41,276 N = 41,310 N = 33,549 N = 16,802 N = 20,697 N = 29,513 HFE locus rs2032451 1 6:26092170 T/G 0.96 0.88 0.85 4.0E-16 1.4E-30 8.2E-38 1.8E-22 2.3E-27 0.01 3.4E-25 2.0E-4 2.3E-3 1.0E-11 rs1800562 2 6:26093141 A/G 0.98 0.98 0.93 1.4E-4 2.4E-5 7.7E-30 3.1E-3 1.2E-3 0.78 0.05 1.0E-5 1.0E-10 1.0E-11 (2020)21:228 CCND3 locus rs1410492 1 6:41907855 C/G 0.94 0.86 0.75 0.24 0.21 1.0E-14 0.59 3.5E-19 1.5E-14 1.9E-6 0.04 1.4E-7 1.0E-11 rs11964516 2 6:41860252 T/C 0.15 0.16 0.17 0.04 0.01 5.6E-13 0.12 3.0E-12 8.8E-4 0.03 0.02 1.6E-6 1.0E-11 HBS1L/MYB locus rs35786788 1 6:135419042 A/G 0.92 0.85 0.74 8.8E-20 7.0E-11 3.0E-66 6.2E-07 1.1E-59 3.3E-60 1.6E-16 8.0E-7 1.0E-11 1.0E-11 rs12664956 2 6:135384188 T/C 0.22 0.26 0.37 2.6E-4 2.3E-3 1.1E-14 7.9E-3 2.4E-12 2.0E-9 0.52 0.03 8.1E-5 1.0E-6 CITED2 locus rs590856 1 6:139844429 A/G 0.63 0.41 0.45 1.2E-4 2.1E-4 7.7E-16 0.44 1.4E-23 2.2E-8 0.79 1.3E-4 1E-11 5.1E-7 rs607203 2 6:139841653 T/C 0.79 0.93 0.96 0.02 0.04 7.9E-11 0.12 2.1E-13 8.8E-5 0.13 5.3E-3 6.2E-6 2.9E-7 ABO locus rs2519093 1 9:136141870 T/C 0.89 0.85 0.80 5.7E-16 8.7E-18 0.32 0.05 0.85 1.1E-7 0.09 2.1E-4 1.1E-8 1.0E-8 rs10901252 2 9:136128000 C/G 0.84 0.93 0.92 0.02 3.4E-4 4.2E-5 5.5E-4 7.9E-8 8.0E-8 0.82 5.3E-5 2.4E-4 5.0E-6 HBA locus rs9924561 1 16:314780 G/T 0.91 0.99 – 1.4E-4 3.6E-26 5.8E-158 9.7E-87 2.4E-135 8.6E-49 4.9E-6 1.0E-11 –– rs76613236 2 16:230724 C/G 0.98 0.995 – 0.79 0.03 6.2E-20 2.0E-5 3.0E-18 5.3E-14 2.7E-5 1.0E-11 –– rs142154093 3 16:366048 C/G 0.98 0.99 – 0.02 4.0E-7 1.3E-32 2.2E-12 4.9E-33 6.5E-13 2.1E-5 1.0E-11 –– rs186066503 4 16:405483 T/C – 0.994 – 0.34 8.5E-3 3.6E-30 4.4E-18 2.8E-19 2.7E-15 3.5E-6 – 1.0E-11 – rs530159671 5 16:250184 A/G – 0.991 0.997 4.5E-3 7.5E-8 1.7E-24 7.2E-12 2.0E-16 2.3E-7 2.2E-4 – 1.0E-11 – rs8058016 6 16:228786 A/C 0.99 0.992 – 2.2E-3 2.8E-5 1.8E-22 1.7E-5 3.5E-21 1.0E-4 8.0E-3 1E-11 2.6E-8 – rs60616598 7 16:297264 A/G 0.83 0.96 – 0.04 1.0E-5 8.1E-23 2.2E-11 1.8E-16 9.7E-9 2.3E-9 1.0E-11 2.8E-9 – rs145752042 8 16:267208 A/G 0.993 ––1.0E-4 2.4E-7 1.2E-18 1.2E-4 3.1E-19 3.0E-4 0.04 3.2E-10 –– rs145546625 9 16:220583 T/C 0.98 0.93 – 0.19 0.08 2.5E-14 0.17 8.8E-15 1.4E-5 5.9E-3 0.22 1.0E-11 – rs115415087 10 16:205132 T/C 0.018 0.004 – 0.80 0.24 1.2E-12 1.2E-4 3.5E-14 3.2E-10 0.17 1.4E-10 1.3E-3 – rs61743947 11 16:240000 T/C 0.995 ––0.04 7.0E-5 2.4E-13 1.4E-11 8.7E-12 0.13 2.3E-6 3.1E-8 –– ae3o 14 of 3 Page rs60125383 12 16:176446 A/T 0.43 0.62 0.55 0.11 0.001 3.1E-13 2.1E-6 4.5E-8 1.3E-4 0.91 6.1E-7 0.01 2.0E-5 rs55932218 13 16:221151 T/C 0.05 0.01 – 0.52 0.03 6.3E-10 1.2E-4 2.0E-8 2.1E-3 3.2E-5 5.0E-8 0.01 – rs8051004 14 16:198835 T/C 0.10 0.05 0.02 0.81 0.04 7.2E-11 3.2E-8 8.1E-8 0.06 0.04 0.01 5.1E-7 0.04 Table 1 RBC trait loci with evidence of multiple independent signals among PAGE study participants (Continued) Hodonsky Signal Chr:pos Ref/ CAFa p values Alt Multi-ethnic RBC trait-specific Combined phenotype by race/ethnicitya ta.BCGenomics BMC al. et AA HL EU HCT HGB MCH MCHC MCV RBCC RDW AA HL EU N = 67,885 N = 67,870 N = 41,317 N = 67,856 N = 41,276 N = 41,310 N = 33,549 N = 16,802 N = 20,697 N = 29,513 PLIN4/PLIN5 locus rs919797 1 19:4498157 A/G 0.68 0.49 0.46 5.1E-2 5.5E-4 1.9E-11 1.8E-5 6.5E-9 4.4E-1 8.8E-1 0.35 9.5E-3 1.1E-8 rs12459922 2 19:4455862 A/G 0.12 0.25 0.26 0.003 0.003 4.3E-8 0.34 1.4E-9 0.09 0.11 0.17 4.9E-5 1.0E-4

Bold font for combined-phenotype analysis indicates that the index SNP also had the lowest reported p-value for that particular trait. Variants not meeting effective heterozygosity criterion of 35 excluded. AA African (2020)21:228 American, HL Hispanic/Latino, EU European American. aRestricted to populations with > 1000 participants ae4o 14 of 4 Page Hodonsky et al. BMC Genomics (2020) 21:228 Page 5 of 14

Fig. 1 Identification and characterization of 58 independent lead variants in 39 loci in a multi-ethnic study population. a Lead and conditionally independent SNPs from combined-phenotype analysis of total study population show shared genetic architecture directionally consistent with correlation structure. Colored circles to the right of figure correspond to trait-specific associations. X-axis: rsid (bottom) sorted by (top) and position; y-axis: significance of association and direction of effect, represented by t-value (scaled to a maximum of t = |15|). Size of circles is exponentially proportional to effect size standardized to trait means (3Z) to demonstrate differences in average effect size at lead SNPs by trait. Dashed gray lines correspond to genome-wide-significance threshold of a = 5E-09. b RBC trait pair partial correlations among MEGA- genotyped participants adjusted for linear regression model covariates (n = 29,090 for HCT, HGB, and MCHC measurements; n = 22,330 for MCH, MCV, and RBCC; n = 19,573 for RDW). c. Low-frequency and rare alleles exhibit larger magnitude of effect across RBC traits in the total multi- ethnic study population. X-axis: minor allele frequency; y-axis: effect size standardized to trait mean (|Z|). Filled circles represent variants present in all ancestry sub-populations; open circles are monomorphic in one or more ancestries association signals identified via conditional analysis, 64% evidence of association was most significant in European (n = 37) exceeded genome-wide significance for the Americans at HFE and HBS1L/MYB loci, whereas His- combined-phenotype lead SNP in two or more traits. panics/Latinos had the most significant association at both When comparing genome-wide significant associations CITED2 lead SNPs. In two instances, known causal vari- for two traits exhibiting a pairwise correlation >|0.2| ants accounted for the entire association signal after con- among these loci, in 93% of instances (119 of 128) the dir- ditioning. At the HFE locus, both rs1800562 (HFE ection of effect matched the direction of trait correlation p.C282Y) and rs1799945 (HFE p.H63D, r2~0.99 with lead (Fig. 1a, b, Tables S5A, S6). Eight of nine trait-pair associa- SNP rs2032451) are known coding hemochromatosis vari- tionswithdirectionsofeffectoppositeofexpectationwere ants and accounted for all significant associations within instances in which MCH or MCV drove the lead SNP as- +/− 3 Mb of the lead SNP [48]. Similarly, rs2519093 and sociation, and HCT or HGB had a different lead SNP in rs10901252 are in moderate to high LD with variants that high LD with the combined-phenotype lead SNP (r2 >0.8 affect RBC traits but also determine an individual’sABO in the combined MEGA-genotyped study population). blood type, and adjusting for these two variants accounted Only one of nine associations was in a trait pair exhibiting for the entire association at this locus. moderate correlation: HGB and RBCC (ρ = 0.68) exhibit- Of note, the HBA1/2 locus demonstrated ancestry spe- ing opposite directions of effect for rs9924561, the lead cificity (i.e., the lead SNP was monomorphic in one or SNP in the HBA1/2 region on . more ancestries) at 11 of 14 conditionally independent SNPs (Fig. 2a, Tables S5B-D). With the exception of Evidence of independent associations at established loci rs60125383 (frequency of the A allele: 0.43 in African We identified 20 independent association signals at Americans, 0.55 in European Americans, 0.62 in His- seven loci (HFE, CCND3, HBS1L/MYB, CITED2, ABO, panics/Latinos), located in a nonsense-mediated-decay HBA1/2,andPLIN4/5, Table 1, Fig. 1a). The majority of transcript for NPRL3, no lead SNP at this locus was lead SNPs were common to all ancestries (MAF > 0.01); common to all ancestries. The LD block for rs60125383 Hodonsky et al. BMC Genomics (2020) 21:228 Page 6 of 14

Fig. 2 Multiple independent associations with MCH demonstrate complex genetic architecture at HBA1/HBA2 locus. All plots: each point

represents one SNP; x-axis: increasing position on chromosome 16 left to right; y-axis: -log10(p-value) of the association with MCH. a Regional association plot of 14 independent associations in unadjusted analysis of multi-ethnic study population (n = 41,317). Large circles represent conditionally independent lead SNPs, labeled by rsid (order of conditioning is shown in Table 1); small colored SNPs represent variants in high LD (r2 > 0.8 in LD in pooled MEGA subpopulation) with the lead SNP of the corresponding color. b-d Locus-Zoom regional association plots of MCH association with rs60125383 (11th round of conditioning, purple diamond) in African Americans on an African American LD background (b n = 8703), Hispanics/Latinos on a Hispanic/Latino LD background (c, n = 17,380), and European Americans on a European LD background (d n = 14,707). SNP correlation with the lead SNP (r2) is colored according to the legend in (b). Annotated Refseq proximal to the lead SNP are shown by position above the X axis contained fewer variants in African Americans (Fig. 2b, no When adjusting for esv3637548 deletion dosage in the SNPs r2 > 0.4) compared to Hispanics/Latinos (Fig. 2c, 10 MEGA-genotyped subgroup, we observed evidence of SNPs r2 > 0.6) and European Americans (Fig. 2d, 13 SNPs both attenuation and strengthening of effect at otherwise r2 >0.6). conditionally independent lead SNPs at the HBA1/2 locus (Table S8). Specifically, eight lead SNPs lost more Sensitivity analyses than two orders of magnitude p-value after conditioning Trait-specific sensitivity analyses identified two previously- on esv3637548; one increased in significance; and five unreported variants exceeded genome-wide significance for remained unchanged. Among the lead SNPs in this a single RBC trait in the univariate analyses, yet did not chromosomal region which remained significant was meet genome-wide significance in the combined pheno- rs145546625, which was previously reported as signifi- type. Rs6573766 was specific to RBCC (p =1.1E-9)andis cant for MCV independent of esv3637548 in a GWAS of common to all ancestries but was poorly captured by earlier HCHS/SOL participants using a different genotyping genotyping arrays and is not represented in 1000 genomes array [28]. All other PAGE lead SNPs in the HBA1/2 re- phase 3 data (Figure S3,TableS7). Rs145548796 was sig- gion either did not pass QC or imputation criteria for nificant for MCV (p = 4.6E-9) and is rare (< 1%) in all popu- the custom array used in that study, or had p > 1E-07 in lations, only meeting the inclusion criteria in the MEGA the primary analysis. pooled sample and one study sub-population (Figure S4, Table S7). Ancestry-specific sensitivity analyses did not un- Generalization of previously reported associations cover any significant association signals that did not achieve Generalization of previously identified association signals genome-wide significance in the overall study population. varied for trait-specific loci (p < 1.07E-4, Tables S9-S11), Hodonsky et al. BMC Genomics (2020) 21:228 Page 7 of 14

ranging from 50 of 143 (35%) for MCHC to 93 of 121 blood cell traits—specifically with regard to the HBA1/2 (77%) for HGB. Ancestry-specific generalization varied locus—have been primarily analyzed in Eurocentric by trait, with the highest proportion of generalization oc- study populations. Given the high polygenicity and com- curring in the European-ancestry sub-population and plexity of quantitative RBC traits, our identification of the lowest occurring in African Americans, which may over a dozen independent association signals suggests a be due to power differences to detect associations by highly-transcribed region with either complementary or ancestry. redundant regulatory mechanisms that may affect mul- tiple genes. Future work could extend our efforts by eQTL function of index SNPs examining other populations in malaria-endemic re- To assess the potential regulatory roles of lead SNPs, we gions, as well as previously identified and highly influen- evaluated cis-eQTL (< 500 kb) associations for all lead tial structural variants, including a previously identified SNPs in GTEx as available [49]. Thirty-three of 51 SNPs 3.7 kb copy number variant, which we were only able to were low-frequency or common (MAF > 1%) in the evaluate as a sensitivity analysis [28]. European-ancestry GTEx population and had available A combined-phenotype method was selected due to its information in whole blood, liver, spleen, and/or thyroid purported ability to increase statistical power to identify tissues. Fourteen SNPs exhibited significant associations novel loci with modest effects across multiple correlated in RBC-relevant tissues; seven SNPs were eQTLs for traits. However, sample sizes of previous RBC trait multiple genes (Table S12). Although approximately 40 GWAS suggest that many loci with modest effects and genes were within 500 kb of each of the chromosome 16 lead SNPs in the low to common allele frequency range lead SNPs, none of the lead SNPs in this region in European or East Asian populations have already been exceeded a MAF > 1% in the GTEx study population and identified. Power was also lacking to detect loci that hence could not be evaluated for cis-eQTLs. might be specific to other race/ethnic groups—although African Americans and Hispanics/Latinos were well- Discussion represented in this study, sample sizes similar to Euro- RBC traits are complex quantitative phenotypes that have pean populations will not be proportionately representa- been broadly examined in GWAS of European- and East tive of genetic diversity, particularly for variants that are Asian-ancestry study populations. Here, we examine the low-frequency or difficult to impute. This observation benefits of identifying and characterizing RBC trait associ- demands an increase in representation of African Ameri- ations in the ancestrally diverse PAGE study population cans and Hispanics/Latinos, as narrower (on average) using a combined-phenotype approach. Although the LD blocks in populations exhibiting ancestral admixture combined-phenotype method we employed did not enable also improve fine-mapping for prioritizing candidate identification of novel loci, ancestral diversity improved variants for functional characterization. A combined- characterization of loci containing both ancestry-specific phenotype method can also improve the interpretability and common variants. The continued underrepresenta- of association signals when one causal SNP per associ- tion of diverse populations in GWAS despite the growing ation signal is assumed. For example, a direction of ef- clinical and public health significance of GWAS-enabled fect inconsistent with the phenotypic correlation of two tools that are ancestry-specific underscores the continued RBC traits is feasible in some anemia states, for which importance of expanding existing RBC trait GWAS of pre- MCV and RDW—despite being negatively correlated in dominantly European and East Asian populations to glo- healthy individuals—may vary widely depending on the bal populations [50–53]. underlying cause [54, 55]. The African-ancestry-specific With regard to regions exhibiting multiple independ- SNP rs9924561 (previously identified for MCH, MCHC, ent significant associations, our results demonstrate al- and MCV) is an example of a variant that unexpectedly lelic heterogeneity at known RBC trait loci, the showed opposite directions of effect for HGB and RBCC characterization of which was enabled by an inclusive (pairwise correlation = 0.68) in our study [28, 30, 56]. study design. Of particular note was our identification of The mechanism driving very strong associations (p < eleven variants specific to African and/or Amerindian 1E-15 in all traits aside from HCT) with this intronic ancestries within the first megabase of chromosome 16. variant remains uncharacterized, likely because it is The chromosome 16 region includes hemoglobin genes not present in European-ancestry populations and HBA1, HBA2, HBM, and HBZ as well as fifty other hence could not be detected in otherwise highly pow- -coding genes that should be examined for plaus- ered studies [12, 31]. The identification of such candi- ible roles in RBC trait biology. Decades of research have date functional variants for multiple traits with the demonstrated selective pressure in this region occurring added context of the phenotypic correlation can pro- over millennia in malaria-endemic regions of the world vide insight for molecular experimentation examining but, as with many other complex quantitative traits, red causal biological mechanisms. Hodonsky et al. BMC Genomics (2020) 21:228 Page 8 of 14

The possibility that combined-phenotype methods Americans and Native Hawaiians are represented in could benefit the study of other correlated polygenic PAGE, but RBC phenotypes were not measured in con- traits still merits further investigation, particularly with tributing studies. South Asian study populations have groups of traits that may overlap in genetic architecture, been included in several previous RBC trait GWAS; Na- but have not been previously examined in concert. Over tive Americans and Pacific Islanders remain underrepre- the past three decades, RBC traits have been associated sented in GWAS of all complex traits [15, 20, 39, 63]. with cardiovascular disease outcomes like heart failure Third, we were unable to evaluate structural variants, and stroke, highlighting the potential for identifying which have traditionally been difficult to impute, and re- novel pleiotropic loci [6, 57–62]. Indeed, combined- calling all structural variants within significant loci was phenotype approaches that examine the shared genetic outside the scope of this work. A sensitivity analysis ac- architecture underlying intermediate phenotypes and counting for the effect of esv3637548 in MEGA-genotyped clinical events may be particularly powerful for out- study participants suggests that further evaluation is required comes like stroke and heart failure, given that pheno- to determine whether true causalvariantsoverlapthepos- typic heterogeneity of these phenotypes has complicated ition of this 3.7 kb structural variant on other ancestral hap- locus identification and characterization. lotypes. However, it is expected that some structural variants Our evaluation of lead SNPs’ effects on expression in will be adequately represented by proxy SNPs, and future RBC-relevant tissues faced known constraints that lim- sequencing-based studies will be able to characterize these ited interpretation and contextualization of identified rare variants. Finally, eQTL datacouldnotbecomprehen- variants. Crucially, the vast majority of publicly available sively interpreted given the limitations of publicly available functional data were collected from European-ancestry databases as described above. It is imperative that these re- individuals, precluding the use of these databases for sources focus their efforts on improving inclusivity over the interpreting potential effects of ancestry-specific or low- next several years to keep abreast of increased representative- frequency SNPs on gene expression. For example, ness in association studies. rs8051004 is one of two less frequent variants that were detected in European-ancestry populations at the HBA1/ Conclusion 2 locus (CAF = 0.02). However, rs8051004 was reported In conclusion, we identified over 50 association signals as “monoallelic” in spleen tissue in GTEx, despite having within 39 loci in a combined-phenotype analysis of seven a 10% allele frequency in PAGE African Americans and RBC traits. We did not observe large improvement in 12 and 11% in the 1000G African and East Asian super- discovery signal detection by using the combined- populations, respectively. The exclusion of populations phenotype methods, although further work is required with African, Amerindian, and Asian ancestry continues to fully test the utility of these approaches. However, our to hamper the potential benefits of these resources. Add- work demonstrates the benefits of diverse study popula- itionally, while the GTEx consortium has made extensive tions for highly polygenic traits, in spite of the fact that efforts to characterize a wide array of tissue types, bone while global populations are increasing in genetic diver- marrow was not included [49]. RBCs enucleate in the sity, genetic research has become less diverse. As gen- bone marrow prior to entering circulation, with no omics tools become more broadly available, our results nuclear transcription and extremely limited translation underscore the critical importance of including diverse occurring in mature RBCs. Therefore, bone marrow is the global populations so the benefits of genomics research only tissue for which eQTL data characterizing the effects can be equitably applied. of genetic variation on gene expression for RBCs directly. As with other genetic association studies, we faced Methods several limitations. First, sample sizes for RBC trait Study population GWAS have ballooned to nearly 200,000 participants The PAGE study comprises ancestrally-diverse study and we were restricted to a smaller study population. populations from United States cohorts and biobanks However, the PAGE study has recently demonstrated evaluating common complex diseases and accompanying that modest-sized studies that are more ancestrally di- risk factors (see online supplement for more informa- verse improve detection of novel and independent sig- tion). This study used data from self-reported African nals compared to simply increasing the number of American, Asian American, European American, His- European-ancestry individuals [56]. Second, while this panic/Latino, and Native American participants from the study did improve on previous studies in terms of repre- Atherosclerosis Risk in Communities Study (ARIC); the sentation from African and American continental ances- Coronary Artery Risk Development in Young Adults tries, we were unable to evaluate associations in several Study (CARDIA); the Hispanic Community Health populations, particularly South Asians, Pacific Islanders, Study/Study of Latinos (HCHC/SOL); the Icahn Mt. Native Americans, and Native Hawaiians. Native Sinai School of Medicine BioME Biobank (BioME); and Hodonsky et al. BMC Genomics (2020) 21:228 Page 9 of 14

the Women’s Health Initiative (WHI, described above). With regard to quality control, studies employed either Our study population comprised sixteen analytic sub- a 90% (ARIC, MOPMAP) or 98% (all other studies) SNP groups which were genotyped and imputed separately. call-rate threshold. A sample call rate of 95% was Fifteen of the sixteen analytic subgroups were identified employed for ARIC and. A 98% rate for MEGA- by study and self-reported race/ethnicity (Tables S2,S3). genotyped participants, with no sample call rate applied The sixteenth subgroup was a pooled sample of self- to remaining studies. Similarly, a 1E-06 HWE p-value reported African American, Asian American, Hispanic/ threshold was employed for ARIC, and a 1E-04 thresh- Latino, Native American, and “Other” MEGA-genotyped old for MEGA-genotyped participants. Additional study- individuals from BioMe, HCHS/SOL, and WHI. Partici- specific genotype QC criteria are described in Table S2. pants were excluded if they had ever been diagnosed All studies were imputed to the 1000 Genomes phase 3 with HIV or leukemia, were pregnant at time of blood reference panel by the PAGE coordinating center after draw, were receiving chemotherapy at time of blood study-specific quality control criteria were applied (Table draw, or had a severe hereditary anemia (primarily S2, 56). We further excluded SNPs on a sub-study- sickle-cell disease, determined by genotype). specific basis which had poor imputation quality (< 0.4) or an effective heterozygosity < 35 (calculated as 2 x RBC trait measurement CAF x (1-CAF) x N x imputation quality, where CAF is RBC traits were measured with hemanalyzers following coded allele frequency and N is sample size). standardized laboratory protocols from blood draws at the earliest available visit (see online supplement) for the Statistical methods three primary (HCT, HGB, and RBCC) and four derived Overall reporting of results (MCH, MCHC, MCV, and RDW) RBC traits (Table S1). Previously-reported SNPs for the seven RBC traits evaluated RBC trait values that exceeded four standard deviations in this study were identified through review of the NHGRI- from the mean of the trait in the overall study population EB GWAS Catalog [65] as of January 1, 2019, supplemented were excluded, mirroring protocols established by prior by a PubMed search. Multi-ethnic combined-phenotype re- GWAS [28, 45]. Pairwise correlation coefficients were cal- sults were presented as the primary findings, employing culated in the MEGA-genotyped analytic subgroup (see Bonferroni correction assuming 10 M independent tests (i.e., below) adjusting for all the covariates used in univariate genome-wide significance refers to paSPU <5E-9).Wede- regression analysis, specifically age at blood draw, sex, fined a locus using physical proximity (+/− 500 kb from the study site or region, and ancestral principal components. lead SNP), and we defined an association signal as the lead (most significant) SNP and proxy SNPs in local LD based Genotyping, quality control, and imputation on conditional independence within ten megabases. Discov- Genotyping methods have been described for each of ery loci were defined as ≥500 kb from and conditionally in- our study sub-populations previously; all imputation of dependent of a variant previously reported to satisfy the genotype data used in this study was performed by the field standard p < 5E-8 for any of the seven RBC traits. PAGE coordinating center [64]. Briefly, genotyping ar- Ancestry-specific and trait-specific analyses were performed rays and quality control measures used were as follows. as sensitivity analyses to improve interpretation of results. Affymetrix Genome-Wide SNP Array 6.0 for Complete summary-level results are available through ARIC, BioMe Mt. Sinai Biobank European Americans, dbGaP (phs000356). CARDIA, and WHI SHARe. The Illumina OmniExpress was used to genotype individuals for all remaining Univariate analysis BioMe Mt. Sinai Biobank participants. WHI GARNET Univariate associations for the seven RBC traits were es- participants were genotyped on the Illumina Human timated assuming an additive genetic model of inherit- Omni1-Quad v1–0 B array; WHI GECCO participants ance and adjusting for linear effects of age at blood on the Illumina 610 K and Cytochip 370 K arrays; WHI draw, sex, study site or region, and ancestral principal HIPFX participants on the Illumina 550 K and 610 K components [66]. The total MEGA-genotyped subgroup arrays; WHI LLS participants on the Illumina was analyzed using generalized estimating equations HumanOmniExpressExome-8v1_A array; WHI MOP- allowing correlated errors for first or second-degree rela- MAP participants on the Affymetrix Gene Titan, Axiom tives, and independent error distributions by self- Genome-Wide Human CEU I Array Plate; and WHI reported ancestry group [67]. Linear regression was im- WHIMS participants on the HumanOmniExpress plemented in SUGEN for the other 15 analytic sub- Exome-8v1_B array. All remaining participants from groups [67]. For each RBC trait, METAL software was BioMe, HCHS/SOL, and WHI were genotyped on the used to perform inverse-variance-weighted meta-analysis Illumina Infinium Expanded Multi-Ethnic Genotyping across all sub-studies [68]. SNP effect heterogeneity was Array (MEGA). measured with the Cochran’s Q test. SNP meta-analysis Hodonsky et al. BMC Genomics (2020) 21:228 Page 10 of 14

p-values were assessed by RBC trait by calculating gen- analyses of univariate summary statistics followed by omic inflation factors (λ) and plotting the expected dis- combined-phenotype analysis were performed within each tribution against observed results. self-reported race/ethnicity using the same methods de- scribed above for the overall study population to identify p Combined-phenotype analyses genome-wide association signals ( <5E-09). We used an adaptive sum of powered scores (aSPU) We also examined whether there was evidence of sig- simulation-based method to perform a combined- nificant trait-specific loci that were not identified in phenotype analysis incorporating univariate results from combined-phenotype analyses. Meta-analyses of each sevenRBCtraitsinsixteenanalyticsubgroupsthatwere univariate RBC trait across all analytic subpopulations, combined using inverse-variance-weighted meta-analysis. as described above, were evaluated for association signals p To evaluate evidence for shared genetic effects across all exceeding genome-wide significance ( < 5E-09). Al- seven RBC traits, we combined meta-analyzed univariate though RBC traits are expected to share genetic under- results with aSPU to generate a combined-phenotype p- pinnings, particularly within pairs of correlated traits, value for each SNP [28, 69]. In comparison with other avail- association signals which were trait-specific in the well- able methods, we chose aSPU because it exhibited low type powered UK BioBank blood trait GWAS suggest that 1 error rate in simulations; accommodated direction of ef- each trait has its own unique suite of associations [12]. fect; and was computationally scalable to the millions of Finally, in an attempt to examine the influence of the SNPs measured using 1000 Genomes Phase 3 imputed data previously identified 3.7 kb structural variant esv3637548 in the HBA1/2 region of chromosome 16, we also ad- [70]. We implemented aSPU using Julia 1.0 to optimize effi- 2 ciency (https://github.com/kaskarn/aspu_julia). justed for esv3637548 dosage (r = 0.86) in the MEGA- aSPU incorporated univariate summary z-scores, cal- genotyped subgroup [28]. This structural variant either culated for each SNP across all 7 traits, to yield a single overlaps or has the potential to affect chromatic accessi- p-value evaluating whether one or more of the traits bility for multiple variants at this locus, but is present as were associated with a given SNP. Briefly, the procedure both a duplication and a deletion. The duplication was estimates Σ, the 7 × 7 correlation of null z-scores across not able to be imputed, and the deletion only met im- univariate results and draws 1011 Monte-Carlo samples putation quality criteria in the MEGA-genotyped study population, hence esv3637548 could not be evaluated from the multivariate N ð0; Σ^Þ distribution. For each SNP 7 within the entire study population in which this variant j, the results for all 7 traits zj1, …, zj7 are used to form a se- γ may be present. To evaluate the potential effect of this quence of sums of powered scores: SPUðγÞ¼z þ … 1 variant on each lead SNP reported as independent þzγ γ … SPU ∞ ∣ S ∣ 7 ,where =0,1, ,8,plus ( ) = max 7 .Each p 11 within our study, unadjusted combined-phenotype - powered score is compared to the distribution of the 10 values were therefore compared to p-values after condi- powered scores calculated using simulated null values tioning on esv3637548. with the same γ to calculate a Monte-Carlo p-value. An 11 overall SNP p-value (paSPU, possible range: [1/(1 + 10 ), 1]), is calculated by comparing the minimum p-value Generalization of previously reported associations to across the sequence of powered scores to the reference PAGE p distribution of minimum -values across the sequence of All SNPs located within 500 kb of a variant previously powered scores computed using the simulated null data. reported for any RBC trait were evaluated for evidence The adaptive aspect of the test lies in the potential for dif- of association in the combined-phenotype analysis as γ ferent values to yield the maximal SPU across SNPs, well as each individual trait analysis. A generalization maintaining power compared to a test with only a single significance threshold of 1.07E-4 was calculated using possible alternative hypothesis. Bonferroni correction for the previous number of one- megabase genomic regions for which one or more Sensitivity analyses genome-wide-significant variants were reported for one Sensitivity analyses were performed for combined-trait or more RBC traits (n = 466, representing 1308 index results by self-reported race/ethnicity among analytic SNPs previously reported for one or more of the seven subgroups with greater than 1000 participants (i.e., re- RBC traits we evaluated). We first reported trait-specific stricted to African Americans, Hispanics/Latinos, and associations—i.e., index variants that have been reported European Americans). Given the number of known by trait. We did not report loci containing a SNP that ancestry-specific variants driving blood trait values, it exceeded genome-wide-significance for the first time in was necessary to ensure that all self-reported race/ethnic one RBC trait but were previously reported for another groups be evaluated individually for associations that trait as discovery associations; therefore, we also used may be undetectable in the larger population. Meta- the aforementioned significance threshold to evaluate Hodonsky et al. BMC Genomics (2020) 21:228 Page 11 of 14

generalization of association signals in each trait across the lead SNP (r2) is colored according to the legend in Figure S4A. Anno- all known loci. tated Refseq genes proximal to the lead SNP are shown by position above the X axis. Identification of conditionally independent association Additional file 2. Twelve supplemental tables supporting findings reported in the main text. Tables cover trait, genotyping, and QC signals description; ancestry- and trait-specific findings for combined-phenotype Iterative conditional analysis was performed to identify lead SNPs; sensitivity analysis of a deletion at the HBA1/2 locus; all independent, genome-wide-significant combined- generalization of previously reported findings to PAGE study populations; and eQTL findings for PAGE lead SNPs in relevant tissue types. phenotype lead SNPs as described above. To avoid iden- tifying SNPs as independent that were in long-range LD, Abbreviations we began by conditioning on the top SNP within ten ARIC: Atherosclerosis Risk in Communities study; aSPU: adaptive sum of megabase windows on each chromosome. To identify in- powered scores; CARDIA: Coronary Artery Risk Development in Young Adults dependent SNPs, linear models were extended to include Study; eQTL: expression quantitative trait locus; GTEx: Genotype Tissue Expression project; GWAS: Genome-wide association study; HCHS/ all PAGE combined-phenotype lead SNPs on shared SOL: Hispanic Community Health Study/Study of Latinos; HCT: Hematocrit; using the same methods described above HGB: Hemoglobin concentration; kb: kilobase; LD: Linkage disequilibrium; for univariate analysis, with an added covariate to in- MCH: Mean corpuscular hemoglobin; MCHC: Mean corpuscular hemoglobin concentration; MCV: Mean corpuscular volume; MEGA: Multi-ethnic clude the dosage information for each participant at genotyping array; PAGE: Population Architecture using Genomics and each lead SNP. Following each round of conditioning, Epidemiology study; RBC: Red blood cell; RBCC: Red blood cell count; aSPU was re-run on conditioned results. Additional RDW: Red cell distribution width; SNP: Simple nucleotide polymorphism; WHI: Women’s Health Initiative study rounds of conditional analyses were performed as an it- erative process until no genome-wide-significant SNPs Acknowledgements remained in the combined phenotype analysis. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health (NIH). The PAGE consortium thanks the staff and participants of all PAGE Publicly available expression quantitative trait studies for their contributions. We thank R. Williams and M. Ginoza for locus (eQTL) analysis providing assistance with program coordination. The complete list of PAGE members can be found at http://www.pagestudy.org. To help prioritize candidate causal gene-variant associa- tions at identified loci, we evaluated all available lead Authors’ contributions SNPs within significant loci in relevant available tissues All contributing authors were given the opportunity to review and comment on the final manuscript and accompanying materials, and have approved (whole blood, liver, spleen, and thyroid) for evidence of submission for publication. Overall project supervision and management: association with gene expression using the Genotype CJH, CLA, KEN, APR, MG, RT. Genotyping and quality control: GLW, SB, MF, Tissue Expression (GTEx) portal [49]. RJFL, CLK. Phenotype harmonization: CJH, SAB, WT, BT. Association and secondary analyses: CJH, ARB, CS, RT, MG. Manuscript preparation: CJH, ARB, SAB, LMR, HMH, CS, MG, LAH, YL, DL, APR, KEN. Supplementary information Supplementary information accompanies this paper at https://doi.org/10. Authors’ information 1186/s12864-020-6626-9. N/A

Additional file 1: Figure S1. Manhattan and Quantile-Quantile plots for Funding individual RBC traits in the total study population. In Manhattan plots, None of the funding bodies described herein played a role in the design of previously reported loci (published index SNP reported p < 5E-08 within the study; collection, analysis, and interpretation of data; or in writing the manuscript. The Population Architecture Using Genomics and Epidemiology 500 kb of PAGE combined-phenotype lead SNP) are shown in purple; pre- viously unreported loci with a PAGE lead SNP p < 5E-09 are shown in (PAGE) program is funded by the National Research green. In Q-Q plots, all (black) p-values and p-values for variants > 500 kb Institute (NHGRI) with co-funding from the National Institute on Minority from a previously reported significant variant for any RBC trait (blue) are Health and Health Disparities (NIMHD). Assistance with data management, data integration, data dissemination, genotype imputation, ancestry decon- both shown. Figure S2. Evidence of genetic associations shared across correlated RBC traits. X-axis: chromosome and position (top) and rsid volution, population genetics, analysis pipelines and general study coordin- (bottom) for each combined-phenotype lead SNP. Y-axis: trait-specific – ation was provided by the PAGE Coordinating Center (NIH U01HG007419). log (p-values), with increased intensity representing higher significance, Genotyping services were provided by the Center for Inherited Disease Re- 10 search (CIDR). The CIDR is fully funded through a federal contract from the for each combined-phenotype lead SNP. P-values scaled to a maximum – NIH to The Johns Hopkins University, contract number HHSN268201200008I. log10 value of 25 for improved interpretation. Figure S3. Locus-Zoom plots of the association between rs6573766 and RBCC in PAGE African Genotype data quality control and quality assurance services were provided Americans on an African American LD background (A), Hispanics/Latinos by the Genetic Analysis Center in the Biostatistics Department of the Univer- sity of Washington, through support provided by the CIDR contract. The data on a Hispanic/Latino LD background (B), and European Americans on a European LD background (C). Each point represents one SNP; x-axis: in- and materials included in this report result from collaboration between the following studies and organizations: ARIC, BioMe Biobank, CARDIA, HCHS/ creasing position on chromosome 14 left to right; y-axis: -log10(p-value) of the association with MCH SNP correlation with the lead SNP (r2) is col- SOL, PAGE Global Reference Panel and WHI. The BioMe Biobank received funding for the PAGE IPM BioMe Biobank study through the National Human ored according to the legend in Figure S3A. Annotated Refseq genes proximal to the lead SNP are shown by position above the X axis. Figure Genome Research Institute (NIH U01HG007417). Primary funding support to S4. Locus-Zoom plot of the association between MCH (A) and MCV (B) KEN, RT, HMH, CLA, CJH, MF, and D-YL (as part of HCHS/SOL) is provided by and rs145548796 in the total MEGA study population. Each point repre- U01HG007416. Additional support was provided via R01DK101855. The Ath- erosclerosis Risk in Communities Study (ARIC) was carried out as a collabora- sents one SNP; x-axis: increasing position on chromosome 6 left to right; y-axis: -log (p-value) of the association with MCH SNP correlation with tive study supported by R01HL087641, R01HL59367 and R01HL086694; 10 National Human Genome Research Institute contract U01HG004402; and Hodonsky et al. BMC Genomics (2020) 21:228 Page 12 of 14

National Institutes of Health contract HHSN268200625226C. The Coronary Ar- 2. Migone de Amicis M, Chivite D, Corbella X, Cappellini MD, Formiga F. tery Risk Development in Young Adults Study (CARDIA) is supported by con- Anemia is a mortality prognostic factor in patients initially hospitalized for tracts HHSN268201300025C, HHSN268201300026C, HHSN268201300027C, acute heart failure. Intern Emerg Med. 2017 Sep;12(6):749–56. HHSN268201300028C, HHSN268201300029C, and HHSN268200900041C from 3. Dai Y, Konishi H, Takagi A, Miyauchi K, Daida H. Red cell distribution width the National Heart, Lung, and Blood Institute (NHLBI), the Intramural Research predicts short- and long-term outcomes of acute congestive heart failure Program of the National Institute on Aging (NIA), and an intra agency agree- more effectively than hemoglobin. Exp Ther Med. 2014;8(2):600–6. ment between NIA and NHLBI (AG0005). 4. Kellert L, Martin E, Sykora M, Bauer H, Gussmann P, Diedler J, et al. Cerebral The HCHS/SOL study was carried out as a collaborative study supported by oxygen transport failure?: decreasing hemoglobin and hematocrit levels contracts from the National Heart, Lung and Blood Institute (NHLBI) to the after ischemic stroke predict poor outcome and mortality: STroke: RelevAnt University of North Carolina (N01-HC65233), University of Miami (N01- impact of hemoGlobin, hematocrit and transfusion (STRAIGHT)--an HC65234), Albert Einstein College of Medicine (N01-HC65235), Northwestern observational study. Stroke. 2011 Oct;42(10):2832–7. University (N01-HC65236) and San Diego State University (N01-HC65237). 5. Dzierzak E, Philipsen S. Erythropoiesis: development and differentiation. Cold The WHI program is funded by the NHLBI, NIH, US Department of Health Spring Harb Perspect Med. 2013;3(4):a011601. and Human Services through contracts HHSN268201100046C, 6. Barlas RS, Honney K, Loke YK, McCall SJ, Bettencourt-Silva JH, Clark AB, et al. HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, Impact of hemoglobin levels and Anemia on mortality in acute stroke: HHSN268201100004C and HHSN271201100004C. CJH, HMH, and LMR were analysis of UK regional registry data, systematic review, and meta-analysis. J also supported by NHLBI training grant HL129982. HMH was also supported Am Heart Assoc. 2016;17:5(8). by NHLBI training grant T32 HL007055 and ADA grant #1–19-PDF-045. D-YL 7. Whitfield JB, Martin NG. Genetic and environmental influences on the size was also supported by R01CA082659, R01GM047845, and P01CA142538. CLA, and number of cells in the blood. Genet Epidemiol. 1985;2(2):133–44. ARB, CMS, HMH, and CJH were also supported by R01HL142825. 8. Wright FA, Sullivan PF, Brooks AI, Zou F, Sun W, Xia K, et al. Heritability and genomics of gene expression in peripheral blood. Nat Genet. 2014 May; Availability of data and materials 46(5):430–7. Complete summary level results are available through dbGaP (http://www. 9. Patel KV. Variability and heritability of hemoglobin concentration: an ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000356). opportunity to improve understanding of anemia in older adults. Haematologica. 2008 Sep;93(9):1281–3. Ethics approval and consent to participate 10. Brendel C, Guda S, Renella R, Bauer DE, Canver MC, Kim Y-J, et al. Lineage- specific BCL11A knockdown circumvents toxicities and reverses sickle All participating studies listed in the methods section obtained Institutional – Review Board approval and written informed consent from all participants. phenotype. J Clin Invest. 2016 Oct 3;126(10):3868 78. Specifically, for the ARIC study IRBs at the following institutions approved the 11. Trakarnsanga K, Wilson MC, Lau W, Singleton BK, Parsons SF, Sakuntanaga P, β inclusion of this data in PAGE genomics studies: University of North Carolina et al. Induction of adult levels of -globin in human erythroid cells that at Chapel Hill, Johns Hopkins University, University of Minnesota, and intrinsically express embryonic or fetal globin by transduction with KLF1 – University of Mississippi Medical Center. Use of HCHS/SOL study data was and BCL11A-XL. Haematologica. 2014 Nov;99(11):1677 85. approved by IRBs from the following institutions: Northwestern University, 12. Astle WJ, Elding H, Jiang T, Allen D, Ruklisa D, Mann AL, et al. The allelic Albert Einstein School of Medicine, University of Miami, and San Diego State landscape of human blood cell trait variation and links to common – University. Use of the BioMe Biobank data was approved by the Mt. Sinai IRB. complex disease. Cell. 2016;167(5):1415 1429.e19. Use of the WHI data was approved by the Fred Hutchinson Research 13. Ferreira MAR, Hottenga J-J, Warrington NM, Medland SE, Willemsen G, Center IRB. Lawrence RW, et al. Sequence variants in three loci influence monocyte counts and erythrocyte volume. Am J Hum Genet. 2009 Nov;85(5):745–9. 14. Yang Q, Kathiresan S, Lin J-P, Tofler GH, O’Donnell CJ. Genome-wide Consent for publication association and linkage analyses of hemostatic factors and hematological N/A phenotypes in the Framingham Heart Study. BMC Med Genet. 2007;8(Suppl 1):S12. Competing interests 15. Chambers JC, Zhang W, Li Y, Sehmi J, Wass MN, Zabaneh D, et al. Genome- All authors declare that they have no competing financial or non-financial wide association study identifies variants in TMPRSS6 associated with interests. hemoglobin levels. Nat Genet. 2009 Nov;41(11):1170–2. 16. Lo KS, Wilson JG, Lange LA, Folsom AR, Galarneau G, Ganesh SK, et al. Author details Genetic association analysis highlights new loci that modulate 1University of North Carolina Gillings School of Public Health, 135 Dauer Dr, hematological trait variation in Caucasians and African Americans. Hum Chapel Hill, NC 27599, USA. 2University of Virginia Center for Public Health Genet. 2011;129(3):307–17. 3 Genomics, 1355 Lee St, Charlottesville, VA 22908, USA. Fred Hutchinson 17. Yasukochi Y, Sakuma J, Takeuchi I, Kato K, Oguri M, Fujimaki T, et al. Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA. Identification of nine novel loci related to hematological traits in a Japanese 4 Department of Genetics, University of North Carolina at Chapel Hill, 120 population. Physiol Genomics. 2018 Sep 1;50(9):758–69. 5 Mason Farm Road, Chapel Hill, NC 27599, USA. University of Washington, 18. Chen Z, Tang H, Qayyum R, Schick UM, Nalls MA, Handsaker R, et al. Genome- 6 1730 Minor Ave, Ste 1360, Seattle, WA 98101, USA. Stanford University wide association analysis of red blood cell traits in African Americans: the 7 School of Medicine, 291 Campus Dr, Stanford, CA 94305, USA. Vanderbilt COGENT network. Hum Mol Genet. 2013 Jun 15;22(12):2529–38. 8 University, 2525 West End Ave #1100, Nashville, TN 37203, USA. University of 19. van Rooij FJA, Qayyum R, Smith AV, Zhou Y, Trompet S, Tanaka T, et al. 9 Minnesota, 420 Delaware St SE, Minneapolis, MN 55455, USA. Rutgers Genome-wide trans-ethnic meta-analysis identifies seven genetic loci 10 University, 683 Hoes Ln W, Piscataway, NJ 08854, USA. University of Texas influencing erythrocyte traits and a role for RBPMS in erythropoiesis. Am J 11 Houston, 7000 Fannin Street, Houston, TX 77030, USA. National Human Hum Genet. 2017 Jan 5;100(1):51–63. Genome Research Institute, 31 Center Dr, Bethesda, MD 20894, USA. 20. van der Harst P, Zhang W, Mateo Leach I, Rendon A, Verweij N, Sehmi J, 12 University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA. et al. Seventy-five genetic loci influencing the human red blood cell. Nature. 13 Icahn School of Medicine at Mount Sinai, 1468 Madison Ave, New York, NY 2012 Dec 20;492(7429):369–75. 10029, USA. 21. Soranzo N, Spector TD, Mangino M, Kühnel B, Rendon A, Teumer A, et al. A genome-wide meta-analysis identifies 22 loci associated with eight Received: 24 October 2019 Accepted: 26 February 2020 hematological parameters in the HaemGen consortium. Nat Genet. 2009 Nov;41(11):1182–90. 22. Pistis G, Okonkwo SU, Traglia M, Sala C, Shin S-Y, Masciullo C, et al. Genome References wide association analysis of a founder population identified TAF3 as a gene 1. Buttarello M. Laboratory diagnosis of anemia: are the old and new red cell for MCHC in . PLoS One. 2013 Jul 31;8(7):e69206. parameters useful in classification and treatment, how? Int J Lab Hematol. 23. Okada Y, Kamatani Y. Common genetic factors for hematological traits in 2016;38(Suppl 1):123–32. humans. J Hum Genet. 2012 Mar;57(3):161–9. Hodonsky et al. BMC Genomics (2020) 21:228 Page 13 of 14

24. Li J, Glessner JT, Zhang H, Hou C, Wei Z, Bradfield JP, et al. GWAS of blood 45. Chami N, Chen M-H, Slater AJ, Eicher JD, Evangelou E, Tajuddin SM, et al. cell traits identifies novel associated loci and epistatic interactions in Exome genotyping identifies pleiotropic variants associated with red blood Caucasian and African-American children. Hum Mol Genet. 2013 Apr 1;22(7): cell traits. Am J Hum Genet. 2016 Jul 7;99(1):8–21. 1457–64. 46. Matise TC, Ambite JL, Buyske S, Carlson CS, Cole SA, Crawford DC, et al. The 25. Kullo IJ, Ding K, Jouni H, Smith CY, Chute CG. A genome-wide association next PAGE in understanding complex traits: design for the analysis of study of red blood cell traits using the electronic medical record. PLoS One. population architecture using genetics and epidemiology (PAGE) study. Am 2010;28:5(9). J Epidemiol. 2011 Oct 1;174(7):849–59. 26. Kamatani Y, Matsuda K, Okada Y, Kubo M, Hosono N, Daigo Y, et al. 47. Wei P, Cao Y, Zhang Y, Xu Z, Kwak I-Y, Boerwinkle E, et al. On robust Genome-wide association study of hematological and biochemical traits in association testing for quantitative traits and rare variants. G3 (Bethesda). a Japanese population. Nat Genet. 2010 Mar;42(3):210–5. 2016;6(12):3941–50. 27. Iotchkova V, Huang J, Morris JA, Jain D, Barbieri C, Walter K, et al. Discovery 48. Gerhard GS, Paynton BV, DiStefano JK. Identification of genes for hereditary and refinement of genetic loci associated with cardiometabolic risk using hemochromatosis. Methods Mol Biol. 2018;1706:353–65. dense imputation maps. Nat Genet. 2016 Sep 26;48(11):1303–12. 49. GTEx Consortium. Human genomics. The genotype-tissue expression (GTEx) 28. Hodonsky CJ, Jain D, Schick UM, Morrison JV, Brown L, McHugh CP, et al. pilot analysis: multitissue gene regulation in humans. Science. 2015; Genome-wide association study of red blood cell traits in Hispanics/Latinos: 348(6235):648–60. the Hispanic community health study/study of Latinos. PLoS Genet. 2017 50. Lambert SA, Abraham G, Inouye M. Towards clinical utility of polygenic risk Apr 28;13(4):e1006760. scores. Hum Mol Genet. 2019;28(R2):R133–42. 29. Ganesh SK, Zakai NA, van Rooij FJA, Soranzo N, Smith AV, Nalls MA, et al. 51. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of Multiple loci influence erythrocyte phenotypes in the CHARGE Consortium. current polygenic risk scores may exacerbate health disparities. Nat Genet. Nat Genet. 2009 Nov;41(11):1191–8. 2019 Mar 29;51(4):584–91. 30. Ding K, de Andrade M, Manolio TA, Crawford DC, Rasmussen-Torvik LJ, 52. Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, Ritchie MD, et al. Genetic variants that confer resistance to malaria are et al. Exploring the phenotypic consequences of tissue specific gene associated with red blood cell traits in African-Americans: an electronic expression variation inferred from GWAS summary statistics. Nat Commun. medical record-based genome-wide association study. G3 (Bethesda). 2013; 2018 May 8;9(1):1825. 3(7):1061–8. 53. Peterson RE, Kuchenbaecker K, Walters RK, Chen C-Y, Popejoy AB, 31. Kanai M, Akiyama M, Takahashi A, Matoba N, Momozawa Y, Ikeda M, et al. Periyasamy S, et al. Genome-wide association studies in ancestrally diverse Genetic analysis of quantitative traits in the Japanese population links cell populations: opportunities, methods, pitfalls, and recommendations. Cell. types to complex human diseases. Nat Genet. 2018 Feb 5;50(3):390–400. 2019 Oct 17;179(3):589–603. 32. Bustamante CD, Burchard EG, De la Vega FM. Genomics for the world. 54. Evans TC, Jehle D. The red blood cell distribution width. J Emerg Med. 1991; Nature. 2011;475(7355):163–5. 9(Suppl 1):71–4. 33. DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, 55. Monzon CM, Beaver BD, Dillon TD. Evaluation of erythrocyte disorders with Asian Genetic Epidemiology Network Type 2 Diabetes (AGEN-T2D) mean corpuscular volume (MCV) and red cell distribution width (RDW). Clin Consortium, South Asian Type 2 Diabetes (SAT2D) Consortium, Mexican Pediatr (Phila). 1987;26(12):632–8. American Type 2 Diabetes (MAT2D) Consortium, Type 2 Diabetes Genetic 56. Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, et al. Exploration by Nex-generation sequencing in muylti-Ethnic Samples (T2D- Genetic analyses of diverse populations improves discovery for complex GENES) Consortium, Mahajan A, et al. Genome-wide trans-ancestry meta- traits. Nature. 2019 Jun 19;570(7762):514–8. analysis provides insight into the genetic architecture of type 2 diabetes 57. Hsieh Y-P, Chang C-C, Kor C-T, Yang Y, Wen Y-K, Chiu P-F. The predictive susceptibility. Nat Genet. 2014;46(3):234–44. role of red cell distribution width in mortality among chronic kidney disease 34. Kichaev G, Roytman M, Johnson R, Eskin E, Lindström S, Kraft P, et al. patients. PLoS One. 2016 Dec 1;11(12):e0162025. Improved methods for multi-trait fine mapping of pleiotropic risk loci. 58. Tseliou E, Terrovitis JV, Kaldara EE, Ntalianis AS, Repasos E, Katsaros L, et al. Bioinformatics. 2017 Jan 15;33(2):248–55. Red blood cell distribution width is a significant prognostic marker in 35. Schick UM, Jain D, Hodonsky CJ, Morrison JV, Davis JP, Brown L, et al. Genome- advanced heart failure, independent of hemoglobin levels. Hell J Cardiol. wide association study of platelet count identifies ancestry-specific loci in 2014 Dec;55(6):457–61. Hispanic/Latino Americans. Am J Hum Genet. 2016 Feb 4;98(2):229–42. 59. Panwar B, Judd SE, Warnock DG, McClellan WM, Booth JN, Muntner P, et al. 36. Parra EJ, Mazurek A, Gignoux CR, Sockell A, Agostino M, Morris AP, et al. Hemoglobin concentration and risk of incident stroke in community-living Admixture mapping in two Mexican samples identifies significant adults. Stroke. 2016;47(8):2017–24. associations of locus ancestry with triglyceride levels in the BUD13/ZNF259/ 60. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex APOA5 region and fine mapping points to rs964184 as the main driver of traits: challenges and strategies. Nat Rev Genet. 2013 Jul;14(7):483–95. the association signal. PLoS One. 2017 Feb 28;12(2):e0172880. 61. Chesmore K, Bartlett J, Williams SM. The ubiquity of pleiotropy in human 37. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, disease. Hum Genet. 2018 Jan;137(1):39–44. Garrison EP, Kang HM, et al. A global reference for human genetic variation. 62. Gaggin HK, Dec GW. The role of treatment for Anemia as a therapeutic Nature. 2015;526(7571):68–74. target in the Management of Chronic Heart Failure: insights after RED-HF. 38. Li YR, Keating BJ. Trans-ethnic genome-wide association studies: advantages Curr Treat Options Cardiovasc Med. 2014;16(1):279. and challenges of mapping in diverse populations. Genome Med. 2014 Oct 63. Morales J, Welter D, Bowler EH, Cerezo M, Harris LW, McMahon AC, et al. A 31;6(10):91. standardized framework for representation of ancestry data in genomics 39. Petrovski S, Goldstein DB. Unequal representation of genetic variation across studies, with application to the NHGRI-EBI GWAS catalog. Genome Biol. ancestry groups creates healthcare inequality in the application of precision 2018;19(1):21. medicine. Genome Biol. 2016;17(1):157. 64. Bien SA, Wojcik GL, Zubair N, Gignoux CR, Martin AR, Kocarnik JM, et al. 40. Bigdeli TB, Genovese G, Georgakopoulos P, et al. Contributions of common Strategies for enriching variant coverage in candidate disease loci on a genetic variants to risk of schizophrenia among individuals of African and Latino multiethnic genotyping array. PLoS One. 2016 Dec 14;11(12):e0167758. ancestry. Mol Psychiatry. 2019. https://doi.org/10.1038/s41380-019-0517-y. 65. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new 41. Girirajan S. Missing heritability and where to find it. Genome Biol. 2017 May NHGRI-EBI catalog of published genome-wide association studies (GWAS 11;18(1):89. catalog). Nucleic Acids Res. 2017 Jan 4;45(D1):D896–901. 42. Kim H, Grueneberg A, Vazquez AI, Hsu S, de Los Campos G. Will big data 66. Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high- close the missing heritability gap? Genetics. 2017 Sep 11;207(3):1135–45. performance computing toolset for relatedness and principal component 43. Pickrell JK, Berisa T, Liu JZ, Ségurel L, Tung JY, Hinds DA. Detection and analysis of SNP data. Bioinformatics. 2012;28(24):3326–8. interpretation of shared genetic influences on 42 human traits. Nat Genet. 67. Lin D-Y, Tao R, Kalsbeek WD, Zeng D, Gonzalez F, Fernández-Rhodes L, et al. 2016 May 16;48(7):709–17. Genetic association analysis under complex survey sampling: the Hispanic 44. Park H, Li X, Song YE, He KY, Zhu X. Multivariate analysis of anthropometric community health study/study of Latinos. Am J Hum Genet. 2014;95(6):675–88. traits using summary statistics of genome-wide association studies from 68. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of GIANT Consortium. PLoS One. 2016 Oct 4;11(10):e0163912. genomewide association scans. Bioinformatics. 2010 Sep 1;26(17):2190–1. Hodonsky et al. BMC Genomics (2020) 21:228 Page 14 of 14

69. Kim J, Bai Y, Pan W. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genet Epidemiol. 2015 Dec;39(8):651–63. 70. Sitlani CM, Baldassari AR, Highland HM, Hodonsky CJ, Avery CL. Comparison of multiple phenotype association tests using summary statistics in genome-wide association studies. Wiley Online Library; 2019.

Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.