SHARED GENETIC ARCHITECTURE OF TRAITS IN U.S. POPULATIONS

Chani Jo Hodonsky

A dissertation submitted to the faculty at the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Epidemiology in the Gillings School of Global Public Health.

Chapel Hill 2019

Approved by:

Christy Avery

Mariaelisa Graff

Kari North

Alex Reiner

Ran Tao

i 228

©2019 Chani Jo Hodonsky ALL RIGHTS RESERVED

ii

ABSTRACT

Chani Jo Hodonsky: Shared Genetic Architecture of Red Blood Cell Traits in U.S. Populations (Under the direction of Christy L. Avery)

Red blood cells are the most numerous cell in the body, and clinical measures used to describe them (RBC traits) are highly polygenic. Hundreds of loci have been identified using traditional genome-wide association study methods. However, the majority of association studies have been performed in European- or East Asian-ancestry populations, and heritability estimates suggest that additional associations remain to be identified. Rare variants, which GWAS are typically underpowered to detect, have been considered as potential contributors to this missing heritability. Of note, European-ancestry populations have both the lowest genetic diversity and the fewest rare variants compared to other ancestry groups.

Both the identification of previously unreported loci and the characterization of known loci for complex quantitative traits benefit from inclusive study populations and recently developed association study methods. The objective of this study was to evaluate genetic associations with seven RBC traits in an ancestrally diverse study population by applying two different methods—a combined-phenotype approach to evaluating common variants that may affect multiple RBC traits, and a -based approach that improves power to detect groups of rare variants acting on a single genetic transcript. We utilized data from a large, multi-ethnic study population from across the United States, including genotypes and data from seven RBC traits: hematocrit, hemoglobin concentration, mean corpuscular hemoglobin, mean corpuscular

iii hemoglobin concentration, mean corpuscular volume, red blood cell count, and red cell distribution width.

Our findings confirm the high polygenicity of RBC traits and the applicability of previously reported RBC trait loci to populations of all ancestries. We identified four previously unreported associated with one or more RBC traits. Additionally, using a combined- phenotype method we identified twenty independent association signals within seven loci, several of which had lead variants only present in African- or American-ancestry populations.

Our work shows the importance of performing association studies in populations of all ancestries, while also calling for increased representation of genetic variation from diverse populations in publicly available resources such as eQTL databases. Continued efforts into the bioinformatic characterization of RBC trait loci will pave the way for molecular work that improves our understanding of RBC physiology and may lead to pharmaceutical innovations for genetically or environmentally induced RBC disorders.

iv I dedicate this work to my parents, Pamela and Joseph Hodonsky; my brother, Eric Hodonsky; and Dr. Paul and Phyllis Schroeder. I am forever grateful for your love and support throughout this journey.

v ACKNOWLEDGMENTS

I would like to thank my dissertation committee (Marielisa Graff, Kari North, Alex

Reiner, and Ran Tao) and chair (Christy Avery) for making this work possible. Each member of my committee offered a breadth and depth of knowledge that inspire me to continue improving every day as I strike out on my own research path.

The work represented by this document may have been performed by me, but would never have been possible without the guidance and encouragement of my family and friends. I begin by thanking my parents, Pam and Joe Hodonsky, who supported me on the winding road of self discovery that ultimately led to the UNC epidemiology program. They were there every step of the way, and that unconditional love sustained me. My brother, Eric Hodonsky, provided a consistent source of love and positive words, checking in from the west coast to make sure I was surviving the travails of coursework and research. I also thank and honor Dr. Paul and

Phyllis Schroeder, without whom I could never have become the person I am today. I can only hope to aspire to their boundless love and generosity.

The life of a PhD student can be a rollercoaster, and over the last five years a large number of people have ridden that rollercoaster with me. I would like to thank my friends and colleagues in the Gillings School of Public Health, who were there to help me grow as a student and researcher: Breakfast Club (Antoine Baldassari, Joseph Engeda, and Rahul Gondalia),

Elizabeth Kamai, Paula Strassle, Nikki Niehoff, Kristin Voltzke Moore, Rebecca Stebbins,

Sydney Thai, Adrien Wilkie, Mackenzie Herzog, Kamika Reynolds, Liz Kelly, Magdalene

vi

Assimon, Anne Justice, Kris Young, Laura Raffield, and Heather Highland. I would also like to thank my “non-epi” friends who loved me (and occasionally fed me) despite the demands of graduate school: Suzanne and Angela Derby-Wright, Hope Tyson, Zinaida Mahmutefendic and

Andrew Hunt, Dylan Williams, Amber Sniff and Jim Heffernan, Dan Mott and Jennifer

Robinson, Lily and Michael Whitton, Kim and Lars Soltmann, Sam Boyarsky, Rim Vilgalys,

Boris Kurktchiev and Kelli Avalos, Will Roper and Sam Ross, Brent Eason, and Cheralyn

Schmidt. I have also had friends from all parts of my life supporting me from afar throughout this process, whom I love and am so glad to have in my life. A special thank you to Eric Witham,

Eric Dennis, Johanna and Wes Craig, Gabrielle and Clare Singleton, Andrew Inman, Michael

Garrison, and Ravin Pan for watching me grow up and loving me anyway. Finally, I have been extremely fortunate to learn from mentors across every aspect of my life. I would specifically like to thank Janet Hegman Shier, John Rubadeau, Tony Antonellis, Ming-Sing Si, Steve

Leibovic, and Arash Babaoff for showing me the best of humanity.

I am honored to have been a member of the PAGE study for the entirety of my graduate work, and I look forward to continuing this collaboration. PAGE has provided an environment of support, generous authorship opportunities, and a shining example of what collaboration in academic science should be. While all of my PAGE collaborators made me feel welcome over the past five years, I would specifically like to thank Drs. Stephanie Bien, Genevieve Wojcik,

Christopher Gignoux, Eimear Kenny, Lucia Hindorff, and Charles Kooperberg. I would also like to acknowledge the contributions of the nearly 100,000 study participants who selflessly gave their time and resources to move forward the fields of cardiovascular and genetic epidemiology.

Lastly, I am grateful for the opportunity to focus on my work as a trainee of the NIH/NHLBI

T32 Cardiovascular Genetic Epidemiology Training Grant (T32 HL129982).

vii

TABLE OF CONTENTS

LIST OF TABLES ...... xiii

LIST OF FIGURES ...... xvii

LIST OF ABBREVIATIONS ...... xviii

LIST OF GENE NAMES ...... xxii

OVERVIEW AND RATIONALE ...... 1

SPECIFIC AIMS ...... 4

2.1. Aim 1: Leverage evidence of pleiotropy to identify and characterize RBC trait loci. 4

2.2. Aim 2: Identify rare variants associated with RBC traits via gene-based analysis. ... 4

BACKGROUND AND SIGNIFICANCE...... 6

3.1. Historical background ...... 6

3.2. Red blood cell biology ...... 7

3.2.1. Three waves of hematopoiesis ...... 8

3.2.2. Hemoglobin: the primary actor in RBC oxygen transportation and delivery ...... 9

3.2.3. Stages of Erythropoiesis and RBC Degradation ...... 10

3.2.4. Iron homeostasis in erythropoiesis...... 12

3.3. Red blood cell trait definitions and measurement...... 13

3.3.1. HCT...... 14

3.3.2. HGB ...... 15

3.3.3. RBCC ...... 15

3.3.4. MCH ...... 15

3.3.5. MCHC ...... 15

viii 3.3.6. MCV ...... 16

3.3.7. RDW ...... 16

3.4. Anemia definition and prevalence ...... 17

3.4.1. Monogenic RBC anemias and related disorders ...... 18

3.4.2. Summary ...... 22

3.5. Red blood cell epidemiology ...... 22

3.5.1. Traditional risk factors ...... 23

3.5.2. RBC trait measurement and error ...... 30

3.5.3. Summary ...... 31

3.6. Clinical and public health significance of RBC traits ...... 32

3.6.1. RDW and ESRD ...... 33

3.6.2. RDW in Failure ...... 34

3.6.3. RDW and ischemic stroke ...... 34

3.7. State of the literature: RBC trait genetics ...... 36

3.7.1. Linkage analysis studies ...... 37

3.7.2. RBC trait candidate gene studies ...... 39

3.7.3. RBC molecular genetics ...... 41

3.7.4. Genome-wide association studies ...... 43

3.7.5. RBC trait genetic association studies ...... 45

3.8. State of the literature: quantitative trait analysis...... 51

3.8.1. Pleiotropy ...... 51

3.8.2. Fine-mapping association signals identified in combined-phenotype GWAS ..... 59

3.8.3. Association studies of rare variants ...... 64

3.9. Public Health Significance ...... 72

3.10. Supporting Tables & Figures ...... 76

ix RESEARCH PLAN ...... 86

4.1. Overview ...... 86

4.2. Study populations...... 86

4.2.1. Population Architecture using Genomics and Epidemiology: The PAGE study . 86

4.3. Phenotypes: outcome and covariate assessment ...... 90

4.3.1. RBC trait descriptive statistics and transformation ...... 90

4.3.2. Exclusion Criteria ...... 90

4.3.3. Adjustment covariates ...... 91

4.4. Specific Aim 1: Combined-phenotype GWAS of red blood cell traits ...... 94

4.4.1. Quality control for genotypes and imputed data and statistical threshold ...... 94

4.4.2. Univariate analyses ...... 95

4.4.3. Comparison of available combined-phenotype methods ...... 97

4.4.4. Combined-phenotype analysis using aSPU ...... 98

4.5. Specific Aim 2: Gene-based rare variant analysis of red blood cell traits ...... 101

4.5.1. Gene-based testing: combined VC and burden test implemented in MASS ...... 102

4.5.2. Interpretation of results ...... 105

4.6. Supporting Tables & Figures ...... 107

MANUSCRIPT A: ANCESTRY-SPECIFIC ASSOCIATIONS IDENTIFIED IN GENOME-WIDE COMBINED-PHENOTYPE STUDY OF RED BLOOD CELL TRAITS EMPHASIZE BENEFITS OF DIVERSITY IN GENOMICS ...... 114

5.1. Overview ...... 114

5.2. Introduction ...... 115

5.3. Methods...... 117

5.3.1. Study population ...... 117

5.3.2. RBC Trait Measurement ...... 117

5.3.3. Genotyping and Imputation ...... 117

x 5.3.4. Statistical Methods ...... 118

5.3.5. Publicly available expression quantitative trait locus (eQTL) analysis ...... 122

5.4. Results ...... 122

5.4.1. Combined-phenotype analyses ...... 122

5.4.2. eQTL function of index SNPs...... 125

5.5. Discussion ...... 125

5.6. Conclusion ...... 129

5.7. Main Figures & Tables ...... 130

MANUSCRIPT B. GENE-BASED TESTING IN AN ANCESTRALLY DIVERSE STUDY POPULATION SUGGESTS COMPLICATED REGULATORY MECHANISMS FOR RED BLOOD CELL TRAITS: THE PAGE STUDY ...... 202

6.1. Overview ...... 202

6.2. Introduction ...... 203

6.3. Methods...... 204

6.3.1. Study population ...... 204

6.3.2. RBC trait measurement ...... 204

6.3.3. Genotyping and Imputation ...... 205

6.3.4. Statistical Methods ...... 205

6.4. Results ...... 208

6.4.1. Genes previously unreported in RBC trait association studies ...... 209

6.4.2. Clustering of significant genes...... 210

6.4.3. Overrepresentation of or molecular pathways ...... 211

6.5. Discussion ...... 211

6.6. Conclusion ...... 214

6.7. Main Figures and Tables ...... 215

CONCLUSIONS ...... 234

xi 7.1. Recapitulations of Specific Aims...... 234

7.2. Main Findings ...... 235

7.3. Strengths ...... 237

7.4. Limitations ...... 237

7.5. Overall Conclusions ...... 238

APPENDIX: AN INTRODUCTION TO BASIC GENETIC PROCESSES...... 239

8.1. Transcription ...... 240

8.2. Intronic/intergenic variation...... 241

8.3. Epigenetic modification ...... 242

8.3.1. DNA methylation ...... 242

8.3.2. 3D chromatin structure ...... 243

8.3.3. Histone modification ...... 244

8.4. Other genetic processes...... 245

8.4.1. MicroRNA ...... 245

8.4.2. Post-translational modification ...... 246

8.5. Supporting Figure ...... 248

WORKS CITED ...... 249

xii LIST OF TABLES

Table 1. Discovery and genetic mapping of hereditary anemias ...... 76

Table 2. Description of involved in iron homeostasis and erythropoiesis...... 78

Table 3. Descriptions of red blood cell primary traits and derived indices...... 79

Table 4. Brief overview of genetic and epigenetic processes and effects of variation...... 80

Table 5. Published English-Language GWAS and Exome Association Studies of RBC Traits ...... 81

Table 6. Evidence of genetic associations shared across RBC traits...... 82

Table 7. Examples of genome-wide significant RBC trait associations with cis-eQTL evidence and a plausible biological mechanism for a proximal gene...... 83

Table 8. The PAGE Study: descriptive statistics for PAGE participants and collaborators by study and self-reported race/ethnicity...... 107

Table 9. Representative sample of bioinformatics analyses for identifying candidate functional variants...... 108

Table 10. RBC trait loci with evidence of multiple independent signals among PAGE study participants...... 135

Table 11. Genotyping platforms and QC measures used by participating study populations...... 137

Table 12. Ancestry representation by self-reported race/ethnicity by RBC trait...... 138

Table 13. Descriptive statistics for PAGE participants by study and race/ethnicity for ARIC, BioME, and CARDIA participants...... 139

Table 14. Descriptive statistics for PAGE participants by study and race/ethnicity for WHI and HCHS/SOL participants...... 140

Table 15. Univariate allele information and frequencies for combined-phenotype lead SNPs in combined multi-ethnic study population...... 141

Table 16. HCT univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 143

Table 17. HGB univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 145

xiii

Table 18. MCH univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 147

Table 19. MCHC univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 149

Table 20. MCV univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 151

Table 21. RBCC univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 153

Table 22. RDW univariate results for combined-phenotype lead SNPs in combined multi-ethnic study population...... 155

Table 23. HCT univariate results for combined-phenotype lead SNPs in African Americans...... 157

Table 24. HGB univariate results for combined-phenotype lead SNPs in African Americans...... 159

Table 25. MCH univariate results for combined-phenotype lead SNPs in African Americans...... 161

Table 26. MCHC univariate results for combined-phenotype lead SNPs in African Americans...... 163

Table 27. MCV univariate results for combined-phenotype lead SNPs in African Americans...... 165

Table 28. RBCC univariate results for combined-phenotype lead SNPs in African Americans...... 167

Table 29. RDW univariate results for combined-phenotype lead SNPs in African Americans...... 169

Table 30. HCT univariate results for combined-phenotype lead SNPs in Hispanics/Latinos...... 171

Table 31. HGB univariate results for combined-phenotype lead SNPs in Hispanics/Latinos...... 173

Table 32. MCH univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos...... 175

xiv Table 33. MCHC univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos...... 177

Table 34. MCV univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos...... 179

Table 35. RBCC univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos...... 181

Table 36. RDW univariate results for combined-phenotype lead SNPs in Hispanics/Latinos...... 183

Table 37. HCT univariate results for combined-phenotype lead SNPs in European Americans...... 185

Table 38. HGB univariate results for combined-phenotype lead SNPs in European Americans...... 186

Table 39. MCH univariate results for combined-phenotype lead SNPs in European Americans...... 187

Table 40. MCHC univariate results for combined-phenotype lead SNPs in European Americans...... 188

Table 41. MCV univariate results for combined-phenotype lead SNPs in European Americans...... 189

Table 42. RBCC univariate results for combined-phenotype lead SNPs in European Americans...... 190

Table 43. RDW univariate results for combined-phenotype lead SNPs in European Americans...... 191

Table 44. LD proxies by ancestry for rs6573766, associated with RBCC in univariate sensitivity analysis...... 192

Table 45. LD proxies by ancestry for rs145548796*, associated with MCV in univariate sensitivity analysis...... 193

Table 46. Comparison of unadjusted and esv3637548-adjusted trait-specific p-values at HBA1/2 locus on 16 in MEGA-genotyped individuals...... 194

Table 47. Shared generalization at previously published index SNPs across seven RBC traits...... 195

xv Table 48. Generalization of previously reported association signals and index SNPs to the PAGE trans-ethnic study population...... 196

Table 49. Generalization of previously reported association signals and index SNPs to the PAGE African American population...... 197

Table 50. Generalization of previously reported association signals and index SNPs to the PAGE Hispanic/Latino population...... 198

Table 51. Generalization of previously reported association signals and index SNPs to the PAGE European-ancestry population...... 199

Table 52. Lead SNPs that are significant eQTLs for genes within 500kb for RBC-relevant tissues in GTEx...... 200

Table 53. Descriptive information for 29,070 MEGA-genotyped study participants...... 218

Table 54. CADD annotation set transcripts significant in one or more RBC traits in 29,070 PAGE participants...... 219

Table 55. Deleterious annotation set transcripts significant in one or more RBC traits in 29,070 PAGE participants...... 221

Table 56. Top variants and p-values by trait and transcript for CADD variant annotation set significant transcripts...... 222

Table 57. Top variants and p-values by trait and transcript for deleterious variant annotation set significant transcripts...... 224

Table 58. Allele frequencies for top variants within CADD annotation set significant transcripts...... 226

Table 59. Allele frequencies for top variants within deleterious annotation set significant transcripts...... 229

Table 60. Association network of all genes significant for CADD or deleterious annotation sets...... 230

Table 61. Gene ontology enrichment for genes significantly associated with one or more RBC traits...... 231

Table 62. P-values that increased after conditioning on a single variant for transcripts significantly associated with one or more RBC traits in the deleterious annotation set...... 233

xvi LIST OF FIGURES

Figure 1. A basic overview of RBC development stages and accompanying molecular processes (prepared by Hodonsky, 2018) ...... 84

Figure 2. Example network of key -protein interactions involving molecules with established roles in RBC biology...... 85

Figure 3. PAGE II participants self-identifying as Hispanic/Latino stratified by ethnicity ...... 109

Figure 4. Ancestral principal components demonstrate the continental-ancestry continuum among participants of the PAGE study...... 110

Figure 5. Comparison of multi-phenotype methods under the assumption of no genetic effect to examine systematic evidence of type I error inflation...... 111

Figure 6. Statistical power for univariate and multi-phenotype associations by effect size...... 112

Figure 7. Selected approximate power curves for 20,000 genes by number of variants per region...... 113

Figure 8. Identification and characterization of 39 loci in a multi-ethnic study population...... 130

Figure 9. Multiple independent associations with MCH demonstrate complex genetic architecture at HBA1/HBA2 locus...... 131

Figure 10. Evidence of genetic associations shared across correlated RBC traits...... 132

Figure 11. Locus-Zoom plots of the association between rs6573766 and RBCC in PAGE African Americans (A), Hispanics/Latinos (B), and European Americans (C) ...... 133

Figure 12. Locus-Zoom plot of the association between MCH (A) and MCV (B) and rs145548796 in the total MEGA study population ...... 134

Figure 13. Number of variants per transcript in deleterious and CADD annotations...... 215

Figure 14. Genome-wide gene-based results using deleterious (top) and CADD (bottom) masks identifies multiple significant transcripts across seven RBC traits...... 216

Figure 15. Evidence of shared pathways in a cluster of CADD-significant genes...... 217

Figure 16. Molecular processes involved in gene transcription, mRNA translation, and detected amounts of products from those processes (prepared by Hodonsky, 2017) ...... 248

xvii LIST OF ABBREVIATIONS

1000G 1000 Genomes project (referred to by phase when relevant)

3' and 5' Downstream and upstream, respectively, in the direction of transcription

AD Autosomal dominant-inherited trait

AFR 1000 Genomes African continental ancestry super population

AMR 1000 Genomes American continental ancestry super population

AR Autosomal recessive-inherited trait

ARIC Atherosclerosis Risk in Communities cohort study

ASW African Americans in the Southwest United States (1000G population)

BCX Blood cell consortium

BFU-E Burst-forming unit erythroid cells

BioME Mt. Sinai BioME biobank

BM Bone marrow

BMI Body mass index

CAF Coded allele frequency (defined as the frequency of the allele listed as the primary allele in PAGE)

CARDIA Coronary Artery Risk Development in Young Adults Study

CEU Northern Europeans from Utah (1000G population)

CFU-E Colony-forming unit erythroid cells

CHARGE Cohorts for Heart and Aging Research in Genetic Epidemiology

CHD Coronary heart disease

CHS Cardiovascular Health Study

CI Confidence interval

CKD Chronic kidney disease

xviii cM Centimorgan

CVD Cardiovascular disease df Degrees of freedom

DNA Deoxyribonucleic acid

EAS 1000 Genomes East Asian ancestry super population eMERGE Electronic Medical Records and Genomics

EMP Erythroblast-macrophage protein

ENCODE Encyclopedia or DNA Elements eQTL Expression quantitative trait locus

ESRD End-stage renal disease

EUR 1000 Genomes European continental ancestry super population

FHS Framingham Heart Study

GEE Generalized estimating equation

GIANT Genetic Investigation of Anthropometric Traits consortium

GWAS Genome-wide association study

GxE Gene-by-environment interaction study h2 Narrow-sense heritability due to additive genetic effects

HCHS/SOL Hispanic Community Health Study/Study of Latinos

HCT Hematocrit

HGB Hemoglobin level

HSC Hematopoietic stem cell

ICC Interclass correlation coefficient iPSC Induced pluripotent stem cells

xix Indel Insertion/deletion genetic variant

IRE Iron-responsive element

ISMMS Icahn School of Medicine at Mt. Sinai

JHS Jackson Heart Study

LD Linkage disequilibrium

LMM Linear mixed (effects) model lncRNA Long non-coding RNA

LZ plot Locus-Zoom plot

MAF Minor allele frequency

MCH Mean corpuscular hemoglobin

MCHC Mean corpuscular hemoglobin concentration

MCV Mean corpuscular volume

MEP Megakaryocyte/erythrocyte precursor

MESA Multi-Ethnic Study of Atherosclerosis miR Micro RNA mRNA Messenger RNA

MS1 Manuscript 1 of dissertation project

MS2 Manuscript 2 of dissertation project

N Number of participants

NHANES National Health and Nutrition Examination Survey

NHLBI National Heart, Lung, and Blood Institute

NIH National Institutes of Health

NMD Nonsense-mediated decay

xx OR Odds ratio

PAGE Population Architecture using Genomics and Epidemiology consortium

PTC Premature termination codon

RBCC Red blood cell count

RBP RNA binding protein

RDW Red cell distribution width

REGARDS REasons for Geographic and Racial Differences in Stroke study

RNA Ribonucleic acid

SAS 1000 Genomes South Asian ancestry super population

SD Standard deviation

SE Standard error

SHARe SNP Health Association Resource study

SHS Strong Heart Study

SNP Single nucleotide polymorphism

T2D Type 2 Diabetes Mellitus tRNA transfer RNA

UTR Untranslated region (of mRNA)

WHI Women’s Health Initiative study

WHO World Health Organization

YRI Yoruba in Ibadan, Nigeria (1000G population)

xxi LIST OF GENE NAMES

ABCB11 ATP binding cassette B11

ADAMTS13 ADAM metallopeptidase with thrombospondin type 1 motif 13

ALAD Aminolevulinate dehydratase

ALAS2 Aminolevulinate delta synthase 2

AK1 Adenylate kinase 1

AMN Amnion-associated transmembrane protein

ANK1 Ankyrin 1

ATXN2 Ataxin 2 (adjacent to SH2B3)

BCL11A B-cell/CLL lymphoma 11A

BMP Bone morphogenetic protein family

BRCA2 Breast cancer 2

BRIP1 BRCA1-interacting protein C-terminal helicase 1

C4BPA Complement-component 4 binding protein alpha

CATSPERB Cation channel sperm-associated auxiliary subunit beta

CCND3 Cyclin D3

CD164 Cluster of differentiation 164 (codes for MUC24)

CDAN1 Codanin 1

CFH Complement factor H

CFHR1 CFH receptor 1 (family of genes, also include CHFR3)

CITED2 CBP/P300-interacting transactivator with glu/asp-rich carboxy terminal domain 2

CPOX Coproporphyrinogen oxidase

CUBN Cubilin (intrinsic factor-cobalamin receptor)

xxii CYP1A2 Cytochrome P450 family 1, subfamily A, polypeptide 2

DARC Duffy blood group chemokine receptor

ERFE Erythroferrone precursor

EPAS1 Endothelial PAS Domain Protein 1

EPB42 Erythrocyte membrane protein band 4.2

EPB4IL2 Erythrocyte membrane protein band 4.1 Like 2

EPO Erythropoietin

FAM234A codes for ITFG3

FANCA Fanconi anemia complementation group A (family of relevant genes, also includes FANCB, FANCC, FANCD2, FANCE, FANCF, FANCG, FANCI, FANCL, and FANCM)

FBXO7 F-box protein 7

FECH Ferrochelatase

G6PD Glucose-6-phosphate dehydrogenase

GATA1 GATA-binding protein 1

HAMP Hepcidin

HBB Hemoglobin beta subunit (adult)

HBA1/2 Hemoglobin alpha subunits 1 and 2 (fetal and adult)

HBD Hemoglobin delta subunit (adult)

HBG1/2 Hemoglobin gamma subunits (fetal)

HBS1L HBS1-like (S. cerevisiae)

HFE Hemochromatosis gene

HK1 Hexokinase 1

HMBS Hydroxymethylbilane synthase

xxiii HMOX2 Heme oxygenase (decycling) 2

IGHV1-3 Immunoglobulin heavy variable 1-3 (predicted RNA)

JAK2 Janus kinase 2

KCNN4 Potassium intermediate/small conductance calcium-activated channel, subfamily N, member 4

KIT KIT Proto-Oncogene Receptor Tyrosine Kinase

KLF1 Kruppel-like Factor 1

LPIN2 Lipin 2

MARCH8 Membrane Associated Ring-CH-Type Finger 8

MCP AKA CD46, complement regulatory protein (CD46)

MYB v-myb myeloblastosis viral oncogene homolog (avian)

PALB2 Partner and localizer of BRCA2

PGK1 Phosphoglycerate kinase 1

PHI codes for GPI

PKLR Pyruvate kinase, liver and RBC

PPHLN1 Periphilin 1

PRF1 Perforin 1 (pore-forming protein)

PRKCE Protein kinase C epsilon

PRKAG2 Protein Kinase AMP-Activated Non-Catalytic Subunit Gamma 2

PSMB8 Proteasome subunit, beta type, 8

RAD51C RAD51 homolog c (S. cerevisiae)

RCL1 RNA Terminal Phosphate Cyclase Like 1

RP11-687M24 Long non-coding RNA anti-sense to PKNOX2 (PBX/Knotted 1 homeobox 2)

RPL5 Ribosomal protein L5 (family of genes, also includes RPL11 and RPL35A)

xxiv RPS7 Ribosomal protein S7 (family of relevant genes, also includes RPS10, RPS17, RPS19, RPS24, and RPS26)

SEC23B SEC23 homolog B (S. cerevisiae)

SH2B3 SH2B adaptor protein 3

SLC11A2 Solute-carrier (family of genes also includes SLC12A2, SLC19A2, SLC25A38, SLC4A1, and SLC5A2)

SLX4 SLX4 structure-specific endonuclease subunit

SMAD Gene family of signal transducers and transcription modulators (similar to ‘Mad’ [Drosophila] and ‘Sma’ [C. elegans])

SPTA1 Spectrin, alpha, erythrocytic 1

SPTB Spectrin, beta, erythrocytic

STX11 Syntaxin 11

STXBP2 STX-binding protein 2

TF Transferrin

TFR2 Transferrin receptor 2

TRFC Transferrin receptor

THRB Thyroid hormone receptor beta

TMPRSS6 Transmembrane protease, serine 6

UNC13D UNC13 homolog D (C. elegans)

UROD Uroporphyrinogen decarboxylase

UROS Uroporphyrinogen III synthase

xxv

OVERVIEW AND RATIONALE

Red blood cell (RBC) traits capture the molecular properties of red blood cells, which are required to transport oxygen to tissues throughout the body. RBC traits are complex, quantitative measures obtained from a complete blood count (CBC), including three primary traits— hematocrit (HCT), hemoglobin (HGB), and RBC count—and four derived indices—mean corpuscular hemoglobin (MCH), MCH concentration (MCHC), mean corpuscular volume

(MCV), and RBC distribution width (RDW). In combination, these correlated traits are used to describe RBC physiology and diagnose RBC disorders. Additionally, population-level variation in these traits has been associated with a wide array of diseases, ranging from chronic kidney disease to autoimmune disorders1-14.

RBC traits are quantitative and tightly regulated, suggesting that a high number of genes contribute to each trait. A large body of literature has identified and characterized RBC trait- associated genes over the past several decades, particularly in the genome-wide association studies (GWAS) era. Several hundred loci have been identified through genetic-association studies; however, effects from these loci explain only a small proportion of the broad-sense variability expected to be attributed to genetics (referred to as "heritability" or h2)15-31.

Several factors contribute to the fact that genetic loci very likely remain to be discovered, which will be examined in the proposed work. First, RBC trait GWAS have included primarily

European- and East Asian-ancestry populations, creating a biased view of human variation underlying RBC traits32. Ancestrally diverse populations may improve capture of variants that

1 are globally rare or population-specific; furthermore, inclusion of ancestrally diverse study populations is necessary for equitable genomics research 32-40.

Second, previous RBC GWAS have examined RBC traits separately despite evidence of shared molecular pathways, common genetic underpinnings, and statistical innovations enabling examination of multiple traits in aggregate. For example, several loci have been consistently associated with multiple RBC traits in GWAS—including HBS1L/MYB, PRKCE, and RCL1— suggesting that studies that leverage evidence of correlation between RBC traits to increase statistical power may help identify additional novel loci. Such “combined phenotype” studies are enabled by statistical and genomic developments, as well as the improved assessment of genetic variation in racial/ethnic minority populations. Together, these suggest that the time is right to perform a combined-phenotype examination of RBC traits34; 41; 42.

Finally, improved coverage of rare variants via innovations in imputation and genome measurement, as well as a vast array of functional annotations, are now available facilitate fine- mapping and functional characterization of RBC trait association signals. Diverse populations are also more likely to have an increased quantity of rare variants when compared with European populations. The interrogation of rare variants in a large study population, particularly one exhibiting ancestral diversity, may thus be beneficial in identifying additional novel loci that are relevant for RBC traits across all ancestries, as has been demonstrated in rare-variant studies of other complex traits43-48.

Innovative, recently developed methods and the availability of GWAS and phenotype data in the large, ancestrally diverse Population Architecture Using Genomics in Epidemiology

(PAGE) consortium uniquely position us to detect and characterize RBC trait genetic associations that have not been identified in prior studies of primarily European or East Asian

2 participants. Continued identification of such associations can help clarify the genetic architecture of RBC traits and elucidate mechanisms of RBC trait biology. An improved understanding of the suite of genes contributing to RBC traits will benefit both basic science and translational research as they pertain to blood diseases as well as phenotypes strongly associated with RBCs.

3

SPECIFIC AIMS

2.1. Aim 1: Leverage evidence of pleiotropy to identify and characterize RBC trait loci.

Aim 1a. Estimate trans-ethnic and ancestry-specific univariate genome-wide genetic associations in approximately 68,000 African American, European American, and

Hispanic/Latino PAGE participants between approximately 25 million variants (effective heterozygosity >35 by study population) and seven RBC traits (HCT, HGB, RBCC, RDW,

MCH, MCHC, and MCV) using SUGEN software.

Aim 1b. Using summary statistics from Aim 1a, apply the adaptive sum of powered score

(aSPU) test, which integrates evidence of association across multiple correlated traits, to identify potentially pleiotropic loci.

Aim 1c. Using summary statistics from Aim 1a, perform iterative conditional analysis, which will identify independent leads SNPs within association signals and determine the presence of secondary signals, i.e., physically proximal or overlapping association signals with independent index variants. Sensitivity analyses will be performed by race-ethnicity at multi- ethnic combined-phenotype lead SNPs.

Aim 1d. Annotate (novel and previously identified) association signals from Aim 1a and

1b using the ENCODE Variant Expression Predictor (VEP) and GTEx eQTL database to assess potential functional significance of candidate variants.

2.2. Aim 2: Identify rare variants associated with RBC traits via gene-based analysis.

Aim 2a. Among approximately 30,000 self-identified African American, Hispanic/Latino

4 American, Asian American, and Native American MEGA-genotyped PAGE Study participants, perform univariate gene-based testing using SKAT-O—a combined burden and variance- component test—for approximately 25,000 genes in each of seven RBC traits (HCT, HGB,

RBCC, RDW, MCH, MCHC, and MCV).

Aim 2b. Annotate the ontology of candidate genes identified in Aim 2a using publicly available databases to determine biological pathways that are well-represented among genome- wide-significant (GWS, Bonferroni-corrected alpha = 2.87x10-7) findings for one or more RBC traits.

Aim 2c. Annotate the most significant variant for each significant transcript identified in

Aim 2c using publicly available databases—including VEP and Blueprint—to identify globally rare and ancestry-specific candidate functional variants.

5

BACKGROUND AND SIGNIFICANCE

3.1. Historical background

Red blood cells have been recognized as a crucial contributor to human physiology for millennia: as far back as 3000 BCE, Egyptian physicians were cognizant of the connection between the heart and circulatory system, although the physiologic functions of this system were not elucidated until much later49-52. From the medical practices of ancient Greek and Islamic physicians through the European Middle Ages, blood was considered one of the four primary

“humors” and the main contributor to the “sanguine temperament”—concordantly, blood-letting was a common treatment for disease states associated with "excessive blood humor", such as insanity, heart and vascular disease, and menstrual disorders53-55. Although the concept of humors fell out of favor in medicine during the 19th century (upon the acceptance of Virchow’s cell theory and, subsequently, germ theory), the practice of blood-letting remained widespread in the Western world into the 20th century56-60. When the discovery was made in the mid-17th century that all living organisms comprised one or more cells, red blood cells were among the earliest to be characterized given their ready availability from human and animal subjects60. By the mid-20th century, the importance of hemoglobin—the RBC-specific protein responsible for the oxygenation of all bodily tissues—was recognized as contributing to several prevalent monogenic hereditary anemias (see Table 1)61-65. In current medical practice, a complete blood count (CBC) describing RBC traits is ubiquitously used to characterize anemic physiologic

6 states, and can also be employed alongside other signs and symptoms in determining the underlying cause of a wide variety of diseases66-74.

3.2. Red blood cell biology

The primary function of RBCs is to transport oxygen from the lungs to tissues throughout the body. In mammals, RBCs absorb oxygen at the interface of alveoli in the lungs and pulmonary vessels; oxygen is then delivered by RBCs to tissues throughout the body as blood is pumped through the vasculature. The average RBC makes over a thousand journeys per day through the heart to deliver oxygen to peripheral tissues. Another task delegated to RBCs is to clear toxic or waste materials from tissues via the bloodstream. This primarily involves interaction with macrophages in the liver or spleen, in which RBCs release unwanted materials through exocytosis for subsequent uptake by the adjacent macrophage75. Given these functions, which involve membrane deformation to pass through arterial and venous capillaries, a typical enucleated RBC is approximately 1/10 the diameter of a human hair (about 8 microns), with over

20 trillion RBCs in circulation in an adult human at any given time. As the most numerous cell type in the body (similar in number to platelets, which may also be considered cell fragments) responsible for oxygenation of all tissues, the maintenance and development of RBCs are tightly regulated.

RBCs begin their journey as multipotent hematopoietic stem cells (HSCs), which are located in the bone marrow. HSCs are self-renewing, meaning this population—which is the predecessor of every blood cell in the body—is sustained from birth through adulthood via mitosis. The HSC lineage is limited to blood-cell types (meaning they cannot differentiate into non-blood tissues), including all cell types derived from the common lymphoid progenitor and

7 common myeloid progenitor cells. The former differentiate into white blood cells (WBCs), whereas the latter may differentiate into several types of white blood cells or the shared early progenitor cell for RBCs and megakaryocytes (which are the precursor cell that become platelets), appropriately designated "megakaryocyte/erythrocyte progenitors" (MEPs). Up to two million RBCs are degraded every second; hence, a similar number of progenitor cells need to be generated at the same rate to maintain a sufficient RBC population in the body. Although not the subject of this dissertation, 5 to 6 million megakaryocytes originating from the same HSC population must also be terminally differentiated each day to maintain the >100 trillion platelets circulating in the vasculature at a given time76. In combination, erythropoiesis and megakaryopoiesis are extremely demanding, further emphasizing the delicate nature of balancing these highly prolific processes.

3.2.1. Three waves of hematopoiesis

The establishment of an RBC population begins very early in fetal development and the required ubiquity of RBCs for oxygenation of all tissues make understanding the molecular underpinnings of any aspect of their development relevant to public health researchers. In utero

RBC development is very different from the process of RBC differentiation in adults (the latter being the focus of this dissertation) with different genetic underpinnings, and will therefore be explained here only in brief. The first wave of hematopoiesis comes early in development from the primitive streak of mesodermal cells in the yolk sac77. These immature RBCs still contain nuclei and are derived from hemangioblasts, which are a shared precursor cells for blood vessel tissues and blood cells77. During the second wave of hematopoiesis—beginning late in fetal development—erythromyeloid progenitors become the primary progenitor cell from which erythroblasts (i.e., immature RBC containing a nucleus) and, subsequently, mature RBCs, are

8 derived. Erythromyeloid progenitors begin in the yolk sac and form colonies similar to those formed in the bone marrow in adults.

The final wave of hematopoiesis involved emergence of HSCs from the haemogenic endothelium, located in the dorsal aorta of the aorta-gonad-mesonephros region of the fetus77.

This population of HSCs migrates to the liver, expands via multiple rounds of mitosis, and eventually migrates further into the bone marrow78. Once this population of HSCs reaches the bone marrow, the cells become a self-replenishing population from which all RBCs are generated throughout the life of the individual77; 79. Because of the limited lifetime supply of these cells, their self-renewing nature, and the large number of cell types for which they act as the source precursor population, HSCs are susceptible to somatic mutations that can result in blood cancers80-82. Thus, to sustain the genomic integrity of this vital population, HSCs not actively involved in the cell division process preceding hematopoiesis are typically maintained in a quiescent state with minimal transcriptional activity83.

3.2.2. Hemoglobin: the primary actor in RBC oxygen transportation and delivery

Before describing stages of erythropoiesis and RBC degradation, we will briefly describe the hemoglobin protein, given its critical role in RBC function and tissue oxygenation.

Hemoglobin is the protein responsible for RBC transportation of oxygen to tissues throughout the body and return of cellular toxins to the liver for processing. Hemoglobin was first identified by Engelhardt in 1825 and described in terms of chemical composition by Hünefeld in 1840, with several other contributors further characterizing this protein in the mid-19th century, including describing the structure and function in humans84; 85. In addition to having a strong affinity for oxygen, hemoglobin can also bind CO2, allowing for removal of the gas generated in the process of cellular respiration. All vertebrates express some form of the hemoglobin protein,

9 which is believed to have developed from a duplication of the myoglobin gene and subsequent functional evolution beginning several hundred million years ago86.

RBCs developed during the first wave of hematopoiesis express the hemoglobin genes most common during fetal development, and do not enucleate until entering circulation.

Erythroblasts during the second wave of hematopoiesis express primarily the "adult" forms of hemoglobin protein subunits, as the fetal genes begin to be downregulated. Once fetal development has progressed to the final stage of hematopoiesis, the adult form of hemoglobin becomes established as the primary form, which this section will briefly describe87. Tight regulation of hemoglobin production and assembly is required for successful generation of RBC cells from the pool of HSCs in the bone marrow, although several compensatory mechanisms with overlapping functions exist that can make up for minor insufficiencies (see Section 3.2.4).

Human adult hemoglobin is tetrahedron-shaped and comprises four subunits—two alpha and two beta—which are coded by the HBA and HBB genes, respectively. Insufficient hemoglobin transcription, translation, or assembly can lead to abnormal RBC trait values (see

Section 3.3). Several examples of mutations in these genes causing Mendelian RBC-related disorders are described briefly in Section 3.4.1 and Table 1; however, these regions both exhibit strong association in GWAS even after adjustment for Mendelian variants, indicating that common variants that play a more modest role in the distribution of RBC trait values in general populations may exist 18.

3.2.3. Stages of Erythropoiesis and RBC Degradation

HSCs are located in the deep bone marrow in clusters of self-renewing cells and, as highlighted above, are progenitors of multiple cell types including RBCs. The first stage of differentiation after the common MEP is "early erythroblasts", which form clusters around a

10 macrophage island in the deep bone marrow, referred to as "blast-forming units" (BFUs)88; 89.

Through the subsequent stages portrayed in Figure 1, each differentiation stage involves decreased cell diameter and volume; more chromatin condensation; reduced diversity of RNA transcription; and an increased emphasis on hemoglobin production75; 89. After HSCs become committed to the RBC path, all stages of erythroblast differentiation in the bone marrow involve direct physical contact with macrophages through a number of cell-membrane proteins, including erythroblast-macrophage protein (EMP), which is found in the membranes of macrophages and developing RBCs75. As the immature erythrocyte progresses through each stage of differentiation, hemoglobin transcription increases and, eventually, hemoglobin subunits become the dominant transcript, accompanying the dramatic downregulation of most other protein- coding genes prior to enucleation90-92.

In the final stage of differentiation that occurs in the bone marrow, orthochromatic erythroblasts in the sinusoidal bone marrow (near an entry point to the vasculature) expel their nucleus and most of the accompanying cellular machinery in a process referred to as "nuclear extrusion"77; 87;

93. The nucleated portion of the RBC remaining after this division is referred to as a pyrenocyte, which is absorbed by the central macrophage75.

The enucleated cell (a reticulocyte) is then released into circulation from the bone marrow, which circulates for one to two days while membrane composition and remaining hemoglobin translation are finalized, resulting in a terminally differentiated or “mature” erythrocyte (henceforth referred to simply as “RBCs”) with the recognizable biconcave shape.

Of note, the enucleation process is unique to mammals, as lower vertebrates such fish and birds have no or very few anuclear RBCs94. While restricting the RBC to a lifespan limited by the half- life of the assembled hemoglobin protein and cell membrane integrity, enucleation leads to the

11 easily identifiable "donut" shape of RBCs, which allows for extensive deformation of the cell, facilitating travel through capillaries and intercellular spaces to deliver oxygen with extremely high efficiency89. After an average lifespan of 120 days, RBCs become irreparably damaged or enter senescence and are degraded in the red pulp of the spleen. Macrophages phagocytose RBCs no longer fit for circulation, recycling the iron to the bone marrow for new RBC production and hemoglobin loading88; 89.

3.2.4. Iron homeostasis in erythropoiesis

RBCs can only perform their primary function of delivering oxygen molecules to peripheral tissues when sufficient iron (approximately 25mg of iron per day) is available to produce and activate hemoglobin95-97. Free iron ions in the bloodstream are toxic, and must therefore be either stored in bodily tissues or, when present in the blood, bound by transferrin molecules97; 98. The amount of iron stored in the body at a given time is estimated based on measurements of molecules in the serum such as hemoglobin level and ferritin99. Iron stores are built up during fetal development—during which time iron molecules are crucial for a large number of biological processes—and infants are born with an average iron store sufficient for all required tissue uses until 6 months of age, with the difference required to maintain the expanding

HSC population absorbed through the diet. The vast majority of iron molecules used in adult hemoglobin synthesis come not from the diet (which provides a relatively low proportion of the total iron used in erythropoiesis), but rather from macrophages, which accumulate iron stores in the spleen via the degradation of senescent RBCs95. A very small proportion (~4mg or 0.1%) of the body’s total iron content is transferrin-bound at any particular time, meaning nearly the entire supply of iron molecules for hemoglobin synthesis is recycled through the liver or macrophages and released in the form of transferrin-bound iron95; 97. The role non-bone-marrow tissues play in

12 iron storage and, therefore, RBC function, underscores the biologic relevance of genes expressed in those tissues (here, WBCs or liver) for RBC function and maintenance. For example, mutations in CFHR1 have been shown to cause a rare form of hemolytic anemia, but this gene is more highly transcribed in the liver than whole blood or bone marrow100; 101. Given the importance of iron homeostasis to maintaining a healthy RBC population, we briefly characterize the shared biological actors involved in iron homeostasis and erythropoiesis below.

3.2.4.1 Genetic regulation of iron homeostasis

A group of proteins co-regulate iron homeostasis and erythropoiesis, several of which have been implicated in monogenic anemias (Table 1). Briefly, Camaschella, et al, propose the following mechanism of inducing or repressing erythropoiesis, which is highly dependent on iron homeostasis (see Table 2 for a description of the relevant genes). Developing erythroblasts express TFR2 on the cell surface, which modulates the sensitivity of RBCs to erythropoietin

(EPO). The presence of EPO can induce acute erythropoiesis expansion, leading to a relative iron deficiency. Subsequently, the presence of surface TFR2 on erythrocytes is diminished, which increases the effect of EPO, leading to an in-kind increase in erythroferrone (ERFE, which is allowed by decreased BMP/SMAD function). The presence of ERFE leads to a downregulation of hepcidin, followed by increased dietary iron absorption through the small intestine95. Once iron stores have been sufficiently restored, hepcidin is upregulated, leading to suppression of iron absorption through intestinal hepcidin-ferroportin interaction102.

3.3. Red blood cell trait definitions and measurement

A complete blood count (CBC) panel, the most common method of evaluating RBC traits, can be used to assess RBC development and maintenance. RBCs are described using several measurements obtained from a complete blood count (CBC) panel and derived indices

13 that allow for population-level comparison of RBC development and maintenance. Recently developed hematology analyzers are capable of measuring a wide array of parameters such as proportion of hypochromic cells or immature reticulocytes (both of which can be used clinically in anemia diagnosis). However, this dissertation will focus on the seven RBC traits—described below—that are most commonly evaluated in clinical, epidemiological, and genomic research studies: hematocrit (HCT), hemoglobin concentration (HGB), red blood cell count (RBCC), red cell distribution width (RDW), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), and mean corpuscular volume (MCV). Separately, each parameter describes a distinct aspect of RBCs that is useful in its own right; in combination,

RBC traits are often helpful for distinguishing the cause of an anemia and potential ways to address that cause. Specifically, basic physiological changes that differ by underlying cause of anemia can lead to an increase or decrease in one or more RBC traits traditionally measured in a

CBC, so using RBC trait values in combination can assist in diagnosis. Associations between the

RBC traits described below and traditional epidemiologic risk factors are described in Section

3.5.1.

3.3.1. HCT

HCT refers to the proportion of whole blood volume made up of red blood cells, and is reported as a percent. Low hematocrit is indicative of anemia, with the normal low reference value for adults in the United States defined as 37% for women and 42% for men103. HCT also directly contributes to blood viscosity—for example, individuals with polycythemia vera (excess

RBC production and hence increased HCT and HGB) have an increased blood viscosity, which can cause inappropriate clotting104.

14 3.3.2. HGB

HGB refers to the concentration of the hemoglobin protein in whole blood, and is reported in grams per deciliter (g/dL). As described above, hemoglobin is the primary protein in

RBCs and is required to effectively transport oxygen to tissues throughout the body. Anemia, defined by the WHO as HGB <13 g/dL in men or <12 g/dL in women, decreases with age even in healthy individuals with no other major risk factors (see Section 4.1.1 for description of how this trend varies by biological sex) 105-108.

3.3.3. RBCC

RBCC refers to the average number of mature (i.e., enucleated) RBCs per unit of volume, typically reported in millions of cells per millimeter cubed (106 cells/mm3) or billions of cells per milliliter (109 cells/mL). The average human produces 2 billion RBCs per day, and any perturbations in this tightly regulated process (in the bone marrow, vasculature, or relevant tissues such as liver or spleen) can lead to excessive or insufficient RBC counts75; 87; 92; 93; 109.

3.3.4. MCH

Mean corpuscular hemoglobin (MCH) refers to the average dry-weight mass of hemoglobin protein per RBC, reported in picograms and calculated as shown in Table 3.

Hemoglobin constitutes over 90% of the dry weight of an RBC, including the cell membrane110;

111. An increased MCH could be due to lower MCV with maintained hemoglobin transcription during erythropoiesis, or increased hemoglobin transcription with other parameters held constant.

MCH varies modestly by sex and race/ethnicity, and increases with age106; 112.

3.3.5. MCHC

Mean corpuscular hemoglobin concentration (MCHC) refers to the average concentration of the hemoglobin protein per RBC, reported in grams per deciliter (g/dL) and calculated as

15 described in Table 3. MCHC increases gradually during the life cycle of an RBC as volume decreases more quickly than hemoglobin content111.

3.3.6. MCV

Mean corpuscular volume (MCV) refers to the average cell size of mature RBCs in an individual, and is reported in femtoliters (fL). MCV decreases throughout the life of the mature

RBC, with the HSC being approximately 10 times as large as a red blood cell at the end of its functional life (Figure 1). Incomplete or delayed enucleation can lead to an increased MCV

(macrocytosis). Conversely, iron deficiency or insufficient hemoglobin production can cause decreased MCV (microcytosis)113-115. Additionally, functional deficiencies regarding vesiculation or hemoglobin retention can affect MCV in either direction111.

3.3.7. RDW

Red cell distribution width (RDW) refers to the intra-individual variation in RBC volume. An abnormally high or low RDW can indicate disruptions in RBC development or failure to remove damaged or senescent cells. There are two methods of reporting RDW—RDW-

SD (SD: standard deviation) and RDW-CV (CV: coefficient of variation). RDW-SD is measured as 2 times the standard deviation of the distribution of mean corpuscular volume within an individual. RDW-CV, the measure to be evaluated in the proposed work, is the coefficient of variation (reported in %) of the distribution of MCV within an individual, calculated as shown in

Table 3. Approximately 20% of the hemoglobin protein mass and 30% of the total volume of mature RBCs is gradually lost post-enucleation, meaning a certain amount of variation in MCV is expected111. High RDW (the combined presence of abnormally small and abnormally large

RBCs with a normal MCV) is also referred to as high anisocytosis, which is associated with multiple physiologic conditions (described in Section 3.6)7. As summarized by Patel, et al, high

16 anisocytosis may be caused by "decreased MCV; increased reticulocyte volume variance;

[heterogeneity of volume reduction (AKA, variability in amount of membrane and hemoglobin lost over the lifespan) in circulating RBCs]; and delayed RBC clearance”116.

3.4. Anemia definition and prevalence

The word "anemia" generally refers to a physiologic state in which RBCs cannot perform their cellular functions due to an insufficiency in RBC differentiation, maintenance, or degradation. The World Health Organization standard definition for anemia is HGB <12g/dL in women or <13g/dL in men; this definition is also commonly used in research as CBCs are frequently available in cohort studies for which any type of biological sample is collected.

Additional criteria can be applied by physicians to identify the cause underlying the anemia; for example, individuals with a normal-range MCV (80-100fL) would follow the diagnostic criteria for normocytic anemia, depending on other RBC traits107. While a general criterion for defining anemia within is useful for determining risk factors relevant to the general population, most specific criteria—based, for example, on MCV—can be beneficial on an individual level for describing the exact type of anemia, which will determine the underlying cause best course of treatment.

Approximately 5.6% of U.S. adults are classified as anemic using the WHO definition, with nearly 2% of the population having moderate or severe anemia (HGB <11g/dL in men and non-pregnant women)117. In the NHANES study population, which is considered broadly representative of US adults, the proportion of men classified as anemic exhibits a linear association with age, whereas adult women not currently pregnant show a bimodal association, with anemia increasing during middle age and late life117. In spite of successful implementation

17 of dietary folate supplementation in cereal flours in much of the developed world beginning in the early 1990s—expected to contribute to the notable decrease in anemia prevalence through the mid-2000s—recent data indicates that anemia is once again on the rise, nearing prevalence estimates in all population subgroups similar to the beginning of the 21st century117-121. Given the topical area of this dissertation, below we briefly introduce monogenic anemias and other spontaneous-onset RBC diseases.

3.4.1. Monogenic RBC anemias and related disorders

RBC traits are, as described above, tightly regulated to maximize efficiency of oxygen transportation; genetic dysregulation of RBC maintenance can therefore result in heritable anemias (i.e., insufficient functional hemoglobin in healthy RBCs to meet tissue-oxygenation demands) with a wide range of severity. Dozens of heritable monogenic anemias have been described which are globally rare but relevantly frequent in populations with ancestry-specific causal variants, demonstrating the highly polygenic nature of RBC traits (Table 1). An example of how these proteins interactions drive molecular processes in RBCs is demonstrated by the interaction network shown in Figure 2. Outside of heritable disorders, somatic mutations in the bone marrow HSC population from which RBCs are derived can lead to blood cancers or other blood disorders. Below, we briefly review the anemias caused by known genetic variants, followed by a description of the most prevalent spontaneous-onset RBC diseases.

3.4.1.1 Recessive anemias associated with malaria-protective variants

A large-effect genetic variant causing a recessive anemia that is fatal prior to reproductive age would be expected to remain extremely rare, unless one copy of the allele (i.e., heterozygosity) provided some kind of evolutionary benefit, allowing it to become established.

In areas of the world where malaria has been endemic for thousands of years, genetic adaptations

18 which confer protection against plasmodium infection have become established in human populations living in these areas122-129. The pathogenic variants described in this section are associated with modest effects on HGB and HCT in heterozygotes, but homozygous effect-allele carriers can be severely affected, accompanied by dramatically abnormal RBC trait values.

3.4.1.1.1 Hemoglobinopathies: sickle-cell disease and thalassemia

An example of a well-characterized hereditary anemia is sickle cell disease (SCD), which in the United States almost exclusively affects individuals with West African ancestry

(prevalence: ~3/1,000 African Americans, 1/36,000 Hispanic/Latino Americans). Sickle-cell disease causes extreme pain, has severe complications, and is fatal in the first two decades of life without treatment130. The most common causal variant for sickle-cell trait is rs334 (which leads to transcription of the inefficient hemoglobin “S” beta chain), which is specific to African- descent populations and causes SCD in homozygous individuals. Several other variants in the beta hemoglobin gene are also associated with RBC sickling and exhibit higher allele frequencies in malaria-endemic regions like the Middle East, Mediterranean countries, and South

Asia. While carriers with one copy of the rs334 effect allele carry protection against infection with malaria, even sickle-cell carriers can have symptoms associated with partial sickling of

RBCs122; 130; 131. The heterozygous state for SCD—sickle cell trait—is also associated with complications during exercise, from cramping to exercise-induced sudden death132. Among

African Americans, compared to having two normal copies of the beta hemoglobin gene, sickle- cell trait has also been associated with venous thromboembolism; ischemic stroke; and kidney- function insufficiency (an association also identified in Hispanic/Latino Americans) 133-136.

Another common recessive monogenic anemia is thalassemia, referring to mutations in either the alpha- or beta- component of the hemoglobin alpha gene that lead to an insufficient number of circulating RBCs rather than the sickling phenotype associated with SCD

19 mutations137. When a mutation in the gene for one hemoglobin subunit leads to a reduction in available protein despite a normal level of transcription, excessive build-up of the other subunit negatively affects erythropoiesis and may also cause hemolysis of mature RBCs in the bloodstream126; 137; 138. Established thalassemia variants are most prevalent in Southeast Asia and

Mediterranean countries, including Cyprus, which has a carrier frequency (i.e., heterozygosity) of 14.3%139. Treatments for moderate to severe thalassemia as well as SCD include frequent blood transfusions, which are accompanied by additional risks including iron overload; alloimmunization leading to extensive delayed hemolysis; and transfusion-related infection, particularly in low-resource areas where extensive screening is not available130; 140-142. Neither condition can currently be cured except with a bone marrow transplant, which is accompanied by a high rejection rate, particularly from unrelated donors; gene therapies have recently been explored as a potential alternative with more modest risks143-147.

3.4.1.1.2 Non-hemoglobinopathy malaria-related anemia genes

Several genes for which heterozygotes obtain protection from malaria aside from the hemoglobin chains have been causally associated with RBC-related conditions. Examples of such genes include G6PD deficiency (an X-linked condition with hemizygous inheritance in males), mutations in DARC, and hereditary spherocytosis due to mutations in ANK116; 129; 148; 149.

Unlike the specific variants most commonly associated with the hemoglobinopathies, there are dozens of causal variants found at the DARC, G6PD, and ANK1 loci. Due to the incapacity of

RBCs to transcribe RNA once the nucleus has been expelled, it is perhaps not surprising that variants in one of the many components of the highly specialized RBC membrane-cytoskeleton complex may cause hereditary anemias150. As described in Table 1, in addition to the aforementioned genes, variants within KEL, SLC4A1, SLC2A1, XK, GYPC, RhD, RHCE,

Protein4.2, ANK1, and SPTA1 have all been identified as causal for monogenic anemias151. As

20 with the rs334, most of these variants became established in a malaria-endemic region of the world, and are therefore either population-specific or exhibit large allele-frequency differences by ancestry.

3.4.1.2 Other monogenic RBC anemias

Aside from genetic variants specifically associated with malaria resistance, several dozen genes have been causally associated with dominant, recessive, or X-linked anemias. These anemias range in frequency from relatively common (hereditary spherocytosis affects approximately 0.05% of Northern Europeans, with about half of cases having causal variants in the ANK1 gene) to extremely rare (Nakajo-Nishimura syndrome, caused by a mutation in

PSMB8, has only had approximately 30 cases reported despite being first described in the 1930s; see Table 1)152; 153. These gene products range in function from transcription factors (KLF1) to proteins involved in glycolysis (PKLR) to ribosomal functions (RPL and RPS family members); several causal genes remain incompletely characterized.

3.4.1.3 RBC disorders attributable to somatic mutations

In addition to heritable anemias, several other monogenic diseases can affect RBCs.

Specifically, erythroleukemia can develop due to a somatic mutation, meaning heritability is very low but several genes have been identified for which causal somatic mutations occur more frequently than would be expected due to the normal mutation rate accompanying replication80;

154. As another example, polycythemia vera—a myeloproliferative disorder—is not typically heritable but does occasionally occur in families, particularly those with a predisposition to a specific JAK2 mutation that has been demonstrated to occur somatically in multiple family members154.

21 3.4.2. Summary

Over 50 monogenic anemias—primarily with a recessive mode of inheritance—have been described and genetically mapped to date, providing a broad insight into RBC physiology.

Among adults in the United States, acquired anemias (i.e., non-monogenic and have a high environmental component) account for the large majority of anemia prevalence117; 155. While environmental risk factors can affect RBC traits sufficiently to lead to an anemic state, the number of genes associated with monogenic anemias alone supports these traits being highly polygenic. While monogenic anemias are important for clinical diagnostics because of the relatively large effect magnitude of most known variants, many more genetic loci likely contribute to RBC traits via common or rare variants with small effect sizes. As described in

Appendix A, gene expression and cellular maintenance are subject to many forms of environmental and genetic regulation. The combined effects of common risk factors and underlying genetic variation are expected to form the distribution of RBC traits within any particular population. Below we discuss RBC trait epidemiology, based primarily on cohort studies, and how traditional risk factors contribute to population-level variation for these traits.

3.5. Red blood cell epidemiology

As described in Section 3.6., RBC trait values are associated with myriad diseases of public health importance97; 156-159. Environmental exposures—including acidity/basicity of the cytoplasm; temperature; osmolarity/tonicity; molecular products released by commensal or pathogenic bacteria (tissue-specific); the contents of extracellular vesicles from other cell types; and availability of minerals or other molecules used in molecular processes—can affect gene expression160-172. RBC traits are frequently measured in cohort studies for several reasons.

22 Specifically, blood draws are a familiar process to most people in the United States and are therefore acceptable to most participants; the small amount required for a hematology analyzer to perform a CBC will typically have no negative health effects; and a blood draw can provide a large amount of information about biomarkers in addition to traditional CBC measures.

Understanding the epidemiology of RBC traits is crucial given the increased risk for negative health outcomes and mortality associated with a change in one or more RBC traits across a wide range of circumstances2; 70; 73; 173-179. Characterizing the relationship between demographic and environmental factors that contribute to RBC traits helps researchers to designate relevant covariates for genomic or other epidemiologic analyses of RBC traits as the outcome(s).

3.5.1. Traditional risk factors

3.5.1.1 Sex

The association between sex and RBC traits varies throughout the life course. HCT and

HGB are both strongly associated with biological sex in adults; prior to puberty, RBC trait values do not differ substantially by sex180. The average HCT and HGB for adult pre-menopausal women are typically over a standard deviation lower than that of men at the same age112; 181.

However, after menopause the trend is diminished: post-menopausal women typically have similar HCT and HGB values compared to men of the same age182. The difference in HCT and

HGB values between men and women has diminished slightly across populations in the United

States in parallel to the required folic-acid supplementation of cereal flours beginning in 1998120.

Women also have a lower RBC count on average than men throughout adulthood, although the underlying mechanism remains unclear105; 181. MCH, MCHC, and MCV differences by sex have

23 not been discussed at length in the literature. In GWAS of RBC traits, sex is typically adjusted for because it is significantly associated with all RBC traits in cohort study populations.

3.5.1.2 Age

All seven RBC traits being evaluated in this work are associated with age in adults.

Anemia is also more prevalent in older adults: as of 2009, approximately 6.8% of women and

9.4% of men in the NHANES study population over age 60 meet WHO HGB criteria for anemia, and 10.2% of men and 6.2% of women in the same age group exhibit macrocytosis120. In male

NHANES participants, average HGB increases by approximately 1g/dL between early adulthood and age 40, then decreases at a similar rate through late life (see below for a description of differences by race)106. Hematocrit and hemoglobin trends are discussed more comprehensively above due to the sex-specific trends of these traits over the life course. MCH increases incrementally but steadily throughout the lifespan, whereas MCHC remains constant on average with age, although the two variables are associated in most study populations (including the

PAGE cohorts participating in this work; data not shown). MCV increases with age in early adulthood and plateaus after middle age; RBC count exhibits a similar pattern, leveling off around middle age in women and peaking at the same time in men, with a slight decreasing in average among older men106. For example, mean MCV increases approximately 3fL over the course of adulthood, from 88fL in 15- to 19-year-olds to 91fL in study participants over 70 the overall NHANES study population (differences in mean MCV by race are described below)106;

183. RDW increases with age independently from MCV 106; 181; 183-185. The aforementioned trends over the lifespan indicate that for many traits the effect may be non-linear and/or sex-dependent.

24 3.5.1.3 Race/Ethnicity

In the United States, African Americans are more likely to have lower values for HCT,

HGB, and MCV across the life course compared to non-Hispanic whites, although these differences are well within a standard deviation for either ethnicity in population-based studies106; 112; 186. For example, NHANES data show HGB values approximately 1g/dL lower in

African American or Mexican American males and females than in non-Hispanic whites in the same age category106.

The aforementioned differences in RBC trait values by race/ethnicity have been observed in some but not all studies, and while trends are visually discernible they are not always statistically significant and may be modified by age112; 138; 184; 187; 188. Whether these differences are due to underlying genetic variation leading to expression differences by ancestry or environmental factors such as diet, socioeconomic status, and smoking has not been comprehensively evaluated155; 184; 187. As described above, known ancestry-specific genetic variants can have large effects on RBC trait values, particularly in the context of malaria- infection resistance16; 122; 126; 129; 148. However, the effect alleles for these variants are not expected to be found at the same frequency in admixed populations such as African Americans or Hispanic/Latino Americans as in African populations with no shared European ancestry, making predicting their effect on population-level trait variability difficult. Additionally, cultural and lifestyle differences between groups with shared ancestry make it difficult to elucidate the exact contribution of such genetic variants on population-based trait estimates189-191.

3.5.1.4 Pregnancy

Physiologic iron stores are depleted more easily in pregnant women due to the physical demands of the growing fetus and accompanying increase in red blood cell mass186; 192; 193.

25 Additionally, the total blood volume necessarily increases during pregnancy. This often leads to mild or moderate anemia; dietary changes and iron supplementation may increase hemoglobin concentration, but evidence is inconclusive on whether such changes positively affect pregnancy outcomes194. The relationships between RBC traits, anemia, and pregnancy are well established.

However, even should a pregnant woman's RBC trait measurements fall within the defined normal range, there is no way to tell what her normal values are while she is not pregnant.

3.5.1.5 Smoking and alcohol use

Both cigarette smoking and alcohol use are associated with increases in HGB independent of other environmental factors105; 195; 196. Furthermore, smoking has been associated with increased MCV, MCHC, and HCT, with multiple analyses suggesting that the effect of smoking on hematological parameters differs by sex105; 196-198. For example, Malenica, et al, report an average 0.8 g/dL higher HGB and 4.5fL higher MCV in smokers than non-smokers

(unadjusted), whereas MCHC, MCH, RBCC, and RDW did not differ significantly between the groups198. RBC count has been reported to be increased in smokers; however it has been suggested that this finding might have been misinterpreted and is due instead to macrocytosis

(abnormally high MCV) with a sustained RBC count198; 199. MCH and MCHC have not been consistently associated with smoking; however, correlation between RBC traits suggests that in large study populations a statistical association may be present.

3.5.1.6 Pharmaceutical effects

There is a well-documented history of pharmaceutical treatments for a wide range of conditions having temporary or permanent effects on RBC traits. In addition to being affected by pharmaceutical compounds introduced to bodily tissues, RBCs also play a role in metabolism of these compounds, including measurable cytochrome P450 activity (which otherwise occurs

26 primarily in the liver)200. The proposed biological mechanisms underlying these conditions suggest that the effects often occur on a dose-response curve rather than having a binary threshold relationship201-207. For example, individuals with CKD receiving synthetic EPO exhibit a dose-response relationship with HCT: Cotter, et al, demonstrate in a population of 14,000 hemodialysis patients that during both initiation phase (1-3 months) and maintenance phase (4-6 months) of treatment an S-shaped curve is present for HCT increase based on EPO dose205.

However, the effects of such medications are often described in the literature in the context of progressing to a clinical condition. Therefore, we will briefly discuss two specific pathologies— iron-deficiency anemia secondary to medication use and drug-induced hemolytic anemia—that represent extreme (and therefore the most straightforwardly measured) instances on a spectrum of RBC trait-pharmaceutical associations.

Over-the-counter non-steroidal anti-inflammatory drugs (NSAIDs) are used frequently— often daily—in over 10% of the adult U.S. population, with or without the recommendation of a physician, and this trend has increased over the last several decades208. Frequent therapeutic use of NSAIDs (for instance, in individuals with autoimmune inflammatory diseases) can lead to a substantial decrease in HGB over a relatively short time period. For example, investigators identified a ≥2g/dL decrease over 6 months among 2 to 6% (depending on the NSAID in question) of CLASS and CONDOR trial participants201. Several mechanisms have been proposed for NSAID-induced anemia, including blood protein loss due to gastrointestinal ulcers; protein deficiency caused by interactions between NSAIDs; and physiological changes caused by the underlying condition for which NSAIDs are used as a treatment (e.g., inflammatory diseases such as rheumatoid arthritis)209-213.

27 Another common effector of changes in RBC traits is chemotherapy due to a causal effect on iron stores or RBC production, which can lead to anemias ranging from mild to severe214-216.

A large proportion of chemotherapy-related anemias can be attributed to iron-store depletion, which can sometimes be successfully treated with intravenous iron or erythropoiesis-stimulating agents215; 217. Patients exhibiting chemotherapy-related anemia can subsequently suffer from dose delay/reduction, which can alter the treatment’s maximum effectiveness if the anemia cannot be controlled218.

3.5.1.7 Other environmental risk factors

In addition to the aforementioned influences, several non-traditional risk factors have also been strongly associated with RBC trait values. Diet in particular is associated with iron absorption and iron store availability during RBC production219; 220. A diet rich in bioavailable iron—as well as simultaneous consumption of high-vitamin C foods to improve iron absorption through the small intestine—increases HGB and RBCC221; 222. Dietary iron comes in heme- and non-heme forms, with the former primarily coming from animal proteins and the latter from plants such as dark leafy greens, beans, and iron-fortified foods in developed countries. While heme-iron is generally considered to be more bioavailable, approximately 85% of an omnivore’s dietary iron comes from non-heme forms, which may lead to small differences in RBC traits due to the high variability of the modern diet. For example, only 10% of non-heme iron is estimated to be bioavailable via human digestion, and vegetarian diets exclude heme-iron-containing meat- based proteins, meaning a much larger amount of total iron per meal is necessary to meet physiologic requirements223; 224. Studies performed in the late 20th century implicated specific chemical compounds in dietary iron absorption: vegetables or beverages (particularly black tea) containing a high amount of phenolic groups, which bind iron, have been demonstrated to reduce

28 iron absorption by as much as 75% when consumed as part of a meal, although these effects can be partly counteracted (up to 50%) by simultaneous consumption of ascorbic acid (Vitamin C)225;

226. However, iron fortification has been mandated by law in a large number of developed countries in the world. While the purpose of this fortification was to improve access to dietary iron in individuals—particularly pregnant women—who could not access sufficient iron, recent work has shown broader benefits, although overall diet must include some form of improved bioavailability (such as ascorbic acid) to meaningfully increase iron absorption223; 227.

Exercise can also have short-term and long-term effects on RBC traits. For example, iron stores are depleted following a session of vigorous physical activity, which leads to reduced hemoglobin and can result in the appearance of anemia in an otherwise-healthy individual228.

The benefits of consistent exercise on RBC traits have also been documented—consistent moderate or intense physical activity leads to increased hemoglobin and red cell total mass in circulation228; 229. Trained athletes may even have a lower measured HCT, referred to as “sports anemia”, due to an increase in plasma volume compared to sedentary individuals229.

In terms of environmental exposures, pollution has been strongly associated with multiple inflammatory cell types, including RBCs. Research has implicated an association between air pollution and hemoglobin protein function, as well as effects of pollutants on the structure and function of RBC membranes230-233. There are more studies evaluating associations between pollution and white blood cell types than RBC traits, however, meaning there is room for more research in this area.

Finally, altitude has long been associated with RBC traits, most notably a decrease in

HGB as altitude increases and oxygen becomes scarcer159. Competitive athletes may take advantage of this short-term adaptation by training at altitude, sometimes going as far as drawing

29 blood after high-elevation training sessions for auto-transfusion prior to competition234; 235.

Global populations that have lived at high altitudes for centuries or millennia even exhibit genetic adaptations to decreased oxygen availability236.

3.5.2. RBC trait measurement and error

The automated machines which output CBC measurements are highly accurate and minimally affected by human measurement error which could decrease precision for quantitative traits237. Particularly when a central laboratory is used for all centers within a study, as is often the case, machines are frequently calibrated and systematic error is unlikely. Therefore, RBC traits usually exhibit both high accuracy and high precision and represent excellent traits for epidemiologic study.

Two specific measures are used to capture within-person or between-person variability in measuring quantitative traits: within-person coefficient of variation (CVw) and between-person coefficient of variation (CVg). For traits with a high CVg, the population-level variability is high, whereas for traits with a high CVw, one measurement is unlikely to accurately capture an individual’s mean trait value (although the population mean is not expected to be affected). CVg in NHANES adult participants, calculated as the ratio of the standard deviation to the mean of each measured trait, ranged from 2.25% (MCHC, men; 2.40 in women) to 9.33% (RBC count, women; 8.83 in men)184, suggesting that there is a modest level of between-person variability for the RBC traits discussed in this work. CVw for multiple measures of each trait (a median of 17 days apart) ranged from 0.37% (MCV in men; equivalent in women [0.31%]) to 3.45% (RDW in

184 women, equivalent in men [2.53%]) . Low CVw and CVg for RBC traits indicate that these values are stable across the short term within the individual and that within-person variability is

30 lower than between-person variability, as would be expected based on the population distribution of RBC traits.

A relevant statistic for capturing the level of variation and the utility of population-based

238 reference intervals for a particular trait is the index of individuality, or the ratio CVw/CVg . A low index of individuality means that the within-person variability is likely to be confined to a narrow range within the population-level reference range of a study population184; 239. In contrast, a high index value indicates a loss of power typically accompanied by an increase in false positives for the exposure-outcome analysis. The number of false positives identified in a study can be reduced by considering repeated measures, but such methods do little to address the decreased ability to detect true positives238. In the study population described above, RBC trait indices of individuality ranged from 0.06 for MCV in both men and women to 0.37 for RBCC in

184 women (0.29 in men) . Both the CVw and CVg are small and unlikely to dramatically affect

184; 198 precision . However, the range in CVw/CVg could be a potential contributor to the disparity in the number of genetic loci associated with MCV compared to HCT, for example. A relatively higher index of individuality could add statistical noise, reducing the ability to detect true associations.

3.5.3. Summary

RBC traits are obtained from one the most convenient biological samples (blood draw) and are frequently measured in cohort studies. Therefore, RBC traits and their contributing factors are well characterized within the general population. Due to measurement via automated hematology analyzers, RBC traits are measured with high accuracy; individual-level variability, at least in the short-term, is also small, although some variation by trait is noted. Demographic sources of RBC trait variability include commonly measured variables such as age, sex, and self-

31 reported race/ethnicity, which are frequently measured in population-based studies and can be adjusted for in analyses of RBC traits. Additional sources of population-level RBC trait variation include individual-level variables such as smoking status, pregnancy, and pharmaceutical use, as well an environmental variables such as altitude or air pollution. Together, these results support the feasibility of epidemiologic studies of RBC traits.

3.6. Clinical and public health significance of RBC traits

In addition to being the relevant phenotype(s) for diagnosing both idiopathic anemias (i.e., an anemia for which the underlying cause cannot be explained by another condition or diseases state) and blood disorders with environmental causes, RBC traits— individually or in combination—have been associated with other diseases. To use CVD as an example, over the past several decades RBC trait values have been associated with stroke, myocardial infarction (MI), coronary heart disease, and heart failure5; 67; 173; 240-242. Given the relevance of anemia to a multitude of diseases, researchers have examined whether variation in

RBC traits in populations with or without anemia are associated with a range of chronic diseases5; 66; 68; 177; 241; 243-246.

Using RDW as an example, we describe below how RBC traits have been associated with clinical diseases in epidemiologic studies. RDW has been used as an indicator for iron-deficiency anemia for several decades, although there are likely multiple different biological mechanisms underlying this association. Research in the past decade have identified associations between

RDW and diseases including systemic lupus erythematosus (SLE); rheumatoid arthritis; heart failure; and incident stroke5; 8; 70; 71; 175. To demonstrate the relevance of RBC trait physiology to chronic complex diseases, we will briefly introduce recently reported associated between RDW

32 and end-stage renal disease (ESRD, the most severe stage of chronic kidney disease [CKD]); heart failure (HF); and ischemic stroke.

3.6.1. RDW and ESRD

Kidney insufficiency prevents effective filtration of blood, which can lead to blood toxicity, edema, and—in the case of untreated kidney failure—death. Patients with ESRD who cannot receive either consistent hemodialysis or a kidney transplant have an extremely short life expectancy. Approximately 661,000 Americans suffer from kidney failure, with large health disparities by race/ethnic background247. For example, among Hispanics/Latinos, progression from CKD to ESRD is expected to occur more quickly than in non-Hispanic whites, given that the prevalence of ESRD in Hispanic US populations is nearly double that in Whites248.

Several studies have found an association between RDW and mortality or kidney function, with one recent study specifically evaluating the association between RDW and mortality in hemodialysis patients1; 249-251. Such studies were performed because understanding the association between RBC traits and events secondary to hemodialysis may inform treatments that can limit such negative outcomes. In a retrospective study of approximately 110,000 DaVita hemodialysis patients for years 2007-2011, investigators identified a positive linear association between RDW and mortality249. Compared to the reference group with RDW values between

15.5% and 16.5% using a fully adjusted model, patients with a baseline RDW of 16.5% to 17.5% had a hazard of mortality that was 16% higher (HR= 1.16; 95% confidence interval [CI]: 1.10—

1.22) and the highest category of RDW (>=17.5%) had a hazard of mortality that was 28% higher (HR = 1.28, 95%CI: 1.24—1.33)249. Potential mechanisms underlying this association involve oxidative stress and low to moderate levels of inflammation, both of which are very commonly found in individuals with ESRD. Several pathways crucial to RBC differentiation and

33 maintenance are affected by inflammation and oxidative stress (and could hence affect RBC heterogeneity), although neither of these mechanisms have been evaluated on a functional level with regard to ESRD.

3.6.2. RDW in Heart Failure

Nearly 6 million Americans suffer from HF, which shares risk factors with many other forms of CVD such as overweight, age (particularly over 65), and a history of MI252. The prevalence of anemia (using the WHO HGB definition described above) among patients hospitalized for acute HF has been reported from 20% to over 50% (compared to slightly over

10% in the general population over age 65), and HF patients with anemia exhibit a higher mortality rate than non-anemic patients175; 253. The primary focus of research on the association between RDW and heart failure has evaluated RDW as a predictor of HF prognosis and outcomes. In the past decade, several studies have identified associations between RDW and HF or associated conditions such as raised left-ventricular end-diastolic pressure (LVEDP) 7; 10; 254-

259. For instance, in a study of over 1,000 patients undergoing elective coronary angiography and followed for five years, increased RDW was associated with increased LVEDP in the entire study population (OR: 1.14, 95% CI: 1.00 to 1.29) and in patients with no history of HF (73% of the study population, OR: 1.37, 95% CI: 1.13-1.67)260. One potential mechanism involved in this association is that high levels of chronic inflammation negatively affect erythropoiesis; another pathway to examine is the shared effect of iron deficiency on both LVEDP and erythropoiesis258;

260; 261. However, neither of these mechanisms has been evaluated in a functional context.

3.6.3. RDW and ischemic stroke

Ischemic stroke is the fifth leading cause of death in the United States262. Although ischemic stroke mortality has been decreasing, racial disparities exist in both incidence of and

34 survival from stroke: African Americans suffer from a disproportionately high stroke mortality

(>2-fold) compared to non-Hispanic whites263; 264. Accurately determining a prognosis for stroke survivors remains difficult based on current standards—in particular, the use of National

Institutes of Health Stroke Score (NIHSS) without other risk factors or prognostic indicators265-

270. Because both stroke and RBC traits are relevant to inflammatory pathways, researchers have been interested in potential associations between RBC traits and stroke outcomes.

Multiple RBC traits have been associated with both severity of the ischemic stroke and

30-day mortality for stroke survivors4; 5; 8; 11; 103; 271. While HCT and HGB specifically show strong associations with stroke outcomes, evidence that RDW is informative even in ESRD patients with normal HGB demonstrates the relevance of studying the association between RDW and other cardiovascular and cerebrovascular diseases4; 6; 8; 249; 261; 272; 273. Within the last three years, several studies have specifically evaluated the association between RDW and ischemic stroke outcomes, primarily focusing on severity of disease and short-term mortality7; 9; 265; 266. For example, a study of 837 stroke patients in an Italian hospital found that a score derived from

RDW at admission, age, and NIHSS had a positive predictive value (PPV) of 0.91 for the highest-risk group (>80% risk of unfavorable outcome), with a high specificity (0.97) but moderate sensitivity (0.37)266.

Similar to ESRD and heart failure, inflammatory mechanisms are suspected to underlie these associations, because increased presence of proinflammatory cytokines in the bloodstream, for instance, can lead to inappropriate clot formation. Specifically, animal studies have shown an association between increased RDW and decreased nitric oxide function, which can negatively affect both the normal vascular constriction/dilation pathways and the ability of the brain to receive appropriate oxygen274. These findings are particularly relevant because prediction of

35 functional decline is an important part of stroke care, and current standards using NIHSS alone have been found wanting266; 269. While knowing the baseline RDW of a patient experiencing an acute stroke may not change the immediate course of treatment during the event, it may be beneficial for determining the patient’s prognosis as well as identifying the best follow-up treatment275. While this dissertation focuses on adult RBC traits, RDW has also recently been implicated in distinguishing stroke from similarly presenting diseases in children, for whom accurate and quick diagnoses are crucial to prevent permanent disability276.

3.7. State of the literature: RBC trait genetics

The relationship between genetic variation and expressed phenotypes was famously first described by the Augustinian friar Gregor Mendel in the late 1850s by demonstrating recessive trait expression patterns in pea plants277. The aptly named theory of Mendelian inheritance—i.e., monogenic traits or diseases that are inherited across generations following Mendel's three laws of inheritance—was the primary driver for discovering the genes underlying many traits, particularly recessive diseases, throughout the late 19th and early 20th centuries278. Genetic segregation of diseases using family pedigrees—particularly when the disease is highly penetrant and the mode of inheritance is either autosomal dominant or recessive—was the primary type of analysis used to identify the causes of hereditary monogenic disorders through the mid-20th century. Extremely detailed information on every individual pedigree member’s phenotype, risk factors, and relationships with other pedigree members (including consanguinity when present) was required in early family studies to ensure accuracy in describing the mode of inheritance so that candidate loci could be established in future analyses65. As described below, the challenges of interval-trait analyses and limited clinical applicability led to less interest in interval traits for

36 QTL linkage analysis and candidate-gene studies, so pedigree-based analyses of monogenic anemias dominated RBC genetic studies prior to the GWAS era.

3.7.1. Linkage analysis studies

Identification of causal loci throughout the second half of the 20th century was primarily performed using linkage analysis in family studies with large pedigrees, in which the mode of inheritance was discernable279. Linkage analysis examines the association between genetic markers (originally several thousand microsatellites distributed across all autosomal ) and a trait of interest by calculating the log of the odds (LOD) score, which utilizes the maximum likelihood of the ratio of recombination rate within observed pedigrees to the recombination rate for independently segregating loci (50%)280. The field-standard significance threshold has typically been a LOD score of 3, representing an odds of 1:1000 that the association exists by chance281. As linkage analysis became established as the premier method for studying monogenic disorders, multiple families with shared phenotypes were used in the same study, which had the added benefit of forcing researchers and clinicians to distinguish between similar phenotypes which resulted from different underlying causal loci282.

The first genetic studies performed in RBC traits or diseases, similar to other monogenic and complex traits, were linkage analyses performed in family-based studies281; 283-285.

Monogenic recessive anemias were some of the earliest diseases for which causal loci were identified, and this represented the vast majority of studies involving RBC traits or diseases prior to 2009 (Table 1). In 1997, for example, Diamond-Blackfan anemia, a very severe anemia typically manifesting in infancy, was mapped using linkage analysis to a 1.8Mb region on the long arm of chromosome 19 for both dominant and recessive forms of the disease286. In the same

37 year, the most common form of congenital dyserythropoietic anemia (CDA II) was mapped to chr20q11.2 with a maximum LOD score of 5.4287.

Family-based linkage analysis was highly successful in identifying causal Mendelian- disease loci, but loci associated with several complex diseases—e.g., non-Lynch syndrome colorectal cancer, obesity, and hypertension—have also been identified281; 288-293. With regard to interval-scale RBC traits, several loci were identified by linkage studies that contain genes with functional roles, such as the HBS1L/MYB locus on chr6q23: in a population of 1,789 people, one copy of the allele was associated with a decrease of 0.074x1012 RBCs/mL294; 295. In another example, a 2013 QTL analysis of RBC traits in an Amish pedigree with 384 individuals identified several blood trait associations previously unreported in linkage analysis, including a peak for RDW at chr10p12. The associated region contains both SEC23B, which has been causally associated with congenital dyserythropoietic anemia II, and BMP2, which is a known modifier of hemochromatosis296-298. Finally, another study in a large Amish pedigree that included 395 individuals with von Willebrand disease identified a significant association with

RBCC at chr4q25299. Abbot, et al, did not implicate any candidate genes in this region; however,

RPL34 is located within the presumed peak at chr4q25, and nine other components of the ribosomal protein complex have been causally associated with Diamond-Blackfan anemia, likely due to the highly proliferative nature of RBCs, making this a reasonable gene to explore in candidate-gene studies300.

Because HSCs are the shared progenitor cells for all blood cell types, erythropoiesis requires a highly specialized procession of initiation and subsequent reduction in specific gene products to result in the final product of mature erythrocytes. Based on molecular analyses of candidate genes and the GWAS literature (both described below), we now know that a large

38 proportion of population-level variability in RBC traits is likely attributable to regulatory variation and that all of the traits are highly polygenic31; 301. Since two copies of each gene across autosomal chromosomes exist for each individual and the effect size of non-coding variants is expected to be small on average, loci including primarily regulatory variants are expected to be difficult to detect using linkage analysis. See Figure S1 and Table 4 for a basic overview of how variation in genetic processes may affect a phenotype and which of those processes will be evaluated in this work.

3.7.2. RBC trait candidate gene studies

Genomic regions associated with complex traits in linkage analysis can be several million base pairs (megabases, or Mb) long, meaning a relatively large number of genes could fall within a suspected causal locus and require further investigation. Candidate gene studies are analyses that focus on a narrow genomic region or small set of genes expected to be causally associated with a trait of interest; historically, these studies were performed on genes within regions implicated by linkage analysis280. Once associations are identified, the most promising functional candidates being carried forward to population-based or molecular analysis.

Several dozen genes identified using a candidate-gene approach have been well characterized in human or mouse molecular models of RBCs, whether for general functional analysis or to identify pathogenic variants (see below)78; 87; 302-307. However, the vast majority of genetic characterization in these studies originated from family-based studies of monogenic RBC diseases—rather than interval traits—as the outcome, although several interesting results have been reported in animal candidate-gene analyses. Hickley, et al, included a review of RBC trait genetic studies which included two Framingham QTL/linkage studies; two additional

QTL/linkage studies in humans; a QTL analysis in baboons; a QTL analysis in mice; seven

39 GWAS including one or more RBC traits (Table 5); and only two candidate gene studies of RBC traits in general population members20; 21; 25; 283; 294; 308-315.

In one of few candidate-gene studies of RBC traits, polymorphisms in EPOR (the erythropoietin receptor) were associated with HCT, leading to a better understanding decades later of the effect of synthetic EPO treatment on anemia in hemodialysis patients316. In another study of participants with a long ancestry of high-altitude living, HGB was associated with population-specific variants in EPAS1, which encodes the known inflammatory protein HIF2α, or hypoxia-inducible factor 2 alpha317. The most comprehensive candidate-gene study of RBC interval traits was performed by Lo, et al, in 2011, evaluating genotyped SNPs in approximately

2,100 genes318. This study was the first to report the now well-studied association between rs1050828 and hemolytic anemia due to G6PD deficiency (although the disorder was first mapped to this gene in 1998 by Puck & Willard)318; 319.

Linkage analysis followed by candidate-gene studies is still the field standard for diseases or phenotypes for which only one or a few loci contribute, particularly when extensive pedigrees or a large number of families are available. However, while the traditional design of candidate- gene studies is not inherently weak, this approach was not generally successful for identifying a large number of true associations among complex quantitative traits31; 320; 321. Recognition of this weakness led to the enthusiastic shift toward GWAS once genotyping assays became commercially available (see Section 3.7.4). Although several associations were successfully reported for RBC traits, linkage analysis followed by candidate-gene studies was not nearly as successful as GWAS have been in identifying the large number of loci expected to contribute to these heritable, highly polygenic traits.

40 3.7.3. RBC molecular genetics

Based in large part on the work evaluating monogenic RBC disorders, known contributors to RBC developmental regulation include a large number of transcription factors for promotion or repression of key sets of genes during the differentiation process, as well as microRNAs (see Appendix A). Three relevant proteins are described below for the purposes of explaining how molecular characterization of genetic loci successfully contributed to understanding the biology of RBCs based on analysis of monogenic RBC disorders prior to the

GWAS era. Ideally, future discovery loci for RBC traits could provide candidate genes that result in similar elucidation of RBC physiologic mechanisms. This section addresses successful molecular characterization of genes associated with monogenic RBC disorders, demonstrating the type of information that could be gleaned from identifying RBC trait loci via population- based studies.

3.7.3.1 HbF (fetal hemoglobin)

Of several genes that encode hemoglobin subunits in humans, the primary subunits expressed after birth are almost exclusively hemoglobin alpha (expressed from the HBA1 gene) and hemoglobin beta (expressed from the HBB gene). Fetal hemoglobin, encoded by the HBG1 and HBG2 genes and resulting in similarly structured gamma subunits, are the secondary subunits expressed alongside the second HBA subunit (transcribed from HBA2) in utero322.

HBG1/2 transcription is diminished to being expressed only in F blood cells after birth, with these cells comprising only approximately 0.6% of total RBCs on average. However, the proportion of total RBCs that are F cells varies up to twenty-fold in adults, and exhibits higher heritability than other RBC traits323-325. BCL11A, HBS1L/MYB, and Xmn1-HBG2 are major

QTLs for HbF persistence in adults109. Furthermore, above-average expression of HbF has been

41 demonstrated to ameliorate the severity of beta thalassemia322. Several genes involved directly in regulating HbF have been identified in GWAS (Section 3.7.4).

3.7.3.2 BCL11A

BCL11A is required for HSC function and plays a large role in the repression of gamma- globin genes during the post-natal hemoglobin switch from primarily α2γ2 (fetal hemoglobin, or

322; 324 HbF) to primarily α2β2 (the primary form of adult hemoglobin, or HbA) . BCL11A deficiency leads to HbF induction in mice322. Three DNAseI sites have been identified upstream of the transcription start site (TSS), with the one at +58kb affecting HbF expression to the greatest extent. The BCL11A locus has been associated with MCH, MCV, RBCC, and RDW in

GWAS (see below)26; 301. Due to the role BCL11A plays in HbF expression, it is a target gene of interest for individuals with insufficient generation of HbA. In fact, some progress has been shown in this area of research in mice for other gene-editing therapies or induction of proteins in vitro, with promising results in early gene-therapy human clinical trials for patients with sickle cell disease or alpha- or beta thalassemia322; 326-329. One could expect that genes regulated directly or indirectly by BCL11A are good candidates for being involved in RBC biology, meaning this transcription factor should be considered in bioinformatic analysis of RBC trait GWAS. As described below, the BCL11A locus has been associated with multiple RBC traits17; 26; 301.

3.7.3.3 KLF1

Kruppel-Like Factor 1, or KLF1, was first discovered in 1992, and is a member of a family of zinc-finger-containing transcription factors found in the cytoplasm as well as the nucleus330. Exclusively expressed in erythroblasts (Figure 1), KLF1 acts as primarily as a transcriptional activator, including regulation of long noncoding RNAs (lncRNAs) and BCL11A transcription, contributing to transcriptional differences by stage of RBC differentiation331.

42 Linkage analysis of a family exhibiting hereditary persistence of fetal hemoglobin (HPFH) led to candidate-gene analysis of KLF1, which identified a heterozygous mutation segregating with the phenotype; functional assays then implicated KLF1 haploinsufficiency with overexpression of hemoglobin gamma subunit in adults332. As described below, the KLF1 locus on chr19 has been associated with both RBCC and MCHC; other loci containing members of the KLF gene family have been associated with other RBC traits301.

3.7.4. Genome-wide association studies

The underlying theory behind GWAS is often referred to as the “common-disease- common-variant” hypothesis, or the idea that a large number of common variants (the majority of which are expected to have small effects) contribute to diseases that occur frequently in most global populations. Linkage disequilibrium—i.e., the likelihood that two variants will be contained on the same haplotype across generations—at an associated locus will determine how physically large the association "peak" is, which is relevant to determining functional candidate variants285. GWAS were designed to address several insufficiencies in traditional linkage- analysis and candidate-gene studies that prevented identification of potential causal loci.

Specifically, complex phenotypes—particularly those expected to have a large number of contributing loci with small effect sizes—demanded a larger sample size with denser genomic coverage than is available using standard linkage-analysis markers. Furthermore, incomplete penetrance or lack of large pedigrees limits the utility of such studies—for instance, a phenotype such as SCD might be reportable across multiple generations, but HGB values were not widely measured before the mid 20th century.

A decade of work by two separate groups attempting to sequence the entire genome culminated in reporting the first human reference genome sequence in 2001333; 334. Genotyping

43 arrays were designed both commercially and at select academic institutions to capture variation on a population level (the first arrays included 100,000 to 450,000 single nucleotide polymorphisms [SNPs]); this was followed by the genotyping of thousands of participants from established cohort studies or biorepositories. Based on the common-disease common-variant hypothesis, leveraging the LD structure of large study populations was expected to contribute findings that represented most, if not all, of the heritable portion of common complex diseases or traits.

Although the methods devised to perform GWAS are more powerful than linkage analysis to detect narrow loci with small effect sizes, one GWAS will not be sufficient to identify all loci associated with the trait of interest. Of particular relevance to translational work, strong associations identified in a GWAS may be subject to "winner's curse", or the false-positive identification of an association which does not replicate in similarly structured populations335-338.

In summary, GWAS are a useful tool for identifying genetic associations with complex quantitative traits, and continued efforts in this area complemented by new methods are expected to produce additional associated loci15; 31; 301.

3.7.4.1 Ancestrally diverse study populations in GWAS

An impressive catalog of genes has been associated with complex traits via GWAS, but missing heritability suggests more associations remain to be identified. While other genomic features may account for some of this heritability, it is likely that more genes affecting these traits remain to be discovered339-341. For the first decade of GWAS, the vast majority of study populations were restricted to European or European-descent populations, a practice driven by convenience as well as ignorance regarding the methodological benefits (never mind the ethical imperative) of including globally representative study participants342-345. Genotyping arrays

44 based on European LD structure and performed in European populations can face issues with long LD blocks which can obscure true signals346. Genes with a functional role in a given trait may exhibit multiple causal variants which are in LD in some populations, leading to the inability to detect independent multiple associations in that population or a dilution of the signal in instances for which the coded alleles have opposite directions of effect347. As described in

Sections 3.8.2. and 3.8.3., study populations which are representative of global ancestral populations improve the ability to identify associations which may be detectable only in one ancestral population but applicable to the outcome of interest in all populations.

3.7.5. RBC trait genetic association studies

To date, 17 GWAS have been performed with various combinations of RBC traits as the outcome(s) of interest16-22; 24-27; 29; 301; 309; 310; 314; 348. Several exome-sequencing efforts and meta- analyses have identified additional signals, some of which have independently replicated27; 28; 30.

For a comprehensive description and lists of known genetic associations with the RBC traits discussed in this work, see Table 5 for highlights. For a highlight of associations that exhibit strong associations in multiple GWAS and have also demonstrated evidence of RBC trait eQTLs.

Of note, while loci reported as discoveries in Table 5 showed the first connection to population- level variability of with RBC traits, a large proportion were already known to play a role in RBC biology (such as EPO, BCL11A, and HK1).

3.7.5.1 Major findings

The first GWAS of RBC traits was performed by Yang, et al, in the Framingham study population—comprising approximately 1,000 people representing 310 families—in 2007, and evaluated 12 total phenotypes including HGB, RBCC, MCV, and MCH310. After adjusting for testing 12 traits on the Affymetrix 100K SNP chip (which included 100,000 genotyped variants

45 designed based on European ancestral variation; p<4.2x10-8), the authors did not report any significant associations for the RBC traits they analyzed. However, they did note suggestive associations for SNPs located in or near known RBC trait candidate genes, including HBB and

EPB4IL2. While this effort was an important contribution to the beginning of the GWAS era, their lack of statistically significant findings foreshadowed future studies recognizing that sufficient sample size, dense genotyping arrays with globally representative ancestral variation, and high-quality imputation are important when conducting population-based genetics analyses with such stringent significance thresholds and power requirements310; 349; 350.

GWAS by Soranzo, et al; Ganesh, et al; and Chambers, et al—published within one month of each other in fall 2009—demonstrated the benefits of improved sample size and methodology17; 25; 314. Of these three similarly timed GWAS, Chambers, et al, primarily focused on characterizing two findings: HFE variant rs199846 for HGB, MCH, MCHC, and MCV; and the functional TMPRSS6 variant rs855791 for HGB314. Although conditional analysis was not performed in this study, the authors described 13 TMPRSS6 variants and five HFE variants that exceeded GWS 5x10-8 in their combined study population314. Soranzo, et al, described findings for eight hematological traits (including five RBC traits) in the HaemGen consortium of approximately 14,000 European study participants25. Among their findings were the same

TMPRSS6 and HFE functional variants described by Chambers, et al, as well as findings that were consistent with Ganesh, et al (TFR2, CCND3) or reported for the first time in that study population (FBXO7)17; 314. Ganesh, et al, identified the highest number of previously unreported

GWS associations of the three studies published in the same month, likely due to both sample size and assessing the log-transformed version of each RBC trait, as prior GWAS had used untransformed traits17. Many of the associations identified in the early RBC trait GWAS have

46 since been replicated and generalized to multiple populations—for example, Ganesh, et al, identified TFRC, MARCH8, and PRKCE (Table 7)17.

The first GWAS of hematological traits in an exclusively non-European population was performed by Kamatani, et al, in 2010, and included nearly 15,000 Japanese individuals in

Japan20. This comprehensive work also evaluated platelet and white blood cell traits, and the authors identified 28 RBC trait discovery associations (21 loci, five of which were GWS in more than one RBC trait), several of which have since been classified as RBC trait eQTLs20; 109.

Although a Japanese-only study population might suffer from similar genetic homogeneity issues as a European-ancestry study population, this study identified a large number of discovery loci, possibly due to differences in LD structure between East Asian- and European-ancestry populations. The same year, Kullo, et al, published a study of RBC traits performed using electronic medical record (EMR) data, evaluating slightly fewer than 500,000 SNPs (the Illumina

Human660W-Quadv genotyping platform, excluding about 50,000 SNPs for QC with no imputation) and identifying one new signal (SLC12A2, which is over 1Mb away from the HFE locus but has not been considered an independent locus in other GWAS) in a study population of slightly over 3,000 individuals21.

In 2012, the first RBC trait GWAS incorporating both African American and European

American study participants was published by Li, et al, and evaluated six RBC traits, white blood cell count, and platelet count in over 14,000 children22. This study was performed in children and therefore possibly less relevant to adult phenotypes as described in the rest of this work.

However, Li, et al, contributed the first extensive conditional analysis of the HBA1/2 region

(chr16p13.3) in a study population with African ancestry, demonstrating the complexity of the

47 genetic architecture and the significance of conditional analysis and fine-mapping to identify all potential independent signals within a genomic region22; 351.

Van der Harst, et al, performed the first RBC trait GWAS in a truly large study population of European whites and South Asians from more than a dozen cohort studies26. The title of the article, “Seventy-five genetic loci influencing the human red blood cell,” demonstrates the relative success obtained by leveraging a large study population to improve statistical power and identify previously unreported loci. Additionally, several loci were chosen by the authors for extensive molecular evaluation, resulting in high-quality functional analysis of traits which could be used as a guideline for future molecular follow-up to GWAS findings.

While van der Harst, et al, contributed a large number of previously unreported loci to consider for further evaluation, the study population was restricted to two race/ethnic groups, and hence would be limited with regard to fine-mapping or identification RBC trait associations led by a rare variant on African or Amerindian ancestral haplotypes, for example26; 34; 352.

Chen, et al (2013), and Hodonsky, et al (2017) performed the first RBC trait GWAS in solely African Americans (COGENT) and Hispanic/Latino Americans (HCHS/SOL), respectively18; 29. Notably, among COGENT participants an African-ancestry-specific 3.7kb deletion was demonstrated to associate independently of other polymorphisms at the alpha- globin locus, which could not be evaluated in a European- or Asian-ancestry-only study population29. While Chen, et al, identified associations at three previously unreported loci (one of which was GWS for two traits), none of the associations were replicated in smaller African

American study populations Children's Hospital of Philadelphia (CHOP) or eMERGE (an

NHGRI-funded network of DNA repositories linked to EMR), nor achieving suggestive significance in non-African American populations. Similarly, attempted replication for seven

48 discovery findings among HCHS/SOL participants was performed in a maximum of 7,100

Hispanic/Latino participants of MESA, WHI, and IMSSM; three of seven associations replicated when accounting for multiple testing18. Both of these studies demonstrate that finding an appropriate study population to replicate or generalize findings identified in non-European- ancestry populations remains a challenge. More importantly, the discovery loci and independent associations within previously reported loci identified in each study emphasize the importance of including ancestrally diverse populations in genomic studies. Inclusive genomic analyses will better capture global variation but can also utilize differences in genetic architecture— particularly in populations with ancestral admixture—to better allow identification of narrow candidate functional regions well suited for molecular characterization.

Recently, Astle, et al, published a GWAS of over two dozen WBC, RBC, and platelet traits in 172,000 European-ancestry individuals, including the UK BioBank301. This study reported a vastly higher number of novel associations (>1,000 independent associations across all RBC traits) compared to previous studies, due in part to sheer sample size and improved imputation quality allowing for better representation of genetic variation across the genome.

However, this effort was only recently published, meaning attempts to generalize and subsequent fine-mapping of these associations in non-European populations, for instance, have yet to be reported. Such attempts could provide a more reasonable number of candidate functional genes into focus for further investigation. In a similarly well-powered study, Kanai, et al, published a

GWAS of over 50 traits in 2018—including several quantitative hematological traits—using data from a large Japanese BioBank (n~108,000)348. This work highlighted the relevance of known loci across ancestral populations, and identified an additional 44 previously unreported associations with regard to blood traits.

49 3.7.5.2 Summary

Together, published RBC trait GWAS have repeatedly identified common- or infrequent variants contributing to population-level RBC trait variation at genes with known Mendelian- disease variants. For example, the AK1 locus was associated with RBCC, HCT, and HGB in the

Astle, et al, study, even though the prevalence of the associated monogenic condition, hemolytic anemia due to adenylate kinase deficiency, is very rare in the general population (Table 1)301.

Some of these findings have functions specific to RBCs, whereas some have biological implications for a wide range of tissues, suggesting closer examination of potential functional candidates is merited to determine why significant associations with RBC traits are reported.

Furthermore, hundreds of associations have been identified through GWAS that have not been reported in the candidate-gene literature for monogenic anemias, some of which have been confirmed with in vitro or animal model experimentation26. However, based on RBC trait heritability estimates and the proportion of population variability explained by known loci, there remain additional unreported loci that are causally associated with one or more RBC traits15. As has recently been demonstrated, study populations exclusively representing European are more likely to identify genetic associations that are difficult to characterize because of large LD blocks. Improved representation of global ancestries will improve discovery, particularly for studies including populations exhibiting ancestral admixture301; 344. Additionally, even if a

GWAS index variant is population-specific, identifying the associated gene can yield mechanistic insight into trait physiology relevant for all populations.

50 3.8. State of the literature: quantitative trait analysis

The underlying philosophy driving GWAS stipulates that one or more functional variants within an LD-based region affect gene expression, and this transcription level change sufficiently affects the amount of protein end-product to affect the phenotype of interest. However, the assumptions underlying univariate analysis have failed to account for the proportion of heritability measured in many complex quantitative traits, meaning that one of the following situations may apply: (1) heritability estimates are failing to account for another portion of population-level trait variation that is neither purely genetic nor environmental in nature (see

Appendix A for a brief introduction to epigenetics); (2) a large number of contributing loci remain to be identified for complex quantitative traits; or (3) some combination of both340; 341; 353-

355.

As described above, GWAS have been extremely prolific in terms of identifying genomic loci that exhibit strong evidence of association with complex traits15; 356. However, inconsistencies between estimated and explained heritability,357-359 proportional increases in variance explained when considering all GWAS variants,360-363 and increases in GWAS- identified novel loci with increasing sample size343; 362; 364-388 all suggest that multiple loci remain unidentified.

3.8.1. Pleiotropy

One avenue to address missing heritability is by exploiting the correlation between traits with a suspected shared genetic basis (i.e., “pleiotropy”) to increase statistical power. The concept of pleiotropy was first discussed in the 1950s, and animal models, particularly D.

Melanogaster, have been used in the molecular evaluation of pleiotropy for over 50 years389-392.

Many syndromic genetic disorders that are monogenic but accompanied by a wide range of

51 phenotypes, such as CHARGE syndrome or velocardiofacial syndrome, exhibit pleiotropy by definition393-396. Without identifying the actual genes at play, several family-based studies have successfully used pairs or groups of traits to identify pleiotropic mechanisms underlying phenotypes such as metabolic traits and lipid biomarkers since the early 90s397-399. Correlated traits have long been understood to have a shared genetic basis, and highly polygenic complex traits in particular have been understood to be good candidates for pleiotropy since the development of quantitative trait linkage analysis methods400; 401.

Since the outset of population-based genomic analyses, multiple complex traits being associated with the same locus has been demonstrated for multiple seemingly distinct groups of phenotypes or diseases401-404. Over the past decade it has become clear that pleiotropy is extremely common among complex traits that share biological underpinnings, including RBC traits—within five years of the first GWAS, nearly 5% of GWAS loci were reported to be associated with more than one trait400; 401; 403; 405. As a specific example, GWAS of autoimmune disorders have shown that rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and

Sjoegren’s Syndrome largely overlap in symptomatology, relevant biomarkers, and genetic- association signatures14; 402; 406-411. These diseases typically do not present together in individuals but they do commonly co-occur in families, suggesting that overlapping genetic architecture of these diseases includes both shared and disease-specific causal variants12; 14; 402; 403; 406; 412.

Therefore, it stands to reason that diseases like RA, SLE, and Sjoegren’s Syndrome may also share a genetic signature that could be identified by incorporating information from all of the traits into one analysis. In a gene-expression meta-analysis of RA, SLE, and Sjoegren’s

Syndrome, Toro-Dominguez, et al, identified 371 genes that were significantly differentially expressed between individuals with one or more auto-immune diseases compared to controls12.

52 In a separate cross-phenotype analysis of immune-disease risk SNPs in seven auto-immune diseases, 47% of SNPs tested were associated with more than one but not all of the phenotypes tested14. Continuing to leverage correlation among traits is important in the genomics era: identification of loci with shared effects across traits of interest as well as clarification of the molecular relationships between correlated traits could lead to new application of treatments that have been successful in diseases with shared genetic architecture413; 414.

Despite the long-standing awareness of pleiotropy in both monogenic and complex phenotypes, genomic studies of pleiotropy have been mostly limited to evaluating shared signals in traits that have been analyzed individually. Methods to evaluate correlated traits in combination have only been developed within the last decade, and implementation of these methods requires a sufficient sample size, often deeply phenotyped data, and access to highly- powered computing clusters. Recently introduced methods have shown the value of analyzing multiple related traits together that are expected to share an underlying biology. Combined- phenotype analysis has been successful in identifying associations that were not significant in univariate analyses of any one trait for groups of phenotypes known to share biological pathways and overlap in physiologic mechanisms 415-417. Below we describe the mechanisms behind pleiotropy as well as the benefits of co-examining traits that likely share genetic underpinnings.

3.8.1.1 Definition and description of biological pleiotropy

In order to describe why assessing pleiotropy improves on standard GWAS methodology, we must consider the mechanism(s) underlying an association signal shared by multiple phenotypes418. As described by Solovieff, et al, in 2013, pleiotropy identified in a genetic association study may be biological, mediated, or spurious in nature402. This proposal will focus

53 on biological, or “true”, pleiotropy. Descriptions in the literature vary widely with regard to whether the term “pleiotropy” refers to a single causal variant, a locus, or a gene. Here, we define pleiotropy as the effect of one variant on multiple phenotypes (in this work, RBC traits), without making claims as to the relevant tissue type(s) or proposed function aside from that which can be implicated via bioinformatics analysis (Section 3.8.2.3.). The expectation of identifying "true" pleiotropic variants is that performing validation studies in a molecular setting would characterize one or more causal SNPs with a measurable effect on multiple associated traits. While we cannot assign functional pleiotropy to potential causal variants within the scope of this work, combined-phenotype analysis of correlated traits which are suspected to exhibit overlapping genetic architecture has been used as a proxy for pleiotropy with success419; 420.

3.8.1.2 Combined-phenotype studies

As described above, correlated traits often show a moderate to significant amount of overlap in associated genomic loci, suggesting a potential point of leverage for identifying loci associated with multiple correlated phenotypes but not previously reported for any of them individually. Correlation among a set of traits that also share some biological underpinnings makes such traits excellent candidates for pleiotropy402; 421; 422. Many groups of correlated phenotypes are expected to share some underlying biology, but this potentially useful feature cannot be utilized in traditional GWAS. While pleiotropy is suspected to be the genomic feature at work behind many shared association signals across correlated traits, causality cannot be assigned directly in association studies. Such limitations have prompted use of the phrase

“combined phenotype association” or “multi-phenotype association” in the absence of a systematic and comprehensive interrogation of pleiotropy.402

54 3.8.1.2.1 Combined-phenotype analysis using summary statistics

One area of active research is the combination of univariate GWAS summary statistics for combined-phenotype inference. Compared to individual-level data, analysis of summary statistics for multiple phenotypes benefits when population sizes differ across traits, or summary statistics for the traits of interest are publicly available (for instance, from dbGAP)345; 347; 419; 423.

Ours and others' work has demonstrated that these methods are scalable to densely imputed

GWAS data; are accurate for common (MAF>0.05) as well as uncommon (0.01

Many combined-phenotype and multi-phenotype methods are currently available; we present a comprehensive evaluation of eight methods in Section 4.4.3, which indicates that the adaptive sum of powered score (aSPU) test is well-suited to the proposed work. In this supportive work, we also performed combined-phenotype analysis of published body-mass index

(BMI),364 waist-hip ratio,365 fasting glucose, and fasting insulin424 summary statistics (HapMap 2 imputed data) from consortia of predominantly EU participants (N range: 51,750 [fasting insulin]

– 339,224 [BMI]) to interrogate potential pleiotropy425; 426. Six previously unreported loci were significantly associated (p < 5.0x10-8) with aggregated information for all four phenotypes, but none of the traits in univariate analysis. These results show the utility of combined-phenotype analyses to identify novel loci undetected by meta-GWAS.

3.8.1.3 Leveraging RBC trait correlation for evidence of shared genetic associations

Identifying pleiotropic loci in the context of multiple traits can benefit basic-science and translational research, as a shared effect on more than one trait may help elucidate undescribed molecular pathways contributing to that trait, or in some cases even implicate shared treatment options for disease phenotypes401; 402; 427. Understanding how overall gene expression affects one disease (likely due to multiple causal variants which may or may not be shared by other diseases)

55 can also improve understanding of that gene’s biological role across phenotypes413; 418; 428; 429.

RBC traits exhibit a range of correlation, with several trait pairs being moderately or highly correlated. While trait correlation is not necessary for pleiotropy to exist, highly correlated traits can be an indicator of a common underlying mechanism, either genetic or environmental, with pleiotropy being a reasonable expectation347.

Of equal import, RBC trait GWAS have uncovered associations that consistently replicate across multiple traits and diverse study populations17; 18; 26; 29; 301. Table 6 demonstrates that while some signals associate strongly and repeatedly with only one or two RBC traits, several association signals have been found for multiple traits across study populations. In combination with the fact that visualization of results shows a distribution including suggestively significant association signals with strong tails in many published RBC trait GWAS, available evidence suggests that RBC traits exhibit pleiotropy and are therefore excellent candidates for a combined-phenotype approach28.

3.8.1.4 Benefits of combined-phenotype analysis

There are multiple benefits to analysis methods that leverage correlation across traits, beginning with statistical gains over traditional GWAS methods418; 430. The number of suggestively significant associations in univariate GWAS—some of which are almost certainly true associations that the analysis is underpowered to detect—implies that increased power would improve the ability to identify new associations. Recently developed methods combining trait summary statistics demonstrate a cost-effective gain in statistical power compared to the alternative of analyzing hundreds of thousands of individuals for each trait301; 425; 430; 431.

Statistical power is increased for discovery when an association is present for multiple traits, without an accompanying negative effect on Type I error421; 432. This is particularly meaningful

56 for highly polygenic traits which likely have a large number of contributing causal variants with

(primarily) small effect sizes that are influenced by lack of precision and relatively small study populations301; 433; 434.

Additionally, fine-mapping of known and previously unreported loci (described more fully in Section 3.8.2) to detect independent associations as well as identifying potential causal variants can be improved when analyzing traits together435. Recent work indicates that fine- mapping of multiple correlated traits can significantly reduce the number of SNPs required to identify the 90% credible set compared to fine-mapping phenotype results individually.

Additionally, shared genetic architecture among traits with different units of measurement (i.e., variable effect sizes) can benefit from a combination of summary statistics which allows for difference in effect by trait (see Section 4.4.4.4.)347; 436. In combination, the benefits of combined-phenotype analysis show great promise to detect and characterize loci underlying a shared genetic architecture.

3.8.1.5 Limitations of combined-phenotype analysis

Several limitations to combined-phenotype analysis do merit examination. Foremost, all of the above pleiotropic mechanisms can be implied but not confirmed by association studies; evidence of pleiotropy requires more extensive functional evaluation than can be ascertained from association studies alone. Additionally, fine mapping of combined-phenotype results cannot detect trait-specific associations within one association signal—fine-mapping methods only test one variant per independent association signal but allelic heterogeneity is common, particularly when multiple tissues may be involved in the physiology of the outcomes. If multiple variants contribute to one signal via different traits, narrowing down a list of causal variants may be difficult or inaccurate 419; 435; 437; 438.

57 GWAS of RBC traits have consistently demonstrated replication in some traits for association signals that are rarely or never found in other traits—for example, the functional variant rs855791 in TMPRSS6 strongly associates with multiple RBC traits but has never been reported for RBC count (Table 6). While our definition of pleiotropy and proposed combined- phenotype analysis methods emphasize the shared effect of a single variant within an association signal on correlated traits, the importance of assessing pleiotropy at the gene level cannot be overlooked, although it outside the scope of the proposed project. Genes expressed in multiple tissue types often exhibit tissue-specific isoforms, and causal variants in shared (or variably expressed) exons may lead to true pleiotropy even in diseases that present very differently428; 439-

442. Finally, some methods for combined analysis of summary statistics from multiple phenotypes require extensive computing resources for simulations, preventing analysts from differentiating between signals that exceed our proposed GWS threshold of ~8x10-9. However, our continued evaluation and refinement of available methods suggests several alternatives that are under investigation. Overall, none of these limitations preclude combined-phenotype analysis from being a useful tool to identify previously unreported genetic associations with RBC traits, and their benefits suggest that they will be useful for identifying associations which do not exceed

GWS in univariate analyses.

3.8.1.6 Summary

The proposed work will leverage RBC trait correlation in a diverse study population and established combined-phenotype analysis methods to attempt to identify genomic loci that previous studies were underpowered to detect. We hypothesize that the improvement in power gained by evaluating traits with a shared genetic architecture will enable the identification of genetic loci associated with multiple RBC traits, which is beneficial not only for discovery

58 purposes but also for contextualizing the fine-mapping results of combined-phenotype genome- wide-significant loci.

3.8.2. Fine-mapping association signals identified in combined-phenotype GWAS

Countless studies have demonstrated that—due to myriad factors including assay design, allelic heterogeneity and ancestral allele differences as described above, and genotyping/imputation quality—the index SNP within a particular association signal is rarely the causal variant 15; 31; 321; 443-445. Multiple functional variants may exist at one association signal within one or more ancestral subpopulations, and the presence of multiple associations may be obscured by LD with one tag SNP351. Additionally, many genetic variants exhibit some level of ancestry specificity (see below)—namely, their allele frequencies differ across global populations or may even be monomorphic in one or more continental ancestry groups. This becomes particularly problematic when considering that the only genotyping assays commercially available until very recently were biased toward European content, constraining studies of diverse populations.

Once genome-wide-significant association signals for a specific trait (or group of correlated traits) have been determined in a GWAS, identifying and characterizing the potential causal variant(s) within each association signal becomes the primary task. Termed “fine- mapping”, this approach attempts to refine loci identified through association studies include visualization of the LD structure within an association signal and comparison of that structure across ancestral populations. In addition to identifying variants most likely to affect a trait of interest, elucidating the LD structure underlying each association signal is merited in order to identify potential independent hits—i.e., the presence of more than one causal variant within one overlapping association.

59 Once a list of potential causal variants has been generated for each GWS association signal, bioinformatics analysis (see below) can aid in identifying which genes may have affected expression based on an individual's alleles. As referenced by van der Harst, et al, a large number of genes can cluster within a small portion of the genome (over 3,000 protein-coding genes were located within 75 GWS loci identified in their study population). Furthermore, up to 95% of genes with more than one exon have multiple reported isoforms, and gene expression is frequently affected by multiple TFBS that may be bound in different combinations across tissues, so multiple causal non-coding variants is a logical possibility for any genetic association446-448.

Fine-mapping can be extremely valuable for narrowing association signals to reduce the number of functional candidate variants. Independent signals within the same physical genomic region may exist, adding to the necessity of defining potential functional variants so functional variation within the locus can be described445. The primary benefits of fine mapping therefore include the ability to characterize independent signals within a locus using LD, as well as identifying a smaller number of potential functional variants for further characterization based on LD with the lead SNP at each association signal. However, in the case of combined-phenotype analysis results, it is possible for independent, physically proximal variants to be differentially causal for multiple RBC traits but contribute to the same association signal. In this situation, fine-mapping can only identify one candidate causal variant per association signal for all traits. This is an admitted limitation of the proposed work, but the broad overlap in biological pathways known to affect RBC traits suggests that prioritization of a single proposed functional variant per combined-phenotype association signal is still merited.

60 3.8.2.1 Allelic heterogeneity and ancestry-specific variation in fine-mapping

Allelic heterogeneity refers to the presence of multiple causal genetic variants within one phenotype-associated locus, or the possibility of multiple regulatory mechanisms being affected for one gene. Causal variants within a locus can act separately but additively; counteract each other; or exhibit interaction if they are physically proximal or affected by the same regulatory pathway444; 449; 450. For example, variants located in two physically proximal transcription factor binding sites (TFBS) may both affect expression of one gene, but these may be detected as one signal (or go undetected) without both a diverse study population and fine-mapping efforts.

Allelic heterogeneity is well-described for many Mendelian disorders—for example, an entire database has been established to describe the heterogeneity contributing to the hemoglobinopathies described above451; 452. This phenomenon has also been widely observed for complex traits, with a recently developed method implicating a substantial proportion of eQTLs

(4-23%) and GWAS of several complex traits (35% for high-density lipoprotein concentrations and 50% for major depressive disorder)444. Sometimes allelic heterogeneity can be detected using conditional analysis, but other methods have shown promise in identifying the presence of multiple causal alleles within one GWAS signal, which benefit particularly from high-density genotyping and imputation of low-frequency or population-specific variants435; 437; 453; 454.

As described above, the majority of published GWAS to date have been performed in exclusively or primarily European-ancestry populations38; 39; 455. The inclusion of ancestrally diverse study populations improves fine-mapping particularly in the arena of being able to detect variants that are very rare or monomorphic in European populations351; 437; 456. In particular, populations exhibiting ancestral admixture (e.g., Hispanics/Latinos and African Americans) provide excellent opportunities for fine-mapping due to shorter shared haplotypes of the

61 contributing continental ancestries450; 457; 458. For example, a fine-mapping analysis in approximately 3,000 Mexican Americans of a large region containing several genes identified rs964184 as the only SNP in the 99% credible interval, exhibiting a strong association for triglyceride concentration459. This SNP has a much higher allele frequency in Native Americans than European-ancestry populations (MAF ~50% vs ~15%), corresponding to the admixture mapping association between Native American ancestry and the outcome. As this study demonstrates, non-coding ancestry-specific variants or variants with large allele-frequency differences across populations are good functional candidates; in this context, multi-ethnic fine- mapping has been shown to greatly reduce the number of candidate functional variants.437; 460-463.

In genomic loci exhibiting allelic heterogeneity (which is typically unknown in GWAS association signals), fine-mapping subsequent to conditional analysis can detect independent associations that may be haplotype-specific, which benefits the characterization of the biology underlying those loci437; 450; 464; 465. In light of the large number of loci involved in quantitative traits like RBC traits, fine-mapping is an important step for clarifying which variants may be good functional candidates when multiple may exist within one association signal in a large, trans-ethnic study population.

3.8.2.2 Bioinformatic characterization of loci identified in association studies

Within molecular genetics, a vast body of work has demonstrated that gene expression can be affected—even to the extreme of binary expression—by ablation or insertion of one nucleotide within a crucial TFBS466-469. Although molecular genetics have been employed for decades longer than cohort-based association studies, fine-mapping of GWAS associations has resulted in the successful identification of causal variants for multiple complex traits, including

RBC traits17; 18; 26. For example, the GWAS of RBC traits by van der Harst, et al, identified 121

62 candidate genes for 75 GWS associations with one or more RBC traits using a combination of

LD information, eQTL data, and pathway analysis; they further identified 100 candidate coding or regulatory sequence variants based on publicly available databases26; 470; 471. These candidate genes showed increased RBC-specific expression patterns compared to other genes when evaluated in vitro; RNA silencing experimentation in D. melanogaster also supported the role of several previously unreported candidate genes in RBC trait biology26. Furthermore, assessment of genomic segments representing regulatory function showed enrichment for RBC-related cell types compared to other cell types26.

For these reasons, bioinformatic characterization of GWS association signals is important to prioritize potential causal variants based on available functional data, particularly via publicly available databases (examples representing the most common types of bioinformatic analysis are presented in Table 9). Computational algorithms which predict the pathogenicity of candidate functional variants have also been available for over a decade472-476. In the last five years, predictive algorithms have been designed for identifying shared pathways, protein-interaction networks (Figure 2), and tissue-specific gene expression based on information ranging from

DNA sequence alone to molecular experimentation results. However, these predictive algorithms, while useful, are rarely tissue- or trait-specific, focusing instead on comparing final protein products for coding variants or the severity of transcription factor binding site changes.

Additional resources have been developed to evaluate potential functional variants by incorporating multiple annotations, such as tissue-specific chromatin immunoprecipitation

(ChIP) sequencing data or expression quantitative trait loci (eQTLs)476-481. To address the characterization of loci and prioritization of variants, other methods have been developed which utilize resources such as in vitro characterization of gene expression in different tissue types482.

63 FastPAINTOR and Mantra are two examples of programs designed to prioritize potential causal variants in fine-mapping studies435; 483. Identification of functional variants is the purpose behind genomic studies, because it will improve an understanding of human genetic architecture as well as uncovering the role of particular variants in a trait of interest484. Bioinformatic analysis is an excellent way to ascertain potential functional candidates within a region.

3.8.3. Association studies of rare variants

Rare-variant studies are motivated by the same reason as common-variant studies: to identify genomic regions and, specifically, variants that may be causally associated with a phenotype of interest. Studies of complex traits such as metabolic disease, psychiatric disorders, and hypertension have identified rare- or low-frequency variants located in genomic regions not previously reported for these traits485-490. A rare variant is typically defined in as one which exhibits a minor allele frequency (MAF) < 0.01 in the total study population, and we will adhere to that standard in this work (Section 4.5.). Highly polygenic complex phenotypes are expected to be affected by a large number of both common and rare variants with variable effect sizes, with some combinations of allele frequency and effect magnitude being more difficult to detect than others using traditional GWAS43; 433.

Inclusion of rare variants in association studies has only really been possible within the last five years due to improved genotyping quality; better assay coverage of non-European populations; increased-accuracy imputation methods; and the introduction of new forms of individual data (specifically, exome or whole-genome sequencing). Standard single-variant examination of rare variants exhibits reduced power compared to common-variant association studies by definition; even a reasonably large study population is unlikely to have enough copies of one rare variant to detect a true association491; 492. For this reason, rare variants are not

64 typically reported in standard GWAS unless the study population is well-powered to detect the effects of rare variants, instead being evaluated using methods that test combined effects of multiple variants43; 491; 492.

Exome-sequencing or whole-genome-sequencing studies are expected to further improve detection of significant associations driven by rare variants because direct measurement rather than genotyping allows for the capture of population-specific variants that are difficult to impute from genotyping arrays491. However, array-based exome-chip studies have shown success in identifying previously unreported associations for complex traits—for example, Nandakumar, et al, identified previously unreported rare variants associated with systolic- or diastolic blood pressure in ten genes in a study of nearly 16,000 African American participants using gene-based testing and single-variant analysis44; 493. While exome chips typically have better coverage of rare coding variants than genotyping arrays, it can be anticipated that high-quality imputation to a genotyping array with good coverage of variants representing multiple ancestries could be similarly successful to earlier efforts. For example, the Illumina Infinium Exome-24 v1.1

Beadchip contains 227,000 exonic variants based on sequencing data from 12,000 individuals of

European, African, Chinese, and Hispanic race/ethnic backgrounds; in contrast, the MEGA array contains 400,000 exonic variants based on sequencing data from over 36,000 individuals of diverse race/ethnic backgrounds, with 16,000 additional hand-curated variants improving coverage for candidate genes/regions from eight groups of traits494; 495. These findings demonstrate that until WGS data is widely available, high-quality imputation to genotyping arrays is a strong method for identifying genomic associations.

65 3.8.3.1 Rare-variant studies in RBC traits

As established above, a significant proportion of RBC trait narrow-sense heritability remains to be explained496; 497. Given the broad array of biological pathways in which rare variants are known to cause monogenic RBC disorders, it is reasonable to expect that rare variants with modest effect sizes contribute to similarly modest effects on RBC traits in loci that have yet to be reported in an association study. With regard to rare-variant studies in RBC traits, two exome-sequencing studies have been performed (see Table 5), but rare-variant findings reported outside of traditional GWAS have been limited because traditional methods with modest study populations are underpowered for rare-variant studies, even when sequencing data is available28; 30; 491. Chami, et al, performed a large exome-genotyping study incorporating both single-variant and gene-based analysis in a majority European-ancestry study population in collaboration with the Blood Cell Consortium (BCX)28. While the study authors reported 21 previously unreported associations with RBC traits, several of those findings were located near genes already known to cause monogenic hereditary anemias and only four lead SNPs were low- frequency or rare (MAF<5%) in the study population. Although the BCX study population comprises primarily European-ancestry study participants, approximately 20% of participants reported having African, East Asian, Hispanic/Latino, or South Asian ancestry, demonstrating that the inclusion of ancestrally-diverse study populations and dense coverage of the genome improve the ability to detect associations.

3.8.3.2 Gene-based testing

One approach to addressing the low power anticipated when testing rare variants (see below) is gene-based testing, or analysis of the combined effects of multiple rare variants within a predefined genomic region. Within the field of rare-variant analysis, a gene is typically defined

66 as all exons and splice site variants for all isoforms of one protein-coding open-reading-frame; sometimes the promoter region (100-1000bp) upstream of the first TSS is included. Several rare- variant studies have successfully implemented gene-based analyses to identify previously unreported genomic loci for quantitative traits488-490; 498. The aforementioned BCX study highlights the relevance of gene-based testing by only reporting one significant association in single-variant analyses (rs201062903, a missense variant at the ALAS2 locus for MCH and MCV which is known to cause sideroblastic anemia), but identifying two additional genic associations in gene-based testing (PKLR for HCT and HGB as well as ALPK3 for MCHC)28. Gene-based findings reinforce the theory that a wide range of methods is required to successfully detect all loci contributing to population-level variability of a complex trait. Therefore, if a trait still exhibits evidence of missing heritability in spite of well-powered GWAS, rare-variant studies are a reasonable approach to identifying new associations that will contribute to characterizing the genetic architecture. However, the assumptions of the method being employed must be taken into consideration to ensure the most appropriate application for the study population(s) in question.

In the most commonly used rare-variant tests, summary statistics from all variants within a pre-defined genomic region are collapsed or summarized to provide a single gene-based value

(see Section 4.5.1.)499. Myriad methods have been designed and their selection depends on the goals of the research group performing the analysis; this work focuses on the combined burden/variance-component tests. Gene-based tests exhibit several benefits compared to single- variant analyses that may be particularly meaningful in an ancestrally diverse study population which better represents the global distribution of rare variants than the most common study populations of primarily European-ancestry individuals.

67 3.8.3.3 Benefits of gene-based testing

A large proportion of rare variants reported in GWAS are ancestry-specific (i.e., variants that arose after human migration led to the establishment of modern continental ancestries)42; 470;

500-502. It follows that rare-variant studies should not be limited to specific populations (e.g.,

European), as there is an established history—particularly with coding variants that play a role in hereditary anemias—of population-specific causal rare variants with causal roles in RBC traits88;

332; 503. For example, the megaloblastic anemia-causing AMN or CUBN variants only occur in

Nordic-descent individual for a disease prevalence of 200,000 in those populations504; 505.

Similarly, an established 3.7kb HBA mutation is uncommon or rare in global populations (<2% in all non-African populations) but has been found to have a nearly 50% prevalence of homozygosity in a malaria-endemic region of Papua New Guinea (Table 1)506; 507. Studies incorporating whole-genome sequencing (WGS) would be the preferred method for discovery when performing gene-based testing, because variants that do not occur in reference genomes but are present in a particular study population would be detected in sequencing491; 508-511. However,

WGS remains cost-prohibitive for study populations sufficiently large to have enough alleles, which limits the power to detect associations using gene-based or single-variant methods491; 509.

Evaluating densely imputed genotype data in populations representing the distribution of global alleles provides an alternative that has shown success identifying variants for a variety of complex traits with different genetic architectures44; 301; 512-514. Therefore, it is advantageous to include large, multi-ethnic study populations, which will increase the number of rare variants being tested per gene when imputation quality is sufficient44; 495; 512.

In summary, the successful causal association of both variants and genes that were not detected in large-scale GWAS of common- or low-frequency variants for these complex traits

68 bolsters the argument in favor of performing rare-variant analyses. Because single variants with large effects are atypical (and a large number have already been reported for RBC traits), gene- based tests have emerged for identification of rare variants, followed by bioinformatics or molecular characterization (see Section 3.8.3.4) to identify the most likely causal variants at candidate loci43; 491; 492. Since there are benefits to using gene-based testing when attempting to discover new associations when the limitations are appropriately addressed, we will employ this methodology for the second aim of this proposed work (see Section 4.5.).

3.8.3.4 Limitations of gene-based rare-variant studies

As with all genomic studies, rare-variant studies have disadvantages that need to be considered. First and foremost, accurate imputation of rare variants (MAF≤0.01) not included on study arrays is required in order to have sufficient information for gene-based testing to improve power over single-variant testing515. Attempts to improve sample size often include the collaboration of multiple cohort studies (as is the case with this work), which can lead to differences in availability of rare variants based on the genotype assay used (if varied across studies) and accompanying imputation quality. Several genotype assays have been recently developed that emphasize non-European variation into the scaffold and fine-mapped regions of the chip, improving the ability to impute across 1000 Genomes populations with greater accuracy and confidence495. A study incorporating imputed rare variants (as opposed to sequencing data) will need to weigh the gain in power of including study participants genotyped on a chip with poorer imputation data against the potential downside of including poorly imputed variants.

Related to imputation concerns is the potential for differences in the presence of rare alleles by population to affect the detection of true associations using gene-based methods. The spectral nature of ancestry (see Section 4.3.3.4.) makes it difficult to define ancestry groups that

69 would prevent population microstructure from affect—even within African Americans, for example, exist large differences in the proportion of African ancestry and regional specificity depending on the geography of the study sub-population516-518. Adjusting for ancestral principal components—even if they were generated using common genotyped variants—does not always completely account for population stratification as well as in single-variant analyses of common or low-frequency variants. However, joint PCs in trans-ethnic study populations have recently been shown to properly account for the majority of population substructure492; 493. There are differences in frequency and race/ethnic distribution of these rare variants, requiring a balance between power of trans-ethnic and race/ethnic-specific pros and cons. Therefore, the decision between a trans-ethnic analysis and race/ethnic-stratified analysis must be made on a study- specific basis.

Although power is improved over single-variant analysis, low power to detect associations via gene-based testing remains an issue, particularly when one or more causal variants are ancestry-specific and therefore highly affected by sample size in multi-ethnic studies492. Additionally, the ability to detect signals depends on the method employed—SKAT-

O, for example, allows for the incorporation of both protective and risk alleles491; 492; 519;

520.Furthermore, sub-analyses based on self-identified race/ethnicity will cause decreased power caused by meta-analyzing smaller populations rather than analyzing the entire study population together. For these reasons, recent efforts have typically performed trans-ethnic pooled analyses for gene-based tests, with ancestry-specific follow-up for variants within a significantly associated gene.

Thirdly, findings from the GWAS literature and molecular studies support the expectation that a large amount of phenotypic variation of complex traits is due to functional

70 non-coding variants regulating expression or other downstream processes463; 468; 469. However, the inclusion of a large number of non-functional variants may dilute the effect of a genic unit, potentially preventing the detection of true associations43; 492. Therefore, a genic unit including only coding variants is anticipated to have a greater magnitude of effect than one containing non- coding variants, because non-coding variants vastly outnumber coding variants, and coding variants are likely to have a larger effect size given their potential to alter the function or structure of a protein491; 509; 520.

Finally, in the event that gene-based testing does identify previously unreported genomic associations for a trait of interest, clarifying which rare variants may be causal is difficult using current methodology given the distribution of these variants by population and that the effect sizes of rare variants can be overstated491. Methods for identification of the variant(s) underlying a particular association currently include rank-based analysis, "leave-one-out" analyses, and

Bayesian model-based analyses of expression data for individuals with both RNA-Seq and association data429; 521-523. The latter method shows great potential for narrowing the list of potential functional variant within a gene, and models have been developed that could be applied to studies with only genotype/imputed data. However, all currently available sequencing data in the GTEx Consortium was performed in European-ancestry individuals, leading to potentially extreme differences in the ability to predict functional consequences of variants that are very rare or monomorphic in Europeans but low-frequency or common in other race/ethnic populations.

In summary, although rare-variant association testing methods have been employed for nearly as long as common-variant GWAS, the question of which methods are most likely to identify candidate variants that are truly functional remains unanswered. However, established gene-based testing methods have demonstrated the utility of combining rare variants to improve

71 power by identifying genomic regions which were not significantly associated with traits in common-variant or rare-variant univariate analyses43; 492.

3.9. Public Health Significance

RBCs perform the crucial function of delivering oxygen to every cell in the body and transporting cellular waste materials for processing in the liver. RBCs are the most numerous cell type in the body, with an average of 1,500 mature RBCs released into the bloodstream each minute in adult humans. Dysfunction or insufficient production is referred to as anemia, with physiologic effects ranging from mild to severe. Anemia affects a large proportion of the population across the life course, with a high public health burden both financially and in disability due to related health consequences, both in the United States and globally524-528. A non- monogenic anemic state is expected to have a large genetic component even for individuals with acquired anemia, based on the complex nature of RBC trait biology. In addition to being relevant for diagnosing blood-related disorders, RBC traits provide clinical information related to the status and prognosis of other diseases. RBC traits have been variously associated with outcomes in patients with end-stage renal disease; autoimmune diseases such as rheumatoid arthritis and systemic lupus erythematosus; cerebrovascular disorders such as ischemic stroke; and cardiovascular diseases such as heart failure and acute myocardial infarction6; 242; 243; 255; 256; 258;

266; 271; 529-533.

RBC traits are complex quantitative traits regulated, we now know, by a vast number of genomic loci. Dozens of genes containing variants that cause monogenic anemias have been identified; several of the most severe anemias are recessive disorders caused by variants specific to populations with evolutionary pressures to adapt to endemic malaria. Mapping these

72 monogenic anemias during the latter half of the 20th century contributed greatly to understanding

RBC development and maintenance150; 322; 330; 534; 535. GWAS within the last decade have identified hundreds of additional loci associated with interval-scale RBC traits; however, as of yet less than half of the population variation in RBC traits attributed to genetic variation can be explained by reported loci15; 31; 301. It is reasonable to expect that genomic loci associated with

RBC traits remain to be discovered, and also that new methods or more diverse study-population composition would benefit the identification of these associations.

The proposed work will examine two areas of RBC trait association genetics left largely unexplored. First, we will leverage correlation among RBC traits, which exhibit moderate to high heritability, to examine evidence of a shared genetic architecture for the first combined- phenotype analysis of these traits. Significant findings from this analysis will be fine-mapped to narrow the association signals, identifying independent signals where they exist as well as potential functional variants within each association signal. Second, we will use gene-based testing to leverage the combined effects of coding and proximal regulatory variants. Analyses of single rare variants are typically extremely underpowered, but a weighted combination of multiple such variants has been shown to identify genes for complex diseases that may have multiple variants contributing to expression difference and, hence, RBC trait associations.

The public health benefits of identifying new and characterizing known genomic loci associated with RBC traits are numerous, though occasionally indirect. The biology underlying

RBC development and gene expression across relevant tissue types remains incompletely explained. GWAS findings of RBC traits thus far imply hundreds or even thousands of additional contributing variants28; 31; 403; 433; 536. The utilization of both combined-phenotype study of common variants and gene-based testing by phenotype for rare variants is anticipated to

73 improve the ability to add to the suite of genomic loci associated with RBC traits. The benefits to increasing the foundational knowledge underlying RBC traits are myriad. Both basic science and also translational research can be improved with further molecular characterization of the pathways involved in RBC development and maintenance. Given the expense of these experiments, association studies are a crucial if not necessary method of identifying candidate genes and variants for further exploration. For example, pharmacogenomics researchers may utilize findings from studies like those proposed in this work for the development and testing of pharmaceutical interventions for blood diseases.

The proposed work also indirectly benefits race/ethnic populations traditionally underrepresented in GWAS. Since the beginning of the GWAS era, the vast majority of published studies comprise either exclusively European or European-ancestry populations or exclusively East Asian populations. Recently published studies have made obvious the benefits of including ancestrally diverse populations in GWAS, particularly populations exhibiting recent admixture; an improvement in discovery of RBC trait-associated genetic loci could be expected by including ancestrally diverse populations 32; 34; 38; 455; 495. In addition to the scientific benefits, research ethics demand equitable inclusion of diverse study populations in genomic studies40; 346.

Lack of inclusivity in study population design can not only lead to inaccuracies in understanding of variant pathogenicity, but also exacerbate existing health disparities with regard to improper diagnosis and failure to identify causal variants or associations specific to excluded ancestral populations38; 537. Failure to include African Americans in studies of hypertrophic cardiomyopathy (HCM), for example, led to decades of misdiagnosis of HCM based on a presumed causal variant because of the allele’s rare frequency in European-ancestry populations537. Furthermore, a causal variant for recessive HCM identified as rare has been

74 discovered to have an allele frequency of nearly 25% in the Garinagu in Honduras, demonstrating the downside of limiting study population to one ancestry and assigning importance of a causal variant based on those findings (G. Wojcik, personal comm. 2018).

Therefore, deliberate inclusion of a diverse study population stands to directly benefit the health of millions of Americans who are currently underrepresented in genomics research by ensuring that more accurate interpretation of results is possible.

In conclusion, continued identification of RBC trait-associated loci may improve our understanding of the exact functions of molecular pathways preceding these RBC phenotypes. A better understanding of RBC physiology is clinically relevant to a broad number of disease pathways; knowledge leading to the identification of causal RBC trait variants could have positive implications for diagnosis and prognosis of diseases associated with RBC traits.

Additionally, a clearer understanding of the genetic mechanisms affecting RBC traits could positively affect pharmaceutical research and genetic testing to address anemias or other RBC- related pathologies. This work proposes improve on the body of knowledge that currently exists with regard to RBC trait genomics while simultaneously increasing representation of ancestrally diverse study populations in this research area and implementing established methods that have yet to be employed in this field.

75 3.10. Supporting Tables & Figures

Table 1. Discovery and genetic mapping of hereditary anemias Anemia Year ID’d Author Prevalence Associated genes Alpha thalassemia (includes 1976 A Deisseroth538 Common (Hb Bart’s very HBA1/2 Hb Bart's) rare) Atypical hemolytic uremic 1967; WW Hagge539; 1 to 9/1 million HF1, CFH, MCP (AKA syndrome mapped Warwicker & CD46), CFHR1, CFHR3, 1997/1998 Goodship540 C4BPA Beta thalassemia / Cooley's 1925 TB Cooley ~1/100,000; more common HBB anemia / Mediterranean in African, Middle Eastern- anemia descent populations Chronic hemolytic anemia 1968 MA Baughan541 ~50 total cases PHI (codes for GPI) Congenital dyserythropoietic 1967; Wendt & Rare (several hundred total CDAN1 (Type I), anemia mapped 1996 Heimpel542; H cases reported) SEC23B (Type II), Tamary543 chr15q22 (Type III, gene unknown), KLF1 Dehydrated hereditary 1974; 2015 Glader544; <1/1 million KCNN4 stomatocytosis (hereditary Glogowska545 and xerocytosis) Rapetti-Mauss546 Diamond-Blackfan 1936 HW Josephs 5 to 7 per million RPL5, RPL11, RPL35A, RPS7, RPS10, RPS17, RPS19, RPS24, RPS26 Dyserythropoietic anemia and 1977; AR Thompson547; C <1/1 million GATA1 (X-linked) thrombocytopenia mapped 2002 Yu548 Elliptocytosis (accompanied 1956; Morton549; 1 per 2,000 – 4,000 EL1 (Elliptocytosis-1); by hemolytic anemia) mapped 1981 Alloisio550 worldwide; increased in SPTA1 (Elliptocytosis-2); malaria-endemic areas SPTB (Elliptocytosis-3); SLC4A1 (Elliptocytosis-4) Familial hemophagocytic 1952; JW Farquhar551; M ~1/50,000 PRF1 & UNC13D lymphohistiocystosis mapped 1999 Ohadi552, R account for 40-60% of Dufourcq- cases; STX11, STXBP2, Lagelouse553 unknown Fanconi anemia / congenital 1927 G Fanconi 1/160,000 (Ashkenazi, BRCA2, BRIP1, FANCA, hypoplastic anemia Roma, black South African FANCB, FANCC, increased) FANCD2, FANCE, FANCF, FANCG, FANCI, FANCL, FANCM, PALB2, RAD51C, SLX4 Hemolytic anemia 1956; PE Carson554; Puck Common in African G6PD (G6PD mapped 1998 & Willard319 Americans, Asians, deficiency), HK1 Mediterranean (different hemolytic anemia) Hemolytic anemia due to 1969; A Szeinberg555; S <1/1 million AK1 adenylate kinase deficiency mapped 1989 Matsuura556 Hereditary pyropoikilocytosis 1975; Zarkowsky557; Uncommon (now classified SPTA1; SPTB mapped 1992 Gallagher558 as a type of elliptocytosis) Hereditary spherocytosis 1962; NE Morton559 & AA 1/2,000 Northern ANK1 (~50%), EPB42, mapped 1975 MacKinney153; WJ Europeans SLC4A1, SPTA1, SPTB Kimberling560 Hypochromic microcytic: 2011 C Kannengiesser561 Rare SLC25A38 sideroblastic Hypochromic microcytic: X- 1992 TC Cox562 Unknown (uncommon) ALAS2 (exacerbated by linked sideroblastic additional mutation in HFE) Hypochromic microcytic: X- 1998 Y Shimada563 <1/1 million ABCB7 linked sideroblastic w/ ataxia Hypochromic microcytic: X- 1964; NT Shahidi564; MP Rare SLC11A2 linked sideroblastic with iron mapped 2005 Mims565 overload

76 Iron-refractory iron deficiency 1981; Buchanan & <1/1 million TMPRSS6 anemia mapped 2008 Sheehan566; KE Finberg567 Majeed syndrome 1993 El-Shanti, H568 3 known families LPIN2 Megaloblastic anemia 1960; Imerslund- 1/200,000 Nordic-descent AMN & CUBN mapped 1995 Grasbeck504; 569; M Aminoff505 Megaloblastic anemia, 1969; LE Rogers570; EJ <1/1 million SLC19A2 thiamine-responsive mapped 1997 Neufeld571 Nakajo-Nishimura syndrome 1936, 1950; Nakajo152 & ~30 cases described PSMB8 mapped 2010 Nishimura572; Garg573, Agarwal574 Nonspherocytic hemolytic 1961; WN Valentine575; H 1/20,000 Europeans PKLR anemia due to pyruvate kinase mapped 1991 Kanno534, A deficiency Larochelle576 Phosphoglycerate kinase 1 1969; 1980 Kraus577; Fujii & <1/1 million PGK1 deficiency Yoshida578 Porphyria (all types) 1937 Waldenstrom63 1/500 to 1/50,000 ALAD, ALAS2, CPOX, (PCT) CYP1A2, FECH, HMBS, UROD, UROS; HFE Thrombotic thrombocytopenic 1978; JD Upshaw579; GG Common ADAMTS13 purpura mapped 2001 Levy580 Hereditary anemias and conditions for which anemia is always present are listed. Several extremely rare diagnoses excluded. First publication date listed refers to phenotype description (often presented as a case report), followed by first published report of identification of any gene associated with the condition.

77 Table 2. Description of proteins involved in iron homeostasis and erythropoiesis.

Protein(s) Transcribed in: Role: Regulates: BMP/SMAD Broadly BMP, a ligand of the TGFβ family, activates SMAD family EPO, hepcidin pathway members by binding TGFβ receptors. SMADs are transcription factors, which activate hepcidin transcription. Erythropoietin kidney Generated in response to hypoxia. Binds EPO receptors on (EPO) RBCs to stimulate erythropoiesis. Recombinant (i.e., exogenous) EPO has been successful in hemodialysis patients with end-stage renal disease, in whom it has been shown to improve hematological parameters as well as increased left-ventricular mass and accompanying decrease in cardiac output177; 581; 582. Recent work in animal models suggests consistent use of EPO can have negative long- term cardiovascular effects67; 583. Transferrin (TF) liver Glycoprotein which reversibly binds free iron (Fe[3+]) ions in the bloodstream Iron regulatory broad (highest in liver) Binds iron-responsive elements (IREs) in transferrin Transferrin, proteins receptor 3' UTRs to regulate cellular uptake of iron. In transferrin (IRP1 and IRP2) iron-deficiency, IRP1 increases iron uptake into cells by receptors, stabilizing transferrin receptor and limits iron sequestration ferritin L & H by limiting ferritin mRNA translation. Iron contributes to subunits, regulation of EPO in kidney through IRP1 and HIF2. ferroportin, Transferrin receptors immature RBCs* Cellular uptake of transferrin-bound iron (HFE binds Hepcidin (TfR1 and TfR2) (TfR1 and TfR2) or competitively). liver (TfR2) Hepcidin liver Activated by TfR2; prevents iron absorption from the small intestines and the release of iron by macrophages into the bloodstream. Repressed during iron deficiency by ERFE. Erythroferrone muscle Rapidly inhibits hepcidin transcription in the liver, leading Hepcidin, (ERFE) to increased iron acquisition. Only co-occurs with low mTOR (liver) activity of the BMP/SMAD pathway. See Table 1 for monogenic anemias caused by mutations in sequences for any of these proteins. * All gene transcription in RBCs occurs at immature stages of development in the bone marrow, prior to nuclear extrusion.

78 Table 3. Descriptions of red blood cell primary traits and derived indices.

RBC trait Calculation Units Indices measured HCT N/A % Proportion of whole blood comprising RBCs HGB N/A g/dL Blood level of hemoglobin RBCC N/A 106 cells / mm3 Number of RBCs per unit of volume

RDW-CV SDMCV % Representation of distribution of RBC size MCV MCH HGB*10 picogram Average hemoglobin protein quantity per RBC (by mass) RBC MCHC HGB*100 g/dL Average hemoglobin protein quantity per RBC (by HCT concentration) MCV HCT*10 femtoliter Average size of RBC RBC HCT: hematocrit; HGB: hemoglobin; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; RBCC: red blood cell count; RDW-CV: red cell distribution width (coefficient of variation).

79 Table 4. Brief overview of genetic and epigenetic processes and effects of variation. Mechanism Cellular Location Effect of genetic variation Exonic or splice-site Nuclear DNA Change in mRNA sequence or secondary structure (can variant reduce or increase mRNA stability prior to translation) or change protein function (amino acid substitutions/changes) Intronic/intergenic Nuclear DNA Change in enhancer/promoter region can alter binding variants efficiency of regulatory factors, leading to changes in transcription DNA methylation Nuclear DNA Addition or removal of a CpG can add or eliminate methylation with a downstream effect on transcription; may be cell-type specific 3D chromatin structure Nuclear DNA Change in chromatin structure (euchromatin vs heterochromatin) can affect nearby gene expression; may be cell-type specific Histone Modification Nuclear chromosome Change in epigenetic histone marks can increase or decrease structure accessibility of genes for transcription MicroRNA (miR) Nucleus, cytoplasm, Altered miR sequence can affect the binding specificity for extracellular vesicle mRNAs, leading to broad increase or decrease in protein translation; altered miR consensus sequence in one gene can have the same effect Post-translational Nucleus, cytoplasm Protein folding and stability can be affected by intracellular modification conditions or availability of complementary molecules (ex: iron and HGB)

80 Table 5. Published English-Language GWAS and Exome Association Studies of RBC Traits Discovery First Author Year Ancestry N Relevant traits included Associations* Yang310 2007 European 1,000 HGB, MCHC, MCV, RBCC 0 Soranzo25 2009 European 13,943 HGB, MCH, MCHC, MCV, 6 RBCC Chambers314 2009 European; South Asian 16,001 HGB, MCH, MCHC, MCV, 5 RBCC Ganesh17 2009 European 24,167 HCT, HGB, MCH, MCHC, 23 MCV, RBCC Ferreira309 2009 European 6,015 HCT, HGB, MCH, MCHC, 2 MCV, RBCC Kamatani20 2010 Japanese 14,700 HCT, HGB, MCH, MCHC, 28 MCV, RBCC Kullo21 2010 European 3,012 HCT, HGB, MCH, MCHC, 1 MCV, RBCC Van der Harst26 2012 European; South Asian Up to HCT, HGB, MCH, MCHC, 67 85,000 MCV, RBCC Li22 2012 European; African 6,234; HCT, HGB, MCH, MCHC, 4 American 7,943 MCV, RBCC Chen29 2013 African American 16,500 HCT, HGB, MCH, MCHC, 4 MCV, RBCC, RDW Pistis24 2013 European 1,653 HCT, HGB, MCH, MCHC, 1 MCV, RBCC Iotchkova19 2016 European 3,210 HGB, MCH, MCHC, MCV, 2 RBCC Chami28 2016 European; African Up to HCT, HGB, MCH, MCHC, 21 American; 130,273 MCV, RBCC, RDW Hispanic/Latino; East Asian; South Asian Polfus30 2016 European; African Up to HCT, HGB, MCH, MCHC, 0 American 15,459 MCV, RBCC, RDW Astle301 2016 European 172,433 HCT, HGB, MCH, MCHC, 1,362 MCV, RBCC, RDW van Rooij27 2017 European, African Up to HCT, HGB, MCH, MCHC, 9 American, and Asian 65,000 MCV, RBCC, RDW Hodonsky18 2017 Hispanic/Latino 12,500 HCT, HGB, MCH, MCHC, 7 MCV, RBCC, RDW

Kanai348 2018 Japanese 108,794 HCT, HGB, MCH, MCHC, 44 MCV, RBCC Yasukochi584 2018 Japanese 4,884 HCT, HGB, MCH, MCHC, 0 MCV, RBCC, RDW GWS = genome-wide-significant (p<5x10-8); EMR = electronic medical records.* Associations reported as “discovery” if the locus has not been previously associated with the reported RBC trait in a GWAS—multiple associations had been previously reported in linkage analysis or candidate-gene studies. Several associations were reported as novel in multiple papers that were published at the same or similar times.

81 Table 6. Evidence of genetic associations shared across RBC traits.

Gene →

(chr22)

(chr7)

(chr6)

(chr2)

(chr3)

(chr9)

(chr6)

(chr9)

(chr10)

(chr4)

Trait

PRKCE THRB KIT HFE CCND3 MYB / HBS1L (chr6) PRKAG RCL1 ABO HK1 SH2B3/ATXN2 (chr12) TMPRSS6 Heritability HCT (11) 3 1 2 3 2 2 2 2 2 0.04

HGB (15) 2 1 3 2 3 2 3 4 7 0.12 – 0.84

RBCC (15) 3 1 3 1 4 7 1 1 4 1 2 0.23 – 0.61

RDW (4) 1 1 1 1 1 1 1 1 0.35

MCH (14) 2 2 4 3 5 3 1 1 1 6 0.22 MCHC 1 1 3 2 4 0.48 (15) MCV (15) 2 4 4 7 8 5 1 1 1 6 0.20 – 0.96 Number below each trait represents the number of GWAS evaluating that trait. Shading intensity represents proportion of studies evaluating that trait that reported a GWS association for at least one SNP with the named gene as the designated locus (given the allelic heterogeneity reported as common among most complex quantitative traits, we are reporting only by gene and do not imply that only one causal variant is located within each locus). To increase visibility, a baseline of 10% shading is used for all proportions.

82 Table 7. Examples of genome-wide significant RBC trait associations with cis-eQTL evidence and a plausible biological mechanism for a proximal gene.

Gene Chr GWAS Associations Role/function in RBCs* CCND3 6 MCH, MCV, RBCC, Important for erythropoiesis; signal identified in GWAS RDW contains TFBS for TAL1, KLF1, and GATA1. Reduced levels = fewer cell divisions during terminal erythropoiesis, leads to fewer but larger terminally differentiated RBCs. HBS1L-MYB 6 HCT, HGB, MCH, SNPs in 2 QTLs upstream of MYB disrupt enhancer binding MCHC, MCV, RBCC, which affects MYB expression. MYB regulates hematopoiesis RDW and erythropoiesis: indirectly through alteration of kinetics of erythroid differentiation; directly via activation of KLF1 and other repressors of gamma-globin genes. Low MYB levels accelerate erythroid differentiation leading to release of early erythroid progenitor cells that are larger and still synthesizing mostly HbF. CITED2 6 MCH, MCV, RBCC Master regulator of stem cell fate, key role in adult HSC maintenance. Regulates iron homeostasis and erythropoiesis via HIF1 and GATA1. CD164 6 MCH Codes for MUC24, a transmembrane molecule that affects HSC proliferation, adhesion and migration. Interacts with CXCR4, which has a role in HIV cellular transport. TFR2 7 HCT, HGB, MCH, Transferrin receptor 2 is a membrane-bound protein that takes MCHC, MCV, RBCC, up transferrin-bound iron; may be involved in erythropoiesis. RDW Mutations reported in hemochromatosis III (rare). SH2B3 12 HCT, HGB, MCH, Negatively regulates hematopoietic cytokine signaling. MCV, RBCC Suppression increases erythroid expansion and maturation by augmenting EPO and KIT pathways. HMOX2 16 MCH, MCV The constitutive form of heme oxygenase, a membrane-bound enzyme that cleaves the heme ring to form biliverdin (bilirubin precursor). Expressed at highest levels in the spleen. FAM234A 16 HGB, MCH, MCHC, Function unknown—variants in this gene have been causally (codes for MCV, RBCC associated with malaria resistance in the heterozygous state16. ITFG3) Also likely in high LD with HBA1/2 locus (~50kb away).

83

Figure 1. A basic overview of RBC development stages and accompanying molecular processes (prepared by Hodonsky, 2018)

84

From left to right, developmental stages progressing from HSC to senescent RBC. The red-shaded cell represents an RBC at each stage of differentiation (including nucleus with chromosomes when appropriate). The gray cell represents a macrophage, a crucial component of successful RBC differentiation. Top blue bars: location of the developing cell at each stage of differentiation. Bottom green bars: standard amount of time required for the stage(s) of cell differentiation or maintenance occurring in the respective tissue. Text boxes include particularly relevant information about events occurring at each stage. Bottom box outlined in blue indicates percent of activity for DNA transcription, mRNA translation, and chromatin density within the developing RBC. DNA transcription and chromatin density become irrelevant after enucleation. DNA: deoxyribonucleic acid; EMP: erythroblast-megakaryocyte protein; Fe2+: iron ions recycled by macrophages from phagocytosed RBCs to developing RBCs; mRNA: messenger ribonucleic acid.

81 84

Figure 2. Example network of key protein-protein interactions involving molecules with established roles in RBC biology.

https://string-db.org

85

RESEARCH PLAN

4.1. Overview

In this section, I will outline the resources and order of analyses required to investigate specific aims 1 and 2 as introduced above. In sum, the first aim will evaluate both known and previously unreported associations between seven RBC traits and common and infrequent genetic variants in the PAGE and collaborating study populations (see below), and identify candidate functional variants using bioinformatics and fine-mapping tools. The second aim will evaluate both known and previously unreported associations between seven RBC traits and genes in the same study populations. Below, I will first briefly describe participating study populations, genotyping arrays, and all covariates and exclusion criteria relevant to the study population. I will then discuss the statistical methods and relevant tools (including software and public databases) that will be used for each aim. Finally, I will review the strengths and limitations of the proposal at hand, and provide a timeline for expected completion of each aim and the accompanying manuscript.

4.2. Study populations

4.2.1. Population Architecture using Genomics and Epidemiology: The PAGE study

The PAGE consortium was first funded in 2008 as a demonstration of NHGRI support for expanding the genetic diversity of large-scale genomics studies342; 455. For the proposed work, all PAGE II-participating cohorts and collaborating studies with available phenotype and

86 genotype or sequencing data were included. Specifically, the Atherosclerosis Risk in

Communities (ARIC); Coronary Artery Risk Development In young Adults (CARDIA);

Hispanic Community Health Study/Study of Latinos (HCHS/SOL); Mt. Sinai School of

Medicine BioMe Biobank (ISSMS); and Women's Health Initiative (WHI).

The proposed work will include all PAGE participants with phenotype and genotype information self-identifying as Hispanic/Latino, African American, Asian American, Native

American or American Indian, or European American—while Native Hawaiian individuals from multiple cohort studies participated in PAGE cohorts, these studies did not report RBC traits and hence will not be included. Studies participating in the PAGE-II program that may be included in these analyses are the CALiCo Consortium (ARIC, CARDIA, HCHS/SOL), Icahn Mt. Sinai

School of Medicine (IMSSM), and the Women’s Health Initiative (WHI). Over 70,000 total

PAGE participants and participants of collaborating studies with genotype and phenotype information. Not all studies or study visits measured all RBC phenotypes (see below); when required, we will use data from multiple visits from the same cohort study to obtain the earliest phenotype measurements possible (and minimize the effects of loss to follow-up or mortality).

4.2.1.1 Atherosclerosis Risk in Communities (ARIC) Study

The ARIC study recruited 15,792 participants aged 45-64 for the first visit (of seven currently completed or planned) beginning in 1986585. Participants were located in four centers:

Jackson, Mississippi, Forsyth County, North Carolina, Minneapolis, Minnesota, and Washington

County, Maryland. The study was designed to include rural, suburban, and urban households of either African American or European American descent. Approximately 55% of participants were female, with about 4,000 African Americans (25%, primarily from Jackson, Mississippi) included in the baseline visit. Three visits are relevant to this study based on the availability of

87 RBC trait measurements, which will also affect the number of participants reported for each trait:

Visit 1, at which HCT, HGB, and MCHC were first reported; Visit 2, at which MCV, MCH, and

RBCC were first reported; and Visit 5, at which RDW was first reported.

4.2.1.2 BioME Biobank: Icahn Mt. Sinai School of Medicine (IMSSM)

The Medican Institute for Personalized Medicine BioME Biobank is affiliated with

IMSSM and participates in eMERGE (electronic medical records and genomics)586. IMSSM clinics are located in areas serving ethnically diverse populations, including Central Harlem, East

Harlem, and the upper east side of Manhattan. 30 IMSSM clinical care sites are involved in recruitment for participants, who donate blood samples for genotyping; these samples are linked to EMR which is de-identified but kept current through IMSSM. Since recruitment began in

2011, over 32,000 patients have been recruited to participate. Approximately 60% of participants are female and three fourths report the US as their country of origin; with regards to ethnicity, over one third of participants identify as Hispanic/Latino (primarily with Caribbean country of origin), over one quarter as African American, and approximately one quarter white.

4.2.1.3 Coronary Artery Risk Development in Young Adults (CARDIA) Study

The CARDIA study was designed in 1983 to evaluate risk factors for cardiovascular disease, including subclinical phenotypes, throughout the life course starting in early adulthood.

Baseline examinations were performed on 5,115 African American or White men and women

(with approximately equal distribution across ethnicity and gender groups) aged 18 to 30 in

1985-1986 at four centers: Birmingham, Alabama; Chicago, Illinois; Minneapolis, Minnesota; and Oakland, California587. Blood trait phenotypes being collected on the maximum number of participants at baseline, therefore phenotype data from the baseline visit will be used in this work.

88 4.2.1.4 Hispanic Community Health Study/Study of Latinos (HCHS/SOL)

The Hispanic Community Health Study completed its first visit of 16,415

Hispanics/Latinos aged 18-74 in 2011, with the aim of representing the diversity of

Hispanics/Latinos living in the United States. A stratified two-stage area probability sample of household addresses was used to in order to recruit a nationally representative sample.

Participants were recruited from Miami, Florida (school), Chicago, Illinois (Northwestern), the

Bronx, New York City, New York (Albert Einstein), and San Diego, California (school), with the coordinating center at UNC Chapel Hill in North Carolina588. Participants self-identified with one of five national backgrounds or countries of origin, including Cuban, Dominican, Mexican,

Puerto Rican, and South/Central American. Participants at the baseline exam underwent extensive phenotyping and interviewing (conducted in either English or Spanish depending on participants’ preferences). Consent and genotype are available for approximately 12,500 participants.

4.2.1.5 Women’s Health Initiative (WHI)

The Women’s Health Initiative began recruitment in 1991 a clinical trial and an observational cohort study, originally designed to evaluate the health effects of hormone therapy in over 150,000 women aged 50-79589. The study population was broadly representative of post- menopausal women in the US, recruiting from over 40 field centers with race/ethnic sampling based on proportions from the 1990 US census. Distribution of race/ethnicity was primarily white (83%) participants followed by proportionally underrepresented participants self- identifying as African American (9%), Hispanic/Latino American (4%), Asian American (3%), and Native American (0.4%). Genotyping information was obtained from consenting participants at the baseline exam, including approximately 6,000 Hispanics/Latinos, 12,000 African

89 Americans, 550 Asian Americans, 500 participants identifying as "Other" race/ethnicity, and

18,000 European Americans590.

4.3. Phenotypes: outcome and covariate assessment

4.3.1. RBC trait descriptive statistics and transformation

As described above, the following RBC traits will be used as the outcome variables in univariate analyses and gene-based tests for both specific aims: HCT, HGB, RBCC, RDW,

MCH, MCHC, and MCV. As is consistent with population-based genetic association studies of complex quantitative traits, all RBC traits will be evaluated for a skewed distribution of the residuals, which could violate assumptions of the models incorporated in the methods described below. For any trait exhibiting excessive skewing in the overall study population, that trait will be regressed on using all non-genetic terms in the model, and then inverse-normal transformation will be performed on the residuals. Interpretability of results from this type of transformation is limited; however, effect estimates in GWAS discovery findings are not expected to represent the global effect size15; 403. Furthermore, because the lead variant at a given discovery association signal is not necessarily expected to be a functional variant for the trait of interest, the effect estimate may not be an accurate proxy for the functional variant or variants31.

4.3.2. Exclusion Criteria

We will be excluding participants meeting criteria that may have a strong effect on RBC traits, and which may therefore give those observations undue influence on the genotype- phenotype association. As described above, we are concerned that we will not accurately capture these influential individuals by simply evaluating statistical outliers, but not that these variables are necessarily acting in a confounding capacity. All exclusion variables are consistent with the

90 RBC trait GWAS literature, and will exclude based on the following when information is available:

(1) Pregnancy

(2) HIV (+) status

(3) Ever-diagnosis of blood cancer

(4) Current treatment with chemotherapeutic medications (any cancer)

(5) Hereditary anemias (primarily SCD but not sickle-cell trait)

(6) Extreme outliers (+/- 4 standard deviations from the mean for the outcome of interest)

Aside from pregnancy in study populations that recruited pre-menopausal women, all exclusion criteria are rare in PAGE-participating study populations (<0.5% of participants). Of note, individuals with one excluded RBC trait value due to outlier status will not be excluded from analyses for other non-outlying traits.

4.3.3. Adjustment covariates

Genetic association studies typically do not suffer from the same confounding issues as traditional epidemiology studies, because any confounder would need to causally affect both the

SNP of interest and the outcome. Anticipating which covariates may be confounders in this context and, additionally, overadjusting in all analyses for covariates that are only confounders for one or a small number of SNPs is inappropriate. Therefore, genetic epidemiology typically adjust for covariates associated with the outcome but not the exposure to add precision to the effect estimate. This added precision is particularly helpful in the case of GWAS because the effect magnitude for true associations with individual genetic variants is anticipated to be small354; 403; 444; 591. The covariates described below will be adjusted for in all analyses.

91 4.3.3.1 Age and sex

Age and sex are significantly associated with RBC traits in the PAGE study population.

In keeping with field standards for GWAS analyses, age and sex terms will be included in all models.

4.3.3.2 Self-reported race/ethnicity

Reports in the hematology literature vary on whether and how much race and ethnicity affect RBC traits112; 180; 181; 592. However, adjusting for self-reported race/ethnicity in GWAS adds precision to the results as well as reducing issues involving population stratification, which can lead to false-positive association signals that actually identify ancestry593-596. We will include self-reported race/ethnicity in four categories: African ancestry or Black (African-descent individuals living in the United States or Europe); Hispanic/Latino (non-overlapping with

African Americans: HCHS/SOL participants will all be considered Hispanic/Latino); Native

American; and White/European American. No PAGE II-participating study populations with individuals of Native Hawaiian or Asian descent reported RBC trait phenotypes, therefore no participants identifying with either of those groups will be included in this work.

4.3.3.3 Study and study center

While genotyping and complete blood count sample collection and processing have been relatively standardized for decades, each cohort study follows its own protocol and uses machinery specific to that protocol. Therefore, slight differences caused by sample processing by individual lab technicians or minor differences in how/when machines are calibrated can affect the outcome measurements due solely to the cohort study and not the true value of the RBC traits measured. Adjustment for participating study is the standard established in the field in genetic consortia unless analyses are stratified by study.

92 Similar to the concerns related to cohort study, samples collected for CBC assays are often processed separately by study center. This can lead to batch effects, therefore adjustment for study center in the model can reduce statistical noise. For studies with multiple centers reported (including WHI, which uses region rather than center), we will adjust for study center, as is the field standard for studies incorporating multiple centers.

4.3.3.4 Ancestral admixture and principal components

Haplotype recombination events in admixed populations can lead to narrower LD blocks than might be observed in either founder population. Differences in recombination peaks across populations with different recent continental ancestry mean that inclusion of ancestrally diverse study populations—particularly those exhibiting a high degree of admixture—may lead to overlapping haplotyplp0u7de blocks, which can be utilized to narrow association signals in fine- mapping (see below)595; 597; 598. For example, among PAGE Hispanics/Latinos, individuals identifying as of Mexican ancestry have primarily Native American and European haplotypes, ranging from nearly 100% European to 100% Native American ancestry (Figure 3). In contrast, individuals self-reporting Dominican descent have ancestral haplotypes primarily from Africa and Europe, with most individuals having only a small proportion of Native American ancestry.

Ancestry-specific variants can lead to spurious associations in GWAS if ancestry is inappropriately adjusted for, which is a primary reason why principal components must be calculated within study populations and adjusted for in GWAS599 600; 601. To avoid identifying genetic associations due to population stratification rather than associations with the outcome, ancestral principal components (PCs) are used as covariates. Based on PAGE analyses of the appropriate number of PCs to adjust for in the transethnic study population, we will adjust for ten

93 PCs as that is sufficient to remove population stratification concerns with the exception of several known loci that are already known in the field455.

Traditionally, GWAS were performed by stratifying study populations based on self- reported race/ethnicity and then meta-analyzing all groups to report trans-ethnic results.

However, rather than properly adjusting for potential confounding by ancestry, this form of adjustment fails to account for the fact that ancestry represents a spectrum rather than definitive categories. See Figure 4 for population stratification of ancestry groups by principal components, demonstrating the continuum of ancestral relationships in PAGE participants which justifies analysis of participants grouped only by genotyping assay rather than race-ethnic specific analysis for Aim 1 (see below) (Wojcik 2017, personal communications).

4.4. Specific Aim 1: Combined-phenotype GWAS of red blood cell traits

This aim will leverage trait correlation to identify previously unreported RBC trait genetic associations in a large, ancestrally diverse study population. These associations will then be evaluated using conditional analysis and fine-mapping techniques that will allow for designation of candidate functional variants.

4.4.1. Quality control for genotypes and imputed data and statistical threshold

Genotype QC has already been performed by the PAGE coordinating center. Imputation to 1000 Genomes Phase 3 was also performed by the coordinating center. Rather than using a simple allele frequency threshold to exclude rare variants, we will implement an effective heterozygosity requirement for variants included in single-variant analyses. This allows for the inclusion of a larger number of variants in small study populations, while excluding low- frequency imputed variants with poor confidence scores. Effective heterozygosity (effN) refers

94 to the number of expected heterozygotes with a particular variant, based on the following equation: effN = 2 * n * MAF * (1-MAF) * oevar. SNPs with an effective heterozygosity (effN)

> 30 will be evaluated in univariate analysis. Rare-variant gene-based analyses will be performed to identify previously unreported RBC trait associations in Specific Aim 2.

For the purposes of discovery, we will be using a GWS threshold of  = 5x10-9 to account for the fact that we are evaluating approximately 30 million variants. While a strict

Bonferroni-corrected p-value would be 1.6x10-9, it is expected that some variants will be in very strong LD and hence not independent, therefore we are adjusting for a test of 10 million variants.

In the context of GWAS, effect heterogeneity refers to a difference in magnitude of effect by study population, typically in reference to the potential for difference by ancestry450; 602; 603. We will consider effect heterogeneity by race/ethnic group by evaluating Cochran's Q statistic.

Once known associations have been assessed, all variants outside of previously described regions (using a +/-1Mb window) and uncorrelated with a known variant (using European LD r2<0.2 as a conservative metric) that exceed the genome-wide-significance will be examined as potential discovery association signals.

4.4.2. Univariate analyses

Univariate analyses will be performed with generalized estimating equations (GEE), which estimate model parameters averaged over the study population, assuming additive inheritance. GEE models are beneficial in that they do not assume normality or independence of the outcome measurements across individuals, improve computational efficiency, and allow for correction for family structure. The GEE method will use the relationship matrix K to divide the sample up into discrete groups of closely related participants, each group assumed to be independent of other groups. The GEE method replaces the usual estimate of the information

95 matrix with an empirically estimated “sandwich” estimate that accounts for within-group relationships. Clusters of relatedness for studies sampling related individuals (in this case only

HCHS/SOL is expected to have large numbers of related individuals, but will be relevant for all

MEGA-genotyped individuals since we are doing a trans-ethnic analysis) will be defined using a relationship matrix to account or family structure. Robust standard errors are calculated for those instances in which a relationship matrix is required.

We will only be performing trait associations on the autosomal chromosomes, i.e., chromosomes 1 through 22. Study participants will be analyzed by genotyping chip—MEGA- typed participants will be analyzed trans-ethnically, and remaining participants will be analyzed by study and genotyping platform using previously calculated study-specific PCs. Heterogeneity of variance by self-identified race/ethnicity will be included for all models stratified by genotype.

Allowing for the presence of such heterogeneity within GEE models using SUGEN can improve detection of genetic associations in populations in which ancestry-specific variants may contribute differentially to the total effect estimate of that locus represented by a tag SNP.

4.4.2.1 Inverse-variance-weighted meta-analysis of study-specific results

Meta-analysis within each RBC trait will be performed using METAL604. Briefly, inverse-variance weighting will be used for meta-analysis of the summary results for genotype assay-specific results. A z-statistic is calculated for each study-specific association, which incorporates effect magnitude and direction of effect. In order to account for the difference in study population size, as well as the difference in effect size across populations (which could be due to lack of precision for not adjusting for covariates that affect the outcome), results will be weighted by the inverse-variance of each association. All study sub-populations of the same self-

96 reported race/ethnicity will be meta-analyzed first; subsequently, all race/ethnic groups will be meta-analyzed together

4.4.3. Comparison of available combined-phenotype methods

We evaluated eight methods that are generally representative of the wide array of available statistical methods to estimate combined-phenotype effects with regard to computational expense, type of statistical test, and use of summary vs. individual-level data.

Specifically, we compared aSPU with S=Z; aSPU_R with S=R-1Z, which is most powerful in the presence of homogeneity of marginal association effects (as opposed to homogeneity of joint association effects, for which aSPU with S=Z is more powerful); our score-based chi-squared test; minimum P-value across phenotypes (minP); a test for heterogeneous (SHet) and homogeneous (SHom) effects; an adaptive association test based on mixed models (MPAT- mixAda); and a unified score-based association test (metaUSAT)422; 425; 605-607. We simulated

2,500 participants, 16 correlated phenotypes, and one variant with MAF=1%. Under the null hypothesis of no effect, uniformity of p-values was assessed by plotting the ratio of observed to expected p-values versus expected p-values, with both quantities on a -log10 scale; the proportion of p-values<1x10-5 was also assessed, with values of this proportion>1x10-5 representing excess type I error.

When the sample size varied across the phenotypes, all methods except aSPU and minP showed evidence of inflated type I error (Figure 5); similar evidence of inflation was demonstrated in empirical tests. While performance of alternate methods improved with increaing MAF, aSPU’s preservation of type I error at low MAF ensures validity for uncommon variants, as previously suggested425. The newer adaptive tests (MPAT-mixAda and metaUSAT) show comparable power gains, but without comparable validity at low MAF in all scenarios.

97 These results support the use of aSPU to identify possibly pleiotropic loci. However, neither multi-phenotype methods nor multivariate methods utilizing individual-level data can determine the nature of the association(s) between a given locus and phenotypes of interest (i.e., biologic vs. mediated pleiotropy). Association signals also most likely remain broad in size, necessitating fine-mapping. Thus, additional statistical dissection is needed to characterize causal relationships.

Our aSPU implementation offers a large increase in speed and simultaneous decrease in memory requirement compared to the standard implementation. Multi-phenotype tests are computationally demanding, particularly when assessing levels of significance used in GWAS settings evaluating millions of variants (in our case, p<5x10-9). After observing the low efficiency of existing software when applied to TopMED-imputed data, which contain >30M variants, we developed an efficient implementation of aSPU in Julia—an open source, high- performance scientific computing language. Our aSPU implementation reduced computational time when using 50 CPUs from 25 days to 13 hours using a significance threshold of 5x10-8; decreasing the threshold to 5x10-9 will be more computationally expensive using our methods but would not even be feasible without our implementation. Given these preliminary findings, we will proceed with aSPU for combined-phenotype analysis.

4.4.4. Combined-phenotype analysis using aSPU

aSPU refers to an adaptive sum of powered scores statistical test which combines summary data from univariate analyses of multiple traits. aSPU combines the sum of Z-scores from univariate GWAS raised to a power, γ, which can take multiple values which allows the program to detect a multi-phenotype effect with maximum efficiency (see below). A higher γ indicates a stronger influence of that trait on the final value. The procedure involved estimates

98 the distribution of null z-scores for each trait's univariate statistics (Nk[0,Σ]) and samples from that distribution utilizing Monte Carlo simulations (with the number of simulations indicating the minimum p-value that can be detected by the procedure, in our case 109). Each SNP has k z- scores, where k is the number of traits for which summary statistics are available. The association of each trait with a given SNP from the trans-ethnic univariate meta-analysis will be adaptively selected based on the maximally powerful score SPU(γ) using the following model, where γ = 0, 1, …, 8, ∞:

γ γ γ 푆푃푈γ = 푧1 + 푧2 + ⋯ + 푧푛,

An overall Monte-Carlo p-value, paSPU, is then calculated for the selected score, and this is reported for each SNP. Squaring the gamma removes concerns about SNPs exhibiting opposite directions of effect by trait inappropriately minimizing the summary effect. A second Monte-

Carlo procedure is then used to simulate the null behavior of the aggregate test-statistic, yielding a p-value for G, taken to quantify evidence for aggregate effects of G across the k traits.

4.4.4.1 Power

Statistical power for RBC traits is high because they are consistently and accurately measured. We assumed a cross-sectional design, a two-sided α=5x10-9, an additive genetic model, MAF 1-50%, and n=30,000-70,000 participants (the number of phenotypes available varies by study). The correlation matrix used for power calculation was generated in the MEGA- genotyped PAGE-participating study population, using inverse-normal transformed residuals of each trait as the outcome to match the proposed univariate analyses. While correlation structure is expected to differ across populations, the differences are slight and should not affect the accuracy of the predicted power. Seven RBC traits and their multivariate distribution were simulated (10,000 iterations), using SD estimates in PAGE (all standard deviations were equal to

99 1 due to the nature of the inverse normalization and subsequent residual calculation as described above). For HGB, MCV, and RDW, we simulated one pleiotropic variant using LMM with per- allele effects of 0.1 SD, 0.04 SD, and 0.05 SD, respectively, based on the median effect size for significant findings in the largest available RBC trait GWAS301. We also examined per-allele effects 50% smaller in magnitude.

Statistical power to detect multi-phenotype effects was excellent for effects of the magnitude observed in our preliminary RBC trait findings were simulated for MAF>1% (Figure

6, panel A). For univariate associations, 80% statistical power was achieved for HGB at

MAF=4%, and the maximum power achieved for MCV and RDW at MAF=50% was 27% and

54%, respectively. When effect sizes 50% smaller in magnitude were examined (Figure 6, panel

B), only the multi-phenotype approach was well powered to detect effects >5% MAF.

4.4.4.2 Independent association-signal identification

Every genome-wide-significant association signal in both univariate and multivariate analyses—both discovery and previously reported—will be evaluated for independent signals using conditional analysis. These analyses will be performed iteratively in SUGEN, adjusting for all lead SNPs from primary GWS association signals (see below)435. A lead SNP will be defined as the variant with the most significant p-value among all SNPs within a 1Mb window in which at least one SNP exceeds GWS; all lead SNPs within a 1Mb window exceeding GWS after conditional analysis will be defined as representing independent association signals.

4.4.4.3 Bioinformatic analysis

Discovery association signals will be analyzed using publicly available databases and prediction algorithms. Examples of various resources for bioinformatics analysis are described above in Table 9. For this dissertation, we will consider but will not be limited to (in the event

100 that new resources become available within the timeline of this dissertation): GTex and the

ENSEMBL Variant Effect Predictor429; 435; 608-611.

4.4.4.3.1 Resources for bioinformatic characterization of GWS association signals

In order to characterize the effect a particular gene has on an associated trait, in the presence of likely allelic heterogeneity all potentially functional variants—particularly noncoding variants in regulatory regions—must be considered. We will evaluate candidate functional variants, and all findings will be reported in the supplement with results that are biologically plausible highlighted in a main table. The Variant Effect Predictor (VEP) is a vast, publicly available database of both sequence and structural genomic variants in all species with reported genome assemblies, with extensive data available for humans608. VEP data include known or suspected effects on transcripts, proteins, and regulatory regions, as well as allele frequencies by 1000 Genomes populations. GTEx is a database of gene expression (determined by RNA sequencing) across a large number of human tissue types as well as several dozen immortalized cell lines, and expression data is accompanied by genotyping information

(https://www.gtexportal.org)429.

4.5. Specific Aim 2: Gene-based rare variant analysis of red blood cell traits

Identification of functional variants within a genomic region associated with a phenotype of interest is difficult. Even more challenging can be identifying genetic associations with variants that are ancestry-specific (monomorphic in most populations) or present but rare (MAF

< 0.01) in all populations. The trade-off for a low false-positive rate often reduces the power to detect a true association when the effect size is small or the variant is rare491; 612. For these reasons, methods have been developed over the past decade to address the wealth of information available if rare-variants are accurately genotyped or imputed in the appropriate populations491;

101 499; 613; 614. This proposal will perform gene-based testing using a combined analysis (jointly employing burden and variance-component tests), as these methods are currently the best suited for our traits of interest and study population, although no currently available combination of study population and methodology can provide a power exceeding the 80% preferred for GWAS

(see below).

4.5.1. Gene-based testing: combined VC and burden test implemented in MASS

As briefly introduced above, combining the methods of burden and variance components tests improves power to detect loci with different distributions of causal variants. In genetic transcripts containing few variants with large effects which all act in the same direction, a burden test is better powered to detect an effect. However, in genomic loci for which a large number of non-causal variants are present alongside functional variants, or for which causal variants exhibit opposite directions of effect—both completely reasonable scenarios given current knowledge of human evolution—SKAT (Sequence Kernel Association Test) is more powerful than burden tests. A combined test which allows for both of these scenarios uses the effect magnitude and standard error values for all rare variants within a defined region to identify genes that exhibit association with the trait of interest615.

First, we will generate variant annotation set files in R for deleterious and CADD-filtered transcripts (see below). Defining each annotation set as the respective group files, we will generate score statistics by transcript group for each annotation set using SUGEN, which is well- suited to produce input files for the combined test because it automatically outputs a mass- compatible results file (https://github.com/dragontaoran/SUGEN)616. We will then use the “VC-

O” test flag in MASS to perform the combined test with a minor allele count (MAC) minimum of 5 using the default one million Monte Carlo simulations for all variants. Because this setting

102 can only simulate a minimum p-value of 1E-6, which is higher than our genome-wide- significance threshold, we will increase the number of simulations to 25 million (minimum p~4E-08) for variants with a p-value lower than 1E-5 in the first run. Due to computational limitations, we currently do not expect to simulate p-values lower than 1E-9.

MASS automatically outputs both fixed-effect and random-effect p-values, as it functions primarily as a meta-analysis program, as well as the Het-SKAT-O p-value using the method described by Lee, et al615. However, our study does not meet the criteria for the Lee method, which was designed for binary outcomes with small study populations, and we are only evaluating one study population, therefore we will be using the fixed-effects p-value. SKAT-O generates a weighted average of the burden test and SKAT statistics into one framework, resulting in the following test statistic:

푄푘 (휌) = (1 − 휌)푄푘,푆퐾퐴푇 + 휌푄푘,푏푢푟푑푒푛

This statistic follows a chi-square distribution, which allows for the combination of effects with opposite directions of effect. Because we are under-powered to evaluate ancestry- specific effects genome-wide, we will perform sensitivity analyses for transcripts significant for each trait within each mask in African American- and Hispanic/Latino-only subpopulations.

4.5.1.1 Variant inclusion, genic unit definition, and adjustment for known variants

In order to perform gene-based testing, we will generate an inclusive list of all variants with a MAF >1% in the MEGA-genotyped study population. Parameters will be defined based on previously described standard field practices. A maximum threshold of 1% minor allele frequency will be used for incorporating variants, a strategy which is supported by both field standards and the fact that common and low-frequency variants will be analyzed individually in

Specific Aim 1a.

103 Two groups of variants will be used to define the respective annotation sets: deleterious coding variants only (frameshift, stop-gain, stop-loss, and nonsynonymous coding variants), and a set additionally allowing for filtered regulatory variants (employed the recommended PHRED- scaled CADD score filter>10 to restrict to variants expected to have modest or higher effects)617

Using whole-genome-sequence annotation files for all MEGA-imputed variants generated by the

PAGE coordinating center, we will compile both annotation sets defining all transcripts meeting the aforementioned criteria and comprising >2 variants. All genes with ENSEMBL-defined transcripts will be included in the analysis, resulting in a GWS alpha threshold of 2.87x10-7, or a significance threshold Bonferroni-corrected for 25,000 genes and seven traits. Variants with an imputation quality below 0.4 were excluded during imputation for VCF genotype files by the

PAGE coordinating center. Genic unit size (i.e., the number of variants included per gene) is expected to range from very small (~3 variants, particularly for the deleterious mask) to very large (several hundred for protein-coding genes with a large number of CADD-significant regulatory variants).

4.5.1.2 Power

As an approximation, PAGE investigators recently performed power simulations for

SKAT testing in 10,000 individuals under the assumption of 1% variance explained per region.

In SKAT and related tests, power in a region depends on the proportion of phenotype variance that is explained by the linear effects of the full set of variants being included in the test (the narrow-sense heritability, h2), the number of variants being tested (varying from a handful to many thousands), and the sample size (scaled by the imputation info). Finally, there is a

Bonferroni correction for the total number of regions tested. Figure 7 shows approximate power curves for selected values of sample size, heritability, and number of variants tested per region,

104 at alpha= 0.05/number of regions. With a sample size of 10,000 and 1,000 variants tested per region, there is 80% power to detect heritability greater than 3 percent in any one region; with a sample size of 100,000 individuals, the detectable heritability would be approximately 10 times lower, or 0.3%. These simulations provide a reasonable framework for the power that can be expected from the proposed work.

4.5.2. Interpretation of results

While the association of specific genes with phenotypes is difficult in traditional GWAS, the nature of gene-based testing for rare-variant analysis increases the likelihood that a particular gene will causally affect the trait of interest. Individual variants within GWS associations will be evaluated to identify likely causal candidate variants (see below). Confirmation of individual- variant causal associations using molecular methods is outside the scope of this work.

4.5.2.1 GO term overrepresentation

As described in Aim 2b, pathway analysis will be performed for genes that are GWS

(p≤2x10-6) for each trait, and pathways which are represented in multiple traits will be highlighted in a results table. Terms will be obtained from the String Consortium’s online resource, which is part of the ELIXIR core infrastructure and is updated frequency618. This resource provides assembled information about known gene functions that can be used to identify the cellular pathways that are most strongly represented in a particular study’s findings, specifically by identifying enriched gene ontology terms (e.g., cell-cell signaling or ion transport)619; 620.

4.5.2.2 Protein-protein interaction network generation

Protein-protein interactions can be beneficial to evaluating the context of genomic findings for a particular phenotype. Therefore, we will generate 1st degree networks of all GWS

105 genes (discovery and known) for each phenotype using String-DB to determine whether there are overlapping interactions between genes that may indicate a core gene function for RBC traits433;

621.

4.5.2.3 Ancestry specificity of lead variants

Unlike univariate analyses, gene-based testing does not produce an obvious index SNP within each identified gene based on p-values from the results. Therefore, for each GWS transcript identified significantly associated with one or more traits, we will evaluate the most significant variant by trait for evidence of ancestry specificity by allele frequency. We will also evaluate each top variant for potential functional relevance using publicly available resources.

106 4.6. Supporting Tables & Figures

Table 8. The PAGE Study: descriptive statistics for PAGE participants and collaborators by study and self-reported race/ethnicity.

Age, years PAGE II Study Population N Race/ethnicity (mean, SD) % Female Atherosclerosis Risk in Communities 2,825 African American 53 (6) 63 Atherosclerosis Risk in Communities 9,266 European American 54 (6) 53 Coronary Artery Risk Development in 2,636 African American 24 (4) 56 Young Adulthood Coronary Artery Risk Development in 2,479 European American 25 (3) 53 Young Adulthood Hispanic Community Health 11,936 Hispanic/Latino 46 (14) 59 Study/Study of Latinos BioME Biobank program 7,138 African American 50 (15) 63 BioME Biobank program 10,152 Hispanic/Latino 52 (16) 63 BioME Biobank program 2,729 European American 63 (12) 45 BioME Biobank program 800 Asian American 44 (15) 64 BioME Biobank program 1,094 Other 48 (16) 50 Women’s Health Initiative 10,035 African American 62 (7) 100 Women’s Health Initiative 4,705 Hispanic/Latino 60 (7) 100 Women’s Health Initiative 594 American Indian 61 (7) 100 Women’s Health Initiative 516 Asian American 67 (6) 100 Women’s Health Initiative 18,180 European American 67 (7) 100 TOTAL 85,085 CBC = complete blood count; N = number of study participants with genetic and RBC trait data; SD = standard deviation. This table represents the total number of participant from each study; individuals without phenotype and/or genotype data have not been excluded, nor have exclusion criteria been applied to these study population numbers.

107 Table 9. Representative sample of bioinformatics analyses for identifying candidate functional variants. Program Type Year Description Transfac473 Regulatory effects 2008 Database of information on transcription factor consensus sequences and known binding sites, and genes regulated by TFs. Specific to eukaryotic TFs. Blueprint478 Hematopoietic 2012 Database including 100 reference epigenomes for multiple healthy epigenome hematopoietic cell types and diseased cells from the same individuals. database Begun in 2011, eQTLs available in 2012. PolyPhen-2472 Pathogenicity 2013 Program predicting possible effects of amino-acid substitution; prediction compares structural differences of amino acids. Specific to nonsynonymous coding mutations in humans. CADD476 Database 2014 Predicts pathogenicity based on metrics including prediction algorithms (PolyPhen) and functional data. Uses a release-specific version of the VEP. VEP609 Function 2015 Provides annotation data for coding and noncoding polymorphisms and prediction structural variants. PrediXcan610 Expression 2015 Uses RNA-seq data from tissues of interest to generate predicted prediction genome-wide transcription levels based on genotype data. FastPAINTOR435 Fine-mapping 2016 High-resolution causal-variant prediction using functional annotation data and results from multiple correlated traits. ClinVar622 Database 2016 Database combining variants with known or suspected functional effect in humans. bloodmIRs623 Blood miR 2017 High-throughput sequencing of miRs in 7 blood cell lines from a expression library of 450 samples. Cataloged online by the Institut für Klinische database Molekularbiologie in Germany (all European ancestry): http://134.245.63.235/ikmb-tools/bloodmiRs. CADD: combined annotation dependent depletion; eQTL: expression quantitative trait locus; miR: microRNA; PIPS: pathogenicity island prediction software; VEP: variant effect predictor

108 Figure 3. PAGE II participants self-identifying as Hispanic/Latino stratified by ethnicity

*adapted from Wojcik, 2017 (personal comm)

109 Figure 4. Ancestral principal components demonstrate the continental-ancestry continuum among participants of the PAGE study.

Scatter plot of PCs for PAGE racial/ethnic groups. Each point represents an individual, color- coded by self-identified race/ethnicity. (Left panel) Global variation (PC1 vs PC2) (Right panel) HL variation (PC7 vs PC8).

110 Figure 5. Comparison of multi-phenotype methods under the assumption of no genetic effect to examine systematic evidence of type I error inflation. Includes proportion of p <1x10-5 for each method.

111 Figure 6. Statistical power for univariate and multi-phenotype associations by effect size. Gray shading indicates 80% statistical power.

A. Simulated RBC trait effects B. Effects 50% smaller in magnitude

112 Figure 7. Selected approximate power curves for 20,000 genes by number of variants per region X-axis represents simulated heritability attributable to the region being tested. Simulated sample size is 10,000.

113

MANUSCRIPT A: ANCESTRY-SPECIFIC ASSOCIATIONS IDENTIFIED IN GENOME-WIDE COMBINED-PHENOTYPE STUDY OF RED BLOOD CELL TRAITS EMPHASIZE BENEFITS OF DIVERSITY IN GENOMICS

5.1. Overview

Quantitative red blood cell (RBC) traits are highly polygenic clinically relevant traits, with approximately 500 reported GWAS loci. The majority of RBC trait GWAS have been performed in European- or East Asian-ancestry populations, despite evidence that rare or ancestry-specific variation contributes substantially to RBC trait heritability. Recently developed combined-phenotype methods which leverage genetic trait correlation to improve statistical power have not yet been applied to these traits. Here we leveraged correlation of seven quantitative RBC traits in performing a combined-phenotype analysis in a multi-ethnic study population.

We used the adaptive sum of powered scores (aSPU) test to assess combined-phenotype associations between ~21 million SNPs and seven RBC traits in 68,634 participants (24%

African American, 30% Hispanic/Latino, 43% European American, 76% female).

Thirty-nine loci contained at least one significant association signal (p<5E-9), with lead

SNPs at five loci significantly associated with three or more RBC traits. A majority of the lead

SNPs were common (MAF>5%) across all ancestral populations. 19 additional independent association signals were identified at seven known loci (HFE, KIT, HBS1L/MYB,

CITED2/FILNC1, ABO, HBA1/2, and PLIN4/5). For example, the HBA1/2 locus contained 14

114 conditionally independent association signals, 11 of which were novel and specific to African and Amerindian ancestries. One variant in this region was common in all ancestries, but exhibited a narrower LD block in African Americans than European Americans or

Hispanics/Latinos. GTEx eQTL analysis of all independent lead SNPs yielded 31 significant associations in relevant tissues, over half of which were not at the gene immediately proximal to the lead SNP.

The complex genetic architecture notable at the HBA1/2 locus—which was only revealed by the inclusion of ancestrally diverse populations—underscores the continued importance of expanding existing GWAS of European and East Asian populations to include other race/ethnic groups when attempting to identify and characterize complex trait loci.

5.2. Introduction

In the average adult, 200 billion red blood cells (RBCs) are generated daily from hematopoietic stem cells in the bone marrow. The most commonly assessed traits for mature

RBCs are hematocrit (HCT), hemoglobin concentration (HGB), mean corpuscular hemoglobin

(MCH), MCH concentration (MCHC), mean corpuscular volume (MCV), RBC count (RBCC), and red cell distribution width (RDW); together, these traits are used to characterize RBC development and function, diagnose anemic disorders, and identify risk factors for complex chronic diseases5; 107; 175; 273; 624; 625. RBC traits also are moderately to highly heritable, making these complex quantitative traits excellent candidates for genomic interrogation157; 497; 626.

Improved characterization of RBC molecular pathways has benefitted both disease diagnosis and pharmaceutical development, as has been demonstrated by recent successes in a BCL11A- silencing gene therapy clinical trial for individuals with sickle cell disease (SCD)146; 326.

115 Genetic association studies have reported over 500 independent loci for RBC traits16-27; 29; 301; 309;

310; 314; 318; 348; 584. However, several research gaps remain which may be addressed via recently developed methods and broadly representative study populations. First, previously published

RBC trait genome-wide association study (GWAS) populations have mostly been ancestrally homogeneous32; 343; 348; 435; 456; 459; 470; 594; 627. Utilization of diverse study populations can improve identification of rare or ancestry-specific variants located in biological pathways that affect phenotypes in global populations. Relatedly, gaps between estimated heritability and the proportion of variance explained by GWAS findings suggest that additional associations remain to be identified, including rare variants and independent secondary associations at known loci that are both more likely to be ancestrally specific301; 340; 496. Finally, RBC traits exhibit modest to high correlation, and several dozen loci have been reported for two or more RBC traits, although few studies have leveraged this shared genetic architecture to increase statistical power to map novel RBC trait loci 20; 26; 28; 301; 405; 419.

In this work, we examined the individual and shared genetic architecture of seven RBC traits in participants of the ancestrally diverse Population Architecture using Genomics and

Epidemiology (PAGE) study342. Our findings reinforce the necessity of incorporating multi- ethnic study populations in genomics in order to accurately characterize RBC trait loci and encourage equitable application of the results to translational work627. The complexity of association signals at loci previously characterized in European- and East Asian-ancestry populations also demonstrates improved power to perform conditional analysis using a combined-phenotype model614.

116 5.3. Methods

5.3.1. Study population

The PAGE study comprises ancestrally-diverse study populations from United States cohorts and biobanks evaluating common complex diseases and accompanying risk factors (see online supplement for more information). This study used data from self-reported African

American, Asian American, European American, Hispanic/Latino, and Native American participants from the Atherosclerosis Risk in Communities Study (ARIC); the Coronary Artery

Risk Development in Young Adults Study (CARDIA); the Hispanic Community Health

Study/Study of Latinos (HCHC/SOL); the Icahn Mt. Sinai School of Medicine BioME Biobank

(BioME); and the Women’s Health Initiative (WHI, described above). Participants were excluded if they met any of the criteria described in Section 4.3.2.

5.3.2. RBC Trait Measurement

RBC traits were measured with hemanalyzers following standardized laboratory protocols from blood draws at the earliest available visit (see online supplement) for the three primary (HCT, HGB, and RBCC) and four derived (MCH, MCHC, MCV, and RDW) RBC traits

(Table 3). RBC trait values that exceeded four standard deviations from the mean of the trait in the overall study population were excluded, mirroring protocols established by prior GWAS17; 18.

5.3.3. Genotyping and Imputation

Study-specific genotyping information is described in the supplemental methods. All studies were imputed to the 1000 Genomes phase 3 reference panel after study-specific quality control criteria were applied (Table 11). We further excluded SNPs on a sub-study-specific basis which had poor imputation quality (< 0.4) or an effective heterozygosity < 35 (calculated as 2 x

117 CAF x (1-CAF) x N x imputation quality, where CAF is coded allele frequency and N is sample size).

5.3.4. Statistical Methods

5.3.4.1 Overview

Briefly, we used an adaptive sum of powered scores (aSPU) simulation-based method to perform a combined-phenotype analysis incorporating univariate results from seven RBC traits in sixteen analytic subgroups that were combined using inverse-variance-weighted meta- analysis. Fifteen of the sixteen analytic subgroups were identified by study and self-reported race/ethnicity (Tables 12-14). The sixteenth subgroup was a pooled sample of self-reported

African American, Asian American, Hispanic/Latino, Native American, and “Other” MEGA- genotyped individuals from BioMe, HCHS/SOL, and WHI. Sensitivity analyses by trait and race/ethnicity were also performed. Complete summary level results are available through dbGaP

(phs000356).

5.3.4.2 Reporting

Previously-reported SNPs for the seven RBC traits evaluated in this study were identified through review of the NHGRI-EB GWAS Catalog21 as of January 1, 2019, supplemented by a

PubMed search. Multi-ethnic combined-phenotype results were presented as the primary findings, employing Bonferroni correction assuming 10M independent tests (i.e., paSPU<5E-9).

We defined a locus using physical proximity (+/-500kb from the lead SNP), and we defined an association signal as the lead (most significant) SNP and proxy SNPs in local LD based on conditional independence within ten megabases. Discovery loci were defined as ≥500kb from and conditionally independent of a variant previously reported to satisfy the field standard p<5E-

118 8 for any of the seven RBC traits. Ancestry-specific and trait-specific analyses were performed as sensitivity analyses to improve interpretation of results.

5.3.4.3 Univariate analysis

Univariate associations for the seven RBC traits were estimated assuming an additive genetic model of inheritance and adjusting for linear effects of age at blood draw, sex, study site or region, and ancestral principal components23. The total MEGA-genotyped subgroup was analyzed using generalized estimating equations allowing correlated errors for first or second-degree relatives, and independent error distributions by self-reported ancestry group24. Linear regression was implemented in SUGEN for the other 15 analytic subgroups24. For each RBC trait, METAL software was used to perform inverse-variance-weighted meta-analysis across all sub-studies604.

SNP effect heterogeneity was measured with the Cochran’s Q test. SNP meta-analysis p-values were assessed by RBC trait by calculating genomic inflation factors (λ) and plotting the expected distribution against observed results.

5.3.4.4 Combined-phenotype analyses

To evaluate evidence for shared genetic effects across all seven RBC traits, we combined meta-analyzed univariate results with aSPU to generate a combined-phenotype p-value for each

SNP17. In comparison with other available methods, we chose aSPU because it exhibited low type 1 error rate in simulations (data not shown); accommodated direction of effect; and was computationally scalable to the millions of SNPs measured using 1000 Genomes Phase 3 imputed data628. We implemented aSPU using Julia 1.0 to optimize efficiency

(https://github.com/kaskarn/aspu_julia).

aSPU incorporated univariate summary z-scores, calculated for each SNP across all 7 traits, to yield a single p-value evaluating whether one or more of the traits were associated with

119 a given SNP. Briefly, the procedure estimates Σ, the 7 × 7 correlation of null z-scores across

11 univariate results and draws 10 Monte-Carlo samples from the multivariate 푁7(0, Σ̂) distribution. For each SNP j, the results for all 7 traits 푧푗1, … , 푧푗7 are used to form a sequence of

훾 훾 sums of powered scores: 푆푃푈(훾) = 푧1 + … + 푧7 , where γ = 0, 1, …, 8, plus 푆푃푈(∞) =

11 max |푆7|. Each powered score is compared to the distribution of the 10 powered scores calculated using simulated null values with the same γ to calculate a Monte-Carlo p-value. An

11 overall SNP- p-value (paSPU), ranging between 1/(1+10 ) and 1, is calculated by comparing the minimum p-value across the sequence of powered scores to the reference distribution of minimum p-values across the sequence of powered scores computed using the simulated null data. The adaptive aspect of the test lies in the potential for different γ values to yield the maximal SPU across SNPs, maintaining power compared to a test with only a single possible alternative hypothesis.

5.3.4.5 Sensitivity Analyses

Sensitivity analyses were performed for combined-trait results by self-reported race/ethnicity among populations with greater than 1,000 participants (African Americans,

Hispanics/Latinos, and European Americans). We also examined whether there was evidence of significant loci that were not identified in combined phenotype analyses. Finally, in an attempt to examine the influence of the previously identified 3.7 kb structural variant esv3637548 in the

HBA1/2 region of , we also adjusted for esv3637548 dosage (r2=0.86) in the

MEGA-genotyped subgroup18. Combined phenotype p-values were then compared before and after conditioning on esv3637548.

120 5.3.4.6 Generalization

All SNPs located within 500kb of a variant previously reported for any RBC trait were evaluated for evidence of association in the combined phenotype analysis as well as each individual trait analysis. A generalization significance threshold of 1.07E-4 was calculated using

Bonferroni correction for the previous number of one-megabase genomic regions for which one or more genome-wide-significant variants were reported for one or more RBC traits (n=466, representing 1,308 index SNPs previously reported for one or more of the seven RBC traits we evaluated). We first report trait-specific associations—i.e., index variants that have been reported by trait. We did not report SNPs that exceed genome-wide-significance for the first time in one

RBC trait but have been reported for another trait as discovery associations; therefore, we also used the aforementioned significance threshold to evaluate generalization of association signals in each trait across all known loci.

5.3.4.7 Identification of conditionally independent association signals

Iterative conditional analysis was performed to identify all independent, genome-wide- significant combined-phenotype lead SNPs as described above. To avoid identifying SNPs as independent that were in long-range LD, we began by conditioning on the top SNP within ten megabase windows on each chromosome. To identify conditionally independent SNPs, linear models were extended to include all PAGE combined-phenotype lead SNPs on shared chromosomes using the same methods described above for univariate analysis, with an added covariate to include the dosage information for each participant at each lead SNP. Following each round of conditioning, aSPU was re-run on conditioned results. Additional rounds of conditional analyses were performed as an iterative process until no genome-wide-significant

SNPs remained in the combined phenotype analysis.

121 5.3.5. Publicly available expression quantitative trait locus (eQTL) analysis

To help prioritize causal genes at identified loci, we attempted to evaluate all lead SNPs within significant loci in relevant available tissues (whole blood, liver, spleen, and thyroid) for evidence of association with gene expression using the Genotype Tissue Expression (GTEx) portal429.

5.4. Results

In total, 68,634 participants met study inclusion criteria, with the number of participants by trait ranging from 33,526 (RDW) to 67,856 (MCHC) (Tables 13,14). Seventy-eight percent of participants were female and the average participant was 57 years old at time of blood collection. Self-reported race/ethnicity in the total study population was approximately 20%

African American, 30% Hispanic/Latino, and 40% European American (Table 12). Partial correlation by RBC trait pair ranged from HCT-MCHC (partial correlation ρ = -0.02) to HCT-

HGB (ρ = 0.94, Figure 8A). Approximately 21M SNPs met our inclusion criteria (Table 11).

5.4.1. Combined-phenotype analyses

SNP associations with the combined RBC phenotype exceeded genome-wide significance at 39 loci (Figure 10), all of which were identified previously. Lead SNPs at five loci (SLC17A2/TRIM38, HFE, HBS1L/MYB, HBA1/2, and TMPRSS6) were associated with three or more RBC traits at genome-wide significance (Tables 15-22). HCT, HGB, and MCHC exhibited the fewest genome-wide-significant associations with lead SNPs (eleven, ten, and two, respectively), whereas MCH and MCV had the largest number of associations (nineteen and twenty respectively, Tables 15-22). Consistent with other GWAS of quantitative complex traits, effect size was inversely correlated to allele frequency across all phenotypes (Figure 8B).

122 Trait-specific directions of effect were largely consistent with pairwise correlations, as expected given the methodological design of aSPU. Of 106 associations which were genome-wide significant for two traits, in 98 instances (93%) direction of effect matched the direction of the pairwise correlation (Figure 8C, Tables 15-22). All eight trait-pair associations with directions of effect opposite of expectation were instances in which MCH or MCV drove the lead SNP association, and HCT or HGB had a different lead SNP in high LD with the combined-phenotype lead SNP (r2>0.8 in the combined MEGA-genotyped study population). Notably, of the eight associations with unexpected directions of effect by trait pair, none of the pairs were more than modestly correlated (0.2<ρ <0.4).

5.4.1.1 Evidence of independent associations at established loci

We identified 20 independent association signals at seven loci (HFE, CCND3,

HBS1L/MYB, CITED2, ABO, HBA1/2, and PLIN4/5, Table 10, Figure 8C). The majority of lead

SNPs were common to all ancestries (MAF>0.01); evidence of association was most significant in European Americans at HFE and HBS1L/MYB loci, whereas Hispanics/Latinos had the most significant association at both CITED2 lead SNPs. In two instances, known causal variants accounted for the entire association signal after conditioning. At the HFE locus, both rs1800562

(C282Y) and rs1799945 (H63D, r2~0.99 with lead SNP rs2032451) are known coding hemochromatosis variants and accounted for all significant associations within +/-3Mb of the lead SNP. Similarly, rs2519093 and rs10901252 are in moderate to high LD with variants that affect RBC traits but also determine an individual’s ABO blood type, and adjusting for these two variants accounted for the entire association at this locus.

Of note, the HBA1/2 locus demonstrated ancestry specificity (i.e., the lead SNP was monomorphic in one or more ancestries) at 11 of 14 conditionally independent SNPs (Figure

123 9A). With the exception of rs60125383 (frequency of the A allele: 0.43 in African Americans,

0.55 in European Americans, 0.62 in Hispanics/Latinos), located in a nonsense-mediated-decay transcript for NPRL3, no lead SNP at this locus was common to all ancestries. The LD block for rs60125383 contained fewer variants in African Americans (Figure 9B, no SNPs r2>0.4) compared to Hispanics/Latinos (Figure 9C, 10 SNPs r2>0.6) and European Americans (Figure

9D, 13 SNPs r2>0.6).

5.4.1.2 Sensitivity Analyses

Trait-specific sensitivity analyses identified two previously-unreported variants exceeded genome-wide significance for a single RBC trait in the univariate analyses, yet did not meet genome-wide significance in the combined phenotype. Rs6573766 was specific to RBCC

(p=1.1E-9) and is common to all ancestries but was poorly captured by earlier genotyping arrays and is not represented in 1000 genomes phase 3 data (Figure 11, Table 44). Rs145548796 was significant for MCV (p=4.6E-9) and is rare (< 1%) in all populations, only meeting the inclusion criteria in the MEGA pooled sample and one study sub-population (Figure 12, Table 45). When adjusting for esv3637548 deletion dosage in the MEGA-genotyped subgroup, we observed evidence of both attenuation and strengthening of effect at otherwise conditionally independent lead SNPs at the HBA1/2 locus (Table 46). Specifically, eight lead SNPs lost more than two orders of magnitude p-value after conditioning on esv3637548, one increased in significance, and five remained unchanged.

5.4.1.3 Generalization of previously reported associations

Generalization of previously identified association signals varied for trait-specific loci

(p<1.07E-4, Tables 47-51), ranging from 50 of 143 (35%) for MCHC to 93 of 121 (77%) for

HGB. Ancestry-specific generalization varied by trait, with the highest proportion of

124 generalization occurring in the European-ancestry sub-population and the lowest occurring in

African Americans, which may be due to power differences to detect associations by ancestry.

5.4.2. eQTL function of index SNPs

To assess the potential regulatory roles of lead SNPs, we evaluated cis-eQTL (<500kb) associations for all lead SNPs in GTEx as available429. Thirty-three of 51 SNPs were low- frequency or common (MAF>1%) in the European-ancestry GTEx population and had available information in whole blood, liver, spleen, and/or thyroid tissues. Fourteen SNPs exhibited significant associations in RBC-relevant tissues; seven SNPs were eQTLs for multiple genes

(Table 52). Although approximately 40 genes were within 500kb of each of the chromosome 16 lead SNPs, none of the lead SNPs in this region exceeded a MAF>1% in the GTEx study population and hence could not be evaluated for cis-eQTLs.

5.5. Discussion

RBC traits are complex quantitative phenotypes that have been broadly examined in

GWAS of European- and East Asian-ancestry study populations. Here, we examine the benefits of identifying and characterizing RBC trait associations in the ancestrally diverse PAGE study population using a combined-phenotype approach. We demonstrate that ancestral diversity improves characterization of loci containing both ancestry-specific and common variants.

Overall, the complex genetic architecture at several RBC loci, which was revealed by inclusion of ancestrally diverse populations, underscores the continued importance of expanding existing

RBC trait GWAS of European and East Asian populations when attempting to identify and characterize complex trait loci.

125 With regard to regions exhibiting multiple independent significant associations, our results demonstrate allelic heterogeneity at known RBC trait loci, the characterization of which was enabled by an inclusive study design. Of particular note was our identification of eleven variants specific to African and/or Amerindian ancestries within the first megabase of chromosome16. The chromosome 16 region includes hemoglobin protein-coding genes HBA1,

HBA2, HBM, and HBZ as well as fifty other protein-coding genes with plausible roles in RBC trait biology, for example genes harboring monogenic causal variants for blood disorders (e.g., rs33987053, an HBA2 variant causing alpha-thalassemia). Decades of research have demonstrated selective pressure in this region occurring over millennia in malaria-endemic regions of the world, but as with many other complex quantitative traits, HBA1/2 has been primarily analyzed in Eurocentric study populations. Given the high polygenicity and complexity of quantitative RBC traits, our identification of over a dozen independent association signals suggests a highly-transcribed region with either complementary or redundant regulatory mechanisms that may affect multiple genes. Future work should extend our efforts by examining other populations in malaria-endemic regions, as well as previously identified and highly influential structural variants, including a previously identified 3.7kb copy number variant, which was not included in our analysis because we did not evaluate insertion/deletion variants18.

A combined-phenotype method was selected due to its purported ability to increase statistical power to identify novel loci with modest effects across multiple correlated traits.

However, sample sizes of previous RBC trait GWAS suggest that many loci with modest effects and lead SNPs in the low to common allele frequency range in European or East Asian populations already have been identified. Power was also lacking to detect loci that might be specific to other race/ethnic groups—although African Americans and Hispanics/Latinos were

126 proportionally well-represented in this study, sample sizes similar to European populations will not be proportionately representative of genetic diversity, particularly for variants that are low- frequency or difficult to impute. This should encourage an increase in representation of African

Americans and Hispanics/Latinos, as the narrower (on average) LD blocks in populations exhibiting ancestral admixture also improve fine-mapping for prioritizing candidate variants for functional characterization.

The possibility that combined-phenotype methods could benefit the study of other correlated polygenic traits still merits further investigation, particularly with groups of traits that may overlap in genetic architecture, but have not been previously examined in concert. Over the past three decades, RBC traits have been associated with cardiovascular disease outcomes like heart failure and stroke, highlighting the potential for identifying novel pleiotropic loci2; 5; 261; 275;

402; 405; 629. Indeed, combined-phenotype approaches that examine the shared genetic architecture underlying intermediate phenotypes and clinical events may be particularly powerful for outcomes like stroke and heart failure, given that phenotypic heterogeneity of these phenotypes has complicated locus identification and characterization.

Our evaluation of lead SNPs’ effects on expression in RBC-relevant tissues faced known constraints that limited interpretation and contextualization of identified variants. Crucially, the vast majority of publicly available functional data were collected from European-ancestry individuals, precluding the use of these databases for interpreting potential effects of ancestry- specific or low-frequency SNPs on gene expression. For example, rs8051004 is one of two less frequent variants that were detected in European-ancestry populations at the HBA1/2 locus

(CAF=0.02). However, rs8051004 was reported as “monoallelic” in spleen tissue in GTEx, despite having a 10% allele frequency in PAGE African Americans and 12% and 11% in the

127 1000G African and East Asian superpopulations, respectively. The exclusion of populations with

African, Amerindian, and Asian ancestry continues to hamper the potential benefits of these resources. Additionally, while the GTex consortium has made extensive efforts to characterize a wide array of tissue types, bone marrow was not included429. RBCs enucleate in the bone marrow prior to entering circulation, with no nuclear transcription and extremely limited translation occurring in mature RBCs. Therefore, bone marrow is the only tissue for which eQTL data characterizing the effects of genetic variation on gene expression for RBCs directly— eQTLs in any other tissue only represent biological processes in other cell types that affect RBC traits post-enucleation.

As with other genetic association studies, we faced several limitations. First, sample sizes for RBC trait GWAS have ballooned to nearly 200,000 participants and we were restricted to a smaller study population. However, the PAGE study has recently demonstrated that modest- sized studies that are more ancestrally diverse improve detection of novel and independent signals compared to simply increasing the number of European-ancestry individuals455. Second, while this study did improve on previous studies in terms of representation from African and

Amerindian continental ancestries, we were unable to evaluate associations in several populations, particularly South Asians, Pacific Islanders, Native Americans, and Native

Hawaiians. Native Americans and Native Hawaiians are represented in PAGE, but RBC phenotypes were not measured in contributing studies. South Asian study populations have been included in several previous RBC trait GWAS; and Native Americans and Pacific Islanders remain underrepresented in GWAS of all complex traits26; 314; 627; 630. Nonetheless, efforts like these underscore the need for ancestral diversity, especially as clinical applications of RBC trait

GWAS findings become more apparent631. Third, we were unable to evaluate structural

128 variations, which have traditionally been difficult to impute, and re-calling all structural variants within significant loci was outside the scope of this work. A sensitivity analysis accounting for the effect of esv3637548 in MEGA-genotyped study participants suggests that further evaluation is required to determine whether true causal variants overlap the position of this 3.7kb structural variant on other ancestral haplotypes. However, it is expected that some structural variants will be adequately represented by proxy SNPs, and future sequencing-based studies will be able to characterize these rare variants. Finally, expression quantitative trait locus data could not be comprehensively interpreted given the limitations of publicly available databases as described above. It is imperative that these resources focus their efforts on becoming more inclusive over the next several years to keep abreast of expected increased representativeness in association studies.

5.6. Conclusion

In conclusion, we identified over 50 association signals within 39 loci in a combined- phenotype analysis of seven RBC traits. We did not observe large improvement in discover signal detection by using the combined-phenotype methods, although further work is required to fully test the utility of these approaches. However, our work demonstrates the benefits of diverse study populations for highly polygenic traits, in spite of the fact that as global populations are increasing in genetic diversity, genetic research has become less diverse. As genomics tools become more broadly available, our results underscore the critical importance of including diverse global populations so the benefits of genomics research can be equitably applied.

129

5.7. Main Figures & Tables

Figure 8. Identification and characterization of 39 loci in a multi-ethnic study population. A. RBC trait pair partial correlations among MEGA-genotyped participants adjusted for linear regression model covariates (N=29,090 for HCT, HGB, and MCHC measurements; N=22,330 for MCH, MCV, and RBCC; and N=19,573 for RDW had all required phenotype and covariate information). B. Low- frequency and rare alleles exhibit larger magnitude of effect across RBC traits in the total multi-ethnic study population. x-axis represents minor allele frequency; y-axis represents trait-specific effect size (z-score). Green dots are present in all ancestry sub-populations; purple dots are monomorphic in one or more ancestries; only variants exceeding nominal significance (p<0.05) are included. C. Lead SNPs from combined-phenotype analysis show shared genetic architecture. Shading intensity corresponds to t-value significance (scaled to a maximum of |15|).

A B

13 130

C

130 Figure 9. Multiple independent associations with MCH demonstrate complex genetic architecture at HBA1/HBA2 locus. All plots: x-axis represents position on chromosome 16, y-axis represents -log10(pvalue) of the association with MCH. A. Regional association plot of 14 independent associations in unadjusted analysis of multi-ethnic study population. Large circles represent conditionally independent lead SNPs; colored SNPs represent variants in high LD (r2>0.8 in LD in pooled MEGA subpopulation) with the corresponding lead SNP. B-D. Locus-Zoom plots of MCH association with rs60125383 (11th round of conditioning, purple diamond) in African Americans on an African American LD background (B), Hispanics/Latinos on a Hispanic/Latino LD background (C), and European Americans on a European LD background (D). A. Multi-ethnic Population

131

B. African Americans C. Hispanics/Latinos D. European Americans

131 Figure 10. Evidence of genetic associations shared across correlated RBC traits. X-axis: chromosome and position (top) and rsid (bottom) for each combined-phenotype lead SNP. Y- axis: trait-specific –log10(p-values), with increased intensity representing higher significance, for each combined-phenotype lead SNP. P-values scaled to a maximum –log10 value of 25 for improved interpretation.

16:314780 16:314780 6:109614844 3:195830276 18:43854259 6:135419042 6:135419042 22:37462936 11:5248233 6:26092170 4:55408999 6:139844429 6:41915871 7:50428445 20:55991637 9:4855858 19:13002400 7:100235970 10:46080590 5:1107428 11:5860096 22:32879617 22:50962782 3:24341268 19:4498157 14:65472241 12:4331647 10:71094504 19:33759240 16:88849421 11:61589481 12:111973358 14:23494277 15:78535437 12:112840766 2:46372781 7:151415256 5:127350549 7:80319938 9:136141870 19:2177925 MCV MCH RBCC

132 HGB

HCT MCHC

RDW

rs28456

rs855791 rs218265 rs590856 rs140523 rs919797 rs958483 rs597808

rs7853365 rs9924561 rs2032451 rs6014993 rs1799918 rs7385804 rs4535497 rs4820079 rs2361016 rs4890634 rs2608604 rs8013143 rs4887023 rs3749748 rs2519093

rs11621325 rs35786788 rs33930165 rs71714426 rs12718598 rs13203155 rs34793991 rs73210009 rs12320749 rs17476364 rs11066283 rs61523591 rs10253736 rs28832309 rs71176524 rs138242149

132

Figure 11. Locus-Zoom plots of the association between rs6573766 and RBCC in PAGE African Americans (A), Hispanics/Latinos (B), and European Americans (C)

A. RBCC-rs6573766 association in PAGE B. RBCC-rs6573766 association in PAGE African American study participants Hispanic/Latino study participants

C. RBCC-rs6573766 association in PAGE European American study participants

133 Figure 12. Locus-Zoom plot of the association between MCH (A) and MCV (B) and rs145548796 in the total MEGA study population

A. MCH-rs145548796 association in B. MCV-rs145548796 association in MEGA-genotyped study participants MEGA-genotyped study participants

134

Table 10. RBC trait loci with evidence of multiple independent signals among PAGE study participants. p values CAF* Multi-ethnic RBC trait-specific Combined phenotype by race/ethnicity* Signal Chr:pos Ref/Alt AA HL EU HCT HGB MCH MCHC MCV RBCC RDW AA HL EU

HFE locus rs2032451 1 6:26092170 T/G 0.96 0.88 0.85 4.0E-16 1.4E-30 8.2E-38 1.8E-22 2.3E-27 0.01 3.4E-25 2.0E-4 2.3E-3 1.0E-11 rs1800562 2 6:26093141 A/G 0.98 0.98 0.93 1.4E-4 2.4E-5 7.7E-30 3.1E-3 1.2E-3 0.78 0.05 1.0E-5 1.0E-10 1.0E-11

CCND3 locus rs1410492 1 6:41907855 C/G 0.94 0.86 0.75 0.24 0.21 1.0E-14 0.59 3.5E-19 1.5E-14 1.9E-6 0.04 1.4E-7 1.0E-11 rs11964516 2 6:41860252 T/C 0.15 0.16 0.17 0.04 0.01 5.6E-13 0.12 3.0E-12 8.8E-4 0.03 0.02 1.6E-6 1.0E-11

HBS1L/MYB locus rs35786788 1 6:135419042 A/G 0.92 0.85 0.74 8.8E-20 7.0E-11 3.0E-66 6.2E-07 1.1E-59 3.3E-60 1.6E-16 8.0E-7 1.0E-11 1.0E-11 rs12664956 2 6:135384188 T/C 0.22 0.26 0.37 2.6E-4 2.3E-3 1.1E-14 7.9E-3 2.4E-12 2.0E-9 0.52 0.03 8.1E-5 1.0E-6

CITED2 locus rs590856 1 6:139844429 A/G 0.63 0.41 0.45 1.2E-4 2.1E-4 7.7E-16 0.44 1.4E-23 2.2E-8 0.79 1.3E-4 1E-11 5.1E-7 rs607203 2 6:139841653 T/C 0.79 0.93 0.96 0.02 0.04 7.9E-11 0.12 2.1E-13 8.8E-5 0.13 5.3E-3 6.2E-6 2.9E-7 135 ABO locus

rs2519093 1 9:136141870 T/C 0.89 0.85 0.80 5.7E-16 8.7E-18 0.32 0.05 0.85 1.1E-7 0.09 2.1E-4 1.1E-8 1.0E-8 rs10901252 2 9:136128000 C/G 0.84 0.93 0.92 0.02 3.4E-4 4.2E-5 5.5E-4 7.9E-8 8.0E-8 0.82 5.3E-5 2.4E-4 5.0E-6

HBA locus rs9924561 1 16:314780 G/T 0.91 0.99 -- 1.4E-4 3.6E-26 5.8E-158 9.7E-87 2.4E-135 8.6E-49 4.9E-6 1.0E-11 -- -- rs76613236 2 16:230724 C/G 0.98 0.995 -- 0.79 0.03 6.2E-20 2.0E-5 3.0E-18 5.3E-14 2.7E-5 1.0E-11 -- -- rs142154093 3 16:366048 C/G 0.98 0.99 -- 0.02 4.0E-7 1.3E-32 2.2E-12 4.9E-33 6.5E-13 2.1E-5 1.0E-11 -- rs186066503 4 16:405483 T/C -- 0.994 -- 0.34 8.5E-3 3.6E-30 4.4E-18 2.8E-19 2.7E-15 3.5E-6 -- 1.0E-11 -- rs530159671 5 16:250184 A/G -- 0.991 0.997 4.5E-3 7.5E-8 1.7E-24 7.2E-12 2.0E-16 2.3E-7 2.2E-4 -- 1.0E-11 -- rs8058016 6 16:228786 A/C 0.99 0.992 -- 2.2E-3 2.8E-5 1.8E-22 1.7E-5 3.5E-21 1.0E-4 8.0E-3 1E-11 2.6E-8 -- rs60616598 7 16:297264 A/G 0.83 0.96 -- 0.04 1.0E-5 8.1E-23 2.2E-11 1.8E-16 9.7E-9 2.3E-9 1.0E-11 2.8E-9 -- rs145752042 8 16:267208 A/G 0.993 -- -- 1.0E-4 2.4E-7 1.2E-18 1.2E-4 3.1E-19 3.0E-4 0.04 3.2E-10 -- -- rs145546625 9 16:220583 T/C 0.98 0.93 -- 0.19 0.08 2.5E-14 0.17 8.8E-15 1.4E-5 5.9E-3 0.22 1.0E-11 -- rs115415087 10 16:205132 T/C 0.018 0.004 -- 0.80 0.24 1.2E-12 1.2E-4 3.5E-14 3.2E-10 0.17 1.4E-10 1.3E-3 -- rs61743947 11 16:240000 T/C 0.995 -- -- 0.04 7.0E-5 2.4E-13 1.4E-11 8.7E-12 0.13 2.3E-6 3.1E-8 -- -- rs60125383 12 16:176446 A/T 0.43 0.62 0.55 0.11 0.001 3.1E-13 2.1E-6 4.5E-8 1.3E-4 0.91 6.1E-7 0.01 2.0E-5 rs55932218 13 16:221151 T/C 0.05 0.01 -- 0.52 0.03 6.3E-10 1.2E-4 2.0E-8 2.1E-3 3.2E-5 5.0E-8 0.01 -- rs8051004 14 16:198835 T/C 0.10 0.05 0.02 0.81 0.04 7.2E-11 3.2E-8 8.1E-8 0.06 0.04 0.01 5.1E-7 0.04

135 132 PLIN4/PLIN5 locus rs919797 1 19:4498157 A/G 0.68 0.49 0.46 5.1E-2 5.5E-4 1.9E-11 1.8E-5 6.5E-9 4.4E-1 8.8E-1 0.35 9.5E-3 1.1E-8 rs12459922 2 19:4455862 A/G 0.12 0.25 0.26 0.003 0.003 4.3E-8 0.34 1.4E-9 0.09 0.11 0.17 4.9E-5 1.0E-4 Bold font for combined-phenotype analysis indicates that the index SNP also had the lowest reported p-value for that particular trait. Variants not meeting effective heterozygosity criterion of 35 excluded. *Restricted to populations with >1000 participants.

136

136

Table 11. Genotyping platforms and QC measures used by participating study populations.

WHI- WHI- WHI- WHI- WHI-- WHI- ARIC BioME CARDIA GARNET GECCO HIPFX WHI-LLS MOPMAP SHARe WHIMS PAGE* N 12,050 3,244 2,534 4,248 1,567 3,195 1,115 2,285 3,086 5,685 29,223 genotyped Illumina Affymetrix Human Gene Human Infinium Omni1-Quad Illumina Illumina Titan, Affymetrix Illumina Affymetrix OmniExpress Expanded v1-0 B Affymetrix Human Illumina Human Axiom Genotyping GeneChip 610 and GeneChip Exome- Multi- (N=3,085); GeneChip Omni1- 50K and Omni1- Genome- Platform SNP Array Cytochip SNP Array 8v1_B Ethnic Affymetrix SNP Array 6.0 Quad v1-0 610K Quad v1-0 Wide, 6.0 370K 6.0 Genome- Genotypin GeneChip B B Human Wide Human g Array SNP Array CEU I 6.0 (N=170) Array Plate Sample call rate QC 90% 98% 90% 98% 98% 98% 98% 90% 90% 98% 98% Filter

137 ARIC: Atherosclerosis Risk in Communities Study, MESA: Multi-Ethnic Study of Atherosclerosis, PAGE: Population Architecture using Genetic Epidemiology study, WHI: Women’s Health Initiative. *PAGE includes all participants of the BioME Biobank, Hispanic Community Study/Study of Latinos, and WHI that were genotyped on the MEGA

array. All genotypes imputed to the 1000 Genomes phase 3 v 5 reference panel; Build 37 of the Genome Reference Consortium was used to assign chromosomal positions and rsid numbers (dbSNP v150).

132137

Table 12. Ancestry representation by self-reported race/ethnicity by RBC trait.

Number of study participants by self-reported race/ethnicity

HCT HGB MCH MCHC MCV RBCC RDW African American 16,256 16,258 8,707 16,301 8,690 8,692 6,153 Hispanic/Latino 20,796 20,784 17,426 20,811 17,409 17,439 17,305 European American 29,514 29,509 14,715 29,494 14,710 14,712 9,630 Asian 634 633 127 633 127 127 123 Native American 604 604 20 20 603 20 20 Other 383 383 383 383 383 383 383 Total 68,187 68,171 41,378 67,642 41,922 41,373 33,614

138 Proportion of study participants by self-reported race/ethnicity

HCT HGB MCH MCHC MCV RBCC RDW African American 23.8% 23.8% 21.0% 24.1% 20.7% 21.0% 18.3% Hispanic/Latino 30.5% 30.5% 42.1% 30.8% 41.5% 42.2% 51.5% European American 43.3% 43.3% 35.6% 43.6% 35.1% 35.6% 28.6% Asian 0.9% 0.9% 0.3% 0.9% 0.3% 0.3% 0.4% Native American 0.9% 0.9% 0.0% 0.0% 1.4% 0.0% 0.1% Other 0.6% 0.6% 0.9% 0.6% 0.9% 0.9% 1.1%

132138

Table 13. Descriptive statistics for PAGE participants by study and race/ethnicity for ARIC, BioME, and CARDIA participants. ARIC BioME CARDIA African European European African Hispanic/ Asian Native African European Other American American American American Latino American American American American Women, 1,763 4,884 232 1715 2,676 159 538 884 66 (52%) 10 (50%) N(%) (63%) (53%) (39%) (60%) (61%) (42%) (61%) (54%) Age (yrs), 53.4 54.2 63.6 51.3 54.5 47.8 52.8 51.2 24.4 25.6 SD (5.78) (5.68) (11.68) (14.1) (14.2) (16.4) (14.9) (14.1) (3.83) (3.35) 40.33 42.18 39.41 37.96 38.78 39.78 39.14 39.26 41.06 42.91 HCT (4.18) (3.71) (4.49) (4.76) (4.62) (4.65) (5.33) (4.58) (4.53) (4.10) 13.23 14.13 13.35 12.62 13.00 13.41 13.06 13.15 13.73 14.56 HGB (1.42) (1.31) (1.59) (1.65) (1.62) (1.64) (1.97) (1.62) (1.54) (1.35) 28.94 30.58 31.12 24.42 30.04 30.49 30.22 29.46 29.08 30.30 MCH (2.45) (1.69) (2.47) (3.06) (2.68) (3.33) (3.26) (2.93) (2.43) (1.62) 139 32.79 33.50 33.86 33.18 33.48 33.69 33.24 33.46 33.46 33.95

MCHC (1.14) (0.94) (0.74) (0.89) (0.85) (0.78) (1.09) (0.94) (0.89) (0.81) 86.67 90.58 91.97 88.52 89.55 90.25 90.60 87.81 86.90 89.25 MCV (6.04) (4.18) (6.23) (7.89) (6.79) (8.11) (8.03) (7.36) (6.24) (4.31) 4.53 4.57 4.30 4.29 4.33 4.41 4.37 4.48 4.73 4.82 RBCC (0.48) (0.42) (0.53) (0.59) (0.54) (0.59) (0.51) (0.56) (0.55) (0.49) 14.71 13.67 14.14 14.68 14.22 13.92 14.31 14.27 RDW N/A N/A (1.15) (1.05) (1.43) (1.60) (1.47) (2.14) (2.03) (1.81) ARIC: Atherosclerosis Risk in Communities Study. *Note: N represents the maximum number of individuals with any phenotype at the earliest available visit.

132139 Table 14. Descriptive statistics for PAGE participants by study and race/ethnicity for WHI and HCHS/SOL participants. HCHS/SOL WHI African Asian European Hispanic/ Native Hispanic/ American American American Latino American Latino Women, 100% 100% 100% 100% 100% 6,929 (59%) N(%) Age (yrs), 61.7 (7.05) 66.7 (6.43) 66.96 (6.56) 60.5 (6.75) 61.7 (7.31) 46.0 (13.83) SD 40.27 HCT 39.06 (3.02) 40.23 (3.00) 40.47 (2.87) 39.94 (2.76) 42.12 (4.04) (2.99) 13.49 HGB 12.92 (1.00) 13.53 (0.97) 13.65 (0.96) 13.44 (0.94) 13.81 (1.49) (1.03) MCH 28.71 (2.27) N/A 30.26 (1.82) 29.98 (1.88) N/A 29.22 (2.11)

140 33.52 MCHC 32.95 (1.17) 33.62 (1.05) 33.59 (1.02) 33.54 (1.01) 32.76 (1.38) (1.02) MCV 88.21 (6.51) N/A 91.80 (4.97) 90.31 (4.99) N/A 89.31 (5.72)

RBCC 4.43 (0.47) N/A 4.41 (0.42) 4.42 (0.40) N/A 4.73 (0.44)

RDW 14.48 (1.33) N/A 13.95 (1.14) 13.89 (1.07) N/A 13.67 (1.14) HCHS/SOL: Hispanic Community Health Study/Study of Latinos, PAGE: Population Architecture using Genetic Epidemiology study, WHI: Women’s Health Initiative. *Note: N represents the maximum number of individuals with any phenotype at the earliest available visit.

132140

Table 15. Univariate allele information and frequencies for combined-phenotype lead SNPs in combined multi-ethnic study population. Ref Alt AfAm Euro Hisp rsid Chr:Pos Allele Allele CAF CAF CAF CAF rs61523591 2:46372781 t c 0.160 0.207 0.146 0.146 rs2361016 3:24341268 a t 0.719 0.803 0.680 0.751 rs73210009 3:195830276 a g 0.623 0.786 0.554 0.659 rs218265 4:55408999 t c 0.231 0.242 0.158 0.319 rs4535497 5:1107428 a c 0.526 0.642 0.432 0.607 rs3749748 5:127350549 t c 0.795 0.944 0.757 0.838 rs2032451 6:26092170 t g 0.870 0.964 0.848 0.882 rs607203 6:139841653 t c 0.886 0.789 9.957 0.933 rs1800562 6:26093141 a g 0.947 0.985 0.938 0.981 rs11964516 6:41860252 t c 0.169 0.147 0.178 0.165 rs13203155 6:109614844 t c 0.598 0.609 0.553 0.686 rs35786788 6:135419042 a g 0.784 0.924 0.739 0.848 rs12664956 6:135384188 t c 0.320 0.216 0.375 0.259 rs590856 6:139844429 a g 0.480 0.635 0.449 0.413 rs12718598 7:50428445 t c 0.520 0.530 0.493 0.573 rs28832309 7:80319938 c g 0.947 0.915 0.371 0.986 rs7385804 7:100235970 a c 0.346 0.341 0.710 0.296 rs10253736 7:151415256 t c 0.757 0.859 0.202 0.803 rs7853365 9:4855858 a c 0.188 0.140 0.803 0.185 rs2519093 9:136141870 t c 0.829 0.891 0.064 0.850 rs10901252 9:136128000 c g 0.898 0.839 0.920 0.927 rs34793991 10:46080590 a c 0.055 0.024 0.095 0.040 rs17476364 10:71094504 t c 0.080 0.020 0.095 0.065 rs33930165 11:5248233 t c 0.993 0.988 -- 0.998 rs28456 11:61589481 a g 0.322 0.170 0.300 0.480 rs12320749 12:4331647 c g 0.225 0.248 0.210 0.228 rs597808 12:111973358 a g 0.604 0.905 0.507 0.710 rs8013143 14:23494277 a g 0.380 0.750 0.290 0.304 rs11621325 14:65472241 t c 0.428 0.405 0.442 0.421 rs4887023 15:78535437 c g 0.374 0.394 0.380 0.337 rs9924561 16:314780 t g 0.941 0.903 -- 0.987 rs186066503 16:405483 t c 0.996 -- -- 0.989 rs530159671 16:250184 a g 0.995 -- -- 0.991 rs8058016 16:228786 a c 0.994 0.990 -- 0.996 rs60616598 16:297264 a g 0.898 0.834 -- 0.963 rs145752042 16:267208 a g 0.996 0.993 -- --

141 rs145546625 16:220583 t c 0.958 0.975 -- 0.931 rs115415087 16:205132 t c 0.011 0.018 -- 0.004 rs61743947 16:240000 t c 0.998 0.995 -- -- rs60125383 16:176446 a t 0.544 0.434 0.552 0.619 rs55932218 16:221151 t c 0.030 0.052 -- 0.010 rs8051004 16:198835 t c 0.064 0.103 0.016 0.048 rs2608604 16:88849421 a g 0.619 0.660 0.629 0.566 rs4890634 18:43854259 t c 0.306 0.402 0.253 0.312 rs71176524 19:2177925 a at 0.548 0.556 0.535 0.568 rs919797 19:4498157 a g 0.515 0.676 0.462 0.491 rs1799918 19:13002400 c g 0.662 0.783 0.636 0.619 rs958483 19:33759240 t c 0.939 0.983 0.933 0.927 rs6014993 20:55991637 a g 0.576 0.709 0.495 0.652 rs4820079 22:32879617 t g 0.502 0.542 0.504 0.467 rs855791 22:37462936 a g 0.610 0.831 0.567 0.562 rs140523 22:50962782 c g 0.638 0.645 0.613 0.659

142

Table 16. HCT univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.2117 0.0295 6.93E-13 0.402 10,384 rs2361016 3:24341268 0.0164 0.0201 4.14E-01 0.02638 25,423 rs73210009 3:195830276 -0.0465 0.0186 1.23E-02 0.289 29,952 rs218265 4:55408999 -0.0852 0.0219 9.88E-05 0.7663 21,985 rs4535497 5:1107428 0.0062 0.0184 7.37E-01 0.6698 29,771 rs3749748 5:127350549 0.011 0.0227 6.28E-01 0.01551 21,907 rs2032451 6:26092170 -0.2195 0.027 3.97E-16 0.9999 15,327 rs1800562 6:26093141 -0.3803 0.0444 1.04E-17 0.1412 6,576 rs11964516 6:41860252 0.0416 0.0231 7.11E-02 0.08621 18,581 rs13203155 6:109614844 0.0594 0.0178 8.23E-04 0.1056 32,578 rs35786788 6:135419042 0.2029 0.0223 8.81E-20 0.486 22,457 rs590856 6:139844429 0.0672 0.0175 1.24E-04 0.7537 33,426 rs12718598 7:50428445 -0.0153 0.0175 3.82E-01 0.8567 32,359 rs28832309 7:80319938 0.046 0.065 4.79E-01 0.3201 3,483 rs7385804 7:100235970 0.117 0.0185 2.35E-10 0.4761 28,584 rs10253736 7:151415256 0.1546 0.0213 3.91E-13 0.5274 22,262 rs7853365 9:4855858 8.00E-04 0.0221 9.71E-01 0.6919 20,608 rs2519093 9:136141870 0.1893 0.0234 5.71E-16 0.8805 18,507 rs34793991 10:46080590 0.0068 0.0399 8.65E-01 0.6702 6,791 rs17476364 10:71094504 0.3706 0.0351 4.37E-26 0.2504 8,662 rs33930165 11:5248233 0.7231 0.1891 1.31E-04 0.2443 411 rs28456 11:61589481 0.1099 0.0197 2.40E-08 0.6768 27,718 rs12320749 12:4331647 -0.0814 0.021 1.00E-04 0.5262 22,243 rs597808 12:111973358 -0.1402 0.0196 9.34E-13 0.3607 31,312 rs8013143 14:23494277 -0.0203 0.0197 3.20E-01 0.8497 30,270 rs11621325 14:65472241 6.00E-04 0.0177 9.71E-01 0.9126 31,612 rs4887023 15:78535437 -0.0483 0.0178 6.73E-03 0.7273 31,532 rs2608604 16:88849421 -0.0513 0.0185 5.58E-03 0.4054 28,550 rs4890634 18:43854259 0.0581 0.0189 2.10E-03 0.1049 28,463 rs71176524 19:2177925 -0.1052 0.018 4.93E-09 0.2478 30,693 rs919797 19:4498157 -0.0356 0.0183 5.12E-02 0.6535 30,484 rs1799918 19:13002400 0.0479 0.0187 1.05E-02 0.774 28,946 rs958483 19:33759240 -0.1161 0.0409 4.57E-03 0.5942 5,603 rs6014993 20:55991637 0.0206 0.0182 2.57E-01 0.9732 30,025 rs4820079 22:32879617 0.0543 0.0171 1.55E-03 0.4961 33,785

143 rs855791 22:37462936 0.2256 0.0185 4.02E-34 0.4115 30,495 rs140523 22:50962782 0.0252 0.021 2.30E-01 0.7152 16,684 rs9924561 16:314780 0.1812 0.0599 2.49E-03 0.1862 3,926 rs186066503 16:405483 -0.2596 0.275 3.45E-01 0.0245 246 rs530159671 16:250184 0.5115 0.1805 4.59E-03 0.4598 355 rs8058016 16:228786 0.5625 0.1834 2.16E-03 0.6508 369 rs60616598 16:297264 0.0985 0.0488 4.36E-02 0.332 6,715 rs145752042 16:267208 -1.136 0.2927 1.04E-04 0.08684 215 rs145546625 16:220583 0.0916 0.0692 1.86E-01 0.2643 3,474 rs115415087 16:205132 0.0353 0.1406 8.02E-01 0.8619 732 rs61743947 16:240000 0.6984 0.3376 3.86E-02 1 108 rs60125383 16:176446 0.0294 0.0186 1.14E-01 0.7982 31,927 rs55932218 16:221151 -0.0533 0.0835 5.24E-01 0.2245 2,204 rs8051004 16:198835 0.0117 0.0486 8.10E-01 0.5986 7,246 rs1800562 6:26093141 -0.3803 0.0444 1.04E-17 0.1412 66022 rs11964516 6:41860252 0.0416 0.0231 7.11E-02 0.08621 18580 rs12664956 6:135384188 -0.0341 0.0189 7.19E-02 0.3611 29049 rs607203 6:139841653 -0.1132 0.0323 4.61E-04 0.2249 13333 rs10901252 9:136128000 -0.1052 3.00E-02 0.0004562 0.6959 11,670

144 Table 17. HGB univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0696 0.008 4.26E-18 0.1845 18,218 rs2361016 3:24341268 0.0094 0.0068 1.71E-01 0.03582 25,467 rs73210009 3:195830276 -0.0277 0.0063 1.19E-05 0.4922 29,980 rs218265 4:55408999 -0.0208 0.0074 5.14E-03 0.884 21,927 rs4535497 5:1107428 0.0098 0.0063 1.16E-01 0.2285 29,771 rs3749748 5:127350549 0.0022 0.0077 7.80E-01 0.1099 21,990 rs2032451 6:26092170 -0.1056 0.0092 1.41E-30 0.9758 15,394 rs1800562 6:26093141 -0.1679 0.015 4.26E-29 0.1487 6,586 rs11964516 6:41860252 0.0187 0.0079 1.74E-02 0.1666 18,586 rs13203155 6:109614844 0.0067 0.006 2.69E-01 0.1852 32,602 rs35786788 6:135419042 0.0494 0.0076 7.01E-11 0.5427 22,519 rs590856 6:139844429 0.0221 0.006 2.10E-04 0.6879 33,419 rs12718598 7:50428445 -0.0025 0.0059 6.79E-01 0.6516 32,353 rs28832309 7:80319938 0.0479 0.0221 3.02E-02 0.4646 3,482 rs7385804 7:100235970 0.0296 0.0063 2.51E-06 0.4949 28,612 rs10253736 7:151415256 0.0547 0.0073 4.72E-14 0.6186 22,313 rs7853365 9:4855858 -0.0034 0.0075 6.54E-01 0.6038 20,612 rs2519093 9:136141870 0.0684 0.008 8.67E-18 0.8338 18,528 rs34793991 10:46080590 0.0148 0.0136 2.74E-01 0.3951 6,801 rs17476364 10:71094504 0.1301 0.012 1.46E-27 0.2726 8,719 rs33930165 11:5248233 -0.0544 0.0645 3.99E-01 0.9468 411 rs28456 11:61589481 0.0419 0.0067 4.50E-10 0.4198 27,658 rs12320749 12:4331647 -0.0271 0.0071 1.51E-04 0.3309 22,224 rs597808 12:111973358 -0.0531 0.0067 2.11E-15 0.2253 31,383 rs8013143 14:23494277 -0.0145 0.0067 3.09E-02 0.9065 30,245 rs11621325 14:65472241 0.0059 0.006 3.28E-01 0.8688 31,611 rs4887023 15:78535437 -0.0114 0.0061 6.13E-02 0.7968 31,535 rs2608604 16:88849421 -0.0368 0.0063 5.70E-09 0.6424 28,535 rs4890634 18:43854259 0.023 0.0065 3.74E-04 0.03955 28,415 rs71176524 19:2177925 -0.0315 0.0061 2.76E-07 0.1145 30,688 rs919797 19:4498157 -0.0216 0.0062 5.47E-04 0.5471 30,479 rs1799918 19:13002400 0.012 0.0064 5.94E-02 0.8099 28,931 rs958483 19:33759240 -0.0382 0.014 6.37E-03 0.5518 5,618 rs6014993 20:55991637 0.013 0.0062 3.63E-02 0.9039 30,056 rs4820079 22:32879617 0.0214 0.0058 2.45E-04 0.5727 33,778 rs855791 22:37462936 0.1019 0.0063 1.98E-58 0.3393 30,505

145 rs140523 22:50962782 0.011 0.0072 1.25E-01 0.8818 16,680 0.00013 rs9924561 16:314780 0.1968 0.0203 2.60E-22 09 3,925 rs186066503 16:405483 0.2585 0.0982 8.51E-03 0.08275 246 rs530159671 16:250184 0.3409 0.0634 7.46E-08 0.4964 349 rs8058016 16:228786 0.2687 0.0642 2.80E-05 0.3386 369 rs60616598 16:297264 0.0774 0.0175 1.00E-05 0.05922 6,393 rs145752042 16:267208 -0.5187 0.1005 2.45E-07 0.3859 215 rs145546625 16:220583 0.0412 0.0238 8.38E-02 0.1134 3,465 rs115415087 16:205132 -0.0569 0.0487 2.43E-01 0.801 752 rs61743947 16:240000 0.4709 0.1184 6.97E-05 1 108 rs60125383 16:176446 0.0202 0.0063 1.37E-03 0.5925 31,924 rs55932218 16:221151 -0.0607 0.0286 3.35E-02 0.2664 2,225 rs8051004 16:198835 -0.0341 0.0166 3.96E-02 0.6503 7,245 rs1800562 6:26093141 -0.1679 0.015 4.26E-29 0.1487 6,586 rs11964516 6:41860252 0.0187 0.0079 1.74E-02 0.1666 18,585 rs12664956 6:135384188 -0.0061 0.0065 3.46E-01 0.3891 29,095 rs607203 6:139841653 -0.0347 0.011 1.58E-03 0.6092 13,361 rs10901252 9:136128000 -0.0482 0.0102 2.26E-06 0.6584 11,688

146 Table 18. MCH univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.0059 0.0199 7.67E-01 0.1719 11,007 rs2361016 3:24341268 0.106 0.0168 2.87E-10 0.9028 15,469 rs73210009 3:195830276 -0.1044 0.0155 1.58E-11 0.4444 18,035 rs218265 4:55408999 0.1932 0.0181 1.22E-26 0.5851 13,721 rs4535497 5:1107428 -0.0569 0.0154 2.10E-04 0.07997 18,356 rs3749748 5:127350549 -0.0156 0.0187 4.06E-01 0.1507 13,310 rs2032451 6:26092170 -0.2825 0.022 8.22E-38 0.05654 9,352 rs1800562 6:26093141 -0.4537 0.0369 1.16E-34 0.4352 3,876 rs11964516 6:41860252 0.1687 0.0192 1.71E-18 0.1094 11,321 rs13203155 6:109614844 -0.1128 0.0148 2.83E-14 0.389 19,753 rs35786788 6:135419042 -0.3173 0.0185 3.00E-66 0.1533 13,609 rs590856 6:139844429 0.1173 0.0146 7.71E-16 0.4587 20,365 rs12718598 7:50428445 0.1119 0.0146 2.00E-14 0.06985 19,532 rs28832309 7:80319938 0.0487 0.0598 4.16E-01 0.4092 2,483 rs7385804 7:100235970 -0.1191 0.0155 1.44E-14 0.09905 17,036 rs10253736 7:151415256 0.0295 0.0176 9.34E-02 0.916 13,593 rs7853365 9:4855858 -0.1399 0.0184 2.81E-14 0.6896 12,569 rs2519093 9:136141870 0.0192 0.0196 3.29E-01 0.4778 11,212 rs34793991 10:46080590 0.2094 0.0334 3.70E-10 0.6701 4,018 rs17476364 10:71094504 0.1254 0.029 1.55E-05 0.7716 5,170 rs33930165 11:5248233 1.3127 0.165 1.80E-15 0.1668 324 rs28456 11:61589481 -0.016 0.0163 3.26E-01 0.4805 17,123 rs12320749 12:4331647 0.0655 0.0174 1.64E-04 0.6853 13,550 rs597808 12:111973358 -0.0302 0.0163 6.36E-02 0.0146 18,903 rs8013143 14:23494277 -0.0391 0.0165 1.82E-02 0.3599 17,894 rs11621325 14:65472241 0.0967 0.0147 5.40E-11 0.5608 19,185 rs4887023 15:78535437 0.0333 0.0149 2.57E-02 0.74 19,088 rs2608604 16:88849421 -0.016 0.0155 3.03E-01 0.8609 17,690 rs4890634 18:43854259 0.0872 0.0158 3.79E-08 0.07243 17,224 rs71176524 19:2177925 0.0172 0.0151 2.52E-01 0.05867 18,993 rs919797 19:4498157 -0.1034 0.0154 1.90E-11 0.6489 18,147 rs1799918 19:13002400 -0.1187 0.0155 2.10E-14 0.6512 17,652 rs958483 19:33759240 0.1408 0.0331 2.11E-05 0.4684 3,631 rs6014993 20:55991637 0.1083 0.0151 8.40E-13 0.9958 18,398 rs4820079 22:32879617 0.0822 0.0143 9.03E-09 0.5197 20,563 rs855791 22:37462936 0.2884 0.0151 7.73E-81 0.5178 18,800

147 rs140523 22:50962782 0.0975 0.0162 1.76E-09 0.199 10,875 3.90E- rs9924561 16:314780 1.5441 0.0559 1.02E-167 11 2,842 rs186066503 16:405483 2.201 0.1929 3.61E-30 0.1758 205 1.94E- rs530159671 16:250184 1.4072 0.1378 1.71E-24 05 275 rs8058016 16:228786 1.7363 0.178 1.80E-22 0.02998 233 rs60616598 16:297264 0.4482 0.0456 8.12E-23 0.2428 4,802 rs145752042 16:267208 -2.2842 0.2591 1.20E-18 0.9663 171 rs145546625 16:220583 0.4016 0.0527 2.48E-14 0.1236 2,964 rs115415087 16:205132 -0.9756 0.1373 1.19E-12 0.03722 523 rs61743947 16:240000 2.2678 0.3096 2.40E-13 1 59 rs60125383 16:176446 0.1122 0.0154 3.14E-13 0.01865 19,377 rs55932218 16:221151 -0.4601 0.0744 6.32E-10 0.07002 1,494 rs8051004 16:198835 -0.2613 0.0401 7.20E-11 0.8688 4,271 rs1800562 6:26093141 -0.4537 0.0369 1.16E-34 0.4352 3,876 rs11964516 6:41860252 0.1687 0.0192 1.71E-18 0.1094 11,321 rs12664956 6:135384188 0.049 0.0157 1.85E-03 0.06695 17,620 rs607203 6:139841653 -0.241 0.0284 1.94E-17 0.2514 7,608 rs10901252 9:136128000 0.1053 0.0261 5.61E-05 0.1763 6,798

148 Table 19. MCHC univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.003 0.0076 6.96E-01 0.3383 18,288 rs2361016 3:24341268 0.0091 0.0065 1.63E-01 0.7151 25,533 rs73210009 3:195830276 -0.0277 0.0061 5.18E-06 0.8902 30,027 rs218265 4:55408999 0.0187 0.0072 9.34E-03 0.9052 21,744 rs4535497 5:1107428 0.0219 0.006 2.87E-04 0.1073 29,807 rs3749748 5:127350549 -0.0105 0.0074 1.53E-01 0.08491 22,080 rs2032451 6:26092170 -0.0863 0.0088 1.76E-22 0.009454 15,507 rs1800562 6:26093141 -0.1196 0.0144 1.23E-16 0.5233 6,615 rs11964516 6:41860252 0.011 0.0075 1.44E-01 0.7896 18,601 rs13203155 6:109614844 -0.0326 0.0058 1.55E-08 0.4052 32,625 rs35786788 6:135419042 -0.045 0.0073 5.52E-10 0.9977 22,584 rs590856 6:139844429 -0.0044 0.0057 4.45E-01 0.02882 33,441 rs12718598 7:50428445 0.0075 0.0057 1.90E-01 0.3087 32,384 rs28832309 7:80319938 0.0588 0.0206 4.31E-03 0.06603 3,532 rs7385804 7:100235970 -0.0224 0.006 2.02E-04 0.6981 28,574 rs10253736 7:151415256 0.0074 0.0069 2.86E-01 0.9849 22,431 rs7853365 9:4855858 -0.0085 0.0072 2.39E-01 0.5081 20,596 rs2519093 9:136141870 0.0151 0.0076 4.82E-02 0.508 18,587 rs34793991 10:46080590 0.0303 0.013 1.94E-02 0.7964 6,841 rs17476364 10:71094504 0.0159 0.0115 1.69E-01 0.7174 8,736 rs33930165 11:5248233 -0.9275 0.063 5.53E-49 2.24E-07 428 rs28456 11:61589481 0.0066 0.0065 3.04E-01 0.7948 27,613 -2.00E- rs12320749 12:4331647 04 0.0068 9.75E-01 0.7072 22,208 rs597808 12:111973358 -0.01 0.0064 1.22E-01 0.1628 31,550 rs8013143 14:23494277 -0.0132 0.0064 3.97E-02 0.6124 30,222 rs11621325 14:65472241 0.0146 0.0058 1.13E-02 0.5484 31,613 rs4887023 15:78535437 0.0079 0.0058 1.72E-01 0.2675 31,545 rs2608604 16:88849421 -0.0529 0.0061 2.99E-18 0.4051 28,516 rs4890634 18:43854259 0.0138 0.0062 2.59E-02 0.2619 28,356 rs71176524 19:2177925 0.0095 0.0059 1.06E-01 0.9195 30,717 rs919797 19:4498157 -0.0275 0.006 4.36E-06 0.3908 30,513 rs1799918 19:13002400 -0.0147 0.0061 1.68E-02 0.7336 28,964 rs958483 19:33759240 0.0177 0.0136 1.93E-01 0.4297 5,666 rs6014993 20:55991637 0.0018 0.0059 7.67E-01 0.8094 30,090 rs4820079 22:32879617 0.01 0.0056 7.39E-02 0.8647 33,807 rs855791 22:37462936 0.0688 0.006 4.76E-30 0.04163 30,565

149 rs140523 22:50962782 0.0062 0.0069 3.69E-01 0.582 16,735 rs9924561 16:314780 0.3738 0.0189 7.57E-87 3.43E-08 3,983 rs186066503 16:405483 0.7504 0.0866 4.41E-18 0.4525 360 rs530159671 16:250184 0.4093 0.0597 7.19E-12 0.1162 349 rs8058016 16:228786 0.2783 0.0646 1.67E-05 0.2908 364 rs60616598 16:297264 0.1076 0.0161 2.22E-11 0.001721 6,814 rs145752042 16:267208 -0.3695 0.0961 1.21E-04 0.1249 209 rs145546625 16:220583 0.0319 0.0233 1.70E-01 0.4442 3,431 rs115415087 16:205132 -0.1834 0.0477 1.22E-04 0.04448 720 rs61743947 16:240000 0.7279 0.1078 1.44E-11 1 113 rs60125383 16:176446 0.0306 0.0064 2.09E-06 0.08923 28,977 rs55932218 16:221151 -0.104 0.027 1.15E-04 0.5502 2,251 rs8051004 16:198835 -0.0872 0.0158 3.17E-08 0.8564 7,286 rs1800562 6:26093141 -0.1196 0.0144 1.23E-16 0.5233 6,614 rs11964516 6:41860252 0.011 0.0075 1.44E-01 0.7896 18,601 rs12664956 6:135384188 0.0063 0.0062 3.04E-01 0.9766 29,153 rs607203 6:139841653 0.0195 0.0104 6.24E-02 0.01405 13,280 rs10901252 9:136128000 -0.0356 0.0097 2.50E-04 0.1218 11,728

150 Table 20. MCV univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population.

rsid Chr:Pos Beta SE p-value Het p-value effN rs61523591 2:46372781 0.0474 0.052 3.63E-01 0.2967 11,012 rs2361016 3:24341268 0.2602 0.0438 2.74E-09 0.8001 15,504 rs73210009 3:195830276 -0.2453 0.0403 1.13E-09 0.3526 18,079 rs218265 4:55408999 0.5089 0.0474 7.47E-27 0.7617 13,557 rs4535497 5:1107428 -0.2471 0.04 6.82E-10 0.03222 18,357 rs3749748 5:127350549 -0.0253 0.0485 6.02E-01 0.2001 13,403 rs2032451 6:26092170 -0.6189 0.0571 2.28E-27 0.1273 9,410 rs1800562 6:26093141 -0.9309 0.0955 1.95E-22 0.5983 3,928 rs11964516 6:41860252 0.4471 0.0502 5.20E-19 0.4012 11,336 rs13203155 6:109614844 -0.2713 0.0386 2.18E-12 0.5726 19,783 rs35786788 6:135419042 -0.7814 0.048 1.13E-59 0.06354 13,745 rs590856 6:139844429 0.3791 0.0379 1.43E-23 0.2855 20,347 rs12718598 7:50428445 0.3065 0.0382 9.72E-16 0.07093 19,520 rs28832309 7:80319938 0.0319 0.1565 8.39E-01 0.6698 2,569 rs7385804 7:100235970 -0.2714 0.0403 1.69E-11 0.04316 17,050 rs10253736 7:151415256 0.0402 0.0457 3.79E-01 0.6778 13,689 rs7853365 9:4855858 -0.3747 0.0479 5.32E-15 0.758 12,588 rs2519093 9:136141870 -0.01 0.0512 8.46E-01 0.7421 11,320 rs34793991 10:46080590 0.5708 0.0866 4.26E-11 0.7771 4,063 rs17476364 10:71094504 0.3633 0.0752 1.35E-06 0.6988 5,220 rs33930165 11:5248233 6.0733 0.4307 3.75E-45 5.54E-07 339 rs28456 11:61589481 -0.1092 0.0426 1.04E-02 0.6303 17,018 rs12320749 12:4331647 0.2248 0.0454 7.21E-07 0.4018 13,520 rs597808 12:111973358 -0.0874 0.0422 3.86E-02 0.1134 18,985 rs8013143 14:23494277 -0.0742 0.043 8.46E-02 0.1555 17,839 rs11621325 14:65472241 0.207 0.0384 6.98E-08 0.4844 19,173 rs4887023 15:78535437 0.0541 0.0388 1.64E-01 0.872 19,089 rs2608604 16:88849421 0.1512 0.0405 1.88E-04 0.5188 17,645 rs4890634 18:43854259 0.2101 0.0414 3.99E-07 0.08241 17,151 rs71176524 19:2177925 0.0039 0.0393 9.21E-01 0.06443 18,980 rs919797 19:4498157 -0.2325 0.0401 6.45E-09 0.5538 18,130 rs1799918 19:13002400 -0.2862 0.0408 2.28E-12 0.3417 17,649 rs958483 19:33759240 0.3301 0.0868 1.42E-04 0.6988 3,628 rs6014993 20:55991637 0.311 0.0394 2.82E-15 0.9769 18,419 rs4820079 22:32879617 0.2244 0.0375 2.25E-09 0.09652 20,542

151 rs855791 22:37462936 0.5954 0.0395 3.06E-51 0.7942 18,793 rs140523 22:50962782 0.2524 0.0423 2.34E-09 0.05907 10,889 rs9924561 16:314780 3.7073 0.146 2.87E-142 3.65E-07 2,925 rs186066503 16:405483 4.5701 0.5091 2.79E-19 0.06771 209 rs530159671 16:250184 3.0586 0.372 1.99E-16 0.0004187 265 rs8058016 16:228786 4.3502 0.4605 3.50E-21 0.07391 240 rs60616598 16:297264 0.9611 0.1167 1.82E-16 0.3492 4,884 rs145752042 16:267208 -5.9748 0.6665 3.13E-19 0.4115 183 rs145546625 16:220583 1.0984 0.1416 8.84E-15 0.2418 2,955 rs115415087 16:205132 -2.641 0.3485 3.49E-14 0.2764 518 rs61743947 16:240000 5.6057 0.8212 8.72E-12 1 59 rs60125383 16:176446 0.2222 0.0406 4.55E-08 0.005679 19,370 rs55932218 16:221151 -1.0939 0.1948 1.97E-08 0.07124 1,552 rs8051004 16:198835 -0.569 0.1061 8.10E-08 0.7759 4,316 rs1800562 6:26093141 -0.9309 0.0955 1.95E-22 0.5983 38,729 rs11964516 6:41860252 0.4471 0.0502 5.20E-19 0.4012 41,253 rs12664956 6:135384188 0.1304 0.041 1.45E-03 0.002992413 17,689 rs607203 6:139841653 -0.7377 0.0743 3.09E-23 0.3559 7,613 rs10901252 9:136128000 0.3696 0.0677 4.86E-08 0.03921 6,779

152

Table 21. RBCC univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0225 0.0039 1.20E-08 0.6916 11,117 rs2361016 3:24341268 -0.0134 0.0034 7.18E-05 0.01339 15,292 rs73210009 3:195830276 0.0042 0.0031 1.74E-01 0.2226 17,849 rs218265 4:55408999 -0.0382 0.0035 5.06E-27 0.4253 13,997 rs4535497 5:1107428 0.0127 0.0031 3.36E-05 0.7516 18,299 rs3749748 5:127350549 -0.0037 0.0038 3.28E-01 0.03913 13,041 rs2032451 6:26092170 0.0111 0.0045 1.27E-02 0.3875 9,161 rs1800562 6:26093141 0.0032 0.0077 6.73E-01 0.7533 3,759 rs11964516 6:41860252 -0.0188 0.0038 9.86E-07 0.4553 11,272 rs13203155 6:109614844 0.0223 0.003 6.56E-14 0.5968 19,652 rs35786788 6:135419042 0.0611 0.0037 3.34E-60 0.9665 13,223 rs590856 6:139844429 -0.0163 0.0029 2.16E-08 0.7455 20,378 rs12718598 7:50428445 -0.02 0.0029 7.89E-12 0.3873 19,516 rs28832309 7:80319938 -0.0021 0.0113 8.51E-01 0.01732 2,615 rs7385804 7:100235970 0.025 0.0031 7.09E-16 0.05119 16,941 rs10253736 7:151415256 0.0177 0.0036 6.34E-07 0.1959 13,344 rs7853365 9:4855858 0.0175 0.0037 2.12E-06 0.8254 12,470 rs2519093 9:136141870 0.0209 0.0039 1.12E-07 0.3303 11,028 rs34793991 10:46080590 -0.0384 0.0068 1.51E-08 0.5006 3,907 rs17476364 10:71094504 0.0232 0.0059 8.56E-05 0.6616 5,024 rs33930165 11:5248233 -0.2182 0.0314 3.94E-12 0.08639 340 rs28456 11:61589481 0.0192 0.0033 3.92E-09 0.1064 17,197 rs12320749 12:4331647 -0.0219 0.0035 2.36E-10 0.7261 13,591 rs597808 12:111973358 -0.0123 0.0033 1.98E-04 0.5381 18,639 rs8013143 14:23494277 0.0016 0.0033 6.32E-01 0.785 18,108 rs11621325 14:65472241 -0.0101 0.003 6.52E-04 0.7282 19,160 rs4887023 15:78535437 -0.009 0.003 2.57E-03 0.4397 19,068 rs2608604 16:88849421 -0.0142 0.0031 4.49E-06 0.5997 17,701 rs4890634 18:43854259 -0.006 0.0032 5.62E-02 0.4185 17,405 rs71176524 19:2177925 -0.0158 0.003 1.48E-07 0.8274 18,978 rs919797 19:4498157 0.0024 0.0031 4.44E-01 0.99 18,138 rs1799918 19:13002400 0.022 0.0031 1.59E-12 0.8491 17,583 rs958483 19:33759240 -0.044 0.0066 2.45E-11 0.7654 3,600 rs6014993 20:55991637 -0.0108 0.003 3.69E-04 0.5721 18,273 rs4820079 22:32879617 -0.007 0.0029 1.48E-02 0.5697 20,559

153 rs855791 22:37462936 -0.0067 0.0031 2.87E-02 0.8674 18,718 rs140523 22:50962782 -0.0085 0.0032 8.22E-03 0.1238 10,847 rs9924561 16:314780 -0.1645 0.0107 3.46E-53 0.04545 2,967 rs186066503 16:405483 -0.2684 0.034 2.77E-15 0.4032 201 rs530159671 16:250184 -0.1424 0.0275 2.26E-07 0.003154 280 rs8058016 16:228786 -0.1294 0.0334 1.05E-04 0.07766 233 rs60616598 16:297264 -0.0503 0.0088 9.70E-09 0.3888 4,029 rs145752042 16:267208 0.1861 0.0515 2.99E-04 0.8732 175 rs145546625 16:220583 -0.0446 0.0103 1.39E-05 0.932 2,981 rs115415087 16:205132 0.1524 0.0242 3.21E-10 0.1171 482 rs61743947 16:240000 -0.0911 0.0597 1.27E-01 1 59 rs60125383 16:176446 -0.0122 0.0032 1.31E-04 0.3929 19,381 rs55932218 16:221151 0.044 0.0143 2.09E-03 0.2602 1,498 rs8051004 16:198835 0.0153 0.008 5.71E-02 0.8478 4,395 rs1800562 6:26093141 0.0032 0.0077 6.73E-01 0.7533 3,759 rs11964516 6:41860252 -0.0188 0.0038 9.86E-07 0.4553 11,271 rs12664956 6:135384188 -0.0121 0.0032 1.32E-04 0.2538 17,397 rs607203 6:139841653 0.0276 0.0055 5.54E-07 0.5485 7,993 rs10901252 9:136128000 -0.0314 0.0052 1.16E-09 0.5296 6,950

154 Table 22. RDW univariate results for combined-phenotype lead SNPs in combined multi- ethnic study population. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.0022 0.0105 8.34E-01 0.3971 8,967 rs2361016 3:24341268 -0.0241 0.009 7.60E-03 0.9458 12,248 rs73210009 3:195830276 0.0417 0.0081 2.96E-07 0.08424 15,032 rs218265 4:55408999 -0.0152 0.0092 9.99E-02 0.6041 11,851 rs4535497 5:1107428 0.0631 0.0079 1.53E-15 0.2214 15,915 rs3749748 5:127350549 0.1218 0.0101 1.80E-33 0.03663 10,187 rs2032451 6:26092170 0.1245 0.012 3.40E-25 0.4457 6,998 rs1800562 6:26093141 0.1666 0.0215 1.04E-14 0.5427 2,824 rs11964516 6:41860252 -0.0402 0.0102 7.72E-05 0.5712 9,260 rs13203155 6:109614844 0.0358 0.008 7.52E-06 0.933 15,760 rs35786788 6:135419042 0.0827 0.01 1.57E-16 0.04951 10,450 rs590856 6:139844429 -0.0021 0.0077 7.90E-01 0.07623 16,523 rs12718598 7:50428445 -0.0077 0.0077 3.17E-01 0.4873 16,203 rs28832309 7:80319938 -0.2072 0.0305 1.12E-11 0.1689 1,825 rs7385804 7:100235970 0.0305 0.0082 1.86E-04 0.8961 14,135 rs10253736 7:151415256 0.0035 0.0094 7.12E-01 0.1159 11,222 rs7853365 9:4855858 0.0291 0.0098 3.14E-03 0.1242 9,994 rs2519093 9:136141870 -0.0178 0.0104 8.55E-02 0.459 9,234 rs34793991 10:46080590 0.0094 0.0183 6.06E-01 0.9347 3,107 rs17476364 10:71094504 0.0061 0.0155 6.95E-01 0.4403 4,074 rs33930165 11:5248233 -0.5348 0.095 1.79E-08 1 140 rs28456 11:61589481 -0.0288 0.0085 6.98E-04 0.9971 14,730 rs12320749 12:4331647 -0.006 0.0091 5.09E-01 0.1015 11,390 rs597808 12:111973358 0.0013 0.0087 8.78E-01 0.9252 15,008 rs8013143 14:23494277 0.0923 0.0086 1.17E-26 0.9374 15,251 rs11621325 14:65472241 -0.0157 0.0079 4.61E-02 0.9749 15,630 rs4887023 15:78535437 -0.0573 0.0079 5.45E-13 0.4464 15,406 rs2608604 16:88849421 0.0265 0.0079 8.08E-04 0.1786 15,227 rs4890634 18:43854259 -0.0625 0.0084 7.90E-14 0.1874 14,176 rs71176524 19:2177925 -0.0144 0.0077 6.28E-02 0.2381 16,062 rs919797 19:4498157 -0.0012 0.0079 8.78E-01 0.971 15,605 rs1799918 19:13002400 0.0158 0.0081 5.06E-02 0.141 15,011 rs958483 19:33759240 -0.0116 0.0167 4.87E-01 0.5987 3,164 rs6014993 20:55991637 -0.0101 0.008 2.07E-01 0.7561 15,074 rs4820079 22:32879617 -0.0213 0.0076 5.10E-03 0.9492 16,680 rs855791 22:37462936 -0.104 0.008 1.70E-38 0.7104 15,400

155 rs140523 22:50962782 -0.0417 0.0084 6.21E-07 0.3139 9,648 rs9924561 16:314780 -0.2675 0.0293 7.79E-20 0.0001387 1,875 rs186066503 16:405483 -0.427 0.092 3.50E-06 0.5336 203 rs530159671 16:250184 -0.2442 0.0662 2.23E-04 0.01148 261 rs8058016 16:228786 -0.2406 0.0907 8.03E-03 1 161 rs60616598 16:297264 -0.1366 0.0229 2.30E-09 0.4977 3,466 rs145752042 16:267208 0.3255 0.1609 4.31E-02 1 50 rs145546625 16:220583 -0.0696 0.0253 5.95E-03 0.1456 2,158 rs115415087 16:205132 0.0922 0.0672 1.70E-01 0.9596 345 rs61743947 16:240000 -0.7066 0.1497 2.34E-06 1 55 rs60125383 16:176446 0.001 0.0086 9.06E-01 0.8255 15,682 rs55932218 16:221151 0.1604 0.0386 3.23E-05 0.3911 1,162 rs8051004 16:198835 0.0452 0.0214 3.50E-02 0.6202 3,255 rs1800562 6:26093141 0.1666 0.0215 1.04E-14 0.5427 2,824 rs11964516 6:41860252 -0.0402 0.0102 7.72E-05 0.5712 9,260 rs12664956 6:135384188 -0.0297 0.0085 4.80E-04 0.2719 13,874 rs607203 6:139841653 0.0259 0.0148 7.97E-02 0.002358 6,048 rs10901252 9:136128000 0.0264 0.0137 5.49E-02 0.9157 5,669

156 Table 23. HCT univariate results for combined-phenotype lead SNPs in African Americans.

Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.1999 0.0445 6.89E-06 7.69E-01 5264 rs73210009 3:195830276 -0.0621 0.0453 1.70E-01 9.27E-01 5194 rs2361016 3:24341268 0.0033 0.0467 9.44E-01 5.24E-02 4796 rs218265 4:55408999 -0.0742 0.043 8.48E-02 2.74E-01 5558 rs4535497 5:1107428 -0.013 0.0387 7.37E-01 7.90E-01 6689 rs3749748 5:127350549 -0.087 0.0811 2.84E-01 1.80E-02 1705 rs2032451 6:26092170 -0.3344 0.099 7.28E-04 7.27E-01 1112 rs1800562 6:26093141 -0.6347 0.165 1.20E-04 2.88E-02 419 rs11964516 6:41860252 -0.0688 0.0512 1.79E-01 1.94E-01 4023 rs13203155 6:109614844 -0.0482 0.037 1.93E-01 8.70E-01 7646 rs35786788 6:135419042 0.1909 0.0695 6.03E-03 8.58E-01 2194 rs590856 6:139844429 0.0878 0.038 2.10E-02 5.78E-01 7376 rs7385804 7:100235970 0.1122 0.0402 5.29E-03 1.85E-02 6546 rs10253736 7:151415256 0.0612 0.054 2.57E-01 7.73E-01 3542 rs12718598 7:50428445 -0.0544 0.0372 1.44E-01 3.97E-01 7672 rs28832309 7:80319938 0.0352 0.0686 6.08E-01 3.85E-01 2274 rs2519093 9:136141870 0.2275 0.0578 8.32E-05 7.61E-01 3085 rs7853365 9:4855858 3.00E-04 0.052 9.96E-01 4.19E-01 3856 rs34793991 10:46080590 0.1782 0.1223 1.45E-01 7.45E-01 695 rs17476364 10:71094504 0.3177 0.1431 2.64E-02 3.69E-01 522 rs33930165 11:5248233 0.5522 0.2055 7.21E-03 4.98E-01 262 rs28456 11:61589481 0.0632 0.051 2.15E-01 9.77E-01 4189 rs12320749 12:4331647 -0.0397 0.0437 3.64E-01 1.44E-01 5657 rs597808 12:111973358 -0.0554 0.0653 3.96E-01 8.65E-01 2722 rs8013143 14:23494277 -0.025 0.0442 5.71E-01 9.55E-01 5710 rs11621325 14:65472241 0.011 0.0381 7.73E-01 7.81E-01 7211 rs4887023 15:78535437 -0.0148 0.0369 6.88E-01 4.81E-01 7651 rs60125383 16:176446 0.0403 0.0394 3.06E-01 4.37E-01 7525 rs8051004 16:198835 0.0338 0.0686 6.23E-01 4.15E-01 2717 rs115415087 16:205132 -0.0087 0.1485 9.54E-01 9.49E-01 514 rs145546625 16:220583 -0.353 0.2699 1.91E-01 1.00E+00 145 rs55932218 16:221151 -0.0088 0.0896 9.22E-01 5.25E-02 1578 rs8058016 16:228786 0.5558 0.2031 6.20E-03 1.81E-01 272 rs61743947 16:240000 0.2996 0.381 4.32E-01 1.00E+00 65 rs145752042 16:267208 -1.0134 0.3114 1.14E-03 4.63E-02 115 rs60616598 16:297264 0.0968 0.0532 6.88E-02 4.17E-01 4278

157 rs9924561 16:314780 0.1548 0.0623 1.30E-02 2.47E-01 2674 rs2608604 16:88849421 -0.05 0.0391 2.02E-01 8.73E-01 6783 rs4890634 18:43854259 0.1136 0.0371 2.19E-03 3.14E-01 7614 rs71176524 19:2177925 -0.0619 0.037 9.48E-02 9.06E-02 7615 rs919797 19:4498157 0.05 0.0414 2.26E-01 9.30E-01 6162 rs1799918 19:13002400 0.0735 0.0445 9.85E-02 8.21E-01 5409 rs958483 19:33759240 0.1051 0.1657 5.26E-01 2.46E-01 380 rs6014993 20:55991637 -0.0698 0.0414 9.16E-02 9.16E-01 6082 rs4820079 22:32879617 0.0482 0.0364 1.85E-01 9.36E-01 7933 rs855791 22:37462936 0.1654 0.05 9.28E-04 2.90E-01 4297 rs140523 22:50962782 0.0111 0.0417 7.91E-01 8.73E-01 5213

158 Table 24. HGB univariate results for combined-phenotype lead SNPs in African Americans.

rsid Chr:Pos Beta SE p-value Het p-value effN rs61523591 2:46372781 -0.0626 0.0149 2.53E-05 4.88E-01 5265 rs73210009 3:195830276 -0.0376 0.0151 1.29E-02 8.13E-01 5193 rs2361016 3:24341268 0.0014 0.0156 9.26E-01 9.90E-02 4799 rs218265 4:55408999 -0.0145 0.0144 3.14E-01 5.84E-01 5557 rs4535497 5:1107428 -0.0051 0.0129 6.93E-01 2.79E-01 6689 rs3749748 5:127350549 -0.0202 0.0271 4.55E-01 6.94E-02 1708 rs2032451 6:26092170 -0.139 0.033 2.52E-05 7.40E-01 1118 rs1800562 6:26093141 -0.2436 0.055 9.49E-06 1.31E-02 424 rs11964516 6:41860252 0.0014 0.0171 9.34E-01 2.90E-01 4021 rs13203155 6:109614844 -0.0225 0.0124 6.92E-02 9.30E-01 7648 rs35786788 6:135419042 0.0371 0.0232 1.10E-01 8.13E-01 2199 rs590856 6:139844429 0.0257 0.0127 4.35E-02 8.48E-01 7380 rs7385804 7:100235970 0.0312 0.0135 2.03E-02 6.93E-03 6554 rs10253736 7:151415256 0.0172 0.018 3.39E-01 7.18E-01 3545 rs12718598 7:50428445 -0.0167 0.0125 1.81E-01 3.00E-01 7671 rs28832309 7:80319938 0.0388 0.023 9.11E-02 5.44E-01 2270 rs2519093 9:136141870 0.0724 0.0193 1.78E-04 3.85E-01 3087 rs7853365 9:4855858 -0.0088 0.0174 6.11E-01 4.08E-01 3857 rs34793991 10:46080590 0.0874 0.0408 3.21E-02 3.15E-01 695 rs17476364 10:71094504 0.0918 0.0478 5.48E-02 4.75E-01 522 rs33930165 11:5248233 -0.0594 0.0689 3.88E-01 3.34E-01 260 rs28456 11:61589481 0.0247 0.017 1.47E-01 9.78E-01 4197 rs12320749 12:4331647 -0.0022 0.0146 8.80E-01 1.83E-01 5662 rs597808 12:111973358 -0.0379 0.0217 8.12E-02 9.18E-01 2732 rs8013143 14:23494277 -0.0205 0.0148 1.64E-01 9.57E-01 5718 rs11621325 14:65472241 -0.0027 0.0127 8.30E-01 9.03E-01 7212 rs4887023 15:78535437 0.0027 0.0123 8.29E-01 7.80E-01 7650 rs60125383 16:176446 0.0274 0.0131 3.70E-02 8.46E-02 7528 rs8051004 16:198835 -0.018 0.0229 4.32E-01 4.42E-01 2715 rs115415087 16:205132 -0.0613 0.0494 2.14E-01 8.83E-01 514 rs145546625 16:220583 -0.1382 0.0885 1.18E-01 1.00E+00 146 rs55932218 16:221151 -0.0431 0.0298 1.48E-01 9.78E-02 1578 rs8058016 16:228786 0.2657 0.0683 1.00E-04 4.06E-01 267 rs61743947 16:240000 0.4404 0.1236 3.69E-04 1.00E+00 65 rs145752042 16:267208 -0.4828 0.1032 2.91E-06 2.78E-01 115 rs60616598 16:297264 0.0607 0.0178 6.58E-04 1.46E-01 4267

159 rs9924561 16:314780 0.1814 0.0208 2.60E-18 5.80E-04 2667 rs2608604 16:88849421 -0.0224 0.0131 8.73E-02 6.75E-01 6788 rs4890634 18:43854259 0.0322 0.0124 9.42E-03 3.21E-01 7615 rs71176524 19:2177925 -0.0203 0.0124 1.02E-01 4.82E-02 7615 rs919797 19:4498157 0.0095 0.0138 4.91E-01 7.64E-01 6159 rs1799918 19:13002400 0.0213 0.0149 1.52E-01 8.45E-01 5413 rs958483 19:33759240 0.0276 0.0552 6.17E-01 6.00E-01 380 rs6014993 20:55991637 -0.0161 0.0138 2.44E-01 9.56E-01 6083 rs4820079 22:32879617 0.0182 0.0122 1.35E-01 7.92E-01 7934 rs855791 22:37462936 0.0755 0.0167 6.00E-06 3.44E-01 4300 rs140523 22:50962782 0.0073 0.0139 5.99E-01 9.79E-01 5216

160 Table 25. MCH univariate results for combined-phenotype lead SNPs in African Americans. rsid Chr:Pos Beta SE p-value Het p-value effN rs61523591 2:46372781 0.0566 0.0464 2.23E-01 3.78E-01 2888 rs73210009 3:195830276 -0.1646 0.0481 6.25E-04 3.03E-01 2767 rs2361016 3:24341268 0.0572 0.0494 2.47E-01 5.77E-01 2580 rs218265 4:55408999 0.1789 0.0448 6.57E-05 2.46E-01 3031 rs4535497 5:1107428 -0.0468 0.0409 2.52E-01 3.01E-01 3611 rs3749748 5:127350549 0.0378 0.0883 6.69E-01 3.97E-03 869 rs2032451 6:26092170 -0.1077 0.1086 3.21E-01 6.50E-01 560 rs1800562 6:26093141 -0.5163 0.1995 9.66E-03 2.94E-01 159 rs11964516 6:41860252 0.1168 0.0534 2.87E-02 3.29E-02 2207 rs13203155 6:109614844 -0.0504 0.0392 1.99E-01 4.11E-01 4132 rs35786788 6:135419042 -0.2935 0.0749 8.95E-05 5.29E-01 1144 rs590856 6:139844429 0.1514 0.0401 1.58E-04 8.70E-01 3943 rs7385804 7:100235970 -0.1044 0.0424 1.38E-02 3.30E-01 3462 rs10253736 7:151415256 0.0079 0.0576 8.91E-01 1.53E-01 1896 rs12718598 7:50428445 0.0801 0.0387 3.84E-02 5.90E-02 4147 rs28832309 7:80319938 0.039 0.0702 5.78E-01 5.69E-01 1297 rs2519093 9:136141870 -0.04 0.0614 5.15E-01 7.60E-01 1639 rs7853365 9:4855858 -0.1307 0.0549 1.71E-02 8.50E-01 2079 rs34793991 10:46080590 0.1633 0.1369 2.33E-01 3.03E-02 334 rs17476364 10:71094504 0.1551 0.1614 3.36E-01 4.48E-01 254 rs33930165 11:5248233 1.2985 0.2005 9.32E-11 2.59E-01 164 rs28456 11:61589481 -0.1014 0.0543 6.18E-02 8.06E-01 2113 rs12320749 12:4331647 0.1023 0.0452 2.38E-02 3.81E-01 3007 rs597808 12:111973358 -0.1152 0.0712 1.06E-01 4.90E-01 1413 rs8013143 14:23494277 -0.0835 0.047 7.58E-02 2.35E-01 3051 rs11621325 14:65472241 0.0679 0.0399 8.89E-02 2.78E-01 3868 rs4887023 15:78535437 0.0255 0.039 5.14E-01 6.03E-01 4141 rs60125383 16:176446 0.1999 0.0388 2.55E-07 1.90E-01 4031 rs8051004 16:198835 -0.1828 0.0676 6.84E-03 9.09E-01 1486 rs115415087 16:205132 -0.8952 0.1455 7.62E-10 7.45E-02 289 rs145546625 16:220583 -0.3128 0.3939 4.27E-01 1.00E+00 35 rs55932218 16:221151 -0.5021 0.0893 1.88E-08 9.42E-02 863 rs8058016 16:228786 1.8882 0.264 8.50E-13 3.07E-02 95 rs61743947 16:240000 ------rs145752042 16:267208 -2.2974 0.4062 1.55E-08 1.00E+00 39 rs60616598 16:297264 0.4365 0.054 6.47E-16 3.37E-01 2396

161 rs9924561 16:314780 1.4824 0.0624 7.95E-125 3.29E-11 1487 rs2608604 16:88849421 0.0413 0.0413 3.17E-01 6.49E-01 3630 rs4890634 18:43854259 0.083 0.0389 3.27E-02 2.43E-02 4141 rs71176524 19:2177925 0.0774 0.039 4.73E-02 3.64E-01 4110 rs919797 19:4498157 -0.0589 0.0433 1.74E-01 8.11E-01 3371 rs1799918 19:13002400 -0.1185 0.047 1.18E-02 6.00E-01 2911 rs958483 19:33759240 0.2607 0.203 1.99E-01 9.00E-01 169 rs6014993 20:55991637 0.101 0.044 2.16E-02 3.05E-01 3276 rs4820079 22:32879617 0.0812 0.0385 3.46E-02 8.76E-01 4270 rs855791 22:37462936 0.1912 0.0531 3.20E-04 5.96E-01 2266 rs140523 22:50962782 0.0221 0.0435 6.12E-01 7.38E-01 2827

162 Table 26. MCHC univariate results for combined-phenotype lead SNPs in African Americans. rsid Chr:Pos Beta SE p-value Het p-value effN rs61523591 2:46372781 0.0082 0.0145 5.71E-01 3.70E-01 5328 rs73210009 3:195830276 -0.0359 0.015 1.64E-02 9.74E-01 5151 rs2361016 3:24341268 0.0101 0.0154 5.11E-01 3.57E-01 4776 rs218265 4:55408999 0.0366 0.014 8.80E-03 7.02E-01 5614 rs4535497 5:1107428 0.0073 0.0128 5.69E-01 7.02E-02 6685 rs3749748 5:127350549 0.0479 0.0271 7.77E-02 3.79E-02 1647 rs2032451 6:26092170 -0.0664 0.0335 4.72E-02 6.37E-01 1056 rs1800562 6:26093141 -0.1359 0.056 1.51E-02 5.89E-01 403 rs11964516 6:41860252 0.0349 0.0168 3.79E-02 3.68E-01 4032 rs13203155 6:109614844 -0.0156 0.0122 2.01E-01 8.78E-01 7667 rs35786788 6:135419042 -0.0531 0.0233 2.28E-02 4.26E-01 2109 rs590856 6:139844429 -0.0056 0.0125 6.52E-01 3.36E-01 7368 rs7385804 7:100235970 -0.0191 0.0131 1.46E-01 8.50E-01 6484 rs10253736 7:151415256 -0.0016 0.0178 9.29E-01 8.37E-01 3510 rs12718598 7:50428445 0.0099 0.0121 4.13E-01 6.39E-01 7700 rs28832309 7:80319938 0.0419 0.0222 5.93E-02 1.74E-02 2334 rs2519093 9:136141870 -0.0041 0.019 8.29E-01 5.88E-01 3073 rs7853365 9:4855858 -0.0266 0.0171 1.21E-01 3.87E-01 3835 rs34793991 10:46080590 0.0725 0.0411 7.81E-02 5.73E-01 694 rs17476364 10:71094504 -0.0362 0.0488 4.58E-01 9.29E-01 502 rs33930165 11:5248233 -0.7731 0.0663 1.98E-31 3.78E-10 269 rs28456 11:61589481 0.0116 0.0169 4.94E-01 8.24E-01 4070 rs12320749 12:4331647 0.0262 0.0142 6.47E-02 8.51E-01 5628 rs597808 12:111973358 -0.0417 0.0222 5.99E-02 7.42E-01 2590 rs8013143 14:23494277 -0.0195 0.0146 1.84E-01 4.34E-01 5632 rs11621325 14:65472241 0.0029 0.0125 8.13E-01 5.09E-01 7227 rs4887023 15:78535437 0.0073 0.0121 5.45E-01 3.94E-01 7692 rs60125383 16:176446 0.0402 0.0128 1.66E-03 2.79E-02 7523 rs8051004 16:198835 -0.0723 0.0218 9.13E-04 4.32E-01 2774 rs115415087 16:205132 -0.158 0.0491 1.29E-03 1.32E-01 544 rs145546625 16:220583 -0.1102 0.1047 2.92E-01 1.00E+00 145 rs55932218 16:221151 -0.1049 0.0282 1.95E-04 3.56E-01 1687 rs8058016 16:228786 0.2459 0.0778 1.57E-03 5.11E-01 291 rs61743947 16:240000 0.8707 0.1429 1.10E-09 1.00E+00 65 rs145752042 16:267208 -0.3627 0.1127 1.28E-03 1.13E-01 116 rs60616598 16:297264 0.0843 0.0173 1.10E-06 1.70E-02 4408

163 rs9924561 16:314780 0.3569 0.0199 1.19E-71 4.76E-07 2731 rs2608604 16:88849421 -0.0222 0.0128 8.40E-02 9.70E-01 6763 rs4890634 18:43854259 -0.0059 0.0122 6.26E-01 5.04E-01 7653 rs71176524 19:2177925 0.0108 0.0122 3.73E-01 6.07E-01 7637 rs919797 19:4498157 -0.0127 0.0134 3.45E-01 3.18E-01 6225 rs1799918 19:13002400 -0.0237 0.0146 1.04E-01 4.29E-01 5409 rs958483 19:33759240 0.0191 0.0575 7.39E-01 7.17E-01 366 rs6014993 20:55991637 -4.0E-04 0.0136 9.79E-01 9.21E-01 6088 rs4820079 22:32879617 0.0091 0.012 4.46E-01 3.54E-01 7951 rs855791 22:37462936 0.0494 0.0166 2.85E-03 7.30E-01 4227 rs140523 22:50962782 0.0025 0.0137 8.56E-01 4.05E-01 5225

164

Table 27. MCV univariate results for combined-phenotype lead SNPs in African Americans. rsid Chr:Pos Beta SE p-value Het p-value effN rs61523591 2:46372781 0.1542 0.1224 2.08E-01 6.40E-01 2890 rs73210009 3:195830276 -0.403 0.1269 1.50E-03 2.66E-01 2750 rs2361016 3:24341268 0.1562 0.1299 2.29E-01 5.05E-01 2570 rs218265 4:55408999 0.4793 0.1178 4.70E-05 3.29E-01 3035 rs4535497 5:1107428 -0.1628 0.1077 1.31E-01 4.56E-01 3599 rs3749748 5:127350549 0.1053 0.2349 6.54E-01 5.40E-03 852 rs2032451 6:26092170 -0.2839 0.2891 3.26E-01 2.69E-01 548 rs1800562 6:26093141 -1.0343 0.5235 4.82E-02 3.58E-01 159 rs11964516 6:41860252 0.2896 0.1406 3.94E-02 5.27E-02 2207 rs13203155 6:109614844 -0.1597 0.1033 1.22E-01 4.98E-01 4124 rs35786788 6:135419042 -0.8157 0.199 4.15E-05 5.81E-01 1126 rs590856 6:139844429 0.462 0.1056 1.22E-05 7.72E-01 3930 rs7385804 7:100235970 -0.2341 0.1117 3.62E-02 2.97E-01 3438 rs10253736 7:151415256 0.0513 0.1523 7.36E-01 1.37E-01 1882 rs12718598 7:50428445 0.2487 0.1019 1.46E-02 1.44E-01 4140 rs28832309 7:80319938 0.0983 0.1838 5.93E-01 6.65E-01 1307 rs2519093 9:136141870 -0.0763 0.1625 6.39E-01 8.57E-01 1628 rs7853365 9:4855858 -0.3284 0.1449 2.34E-02 7.74E-01 2070 rs34793991 10:46080590 0.5288 0.361 1.43E-01 2.08E-02 328 rs17476364 10:71094504 0.7095 0.4279 9.73E-02 1.78E-01 245 rs33930165 11:5248233 5.4646 0.521 9.64E-26 1.36E-08 166 rs28456 11:61589481 -0.2596 0.144 7.13E-02 9.51E-01 2093 rs12320749 12:4331647 0.3063 0.1185 9.75E-03 3.07E-01 2996 rs597808 12:111973358 -0.3308 0.19 8.16E-02 4.17E-01 1372 rs8013143 14:23494277 -0.1989 0.1244 1.10E-01 5.82E-02 3020 rs11621325 14:65472241 0.1495 0.1051 1.55E-01 1.66E-01 3862 rs4887023 15:78535437 0.078 0.1024 4.46E-01 6.23E-01 4133 rs60125383 16:176446 0.5133 0.1023 5.19E-07 2.37E-01 4020 rs8051004 16:198835 -0.3683 0.1776 3.81E-02 7.18E-01 1498 rs115415087 16:205132 -2.6558 0.3868 6.61E-12 2.49E-01 289 rs145546625 16:220583 -0.7876 1.1431 4.91E-01 1.00E+00 35 rs55932218 16:221151 -1.2094 0.2346 2.54E-07 1.21E-01 872 rs8058016 16:228786 4.5317 0.6691 1.26E-11 4.19E-02 94 rs61743947 16:240000 ------rs145752042 16:267208 -5.3769 0.9871 5.12E-08 1.00E+00 38

165 rs60616598 16:297264 0.9493 0.1413 1.84E-11 5.32E-01 2409 rs9924561 16:314780 3.6686 0.1643 1.80E-110 1.27E-07 1493 rs2608604 16:88849421 0.2453 0.1088 2.42E-02 5.90E-01 3620 rs4890634 18:43854259 0.1892 0.1025 6.48E-02 8.32E-02 4135 rs71176524 19:2177925 0.1844 0.1027 7.26E-02 3.11E-01 4102 rs919797 19:4498157 -0.1647 0.1145 1.50E-01 8.15E-01 3365 rs1799918 19:13002400 -0.2717 0.1237 2.81E-02 3.95E-01 2902 rs958483 19:33759240 0.9266 0.5418 8.72E-02 9.45E-01 164 rs6014993 20:55991637 0.3728 0.116 1.31E-03 1.79E-01 3267 rs4820079 22:32879617 0.201 0.1015 4.76E-02 8.48E-01 4261 rs855791 22:37462936 0.427 0.1406 2.39E-03 7.96E-01 2246 rs140523 22:50962782 0.072 0.1138 5.27E-01 5.99E-01 2819

166 Table 28. RBCC univariate results for combined-phenotype lead SNPs in African Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0305 0.0088 5.41E-04 6.56E-01 2887 rs73210009 3:195830276 0.0131 0.0091 1.53E-01 4.09E-01 2751 rs2361016 3:24341268 -0.0148 0.0094 1.18E-01 9.73E-02 2567 rs218265 4:55408999 -0.0282 0.0085 9.15E-04 1.46E-01 3034 rs4535497 5:1107428 0.0072 0.0078 3.53E-01 9.49E-01 3603 rs3749748 5:127350549 -0.003 0.0169 8.57E-01 3.81E-02 854 rs2032451 6:26092170 -0.0325 0.0208 1.17E-01 5.41E-01 546 rs1800562 6:26093141 -0.0697 0.0383 6.90E-02 1.28E-01 160 rs11964516 6:41860252 -0.0109 0.0101 2.81E-01 3.86E-01 2212 rs13203155 6:109614844 0.0189 0.0074 1.11E-02 5.11E-01 4124 rs35786788 6:135419042 0.0578 0.0143 5.47E-05 9.98E-01 1129 rs590856 6:139844429 -0.0133 0.0076 8.07E-02 5.53E-01 3931 rs7385804 7:100235970 0.0183 0.0081 2.33E-02 2.55E-01 3434 rs10253736 7:151415256 0.0154 0.011 1.60E-01 4.85E-01 1878 rs12718598 7:50428445 -0.0154 0.0073 3.60E-02 2.54E-01 4140 rs28832309 7:80319938 -0.006 0.0132 6.52E-01 2.31E-01 1311 rs2519093 9:136141870 0.0343 0.0117 3.49E-03 6.73E-01 1627 rs7853365 9:4855858 0.0113 0.0104 2.79E-01 1.85E-01 2074 rs34793991 10:46080590 0.0051 0.0262 8.47E-01 8.38E-02 328 rs17476364 10:71094504 0.0147 0.0311 6.37E-01 6.26E-01 246 rs33930165 11:5248233 -0.1743 0.0384 5.79E-06 2.39E-01 168 rs28456 11:61589481 0.0217 0.0104 3.65E-02 9.80E-01 2102 rs12320749 12:4331647 -0.018 0.0085 3.47E-02 1.47E-01 2995 rs597808 12:111973358 0.004 0.0137 7.70E-01 4.74E-01 1365 rs8013143 14:23494277 0.0036 0.009 6.88E-01 9.42E-01 3023 rs11621325 14:65472241 4.0E-04 0.0076 9.57E-01 9.17E-01 3864 rs4887023 15:78535437 -0.0046 0.0074 5.35E-01 6.92E-01 4135 rs60125383 16:176446 -0.0183 0.0078 1.96E-02 2.31E-01 4023 rs8051004 16:198835 0.0037 0.0135 7.86E-01 7.66E-01 1495 rs115415087 16:205132 0.1389 0.0296 2.74E-06 2.48E-01 288 rs145546625 16:220583 -0.0964 0.081 2.34E-01 1.00E+00 35 rs55932218 16:221151 0.0499 0.018 5.51E-03 3.20E-02 867 rs8058016 16:228786 -0.1065 0.0513 3.78E-02 4.43E-03 94 rs61743947 16:240000 ------rs145752042 16:267208 0.1766 0.0788 2.51E-02 1.00E+00 39 rs60616598 16:297264 -0.0445 0.0107 3.06E-05 9.27E-01 2399

167 rs9924561 16:314780 -0.1622 0.0121 6.98E-41 4.42E-02 1490 rs2608604 16:88849421 -0.0093 0.0079 2.35E-01 3.66E-01 3615 rs4890634 18:43854259 0.0014 0.0074 8.50E-01 5.23E-01 4134 rs71176524 19:2177925 -0.0181 0.0074 1.44E-02 7.69E-01 4103 rs919797 19:4498157 0.01 0.0082 2.27E-01 9.73E-01 3381 rs1799918 19:13002400 0.0314 0.0089 4.27E-04 9.52E-01 2905 rs958483 19:33759240 -0.0345 0.0396 3.84E-01 3.33E-01 163 rs6014993 20:55991637 -0.0228 0.0083 6.21E-03 1.90E-01 3277 rs4820079 22:32879617 -0.0133 0.0073 6.84E-02 8.46E-01 4263 rs855791 22:37462936 -0.0163 0.0102 1.09E-01 3.04E-01 2248 rs140523 22:50962782 7.0E-04 0.0083 9.29E-01 7.46E-01 2822

168

Table 29. RDW univariate results for combined-phenotype lead SNPs in African Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0107 0.0257 6.78E-01 2.01E-01 2039 rs73210009 3:195830276 0.0175 0.0267 5.12E-01 1.58E-01 1983 rs2361016 3:24341268 -0.0211 0.0273 4.39E-01 7.00E-01 1818 rs218265 4:55408999 -0.0086 0.0246 7.27E-01 2.64E-01 2143 rs4535497 5:1107428 0.0042 0.0225 8.52E-01 4.33E-01 2710 rs3749748 5:127350549 0.1124 0.0481 1.93E-02 4.53E-02 641 rs2032451 6:26092170 0.0167 0.0585 7.75E-01 5.89E-01 416 rs1800562 6:26093141 0.3059 0.1278 1.66E-02 6.14E-01 94 rs11964516 6:41860252 -0.0731 0.0296 1.36E-02 2.08E-01 1540 rs13203155 6:109614844 0.011 0.0217 6.13E-01 9.15E-01 2908 rs35786788 6:135419042 0.1422 0.0408 4.88E-04 9.50E-01 827 rs590856 6:139844429 -0.0267 0.0223 2.31E-01 2.05E-01 2798 rs7385804 7:100235970 0.0206 0.0232 3.76E-01 2.56E-01 2506 rs10253736 7:151415256 0.0749 0.0317 1.82E-02 4.12E-01 1360 rs12718598 7:50428445 0.0031 0.0216 8.85E-01 2.22E-01 2917 rs28832309 7:80319938 -0.1954 0.0391 5.74E-07 1.91E-01 895 rs2519093 9:136141870 -0.0114 0.0339 7.36E-01 8.45E-01 1158 rs7853365 9:4855858 -0.0162 0.0304 5.94E-01 2.22E-01 1458 rs34793991 10:46080590 0.0579 0.0808 4.74E-01 8.20E-01 224 rs17476364 10:71094504 -0.1145 0.1092 2.94E-01 2.53E-01 148 rs33930165 11:5248233 -0.4732 0.1397 7.04E-04 4.24E-01 77 rs28456 11:61589481 -0.0397 0.0305 1.92E-01 2.61E-01 1489 rs12320749 12:4331647 -0.0124 0.0253 6.26E-01 1.12E-02 2109 rs597808 12:111973358 0.0602 0.039 1.23E-01 9.75E-01 1042 rs8013143 14:23494277 0.0911 0.0258 4.19E-04 8.63E-01 2216 rs11621325 14:65472241 -0.0079 0.022 7.21E-01 9.36E-01 2733 rs4887023 15:78535437 -0.0577 0.0216 7.55E-03 4.46E-01 2918 rs60125383 16:176446 -0.0508 0.0219 2.03E-02 4.43E-01 2817 rs8051004 16:198835 0.0237 0.0373 5.24E-01 3.85E-01 1006 rs115415087 16:205132 0.0301 0.1029 7.70E-01 6.01E-01 135 rs145546625 16:220583 0.2264 0.1798 0.208 1 31.7 rs55932218 16:221151 0.1392 0.0481 3.79E-03 3.01E-01 597 rs8058016 16:228786 -0.0124 0.0253 6.26E-01 1.12E-02 2109 rs61743947 16:240000 ------rs145752042 16:267208 ------

169 rs60616598 16:297264 0.0911 0.0258 4.19E-04 8.63E-01 2216 rs9924561 16:314780 -0.216 0.0367 4.07E-09 1.83E-03 1006 rs2608604 16:88849421 0.0081 0.0226 7.18E-01 2.40E-01 2638 rs4890634 18:43854259 -0.071 0.0216 9.84E-04 8.93E-02 2912 rs71176524 19:2177925 -0.0229 0.0217 2.92E-01 9.35E-01 2899 rs919797 19:4498157 0.0332 0.024 1.66E-01 5.54E-01 2394 rs1799918 19:13002400 -0.0114 0.026 6.62E-01 1.19E-01 2072 rs958483 19:33759240 0.016 0.1183 8.93E-01 8.22E-01 115 rs6014993 20:55991637 -0.0369 0.0244 1.30E-01 9.38E-01 2307 rs4820079 22:32879617 -0.0033 0.0214 8.76E-01 7.16E-01 3008 rs855791 22:37462936 -0.0985 0.0292 7.57E-04 3.29E-01 1648 rs140523 22:50962782 -0.0491 0.0237 3.83E-02 4.31E-01 2167

170 Table 30. HCT univariate results for combined-phenotype lead SNPs in Hispanics/Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.2471 0.0457 6.54E-08 9.42E-02 5141 rs73210009 3:195830276 -0.085 0.0348 1.46E-02 5.09E-01 9127 rs2361016 3:24341268 0.0331 0.0397 4.04E-01 6.42E-01 7212 rs218265 4:55408999 -0.0583 0.0364 1.09E-01 9.94E-01 8845 rs4535497 5:1107428 0.0065 0.0339 8.48E-01 7.87E-01 9409 rs3749748 5:127350549 -0.0581 0.0441 1.88E-01 1.14E-01 5611 rs2032451 6:26092170 -0.2331 0.0507 4.19E-06 8.49E-01 4314 rs1800562 6:26093141 -0.1798 0.123 1.44E-01 7.97E-01 772 rs11964516 6:41860252 0.0959 0.0434 2.73E-02 2.05E-01 5706 rs13203155 6:109614844 0.0989 0.0357 5.66E-03 7.17E-01 8903 rs35786788 6:135419042 0.2122 0.0455 3.04E-06 1.94E-01 5315 rs590856 6:139844429 0.1066 0.0331 1.29E-03 8.45E-01 9990 rs12718598 7:50428445 -0.0186 0.033 5.72E-01 9.75E-01 10072 rs28832309 7:80319938 0.154 0.1563 3.25E-01 3.71E-01 534 rs7385804 7:100235970 0.1541 0.0352 1.20E-05 3.84E-01 8629 rs10253736 7:151415256 0.2062 0.0411 5.30E-07 6.44E-02 6385 rs7853365 9:4855858 0.0185 0.0418 6.59E-01 3.81E-01 6226 rs2519093 9:136141870 0.2441 0.0455 8.27E-08 5.44E-01 5290 rs34793991 10:46080590 -0.0931 0.0829 2.61E-01 3.81E-01 1573 rs17476364 10:71094504 0.3816 0.0658 6.52E-09 7.65E-01 2482 rs33930165 11:5248233 1.7854 0.482 2.12E-04 1.00E+00 45 rs28456 11:61589481 0.1288 0.0347 2.03E-04 2.96E-01 10257 rs12320749 12:4331647 -0.051 0.0386 1.87E-01 6.24E-02 7255 rs597808 12:111973358 -0.0812 0.0368 2.72E-02 1.09E-01 8386 rs8013143 14:23494277 0.031 0.037 4.02E-01 8.93E-01 8645 rs11621325 14:65472241 0.0277 0.0335 4.09E-01 6.60E-01 9500 rs4887023 15:78535437 -0.0645 0.0345 6.14E-02 6.70E-01 9217 rs60125383 16:176446 0.018 0.0367 6.24E-01 7.27E-01 9258 rs115415087 16:205132 -0.0209 0.4044 9.59E-01 4.26E-01 109 rs145546625 16:220583 0.1657 0.069 1.64E-02 7.88E-03 2600 rs55932218 16:221151 0.0619 0.1991 7.56E-01 5.96E-01 394 rs8058016 16:228786 0.1547 0.408 7.05E-01 3.56E-01 110 rs530159671 16:250184 0.4649 0.1949 1.71E-02 1.76E-01 282 rs60616598 16:297264 0.0544 0.1044 6.02E-01 3.33E-01 1411 rs9924561 16:314780 0.2802 0.155 7.06E-02 7.43E-01 530 rs2608604 16:88849421 -0.0253 0.033 4.43E-01 2.25E-01 10168

171 rs4890634 18:43854259 0.0297 0.0349 3.94E-01 5.76E-02 8832 rs71176524 19:2177925 -0.1495 0.0332 6.63E-06 5.59E-01 9885 rs919797 19:4498157 -0.0803 0.0326 1.37E-02 4.58E-01 10045 rs1799918 19:13002400 0.0323 0.0337 3.38E-01 5.89E-01 9710 rs958483 19:33759240 -0.1993 0.0637 1.75E-03 9.16E-01 2566 rs6014993 20:55991637 0.039 0.0342 2.54E-01 1.26E-01 9042 rs4820079 22:32879617 0.056 0.0323 8.27E-02 1.92E-01 10226 rs855791 22:37462936 0.2407 0.0331 3.68E-13 2.68E-01 10066 rs140523 22:50962782 -0.0181 0.0345 6.00E-01 2.36E-01 8203

172

Table 31. HGB univariate results for combined-phenotype lead SNPs in Hispanics/Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0774 0.0159 1.20E-06 4.21E-02 5136 rs73210009 3:195830276 -0.0371 0.012 1.93E-03 2.07E-01 9129 rs2361016 3:24341268 0.023 0.0138 9.44E-02 7.19E-01 7212 rs218265 4:55408999 -0.0117 0.0124 3.45E-01 9.17E-01 8848 rs4535497 5:1107428 0.011 0.0116 3.46E-01 5.29E-01 9405 rs3749748 5:127350549 -0.0185 0.0153 2.26E-01 2.39E-01 5616 rs2032451 6:26092170 -0.1217 0.0175 3.32E-12 9.62E-01 4312 rs1800562 6:26093141 -0.0987 0.0414 1.72E-02 4.64E-01 771 rs11964516 6:41860252 0.0385 0.015 1.02E-02 3.03E-01 5711 rs13203155 6:109614844 0.0156 0.0123 2.06E-01 7.12E-01 8899 rs35786788 6:135419042 0.0515 0.0155 9.18E-04 1.47E-01 5318 rs590856 6:139844429 0.0329 0.0114 3.94E-03 7.57E-01 9981 rs12718598 7:50428445 -0.0101 0.0114 3.75E-01 9.97E-01 10066 rs28832309 7:80319938 0.1424 0.0544 8.86E-03 1.17E-01 530 rs7385804 7:100235970 0.0338 0.0121 5.20E-03 6.07E-01 8619 rs10253736 7:151415256 0.0786 0.0141 2.70E-08 9.27E-02 6381 rs7853365 9:4855858 0.009 0.0144 5.30E-01 4.23E-01 6225 rs2519093 9:136141870 0.0973 0.0157 6.56E-10 8.98E-01 5295 rs34793991 10:46080590 -0.0131 0.0285 6.46E-01 3.88E-01 1576 rs17476364 10:71094504 0.14 0.0227 6.45E-10 9.56E-01 2487 rs33930165 11:5248233 -0.1972 0.1689 2.43E-01 1.00E+00 47 rs28456 11:61589481 0.0504 0.012 2.64E-05 3.22E-01 10252 rs12320749 12:4331647 -0.03 0.0133 2.40E-02 3.68E-02 7246 rs597808 12:111973358 -0.0268 0.0127 3.50E-02 4.46E-02 8386 rs8013143 14:23494277 0.0061 0.0128 6.35E-01 7.38E-01 8623 rs11621325 14:65472241 0.0238 0.0116 3.97E-02 7.70E-01 9496 rs4887023 15:78535437 -0.0087 0.0119 4.66E-01 6.18E-01 9217 rs60125383 16:176446 0.021 0.0128 1.01E-01 5.49E-01 9250 rs115415087 16:205132 -0.1625 0.1506 2.81E-01 3.66E-01 112 rs145546625 16:220583 0.0715 0.024 2.89E-03 3.72E-02 2595 rs55932218 16:221151 -0.0432 0.0701 5.38E-01 6.15E-01 389 rs8058016 16:228786 0.2219 0.1474 1.32E-01 4.73E-01 115 rs530159671 16:250184 0.3677 0.0689 9.46E-08 4.90E-01 285 rs60616598 16:297264 0.1003 0.0378 7.88E-03 1.42E-01 1411 rs9924561 16:314780 0.3025 0.0533 1.41E-08 5.60E-01 526 rs2608604 16:88849421 -0.0504 0.0114 9.83E-06 2.09E-01 10162

173 rs4890634 18:43854259 0.0106 0.0121 3.83E-01 1.50E-01 8827 rs71176524 19:2177925 -0.0445 0.0114 9.04E-05 7.88E-01 9879 rs919797 19:4498157 -0.0342 0.0113 2.46E-03 3.10E-01 10039 rs1799918 19:13002400 0.0044 0.0116 7.02E-01 7.68E-01 9706 rs958483 19:33759240 -0.065 0.0219 2.98E-03 9.35E-01 2561 rs6014993 20:55991637 0.0233 0.0118 4.91E-02 6.94E-02 9043 rs4820079 22:32879617 0.0198 0.0112 7.60E-02 1.89E-01 10220 rs855791 22:37462936 0.12 0.0115 1.51E-25 2.73E-02 10061 rs140523 22:50962782 0.0025 0.0118 8.30E-01 3.62E-01 8200

174 Table 32. MCH univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0647 0.0334 5.26E-02 9.62E-01 4340 rs73210009 3:195830276 -0.1151 0.0249 4.00E-06 3.61E-01 7572 rs2361016 3:24341268 0.1089 0.0285 1.31E-04 3.37E-01 6034 rs218265 4:55408999 0.2169 0.0262 1.23E-16 8.85E-01 7405 rs4535497 5:1107428 -0.0279 0.0246 2.56E-01 6.23E-01 7887 rs3749748 5:127350549 0.0059 0.032 8.53E-01 7.85E-01 4655 rs2032451 6:26092170 -0.3049 0.0366 8.41E-17 4.58E-01 3575 rs1800562 6:26093141 -0.3156 0.085 2.05E-04 3.08E-01 641 rs11964516 6:41860252 0.2021 0.0306 3.75E-11 2.71E-02 4773 rs13203155 6:109614844 -0.1263 0.0252 5.13E-07 6.45E-01 7454 rs35786788 6:135419042 -0.2348 0.0332 1.45E-12 1.88E-01 4359 rs590856 6:139844429 0.1649 0.024 6.20E-12 5.50E-02 8408 rs12718598 7:50428445 0.1351 0.0235 8.75E-09 3.66E-01 8451 rs28832309 7:80319938 0.1074 0.1066 3.14E-01 2.14E-01 467 rs7385804 7:100235970 -0.0994 0.0251 7.69E-05 3.13E-01 7279 rs10253736 7:151415256 0.0165 0.0295 5.75E-01 4.14E-01 5296 rs7853365 9:4855858 -0.1547 0.0302 3.15E-07 6.76E-01 5151 rs2519093 9:136141870 0.0273 0.0331 4.10E-01 2.00E-01 4349 rs34793991 10:46080590 0.2005 0.0615 1.11E-03 4.04E-01 1261 rs17476364 10:71094504 0.11 0.0472 1.98E-02 4.05E-01 2021 rs33930165 11:5248233 1.1857 0.3004 7.92E-05 1.00E+00 47 rs28456 11:61589481 -0.0255 0.0252 3.11E-01 4.57E-02 8592 rs12320749 12:4331647 0.0661 0.0275 1.61E-02 8.78E-01 6147 rs597808 12:111973358 0.0389 0.0267 1.45E-01 9.30E-01 6884 rs8013143 14:23494277 -0.0459 0.0261 7.80E-02 8.21E-01 7421 rs11621325 14:65472241 0.1174 0.0242 1.18E-06 3.05E-02 7961 rs4887023 15:78535437 0.0279 0.0242 2.48E-01 8.23E-01 7686 rs60125383 16:176446 0.0842 0.0258 1.10E-03 1.09E-01 7822 rs115415087 16:205132 -1.2073 0.3275 2.27E-04 5.86E-01 120 rs145546625 16:220583 4.31E-01 0.0481 3.18E-19 9.26E-01 2207 rs55932218 16:221151 -0.4085 0.1406 3.67E-03 2.86E-01 348 rs8058016 16:228786 1.6928 0.2823 2.03E-09 5.61E-01 125 rs530159671 16:250184 1.8413 0.1684 8.22E-28 9.88E-02 209 rs60616598 16:297264 5.02E-01 0.0795 2.73E-10 2.23E-01 1326 rs9924561 16:314780 1.6572 0.1138 4.70E-48 8.41E-01 469 rs2608604 16:88849421 -0.0424 0.0235 7.13E-02 6.85E-01 8533

175 rs4890634 18:43854259 0.0754 0.0252 2.75E-03 6.30E-01 7412 rs71176524 19:2177925 0.004 0.0237 8.65E-01 8.17E-01 8301 rs919797 19:4498157 -0.0792 0.0236 7.73E-04 5.32E-01 8435 rs1799918 19:13002400 -0.1041 0.0239 1.39E-05 4.82E-01 8115 rs958483 19:33759240 0.1217 0.0457 7.79E-03 9.29E-01 2163 rs6014993 20:55991637 0.124 0.0242 2.99E-07 6.91E-01 7566 rs4820079 22:32879617 0.0464 0.0233 4.65E-02 6.37E-01 8588 rs855791 22:37462936 0.334 0.0238 9.06E-45 6.24E-01 8433 rs140523 22:50962782 0.082 0.0245 8.21E-04 3.05E-01 6882

176

Table 33. MCHC univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.0093 0.015 5.35E-01 7.63E-01 5313 rs73210009 3:195830276 -0.0288 0.0116 1.28E-02 5.27E-01 9077 rs2361016 3:24341268 0.0155 0.0129 2.30E-01 8.62E-01 7384 rs218265 4:55408999 0.0197 0.0122 1.08E-01 1.37E-01 8606 rs4535497 5:1107428 0.0211 0.0112 5.97E-02 6.70E-01 9442 rs3749748 5:127350549 -0.0106 0.0145 4.66E-01 1.14E-01 5615 rs2032451 6:26092170 -0.1154 0.0172 1.80E-11 9.13E-01 4194 rs1800562 6:26093141 -0.0907 0.0407 2.57E-02 5.20E-02 744 rs11964516 6:41860252 0.0091 0.0144 5.28E-01 2.51E-01 5732 rs13203155 6:109614844 -0.0524 0.0116 6.34E-06 6.44E-01 9074 rs35786788 6:135419042 -0.0548 0.015 2.53E-04 3.31E-01 5282 rs590856 6:139844429 -0.0132 0.0111 2.32E-01 2.73E-01 10066 rs12718598 7:50428445 0.0024 0.0109 8.28E-01 3.63E-01 10111 rs28832309 7:80319938 0.1232 0.0466 8.19E-03 3.19E-03 674 rs7385804 7:100235970 -0.032 0.0116 5.97E-03 2.55E-01 8704 rs10253736 7:151415256 0.0134 0.0136 3.25E-01 8.52E-01 6484 rs7853365 9:4855858 -0.0108 0.0138 4.35E-01 8.92E-01 6152 rs2519093 9:136141870 0.0097 0.0152 5.24E-01 1.61E-01 5235 rs34793991 10:46080590 0.0612 0.027 2.32E-02 8.24E-01 1582 rs17476364 10:71094504 0.024 0.022 2.74E-01 1.14E-01 2426 rs33930165 11:5248233 -1.9802 0.1917 5.06E-25 1.00E+00 47 rs28456 11:61589481 -8.00E-04 0.0116 9.47E-01 3.09E-02 10202 rs12320749 12:4331647 -0.0256 0.0127 4.30E-02 7.61E-01 7289 rs597808 12:111973358 -0.0058 0.0123 6.39E-01 5.78E-01 8397 rs8013143 14:23494277 -0.0129 0.0121 2.87E-01 2.24E-01 9050 rs11621325 14:65472241 0.0234 0.0111 3.56E-02 6.26E-01 9465 rs4887023 15:78535437 0.0219 0.0113 5.20E-02 5.02E-01 9209 rs60125383 16:176446 0.0274 0.0117 1.96E-02 1.43E-01 9402 rs115415087 16:205132 -0.3316 0.1199 5.70E-03 9.07E-01 147 rs145546625 16:220583 0.0508 0.0234 2.98E-02 6.94E-01 2414 rs55932218 16:221151 -0.1332 0.0587 2.32E-02 4.16E-01 507 rs8058016 16:228786 0.304 0.1113 6.30E-03 9.68E-01 164 rs530159671 16:250184 0.4165 0.0658 2.45E-10 4.51E-02 292 rs60616598 16:297264 1.67E-01 0.0314 1.02E-07 6.45E-01 1790 rs9924561 16:314780 0.4657 0.047 3.95E-23 2.79E-01 671 rs2608604 16:88849421 -0.0788 0.011 6.91E-13 6.95E-04 10115

177 rs4890634 18:43854259 0.0024 0.0117 8.35E-01 9.20E-02 8849 rs71176524 19:2177925 0.0123 0.0109 2.59E-01 9.64E-01 9907 rs919797 19:4498157 -0.0186 0.0109 8.85E-02 7.23E-01 10056 rs1799918 19:13002400 -0.0105 0.0113 3.51E-01 7.97E-01 9582 rs958483 19:33759240 0.0026 0.0213 9.03E-01 9.16E-01 2532 rs6014993 20:55991637 0.0025 0.0114 8.29E-01 2.85E-01 9049 rs4820079 22:32879617 0.0048 0.0107 6.52E-01 2.78E-01 10241 rs855791 22:37462936 0.0969 0.0109 8.10E-19 4.43E-02 10025 rs140523 22:50962782 0.0099 0.0118 3.99E-01 3.06E-01 8221

178 Table 34. MCV univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.1176 0.0914 1.98E-01 3.65E-01 4355 rs73210009 3:195830276 -0.266 0.0677 8.60E-05 8.66E-01 7560 rs2361016 3:24341268 0.249 0.0772 1.26E-03 2.64E-01 6046 rs218265 4:55408999 0.5696 0.0711 1.10E-15 5.80E-01 7378 rs4535497 5:1107428 -0.1723 0.0673 1.05E-02 9.03E-01 7884 rs3749748 5:127350549 0.0243 0.0861 7.78E-01 8.84E-01 4651 rs2032451 6:26092170 -0.6137 0.0986 4.92E-10 3.30E-01 3564 rs1800562 6:26093141 -0.6912 0.2359 3.39E-03 8.72E-01 641 rs11964516 6:41860252 0.5113 0.0848 1.65E-09 5.26E-02 4770 rs13203155 6:109614844 -0.2387 0.0688 5.22E-04 9.48E-01 7468 rs35786788 6:135419042 -0.5334 0.0904 3.68E-09 8.17E-02 4355 rs590856 6:139844429 0.5344 0.0648 1.70E-16 2.01E-02 8406 rs12718598 7:50428445 0.3519 0.0641 3.96E-08 4.29E-01 8447 rs28832309 7:80319938 -0.0531 0.2855 8.53E-01 6.42E-01 476 rs7385804 7:100235970 -0.1934 0.0682 4.54E-03 1.96E-01 7279 rs10253736 7:151415256 -0.0149 0.0803 8.52E-01 3.96E-01 5301 rs7853365 9:4855858 -0.4638 0.0819 1.52E-08 7.80E-01 5139 rs2519093 9:136141870 -0.0341 0.0909 7.08E-01 4.30E-01 4345 rs34793991 10:46080590 0.5509 0.1662 9.21E-04 4.04E-01 1262 rs17476364 10:71094504 0.2795 0.1283 2.94E-02 9.96E-01 2016 rs33930165 11:5248233 8.3913 0.8275 3.64E-24 1.00E+00 47 rs28456 11:61589481 -0.165 0.0684 1.58E-02 9.98E-02 8579 rs12320749 12:4331647 0.2975 0.0748 7.04E-05 7.74E-01 6144 rs597808 12:111973358 0.1141 0.0726 1.16E-01 9.90E-01 6882 rs8013143 14:23494277 -0.1229 0.0711 8.39E-02 5.63E-01 7448 rs11621325 14:65472241 0.2378 0.0654 2.76E-04 1.98E-01 7949 rs4887023 15:78535437 -0.0331 0.0661 6.16E-01 9.91E-01 7682 rs60125383 16:176446 0.1691 0.0713 1.78E-02 1.52E-01 7829 rs115415087 16:205132 -2.5876 0.8153 1.51E-03 5.34E-01 117 rs145546625 16:220583 1.15E+00 0.1308 1.15E-18 8.69E-01 2193 rs55932218 16:221151 -0.9119 0.3806 1.66E-02 2.09E-01 360 rs8058016 16:228786 4.2379 0.7427 1.16E-08 5.89E-01 128 rs530159671 16:250184 3.9345 0.4835 4.06E-16 7.06E-02 208 rs60616598 16:297264 1.08E+00 0.2002 6.78E-08 6.48E-01 1316 rs9924561 16:314780 3.6573 0.2989 1.99E-34 9.54E-01 475 rs2608604 16:88849421 0.1988 0.0642 1.95E-03 3.29E-01 8520

179 rs4890634 18:43854259 0.1496 0.0687 2.94E-02 7.53E-01 7406 rs71176524 19:2177925 -0.0408 0.0644 5.27E-01 6.91E-01 8295 rs919797 19:4498157 -0.1943 0.0634 2.18E-03 7.07E-01 8427 rs1799918 19:13002400 -0.3003 0.0665 6.37E-06 2.99E-01 8089 rs958483 19:33759240 0.2661 0.124 3.19E-02 9.95E-01 2152 rs6014993 20:55991637 0.3289 0.0661 6.41E-07 6.97E-01 7561 rs4820079 22:32879617 0.0965 0.0644 1.34E-01 2.38E-01 8580 rs855791 22:37462936 0.6533 0.0644 3.55E-24 7.11E-01 8420 rs140523 22:50962782 0.1582 0.068 2.01E-02 3.48E-01 6879

180 Table 35. RBCC univariate results for combined-phenotype lead SNPs in Hispanics/ Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0166 0.0062 7.00E-03 1.85E-01 4335 rs73210009 3:195830276 9.00E-04 0.0046 8.49E-01 8.56E-01 7570 rs2361016 3:24341268 -0.0151 0.0053 4.32E-03 9.07E-01 6023 rs218265 4:55408999 -0.035 0.0048 4.24E-13 6.68E-01 7429 rs4535497 5:1107428 0.012 0.0045 7.32E-03 7.40E-01 7889 rs3749748 5:127350549 -0.0112 0.0058 5.48E-02 4.67E-01 4649 rs2032451 6:26092170 0.0086 0.0067 1.95E-01 3.14E-01 3578 rs1800562 6:26093141 0.0176 0.0163 2.80E-01 8.27E-01 642 rs11964516 6:41860252 -0.0175 0.0057 2.09E-03 3.90E-02 4765 rs13203155 6:109614844 0.0234 0.0048 9.96E-07 8.83E-01 7447 rs35786788 6:135419042 0.059 0.006 1.32E-22 8.44E-01 4357 rs590856 6:139844429 -0.0141 0.0044 1.45E-03 5.96E-01 8410 rs12718598 7:50428445 -0.0219 0.0044 8.26E-07 8.03E-01 8454 rs28832309 7:80319938 0.0148 0.0202 4.63E-01 4.26E-02 462 rs7385804 7:100235970 0.0271 0.0047 6.61E-09 2.36E-02 7287 rs10253736 7:151415256 0.0176 0.0054 1.22E-03 1.79E-01 5285 rs7853365 9:4855858 0.0223 0.0057 8.32E-05 3.84E-01 5150 rs2519093 9:136141870 0.0233 0.0061 1.31E-04 4.27E-01 4355 rs34793991 10:46080590 -0.0429 0.0113 1.40E-04 6.55E-01 1255 rs17476364 10:71094504 0.0291 0.0088 9.10E-04 5.52E-01 2014 rs33930165 11:5248233 -0.2773 0.0544 3.38E-07 1.00E+00 45 rs28456 11:61589481 0.0192 0.0046 3.09E-05 9.02E-01 8603 rs12320749 12:4331647 -0.0253 0.0051 8.90E-07 1.70E-01 6155 rs597808 12:111973358 -0.0168 0.005 7.98E-04 7.20E-02 6881 rs8013143 14:23494277 0.0072 0.0049 1.44E-01 9.39E-01 7420 rs11621325 14:65472241 -0.0106 0.0044 1.61E-02 2.36E-01 7970 rs4887023 15:78535437 -0.0046 0.0046 3.20E-01 8.74E-01 7688 rs60125383 16:176446 -0.0097 0.0047 3.92E-02 7.98E-01 7824 rs115415087 16:205132 0.1533 0.0471 1.13E-03 1.23E-01 112 rs145546625 16:220583 -3.97E-02 0.0092 1.48E-05 5.67E-02 2214 rs55932218 16:221151 0.0596 0.0262 2.27E-02 9.08E-01 345 rs8058016 16:228786 -0.205 0.046 8.49E-06 3.33E-02 107 rs530159671 16:250184 -0.2008 0.0321 3.98E-10 6.71E-01 206 rs60616598 16:297264 -5.34E-02 0.0125 1.98E-05 1.19E-01 1263 rs9924561 16:314780 -0.1622 0.021 9.93E-15 5.46E-01 461 rs2608604 16:88849421 -0.0125 0.0044 4.12E-03 7.28E-01 8543

181 rs4890634 18:43854259 -0.0055 0.0046 2.38E-01 1.36E-01 7421 rs71176524 19:2177925 -0.0175 0.0044 7.18E-05 6.98E-01 8307 rs919797 19:4498157 -0.0014 0.0044 7.47E-01 6.13E-01 8442 rs1799918 19:13002400 0.0182 0.0045 4.33E-05 5.97E-01 8128 rs958483 19:33759240 -0.0423 0.0084 5.40E-07 9.79E-01 2170 rs6014993 20:55991637 -0.009 0.0046 5.10E-02 3.13E-01 7566 rs4820079 22:32879617 -0.0013 0.0043 7.60E-01 1.96E-02 8593 rs855791 22:37462936 -0.0062 0.0044 1.64E-01 9.34E-01 8443 rs140523 22:50962782 -0.0088 0.0046 5.40E-02 7.30E-03 6882

182 Table 36. RDW univariate results for combined-phenotype lead SNPs in Hispanics/Latinos. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.0029 0.015 8.48E-01 3.17E-01 4329 rs73210009 3:195830276 0.058 0.0114 3.34E-07 6.80E-01 7509 rs2361016 3:24341268 -0.0333 0.0129 9.70E-03 8.91E-01 6002 rs218265 4:55408999 -0.0156 0.0117 1.80E-01 5.74E-01 7348 rs4535497 5:1107428 0.0763 0.0112 8.49E-12 3.59E-01 7836 rs3749748 5:127350549 0.1173 0.0143 2.65E-16 2.29E-02 4623 rs2032451 6:26092170 0.1119 0.0165 1.20E-11 8.53E-02 3532 rs1800562 6:26093141 0.0725 0.0393 6.52E-02 5.22E-01 630 rs11964516 6:41860252 -0.0313 0.0141 2.60E-02 8.91E-02 4735 rs13203155 6:109614844 0.0379 0.0115 9.72E-04 2.68E-01 7414 rs35786788 6:135419042 0.0685 0.0149 4.08E-06 3.71E-02 4324 rs590856 6:139844429 0.0027 0.0108 8.01E-01 2.91E-01 8352 rs12718598 7:50428445 -0.0133 0.0106 2.10E-01 1.90E-01 8393 rs28832309 7:80319938 -0.2112 0.0471 7.17E-06 7.03E-01 467 rs7385804 7:100235970 0.034 0.0115 3.02E-03 6.35E-01 7240 rs10253736 7:151415256 0.0026 0.0134 8.44E-01 4.21E-01 5266 rs7853365 9:4855858 0.0374 0.0136 5.81E-03 6.24E-01 5104 rs2519093 9:136141870 -0.0252 0.0149 9.01E-02 4.63E-01 4314 rs34793991 10:46080590 -0.0065 0.0277 8.14E-01 7.75E-01 1252 rs17476364 10:71094504 0.0337 0.0213 1.14E-01 3.01E-01 2001 rs33930165 11:5248233 -0.5124 0.1381 2.07E-04 1.00E+00 45 rs28456 11:61589481 -0.0258 0.0113 2.17E-02 2.56E-01 8530 rs12320749 12:4331647 -0.0078 0.0124 5.28E-01 6.34E-02 6106 rs597808 12:111973358 0.0014 0.012 9.07E-01 2.30E-01 6827 rs8013143 14:23494277 0.0961 0.0118 3.03E-16 4.88E-01 7397 rs11621325 14:65472241 -0.0157 0.011 1.53E-01 3.54E-01 7903 rs4887023 15:78535437 -0.0531 0.011 1.44E-06 1.84E-01 7629 rs60125383 16:176446 -0.0025 0.0122 8.38E-01 4.32E-01 7769 rs115415087 16:205132 0.172 0.1247 1.68E-01 3.76E-01 116 rs145546625 16:220583 -0.0789 0.0226 4.83E-04 5.01E-01 2195 rs55932218 16:221151 0.1668 0.063 8.16E-03 7.52E-01 342 rs8058016 16:228786 -0.1555 0.0339 4.47E-06 9.09E-01 1317 rs530159671 16:250184 -0.3628 0.0719 4.59E-07 4.04E-01 205 rs60616598 16:297264 -0.1599 0.0338 2.17E-06 8.97E-01 1314 rs9924561 16:314780 -0.349 0.0473 1.65E-13 4.34E-01 454 rs2608604 16:88849421 0.0357 0.0107 7.92E-04 2.08E-01 8471

183 rs4890634 18:43854259 -0.0626 0.0113 2.97E-08 2.28E-01 7359 rs71176524 19:2177925 -0.0119 0.0106 2.64E-01 9.16E-01 8244 rs919797 19:4498157 -0.0081 0.0108 4.53E-01 4.05E-01 8378 rs1799918 19:13002400 0.0023 0.0109 8.33E-01 8.50E-01 8050 rs958483 19:33759240 0.0136 0.0201 5.00E-01 2.64E-01 2153 rs6014993 20:55991637 -0.0172 0.0111 1.23E-01 6.63E-01 7511 rs4820079 22:32879617 -0.022 0.0104 3.48E-02 2.44E-01 8528 rs855791 22:37462936 -0.0981 0.0107 6.95E-20 4.06E-01 8374 rs140523 22:50962782 -0.0329 0.0112 3.29E-03 7.82E-02 6827

184 Table 37. HCT univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.2237 0.0334 2.1E-11 2.6E-01 7339 rs2361016 3:24341268 0.0191 0.0262 4.6E-01 4.4E-02 11798 rs73210009 3:195830276 -0.0254 0.0243 3.0E-01 7.5E-02 13552 rs218265 4:55408999 -0.1088 0.0337 1.2E-03 8.5E-01 7085 rs4535497 5:1107428 0.0074 0.0256 7.7E-01 3.4E-01 12414 rs3749748 5:127350549 0.0453 0.0275 9.9E-02 1.1E-01 10754 rs2032451 6:26092170 -0.2134 0.0328 7.4E-11 1.0E+00 7610 rs1800562 6:26093141 -0.3819 0.0491 7.4E-15 6.0E-02 3361 rs11964516 6:41860252 0.0513 0.0312 1.0E-01 1.7E-01 8289 rs13203155 6:109614844 0.0861 0.0237 2.8E-04 8.5E-02 14583 rs35786788 6:135419042 0.2012 0.027 9.4E-14 4.4E-01 11170 rs590856 6:139844429 0.0423 0.0238 7.5E-02 7.8E-01 14429 rs12718598 7:50428445 0.0038 0.0239 8.7E-01 9.5E-01 14121 rs7385804 7:100235970 0.1201 0.0249 1.4E-06 6.5E-01 13046 rs10253736 7:151415256 0.1593 0.0272 4.7E-09 6.3E-01 10738 rs7853365 9:4855858 -0.0061 0.0294 8.4E-01 6.4E-01 9469 rs2519093 9:136141870 0.1718 0.0301 1.1E-08 7.3E-01 8826 rs34793991 10:46080590 0.0056 0.0481 9.1E-01 5.0E-01 3517 rs17476364 10:71094504 0.361 0.0422 1.1E-17 1.6E-01 4433 rs28456 11:61589481 0.1126 0.0262 1.7E-05 2.4E-01 11738 rs12320749 12:4331647 -0.1008 0.0294 6.0E-04 9.6E-01 9247 rs597808 12:111973358 -0.1679 0.0241 3.1E-12 5.6E-01 14096 rs8013143 14:23494277 -0.0353 0.0264 1.8E-01 5.7E-01 11487 rs11621325 14:65472241 -0.0141 0.0241 5.6E-01 7.6E-01 14041 rs4887023 15:78535437 -0.0557 0.0244 2.3E-02 5.7E-01 13799 rs2608604 16:88849421 -0.0786 0.0262 2.7E-03 4.4E-01 11677 rs4890634 18:43854259 0.0408 0.0272 1.3E-01 2.5E-01 11039 rs71176524 19:2177925 -0.1057 0.0252 2.8E-05 1.3E-01 12760 rs919797 19:4498157 -0.033 0.0251 1.9E-01 3.9E-01 13216 rs1799918 19:13002400 0.0662 0.0251 8.4E-03 5.5E-01 12735 rs958483 19:33759240 -0.078 0.0545 1.5E-01 7.6E-01 2692 rs6014993 20:55991637 0.0336 0.0244 1.7E-01 1.0E+00 13367 rs4820079 22:32879617 0.0465 0.0235 4.8E-02 2.1E-01 14738 rs855791 22:37462936 0.2288 0.0243 4.7E-21 3.8E-01 13718 rs140523 22:50962782 0.0591 0.0324 6.8E-02 7.7E-01 6272

185 Table 38. HGB univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0781 0.0113 4.0E-12 1.4E-01 7333 rs2361016 3:24341268 0.0079 0.0088 3.7E-01 5.7E-02 11802 rs73210009 3:195830276 -0.0213 0.0082 9.4E-03 2.1E-01 13549 rs218265 4:55408999 -0.0301 0.0114 8.0E-03 8.5E-01 7088 rs4535497 5:1107428 0.0129 0.0087 1.4E-01 1.3E-01 12413 rs3749748 5:127350549 0.0114 0.0093 2.2E-01 2.4E-01 10752 rs2032451 6:26092170 -0.1002 0.011 1.2E-19 8.6E-01 7609 rs1800562 6:26093141 -0.1697 0.0166 1.2E-24 9.6E-02 3360 rs11964516 6:41860252 0.0159 0.0105 1.3E-01 1.7E-01 8284 rs13203155 6:109614844 0.015 0.008 6.0E-02 8.1E-02 14581 rs35786788 6:135419042 0.0499 0.0091 4.3E-08 5.4E-01 11168 rs590856 6:139844429 0.0157 0.008 5.0E-02 4.4E-01 14427 rs12718598 7:50428445 0.0062 0.0081 4.4E-01 8.6E-01 14118 rs7385804 7:100235970 0.0336 0.0084 6.5E-05 7.6E-01 13043 rs10253736 7:151415256 0.0562 0.0092 1.0E-09 6.7E-01 10736 rs7853365 9:4855858 -0.0048 0.0099 6.3E-01 5.3E-01 9460 rs2519093 9:136141870 0.0611 0.0102 1.9E-09 9.2E-01 8811 rs34793991 10:46080590 0.0117 0.0162 4.7E-01 2.8E-01 3511 rs17476364 10:71094504 0.126 0.0142 8.7E-19 1.8E-01 4433 rs28456 11:61589481 0.0399 0.0088 6.4E-06 1.0E-01 11741 rs12320749 12:4331647 -0.032 0.0099 1.3E-03 6.5E-01 9239 rs597808 12:111973358 -0.0621 0.0081 2.1E-14 3.6E-01 14093 rs8013143 14:23494277 -0.0201 0.0089 2.4E-02 7.4E-01 11493 rs11621325 14:65472241 0.0021 0.0081 8.0E-01 6.2E-01 14039 rs4887023 15:78535437 -0.0185 0.0082 2.5E-02 6.4E-01 13797 rs2608604 16:88849421 -0.0386 0.0089 1.3E-05 4.1E-01 11677 rs4890634 18:43854259 0.0202 0.0092 2.7E-02 4.1E-02 11037 rs71176524 19:2177925 -0.0325 0.0085 1.4E-04 9.6E-02 12758 rs919797 19:4498157 -0.0219 0.0085 9.9E-03 4.8E-01 13214 rs1799918 19:13002400 0.0179 0.0085 3.5E-02 6.4E-01 12727 rs958483 19:33759240 -0.0269 0.0184 1.4E-01 3.9E-01 2696 rs6014993 20:55991637 0.0161 0.0082 5.0E-02 9.6E-01 13364 rs4820079 22:32879617 0.0203 0.0079 1.1E-02 3.5E-01 14736 rs855791 22:37462936 0.0993 0.0082 8.5E-34 4.9E-01 13716 rs140523 22:50962782 0.0188 0.011 8.8E-02 6.9E-01 6269

186 Table 39. MCH univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.0375 0.0282 1.8E-01 2.2E-01 3691 rs2361016 3:24341268 0.1096 0.0223 8.7E-07 9.7E-01 5838 rs73210009 3:195830276 -0.0899 0.0209 1.7E-05 6.7E-01 6583 rs218265 4:55408999 0.1677 0.0285 4.0E-09 7.0E-01 3515 rs4535497 5:1107428 -0.0795 0.0216 2.3E-04 8.0E-02 6173 rs3749748 5:127350549 -0.0379 0.0234 1.1E-01 5.0E-01 5376 rs2032451 6:26092170 -0.2824 0.0276 1.4E-24 6.3E-03 3789 rs1800562 6:26093141 -0.4866 0.041 1.9E-32 6.5E-01 1685 rs11964516 6:41860252 0.1818 0.0266 8.3E-12 2.4E-01 4095 rs13203155 6:109614844 -0.1176 0.02 3.9E-09 4.1E-01 7272 rs35786788 6:135419042 -0.3501 0.0227 1.4E-53 4.8E-01 5567 rs590856 6:139844429 0.0799 0.0199 5.8E-05 8.2E-01 7236 rs12718598 7:50428445 0.1096 0.0205 9.1E-08 5.7E-02 6939 rs7385804 7:100235970 -0.1347 0.0214 3.0E-10 1.3E-01 6385 rs10253736 7:151415256 0.0334 0.0229 1.4E-01 9.7E-01 5360 rs7853365 9:4855858 -0.1314 0.0248 1.2E-07 2.9E-01 4752 rs2519093 9:136141870 0.0324 0.0258 2.1E-01 5.5E-01 4341 rs34793991 10:46080590 0.2159 0.0407 1.1E-07 5.3E-01 1753 rs17476364 10:71094504 0.1356 0.0362 1.8E-04 4.1E-01 2191 rs28456 11:61589481 -0.0025 0.0224 9.1E-01 7.1E-01 5782 rs12320749 12:4331647 0.0489 0.0249 4.9E-02 6.2E-01 4661 rs597808 12:111973358 -0.0585 0.0207 4.6E-03 1.6E-02 6937 rs8013143 14:23494277 -0.0232 0.0229 3.1E-01 5.1E-01 5562 rs11621325 14:65472241 0.0851 0.0203 2.8E-05 7.8E-01 7002 rs4887023 15:78535437 0.0441 0.0207 3.3E-02 4.9E-01 6875 rs2608604 16:88849421 -0.0077 0.0227 7.3E-01 9.6E-01 5737 rs4890634 18:43854259 0.1034 0.0229 6.3E-06 2.0E-01 5510 rs71176524 19:2177925 0.0022 0.0216 9.2E-01 1.8E-02 6335 rs919797 19:4498157 -0.1342 0.0221 1.2E-09 5.5E-01 6208 rs1799918 19:13002400 -0.1251 0.0217 7.7E-09 1.5E-01 6180 rs958483 19:33759240 0.1523 0.0472 1.3E-03 2.1E-01 1378 rs6014993 20:55991637 0.1063 0.0206 2.6E-07 9.8E-01 6770 rs4820079 22:32879617 0.1072 0.0198 6.4E-08 4.0E-01 7343 rs855791 22:37462936 0.2731 0.0205 1.6E-40 9.5E-01 6883 rs140523 22:50962782 0.1398 0.0236 3.3E-09 4.7E-01 3438

187

Table 40. MCHC univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0023 0.0108 8.3E-01 1.3E-01 7350 rs2361016 3:24341268 0.0059 0.0085 4.9E-01 3.4E-01 11779 rs73210009 3:195830276 -0.0219 0.0079 5.4E-03 6.7E-01 13535 rs218265 4:55408999 0.0147 0.0109 1.8E-01 8.5E-01 7088 rs4535497 5:1107428 0.0267 0.0083 1.3E-03 2.3E-01 12422 rs3749748 5:127350549 -0.019 0.0089 3.2E-02 4.5E-01 10738 rs2032451 6:26092170 -0.0798 0.0106 4.1E-14 9.1E-04 7601 rs1800562 6:26093141 -0.1247 0.0159 4.5E-15 7.8E-01 3348 rs11964516 6:41860252 0.0032 0.0101 7.5E-01 9.5E-01 8276 rs13203155 6:109614844 -0.0322 0.0076 2.5E-05 2.4E-01 14568 rs35786788 6:135419042 -0.0442 0.0087 4.3E-07 9.9E-01 11156 rs590856 6:139844429 -0.0038 0.0077 6.2E-01 6.3E-03 14422 rs12718598 7:50428445 0.0088 0.0077 2.6E-01 1.1E-01 14112 rs7385804 7:100235970 -0.0168 0.0081 3.7E-02 4.9E-01 13032 rs10253736 7:151415256 0.0083 0.0088 3.4E-01 9.2E-01 10731 rs7853365 9:4855858 0.0013 0.0095 8.9E-01 4.1E-01 9445 rs2519093 9:136141870 0.0205 0.0098 3.5E-02 6.7E-01 8820 rs34793991 10:46080590 0.0179 0.0156 2.5E-01 7.9E-01 3509 rs17476364 10:71094504 0.0127 0.0137 3.6E-01 5.3E-01 4405 rs28456 11:61589481 0.0082 0.0085 3.3E-01 9.7E-01 11701 rs12320749 12:4331647 0.0033 0.0095 7.3E-01 7.3E-01 9251 rs597808 12:111973358 -0.0109 0.0078 1.6E-01 4.1E-02 14087 rs8013143 14:23494277 -0.0139 0.0085 1.0E-01 7.1E-01 11496 rs11621325 14:65472241 0.0144 0.0078 6.5E-02 2.1E-01 14022 rs4887023 15:78535437 0.0028 0.0079 7.2E-01 2.2E-01 13793 rs2608604 16:88849421 -0.0436 0.0085 3.2E-07 6.3E-01 11654 rs4890634 18:43854259 0.0225 0.0088 1.0E-02 3.1E-01 11023 rs71176524 19:2177925 0.0068 0.0082 4.1E-01 7.3E-01 12750 rs919797 19:4498157 -0.0372 0.0082 5.2E-06 6.1E-01 13211 rs1799918 19:13002400 -0.0143 0.0082 8.0E-02 6.5E-01 12713 rs958483 19:33759240 0.0308 0.0178 8.5E-02 2.8E-01 2672 rs6014993 20:55991637 0.0024 0.0079 7.6E-01 6.0E-01 13358 rs4820079 22:32879617 0.0113 0.0076 1.4E-01 7.1E-01 14728 rs855791 22:37462936 0.0578 0.0079 1.9E-13 2.3E-01 13709 rs140523 22:50962782 0.0027 0.0103 7.9E-01 7.4E-01 6273

188 Table 41. MCV univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.1305 0.072 7.0E-02 3.2E-01 3696 rs2361016 3:24341268 0.2767 0.0569 1.2E-06 9.5E-01 5831 rs73210009 3:195830276 -0.2064 0.0531 1.0E-04 5.7E-01 6581 rs218265 4:55408999 0.4577 0.0728 3.1E-10 9.5E-01 3507 rs4535497 5:1107428 -0.3103 0.0547 1.4E-08 1.5E-02 6171 rs3749748 5:127350549 -0.0686 0.0597 2.5E-01 6.4E-01 5378 rs2032451 6:26092170 -0.6301 0.0706 4.2E-19 7.0E-02 3788 rs1800562 6:26093141 -0.9763 0.1049 1.3E-20 4.3E-01 1685 rs11964516 6:41860252 0.4834 0.0678 1.0E-12 6.8E-01 4099 rs13203155 6:109614844 -0.3076 0.051 1.7E-09 5.3E-01 7269 rs35786788 6:135419042 -0.8698 0.058 7.8E-51 5.1E-01 5568 rs590856 6:139844429 0.2761 0.0507 5.0E-08 8.3E-01 7234 rs12718598 7:50428445 0.3098 0.0523 3.2E-09 1.6E-02 6937 rs7385804 7:100235970 -0.3308 0.0545 1.3E-09 9.0E-02 6382 rs10253736 7:151415256 0.058 0.0582 3.2E-01 7.8E-01 5360 rs7853365 9:4855858 -0.332 0.0633 1.6E-07 3.8E-01 4756 rs2519093 9:136141870 0.0233 0.0656 7.2E-01 6.8E-01 4351 rs34793991 10:46080590 0.5839 0.1037 1.8E-08 6.7E-01 1755 rs17476364 10:71094504 0.3964 0.0922 1.7E-05 7.2E-01 2190 rs28456 11:61589481 -0.0638 0.0571 2.6E-01 4.3E-01 5777 rs12320749 12:4331647 0.1462 0.0633 2.1E-02 5.5E-01 4659 rs597808 12:111973358 -0.1636 0.0527 1.9E-03 2.6E-01 6934 rs8013143 14:23494277 -0.007 0.0584 9.0E-01 6.8E-01 5556 rs11621325 14:65472241 0.1908 0.0518 2.3E-04 6.3E-01 7000 rs4887023 15:78535437 0.1024 0.0529 5.3E-02 8.3E-01 6875 rs2608604 16:88849421 0.1092 0.0576 5.8E-02 7.1E-01 5733 rs4890634 18:43854259 0.2733 0.0585 3.0E-06 1.1E-01 5508 rs71176524 19:2177925 -0.0276 0.055 6.2E-01 5.5E-02 6333 rs919797 19:4498157 -0.2847 0.0561 3.8E-07 2.8E-01 6206 rs1799918 19:13002400 -0.2722 0.0551 7.8E-07 4.4E-02 6182 rs958483 19:33759240 0.3658 0.1206 2.4E-03 4.6E-01 1375 rs6014993 20:55991637 0.1462 0.0633 2.1E-02 5.5E-01 4659 rs4820079 22:32879617 -0.007 0.0584 9.0E-01 6.8E-01 5556 rs855791 22:37462936 0.5782 0.0524 2.5E-28 7.5E-01 6880 rs140523 22:50962782 0.3845 0.0594 9.4E-11 3.5E-01 3437

189 Table 42. RBCC univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 -0.0266 0.006 8.3E-06 9.5E-01 3699 rs2361016 3:24341268 -0.0126 0.0047 7.7E-03 2.5E-03 5832 rs73210009 3:195830276 0.0059 0.0044 1.8E-01 1.6E-01 6580 rs218265 4:55408999 -0.0446 0.006 1.7E-13 9.4E-01 3513 rs4535497 5:1107428 0.0148 0.0046 1.2E-03 3.0E-01 6173 rs3749748 5:127350549 0.0012 0.005 8.1E-01 3.0E-01 5379 rs2032451 6:26092170 0.0153 0.0059 9.2E-03 3.1E-01 3793 rs1800562 6:26093141 0.003 0.0087 7.3E-01 8.0E-01 1687 rs11964516 6:41860252 -0.0255 0.0057 6.6E-06 6.4E-01 4100 rs13203155 6:109614844 0.0217 0.0042 3.1E-07 4.8E-01 7269 rs35786788 6:135419042 0.0625 0.0048 3.3E-38 6.3E-01 5565 rs590856 6:139844429 -0.0188 0.0042 8.1E-06 5.8E-01 7235 rs12718598 7:50428445 -0.0202 0.0044 3.6E-06 3.1E-01 6938 rs7385804 7:100235970 0.027 0.0045 3.0E-09 2.8E-02 6383 rs10253736 7:151415256 0.0189 0.0049 9.9E-05 1.4E-01 5355 rs7853365 9:4855858 0.014 0.0053 8.0E-03 9.5E-01 4758 rs2519093 9:136141870 0.0162 0.0055 3.1E-03 2.4E-01 4345 rs34793991 10:46080590 -0.0394 0.0086 4.8E-06 4.8E-01 1755 rs17476364 10:71094504 0.0161 0.0077 3.7E-02 5.3E-01 2186 rs28456 11:61589481 0.0184 0.0048 1.1E-04 5.9E-03 5775 rs12320749 12:4331647 -0.0199 0.0053 1.6E-04 7.4E-01 4658 rs597808 12:111973358 -0.0103 0.0044 1.9E-02 9.5E-01 6935 rs8013143 14:23494277 -0.0029 0.0049 5.5E-01 4.6E-01 5562 rs11621325 14:65472241 -0.0114 0.0043 7.9E-03 3.9E-01 7001 rs4887023 15:78535437 -0.0136 0.0044 2.0E-03 2.8E-01 6877 rs2608604 16:88849421 -0.0197 0.0048 4.4E-05 4.7E-01 5733 rs4890634 18:43854259 -0.0105 0.0049 3.0E-02 4.6E-01 5510 rs71176524 19:2177925 -0.0132 0.0046 4.3E-03 6.0E-01 6334 rs919797 19:4498157 0.0046 0.0047 3.2E-01 9.8E-01 6207 rs1799918 19:13002400 0.0238 0.0046 2.4E-07 3.8E-01 6180 rs958483 19:33759240 -0.0465 0.0101 4.1E-06 5.8E-01 1372 rs6014993 20:55991637 -0.0116 0.0044 8.1E-03 2.9E-01 6769 rs4820079 22:32879617 -0.0128 0.0042 2.4E-03 8.9E-01 7341 rs855791 22:37462936 -0.0055 0.0044 2.0E-01 9.4E-01 6882 rs140523 22:50962782 -0.0143 0.005 4.1E-03 1.9E-01 3437

190 Table 43. RDW univariate results for combined-phenotype lead SNPs in European Americans. Het p- rsid Chr:Pos Beta SE p-value value effN rs61523591 2:46372781 0.0291 0.0302 3.4E-01 2.5E-01 1024 rs2361016 3:24341268 -0.0132 0.0234 5.7E-01 9.8E-01 1678 rs73210009 3:195830276 0.0473 0.0213 2.7E-02 4.2E-01 2023 rs218265 4:55408999 0.0286 0.0294 3.3E-01 9.6E-01 1047 rs4535497 5:1107428 0.1012 0.0216 2.8E-06 8.5E-01 1962 rs3749748 5:127350549 0.1244 0.0247 4.9E-07 2.9E-01 1505 rs2032451 6:26092170 0.1544 0.0297 2.0E-07 5.5E-01 1044 rs1800562 6:26093141 0.2114 0.0447 2.3E-06 8.0E-01 451 rs11964516 6:41860252 -0.0969 0.0278 4.8E-04 6.1E-01 1174 rs13203155 6:109614844 0.0352 0.0213 9.8E-02 7.3E-01 2066 rs35786788 6:135419042 0.1366 0.0236 7.2E-09 1.4E-01 1615 rs590856 6:139844429 0 0.0211 1.0E+00 1.4E-01 2051 rs12718598 7:50428445 -0.0121 0.0208 5.6E-01 5.8E-01 2065 rs7385804 7:100235970 0.0489 0.022 2.6E-02 7.2E-01 1917 rs10253736 7:151415256 0.0252 0.0236 2.8E-01 1.2E-01 1661 rs7853365 9:4855858 0.0388 0.0268 1.5E-01 2.1E-01 1300 rs2519093 9:136141870 -0.0605 0.0267 2.4E-02 4.5E-01 1275 rs34793991 10:46080590 0.032 0.0434 4.6E-01 5.9E-01 476 rs17476364 10:71094504 -0.0283 0.0362 4.4E-01 4.6E-01 684 rs28456 11:61589481 -0.0286 0.023 2.1E-01 9.8E-01 1761 rs12320749 12:4331647 -2.00E-04 0.0254 9.9E-01 7.3E-01 1405 rs597808 12:111973358 -8.00E-04 0.0214 9.7E-01 7.7E-01 2044 rs8013143 14:23494277 0.1114 0.0232 1.6E-06 7.6E-01 1711 rs11621325 14:65472241 -0.0208 0.0217 3.4E-01 9.7E-01 1995 rs4887023 15:78535437 -0.099 0.0218 5.6E-06 7.2E-01 1943 rs2608604 16:88849421 -0.0017 0.0219 9.4E-01 1.3E-01 1842 rs4890634 18:43854259 -0.0608 0.0245 1.3E-02 2.1E-01 1551 rs71176524 19:2177925 -0.0485 0.021 2.1E-02 1.2E-01 2026 rs919797 19:4498157 0.0037 0.0214 8.6E-01 8.7E-01 1997 rs1799918 19:13002400 0.0486 0.022 2.7E-02 7.9E-02 1895 rs958483 19:33759240 -0.0818 0.0457 7.4E-02 9.6E-01 446 rs6014993 20:55991637 0.0032 0.021 8.8E-01 3.9E-01 2015 rs4820079 22:32879617 -0.0092 0.021 6.6E-01 9.9E-01 2082 rs855791 22:37462936 -0.1285 0.0214 1.9E-09 3.5E-01 2001 rs140523 22:50962782 -0.1007 0.0317 1.5E-03 6.1E-01 906

191

Table 44. LD proxies by ancestry for rs6573766, associated with RBCC in univariate sensitivity analysis.

LD proxy SNPs (r2>0.6)

Coded allele p-value in total African frequency study population American Hispanic/ Latino European American Pooled MEGA African 0.24 HCT 0.06 rs11848279, rs11848279, rs8008583 rs7158837, rs11158663 rs11848279, rs8008583 American rs8008583 (r2=0.76) (r2=0.69) (r2=0.67) (r2=0.76) Hispanic/ 0.34 HGB 0.06 rs35372434 (r2=0.72) rs10131172, rs9788516 rs77151393 (r2=0.6) Latino (r2=0.61) European 0.2 MCH 0.06 rs57193392 (r2=0.69) American rs77151393 (r2=0.67) rs7158837, rs77151393, CHB 0.28 MCHC 0.23 rs11158663 (r2=0.66) (HapMap)

GIH 0.6 MCV 1.80E-03 rs8003102 (r2=0.62) (HapMap)

192 RBCC 1.07E-09

RDW 0.4

r2 refers to correlation between lead SNP and reported SNP; CHB = Han Chinese; GIH = Gujarati Indian in Houston, TX.

192

Table 45. LD proxies by ancestry for rs145548796*, associated with MCV in univariate sensitivity analysis.

LD proxy SNPs (r2>0.6) Coded allele frequency (1000G p-value in total Hispanic/ European Phase 3) study population African American Latino American Pooled MEGA AFR 0.002 HCT 0.94 rs372843462, rs531243911, N/A N/A rs531243911, rs372843462, rs368935853, rs374651220, rs368935853, rs374651220,

AMR 0.001 HGB 0.75 rs377045600, rs534189319 rs377045600, rs534189319 (r2=0.75) (r2=0.64)

EAS 0 MCH 1.20E-07 rs570678968 (r2=0.73) rs570678968 (r2=0.63)

EUR 0.005 MCHC 0.63 rs553008644, rs551058836, rs529107624, rs368442057, rs373865052, rs367680848, SAS 0.001 MCV 4.60E-09 rs527330687, rs554440375 (r2=0.69)

193 RBCC 0.06 2

rs189451984 (r =0.67)

RDW 0.79 rs531451290 (r2=0.62)

rs543776773 (r2=0.61)

*coded allele was not detectable in ancestry-specific analyses, values reported are for the ARIC European American sub-population; the variant had a CAF of 0.0015 in the pooled MEGA study population and a CAF of 0.0034 in ARIC European Americans. r2 = correlation between lead SNP and reported SNP.

193

Table 46. Comparison of unadjusted and esv3637548-adjusted trait-specific p-values at HBA1/2 locus on chromosome 16 in MEGA-genotyped individuals. rsid HCT un HCT adj HGB un HGB adj MCH un MCH adj MCHC un MCHC adj aSPU un aSPU adj rs186066503 6.0E-02 7.0E-02 1.0E-01 7.0E-02 5.3E-26 1.5E-28 2.2E-13 4.1E-15 1.0E-11 1.0E-11 rs145546625 6.0E-02 5.0E-02 2.0E-02 1.0E-02 1.1E-15 1.5E-17 1.8E-01 7.0E-02 1.0E-11 1.0E-11 rs61743947 4.0E-02 3.0E-02 2.2E-04 4.6E-01 3.9E-10 1.3E-14 2.4E-09 5.7E-11 1.0E-11 1.0E-11 rs55932218 1.0E-01 6.0E-02 5.0E-03 7.0E-04 5.5E-07 1.6E-11 2.0E-03 3.6E-05 4.5E-06 1.0E-11 rs142154093 2.0E-02 8.9E-01 3.3E-06 7.0E-03 2.0E-19 3.4E-07 1.5E-07 7.0E-02 1.0E-11 1.3E-06 rs9924561 1.4E-04 6.9E-01 3.6E-26 4.0E-01 5.8E-158 9.8E-07 9.7E-87 1.1E-04 1.0E-11 1.6E-05 rs60616598 2.2E-01 3.5E-01 5.9E-04 9.0E-03 2.3E-11 4.0E-05 1.7E-06 4.4E-04 1.0E-11 2.7E-05 rs76613236 5.5E-01 3.6E-01 3.5E-05 8.8E-01 1.3E-41 3.4E-04 9.6E-14 8.5E-01 1.0E-11 4.5E-04 rs60125383 4.9E-01 9.3E-01 1.0E-03 1.2E-01 3.8E-09 1.0E-03 4.5E-10 3.0E-04 3.0E-09 3.0E-03 rs8051004 5.2E-01 6.5E-01 7.5E-01 4.0E-01 5.0E-03 5.0E-03 7.0E-02 6.0E-03 5.0E-02 2.0E-02 rs8058016 9.0E-03 5.0E-02 2.3E-04 1.0E-01 1.1E-12 1.0E-02 1.2E-04 6.5E-01 1.0E-11 1.2E-01 rs115415087 4.7E-01 9.0E-02 5.6E-01 3.9E-01 1.4E-09 7.6E-01 1.8E-04 6.1E-01 3.0E-09 1.2E-01 rs145752042 6.0E-02 7.0E-02 6.0E-02 1.2E-01 1.3E-01 4.6E-01 8.8E-01 4.7E-01 4.0E-02 1.3E-01 rs530159671 2.0E-02 4.8E-01 1.0E-07 4.6E-01 1.4E-34 8.0E-02 3.7E-11 3.0E-01 1.0E-11 2.6E-01 194 rsid MCV un MCV adj RBCC un RBCC adj RDW un RDW adj aSPU un aSPU adj

rs186066503 8.0E-14 2.6E-15 8.0E-15 3.4E-14 1.5E-06 8.4E-07 1.0E-11 1.0E-11 rs145546625 3.2E-16 1.3E-17 2.1E-05 1.4E-05 7.0E-03 3.0E-03 1.0E-11 1.0E-11 rs61743947 2.7E-09 4.3E-13 2.0E-01 1.2E-01 1.5E-05 5.2E-07 1.0E-11 1.0E-11 rs55932218 2.2E-05 1.2E-08 7.0E-02 3.0E-02 6.2E-06 1.2E-08 4.5E-06 1.0E-11 rs142154093 2.1E-18 2.4E-07 5.9E-08 5.0E-03 6.1E-07 2.0E-03 1.0E-11 1.3E-06 rs9924561 2.4E-135 1.1E-04 8.6E-49 2.1E-01 4.9E-06 9.0E-03 1.0E-11 1.6E-05 rs60616598 2.4E-07 7.0E-03 2.6E-04 4.0E-02 9.3E-09 2.6E-06 1.0E-11 2.7E-05 rs76613236 1.3E-33 1.9E-04 1.4E-18 2.2E-04 2.2E-12 6.0E-03 1.0E-11 4.5E-04 rs60125383 8.6E-06 3.0E-02 6.8E-04 7.0E-02 5.0E-02 3.3E-01 3.0E-09 3.0E-03 rs8051004 3.0E-02 2.0E-02 9.7E-01 9.4E-01 7.2E-01 5.4E-01 5.0E-02 2.0E-02 rs8058016 1.3E-11 8.0E-03 5.0E-03 4.7E-01 2.0E-02 7.3E-01 1.0E-11 1.2E-01 rs115415087 6.4E-09 5.8E-01 7.4E-09 2.0E-02 1.6E-01 7.0E-02 3.0E-09 1.2E-01 rs145752042 1.0E-02 5.0E-02 6.8E-01 8.5E-01 6.1E-01 5.5E-01 4.0E-02 1.3E-01 rs530159671 2.1E-21 4.0E-02 6.0E-11 3.3E-01 9.9E-08 3.3E-01 1.0E-11 2.6E-01 “un” refers to trait-specific or aSPU unconditioned p-value; “adj” refers to p-value for respective analysis conditioned on esv3637548

194

Table 47. Shared generalization at previously published index SNPs across seven RBC traits.

HCT HGB MCH MCHC MCV RBCC RDW

HCT: HCT 71% 21% 30% 21% 38% 35% 5.8% HGB: HGB 86% 27% 48% 35% 43% 39% 4.8% MCH: MCH 48% 53% 67% 88% 67% 66% 9.1% MCHC: MCHC 32% 42% 30% 32% 24% 60% 19% MCV: MCV 48% 68% 88% 69% 73% 69% 7.0% RBCC: RBCC 57% 54% 43% 33% 47% 38% 7.7% RDW: RDW 38% 36% 31% 61% 33% 28% 27%

*Generalization significance threshold is p<1.07E-04. Top-right side of matrix represents the number of variants significant for both traits (numerator) as a proportion of the total number of significant variants for the trait in that row (denominator, green/yellow). Bottom-left size of the matrix represents the number of variants significant for both traits (numerator) as a proportion of the total number of significant variants for the trait in that column (denominator, blue/orange).

195 Table 48. Generalization of previously reported association signals and index SNPs to the PAGE trans-ethnic study population.

Number HCT HGB MCH MCHC MCV RBCC RDW (67,813) (67,798) (41,294) (67,856) (41,253) (41,287) (33,526) All loci 250 251 223 197 222 126 188 Trait loci 82 93 129 50 142 73 95 All index SNPs 104 125 243 111 243 155 114 Trait index SNPs 48 62 137 40 136 77 47

Proportion HCT HGB MCH MCHC MCV RBCC RDW

Trait loci 73 77 60 35 58 41 53 Trait index SNPs 29 32 33 28 29 26 13 All loci 53 53 47 42 47 27 40 All index SNPs 8 10 19 8 19 12 9

196 Table 49. Generalization of previously reported association signals and index SNPs to the PAGE African American population.

Number HCT HGB MCH MCHC MCV RBCC RDW (16,082) (16,084) (8,680) (16,127) (8,663) (8,665) (6,124) All loci 171 153 149 143 144 119 101 Trait loci 47 53 69 29 83 41 43 All index SNPs 2 19 23 12 28 18 6 Trait index 2 11 14 7 13 8 0 SNPs

Proportion HCT HGB MCH MCHC MCV RBCC RDW Trait loci 42 44 32 20 34 23 24 Trait index SNPs 1 6 3 5 3 3 0 All loci 36 33 32 30 31 25 21 All index SNPs 0 1 2 1 2 1 0

197 Table 50. Generalization of previously reported association signals and index SNPs to the PAGE Hispanic/Latino population.

Number HCT HGB MCH MCHC MCV RBCC RDW (20,697) (20,685) (17,380) (20,712) (17,363) (17,393) (17,260) All loci 169 180 171 128 193 185 142 Trait loci 63 66 93 31 118 89 77 All index SNPs 44 43 113 40 103 67 65 Trait index SNPs 18 29 65 20 64 38 24

Proportion HCT HGB MCH MCHC MCV RBCC RDW Trait loci 56 55 43 22 48 49 43 Trait index SNPs 11 15 15 14 14 13 7 All loci 36 38 36 27 41 39 30 All index SNPs 3 3 9 3 8 5 5

198 Table 51. Generalization of previously reported association signals and index SNPs to the PAGE European-ancestry population.

Number HCT HGB MCH MCHC MCV RBCC RDW (29,513) (29,508) (14,707) (29,493) (14,702) (14,704) (4,172) All loci 125 136 119 98 124 119 48 Trait loci 51 67 84 28 84 62 25 All index SNPs 92 84 165 50 159 81 42 Trait index SNPs 39 44 95 18 92 42 15

Proportion HCT HGB MCH MCHC MCV RBCC RDW Trait loci 45 55 39 20 34 34 14 Trait index SNPs 23 22 23 13 20 14 4 All loci 27 29 25 21 26 25 10 All index SNPs 7 6 13 4 12 6 3

199 Table 52. Lead SNPs that are significant eQTLs for genes within 500kb for RBC-relevant tissues in GTEx. Gene GTEx Symbol Gencode Id SNP p-value alpha Tissue ABO ENSG00000175164.9 rs10901252 1.5E-49 4.6E-05 Thyroid ABO ENSG00000175164.9 rs10901252 3.1E-16 2.2E-05 Whole Blood ABO ENSG00000175164.9 rs10901252 2.8E-07 1.2E-05 Spleen CHAF1A ENSG00000167670.11 rs12459922 1.1E-36 3.5E-05 Thyroid CHAF1A ENSG00000167670.11 rs12459922 1.6E-08 2.0E-05 Whole Blood CHAF1A ENSG00000167670.11 rs12459922 7.1E-07 7.1E-06 Liver UBXN6 ENSG00000167671.7 rs12459922 2.8E-44 3.7E-05 Thyroid HBS1L ENSG00000112339.10 rs12664956 1.4E-16 7.0E-05 Thyroid HBS1L ENSG00000112339.10 rs12664956 9.2E-08 1.2E-05 Liver HBS1L ENSG00000112339.10 rs12664956 4.6E-06 3.7E-05 Whole Blood CPT1B ENSG00000205560.8 rs140523 1.4E-06 8.3E-05 Thyroid ODF3B ENSG00000177989.9 rs140523 4.0E-10 4.6E-05 Whole Blood SCO2 ENSG00000130489.8 rs140523 2.5E-06 8.0E-05 Thyroid SCO2 ENSG00000130489.8 rs140523 8.9E-06 1.7E-05 Liver SCO2 ENSG00000130489.8 rs140523 2.2E-05 4.3E-05 Whole Blood TYMP ENSG00000025708.8 rs140523 8.9E-19 4.8E-05 Whole Blood FARSA ENSG00000179115.6 rs1799918 7.6E-06 3.8E-05 Whole Blood GCDH ENSG00000105607.8 rs1799918 7.6E-12 6.9E-05 Thyroid GCDH ENSG00000105607.8 rs1799918 6.6E-08 3.8E-05 Whole Blood GCDH ENSG00000105607.8 rs1799918 2.7E-06 1.3E-05 Liver SYCE2 ENSG00000161860.7 rs1799918 5.7E-09 5.5E-05 Thyroid SYCE2 ENSG00000161860.7 rs1799918 2.8E-06 1.3E-05 Liver HFE ENSG00000010704.14 rs2032451 3.9E-10 1.3E-04 Thyroid ABO ENSG00000175164.9 rs2519093 4.0E-15 2.2E-05 Whole Blood FADS1 ENSG00000149485.12 rs28456 4.2E-13 8.0E-05 Thyroid FADS2 ENSG00000134824.9 rs28456 2.1E-27 5.1E-05 Whole Blood FADS2 ENSG00000134824.9 rs28456 8.1E-14 7.8E-05 Thyroid FADS2 ENSG00000134824.9 rs28456 6.7E-11 2.3E-05 Spleen TMEM258 ENSG00000134825.9 rs28456 7.1E-06 4.0E-05 Whole Blood MARCH8 ENSG00000165406.11 rs34793991 4.0E-06 1.3E-04 Thyroid HBS1L ENSG00000112339.10 rs35786788 4.8E-07 7.0E-05 Thyroid HBS1L ENSG00000112339.10 rs35786788 4.4E-06 1.2E-05 Liver C18orf25 ENSG00000152242.6 rs4890634 5.1E-06 3.5E-05 Whole Blood MOSPD3 ENSG00000106330.7 rs7385804 3.5E-19 5.1E-05 Whole Blood TFR2 ENSG00000106327.8 rs7385804 9.7E-11 1.9E-05 Liver HAUS4 ENSG00000092036.12 rs8013143 5.0E-06 2.2E-05 Whole Blood

200 PRMT5 ENSG00000100462.11 rs8013143 1.1E-08 2.4E-05 Whole Blood MPST ENSG00000128309.12 rs855791 8.6E-07 4.6E-05 Thyroid TST ENSG00000128311.9 rs855791 2.0E-05 3.9E-05 Thyroid

201

MANUSCRIPT B. GENE-BASED TESTING IN AN ANCESTRALLY DIVERSE STUDY POPULATION SUGGESTS COMPLICATED REGULATORY MECHANISMS FOR RED BLOOD CELL TRAITS: THE PAGE STUDY

6.1. Overview

Red blood cell (RBC) traits are highly polygenic and moderately heritable traits with a broad array of both clinical and research applications. Genetic characterization of these traits remain incomplete despite hundreds of reported GWAS loci. Populations of non-European ancestry also remain underrepresented in blood trait genetics literature. Therefore, we conducted gene-based testing of two variant annotation sets in genotyped and imputed data from participants of the multi-ethnic PAGE study (n=29,070). We identified four genes in loci previously unreported in the RBC trait association study literature: two specific to MCH

(IGHV1-3, p=1.2E-07, and CATSPERB, p=4E-08), one specific to RBCC (PPHLN1, p=2.4E-07), and a non-coding RNA (RP11-687M24.7) significant for HCT, HGB, MCH, MCHC, RBCC, and

RDW. The most significant variant within each transcript exhibited a gradient of allele frequency by ancestry, being the most common in the African American subpopulation. We additionally reported 45 significantly associated genetic transcripts within known RBC trait GWAS loci, all but six of which were significantly associated with multiple traits. Our findings suggest that rare variants do play a role in RBC development and function, and the improvement in power provided by gene-based methods can increase detection of discovery loci driven by rare variation. We additionally demonstrate that ancestrally diverse study populations will allow for

202 the detection of loci driven by variants that are monomorphic or extremely rare in European- ancestry study populations.

6.2. Introduction

Red blood cell (RBC) traits are highly polygenic quantitative phenotypes that describe the function and maintenance of RBCs. Hematocrit, hemoglobin, RBC count, and their derived traits are utilized clinically to diagnose anemic states, and several RBC traits have been associated as prognostic indicators of other chronic diseases8; 107; 260; 632. These traits exhibit moderate heritability (20-80%), and insights into the molecular function of genes affecting RBC biology has led directly to the development of treatments for hereditary anemias143; 156; 157; 322.

Identification of genes likely to be affecting RBC trait biology is important to maximize understanding of these traits for both research and clinical utilization.

RBC traits have been extensively evaluated using traditional GWAS methods, identifying over 500 loci 31; 301; 348. However, heritability estimates suggest that additional loci remain to be found, with the expectation that some heritability may be attributable to rare variants, which are more likely to be population-specific than common variants340; 470; 627; 633; 634. Restriction of the best-powered GWAS to European- and East Asian-ancestry populations has limited the detection of such potentially causal rare variants (minor allele frequency <1%) segregating in African- and

Amerindian-ancestry populations. In addition to low representation of global ancestries, RBC trait GWAS examining rare variants have limited statistical power, particularly when attempting to identify loci with modest to moderate effects that are present in a small subset of the study population. Gene-based testing improves statistical power to detect such effects by incorporating the effects of multiple variants into one test statistic, which can also leverage the benefits of a

203 multi-ethnic study population with a large number of previously unexamined rare variants491; 519.

To address the limitations outlined above, we performed gene-based testing of seven RBC traits in over 29,000 participants of the Population Architecture using Genomics and Epidemiology

(PAGE) study. Our findings emphasize the importance of inclusive study design and evaluation of the spectrum of allelic variation to understand the genetic architecture underlying complex traits.

6.3. Methods

6.3.1. Study population

The PAGE study comprises ancestrally-diverse study populations from United States cohorts and biobanks evaluating common complex diseases and accompanying risk factors

(http://pagestudy.org). This study used data from all Multi-ethnic Genotyping Array (MEGA)- genotyped (see above) self-reported African American, Asian American, Hispanic/Latino, Native

American, and “Other” participants from the Hispanic Community Health Study/Study of

Latinos (HCHC/SOL); the Icahn Mt. Sinai School of Medicine BioME Biobank (BioME); and the Women’s Health Initiative (WHI) with phenotype data for one or more RBC traits495; 589; 635.

Participants were excluded if they met any of the criteria described in Section 4.3.2.

6.3.2. RBC trait measurement

RBC traits were measured with hemanalyzers following standardized laboratory protocols from blood draws at the earliest available visit for all seven RBC traits (hematocrit

[HCT], hemoglobin concentration [HGB], mean corpuscular hemoglobin [MCH], MCH concentration [MCHC], mean corpuscular volume [MCV], RBC count [RBCC], and red cell

204 distribution width [RDW]; Table 3). RBC trait values exceeding four standard deviations from the total study population trait mean were excluded.

6.3.3. Genotyping and Imputation

All study participants were genotyped on the Illumina Multi-ethnic genotyping array

(MEGA) and imputed to the 1000 Genomes phase 3 reference panel after application of filtering and quality control criteria, as previously described455; 495. Variants with an imputation quality score below 0.4 were excluded from analysis. In line with previously published gene-based testing efforts, we excluded SNPs which had a minor allele count (MAC) <5—because the association may be driven by an ancestral haplotype or outlying phenotypic value rather than a genetic association. We also excluded variants with a minor allele frequency (MAF) >0.01 because variants above that allele frequency are powered to be detected using GWAS methods and have already been evaluated in our study population28; 636. Minor allele frequencies (MAFs) used for reporting were calculated in MEGA-genotyped WHI African Americans and

HCHS/SOL Hispanics/Latinos. European MAFs were obtained from the 1000G CEU European- ancestry population470.

6.3.4. Statistical Methods

6.3.4.1 Transcript-based variant annotation file design

Variant annotation files were generated for all autosomal variants with a PAGE MAF

<0.01 for two different definitions of a genetic transcript using variant-transcript affiliations as defined in Ensembl (version 25 GRCh37/hg19), downloaded from BioMart (accessed July 1,

2019)637. Briefly, transcript start, end, and splice sites were predicted and compared to previous versions, which were combined with manual annotation to define all transcripts in both protein-

205 coding and noncoding genes (annotation is described in more detail at http://ensembl.org/Homo_sapiens/Info/Annotation and http://.ensembl.org/info/genome/genebuild/ manual_havana.html). For the first annotation file

(referred to henceforth as “deleterious”), we restricted to variants annotated as stop-gained, stop- lost, non-synonymous, and splicing variants. For the second annotation file (henceforth referred to as “CADD” in reference to the Combined Annotation Dependent Depletion database), we included all variants in the deleterious annotation file plus all synonymous coding variants and potential regulatory variants with a scale CADD score >10 classified as upstream, downstream,

5’UTR, 3’UTR, or intronic variants for the relevant transcript617. CADD scores have been calculated and published for 8.6 billion nucleotides in the context of the current reference genome (including 1000 Genomes, to which our genotypes are imputed), correlate with relevant metrics such as ClinVar pathogenicity, and demonstrate an improvement in predicting deleteriousness over other available prediction algorithms such as PolyPhen476. CADD scores are particularly useful as a filtering method for rare variants, because functional rare variants are expected to have higher effect magnitudes than most common variants, but the vast majority have not been molecularly characterized43; 638. Since the raw or unscaled CADD score cannot be compared across datasets, we used the scaled PHRED CADD score (representing the rank-order by effect magnitude) and filtered on the recommended score of 10 to obtain the top ten percent of all variants in the genome. Of note, scores are currently not tissue- or transcript-specific, meaning a highly-scored variant may affect transcripts differentially

(https://cadd.gs.washington.edu, accessed July 5, 2019). Additionally, we note that while non- coding-RNA (ncRNA)-specific variants were excluded, our annotation file design did not preclude ncRNA transcripts comprising other types of variants from being included.

206 6.3.4.2 Gene-based analyses

Score statistics for individual variants by group were generated in SUGEN in a MASS- accepted format (see below)616. An additive genetic model of inheritance was assumed, and we adjusted for linear effects of age at blood draw, sex, study site or region, and ten ancestral principal components23. To identify gene transcripts significantly associated with the seven examined RBC traits, we implemented an adaptive method combining variance components and burden tests using MASS meta-analysis software 639. The burden test represents the combined effect of variant scores within a gene—i.e., the test statistic represents only one genetic variable per transcript in the annotation file. The variance components test represents the contribution of each participant’s genotype to the proportion of variance of the trait explained by genetic variation. In this case, the covariances are calculated for variants within each transcript, for which the corresponding test statistic represents a weighted sum of the score statistics for each variant499. Within-study weights are applied to the covariance matrix using the beta density function Beta(1,25,MAF).

Transcripts represented by fewer than three variants were excluded because they essentially represent single-variant tests rather than group tests44; 636. The maximum number of

Monte Carlo simulations run in MASS was set to the default of one million; simulations were increased to 25 million (simulating a minimum p-value of 4E-08) for transcripts exceeding 1E-5 to ascertain statistical significance. Gene-based results were presented as the primary findings, employing Bonferroni correction assuming 25,000 independent genes for seven traits (p<2.86E-

7). Transcripts were reported as discovery genes if a significant transcript with start and end sites

≥500kb from a variant previously reported to exceed p<5E-8 for any of the seven RBC traits.

Previously-reported SNPs for the seven RBC traits evaluated in this study were identified

207 through review of the NHGRI-EB GWAS Catalog21 as of May 1, 2019. Previously reported genes were identified through a PubMed search for exome-chip, exome-sequencing, and whole- genome sequencing which employed gene-based or sliding-window burden tests.

6.3.4.3 Gene ontology data

To help identify shared biological mechanisms underlying significantly associated genes, we compared results from both variant annotation sets across all RBC traits using the publicly available StringDB resource from Elixir618. StringDB is regularly updated and pulls from molecular experimental and GWAS results databases as well as text-mining from the literature.

Using the standard false discovery rate (FDR) significance threshold of 0.05, we searched for biological processes, molecular functions, and cellular components that were overrepresented in our set of significantly associated genes (updated Jan 2019, accessed July 1, 2019). Additionally, for previously unreported genes we generated networks with modest (0.2) or higher interaction scores to visualize potential clusters of proteins significantly associated with one or more traits.

6.4. Results

The study population comprised 29,070 participants who were predominantly African

Americans (30%) or Hispanics/Latino (65%) (Table 53). The study population was broadly representative with regard to age (ranging from 18 to 90; average age 53 years) and majority female (76%). We evaluated a maximum of 108,331 transcripts representing up to 12,399,104 rare variants in the total study population. The number of variants per transcript ranged from the designated minimum of three to 207 (median = 4) for the deleterious variant annotation set and three to 1,103 (median = 11) for the CADD variant annotation set (Figure 13).

208 6.4.1. Genes previously unreported in RBC trait association studies

In total, 49 transcripts representing 42 genes were significant for one or more RBC traits for one or both annotation sets (Figure 14, Tables 54, 55). Fourteen transcripts were associated with only one RBC trait, and MCH (n=37) and MCV (n=36) harbored the largest number of associations across all transcripts. Significantly associated transcripts from four genes were outside loci previously associated with RBC traits in large-scale genetic studies: CATSPERB,

IGHV1-3, PPHLN1, and RP11-687M24 (Tables 54, 55). Three of four newly identified transcripts exceeded significance for only one trait (RP11-687M24 was significantly associated with HCT, HGB, MCV, MCHC, RBCC, and RDW). Three were represented by five or fewer variants in their respective masks (the CATSPERB transcript significantly associated with RBCC included 38 variants).

The most significant (“top”) annotated variant had a minor allele frequency >0.1% in all significant trait-transcript associations (Tables 56-59). Several of the top variants have been reported as index SNPs in RBC trait GWAS, although the transcript containing the variant was not always reported as the gene associated with the index SNP in the literature. For the genes with multiple significantly associated transcripts in the deleterious annotation set, the top variant was the same for all transcripts within the gene in 100% of cases, suggesting that the same variants may be functional for multiple transcripts (Tables 58, 59). Approximately 60% of top

SNPs exceeded traditional genome-wide significance (5E-08), indicating those variants could be accounting for most of the association for that trait-transcript combination (Tables 56, 57).

However, for transcripts significant using the deleterious annotation set, fewer than 10% of associations exhibited a decrease over one order of magnitude after conditioning on the top variant (Table 62). For transcripts significantly associated with more than one trait, the top

209 variant was the same across all traits approximately half of the time (e.g., n=15/31 for CADD annotation set, Tables 56, 57). Of the transcripts representing 38 genes which overlapped with known RBC trait GWAS loci, the majority were located in gene-dense regions, particularly the regions on chromosomes 11 and 16 referred in which HBB and HBA1/2 are coded, respectively.

Ancestry differences were also evident in allele frequency differences for top variants

(Tables 58, 59). On average, African Americans had the highest minor allele frequency in top variants (average MAF = 0.0077; approximately one third of variants would been excluded due to MAF>1% in an African American-only study population), with Hispanics/Latinos having a slightly lower MAF at the same variants (average MAF = 0.0028). Of 75 unique top variants across all trait-transcript associations, over 75% were monomorphic in 1000 Genomes European or Asian ancestries (EAS, n=68; EUR, n=56; SAS, n=69; Tables 58, 59).

6.4.2. Clustering of significant genes

Consistent with both the molecular and GWAS literature, the first megabase of chromosome

16—which contains genes coding for multiple hemoglobin subunits as well as FAM234A and

LUC7L—was highly represented, with 17 genes significant for one or more RBC traits. Of these genes, a cluster including LUC7L, ITFG3/FAM234A, RHBDF1, TMEM8A, PDIA2, and RGS11 exhibited evidence supporting a shared role in RBC physiology (Figure 15). For instance,

PDIA2, located in a locus recently implicated in alpha-thalassemia, showed evidence of association with the adjacent globin cluster on chr16p13.3640. Given the relative proximity of these genes and a complicated long-range LD structure regulating alpha-hemoglobin expression in this region, additional characterization would be required to separate independent biological actions from correlation due to physical proximity. Twenty-two of the 42 genes either did not

210 exhibit a moderate or strong association with any other significant gene or were not available in the StringDB gene list (Table 60).

6.4.3. Overrepresentation of gene ontology or molecular pathways

Gene ontology and interpretation of biological mechanisms of all significant genes was evaluated to determine overlapping pathways. G-protein-coupled receptor activity and oxygen transport were overrepresented in our dataset (FDR<0.05, Table 61). None of the previously unreported genes were indexed in the database and hence could not be evaluated for protein-protein associations.

6.5. Discussion

This work reported a gene-based study of seven RBC traits in a multi-ethnic study population. We identified four genes which were physically distant from any previously reported

RBC trait–associated locus. We further reported significant associations with multiple genes in known RBC trait loci which differed from the gene reported in the GWAS literature, likely due to proximity to a GWAS index SNP. Consistent with RBC trait GWAS and exome-sequencing findings, we also found a large overlap in associations across RBC traits. Our results reinforced the complexity of RBC traits, as well as highlighting the crucial role ancestrally diverse study populations will continue to have in genetic discovery and characterization. Overall, this study emphasizes the need to consider the entire spectrum of allelic diversity in global populations when attempting to understand the genetic architecture underlying complex traits.

Notably, we identified four previously unreported genes significantly associated with one or more RBC traits, all of which demonstrated contribution from more than one variant to exceed significance. While the relevance of each gene to RBC biology is not immediately apparent,

211 more characterization is merited to determine potential underlying contributions to RBC phenotypes. For instance, PPHLN1 encodes for periphilin, which has recently been implicated as part of the HUSH complex for involvement with epigenetic transcriptional silencing leading to heterochromatin formation, a process which is crucial to RBC development in the bone marrow as hemoglobin expression is upregulated and nearly all other transcripts are downregulated641-643.

Of particular relevance to translational research, we found that nearly half of significant transcripts in genomic regions previously reported in GWAS were not from the reported gene

(typically the gene most physically proximal to the index SNP). While canon dictates that the index SNP in a GWAS in not necessarily causal and the association may affect a more distant gene, this reporting tendency has been sustained over the past decade. GWAS have revealed multiple independent associations in both the HBB and HBA1/2 regions, which represent the nearly 75% of the associations we identified, but characterization of the functions of most genes neighboring the hemoglobin genes has yet to be performed. Our results—especially for smaller genes with upstream and downstream variants included in the mask file—cannot be conclusively determined to represent the protein-coding gene associated with the associated transcript, particularly in areas expected to have multiple overlapping regulatory mechanisms. However, the difference between our significant genes and reported GWAS genes reinforces the idea that genes other than the most proximal to an index SNP should be considered for functional work.

This work benefits the RBC trait genetics literature in several ways. First, early gene- based testing of rare variants was inhibited by poor capture of rare or non-European variants in early array design, although sequencing is beginning to address this issue. We were able to include millions of rare variants in our annotation sets due in part to use of the MEGA array, which improved imputation of rare variants in multi-ethnic populations and allowed us to extend

212 our variant annotation sets into regulatory regions not captured by standard exome chips495. We also benefitted greatly from the diverse ancestral representation of the PAGE study participants, as demonstrated by the inclusion of 38 variants that contributed to significant associations and were monomorphic in the 1000 Genomes CEU European-ancestry population. Second, implementation of the optimal SKAT method is reportedly well powered to identify genes for which all variants act in same direction as well as those with variants having both positive and negative effects, which is particularly relevant for transcripts including regulatory variants.

Third, genetic transcripts rather than individual variants as the primary results allows for a more precise comparison of results with regard to shared function or physiologic pathway.

Our study also faced several limitations common to both RBC trait genomics research and gene-based testing in general. First, our sample size was limited, although our study population was broadly representative of the ancestry groups living in the United States.

Relatedly, because the number of participants varied by trait, differences in strength of association across traits may actually be due to differences in power to detect associations.

Females were also overrepresented (75%) in our study population—while biological sex is associated with the RBC traits we analyzed, we did not have a sufficient number of males to perform a sensitivity analysis determining whether sex may be affecting any significant associations. Secondly, while the identification of functional variants should be a priority, eQTL data for low-frequency or rare SNPs is limited generally, and the limitations become more severe for non-European-ancestry variants. Therefore, prioritizing variants identified in our study population for additional characterization will be difficult until publicly available functional databases are able to be more inclusive. Third, while the capture of rare variants (both genotyped and imputed) was improved compared to previous efforts, we acknowledge that with the right

213 methods, whole-genome sequencing will more accurately measure rare variants644. However, this effort demonstrates the utility of gene-based testing in diverse populations, providing a platform for future testing in larger study populations with sequencing data. Finally, although our transcript annotations are designed to represent specific gene isoforms, they are defined by physical location within or immediately proximal to the gene of interest. Without additional characterization, we cannot be certain that the variants representing any particular transcript are not in fact significantly affecting a neighboring gene via a regulatory mechanism. This was particularly evident on the short arm of chromosome 16, which contained 214 transcripts in the first megabase alone, 26 of which exceeded significance but may be representing complementary regulatory effects on the expression of the hemoglobin subunit genes in that region.

6.6. Conclusion

In conclusion, implementation of gene-based testing enabled the identification of multiple new transcripts associated with RBC traits as well as generalizing the associations of over three dozen loci previously identified in GWAS. Our work emphasizes the importance of inclusive study population design to maximize the benefits of methods developed specifically to incorporate rare variants into genomic analyses.

214 6.7. Main Figures and Tables

Figure 13. Number of variants per transcript in deleterious and CADD annotations. Green = deleterious annotation set; purple = CADD annotation set. Truncated at x=150 variants (maximum value for CADD annotation set = 1,103; maximum value for deleterious annotation set = 207).

215

Figure 14. Genome-wide gene-based results using deleterious (top) and CADD (bottom) masks identifies multiple significant transcripts across seven RBC traits. Each color represents a p-value < 0.001 for one trait-transcript association as follows: HCT (dark green), HGB (turquoise), MCH (light green), MCHC (orange), MCV (dark blue), RBCC (dark yellow), and RDW (pink). Labels indicate each gene with one or more trait-transcript associations exceeding genome-wide significance (p<2.8E-07).

216 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Chromosome

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Chromosome

132216

Figure 15. Evidence of shared pathways in a cluster of CADD-significant genes. Network generated by StringDB for 12 genes with transcripts significantly associated with one or more RBC traits. Nodes = significant genes within two degrees of association of a significant globin gene. Intensity of edge = strength of confidence of association (minimum=0.2).

217

Table 53. Descriptive information for 29,070 MEGA-genotyped study participants.

Female Age, HCT HGB MCH MCHC MCV RBCC RDW (%) years (%) (g/dL) (pg) (g/dL) (fL) (103/µL) (CV) African American 90.6 58 38.7 12.8 29.0 33.0 88.2 4.36 14.6 (9.3) (3.6) (1.2) (2.7) (1.1) (7.2) (0.55) (1.4) Hispanic/ Latino 69.0 51 41.1 13.6 29.3 33.0 89.3 4.64 13.8 (14) (4.1) (1.4) (2.2) (1.3) (5.9) (0.48) (1.2) Total* 76.2 53 40.4 13.4 29.3 33.0 89.1 4.58 14.0 (13) (4.1) (1.4) (2.3) (1.2) (6.2) (0.51) (1.4) *The total study population includes self-identified Native American (604), Asian American (541), and “Other” (382) study participants. All variables aside from sex presented as the mean (standard deviation). g/dL = grams per decliter; pg = picograms; fL = femtoliters; µL = microliters; CV = coefficient of variation.

218

Table 54. CADD annotation set transcripts significant in one or more RBC traits in 29,070 PAGE participants.

Gene Transcript chr:pos vars HCT HGB MCH MCHC MCV RBCC RDW RP11- ENST00000605886 1:156611458 4 -- 1.0E- 4E-08 4E-08 -- -- 4E-08 284F21.10 07 ATP2B2 ENST00000397077 3:10365707 129 4E-08 4E-08 4E-08 -- 4E-08 4E-08 -- OSBPL5 ENST00000263650 11:3108346 31 4E-08 4E-08 4E-08 4E-08 -- 4E-08 -- OR51E1 ENST00000396952 11:4664650 8 ------4E-08 -- -- MMP26 ENST00000380390 11:4726157 74 4E-08 ------4E-08 -- -- OR52R1 ENST00000356069 11:4824663 10 -- -- 4E-08 4E-08 4E-08 -- -- OR51S1 ENST00000322101 11:4869427 6 ------4E-08 -- -- OR51A4 ENST00000380373 11:4967355 8 4E-08 ------4E-08 -- -- OR51A2 ENST00000380371 11:4976002 10 ------4E-08 4E-08 -- -- OR52J1P ENST00000366371 11:5125383 6 -- -- 4E-08 -- 4E-08 -- -- HBB ENST00000335295 11:5246694 5 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 HBB ENST00000380315 11:5246694 4 4E-08 -- 4E-08 4E-08 4E-08 4E-08 4E-08 219 *RP11- ENST00000560744 11:125058011 5 4E-08 4E-08 4E-08 4E-08 -- 4E-08 4E-08

687M24.7 *CATSPERB ENST00000557036 14:92047040 38 -- -- 4E-08 ------RHBDF1 ENST00000262316 16:108058 15 -- -- 4E-08 -- 4E-08 -- -- RHBDF1 ENST00000428730 16:108058 7 -- -- 4E-08 -- 4E-08 -- -- RHBDF1 ENST00000450643 16:108058 7 -- -- 4E-08 -- 4E-08 -- -- HBQ1 ENST00000199708 16:230452 7 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 LUC7L ENST00000293872 16:238968 13 -- -- 4E-08 ------LUC7L ENST00000429378 16:238968 8 -- -- 1.5E-07 ------LUC7L ENST00000442701 16:238968 3 -- -- 4E-08 -- 4E-08 -- -- ITFG3 ENST00000301678 16:284545 12 -- -- 4E-08 4E-08 4E-08 4E-08 -- ITFG3 ENST00000399932 16:284545 12 -- -- 4E-08 2.0E-07 4E-08 4E-08 -- RGS11 ENST00000168869 16:318300 7 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 RGS11 ENST00000316163 16:318300 20 -- -- 4E-08 -- 4E-08 -- -- RGS11 ENST00000359740 16:318300 19 ------4E-08 -- -- RGS11 ENST00000397770 16:318300 19 -- -- 2.5E-07 -- 1.5E- -- -- 07

219 215 PDIA2 ENST00000219406 16:333152 41 -- -- 4E-08 4E-08 4E-08 4E-08 -- PDIA2 ENST00000404312 16:333152 41 -- -- 4E-08 4E-08 4E-08 4E-08 -- PDIA2 ENST00000435833 16:333152 10 -- -- 4E-08 4E-08 4E-08 4E-08 4E-08 TMEM8A ENST00000250930 16:420773 14 -- -- 4E-08 -- 4E-08 -- -- CAPN15 ENST00000219611 16:577717 27 -- -- 4E-08 -- 4E-08 4E-08 -- PIGQ ENST00000026218 16:616995 17 4E-08 4E-08 4E-08 4E-08 -- 4E-08 -- PIGQ ENST00000293874 16:616995 6 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 -- NHLRC4 ENST00000424439 16:616996 6 -- -- 4E-08 4E-08 4E-08 4E-08 -- NHLRC4 ENST00000540585 16:616996 6 -- -- 4E-08 4E-08 4E-08 4E-08 -- WDR90 ENST00000293879 16:699311 39 -- -- 4E-08 ------BAIAP3 ENST00000397488 16:1383602 23 -- -- 4E-08 4E-08 -- 4E-08 4E-08 MEIOB ENST00000325962 16:1883984 11 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 4E-08 TBL3 ENST00000561907 16:2022038 6 -- -- 4E-08 ------Chr:pos = chromosome and start site for respective transcript; vars = number of variants included in the transcript in its respective annotation file. * indicates gene with transcript start and end sites >500kb from a previously reported RBC trait GWAS index variant.

220

216220

Table 55. Deleterious annotation set transcripts significant in one or more RBC traits in 29,070 PAGE participants. Gene Transcript chr:pos vars HCT HGB MCH MCHC MCV RBCC RDW OR52R1 ENST00000356069 11:4824663 7 -- -- 4.0E-08 4.0E-08 4.0E-08 -- -- OR52R1 ENST00000380382 11:4824663 10 -- -- 1.2E-07 4.0E-08 4.0E-08 -- -- OR51A2 ENST00000380371 11:4976002 8 ------4.0E-08 4.0E-08 -- -- *PPHLN1 ENST00000317560 12:42632249 4 ------2.4E-07 -- *IGHV1-3 ENST00000390595 14:106471246 4 -- -- 1.2E-07 ------HBQ1 ENST00000199708 16:230452 3 -- -- 4.0E-08 4.0E-08 4.0E-08 4.0E-08 4.0E-08 LUC7L ENST00000293872 16:238968 5 -- -- 4.0E-08 -- 1.6E-07 -- -- LUC7L ENST00000337351 16:238968 5 -- -- 4.0E-08 -- 4.0E-08 -- -- LUC7L ENST00000397780 16:238968 5 -- -- 4.0E-08 -- 4.0E-08 -- -- LUC7L ENST00000397783 16:238968 5 -- -- 4.0E-08 -- 4.0E-08 -- -- LUC7L ENST00000429378 16:238968 3 ------4.0E-08 -- --

221 ITFG3 ENST00000301678 16:284545 5 -- -- 4.0E-08 4.0E-08 4.0E-08 4.0E-08 4.0E-08 ITFG3 ENST00000399932 16:284545 5 -- -- 4.0E-08 4.0E-08 4.0E-08 4.0E-08 4.0E-08

RGS11 ENST00000168869 16:318300 3 -- -- 4.0E-08 -- 4.0E-08 -- -- RGS11 ENST00000316163 16:318300 15 -- -- 2.4E-07 -- 4.0E-08 -- -- RGS11 ENST00000359740 16:318300 15 ------4.0E-08 -- -- RGS11 ENST00000397770 16:318300 16 ------4.0E-08 -- -- PDIA2 ENST00000219406 16:333152 22 -- -- 4.0E-08 ------PDIA2 ENST00000404312 16:333152 22 -- -- 4.0E-08 ------PDIA2 ENST00000435833 16:333152 7 ------4.0E-08 4.0E-08 -- NHLRC4 ENST00000424439 16:616996 4 ------4.0E-08 4.0E-08 4.0E-08 -- NHLRC4 ENST00000540585 16:616996 4 ------4.0E-08 -- 4.0E-08 -- Chr:pos = chromosome and start site for respective transcript; vars = number of variants included in the transcript in its respective annotation file. * indicates gene with transcript start and end sites >500kb from a previously reported RBC trait GWAS index variant.

215221

Table 56. Top variants and p-values by trait and transcript for CADD variant annotation set significant transcripts. HCT HGB MCH MCHC MCV RBCC RDW p- p- p- p- p- p- p- Transcript variant value variant value variant value variant value variant value variant value variant value ENST00000605886 rs115518359 7.0E- rs115518359 4.9E- rs115920478 6.9E- rs115518359 1.5E- rs59938816 9.7E- rs115920478 3.3E rs115518359 4.6E- 01 01 02 01 02 -01 02 ENST00000397077 rs139071992 3.1E- rs139071992 2.4E- rs534601167 4.1E- rs35855296 1.4E- rs145951267 8.2E- rs115367728 3.4E rs191008721 6.0E- 04 03 03 02 03 -04 03 ENST00000263650 rs142437471 2.3E- rs142437471 2.3E- rs6578323 8.7E- rs77195727 1.9E- rs73413562 5.5E- rs73413562 7.2E rs6578323 3.1E- 04 03 04 02 05 -03 02 ENST00000396952 rs149616162 7.1E- rs17223816 7.4E- rs142765915 7.2E- rs142765915 7.0E- rs142765915 1.4E- rs142765915 2.8E rs142765915 9.4E- 02 02 06 06 12 -04 04 ENST00000356069 rs75407816 7.5E- rs73403015 2.6E- rs61742567 1.2E- rs61742567 1.9E- rs61742567 1.6E- rs528043612 1.8E rs61742567 6.6E- 07 04 09 33 34 -02 07 ENST00000322101 rs143546045 1.5E- rs149126468 2.2E- rs143546045 1.4E- rs143546045 3.7E- rs143546045 5.6E- rs61742567 7.1E rs143546045 1.2E- 07 05 05 04 12 -09 04 ENST00000380390 rs146130616 4.3E- rs143546045 1.1E- rs146130616 8.0E- rs140983779 3.6E- rs140983779 2.2E- rs115882083 2.8E rs146130616 1.6E- 09 04 07 06 15 -01 06 ENST00000380373 rs140983779 8.8E- rs140983779 8.2E- rs140983779 8.5E- rs80228945 2.0E- rs80228945 5.0E- rs141671097 5.6E rs140983779 2.2E- 09 05 07 21 21 -03 06

222 ENST00000380371 rs148063081 2.0E- rs148063081 5.1E- rs80228945 1.9E- rs115909008 2.0E- rs115909008 9.6E- rs80228945 2.5E rs80228945 7.3E- 02 02 06 06 16 -06 03

ENST00000366371 rs114819139 5.2E- rs114819139 1.4E- rs114819139 1.5E- rs114819139 9.6E- rs114819139 1.9E- rs114819139 1.3E rs114819139 3.9E- 02 03 15 05 13 -06 05 ENST00000335295 rs33930165 5.4E- rs113082294 1.6E- rs33930165 3.8E- rs33930165 2.4E- rs33930165 5.3E- rs33930165 3.4E rs33930165 1.5E- 05 01 16 56 53 -12 09 ENST00000380315 rs33930165 5.4E- rs33930165 4.2E- rs33930165 3.8E- rs33930165 2.4E- rs33930165 5.3E- rs33930165 3.4E rs33930165 1.5E- 05 01 16 56 53 -12 09 ENST00000560744 rs11219958 8.2E- rs143258223 1.0E- rs112487047 4.3E- rs112487047 8.1E- rs112487047 3.6E- rs11219958 5.7E rs143258223 5.1E- 02 01 02 02 02 -03 02 ENST00000557036 rs562329741 7.3E- rs562329741 2.1E- rs148993089 1.7E- rs536626080 2.5E- rs114250512 8.9E- rs148993089 2.7E rs536626080 3.2E- 03 02 09 02 07 -03 02 ENST00000262316 rs147906988 3.8E- rs147906988 1.9E- rs114597905 1.3E- rs114597905 2.5E- rs114597905 1.2E- rs114597905 2.3E rs114597905 8.0E- 03 03 11 04 09 -05 02 ENST00000428730 rs147906988 3.8E- rs147906988 1.9E- rs114597905 1.3E- rs114597905 2.5E- rs114597905 1.2E- rs114597905 2.3E rs114597905 8.0E- 03 03 11 04 09 -05 02 ENST00000450643 rs147906988 3.8E- rs147906988 1.9E- rs114597905 1.3E- rs114597905 2.5E- rs114597905 1.2E- rs114597905 2.3E rs114597905 8.0E- 03 03 11 04 09 -05 02 ENST00000199708 rs144961211 8.2E- rs76613236 3.0E- rs76613236 9.2E- rs76613236 3.1E- rs76613236 5.9E- rs76613236 1.6E rs76613236 7.5E- 02 05 43 13 37 -21 13 ENST00000293872 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E rs61743947 2.7E- 02 04 12 09 11 -01 06

ENST00000429378 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E rs61743947 2.7E-

222

218 02 04 12 09 11 -01 06 ENST00000442701 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E rs61743947 2.7E- 02 04 12 09 11 -01 06 ENST00000301678 rs78001048 2.7E- rs144091859 3.4E- rs144091859 1.0E- rs144091859 3.5E- rs144091859 2.5E- rs144091859 5.3E rs144091859 5.3E- 01 05 41 13 35 -22 13 ENST00000399932 rs78001048 2.7E- rs144091859 3.4E- rs144091859 1.0E- rs144091859 3.5E- rs144091859 2.5E- rs144091859 5.3E rs144091859 5.3E- 01 05 41 13 35 -22 13 ENST00000168869 rs145112941 1.0E- rs149201684 5.8E- rs149201684 1.6E- rs149201684 3.8E- rs149201684 2.3E- rs149201684 2.4E rs149201684 7.3E- 01 03 09 04 09 -04 02 ENST00000316163 rs77654034 2.2E- rs149201684 5.8E- rs149201684 1.6E- rs149201684 3.8E- rs149201684 2.3E- rs149201684 2.4E rs200125295 6.3E- 02 03 09 04 09 -04 03 ENST00000359740 rs77654034 2.2E- rs149201684 5.8E- rs149201684 1.6E- rs149201684 3.8E- rs149201684 2.3E- rs149201684 2.4E rs200125295 6.3E- 02 03 09 04 09 -04 03 ENST00000397770 rs77654034 2.2E- rs149201684 5.8E- rs149201684 1.6E- rs149201684 3.8E- rs149201684 2.3E- rs149201684 2.4E rs200125295 6.3E- 02 03 09 04 09 -04 03 ENST00000219406 rs116969376 3.1E- rs45512400 1.0E- rs45512400 2.7E- rs199589252 1.3E- rs45512400 1.2E- rs45512400 5.1E rs45512400 1.3E- 02 04 24 11 21 -10 08 ENST00000404312 rs116969376 3.1E- rs45512400 1.0E- rs45512400 2.7E- rs199589252 1.3E- rs45512400 1.2E- rs45512400 5.1E rs45512400 1.3E- 02 04 24 11 21 -10 08 ENST00000435833 rs116969376 3.1E- rs45512400 1.0E- rs45512400 2.7E- rs199589252 1.3E- rs45512400 1.2E- rs45512400 5.1E rs45512400 1.3E- 02 04 24 11 21 -10 08 ENST00000250930 rs201414656 3.8E- rs144557907 1.3E- rs148970260 1.3E- rs144557907 2.4E- rs148970260 2.3E- rs148970260 1.2E rs144557907 1.0E-

223 02 04 13 07 12 -09 04

ENST00000219611 rs141298471 4.7E- rs74003979 3.4E- rs199859872 6.6E- rs74003979 2.5E- rs199859872 1.8E- rs199859872 4.8E rs199859872 2.5E- 02 03 20 09 19 -14 05 ENST00000026218 rs11864607 3.4E- rs56293456 2.5E- rs56293456 3.3E- rs56293456 6.7E- rs56293456 5.5E- rs56293456 1.8E rs56293456 2.2E- 02 02 05 07 03 -02 01 ENST00000293874 rs11864607 3.4E- rs56293456 2.5E- rs56293456 3.3E- rs56293456 6.7E- rs56293456 5.5E- rs56293456 1.8E rs56293456 2.2E- 02 02 05 07 03 -02 01 ENST00000424439 rs200060251 5.4E- rs200060251 1.5E- rs200060251 2.5E- rs200060251 4.0E- rs200060251 3.3E- rs200060251 6.3E rs200060251 4.7E- 02 01 20 13 13 -12 05 ENST00000540585 rs200060251 5.4E- rs200060251 1.5E- rs200060251 2.5E- rs200060251 4.0E- rs200060251 3.3E- rs200060251 6.3E rs200060251 4.7E- 02 01 20 13 13 -12 05 ENST00000293879 rs114236636 3.9E- rs138126179 1.4E- rs185839646 5.6E- rs138126179 1.9E- rs185839646 1.4E- rs185839646 2.0E rs185839646 2.6E- 02 03 07 08 07 -02 04 ENST00000397488 rs146826887 4.8E- rs528548462 1.5E- rs146826887 4.2E- rs146826887 1.5E- rs146826887 2.2E- rs146826887 1.9E rs574931188 5.6E- 02 02 08 07 05 -05 03 ENST00000325962 rs143362679 1.4E- rs567425297 2.1E- rs143362679 3.8E- rs143362679 4.3E- rs143362679 8.8E- rs143362679 1.2E rs143362679 5.8E- 01 01 09 07 06 -05 03 ENST00000561907 rs146622601 2.3E- rs115518359 4.9E- rs144902773 3.1E- rs115518359 1.5E- rs59938816 9.7E- rs115920478 3.3E rs144902773 6.5E- 01 01 08 01 02 -01 03

Grayed out values represent top variants in transcripts that did not exceed significance for the respective RBC trait.

223

218

Table 57. Top variants and p-values by trait and transcript for deleterious variant annotation set significant transcripts. HCT HGB MCH MCHC MCV RBCC RDW p- p- p- p- p- p- p- Transcript variant value variant value variant value variant value variant value variant value variant value ENST00000356069 rs73403015 8.3E- rs73403015 2.6E- rs61742567 1.2E- rs61742567 1.9E- rs61742567 1.6E- rs61742567 7.1E- rs61742567 6.6E- 07 04 09 33 34 09 07 ENST00000380382 rs73403017 7.3E- rs73403015 2.6E- rs61742567 1.2E- rs61742567 1.9E- rs61742567 1.6E- rs61742567 7.1E- rs61742567 6.6E- 07 04 09 33 34 09 07 ENST00000380371 rs80228945 1.3E- rs80228945 2.3E- rs80228945 1.9E- rs80228945 2.0E- rs80228945 5.0E- rs80228945 2.5E- rs80228945 7.3E- 01 01 06 21 21 06 03 ENST00000317560 rs191890467 3.9E- rs191890467 2.2E- rs187632148 1.6E- rs18763214 2.1E- rs187632148 3.1E- rs187632148 2.3E- rs141677892 1.8E- 03 02 02 8 01 02 06 01 ENST00000390595 rs566505899 5.6E- rs74091631 3.6E- rs566505899 2.3E- rs56650589 4.8E- rs566505899 6.0E- rs566505899 7.0E- rs192340916 1.3E- 01 01 06 9 02 05 02 01 ENST00000199708 rs144961211 8.2E- rs76613236 3.0E- rs76613236 9.2E- rs76613236 3.1E- rs76613236 5.9E- rs76613236 1.6E- rs76613236 7.5E- 02 05 43 13 37 21 13 ENST00000293872 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E- rs61743947 2.7E- 02 04 12 09 11 01 06 ENST00000337351 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E- rs61743947 2.7E- 224 02 04 12 09 11 01 06

ENST00000397780 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E- rs61743947 2.7E- 02 04 12 09 11 01 06 ENST00000397783 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E- rs61743947 2.7E- 02 04 12 09 11 01 06 ENST00000429378 rs61743947 3.7E- rs61743947 1.2E- rs61743947 2.8E- rs61743947 3.0E- rs61743947 9.8E- rs144911375 1.4E- rs61743947 2.7E- 02 04 12 09 11 01 06 ENST00000301678 rs78001048 2.7E- rs78001048 3.4E- rs144091859 1.0E- rs14409185 3.5E- rs144091859 2.5E- rs144091859 5.3E- rs144091859 5.3E- 01 05 41 9 13 35 22 13 ENST00000399932 rs78001048 2.7E- rs144091859 3.4E- rs144091859 1.0E- rs14409185 3.5E- rs144091859 2.5E- rs144091859 5.3E- rs144091859 5.3E- 01 05 41 9 13 35 22 13 ENST00000316163 rs77654034 2.2E- rs149201684 5.8E- rs149201684 1.6E- rs14920168 3.8E- rs149201684 2.3E- rs149201684 2.4E- rs200125295 6.3E- 02 03 09 4 04 09 04 03 ENST00000359740 rs77654034 2.2E- rs149201684 5.8E- rs149201684 1.6E- rs14920168 3.8E- rs149201684 2.3E- rs149201684 2.4E- rs200125295 6.3E- 02 03 09 4 04 09 04 03 ENST00000397770 rs77654034 2.2E- rs149201684 5.8E- rs149201684 1.6E- rs14920168 3.8E- rs149201684 2.3E- rs149201684 2.4E- rs200125295 6.3E- 02 03 09 4 04 09 04 03 ENST00000168869 rs149201684 3.1E- rs149201684 5.8E- rs149201684 1.6E- rs14920168 3.8E- rs149201684 2.3E- rs149201684 2.4E- rs149201684 7.3E- 01 03 09 4 04 09 04 02 ENST00000219406 rs116969376 3.1E- rs116969376 8.3E- rs199589252 4.7E- rs19958925 1.3E- rs199589252 2.4E- rs199589252 2.3E- rs199589252 1.2E- 02 04 16 2 11 10 09 04 ENST00000404312 rs116969376 3.1E- rs116969376 8.3E- rs199589252 4.7E- rs19958925 1.3E- rs199589252 2.4E- rs199589252 2.3E- rs199589252 1.2E-

224 221 02 04 16 2 11 10 09 04 ENST00000435833 rs116969376 3.1E- rs116969376 8.3E- rs199589252 4.7E- rs19958925 1.3E- rs199589252 2.4E- rs199589252 2.3E- rs199589252 1.2E- 02 04 16 2 11 10 09 04 ENST00000424439 rs200060251 5.4E- rs200060251 1.5E- rs200060251 2.5E- rs20006025 4.0E- rs200060251 3.3E- rs200060251 6.3E- rs200060251 4.7E- 02 01 20 1 13 13 12 05 ENST00000540585 rs200060251 5.4E- rs200060251 1.5E- rs200060251 2.5E- rs20006025 4.0E- rs200060251 3.3E- rs200060251 6.3E- rs200060251 4.7E- 02 01 20 1 13 13 12 05 ENST00000356069 rs73403015 8.3E- rs73403015 2.6E- rs61742567 1.2E- rs61742567 1.9E- rs61742567 1.6E- rs61742567 7.1E- rs61742567 6.6E- 07 04 09 33 34 09 07 Grayed out values represent top variants in transcripts that did not exceed significance for the respective RBC trait.

225

225

218 Table 58. Allele frequencies for top variants within CADD annotation set significant transcripts. Coded allele frequency Chr:Position rsid WHI AfAm HCHS/SOL 1000G 1000G 1000G EAS EUR SAS 1:156606605 rs115920478 1.16E-02 1.86E-03 0 0 0 1:156611030 rs115518359 2.04E-02 1.85E-03 0 0 0 1:156611125 rs59938816 6.08E-03 7.64E-04 0 0 0 3:10371218 rs191008721 1.82E-05 3.80E-03 0 0 0 3:10428129 rs35855296 9.34E-04 1.75E-03 0 0.003 0.0112 3:10448852 rs534601167 2.80E-04 3.37E-03 0 0.003 0 3:10456310 rs145951267 3.30E-03 3.25E-04 0 0 0 3:10525459 rs139071992 1.03E-02 1.63E-03 0 0 0 3:10528178 rs115367728 1.11E-02 1.84E-03 0 0 0 11:3115015 rs73413562 1.24E-02 1.66E-03 0 0.001 0

226 11:3128508 rs142437471 3.07E-03 9.51E-04 0 0 0 11:3128539 rs77195727 1.98E-02 5.25E-03 0 0 0

11:3143609 rs6578323 1.98E-02 3.14E-03 0 0 0 11:4665190 rs17223816 1.38E-03 3.13E-03 0 0.007 0 11:4674652 rs149616162 1.87E-03 5.85E-03 0 0.0129 0 11:4675880 rs142765915 1.79E-02 2.28E-03 0.001 0 0 11:4799877 rs528043612 5.45E-03 1.23E-03 0 0 0 11:4824752 rs61742567 8.66E-03 1.61E-03 0 0 0 11:4825022 rs73403015 1.36E-02 5.55E-03 0.001 0 0 11:4825233 rs75407816 1.36E-02 5.38E-03 0 0 0 11:4840505 rs149126468 1.33E-02 6.23E-03 0 0.001 0 11:4869718 rs115882083 8.81E-03 1.14E-03 0 0 0 11:4870064 rs143546045 1.30E-02 5.76E-03 0 0 0 11:4958925 rs146130616 7.51E-03 6.44E-03 0 0.001 0 11:4967409 rs140983779 7.36E-03 5.36E-03 0 0 0

226 224 11:4968058 rs141671097 6.34E-03 1.75E-03 0 0.003 0 11:4976897 rs80228945 2.09E-02 3.74E-03 0 0 0 11:4976935 rs148063081 7.50E-03 4.73E-03 0 0 0 11:4994513 rs115909008 7.50E-03 5.35E-03 0 0 0 11:5131056 rs114819139 6.94E-03 9.74E-04 0 0 0 11:5246870 rs113082294 9.88E-04 2.31E-03 0 0.007 0 11:5248233 rs33930165 1.07E-02 1.95E-03 0 0 0 11:125058682 rs11219958 2.20E-02 3.94E-03 0 0.001 0.002 11:125059476 rs143258223 1.83E-03 4.31E-03 0 0.008 0 11:125060810 rs112487047 2.74E-03 2.97E-04 0 0 0 14:92075078 rs148993089 8.08E-03 1.18E-03 0 0 0 16:113648 rs114597905 1.74E-02 2.54E-03 0.001 0 0 16:114773 rs147906988 1.17E-02 2.46E-03 0 0 0 16:230724 rs76613236 1.59E-02 3.99E-03 0 0 0

227 16:231021 rs144961211 3.21E-03 6.36E-03 0 0.007 0

16:240000 rs61743947 5.49E-03 6.59E-04 0 0 0 16:240003 rs144911375 7.09E-03 9.32E-04 0 0 0 16:309998 rs78001048 4.48E-04 9.24E-03 0.126 0 0.002 16:314962 rs144091859 1.56E-02 3.90E-03 0 0 0 16:319299 rs200125295 3.03E-03 2.97E-04 0 0 0 16:320629 rs77654034 4.48E-04 1.19E-03 0 0.008 0 16:324075 rs149201684 5.75E-03 9.32E-04 0 0 0 16:324252 rs145112941 1.64E-03 4.66E-04 0 0 0 16:336731 rs116969376 6.40E-03 4.45E-03 0 0.0089 0 16:336898 rs199589252 1.90E-04 4.79E-03 0 0.001 0 16:336909 rs45512400 1.55E-02 4.77E-03 0 0 0 16:424251 rs148970260 5.61E-03 7.80E-04 0 0 0 16:425238 rs144557907 3.36E-03 6.78E-04 0 0 0 16:425378 rs201414656 4.48E-04 2.29E-03 0 0.002 0

227 224 16:597150 rs74003979 1.04E-02 2.13E-03 0 0 0 16:601533 rs141298471 8.28E-03 1.85E-03 0.001 0 0 16:602364 rs199859872 1.25E-02 2.72E-03 0 0 0 16:618214 rs200060251 0 3.18E-03 0 0 0 16:624108 rs11864607 9.78E-03 2.16E-03 0 0 0 16:624149 rs56293456 1.97E-02 5.51E-03 0.0079 0.001 0.0266 16:706763 rs114236636 1.60E-02 2.55E-03 0 0 0 16:709086 rs138126179 1.84E-02 3.83E-03 0 0 0 16:712043 rs185839646 8.00E-03 1.08E-03 0 0 0 16:1381905 rs528548462 4.19E-03 5.56E-04 0 0 0 16:1392632 rs146826887 1.52E-04 4.31E-03 0 0 0 16:1397810 rs574931188 1.22E-04 2.16E-03 0 0 0 16:1906980 rs567425297 4.42E-03 5.59E-04 0 0 0 16:1910418 rs143362679 1.49E-04 4.07E-03 0 0.003 0

228 16:2024200 rs144902773 3.27E-05 3.02E-03 0 0 0

16:2024263 rs146622601 3.26E-05 3.02E-03 0 0 0 Chr:Position=chromosome and position in hg19; WHI AfAm = MEGA-genotyped WHI African Americans with RBC trait data; HCHSSOL = MEGA-genotyped HCHSSOL Hispanics/Latinos with RBC trait data; 1000G EAS = 1000 Genomes phase 3 East Asian super population; 1000G EUR = 1000 Genomes phase 3 European super population; 1000G SAS = 1000 Genomes phase 3 South Asian super population.

228 224 Table 59. Allele frequencies for top variants within deleterious annotation set significant transcripts. Coded allele frequency Chr:Position rsid WHI HCHS/SOL 1000G 1000G 1000G AfAm EAS EUR SAS 11:4824752 rs61742567 8.66E-03 1.61E-03 0 0 0 11:4976897 rs80228945 2.09E-02 3.74E-03 0 0 0 12:42853128 rs187632148 3.29E-03 5.51E-04 0.001 0 0 14:106471448 rs566505899 6.34E-03 4.54E-03 0.001 0.001 0.0031 16:230724 rs76613236 1.59E-02 3.99E-03 0 0 0 16:240000 rs61743947 5.49E-03 6.59E-04 0 0 0 16:314962 rs144091859 1.56E-02 3.90E-03 0 0 0 16:324075 rs149201684 5.75E-03 9.32E-04 0 0 0 16:336898 rs199589252 1.90E-04 4.79E-03 0 0.001 0 16:618214 rs200060251 0 3.18E-03 0 0 0

229 Chr:position= chromosome and position in hg19; WHI AfAm = MEGA-genotyped WHI African Americans with RBC trait data; HCHSSOL = MEGA-genotyped HCHSSOL Hispanics/Latinos with RBC trait data; 1000G EAS =

1000 Genomes phase 3 East Asian super population; 1000G EUR = 1000 Genomes phase 3 European super population; 1000G SAS = 1000 Genomes phase 3 South Asian super population.

229 224 Table 60. Association network of all genes significant for CADD or deleterious annotation sets. Gene Associated genes OR51E1 CATSPERB PIGQ HBB HBQ1 HBB, ITFG3, LUC7L FAM234A/ ITFG3 HBQ1, PDIA2, RHBDF1, RGS11, TMEM8A, LUC7L HBB HBQ1, RHBDF1 RHBDF1 ITFG3, LUC7L RGS11, PDIA2, HBB, TMEM8A RGS11 ITFG3, RHBDF1, PDIA2, LUC7L, TMEM8A TMEM8A ITFG3, RHBDF1, RGS11, LUC7L LUC7L ITFG3, RHBDF1, RGS11, PDIA2, HBQ1, TMEM8A, OR51S1 OR51A2, OR51A4, OR52R1 OR51A4 OR51A2, OR52R1, OR51S1 OR51A2 OR51A4, OR51S1

230 OR52R1 OR51A2, OR51A4, OR51S1 ATP2B2 PDIA2 BAIAP3 PDIA2 WDR90 PDIA2, CEP152, RAI1 PDIA2 RHBDF1, LUC7L, ITFG3, RGS11, WDR90, BAIAP3, ATP2B2 CATSPERB OR51E1

MEIOB

CAPN15

IGHV1-3

MMP26

NHLRC4

OR52J1P

OSBPL5

PPHLN1

RP11-687M24

230224

Table 61. Gene ontology enrichment for genes significantly associated with one or more RBC traits.

Cellular Term observed background FDR matching proteins in component description count count your network (IDs) enrichment term matching proteins in your network (labels) GO:0005833 hemoglobin 2 12 0.0331 ENSP00000199708, HBB,HBQ1 complex ENSP00000333994

Molecular function enrichment Term observed background matching proteins in term description count count FDR your network (IDs) matching proteins in your network (labels) GO:0004984 olfactory 6 385 0.0072 ENSP00000322754, OR51A2,OR51A4,OR51E1,OR51S1,OR52R1 receptor ENSP00000348368,

2

31 activity ENSP00000369729,

ENSP00000369731, ENSP00000380153, ENSP00000380155 GO:0004930 G protein- 7 824 0.0273 ENSP00000293897, OR51A2,OR51A4,OR51E1,OR51S1,OR52R1 coupled ENSP00000322754, receptor ENSP00000348368, activity ENSP00000369729, ENSP00000369731, ENSP00000380153, ENSP00000380155 GO:0005344 oxygen 2 14 0.0273 ENSP00000199708, HBB,HBQ1 carrier ENSP00000333994 activity

231 224

Biological process enrichment Term observed background matching proteins in term description count count FDR your network (IDs) matching proteins in your network (labels) GO:0007186 G protein- 9 1247 0.0305 ENSP00000293897, BAIAP3,OR51A2,OR51A4,OR51E1,OR51S1, coupled ENSP00000322754, OR52R1,RGS11 receptor ENSP00000324510, signaling ENSP00000348368, pathway ENSP00000369729, ENSP00000369731, ENSP00000380153, ENSP00000380155, ENSP00000380876 GO:0050911 detection of 6 385 0.0305 ENSP00000322754, OR51A2,OR51A4,OR51E1,OR51S1,OR52R1 chemical ENSP00000348368, stimulus ENSP00000369729, involved in ENSP00000369731,

2 sensory ENSP00000380153,

32 perception of ENSP00000380155

smell GO:0015671 oxygen 2 15 0.0329 ENSP00000199708, HBB,HBQ1 transport ENSP00000333994 GO:0007600 sensory 7 901 0.0456 ENSP00000322754, ATP2B2,OR51A2,OR51A4,OR51E1,OR51S1,OR52R1 perception ENSP00000348368, ENSP00000353414, ENSP00000369729, ENSP00000369731, ENSP00000380153, ENSP00000380155

232228

Table 62. P-values that increased after conditioning on a single variant for transcripts significantly associated with one or more RBC traits in the deleterious annotation set. Conditioned Gene Transcript variant MCH MCHC MCV RBCC RDW PPHLN1 ENST00000317560 12:42853128 ------3.2E-07 -- PPHLN1 ENST00000317560 12:42792671 ------3.2E-07 -- LUC7L ENST00000429378 16:230724 -- -- 2.0E-07 -- -- LUC7L ENST00000429378 16:314962 -- -- 2.0E-07 -- -- LUC7L ENST00000429378 16:618214 -- -- 2.0E-07 -- -- RGS11 ENST00000316163 16:230724 4.0E-07 -- N/A -- -- RGS11 ENST00000316163 16:314962 4.0E-07 -- N/A -- -- RGS11 ENST00000316163 16:618214 4.0E-07 -- N/A -- -- RGS11 ENST00000316163 16:240000 4.0E-07 -- N/A -- -- RGS11 ENST00000359740 16:230724 -- -- 1.6E-07 -- --

2 RGS11 ENST00000359740 16:314962 -- -- 1.6E-07 -- --

33

RGS11 ENST00000359740 16:618214 -- -- 1.6E-07 -- -- RGS11 ENST00000359740 16:240000 -- -- 1.6E-07 -- -- RGS11 ENST00000397770 16:230724 -- -- 2.0E-07 -- -- RGS11 ENST00000397770 16:314962 -- -- 2.0E-07 -- -- RGS11 ENST00000397770 16:618214 -- -- 2.0E-07 -- -- RGS11 ENST00000397770 16:240000 -- -- 2.0E-07 -- -- N/A = p-value unchanged after conditioning for the respective trait. Note: HCT and HGB excluded from conditional analysis because no transcript exceeded genome-wide significance for either trait.

233 224

CONCLUSIONS

7.1. Recapitulations of Specific Aims

Red blood cell traits are highly complex, clinically relevant, and standardly measured, making them ideal for genetic association studies. Molecular characterization and experimentation in known red blood cell loci has already been productive in terms of clinical application, specifically with regard to anemias with known genetic causes143; 146; 147; 645-647.

Although RBC biology is moderately well characterized, heritability estimates and suggest there are additional sources of genomic contribution to these traits31.

Although over 500 loci have been convincingly identified for one or more RBC traits, previous studies faced multiple limitations. First, previous study populations have been primarily of European and East Asian ancestry; such studies are inherently unable to detect African- and

Amerindian-specific variants unless the study populations are extremely large. Additionally, the design of early genotyping arrays made assessment of low-frequency and rare variants difficult, which limited the potential for fine-mapping loci of interest435; 495; 597; 648. Finally, while several published RBC trait GWAS have been well powered to detect genetic associations, heritability estimates suggest that there remain undiscovered genetic loci, motivating additional studies employing non-traditional GWAS methods.

Therefore, the purpose of this doctoral research was to examine RBC traits in a sample more representative of the US population, with the aim of identifying previously unreported loci as well as more completely characterizing known loci. To accomplish this objective, we

234 employed two different methodological approaches to analyze genotyped and imputed genetic variants. First, we assessed whether an anticipated increase in power from studies that leveraged correlation in RBC traits allowed for the detection of new RBC trait loci or improved fine- mapping of known loci using a combined-phenotype approach. Next, we evaluated rare variants which traditional methods are underpowered to detect by using a gene-based method for two different variant annotation sets.

We performed this research using data from participants of the Populating Architecture using Genomics and Epidemiology (PAGE) study consortium, which incorporates phenotype data and imputed genotyped data from participating biobanks and large national cohort studies of common complex diseases. RBC trait data was collected and calculated using hemanalyzers and standardized complete blood count methods in study-specific labs and harmonized within PAGE, and relevant lifestyle and clinical risk factors were collected via interview at the same study visit as recorded blood draws. Genotype data were cleaned using standardized quality control procedures and imputed to 1000 Genomes Phase III data by the PAGE study coordinating center.

7.2. Main Findings

We performed these study objectives in up to 68,000 participants of the PAGE study, although the number of participants varied by trait. In the first manuscript, we reported 39 loci with over 50 independent association signals, with evidence that our findings in conditional analyses were due in part to our ancestrally diverse study population. In the second manuscript, we identified four discovery loci, demonstrating the relevance of allowing for multiple functional-group scenarios to maximize findings in gene-based testing. Both studies showed strong evidence of overlapping genetic effects among RBC traits, with over half of all findings

235 being significant for more than one trait. Our primary findings emphasize the complexity of RBC traits by revealing multiple independent signals at several known loci which cannot be comprehensively evaluated in European populations. This work also strongly supports the expectation, based on previous molecular and population-based studies, that RBC traits share an underlying genetic architecture.

With regard to individual common or low-frequency variants, we found evidence of allelic heterogeneity at 19 previously reported loci, confirming independent associations at multiple loci that have been previously reported in European-ancestry populations. Within the

HBA1/2 region, we found 14 conditionally independent associations, 11 of which were undetectable in European-ancestry populations, and four of which exhibited independence from a known 3.7kb deletion in sensitivity analyses18.

Our gene-based results also found a large number of transcripts within previously reported RBC trait GWAS loci, unsurprisingly given the overlap in study population. Notably, we identified four previously unreported genes across all seven traits. We found moderate overlap between transcript associations for annotations representing only deleterious coding variants versus including regulatory variants, suggesting the benefit of testing multiple types of groups for genome-wide gene-based testing. Of 75 lead variants for one or more RBC traits within 49 significantly associated transcripts, over 75% were monomorphic in the 1000G EAS,

EUR, and SAS populations, exemplifying the benefits of study populations including African- and Amerindian-ancestry participants.

236 7.3. Strengths

The PAGE study provided an appreciable improvement in ancestry representation compared to previous RBC trait genetics studies. The benefits of an inclusive study population involve both improved equity in application of results and improved detection of rare and ancestry-specific variants. The ability to measure and evaluate variants that are present in

African- or Amerindian-ancestry populations but extremely rare or monomorphic in Europeans allowed us to detect multiple conditionally independent variants at the HBA1/2 locus in our first study and previously unreported genes with ancestry-specific top variants in the gene-based study495.

7.4. Limitations

Our study faced limitations common to genetic analyses of complex traits as well as several unique to our analyses. First, several RBC trait GWAS, while not benefitting from the methods employed herein, exceeded our study population by tens of thousands of participants, which provides an increase in statistical power that we could not match, particularly with the gene-based study. Additionally, while our study population improved representation of global ancestries compared with previous work, we could not include proportionally representative numbers of Native Hawaiians, Pacific Islanders, and Native Americans due to limited phenotype data availability, preventing the generalizability of our findings to those groups. We could not evaluate low-frequency insertion/deletion variants in our first study, which precluded determining whether our lead variants are truly independent of known in/dels. Finally, comprehensive characterization of genetic associations requires information on the likely molecular function of potential causal variants, which we were unable to evaluate due to publicly

237 available databases having limited availability of relevant data with regard to both tissue type and non-European ancestry genetic variation429; 521.

7.5. Overall Conclusions

Our study results reinforce the benefits of inclusive study design and considering methods outside of the arena of traditional GWAS. Over 50 previously reported loci met genome-wide significance criteria between the results of both studies, reaffirming the association of these loci with RBC traits and calling for additional characterization where it has not yet been performed. Particularly regarding loci affected by evolutionary selection in malaria-endemic regions with variants known to cause monogenic anemias such as the HBA1/2 locus, our results suggest further work remains to successfully identify functional regulatory variants in this region. Additionally, our identification of multiple genes supports further examination in a molecular context which can elucidate any potential biological mechanisms underlying RBC traits.

Overall, our findings demonstrate that complex traits known to be highly polygenic can continue to be characterized via a suite of methods that account for the complete range of genetic variation, and that diverse study populations are required for a comprehensive study. Given the recent advances in gene-therapy with regard to monogenic blood disorders, all approaches that improve characterization of known loci or can identify new candidate functional genes or variants should be considered. Findings from both of our studies strongly support the continued call for increased representation in publicly available expression and functional databases.

Additional work is merited to characterize our discovery genes and ancestry-specific variants to determine potential function within RBC development and maintenance.

238

APPENDIX: AN INTRODUCTION TO BASIC GENETIC PROCESSES

All of the processes discussed in this section are portrayed with relevant causal relationships in Figure S1. DNA is the beginning molecule for all known coded proteins, making it an obvious source of differences in gene products due to genetic variation. Genetic variation occurs naturally in every individual from conception through death. Mutations occur during meiosis at an average rate of approximately 1x10-8 (approximately 320 per genomic replication), whereas somatic mutations have been suggested to occur more frequently649; 650. For a mutation to become established within a population, it must:

1. occur initially as a germline mutation (in order to be heritable);

2. not preclude reproduction (i.e., the variant cannot cause fatal disease at a prepubescent age); and

3. spread through the population over generations via sexual reproduction.

Evolutionary selective pressure can shorten the timeframe for a particular variant (as can be seen at several mutations at the receptor site in duffy antigen receptor gene, or DARC, which is associated with malarial resistance in heterozygotes, which are only found in individuals of recent African ancestry)149; 651. This becomes significant when the locus affected by selective pressure also affects a physiologic state, either in heterozygotes or (more commonly for deleterious phenotypes) homozygotes62; 65; 122; 125; 126; 548; 652.

Several types of variants affect gene sequences, and these variants may be inherited through the germline or occur spontaneously during meiosis after conception. The primary forms of genetic variation discussed in this proposal include simple-nucleotide polymorphisms (SNPs,

also sometimes referred to in other publications as "single nucleotide polymorphisms") and insertion/deletion of multiple nucleotides (AKA "in/dels"). Other types of variation that can

239 affect proper expression and maintenance of gene products include chromatin structural alterations and epigenetic changes (histone modification or DNAm) that may cause expression changes in transcripts with identical genetic sequences (see below). Here we provide a broad overview of an extremely complex system with literally hundreds of thousands of processes constantly being modulated in each cell in the body.

8.1. Transcription

In order for a segment of genomic sequence to eventually become a protein, it must first be transcribed from the DNA sequence into messenger RNA (mRNA), which contains the nucleotides from which the final protein product can be translated. Note: mitochondrial DNA sequence changes can also effect changes in RBC traits, but that is outside the scope of this work. Perhaps the best-characterized disease-related genetic changes are coding variants (or genetic variants falling in exons), which change the mRNA sequence. Coding changes lead to either the same amino acid sequence (synonymous), a changed amino acid sequence

(nonsynonymous), or an early stop codon (or "premature termination codon", PTC) for the resulting protein product. Synonymous coding changes were long considered "silent mutations," but evidence has recently come to light that such changes can affect translation by changing the post-transcriptional or "secondary" structure of the mRNA653-655. Nonsynonymous changes can cause structural changes in the final protein, affect the binding-site affinity at receptor sites, or prevent localization of the protein to the appropriate cell structure. PTCs lead to the protein product being terminated prematurely during translation, which in turn can cause the product to function poorly or be targeted for degradation in the proteasome. Coding variants primarily affect the gene in which the variant occurs (exceptions to this general rule have been seen in opposite-strand enhancer elements for a different gene located within coding regions)653; 654; 656-

240 658. The vast majority of genetic disorders that were characterized before the GWAS era were identified via candidate-gene studies which identified coding mutations, primarily for monogenic disorders but also more recently for highly polygenic diseases such as cancers320; 321.

8.2. Intronic/intergenic variation

A large number of cellular processes are also affected by non-coding genetic variants involved in regulation due to the disruption of transcription factor binding or other indirect regulatory mechanisms469. Work in HaploReg and the GTex consortium has led to estimates that a large proportion of variability for complex traits in humans is attributable to regulatory mechanisms of gene expression429; 469; 481; 659-661. Noncoding variants in enhancer-, promoter-, intronic-, and intergenic regions can affect transcription-factor binding capacity, DNA methylation (described below), or chromatin accessibility. These types of variation affect the rate and volume of transcription, and can vary by cell type if the gene is differentially expressed by tissue307. For example, constitutively expressed tissue-specific transcription factors (TFs) are frequently involved in the promotion or repression of gene transcription, and disruption of TF binding will likely affect the efficiency of mRNA transcription. Consensus sequence specificity varies by TF, with binding by TFs requiring high specificity being more dramatically affected by changes in relevant consensus sequences660; 662; 663.

Splice-site variation in particular merits discussion as a type of noncoding mutation that can exhibit a large effect on the final gene product. Variation within or proximal to the splice nucleotide can lead to exon skipping; inappropriate inclusion of introns in the mRNA (which may subsequently cause the mRNA to be targeted for NMD); or isoform switching, all of which can be deleterious to the function of the protein664. Even if the inappropriate exclusion or

241 inclusion of nucleotides does not lead to the final mRNA being out of frame, the resulting protein product is likely to suffer from misfolding in the tertiary structure. Of note, PTCs may be introduced by exon skipping or intron inclusion, leading to truncated protein products.

Spliceosome variation can contribute to dramatic changes in the level of transcript available for translation in any cell type. However, RBCs in particular are vulnerable to the effects of splice variants because post-enucleation translation is minimal, meaning defective transcripts cannot be replaced in mature RBCs665.

8.3. Epigenetic modification

While nucleotide variation is relatively straightforward to measure and observe, accessibility of the DNA is required for transcription, and this is tightly controlled by epigenetic processes in every tissue type. Some of the recently characterized epigenetic marks are heritable, and all of them can affect gene expression, primarily by affecting accessibility to coding regions.

Several of these processes are described briefly below, primarily to demonstrate how epigenetic mechanisms may contribute to the concept of “missing heritability”, or the idea that the observed proportion of population-level variability is lower than expected based on heritability estimates339-341; 354.

8.3.1. DNA methylation

DNA methylation refers to the addition of a methyl group to a cytosine preceding a guanine residue (cytosine-phosphodiester-guanine or CpG) by DNA methyltransferase. This chemical bond is heritable (i.e., it is maintained through DNA replication during mitosis), and is most commonly associated with repression of transcription by altering the consensus sequence of a transcription factor binding site or RNA polymerase666. Methyl groups must be available for

242 DNA methylation to occur (obtained via methyl donors such as choline or S-Adenosyl methionine), which can be obtained readily through diet, particularly in countries with foods fortified with iron and folate118; 227; 667; 668. DNA methylation is responsive to environmental exposures such as cigarette smoking, pollution, diet, and even high altitudes, risk factors which are also strongly associated with RBC traits 669-672. CpG sites occur throughout the genome, and mechanisms by which methylation sites are selected have not been fully elucidated673; 674. This modifiable form of structural alteration, which does not change the underlying DNA sequence, can change the effect magnitude of a SNP and therefore play a role in allelic heterogeneity, particularly when the variant changes from an A/T to C/G pair675. DNA methylation can therefore be important for highly regulated gene transcripts—slight reduction in binding capacity for a tissue-specific transcription factor can dramatically affect the amount of protein end- product675-677.

8.3.2. 3D chromatin structure

Unraveled, the human chromosomes lined up would be approximately six feet long. The genetic sequence must be maintained across thousands of mitotic divisions over the life course, and environmental stresses can lead to double-stranded DNA breaks which are repaired at an impressive but imperfect rate. Gene expression necessarily varies by cell type, often dramatically with the exception of a suite of constitutively highly expressed "housekeeping" genes.

Euchromatin refers to open stretches of chromatin, which are more accessible for the cellular machinery required for transcription, whereas heterochromatin, is inaccessible or has limited accessibility for transcription, usually because it is more densely packed678. Similar to genetic variants that ablate or constitutively activate transcription of a particular gene, abnormal changes in accessibility to genes which are required to be expressed or repressed can lead to abnormal

243 cell function, premature apoptosis, or tumorigenesis679; 680. Related to overall 3D structure of the chromatin at a given genomic region are DNAseI-hypersensitivity regions (also referred to as

DHS sites), or short, often cell-type-specific, sequences which are particularly susceptible to cleavage by the enzyme DNAseI681. These sites indicate of euchromatin areas that are highly transcribed678. Given the dramatic increase in chromatin condensation with each stage of RBC differentiation during erythropoiesis, culminating in nuclear- and organellar extrusion directly prior to terminal differentiation, genetic variation that affects chromatin structure is particularly relevant to RBC traits94; 682; 683.

8.3.3. Histone modification

Nucleosomes are octameric complexes comprising four histone dimers (named H2A,

H2B, H3, and H4), which are proteins coded in the cell, around which 145- to 147-nucleotide segments of DNA are wrapped to condense chromatin that is not being transcribed. Maintenance of the DNA surrounding nucleosomes is controlled in part by the molecules binding to the tails of the histones, which—unlike the bodies of the histones—are accessible for modification in the nucleoplasm679; 684; 685. Methylation, acetylation, ubiquitination, and other molecular modifications of amino acids (frequently lysine resides) in the histone tails can affect repression or promotion of gene expression in the proximal DNA sequence684. Modification can be measured by cross-linking proteins bound to DNA from the tissue type of interest and performing genome-wide chromatin-immunoprecipitation and sequencing, or ChIP-seq, assays686-689.

244 8.4. Other genetic processes

The molecular processes occurring between RNA transcription and the folded protein also play a large role in the amount of final protein product. Post-transcriptional regulation, mRNA secondary structure, nonsense-mediated decay, and microRNA binding have all been demonstrated to play important roles in successful as well as aberrant erythropoiesis. The complex interaction of all these mechanisms highlighting the importance of considering how genetic variation may alter gene function in the context of epigenetic marks as well as the direct consequence of a nucleotide change on gene transcription690.

8.4.1. MicroRNA

First identified in the early 1990’s, microRNAs (22- to 24-nucleotide single-stranded

RNAs which are also coded from DNA) have been found to be highly involved in post- transcriptional regulation by binding to a 3’UTR consensus sequence of one or multiple mRNAs691. Correct functioning of this regulatory mechanism helps explain why the amount of transcription does not correlate 1:1 to protein translation. For example, while microRNAs are typically consider to function as translational repressors, recent efforts have demonstrated cases in which they act to stabilize mRNAs, which can then be translated multiple times prior to degradation692-696. Conversely, translation of mRNAs that are constitutively transcribed may depend heavily on post-transcriptional mechanisms such as microRNA binding and targeting to nonsense-mediated decay to avoid excess protein product83; 697; 698. To further complicate this regulatory mechanism, recent work demonstrates that microRNAs coded in one cell can be transferred to another cell via microvesicles, suggesting intercellular translational regulation across tissue types699.

245 Over the past decade, increasing interest has been shown in the function of microRNAs— particularly with regard to exchange via extracellular vesicles—in erythropoiesis and homeostasis623. MicroRNAs are highly differentially expressed by stage of erythryopoiesis. In

HSCs, a group of miRNAs repress erythropoietic genes (and hence continue to replenish the pool of HSCs in bone marrow). In committed but immature erythrocytes, miRNAs repress proliferation, which allows for increased transcription and translation of hemoglobin700. During the last stage of differentiation, one group has demonstrated the downregulation of miR191 in mice as playing an important role in the nuclear extrusion process701. While a reference database of quantitative microRNA sequencing data in different tissue types is not currently publicly available, the role these molecules play in gene regulation in erythropoiesis and during the life of the RBC is undeniable. For this reason, the coding sequences for microRNAs should therefore be under consideration during bioinformatic analysis of primary GWAS results.

8.4.2. Post-translational modification

Post-translational modification and mechanisms involved in protein maintenance can also affect the amount of functional protein in a cell. Faulty protein products—for example, membrane proteins that fail to be targeted to the membrane or shortened proteins that don't successfully fold into the required tertiary or quaternary structure—have been shown to be cleared with haste from the cytoplasm702; 703. Efficient clearance is also noteworthy because normal protein turnover is required for cell health and function, and genetic variants affecting the ability of proteins to be degraded can lead to poor performance of related pathways704-707. For example, “wobble bases” in the mRNA sequence are partially responsible for the secondary structure of the RNA and hence stability of the transcript for translation, but prediction of these bases from the DNA sequence alone is difficult656. Pathology caused by this type of variation

246 will not be detectable in a population-based genotyping analysis without further molecular analysis, but implications for failure to undergo correct post-translational modification can be predicted computationally in cases in which the relevant variant is coding.

247 8.5. Supporting Figure

Figure 16. Molecular processes involved in gene transcription, mRNA translation, and detected amounts of products from those processes (prepared by Hodonsky, 2017)

histone secondary modification heterochromatin structure spliceosome DNA methylation variation secondary & tertiary tRNA structure

DNA mRNA protein nonsense-mediated decay microRNA ubiquitination & proteasome targeting

Boxes outlined in blue indicate processes that can be directly affected by a genetic variant and would therefore be detectable via GWAS. n.b., not all variant-specific associations can be predicted with currently available prediction algorithms or bioinformatics databases.

248 WORKS CITED

1. Al-Kindi, S.G., Refaat, M., Jayyousi, A., Asaad, N., Al Suwaidi, J., and Abi Khalil, C. (2017). Red Cell Distribution Width Is Associated with All-Cause and Cardiovascular Mortality in Patients with Diabetes. Biomed Res Int 2017, 5843702.

2. Hsieh, Y.P., Chang, C.C., Kor, C.T., Yang, Y., Wen, Y.K., and Chiu, P.F. (2016). The Predictive Role of Red Cell Distribution Width in Mortality among Chronic Kidney Disease Patients. PloS one 11, e0162025.

3. Pereira, C.A., Roscani, M.G., Zanati, S.G., and Matsubara, B.B. (2013). Anemia, heart failure and evidence-based clinical management. Arq Bras Cardiol 101, 87-92.

4. Allport, L.E., Parsons, M.W., Butcher, K.S., MacGregor, L., Desmond, P.M., Tress, B.M., and Davis, S.M. (2005). Elevated hematocrit is associated with reduced reperfusion and tissue survival in acute stroke. Neurology 65, 1382-1387.

5. Barlas, R.S., Honney, K., Loke, Y.K., McCall, S.J., Bettencourt-Silva, J.H., Clark, A.B., Bowles, K.M., Metcalf, A.K., Mamas, M.A., Potter, J.F., et al. (2016). Impact of Hemoglobin Levels and Anemia on Mortality in Acute Stroke: Analysis of UK Regional Registry Data, Systematic Review, and Meta-Analysis. J Am Heart Assoc 5.

6. Cherng, Y.G., Lin, C.S., Shih, C.C., Hsu, Y.H., Yeh, C.C., Hu, C.J., Chen, T.L., and Liao, C.C. (2018). Stroke risk and outcomes in patients with chronic kidney disease or end- stage renal disease: Two nationwide studies. PloS one 13, e0191155.

7. Danese, E., Lippi, G., and Montagnana, M. (2015). Red blood cell distribution width and cardiovascular diseases. J Thorac Dis 7, E402-411.

8. Gillum, R.F., and Sempos, C.T. (1996). Hemoglobin, hematocrit, and stroke incidence and mortality in women and men. Stroke 27, 1910.

9. Jia, H., Li, H., Zhang, Y., Li, C., Hu, Y., and Xia, C. (2015). Association between red blood cell distribution width (RDW) and carotid artery atherosclerosis (CAS) in patients with primary ischemic stroke. Arch Gerontol Geriatr 61, 72-75.

10. Mozos, I. (2015). Mechanisms linking red blood cell disorders and cardiovascular diseases. Biomed Res Int 2015, 682054.

11. Ponari, O., Squeri, M., Mombelloni, A., Catamo, A., Tardio, S., and Moretti, G. (1984). Hematocrit and stroke. A case-control study. Ital J Neurol Sci 5, 409-411.

249 12. Toro-Dominguez, D., Carmona-Saez, P., and Alarcon-Riquelme, M.E. (2014). Shared signatures between rheumatoid arthritis, systemic lupus erythematosus and Sjogren's syndrome uncovered through gene expression meta-analysis. Arthritis Res Ther 16, 489.

13. Yunchun, L., Yue, W., Jun, F.Z., Qizhu, S., and Liumei, D. (2016). Clinical Significance of Red Blood Cell Distribution Width and Inflammatory Factors for the Disease Activity in Rheumatoid Arthritis. Clin Lab 62, 2327-2331.

14. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M., Wallace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho, J., et al. (2011). Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet 7, e1002254.

15. Visscher, P.M., Brown, M.A., McCarthy, M.I., and Yang, J. (2012). Five years of GWAS discovery. Am J Hum Genet 90, 7-24.

16. Ding, K., de Andrade, M., Manolio, T.A., Crawford, D.C., Rasmussen-Torvik, L.J., Ritchie, M.D., Denny, J.C., Masys, D.R., Jouni, H., Pachecho, J.A., et al. (2013). Genetic variants that confer resistance to malaria are associated with red blood cell traits in African- Americans: an electronic medical record-based genome-wide association study. G3 (Bethesda) 3, 1061-1068.

17. Ganesh, S.K., Zakai, N.A., van Rooij, F.J., Soranzo, N., Smith, A.V., Nalls, M.A., Chen, M.H., Kottgen, A., Glazer, N.L., Dehghan, A., et al. (2009). Multiple loci influence erythrocyte phenotypes in the CHARGE Consortium. Nature genetics 41, 1191-1198.

18. Hodonsky, C.J., Jain, D., Schick, U.M., Morrison, J.V., Brown, L., McHugh, C.P., Schurmann, C., Chen, D.D., Liu, Y.M., Auer, P.L., et al. (2017). Genome-wide association study of red blood cell traits in Hispanics/Latinos: The Hispanic Community Health Study/Study of Latinos. PLoS Genet 13, e1006760.

19. Iotchkova, V., Huang, J., Morris, J.A., Jain, D., Barbieri, C., Walter, K., Min, J.L., Chen, L., Astle, W., Cocca, M., et al. (2016). Discovery and refinement of genetic loci associated with cardiometabolic risk using dense imputation maps. Nature genetics 48, 1303-1312.

20. Kamatani, Y., Matsuda, K., Okada, Y., Kubo, M., Hosono, N., Daigo, Y., Nakamura, Y., and Kamatani, N. (2010). Genome-wide association study of hematological and biochemical traits in a Japanese population. Nature genetics 42, 210-215.

21. Kullo, I.J., Ding, K., Jouni, H., Smith, C.Y., and Chute, C.G. (2010). A genome-wide association study of red blood cell traits using the electronic medical record. PloS one 5.

250 22. Li, J., Glessner, J.T., Zhang, H., Hou, C., Wei, Z., Bradfield, J.P., Mentch, F.D., Guo, Y., Kim, C., Xia, Q., et al. (2013). GWAS of blood cell traits identifies novel associated loci and epistatic interactions in Caucasian and African-American children. Human molecular genetics 22, 1457-1464.

23. Okada, Y., and Kamatani, Y. (2012). Common genetic factors for hematological traits in humans. J Hum Genet 57, 161-169.

24. Pistis, G., Okonkwo, S.U., Traglia, M., Sala, C., Shin, S.Y., Masciullo, C., Buetti, I., Massacane, R., Mangino, M., Thein, S.L., et al. (2013). Genome wide association analysis of a founder population identified TAF3 as a gene for MCHC in humans. PloS one 8, e69206.

25. Soranzo, N., Spector, T.D., Mangino, M., Kuhnel, B., Rendon, A., Teumer, A., Willenborg, C., Wright, B., Chen, L., Li, M., et al. (2009). A genome-wide meta-analysis identifies 22 loci associated with eight hematological parameters in the HaemGen consortium. Nature genetics 41, 1182-1190.

26. van der Harst, P., Zhang, W., Mateo Leach, I., Rendon, A., Verweij, N., Sehmi, J., Paul, D.S., Elling, U., Allayee, H., Li, X., et al. (2012). Seventy-five genetic loci influencing the human red blood cell. Nature 492, 369-375.

27. van Rooij, F.J., Qayyum, R., Smith, A.V., Zhou, Y., Trompet, S., Tanaka, T., Keller, M.F., Chang, L.C., Schmidt, H., Yang, M.L., et al. (2017). Genome-wide Trans-ethnic Meta- analysis Identifies Seven Genetic Loci Influencing Erythrocyte Traits and a Role for RBPMS in Erythropoiesis. Am J Hum Genet 100, 51-63.

28. Chami, N., Chen, M.H., Slater, A.J., Eicher, J.D., Evangelou, E., Tajuddin, S.M., Love- Gregory, L., Kacprowski, T., Schick, U.M., Nomura, A., et al. (2016). Exome Genotyping Identifies Pleiotropic Variants Associated with Red Blood Cell Traits. Am J Hum Genet 99, 8-21.

29. Chen, Z., Tang, H., Qayyum, R., Schick, U.M., Nalls, M.A., Handsaker, R., Li, J., Lu, Y., Yanek, L.R., Keating, B., et al. (2013). Genome-wide association analysis of red blood cell traits in African Americans: the COGENT Network. Human molecular genetics 22, 2529-2538.

30. Polfus, L.M., Khajuria, R.K., Schick, U.M., Pankratz, N., Pazoki, R., Brody, J.A., Chen, M.H., Auer, P.L., Floyd, J.S., Huang, J., et al. (2016). Whole-Exome Sequencing Identifies Loci Associated with Blood Cell Traits and Reveals a Role for Alternative GFI1B Splice Variants in Human Hematopoiesis. Am J Hum Genet 99, 785.

251 31. Chami, N., and Lettre, G. (2014). Lessons and Implications from Genome-Wide Association Studies (GWAS) Findings of Blood Cell Phenotypes. Genes 5, 51-64.

32. Bustamante, C.D., Burchard, E.G., and De la Vega, F.M. (2011). Genomics for the world. Nature 475, 163-165.

33. Bustamante, C.D., and Henn, B.M. (2010). Human origins: Shadows of early migrations. Nature 468, 1044-1045.

34. Baharian, S., Barakatt, M., Gignoux, C.R., Shringarpure, S., Errington, J., Blot, W.J., Bustamante, C.D., Kenny, E.E., Williams, S.M., Aldrich, M.C., et al. (2016). The Great Migration and African-American Genomic Diversity. PLoS Genet 12, e1006059.

35. Basu, A., Tang, H., Zhu, X., Gu, C.C., Hanis, C., Boerwinkle, E., and Risch, N. (2008). Genome-wide distribution of ancestry in Mexican Americans. Human genetics 124, 207- 214.

36. Conomos, M.P., Laurie, C.A., Stilp, A.M., Gogarten, S.M., McHugh, C.P., Nelson, S.C., Sofer, T., Fernandez-Rhodes, L., Justice, A.E., Graff, M., et al. (2016). Genetic Diversity and Association Studies in US Hispanic/Latino Populations: Applications in the Hispanic Community Health Study/Study of Latinos. Am J Hum Genet 98, 165-184.

37. Obasogie, O.K., Headen, I., and Mujahid, M.S. (2017). Race, Law, and Health Disparities: Toward a Critical Race Intervention. Annu Rev Law Soc Sci 13, 313-329.

38. Popejoy, A.B., and Fullerton, S.M. (2016). Genomics is failing on diversity. Nature 538, 161-164.

39. Sirisena, N.D., and Dissanayake, V.H.W. (2017). Focusing attention on ancestral diversity within genomics research: a potential means for promoting equity in the provision of genomics based healthcare services in developing countries. J Community Genet 8, 275- 281.

40. Rotimi, C.N. (2012). Health disparities in the genomic era: the case for diversifying ethnic representation. Genome Med 4, 65.

41. Erves, J.C., Mayo-Gamble, T.L., Malin-Fair, A., Boyer, A., Joosten, Y., Vaughn, Y.C., Sherden, L., Luther, P., Miller, S., and Wilkins, C.H. (2017). Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community Perspective. J Community Health 42, 472-480.

252 42. Auton, A., Bryc, K., Boyko, A.R., Lohmueller, K.E., Novembre, J., Reynolds, A., Indap, A., Wright, M.H., Degenhardt, J.D., Gutenkunst, R.N., et al. (2009). Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res 19, 795-803.

43. Kosmicki, J.A., Churchhouse, C.L., Rivas, M.A., and Neale, B.M. (2016). Discovery of rare variants for complex phenotypes. Human genetics 135, 625-634.

44. Nandakumar, P., Lee, D., Richard, M.A., Tekola-Ayele, F., Tayo, B.O., Ware, E., Sung, Y.J., Salako, B., Ogunniyi, A., Gu, C.C., et al. (2017). Rare coding variants associated with blood pressure variation in 15914 individuals of African ancestry. Journal of Hypertension 35, 1381-1389.

45. de Vries, P.S., Chasman, D.I., Sabater-Lleal, M., Chen, M.H., Huffman, J.E., Steri, M., Tang, W., Teumer, A., Marioni, R.E., Grossmann, V., et al. (2016). A meta-analysis of 120 246 individuals identifies 18 new loci for fibrinogen concentration. Human molecular genetics 25, 358-370.

46. Sabo, A., Mishra, P., Dugan-Perez, S., Voruganti, V.S., Kent, J.W., Jr., Kalra, D., Cole, S.A., Comuzzie, A.G., Muzny, D.M., Gibbs, R.A., et al. (2017). Exome sequencing reveals novel genetic loci influencing obesity-related traits in Hispanic children. Obesity (Silver Spring) 25, 1270-1276.

47. Grarup, N., Moltke, I., Andersen, M.K., Dalby, M., Vitting-Seerup, K., Kern, T., Mahendran, Y., Jorsboe, E., Larsen, C.V.L., Dahl-Petersen, I.K., et al. (2018). Loss-of-function variants in ADCY3 increase risk of obesity and type 2 diabetes. Nature genetics 50, 172- 174.

48. Chen, M.H., Yanek, L.R., Backman, J.D., Eicher, J.D., Huffman, J.E., Ben-Shlomo, Y., Beswick, A.D., Yerges-Armstrong, L.M., Shuldiner, A.R., O'Connell, J.R., et al. (2017). Exome-chip meta-analysis identifies association between variation in ANKRD26 and platelet aggregation. Platelets, 1-10.

49. Feldman, R.P., and Goodrich, J.T. (1999). The Edwin Smith Surgical Papyrus. Childs Nerv Syst 15, 281-284.

50. Jex, H.S. (1951). The Edwin Smith Surgical Papyrus: first milestone in the march of medicine. Merck Rep 60, 20-22.

51. Sykes, P. (2009). The Edwin Smith papyrus (ca. 16th century BC). Ann Plast Surg 62, 3-4.

253 52. Hippocrates, Par©*, A., Harvey, W., Jenner, E., Holmes, O.W., Lister, J., Pasteur, L., Lyell, C., Paget, S., Willis, R., et al. (1910). Scientific papers: physiology, medicine, surgery, geology, with introductions and notes.(New York,: P. F. Collier).

53. Culpeper, N. (1653). The English physitian enlarged : with three hundred, sixty, and nine medicines made of English herbs that were not in any impression until this ... Being an astrologo-physical discourse of the vulgar herbs of this nation: containing a compleat method of physick, whereby man may preserve his body in health, or cure himself, being sick, for three pence charge, with such things only as grow in England, they being most fit for English bodies.(London: printed by Peter Cole in Leaden-Hall, and are to be sold at his shop at the sign of the Printing-press in Cornhil, neer the Royal Exchange).

54. Debru, A., and Von Staden, H. (1991). [Herophilus, or the art of medicine in ancient Alexandria]. Rev Hist Sci Paris 44, 435-445.

55. Jarcho, S. (1970). Galen's six non-naturals: a bibliographic note and translation. Bull Hist Med 44, 372-377.

56. Bryan, L.S., Jr. (1964). Blood-Letting in American Medicine, 1830-1892. Bull Hist Med 38, 516-529.

57. Markham, W.O. (1864). Gulstonian Lectures on the Uses of Blood-Letting in Disease. Br Med J 1, 411-412.

58. Semple, R.H. (1852). On General Blood-letting. Lond J Med 4, 310-319.

59. Gest, H. (2004). The discovery of microorganisms by Robert Hooke and Antoni Van Leeuwenhoek, fellows of the Royal Society. Notes Rec R Soc Lond 58, 187-201.

60. Hooke, R. (1665). Micrographia, or, Some physiological descriptions of minute bodies made by magnifying glasses with observations and inquiries thereupon.(London: Printed by Jo. Martyn and Ja. Allestry ... and are to be sold at their shop ...).

61. Lamson, P.D. (1916). The Processes Taking Place in the Body by Which the Number of Erythrocytes Per Unit Volume of Blood is Increased in Acute Experimental Polycythaemia. Proc Natl Acad Sci U S A 2, 365-369.

62. Taliaferro, W.H., and Huck, J.G. (1923). The Inheritance of Sickle-Cell Anaemia in Man. Genetics 8, 594-598.

254 63. Waldenström, J. (1937). Studien über Porphyrie. boktryck P A Norstedt och söner, 254.

64. Williamson, G.R., and Crawford, R. (1945). Fatal Mediterranean (Cooley's) anemia. New Orleans Med Surg J 98, 280-284.

65. Neel, J.V., and Valentine, W.N. (1947). Further Studies on the Genetics of Thalassemia. Genetics 32, 38-63.

66. Groenveld, H.F., Januzzi, J.L., Damman, K., van Wijngaarden, J., Hillege, H.L., van Veldhuisen, D.J., and van der Meer, P. (2008). Anemia and mortality in heart failure patients a systematic review and meta-analysis. J Am Coll Cardiol 52, 818-827.

67. Guglin, M.E., and Koul, D. (2006). Cardiovascular effects of erythropoietin: anemia and beyond. Cardiol Rev 14, 200-204.

68. Ickx, B.E., Rigolet, M., and Van Der Linden, P.J. (2000). Cardiovascular and metabolic response to acute normovolemic anemia. Effects of anesthesia. Anesthesiology 93, 1011- 1016.

69. Kajimoto, K., Minami, Y., Otsubo, S., Sato, N., and investigators of the Acute Decompensated Heart Failure Syndromes, r. (2017). Association of admission and discharge anemia status with outcomes in patients hospitalized for acute decompensated heart failure: Differences between patients with preserved and reduced ejection fraction. Eur Heart J Acute Cardiovasc Care, 2048872617730039.

70. Kajimoto, K., Sato, N., Takano, T., and investigators of the Acute Decompensated Heart Failure Syndromes, r. (2016). Association of anemia and renal dysfunction with in- hospital mortality among patients hospitalized for acute heart failure syndromes with preserved or reduced ejection fraction. Eur Heart J Acute Cardiovasc Care 5, 89-99.

71. Montero, D., Lundby, C., Ruschitzka, F., and Flammer, A.J. (2017). True Anemia-Red Blood Cell Volume Deficit-in Heart Failure: A Systematic Review. Circ Heart Fail 10.

72. Nuis, R.J., Sinning, J.M., Rodes-Cabau, J., Gotzmann, M., van Garsse, L., Kefer, J., Bosmans, J., Yong, G., Dager, A.E., Revilla-Orodea, A., et al. (2013). Prevalence, factors associated with, and prognostic effects of preoperative anemia on short- and long-term mortality in patients undergoing transcatheter aortic valve implantation. Circ Cardiovasc Interv 6, 625-634.

255 73. Oster, H.S., Benderly, M., Hoffman, M., Cohen, E., Shotan, A., and Mittelman, M. (2013). Mortality in heart failure with worsening anemia: a national study. Isr Med Assoc J 15, 368-372.

74. Powell, D.J., and Achebe, M.O. (2016). Anemia for the Primary Care Physician. Prim Care 43, 527-542.

75. Jacobsen, R.N., Perkins, A.C., and Levesque, J.P. (2015). Macrophages and regulation of erythropoiesis. Curr Opin Hematol 22, 212-219.

76. Kaushansky, K. (2008). Historical review: megakaryopoiesis and thrombopoiesis. Blood 111, 981-986.

77. Nandakumar, S.K., Ulirsch, J.C., and Sankaran, V.G. (2016). Advances in understanding erythropoiesis: evolving perspectives. Br J Haematol 173, 206-218.

78. Ema, H., and Nakauchi, H. (2000). Expansion of hematopoietic stem cells in the developing liver of a mouse embryo. Blood 95, 2284-2288.

79. Perie, L., and Duffy, K.R. (2016). Retracing the in vivo haematopoietic tree using single-cell methods. FEBS Lett 590, 4068-4083.

80. Mead, A.J., and Mullally, A. (2017). Myeloproliferative neoplasm stem cells. Blood 129, 1607-1616.

81. Kleiler, K.R. (1979). Lymphocytic leukemia: a review of recent literature. Am J Med Technol 45, 590-599.

82. Turner, A., and Kjeldsberg, C.R. (1978). Hairy cell leukemia: a review. Medicine (Baltimore) 57, 477-499.

83. Johnson, E.L., Robinson, D.G., and Coller, H.A. (2017). Widespread changes in mRNA stability contribute to quiescence-specific gene expression patterns in a fibroblast model of quiescence. BMC Genomics 18, 123.

84. Huenefeld, F.L. (1840). Der Chemismus in der thierischen Organisation : ein Beitrag zur Physiologie und Heilmittellehre.(Leipzig: F.A. Brockhaus).

85. Engelhart, J.F. (1825). Commentatio de vera materiae sanguini purpureum colorem impertientis natura.(Göttingen: Dieterich).

256 86. Goodman, M., Moore, G.W., and Matsuda, G. (1975). Darwinian evolution in the genealogy of haemoglobin. Nature 253, 603-608.

87. An, X., Schulz, V.P., Mohandas, N., and Gallagher, P.G. (2015). Human and murine erythropoiesis. Curr Opin Hematol 22, 206-211.

88. Alam, M.Z., Devalaraja, S., and Haldar, M. (2017). The Heme Connection: Linking Erythrocytes and Macrophage Biology. Front Immunol 8, 33.

89. de Back, D.Z., Kostova, E.B., van Kraaij, M., van den Berg, T.K., and van Bruggen, R. (2014). Of macrophages and red blood cells; a complex love story. Front Physiol 5, 9.

90. Koulnis, M., Porpiglia, E., Hidalgo, D., and Socolovsky, M. (2014). Erythropoiesis: from molecular pathways to system properties. Adv Exp Med Biol 844, 37-58.

91. Pimentel, H., Parra, M., Gee, S.L., Mohandas, N., Pachter, L., and Conboy, J.G. (2016). A dynamic intron retention program enriched in RNA processing genes regulates gene expression during terminal erythropoiesis. Nucleic Acids Res 44, 838-851.

92. Thiadens, K.A., and von Lindern, M. (2015). Selective mRNA translation in erythropoiesis. Biochem Soc Trans 43, 343-347.

93. Nogueira-Pedro, A., dos Santos, G.G., Oliveira, D.C., Hastreiter, A.A., and Fock, R.A. (2016). Erythropoiesis in vertebrates: from ontogeny to clinical relevance. Front Biosci (Elite Ed) 8, 100-112.

94. Ji, P., Murata-Hori, M., and Lodish, H.F. (2011). Formation of mammalian erythrocytes: chromatin condensation and enucleation. Trends Cell Biol 21, 409-415.

95. Camaschella, C., Pagani, A., Nai, A., and Silvestri, L. (2016). The mutual control of iron and erythropoiesis. Int J Lab Hematol 38 Suppl 1, 20-26.

96. Kautz, L., and Nemeth, E. (2014). Molecular liaisons between erythropoiesis and iron metabolism. Blood 124, 479-482.

97. Papanikolaou, G., and Pantopoulos, K. (2017). Systemic iron homeostasis and erythropoiesis. IUBMB Life 69, 399-413.

98. Hentze, M.W., Muckenthaler, M.U., Galy, B., and Camaschella, C. (2010). Two to tango: regulation of Mammalian iron metabolism. Cell 142, 24-38.

257 99. Siddappa, A.M., Rao, R., Long, J.D., Widness, J.A., and Georgieff, M.K. (2007). The assessment of newborn iron stores at birth: a review of the literature and standards for ferritin concentrations. Neonatology 92, 73-82.

100. Hughes, A.E., Bridgett, S., Meng, W., Li, M., Curcio, C.A., Stambolian, D., and Bradley, D.T. (2016). Sequence and Expression of Complement Factor H Gene Cluster Variants and Their Roles in Age-Related Macular Degeneration Risk. Invest Ophthalmol Vis Sci 57, 2763-2769.

101. Noris, M., Bresin, E., Mele, C., and Remuzzi, G. (1993). Genetic Atypical Hemolytic- Uremic Syndrome. In GeneReviews((R)), M.P. Adam, H.H. Ardinger, R.A. Pagon, S.E. Wallace, L.J.H. Bean, K. Stephens, andA. Amemiya, eds. (Seattle (WA).

102. D'Angelo, G. (2013). Role of hepcidin in the pathophysiology and diagnosis of anemia. Blood Res 48, 10-15.

103. Stavropoulos, K., Imprialos, K.P., Bouloukou, S., Boutari, C., and Doumas, M. (2017). Hematocrit and Stroke: A Forgotten and Neglected Link? Semin Thromb Hemost 43, 591-598.

104. Sandes, A.F., Goncalves, M.V., and Chauffaille, M.L. (2017). Frequency of polycythemia in individuals with normal complete blood cell counts according to the new 2016 WHO classification of myeloid neoplasms. Int J Lab Hematol 39, 528-531.

105. Helman, N., and Rubenstein, L.S. (1975). The effects of age, sex, and smoking on erythrocytes and leukocytes. Am J Clin Pathol 63, 35-44.

106. Hollowell, J.G., van Assendelft, O.W., Gunter, E.W., Lewis, B.G., Najjar, M., Pfeiffer, C., Centers for Disease, C., and Prevention, N.C.f.H.S. (2005). Hematological and iron- related analytes--reference data for persons aged 1 year and over: United States, 1988-94. Vital Health Stat 11, 1-156.

107. Buttarello, M. (2016). Laboratory diagnosis of anemia: are the old and new red cell parameters useful in classification and treatment, how? Int J Lab Hematol 38 Suppl 1, 123-132.

108. Cappellini, M.D., and Motta, I. (2015). Anemia in Clinical Practice-Definition and Classification: Does Hemoglobin Change With Aging? Semin Hematol 52, 261-269.

109. Tumburu, L., and Thein, S.L. (2017). Genetic control of erythropoiesis. Curr Opin Hematol 24, 173-182.

258 110. Weed, R.I., Reed, C.F., and Berg, G. (1963). Is hemoglobin an essential structural component of human erythrocyte membranes? J Clin Invest 42, 581-588.

111. Malka, R., Delgado, F.F., Manalis, S.R., and Higgins, J.M. (2014). In vivo volume and hemoglobin dynamics of human red blood cells. PLoS Comput Biol 10, e1003839.

112. (1982). Hematological and nutritional biochemistry reference data for persons 6 months-74 years of age: United States, 1976-80. Vital Health Stat 11, i-vi, 1-173.

113. Massey, A.C. (1992). Microcytic anemia. Differential diagnosis and management of iron deficiency anemia. Med Clin North Am 76, 549-566.

114. Aslinia, F., Mazza, J.J., and Yale, S.H. (2006). Megaloblastic anemia and other causes of macrocytosis. Clin Med Res 4, 236-241.

115. Yokoyama, A., Yokoyama, T., Brooks, P.J., Mizukami, T., Matsui, T., Kimura, M., Matsushita, S., Higuchi, S., and Maruyama, K. (2014). Macrocytosis, macrocytic anemia, and genetic polymorphisms of alcohol dehydrogenase-1B and aldehyde dehydrogenase-2 in Japanese alcoholic men. Alcohol Clin Exp Res 38, 1237-1246.

116. Patel, H.H., Patel, H.R., and Higgins, J.M. (2015). Modulation of red blood cell population dynamics is a fundamental homeostatic response to disease. American journal of hematology 90, 422-428.

117. Le, C.H. (2016). The Prevalence of Anemia and Moderate-Severe Anemia in the US Population (NHANES 2003-2012). PloS one 11, e0166635.

118. Fairweather-Tait, S.J., and Teucher, B. (2002). Iron and calcium bioavailability of fortified foods and dietary supplements. Nutr Rev 60, 360-367.

119. Barkley, J.S., Wheeler, K.S., and Pachon, H. (2015). Anaemia prevalence may be reduced among countries that fortify flour. Br J Nutr 114, 265-273.

120. Ganji, V., and Kafai, M.R. (2009). Hemoglobin and hematocrit values are higher and prevalence of anemia is lower in the post-folic acid fortification period than in the pre- folic acid fortification period in US adults. Am J Clin Nutr 89, 363-371.

259 121. Stevens, G.A., Finucane, M.M., De-Regil, L.M., Paciorek, C.J., Flaxman, S.R., Branca, F., Pena-Rosas, J.P., Bhutta, Z.A., Ezzati, M., and Nutrition Impact Model Study, G. (2013). Global, regional, and national trends in haemoglobin concentration and prevalence of total and severe anaemia in children and pregnant and non-pregnant women for 1995- 2011: a systematic analysis of population-representative data. Lancet Glob Health 1, e16- 25.

122. Beutler, E., Dern, R.J., and Flanagan, C.L. (1955). Effect of sickle-cell trait on resistance to malaria. Br Med J 1, 1189-1191.

123. Cela, E., Bellon, J.M., de la Cruz, M., Belendez, C., Berrueco, R., Ruiz, A., Elorza, I., Diaz de Heredia, C., Cervera, A., Valles, G., et al. (2016). National registry of hemoglobinopathies in Spain (REPHem). Pediatr Blood Cancer.

124. Modell, B., Darlison, M., Birgens, H., Cario, H., Faustino, P., Giordano, P.C., Gulbis, B., Hopmeier, P., Lena-Russo, D., Romao, L., et al. (2007). Epidemiology of haemoglobin disorders in Europe: an overview. Scand J Clin Lab Invest 67, 39-69.

125. Bezon, A. (1955). [Possible resistance of subjects with sickle cell trait to endemic falciparum malaria]. Med Trop (Mars) 15, 423-427.

126. De Sanctis, V., Kattamis, C., Canatan, D., Soliman, A.T., Elsedfy, H., Karimi, M., Daar, S., Wali, Y., Yassin, M., Soliman, N., et al. (2017). beta-Thalassemia Distribution in the Old World: an Ancient Disease Seen from a Historical Standpoint. Mediterr J Hematol Infect Dis 9, e2017018.

127. Patrinos, G.P., Kollia, P., and Papadakis, M.N. (2005). Molecular diagnosis of inherited disorders: lessons from hemoglobinopathies. Hum Mutat 26, 399-412.

128. Souza-Silva, F.A., Torres, L.M., Santos-Alves, J.R., Tang, M.L., Sanchez, B.A., Sousa, T.N., Fontes, C.J., Nogueira, P.A., Rocha, R.S., Brito, C.F., et al. (2014). Duffy antigen receptor for chemokine (DARC) polymorphisms and its involvement in acquisition of inhibitory anti-duffy binding protein II (DBPII) immunity. PloS one 9, e93782.

129. Kwiatkowski, D.P. (2005). How malaria has affected the and what human genetics can teach us about malaria. Am J Hum Genet 77, 171-192.

130. Ware, R.E., de Montalembert, M., Tshilolo, L., and Abboud, M.R. (2017). Sickle cell disease. Lancet.

260 131. Piel, F.B., Tewari, S., Brousse, V., Analitis, A., Font, A., Menzel, S., Chakravorty, S., Thein, S.L., Inusa, B., Telfer, P., et al. (2016). Associations between environmental factors and hospital admissions for sickle cell disease. Haematologica.

132. Key, N.S., and Derebail, V.K. (2010). Sickle-cell trait: novel clinical significance. Hematology Am Soc Hematol Educ Program 2010, 418-422.

133. Dueker, N.D., Della-Morte, D., Rundek, T., Sacco, R.L., and Blanton, S.H. (2017). Sickle Cell Trait and Renal Function in Hispanics in the United States: The Northern Manhattan Study. Ethn Dis 27, 11-14.

134. Naik, R.P., Irvin, M.R., Judd, S., Gutierrez, O.M., Zakai, N.A., Derebail, V.K., Peralta, C., Lewis, M.R., Zhi, D., Arnett, D., et al. (2017). Sickle Cell Trait and the Risk of ESRD in Blacks. J Am Soc Nephrol 28, 2180-2187.

135. Caughey, M.C., Loehr, L.R., Key, N.S., Derebail, V.K., Gottesman, R.F., Kshirsagar, A.V., Grove, M.L., and Heiss, G. (2014). Sickle cell trait and incident ischemic stroke in the Atherosclerosis Risk in Communities study. Stroke 45, 2863-2867.

136. Austin, H., Key, N.S., Benson, J.M., Lally, C., Dowling, N.F., Whitsett, C., and Hooper, W.C. (2007). Sickle cell trait and the risk of venous thromboembolism among blacks. Blood 110, 908-912.

137. Mathias, L.A., Fisher, T.C., Zeng, L., Meiselman, H.J., Weinberg, K.I., Hiti, A.L., and Malik, P. (2000). Ineffective erythropoiesis in beta-thalassemia major is due to apoptosis at the polychromatophilic normoblast stage. Exp Hematol 28, 1343-1353.

138. Beutler, E., and West, C. (2005). Hematologic differences between African-Americans and whites: the roles of iron deficiency and alpha-thalassemia on hemoglobin levels and mean corpuscular volume. Blood 106, 740-745.

139. Cao, A., and Galanello, R. (2010). Beta-thalassemia. Genet Med 12, 61-76.

140. de Montalembert, M., Dumont, M.D., Heilbronner, C., Brousse, V., Charrara, O., Pellegrino, B., Piguet, C., Soussan, V., and Noizat-Pirenne, F. (2011). Delayed hemolytic transfusion reaction in children with sickle cell disease. Haematologica 96, 801-807.

141. Talano, J.A., Hillery, C.A., Gottschall, J.L., Baylerian, D.M., and Scott, J.P. (2003). Delayed hemolytic transfusion reaction/hyperhemolysis syndrome in children with sickle cell disease. Pediatrics 111, e661-665.

261 142. Campbell-Lee, S.A., and Kittles, R.A. (2014). Red blood cell alloimmunization in sickle cell disease: listen to your ancestors. Transfus Med Hemother 41, 431-435.

143. Mansilla-Soto, J., Riviere, I., Boulad, F., and Sadelain, M. (2016). Cell and Gene Therapy for the Beta-Thalassemias: Advances and Prospects. Hum Gene Ther 27, 295-304.

144. Zaidman, I., Rowe, J.M., Khalil, A., Ben-Arush, M., and Elhasid, R. (2016). Allogeneic Stem Cell Transplantation in Congenital Hemoglobinopathies Using a Tailored Busulfan- Based Conditioning Regimen: Single-Center Experience. Biol Blood Marrow Transplant 22, 1043-1048.

145. Arnold, S.D., Bhatia, M., Horan, J., and Krishnamurti, L. (2016). Haematopoietic stem cell transplantation for sickle cell disease - current practice and new approaches. Br J Haematol 174, 515-525.

146. Brendel, C., Guda, S., Renella, R., Bauer, D.E., Canver, M.C., Kim, Y.J., Heeney, M.M., Klatt, D., Fogel, J., Milsom, M.D., et al. (2016). Lineage-specific BCL11A knockdown circumvents toxicities and reverses sickle phenotype. J Clin Invest 126, 3868-3878.

147. Cottle, R.N., Lee, C.M., and Bao, G. (2016). Treating hemoglobinopathies using gene- correction approaches: promises and challenges. Human genetics 135, 993-1010.

148. Huang, H.M., Bauer, D.C., Lelliott, P.M., Dixon, M.W.A., Tilley, L., McMorran, B.J., Foote, S.J., and Burgio, G. (2017). Ankyrin-1 Gene Exhibits Allelic Heterogeneity in Conferring Protection Against Malaria. G3 (Bethesda) 7, 3133-3144.

149. Ntumngia, F.B., Thomson-Luque, R., Pires, C.V., and Adams, J.H. (2016). The role of the human Duffy antigen receptor for chemokines in malaria susceptibility: current opinions and future treatment prospects. J Receptor Ligand Channel Res 9, 1-11.

150. Da Costa, L., Galimand, J., Fenneteau, O., and Mohandas, N. (2013). Hereditary spherocytosis, elliptocytosis, and other red cell membrane disorders. Blood Rev 27, 167- 178.

151. Narla, J., and Mohandas, N. (2017). Red cell membrane disorders. Int J Lab Hematol 39 Suppl 1, 47-52.

152. Nakajo, A. (1939). Secondary hypertrophic osteoperiostosis with pernio (Japanese). J Derm Urol 45, 9.

262 153. Mackinney, A.A., Jr., Mortonne, Kosower, N.S., and Schilling, R.F. (1962). Ascertaining genetic carriers of hereditary spherocytosis by statistical analysis of multiple laboratory tests. J Clin Invest 41, 554-567.

154. Rumi, E., Passamonti, F., Pietra, D., Della Porta, M.G., Arcaini, L., Boggi, S., Elena, C., Boveri, E., Pascutto, C., Lazzarino, M., et al. (2006). JAK2 (V617F) as an acquired somatic mutation and a secondary genetic event associated with disease progression in familial myeloproliferative disorders. Cancer 107, 2206-2211.

155. Zakai, N.A., McClure, L.A., Prineas, R., Howard, G., McClellan, W., Holmes, C.E., Newsome, B.B., Warnock, D.G., Audhya, P., and Cushman, M. (2009). Correlates of anemia in American blacks and whites: the REGARDS Renal Ancillary Study. American journal of epidemiology 169, 355-364.

156. Evans, D.M., Frazer, I.H., and Martin, N.G. (1999). Genetic and environmental causes of variation in basal levels of blood cells. Twin research : the official journal of the International Society for Twin Studies 2, 250-257.

157. Whitfield, J.B., and Martin, N.G. (1985). Genetic and environmental influences on the size and number of cells in the blood. Genet Epidemiol 2, 133-144.

158. Liang, R., and Ghaffari, S. (2016). Advances in understanding the mechanisms of erythropoiesis in homeostasis and disease. Br J Haematol 174, 661-673.

159. Ruiz-Arguelles, G.J. (2006). Altitude above sea level as a variable for definition of anemia. Blood 108, 2131; author reply 2131-2132.

160. Clauss, K., Popp, A.P., Schulze, L., Hettich, J., Reisser, M., Escoter Torres, L., Uhlenhaut, N.H., and Gebhardt, J.C.M. (2017). DNA residence time is a regulatory factor of transcription repression. Nucleic Acids Res 45, 11121-11130.

161. Woo, S.K., Nahm, O., and Kwon, H.M. (2000). How salt regulates genes: function of a Rel- like transcription factor TonEBP. Biochem Biophys Res Commun 278, 269-271.

162. Huang, W., Liu, H., Wang, T., Zhang, T., Kuang, J., Luo, Y., Chung, S.S., Yuan, L., and Yang, J.Y. (2011). Tonicity-responsive microRNAs contribute to the maximal induction of osmoregulatory transcription factor OREBP in response to high-NaCl hypertonicity. Nucleic Acids Res 39, 475-485.

163. Hong, S.Y., Roze, L.V., and Linz, J.E. (2013). Oxidative stress-related transcription factors in the regulation of secondary metabolism. Toxins (Basel) 5, 683-702.

263 164. Wright, F.L., Gamboni, F., Moore, E.E., Nydam, T.L., Mitra, S., Silliman, C.C., and Banerjee, A. (2014). Hyperosmolarity invokes distinct anti-inflammatory mechanisms in pulmonary epithelial cells: evidence from signaling and transcription layers. PloS one 9, e114129.

165. Samanta, S., Raghunathan, D., and Mukherjee, S. (2016). Effect of temperature on the structure and hydration layer of TATA-box DNA: A molecular dynamics simulation study. J Mol Graph Model 66, 9-19.

166. Paredes, U., Radersma, R., Cannell, N., While, G.M., and Uller, T. (2016). Low Incubation Temperature Induces DNA Hypomethylation in Lizard Brains. J Exp Zool A Ecol Genet Physiol 325, 390-395.

167. Fang, W., Chen, J., Rossi, M., Feng, Y., Li, X.Z., and Michaelides, A. (2016). Inverse Temperature Dependence of Nuclear Quantum Effects in DNA Base Pairs. J Phys Chem Lett 7, 2125-2131.

168. Burgerhout, E., Mommens, M., Johnsen, H., Aunsmo, A., Santi, N., and Andersen, O. (2017). Genetic background and embryonic temperature affect DNA methylation and expression of myogenin and muscle development in Atlantic salmon (Salmo salar). PloS one 12, e0179918.

169. Metzger, D.C.H., and Schulte, P.M. (2017). Persistent and plastic effects of temperature on DNA methylation across the genome of threespine stickleback (Gasterosteus aculeatus). Proc Biol Sci 284.

170. Machha, V.R., Mikek, C.G., Wellman, S., and Lewis, E.A. (2017). Temperature and osmotic stress dependence of the thermodynamics for binding linker histone H1(0), Its carboxyl domain (H1(0)-C) or globular domain (H1(0)-G) to B-DNA. Biochem Biophys Rep 12, 158-165.

171. Inoue, H., and Oka, T. (1980). The effect of inhibitors of ornithine decarboxylase on DNA synthesis in mouse mammary gland in culture. The importance of osmolarity of the medium and of the initial intracellular level of putrescine. J Biol Chem 255, 3308-3312.

172. Cagliero, C., and Jin, D.J. (2013). Dissociation and re-association of RNA polymerase with DNA during osmotic stress response in Escherichia coli. Nucleic Acids Res 41, 315-326.

173. Koch, C.G., Li, L., Sun, Z., Hixson, E.D., Tang, A., Phillips, S.C., Blackstone, E.H., and Henderson, J.M. (2013). Hospital-acquired anemia: prevalence, outcomes, and healthcare implications. J Hosp Med 8, 506-512.

264 174. Lam, A.P., Gundabolu, K., Sridharan, A., Jain, R., Msaouel, P., Chrysofakis, G., Yu, Y., Friedman, E., Price, E., Schrier, S., et al. (2013). Multiplicative interaction between mean corpuscular volume and red cell distribution width in predicting mortality of elderly patients with and without anemia. American journal of hematology 88, E245-249.

175. Migone de Amicis, M., Chivite, D., Corbella, X., Cappellini, M.D., and Formiga, F. (2017). Anemia is a mortality prognostic factor in patients initially hospitalized for acute heart failure. Intern Emerg Med.

176. Waldum, B., Westheim, A.S., Sandvik, L., Flonaes, B., Grundtvig, M., Gullestad, L., Hole, T., and Os, I. (2012). Baseline anemia is not a predictor of all-cause mortality in outpatients with advanced heart failure or severe renal dysfunction. Results from the Norwegian Heart Failure Registry. J Am Coll Cardiol 59, 371-378.

177. Martinez-Vea, A., Marcas, L., Bardaji, A., Romeu, M., Gutierrez, C., Garcia, C., Compte, T., Nogues, R., Peralta, C., and Giralt, M. (2012). Role of oxidative stress in cardiovascular effects of anemia treatment with erythropoietin in predialysis patients with chronic kidney disease. Clin Nephrol 77, 171-181.

178. Solak, Y., Yilmaz, M.I., Saglam, M., Demirbas, S., Verim, S., Unal, H.U., Gaipov, A., Oguz, Y., Kayrak, M., Caglar, K., et al. (2013). Mean corpuscular volume is associated with endothelial dysfunction and predicts composite cardiovascular events in patients with chronic kidney disease. Nephrology (Carlton) 18, 728-735.

179. Gotoh, S., Hata, J., Ninomiya, T., Hirakawa, Y., Nagata, M., Mukai, N., Fukuhara, M., Ikeda, F., Ago, T., Kitazono, T., et al. (2015). Hematocrit and the risk of cardiovascular disease in a Japanese community: The Hisayama Study. Atherosclerosis 242, 199-204.

180. Haddy, T.B., Castro, O.L., and Rana, S.R. (1987). Hemoglobin and MCV values in 4,074 healthy black children and adolescents. J Natl Med Assoc 79, 75-78.

181. Castro, O.L., Haddy, T.B., and Rana, S.R. (1987). Age- and sex-related blood cell values in healthy black Americans. Public Health Rep 102, 232-237.

182. Wildman, R.P., Colvin, A.B., Powell, L.H., Matthews, K.A., Everson-Rose, S.A., Hollenberg, S., Johnston, J.M., and Sutton-Tyrrell, K. (2008). Associations of endogenous sex hormones with the vasculature in menopausal women: the Study of Women's Health Across the Nation (SWAN). Menopause 15, 414-421.

265 183. Hoffmann, J.J., Nabbe, K.C., and van den Broek, N.M. (2015). Effect of age and gender on reference intervals of red blood cell distribution width (RDW) and mean red cell volume (MCV). Clin Chem Lab Med 53, 2015-2019.

184. Lacher, D.A., Barletta, J., and Hughes, J.P. (2012). Biological variation of hematology tests based on the 1999-2002 National Health and Nutrition Examination Survey. Natl Health Stat Report, 1-10.

185. Lippi, G., and Plebani, M. (2014). Red blood cell distribution width (RDW) and human pathology. One size fits all. Clin Chem Lab Med 52, 1247-1249.

186. Milman, N., Bergholt, T., Byg, K.E., Eriksen, L., and Hvas, A.M. (2007). Reference intervals for haematological variables during normal pregnancy and postpartum in 434 healthy Danish women. Eur J Haematol 79, 39-46.

187. Jackson, R.T. (1990). Separate hemoglobin standards for blacks and whites: a critical review of the case for separate and unequal hemoglobin standards. Med Hypotheses 32, 181-189.

188. Zalawadiya, S.K., Veeranna, V., Panaich, S.S., Afonso, L., and Ghali, J.K. (2012). Gender and ethnic differences in red cell distribution width and its association with mortality among low risk healthy United state adults. Am J Cardiol 109, 1664-1670.

189. Caballero, A.E. (2011). Understanding the Hispanic/Latino patient. Am J Med 124, S10-15.

190. Torres, J.B., and Kittles, R.A. (2007). The relationship between "race" and genetics in biomedical research. Curr Hypertens Rep 9, 196-201.

191. Caulfield, T., Fullerton, S.M., Ali-Khan, S.E., Arbour, L., Burchard, E.G., Cooper, R.S., Hardy, B.J., Harry, S., Hyde-Lay, R., Kahn, J., et al. (2009). Race and ancestry in biomedical research: exploring the challenges. Genome Med 1, 8.

192. Scholl, T.O. (2011). Maternal iron status: relation to fetal growth, length of gestation, and iron endowment of the neonate. Nutr Rev 69 Suppl 1, S23-29.

193. DeMaeyer, E., and Adiels-Tegman, M. (1985). The prevalence of anaemia in the world. World Health Stat Q 38, 302-316.

194. Reveiz, L., Gyte, G.M., Cuervo, L.G., and Casasbuenas, A. (2011). Treatments for iron- deficiency anaemia in pregnancy. Cochrane Database Syst Rev, CD003094.

266 195. Eisen, M.E., and Hammond, E.C. (1956). The effect of smoking on packed cell volume, red blood cell counts, haemoglobin and platelet counts. Can Med Assoc J 75, 520-523.

196. Eschwege, E., Papoz, L., Lellouch, J., Claude, J.R., Cubeau, J., Pequignot, G., Richard, J.L., and Schwartz, D. (1978). Blood cells and alcohol consumption with special reference to smoking habits. J Clin Pathol 31, 654-658.

197. Chalmers, D.M., Levi, A.J., Chanarin, I., North, W.R., and Meade, T.W. (1979). Mean cell volume in a working population: the effects of age, smoking, alcohol and oral contraception. Br J Haematol 43, 631-636.

198. Malenica, M., Prnjavorac, B., Bego, T., Dujic, T., Semiz, S., Skrbo, S., Gusic, A., Hadzic, A., and Causevic, A. (2017). Effect of Cigarette Smoking on Haematological Parameters in Healthy Population. Med Arch 71, 132-136.

199. Whitehead, T.P., Robinson, D., Allaway, S.L., and Hale, A.C. (1995). The effects of cigarette smoking and alcohol consumption on blood haemoglobin, erythrocytes and leucocytes: a dose related study on male subjects. Clin Lab Haematol 17, 131-138.

200. Cossum, P.A. (1988). Role of the red blood cell in drug metabolism. Biopharm Drug Dispos 9, 321-336.

201. Goldstein, J.L., Chan, F.K., Lanas, A., Wilcox, C.M., Peura, D., Sands, G.H., Berger, M.F., Nguyen, H., and Scheiman, J.M. (2011). Haemoglobin decreases in NSAID users over time: an analysis of two large outcome trials. Aliment Pharmacol Ther 34, 808-816.

202. Aronson, J.K. (2007). Concentration-effect and dose-response relations in clinical pharmacology. Br J Clin Pharmacol 63, 255-257.

203. Hinderling, P.H. (1997). Red blood cells: a neglected compartment in pharmacokinetics and pharmacodynamics. Pharmacol Rev 49, 279-295.

204. Grabowski, G.A., Kacena, K., Cole, J.A., Hollak, C.E., Zhang, L., Yee, J., Mistry, P.K., Zimran, A., Charrow, J., and vom Dahl, S. (2009). Dose-response relationships for enzyme replacement therapy with imiglucerase/alglucerase in patients with Gaucher disease type 1. Genet Med 11, 92-100.

205. Cotter, D., Zhang, Y., Thamer, M., Kaufman, J., and Hernan, M.A. (2008). The effect of epoetin dose on hematocrit. Kidney Int 73, 347-353.

267 206. Coviello, A.D., Kaplan, B., Lakshman, K.M., Chen, T., Singh, A.B., and Bhasin, S. (2008). Effects of graded doses of testosterone on erythropoiesis in healthy young and older men. J Clin Endocrinol Metab 93, 914-919.

207. Critchlow, C.W., Bradbury, B.D., Acquavella, J.F., Ruixo, J.J., Krishnan, M., Chow, A.T., and Brenner, R.M. (2008). The effect of epoetin dose on hematocrit. Kidney Int 74, 827; author reply 827-828.

208. Zhou, Y., Boudreau, D.M., and Freedman, A.N. (2014). Trends in the use of aspirin and nonsteroidal anti-inflammatory drugs in the general U.S. population. Pharmacoepidemiol Drug Saf 23, 43-50.

209. Graham, D.Y. (2000). NSAID ulcers: prevalence and prevention. Mod Rheumatol 10, 2-7.

210. Graham, D.Y., Opekun, A.R., Willingham, F.F., and Qureshi, W.A. (2005). Visible small- intestinal mucosal injury in chronic NSAID users. Clin Gastroenterol Hepatol 3, 55-59.

211. Haslock, I. (1990). Prevalence of NSAID-induced gastrointestinal morbidity and mortality. J Rheumatol Suppl 20, 2-6.

212. Cavagna, L., Caporali, R., Trifiro, G., Arcoraci, V., Rossi, S., and Montecucco, C. (2013). Overuse of prescription and OTC non-steroidal anti-inflammatory drugs in patients with rheumatoid arthritis and osteoarthritis. Int J Immunopathol Pharmacol 26, 279-281.

213. Maino, M., Mantovani, N., Merli, R., Cavestro, G.M., Leandro, G., Cavallaro, L.G., Corrente, V., Iori, V., Pilotto, A., Franze, A., et al. (2006). Effects of chronic therapy with non-steroidal antiinflammatory drugs on gastric permeability of sucrose: a study on 71 patients with rheumatoid arthritis. World J Gastroenterol 12, 5017-5020.

214. Mouysset, J.L., Freier, B., van den Bosch, J., Levache, C.B., Bols, A., Tessen, H.W., Belton, L., Bohac, G.C., Terwey, J.H., and Tonini, G. (2016). Hemoglobin levels and quality of life in patients with symptomatic chemotherapy-induced anemia: the eAQUA study. Cancer Manag Res 8, 1-10.

215. Whitehead, L. (2017). Managing Chemotherapy-Induced Anemia with Erythropoiesis- Stimulating Agents Plus Iron. Am J Nurs 117, 67.

216. Khan, U., Ali, F., Khurram, M.S., Zaka, A., and Hadid, T. (2017). Immunotherapy- associated autoimmune hemolytic anemia. J Immunother Cancer 5, 15.

268 217. Visweshwar, N., Jaglal, M., Sokol, L., and Zuckerman, K. (2017). Chemotherapy-related anemia. Ann Hematol.

218. Family, L., Xu, L., Xu, H., Cannavale, K., Sattayapiwat, O., Page, J.H., Bohac, C., and Chao, C. (2016). The effect of chemotherapy-induced anemia on dose reduction and dose delay. Support Care Cancer 24, 4263-4271.

219. Baccarelli, A., and Ghosh, S. (2012). Environmental exposures, epigenetics and cardiovascular disease. Curr Opin Clin Nutr Metab Care 15, 323-329.

220. Phu, P.V., Hoan, N.V., Salvignol, B., Treche, S., Wieringa, F.T., Khan, N.C., Tuong, P.D., and Berger, J. (2010). Complementary foods fortified with micronutrients prevent iron deficiency and anemia in Vietnamese infants. J Nutr 140, 2241-2247.

221. Baker, S.J., and DeMaeyer, E.M. (1979). Nutritional anemia: its understanding and control with special reference to the work of the World Health Organization. Am J Clin Nutr 32, 368-417.

222. Centers for Disease, C., and Prevention. (2002). Iron deficiency--United States, 1999-2000. MMWR Morb Mortal Wkly Rep 51, 897-899.

223. Demeyer, D., De Smet, S., and Ulens, M. (2014). The near equivalence of haem and non- haem iron bioavailability and the need for reconsidering dietary iron recommendations. Eur J Clin Nutr 68, 750-751.

224. Hurrell, R., and Egli, I. (2010). Iron bioavailability and dietary reference values. Am J Clin Nutr 91, 1461S-1467S.

225. Tuntawiroon, M., Sritongkul, N., Brune, M., Rossander-Hulten, L., Pleehachinda, R., Suwanik, R., and Hallberg, L. (1991). Dose-dependent inhibitory effect of phenolic compounds in foods on nonheme-iron absorption in men. Am J Clin Nutr 53, 554-557.

226. Hurrell, R.F., Reddy, M., and Cook, J.D. (1999). Inhibition of non-haem iron absorption in man by polyphenolic-containing beverages. Br J Nutr 81, 289-295.

227. Hoppe, M., Hulthen, L., and Hallberg, L. (2008). The importance of bioavailability of dietary iron in relation to the expected effect from iron fortification. Eur J Clin Nutr 62, 761-769.

269 228. Mairbaurl, H. (2013). Red blood cells in sports: effects of exercise and training on oxygen supply by red blood cells. Front Physiol 4, 332.

229. Smith, J.A. (1995). Exercise, training and red blood cell turnover. Sports Med 19, 9-31.

230. Honda, T., Pun, V.C., Manjourides, J., and Suh, H. (2017). Anemia prevalence and hemoglobin levels are associated with long-term exposure to air pollution in an older population. Environ Int 101, 125-132.

231. Behrman, R.E., Fisher, D.E., and Paton, J. (1971). Air pollution in nurseries: correlation with a decrease in oxygen-carrying capacity of hemoglobin. J Pediatr 78, 1050-1054.

232. Sharma, D.C., Seervi, N., and Rawtani, J. (2000). Effect of environmental lead pollution on haemoglobin and erythrocyte ALAD activity. Indian J Physiol Pharmacol 44, 117-118.

233. Mudd, J.B., Dawson, P.J., Adams, J.R., Wingo, J., and Santrock, J. (1996). Reaction of ozone with enzymes of erythrocyte membranes. Arch Biochem Biophys 335, 145-151.

234. Lippi, G., and Banfi, G. (2006). Blood transfusions in athletes. Old dogmas, new tricks. Clin Chem Lab Med 44, 1395-1402.

235. Malm, C.B., Khoo, N.S., Granlund, I., Lindstedt, E., and Hult, A. (2016). Autologous Doping with Cryopreserved Red Blood Cells - Effects on Physical Performance and Detection by Multivariate Statistics. PloS one 11, e0156157.

236. Scheinfeldt, L.B., Soi, S., Thompson, S., Ranciaro, A., Woldemeskel, D., Beggs, W., Lambert, C., Jarvis, J.P., Abate, D., Belay, G., et al. (2012). Genetic adaptation to high altitude in the Ethiopian highlands. Genome biology 13, R1.

237. Daves, M., Zagler, E.M., Cemin, R., Gnech, F., Joos, A., Platzgummer, S., and Lippi, G. (2015). Sample stability for complete blood cell count using the Sysmex XN haematological analyser. Blood Transfus 13, 576-582.

238. Petersen, P.H., Sandberg, S., Fraser, C.G., and Goldschmidt, H. (2001). Influence of index of individuality on false positives in repeated sampling from healthy individuals. Clin Chem Lab Med 39, 160-165.

239. Ram, N., and Gerstorf, D. (2009). Time-structured and net intraindividual variability: tools for examining the development of dynamic characteristics and processes. Psychol Aging 24, 778-791.

270 240. Varat, M.A., Adolph, R.J., and Fowler, N.O. (1972). Cardiovascular effects of anemia. Am Heart J 83, 415-426.

241. Jurgensen, J.S., Grimm, R., Benz, K., Philipp, S., Eckardt, K.U., and Amann, K. (2010). Effects of anemia and uremia and a combination of both on cardiovascular structures. Kidney Blood Press Res 33, 274-281.

242. Huang, Y.L., and Hu, Z.D. (2016). Lower mean corpuscular hemoglobin concentration is associated with poorer outcomes in intensive care unit admitted patients with acute myocardial infarction. Ann Transl Med 4, 190.

243. Cavusoglu, Y., Altay, H., Cetiner, M., Guvenc, T.S., Temizhan, A., Ural, D., Yesilbursa, D., Yildirim, N., and Yilmaz, M.B. (2017). Iron deficiency and anemia in heart failure. Turk Kardiyol Dern Ars 45, 1-38.

244. Deepak, B., Balaji, A., Pramod, A., Sujit, K., Savani, F., Bapu, K., Manish, P., Yogesh, B., Shreedhar, J., and Antony, G. (2016). The Prevalence and Impact of Preoperative Anemia in Patients Undergoing Cardiac Surgery for Rheumatic Heart Disease. J Cardiothorac Vasc Anesth 30, 896-900.

245. Moe, G.W., Ezekowitz, J.A., O'Meara, E., Lepage, S., Howlett, J.G., Fremes, S., Al- Hesayen, A., Heckman, G.A., Abrams, H., Ducharme, A., et al. (2015). The 2014 Canadian Cardiovascular Society Heart Failure Management Guidelines Focus Update: anemia, biomarkers, and recent therapeutic trial implications. Can J Cardiol 31, 3-16.

246. Paul, S., and Paul, R.V. (2004). Anemia in heart failure: implications, management, and outcomes. J Cardiovasc Nurs 19, S57-66.

247. Diseases, N.I.o.D.a.D.a.K. (2016). Kidney Disease Satistics for the United States.

248. Lora, C.M., Gordon, E.J., Sharp, L.K., Fischer, M.J., Gerber, B.S., and Lash, J.P. (2011). Progression of CKD in Hispanics: potential roles of health literacy, acculturation, and social support. Am J Kidney Dis 58, 282-290.

249. Vashistha, T., Streja, E., Molnar, M.Z., Rhee, C.M., Moradi, H., Soohoo, M., Kovesdy, C.P., and Kalantar-Zadeh, K. (2016). Red Cell Distribution Width and Mortality in Hemodialysis Patients. Am J Kidney Dis 68, 110-121.

250. Peng, F., Li, Z., Zhong, Z., Luo, Q., Guo, Q., Huang, F., Yu, X., and Yang, X. (2014). An increasing of red blood cell distribution width was associated with cardiovascular mortality in patients on peritoneal dialysis. Int J Cardiol 176, 1379-1381.

271 251. Sicaja, M., Pehar, M., Derek, L., Starcevic, B., Vuletic, V., Romic, Z., and Bozikov, V. (2013). Red blood cell distribution width as a prognostic marker of mortality in patients on chronic dialysis: a single center, prospective longitudinal study. Croat Med J 54, 25- 32.

252. National Heart, L., and Blood Institute. (2017). Heart Failure.

253. Patel, K.V. (2008). Epidemiology of anemia in older adults. Semin Hematol 45, 210-217.

254. Cheng, Y.L., Cheng, H.M., Huang, W.M., Lu, D.Y., Hsu, P.F., Guo, C.Y., Yu, W.C., Chen, C.H., and Sung, S.H. (2016). Red Cell Distribution Width and the Risk of Mortality in Patients With Acute Heart Failure With or Without Cardiorenal Anemia Syndrome. Am J Cardiol 117, 399-403.

255. Felker, G.M., Allen, L.A., Pocock, S.J., Shaw, L.K., McMurray, J.J., Pfeffer, M.A., Swedberg, K., Wang, D., Yusuf, S., Michelson, E.L., et al. (2007). Red cell distribution width as a novel prognostic marker in heart failure: data from the CHARM Program and the Duke Databank. J Am Coll Cardiol 50, 40-47.

256. Nunez, J., Nunez, E., Rizopoulos, D., Minana, G., Bodi, V., Bondanza, L., Husser, O., Merlos, P., Santas, E., Pascual-Figal, D., et al. (2014). Red blood cell distribution width is longitudinally associated with mortality and anemia in heart failure patients. Circ J 78, 410-418.

257. Sargento, L., Simoes, A.V., Longo, S., Lousada, N., and Palma Dos Reis, R. (2017). Red blood cell distribution width is a survival predictor beyond anemia and Nt-ProBNP in stable optimally medicated heart failure with reduced ejection fraction outpatients. Clin Hemorheol Microcirc 65, 185-194.

258. Sotiropoulos, K., Yerly, P., Monney, P., Garnier, A., Regamey, J., Hugli, O., Martin, D., Metrich, M., Antonietti, J.P., and Hullin, R. (2016). Red cell distribution width and mortality in acute heart failure patients with preserved and reduced ejection fraction. ESC Heart Fail 3, 198-204.

259. Loprinzi, P.D. (2016). Comparative evaluation of red blood cell distribution width and high sensitivity C-reactive protein in predicting all-cause mortality and coronary heart disease mortality. Int J Cardiol 223, 72-73.

260. Senthong, V., Hudec, T., Neale, S., Wu, Y., Hazen, S.L., and Tang, W.H. (2017). Relation of Red Cell Distribution Width to Left Ventricular End-Diastolic Pressure and Mortality in Patients With and Without Heart Failure. Am J Cardiol 119, 1421-1427.

272 261. Tseliou, E., Terrovitis, J.V., Kaldara, E.E., Ntalianis, A.S., Repasos, E., Katsaros, L., Margari, Z.J., Matsouka, C., Toumanidis, S., Nanas, S.N., et al. (2014). Red blood cell distribution width is a significant prognostic marker in advanced heart failure, independent of hemoglobin levels. Hellenic J Cardiol 55, 457-461.

262. Xu, J., Murphy, S.L., Kochanek, K.D., and Arias, E. (2016). Mortality in the United States, 2015. NCHS Data Brief, 1-8.

263. Howard, V.J., Cushman, M., Pulley, L., Gomez, C.R., Go, R.C., Prineas, R.J., Graham, A., Moy, C.S., and Howard, G. (2005). The reasons for geographic and racial differences in stroke study: objectives and design. Neuroepidemiology 25, 135-143.

264. Howard, G., Anderson, R., Sorlie, P., Andrews, V., Backlund, E., and Burke, G.L. (1994). Ethnic differences in stroke mortality between non-Hispanic whites, Hispanic whites, and blacks. The National Longitudinal Mortality Study. Stroke 25, 2120-2125.

265. Kara, H., Degirmenci, S., Bayir, A., Ak, A., Akinci, M., Dogru, A., Akyurek, F., and Kayis, S.A. (2015). Red cell distribution width and neurological scoring systems in acute stroke patients. Neuropsychiatr Dis Treat 11, 733-739.

266. Turcato, G., Cervellin, G., Cappellari, M., Bonora, A., Zannoni, M., Bovi, P., Ricci, G., and Lippi, G. (2017). Early function decline after ischemic stroke can be predicted by a nomogram based on age, use of thrombolysis, RDW and NIHSS score at admission. J Thromb Thrombolysis 43, 394-400.

267. Chang, W.H., Sohn, M.K., Lee, J., Kim, D.Y., Lee, S.G., Shin, Y.I., Oh, G.J., Lee, Y.S., Joo, M.C., Han, E.Y., et al. (2017). Long-term functional outcomes of patients with very mild stroke: does a NIHSS score of 0 mean no disability? An interim analysis of the KOSCO study. Disabil Rehabil 39, 904-910.

268. Kwah, L.K., and Diong, J. (2014). National Institutes of Health Stroke Scale (NIHSS). J Physiother 60, 61.

269. Martin-Schild, S., Siegler, J.E., Kumar, A.D., and Lyden, P. (2015). Troubleshooting the NIHSS: question-and-answer session with one of the designers. Int J Stroke 10, 1284- 1286.

270. Tan, W.S., Sexton, S., and Mulcahy, R. (2015). National Institutes of Health Stroke Scale (NIHSS): Are Hospital Doctors Up To Date? Ir Med J 108, 253-254.

273 271. Sacco, S., Marini, C., Olivieri, L., Pistoia, F., and Carolei, A. (2007). Contribution of hematocrit to early mortality after ischemic stroke. Eur Neurol 58, 233-238.

272. Longstreth, W.T., Jr. (2006). The REasons for Geographic And Racial Differences in Stroke (REGARDS) Study and the National Institute of Neurological Disorders and Stroke (NINDS). Stroke 37, 1147.

273. Dai, Y., Konishi, H., Takagi, A., Miyauchi, K., and Daida, H. (2014). Red cell distribution width predicts short- and long-term outcomes of acute congestive heart failure more effectively than hemoglobin. Exp Ther Med 8, 600-606.

274. Casuso, R.A., Martinez-Amat, A., Martinez-Romero, R., Camiletti-Moiron, D., Hita- Contreras, F., and Martinez-Lopez, E. (2013). Plasmatic nitric oxide correlates with weight and red cell distribution width in exercised rats supplemented with quercetin. Int J Food Sci Nutr 64, 830-835.

275. Panwar, B., Judd, S.E., Warnock, D.G., McClellan, W.M., Booth, J.N., 3rd, Muntner, P., and Gutierrez, O.M. (2016). Hemoglobin Concentration and Risk of Incident Stroke in Community-Living Adults. Stroke 47, 2017-2024.

276. Demir, R., Saritemur, M., Atis, O., Ozel, L., Kocaturk, I., Emet, M., and Ulvi, H. (2015). Can we distinguish stroke and stroke mimics via red cell distribution width in young patients? Arch Med Sci 11, 958-963.

277. Mendel, G. (1950). Gregor Mendel's letters to Carl Nageli, 1866-1873. Genetics 35, 1-29.

278. Sorsby, A. (1965). Gregor Mendel. Br Med J 1, 333-338.

279. Oakland, G.B. (1960). The statistical analysis of genetic linkage data. Acta Genet Stat Med 10, 191-236.

280. Cui, Y., Li, G., Li, S., and Wu, R. (2010). Designs for linkage analysis and association studies of complex diseases. Methods Mol Biol 620, 219-242.

281. Bush, W.S., and Haines, J. (2010). Overview of linkage analysis in complex traits. Curr Protoc Hum Genet Chapter 1, Unit 1 9 1-18.

282. Hall, J.M., Lee, M.K., Newman, B., Morrow, J.E., Anderson, L.A., Huey, B., and King, M.C. (1990). Linkage of early-onset familial breast cancer to chromosome 17q21. Science 250, 1684-1689.

274 283. Lin, J.P., O'Donnell, C.J., Jin, L., Fox, C., Yang, Q., and Cupples, L.A. (2007). Evidence for linkage of red blood cell size and count: genome-wide scans in the Framingham Heart Study. American journal of hematology 82, 605-610.

284. Altmuller, J., Palmer, L.J., Fischer, G., Scherb, H., and Wjst, M. (2001). Genomewide scans of complex human diseases: true linkage is hard to find. Am J Hum Genet 69, 936-950.

285. Boehnke, M. (2000). A look at linkage disequilibrium. Nature genetics 25, 246-247.

286. Gustavsson, P., Willing, T.N., van Haeringen, A., Tchernia, G., Dianzani, I., Donner, M., Elinder, G., Henter, J.I., Nilsson, P.G., Gordon, L., et al. (1997). Diamond-Blackfan anaemia: genetic homogeneity for a gene on chromosome 19q13 restricted to 1.8 Mb. Nature genetics 16, 368-371.

287. Gasparini, P., Miraglia del Giudice, E., Delaunay, J., Totaro, A., Granatiero, M., Melchionda, S., Zelante, L., and Iolascon, A. (1997). Localization of the congenital dyserythropoietic anemia II locus to chromosome 20q11.2 by genomewide search. Am J Hum Genet 61, 1112-1116.

288. Easton, D., Ford, D., and Peto, J. (1993). Inherited susceptibility to breast cancer. Cancer Surv 18, 95-113.

289. Lindpaintner, K. (1992). Genetic linkage analysis in hypertension: principles and practice. J Hypertens 10, 121-124.

290. West, M.J., Summers, K.M., and Huggard, P.R. (1992). Polymorphisms of candidate genes in essential hypertension. Clin Exp Pharmacol Physiol 19, 315-318.

291. Bouchard, C. (1995). Genetics of obesity: an update on molecular markers. Int J Obes Relat Metab Disord 19 Suppl 3, S10-13.

292. Mansour-Chemaly, M., Haddy, N., Siest, G., and Visvikis, S. (2002). Family studies: their role in the evaluation of genetic cardiovascular risk factors. Clin Chem Lab Med 40, 1085-1096.

293. Kontham, V., von Holst, S., and Lindblom, A. (2013). Linkage analysis in familial non- Lynch syndrome colorectal cancer families from Sweden. PloS one 8, e83936.

275 294. Iliadou, A., Evans, D.M., Zhu, G., Duffy, D.L., Frazer, I.H., Montgomery, G.W., and Martin, N.G. (2007). Genomewide scans of red cell indices suggest linkage on chromosome 6q23. J Med Genet 44, 24-30.

295. Menzel, S., Jiang, J., Silver, N., Gallagher, J., Cunningham, J., Surdulescu, G., Lathrop, M., Farrall, M., Spector, T.D., and Thein, S.L. (2007). The HBS1L-MYB intergenic region on chromosome 6q23.3 influences erythrocyte, platelet, and monocyte counts in humans. Blood 110, 3624-3626.

296. Lamoril, J., Theou-Anton, N., and Tchernitchko, D. (2015). A coding polymorphism in the BMP2 gene is associated with iron overload in non-HFE haemochromatosis patients. Blood Cells Mol Dis 55, 318-319.

297. Milet, J., Le Gac, G., Scotet, V., Gourlaouen, I., Theze, C., Mosser, J., Bourgain, C., Deugnier, Y., and Ferec, C. (2010). A common SNP near BMP2 is associated with severity of the iron burden in HFE p.C282Y homozygous patients: a follow-up study. Blood Cells Mol Dis 44, 34-37.

298. Truksa, J., Peng, H., Lee, P., and Beutler, E. (2006). Bone morphogenetic proteins 2, 4, and 9 stimulate murine hepcidin 1 expression independently of Hfe, transferrin receptor 2 (Tfr2), and IL-6. Proc Natl Acad Sci U S A 103, 10289-10293.

299. Abbott, D.L. (2009). Conditional linkage methods--searching for modifier genes in a large Amish pedigree with known von Willebrand disease major gene modification. PhD thesis, University of Iowa.

300. Boria, I., Garelli, E., Gazda, H.T., Aspesi, A., Quarello, P., Pavesi, E., Ferrante, D., Meerpohl, J.J., Kartal, M., Da Costa, L., et al. (2010). The ribosomal basis of Diamond- Blackfan Anemia: mutation and database update. Hum Mutat 31, 1269-1279.

301. Astle, W.J., Elding, H., Jiang, T., Allen, D., Ruklisa, D., Mann, A.L., Mead, D., Bouman, H., Riveros-Mckay, F., Kostadima, M.A., et al. (2016). The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell 167, 1415-1429 e1419.

302. Casanovas, G., Vujic Spasic, M., Casu, C., Rivella, S., Strelau, J., Unsicker, K., and Muckenthaler, M.U. (2013). The murine growth differentiation factor 15 is not essential for systemic iron homeostasis in phlebotomized mice. Haematologica 98, 444-447.

276 303. Lang, E., Bissinger, R., Fajol, A., Salker, M.S., Singh, Y., Zelenak, C., Ghashghaeinia, M., Gu, S., Jilani, K., Lupescu, A., et al. (2015). Accelerated apoptotic death and in vivo turnover of erythrocytes in mice lacking functional mitogen- and stress-activated kinase MSK1/2. Sci Rep 5, 17316.

304. Nai, A., Rubio, A., Campanella, A., Gourbeyre, O., Artuso, I., Bordini, J., Gineste, A., Latour, C., Besson-Fournier, C., Lin, H.Y., et al. (2016). Limiting hepatic Bmp-Smad signaling by matriptase-2 is required for erythropoietin-mediated hepcidin suppression in mice. Blood 127, 2327-2336.

305. Basu, P., Morris, P.E., Haar, J.L., Wani, M.A., Lingrel, J.B., Gaensler, K.M., and Lloyd, J.A. (2005). KLF2 is essential for primitive erythropoiesis and regulates the human and murine embryonic beta-like globin genes in vivo. Blood 106, 2566-2571.

306. Ulyanova, T., Padilla, S.M., and Papayannopoulou, T. (2014). Stage-specific functional roles of integrins in murine erythropoiesis. Exp Hematol 42, 404-409 e404.

307. Paul, D.S., Albers, C.A., Rendon, A., Voss, K., Stephens, J., HaemGen, C., van der Harst, P., Chambers, J.C., Soranzo, N., Ouwehand, W.H., et al. (2013). Maps of open chromatin highlight cell type-restricted patterns of regulatory sequence variation at hematological trait loci. Genome Res 23, 1130-1141.

308. Lin, J.P., O'Donnell, C.J., Levy, D., and Cupples, L.A. (2005). Evidence for a gene influencing haematocrit on chromosome 6q23-24: genomewide scan in the Framingham Heart Study. J Med Genet 42, 75-79.

309. Ferreira, M.A., Hottenga, J.J., Warrington, N.M., Medland, S.E., Willemsen, G., Lawrence, R.W., Gordon, S., de Geus, E.J., Henders, A.K., Smit, J.H., et al. (2009). Sequence variants in three loci influence monocyte counts and erythrocyte volume. Am J Hum Genet 85, 745-749.

310. Yang, Q., Kathiresan, S., Lin, J.P., Tofler, G.H., and O'Donnell, C.J. (2007). Genome-wide association and linkage analyses of hemostatic factors and hematological phenotypes in the Framingham Heart Study. BMC Med Genet 8 Suppl 1, S12.

311. Bertin, A., Mahaney, M.C., Cox, L.A., Rogers, J., VandeBerg, J.L., Brugnara, C., and Platt, O.S. (2007). Quantitative trait loci for peripheral blood cell counts: a study in baboons. Mamm Genome 18, 361-372.

277 312. Hinckley, J.D., Abbott, D., Burns, T.L., Heiman, M., Shapiro, A.D., Wang, K., and Di Paola, J. (2013). Quantitative trait locus linkage analysis in a large Amish pedigree identifies novel candidate loci for erythrocyte traits. Mol Genet Genomic Med 1, 131- 141.

313. Peters, L.L., Lambert, A.J., Zhang, W., Churchill, G.A., Brugnara, C., and Platt, O.S. (2006). Quantitative trait loci for baseline erythroid traits. Mamm Genome 17, 298-309.

314. Chambers, J.C., Zhang, W., Li, Y., Sehmi, J., Wass, M.N., Zabaneh, D., Hoggart, C., Bayele, H., McCarthy, M.I., Peltonen, L., et al. (2009). Genome-wide association study identifies variants in TMPRSS6 associated with hemoglobin levels. Nature genetics 41, 1170-1172.

315. Benyamin, B., Ferreira, M.A., Willemsen, G., Gordon, S., Middelberg, R.P., McEvoy, B.P., Hottenga, J.J., Henders, A.K., Campbell, M.J., Wallace, L., et al. (2009). Common variants in TMPRSS6 are associated with iron status and erythrocyte volume. Nature genetics 41, 1173-1175.

316. Zeng, S.M., Yankowitz, J., Widness, J.A., and Strauss, R.G. (2001). Etiology of differences in hematocrit between males and females: sequence-based polymorphisms in erythropoietin and its receptor. J Gend Specif Med 4, 35-40.

317. Beall, C.M., Cavalleri, G.L., Deng, L., Elston, R.C., Gao, Y., Knight, J., Li, C., Li, J.C., Liang, Y., McCormack, M., et al. (2010). Natural selection on EPAS1 (HIF2alpha) associated with low hemoglobin concentration in Tibetan highlanders. Proc Natl Acad Sci U S A 107, 11459-11464.

318. Lo, K.S., Wilson, J.G., Lange, L.A., Folsom, A.R., Galarneau, G., Ganesh, S.K., Grant, S.F., Keating, B.J., McCarroll, S.A., Mohler, E.R., 3rd, et al. (2011). Genetic association analysis highlights new loci that modulate hematological trait variation in Caucasians and African Americans. Human genetics 129, 307-317.

319. Puck, J.M., and Willard, H.F. (1998). X inactivation in females with X-linked disease. N Engl J Med 338, 325-328.

320. Dattani, M.T. (2009). The candidate gene approach to the diagnosis of monogenic disorders. Horm Res 71 Suppl 2, 14-21.

321. Fachal, L., and Dunning, A.M. (2015). From candidate gene studies to GWAS and post- GWAS analyses in breast cancer. Curr Opin Genet Dev 30, 32-41.

278 322. Bauer, D.E., and Orkin, S.H. (2015). Hemoglobin switching's surprise: the versatile transcription factor BCL11A is a master repressor of fetal hemoglobin. Curr Opin Genet Dev 33, 62-70.

323. Craig, J.E., Rochette, J., Sampietro, M., Wilkie, A.O., Barnetson, R., Hatton, C.S., Demenais, F., and Thein, S.L. (1997). Genetic heterogeneity in heterocellular hereditary persistence of fetal hemoglobin. Blood 90, 428-434.

324. Menzel, S., Garner, C., Gut, I., Matsuda, F., Yamaguchi, M., Heath, S., Foglio, M., Zelenika, D., Boland, A., Rooks, H., et al. (2007). A QTL influencing F cell production maps to a gene encoding a zinc-finger protein on chromosome 2p15. Nature genetics 39, 1197-1199.

325. Garner, C., Tatu, T., Reittie, J.E., Littlewood, T., Darley, J., Cervino, S., Farrall, M., Kelly, P., Spector, T.D., and Thein, S.L. (2000). Genetic influences on F cells and other hematologic variables: a twin heritability study. Blood 95, 342-346.

326. Trakarnsanga, K., Wilson, M.C., Lau, W., Singleton, B.K., Parsons, S.F., Sakuntanaga, P., Kurita, R., Nakamura, Y., Anstee, D.J., and Frayne, J. (2014). Induction of adult levels of beta-globin in human erythroid cells that intrinsically express embryonic or fetal globin by transduction with KLF1 and BCL11A-XL. Haematologica 99, 1677-1685.

327. Wilber, A., Hargrove, P.W., Kim, Y.S., Riberdy, J.M., Sankaran, V.G., Papanikolaou, E., Georgomanoli, M., Anagnou, N.P., Orkin, S.H., Nienhuis, A.W., et al. (2011). Therapeutic levels of fetal hemoglobin in erythroid progeny of beta-thalassemic CD34+ cells after lentiviral vector-mediated gene transfer. Blood 117, 2817-2826.

328. Khosravi, M.A., Abbasalipour, M., Concordet, J.P., Berg, J.V., Zeinali, S., Arashkia, A., Azadmanesh, K., Buch, T., and Karimipoor, M. (2019). Targeted deletion of BCL11A gene by CRISPR-Cas9 system for fetal hemoglobin reactivation: A promising approach for gene therapy of beta thalassemia disease. Eur J Pharmacol 854, 398-405.

329. Orkin, S.H., and Bauer, D.E. (2019). Emerging Genetic Therapy for Sickle Cell Disease. Annu Rev Med 70, 257-271.

330. Perkins, A., Xu, X., Higgs, D.R., Patrinos, G.P., Arnaud, L., Bieker, J.J., Philipsen, S., and Workgroup, K.L.F.C. (2016). Kruppeling erythropoiesis: an unexpected broad spectrum of human red blood cell disorders due to KLF1 variants. Blood 127, 1856-1862.

331. Kulczynska, K., and Siatecka, M. (2016). A regulatory function of long non-coding RNAs in red blood cell development. Acta Biochim Pol 63, 675-680.

279 332. Borg, J., Papadopoulos, P., Georgitsi, M., Gutierrez, L., Grech, G., Fanis, P., Phylactides, M., Verkerk, A.J., van der Spek, P.J., Scerri, C.A., et al. (2010). Haploinsufficiency for the erythroid transcription factor KLF1 causes hereditary persistence of fetal hemoglobin. Nature genetics 42, 801-805.

333. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. (2001). The sequence of the human genome. Science 291, 1304-1351.

334. Wolfsberg, T.G., McEntyre, J., and Schuler, G.D. (2001). Guide to the draft human genome. Nature 409, 824-826.

335. Palmer, C., and Pe'er, I. (2017). Statistical correction of the Winner's Curse explains replication variability in quantitative trait genome-wide association studies. PLoS Genet 13, e1006916.

336. Grinde, K.E., Arbet, J., Green, A., O'Connell, M., Valcarcel, A., Westra, J., and Tintle, N. (2017). Illustrating, Quantifying, and Correcting for Bias in Post-hoc Analysis of Gene- Based Rare Variant Tests of Association. Front Genet 8, 117.

337. Poirier, J.G., Faye, L.L., Dimitromanolakis, A., Paterson, A.D., Sun, L., and Bull, S.B. (2015). Resampling to Address the Winner's Curse in Genetic Association Analysis of Time to Event. Genet Epidemiol 39, 518-528.

338. Xiao, R., and Boehnke, M. (2011). Quantifying and correcting for the winner's curse in quantitative-trait association studies. Genet Epidemiol 35, 133-138.

339. Trerotola, M., Relli, V., Simeone, P., and Alberti, S. (2015). Epigenetic inheritance and the missing heritability. Hum Genomics 9, 17.

340. Girirajan, S. (2017). Missing heritability and where to find it. Genome biology 18, 89.

341. Bourrat, P., Lu, Q., and Jablonka, E. (2017). Why the missing heritability might not be in the DNA. Bioessays 39.

342. Matise, T.C., Ambite, J.L., Buyske, S., Carlson, C.S., Cole, S.A., Crawford, D.C., Haiman, C.A., Heiss, G., Kooperberg, C., Marchand, L.L., et al. (2011). The Next PAGE in understanding complex traits: design for the analysis of Population Architecture Using Genetics and Epidemiology (PAGE) Study. American journal of epidemiology 174, 849- 859.

280 343. Replication, D.I.G., Meta-analysis, C., Asian Genetic Epidemiology Network Type 2 Diabetes, C., South Asian Type 2 Diabetes, C., Mexican American Type 2 Diabetes, C., Type 2 Diabetes Genetic Exploration by Nex-generation sequencing in muylti-Ethnic Samples, C., Mahajan, A., Go, M.J., Zhang, W., Below, J.E., et al. (2014). Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics 46, 234-244.

344. Pulit, S.L., Voight, B.F., and de Bakker, P.I. (2010). Multiethnic genetic association studies improve power for locus discovery. PloS one 5, e12600.

345. Zaitlen, N., Pasaniuc, B., Gur, T., Ziv, E., and Halperin, E. (2010). Leveraging genetic variability across populations for the identification of causal variants. Am J Hum Genet 86, 23-33.

346. Rosenberg, N.A., Huang, L., Jewett, E.M., Szpiech, Z.A., Jankovic, I., and Boehnke, M. (2010). Genome-wide association studies in diverse populations. Nat Rev Genet 11, 356- 366.

347. Pasaniuc, B., and Price, A.L. (2017). Dissecting the genetics of complex traits using summary association statistics. Nat Rev Genet 18, 117-127.

348. Kanai, M., Akiyama, M., Takahashi, A., Matoba, N., Momozawa, Y., Ikeda, M., Iwata, N., Ikegawa, S., Hirata, M., Matsuda, K., et al. (2018). Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nature genetics 50, 390-400.

349. Nsengimana, J., and Bishop, D.T. (2017). Design Considerations for Genetic Linkage and Association Studies. Methods Mol Biol 1666, 257-281.

350. Spencer, C.C., Su, Z., Donnelly, P., and Marchini, J. (2009). Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 5, e1000477.

351. Voight, B.F., Kang, H.M., Ding, J., Palmer, C.D., Sidore, C., Chines, P.S., Burtt, N.P., Fuchsberger, C., Li, Y., Erdmann, J., et al. (2012). The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet 8, e1002793.

281 352. Smith, M.W., Patterson, N., Lautenberger, J.A., Truelove, A.L., McDonald, G.J., Waliszewska, A., Kessing, B.D., Malasky, M.J., Scafe, C., Le, E., et al. (2004). A high- density admixture map for disease gene discovery in african americans. Am J Hum Genet 74, 1001-1013.

353. Blanco-Gomez, A., Castillo-Lluva, S., Del Mar Saez-Freire, M., Hontecillas-Prieto, L., Mao, J.H., Castellanos-Martin, A., and Perez-Losada, J. (2016). Missing heritability of complex diseases: Enlightenment by genetic variants from intermediate phenotypes. Bioessays 38, 664-673.

354. Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M., Moore, J.H., and Nadeau, J.H. (2010). Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11, 446-450.

355. Gusev, A., Bhatia, G., Zaitlen, N., Vilhjalmsson, B.J., Diogo, D., Stahl, E.A., Gregersen, P.K., Worthington, J., Klareskog, L., Raychaudhuri, S., et al. (2013). Quantifying missing heritability at known GWAS loci. PLoS Genet 9, e1003993.

356. Visscher, P.M., Wray, N.R., Zhang, Q., Sklar, P., McCarthy, M.I., Brown, M.A., and Yang, J. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet 101, 5-22.

357. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature 461, 747-753.

358. Maher, B. (2008). Personal genomes: The case of the missing heritability. Nature 456, 18- 21.

359. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42, 565-569.

360. Morris, A.P., Voight, B.F., Teslovich, T.M., Ferreira, T., Segre, A.V., Steinthorsdottir, V., Strawbridge, R.J., Khan, H., Grallert, H., Mahajan, A., et al. (2012). Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44, 981-990.

361. Lee, S.H., Wray, N.R., Goddard, M.E., and Visscher, P.M. (2011). Estimating missing heritability for disease from genome-wide association studies. American journal of human genetics 88, 294-305.

282 362. Berndt, S.I., Gustafsson, S., Magi, R., Ganna, A., Wheeler, E., Feitosa, M.F., Justice, A.E., Monda, K.L., Croteau-Chonka, D.C., Day, F.R., et al. (2013). Genome-wide meta- analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet 45, 501-512.

363. Vattikuti, S., Guo, J., and Chow, C.C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS genetics 8, e1002637.

364. Locke, A.E., Kahali, B., Berndt, S.I., Justice, A.E., Pers, T.H., Day, F.R., Powell, C., Vedantam, S., Buchkovich, M.L., Yang, J., et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197-206.

365. Shungin, D., Winkler, T.W., Croteau-Chonka, D.C., Ferreira, T., Locke, A.E., Magi, R., Strawbridge, R.J., Pers, T.H., Fischer, K., Justice, A.E., et al. (2015). New genetic loci link adipose and insulin biology to body fat distribution. Nature 518, 187-196.

366. Global Lipids Genetics, C., Willer, C.J., Schmidt, E.M., Sengupta, S., Peloso, G.M., Gustafsson, S., Kanoni, S., Ganna, A., Chen, J., Buchkovich, M.L., et al. (2013). Discovery and refinement of loci associated with lipid levels. Nat Genet 45, 1274-1283.

367. Surakka, I., Horikoshi, M., Magi, R., Sarin, A.P., Mahajan, A., Lagou, V., Marullo, L., Ferreira, T., Miraglio, B., Timonen, S., et al. (2015). The impact of low-frequency and rare variants on lipid levels. Nat Genet 47, 589-597.

368. Teslovich, T.M., Musunuru, K., Smith, A.V., Edmondson, A.C., Stylianou, I.M., Koseki, M., Pirruccello, J.P., Ripatti, S., Chasman, D.I., Willer, C.J., et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707-713.

369. Arking, D.E., Pulit, S.L., Crotti, L., van der Harst, P., Munroe, P.B., Koopmann, T.T., Sotoodehnia, N., Rossin, E.J., Morley, M., Wang, X., et al. (2014). Genetic association study of QT interval highlights role for calcium signaling pathways in myocardial repolarization. Nat Genet 46, 826-836.

370. International Consortium for Blood Pressure Genome-Wide Association, S., Ehret, G.B., Munroe, P.B., Rice, K.M., Bochud, M., Johnson, A.D., Chasman, D.I., Smith, A.V., Tobin, M.D., Verwoert, G.C., et al. (2011). Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature 478, 103-109.

371. Kathiresan, S., Willer, C.J., Peloso, G.M., Demissie, S., Musunuru, K., Schadt, E.E., Kaplan, L., Bennett, D., Li, Y., Tanaka, T., et al. (2009). Common variants at 30 loci contribute to polygenic dyslipidemia. Nat Genet 41, 56-65.

283 372. Lu, X., Huang, J., Mo, Z., He, J., Wang, L., Yang, X., Tan, A., Chen, S., Chen, J., Gu, C.C., et al. (2016). Genetic Susceptibility to Lipid Levels and Lipid Change Over Time and Risk of Incident Hyperlipidemia in Chinese Populations. Circ Cardiovasc Genet 9, 37-44.

373. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thorleifsson, G., Jackson, A.U., Lango Allen, H., Lindgren, C.M., Luan, J., Magi, R., et al. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42, 937-948.

374. Aulchenko, Y.S., Ripatti, S., Lindqvist, I., Boomsma, D., Heid, I.M., Pramstaller, P.P., Penninx, B.W., Janssens, A.C., Wilson, J.F., Spector, T., et al. (2009). Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nat Genet 41, 47-55.

375. Wain, L.V., Verwoert, G.C., O'Reilly, P.F., Shi, G., Johnson, T., Johnson, A.D., Bochud, M., Rice, K.M., Henneman, P., Smith, A.V., et al. (2011). Genome-wide association study identifies six new loci influencing pulse pressure and mean arterial pressure. Nat Genet 43, 1005-1011.

376. Waterworth, D.M., Ricketts, S.L., Song, K., Chen, L., Zhao, J.H., Ripatti, S., Aulchenko, Y.S., Zhang, W., Yuan, X., Lim, N., et al. (2010). Genetic variants influencing circulating lipid levels and risk of coronary artery disease. Arteriosclerosis, thrombosis, and vascular biology 30, 2264-2276.

377. Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R., Soranzo, N., Jackson, A.U., Wheeler, E., Glazer, N.L., Bouatia-Naji, N., Gloyn, A.L., et al. (2010). New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat Genet 42, 105-116.

378. Kathiresan, S., Melander, O., Guiducci, C., Surti, A., Burtt, N.P., Rieder, M.J., Cooper, G.M., Roos, C., Voight, B.F., Havulinna, A.S., et al. (2008). Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet 40, 189-197.

379. Willer, C.J., Sanna, S., Jackson, A.U., Scuteri, A., Bonnycastle, L.L., Clarke, R., Heath, S.C., Timpson, N.J., Najjar, S.S., Stringham, H.M., et al. (2008). Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet 40, 161- 169.

380. Lu, X., Wang, L., Lin, X., Huang, J., Charles Gu, C., He, M., Shen, H., He, J., Zhu, J., Li, H., et al. (2015). Genome-wide association study in Chinese identifies novel loci for blood pressure and hypertension. Human molecular genetics 24, 865-874.

284 381. den Hoed, M., Eijgelsheim, M., Esko, T., Brundel, B.J., Peal, D.S., Evans, D.M., Nolte, I.M., Segre, A.V., Holm, H., Handsaker, R.E., et al. (2013). Identification of heart rate- associated loci and their effects on cardiac conduction and rhythm disorders. Nat Genet 45, 621-631.

382. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina, C., Welch, R.P., Zeggini, E., Huth, C., Aulchenko, Y.S., Thorleifsson, G., et al. (2010). Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet 42, 579- 589.

383. Below, J.E., Parra, E.J., Gamazon, E.R., Torres, J., Krithika, S., Candille, S., Lu, Y., Manichakul, A., Peralta-Romero, J., Duan, Q., et al. (2016). Meta-analysis of lipid-traits in Hispanics identifies novel loci, population-specific effects, and tissue-specific enrichment of eQTLs. Scientific reports 6, 19429.

384. Kato, N., Loh, M., Takeuchi, F., Verweij, N., Wang, X., Zhang, W., Kelly, T.N., Saleheen, D., Lehne, B., Mateo Leach, I., et al. (2015). Trans-ancestry genome-wide association study identifies 12 genetic loci influencing blood pressure and implicates a role for DNA methylation. Nat Genet 47, 1282-1293.

385. Newton-Cheh, C., Eijgelsheim, M., Rice, K.M., de Bakker, P.I., Yin, X., Estrada, K., Bis, J.C., Marciante, K., Rivadeneira, F., Noseworthy, P.A., et al. (2009). Common variants at ten loci influence QT interval duration in the QTGEN Study. Nat Genet 41, 399-406.

386. Hwang, J.Y., Sim, X., Wu, Y., Liang, J., Tabara, Y., Hu, C., Hara, K., Tam, C.H., Cai, Q., Zhao, Q., et al. (2015). Genome-wide association meta-analysis identifies novel variants associated with fasting plasma glucose in East Asians. Diabetes 64, 291-298.

387. Wen, W., Zheng, W., Okada, Y., Takeuchi, F., Tabara, Y., Hwang, J.Y., Dorajoo, R., Li, H., Tsai, F.J., Yang, X., et al. (2014). Meta-analysis of genome-wide association studies in East Asian-ancestry populations identifies four new loci for body mass index. Human molecular genetics 23, 5492-5504.

388. Coram, M.A., Duan, Q., Hoffmann, T.J., Thornton, T., Knowles, J.W., Johnson, N.A., Ochs-Balcom, H.M., Donlon, T.A., Martin, L.W., Eaton, C.B., et al. (2013). Genome- wide characterization of shared and distinct genetic components that influence blood lipid levels in ethnically diverse human populations. American journal of human genetics 92, 904-916.

389. Whiting, P.W. (1950). Blood group symbols and genes, some thoughts on allelism, pleiotropy, and dominance. J Hered 41, 55-56.

285 390. Hadorn, E. (1956). Patterns of biochemical and developmental pleiotropy. Cold Spring Harb Symp Quant Biol 21, 363-373.

391. Parsons, P.A., and Green, M.M. (1959). Pleiotropy and Competition at the Vermilion Locus in Drosophila Melanogaster. Proc Natl Acad Sci U S A 45, 993-996.

392. Stern, C., and Tokunaga, C. (1968). Autonomous pleiotropy in Drosophilia. Proc Natl Acad Sci U S A 60, 1252-1259.

393. Hartsfield, J.K., Jr., Hall, B.D., Grix, A.W., Kousseff, B.G., Salazar, J.F., and Haufe, S.M. (1993). Pleiotropy in Coffin-Lowry syndrome: sensorineural hearing deficit and premature tooth loss as early manifestations. Am J Med Genet 45, 552-557.

394. Malfait, F., Kariminejad, A., Van Damme, T., Gauche, C., Syx, D., Merhi-Soussi, F., Gulberti, S., Symoens, S., Vanhauwaert, S., Willaert, A., et al. (2013). Defective initiation of glycosaminoglycan synthesis due to B3GALT6 mutations causes a pleiotropic Ehlers-Danlos-syndrome-like connective tissue disorder. Am J Hum Genet 92, 935-945.

395. Driscoll, D.A. (1994). Genetic basis of DiGeorge and velocardiofacial syndromes. Curr Opin Pediatr 6, 702-706.

396. Chang, L., Blain, D., Bertuzzi, S., and Brooks, B.P. (2006). Uveal coloboma: clinical and basic science update. Curr Opin Ophthalmol 17, 447-470.

397. Rice, T., Tremblay, A., Deriaz, O., Perusse, L., Rao, D.C., and Bouchard, C. (1996). Genetic pleiotropy for resting metabolic rate with fat-free mass and fat mass: the Quebec Family Study. Obes Res 4, 125-131.

398. Hokanson, J.E., Langefeld, C.D., Mitchell, B.D., Lange, L.A., Goff, D.C., Jr., Haffner, S.M., Saad, M.F., and Rotter, J.I. (2003). Pleiotropy and heterogeneity in the expression of atherogenic lipoproteins: the IRAS Family Study. Hum Hered 55, 46-50.

399. Schork, N.J., Weder, A.B., Trevisan, M., and Laurenzi, M. (1994). The contribution of pleiotropy to blood pressure and body-mass index variation: the Gubbio Study. Am J Hum Genet 54, 361-373.

400. Paaby, A.B., and Rockman, M.V. (2013). The many faces of pleiotropy. Trends Genet 29, 66-73.

286 401. Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast, J.G., Zgaga, L., Manolio, T., Rudan, I., McKeigue, P., Wilson, J.F., and Campbell, H. (2011). Abundant pleiotropy in human complex diseases and traits. Am J Hum Genet 89, 607-618.

402. Solovieff, N., Cotsapas, C., Lee, P.H., Purcell, S.M., and Smoller, J.W. (2013). Pleiotropy in complex traits: challenges and strategies. Nat Rev Genet 14, 483-495.

403. Visscher, P.M., and Yang, J. (2016). A plethora of pleiotropy across complex traits. Nature genetics 48, 707-708.

404. Zhou, Y., Kaiser, T., Monteiro, P., Zhang, X., Van der Goes, M.S., Wang, D., Barak, B., Zeng, M., Li, C., Lu, C., et al. (2016). Mice with Shank3 Mutations Associated with ASD and Schizophrenia Display Both Shared and Distinct Defects. Neuron 89, 147-162.

405. Chesmore, K., Bartlett, J., and Williams, S.M. (2018). The ubiquity of pleiotropy in human disease. Human genetics 137, 39-44.

406. Ramos, P.S., Criswell, L.A., Moser, K.L., Comeau, M.E., Williams, A.H., Pajewski, N.M., Chung, S.A., Graham, R.R., Zidovetzki, R., Kelly, J.A., et al. (2011). A comprehensive analysis of shared loci between systemic lupus erythematosus (SLE) and sixteen autoimmune diseases reveals limited genetic overlap. PLoS Genet 7, e1002406.

407. Pernis, A.B., Ricker, E., Weng, C.H., Rozo, C., and Yi, W. (2016). Rho Kinases in Autoimmune Diseases. Annu Rev Med 67, 355-374.

408. Alarcon-Segovia, D., Alarcon-Riquelme, M.E., Cardiel, M.H., Caeiro, F., Massardo, L., Villa, A.R., Pons-Estel, B.A., and Grupo Latinoamericano de Estudio del Lupus, E. (2005). Familial aggregation of systemic lupus erythematosus, rheumatoid arthritis, and other autoimmune diseases in 1,177 lupus patients from the GLADEL cohort. Arthritis Rheum 52, 1138-1147.

409. Criswell, L.A., Pfeiffer, K.A., Lum, R.F., Gonzales, B., Novitzke, J., Kern, M., Moser, K.L., Begovich, A.B., Carlton, V.E., Li, W., et al. (2005). Analysis of families in the multiple autoimmune disease genetics consortium (MADGC) collection: the PTPN22 620W allele associates with multiple autoimmune phenotypes. Am J Hum Genet 76, 561- 571.

410. Reveille, J.D. (2005). Genetic studies in the rheumatic diseases: present status and implications for the future. J Rheumatol Suppl 72, 10-13.

287 411. Kuo, C.F., Grainge, M.J., Valdes, A.M., See, L.C., Luo, S.F., Yu, K.H., Zhang, W., and Doherty, M. (2015). Familial Risk of Sjogren's Syndrome and Co-aggregation of Autoimmune Diseases in Affected Families: A Nationwide Population Study. Arthritis Rheumatol 67, 1904-1912.

412. Luan, M., Shang, Z., Teng, Y., Chen, X., Zhang, M., Lv, H., and Zhang, R. (2017). The shared and specific mechanism of four autoimmune diseases. Oncotarget 8, 108355- 108374.

413. O'Donovan, M.C., and Owen, M.J. (2016). The implications of the shared genetics of psychiatric disorders. Nat Med 22, 1214-1219.

414. Whitfield, J.B. (2014). Genetic insights into cardiometabolic risk factors. Clin Biochem Rev 35, 15-36.

415. Park, H., Li, X., Song, Y.E., He, K.Y., and Zhu, X. (2016). Multivariate Analysis of Anthropometric Traits Using Summary Statistics of Genome-Wide Association Studies from GIANT Consortium. PloS one 11, e0163912.

416. Oh, S., Huh, I., Lee, S.Y., and Park, T. (2016). Analysis of multiple related phenotypes in genome-wide association studies. J Bioinform Comput Biol 14, 1644005.

417. Smeland, O.B., Wang, Y., Frei, O., Li, W., Hibar, D.P., Franke, B., Bettella, F., Witoelar, A., Djurovic, S., Chen, C.H., et al. (2017). Genetic Overlap Between Schizophrenia and Volumes of Hippocampus, Putamen, and Intracranial Volume Indicates Shared Molecular Genetic Mechanisms. Schizophr Bull.

418. Salinas, Y.D., Wang, Z., and DeWan, A.T. (2017). Statistical Analysis of Multiple Phenotypes in Genetic Epidemiological Studies:From Cross-Phenotype Associations to Pleiotropy. American journal of epidemiology.

419. Pickrell, J.K., Berisa, T., Liu, J.Z., Segurel, L., Tung, J.Y., and Hinds, D.A. (2016). Detection and interpretation of shared genetic influences on 42 human traits. Nature genetics 48, 709-717.

420. Shen, X., Klaric, L., Sharapov, S., Mangino, M., Ning, Z., Wu, D., Trbojevic-Akmacic, I., Pucic-Bakovic, M., Rudan, I., Polasek, O., et al. (2017). Multivariate discovery and replication of five novel loci associated with Immunoglobulin G N-glycosylation. Nat Commun 8, 447.

288 421. Schifano, E.D., Li, L., Christiani, D.C., and Lin, X. (2013). Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet 92, 744-759.

422. Ray, D., and Boehnke, M. (2018). Methods for meta-analysis of multiple traits using GWAS summary statistics. Genet Epidemiol 42, 134-145.

423. Pickrell, J.K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet 94, 559-573.

424. Manning, A.K., Hivert, M.F., Scott, R.A., Grimsby, J.L., Bouatia-Naji, N., Chen, H., Rybin, D., Liu, C.T., Bielak, L.F., Prokopenko, I., et al. (2012). A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat Genet 44, 659-669.

425. Kim, J., Bai, Y., and Pan, W. (2015). An Adaptive Association Test for Multiple Phenotypes with GWAS Summary Statistics. Genet Epidemiol 39, 651-663.

426. Highland, H.M., Sitlani, C., Seyerle, A.A., Gondalia, R., Graff, M., Avery, C.L., and North, K.E. (2016). Pleiotropic effects on BMI, WHR, fasting glucose, and fasting insulin levels. Presented at the 2016 American Society for Human Genetics meeting.

427. Gratten, J., and Visscher, P.M. (2016). Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Med 8, 78.

428. Li, Y.I., van de Geijn, B., Raj, A., Knowles, D.A., Petti, A.A., Golan, D., Gilad, Y., and Pritchard, J.K. (2016). RNA splicing is a primary link between genetic variation and disease. Science 352, 600-604.

429. Consortium, G.T. (2015). Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648-660.

430. Hackinger, S., and Zeggini, E. (2017). Statistical methods to detect pleiotropy in human complex traits. Open Biol 7.

431. Chiu, C.Y., Jung, J., Chen, W., Weeks, D.E., Ren, H., Boehnke, M., Amos, C.I., Liu, A., Mills, J.L., Ting Lee, M.L., et al. (2017). Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. Eur J Hum Genet 25, 350-359.

289 432. Porter, H.F., and O'Reilly, P.F. (2017). Multivariate simulation framework reveals performance of multi-trait GWAS methods. Sci Rep 7, 38837.

433. Boyle, E.A., Li, Y.I., and Pritchard, J.K. (2017). An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169, 1177-1186.

434. MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J., et al. (2017). The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res 45, D896-D901.

435. Kichaev, G., Roytman, M., Johnson, R., Eskin, E., Lindstrom, S., Kraft, P., and Pasaniuc, B. (2017). Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics 33, 248-255.

436. Chen, G.B., Lee, S.H., Robinson, M.R., Trzaskowski, M., Zhu, Z.X., Winkler, T.W., Day, F.R., Croteau-Chonka, D.C., Wood, A.R., Locke, A.E., et al. (2016). Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur J Hum Genet 25, 137- 146.

437. van de Bunt, M., Cortes, A., Consortium, I., Brown, M.A., Morris, A.P., and McCarthy, M.I. (2015). Evaluating the Performance of Fine-Mapping Strategies at Common Variant GWAS Loci. PLoS Genet 11, e1005535.

438. Finucane, H.K., Bulik-Sullivan, B., Gusev, A., Trynka, G., Reshef, Y., Loh, P.R., Anttila, V., Xu, H., Zang, C., Farh, K., et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature genetics 47, 1228- 1235.

439. Cieply, B., and Carstens, R.P. (2015). Functional roles of alternative splicing factors in human disease. Wiley Interdiscip Rev RNA 6, 311-326.

440. da Costa, P.J., Menezes, J., and Romao, L. (2017). The role of alternative splicing coupled to nonsense-mediated mRNA decay in human disease. Int J Biochem Cell Biol 91, 168- 175.

441. Horne, W.C., Huang, S.C., Becker, P.S., Tang, T.K., and Benz, E.J., Jr. (1993). Tissue- specific alternative splicing of protein 4.1 inserts an exon necessary for formation of the ternary complex with erythrocyte spectrin and F-actin. Blood 82, 2558-2563.

290 442. Yang, X., Coulombe-Huntington, J., Kang, S., Sheynkman, G.M., Hao, T., Richardson, A., Sun, S., Yang, F., Shen, Y.A., Murray, R.R., et al. (2016). Widespread Expansion of Protein Interaction Capabilities by Alternative Splicing. Cell 164, 805-817.

443. MacRae, C.A., and Seidman, C.E. (2017). Closing the Genotype-Phenotype Loop for Precision Medicine. Circulation 136, 1492-1494.

444. Hormozdiari, F., Zhu, A., Kichaev, G., Ju, C.J., Segre, A.V., Joo, J.W.J., Won, H., Sankararaman, S., Pasaniuc, B., Shifman, S., et al. (2017). Widespread Allelic Heterogeneity in Complex Traits. Am J Hum Genet 100, 789-802.

445. Avery, C.L., Wassel, C.L., Richard, M.A., Highland, H.M., Bien, S., Zubair, N., Soliman, E.Z., Fornage, M., Bielinski, S.J., Tao, R., et al. (2017). Fine mapping of QT interval regions in global populations refines previously identified QT interval loci and identifies signals unique to African and Hispanic descent populations. Heart Rhythm 14, 572-580.

446. Pan, Q., Shai, O., Lee, L.J., Frey, B.J., and Blencowe, B.J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics 40, 1413-1415.

447. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470-476.

448. Uhlen, M., Fagerberg, L., Hallstrom, B.M., Lindskog, C., Oksvold, P., Mardinoglu, A., Sivertsson, A., Kampf, C., Sjostedt, E., Asplund, A., et al. (2015). Proteomics. Tissue- based map of the human proteome. Science 347, 1260419.

449. Jansen, R., Hottenga, J.J., Nivard, M.G., Abdellaoui, A., Laport, B., de Geus, E.J., Wright, F.A., Penninx, B., and Boomsma, D.I. (2017). Conditional eQTL analysis reveals allelic heterogeneity of gene expression. Human molecular genetics 26, 1444-1451.

450. Wu, Y., Waite, L.L., Jackson, A.U., Sheu, W.H., Buyske, S., Absher, D., Arnett, D.K., Boerwinkle, E., Bonnycastle, L.L., Carty, C.L., et al. (2013). Trans-ethnic fine-mapping of lipid loci identifies population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS Genet 9, e1003379.

451. Patrinos, G.P., Giardine, B., Riemer, C., Miller, W., Chui, D.H., Anagnou, N.P., Wajcman, H., and Hardison, R.C. (2004). Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucleic Acids Res 32, D537-541.

291 452. Hardison, R.C., Chui, D.H., Giardine, B., Riemer, C., Patrinos, G.P., Anagnou, N., Miller, W., and Wajcman, H. (2002). HbVar: A relational database of human hemoglobin variants and thalassemia mutations at the globin gene server. Hum Mutat 19, 225-233.

453. Hormozdiari, F., Kostem, E., Kang, E.Y., Pasaniuc, B., and Eskin, E. (2014). Identifying causal variants at loci with multiple signals of association. Genetics 198, 497-508.

454. Chen, W., Larrabee, B.R., Ovsyannikova, I.G., Kennedy, R.B., Haralambieva, I.H., Poland, G.A., and Schaid, D.J. (2015). Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics. Genetics 200, 719-736.

455. Wojcik, G.L., Graff, M., Nishimura, K.K., Tao, R., Haessler, J., Gignoux, C.R., Highland, H.M., Patel, Y.M., Sorokin, E.P., Avery, C.L., et al. (2019). Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514-518.

456. Schick, U.M., Jain, D., Hodonsky, C.J., Morrison, J.V., Davis, J.P., Brown, L., Sofer, T., Conomos, M.P., Schurmann, C., McHugh, C.P., et al. (2016). Genome-wide Association Study of Platelet Count Identifies Ancestry-Specific Loci in Hispanic/Latino Americans. Am J Hum Genet 98, 229-242.

457. Genomes Project, C., Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., and McVean, G.A. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56-65.

458. Horikoshi, M., Pasquali, L., Wiltshire, S., Huyghe, J.R., Mahajan, A., Asimit, J.L., Ferreira, T., Locke, A.E., Robertson, N.R., Wang, X., et al. (2016). Transancestral fine-mapping of four type 2 diabetes susceptibility loci highlights potential causal regulatory mechanisms. Human molecular genetics 25, 2070-2081.

459. Parra, E.J., Mazurek, A., Gignoux, C.R., Sockell, A., Agostino, M., Morris, A.P., Petty, L.E., Hanis, C.L., Cox, N.J., Valladares-Salgado, A., et al. (2017). Admixture mapping in two Mexican samples identifies significant associations of locus ancestry with triglyceride levels in the BUD13/ZNF259/APOA5 region and fine mapping points to rs964184 as the main driver of the association signal. PloS one 12, e0172880.

460. Stone, J.R., and Wray, G.A. (2001). Rapid evolution of cis-regulatory sequences via local point mutations. Mol Biol Evol 18, 1764-1770.

461. Payne, J.L., and Wagner, A. (2015). Mechanisms of mutational robustness in transcriptional regulation. Front Genet 6, 322.

292 462. Asimit, J.L., Hatzikotoulas, K., McCarthy, M., Morris, A.P., and Zeggini, E. (2016). Trans- ethnic study design approaches for fine-mapping. European journal of human genetics : EJHG 24, 1330-1336.

463. Kichaev, G., and Pasaniuc, B. (2015). Leveraging Functional-Annotation Data in Trans- ethnic Fine-Mapping Studies. Am J Hum Genet 97, 260-271.

464. Liu, C.T., Buchkovich, M.L., Winkler, T.W., Heid, I.M., African Ancestry Anthropometry Genetics, C., Consortium, G., Borecki, I.B., Fox, C.S., Mohlke, K.L., North, K.E., et al. (2014). Multi-ethnic fine-mapping of 14 central adiposity loci. Human molecular genetics 23, 4738-4744.

465. Zubair, N., Graff, M., Luis Ambite, J., Bush, W.S., Kichaev, G., Lu, Y., Manichaikul, A., Sheu, W.H., Absher, D., Assimes, T.L., et al. (2016). Fine-mapping of lipid regions in global populations discovers ethnic-specific signals and refines previously identified lipid loci. Human molecular genetics 25, 5500-5512.

466. Hodonsky, C.J., Kleinbrink, E.L., Charney, K.N., Prasad, M., Bessling, S.L., Jones, E.A., Srinivasan, R., Svaren, J., McCallion, A.S., and Antonellis, A. (2012). SOX10 regulates expression of the SH3-domain kinase binding protein 1 (Sh3kbp1) locus in Schwann cells via an alternative promoter. Mol Cell Neurosci 49, 85-96.

467. Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., and Luscombe, N.M. (2009). A census of human transcription factors: function, expression and evolution. Nat Rev Genet 10, 252-263.

468. Lee, T.I., and Young, R.A. (2013). Transcriptional regulation and its misregulation in disease. Cell 152, 1237-1251.

469. Maurano, M.T., Humbert, R., Rynes, E., Thurman, R.E., Haugen, E., Wang, H., Reynolds, A.P., Sandstrom, R., Qu, H., Brody, J., et al. (2012). Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190-1195.

470. Genomes Project, C., Auton, A., Brooks, L.D., Durbin, R.M., Garrison, E.P., Kang, H.M., Korbel, J.O., Marchini, J.L., McCarthy, S., McVean, G.A., et al. (2015). A global reference for human genetic variation. Nature 526, 68-74.

471. Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.K., Sheffield, N.C., Graf, S., Huss, M., Keefe, D., et al. (2011). Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res 21, 1757-1767.

293 472. Adzhubei, I., Jordan, D.M., and Sunyaev, S.R. (2013). Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet Chapter 7, Unit7 20.

473. Wingender, E. (2008). The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform 9, 326-332.

474. Soares, S.C., Abreu, V.A., Ramos, R.T., Cerdeira, L., Silva, A., Baumbach, J., Trost, E., Tauch, A., Hirata, R., Jr., Mattos-Guaraldi, A.L., et al. (2012). PIPS: pathogenicity island prediction software. PloS one 7, e30848.

475. Thusberg, J., Olatubosun, A., and Vihinen, M. (2011). Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat 32, 358-368.

476. Kircher, M., Witten, D.M., Jain, P., O'Roak, B.J., Cooper, G.M., and Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics 46, 310-315.

477. Ward, L.D., and Kellis, M. (2016). HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Research 44, D877-D881.

478. Adams, D., Altucci, L., Antonarakis, S.E., Ballesteros, J., Beck, S., Bird, A., Bock, C., Boehm, B., Campo, E., Caricasole, A., et al. (2012). BLUEPRINT to decode the epigenetic signature written in blood. Nature biotechnology 30, 224-226.

479. Consortium, E.P. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74.

480. Pradel, L.C., Vanhille, L., and Spicuglia, S. (2015). [The European Blueprint project: towards a full epigenome characterization of the immune system]. Med Sci (Paris) 31, 236-238.

481. Boyle, A.P., Hong, E.L., Hariharan, M., Cheng, Y., Schaub, M.A., Kasowski, M., Karczewski, K.J., Park, J., Hitz, B.C., Weng, S., et al. (2012). Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 22, 1790-1797.

482. Manrai, A.K., Ioannidis, J.P., and Kohane, I.S. (2016). Clinical Genomics: From Pathogenicity Claims to Quantitative Risk Estimates. JAMA 315, 1233-1234.

294 483. Morris, A.P. (2011). Transethnic meta-analysis of genomewide association studies. Genet Epidemiol 35, 809-822.

484. Florez, J.C. (2017). Mining the Genome for Therapeutic Targets. Diabetes 66, 1770-1778.

485. Nandakumar, P., Lee, D., Richard, M.A., Tekola-Ayele, F., Tayo, B.O., Ware, E., Sung, Y.J., Salako, B., Ogunniyi, A., Gu, C.C., et al. (2017). Rare coding variants associated with blood pressure variation in 15 914 individuals of African ancestry. J Hypertens 35, 1381-1389.

486. Kilaru, V., Iyer, S.V., Almli, L.M., Stevens, J.S., Lori, A., Jovanovic, T., Ely, T.D., Bradley, B., Binder, E.B., Koen, N., et al. (2016). Genome-wide gene-based analysis suggests an association between Neuroligin 1 (NLGN1) and post-traumatic stress disorder. Transl Psychiatry 6, e820.

487. Lee, H.S., Kim, Y., and Park, T. (2018). New Common and Rare Variants Influencing Metabolic Syndrome and Its Individual Components in a Korean Population. Sci Rep 8, 5701.

488. Teslovich, T.M., Kim, D.S., Yin, X., Stancakova, A., Jackson, A.U., Wielscher, M., Naj, A., Perry, J.R.B., Huyghe, J.R., Stringham, H.M., et al. (2018). Identification of seven novel loci associated with amino acid levels using single-variant and gene-based tests in 8545 Finnish men from the METSIM study. Human molecular genetics 27, 1664-1674.

489. Andaleon, A., Mogil, L.S., and Wheeler, H.E. (2018). Gene-based association study for lipid traits in diverse cohorts implicates BACE1 and SIDT2 regulation in triglyceride levels. PeerJ 6, e4314.

490. Mishra, A., Ferrari, R., Heutink, P., Hardy, J., Pijnenburg, Y., Posthuma, D., and International, F.T.D.G.C. (2017). Gene-based association studies report genetic links for clinical subtypes of frontotemporal dementia. Brain 140, 1437-1446.

491. Moutsianas, L., Agarwala, V., Fuchsberger, C., Flannick, J., Rivas, M.A., Gaulton, K.J., Albers, P.K., Go, T.D.C., McVean, G., Boehnke, M., et al. (2015). The power of gene- based rare variant methods to detect disease-associated variation and test hypotheses about complex disease. PLoS Genet 11, e1005165.

492. Lee, S., Abecasis, G.R., Boehnke, M., and Lin, X. (2014). Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95, 5-23.

295 493. Raffield, L.M., Zakai, N.A., Duan, Q., Laurie, C., Smith, J.D., Irvin, M.R., Doyle, M.F., Naik, R.P., Song, C., Manichaikul, A.W., et al. (2017). D-Dimer in African Americans: Whole Genome Sequence Analysis and Relationship to Cardiovascular Disease Risk in the Jackson Heart Study. Arterioscler Thromb Vasc Biol 37, 2220-2227.

494. Igo, R.P., Jr., Cooke Bailey, J.N., Romm, J., Haines, J.L., and Wiggs, J.L. (2016). Quality Control for the Illumina HumanExome BeadChip. Curr Protoc Hum Genet 90, 2 14 11- 12 14 16.

495. Bien, S.A., Wojcik, G.L., Zubair, N., Gignoux, C.R., Martin, A.R., Kocarnik, J.M., Martin, L.W., Buyske, S., Haessler, J., Walker, R.W., et al. (2016). Strategies for Enriching Variant Coverage in Candidate Disease Loci on a Multiethnic Genotyping Array. PloS one 11, e0167758.

496. Kim, H., Grueneberg, A., Vazquez, A.I., Hsu, S., and de Los Campos, G. (2017). Will Big Data Close the Missing Heritability Gap? Genetics.

497. Wright, F.A., Sullivan, P.F., Brooks, A.I., Zou, F., Sun, W., Xia, K., Madar, V., Jansen, R., Chung, W., Zhou, Y.H., et al. (2014). Heritability and genomics of gene expression in peripheral blood. Nature genetics 46, 430-437.

498. Mo, X.B., Lu, X., Zhang, Y.H., Zhang, Z.L., Deng, F.Y., and Lei, S.F. (2015). Gene-based association analysis identified novel genes associated with bone mineral density. PloS one 10, e0121811.

499. Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89, 82-93.

500. Bryc, K., Durand, E.Y., Macpherson, J.M., Reich, D., and Mountain, J.L. (2015). The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet 96, 37-53.

501. International HapMap, C., Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M., Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu, F., et al. (2010). Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-58.

502. Mathieson, I., and McVean, G. (2014). Demography and the age of rare variants. PLoS Genet 10, e1004528.

296 503. Zhao, G., Yu, D., and Weiss, M.J. (2010). MicroRNAs in erythropoiesis. Curr Opin Hematol 17, 155-162.

504. Imerslund, O., and Bjornstad, P. (1963). Familial Vitamin B12 Malabsorption. Acta Haematol 30, 1-7.

505. Aminoff, M., Tahvanainen, E., Grasbeck, R., Weissenbach, J., Broch, H., and de la Chapelle, A. (1995). Selective intestinal malabsorption of vitamin B12 displays recessive mendelian inheritance: assignment of a locus to chromosome 10 by linkage. Am J Hum Genet 57, 824-831.

506. Oppenheimer, S.J., Higgs, D.R., Weatherall, D.J., Barker, J., and Spark, R.A. (1984). Alpha thalassaemia in Papua New Guinea. Lancet 1, 424-426.

507. O'Donnell, A., Raiko, A., Clegg, J.B., Weatherall, D.J., and Allen, S.J. (2006). Alpha+ - thalassaemia and pregnancy in a malaria endemic region of Papua New Guinea. Br J Haematol 135, 235-241.

508. Werling, D.M., Brand, H., An, J.Y., Stone, M.R., Zhu, L., Glessner, J.T., Collins, R.L., Dong, S., Layer, R.M., Markenscoff-Papadimitriou, E., et al. (2018). An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nature genetics 50, 727-736.

509. Bomba, L., Walter, K., and Soranzo, N. (2017). The impact of rare and low-frequency genetic variants in common disease. Genome biology 18, 77.

510. Dewey, F.E., Murray, M.F., Overton, J.D., Habegger, L., Leader, J.B., Fetterolf, S.N., O'Dushlaine, C., Van Hout, C.V., Staples, J., Gonzaga-Jauregui, C., et al. (2016). Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354.

511. Peloso, G.M., Lange, L.A., Varga, T.V., Nickerson, D.A., Smith, J.D., Griswold, M.E., Musani, S., Polfus, L.M., Mei, H., Gabriel, S., et al. (2016). Association of Exome Sequences With Cardiovascular Traits Among Blacks in the Jackson Heart Study. Circ Cardiovasc Genet 9, 368-374.

512. Chou, W.C., Zheng, H.F., Cheng, C.H., Yan, H., Wang, L., Han, F., Richards, J.B., Karasik, D., Kiel, D.P., and Hsu, Y.H. (2016). A combined reference panel from the 1000 Genomes and UK10K projects improved rare variant imputation in European and Chinese samples. Sci Rep 6, 39313.

297 513. Fabbri, C., Tansey, K.E., Perlis, R.H., Hauser, J., Henigsberg, N., Maier, W., Mors, O., Placentino, A., Rietschel, M., Souery, D., et al. (2018). New insights into the pharmacogenomics of antidepressant response from the GENDEP and STAR*D studies: rare variant analysis and high-density imputation. Pharmacogenomics J 18, 413-421.

514. Do, A.N., Zhao, W., Srinivasasainagendra, V., Aslibekyan, S., Tiwari, H.K., Limdi, N., Shah, S.J., Zhi, D., Broeckel, U., Gu, C.C., et al. (2017). Whole exome analyses to examine the impact of rare variants on left ventricular traits in African American participants from the HyperGEN and GENOA studies. J Hypertens Manag 3.

515. Auer, P.L., Johnsen, J.M., Johnson, A.D., Logsdon, B.A., Lange, L.A., Nalls, M.A., Zhang, G., Franceschini, N., Fox, K., Lange, E.M., et al. (2012). Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome Sequencing Project. Am J Hum Genet 91, 794-808.

516. Benn-Torres, J., Bonilla, C., Robbins, C.M., Waterman, L., Moses, T.Y., Hernandez, W., Santos, E.R., Bennett, F., Aiken, W., Tullock, T., et al. (2008). Admixture and population stratification in African Caribbean populations. Ann Hum Genet 72, 90-98.

517. Zakharia, F., Basu, A., Absher, D., Assimes, T.L., Go, A.S., Hlatky, M.A., Iribarren, C., Knowles, J.W., Li, J., Narasimhan, B., et al. (2009). Characterizing the admixed African ancestry of African Americans. Genome biology 10, R141.

518. Hinch, A.G., Tandon, A., Patterson, N., Song, Y., Rohland, N., Palmer, C.D., Chen, G.K., Wang, K., Buxbaum, S.G., Akylbekova, E.L., et al. (2011). The landscape of recombination in African Americans. Nature 476, 170-175.

519. Ma, C., Boehnke, M., Lee, S., and Go, T.D.I. (2015). Evaluating the Calibration and Power of Three Gene-Based Association Tests of Rare Variants for the X Chromosome. Genet Epidemiol 39, 499-508.

520. Sazonovs, A., and Barrett, J.C. (2018). Rare-Variant Studies to Complement Genome-Wide Association Studies. Annu Rev Genomics Hum Genet.

521. Li, X., Kim, Y., Tsang, E.K., Davis, J.R., Damani, F.N., Chiang, C., Hess, G.T., Zappala, Z., Strober, B.J., Scott, A.J., et al. (2017). The impact of rare variation on gene expression across tissues. Nature 550, 239-243.

522. Castro-Sanchez, S., Alvarez-Satta, M., Tohamy, M.A., Beltran, S., Derdak, S., and Valverde, D. (2017). Whole exome sequencing as a diagnostic tool for patients with ciliopathy-like phenotypes. PloS one 12, e0183081.

298 523. James, R.A., Campbell, I.M., Chen, E.S., Boone, P.M., Rao, M.A., Bainbridge, M.N., Lupski, J.R., Yang, Y., Eng, C.M., Posey, J.E., et al. (2016). A visual and curatorial approach to clinical variant prioritization and disease gene discovery in genome-wide diagnostics. Genome Med 8, 13.

524. Global Burden of Disease, C., Adolescent Health, C., Kassebaum, N., Kyu, H.H., Zoeckler, L., Olsen, H.E., Thomas, K., Pinho, C., Bhutta, Z.A., Dandona, L., et al. (2017). Child and Adolescent Health From 1990 to 2015: Findings From the Global Burden of Diseases, Injuries, and Risk Factors 2015 Study. JAMA Pediatr 171, 573-592.

525. Stoltzfus, R.J. (2003). Iron deficiency: global prevalence and consequences. Food Nutr Bull 24, S99-103.

526. Kassebaum, N.J., and Collaborators, G.B.D.A. (2016). The Global Burden of Anemia. Hematol Oncol Clin North Am 30, 247-308.

527. He, Z., Bian, J., Carretta, H.J., Lee, J., Hogan, W.R., Shenkman, E., and Charness, N. (2018). Prevalence of Multiple Chronic Conditions Among Older Adults in Florida and the United States: Comparative Analysis of the OneFlorida Data Trust and National Inpatient Sample. J Med Internet Res 20, e137.

528. Xu, R.B., Jin, D.Y., Song, Y., Wang, X.J., Dong, Y.H., Yang, Z.G., Chen, Y.J., and Ma, J. (2017). [Study on the disease burden of Chinese adolescent in 2015]. Zhonghua Yu Fang Yi Xue Za Zhi 51, 910-914.

529. Rodriguez-Carrio, J., Alperi-Lopez, M., Lopez, P., Alonso-Castro, S., Ballina-Garcia, F.J., and Suarez, A. (2015). Red cell distribution width is associated with cardiovascular risk and disease parameters in rheumatoid arthritis. Rheumatology (Oxford) 54, 641-646.

530. Anker, S.D., Voors, A., Okonko, D., Clark, A.L., James, M.K., von Haehling, S., Kjekshus, J., Ponikowski, P., Dickstein, K., and Investigators, O. (2009). Prevalence, incidence, and prognostic value of anaemia in patients after an acute myocardial infarction: data from the OPTIMAAL trial. Eur Heart J 30, 1331-1339.

531. Bal, Z., Bal, U., Okyay, K., Yilmaz, M., Balcioglu, S., Turgay, O., Hasirci, S., Aydinalp, A., Yildirir, A., Sezer, S., et al. (2015). Hematological parameters can predict the extent of coronary artery disease in patients with end-stage renal disease. Int Urol Nephrol 47, 1719-1725.

299 532. Hu, Z.D., Chen, Y., Zhang, L., Sun, Y., Huang, Y.L., Wang, Q.Q., Xu, Y.L., Chen, S.X., Qin, Q., and Deng, A.M. (2013). Red blood cell distribution width is a potential index to assess the disease activity of systemic lupus erythematosus. Clin Chim Acta 425, 202- 205.

533. Turcato, G., Cappellari, M., Follador, L., Dilda, A., Bonora, A., Zannoni, M., Bovo, C., Ricci, G., Bovi, P., and Lippi, G. (2017). Red Blood Cell Distribution Width Is an Independent Predictor of Outcome in Patients Undergoing Thrombolysis for Ischemic Stroke. Semin Thromb Hemost 43, 30-35.

534. Kanno, H., Fujii, H., Hirono, A., and Miwa, S. (1991). cDNA cloning of human R-type pyruvate kinase and identification of a single amino acid substitution (Thr384----Met) affecting enzymatic stability in a pyruvate kinase variant (PK Tokyo) associated with hereditary hemolytic anemia. Proc Natl Acad Sci U S A 88, 8218-8221.

535. Nuinoon, M., Makarasara, W., Mushiroda, T., Setianingsih, I., Wahidiyat, P.A., Sripichai, O., Kumasaka, N., Takahashi, A., Svasti, S., Munkongdee, T., et al. (2010). A genome- wide association identified the common genetic variants influence disease severity in beta0-thalassemia/hemoglobin E. Human genetics 127, 303-314.

536. Kichaev, G., Yang, W.Y., Lindstrom, S., Hormozdiari, F., Eskin, E., Price, A.L., Kraft, P., and Pasaniuc, B. (2014). Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet 10, e1004722.

537. Manrai, A.K., Funke, B.H., Rehm, H.L., Olesen, M.S., Maron, B.A., Szolovits, P., Margulies, D.M., Loscalzo, J., and Kohane, I.S. (2016). Genetic Misdiagnoses and the Potential for Health Disparities. N Engl J Med 375, 655-665.

538. Deisseroth, A., Velez, R., and Nienhuis, A.W. (1976). Hemoglobin synthesis in somatic cell hybrids: independent segregation of the human alpha- and beta-globin genes. Science 191, 1262-1264.

539. Hagge, W.W., Holley, K.E., Burke, E.C., and Stickler, G.B. (1967). Hemolytic-uremic syndrome in two siblings. N Engl J Med 277, 138-139.

540. Warwicker, P., Goodship, T.H., Donne, R.L., Pirson, Y., Nicholls, A., Ward, R.M., Turnpenny, P., and Goodship, J.A. (1998). Genetic studies into inherited and sporadic hemolytic uremic syndrome. Kidney Int 53, 836-844.

300 541. Baughan, M.A., Valentine, W.N., Paglia, D.E., Ways, P.O., Simons, E.R., and DeMarsh, Q.B. (1968). Hereditary hemolytic anemia associated with glucosephosphate isomerase (GPI) deficiency--a new enzyme defect of human erythrocytes. Blood 32, 236-249.

542. Wendt, F., and Heimpel, H. (1967). [Congenital dyserythropoietic anemia in a pair of dizygotic twins]. Med Klin 62, 172-177.

543. Tamary, H., Shalmon, L., Shalev, H., Halil, A., Dobrushin, D., Ashkenazi, N., Zoldan, M., Resnitzky, P., Korostishevsky, M., Bonne-Tamir, B., et al. (1998). Localization of the gene for congenital dyserythropoietic anemia type I to a <1-cM interval on chromosome 15q15.1-15.3. Am J Hum Genet 62, 1062-1069.

544. Glader, B.E., Fortier, N., Albala, M.M., and Nathan, D.G. (1974). Congenital hemolytic anemia associated with dehydrated erythrocytes and increased potassium loss. N Engl J Med 291, 491-496.

545. Glogowska, E., Lezon-Geyda, K., Maksimova, Y., Schulz, V.P., and Gallagher, P.G. (2015). Mutations in the Gardos channel (KCNN4) are associated with hereditary xerocytosis. Blood 126, 1281-1284.

546. Rapetti-Mauss, R., Lacoste, C., Picard, V., Guitton, C., Lombard, E., Loosveld, M., Nivaggioni, V., Dasilva, N., Salgado, D., Desvignes, J.P., et al. (2015). A mutation in the Gardos channel is associated with hereditary xerocytosis. Blood 126, 1273-1280.

547. Thompson, A.R., Wood, W.G., and Stamatoyannopoulos, G. (1977). X-linked syndrome of platelet dysfunction, thrombocytopenia, and imbalanced globin chain synthesis with hemolysis. Blood 50, 303-316.

548. Yu, C., Niakan, K.K., Matsushita, M., Stamatoyannopoulos, G., Orkin, S.H., and Raskind, W.H. (2002). X-linked thrombocytopenia with thalassemia from a mutation in the amino finger of GATA-1 affecting DNA binding rather than FOG-1 interaction. Blood 100, 2040-2045.

549. Morton, N.E. (1956). The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet 8, 80-96.

550. Alloisio, N., Dorleac, E., Girot, R., and Delaunay, J. (1981). Analysis of the red cell membrane in a family with hereditary elliptocytosis--total or partial of protein 4.1. Human genetics 59, 68-71.

301 551. Farquhar, J.W., and Claireaux, A.E. (1952). Familial haemophagocytic reticulosis. Arch Dis Child 27, 519-525.

552. Ohadi, M., Lalloz, M.R., Sham, P., Zhao, J., Dearlove, A.M., Shiach, C., Kinsey, S., Rhodes, M., and Layton, D.M. (1999). Localization of a gene for familial hemophagocytic lymphohistiocytosis at chromosome 9q21.3-22 by homozygosity mapping. Am J Hum Genet 64, 165-171.

553. Dufourcq-Lagelouse, R., Jabado, N., Le Deist, F., Stephan, J.L., Souillet, G., Bruin, M., Vilmer, E., Schneider, M., Janka, G., Fischer, A., et al. (1999). Linkage of familial hemophagocytic lymphohistiocytosis to 10q21-22 and evidence for heterogeneity. Am J Hum Genet 64, 172-179.

554. Alving, A.S., Carson, P.E., Flanagan, C.L., and Ickes, C.E. (1956). Enzymatic deficiency in primaquine-sensitive erythrocytes. Science 124, 484-485.

555. Szeinberg, A., Gavendo, S., and Cahane, D. (1969). Erythrocyte adenylate-kinase deficiency. Lancet 1, 315-316.

556. Matsuura, S., Igarashi, M., Tanizawa, Y., Yamada, M., Kishi, F., Kajii, T., Fujii, H., Miwa, S., Sakurai, M., and Nakazawa, A. (1989). Human adenylate kinase deficiency associated with hemolytic anemia. A single base substitution affecting solubility and catalytic activity of the cytosolic adenylate kinase. J Biol Chem 264, 10148-10155.

557. Zarkowsky, H.S., Mohandas, N., Speaker, C.B., and Shohet, S.B. (1975). A congenital haemolytic anaemia with thermal sensitivity of the erythrocyte membrane. Br J Haematol 29, 537-543.

558. Gallagher, P.G., Tse, W.T., Coetzer, T., Lecomte, M.C., Garbarz, M., Zarkowsky, H.S., Baruchel, A., Ballas, S.K., Dhermy, D., Palek, J., et al. (1992). A common type of the spectrin alpha I 46-50a-kD peptide abnormality in hereditary elliptocytosis and pyropoikilocytosis is associated with a mutation distant from the proteolytic cleavage site. Evidence for the functional importance of the triple helical model of spectrin. J Clin Invest 89, 892-898.

559. Morton, N.E., Mackinney, A.A., Kosower, N., Schilling, R.F., and Gray, M.P. (1962). Genetics of spherocytosis. Am J Hum Genet 14, 170-184.

560. Kimberling, W.J., Fulbeck, T., Dixon, L., and Lubs, H.A. (1975). Localization of spherocytosis to chromosome 8 or 12 and report of a family with spherocytosis and a reciprocal translocation. Am J Hum Genet 27, 586-594.

302 561. Kannengiesser, C., Sanchez, M., Sweeney, M., Hetet, G., Kerr, B., Moran, E., Fuster Soler, J.L., Maloum, K., Matthes, T., Oudot, C., et al. (2011). Missense SLC25A38 variations play an important role in autosomal recessive inherited sideroblastic anemia. Haematologica 96, 808-813.

562. Cox, T.C., Kozman, H.M., Raskind, W.H., May, B.K., and Mulley, J.C. (1992). Identification of a highly polymorphic marker within intron 7 of the ALAS2 gene and suggestion of at least two loci for X-linked sideroblastic anemia. Human molecular genetics 1, 639-641.

563. Shimada, Y., Okuno, S., Kawai, A., Shinomiya, H., Saito, A., Suzuki, M., Omori, Y., Nishino, N., Kanemoto, N., Fujiwara, T., et al. (1998). Cloning and chromosomal mapping of a novel ABC transporter gene (hABC7), a candidate for X-linked sideroblastic anemia with spinocerebellar ataxia. J Hum Genet 43, 115-122.

564. Shahidi, N.T., Nathan, D.G., and Diamond, L.K. (1964). Iron Deficiency Anemia Associated with an Error of Iron Metabolism in Two Siblings. J Clin Invest 43, 510-521.

565. Mims, M.P., Guan, Y., Pospisilova, D., Priwitzerova, M., Indrak, K., Ponka, P., Divoky, V., and Prchal, J.T. (2005). Identification of a human mutation of DMT1 in a patient with microcytic anemia and iron overload. Blood 105, 1337-1342.

566. Buchanan, G.R., and Sheehan, R.G. (1981). Malabsorption and defective utilization of iron in three siblings. J Pediatr 98, 723-728.

567. Finberg, K.E., Heeney, M.M., Campagna, D.R., Aydinok, Y., Pearson, H.A., Hartman, K.R., Mayo, M.M., Samuel, S.M., Strouse, J.J., Markianos, K., et al. (2008). Mutations in TMPRSS6 cause iron-refractory iron deficiency anemia (IRIDA). Nature genetics 40, 569-571.

568. El-Shanti, H., and Ferguson, P. (1993). Majeed Syndrome. In GeneReviews(R), R.A. Pagon, M.P. Adam, H.H. Ardinger, S.E. Wallace, A. Amemiya, L.J.H. Bean, T.D. Bird, N. Ledbetter, H.C. Mefford, R.J.H. Smith, et al., eds. (Seattle (WA).

569. Grasbeck, R. (1960). [Familiar selective vitamin B12 malabsorption with proteinuria. A pernicious anemia-like syndrome]. Nord Med 63, 322-323.

570. Porter, F.S., Rogers, L.E., and Sidbury, J.B., Jr. (1969). Thiamine-responsive megaloblastic anemia. J Pediatr 74, 494-504.

303 571. Neufeld, E.J., Mandel, H., Raz, T., Szargel, R., Yandava, C.N., Stagg, A., Faure, S., Barrett, T., Buist, N., and Cohen, N. (1997). Localization of the gene for thiamine-responsive megaloblastic anemia syndrome, on the long arm of chromosome 1, by homozygosity mapping. Am J Hum Genet 61, 1335-1341.

572. Nishimura, N., Deki, T., Kato, S. (1950). Hypertrophic pulmonary osteo-arthropathy with pernio-like eruption in the two families: report of the three cases. (Japanese). Jpn J Derm Venereol 60, 6.

573. Garg, A., Hernandez, M.D., Sousa, A.B., Subramanyam, L., Martinez de Villarreal, L., dos Santos, H.G., and Barboza, O. (2010). An autosomal recessive syndrome of joint contractures, muscular atrophy, microcytic anemia, and panniculitis-associated lipodystrophy. J Clin Endocrinol Metab 95, E58-63.

574. Agarwal, A.K., Xing, C., DeMartino, G.N., Mizrachi, D., Hernandez, M.D., Sousa, A.B., Martinez de Villarreal, L., dos Santos, H.G., and Garg, A. (2010). PSMB8 encoding the beta5i proteasome subunit is mutated in joint contractures, muscle atrophy, microcytic anemia, and panniculitis-induced lipodystrophy syndrome. Am J Hum Genet 87, 866- 872.

575. Valentine, W.N., Tanaka, K.R., and Miwa, S. (1961). A specific erythrocyte glycolytic enzyme defect (pyruvate kinase) in three subjects with congenital non-spherocytic hemolytic anemia. Trans Assoc Am Physicians 74, 100-110.

576. Larochelle, A., Magny, P., Tremblay, S., and De Medicis, E. (1999). Pyruvate Kinase Deficiency which Causes Nonspherocytic Hemolytic Anemia: The Gene and its Mutations. Hematology 4, 77-87.

577. Kraus, A.P., Langston, M.F., Jr., and Lynch, B.L. (1968). Red cell phosphoglycerate kinase deficiency. A new cause of non-spherocytic hemolytic anemia. Biochem Biophys Res Commun 30, 173-177.

578. Fujii, H., and Yoshida, A. (1980). Molecular abnormality of phosphoglycerate kinase- Uppsala associated with chronic nonspherocytic hemolytic anemia. Proc Natl Acad Sci U S A 77, 5461-5465.

579. Upshaw, J.D., Jr. (1978). Congenital deficiency of a factor in normal plasma that reverses microangiopathic hemolysis and thrombocytopenia. N Engl J Med 298, 1350-1352.

304 580. Levy, G.G., Nichols, W.C., Lian, E.C., Foroud, T., McClintick, J.N., McGee, B.M., Yang, A.Y., Siemieniak, D.R., Stark, K.R., Gruppo, R., et al. (2001). Mutations in a member of the ADAMTS gene family cause thrombotic thrombocytopenic purpura. Nature 413, 488- 494.

581. Vaziri, N.D. (2001). Cardiovascular effects of erythropoietin and anemia correction. Curr Opin Nephrol Hypertens 10, 633-637.

582. Robles, N.R. (2016). The Safety of Erythropoiesis-Stimulating Agents for the Treatment of Anemia Resulting from Chronic Kidney Disease. Clin Drug Investig 36, 421-431.

583. Piloto, N., Teixeira, H.M., Teixeira-Lemos, E., Parada, B., Garrido, P., Sereno, J., Pinto, R., Carvalho, L., Costa, E., Belo, L., et al. (2009). Erythropoietin promotes deleterious cardiovascular effects and mortality risk in a rat model of chronic sports doping. Cardiovasc Toxicol 9, 201-210.

584. Yasukochi, Y., Sakuma, J., Takeuchi, I., Kato, K., Oguri, M., Fujimaki, T., Horibe, H., and Yamada, Y. (2018). Identification of nine novel loci related to hematological traits in a Japanese population. Physiol Genomics 50, 758-769.

585. (1989). The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. American journal of epidemiology 129, 687-702.

586. McCarty, C.A., Chisholm, R.L., Chute, C.G., Kullo, I.J., Jarvik, G.P., Larson, E.B., Li, R., Masys, D.R., Ritchie, M.D., Roden, D.M., et al. (2011). The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC medical genomics 4, 13.

587. Preuss, M., Konig, I.R., Thompson, J.R., Erdmann, J., Absher, D., Assimes, T.L., Blankenberg, S., Boerwinkle, E., Chen, L., Cupples, L.A., et al. (2010). Design of the Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Study: A Genome-wide association meta-analysis involving more than 22 000 cases and 60 000 controls. Circ Cardiovasc Genet 3, 475-483.

588. Lavange, L.M., Kalsbeek, W.D., Sorlie, P.D., Aviles-Santa, L.M., Kaplan, R.C., Barnhart, J., Liu, K., Giachello, A., Lee, D.J., Ryan, J., et al. (2010). Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol 20, 642-649.

589. (1998). Design of the Women's Health Initiative clinical trial and observational study. The Women's Health Initiative Study Group. Controlled clinical trials 19, 61-109.

305 590. Curb, J.D., McTiernan, A., Heckbert, S.R., Kooperberg, C., Stanford, J., Nevitt, M., Johnson, K.C., Proulx-Burns, L., Pastore, L., Criqui, M., et al. (2003). Outcomes ascertainment and adjudication methods in the Women's Health Initiative. Ann Epidemiol 13, S122-128.

591. Battle, A., Khan, Z., Wang, S.H., Mitrano, A., Ford, M.J., Pritchard, J.K., and Gilad, Y. (2015). Genomic variation. Impact of regulatory variation from RNA to protein. Science 347, 664-667.

592. Liu, B., and Taioli, E. (2015). Seasonal Variations of Complete Blood Count and Inflammatory Biomarkers in the US Population - Analysis of NHANES Data. PloS one 10, e0142382.

593. Conomos, M.P., Miller, M.B., and Thornton, T.A. (2015). Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet Epidemiol 39, 276-293.

594. Li, Y.R., and Keating, B.J. (2014). Trans-ethnic genome-wide association studies: advantages and challenges of mapping in diverse populations. Genome Med 6, 91.

595. Thornton, T., Conomos, M.P., Sverdlov, S., Blue, E.M., Cheung, C.Y., Glazner, C.G., Lewis, S.M., and Wijsman, E.M. (2014). Estimating and adjusting for ancestry admixture in statistical methods for relatedness inference, heritability estimation, and association testing. BMC Proc 8, S5.

596. Marchini, J., Cardon, L.R., Phillips, M.S., and Donnelly, P. (2004). The effects of human population structure on large genetic association studies. Nature genetics 36, 512-517.

597. Buyske, S., Wu, Y., Carty, C.L., Cheng, I., Assimes, T.L., Dumitrescu, L., Hindorff, L.A., Mitchell, S., Ambite, J.L., Boerwinkle, E., et al. (2012). Evaluation of the metabochip genotyping array in African Americans and implications for fine mapping of GWAS- identified loci: the PAGE study. PloS one 7, e35651.

598. Jeff, J.M., Armstrong, L.L., Ritchie, M.D., Denny, J.C., Kho, A.N., Basford, M.A., Wolf, W.A., Pacheco, J.A., Li, R., Chisholm, R.L., et al. (2014). Admixture mapping and subsequent fine-mapping suggests a biologically relevant and novel association on chromosome 11 for type 2 diabetes in African Americans. PloS one 9, e86931.

306 599. Browning, S.R., Grinde, K., Plantinga, A., Gogarten, S.M., Stilp, A.M., Kaplan, R.C., Aviles-Santa, M.L., Browning, B.L., and Laurie, C.C. (2016). Local Ancestry Inference in a Large US-Based Hispanic/Latino Study: Hispanic Community Health Study/Study of Latinos (HCHS/SOL). G3 (Bethesda) 6, 1525-1534.

600. Conomos, M.P., Reiner, A.P., Weir, B.S., and Thornton, T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. Am J Hum Genet 98, 127-148.

601. Patterson, N., Moorjani, P., Luo, Y., Mallick, S., Rohland, N., Zhan, Y., Genschoreck, T., Webster, T., and Reich, D. (2012). Ancient admixture in human history. Genetics 192, 1065-1093.

602. Pei, Y.F., Tian, Q., Zhang, L., and Deng, H.W. (2016). Exploring the Major Sources and Extent of Heterogeneity in a Genome-Wide Association Meta-Analysis. Ann Hum Genet 80, 113-122.

603. Shi, J., and Lee, S. (2016). A novel random effect model for GWAS meta-analysis and its application to trans-ethnic meta-analysis. Biometrics 72, 945-954.

604. Willer, C.J., Li, Y., and Abecasis, G.R. (2010). METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190-2191.

605. He, Q., Avery, C.L., and Lin, D.Y. (2013). A general framework for association tests with multivariate traits in large-scale genomics studies. Genetic epidemiology 37, 759-767.

606. Zhu, X., Feng, T., Tayo, B.O., Liang, J., Young, J.H., Franceschini, N., Smith, J.A., Yanek, L.R., Sun, Y.V., Edwards, T.L., et al. (2015). Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. American journal of human genetics 96, 21-36.

607. Liu, Z., and Lin, X. (2018). Multiple phenotype association tests using summary statistics in genome-wide association studies. Biometrics 74, 165-175.

608. McLaren, W., Gil, L., Hunt, S.E., Riat, H.S., Ritchie, G.R., Thormann, A., Flicek, P., and Cunningham, F. (2016). The Ensembl Variant Effect Predictor. Genome biology 17, 122.

609. Yourshaw, M., Taylor, S.P., Rao, A.R., Martin, M.G., and Nelson, S.F. (2015). Rich annotation of DNA sequencing variants by leveraging the Ensembl Variant Effect Predictor with plugins. Brief Bioinform 16, 255-264.

307 610. Gamazon, E.R., Wheeler, H.E., Shah, K.P., Mozaffari, S.V., Aquino-Michaels, K., Carroll, R.J., Eyler, A.E., Denny, J.C., Consortium, G.T., Nicolae, D.L., et al. (2015). A gene- based association method for mapping traits using reference transcriptome data. Nature genetics 47, 1091-1098.

611. Martens, J.H., and Stunnenberg, H.G. (2013). BLUEPRINT: mapping human blood cell epigenomes. Haematologica 98, 1487-1489.

612. Asimit, J., and Zeggini, E. (2011). Testing for rare variant associations in complex diseases. Genome Medicine 3.

613. Liu, D.J., and Leal, S.M. (2010). A novel adaptive method for the analysis of next- generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 6, e1001156.

614. Wei, P., Cao, Y., Zhang, Y., Xu, Z., Kwak, I.Y., Boerwinkle, E., and Pan, W. (2016). On Robust Association Testing for Quantitative Traits and Rare Variants. G3 (Bethesda) 6, 3941-3950.

615. Lee, S., Emond, M.J., Bamshad, M.J., Barnes, K.C., Rieder, M.J., Nickerson, D.A., Team, N.G.E.S.P.-E.L.P., Christiani, D.C., Wurfel, M.M., and Lin, X. (2012). Optimal unified approach for rare-variant association testing with application to small-sample case- control whole-exome sequencing studies. Am J Hum Genet 91, 224-237.

616. Lin, D.Y., Tao, R., Kalsbeek, W.D., Zeng, D., Gonzalez, F., 2nd, Fernandez-Rhodes, L., Graff, M., Koch, G.G., North, K.E., and Heiss, G. (2014). Genetic association analysis under complex survey sampling: the Hispanic Community Health Study/Study of Latinos. Am J Hum Genet 95, 675-688.

617. Rentzsch, P., Witten, D., Cooper, G.M., Shendure, J., and Kircher, M. (2019). CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47, D886-D894.

618. Szklarczyk, D., Gable, A.L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N.T., Morris, J.H., Bork, P., et al. (2019). STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47, D607-D613.

619. Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37, 1-13.

308 620. Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57.

621. Szklarczyk, D., Morris, J.H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., Santos, A., Doncheva, N.T., Roth, A., Bork, P., et al. (2017). The STRING database in 2017: quality- controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45, D362-D368.

622. Manrai, A.K., Wang, B.L., Patel, C.J., and Kohane, I.S. (2016). Reproducible and Shareable Quantifications of Pathogenicity. Pac Symp Biocomput 21, 231-242.

623. Juzenas, S., Venkatesh, G., Hubenthal, M., Hoeppner, M.P., Du, Z.G., Paulsen, M., Rosenstiel, P., Senger, P., Hofmann-Apitius, M., Keller, A., et al. (2017). A comprehensive, cell specific microRNA catalogue of human peripheral blood. Nucleic Acids Res 45, 9290-9301.

624. Kellert, L., Martin, E., Sykora, M., Bauer, H., Gussmann, P., Diedler, J., Herweh, C., Ringleb, P.A., Hacke, W., Steiner, T., et al. (2011). Cerebral oxygen transport failure?: decreasing hemoglobin and hematocrit levels after ischemic stroke predict poor outcome and mortality: STroke: RelevAnt Impact of hemoGlobin, Hematocrit and Transfusion (STRAIGHT)--an observational study. Stroke 42, 2832-2837.

625. Dzierzak, E., and Philipsen, S. (2013). Erythropoiesis: development and differentiation. Cold Spring Harb Perspect Med 3, a011601.

626. Patel, K.V. (2008). Variability and heritability of hemoglobin concentration: an opportunity to improve understanding of anemia in older adults. Haematologica 93, 1281-1283.

627. Petrovski, S., and Goldstein, D.B. (2016). Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome biology 17, 157.

628. Sitlani, C., Baldassari, A.R., Highland, H.M., Hodonsky, C.J., McKnight, B., Avery, C.L. (2019). Comparison of multiple phenotype association tests using summary statistics in genome-wide association studies. In American Society for Human Genetics 2019. (Houston, TX.

629. Gaggin, H.K., and Dec, G.W. (2014). The Role of Treatment for Anemia as a Therapeutic Target in the Management of Chronic Heart Failure: Insights After RED-HF. Curr Treat Options Cardiovasc Med 16, 279.

309 630. Morales, J., Welter, D., Bowler, E.H., Cerezo, M., Harris, L.W., McMahon, A.C., Hall, P., Junkins, H.A., Milano, A., Hastings, E., et al. (2018). A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome biology 19, 21.

631. Wheeler, E., Leong, A., Liu, C.T., Hivert, M.F., Strawbridge, R.J., Podmore, C., Li, M., Yao, J., Sim, X., Hong, J., et al. (2017). Impact of common genetic determinants of Hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: A transethnic genome-wide meta-analysis. PLoS Med 14, e1002383.

632. Duchnowski, P., Szymanski, P., Orlowska-Baranowska, E., Kusmierczyk, M., and Hryniewiecki, T. (2016). Raised red cell distribution width as a prognostic marker in aortic valve replacement surgery. Kardiol Pol 74, 547-552.

633. Nolte, I.M., van der Most, P.J., Alizadeh, B.Z., de Bakker, P.I., Boezen, H.M., Bruinenberg, M., Franke, L., van der Harst, P., Navis, G., Postma, D.S., et al. (2017). Missing heritability: is the gap closing? An analysis of 32 complex traits in the Lifelines Cohort Study. Eur J Hum Genet 25, 877-885.

634. Gravel, S., Henn, B.M., Gutenkunst, R.N., Indap, A.R., Marth, G.T., Clark, A.G., Yu, F., Gibbs, R.A., Genomes, P., and Bustamante, C.D. (2011). Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci U S A 108, 11983-11988.

635. Sorlie, P.D., Aviles-Santa, L.M., Wassertheil-Smoller, S., Kaplan, R.C., Daviglus, M.L., Giachello, A.L., Schneiderman, N., Raij, L., Talavera, G., Allison, M., et al. (2010). Design and implementation of the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol 20, 629-641.

636. Auer, P.L., Teumer, A., Schick, U., O'Shaughnessy, A., Lo, K.S., Chami, N., Carlson, C., de Denus, S., Dube, M.P., Haessler, J., et al. (2014). Rare and low-frequency coding variants in CXCR2 and other genes are associated with hematological traits. Nature genetics 46, 629-634.

637. Hunt, S.E., McLaren, W., Gil, L., Thormann, A., Schuilenburg, H., Sheppard, D., Parton, A., Armean, I.M., Trevanion, S.J., Flicek, P., et al. (2018). Ensembl variation resources. Database (Oxford) 2018.

638. Tennessen, J.A., Bigham, A.W., O'Connor, T.D., Fu, W., Kenny, E.E., Gravel, S., McGee, S., Do, R., Liu, X., Jun, G., et al. (2012). Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64-69.

310 639. Tang, Z.Z., and Lin, D.Y. (2013). MASS: meta-analysis of score statistics for sequencing studies. Bioinformatics 29, 1803-1805.

640. Heireman, L., Luyckx, A., Schynkel, K., Dheedene, A., Delaunoy, M., Adam, A.S., Gulbis, B., and Dierick, J. (2019). Detection of a Large Novel alpha-Thalassemia Deletion in an Autochthonous Belgian Family. Hemoglobin, 1-4.

641. Tchasovnikarova, I.A., Timms, R.T., Matheson, N.J., Wals, K., Antrobus, R., Gottgens, B., Dougan, G., Dawson, M.A., and Lehner, P.J. (2015). GENE SILENCING. Epigenetic silencing by the HUSH complex mediates position-effect variegation in human cells. Science 348, 1481-1485.

642. Huh, J.W., Kim, T.H., Yi, J.M., Park, E.S., Kim, W.Y., Sin, H.S., Kim, D.S., Min, D.S., Kim, S.S., Kim, C.B., et al. (2006). Molecular evolution of the periphilin gene in relation to human endogenous retrovirus m element. J Mol Evol 62, 730-737.

643. Timms, R.T., Tchasovnikarova, I.A., and Lehner, P.J. (2016). Position-effect variegation revisited: HUSHing up heterochromatin in human cells. Bioessays 38, 333-343.

644. (2018). The 100 000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ 361, k1952.

645. Lis, R., Karrasch, C.C., Poulos, M.G., Kunar, B., Redmond, D., Duran, J.G.B., Badwe, C.R., Schachterle, W., Ginsberg, M., Xiang, J., et al. (2017). Conversion of adult endothelium to immunocompetent haematopoietic stem cells. Nature 545, 439-445.

646. Sugimura, R., Jha, D.K., Han, A., Soria-Valles, C., da Rocha, E.L., Lu, Y.F., Goettel, J.A., Serrao, E., Rowe, R.G., Malleshaiah, M., et al. (2017). Haematopoietic stem and progenitor cells from human pluripotent stem cells. Nature 545, 432-438.

647. Martyn, G.E., Wienert, B., Kurita, R., Nakamura, Y., Quinlan, K.G.R., and Crossley, M. (2019). A natural regulatory mutation in the proximal promoter elevates fetal globin expression by creating a de novo GATA1 site. Blood.

648. Asimit, J.L., and Zeggini, E. (2012). Imputation of Rare Variants in Next-Generation Association Studies. Human Heredity 74, 196-204.

649. Conrad, D.F., Keebler, J.E., DePristo, M.A., Lindsay, S.J., Zhang, Y., Casals, F., Idaghdour, Y., Hartl, C.L., Torroja, C., Garimella, K.V., et al. (2011). Variation in genome-wide mutation rates within and between human families. Nature genetics 43, 712-714.

311 650. Milholland, B., Dong, X., Zhang, L., Hao, X., Suh, Y., and Vijg, J. (2017). Differences between germline and somatic mutation rates in humans and mice. Nat Commun 8, 15183.

651. Tournamille, C., Colin, Y., Cartron, J.P., and Le Van Kim, C. (1995). Disruption of a GATA motif in the Duffy gene promoter abolishes erythroid gene expression in Duffy- negative individuals. Nature genetics 10, 224-228.

652. de Souza, R.A., Carlos, A.M., de Souza, B.M., Rodrigues, C.V., Pereira Gde, A., and Moraes-Souza, H. (2015). Alpha-Thalassemia: Genotypic Profile Associated with Ethnicity and Hematological Differentiation of Iron Deficiency Anemia in the Region of Uberaba, Minas Gerais, Brazil. Hemoglobin 39, 264-269.

653. Edwards, N.C., Hing, Z.A., Perry, A., Blaisdell, A., Kopelman, D.B., Fathke, R., Plum, W., Newell, J., Allen, C.E., S, G., et al. (2012). Characterization of coding synonymous and non-synonymous variants in ADAMTS13 using ex vivo and in silico approaches. PloS one 7, e38864.

654. Meyer, I.M., and Miklos, I. (2005). Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs. Nucleic Acids Res 33, 6338-6348.

655. Wan, Y., Qu, K., Zhang, Q.C., Flynn, R.A., Manor, O., Ouyang, Z., Zhang, J., Spitale, R.C., Snyder, M.P., Segal, E., et al. (2014). Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706-709.

656. Biro, J.C. (2008). Correlation between nucleotide composition and folding energy of coding sequences with special attention to wobble bases. Theor Biol Med Model 5, 14.

657. Huber, F., Bunina, D., Gupta, I., Khmelinskii, A., Meurer, M., Theer, P., Steinmetz, L.M., and Knop, M. (2016). Protein Abundance Control by Non-coding Antisense Transcription. Cell Rep 15, 2625-2636.

658. Villegas, V.E., and Zaphiropoulos, P.G. (2015). Neighboring gene regulation by antisense long non-coding RNAs. Int J Mol Sci 16, 3251-3266.

659. Ward, L.D., and Kellis, M. (2012). HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40, D930-934.

312 660. Barrera, L.A., Vedenko, A., Kurland, J.V., Rogers, J.M., Gisselbrecht, S.S., Rossin, E.J., Woodard, J., Mariani, L., Kock, K.H., Inukai, S., et al. (2016). Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science 351, 1450- 1454.

661. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362-9367.

662. Jolma, A., Yin, Y., Nitta, K.R., Dave, K., Popov, A., Taipale, M., Enge, M., Kivioja, T., Morgunova, E., and Taipale, J. (2015). DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384-388.

663. Smith, N.C., and Matthews, J.M. (2016). Mechanisms of DNA-binding specificity and functional gene regulation by transcription factors. Curr Opin Struct Biol 38, 68-74.

664. Chabot, B., and Shkreta, L. (2016). Defective control of pre-messenger RNA splicing in human disease. J Cell Biol 212, 13-27.

665. Conboy, J.G. (2017). RNA splicing during terminal erythropoiesis. Curr Opin Hematol 24, 215-221.

666. Bheemanaik, S., Reddy, Y.V., and Rao, D.N. (2006). Structure, function and mechanism of exocyclic DNA methyltransferases. Biochem J 399, 177-190.

667. Carlin, J., George, R., and Reyes, T.M. (2013). Methyl donor supplementation blocks the adverse effects of maternal high fat diet on offspring physiology. PloS one 8, e63549.

668. Obeid, R. (2013). The metabolic burden of methyl donor deficiency with focus on the betaine homocysteine methyltransferase pathway. Nutrients 5, 3481-3495.

669. Julian, C.G. (2017). Epigenomics and Human Adaptation to High Altitude. J Appl Physiol (1985), jap 00351 02017.

670. Alkorta-Aranburu, G., Beall, C.M., Witonsky, D.B., Gebremedhin, A., Pritchard, J.K., and Di Rienzo, A. (2012). The genetic architecture of adaptations to high altitude in Ethiopia. PLoS Genet 8, e1003110.

671. Alegria-Torres, J.A., Baccarelli, A., and Bollati, V. (2011). Epigenetics and lifestyle. Epigenomics 3, 267-277.

313 672. van der Harst, P., de Windt, L.J., and Chambers, J.C. (2017). Translational Perspective on Epigenetics in Cardiovascular Disease. J Am Coll Cardiol 70, 590-606.

673. Hashimoto, H., Liu, Y., Upadhyay, A.K., Chang, Y., Howerton, S.B., Vertino, P.M., Zhang, X., and Cheng, X. (2012). Recognition and potential mechanisms for replication and erasure of cytosine hydroxymethylation. Nucleic Acids Res 40, 4841-4849.

674. Cheng, X. (1995). Structure and function of DNA methyltransferases. Annu Rev Biophys Biomol Struct 24, 293-318.

675. Saito, D., and Suyama, M. (2015). Linkage disequilibrium analysis of allelic heterogeneity in DNA methylation. Epigenetics 10, 1093-1098.

676. Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al. (2012). An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83-90.

677. Rockman, M.V., and Wray, G.A. (2002). Abundant raw material for cis-regulatory evolution in humans. Mol Biol Evol 19, 1991-2004.

678. Thurman, R.E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M.T., Haugen, E., Sheffield, N.C., Stergachis, A.B., Wang, H., Vernot, B., et al. (2012). The accessible chromatin landscape of the human genome. Nature 489, 75-82.

679. Bannister, A.J., and Kouzarides, T. (2011). Regulation of chromatin by histone modifications. Cell Res 21, 381-395.

680. Kremsky, I., Bellora, N., and Eyras, E. (2015). A Quantitative Profiling Tool for Diverse Genomic Data Types Reveals Potential Associations between Chromatin and Pre-mRNA Processing. PloS one 10, e0132448.

681. Lee, D., Gorkin, D.U., Baker, M., Strober, B.J., Asoni, A.L., McCallion, A.S., and Beer, M.A. (2015). A method to predict the impact of regulatory variants from DNA sequence. Nature genetics 47, 955-961.

682. Hattangadi, S.M., Wong, P., Zhang, L., Flygare, J., and Lodish, H.F. (2011). From stem cell to red cell: regulation of erythropoiesis at multiple levels by multiple proteins, RNAs, and chromatin modifications. Blood 118, 6258-6268.

314 683. Zhao, B., Yang, J., and Ji, P. (2016). Chromatin condensation during terminal erythropoiesis. Nucleus 7, 425-429.

684. Lawrence, M., Daujat, S., and Schneider, R. (2016). Lateral Thinking: How Histone Modifications Regulate Gene Expression. Trends Genet 32, 42-56.

685. Allis, C.D., and Jenuwein, T. (2016). The molecular hallmarks of epigenetic control. Nat Rev Genet 17, 487-500.

686. Milne, T.A., Zhao, K., and Hess, J.L. (2009). Chromatin immunoprecipitation (ChIP) for analysis of histone modifications and chromatin-associated proteins. Methods Mol Biol 538, 409-423.

687. Pillai, S., Dasgupta, P., and Chellappan, S.P. (2009). Chromatin immunoprecipitation assays: analyzing transcription factor binding and histone modifications in vivo. Methods Mol Biol 523, 323-339.

688. Hoang, S.A., Xu, X., and Bekiranov, S. (2011). Quantification of histone modification ChIP-seq enrichment for data mining and machine learning applications. BMC Res Notes 4, 288.

689. O'Geen, H., Echipare, L., and Farnham, P.J. (2011). Using ChIP-seq technology to generate high-resolution profiles of histone modifications. Methods Mol Biol 791, 265-286.

690. Udali, S., Guarini, P., Moruzzi, S., Choi, S.W., and Friso, S. (2013). Cardiovascular epigenetics: from DNA methylation to microRNAs. Mol Aspects Med 34, 883-901.

691. Okada, Y., Muramatsu, T., Suita, N., Kanai, M., Kawakami, E., Iotchkova, V., Soranzo, N., Inazawa, J., and Tanaka, T. (2016). Significant impact of miRNA-target gene networks on genetics of human complex traits. Sci Rep 6, 22223.

692. Fabian, M.R., Sonenberg, N., and Filipowicz, W. (2010). Regulation of mRNA translation and stability by microRNAs. Annu Rev Biochem 79, 351-379.

693. Azzouzi, I., Moest, H., Wollscheid, B., Schmugge, M., Eekels, J.J.M., and Speer, O. (2015). Deep sequencing and proteomic analysis of the microRNA-induced silencing complex in human red blood cells. Exp Hematol 43, 382-392.

694. Orom, U.A., Nielsen, F.C., and Lund, A.H. (2008). MicroRNA-10a binds the 5'UTR of ribosomal protein mRNAs and enhances their translation. Mol Cell 30, 460-471.

315 695. Saraiya, A.A., Li, W., and Wang, C.C. (2013). Transition of a microRNA from repressing to activating translation depending on the extent of base pairing with the target. PloS one 8, e55672.

696. Vasudevan, S., Tong, Y., and Steitz, J.A. (2007). Switching from repression to activation: microRNAs can up-regulate translation. Science 318, 1931-1934.

697. Listowski, M.A., Heger, E., Boguslawska, D.M., Machnicka, B., Kuliczkowski, K., Leluk, J., and Sikorski, A.F. (2013). microRNAs: fine tuning of erythropoiesis. Cell Mol Biol Lett 18, 34-46.

698. Undi, R.B., Kandi, R., and Gutti, R.K. (2013). MicroRNAs as Haematopoiesis Regulators. Adv Hematol 2013, 695754.

699. Diehl, P., Fricke, A., Sander, L., Stamm, J., Bassler, N., Htun, N., Ziemann, M., Helbing, T., El-Osta, A., Jowett, J.B., et al. (2012). Microparticles: major transport vehicles for distinct microRNAs in circulation. Cardiovasc Res 93, 633-644.

700. Azzouzi, I., Schmugge, M., and Speer, O. (2012). MicroRNAs as components of regulatory networks controlling erythropoiesis. Eur J Haematol 89, 1-9.

701. Zhang, L., Flygare, J., Wong, P., Lim, B., and Lodish, H.F. (2011). miR-191 regulates mouse erythroblast enucleation by down-regulating Riok3 and Mxi1. Genes Dev 25, 119- 124.

702. Ecroyd, H., and Carver, J.A. (2008). Unraveling the mysteries of protein folding and misfolding. IUBMB Life 60, 769-774.

703. Englander, S.W., Mayne, L., and Krishna, M.M. (2007). Protein folding and misfolding: mechanism and principles. Q Rev Biophys 40, 287-326.

704. Hammarstrom, P. (2009). Protein folding, misfolding and disease. FEBS Lett 583, 2579- 2580.

705. Papsdorf, K., and Richter, K. (2014). Protein folding, misfolding and quality control: the role of molecular chaperones. Essays Biochem 56, 53-68.

706. Leandro, P., and Gomes, C.M. (2008). Protein misfolding in conformational disorders: rescue of folding defects and chemical chaperoning. Mini Rev Med Chem 8, 901-911.

316 707. Miyazaki, Y., Chen, L.C., Chu, B.W., Swigut, T., and Wandless, T.J. (2015). Distinct transcriptional responses elicited by unfolded nuclear or cytoplasmic protein in mammalian cells. Elife 4.

317