Genetic instrumental variable regression: Explaining socioeconomic and health outcomes in nonexperimental data

Thomas A. DiPretea,1,2, Casper A. P. Burikb,1, and Philipp D. Koellingerb,1,2

aDepartment of Sociology, Columbia University, New York, NY 10027; and bDepartment of Economics, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands

Edited by Kenneth W. Wachter, University of California, Berkeley, CA, and approved March 21, 2018 (received for review May 3, 2017)

Identifying causal effects in nonexperimental data is an enduring complex,” which means that the observed heritability is due to challenge. One proposed solution that recently gained popu- the accumulation of effects from a very large number of genes larity is the idea to use genes as instrumental variables [i.e., that each have a small, often statistically insignificant, influ- Mendelian randomization (MR)]. However, this approach is prob- ence (10). lematic because many variables of interest are genetically corre- Furthermore, genes often influence several seemingly unre- lated, which implies the possibility that many genes could affect lated traits, a phenomenon called direct or vertical both the exposure and the outcome directly or via unobserved (11, 12). For example, a mutation of a single gene that causes confounding factors. Thus, pleiotropic effects of genes are them- the disease phenylketonuria is responsible for mental retardation selves a source of bias in nonexperimental data that would also and also for abnormally light hair and skin color (13). Pleiotropy undermine the ability of MR to correct for endogeneity bias from is not restricted to diseases. All genes involved in healthy cell nongenetic sources. Here, we propose an alternative approach, metabolism and cell division can be expected to directly influ- genetic instrumental variable (GIV) regression, that provides esti- ence a broad range of traits such as body height, cognitive ability, mates for the effect of an exposure on an outcome in the presence and longevity, even if the effect on each of these traits may be of pleiotropy. As a valuable byproduct, GIV regression also pro- tiny. Similarly, any gene involved in neurodevelopment and brain vides accurate estimates of the chip heritability of the outcome function is likely to contribute to human behavior and mental variable. GIV regression uses polygenic scores (PGSs) for the out- health in some way (14). come of interest which can be constructed from genome-wide In addition to direct pleiotropic effects, genes can also have association study (GWAS) results. By splitting the GWAS sam- indirect or horizontal pleiotropic effects, where a genetic vari- ple for the outcome into nonoverlapping subsamples, we obtain ant influences one trait, which in turn influences another trait multiple indicators of the outcome PGSs that can be used as (11). The similarity of the genetic architecture of two traits is instruments for each other and, in combination with other meth- estimated by their genetic correlation (i.e., the correlation of the ods such as sibling fixed effects, can address endogeneity bias “true” effect sizes of all genetic variants on both traits) (15), from both pleiotropy and the environment. In two empirical appli- which captures both direct and indirect pleiotropic effects (16– cations, we demonstrate that our approach produces reasonable 18). Genetic correlations exist between many traits and often estimates of the chip heritability of educational attainment (EA) and show that standard regression and MR provide upwardly Significance biased estimates of the effect of body height on EA. We propose genetic instrumental variable (GIV) regression—a causal effects | genetic instrumental variables | polygenic scores | method that controls for pleiotropic effects of genes on two genome-wide association studies | pleiotropy variables. GIV regression is broadly applicable to study out- comes for which polygenic scores from large-scale genome- major challenge in the social sciences and in epidemiol- wide association studies are available. We explore the perfor- Aogy is the identification of causal effects in nonexperimental mance of GIV regression in the presence of pleiotropy across data. In these disciplines, ethical and legal considerations along a range of scenarios and find that it yields more accurate with practical constraints often preclude the use of experiments estimates than alternative approaches such as ordinary least- to randomize the assignment of observations between treatment squares regression or Mendelian randomization. When GIV and control groups or to carry out such experiments in sam- regression is combined with proper controls for purely envi- ples that represent the relevant population (1). Instead, many ronmental sources of bias (e.g., using control variables and important questions are studied in field data which make it diffi- sibling fixed effects), it improves our understanding of the cult to discern between causal effects and (spurious) correlations causal relationships between genetically correlated variables. that are induced by unobserved factors (2). Obviously, confusing correlation with causation is not only a conceptual error; it can Author contributions: T.A.D., C.A.P.B., and P.D.K. designed research, performed research, also lead to ineffective or even harmful recommendations, treat- contributed new analytic tools, analyzed data, and wrote the paper. ments, and policies, as well as a significant waste of resources The authors declare no conflict of interest. (e.g., as in ref. 3). This article is a PNAS Direct Submission. One important source of bias in field data stems from genetic This open access article is distributed under Creative Commons Attribution- effects: Twin studies (4) as well as methods based on molecular NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). genetic data (5, 6) allow estimation of the proportion of vari- See Commentary on page 5626. ance in a trait that is due to linear genetic effects (so-called 1 T.A.D., C.A.P.B., and P.D.K. contributed equally to this work. narrow-sense heritability). Using these and related methods, 2 To whom correspondence may be addressed. Email: [email protected] or p.d. an overwhelming body of literature demonstrates that almost [email protected]. all important human characteristics, behaviors, and health out- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. comes are influenced both by genetic predisposition and by 1073/pnas.1707388115/-/DCSupplemental. environmental factors (7–9). Most of these traits are “genetically Published online April 23, 2018.

E4970–E4979 | PNAS | vol. 115 | no. 22 www.pnas.org/cgi/doi/10.1073/pnas.1707388115 Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 irt tal. et DiPrete schizophrenia (36), (BMI) such index traits mass body of the (35), heritability on height the light body driving shed as is to that begin architecture results association have traits genetic These genome-wide associations human (34). large-scale genetic (GWASs) many in studies replicable avail- of reported now robust, been architecture are of recently datasets genetic plethora large a the data very and genetic study result, of to a geno- collection As Array-based able cheap. the MR. and of made fast challenges have two technologies first typing the address 33). 32, by to (27, hindered genes most still of are function influ- the attempts that of such genes knowledge of limited but subset our exposure, be a not the isolate to obviously only try could ence could genes One these IVs. as are confound, used genes the of of effects pleiotropic source direct the personality, if Fourth, environment, (31). parental habits) the and (e.g., of not with genotypes is the correlate for offspring that control parents the everything of to with genotype correlates possible the and is parents, random it the of Unless genotype parents. trait, the the on the of conditional only in genotype assigned variance (29, the randomly instruments the are weak genotypes of of itself Third, problem part by 30). well-known gene small the to any very leads trait, a which complex only genetically capture envi- a will is Second, from ancestry. exposure effects with the genetic correlated if are genes true which that isolate know confounds and to ronmental need exposure are we there the First, However, idea. affect of interest. this influence to of may causal challenges outcome the it four some identify exposure, to on the IVs exposure affect as the genes them use which to known draw. possible be random be a could of it result the if is par- So offspring the the the of of genotype by genotype the the gametes on because ents, conditional of appealing Thus, production meiosis. principle of the in process in is randomized idea are The genotypes randomiza- Mendelian (25–28). approach (MR) this tion termed and IVs construct for variable outcome the affect direction) The units.] not same (SUTVA): other does the assumption unit in one value of affected treatment “treatment” is unit IV stable the the by and (every- affected monotonicity is are who satisfy to one particu- other need In IVs [Two valid challenging. difficult. that is conditions is restriction exclusion requirements the both satisfying lar, satisfy the that their regression on IVs finding practice, effect produce valid the In their restriction). and exclusion of through so-called solely (the variables term exposure outcome control error the with other the correlation the they of Second, on independent “relevant”). conditional be be to to need corre- IVs need be (i.e., to variables regression to need control the other need IVs the in on they Valid conditional exposure First, outcome. the with conditions. an lated on important exposure two isolate to that satisfy interest of of pro- exposure effect They the Valid the on experiments: (24). “shock” natural exogenous (IVs) to an variables vide similar instrumental conceptually use are to IVs is data this imental in conclusions wrong to lead still design. Further- study can nonexistent. twins causation differences or the reverse because between small or differences be pairs environmental to require twin unobserved tend more, studies MZ pairs twin of such MZ sizes that within sample is also large challenge approach very practical environment. this parental the shared addition, from However, arise genet- In that definition effects (23). for by controls for identical are outcomes who (almost) compare 22), ically (21, to twins is (MZ) confounds monozygotic genetic of presence the the to (20). effects genetic rise that for studies control giving bias not substantially do (19), may pleiotropy direct correlations that concern phenotypic their exceed eetavne ncmlxtatgntc aei possible it make trait complex in advances Recent to information genetic use to proposed have Epidemiologists nonexper- in effects causal isolate to strategy popular Another in standard gold the possible, not is design experimental an If etos1–5 sections Appendix, here.) naturally exploit and we individuals, possi- of data which endowment not observational genetic between the on is like relationship experiments rely EA occurring causal must the or variables study two height to the body attempt any on randomizes Thus, based which ble. groups design should (it experimental into height clean body people whether a on check that EA of we (Note effect of control,” not). causal effects “negative a a the finds method As estimate our EA. we can on Then, EA height method. of body our (47) with heritability chip obtained so-called con- be a the that of demonstrate estimate we sistent and First, (51). Health (HRS) available Study publicly Retirement the using of applications version empirical enhanced in an smaller term generally we under what is or bias (EMR). regression MR, MR the OLS, GIV that evidence with simulations with with than of show assumptions we set these and comprehensive pleiotropy, a of from because inter- of term regressor a error when of case complex We est more SNPs. sam- of the GWAS number to error the the turn to then the when relative large and with sufficiently covariates, are are uncorrelated sizes included model ple is the the of PGS the net in true on term covariates the PGSs regression other the when GIV the of exogenous, that when effect show variable the to for outcome on estimates go accurate We produces approach. our direct of substantial tions of presence the expo- in an of even effect outcome pleiotropy. causal practically an the on for estimates bounds sure avail- regression lower and widely variables GIV upper using useful instrumental software. implemented pro- genetic statistical be we call can able particular, that we In regression that pleiotropy. (GIV) approach interest from ordinary an of arising outcome pose in bias the bias for reduce PGSs of to use source that alternative strategies propose serious We estimation MR. a and demon- (OLS) is regression We least-squares pleiotropy data. that nonexperimental strate using relationships causal to (48–50). efforts MR in recent assumptions despite exogeneity obstacle the twins relax serious (i.e., dizygotic a challenge remains fourth or the pleiotropy) However, siblings genotypes. draws of parent’s random the are samples from siblings between large differences genetic the using the where and by in parents parent–offspring of or could the sample large trios) list of a in above genotypes (e.g., observed the the are offspring in if MR addressed to be principle challenge of genetic share third individual larger than The (an much traits markers. (47) complex a the genetically capture heritability of of PGSs variance their below), the less by to substantially return suggested we capture than issue traits still (35–37, in EA PGSs variation and Although schizophrenia, 46). this complex BMI, genetically 39, height, that predict as to demonstrate such begin outcomes exclude studies that PGSs that Recent yields GWASs approach sample. large-scale prediction in on estimated the SNP are each outcome estimated of effects an the The aggregate (SNPs)]. single- polymorphisms [typically that variants nucleotide genetic indexes measured currently linear all of are effects summarizing 39, PGSs (15, for 45). traits tool complex 44, genetically favored for the predispositions genetic become Polygenic the (40–43). have ancestry associations by (PGSs) genetic confounded reported scores not the is traits of many majority sug- for vast evidence the empirical that and population, gests the in structure genetic educational and (15). (39), (EA) attainment depression (38), disease Alzheimer’s (37), omldrvtosadtcncldtisaecnandin contained are details technical and derivations Formal approach our of usefulness practical the demonstrate we Next, assump- the out laying and intuition providing by begin We modeling for pleiotropy of implications the address we Here, for control to strategies several use GWASs High-quality (T ) sptnilycreae ihuosre aibe nthe in variables unobserved with correlated potentially is . PNAS | o.115 vol. | o 22 no. | E4971 SI

SOCIAL SCIENCES SEE COMMENTARY Theory The GIV strategy starts to break down when bias arises from Intuition. To build intuition for our approach, we introduce the unobserved nongenetic factors as well as from pleiotropic effects. concept of the true PGS for y which would be constructed using We show below that both GIV and MR produce biased estimates the true effects of each SNP on y. In theory, the true SNP effects in this case. However, we demonstrate that the combination could be estimated in a GWAS on y in an infinitely large sample of GIV-C and GIV-U still outperforms OLS and MR. Fur- that is drawn from the same population as the prediction sam- thermore, the GIV approach has additional utility because it ple. The true PGS would capture the narrow-sense heritability of can be combined with other strategies to reduce the effects y. Of course, the true PGS is unknown. All one can practically of environmental endogeneity (e.g., additional control variables obtain is a PGS from a finite GWAS sample that will capture a or family fixed effects). We demonstrate that these combined part, but not all, of the genetic influence on y because the effect strategies can potentially provide accurate information about the of each SNP is estimated with noise. The attenuated predictive effects of an exposure in situations with both genetic and non- accuracy of practically available PGSs (45, 47, 52) is conceptu- genetic sources of endogeneity. In contrast, the problems for ally similar to the well-known problem of measurement error MR that are produced by pleiotropy bias are not fixable in a in regression analysis. It has long been understood that multi- similar manner. ple indicators can, under certain conditions, provide a strategy to correct regression estimates for attenuation from measurement Assumptions. GIV regression builds on the standard identify- error (53–55). We show below that by splitting the GWAS sam- ing assumptions of IV regression (24). In the context of our ple into independent subsamples, one can obtain several PGSs approach, this implies six specific conditions: (i.e., multiple indicators) in the prediction sample. Each will have i) Polygenicity: The outcome is a genetically complex trait that even lower predictive accuracy than the original score due to is influenced by many genetic variants, each with a very small the smaller GWAS subsamples used in their construction, but effect. these multiple indicators can be used as instrumental variables ii) Complete genetic information: The available genetic data for each other, and the instruments will satisfy the assumptions include all variants that influence the variable(s) of interest. of IV regression to the extent that the measurement errors (the iii) Genetic effects are linear: All genetic variants influence difference between the true and calculated PGSs) are uncorre- the variable(s) of interest via additive linear effects. Thus, lated. Standard two-stage least-squares (2SLS) regression (24) there are no genetic interactions (i.e., epistasis) or dominant (readily available in statistics software packages) using at least alleles. one valid IV for the PGS of y can then be used to back out an iv) Unbiased GWAS results: The available GWAS results are unbiased estimate of the heritability of y. not systematically biased by omitted environmental variables. Next, presume the matter of interest is not heritability, but the For example, failure to control for population structure can causal effect of some treatment T on y, where T is also her- lead to spurious genetic associations (31). itable and some genes have direct pleiotropic effects on both. v) Nonoverlapping samples: It is possible to divide GWAS sam- If these genes are not known and not controlled for, regressing ples into nonoverlapping subsamples drawn from the same y on T would result in omitted variable bias. (Unfortunately, population. that is the reality in most social scientific and epidemiological vi) The genetic effects on y are the same in the GWAS and studies that use nonexperimental data.) Suppose the effects of the prediction samples; i.e., the genetic correlation between all genes that influence y through channels other than T could samples is one. be known. Theoretically, one could estimate these effects in a GWAS on y that controls for T in an infinitely large sample. Estimating Narrow-Sense SNP Heritability from Polygenic Scores. That information could be used to construct a “true conditional” Under these assumptions, consistent estimates of the chip heri- PGS in a prediction sample. Adding the true conditional PGS tability of a trait (i.e., the proportion of variance in a trait that to a regression of y on T in the prediction sample would effec- is due to linear effects of currently measurable SNPs) can be tively eliminate bias arising from direct pleiotropy. However, the obtained from polygenic scores (for full details, see SI Appendix, true conditional PGS is also unknown and all we can practically section 2). If y is the outcome variable, X is a vector of exoge- obtain is a noisy proxy of it from a finite GWAS sample. While it nous control variables, and S ∗ is a summary measure of genetic is not guaranteed, the general conclusion of the literature is that y|X tendency for y in the presence of controls for X , then one the use of proxy variables is an improvement over omitting the variable being proxied (56, 57). Furthermore, having a valid IV can write for the conditional score would potentially correct for its noise ∗ y = α + X β + γS +  [1] and get us closer to estimating the true causal effect of T on y|X y. As before, a valid IV can be practically obtained by splitting = α + X β + γ(Gζy|X ) + , the GWAS sample into independent parts and standard IV esti- mation techniques such as two-stage least squares can be used. where G is an n × m matrix of genetic markers, and ζy|X is the We refer to this approach as conditional genetic IV regression m × 1 vector of SNP effect sizes, where the number of SNPs (GIV-C). is typically in the millions. If the true effects of each SNP on If conditional GWAS results are not available, one can still the outcome were known, the entire genetic tendency for y ∗ add the unconditional PGS for y as a control variable and use would be captured by the true unconditional score Sy|X , and 2 ∗ IV regression with multiple indicators for this score to correct the marginal R of Sy|X in Eq. 1 would be the chip heritabil- for measurement error. We refer to this as unconditional GIV ity of the trait. We refer to the estimate of the PGS from actually regression (GIV-U). GIV-U still corrects for bias arising from available GWAS data in the presence of controls for X as Sy|X , direct pleiotropy, but this strategy will overcontrol and result in where estimates for T that are biased toward zero because the uncondi- ∗ S = S + v1 = Gζ + Gu [2] tional PGS also includes indirect pleiotropic effects of genes that y|X y|X y|X y|X affect y only because they affect T . However, extensive simula- and uy|X is the estimation error in ζy|X and Sy|X is substituted ∗ tions show that the combination of GIV-C and GIV-U turns out for Sy|X in Eq. 1. The variance of a trait that is captured by its to produce reasonable upper and lower bounds for the effect of available PGS increases with the available GWAS sample size T on y across a broad range of scenarios if the only sources of to estimate ζy|X and converges to the SNP-based narrow-sense bias are pleiotropic genes. heritability of the trait at the limit if all relevant genetic markers

E4972 | www.pnas.org/cgi/doi/10.1073/pnas.1707388115 DiPrete et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 irt tal. et DiPrete This SNPs. of of estimates sets noisier overlapping produces partially splitting least by at PGS for with the subsamples sample of discovery indicators multiple GWAS obtain the to is bias tion currently datasets the with ignor- questions strate- to available. important alternative down for address calls attenuation to large therefore, gies this having situation, bring This from levels. away to able time sizes long a sample are enough we esti- size, sample the GWAS in of attenuation mate substantial of demonstrated estimates consistently obtain to assumptions use S statistical that of methods combination estimation a nonconventional requires cases to were size sample GWAS 52). the 47, if (45, and large sufficiently GWAS the in included were erse on regressed estimate heritability the of heritability that γ genotype chip assuming with the further correlated of and not one, are of or in SD of contained a level variables and sufficient zero a mean to means satisfied which be independent, can the mutually accuracy. in assumption of separation essentially this set spatial be that total iso- sufficient the to have the to from that on DNA possible SNPs markers other currently of thousand each millions is markers hundred to it close several genetic located late However, general, are chromosome. In they same other. if each correlated are of independent are already are rea- that with sizes estimated sample effects effects obtainable. discovery their large using and with precision identified sonable markers for easily the need be no because be could also method, would proposed there here However, the proposed well. method work the not then would markers, If of which number effect. of small small each tively markers, very genetic a of number has large a by influenced for ment for assumption important additional each 60). for (55, assumptions IVs IV as standard used using be regression S can 2SLS indicators The multiple other. the but racy, estimate to available in cases variables of of number number the S while the questionable millions when is the context, result in current mathematical this the of in utility practical probabil- the in of convergence ity include that assumptions if reasonable needed) estimates are Eq. consistent of errors provide parameters will the OLS of established, has 59) (58, (ζ coefficients and that in regressor” hc ed oaba qa to equal of bias a variance to the leads which to respect with dardize praht orcigatnainba ae nteueof use the on based bias attenuation correcting to approach for estimate between correlation the ˆ y 2 miia tde sn Gsfravreyo rishave traits of variety a for PGSs using studies Empirical . ∗ h otsrihfradslto otepolmo attenua- of problem the to solution straightforward most The Eq. suigta h aibe nEq. in variables the that Assuming markers genetic the that is assumption required Another in technically more discussed As 2|X ρ(S sfrsalrta ht h moe ai fcoefficients of ratio imposed The that. than smaller far is y ilte eoe ossetetmt of estimate consistent a recover then will ζ ˆ 1 1|X y |X otiswa scle neooerc “generated a econometrics in called is what contains γ S , y S to 4,4,5) n,wieteba iiihswith diminishes bias the while and, 52), 47, (45, 2|X y h ˆ 2|X S y 2 ζ y y sthat is |X .(o nalternative an (For 2.1). section Appendix, (SI |X where ), y = S |X stesml iegoslre.However, larger. grows size sample the as ∗ h S ˆ rmaohrmdl speiu work previous As model. another from ) y 2 y ∗ S safnto fasto variables of set a of function a is y |X X snteulto equal not is S eacmlxtat enn hti is it that meaning trait, complex a be ssbtttdfor substituted is y + 1 ρ oto o ouainstratification population for control 1|X y y atog orcin o standard for corrections (although v stecreaincefiin.The coefficient. correlation the is spiaiyiflecdb rela- a by influenced primarily is 1/Var a o eotie from obtained be now can S nta of instead and y ∗ |X y , S notomtal exclusive mutually two into S (S ihlwrpeitv accu- predictive lower with an 1, section Appendix, SI y 1 S y 1|X G 2|X y y ∗ r tnadzdt have to standardized are |X 1|X ossetestimate consistent a , γ ˆ S Multiplying ). S sa ntuetfor instrument an as 2 eoesaconsistent a recovers y y ob ai instru- valid a be to ∗ |X |X S ipybcuewe because simply ∗ hs estan- we Thus, . nta of instead γ ne e of set a under nEq. in 1 γ ˆ 2 under S h ˆ G with y (G ∗ y 2 |X = is ) , rannadmzdtetetdet oiyo eia inter- medical or interest Eq. policy of rewrite to can due We treatment ventions). nonrandomized chip a the or not on is exposure interest ized of question the of heritability where situations in role Exposure Outcome. Between and Correlation Genetic from Arising Bias Reducing 61.) framework, ref. modeling see equation structural a in indicators multiple purge (T interest of behavior on effects pleiotropic for control effectively would on effect their of combination T the from stemming effect indirect on that Given tion. score conditional vari- true with control the exogenous uncorrelated the for is coefficients drop ables we the term Also, on later. subscript assumption disturbance the this drop the (We variables. that genetic assume We where oeta isi h siaeof estimate the and size in sample bias GWAS potential finite to for due error proxy measurement a contains is instead obtainable is coefficients, the specifically, effect for EA). the propensity of on study a height in (e.g., of available not often time), is over strategy vary this not would but does times (which multiple pleiotropy for observed control is effectively individual regres- of same fixed-effects the would example, where outcome pleiotropy For sion estimates. regression, the coefficient the bias in and not controlled directly exposure be could the on pleiotropic gene direct that exist.) the argue interest to of difficult it effects makes genes function biological most the of about that restric- knowledge exclusion limited variants the practice, violate In genetic biological to tion. well-understood unlikely single it via instru- make use exposure that as the mechanisms to affect PGSs is to use known idea not are the does Instead, typically MR ments. direct that if (Classic a factors or EA. environmental metabolism) have affect unmeasured and also with growth height correlated cell restric- are healthy affecting exclusion they via a genes (e.g., the in EA the satisfy on height to of) effect is fail for (some approach will instrument this if height with an tion for problem as PGS The the height height. that on for EA PGS of a regression of use the assumption the under term that disturbance the in variables servable * o litoy edsusteesrtge tgetrlnt below. length greater control at not strategies would these discuss models) We fixed-effects pleiotropy. neighborhood for for or sibling indicators (e.g., genetic in strategies and changes other to time, relationship over weak vary a not have does IV the because ie-fet siainwt ae aawudas rcueM-yestrategies MR-type preclude also would data panel with estimation Fixed-effects nsadr R esr fgntctnec ( tendency genetic of measure a MR, standard In ftetu odtoa ntof (net conditional true the If (ξ y S ) X ilgnrlyivleadrc e fetof effect net direct a involve generally will T δ ˆ n h fetof effect the and |X fba htaie rmcreainbetween correlation from arises that bias of eo hni ol o edt ofso. enwuse now We confusion.) to lead not would it when below T seoeos(0 2.Oesc xml ol be would example such One 62). (60, exogenous is uhthat such , y y oyei crsas lyaptnilyimportant potentially a play also scores Polygenic e e u ahrteefc fsm nonrandom- some of effect the rather but se, per T s fcus,ntpsil,because possible, not course, of is, y y = = * si h oe,teefc fidvda SNPs individual of effect the model, the in is T eg,abhvoa revrnetlvariable environmental or behavioral a (e.g., ietcnrlfrtecniinlgenetic conditional the for control Direct δ δ = = T T nEq. in G αS + + T ξ X X T S T ∗ on |X y ζ |X ∗ β β |XT y T y y |XT sue sa Vi nefr to effort an in IV an as used is 3) y 1 + + vrtm.Fxdefcsrgesosbsdon based regressions Fixed-effects time. over + + T . yadn ramn variable treatment a adding by PNAS X X Having ahrthan rather γ G nteGWAS. the in nEq. in T S β S β ζ y T eei rpniyfor propensity genetic ) y ∗ T y ∗ |XT |XT |XT + | + o.115 vol.   namely , sntkon What known. not is 3) S + + T T y ∗ |XT  .  y y S , y ∗ nteequation the in |X | T T T o 22 no. S S T nteequa- the in ol generally would y S y ∗ (ζ and |XT |XT T n unob- and ) |X which , n an and | y o a for ) (more . E4973 [4] [3] y

SOCIAL SCIENCES SEE COMMENTARY ˆ We refer to the combined use of Sy|XT as a control and ST|X we construct an estimate for Sy1|XT , namely Sy1|XT , by using as an IV for T as EMR. However, controlling for Sy|XT as a Sy2|X (the unconditional PGS from the second GWAS sam- S ∗ ˆ proxy for y|XT is not a perfect solution to pleiotropy because it ple) as its IV. We call this approach, where Sy1|XT is used as ∗ leaves a component of Sy|XT in the error term which is correlated the regressor in the second stage, GIV-C. We also use a third with ST due to pleiotropy. As a result, the EMR estimate for δ estimator that uses the same IV as in GIV-C (i.e., Sy2|X ), but will be biased. The practical question, then, is whether alternative that substitutes the unconditional PGS for y (i.e., substitutes strategies that split the GWAS sample for Y to obtain multi- Sy1|X for the conditional PGS Sy1|XT ) in the structural model S ˆ ple indicators of y|XT that can be used as IVs for each other in Eq. 3. We then use Sy2|X to predict Sy1|X , obtaining Sy1|X (e.g., Sy1|XT and Sy2|XT ) are sufficient to rescue ST as a prac- as the regressor in the second stage. We call this third approach tically useful IV for T .† This is a practical question beyond the GIV-U. reach of formal mathematics and best answered by simulation We generally expect the use of GIV-C to perform better than analyses. Unfortunately, and as we show in SI Appendix, section the use of the proxy Sy|XT in simple OLS. If the true effect of T 2, the pleiotropy-induced violation of the exclusion restriction on y is positive and positive pleiotropy is present, the estimated when using genetic IVs for T is sufficient to produce serious bias effect of T on y will have positive bias. This follows from the ∗ in the estimated effect of T even if one attempts to control for positive correlation between Sy|XT and T and from the positive ∗ pleiotropy using such strategies. The magnitude of bias clearly effect of S on y. The presence of the proxy Sy|XT in the first ∗ y|XT depends on the quality of the proxies for Sy|XT . However, we approach (simple OLS) adds a partially offsetting negative bias, find that pleiotropy leads to considerable bias in MR in virtually because the correlation between Sy|XT and T is positive and the all scenarios we investigated (SI Appendix, sections 2, 3, and 4 effect γ˜OLS is also positive, but γˆOLS Sy|XT is being subtracted, and Tables S2–S15). which causes the offsetting bias to be negative. The net bias is The problems posed by pleiotropy cannot be completely elim- expected to be positive, but we would expect it to be smaller ∗ inated without knowledge of Sy|XT (SI Appendix, section 2). with the inclusion of the proxy than with no proxy at all, both However, this situation does not mean that the estimation prob- because the correlation between T and Sy|XT would be lower ∗ lems introduced by pleiotropy are intractable. When endogeneity than between T and Sy|XT and because we expect γˆOLS to be bias is driven by genetic correlation, the PGS for y can still attenuated relative to γ. When GIV-C is used instead, the term be used to obtain more accurate estimates of the effect of ∗ ˆ in the error becomes γSy|XT − γˆIVc Sy1|XT . The presence in the T than can be obtained with MR or, for that matter, with ∗ first stage of GIV-C of T , which is correlated with S , pre- OLS that lacks controls for the direct effect of genetic mark- y|XT γ ers on y. To gain insight into the best strategy, we consider the vents the IV strategy from obtaining a consistent estimate of . γˆ > γˆ reasons for the pleiotropy bias. Regardless of whether the esti- Nonetheless, we would generally expect IVc OLS , and there- δ mation strategy is OLS or the second stage of IV regression fore we expect the positive bias for the estimate of to be smaller δ involving OLS on predicted variables from the first stage, the when using GIV-C than when estimating using OLS and the S coefficient bias comes from the extent to which the expected esti- proxy y1|X . We confirm this in the simulations in SI Appendix, ‡ Tables S2–S7. mate of the OLS coefficients differs from the true coefficients; ∗ With GIV-U, the problem term in the error is γS − namely, y|XT 0 −1 0 γˆIVu Sˆ . As before, the presence of the first term produces E[βˆ|X ] = β + E[X X  X |X ]. [5] y1|X a positive bias in the estimate of δ, while the second term pro- In other words, the coefficient bias from OLS is the expected duces an offsetting negative bias. The offset will be stronger regression coefficient of the error on the included variables in when the unconditional PGS for y is the regressor in the struc- the regression. If  is the sum of an omitted variable, z, which tural model, because the coefficients of the genetic markers in is correlated with the regressors, and additional variables that Sy1 are δˆξˆ+ ζˆ, where ξ is the effect of the genetic marker on are uncorrelated with the regressors, then the bias for each   T . The presence of δˆ Gξˆ in the second endogenous term in coefficient βk in Eq. 5 becomes the product of the regression ∗ ˜ coefficient for xk in the regression of z on all of the omitted the error (i.e., the second term in γSy|XT − γ˜Sy|XT ) produces variables multiplied by the effect of z on the outcome. For sim- a stronger downward bias. This downward bias is made still plicity, we assume that the only variables in the regression are stronger by the use of γˆIVu instead of γˆOLS as the coefficient, ∗ ˜ T and a potential proxy for Sy|XT , which we call Sy|XT . For because we expect the first-stage regression to reduce the down- ˜ ˆ ward bias of γˆOLS . In other words, we expect these three proxies any given proxy, Sy|XT , the bias in the estimate of δ (the coef- ficient for T in Eq. 3) comes from the expected coefficient of to behave differently in the simulations, and, as we will see, this T from a regression of γS ∗ − γ˜S˜ on T and S˜ . We expectation is met in practice. We establish via a comprehen- y|XT y|XT y|XT sive set of simulations that GIV-C and GIV-U provide upper ˜ consider three alternative approaches for the proxy Sy|XT , which and lower bounds for the effect of T across a range of plau- we call simple OLS, GIV-C, and GIV-U. First, we use Sy|XT ∗ sible scenarios for pleiotropy and for heritability (SI Appendix, as a proxy for Sy|XT in a simple OLS regression. Second, we Tables S2–S7). We further establish through simulation analyses observe that Sy|XT is correlated with T ; its inclusion in the error that GIV-C and GIV-U perform similarly in the case when endo- (by virtue of its being controlled) may affect the bias in δˆ. So geneity arises from pleiotropy and when it arises from pleiotropy in combination with genetic confounds for reasons other than pleiotropy [epistasis, effects from rare alleles, or genetic nurtur- ing effects, where the environment of ego is shaped by genetically †An earlier version of our paper pursued this approach and called it GIV regression. However, we later found that controlling for T in a GWAS for y induces a correlation of related individuals to ego (63)] (SI Appendix, section 3 and Sy1|XT with ST that invalidates the latter as an IV. The version of GIV-U and GIV-C regres- Tables S8–S10). sion we describe below does not have this problem because it does not use a genetic instrument for T anymore. Instead, GIV-U and GIV-C both rely on a proxy-control strat- Simulations egy that uses only an instrument for Sy1|XT or Sy1|X to correct for measurement error ∗ ∗ in these proxies for Sy|XT and Sy|X . We explored the performance of GIV regression in finite sam- ple sizes using three sets of simulation scenarios (SI Appendix, ‡It is possible to have finite-sample bias that disappears asymptotically, in which case the estimator is consistent. We use the expectation formula instead because it is arguably sections 2–4). The simulations generated genetic and pheno- more straightforward to understand. typic data at the individual level from a set of known models in

E4974 | www.pnas.org/cgi/doi/10.1073/pnas.1707388115 DiPrete et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 orltdwt h ierefcso ee ntePSfor PGS the in genes were of and effects error PGS linear the true in the were with the genes correlated of with effects correlated nonlinear where was tasis alleles these for of scores effect the if ol cu frr lee eemsigfo h rePSfor PGS true the from missing situation were This alleles T 3). rare section if Appendix , occur would (SI equation model structural structural the the for in equation term disturbance structural for that the the and with in correlated genes term be with correlated disturbance both the are cause that sources from arise , Appendix (SI confounds genetic from 2 section arises problem geneity one that unlikely direct out) equally rule is completely (or it practice, on pleiotropy. but bound In upper case, credible both. the a put on be can to effects unlikely direct is have this phenotypes both with conserva- between the correlation T genetic made entire We the correlation. that genetic assumption tive of result a as covary for strategies PGS estimation the various of which effect Finally, to the 52). extent (45, recover SNP the each of analyzed GWAS) size we (i.e., effect training predic- average the the their of and that size sample the property scores with realistic increases polygenic the accuracy the tive have Thus, simulations sample. holdout our the in in parameters SNP estimated each the for with estimated scores then polygenic constructed We and that traits. parameters on complex using effects genetically genetic sample for holdout realistic a are and sample training a irt tal. et DiPrete is simulations these distribution at for a code available obtain a computer (The to across effects. parameters estimated different of answer of with set simulations each true for 20 the seeds ran random We regression, with values. GIV-U results parameter of and these range GIV-C, compared OLS, we proxy and EMR, MR, OLS, on ntescn e fsmltos ealwdedgniyto endogeneity allowed we simulations, of set second the In endo- entire the that assumed we simulations, of set first the In for scores PGS true the that specify simulations The n h recniinlPSfor PGS conditional true the and sdet ietpeorp;ie,algnsta r associated are that genes all i.e., pleiotropy; direct to due is y T ntehlotsml.W rdcdteeetmtsusing estimates these produced We sample. holdout the in vni h recniinlPSfor PGS conditional true the if even ). T https://github.com/cburik/GIVsim and y T twudas cu ne odtoso epis- of conditions under occur also would It . S9 al .Ilsrtv eut rmsmltos siae ofcetfor coefficient estimated simulations, Model from results Illustrative 1. Table siainmtosfrsvrlmdl.Fralmdl h eei orlto ( correlation genetic the models all For models. for several for methods estimation B A aaees aine n tnadzdefc size. effect standardized and variance, from parameters, selection a are results These confound. S14 the and of variance and of confound terms nongenetic the between correlation h eei ofudfor confound genetic the E D C ogntccnonswt oto .63156 .360.9587 1.3346 1.5064 1.3643 control with confounds Nongenetic G F F, ; and hw r enetmtdcefiin for coefficient estimated mean are Shown T n G, and S11; Table (δ a 1. was ) see ; y sn WSi h riigsample training the in GWAS using etos2–4 sections Appendix, SI h h h y 2 y 2 y 2 = = y y and ogntcconfounds Nongenetic h h h y 2 eei Confounds Genetic ae nkonSP and SNPs known on based on T 2 h T 2 = e o fec al.See table. each of 2 row See S14. Table litoyonly Pleiotropy only Pleiotropy litoyonly Pleiotropy only Pleiotropy y 2 h = = h h h h = T 2 y y Parameter y 2 y 2 y 2 0.5, T 2 0.5, . r h eiaiiyprmtr of parameters heritability the are = = = 0.8, = ρ n h fetof effect the and ν h h h y 0.5, T ρ .) ρ T 2 T 2 T 2 e h ν stecreainbetween correlation the is eeicue in included were = = = T 2 = y ρ = = 0.2 0.6 0.4 0.4, e ρ 0.2 = o l eut.A, results. all for ν T 0.4 s = = y 000)(.02 000)(0.0002) (0.0001) (0.0002) (0.0001) 0.5 000)(.02 000)(0.0001) (0.0001) (0.0002) (0.0001) 0.5 and T y T y and and n Ewti aetee f2 iuain sn different using simulations 20 of parentheses within SE and to T T y . s B, S2; Table stesaeo h ofudta scnrle o in for controlled is that confound the of share the is ieyt be. to likely of on whether levels effect about causal higher information GIV- useful and moderate at GIV-C provides to of Even U combination low heritability. the of heritability, and and conditions pleiotropy pleiotropy under of answer true levels the to close the underestimates of generally of with- GIV-U levels OLS effect EMR. moderate than or accurate MR, to more proxies, low is out and at of heritability modest effect and is pleiotropy the overestimation overestimates the generally of but GIV-C heritability variants, extreme. the genetic not and unobserved is nurturing) genetic (i.e., or confounds epistasis, genetic effect true other the or of bounds regression lower GIV of and contrast, In upper pleiotropy. results reasonable with provides endogene- Our along nongenetic IVs. present when is even genetic ity poorly for performs MR restriction vio- that the exclusion find by the undermined is of strategy lation MR When the provide MR. present, from results is estimates pleiotropy of The skeptical be from to sources. endogeneity reason include considerable environmental) also (i.e., G and bias, nongenetic con- F of genetic scenarios of source while sources only other founds, the plus is pleiotropy contains pleiotropy Table E where in scenario situations A–D to Scenarios refer size). effect 1 standardized the on details of effect the on environ- produces effect genetics some to an where unrelated both is situation that a factor behavioral in or occur mental Appendix, would (SI This variables for genetic 4). with models section correlated the itself in not was terms that error the between correlation genetic by i.e., (like members, variables sample (63). of with genes nurturing the also by genes affected therefore the are and with that correlated members are sample which of genes, for parental term by disturbance selected the in factors mental affect that factors genetic the for PGS conditional the 000)(.03 000)(0.0002) (0.0002) (0.0093) (0.0005) 000)(.02 000)(0.0001) (0.0001) (0.0012) (0.0001) 000)(.02 000)(0.0002) (0.0001) (0.0001) (0.0001) (0.0002) (0.0004) (0.0001) (0.0001) 000)(.02 000)(0.0001) (0.0001) (0.0002) (0.0001) .04154 .110.9419 1.0131 1.5040 1.1004 .04342 .760.8689 0.7263 1.0776 0.8575 1.1573 3.4922 1.0604 1.5020 1.5004 1.5024 1.3016 1.2011 .50153 .291.1193 1.4259 1.5032 1.4520 .16140 .920.6422 1.1922 1.4609 1.3106 L RGVCGIV-U GIV-C MR OLS umr fteerslsi nTbe1frtecs where case the for 1 Table in is results these of summary A a of presence the specified we simulations, of set third the In T T n h eei ofudfor confound genetic the and y on al S6 Table Appendix, SI and y C, S3; Table T T. ftesuc fedgniyi nyfo pleiotropy from only is endogeneity of source the if alsS–4 7 9 S11, S9, S7, S2–S4, Tables Appendix, SI ρ on ν T y y stecreainbetween correlation the is sstt . (see 1.0 to set is y T u rvdsa siaeta sreasonably is that estimate an provides but a . n h coefficient the and 0.5 was ρ) n htteuprbudo hsefc is effect this of bound upper the what and D, S4; Table n nefc on effect an and T o oedtiso the on details more for y . hr,ti iuto ol cu if occur would situation this Third, T E, S7; Table T . r orltdwt h environ- the with correlated are ρ PNAS e = al S16 Table Appendix, SI y sthe is 0.4 . | y Table o.115 vol. and y htaecue or caused are that T | culyhsa has actually o 22 no. T y | and and E4975 for T T T y )

SOCIAL SCIENCES SEE COMMENTARY In the case where the true value of T is zero (i.e., where Table 2. Effects of the for educational the true model is Eq. 1), we expect that GIV-U will produce attainment (PGS EA) on (residualized) educational attainment in an estimate that is close to zero as long as the endogeneity the Health and Retirement Study (HRS) comes either from pleiotropy or from other genetic confounds Variable OLS IV1 IV2 (see SI Appendix, sections 2–4, for details). The simulations in SI Appendix section 2 show that when the endogeneity is only from PGS EA UKB 0.259∗∗∗ 0.523∗∗∗ genetic sources, GIV-U estimates the effect of T to be close to (0.0183) (0.0385) zero regardless of the level of pleiotropy or inheritance that is PGS EA SSGAC 0.530∗∗∗ specified in the simulations. (0.0389) When the source of endogeneity is nongenetic in origin, we hˆ2 NA 0.134 0.138 find that neither MR nor proxy controls for pleiotropy provide NA (0.0197) (0.0202) a satisfactory method for determining the effect of T on y. In N 2,751 2,751 2,751 this scenario, the pleiotropy creates endogeneity bias for genetic *P < 0.05, **P < 0.01, ***P < 0.001. We regress the residual of EA on IVs that defeats the ability of MR to solve the problem of non- the different PGSs and calculate the implied heritability estimates. SEs are in genetic endogeneity via an IV strategy. Nongenetic endogeneity parentheses. All variables have been standardized. EA is measured in years can cause even GIV-U to overpredict the effect of T on y, of schooling needed to obtain the highest achieved educational degree although in our simulations it is clearly the most accurate of all according to International Standard Classification of Education (ISCED) clas- of the estimators that we have surveyed when the effect of T is sifications. We use the residual of EA after a regression on birth year, birth zero. Indeed, GIV-U always provides the most conservative esti- year squared, gender, and the first 20 principal components in the genetic mate across the entire range of scenarios that we have surveyed, data. PGS EA SSGAC: PGS for EA using meta-analysis from ref. 15, excluding both for the case where an effect of T on y exists and when the data from 23andMe, UK Biobank (UKB), and HRS; PGS EA UKB, PGS for EA effect of T is zero. using UKB data. IV1 uses PGS EA SSGAC as instrument and IV2 uses PGS EA UKB as instrument. NA, not applicable. Inference with the GIV can be further strengthened in cases where nongenetic endogeneity can be controlled either through observable variables or through strategies such as family fixed effects that reduce or eliminate the impact of nongenetic forms mate and the two GIV estimates overlap, demonstrating that of environmental endogeneity. Indeed, environmental endo- GIV regression can recover the chip heritability of EA from geneity is a concern in most applied-research questions that use polygenic scores. nonexperimental data. Reassuringly, our simulations show that GIV regression is a good estimation strategy in the presence The Relationship Between Body Height and Educational Attain- of both direct pleiotropy and environmental endogeneity if con- ment. Previous studies using both OLS and sibling or twin trol variables are available that manage to absorb a substantial fixed-effects methods have found that taller people generally share of the nongenetic confounds (SI Appendix, Tables S13– have higher levels of EA (66–68). They are also more likely to S15). Therefore, we recommend using GIV regression always in perform well in various other life domains, including earnings, combination with control variables the capture possible environ- higher marriage rates for men (although with higher proba- mental confounds, ideally in datasets that allow controlling for bilities of divorce), and higher fertility (69–74). The question family fixed effects (e.g., using siblings or dizygotic twins). In SI is what drives these results. Can they be attributed to genetic Appendix, section 6, we provide additional practical guidelines effects that jointly influence these outcomes? Are there social for GIV regression. mechanisms that systematically favor taller or penalize shorter The simulations described in this paper certainly do not individuals? Or are there nongenetic factors (e.g., the uterine cover all conceivable data-generating processes, but they are and postbirth environments especially related to nutrition or dis- nonetheless of considerable utility. ease) that affect both height and these life-course outcomes? The literature on the relationship between height and EA has Empirical Applications found evidence that the association arises largely through the We illustrate the practical use of GIV regression in two empir- relationship between height and cognitive ability, which may sug- ical applications using data from the Health and Retirement gest that the height–EA association is driven largely by genetic Survey (HRS) for 2,751 unrelated individuals of northwest association between height and cognitive ability. We use GIV European descent who were born between 1935 and 1945 regression with individual-level data from the HRS to clarify (SI Appendix, section 5). the influence of height on EA, and we compare these results with those obtained from OLS and from MR. In addition, we The Narrow-Sense SNP Heritability of EA. First, we demon- conduct a negative control experiment that estimates the causal strate that GIV regression can recover the unbiased genomic– effect of EA on body height (which should be zero). A com- relatedness-matrix restricted maximum-likelihood (GREML) plete description of the materials and methods is available in estimate of the chip heritability of EA. Specifically, we follow the SI Appendix, section 5. common practice in GREML estimates of heritability and ana- GWAS summary statistics for height were obtained from the lyze the residual of EA from a regression of EA on birth year, Genetic Investigation of Anthropometric Traits (GIANT) con- birth year squared, gender, and the first 20 principal components sortium (35) and by running a GWAS on height (conditional on from the genetic data (64). Next, we standardize the residual and EA and unconditional on EA) in the UKB (75). The UKB was regress it on a standardized PGS for EA using OLS or GIV. not part of the GIANT sample. GWAS summary statistics for EA The results are displayed in Table 2. The OLS estimate of the were obtained from the Social Science Genetic Association Con- PGS accounts for 6.8% of the variance in EA (β2 = ∆R2 = sortium (SSGAC) for the unconditional PGSs. The most recent 6.8%), which is substantially lower than the 17.3% (95% CI ± study of the SSGAC on EA used a meta-analysis of 64 cohorts for 4%) estimate of chip heritability reported by ref. 64 in the same genetic discovery (15). We obtained meta-analysis results from data using GREML. [GREML yields unbiased estimates of SNP- this study with the HRS, UKB, and 23andMe cohorts excluded based heritability that are not affected by attenuation (65).] and we refer to the PGS constructed from these results as EA Instead, the GIV regression results in columns 2 and 3 of Table SSGAC. Furthermore, we obtained GWAS estimates for EA in 2 imply a chip heritability of 13.4% (CI ± 3.9%) and 13.8% the full UKB release (N = 442,183) from ref. 76. We refer to (±4.0%), respectively. Thus, the 95% CIs of the GREML esti- this PGS as PGS EA unc. UKB. We also created a PGS for EA

E4976 | www.pnas.org/cgi/doi/10.1073/pnas.1707388115 DiPrete et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 N UKB uncond. EA PGS UKB cond. EA PGS educational Variable on height of effect the (EA) of attainment Estimates 3. Table demon- The EA. on affected height of effect also true the about and more learning respondents HRS of simultaneously EA. height their that variables the depends estimate environmental the GIV-U affected of from than the strength estimate smaller in the bias the or upward on than small of smaller as extent 25% is The MR. is of that estimate which EA true estimate, on a GIV-U height to of points effect is MR. simulations estimate the in the IV MR from the the invalidates guidance in that The bias bias pleiotropy upward of This EA evidence significant height. strong statistically of on and endogene- EA effect positive of environmental a positive effect reports this of significant also height MR a absence estimate height. on reports the on accurately GIV-U EA in would Instead, zero of ity. GIV-U be effect then to true effect zero, the be that environ- should Given from bias sources. endogeneity mental of evidence substantial provide 0.11 GIV-C). between (from is 0.17 effect and standardized Table GIV-U) true in (from the results that the if suggest pleiotropy, Thus, would is respectively. 3 endogeneity EA, of on source height lower only of and the effect upper true provide the to for GIV-U bounds and GIV-C expect would we effect estimated the biases height. also the of score in PGS and bias the the of potential because effect strategy endogeneity estimated cognitive imperfect of an source full is this PGS their eliminating the for reach for Controlling to EA. there- higher likely may have who more ten- children enables be genetic their which fore the income, into and with that investments EA correlated parental possibility higher higher is have the to EA itself from parents by and for stem 15 dency height height also refs. for could between PGS from They correlation the (Results 0.15.) genetic EA. about a and of suggest height have 18 genes both and some on These that effects possibility pleiotropy. the to direct from due stem probable restriction could from exclusion violations suffers the MR OLS above, of the discussed larger violations slightly As of even OLS. is interpretation from MR from cen- than causal estimate point additional the the 2.5 indeed, confirm to result; with to appears schooling. of EA, appears height month additional on MR that one generating effect show height in positive results timeters strong OLS a The have EA. on to height instead data PGS. as the Biobank other in UK the errors measurement from excluded of we PGS independence regressor, ensure a as used used and was instrument two the of one irt tal. et DiPrete Height overlap sample is same the There in UKB). EA cond. Height on EA between GWAS (PGS a sample running UKB by height on conditional o egt o I- n I- G ASGCwsue sa instrument. an as instrument used as was used SSGAC was EA data PGS GIV-U (UKB) and Biobank For GIV-C UK For variables. from height. control height for as for included PGS are a components MR, father, principal EA 20 mother, first EA gender, the squared, and year birth year, Birth parentheses. in hs eut lopittwr rdciesrtg for strategy productive a toward point also results These 4 Table in results regression control negative the However, then pleiotropy, positive from came EA in bias the of all If of effect standardized estimated the report we 3 Table In *P < 5 **P 0.05, < IN n EA and GIANT 1 ***P 0.01, 0.136 006)(.41 006)(0.0262) (0.0264) (0.0481) (0.0262) ,5 ,5 ,5 2,751 2,751 2,751 2,751 L RGVCGIV-U GIV-C MR OLS < ∗∗∗ 0.Sadrie fetszsadSsare SEs and sizes effect Standardized 0.001. 0.160 SA.Teeoe whenever Therefore, SSGAC. ∗∗∗ 0.396 0.168 (0.0367) ∗∗∗ ∗∗∗ 0.110 0.384 (0.0354) ∗∗∗ ∗∗∗ eilg n h oilsine—w ed fiqiywhere inquiry of epi- in fields challenges sciences—two important social most the and and biggest observational the demiology with of one relationships is causal data of estimation Accurate Conclusion into genetic selection nonrandom of samples.) from dis- existence bias we the possible strategies address given estimation here EA cussed the of of none provided process (Obviously, the advantage confounds. in social the is the of toward height large progress artifact by how represent an understanding results of not these goal words, and other endogene- real In genetic-unrelated be from is ity. or would pleiotropy effect we from either the statisti- analysis, bias and that fixed-effects positive confident sibling be a rather to in continued weaker. significant GIV-U effect is relatively cally of endogeneity the the is estimate for if the endogeneity effect If estimate genetic-unrelated this underpredicts plausible if or EA a strong sug- on is therefore height either results of GIV-U simulation that is Our height (18). gest and EA. high strategy EA affects extremely between effective height pleiotropy not an extent that not suggest what suffers studies to is Other estimate and MR MR whether that the determining and for that bias imply pleiotropy strongly from also height of They effect the EA. of on estimate that biased imply upwardly strongly an results provides impli- Our OLS estimates. the available the into the the of in insight time, cations same considerable inevitable the provides At is IV. example valid uncertainty empirical a or although data experimental desired, of absence be might as quite is endogeneity bias strong. genetic-unrelated positive this and unless in pleiotropy effect both true considered the from estimators underpredict will bias other more it the and be suggests paper, of will was regression all estimate height control than GIV-U on conservative negative the EA sources, the of environmental effect If from the zero. of to estimate as close GIV-U bounds the approximate estimates provide as the would long data, GIV-U such and GIV-C With environmental from fixed-effects. using shared family by for controlling via or allow variables directly that factors siblings environmental control- on data relevant by the either for confounds while pleiotropy ling environmental from for bias correcting address and to also GIV-U strategies use proxy to twin as arguably MZ is GIV-C from strategy results factors productive the most bias environmental The still data. could unobserved EA and but height effectively affecting pleiotropy, would for twins monozygotic control endogene- between environmental for Comparisons for adjust ity. fix to and used effective pleiotropy be by no cannot contaminated is therefore are there instruments that genetic implies its 3 affecting MR; is Table bias in environmental estimates as the well as pleiotropy that stration instrument instrument. an an as as used used was was GIANT height data PGS (UKB) GIV-U Biobank and For GIV-C UK For variables. from EA. control EA for as for included PGS are a components MR, father, principal EA 20 mother, first EA gender, the squared, and year birth year, Birth parentheses. in N UKB unc. height PGS UKB cond. height PGS EA Variable (EA) attainment educational of effect the height of on Estimates 4. Table hs eut ontpoiea la n etaconclusion a neat and clean as provide not do results These *P < 5 **P 0.05, < 1 ***P 0.01, 0.072 003)(.53 001)(0.0120) (0.0119) (0.0543) (0.0138) ,5 ,5 ,5 2,751 2,751 2,751 2,751 L RGVCGIV-U GIV-C MR OLS < 0.Sadrie fetszsadSsare SEs and sizes effect Standardized 0.001. ∗∗∗ PNAS 0.179∗∗ | o.115 vol. 0.050 0.448 (0.0174) | ∗∗∗ ∗∗∗ o 22 no. 0.040 0.446 (0.0175) | E4977 ∗∗∗ ∗∗∗

SOCIAL SCIENCES SEE COMMENTARY many questions of interest cannot be adequately addressed results for the outcome of interest from two nonoverlapping with properly designed experiments due to practical or ethical samples. Due to rapidly falling genotyping costs that enable a constraints. One important confound in nonexperimental data growing availability of genetic data and large GWAS samples comes from direct pleiotropic effects of genes on the exposure for many traits, these requirements have become increasingly and the outcome of interest. Both OLS and MR yield biased feasible for many applications. Indeed, the combination of new results in this case. We proposed GIV regression as an empirical estimation tools and continued rapid advancements in genetics strategy that controls for such pleiotropic effects using polygenic should provide a significant improvement in our understand- scores. GIV regression uses standard IV estimation algorithms ing of the effects of behavioral and environmental variables on such as two-stage least squares that are widely available in important socioeconomic and medical outcomes. existing statistical software packages. Our approach provides reasonable upper and lower bounds of causal effects in situa- ACKNOWLEDGMENTS. We are very grateful to Richard Karlsson Linner´ for tions when pleiotropic effects of genes are the only source of bias. help with the GWAS analyses in the UK Biobank, to Aysu Okbay for provid- ing us with subsets of the GWAS meta-analysis on educational attainment, We showed that OLS, MR, and GIV regression yield biased esti- and to S. Fleur Meddens for constructing genetic principal components mates if both genetic and environmental sources of endogeneity in the HRS data. We thank Daniel J. Benjamin, Jonathan P. Beauchamp, are present. However, GIV regression still outperforms OLS Benjamin Domingue, Lisbeth Trille Loft, Niels Rietveld, Eric Slob, Patrick and MR in this scenario. Furthermore, GIV regression can (and Turley, Hans van Kippersluis, and Tian Zheng for productive discussions and comments on earlier versions of the manuscript. Furthermore, we thank should) be combined with additional strategies that allow con- Elliot Tucker-Drob for directing us to the necessary correction of the heri- trolling for bias from purely environmental or behavioral factors, tability estimate in our model. This research was facilitated by the SSGAC such as using covariates or family-fixed effects. Together, these and by the research group on genetic and social causes of life chances at approaches can provide reasonable estimates of causal effects the Zentrum fur¨ interdisciplinare¨ Forschung Bielefeld. Data analyses made use of the UK Biobank resource under Application 11425. We acknowledge across a broad range of scenarios. data access from the GIANT Consortium. We used data from the HRS, which GIV regression is called for whenever an experimental design, is supported by the National Institute on Aging (NIA U01AG009740, RC2 a valid IV strategy, or a large-enough sample of MZ twins is AG036495, and RC4 AG039029). This study was supported by funding from not available and when pleiotropy is a potential problem—a a European Research Council Consolidator Grant (647648 EdGe) (to P.D.K.). HRS genotype data can be accessed via the database of Genotypes and Phe- situation that is frequently encountered in practice. The main notypes (dbGaP, accession no. phs000428.v1.p1). Researchers who wish to requirements for GIV regression are a prediction sample that link genetic data with other HRS measures that are not in dbGaP, such as has been comprehensively genotyped and large-scale GWAS educational attainment, must apply for access from HRS.

1. McNeill PM (1993) The Ethics and Politics of Human Experimentation (Cambridge Univ 23. Ehli EA, et al. (2012) De novo and inherited CNVs in MZ twin pairs selected for Press). discordance and concordance on attention problems. Eur J Hum Genet 20:1037–1043. 2. Stigler SM (2005) Correlation and causation: A comment. Perspect Biol Med 48:S88– 24. Wooldridge JM (2002) Econometric Analysis of Cross Section and Panel Data S94. (Massachusetts Institute of Technology, Cambridge, MA), pp 83–113. 3. Lawlor DA, Smith GD, Ebrahim S (2004) Commentary: The hormone replacement– 25. Smith GD, Ebrahim S (2003) ‘Mendelian randomization’: Can genetic coronary heart disease conundrum: Is this the death of observational epidemiology? contribute to understanding environmental determinants of disease? Int J Epidemiol Int J Epidemiol 33:464–467. 32:1–22. 4. Plomin R (1999) Genetics and general cognitive ability. Nature 402:C25–C29. 26. Davey Smith G, Hemani G (2014) Mendelian randomization: Genetic anchors for 5. Yang J, et al. (2010) Common SNPs explain a large proportion of the heritability for causal inference in epidemiological studies. Hum Mol Genet 23:R89–R98. human height. Nat Genet 42:565–569. 27. Pickrell J (2015) Fulfilling the promise of Mendelian randomization. bioRxiv:018150. 6. Yang J, Lee SH, Goddard ME, Visscher PM (2011) Gcta: A tool for genome-wide 28. Davey Smith G (2015) Mendelian randomization: A premature burial? bioRxiv:021386. complex trait analysis. Am J Hum Genet 88:76–82. 29. Hahn J, Hausman J (2003) Weak instruments: Diagnosis and cures in empirical 7. Turkheimer E, Haley A, Waldron M, D’Onofrio B, Gottesman II (2003) Socioeconomic econometrics. Am Econ Rev 93:118–125. status modifies heritability of IQ in young children. Psychol Sci 14:623–628. 30. Murray MP (2006) Avoiding invalid instruments and coping with weak instruments. J 8. Polderman TJC, et al. (2015) Meta-analysis of the heritability of human traits based Econ Perspect 20:111–132. on fifty years of twin studies. Nat Genet 47:702–709. 31. Hamer D, Sirota L (2000) Beware the chopsticks gene. Mol Psychiatry 5:11–13. 9. Conley D (2016) Socio-genomic research using genome-wide molecular data. Annu 32. Bowden J, Davey Smith G, Burgess S (2015) Mendelian randomization with invalid Rev Sociol 42:275–299. instruments: Effect estimation and bias detection through Egger regression. Int J 10. Chabris CF, Lee JJ, Cesarini D, Benjamin DJ, Laibson DI (2015) The fourth law of Epidemiol 44:512–525. behavior genetics. Curr Dir Psychol Sci 24:304–312. 33. Verbanck M, Chen CY, Neale B, Do R (2017) Widespread pleiotropy confounds 11. Paaby AB, Rockman MV (2013) The many faces of pleiotropy. Trends Genet 29: causal relationships between complex traits and diseases inferred from Mendelian 66–73. randomization. bioRxiv:157552. 12. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW (2013) Pleiotropy in complex 34. Welter D, et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait traits: Challenges and strategies. Nat Rev Genet 14:483–495. associations. Nucleic Acids Res 42:D1001–D1006. 13. Low KB (2001) Pleiotropy in Encyclopedia of Genetics, eds Brenner S, Miller JH 35. Wood AR, et al. (2014) Defining the role of common variation in the genomic and (Academic, New York), pp 1490–1491. biological architecture of adult human height. Nat Genet 46:1173–1186. 14. Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits: From 36. Locke AE, et al. (2015) Genetic studies of body mass index yield new insights for polygenic to omnigenic. Cell 169:1177–1186. obesity biology. Nature 518:197–206. 15. Okbay A, et al. (2016) Genome-wide association study identifies 74 loci associated 37. Ripke S, et al. (2014) Biological insights from 108 schizophrenia-associated genetic with educational attainment. Nature 533:539–542. loci. Nature 511:421–427. 16. Lynch M, Walsh B (1998) Correlations between characters. Genetics and Analysis of 38. Lambert JC, et al. (2013) Meta-analysis of 74,046 individuals identifies 11 new Quantitative Traits (Sinauer, Sunderland, MA), pp 629–656. susceptibility loci for Alzheimer’s disease. Nat Genet 45:1452–1458. 17. Lee SH, Yang J, Goddard ME, Visscher PM, Wray NR (2012) Estimation of pleiotropy 39. Okbay A, et al. (2016) Genetic variants associated with subjective well-being, depres- between complex diseases using SNP-derived genomic relationships and restricted sive symptoms, and neuroticism identified through genome-wide analyses. Nat Genet maximum likelihood. Bioinformatics 28:2540–2542. 29:1591. 18. Bulik-Sullivan B, et al. (2015) An atlas of genetic correlations across human diseases 40. Price AL, et al. (2006) Principal components analysis corrects for stratification in and traits. Nat Genet 47:1236–1241. genome-wide association studies. Nat Genet 38:904–909. 19. Zheng J, et al. (2017) LD Hub: A centralized database and web interface to per- 41. Rietveld CA, et al. (2014) Replicability and robustness of GWAS for behavioral traits. form LD score regression that maximizes the potential of summary level GWAS Psychol Sci 25:1975–1986. data for SNP heritability and genetic correlation analysis. Bioinformatics 33:272– 42. Bulik-Sullivan BK, et al. (2015) LD score regression distinguishes confounding from 279. polygenicity in genome-wide association studies. Nat Genet 47:291–295. 20. D’Onofrio BM, et al. (2010) A quasi-experimental study of maternal smoking during 43. Loh PR, et al. (2015) Efficient Bayesian mixed-model analysis increases association pregnancy and offspring academic achievement. Child Dev 81:80–100. power in large cohorts. Nat Genet 47:284–290. 21. Asbury K, Dunn JF, Pike A, Plomin R (2003) Nonshared environmental influences 44. Purcell SM, et al. (2009) Common polygenic variation contributes to risk of on individual differences in early behavioral development: A monozygotic twin schizophrenia and bipolar disorder. Nature 460:748–752. differences study. Child Dev 74:933–943. 45. Dudbridge F (2013) Power and predictive accuracy of polygenic risk scores. PLoS Genet 22. Caspi A, et al. (2004) Maternal expressed emotion predicts children’s antisocial behav- 9:e1003348. ior problems: Using monozygotic-twin differences to identify environmental effects 46. Rietveld CA, et al. (2013) GWAS of 126,559 individuals identifies genetic variants on behavioral development. Dev Psychol 40:149–161. associated with educational attainment. Science 340:1467–1471.

E4978 | www.pnas.org/cgi/doi/10.1073/pnas.1707388115 DiPrete et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 9 upyK,TplR 20)Etmto n neec ntose econometric two-step in inference and generated Estimation with (2002) regressions RH of Topel analysis KM, the Murphy errors-of-observation. in 59. issues with Econometric squares (1984) least A of Pagan variables. dominance 58. proxy MSE of (1974) use DJ the Aigner on 57. note A (1972) (2009) MR Wickens JS 56. Pischke JD, Angrist 55. 1 ukrDo M(07 esrmn ro orcino eoewd polygenic genome-wide of correction error Measurement (2017) EM estimators Tucker-Drob variable instrumental 61. of review A (2015) SG Thompson DS, Small S, Burgess 60. sciences. social the and psychology in variables in Latent males (2002) nonblack KA of errors Bollen Response 54. (1977) DL Featherman RM, genetic Hauser the WT, predicting Bielby of Accuracy 53. (2008) JA Woolliams B, (HRS). Villanueva Study HD, Retirement Daetwyler and diseases Health 52. The common profile: and Cohort (2014) factors al. risk et A, between Sonnega associations 51. Causal (2018) al. et Z, Zhu 50. 2 ugs ,BtewrhA aasi ,Topo G(02 s fMendelian of Use (2012) SG Thompson A, Malarstig A, Butterworth S, Burgess 62. randomization. Mendelian Pleiotropy-robust (2017) CA Rietveld H, invalid Kippersluis van with randomization 49. Mendelian (2015) S Burgess G, Smith Davey disease to J, variants Bowden genetic of contribution 48. The (2014) NR Wray PM, Visscher JS, Witte 47. irt tal. et DiPrete models. regressors. Econom 761. Companion Psychol crsi rdcinsamples. prediction in scores randomization. Mendelian for process. stratification the of models approach. genome-wide a using disease of risk Epidemiol data. summary GWAS from inferred Epidemiol J admsto oass oeta eeto lnclintervention. clinical of benefit potential assess e7325 to randomisation regression. Egger through detection bias and Epidemiol estimation Effect instruments: ruler. the on depends 53:605–634. u cnStat Econ Bus J 2:365–372. 43:576–585. 44:512–525. n cnRev Econ Int 10.1093/ije/dyx002. , PictnUi rs,Princeton). Press, Univ (Princeton a e Genet Rev Nat 20:88–97. 25:221–247. bioRxiv ttMt e Res Med Meth Stat otyHrls cnmtis nEmpiricist’s An Econometrics: Harmless Mostly mSa Assoc Stat Am J :165472. a Commun Nat 15:765–776. LSOne PLoS 9:224. 26:2333–2355. 72:723–735. 3:e3395. Econometrica BMJ nuRev Annu 40:759– n J Int 345: n J Int Int J 6 e J ta.Gn icvr n oyei rdcinfo .-ilo-esnGWAS 1.1-million-person a from prediction polygenic and discovery Gene al. et JJ, Lee 76. 3 ogA ta.(07 h aueo utr:Efcso aetlgenotypes. parental of Effects nurture: of nature The (2017) al. et A, Kong 63. 8 aeA asnC(08 ttr n tts egt blt,adlbrmarket labor and earnings and ability, height body Height, between relationship The status: skies? the and in Up (2005) Stature G Heineck (2008) 69. C premium: Paxson height market A, labor the Case of sense 68. Making (2009) M Islam C, Paxson A, Case 67. contributions environmental and Genetic (2000) E Lahelma J, neg- Kaprio finds K, variants Silventoinen imputed 66. with estimation variance Genetic (2015) al. et J, calculator Yang (MetaGAP) Power 65. and Accuracy Meta-GWAS (2017) al. et R, Vlaming de 64. 5 aciiJ ta.(05 eoyeiptto n eei soito tde of studies association genetic and imputation Genotype educational (2015) for al. matter et height J, does Why Marchini (2011) 75. J of Winter generalizations M, Further Piopiunik sons: F, more Cinnirella have parents 74. tall and Big (2005) S Kanazawa 73. height, Men’s coupling: ashortative to assortative From (2014) D Conley of A, contributions Weitzman The earnings: 72. and capital, human Height, (2015) RH Steckel cognitive A, and Schick height between 71. relationship The smart? be to tall Too (2009) G Heineck 70. bioRxiv nGermany. in outcomes. survey. panel household British the from Evidence adult of study A attainment: educational twins. and index. Finnish height body mass between association the body to and height human for 47:1114–1120. heritability missing ligible across correlations genetic imperfect to due partially studies. is heritability hiding that shows Kboak nei aarlae vial thttp://www.ukbiobank.ac.uk/wp- at Available release. data content/uploads/2014/04/imputation Interim biobank: UK children. German from Evidence attainment? hypothesis. Trivers–Willard the Bureau (National Report. States Technical United Research), Economic the of in dynamics relationship and heterogamy, height ability. noncognitive and cognitive abilities. feuainlattainment. educational of 2017. 18, :219261. LSGenetics PLoS cnLett Econ oi Econ Polit J Labour ea Genet Behav 105:78–80. 19:469–489. 13:e1006495. 116:499–532. a Genet Nat 30:477–485. ho Biol Theor J u Cap Hum J documentation npress. in , PNAS 235:583–590. cnHmBiol Hum Econ 9:94–115. | cnLett Econ o.115 vol. a21.d.Acse April Accessed May2015.pdf. 102:174–176. 9:407–418. | o 22 no. a Genet Nat | E4979

SOCIAL SCIENCES SEE COMMENTARY