<<

Structural variants caused by Alu insertions are associated with risks for many diseases

Lindsay M. Payera,1, Jared P. Sterankaa, Wan Rou Yanga, Maria Kryatovaa, Sibyl Medabalimia, Daniel Ardeljana,b, Chunhong Liua, Jef D. Boekeb,c,d,e,f,1, Dimitri Avramopoulosb,g, and Kathleen H. Burnsa,b,d,f,1

aDepartment of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205; bMcKusick–Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205; cInstitute for Systems , New York University Langone School of Medicine, New York, NY 10016; dSidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205; eDepartment of Molecular Biology & Genetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205; fHigh Throughput Biology Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205; and gPsychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21205

Contributed by Jef D. Boeke, March 22, 2017 (sent for review November 4, 2016; reviewed by Mark A. Batzer and Jeffrey Han) sequences comprise much of our DNA, al- variants reflect a major source of genetic diversity in . Over though their functional effects are poorly understood. The most 18,000 common polymorphic transposable elements have been commonly occurring repeat is the Alu short interspersed element. mapped (e.g., ref. 9), and it is estimated that >60,000 exist (17, 18). New Alu insertions occur in human populations, and have been To test the hypothesis that a subset of common transposable responsible for several instances of genetic disease. In this study, element polymorphisms has implications for human health, we we sought to determine if there are instances of polymorphic Alu focused on genomic intervals related to disease risk by - insertion variants that function in a common variant, common dis- wide association study (GWAS). GWAS uses large numbers of ease paradigm. We cataloged 809 polymorphic Alu elements map- cases and controls to approximate the locations of genetic variants ping to 1,159 loci implicated in disease risk by genome-wide that predispose to disease. We found unexpected numbers of Alu − association study (GWAS) (P < 10 8). We found that Alu insertion insertion polymorphisms residing within these GWAS intervals. At variants occur disproportionately at GWAS loci (P = 0.013). More- 44 loci, we demonstrate linkage disequilibrium (LD) between the over, we identified 44 of these Alu elements in linkage disequilib- Alu variant and trait-associated SNPs (TASs) identified by 2 rium (r > 0.7) with the trait-associated SNP. This figure represents GWAS, thus providing genetic evidence that the Alu insertion is a >20-fold increase in the number of polymorphic Alu elements one of the candidate causative variants at each . associated with human phenotypes. This work provides a broader perspective on how structural variants in repetitive may con- Results tribute to human disease. Polymorphic Alu Elements Are Common Structural Variants. Alu in- sertions are ∼300-bp bipartite, -specific interspersed re- Alu | structural variant | GWAS | causative variant | interspersed repeats peats derived from 7SL RNA (19). There are over 1.1 million Alu copies in the (1, 3). Of these elements, a small e understand the function of only a small fraction of our subset is polymorphic in the population such that both insertion Wgenome. -coding exons account for about 1% and preinsertion (empty) are present (e.g., refs. 9, 20–22). of our DNA (1). Although transacting factors and modified These elements are of special interest because they represent in- histone positions have been identified for other regions of DNA sertions that occurred relatively recently. (e.g., ref. 2), researchers have identified the function of few noncoding sequences. Significance The most frequently occurring sequences are perhaps the least well understood. Interspersed repeats derived from mobile DNAs ∼ Repetitive sequences comprise a large portion of the genome comprise 45% of our genome (1, 3). These sequences are often “ ” “ ” and are often thought of as junk DNA. They are a significant presumed to be nonfunctional junk DNA, with rare exceptions source of genetic variation, particularly Alu elements. Their causing genetic disease. The first known mobile element insertion functional consequence is frequently dismissed. Here, we test causing disease was a long interspersed element-1 (LINE-1) in- the hypothesis that Alu polymorphisms contribute to phenotypic sertion interrupting a coding exon of the coagulation factor VIII differences between individuals. We identified an enrichment of gene (FVIII) to cause hemophilia A (4). Insertions of other Alu polymorphisms in regions of the genome associated with transposons have also been reported in FVIII in patients with – human disease risk. Further, we find 44 instances where the trait- hemophilia A (e.g., refs. 5 7). In all, 124 cases of genetic disease associated SNP is a surrogate for presence or absence of an Alu have been attributed to de novo interspersed repeat insertions, insertion. This finding indicates that the Alu maybethevariant including 76 cases caused by Alu short interspersed elements effecting disease risk, an intriguing possibility given its size and (SINEs) and 30 cases attributed to LINE-1 insertions (8). These regulatory potential. This work emphasizes the importance of insertions compromise gene function by disrupting an exon or considering repeat polymorphisms in functional variant analysis. interfering with mRNA splicing. In each instance, the mobile el- ement insertion severely affects function and leads to an Author contributions: L.M.P., J.D.B., and K.H.B. designed research; L.M.P., J.P.S., W.R.Y., overt phenotype. M.K., and S.M. performed research; D. Avramopoulos contributed new reagents/analytic It has remained an open question as to whether common in- tools; L.M.P., J.P.S., W.R.Y., M.K., D. Ardeljan, C.L., D. Avramopoulos, and K.H.B. analyzed terspersed repeat variants also affect human health. Large num- data; and L.M.P., D. Ardeljan, C.L., J.D.B., and K.H.B. wrote the paper. bers of interspersed repeat insertion variants that would be Reviewers: M.A.B., Louisiana State University; and J.H., Tulane University School of candidates for such an effect have been identified in recent years Medicine. (e.g., refs. 9–11). Polymorphic interspersed repeats include LINE-1 J.H. was a PhD trainee of J.D.B. (graduation date 2005). J.H. was a middle author on a single paper from J.D.B. in the past 4 years, on an unrelated project; its publication was sequences that copy and insert into new genomic loci using delayed until 2014. All other authors declare no conflict of interest. – LINE-1 encoded (12, 13). LINE-1 also 1To whom correspondence may be addressed. Email: [email protected], lhorvat1@jhmi. mediates the retrotransposition of other interspersed repeats, edu, or [email protected]. including Alu SINE (14) and SINE-variable number tandem re- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. peat (VNTR)-Alu elements (15, 16). These interspersed repeats 1073/pnas.1704117114/-/DCSupplemental.

E3984–E3992 | PNAS | Published online May 2, 2017 www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Downloaded by guest on October 1, 2021 To identify polymorphic copies that could previously identified polymorphic Alu elements (10, 11, 21, 22, 24– PNAS PLUS affect disease risk, we focused on those copies mapping to lo- 27) to generate a catalog of 13,572 Alu variants distributed cations already associated with disease phenotypes by GWAS. As throughout the genome. These reports underscore that Alu poly- a pilot study, we set out to catalog polymorphic Alu elements morphisms are a major source of genetic diversity in humans. near GWAS signals. We mapped retrotransposon insertion variants using ligation-mediated PCR to amplify insertion sites Polymorphic Alu Elements Are Enriched at GWAS Loci. To identify followed by microarray analysis, transposon insertion profiling by those Alu variants potentially involved in disease, we determined microarray (TIP-chip) (23). By using conditions specific for intervals of the genome where variants could be tagged by TAS; AluYa5/8 and AluYb8/9 detecting these subfamilies, we avoided each TAS marks a region in LD that contains other variants for the evolutionarily older Alu subfamilies that are largely invariant which the TAS could serve as a proxy. We determined the LD in the genome (i.e., homozygously present in all people). These block around each TAS based on the physical position of SNPs with pairwise correlation coefficients with a TAS ≥ 0.8. We PCR products were fluorescently labeled and hybridized to identified polymorphic Alu elements contained within LD blocks custom genomic tiling microarrays encompassing 160-kb regions − defined by all GWAS TASs with P < 10 9. We excluded all around 825 TASs (Dataset S1). We performed TIP-chip on ge- GWAS signals and Alu elements at the HLA locus, given the nomic DNA samples from nine individuals from the Centre ’ extensive haplotypes in this region. We found 625 polymorphic d Étude du Polymorphisme Humain collection of Utah residents Alu elements that map to 899 GWAS signals (Fig. 1); ∼17% of with ancestry from Northern and Western Europe (CEU) and GWAS signals included in our analysis have a polymorphic Alu compared their retrotransposon complements. This analysis element within the identified LD block. revealed 80 polymorphic Alu elements mapping near GWAS- We next wanted to look at the distribution of these 625 poly- identified TASs; of these elements, 16 mapped within the LD morphic Alu elements. Relative to all polymorphic Alu elements, block defined by the TAS and proxy SNPs with a correlation those elements mapping to GWAS signals are enriched within 2 coefficient (r ) > 0.8 (Dataset S2). Thus, common Alu insertion [odds ratio (OR) = 1.93, 95% confidence interval (CI) = − polymorphisms could be readily identified in positions of the 1.67–2.24; P = 5.91 19] and within 10 kb of genes (OR = 2.54, − genome with importance in disease. 95% CI = 1.98–2.70; P = 1.82 10). This finding is expected, be- In parallel with these studies, more Alu structural variants were cause TASs are also enriched in genes (OR = 1.79, 95% CI = − defined by the 1000 Project and others, obviating the 1.65–1.93; P < 1 47) and within 10 kb of genes (OR = 4.44, 95% − need for us to profile common variants in small numbers of CI = 3.93–5.01; P = 1.32 106). samples (e.g., refs. 9, 11). Due to the pace of these discovery ef- If polymorphic Alu insertions often contribute to altered dis- forts, no comprehensive list of all reported polymorphisms has ease risk, we might expect an enrichment of these elements in been developed for Alu variants. Therefore, we collated reports of the TAS LD blocks. To investigate this possibility, we compared

Digestive system disease Liver enzyme measurement Response to drug GENETICS Cardiovascular disease Lipid or lipoprotein measurement Biological process Metabolic disease Inflammatory measurement Other trait Immune system disease Hematological measurement Nervous system disease Body weights and measures Cancer Cardiovascular measurement Other disease Other measurement

*

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

− Fig. 1. Polymorphic Alu elements map to GWAS loci. GWAS signals (P ≤ 10 9) with at least one polymorphic mapping to the TAS-defined LD block are represented by a circle, with color based on the phenotype reported in the GWAS. Figure modified from ref. 75. Elements mapping to sex chro- mosomes and the HLA locus (*) were excluded from the diagram. Details of GWAS loci and polymorphic Alu elements are provided in Dataset S3.

Payer et al. PNAS | Published online May 2, 2017 | E3985 Downloaded by guest on October 1, 2021 the number of polymorphic Alu elements contained within TAS We designed PCR primers to amplify across the reported in- LD blocks with the number expected by chance by considering sertion sites of 576 Alu variants near a TAS. To identify common 1,000 random LD blocks with characteristics that mirror the TAS variants, we first screened pools of genomic DNA samples from a − LD blocks. All reported GWAS signals (P ≤ 10 9) were reduced panel of unrelated CEU individuals. We electrophoresed PCR to a list of 3,242 nonoverlapping TAS LD blocks, with 625 poly- products on agarose gels to identify loci producing two fragments morphic Alu elements occurring in these regions (Figs. 1 and 2A). indicating the filled (Alu-containing) and empty (preinsertion) To determine the rate of occurrence expected by chance, alleles. Of the 576 reported polymorphic Alu elements, we were 1,000 random sets of 3,242 LD blocks each were generated. To able to detect 174 in the pooled DNA samples. control for nonrandom distribution of GWAS signals in the ge- To determine the allele frequencies of the Alu insertions at these nome, the random LD blocks were based on random SNPs se- 174 loci, we genotyped individual samples from the CEU HapMap lected to mirror TASs. Because TASs are common variants reference panel (Dataset S4). The detected Alu insertion allele fre- enriched near genes, we considered both the minor allele fre- quencyrangedfrom0.005to0.994,withanaverageof0.351(Fig. S1). quency (MAF) and distance from the nearest gene when selecting the random SNPs. Each random SNP selected matched a TAS in Some Alu Insertions Are in LD with TASs. We next evaluated the MAF (within bins of 5%) and distance to the nearest gene (within genetic relationship between each polymorphic Alu element and gene, <10 kb, or >10 kb). These random SNPs were used to its corresponding TAS(s). We expected Alu insertions responsible define random LD blocks (SI Materials and Methods) that we for human phenotypes to exist in LD with the TAS, with the two confirmed are of similar size to TAS LD blocks (Fig. 2B). Thus, genotypes highly correlated (i.e., high correlation coefficient). although Alu elements modestly influence recombination rates To measure LD between the Alu and TAS, we used SNP gen- and local LD structure (28), we did not need to correct for this otypes for the same panel of 90 CEU samples from the HapMap effect by imposing selection on LD block size. Each set of project (29). We found little to no correlation between the two 2 < 3,242 blocks was considered to be a single iteration, and the variants (r 0.4) at 115 of the polymorphic Alu loci. For many, number of times polymorphic Alu elements mapped to these the correlation coefficient approaches zero, indicating that the random LD blocks in the iteration was recorded. We needed to TAS and polymorphic Alu element alleles are essentially in- perform 1,000 iterations to see several instances of >625 Alu dependently segregating (Fig. 3A). The GWAS signal at these loci insertion variants corresponding to these intervals by chance (Fig. is apparently unrelated to the Alu insertion variant. Of the remaining 58 polymorphic Alu elements, some LD was 2A). Thus, the observed number of Alu variants in proximity to 2 = observed, with r > 0.4. These elements included 18 Alu variants GWAS signals is significantly greater than expected (P 0.013; 2 Fig. 2A). This finding supports the hypothesis that these elements that are in moderate LD, as defined by r values between 0.4 and may have functional impacts detected at multiple GWAS loci. 0.7, with at least one TAS. Correlation coefficients in this range indicate that the SNP and Alu are not perfect proxies for each Alu Insertions Are Common Variants at Many GWAS Loci. GWAS other. However, high normalized coefficient of linkage disequi- ′ > depends on a causative variant that is a frequent allele (i.e., a librium (D )values( 0.8) at 16 of these loci indicate that their common variant). Because allele frequencies for many Alu in- correlations are weakened primarily by differences in allele fre- sertion polymorphisms in our catalog were unknown, we devel- quency. In other words, the less frequent variant (Alu or TAS) tends oped PCR assays to genotype them in reference DNA samples. to be on the same strand as one version of the more common variant (Fig. 3B). In some cases (n = 6), we could identify a better proxy SNP than the TAS for the Alu that was directly genotyped in the GWAS. These SNPs did not show phenotype association as strong as the TAS, and thus we infer that the Alu is not a strong A 0.015 p=0.013 candidate for the causative variant. An example is age-related risk, which has been mapped to 1q31.3 by −24 0.010 the TAS rs1831282 (P = 9 × 10 )(30)(Fig. S2). Differences in the MAF of the TAS and Alu reduce the ability of the TAS to serve as a

Density proxy for effects of the polymorphic Alu element; the TAS MAF is 0.005 0.42, and the Alu MAF is 0.23. However, the variants are in some LD (r2 = 0.4, D′ = 0.81). When present, the Alu ismostoftenonthe 0 same strand as the C allele of rs1831282 (Fig. S2). Using individual- 500 550 600 650 level genotyping data for these GWAS patients, we imputed, or Number of Overlaps inferred, the Alu genotype in each person and repeated the asso-

B1.5e-5 ciation analysis for this locus. The Alu variant at 1q31.3 is not as highly associated with age-related macular degeneration as rs1831282 (Fig. S2) and, instead, behaves similar to its proxy SNP. 1e-5 In other cases, a single GWAS genotyping platform was not clearly identified (n = 10) or did not incorporate a strong proxy SNP for the Alu (n = 2); therefore, these cases may merit further follow-up. An Density 5e-6 example occurs at 4p16.1, where GWAS identified a haplotype − associated with urate levels (P = 1 × 10 9)(31)(Fig.3B). The TAS has a MAF of 0.34, whereas the polymorphic Alu element has a 0 MAF of 0.49. Higher urate levels associate with the major haplotype 0 250,000 500,000 750,000 1,000,000 (allele frequency = 0.66). When the Alu is present, it is consistently Distance (bp) part of the major haplotype, but the preinsertion allele is found with both SNP haplotypes (Fig. 3B). Fig. 2. Alu variants are enriched at GWAS signals. (A) Alu variants map to TAS LD blocks 625 times (red). To determine if this number is attributable to Alu chance, 1,000 iterations of random LD blocks mirroring TAS LD blocks were Identification of a Known Variant Associated with Angiotensin- generated. The distribution of the number of times Alu variants overlap with Converting Enzyme Levels. There is one polymorphic Alu element these random blocks is shown (black). (B) Random LD blocks (gray) closely with a well-established functional association. It maps to the mirror TAS LD blocks (red) in size. angiotensin-converting enzyme gene (ACE) locus (32, 33). The

E3986 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al. Downloaded by guest on October 1, 2021 PNAS PLUS A No LD 10 kb 6p22.1 CASC8, 12p11.22: PTHLH, 2q35: TCF4, and 6p23: RANBP9), ROS1 DCBLD1 among others (Table 1). We found the insertion allele associated Variants with both the protective bias and risk without significant bias Alu C 0.267* (P = 0.275; Dataset S3). 2 Alu A 0.273 r =0.01 Haplotypes Of the 44 Alu polymorphisms in strong LD with TASs, 23 map D’=0.11 Empty C 0.223* to intronic regions, with half in each orientation relative to the Empty A 0.237 gene (sense vs. antisense). The remaining 21 insertions in LD B LD supported 10 kb 4p16.1 with TASs are intergenic. Of these insertions, 16 Alu elements SLC2A9 are upstream of the nearest gene, whereas five are downstream SLC2A9 of the nearest gene (location bias binomial test, P = 0.027). This Variants bias is likely influenced by the enrichment of TASs upstream of Empty G 0.340 2 r =0.50 Haplotypes Empty C 0.170 genes (35) and is also evident when we consider all intergenic D’=0.84 * = Alu C 0.490 * polymorphic Alu elements mapping to TAS LD blocks (OR 1.28, 95% CI = 1.03–1.59; P = 0.02). The range of distances C Complete LD 17q23.3 CYB561 ACE KCNH6 between upstream Alu variants in LD with TAS and the nearest CYB561 ACE KCNH6 10 kb protein coding gene is broad (1–683 kb) without clustering in the SINEs proximal sequence. Polymorphic Alu elements Alu variants that are in strong LD with TASs mirror the SNPs characteristics of all polymorphic Alu elements in the genome. These Alu elements are all full-length, ∼300 bp (Dataset S5). We compared the sequence content of Alu elements by considering their subfamily assignments. The same subfamilies are repre- sented in similar proportions for all polymorphic Alu elements (36), those Alu elements that map to GWAS signals, and the Alu r2=1.00 G Alu A 0.472 D’=1.00 A Empty G 0.528* variants in LD with TASs (Fig. S3). One of the strongest GWAS signals we associated with an Alu insertion variant is at 2p25.3 (Fig. 5A). This region is an im- portant genetic determinant for weight, obesity, and body mass − Fig. 3. Genetic relationships between TASs and polymorphic Alu elements index (best P = 3 × 10 49) (e.g., ref. 37). Although the effect has distinguish functional variant candidates. (A) There is no LD between the − been attributed to the transmembrane protein 18 (TMEM18), a polymorphic Alu element and the TAS, rs9387478 (P = 10 10)at 6p22.1. The Alu element is just as frequently found with the risk haplotype (*) transcriptional repressor that sequesters sequences at the mem- as with the protective haplotype. (B) Moderate LD between a polymorphic Alu brane, there is debate concerning whether TMEM18 is involved − element and a TAS associated with urate levels (P = 3 × 10 9) makes this in obesity regulation through expression in adipose tissue or the variant a potential functional variant. Although the MAF differs between the central nervous system (38). The causative variant(s) remain GENETICS Alu variant and the TAS, when present, Alu is consistently on the major hap- unclear even after targeted resequencing efforts that focused lotype strand. (C) Good functional variant candidate. There are many SINEs mainly on the TMEM18 coding sequence (39–41); it is presumed across the locus, but only one polymorphic Alu (red) has been identified at this that the risk haplotype contains a noncoding regulatory variant. locus; the polymorphic Alu element occurs in ACE. The LD structure is shown Seven TASs occur over a 23.5-kb interval, mapping ∼15–70 kb and generated by pairwise comparison between variants (SNPs and Alu ele- – ment), where red indicates LD. The GWAS LD block associated with the TASs downstream of TMEM18 (e.g., refs. 37, 42 46). The risk haplo- (blue) and Alu variant (red) is bracketed and shown by a red horizontal line. type is defined by TASs rs2867125-C, rs6711012-C, rs2903492-A, There is complete LD (r2 = 1) between the Alu element and rs4343, the SNP rs12463617-C, rs6548238-C, rs7561317-G, and rs10189761-A, − associated with human serum ACE levels in a GWAS (P = 3 × 10 25). The empty with an overall allele frequency of ∼0.87 in the GWAS pop- allele is on the risk (*) haplotype. ulations. We found a 306-bp polymorphic AluYa5 that maps

287-bp Alu variant is located between exons 15 and 16 in an antisense orientation with respect to ACE (Fig. 3C). The Alu 1.00 genotype is strongly associated with serum-immunoreactive ACE Strong LD concentrations (32, 33) and detected in a GWAS of ACE activity 44 different Alu variants (34) by the proxy TAS rs4343, which is in complete LD with the 0.75 Alu (r2 = 1) (Fig. 3C). Moderate LD Alu Elements Are Tagged by TASs. In total, we identified 44 poly- 0.50 18 different Alu variants morphic Alu elements at 77 GWAS loci highly correlated with the TAS (r2 > 0.7; Fig. 4). In these cases, the TAS serves as a perfect, or near-perfect, proxy for the polymorphic Alu element; 0.25 thus, the Alu insertion could be considered a candidate causative variant. We note that these insertions and associated SNPs are 0.00 all common variants, and their impact on disease risk is small. correlation coefficent for TAS and polymorphic Alu and polymorphic TAS for correlation coefficent 12 3456789101112131415161718192021 These 44 loci have been associated with a diverse group of phenotypes through some of the most highly significant GWASs conducted to date (Table 1). Our results indicate that poly- Fig. 4. Forty-four Alu variants are in strong LD with TAS(s). LD results for all morphic Alu elements are candidate causative variants for a wide pairwise comparisons between polymorphic Alu elements and their corre- sponding TASs mapped across the genome are shown. Strong LD (r2 > 0.7) range of conditions with significant impacts on human health, indicates the best functional candidates, and falls above the red line. There are including multiple sclerosis (1p13.1: CD58), obesity (2p25), acute 44 of these insertions, also shown in Table 1. We also defined a subset of lymphoblastic leukemia (ALL) (10q21.2: ARID5B), psoriasis polymorphic Alu elements imperfectly correlated with nearby TASs (0.4< r2 < (12q13.3: STAT2), and four breast cancer risk loci (8q23.21: 0.7; n = 16) (above the gray line).

Payer et al. PNAS | Published online May 2, 2017 | E3987 Downloaded by guest on October 1, 2021 Table 1. Polymorphic Alu elements in strong LD (r2 > 0.7) with GWAS TAS Region Disease or trait SNP P value OR r2

1p13.1 Multiple sclerosis* rs1335532 3E-16 1.22 1.000 1q23.1 Red blood traits rs857684 4E-16 — 0.811 1q31.3 Meningococcal disease rs426736 5E-13 1.59 1.000 2p16.1 Venous thromboembolism rs1367228 2E-09 1.49 0.769 2p25.3 Obesity* rs2867125 3E-49 — 1.000 2q24.2 Bilirubin levels* rs2667011 2E-13 — 0.804 2q33.1 Crohn’s disease rs6738825 4E-09 1.06 0.920 2q35 rs16857609 1E-15 1.08 0.953 3p21.1 Osteoarthritis rs11177 5E-09 1.09 0.914 Major mood disorders rs2251219 2E-09 1.14 0.957 3q21.3 Monocyte count rs2712381 2E-16 — 0.877 3q25.32 Height rs2362965 2E-09 1.12 1.000 3q28 Alzheimer’s disease biomarkers rs9877502 5E-09 — 1.000 4q25 Myopia (pathological) rs10034228 8E-13 1.23 0.959 Metabolic traits rs2087160 7E-13 — 0.723 Blood pressure rs6825911 9E-09 — 0.839 5p15.33 Myocardial infarction rs11748327 5E-13 1.25 0.816 6p21.1 Metabolic traits rs9472155 2E-26 — 0.945 6p22.2 Platelet counts rs441460 9E-18 3.08 0.961 6p23 Breast cancer rs204247 8E-09 1.05 1.000 6q16.1 Migraine* rs11759769 2E-12 1.18 0.887 7q22.1 Ulcerative colitis rs7809799 9E-11 1.56 1.000 7q31.31 Bone mineral density rs4609139 1E-10 — 0.852 8q23.3 HDL cholesterol level rs2293889 6E-11 — 0.795 8q24.21 Breast cancer* rs13281615 1E-27 1.09 0.765 Prostate cancer* rs10505483 7E-15 1.73 1.000 9q22.32 Height rs10512248 4E-11 — 0.957 10p11.23 Dental caries rs399593 9E-09 — 1.000 10q21.2 ALL* rs10821936 6E-46 1.86 0.830 12p11.22 Breast cancer* rs10771399 8E-31 1.16 0.918 Height rs2638953 7E-17 — 0.887 12q13.3 Psoriasis rs2066808 1E-09 1.34 1.000 Height rs2066807 1E-13 — 1.000 12q21.2 Myopia (pathological) rs17788937 4E-15 — 0.829 13q22.3 Hair color rs975739 2E-14 — 0.874 14q32.2 Type 1 rs4900384 4E-09 1.09 0.776 Graves’ disease rs1456988 5E-09 1.12 0.822 15q21.2 Thyroid hormone levels* rs10519227 1E-11 — 0.763 15q22.2 Height rs7178424 6E-09 — 0.781 15q24.1 Liver enzyme levels rs8038465 1E-09 2.40 0.825 16p13.13 Age at menopause rs10852344 1E-11 — 0.752 16q22.1 Coronary heart disease rs3729639 2E-11 — 1.000 17q23.3 Height* rs2665838 5E-25 — 0.916 Metabolic traits* (including ACE) rs4343 3E-25 16.20 1.000 17q25.3 Eye color traits rs9894429 9E-14 — 0.813 20q13.32 Blood pressure* rs6015450 4E-23 — 1.000

ORs are given when they were reported in the GWAS catalog (75). Dashes indicate values not reported. *When there are multiple reports of the same phenotype at a locus, the table includes the strongest GWAS signal.

between the TASs and TMEM18 and is antisense relative to pathway that has been associated with meningococcal disease TMEM18. We determined the allele frequency of the Alu in- susceptibility (e.g., refs. 50, 51). However, the causative variant sertion to be 0.12. It is in perfect (r2 = 1, D′ = 1) or near-perfect at this locus is unknown. The TAS (47) maps to an intron of 2 ∼ (r = 0.907) LD with each of the TASs and corresponds to the complement factor H-related 3 (CFHR3) and 27 kb upstream protective haplotype (Fig. 5A). Therefore, the polymorphic Alu of the start site for CFHR1 (Fig. 5B). We identified element or a variant traveling on the same strand reduces obesity a 314-bp AluYb8 on the same strand as CFHR3, located 953 bp from the TAS. The TAS rs426736 and Alu element are in risk by an unknown mechanism. complete LD (r2 = 1, D′ = 1). The TAS risk allele, T, and the We observed a similar scenario at a meningococcal disease = × −13 Alu-containing allele are part of the same haplotype; the Alu risk locus on 1q31.3 (P 5 10 ) (47) (Fig. 5B). The causative insertion is on the risk haplotype. , Neisseria meningitides, colonizes a portion of the pop- ulation asymptomatically, but can cause sepsis and meningitis Imputed Alu Variants Are Highly Associated with Disease Risk. To with high mortality rates in susceptible individuals (e.g., refs. 48, confirm that Alu tagged by TASs associate with disease pheno- 49). Genes at 1q31.3 are involved in complement activation, a types, we used available individual-level genotyping data from

E3988 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al. Downloaded by guest on October 1, 2021 AB PNAS PLUS TMEM18 10 kb CFH CFHR3 CFHR1 CFHR3 Variants 2p25.3 1q31.3 Variants 10 kb 10 kb

2 r =1.00 CCACCGA Empty 0.87 * r2=1.00 T Alu 0.17 * 2 r =0.91 TGGATAT Alu 0.12 G Empty 0.83

C D ARID5B 10 kb PCAT1 CASC19 CCAT2 10 kb ARID5B PCAT2 CCAT1 POU5F1B Variants PRNCR1 CASC21 Variants CASC8 10 kb 10 kb ARID5B

r2=0.83 C Alu G 0.21 * r2=1.00 C Alu AT 0.03 * T Alu T 0.02 A Empty CC 0.97 T Empty T 0.74

ARID5B 10q21.2 PCAT1 CASC19 8q24.21 ARID5B PRNCR1 CCAT1 18 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 Log disease association Log disease association 0 0 63.2 Mb 63.6 Mb 127.85 Mb 128.51 Mb

Chromosome position Chromosome position GENETICS

Fig. 5. Loci where polymorphic Alu elements are potential causative variants. LD plots show Alu insertions and neighboring SNPs with pairwise comparisons − indicating variants in LD (red). (A) The 2p25.3 locus is associated with obesity (best P = 3 × 10 49) (37). The Alu insertion variant (red) and TASs (blue) are annotated within the LD block (red horizontal line) downstream of the TMEM18 gene. The r2 values between the Alu and TASs are shown to the lower left; phased haplotypes are shown to the lower right. Here, the preinsertion (empty) allele is the risk allele (*), and the Alu insertion segregates with the protective haplotype. (B) The 1q31.3 locus associated with meningococcal disease (P = 5 × 10−13) (47); the Alu is on the risk haplotype at CFHR3.(C) The 10q21.2 locus for precursor B-cell ALL; the Alu is on the risk haplotype. (D) The 8q24 locus for prostate cancer; the Alu is on the risk haplotype. For C and D, we imputed the Alu variant genotype for patients in the GWAS and controls to test the association between Alu genotype and disease. Graphs show the −log of the P value for disease association on the y axis; the genomic coordinate is plotted on the x axis. The polymorphic Alu in each case is highly associated with the disease (red diamond), comparable to proximal TASs (blue triangles).

two cancer GWASs. We imputed the Alu genotype of each study SNPs at the 5′ end of the gene are not associated with disease participant, and tested the genotype–phenotype association. For risk, whereas SNPs ∼20 kb upstream of the Alu are associated these two cancers, precursor B-cell (pre-B) ALL and prostate with disease, and the relationship with disease risk deteriorates cancer, we could more directly ask how the Alu element behaves quickly downstream. The Alu variant is at the pinnacle of the in relation to disease risk. Specifically, does the presence or Manhattan plot, indicating that it is highly associated with disease absence of the Alu occur disproportionately in patients relative risk and compares favorably with other previously reported TASs. to controls? We also obtained individual-level genotyping data from pros- Pre-B ALL is the most common childhood cancer. The tate cancer GWAS samples to investigate the signal at 8q24 (Fig. inherited risk of this disease has been mapped to several loci, 5D). This region has long been implicated in epithelial cancer − including the ARID5B locus (P < 10 19) (52–54). ARID5B en- risk, with the GWAS results for prostate, breast, ovarian, and codes the AT-rich interaction domain (ARID) 5B (MRF1-like) mapping to five separate GWAS peaks (e.g., protein that is a transcriptional regulator highly expressed in refs. 56–63). Four of these peaks are associated with prostate developing B cells. The TAS LD block is limited to an ∼34-kb cancer (Fig. S4). Causative genes at this locus are unknown. Al- region of the ARID5B gene, but the causative variant mapping to though several long intergenic noncoding RNAs at this locus this region has not been identified. We identified a 168-bp an- may be up-regulated in prostate cancer (64), much emphasis has tisense polymorphic AluYb8 element in the third intron of remained on this region regulating protein-coding genes (e.g., ARID5B (Fig. 5C). The Alu is in strong LD with previously refs. 65, 66). To appreciate fully the mechanism that alters disease identified TASs (rs7089424 and rs10821936; r2 = 0.83, D′ = 1) risk, functional variant(s) associated with each of these four LD and is on the risk haplotype (Fig. 5C). We obtained genotyping blocks must be identified. At one of these regions, we found a data for patients with pre-B ALL (54) and controls from the 301-bp polymorphic AluYb8 element that is in perfect LD (r2 = 1) Framingham cohort (55). We imputed the Alu genotype for with the TASs (57, 67, 68) that mark this location (Fig. 5D). The patients and controls and performed the association analysis. Alu has an allele frequency of 0.03 and is on the risk haplotype.

Payer et al. PNAS | Published online May 2, 2017 | E3989 Downloaded by guest on October 1, 2021 Although we were unable to access all study components of the as “plug-and-play” gene regulators (e.g., refs. 69, 70). Although this metaanalysis, we were able to recapitulate the signal to a lower model has not been proven for polymorphic Alu elements, it is significance level with publically available data. As expected, supported for evolutionarily older elements that are fixed in the given the complete LD between the Alu element and the TASs, genome, including Alu elements that act as tissue-specific en- the Alu and TASs localize to the peak of the Manhattan plot (Fig. hancers (e.g., refs. 71, 72). Fixed Alu can also provide alternatively 5D). The Alu is therefore an equally good candidate as the pre- used exons (e.g., refs. 73, 74). viously identified SNP variants and remains a candidate for the The 44 potentially causal polymorphic Alu elements we describe functional variant at this locus. here likely underrepresent the number of insertions in this category. Although we have assembled the most Discussion inclusive list of common polymorphic Alu elements to date, many Although polymorphic interspersed repeats are a major source more common variants are predicted to exist than have been of structural variation in the genome, the functional effects of mapped and reported (17, 18). Also, our phasing of these Alu these sequences have not been systematically investigated. We insertions with surrounding SNPs to discern haplotypes focused leveraged the accelerated polymorphic retrotransposon discov- on common Alu insertion alleles in the European population. ery of recent years (e.g., refs. 9, 10, 21, 22, 36) to identify com- Inclusion of more diverse human populations in mon polymorphic Alu elements that may alter disease risk, albeit discovery and targeted discoveries in patient populations will likely with the modest effect sizes identified in GWAS. We focused on increase the number of candidate functional variants. polymorphic elements near GWAS signals, where functional variants remain elusive. We provide an important resource of all Materials and Methods reported polymorphic Alu elements mapping in these intervals The Johns Hopkins University School of Medicine Institutional Review Board (Dataset S3). reviewed and approved our application for Framingham Heart Study data We identified numerous loci where the polymorphic Alu ele- access (NA 00092855), and access was approved by the study data access ment is a candidate causative variant by association. Specifically, committee. we found 44 Alu insertion polymorphisms in strong LD (r2 > 0.7) with the SNP(s) most associated with a disease phenotype (Table 1 Transposable Elements Mapping to GWAS Signals. TIP-chip (23) was used to and Dataset S3). Thus, at these loci, we find genetic evidence to map insertions (SI Materials and Methods). Previously reported polymorphic Alu elements were collected (10, 11, 21, 22, 24–27). The LD block for each associate these Alu variants with functional effects. Although Alu − GWAS signal (P ≤ 10 8) was defined by proxy SNPs to the TAS (r2 > 0.8) (SI insertions are well recognized to cause genetic disease as rare Materials and Methods). Overlaps are reported in Dataset S3. interrupting coding exons or disrupting splicing, our Enrichment of polymorphic Alu elements near GWAS signals was calcu- study indicates that they may also regularly operate as common lated by comparing the number of overlaps of polymorphic Alu elements variants effecting risk for common diseases. Only two common Alu and GWAS LD blocks with the number of polymorphic Alu elements and − insertion alleles have been previously implicated in human phe- 1,000 randomized sets of LD blocks. All reported GWAS signals (P ≤ 10 9) notypes (9, 32–34). We now report 20-fold more candidates. were reduced to a list of 3,242 nonoverlapping TAS LD blocks. To generate Are these Alu insertions truly the functional variant at each of 1,000 sets of random LD blocks (3,242 blocks per set) with similar charac- these loci? We do not want to overstate the functional role of any teristics to these GWAS LD blocks, the characteristics of the TASs anchoring specific Alu insertion identified in this study. Experimental sys- GWAS LD blocks were used to select random SNPs. Specifically, the random tems relevant to model effects detected in GWAS are challeng- SNPs had allele frequencies (within bins of 5%) and distances to the nearest < > ing to develop and should rigorously assess multiple candidate gene (within gene, 10 kb, or 10 kb) matching those parameters of the TAS. These sets of 3,242 random SNPs were used to define 1,000 new sets of causative variants at each locus after extensive variant discovery LD blocks (3,242 LD blocks per iteration). We recorded the number of times and phenotype association fine-mapping. We make the case that that known polymorphic Alu elements map to each set of 3,242 random LD these efforts should be designed to consider Alu insertion vari- blocks and compared the observed value (the number of times that the same ants. Alu variants may be inherently more likely than a typical elements map to TAS LD blocks) with these expected values. Polymorphic SNP to have a functional consequence, given that each insertion Alu elements and LD blocks mapping to the sex and HLA locus creates a structural feature of about 300 bp. Perhaps the stron- were excluded to eliminate any bias owing to unequal ascertainment at gest evidence that Alu polymorphisms deserve special consider- these regions or the large intervals of LD at HLA. ation as causal variants is the disproportionate co-occurrence of Alu variants at GWAS loci (P = 0.013). In total, we identified an Genotyping of Alu Elements and LD Analysis. To focus on Alu elements that unexpectedly high number (n = 625) of Alu variants mapping are common polymorphisms, we conducted an initial screen in pooled DNA − within TAS LD blocks (GWAS, P < 10 9) (Fig. 2). This en- samples. Detected polymorphic Alu elements were genotyped in a 30-trio reference panel of CEU HapMap samples by PCR (SI Materials and Methods). richment of polymorphic Alu elements suggests that some are LD between the Alu variant and TAS was defined as r2 ≥ 0.4 and D′ ≥ 0.8, likely functional. with particular emphasis placed on those variants in stronger LD with r2 ≥ Further studies will be required to determine the extent of 0.7 (SI Materials and Methods and Dataset S3). polymorphic Alu involvement in GWAS signals and the functional mechanism(s) responsible. De novo Alu insertions cause single- ACKNOWLEDGMENTS. We thank Drs. Wenjian Yang and Mary Relling (St. gene disease by interrupting coding sequences or disrupting splic- Jude Children’s Research Hospital) for genotyping data; Mary Relling, John ing (8). For polymorphic Alu elements acting as common variants, Moran (University of Michigan), David Valle, Aravinda Chakravarti, Haig Kazazian, and members of the K.H.B. laboratory for helpful discussion of molecular effects are expected to be more subtle and have less this project and review of the manuscript; and Tim Babatz, Emily Robinson, pronounced phenotypic consequence. Most of the Alu insertions in Allison Moyer, Hannah Bogen, Nicholas Frisco, Reona Kimura, and Tianqi our study do not map to known coding or regulatory sequences or (Nina) Luo for technical assistance. This work was funded by National Heart, highly conserved sequences. As with GWAS intervals generally, the Lung, and Blood Institute Grant T32HL007525; a Burroughs Wellcome Fund Career Award for Biomedical Scientists Program (to K.H.B.); US NIH Awards location of these variants does not immediately imply a mecha- R01CA163705 (to K.H.B.) and R01GM103999 (to K.H.B.), as well as Center nism. Interestingly, may themselves carry for Systems Biology of Retrotransposition Grant P50GM107632 (to K.H.B. regulatory sequences and distribute these sequences in the genome and J.D.B.).

1. Lander ES, et al.; International Human Genome Sequencing Consortium (2001) Initial 3. Smit AFA, Hubley R, Green P (2015) RepeatMasker Open-4.0. Available at www. sequencing and analysis of the human genome. Nature 409:860–921. repeatmasker.org. Accessed April 26, 2017. 2. Kellis M, et al. (2014) Defining functional DNA elements in the human genome. Proc 4. Kazazian HH, Jr, et al. (1988) A resulting from de novo insertion of Natl Acad Sci USA 111:6131–6138. L1 sequences represents a novel mechanism for in man. Nature 332:164–166.

E3990 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al. Downloaded by guest on October 1, 2021 5. Sukarova E, Dimovski AJ, Tchacarova P, Petkov GH, Efremov GD (2001) An Alu insert 41. Liu CT, et al. (2014) Sequence variation in TMEM18 in association with body mass PNAS PLUS as the cause of a severe form of hemophilia A. Acta Haematol 106:126–129. index: Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) 6. Ganguly A, Dunbar T, Chen P, Godmilow L, Ganguly T (2003) Exon skipping caused by Consortium Targeted Sequencing Study. Circ Cardiovasc Genet 7:344–349. an intronic insertion of a young Alu Yb9 element leads to severe hemophilia A. Hum 42. Berndt SI, et al. (2013) Genome-wide meta-analysis identifies 11 new loci for an- Genet 113:348–352. thropometric traits and provides insights into genetic architecture. Nat Genet 45: 7. Green PM, Bagnall RD, Waseem NH, Giannelli F (2008) Haemophilia A mutations in 501–512. the UK: Results of screening one-third of the population. Br J Haematol 143:115–128. 43. Graff M, et al.; GIANT Consortium (2013) Genome-wide analysis of BMI in adolescents 8. Hancks DC, Kazazian HH, Jr (2016) Roles for retrotransposon insertions in human and young adults reveals additional insight into the effects of genetic loci over the disease. Mob DNA 7:9. course. Hum Mol Genet 22:3597–3607. 9. Sudmant PH, et al.; 1000 Genomes Project Consortium (2015) An integrated map of 44. Wheeler E, et al. (2013) Genome-wide SNP and CNV analysis identifies common and structural variation in 2,504 human genomes. Nature 526:75–81. low-frequency variants associated with severe early-onset obesity. Nat Genet 45: 10. Hormozdiari F, et al. (2011) Alu repeat discovery and characterization within human 513–517. genomes. Genome Res 21:840–849. 45. Willer CJ, et al.; Wellcome Trust Case Control Consortium; Genetic Investigation of 11. Stewart C, et al.; 1000 Genomes Project (2011) A comprehensive map of mobile ele- Anthropometric Traits Consortium (2009) Six new loci associated with body mass in- ment insertion polymorphisms in humans. PLoS Genet 7:e1002236. dex highlight a neuronal influence on body weight regulation. Nat Genet 41:25–34. 12. Mathias SL, Scott AF, Kazazian HH, Jr, Boeke JD, Gabriel A (1991) Reverse transcrip- 46. Thorleifsson G, et al. (2009) Genome-wide association yields new sequence variants at tase encoded by a human transposable element. Science 254:1808–1810. seven loci that associate with measures of obesity. Nat Genet 41:18–24. 13. Moran JV, et al. (1996) High frequency retrotransposition in cultured mammalian 47. Davila S, et al.; International Meningococcal Genetics Consortium (2010) Genome- cells. Cell 87:917–927. wide association study identifies variants in the CFH region associated with host 14. Dewannieux M, Esnault C, Heidmann T (2003) LINE-mediated retrotransposition of susceptibility to meningococcal disease. Nat Genet 42:772–776. marked Alu sequences. Nat Genet 35:41–48. 48. Emonts M, Hazelzet JA, de Groot R, Hermans PW (2003) Host genetic determinants of 15. Raiz J, et al. (2012) The non-autonomous retrotransposon SVA is trans-mobilized by Neisseria meningitidis infections. Lancet Infect Dis 3:565–577. the human LINE-1 protein machinery. Nucleic Acids Res 40:1666–1683. 49. Haralambous E, et al. (2003) Sibling familial risk ratio of meningococcal disease in UK 16. Hancks DC, Goodier JL, Mandal PK, Cheung LE, Kazazian HH, Jr (2011) Retro- Caucasians. Epidemiol Infect 130:413–418. transposition of marked SVA elements by human L1s in cultured cells. Hum Mol Genet 50. Schneider MC, et al. (2009) Neisseria meningitidis recruits factor H using protein 20:3386–3400. mimicry of host carbohydrates. Nature 458:890–893. 17. Ewing AD, Kazazian HH, Jr (2010) High-throughput sequencing reveals extensive 51. Brouwer MC, et al. (2009) Host genetic susceptibility to pneumococcal and menin- variation in human-specific L1 content in individual human genomes. Genome Res 20: gococcal disease: A systematic review and meta-analysis. Lancet Infect Dis 9:31–44. 1262–1270. 52. Xu H, et al. (2013) Novel susceptibility variants at 10p12.31-12.2 for childhood acute 18. Watterson GA (1975) On the number of segregating sites in genetical models without lymphoblastic leukemia in ethnically diverse populations. J Natl Cancer Inst 105: recombination. Theor Popul Biol 7:256–276. 733–742. 19. Deininger PL, Moran JV, Batzer MA, Kazazian HH, Jr (2003) Mobile elements and 53. Papaemmanuil E, et al. (2009) Loci on 7p12.2, 10q21.2 and 14q11.2 are associated mammalian genome . Curr Opin Genet Dev 13:651–658. with risk of childhood acute lymphoblastic leukemia. Nat Genet 41:1006–1010. 20. Xing J, et al. (2009) Mobile elements create structural variation: Analysis of a com- 54. Treviño LR, et al. (2009) Germline genomic variants associated with childhood acute plete human genome. Genome Res 19:1516–1526. lymphoblastic leukemia. Nat Genet 41:1001–1005. 21. Wang J, et al. (2006) dbRIP: A highly integrated database of retrotransposon insertion 55. Dawber TR, Meadors GF, Moore FE, Jr (1951) Epidemiological approaches to heart polymorphisms in humans. Hum Mutat 27:323–329. disease: The Framingham Study. Am J Public Health Nations Health 41:279–281. 22. Witherspoon DJ, et al. (2013) Mobile element scanning (ME-Scan) identifies thou- 56. Easton DF, et al.; SEARCH collaborators; kConFab; AOCS Management Group (2007) sands of novel Alu insertions in diverse human populations. Genome Res 23: Genome-wide association study identifies novel breast cancer susceptibility loci. 1170–1181. Nature 447:1087–1093. 23. Huang CR, et al. (2010) Mobile interspersed repeats are major structural variants in 57. Gudmundsson J, et al. (2007) Genome-wide association study identifies a second the human genome. Cell 141:1171–1182. prostate cancer susceptibility variant at 8q24. Nat Genet 39:631–637.

24. Shukla R, et al. (2013) Endogenous retrotransposition activates oncogenic pathways 58. Haiman CA, et al. (2007) Multiple regions within 8q24 independently affect risk for GENETICS in hepatocellular carcinoma. Cell 153:101–111. prostate cancer. Nat Genet 39:638–644. 25. Lee E, et al.; Cancer Genome Atlas Research Network (2012) Landscape of somatic 59. Haiman CA, et al. (2007) A common genetic risk factor for colorectal and prostate retrotransposition in human cancers. Science 337:967–971. cancer. Nat Genet 39:954–956. 26. Witherspoon DJ, et al. (2010) Mobile element scanning (ME-Scan) by targeted high- 60. Schumacher FR, et al. (2007) A common 8q24 variant in prostate and breast cancer throughput sequencing. BMC Genomics 11:410. from a large nested case-control study. Cancer Res 67:2951–2956. 27. Iskow RC, et al. (2010) Natural mutagenesis of human genomes by endogenous ret- 61. Tomlinson IP, et al.; CORGI Consortium; EPICOLON Consortium (2008) A genome-wide rotransposons. Cell 141:1253–1261. association study identifies colorectal cancer susceptibility loci on chromosomes 28. Witherspoon DJ, et al. (2009) Alu repeats increase local recombination rates. BMC 10p14 and 8q23.3. Nat Genet 40:623–630. Genomics 10:530. 62. Yeager M, et al. (2007) Genome-wide association study of prostate cancer identifies a 29. Frazer KA, et al.; International HapMap Consortium (2007) A second generation second risk locus at 8q24. Nat Genet 39:645–649. human haplotype map of over 3.1 million SNPs. Nature 449:851–861. 63. Ghoussaini M, et al.; UK Genetic Prostate Cancer Study Collaborators/British Associ- 30. Naj AC, et al. (2013) Genetic factors in nonsmokers with age-related macular de- ation of Urological Surgeons’ Section of Oncology; UK Protect Study Collaborators generation revealed through genome-wide gene-environment interaction analysis. (2008) Multiple loci with different cancer specificities within the 8q24 gene desert. Ann Hum Genet 77:215–231. J Natl Cancer Inst 100:962–966. 31. Charles BA, et al. (2011) A genome-wide association study of serum uric acid in Af- 64. Bawa P, et al. (2015) Integrative analysis of normal long intergenic non-coding RNAs rican Americans. BMC Med Genomics 4:17. in prostate cancer. PLoS One 10:e0122143. 32. Rigat B, et al. (1990) An insertion/deletion polymorphism in the angiotensin 65. Pomerantz MM, et al. (2009) The 8q24 cancer risk variant rs6983267 shows long-range I-converting enzyme gene accounting for half the variance of serum enzyme levels. interaction with MYC in colorectal cancer. Nat Genet 41:882–884. J Clin Invest 86:1343–1346. 66. Meyer KB, et al. (2011) A functional variant at a prostate cancer predisposition locus 33. Tiret L, et al. (1992) Evidence, from combined segregation and linkage analysis, that a at 8q24 is associated with PVT1 expression. PLoS Genet 7:e1002165. variant of the angiotensin I-converting enzyme (ACE) gene controls plasma ACE 67. Gudmundsson J, et al. (2009) Genome-wide association and replication studies levels. Am J Hum Genet 51:197–205. identify four variants associated with prostate cancer susceptibility. Nat Genet 41: 34. Chung CM, et al. (2010) A genome-wide association study identifies new loci for ACE 1122–1126. activity: Potential implications for response to ACE inhibitor. Pharmacogenomics J 10: 68. Cheng I, et al. (2012) Evaluating genetic risk for prostate cancer among Japanese and 537–544. Latinos. Cancer Epidemiol Biomarkers Prev 21:2048–2058. 35. Hindorff LA, et al. (2009) Potential etiologic and functional implications of genome- 69. Feschotte C (2008) Transposable elements and the evolution of regulatory networks. wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106: Nat Rev Genet 9:397–405. 9362–9367. 70. Sundaram V, et al. (2014) Widespread contribution of transposable elements to the 36. Konkel MK, et al.; 1000 Genomes Consortium (2015) Sequence analysis and charac- innovation of gene regulatory networks. Genome Res 24:1963–1976. terization of active human Alu subfamilies based on the 1000 Genomes Pilot Project. 71. Jacobsen BM, Jambal P, Schittone SA, Horwitz KB (2009) ALU repeats in promoters are Genome Biol Evol 7:2608–2622. position-dependent co-response elements (coRE) that enhance or repress transcrip- 37. Speliotes EK, et al.; MAGIC; Procardis Consortium (2010) Association analyses of tion by dimeric and monomeric progesterone receptors. Mol Endocrinol 23:989–1000. 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42: 72. Romanish MT, Nakamura H, Lai CB, Wang Y, Mager DL (2009) A novel protein isoform 937–948. of the multicopy human NAIP gene derives from intragenic Alu SINE promoters. PLoS 38. Speakman JR (2013) Functional analysis of seven genes linked to body mass index and One 4:e5761. adiposity by genome-wide association studies: A review. Hum Hered 75:57–79. 73. Gal-Mark N, Schwartz S, Ast G (2008) Alternative splicing of Alu exons–two arms are 39. Volckmar AL, et al. (2016) Analysis of genes involved in body weight regulation by better than one. Nucleic Acids Res 36:2012–2023. targeted re-sequencing. PLoS One 11:e0147904. 74. Sorek R, Ast G, Graur D (2002) Alu-containing exons are alternatively spliced. Genome 40. Rask-Andersen M, et al. (2015) Determination of obesity associated gene variants Res 12:1060–1067. related to TMEM18 through ultra-deep targeted re-sequencing in a case-control co- 75. Welter D, et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait hort for pediatric obesity. Genet Res 97:e16. associations. Nucleic Acids Res 42:D1001–D1006.

Payer et al. PNAS | Published online May 2, 2017 | E3991 Downloaded by guest on October 1, 2021 76. Carroll ML, et al. (2001) Large-scale analysis of the Alu Ya5 and Yb8 subfamilies and 81. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and their contribution to human genomic diversity. J Mol Biol 311:17–40. powerful approach to multiple testing. J R Statist Soc B 57:289–300. 77. Johnson AD, et al. (2008) SNAP: A web-based tool for identification and annotation 82. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: Analysis and visualization of LD of proxy SNPs using HapMap. Bioinformatics 24:2938–2939. and haplotype maps. Bioinformatics 21:263–265. 78. Afgan E, et al. (2016) The Galaxy platform for accessible, reproducible and collabo- 83. Purcell S, et al. (2007) PLINK: A tool set for whole-genome association and population- rative biomedical analyses: 2016 update. Nucleic Acids Res 44:W3–W10. based linkage analyses. Am J Hum Genet 81:559–575. 79. Pruitt KD, et al. (2014) RefSeq: An update on mammalian reference sequences. 84. Delaneau O, Marchini J, Zagury JF (2011) A linear complexity phasing method for Nucleic Acids Res 42:D756–D763. thousands of genomes. Nat Methods 9:179–181. 80. Quinlan AR, Hall IM (2010) BEDTools: A flexible suite of utilities for comparing ge- 85. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation nomic features. Bioinformatics 26:841–842. method for the next generation of genome-wide association studies. PLoS Genet 5:e1000529.

E3992 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al. Downloaded by guest on October 1, 2021