bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1 Multi-tissue analysis reveals short tandem repeats as ubiquitous 2 regulators of expression and complex traits 3 4 Stephanie Feupe Fotsing1,2, Jonathan Margoliash3, Catherine Wang4, Shubham Saini3, Richard 5 5 5,# 3,5,# 5 Yanicky , Sharona Shleizer-Burko , Alon Goren & Melissa Gymrek 6

1 7 Biomedical Informatics and Systems Biology, University of California, San Diego, La Jolla, CA, 8 USA 9 2 Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 10 USA 11 3 Department of Computer Science and Engineering, University of California San Diego, La Jolla, 12 CA USA 13 4 Department of Bioengineering, University of California San Diego, La Jolla, CA USA 14 5 Department of Medicine, University of California San Diego, La Jolla, CA USA

15 16 # Correspondence should be addressed to [email protected] or [email protected]. 17 18 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

19 Abstract 20 Short tandem repeats (STRs) have been implicated in a variety of complex traits in humans. 21 However, genome-wide studies of the effects of STRs on gene expression thus far have had 22 limited power to detect associations and provide insights into putative mechanisms. Here, we 23 leverage whole genome sequencing and expression data for 17 tissues from the Genotype-Tissue 24 Expression Project (GTEx) to identify STRs for which repeat number is associated with 25 expression of nearby (eSTRs). Our analysis reveals more than 28,000 eSTRs. We employ 26 fine-mapping to quantify the probability that each eSTR is causal and characterize a group of the 27 top 1,400 fine-mapped eSTRs. We identify hundreds of eSTRs linked with published GWAS 28 signals and implicate specific eSTRs in complex traits including height and schizophrenia, 29 inflammatory bowel disease, and intelligence. Overall, our results support the hypothesis that 30 eSTRs contribute to a range of human phenotypes and will serve as a valuable resource for future 31 studies of complex traits. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

32 Introduction 33 Genome-wide association studies (GWAS) have identified thousands of genetic loci associated 34 with complex traits1, but determining the causal variants, target genes, and biological mechanisms 35 responsible for each signal has proven challenging. The vast majority of GWAS signals lie in non- 36 coding regions2 which are difficult to interpret. Expression quantitative trait loci (eQTL) studies 37 attempt to link regulatory genetic variation to gene expression changes as a potential molecular 38 intermediate that drives biological aberrations leading to disease3. Indeed, recent studies have 39 utilized eQTL catalogs to pinpoint causal genes and relevant tissues for a variety of traits4-6. 40 41 An additional major challenge in interpreting GWAS is that lead variants are rarely causal 42 themselves, but rather tag a set of candidate variants in linkage disequilibrium (LD). A variety of 43 statistical fine-mapping techniques have been developed to identify the most likely causal variant7- 44 9 considering factors such as summary association statistics, LD information, and functional 45 annotations. However, these methods have been limited by focusing on bi-allelic single nucleotide 46 polymorphisms (SNPs) or short indels. On the other hand, multiple recent studies to dissect 47 GWAS loci have found complex repetitive6,10 and structural variants11-13 to be the underlying 48 causal variants, highlighting the need to consider additional variant classes. 49 50 Short tandem repeats (STRs), consisting of consecutively repeated units of 1-6bp, represent a 51 large source of genetic variation. STR mutation rates are orders of magnitude higher than SNPs14 52 and short indels15 and each individual is estimated to harbor around 100 de novo mutations in 53 STRs16. Expansions at several dozen STRs have been known for decades to cause Mendelian 54 disorders17 including Huntington’s Disease and hereditary ataxias. Importantly, these pathogenic 55 STRs represent a small minority of the more than 1.5 million STRs in the human genome18. Due 56 to bioinformatics challenges of analyzing repetitive regions, many STRs are often filtered from 57 genome-wide studies19. However, increasing evidence supports a widespread role of common 58 variation at STRs in complex traits such as gene expression20-23. 59 60 STRs may regulate gene expression through a variety of mechanisms24. For example, the CCG 61 repeat implicated in Fragile X Syndrome was shown to disrupt DNA methylation, altering 62 expression of FMR125. Yeast studies have demonstrated that homopolymer repeats act as 63 nucleosome positioning signals with downstream regulatory effects26,27. Dinucleotide repeats may 64 alter affinity of nearby DNA binding sites28. Furthermore, certain STR repeat units may form non- bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

65 canonical DNA and RNA secondary structures such as G-quadruplexes29, R-loops30, and Z- 66 DNA31. 67 68 We previously performed a genome-wide analysis to identify more than 2,000 STRs for which the 69 number of repeats were associated with expression of nearby genes20, termed expression STRs 70 (eSTRs). However, the quality of the datasets available for this study reduced power to detect 71 associations and prevented accurate fine-mapping of individual eSTRs. First, STR genotypes 72 were based on low coverage (4-6x) whole genome sequencing data performed using short reads 73 (50-100bp) which are unable to span across many STRs. As a result, individual STR genotype 74 calls exhibited poor quality with less than 50% genotyping accuracy18. Second, the study was 75 based on a single cell-type (lymphoblastoid cell lines; LCLs) with potentially limited relevance to 76 most complex traits32. While our and other studies20,22 demonstrated that eSTRs explain a 77 significant portion (10-15%) of the cis heritability of gene expression, the resulting eSTR catalogs 78 were not powered to causally implicate eSTRs over other nearby variants. 79 80 Here, we leverage deep whole genome sequencing (WGS) and expression data collected by the 81 Genotype-Tissue Expression Project (GTEx)33 to map eSTRs in 17 tissues. Our analysis reveals 82 more than 28,000 unique eSTRs. We employ fine-mapping to quantify the probability of causality 83 of each eSTR and characterize the top 1,400 (top 5%) fine-mapped eSTRs. We additionally 84 identify hundreds of eSTRs that are in strong LD with published GWAS signals and implicate 85 specific eSTRs in multiple complex traits including height, schizophrenia inflammatory bowel 86 disease, and intelligence. To further validate our findings, we employ available GWAS data to 87 demonstrate evidence of a causal link between an eSTR for RFT1 and height and use a reporter 88 assay to experimentally validate an effect of this STR on expression. Finally, our eSTR catalog is 89 publicly available and can provide a valuable resource for future studies of complex traits. 90 91 Results 92 93 Profiling expression STRs across 17 human tissues 94 We performed a genome-wide analysis to identify associations between the number of repeats in 95 each STR and expression of nearby genes (expression STRs, or “eSTRs”, which we use to refer 96 to a unique STR by gene association). We focused on 652 individuals from the GTEx33 dataset 97 for which both high coverage WGS and RNA-sequencing of multiple tissues were available (Fig. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

98 1a). We used HipSTR34 to genotype STRs in each sample. After filtering low quality calls 99 (Methods), 175,226 STRs remained for downstream analysis. To identify eSTRs, we performed 100 a linear regression between average STR length and normalized gene expression for each 101 individual at each STR within 100kb of a gene, controlling for sex, population structure, and 102 technical covariates (Methods, Supplementary Figs. 1-3). Analysis was restricted to 17 tissues 103 where we had data for at least 100 samples (Supplementary Table 1, Methods) and to genes 104 with median RPKM greater than 0. Altogether, we performed an average of 262,593 STR-gene 105 tests across 15,840 -coding genes per tissue. 106 107 Using this approach, we identified 28,375 unique eSTRs associated with 12,494 genes in at least 108 one tissue at a gene-level FDR of 10% (Fig. 1b, Supplementary Table 1, Supplementary 109 Dataset 1). The number of eSTRs detected per tissue correlated with sample size as expected 110 (Pearson r=0.75; p=0.00059; n=17), with the smallest number of eSTRs detected in the two brain 111 tissues presumably due to their low sample sizes (Supplementary Fig. 4). Notably, although 112 whole blood and skeletal muscle had the highest number of samples, we identified fewer eSTRs 113 in those tissues than in others with lower sample sizes. This finding is concordant with previous 114 results for SNPs in this cohort33 and may reflect higher cell-type heterogeneity in these tissue 115 samples. eSTR effect sizes previously measured in LCLs were significantly correlated with effect 116 sizes in all GTEx tissues (p<0.01 for all tissues, mean Pearson r=0.45). We additionally examined 117 previously reported eSTRs35-42 that were mostly identified using in vitro constructs. Six of eight 118 examples were significant eSTRs in GTEx (p<0.01) in at least one tissue analyzed 119 (Supplementary Table 2). 120 121 eSTRs identified above could potentially be explained by tagging nearby causal variants such as 122 single nucleotide polymorphisms (SNPs). To identify potentially causal eSTRs we employed 123 CAVIAR7, a statistical fine-mapping framework for identifying causal variants. CAVIAR models 124 the relationship between LD-structure and association scores of local variants to quantify the 125 posterior probability of causality for each variant (which we refer to as the CAVIAR score). We 126 used CAVIAR to fine-map eSTRs against all SNPs nominally associated (p<0.05) with each gene 127 under our model (Methods, Fig. 1a). On average across tissues, 12.2% of eSTRs had the highest 128 causality scores of all variants tested. 129 130 We ranked eSTRs by the best CAVIAR score across tissues and chose the top 5% (best CAVIAR 131 score>0.3) for downstream analysis. We hereby refer to this group as fine-mapped eSTRs (FM- bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

132 eSTRs) (Supplementary Table 1, Supplementary Dataset 2). Expected gene annotations are 133 more strongly enriched in this subset of eSTRs compared to the entire set (Supplementary Fig. 134 5), and stricter thresholds reduced power to detect eSTR-enriched features described below. Of 135 FM-eSTRs in each tissue, on average 78% explained more gene expression variation beyond 136 that of the best SNP (ANOVA q<0.1). Furthermore, on average each FM-eSTR had CAVIAR 137 score 0.41 higher (41% higher posterior probability) than the top-scoring eSNP (Supplementary 138 Fig. 6). Multiple STRs with known disease implications were captured by this list (Fig. 1c). In 139 many cases, FM-eSTRs show clear relationships between the number of repeats and gene 140 expression across a wide range of repeat lengths (Supplementary Fig. 7). 141 142 We next analyzed sharing of eSTRs (defined by a unique STR-gene pair) across tissues. To 143 minimize power differences across tissues and enable cross-tissue comparisons of eSTR effects, 144 we applied multivariate adaptive shrinkage (mash43) (Fig. 1a). Mash takes as input effect sizes 145 and standard errors as computed above and recomputes posterior estimates of each while 146 considering global correlations of effect sizes across tissues. We computed correlations of mash 147 effect sizes for FM-eSTRs across all pairs of tissues (Fig. 1d) and recovered previously observed 148 relationships43. For example, tissues with similar origins (e.g., Adipose-Visceral/Adipose- 149 Subcutaneous) are highly concordant, whereas Whole Blood effects are less correlated with other 150 tissues. These results are also supported by replication between single-tissue eSTRs using 151 unadjusted effect sizes (Supplementary Fig. 8). We further examined tissue sharing of FM- 152 eSTRs by counting the number of tissues for which mash computed a posterior Z-score with 153 absolute value >4. Most eSTRs are either shared across all tissues analyzed or are shared by 154 only a small number of tissues (Supplementary Fig. 9), again similar to previously reported eSNP 155 analyses in this cohort33. 156 157 FM-eSTRs demonstrate unique genomic characteristics and sequence features 158 We next sought to characterize properties of STRs that might provide insights into their biological 159 function. We reasoned that characteristics such as genomic localization, sequence features, and 160 direction of effects that distinguish FM-eSTRs from all analyzed STRs would support the 161 hypothesis that a subset of them are acting as causal variants. We first considered whether the 162 localization of FM-eSTRs differed from that of STRs overall (Fig. 2a-b, Supplementary Fig. 10). 163 Overall, the majority of FM-eSTRs occur in intronic or intergenic regions, and only 11 FM-eSTRs 164 fall in coding exons (Supplementary Table 3). However, compared to all STRs, those closest to 165 TSSs and near DNAseI HS sites were more likely to act as FM-eSTRs (Fig. 2c-d). FM-eSTRs bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

166 are strongly enriched at 5’ UTRs (OR=5.0; Fisher’s two-sided p=4.9e-13), 3’ UTRs (OR=2.78; 167 p=5.85e-10), and within 3kb of transcription start sites (OR=3.39; p=1.10e-46). These 168 enrichments are considerably stronger for FM-eSTRs compared to all eSTRs (Supplementary 169 Table 4), suggesting as expected that FM-eSTRs are more likely to be causal. 170 171 To further explore characteristic of FM-eSTRs in regulatory regions, we examined nucleosome 172 occupancy in the lymphoblastoid cell line GM12878 and DNA accessibility measured by DNAseI- 173 seq in a variety of cell types within 500bp of FM-eSTRs (Supplementary Fig. 11). As expected 174 from previous studies44, regions near homopolymer repeats are strongly nucleosome-depleted. 175 Notably, STRs with other repeat lengths showed distinct patterns of nucleosome positioning 176 (Supplementary Fig. 11a-c). Nucleosome occupancy is broadly similar for FM-eSTRs compared 177 to all STRs. FM-eSTRs are generally located in regions with higher DNAseI-seq read count 178 compared to non-eSTRs (Mann-Whitney [MW] two-sided p=3.9e-37 in GM12878; 179 Supplementary Fig. 11d-e). DNAseI hypersensitivity around homopolymer FM-eSTRs shows a 180 periodic pattern in GM12878 and other tissue types, with notable peaks located at multiples of 181 147bp upstream of downstream from the STR (Supplementary Fig. 11d). Given that 147bp is 182 the length of DNA typically wrapped around a single nucleosome44, we hypothesize that a subset 183 of homopolymer FM-eSTRs may act by shifting nucleosome positions and thus modulating 184 openness of adjacent hypersensitive sites. 185 186 We next examined the sequence characteristics of FM-eSTRs compared to all STRs. We tested 187 FM-eSTRs combined across all tissues for enrichment of each canonical STR repeat unit (defined 188 lexicographically, see Methods). FM-eSTRs are most strongly enriched for repeats with GC-rich 189 repeat units (Fig. 2e, Supplementary Table 5). For example, the canonical repeat units 190 CCCCGG, CCCCCG, and CCG are 22, 13, and 7-fold enriched in FM-eSTRs compared to all 191 STRs respectively. Notably, the total lengths of FM-eSTRs are significantly higher compared to 192 all STRs analyzed (MW two-sided p=0.00032 and p=2.4e-10 when comparing total repeat number 193 and total length in bp in hg19, respectively). 194 195 We next examined effect sizes biases in FM-eSTR associations. Overall, FM-eSTRs are equally 196 likely to show positive vs. negative correlations between repeat length and expression 197 (Supplementary Fig. 12; two-sided binomial p=0.94). We additionally observed that FM-eSTRs

198 with repeat units of the form (AnC/GnT) show strand-specific effects when in or near transcribed 199 regions. Transcribed FM-eSTRs are more likely to have the T-rich version of the repeat unit on bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

200 the template strand (two-sided binomial p=0.0015). Further, compared to A-rich repeat units on 201 the template strand, T-rich FM-eSTRs tend to have more positive effect sizes, with the most 202 notable differences for AC vs. GT repeats. These patterns are observed in transcribed regions 203 across multiple distinct repeat types (A/T, AC/GT, AAC/GTT, AAAC/GGGT) but are not present 204 in intergenic regions (Fig. 2f). 205 206 Finally, we wondered whether eSTRs might exhibit distinct characteristics in different tissues. We 207 clustered tissue-specific Z-scores (absolute value) for each FM-eSTR calculated jointly across 208 tissues by mash (Methods) to identify eight categories of FM-eSTR (Supplementary Fig. 13, 209 14). These include two clusters of FM-eSTRs present across many tissues (Clusters 2 and 8) as 210 well as several more tissue-specific clusters (e.g., Thyroid for Cluster 1). Notably, clusters do not 211 necessarily imply tissue specificity, but rather enrich for FM-eSTRs with particularly strong effects 212 in one or more tissues compared to others (Supplementary Fig. 14). More than 50% of all genes 213 in each cluster are expressed in all 17 tissues analyzed, and 88% of FM-eSTRs are shared by 214 more than one tissue (Supplementary Fig. 9). Clusters show similar repeat unit enrichment to 215 all FM-eSTRs and do not exhibit distinct enriched repeat units (Supplementary Fig. 15). We 216 further tested whether repeat units of FM-eSTRs are distributed uniformly across clusters. Only 217 one repeat unit (AAAAT) shows a suggestive non-uniform distribution across clusters (Chi- 218 squared p=0.018) with highest prevalence in the thyroid cluster. Similar results were achieved 219 using different numbers of clusters. Overall, our results suggest the majority of eSTRs act by 220 global mechanisms and do not implicate tissue-specific characteristics of FM-eSTRs. However, 221 low numbers of tissue-specific effects limit power to detect differences. Future work may provide 222 insights into potential tissue-specific mechanisms. 223 224 GC-rich eSTRs are predicted to modulate DNA and RNA secondary structure 225 FM-eSTRs are most strongly enriched for repeats with high GC content (e.g., canonical repeat

226 units CCG, CCCCG, CCCCCG, AGGGC) (Fig. 2e, Supplementary Table 5) which are found 227 almost exclusively in promoter regions (Supplementary Fig. 10). These GC-rich repeat units 228 have been shown to form highly stable secondary structures during transcription such as G4 229 quadruplexes in single-stranded DNA45 or RNA46 that may regulate gene expression. We 230 hypothesized that the effects of GC-rich eSTRs may be in part due to formation of non-canonical 231 nucleic acid secondary structures that modulate DNA or RNA stability as a function of repeat 232 number. We considered properties of two classes of GC-rich FM-eSTRs: (i) those following the 47 233 standard G4 motif (G3N1-7G3N1-7G3N1-7G3) and (ii) repeats with canonical repeat unit CCG which bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

234 does not meet the standard G4 definition. Notably, the majority of CCG FM-eSTRs (79%) occur 235 in 5’ UTRs compared to only 11% for G4 repeats. We observed that both classes of GC-rich 236 repeats are associated with higher RNAPII (Fig. 3a) and lower nucleosome occupancy (Fig. 3b) 237 compared to all STRs. The relationship with RNAPII was observed across a diverse range of cell 238 and tissue types (Supplementary Fig. 16). 239 240 To evaluate whether GC-rich repeats could be modulating DNA or RNA secondary structure, we 241 used mfold48 to calculate the free energy of each STR and 50bp of its surrounding context in 242 single stranded DNA or RNA. We considered all common allele lengths (number of repeats) 243 observed at each STR (Methods) and computed energies for both the template and non-template 244 strands. We then computed the correlation between the number of repeats and free energy at 245 each STR region. Overall, both G4 and CGG STRs have lower mean free energy (greater stability) 246 and more negative correlations between repeat number and free energy compared to all STRs 247 (Fig. 3c-f, Supplementary Fig. 17; adjusted MW one-sided p<0.05). Compared to all STRs, FM- 248 eSTRs tend to have lower free energy and more negative correlations with repeat number (MW 249 p<0.05 in all categories except for CGG STRs). Notably, both metrics (mean free energy and 250 correlation of repeat number vs. free energy) are significantly correlated with the total length of 251 STR in all cases (Pearson correlation p<0.01). FM-eSTRs tend to be longer than STRs overall 252 (see above), which may partially explain the secondary structure trends observed. 253 254 Based on previous observations49, we predicted that higher repeat numbers at GC-rich eSTRs 255 would result in greater DNA or RNA stability and in turn would increase expression of nearby 256 genes. To that end, we tested whether FM-eSTRs were biased toward negative vs. positive effect 257 sizes. As described above, overall FM-eSTRs show no bias in effect direction. However, when 258 considering only repeats in promoter regions (TSS +/- 3kb), 59% of FM-eSTRs have positive 259 effect sizes, significantly more than the 50% expected by chance (binomial two-sided p=0.04; 260 n=137). This effect was stronger when considering only G4 FM-eSTRs (87% positive effect sizes; 261 p=0.0074; n=15) but not significant for CCG FM-eSTRs (62% positive; n=13; p=0.58; Fig. 3g). 262 For multiple G4 FM-eSTRs, expression levels across allele lengths follow an inverse relationship 263 with free energy (Fig. 3h-j). Altogether, these results support a model in which higher repeat 264 numbers at GC-rich eSTRs in promoter regions stabilize DNA secondary structures which 265 promote transcription. Lastly, the contradictory results for CCG STRs may indicate that those 266 repeats could act by distinct mechanisms compared to G4 STRs, but also may be due in part to 267 limited power from a smaller sample size. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

268 eSTRs are potential drivers of published GWAS signals 269 Finally, we wondered whether our eSTR catalog could identify STRs affecting complex traits in 270 humans. We first leveraged the NHGRI/EBI GWAS catalog50 to identify FM-eSTRs that are nearby 271 and in LD with published GWAS signals. Overall, 1,381 unique FM-eSTRs are within 1Mb of 272 GWAS hits (Methods, Supplementary Dataset 3). Of these, 847 are in moderate LD (r2>0.1) 273 and 65 are in strong LD (r2>0.8) with the lead SNP. For 7 loci in at least moderate LD, the lead 274 GWAS variant is within the STR itself (Supplementary Table 6). 275 276 We next sought to determine whether specific published GWAS signals could be driven by 277 changes in expression due to an underlying but previously unobserved FM-eSTR. We reasoned 278 that such loci would exhibit the following properties: (i) strong similarity in association statistics 279 across variants for both the GWAS trait and expression of a particular gene, indicating the signals 280 may be co-localized, i.e., driven by the same causal variant; and (ii) strong evidence that the FM- 281 eSTR causes variation in expression of that gene (Fig. 4a). Co-localization analysis requires high- 282 resolution summary statistic data. Thus, we focused on several example complex traits (height51, 283 schizophrenia52, inflammatory bowel disease (IBD)53, and intelligence54) for which detailed 284 summary statistics computed on cohorts of tens of thousands or more individuals are publicly 285 available (Methods). 286 287 For each trait, we identified FM-eSTRs within 1Mb of published GWAS signals from 288 Supplementary Dataset 3. We then used coloc55 to compute the probability that the FM-eSTR 289 signals we derived from GTEx and the GWAS signals derived from other cohorts are co-localized. 290 The coloc tool compares association statistics at each SNP in a region for expression and the 291 trait of interest and returns a posterior probability that the signals are co-localized. We used coloc 292 to test a total of 276 gene´trait pairs (138, 45, 29, and 64 for height, intelligence, IBD, and 293 schizophrenia respectively). In total, we identified 28 GWAS loci with (1) an FM-eSTR in at least 294 moderate LD (r2>0.1) with a nearby SNP for that trait in the GWAS catalog and (2) co-localization 295 probability between the target gene and the trait >90% (Supplementary Table 7, 296 Supplementary Fig. 18-19). 297 298 A top example in our analysis was an FM-eSTR for RFT1, an enzyme involved in the pathway of 299 N-glycosylation of proteins56, that has 97.8% co-localization probability with a GWAS signal for 300 height (Fig. 4b-c). The lead SNP in the NHGRI catalog (rs2336725) is in high LD (r2=0.85) with 301 an AC repeat that is a significant eSTR in 15 tissues. This STR falls in a cluster of transcription bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

302 factor and chromatin regulator binding regions identified by ENCODE near the 3’ end of the gene 303 (Fig. 4d) and exhibits a positive correlation with expression across a range of repeat numbers. 304 305 To more directly test for association between this FM-eSTR and height, we used our recently 306 developed STR-SNP reference haplotype panel57 to impute STR genotypes into available GWAS 307 data. We focused on the eMERGE cohort (Methods) for which imputed genotype array data and 308 height measurements are available. We tested for association between height and SNPs as well 309 as for AC repeat number after excluding samples with low STR imputation quality (Methods). 310 Imputed AC repeat number is significantly associated with height in the eMERGE cohort 311 (p=0.00328; beta=0.010; n=6,393), although with a slightly weaker p-value compared to the top 312 SNP (Fig. 4e). Notably, even in the case that the STR is the causal variant, power is likely reduced 313 due to the lower quality of imputed STR genotypes. Encouragingly, AC repeat number shows a 314 strong positive relationship with height across a range of repeat lengths (Fig. 4f), similar to the 315 relationship between repeat number and RFT1 expression. 316 317 To further investigate whether the FM-eSTR for RFT1 could be a causal driver of gene expression 318 variation, we devised a dual reporter assay in HEK293T cells to test for an effect of the number 319 of repeats on gene expression (0, 5, 10, or 12 repeats plus approximately 170bp of genomic 320 sequence context on either side (Supplementary Table 8, Methods). We observed a positive 321 linear relationship between the number of AC repeats and reporter expression as predicted (Fig. 322 4g) (Pearson r=0.97; p=0.013). Furthermore, all pairs of constructs with consecutive repeat 323 numbers showed significantly different expression (one-sided t-test p<0.01) with the exception of 324 10 vs. 12 repeats. Overall, these results further support the hypothesis that eSTRs may act as 325 causal drivers of gene expression. 326 327 Discussion 328 Here we present the most comprehensive resource of eSTRs to date, which reveals more than 329 28,000 associations between the number of repeats at STRs and expression of nearby genes 330 across 17 tissues. We performed fine-mapping to quantify the probability that each eSTR causally 331 effects gene expression and characterize top fine-mapped eSTRs. eSTRs analyzed here consist 332 of a large spectrum of repeat classes with a variety of repeat unit lengths and sequences, ranging 333 from homopolymers to hexanucleotide repeats. It is probable that each type induces distinct 334 regulatory effects (Fig. 5). While we explored several potential mechanisms, including bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

335 nucleosome positioning and the formation of non-canonical DNA or RNA secondary structures, 336 our results do not rule out other potential mechanisms for these eSTRs. 337 338 We leveraged our resource to provide evidence that FM-eSTRs may drive a subset of published 339 GWAS associations for a variety of complex traits. Altogether, our complex trait analysis 340 demonstrates that STRs may represent an important class of variation that is largely missed by 341 current GWAS. STRs have a unique ability compared to bi-allelic variants such as SNPs or small 342 indels to drive phenotypic variation along a spectrum of multiple alleles, each with different 343 numbers of repeats. In multiple examples, the eSTR shows a linear trend between repeat length 344 and expression across a range of repeat numbers, a signal that cannot be easily explained by 345 tagging nearby bi-allelic variants. Importantly, the cases identified here likely represent a minority 346 of eSTRs driving complex traits. Our analysis is based only on signals that could be detected by 347 standard SNP-based GWAS, which are underpowered to detect underlying multi-allelic 348 associations from STRs57. Further work to directly test for associations between STRs and 349 phenotypes is likely to reveal a widespread role for repeat number variation in complex traits. 350 351 Our study faced several limitations. (i) eSTR discovery was restricted to linear associations 352 between repeat number and expression. Our analysis did not consider non-linear effects or the 353 effect of sequence imperfections, such as SNPs or small indels, within the STR sequence itself. 354 (ii) While we applied stringent fine-mapping approaches to find eSTRs whose signals are likely 355 not explained by nearby SNPs in LD, some signals could plausibly be explained by other variant 356 classes such as structural variants58 or Alu elements59 that were not considered. Furthermore, 357 our fine-mapping procedure may be vulnerable to false negatives for STRs in strong or perfect 358 LD with nearby SNPs or false positives due to noise present with small sample sizes. (iii) Our 359 study was limited to tissues available from GTEx with sufficient sample sizes. While this greatly 360 expanded on the single tissue used in our previous eSTR analysis, some tissues such as brain 361 were not well represented and had low power for eSTR detection. Further, while we analyzed 17 362 distinct tissues, due to overwhelming sharing of eSTRs across tissues, we were unable to identify 363 tissue-specific characteristics of eSTRs (iv) Despite strong evidence that the FM-eSTRs for RFT1 364 and other genes may drive published GWAS signals, we have not definitively proved causality. 365 Additional work is needed to validate effects on expression and evaluate the impact of these STRs 366 in trait-relevant cell types. Nevertheless, most FM-eSTRs have broad effects across many tissue 367 types and in many cases are the most plausible causal variants identified by fine-mapping. This bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

368 suggests our list of FM-eSTRs colocalized with GWAS signals (Supplementary Table 7) will be 369 useful for identifying candidate eSTRs driving complex traits to be explored in future studies. 370 371 In summary, our eSTR catalog provides a valuable resource for both obtaining deeper insights 372 into biological roles of eSTRs in regulating gene expression and for identifying potential causal 373 variants underlying a variety of complex traits. 374 375 Materials and Methods 376 Dataset and preprocessing 377 Next-generation sequencing data was obtained from the Genotype-Tissue Expression (GTEx) 378 through dbGaP under phs000424.v7.p2. This included high coverage (30x) Illumina whole 379 genome sequencing (WGS) data and expression data from 652 unrelated individuals 380 (Supplementary Fig. 1). The WGS cohort consisted of 561 individuals with reported European 381 ancestry, 75 of African ancestry, and 8, 3, and 5 of Asian, Amerindian, and Unknown ancestry, 382 respectively. For each sample, we downloaded BAM files containing read alignments to the hg19 383 reference genome and VCFs containing SNP genotype calls. 384 385 STRs were genotyped using HipSTR34, which returns the maximum likelihood diploid STR allele 386 sequences for each sample based on aligned reads as input. Samples were genotyped separately 387 with non-default parameters --min-reads 5 and --def-stutter-model. VCFs were filtered using the 388 filter_vcf.py script available from HipSTR using recommended settings for high coverage data 389 (min-call-qual 0.9, max-call-flank-indel 0.15, and max-call-stutter 0.15). VCFs were merged 390 across all samples and further filtered to exclude STRs meeting the following criteria: call rate 391 <80%; STRs overlapping segmental duplications (UCSC Genome Browser60 392 hg19.genomicSuperDups table); penta- and hexamer STRs containing homopolymer runs of at 393 least 5 or 6 nucleotides, respectively in the hg19 reference genome, since we previously found 394 these STRs to have high error rates due to indels in homopolymer regions57; and STRs whose 395 frequencies did not meet the percentage of homozygous vs. heterozygous calls based on 396 expected under Hardy-Weinberg Equilibrium (binomial two-sided p<0.05). Additionally, to restrict 397 to polymorphic STRs we filtered STRs with heterozygosity <0.1. Altogether, 175,226 STRs 398 remained for downstream analysis. 399 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

400 We additionally obtained gene-level RPKM values for each tissue from dbGaP project 401 phs000424.v7.p2. We focused on 15 tissues with at least 200 samples, and included two brain 402 tissues with slightly more than 100 samples available (Supplementary Table 1). Genes with 403 median RPKM of 0 were excluded and expression values for remaining genes were quantile 404 normalized separately per tissue to a standard normal distribution. Analysis was restricted to 405 protein-coding genes based on GENCODE version 19 (Ensembl 74) annotation. 406 407 Prior to downstream analyses, expression values were adjusted separately for each tissue to 408 control for sex, population structure, and technical variation in expression as covariates. For 409 population structure, we used the top 10 principal components resulting from performing principal 410 components analysis (PCA) on the matrix of SNP genotypes from each sample. PCA was 411 performed jointly on GTEx samples and 1000 Genomes Project61 samples genotyped using Omni 412 2.5 genotyping arrays (see URLs). Analysis was restricted to bi-allelic SNPs present in the Omni 413 2.5 data and resulting loci were LD-pruned using plink62 with option --indep 50 5 2. PCA on 414 resulting SNP genotypes was performed using smartpca63,64. To control for technical variation in 415 expression, we applied PEER factor correction65. Based on an analysis of number of PEER 416 factors vs. number of eSTRs identified per tissue (Supplementary Fig. 2), we determined an 417 optimal number of N/10 PEER factors as covariates for each tissue, where N is the sample size. 418 PEER factors were correlated with covariates reported previously for GTEx samples 419 (Supplementary Fig. 3) such as ischemic time. 420 421 eSTR and eSNP identification 422 For each STR within 100kb of a gene, we performed a linear regression between STR lengths 423 and adjusted expression values: 424 !′ = $% + ' 425 Where % denotes STR genotypes, !′ denotes expression values adjusted for the covariates 426 described above, $ denotes the effect size, and ε is the error term. A separate regression analysis 427 was performed for each STR-gene pair in each tissue. For STR genotypes, we used the average 428 repeat length of the two alleles for each individual, where repeat length was computed as a length 429 difference from the hg19 reference, with 0 representing the reference allele. Linear regressions 430 were performed using the OLS function from the Python statsmodels.api module66. As a control, 431 for each STR-gene pair we performed a permutation analysis in which sample identifiers were 432 shuffled. 433 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

434 Samples with missing genotypes or expression values were removed from each regression 435 analysis. To reduce the effect of outlier STR genotypes, we removed samples with genotypes 436 observed in less than 3 samples. If after filtering samples there were less than three unique 437 genotypes, the STR was excluded from analysis. Adjusted expression values and STR genotypes 438 for remaining samples were then Z-scaled to have mean 0 and variance 1 before performing each 439 regression. This step forces resulting effect sizes to be between -1 and 1. 440 441 We used a gene-level FDR threshold (described previously20) of 10% to identify significant STR- 442 gene pairs. We assume most genes have at most a single causal eSTR. For each gene, we 443 determined the STR association with the strongest P-value. This P-value was adjusted using a 444 Bonferroni correction for the number of STRs tested per gene to give a P-value for observing a 445 single eSTR association for each gene. We then used the list of adjusted P-values (one per gene) 446 as input to the fdrcorrection function in the statsmodels.stats.multitest module to obtain a q-value 447 for the best eSTR for each gene. FDR analysis was performed separately for each tissue. 448 449 eSNPs were identified using the same model covariates, and normalization procedures but using 450 SNP dosages (0, 1, or 2) rather than STR lengths. Similar to the STR analysis, we removed 451 samples with genotypes occurring in fewer than 3 samples and removed SNPs with less than 3 452 unique genotypes remaining after filtering. On average, we tested 17 STRs and 533 SNPs per 453 gene. 454 455 Fine-mapping eSTRs 456 We used model comparison as an orthogonal validation to CAVIAR findings to determine whether 457 the best eSTR for each gene explained variation in gene expression beyond a model consisting 458 of the best eSNP. For each gene with an eSTR we determined the eSNP with the strongest p- 459 value. We then compared two linear models: Y’~eSNP (SNP-only model) vs. Y’~eSNP+eSTR 460 (SNP+STR model) using the anova_lm function in the python statsmodels.api.stats module. Q- 461 values were obtained using the fdrcorrection function in the statsmodels.stats.multitest module. 462 On average across tissues, 17.4% of eSTRs tested improved the model over the best eSNP for 463 the target gene (10% FDR). When restricting to FM-eSTRs, 78% improved the model (10% FDR). 464 465 We used CAVIAR7 v2.2 to further fine-map eSTR signals against the all nominally significant 466 eSNPs (p<0.05) within 100kb of each gene. On average, 121 SNPs per gene passed this 467 threshold and were included in CAVIAR analysis. Pairwise-LD between the eSTR and eSNPs bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

468 was estimated using the Pearson correlation between SNP dosages (0, 1, or 2) and STR 469 genotypes (average of the two STR allele lengths) across all samples. CAVIAR was run with 470 parameters -f 1 -c 2 to model up to two independent causal variants per . In some cases, 471 initial association statistics for SNPs and STRs might have been computed using different sets of 472 samples if some were filtered due to outlier genotypes. To provide a fair comparison between 473 eSTRs and eSNPs, for each CAVIAR analysis we recomputed Z-scores for eSTRs and eSNPs 474 using the same set of samples prior to running CAVIAR. 475 476 Multi-tissue eSTR analysis 477 We used an R implementation of mash43 (mashR) v0.2.21 to compute posterior estimates of eSTR 478 effect sizes and standard errors across tissues. Briefly, mashR takes as input effect sizes and 479 standard error measurements per-tissue, learns various covariance matrices of effect sizes 480 between tissues, and outputs posterior estimates of effect sizes and standard errors accounting 481 for global patterns of effect size sharing. We used all eSTRs with a nominal p-value of <1e-5 in 482 at least one tissue as a set of strong signals to compute covariance matrices. eSTRs that were 483 not analyzed in all tissues were excluded from this step. We included “canonical” covariance 484 matrices (identity matrix and matrices representing condition-specific effects) and matrices 485 learned by extreme deconvolution initialized using PCA with 5 components as suggested by 486 mashR documentation. After learning covariance matrices, we applied mashR to estimate 487 posterior effect sizes and standard errors for each eSTR in each tissue. For eSTRs that were 488 filtered from one or more tissues in the initial regression analysis, we set input effect sizes to 0 489 and standard errors to 10 in those tissues to reflect high uncertainty in effect size estimates at 490 those eSTRs. For Fig. 1d, rows and columns of the effect size correlation matrix were clustered 491 using default parameters from the clustermap function in the Python seaborn library (see URLs). 492 493 Canonical repeat units 494 For each STR, we defined the canonical repeat unit as the lexicographically first repeat unit when 495 considering all rotations and strand orientations of the repeat sequence. For example, the 496 canonical repeat unit for the repeat sequence CAGCAGCAGCAG would be AGC. 497 498 Enrichment analyses 499 Enrichment analyses were performed using a two-sided Fisher’s exact test as implemented in the 500 fisher_exact function of the python package scipy.stats (see URLs). Overlapping STRs with each 501 annotation was performed using the intersectBed tool of the BEDTools67 suite. Genomic bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

502 annotations were obtained by downloading custom tables using the UCSC Genome Browser60 503 table browser tool to select either coding regions, introns, 5’UTRs, 3’UTRs, or regions upstream 504 of TSSs. An STR could be assigned to more than one category in the case of overlapping 505 transcripts. STRs not assigned to one of those categories were labeled as intergenic. ENCODE 506 DNAseI HS clusters were downloaded from the UCSC Genome Browser (see URLs). Analysis 507 was restricted to DNAseI HS clusters annotated in at least 20 cell types. The distance between 508 each STR and the center of the nearest DNAseI HS cluster was computed using the closestBed 509 tool from the BEDTools suite. 510 511 Analysis of DNAseI-seq, ChIP-seq, and Nucleosome occupancy 512 Genome-wide nucleosome occupancy signal in GM12878 was downloaded from the UCSC 513 Genome Browser (see URLs). ChIP-seq reads for RNAPII and DNAseI-seq reads were 514 downloaded from the ENCODE Project website (see URLs) (Accessions GM12878 RNAPII: 515 ENCFF775ZJX, heart RNAPII: ENCFF643EGO, lung RNAPII: ENCSR033NHF, tibial nerve 516 RNAPII: ENCFF750HDH, human embryonic stem cells RNAPII: ENCFF526YGE; GM12878 517 DNAseI: ENCFF775ZJX, fat DNAseI: ENCFF880CAD, tibial nerve DNAseI: ENCFF226ZCG, skin 518 DNAseI: ENCFF238BRB). Histograms of aggregate read densities and heatmaps for individual 519 STR regions were generated using the annoatePeaks.pl tool of Homer68. For nucleosome 520 occupancy and DNAseI analyses on all STRs, we used parameters -size 1000 -hist 1. For analysis 521 of GC-rich repeats in promoters, we used parameters -size 10000 -hist 5. 522 523 Characterization of tissue-specific eSTRs 524 We clustered FM-eSTRs based on effect sizes Z-scores computed by mash for each eSTR in 525 each tissue. We first created a tissue by FM-eSTR matrix of the absolute value of the Z-scores. 526 We then Z-normalized Z-scores for each FM-eSTR to have mean 0 and variance 1. We used the 527 KMeans class from the Python sklearn.cluster module to perform K-means clustering with K=8. 528 The number of clusters was chosen by visualizing the sum of squared distances from centroids 529 for values of K ranging from 1 to 20 and choosing a value of K based on the “elbow method”. 530 Using different values of K produced similar groups. We tested for non-uniform distributions of 531 FM-eSTR repeat units across clusters using a chi-squared test implemented in the scipy.stats 532 chi2_contingency function. 533 534 Analysis of DNA and RNA secondary structure bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

535 For each STR, we extracted the repeat plus 50bp flanking sequencing from the hg19 reference 536 genome. We additionally created sequences containing each common allele for each STR. 537 Common alleles were defined as those seen at least 5 times in a previously generated deep 538 catalog of STR variation in 1,916 samples57. For each sequence and its reverse complement, we 539 ran mfold48 on the DNA and corresponding RNA sequences with mfold arguments NA=DNA and 540 NA=RNA, respectively, and otherwise default parameters to estimate the free energy of each 541 single-stranded sequence. Mann-Whitney tests were performed using the mannwhitneyu function 542 of the scipy.stats python package (see URLs). 543 544 Co-localization of FM-eSTRs with published GWAS signals 545 Published GWAS associations were obtained from the NHGRI/EBI GWAS catalog available from 546 the UCSC Genome Browser Table Browser (table hg19.gwasCatalog) downloaded on July 24, 547 2019. Height GWAS summary statistics were downloaded from the GIANT Consortium website 548 (see URLs). Schizophrenia GWAS summary statistics were downloaded from the Psychiatric 549 Genomics Consortium website (see URLs). IBD summary statistics were downloaded from the 550 International Inflammatory Bowel Disease Genetics Consortium (IIBDGC) website. We used the 551 file EUR.IBD.gwas_info03_filtered.assoc with summary statistics in Europeans (see URLs). 552 Intelligence summary statistics were downloaded from the Complex Trait Genomics lab website 553 (see URLs). LD between STRs and SNPs was computed by taking the squared Pearson 554 correlation between STR lengths and SNP dosages in GTEx samples for each STR-SNP pair. 555 STR genotypes seen less than 3 times were filtered from LD calculations. 556 557 Co-localization analysis of eQTL and GWAS signals was performed using the coloc.abf function 558 of the coloc55 package. For all traits, dataset 1 was specified as type=”quant” and consisted of 559 eSNP effect sizes and their variances as input. We specified sdY=1 since expression was quantile 560 normalized to a standard normal distribution. Dataset 2 was specified differently for height and 561 schizophrenia to reflect quantitative vs. case-control analyses. For height and intelligence, we 562 specified type=”quant” and used effect sizes and their variances as input. We additionally 563 specified minor allele frequencies listed in the published summary statistics file and the total 564 sample size of N=695,647 and N=269,720 for height and intelligence, respectively. For 565 schizophrenia and IBD, we specified type=”CC” and used effect sizes and their variances as input. 566 We additionally specified the fraction of cases as 33%. 567 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

568 Capture Hi-C interactions (Supplementary Fig. 19) were visualized using the 3D Genome 569 Browser69. The visualization depicts interactions profiled in GM1287870 and only shows 570 interactions overlapping the STR of interest. 571 572 Association analysis in the eMERGE cohort 573 We obtained SNP genotype array data and imputed genotypes from dbGaP accessions 574 phs000360.v3.p1 and phs000888.v1.p1 from consent groups c1 (Health/Medical/Biomedical), c3 575 (Health/Medical/Biomedical - Genetic Studies Only - No Insurance Companies), and c4 576 (Health/Medical/Biomedical - Genetic Studies Only). Height data was available for samples in 577 cohorts c1 (phs000888.v1.pht004680.v1.p1.c1), c3 (phs000888.v1.pht004680.v1.p1.c3), and c4 578 (phs000888.v1.pht004680.v1.p1.c4). We removed samples without age information listed. If 579 height was collected at multiple times for the same sample, we used the first data point listed. 580 581 Genotype data was available for 7,190, 6100, and 3,755 samples from the c1, c3, and c4 cohorts 582 respectively (dbGaP study phs000360.v3.p1). We performed PCA on the genotypes to infer 583 ancestry of each individual. We used plink to restrict to SNPs with minor allele frequency at least 584 10% and with genotype frequencies expected under Hardy-Weinberg Equilibrium (p>1e-4). We 585 performed LD pruning using the plink option --indep 50 5 1.5 and used pruned SNPs as input to 586 PCA analysis. We visualized the top two PCs and identified a cluster of 14,147 individuals 587 overlapping samples with annotated European ancestry. We performed a separate PCA using 588 only the identified European samples and used the top 10 PCs as covariates in association tests. 589 590 A total of 11,587 individuals with inferred European ancestry had both imputed SNP genotypes 591 and height and age data available. Samples originated from cohorts at Marshfield Clinic, Group 592 Health Cooperative, Northwestern University, Vanderbilt University, and the Mayo Clinic. We 593 adjusted height values by regressing on top 10 ancestry PCs, age, and cohort. Residuals were 594 inverse normalized to a standard normal distribution. Adjustment was performed separately for 595 males and females. 596 597 Imputed genotypes (from dbGaP study phs000888.v1.p1) were converted from IMPUTE271 to 598 plink’s binary format using plink, which marks calls with uncertainty >0.1 (score<0.9) as missing. 599 SNP associations were performed using plink with imputed genotypes as input and with the 600 “linear” option with analysis restricted to the region chr3:53022501-53264470. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

601 602 The RFT1 FM-eSTR was imputed into the imputed SNP genotypes using Beagle 572 with option 603 gp=true and using our SNP-STR reference haplotype panel57. We previously estimated 604 imputation concordance of 97% at this STR in a separate European cohort. Samples with imputed 605 genotype probabilities of less than 0.9 were removed from the STR analysis. We additionally 606 restricted analysis to STR genotypes present in at least 100 samples to minimize the effect of 607 outlier genotypes. We regressed STR genotype (defined above as the average of an individual’s 608 two repeat lengths) on residualized height values for the remaining 6,393 samples using the 609 Python statsmodels.regression.linear_model.OLS function (see URLs). 610 611 Dual luciferase reporter assay 612 Constructs for 0, 5, or 10 copies of AC at the FM-eSTR for RFT1 (chr3:53128363-53128413 plus 613 approximately 170bp genomic context on either side (RFT1_0rpt, RFT1_5rpt, RFT1_10rpt in 614 Supplementary Table 8) were ordered as gBlocks from Integrated DNA technologies (IDT). Each 615 construct additionally contained homology arms for cloning into pGL4.27 (below). We additionally 616 PCR amplified the region from genomic DNA for sample NA12878 with 12 copies of AC (NIGMS 617 Human Genetic Repository, Coriell) using PrimeSTAR max DNA Polymerase (Clontech R045B) 618 and primers RFT1eSTR_F and RFT1eSTR_R (Supplementary Table 8) which included the 619 same homology arms. 620 621 Constructs were cloned into plasmid pGL4.27 (Promega, E8451), which contains the firefly 622 luciferase coding sequence and a minimal promoter. The plasmid was linearized using EcoRV 623 (New England Biolabs, R3195) and purified from agarose gel (Zymo Research, D4001). 624 Constructs were cloned into the linearized vector using In-Fusion (Clontech, 638910). Sanger 625 sequencing of isolated clones for each plasmid validated expected repeat numbers in each 626 construct. 627 628 Plasmids were transfected into the human embryonic kidney 293 cell line (HEK293T; ATCC CRL- 629 3216) and grown in DMEM media (Gibco, 10566-016), supplemented with 10% fetal bovine serum 630 (Gibco, 10438-026), 2 mM glutamine (Gibco, A2916801), 100 units/mL of penicillin, 100 µg/mL of 631 streptomycin, and 0.25 µg/mL Amphotericin B (Anti-Anti Gibco, 15240062). Cells were maintained 5 632 at 37°C in a 5% CO2 incubator. 2×10 HEK293T cells were plated onto each well of a 25 ug/ml 633 poly-D lysine (EMD Millipore, A-003-E) coated 24-well plate, the day prior to transfection. On the 634 day of the transfection medium was changed to Opti-MEM. We conducted co-transfection bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

635 experiments to test expression of each construct. 100ng of the empty pGL4.27 vector (Promega, 636 E8451) or 100 ng of each one of the pGL4.27 derivatives, were mixed with 5ng of the reference 637 plasmid, pGL4.73 (Promega, E6911), harboring SV40 promoter upstream of Renilla luciferase, 638 and added to the cells in the presence of Lipofectamine™ 3000 (Invitrogen, L3000015), according 639 to the manufacturer's instructions. Cells were incubated for 24 hr at 37°C, washed once with 640 phosphate-buffered saline, and then incubated in fresh completed medium for an additional 24 641 hr. 642 643 48 hours after transfection the HEK293T cells were washed 3 times with PBS and lysed in 100μl 644 of Passive Lysis Buffer (Promega, E1910). Firefly luciferase and Renilla luciferase activities were 645 measured in 10μl of HEK293T cell lysate using the Dual-Luciferase Reporter assay system 646 (Promega, E1910) in a Veritas™ Microplate Luminometer. Relative activity was defined as the 647 ratio of firefly luciferase activity to Renilla luciferase activity. For each plasmid, transfection and 648 the expression assay were done in triplicates using three wells of cultured cells that were 649 independently transfected (biological repeats), and three individually prepared aliquots of each 650 transfection reaction (technical repeats). Values from each technical replicate were averaged to 651 get one ratio for each biological repeat. Values shown in Fig. 4g represent the mean and standard 652 deviation across the three biological replicates for each construct. 653 654 Data Availability 655 All eSTR summary statistics are available for download on WebSTR 656 http://webstr.ucsd.edu/downloads. 657 658 Code Availability 659 Code for performing analyses and generated figures is available at 660 http://github.com/gymreklab/gtex-estrs-paper.

661 URLs 662 1000 Genomes phased Omni2.5 SNP data, 663 ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/shapeit2_scaffolds/hd_chip_s 664 caffolds/ bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

665 ENCODE DNAseI HS clusters, 666 http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeRegDnaseClustered/w 667 gEncodeRegDnaseClusteredV3.bed.gz 668 ChIP-seq and DNAseI-seq, https://www.encodeproject.org 669 Nucleosome occupancy in GM12878, 670 http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeSydhNsome/wgEncod 671 eSydhNsomeGm12878Sig.bigWig 672 Height GWAS summary statistics, 673 https://portals.broadinstitute.org/collaboration/giant/images/0/0f/Meta- 674 analysis_Locke_et_al%2BUKBiobank_2018.txt.gz 675 Schizophrenia GWAS summary statistics, https://www.med.unc.edu/pgc/results-and-downloads 676 IBD summary statistics, ftp://ftp.sanger.ac.uk/pub/consortia/ibdgenetics/iibdgc-trans-ancestry- 677 filtered-summary-stats.tgz 678 Intelligence summary statistics, 679 https://ctg.cncr.nl/documents/p1651/SavageJansen_IntMeta_sumstats.zip 680 Python scipy.stats package, https://docs.scipy.org/doc/scipy/reference/stats.html 681 Python statsmodels package, https://www.statsmodels.org 682 Python scikit-learn package, https://scikit-learn.org/stable/ 683 Python seaborn package, https://seaborn.pydata.org/ 684 MashR vignettes, https://stephenslab.github.io/mashr/articles/intro_mash_dd.html#data-driven- 685 covariances 686 687 Acknowledgements 688 Research reported in this publication was supported in part by the Office Of The Director, National 689 Institutes of Health under Award Number DP5OD024577. We thank Vineet Bafna, Eric 690 Mendenhall, Joseph Gleeson, and Yu-Ting Liu for helpful comments.

691 GTEx Project 692 The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the 693 Office of the Director of the National Institutes of Health (commonfund.nih.gov/GTEx). Additional 694 funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled 695 at Biospecimen Source Sites funded by NCI\Leidos Biomedical Research, Inc. subcontracts to 696 the National Disease Research Interchange (10XS170), GTEx Project March 5, 2014 version 697 Page 5 of 8 Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

698 Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract 699 (HHSN268201000029C) to the The Broad Institute, Inc. Biorepository operations were funded 700 through a Leidos Biomedical Research, Inc. subcontract to Van Andel Research Institute 701 (10ST1035). Additional data repository and project management were provided by Leidos 702 Biomedical Research, Inc. (HHSN261200800001E). The Brain Bank was supported supplements 703 to University of Miami grant DA006227. Statistical Methods development grants were made to 704 the University of Geneva (MH090941 & MH101814), the University of Chicago 705 (MH090951,MH090937, MH101825, & MH101820), the University of North Carolina - Chapel Hill 706 (MH090936), North Carolina State University (MH101819),Harvard University (MH090948), 707 Stanford University (MH101782), Washington University (MH101810), and to the University of 708 Pennsylvania (MH101822). The datasets used for the analyses described in this manuscript were 709 obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number 710 phs000424.v7.p2.

711 eMERGE Project - Mayo Clinic 712 Samples and associated genotype and phenotype data used in this study were provided by the 713 Mayo Clinic. Funding support for the Mayo Clinic was provided through a cooperative agreement 714 with the National Research Institute (NHGRI), Grant #: UOIHG004599; and by 715 grant HL75794 from the National Heart Lung and Blood Institute (NHLBI). Funding support for 716 genotyping, which was performed at The Broad Institute, was provided by the NIH 717 (U01HG004424). The datasets used for analyses described in this manuscript were obtained from 718 dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession numbers phs000360.v3.p1 719 and phs000888.v1.p1. 720 721 eMERGE Project - Marshfield Clinic 722 Funding support for the Personalized Medicine Research Project (PMRP) was provided through 723 a cooperative agreement (U01HG004608) with the National Human Genome Research Institute 724 (NHGRI), with additional funding from the National Institute for General Medical Sciences 725 (NIGMS) The samples used for PMRP analyses were obtained with funding from Marshfield 726 Clinic, Health Resources Service Administration Office of Rural Health Policy grant number D1A 727 RH00025, and Wisconsin Department of Commerce Technology Development Fund contract 728 number TDF FYO10718. Funding support for genotyping, which was performed at Johns Hopkins 729 University, was provided by the NIH (U01HG004438). The datasets used for analyses described bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

730 in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP 731 accession numbers phs000360.v3.p1 and phs000888.v1.p1. 732 733 eMERGE Project - Northwestern University 734 Samples and data used in this study were provided by the NUgene Project (www.nugene.org). 735 Funding support for the NUgene Project was provided by the Northwestern University’s Center 736 for Genetic Medicine, Northwestern University, and Northwestern Memorial Hospital. Assistance 737 with phenotype harmonization was provided by the eMERGE Coordinating Center (Grant number 738 U01HG04603). This study was funded through the NIH, NHGRI eMERGE Network 739 (U01HG004609). Funding support for genotyping, which was performed at The Broad Institute, 740 was provided by the NIH (U01HG004424). The datasets used for analyses described in this 741 manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP 742 accession numbers phs000360.v3.p1 and phs000888.v1.p1. 743 744 eMERGE Project - Vanderbilt University 745 Funding support for the Vanderbilt Genome-Electronic Records (VGER) project was provided 746 through a cooperative agreement (U01HG004603) with the National Human Genome Research 747 Institute (NHGRI) with additional funding from the National Institute of General Medical Sciences 748 (NIGMS). The dataset and samples used for the VGER analyses were obtained from Vanderbilt 749 University Medical Center's BioVU, which is supported by institutional funding and by the 750 Vanderbilt CTSA grant UL1RR024975 from NCRR/NIH. Funding support for genotyping, which 751 was performed at The Broad Institute, was provided by the NIH (U01HG004424). The datasets 752 used for analyses described in this manuscript were obtained from dbGaP at 753 http://www.ncbi.nlm.nih.gov/gap through dbGaP accession numbers phs000360.v3.p1 and 754 phs000888.v1.p1. 755 756 eMERGE Project - Group Health Cooperative 757 Funding support for Alzheimer's Disease Patient Registry (ADPR) and Adult Changes in Thought 758 (ACT) study was provided by a U01 from the National Institute on Aging (Eric B. Larson, PI, 759 U01AG006781). A gift from the 3M Corporation was used to expand the ACT cohort. DNA aliquots 760 sufficient for GWAS from ADPR Probable AD cases, who had been enrolled in Genetic 761 Differences in Alzheimer's Cases and Controls (Walter Kukull, PI, R01 AG007584) and obtained 762 under that grant, were made available to eMERGE without charge. Funding support for 763 genotyping, which was performed at Johns Hopkins University, was provided by the NIH bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

764 (U01HG004438). Genome-wide association analyses were supported through a Cooperative 765 Agreement from the National Human Genome Research Institute, U01HG004610 (Eric B. Larson, 766 PI). The datasets used for analyses described in this manuscript were obtained from dbGaP at 767 http://www.ncbi.nlm.nih.gov/gap through dbGaP accession numbers phs000360.v3.p1 and 768 phs000888.v1.p1. 769 770 Author Contributions 771 S.F.F. performed all eSTR and eSNP mapping, helped perform downstream analyses and helped 772 draft the manuscript. J.M. performed multi-tissue analysis using mashR and helped revise the 773 manuscript. C.W. optimized and performed the reporter assay. S.S. participated in design of the 774 STR imputation analysis. S.S.-B. lead, designed, and analyzed data from the reporter assay. R.Y. 775 implemented the WebSTR web application. A.G. conceived and planned analyses and validation 776 experiments of regulatory effects of eSTRs and wrote the manuscript. M.G. conceived the study, 777 designed and performed analyses, and drafted the initial manuscript. All authors have read and 778 approved the final manuscript.

779 Competing interests 780 The authors have no competing financial interests to disclose.

781

782 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

783 Figure Legends 784 785 Figure 1: Multi-tissue identification of eSTRs. 786 (a) Schematic of eSTR discovery pipeline. We analyzed RNA-seq from 17 tissues and STR 787 genotypes obtained from deep WGS for 652 individuals from the GTEx Project. For each STR 788 within 100kb of a gene, we tested for association between length of the STR and expression of 789 the gene in each tissue. For each gene, CAVIAR was used to fine-map the effects of eSTRs vs. 790 nominally significant cis SNPs on gene expression. CAVIAR takes as input pairwise variant LD 791 and effect sizes (Z-scores) and outputs a posterior probability of causality for each variant. For 792 multi-tissue analysis, per-tissue effect sizes and standard errors were used as input to mashR, 793 which computes posterior effect size estimates in each tissue based on potential sharing of 794 eSTRs across tissues. 795 (b) eSTR association results. The quantile-quantile plot compares observed p-values for each 796 STR-gene test vs. the expected uniform distribution. Gray dots denote permutation controls. The 797 black line shows the diagonal. Colored dots show observed p-values for each tissue. 798 (c) Example eSTRs previously implicated in disease. Left: a CG-rich FM-eSTR upstream of 799 CSTB was previously implicated in myoclonus epilepsy73. Middle: a multi-allelic intronic CCTGGG 800 FM-eSTR in NOP56 was implicated in spinocerebellar ataxia 3674. Right: A CGGGGG FM-eSTR 801 in the promoter of ALOX5 was previously shown to regulate ALOX5 expression in leukocytes42 802 and is associated with reduced lung function75 and cardiovascular disease76. Each black point 803 represents a single individual. For each plot, the x-axis represents the mean number of repeats 804 in each individual and the y-axis represents normalized expression in a representative tissue. 805 Boxplots summarize the distribution of expression values for each genotype. Horizontal lines 806 show median values, boxes span from the 25th percentile (Q1) to the 75th percentile (Q3). 807 Whiskers extend to Q1-1.5*IQR (bottom) and Q3+1.5*IQR (top), where IQR gives the interquartile 808 range (Q3-Q1). The red line shows the mean expression for each x-axis value. Gene diagrams 809 are not drawn to scale. 810 (d) eSTR effect sizes are correlated across tissues and studies. Each cell in the matrix shows 811 the Spearman correlation between mashR FM-eSTR effect sizes (β’) for each pair of tissues. Only 812 eSTRs with CAVIAR score >0.3 in at least one tissue (FM-eSTRs) were included in each 813 correlation analysis. Rows and columns were clustered using hierarchical clustering (Methods). 814 815 Figure 2: Characterization of FM-eSTRs bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

816 (a) Density of all STRs around transcription start sites. The y-axis shows the number of STRs 817 in each 100bp bin around the TSS relative to the average across all bins. Negative x-axis numbers 818 denote upstream regions and positive numbers denote downstream regions. 819 (b) Density of all STRs around DNAseI HS sites. Plots are centered at ENCODE DNAseI HS 820 clusters and represent the relative number of STRs in each 50bp bin. For (a) and (b) the black 821 line denotes all STRs and colored lines denote repeats with different repeat unit lengths 822 (gray=homopolymers, red=dinucleotides, gold=trinucleotides, blue=tetranucleotides, 823 green=pentanucleotides, purple=hexanucleotides). 824 (c) Relative probability to be an FM-eSTR around TSSs. The black lines represent the 825 probability of an STR in each bin to be an FM-eSTR. Values were scaled relative to the genome- 826 wide average. 827 (d) Relative probability to be an FM-eSTR around DNAseI HS clusters. Axes are similar to 828 those in (c) except centered around DNAseI HS clusters. For a-d, values were smoothed by 829 taking a sliding average of each four consecutive bins. 830 (e) Repeat unit enrichment at FM-eSTRs across all tissues. The x-axis shows all repeat units

831 for which there are at least 3 FM-eSTRs across all tissues. The y-axis denotes the log2 odds ratios 832 (OR) from performing a Fisher’s exact test comparing FM-eSTRs to all STRs. Enrichments for all 833 eSTRs are given in Supplementary Table 5. Single asterisks denote repeat units nominally 834 enriched or depleted (two-sided Fisher exact test p<0.05). Double asterisks denote repeat units 835 significantly enriched after controlling for the number of repeat units tested (Bonferroni adjusted 836 p<0.05). 837 (f) Strand-biased characteristics of FM-eSTRs. Top panel: the y-axis shows the number of FM- 838 eSTRs with each repeat unit on the template strand. Bottom panel: the y-axis shows the 839 percentage of FM-eSTRs with each repeat unit on the template strand that have positive effect 840 sizes. Gray bars denote A-rich repeat units (A/AC/AAC/AAAC) and red bars denote T-rich repeat 841 units (T/GT/GTT/GTTT). Single asterisks denote repeat units nominally enriched or depleted (two- 842 sided binomial p<0.05). Double asterisks denote repeat units significantly enriched after 843 controlling for multiple hypothesis testing (Bonferroni adjusted p<0.05). Asterisks above brackets 844 show significant differences between repeat unit pairs. Asterisks on x-axis labels denote 845 departure from the 50% positive effect sizes expected by chance. Error bars give 95% confidence 846 intervals. 847 848 Figure 3: GC-rich eSTRs are predicted to modulate DNA secondary structure. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

849 (a) Density of RNAPII localization around STRs. The y-axis denotes the average number of 850 ChIP-seq reads for RNA Polymerase II in GM12878 in 5bp bins centered at STRs. 851 (b) Nucleosome occupancy around STRs. The y-axis denotes the average nucleosome 852 occupancy in 5bp bins centered at STRs in GM12878. For (a) and (b), black lines denote all STRs 853 found within 5kb of TSSs, blue lines denote CCG STRs, and red lines denote STRs matching the 854 canonical G4 motif. Dashed lines represent all STRs of each class and solid lines represent FM- 855 eSTRs. Only STRs within 5,000bp of a TSS are included. 856 (c-d) Free energy of STR regions. Boxplots denote the distribution of free energy for each STR 857 +/- 50bp of context sequence, computed as the average across all alleles at each STR. (c) and 858 (d) show results computed using the template strand for DNA and RNA, respectively. 859 (e-f) Pearson correlation between STR length and free energy. Correlations were computed 860 separately for each STR, and plots show the distribution of correlation coefficients across all 861 STRs. The dashed horizontal line denotes 0 correlation as expected by chance. (e) and (f) show 862 results computed using the template strand for DNA and RNA, respectively. 863 For c-f, horizontal purple lines show medians and boxes span from the 25th percentile (Q1) to 864 the 75th percentile (Q3). Whiskers extend to Q1-1.5*IQR (bottom) and Q3+1.5*IQR (top), where 865 IQR represents the interquartile range (Q3-Q1). White boxes show all STRs in each category and 866 black boxes show FM-eSTRs. Upper brackets denote significant differences for all STRs (white) 867 across categories or for significant differences within each category between all STRs (white) and 868 FM-eSTRs (black). Numbers below (c) denote the number of data points (unique STRs) included 869 in each box. Nominally significant differences (Mann Whitney one-sided p<0.05) between 870 distributions are denoted with a single asterisk. Differences significant after controlling for multiple 871 hypothesis correction are denoted with double asterisks. For each category (free energy and 872 Pearson correlation), we used a Bonferroni correction to control for 20 total comparisons: 873 comparing all vs. FM-eSTRs separately in each category, comparing CCG vs. all STRs, and 874 comparing G4 vs. all STRs, in four conditions (DNA +/- and RNA +/-). Results for non-template 875 strands are shown in Supplementary Fig. 17. 876 (g) Bias in the direction of eSTR effect sizes. The y-axis shows the percentage of FM-eSTRs 877 in each category with positive effect sizes, meaning a positive correlation between STR length 878 and expression. White bars denote all STRs in each category. Gray bars denote STRs falling 879 within 3kb of TSSs. Error bars give 95% confidence intervals. 880 (h-j) Examples of G4 FM-eSTRs in promoter regions predicted to modulate secondary 881 structure. For each example, top plots show the mean expression across all individuals with each 882 mean STR length. Vertical bars represent +/- 1 s.d. Bottom plots show the free energy computed bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

883 by mfold for each number of repeats for the STR. Note, expression plots (top) have additional 884 points to represent heterozygous genotypes, whereas energies (bottom) were computed per- 885 allele rather than per-genotype. Solid and dashed gray lines show energies for alleles on the 886 template and non-template strands, respectively. The x-axis shows STR lengths relative to the 887 hg19 reference genome in bp. Gene diagrams are not drawn to scale. 888 889 Figure 4: FM-eSTRs co-localize with published GWAS signals. 890 (a) Overview of analyses to identify FM-eSTRs involved in complex traits. In cases where 891 FM-eSTRs may drive complex phenotypes, we assumed a model where variation in STR repeat 892 number (red; left) alters gene expression (purple; middle), which in turn affects the value of a 893 particular complex trait (right). Even when the STR is the causal variant, nearby SNPs (gray; left) 894 in LD may be associated with both gene expression through eQTL analysis and the trait of interest 895 through GWAS. Black arrows indicate assumed causal relationships. Dashed arrows indicate 896 analysis approaches (1-3) used to test each relationship. (1) eQTL analysis tests for associations 897 between SNP genotype or STR repeat number and gene expression. (2) GWAS tests for 898 associations between SNP genotype and trait value. (3) coloc analysis tests whether the 899 association signals for expression and the trait are driven by the same underlying causal variant. 900 (b) eSTR association for RFT1. The x-axis shows STR genotype at an AC repeat 901 (chr3:53128363) as the mean number of repeats and the y-axis gives normalized RFT1 902 expression in a representative tissue (Esophagus-Muscularis). Each point represents a single 903 individual. Boxplots summarize expression distributions for each genotype as described in Fig. 904 1d. Red lines show the mean expression for each x-axis value. 905 (c) Summary statistics for RFT1 expression and height. The top panel shows genes in the

906 region around RFT1. The middle panel shows the -log10 p-values of association between each 907 variant and RFT1 expression in aortic artery. The FM-eSTR is denoted by a red star. The bottom

908 panel shows the -log10 p-values of association for each variant with height based on available 909 summary statistics51. The dashed gray horizontal line shows genome-wide significance 910 threshold. 911 (d) Detailed view of the RFT1 locus. A UCSC genome browser60 screenshot is shown for the 912 region in the gray box in (b). The FM-eSTR is shown in red. The bottom track shows transcription 913 factor (TF) binding clusters profiled by ENCODE. 914 (e) eSTR and SNP associations with height at the RFT1 locus in the eMERGE cohort. The 915 x-axis shows the same genomic region as in (b). The y-axis denotes association p-values for 916 each variant in the subset of eMERGE cohort samples analyzed here. Black dots represent SNPs. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

917 The blue star denotes the top variant in the region identified by Yengo, et al.51 (rs2581830), which 918 was also the top SNP in eMERGE. The red star represents the imputed FM-eSTR. 919 (f) Imputed RFT1 repeat number is positively correlated with height. The x-axis shows the 920 mean number of AC repeats. The y-axis shows the mean normalized height for all samples 921 included in the analysis with a given genotype. Vertical black lines show +/- 1 s.d. 922 (g) Reporter assay shows expected positive trend between FM-eSTR repeat number and 923 luciferase expression. Top: schematic of the experimental design (not to scale). A variable 924 number of AC repeats plus surrounding genomic context (hg19 chr3:53128201-53128577) were 925 introduced upstream of a minimal promoter driving reporter expression. Bottom: white bar shows 926 results from the unmodified plasmid (empty). Gray bars show expression results for constructs 927 with each number of repeats (0, 5, 10, and 12). Reporter expression results normalized to a 928 Renilla control. Error bars show +/- 1 s.d. Asterisks represent significant differences between 929 conditions (one-sided t-test p<0.01). 930 931 Figure 5: Summary of FM-eSTRs classes and potential regulatory mechanisms 932 (a) Distribution of FM-eSTR classes across genomic annotations. Each bar shows the 933 fraction of FM-eSTRs falling in each annotation consisting of homopolymer (gray), dinucleotide 934 (red), trinucleotide (orange), tetranucleotide (blue), pentanucleotide (green) or hexanucleotide 935 (purple) repeats. The total number of FM-eSTRs and the top five most common repeat units in 936 each category are shown on the right. Note, FM-eSTRs may be counted in more than one 937 category. 938 (b) Homopolymer A/T STRs are predicted to modulate nucleosome positioning. 939 Homopolymer repeats are depleted of nucleosomes (gray circles) and may modulate expression 940 changes in nearby genes through altering nucleosome positioning. 941 (c) GC-rich STRs form DNA and RNA secondary structures during transcription. Highly 942 stable secondary structures such as G4 quadruplexes may act by expelling nucleosomes (gray 943 circle) or stabilizing RNAPII (light green circle). These structures may form in DNA (black) or RNA 944 (purple) The stability of the structure can depend on the number of repeats. 945 (d) Dinucleotide STRs can alter transcription factor binding. Dinucleotides are prevalent in 946 putative enhancer regions. They may potentially alter transcription factor binding by forming 947 binding sites themselves (top), changing affinity of nearby binding sites (middle), or modulating 948 spacing between nearby binding sites (bottom). 949 For (b)-(d), text and arrows in the white boxes provide a summary of the predicted eSTR 950 mechanism depicted in each panel. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

951 References 952 1 Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. 953 Am J Hum Genet 101, 5-22, doi:10.1016/j.ajhg.2017.06.005 (2017). 954 2 Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating 955 the dark road from association to function. Am J Hum Genet 93, 779-797, 956 doi:10.1016/j.ajhg.2013.10.012 (2013). 957 3 Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional 958 variation in humans. Nature 501, 506-511, doi:10.1038/nature12531 (2013). 959 4 Gamazon, E. R. et al. A gene-based association method for mapping traits using 960 reference transcriptome data. Nat Genet 47, 1091-1098, doi:10.1038/ng.3367 (2015). 961 5 Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association 962 studies. Nat Genet 48, 245-252, doi:10.1038/ng.3506 (2016). 963 6 Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts 964 complex trait gene targets. Nat Genet 48, 481-487, doi:10.1038/ng.3538 (2016). 965 7 Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal 966 variants at loci with multiple signals of association. Genetics 198, 497-508, 967 doi:10.1534/genetics.114.167908 (2014). 968 8 Kichaev, G. et al. Integrating functional data to prioritize causal variants in statistical 969 fine-mapping studies. PLoS Genet 10, e1004722, doi:10.1371/journal.pgen.1004722 970 (2014). 971 9 Pickrell, J. K. Joint analysis of functional genomic data and genome-wide association 972 studies of 18 human traits. Am. J. Hum. Genet. 94, 559-573, 973 doi:10.1016/j.ajhg.2014.03.004 (2014). 974 10 Grünewald, T. G. P. et al. Chimeric EWSR1-FLI1 regulates the Ewing sarcoma 975 susceptibility gene EGR2 via a GGAA microsatellite. Nat. Genet. 47, 1073-1078, 976 doi:10.1038/ng.3363 (2015). 977 11 Boettger, L. M. et al. Recurring exon deletions in the HP (haptoglobin) gene contribute 978 to lower blood cholesterol levels. Nat Genet 48, 359-366, doi:10.1038/ng.3510 (2016). 979 12 Leffler, E. M. et al. Resistance to malaria through structural variation of red blood cell 980 invasion receptors. Science 356, doi:10.1126/science.aam6393 (2017). 981 13 Sekar, A. et al. Schizophrenia risk from complex variation of complement component 4. 982 Nature 530, 177-183, doi:10.1038/nature16549 (2016). 983 14 Sun, J. X. et al. A direct characterization of human mutation based on microsatellites. 984 Nat. Genet. 44, 1161-1165, doi:10.1038/ng.2398 (2012). 985 15 Lynch, M. Rate, molecular spectrum, and consequences of human mutation. Proc Natl 986 Acad Sci U S A 107, 961-968, doi:10.1073/pnas.0912629107 (2010). 987 16 Willems, T. et al. Population-Scale Sequencing Data Enable Precise Estimates of Y-STR 988 Mutation Rates. Am. J. Hum. Genet. 98, 919-933, doi:10.1016/j.ajhg.2016.04.001 (2016). 989 17 Mirkin, S. M. Expandable DNA repeats and human disease. Nature 447, 932-940, 990 doi:10.1038/nature05977 (2007). 991 18 Willems, T. et al. The landscape of human STR variation. Genome Res. 24, 1894-1904, 992 doi:10.1101/gr.177774.114 (2014). bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

993 19 Li, H. Towards Better Understanding of Artifacts in Variant Calling from High-Coverage 994 Samples. arXiv [q-bio.GN] (2014). 995 20 Gymrek, M. et al. Abundant contribution of short tandem repeats to gene expression 996 variation in humans. Nat Genet 48, 22-29, doi:10.1038/ng.3461 (2016). 997 21 Nasrallah, M. P. et al. Differential effects of a polyalanine tract expansion in Arx on 998 neural development and gene expression. Hum Mol Genet 21, 1090-1098, 999 doi:10.1093/hmg/ddr538 (2012). 1000 22 Quilez, J. et al. Polymorphic tandem repeats within gene promoters act as modifiers of 1001 gene expression and DNA methylation in humans. Nucleic Acids Res. 44, 3750-3762, 1002 doi:10.1093/nar/gkw219 (2016). 1003 23 Vinces, M. D., Legendre, M., Caldara, M., Hagihara, M. & Verstrepen, K. J. Unstable 1004 tandem repeats in promoters confer transcriptional evolvability. Science 324, 1213- 1005 1216, doi:10.1126/science.1170097 (2009). 1006 24 Gemayel, R., Vinces, M. D., Legendre, M. & Verstrepen, K. J. Variable tandem repeats 1007 accelerate evolution of coding and regulatory sequences. Annu Rev Genet 44, 445-477, 1008 doi:10.1146/annurev-genet-072610-155046 (2010). 1009 25 Liu, X. S. et al. Rescue of Fragile X Syndrome Neurons by DNA Methylation Editing of the 1010 FMR1 Gene. Cell 172, 979-992 e976, doi:10.1016/j.cell.2018.01.012 (2018). 1011 26 Raveh-Sadka, T. et al. Manipulating nucleosome disfavoring sequences allows fine-tune 1012 regulation of gene expression in yeast. Nat Genet 44, 743-750, doi:10.1038/ng.2305 1013 (2012). 1014 27 Suter, B., Schnappauf, G. & Thoma, F. Poly(dA.dT) sequences exist as rigid DNA 1015 structures in nucleosome-free yeast promoters in vivo. Nucleic Acids Res 28, 4083-4089, 1016 doi:10.1093/nar/28.21.4083 (2000). 1017 28 Afek, A., Schipper, J. L., Horton, J., Gordan, R. & Lukatsky, D. B. Protein-DNA binding in 1018 the absence of specific base-pair recognition. Proc Natl Acad Sci U S A 111, 17140- 1019 17145, doi:10.1073/pnas.1410569111 (2014). 1020 29 Conlon, E. G. et al. The C9ORF72 GGGGCC expansion forms RNA G-quadruplex inclusions 1021 and sequesters hnRNP H to disrupt splicing in ALS brains. Elife 5, 1022 doi:10.7554/eLife.17820 (2016). 1023 30 Lin, Y., Dent, S. Y., Wilson, J. H., Wells, R. D. & Napierala, M. R loops stimulate genetic 1024 instability of CTG.CAG repeats. Proc Natl Acad Sci U S A 107, 692-697, 1025 doi:10.1073/pnas.0909740107 (2010). 1026 31 Rothenburg, S., Koch-Nolte, F., Rich, A. & Haag, F. A polymorphic dinucleotide repeat in 1027 the rat nucleolin gene forms Z-DNA and inhibits promoter activity. Proc. Natl. Acad. Sci. 1028 U. S. A. 98, 8985-8990, doi:10.1073/pnas.121176998 (2001). 1029 32 Min, J. L. et al. The use of genome-wide eQTL associations in lymphoblastoid cell lines to 1030 identify novel genetic pathways involved in complex traits. PLoS One 6, e22070, 1031 doi:10.1371/journal.pone.0022070 (2011). 1032 33 Consortium, G. et al. Genetic effects on gene expression across human tissues. Nature 1033 550, 204-213, doi:10.1038/nature24277 (2017). 1034 34 Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. 1035 Methods, doi:10.1038/nmeth.4267 (2017). bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1036 35 Borel, C. et al. Tandem repeat sequence variation as causative cis-eQTLs for protein- 1037 coding gene expression variation: the case of CSTB. Hum. Mutat. 33, 1302-1309, 1038 doi:10.1002/humu.22115 (2012). 1039 36 Contente, A., Dittmer, A., Koch, M. C., Roth, J. & Dobbelstein, M. A polymorphic 1040 microsatellite that mediates induction of PIG3 by p53. Nat. Genet. 30, 315-320, 1041 doi:10.1038/ng836 (2002). 1042 37 Gebhardt, F., Zänker, K. S. & Brandt, B. Modulation of epidermal growth factor receptor 1043 gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 274, 1044 13176-13180 (1999). 1045 38 Johnson, A. D. et al. Genome-wide association meta-analysis for total serum bilirubin 1046 levels. Hum Mol Genet 18, 2700-2710, doi:10.1093/hmg/ddp202 (2009). 1047 39 Matsuzono, K. et al. Antisense Oligonucleotides Reduce RNA Foci in Spinocerebellar 1048 Ataxia 36 Patient iPSCs. Mol Ther Nucleic Acids 8, 211-219, 1049 doi:10.1016/j.omtn.2017.06.017 (2017). 1050 40 Saha, A. et al. Functional IFNG polymorphism in intron 1 in association with an increased 1051 risk to promote sporadic breast cancer. Immunogenetics 57, 165-171, 1052 doi:10.1007/s00251-005-0783-5 (2005). 1053 41 Shimajiri, S. et al. Shortened microsatellite d(CA)21 sequence down-regulates promoter 1054 activity of matrix metalloproteinase 9 gene. FEBS Lett. 455, 70-74 (1999). 1055 42 Vikman, S. et al. Functional analysis of 5-lipoxygenase promoter repeat variants. Hum 1056 Mol Genet 18, 4521-4529, doi:10.1093/hmg/ddp414 (2009). 1057 43 Urbut, S. M., Wang, G., Carbonetto, P. & Stephens, M. Flexible statistical methods for 1058 estimating and testing effects in genomic studies with multiple conditions. Nat Genet 1059 51, 187-195, doi:10.1038/s41588-018-0268-8 (2019). 1060 44 Jiang, C. & Pugh, B. F. Nucleosome positioning and gene regulation: advances through 1061 genomics. Nat Rev Genet 10, 161-172, doi:10.1038/nrg2522 (2009). 1062 45 Bochman, M. L., Paeschke, K. & Zakian, V. A. DNA secondary structures: stability and 1063 function of G-quadruplex structures. Nat Rev Genet 13, 770-780, doi:10.1038/nrg3296 1064 (2012). 1065 46 Sawaya, S. et al. Microsatellite tandem repeats are abundant in human promoters and 1066 are associated with regulatory elements. PLoS One 8, e54710, 1067 doi:10.1371/journal.pone.0054710 (2013). 1068 47 Todd, A. K., Johnston, M. & Neidle, S. Highly prevalent putative quadruplex sequence 1069 motifs in human DNA. Nucleic Acids Res 33, 2901-2907, doi:10.1093/nar/gki553 (2005). 1070 48 Zuker, M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic 1071 Acids Res 31, 3406-3415, doi:10.1093/nar/gkg595 (2003). 1072 49 Hansel-Hertsch, R. et al. G-quadruplex structures mark human regulatory chromatin. 1073 Nat Genet 48, 1267-1272, doi:10.1038/ng.3662 (2016). 1074 50 MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association 1075 studies (GWAS Catalog). Nucleic Acids Res 45, D896-D901, doi:10.1093/nar/gkw1133 1076 (2017). 1077 51 Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body 1078 mass index in ~700,000 individuals of European ancestry. bioRxiv, doi:10.1101/274654 1079 (2018). bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1080 52 Schizophrenia Working Group of the Psychiatric Genomics, C. Biological insights from 1081 108 schizophrenia-associated genetic loci. Nature 511, 421-427, 1082 doi:10.1038/nature13595 (2014). 1083 53 Huang, H. et al. Fine-mapping inflammatory bowel disease loci to single-variant 1084 resolution. Nature 547, 173-178, doi:10.1038/nature22969 (2017). 1085 54 Savage, J. E. et al. Genome-wide association meta-analysis in 269,867 individuals 1086 identifies new genetic and functional links to intelligence. Nat Genet 50, 912-919, 1087 doi:10.1038/s41588-018-0152-6 (2018). 1088 55 Guo, H. et al. Integration of disease association and eQTL data using a Bayesian 1089 colocalisation approach highlights six candidate causal genes in immune-mediated 1090 diseases. Hum Mol Genet 24, 3305-3313, doi:10.1093/hmg/ddv077 (2015). 1091 56 Haeuptle, M. A. et al. Human RFT1 deficiency leads to a disorder of N-linked 1092 glycosylation. Am J Hum Genet 82, 600-606, doi:10.1016/j.ajhg.2007.12.021 (2008). 1093 57 Saini, S., Mitra, I., Mousavi, N., Fotsing, S. F. & Gymrek, M. A reference haplotype panel 1094 for genome-wide imputation of short tandem repeats. Nat. Commun. 9, 4397, 1095 doi:10.1038/s41467-018-06694-0 (2018). 1096 58 Chiang, C. et al. The impact of structural variation on human gene expression. Nat Genet 1097 49, 692-699, doi:10.1038/ng.3834 (2017). 1098 59 Hasler, J. & Strub, K. Alu elements as regulators of gene expression. Nucleic Acids Res 34, 1099 5491-5497, doi:10.1093/nar/gkl706 (2006). 1100 60 Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996-1006, 1101 doi:10.1101/gr.229102. Article published online before print in May 2002 (2002). 1102 61 Genomes Project, C. et al. A global reference for human genetic variation. Nature 526, 1103 68-74, doi:10.1038/nature15393 (2015). 1104 62 Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based 1105 linkage analyses. Am J Hum Genet 81, 559-575, doi:10.1086/519795 (2007). 1106 63 Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet 1107 2, e190, doi:10.1371/journal.pgen.0020190 (2006). 1108 64 Price, A. L. et al. Principal components analysis corrects for stratification in genome- 1109 wide association studies. Nat Genet 38, 904-909, doi:10.1038/ng1847 (2006). 1110 65 Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of 1111 expression residuals (PEER) to obtain increased power and interpretability of gene 1112 expression analyses. Nat Protoc 7, 500-507, doi:10.1038/nprot.2011.457 (2012). 1113 66 Seabold, S. P., Josef. Statsmodels: Econometric and statistical modeling with python. 1114 Proceedings of the 9th Python in Science Conference (2010). 1115 67 Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic 1116 features. Bioinformatics 26, 841-842, doi:10.1093/bioinformatics/btq033 (2010). 1117 68 Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime 1118 cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38, 576- 1119 589, doi:10.1016/j.molcel.2010.05.004 (2010). 1120 69 Wang, Y. et al. The 3D Genome Browser: a web-based browser for visualizing 3D 1121 genome organization and long-range chromatin interactions. Genome Biol 19, 151, 1122 doi:10.1186/s13059-018-1519-9 (2018). bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1123 70 Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high- 1124 resolution capture Hi-C. Nat Genet 47, 598-606, doi:10.1038/ng.3286 (2015). 1125 71 Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation 1126 method for the next generation of genome-wide association studies. PLoS Genet. 5, 1127 e1000529, doi:10.1371/journal.pgen.1000529 (2009). 1128 72 Browning, B. L., Zhou, Y. & Browning, S. R. A One-Penny Imputed Genome from Next- 1129 Generation Reference Panels. Am J Hum Genet 103, 338-348, 1130 doi:10.1016/j.ajhg.2018.07.015 (2018). 1131 73 Lalioti, M. D. et al. Dodecamer repeat expansion in cystatin B gene in progressive 1132 myoclonus epilepsy. Nature 386, 847-851, doi:10.1038/386847a0 (1997). 1133 74 Kobayashi, H. et al. Expansion of intronic GGCCTG hexanucleotide repeat in NOP56 1134 causes SCA36, a type of spinocerebellar ataxia accompanied by motor neuron 1135 involvement. Am J Hum Genet 89, 121-130, doi:10.1016/j.ajhg.2011.05.015 (2011). 1136 75 Mougey, E. et al. ALOX5 polymorphism associates with increased leukotriene production 1137 and reduced lung function and asthma control in children with poorly controlled 1138 asthma. Clin Exp Allergy 43, 512-520, doi:10.1111/cea.12076 (2013). 1139 76 Stephensen, C. B. et al. ALOX5 gene variants affect eicosanoid production and response 1140 to fish oil supplementation. J Lipid Res 52, 991-1003, doi:10.1194/jlr.P012864 (2011). 1141 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a licenseeSTR to identificationdisplay the preprint in perpetuity. It is madeFine-mapping available under a aCC-BY-NC 4.0 International license.

tissues1-17 Pairwise LD Z-scores STR Z Genotype-Tissue Expression (GTEx) Y~β X+e STR STR tissue SNP 1 SNP Z WGS 1 SNP1 Tissue-specific RNA-seq SNP2 652 individuals Z 17 tissues (e.g., Whole Blood, SNP3 SNP2 SNP2 1 3 30-fold coverage Skin, Nerve, Artery, Lung) 2 SNP3 ZSNP3 STR SNP SNP SNP Gene Expression (Y)

5xAC 6xAC 7xAC 8xAC 9xAC HipSTR Avg. STR length (X) [β1,β2,β3...β17]

STR SNP mashR CAVIAR (AC)n Gene

[β’1,β’2,β’3...β’17] eSTR causality score b d WholeBlood Brain-Caudate Brain-Cerebellum 70.0 Adipose S. Ventricle Cells-Transformedfibroblasts Adipose V. Lung Esophagus-Mucosa Artery A. Muscle Skin-NotSunExposed 60.0 Artery T. Nerve Caudate Skin Unexposed Skin-SunExposed Skin Leg 50.0 Cerebellum Thyroid

[-log10(p)] Thyroid Fibroblast Lung l E. Mucosa Blood Nerve-Tibial 40.0 E Muscularis Permuted

p-va Artery-Aorta

d 30.0 Artery-Tibial Esophagus-Muscularis 20.0 Adipose-Subcutaneous Adipose-Visceral Observe 10.0 Muscle-Skeletal Heart-LeftVentricle d ris rta

0.0 a o oo osed osed Lung ellum eletal p p 1.00

0.0 1.0 2.0 3.0 4.0 5.0 6.0 k b r entricle Thyroid roblasts

V 0.95 b

Expected p-val [-log10(p)] cutaneous 0.90 b Artery-A Artery-Tibial Nerve-Tibial WholeBl ose-Visceral

p 0.85 Brain-Caudate rt-Left rmedfi a Muscle-S

o 0.80 Adi Brain-Cere Skin-SunEx Spearman He ose-Su 0.75 Esophagus-Mucosa p Skin-NotSunEx ransf T Esophagus-Muscul Adi c Cells- CSTB NOP56 ALOX5 (CGGGGG) (CGCGGGGCGGGG) (CCTGGG) n n n 2.0 n n n 1.0 0.5 1.5 1.0

ressio 0.5 ressio ressio p p p eletal eletal 0.0 0.5 k k

Ex 0.0 Ex Ex 0.0 d d -0.5 d -0.5 -0.5 -1.0 -1.0 Muscle-S Muscle-S

rmalize -1.0 rmalize rmalize -1.5 o o o -1.5 Esophagus-Mucosa N N N -1.5 -2.0 2.0 2.5 3.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 4.5 5.0 5.5 6.0 6.5 7.0 Mean num. rpts. Mean num. rpts. Mean num. rpts. bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not a certified by peer review) is the author/funder, who has granted bioRxiv a blicense to display the preprint in perpetuity. It is made available under 175 Upstream Downstream 6 Homopolymers aCC-BY-NC 4.0 International license. Dinucleotides 150 Dinucleotides Trinucleotides Trinucleotides 5 125 Tetranucleotides Hexanucleotides Pentanucleotides 4 100 Hexanucleotides ALL STRs 3 75 Pentanucleotides 50 2

25 1 Tetranucleotides Relative abundance Relative abundance 0 -5 -3 c ×10 d ×10 7 2.25

6 2.00

5 1.75

4 1.50

3 1.25

2 1.00

1 0.75 Relative p(FM-eSTR) Relative p(FM-eSTR) 0 0.50 10000 −7500 −5000 −2500 0 2500 5000 7500 10000 -4000 -2000 0 2000 4000 Distance to TSS (bp) Distance to DNAseI cluster center (bp) e ** 5.0 * * 4.0 *

3.0 ** OR

2 2.0 *

log * 1.0

0.0 * Nominal p<0.05 -1.0 ** Adjusted p<0.05 *

CCCCGGCCCCCGA CCCCGCCGAGCC AGGAAAGAAATAGATACATAATGATCCAC AAAACA AAACA AAGGAG AACATCAT AAAATAAAAACAAT GGGC GGG f 160 T-rich 140 A-rich ** Nominal p<0.05 120 * ** Adjusted p<0.05 100 80 60 40 20 Number of FM-eSTRs 0 1.0 STR length * * Expression 0.8

0.6

0.4 with β>0

0.2 STR length

Fraction of FM-eSTRs Expression 0.0 AAAC A T AC GT GTT GTTT A T AC GT A T AC GT A T AC GT AAC GTT GTTT Promoter **Promoter Intergenic Introns upstream downstream bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a aCC-BY-NC 4.0 International license. 35 STRs; n=2,233 30 FM-eSTRs; n=228 g ** CCG; n=52 1.0 All 25 CCG FM-eSTRs; n=13 Promoter (TSS +/- 3kb) G4; n=95 20 Nom. p<0.05 G4 FM-eSTRs; n=20 * 0.8 15 ** Adj. p<0.05 10 (GM12878) * 5

RNAPII Coverage 0.6 0 b 1.50 0.4 1.25

1.00 0.2 0.75

0.50 Fraction of eSTRs with β>0

(GM12878) 0.25 0.0

Nucleosome occ. CCG G4 0.00 FM-eSTRs FM-eSTRs FM-eSTRs − 0.25 (n) 1,420 137 14 13 53 15 −4000 −2000 0 2000 4000 Distance to STR (bp)

DNA (+ strand) RNA (+ strand) DNA (+ strand) RNA (+ strand) c ** d ** e ** f ** ** ** ** ** ** ** ** ** * * * * 1.00 0 0.75 −10 0.50

−20 0.25 0.00 −30 −0.25 −40 All STRs FM-eSTRs −0.50 * Nom. p<0.05 −50 ** Adj. p<0.05 −0.75 Free energy (kcal/mol)

Pearson r(length, energy) −1.00 −60 ALL CCG G4 ALL CCG G4 ALL CCG G4 ALL CCG G4 (n) 123,836 1,389 191 14 1,480 52

h CSTB i ALOX5 j DPYSL4 (CGCGGGGCGGGG) (CGGGGG) (CCCGG) n n n

0.60 Skeletal muscle Esophagus mucosa 2.00 Transformed fibroblasts 0.50 0.40 1.50 0.20 0.00 1.00 0.00 -0.50 0.50

-0.20 ression ression ression 0.00 p p p -0.40 -1.00 -0.50 Ex Ex Ex -0.60 -1.50 -1.00 Mean expr. (+/- 1 s.d.) -0.80 -2.00

-32.0 -22.0 -40.0 -34.0 -22.5 -45.0 -23.0 -36.0 -23.5 -50.0 -38.0 -24.0 -55.0 -40.0 -24.5 -60.0

-12.0 -6.0 0.0 -12.0 -9.0 -6.0 -3.0 0.0 3.0 -33.0 -28.0 -23.0 -18.0 -13.0 -8.0 -3.0 2.0 Free energy (kcal/mol) STR length (bp rel. to hg19) STR length (bp rel. to hg19) STR length (bp rel. to hg19) bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

a STR length Gene expression Trait (1) eQTL analysis

AA (1) eQTL (3) coloc AT TT A analysis Gene analysis

AC AC AC Gene Expr. A C Gene Expr. SNP genotype STR Rpt. Num.

G A C AC AC AC AC Gene (2) GWAS AA AT trait P 10

Trait TT Phenotype G T C AC AC AC AC AC Gene -log

( e.g. , disease risk) SNP genotype SNP position (3) coloc analysis G T T AC AC AC AC AC AC Gene trait P

SNPS in LD 10 -log expr P

(2) GWAS 10 -log SNP position b c Genes SFMBT1 RFT1 PRKCD TKT (Gencode) 1.5 30 RFT1 (Esophagus- chr3:53128363 (ACn) 1.0 Muscularis) 20

0.5 (P) 10 10

0.0 -log

Expression 0

100 -0.5 Height (GIANT Consortium) rs2581830 75

-1.0 (P)

Normalized Expression Esophagus-Muscularis 50 10

-1.5 25 -log

Height GWAS Height 0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 chr3 5.305 5.310 5.315 5.320 5.325 ×107 Mean num. rpts.

d Scale 10 kb hg19 chr3: 53,130,000 53,135,000 53,140,000 53,145,000 53,150,000 53,155,000 53,160,000 (AC) n RFT1

Txn Factor ChIP

MYBL2, HDAC2, ARID3A, FOXA2, EP300, SP1, FOXA1

e g (AC)0-12 2 P Luciferase 10 1 FM-eSTR construct Minimal -log p<0.01 promoter (eMERGE) 0 * 5.305 5.310 5.315 5.320 5.325 * × 10 7

7.0 f * 6.0 0.15 * 5.0 0.10

(firefly/renilla) 4.0 0.05 3.0 0.00 Mean height (normalized) 2.0 -0.05 1.0

12.5 13.0 13.5 14.0 14.5 15.0 15.5 Relative expr. Mean repeat number 0.0

empty (AC)0 (AC)5 (AC)10 (AC)12 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

a Fraction of category 0.0 0.2 0.4 0.6 0.8 1.0

Promoter (TSS +/-3kb) n=234 (A:61, AC:59, AAAT:12, AAAC:11, C:8) Promoter (TSS +/-1kb) n=85 (AC:21, A:15, C:5, AT:5, CCG:4) 5’ UTR n=34 (CCG:11, AC:8, AGCGGC:1, CCCGG:1, CCCCGG:1) Coding n=11 (AGC:3, CCG:2, AAGGCC:1, ACCCCC:1, AGCGCC:1) 3’ UTR n=55 (AC:20, A:11, AAAC:4, AAAT:3, AT:2) Intron n=819 (A:284, AC:232, AAAT:67, AAAC:44, AT:23) Intergenic n=422 (A:170, AC:120, AAAT:23, AG:13, AT:12) DNAseI HS (+/- 500bp) n=228 (AC:73, A:50, CCG:13, AG:9, AAAT:8)

All FM-eSTRs n=1,420 (A:492, AC:399, AAAT:95, AAAC:62, AT:38)

Homopolymers Dinucleotides Trinucleotides

Tetranucleotides Pentanucleotides Hexanucleotides

b Homopolymers (e.g., A/T) c GC-rich (e.g., CCCG/CGGGG) d Dinucleotides (e.g., AC/GT)

TF TF ACACACAC

(CCCG)n 5’ AAAAAAA 3’ TF Nuc. RNAPII 3’ TTTTTTT 5’ ACACACAC TF motif 5’ 5’ Nuc. 3’ 3’ 3’ # AC/TG TF binding

Repeats affinity 5’ TTTTTTT 3’ (CCG)n Nuc. 3’ AAAAAAA 5’ 5’ TF1 TF2 TF1 motif ACACACAC TF2 motif

# A/T Directional nucleosome # GC-rich ssDNA/ssRNA # AC/TG Spacing between TF Repeats displacement Repeats stability Repeats binding motifs bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 1: Analysis of GTEx population structure

Principal component analysis was performed using SNP genotypes from the GTEx and 1000 Genomes cohorts. Samples from the 1000 Genomes project are shown in gray and GTEx samples are shown as colored dots based on ethnicity provided for each sample (yellow=African American; red=Amerindian; blue=Asian; green=European, black=Unknown). Related to Fig. 1.

1 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 2: Effect of varying numbers of PEER factors on power to detect eSTRs

eSTRs (gene-level FDR 10%) were computed for each tissue after adjusting for a number of PEER factors ranging from 0 to 50. The x-axis shows the number of PEER factors adjusted for. The y-axis shows the number of significant eSTRs. Dashed vertical lines show the number of PEER factors equal to N/10, where N is the number of samples analyzed for each tissue. Purple=whole blood, yellow=Brain-Cerebellum, gold=Nerve-Tibial. Related to Fig. 1.

2 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 3: Correlation of sample metadata with PEER factors

Each cell in the matrix shows the Spearman correlation of each PEER factor with data processing covariates. The x-axis represents each variable as defined for the GTEx cohort in dbGaP study phs000424.v7.p2. For example, covariates most strongly associated with PEER factors included DTHHRDY (Hardy scale for death classification) and TRISCHD (ischemic time). The y-axis represents factors obtained from PEER analysis of gene expression from Adipose-subcutaneous tissue. Similar correlations were observed for other tissues. Related to Fig. 1.

3 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 4: Relationship between sample size and number of eSTRs detected

The x-axis shows the number of samples per tissue. The y-axis shows the number of eSTRs (gene-level FDR<10%) detected in each tissue. Each dot represents a single tissue, using the same colors as shown in Fig. 1 in the main text (box on the right). Related to Fig. 1. ​ ​

4 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 5: Enrichment of genomic annotations as a function of CAVIAR threshold

The x-axis represents CAVIAR threshold in terms of the percentile across all eSTRs. The y-axis represents the odds ratio for enrichment in eSTRs above each percentile threshold in each of these categories: a. 5’UTRs (purple); b. 3’UTRs (blue); c. promoters (orange; within 3kb of a transcription ​ ​ ​ ​ ​ start site); d. Coding regions (red) and e. Intergenic regions (green). Related to Fig. 1. ​ ​ ​ ​

5 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 6: Difference in CAVIAR score between the top eSTR and top eSNP for each gene

For each tissue, the boxplot shows the distribution of differences between the CAVIAR posterior score for the best STR vs. the best SNP for each gene. Data is only shown for genes with FM-eSTRs . The colors of each box correspond to the different tissues (see legend on the right and the same as in Fig. 1b). Boxplots as in Fig. 1c. Outlier points are not shown. Related to Fig. 1. ​ ​ ​ ​

6 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 7: Example multi-allelic FM-eSTRs

For each plot, the x-axis represents the mean number of repeats in each individual and the y-axis represents normalized expression in the tissue for which the eSTR was most significant. Boxplots summarize the distribution of expression values for each genotype. Boxplots as in Fig. 1c.The red ​ ​ line shows the mean expression for each x-axis value. Related to Fig. 1.

7 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 8: Pairwise sharing of effect sizes across tissues

For each discovery tissue (rows), all eSTRs with gene-level FDR<10% were tested for association in each other (replication) tissue (columns). The value in each cell gives the percent of eSTRs that were

replicated with p<0.05 (π1). Related to Fig. 1. ​ ​ ​

8 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 9: Sharing of eSTRs across tissues

The x-axis represents the number of tissues that share a given eSTR (absolute value of mashR Z-score >4). The y-axis represents the number of eSTRs shared across a given number of tissues. Related to Fig. 1.

9 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 10: Localization of all STRs around putative regulatory regions

Left and right plots show localization around transcription start sites and DNAseI HS clusters, respectively. The y-axis denotes the relative number of STRs of each type in each bin. For promoters, the x-axis is divided into 100bp bins. For DNAseI HS sites, the x-axis is divided into 50bp bins. In each plot, values were smoothed by taking a sliding average of each four consecutive bins. Only STR-gene pairs included in our analysis are considered. Each plot compares localization of the two possible sequences of a given repeat unit on the coding strand. i.e. top plots compare repeat units of ​ the form CnG vs. their reverse complement on the opposite strand, middle plots compare AC vs. GT ​ ​ repeats, and bottom plots compare A vs. T repeats. The strand of each STR was determined based on the coding strand of each target gene. Related to Fig. 2.

10 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 11: Nucleosome occupancy and DNAseI hypersensitivity show distinct patterns around eSTRs

a-c. Nucleosome density around STRs with different repeat unit lengths. Nucleosome density in ​ GM12878 in 5bp windows is averaged across all STRs analyzed (dashed) and FM-eSTRs (solid) relative to the center of the STR. b. DNAseI HS density around STRs with different repeat unit ​ lengths. The number of DNAseI HS reads in GM12878 (gray), fat (red), tibial nerve (yellow), and skin ​ (cyan) is averaged across all STRs in each category. Solid lines show FM-eSTRs. Dashed lines show all STRs. Left=homopolymers, middle=dinucleotides, right=tetranucleotides. Other repeat unit lengths were excluded since they have low numbers of FM-eSTRs (see Fig. 5a). Dashed vertical lines in (d) ​ ​ ​ show the STR position +/- 147bp. Related to Fig. 2.

12 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 12: Bias in the direction of eSTR effect sizes

The y-axis shows the percentage of FM-eSTRs in each category with positive effect sizes, meaning a positive correlation between STR length and expression. Colored bars represent different repeat unit lengths (black=all FM-eSTRs; gray=homopolymers; red=dinucleotides; gold=trinucleotides; blue=tetranucleotides; purple=pentanucleotides; green=hexanucleotides). Error bars show 95% confidence intervals. Asterisks denote categories that are nominally significant (binomial two-sided p<0.05) for having significantly more or less positive effect sizes than expected by chance (50%). No category was significant after accounting for multiple hypothesis testing. Related to Fig. 2.

13 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 13: Characterization of tissue-specific FM-eSTRs

FM-eSTRs were clustered by absolute Z-scores computed by mashR using K-means (Methods). The ​ ​ heatmap shows absolute values of Z-scores in each tissue, Z-normalized by row. (Number of genes in each cluster: Cluster 1=22, Cluster 2=497, Cluster 3=253; Cluster 4=122, Cluster 5=90, Cluster 6=126, Cluster 7=220, Cluster 8=336). Related to Fig. 2.

14 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 14: Characterization of tissue-specific eSTRs

Each panel shows the distribution of the absolute value of posterior effect sizes computed by mashR in each tissue for the set of FM-eSTRs in each cluster (see Supp Fig. 13 above). Horizontal lines ​ ​ show median values, boxes span from the 25th percentile (Q1) to the 75th percentile (Q3). Whiskers extend to Q1-1.5*IQR (bottom) and Q3+1.5*IQR (top), where IQR gives the interquartile range (Q3-Q1). The red line shows the mean expression for each x-axis value. Related to Fig. 2.

15 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 15: eSTR repeat unit enrichment

We evaluated repeat unit enrichment in multiple FM-eSTR groups: all FM-eSTRs combined across tissues (similar to Fig. 2e), FM-eSTRs identified per-tissue, and FM-eSTRs belonging to each cluster ​ ​ (see Supplementary Fig. 13). For each group of FM-eSTRs, the heatmap shows the log2 of the ​ ​ odds ratio computed using a Fisher’s Exact test (scipy.stats.fisher_exact). Columns are sorted from highest to lowest enrichment in all FM-eSTRs. Bold boxes indicate enrichments statistically significant (adjusted p<0.05, adjusted separately per row for the number of motifs tested). Related to Fig. 2.

16 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 16: Density of RNAPII localization around STRs ​

The y-axis shows the average number of ChIP-seq reads for RNA Polymerase II in 5bp bins centered at STRs within 5KB of TSSs. Black lines denote all STRs, blue lines denote CCG STRs, and red lines denote STRs matching the canonical G4 motif. Dashed lines represent all STRs of each class and solid lines represent FM-eSTRs. Plots show read counts in different cell types. From top to bottom: human embryonic stem cells, heart, lung, and tibial nerve. Related to Fig. 3.

17 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 17: GC-rich eSTRs modulate DNA or RNA secondary structure

(a-b) Free energy of STR regions. Boxplots denote the distribution of free energy for each STR +/- ​ ​ 50bp of context sequence, computed as the average across all alleles at each STR. (a) and (b) show ​ ​ ​ ​ results computed using the non-template strand for DNA and RNA respectively. (c-d) Pearson ​ correlation between STR length and free energy. Correlations were computed separately for each ​ STR, and plots show the distribution of correlation coefficients across all STRs. The dashed horizontal line denotes 0 correlation as expected by chance. (c) and (d) show results computed using ​ ​ ​ ​ the non-template strand for DNA and RNA respectively. Nominally significant (Mann Whitney one-sided p<0.05) differences between distributions are denoted with a single asterisk. Differences significant after controlling for multiple hypothesis correction are denoted with double asterisks. For each category (free energy and Pearson correlation), we used a Bonferroni correction to control for 20 total comparisons: comparing all vs. FM-eSTRs separately in each category, comparing CCG vs. all STRs, and comparing G4 vs. all STRs, in four conditions (DNA +/- and RNA +/-). Related to Fig. 3.

18 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 18: Example GWAS signals co-localized with FM-eSTRs

19 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Left: For each plot, the x-axis represents the mean number of repeats in each individual and the ​ ​ y-axis represents normalized expression in the tissue with the most significant eSTR signal at each ​ locus. Boxplots summarize the distribution of expression values for each genotype. Box plots as in Fig. 1c. The red line shows the mean expression for each x-axis value. Right: Top panels give genes ​ ​ in each region. The target gene for the eQTL associations is shown in black. Middle panels give the -log p-values of association of the effect-size between each SNP (black points) and the expression ​10 of the target gene. The FM-eSTR is denoted by a red star. Bottom panels give the -log p-values of ​10 association between each SNP and the trait based on published GWAS summary statistics. Dashed gray horizontal lines give the genome-wide significance threshold of 5E-8. Related to Fig. 4. ​

20 bioRxiv preprint doi: https://doi.org/10.1101/495226; this version posted August 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Supplementary Figure 19: Example GWAS signal for schizophrenia potentially driven by an eSTR for MED19 ​

a. eSTR association for MED19. The x-axis shows STR genotypes at an AC repeat ​ ​ ​ (chr11:57523883) as the mean number of repeats in each individual and the y-axis shows normalized MED19 expression in subcutaneous adipose. Each point represents a single individual. Red lines ​ show the mean expression for each x-axis value. Boxplots as in Fig. 1c. b. Summary statistics for ​ ​ ​ ​ MED19 expression and schizophrenia. The top panel shows genes in the region around MED19. ​ ​ ​ ​ The middle panel shows the -log p-values of association between each variant and MED19 ​10 ​ expression in subcutaneous adipose tissue in the GTEx cohort. The FM-eSTR is denoted by a red star. The bottom panel shows the -log p-values of association for each variant with schizophrenia ​10 from Ripke, et al.. The dashed gray horizontal line shows genome-wide significance threshold of ​ ​ 5E-8. c. Detailed view of the MED19 locus. A UCSC genome browser screenshot is shown for the ​ ​ ​ ​ region in the gray box in (b). The FM-eSTR is shown in red [(AC)n]. The bottom track shows ​ ​ transcription factor (TF) and chromatin regulator binding sites profiled by ENCODE. The bottom panel shows long-range interactions reported by Mifsud, et al. using Capture Hi-C on GM12878. ​ Interactions shown in black include MED19. Interactions to loci outside of the window depicted are not ​ ​ shown. Related to Fig. 4.

See file eSTRGtex_SuppTables.xlsx for Supplementary Tables 1-8.

21