Identification and Analysis of Splicing Quantitative Trait Loci Across Multiple

ARTICLE https://doi.org/10.1038/s41467-020-20578-2 OPEN Identification and analysis of splicing quantitative trait loci across multiple tissues in the human genome ✉ ✉ Diego Garrido-Martín 1 , Beatrice Borsari 1, Miquel Calvo 2, Ferran Reverter 2 & Roderic Guigó 1,3 Alternative splicing (AS) is a fundamental step in eukaryotic mRNA biogenesis. Here, we develop an efficient and reproducible pipeline for the discovery of genetic variants that affect 1234567890():,; AS (splicing QTLs, sQTLs). We use it to analyze the GTEx dataset, generating a comprehensive catalog of sQTLs in the human genome. Downstream analysis of this catalog pro- vides insight into the mechanisms underlying splicing regulation. We report that a core set of sQTLs is shared across multiple tissues. sQTLs often target the global splicing pattern of genes, rather than individual splicing events. Many also affect the expression of the same or other genes, uncovering regulatory loci that act through different mechanisms. sQTLs tend to be located in post-transcriptionally spliced introns, which would function as hotspots for splicing regulation. While many variants affect splicing patterns by altering the sequence of splice sites, many more modify the binding sites of RNA-binding proteins. Genetic variants affecting splicing can have a stronger phenotypic impact than those affecting gene expression. 1 Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003 Catalonia, Spain. 2 Section of Statistics, Faculty of Biology, Universitat de Barcelona (UB), Av. Diagonal 643, Barcelona 08028, Spain. 3 Universitat Pompeu Fabra (UPF), Barcelona, ✉ Catalonia, Spain. email: [email protected]; [email protected] NATURE COMMUNICATIONS | (2021) 12:727 | https://doi.org/10.1038/s41467-020-20578-2 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-20578-2 lternative splicing (AS) is the process through which sQTLseekeR20, which identifies genetic variants associated with multiple transcript isoforms are produced from a single changes in the relative abundances of the transcript isoforms of a A 1 gene . It is a key mechanism that increases functional given gene. sQTLseekeR uses the Hellinger distance to estimate complexity in higher eukaryotes2. Often, its alteration leads to the variability of isoform abundances across observations, and pathological conditions3. AS is subject to a tight regulation, Anderson’s method21,22, a non-parametric analog to multivariate usually tissue-, cell type-, or condition-specific, that involves a analysis of variance, to assess the significance of the associations wide range of cis and trans regulatory elements4,5. Since AS is (see Methods and Supplementary Note 1). Among other generally coupled with transcription, transcription factors and enhancements, sQTLseekeR2 improves the accuracy and speed of chromatin structure also play a role in its regulation6. the p-value calculation, and allows to account for additional In recent years, transcriptome profiling of large cohorts of covariates before testing for association with the genotype, while genotyped individuals by RNA-seq has allowed the identification maintaining the multivariate statistical test in sQTLseekeR. It also of genetic variants affecting AS, i.e. splicing quantitative trait loci implements a multiple testing correction scheme that empirically or sQTLs7–12. sQTL analyses in a variety of experimental settings characterizes, for each gene, the distribution of p-values expected have helped to gain insight into the mechanisms underlying under the null hypothesis of no association (see Methods and GWAS associations for a number of traits, such as adipose-related Supplementary Note 1). To ensure highly parallel, portable and traits13, Alzheimer’s disease10, schizophrenia9 or breast cancer14, reproducible sQTL mapping, we embedded sQTLseekeR2 in a among others. sQTLs might actually contribute to complex traits Nextflow23 (plus Docker, https://www.docker.com) computa- and diseases at a similar or even larger degree than variants tional workflow named sqtlseeker2-nf, available at https://github. affecting gene expression15. com/guigolab/sqtlseeker2-nf. The vast majority of methods commonly used for sQTL Here we extensively analyze the sQTLs identified by sqtlseeker2- mapping treat splicing as a univariate phenotype. They assess nf, using the expression and genotype data produced by the GTEx association between genetic variants and the abundance of indi- Consortium. For most of the analyses, we employed isoform vidual transcripts7,16, or the splicing of individual exons9,17 or quantifications obtained from the V7 release (dbGaP accession introns12,15. However, this approach ignores the strongly corre- phs000424.v7.p2), corresponding to 10,361 samples from 53 tissues lated structure of AS measurements (e.g. at constant gene of 620 deceased donors. 48 tissues with sample size ≥ 70 were expression level, higher levels of a splicing isoform correspond selected for sQTL analyses. Under the assumption that most necessarily to lower levels of other isoforms). In contrast, we variants with cis effects on alternative splicing are likely to be carried propose an approach that takes into account the intrinsically on the sequence of the primary transcript or its close vicinity, we multivariate nature of alternative splicing: variants are tested for tested variants in a cis window defined as the gene body plus 5 Kb association with a vector of AS phenotypes, such as the relative upstream and downstream the gene boundaries. In addition, to abundances of the transcript isoforms of a gene or the intron demonstrate that the statistical framework of sQTLseekeR2 is not excision ratios of an intron cluster obtained by LeafCutter18. restricted to the analysis of transcript abundances, but it can Based on this approach, we have developed a pipeline for leverage other splicing-related multivariate phenotypes, we have efficient and reproducible sQTL mapping. Here we employ it to also computed the sQTLs based on the intron excision ratios leverage the multi-tissue transcriptome data generated by the obtained by LeafCutter18 from the GTEx RNA-seq data (Supple- Genotype-Tissue Expression (GTEx) Consortium, producing a mentary Note 2). Finally, we also provide the sQTLs identified by comprehensive catalog of genetic variants affecting splicing in the sqtlseeker2-nf in GTEx V8, which we compare to the sQTLs human genome. Downstream analyses of this catalog uncover a produced by the GTEx Consortium, described in a recent number of relevant features regarding splicing regulation. Thus, publication12. We show that the two sets of sQTLs differ indeed consistent with the multivariate nature of splicing, we observe in the nature of the AS events captured. Our approach detects more that sQTLs tend to involve multiple splicing events. A substantial events affecting the gene termini and intron retention, while the fraction of sQTLs also affects gene expression, a reflection of the approach by the GTEx Consortium tends to detect more events intimate relationship between splicing and transcription. We find, involving internal exons. The two sQTL sets also differ regarding however, many cases in which the expression of a gene other than several biological features at variant and gene level. For instance, the sQTL target is affected by the same variant. In these cases, the our approach seems to have more power to identify sQTLs affecting pleiotropic effect of the regulatory locus is not mediated by the genes with lower expression levels and shorter introns, as well as interplay between the splicing and transcription processes, but it involving variants with lower minor allele frequency (MAF). In is exerted through different mechanisms, acting upon different contrast, the GTEx Consortium’sapproachhaslargerpowerto genes that otherwise may not appear to be directly interacting. identify sQTLs affecting genes with higher expression and longer We also find that sQTLs tend to be preferentially located in introns (Supplementary Note 3, Supplementary Data 7). introns that are post-transcriptionally spliced: these introns At a 0.05 false discovery rate (FDR), we found in GTEx V7 a would be consequently acting as hotspots for splicing regulation. total of 210,485 cis sQTLs affecting 6,963 genes (i.e. sGenes: 6,685 While many variants affect splicing patterns by directly altering protein coding genes and 278 long intergenic non-coding RNAs, the sequence of splice sites, many more modify the binding of lincRNAs). On average, per tissue, we identified 1,158 sGenes RNA-binding proteins (RBPs) to target sequences within the (Supplementary Table 1). 44% and 34% of all tested protein transcripts. We observe that sQTLs often impact GWAS traits coding genes and lincRNAs, respectively, were found to be and diseases more than variants affecting only gene expression, sGenes. In an analogous experimental setting, the GTEx confirming earlier reports which suggest that splicing mutations Consortium reported genetic variants affecting gene expression underlie many hereditary diseases15,19. For several conditions, (expression QTLs, eQTLs) for 95% and 71% of all tested protein GWAS associations are particularly strong for sQTLs altering coding genes and lincRNAs, respectively24. To illustrate the RBP binding sites. nature of the sQTLs identified with sqtlseeker2-nf, in Fig. 1 we show the example of the SNP rs2295682, an sQTL for the gene RBM23 shared

Identification and Analysis of Splicing Quantitative Trait Loci Across Multiple

Functional Annotation of Exon Skipping Event in Human Pora Kim1,,†, Mengyuan Yang1,†,Keyiya2, Weiling Zhao1 and Xiaobo Zhou1,3,4,

Triplet Repeat Length Bias and Variation in the Human Transcriptome

(P -Value<0.05, Fold Change≥1.4), 4 Vs. 0 Gy Irradiation

Human Induced Pluripotent Stem Cell–Derived Podocytes Mature Into Vascularized Glomeruli Upon Experimental Transplantation

The Structural Basis of Indisulam-Mediated Recruitment of RBM39 to The

Genome-Wide Association of Endophenotypes for Schizophrenia from the Consortium on the Genetics of Schizophrenia (COGS) Study

Sequence, Structure and Context Preferences of Human RNA

Genome-Wide Association Study and Pathway Analysis for Female Fertility Traits in Iranian Holstein Cattle

(12) United States Patent (10) Patent No.: US 8,841,436 B2 Gorodeski Et Al

W O 2019/067145 Al 04 April 2019 (04.04.2019) W IPO I PCT

The MHC to 2-Microglobulin Β Primordial Linkage Of

Homo Sapiens, Homo Neanderthalensis and the Denisova Specimen: New Insights on Their Evolutionary Histories Using Whole-Genome Comparisons