Mapping Pathogenic Regulatory Regions and Genes
Total Page:16
File Type:pdf, Size:1020Kb
Mapping pathogenic regulatory regions and genes Chris Cotsapas Yale/Broad Mapping pathogenic regulatory regions Chris Cotsapas Yale/Broad Mapping pathogenic regulatory regions Chris Cotsapas Yale/Broad cotsapaslab.info/positions http://biorxiv.org/content/early/2016/05/19/054361 Common risk variants localize to DHS Maurano et. al, Science 2012 Gusev et. al, AJHG 2014 Trynka et al, AJHG 2015 over loci shown for 9 AID). and adjusted forIdentifying the correlation specific structure regulatory in the gene elements expression driving data (see risk supplementary to the disease, and the materials forgenes more that details). they control: We called any DHS cluster harboring a CI SNP (⇢d > 0) a We convertedpathogenic P values DHS of correlationcluster. For to chi-squareeach lead test SNP, statistics we defined as an a intermediate 2 Mbp region step centered on it, required toand measure identified posterior all pathogenic of association DHS transmitted clusters and from genes a pathogenic overlapping DHS the cluster region to using intersect a gene. Asfunction shown in of Supplementary BEDTools [45]. Figure We reasoned 7 34, P values that if of a correlationDHS cluster follow controls a uniform expression of a gene, distribution;its therefore, presence/absence this conversion should is be valid. correlated We showed to the the expression proportion of the of associationtarget gene over matched posterior transmittedsamples. We from therefore, DHS cluster measuredd to P gene valueg ofby correlationβd,g, and computed between all it pairs using of the pathogenic DHS following formula.clusters and genes in the locus using paired expression and DHS data over 22 cell types from 2 2 χ /2 χd,g /2 Roadmap Epigenomeβ Projectd,g = e d,g (see/ supplementarye i materials), and asked whether expression g of any gene in the locus changed conditionalXi on presence of a pathogenic DHS cluster. 2 We converted P values of correlation to chi-square test statistics as an intermediate step where χd,gi is correlation chi-squared test statistics between DHS cluster d and gene gi. In the above formula,required genes to measure with no posteriorcorrelations of to association DHS cluster transmittedd get non-zero from low aβ pathogenicd,g values; DHS cluster however, into order a gene. to compute As shownβd,g inof theSupplementary correlated genes Figure with 4, higher P values accuracy, of correlationβd,g of non- follow a uniform computedcorrelated posterior genesdistribution; probability should be of therefore, associationset zero. this To of decideeach conversion SNP on in a set reasonable isS valid.as cut-oWe showed↵ threshold the forproportionβd,g,we of association considered posteriorβd,g of all pairstransmitted of pathogenic from2 DHS DHS2 cluster cluster andd to genes Gene ing allby loci,βd,g and, and sorted computed them it using the χs/2 χi /2 in a decreasingcomputedfollowing posterior order formula. (See probability supplementaryPPs = ofe association/ figuree of 8, each in SNP which in theset S blueas vertical line separates i S 2 χ2 /2 X2 χd,g/2 d,g 25% top βd,g values from the rest). As shown2 β in this= e figure,2 / aftere thisi cut-o↵ value, the βd,g PosteriorRegulatory Probability of Association fine-mapping modelχs/2 d,g χ /2 i 0.0 0.1 0.2 0.3 0.42 PPs = e / e wherecurveχi is is the flat. association This cut-o chi-square↵ value test is equivalent statistics of to SNP thei gene-DHScalculatedg cluster byi converting correlation as- P value of i S sociation P value from Immunochip data into chi-square test2 statistics. WeX sorted SNPs by 6 ● 0.25. We therefore assigned2 0 to βd,g if Pd,g > 0.25.X If a DHS cluster had zero βd,g values for Posterior(Probability( where χd,gi is correlation chi-squared test statistics between DHS cluster d and Gene gi. Posterior'Probability'of'their posterior probability2 values (i.e. PPs)andselectedthen top SNPs with the highest as- ●● ● all geneswhere inχ theis region, the association it was considered chi-square test to be statistics not correlated of SNP i tocalculated any gene by in converting the region as- and ●● ● Associa/on((PPAssocia%on'(PP )'s)( i ● sociations posteriorsThe such genes that with lowPP correlationss 0.99 and to a DHSPP clusters < 0.99. (equivalent The selected to low βd,g)introducenoise ● ● ● sociation P value froms Immunochip=1..n data into chi-squares=1..(n 1) test statistics. We sorted SNPs by ●●●●●● it was removed from downstream analysis. For each remaining DHS cluster d in the locus, ●● ●● ● ≥ − 4 Posterior Probability●● of● ●Association● ●● ● ●● ● ● n top SNPs formedto theour 99% calculations; credible interval therefore, set for their the risk corresponding locus of interest.β35d,g need to be zeroed out to avoid this ● 0" 0.1" 0.2" 0.3" 0.4" their posterior probability values (i.e. PPs)andselectedthen top SNPs with the highest as- (P) ● ● 0.0 0.1 0.2we0.3 0.4 identified all genes with Pd,g 0.25, and put them in set Gd. ● ● ● P P 10 ● ● ●● ●●● ●● ● ● e↵ect. However, we first needed to decide on a reasonable cut-o↵ threshold for βd,g. In log ●● ● sociation posteriors such that s=1..n PPs 0.99 and s=1..(n 1) PPs < 0.99. The selected − ●●● ●● ● We updated our formula for calculating βd,g as follows. 6 ● ≥ − ● ●●●● Estimating regulatory potential of disease loci: In order to identify pathogenic 2 ●●●●●● ●● n top SNPsSupplementary formed our 99% Figure credible 5, we interval plotted set sorted for theβ riskd,g values locus of (before interest. zero-ing out). The blue vertical ● ● ● ● ● ● ●● ● ●●● ● ●● ● ●● ● ● P 2 P ●● ●● ● ●● regulatory elements, we overlapped credible2 interval (CI) SNPs with our DHS clusters using ● ●● ●● ●●●● ●● ● χ /2 χ /2 ●● ●●●●● ● ● ● ● ● line separates 25% topd,gβd,g values fromd,g the rest. As shown in this figure, after this cut-o↵ ●●●● ● ● i ● ● ● ●●●●●●●●●●●●● ● e / e if g Gd ● ●●●●●●●●●●●● ● ●●●●●●●●●● ● ● the● intersect● function of BedTools software suite [49].gi WeGd computed posterior probability of ●● ● ●●●●● ●●●●●●●●●●●●● ● Summary%P%values%from%gene>cs% 4 ● ●● ●●●●●●●● ●●● ●●●●●●● ● Estimating regulatoryβ = potential of2 disease loci: In order to identify pathogenic ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●● d,g 2 0 ●●●●●●● ● ●●●●● ●● ● ● ●●●●● ●● ● ● ●● ● value, the βd,g curve remains almost flat with small values. This cut-o↵ value is equivalent (P) ● ● ● ● ● association attributableassocia>on%data% to each DHS cluster0otherwised as the sum of per-SNP posteriors over all CI 10 ● ● ( ●● ●●● regulatory elements, we overlapped credible interval (CI) SNPs with our DHS clusters using ●● ● ● P log 90.50 90.75 ●● 91.00● 91.25 91.50 to the gene-DHS cluster correlation P value of 0.25. Given this, we assigned 0 to βd,g if − ●●● ●● SNPs overlapping the DHS. That is, Posi%on'on'Chromosome'(Position● on ChromosomeMbp (Mbp))' ● ●●●● the intersect function of BedTools software suite [49]. We computed posterior probability of 2 ●●●●● ●● ● ● ● ● We computed the total posterior of association transmitted from DHS cluster d to gene ●● ●● Pd,g > 0.25. If a DHS cluster had zero βd,g values for all genes in the region, it was considered ● ● ● ● ● ● ●●● ● ● association attributable to each DHS cluster d as the sum of per-SNP posteriors over all CI ●● ● ●●●● ● ● ● ●●●●● ● ● SNPs% ● ●●●●●● ●● ● ● g (shownPer3SNP%posterior%probability% as γd,g) by multiplying⇢ = itsPP posteriorO (s) of association (⇢d)andtheproportionof ● ● ●●●●●●●●●●● ● to be not correlatedd to anys gened in the region and it was removed from downstream analysis. ● ●●●●●●●●●●●● ● ●●● ● ● ● ● ●● ● ●●●●●● ●● ● ●●●●●● ● SNPs overlapping the DHS. That is, ● ●● ●● ●●●●●● ● ●●●●●● ● ⇥ ● ● ● ● ●●●●●● ●●● ●●●●●●● ● ●●● ● ●●●●●● ● ●●●●●●●●●● ● ●●●●●●●● ● ● ● ●transmitted association posteriors CI (β ). That is, 0 ● ● ● ● ● ● ● ●● For each remaining DHSX2 d,g cluster d in the locus, we identified all genes with Pd,g 0.25, and 90.50 90.75 91.00 91.25 91.50 PositionDHS1% on Chromosome (Mbp) DHS2% Per3DHS%posterior%probability%put them in set G . We⇢ updatedd = PP ours formulaOd(s) for calculating β as follows. DHS%Clusters% where PPs is the posterior probabilityd of association for SNP⇥ s. Od(s) is equal to one if SNPd,g γd,gs =CI ⇢d βd,g s is located on DHS cluster d or the 100 bp flankingX2 region⇥ each side of DHS cluster d,andit χ2 /2 χ2 /2 is zero otherwise. If one SNP overlapped two or more DHSe d,g clusters/ or theire d,g 100i bpif flankingg G By summingwhere PPγs isover the posterior all pathogenic probability DHS of clusters, association we for computed SNPgi Gsd. O overalld(s) is equal posterior to oned probability if SNP Normalized%correla>on%d,g β = 2 2 regions, thens is we located split on its DHS posterior cluster probabilityd or the 100d,gPPs bpbetween flanking those region DHS each clusters side of equally. DHS cluster d,andit of associationcoefficient% attributable to each gene (γg)intheregion.Thatis,(0otherwise We measuredis zero otherwise. regulatory If potential one SNP of overlapped each disease two risk or more locus DHSP by summing clusters per-DHSor their 100 pos- bp flanking teriors over all DHSsWe in the computed locus. That the is, total posterior of association transmitted from DHS cluster d to Gene regions, then we split its posterior probability PPs between those DHS clusters equally. Genes% Gene1% Gene2% Gene3% Per3gene%posterior%probability% γg = ⇢d βd,g Weg measured(shown regulatory as γd,g)bymultiplyingitsposteriorofassociation( potential of each disease⇥ risk locus by summing per-DHS⇢d)andtheproportionof pos- ⇢ = ⇢d d D teriorstransmitted over all DHSs association in the locus. posterior That is,X2 (βd,g). That is, d D X2 where D is the set of all pathogenic DHS clusters in the region. We used γg to prioritize where D is the set of all DHS clusters in the region,⇢ and= ⇢ represents⇢d γd,g = ⇢ overalld βd,g posterior prob- genes in the disease risk locus. The higher γg, the higher per-gene⇥ posterior of association, d D ability of association attributable to all DHS clusters in2 the region.