Escan: Scan Regulatory Regions for Aggregate Association Testing

bioRxiv preprint doi: https://doi.org/10.1101/2020.11.30.405266; this version posted December 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 eSCAN: Scan Regulatory Regions for Aggregate 2 Association Testing using Whole Genome 3 Sequencing Data 4 Yingxi Yang,1* Yuchen Yang,2,3,4* Le Huang,2 Jai G. Broome5,6, Adolfo Correa,7 Alexander 5 Reiner,8,9 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Laura M. 6 Raffield2*, Yun Li2,10,11*† 7 1Department of Statistics, Sun Yat-Sen University, Guangzhou, Guangdong, 510275, China 8 2 Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 9 27599, USA 10 3 Department of Pathology and Laboratory Medicine, University of North Carolina at Chapel 11 Hill, Chapel Hill, NC, 27599, USA 12 4 McAllister Heart Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, 13 27599, USA 14 5 Department of Biostatistics, University of Washington, Seattle, WA 98195, USA 15 6 Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, 16 WA 98195, USA 17 7 Department of Medicine and Population Health Science, University of Mississippi Medical 18 Center, Jackson, MS, 39216, USA 19 8 Department of Epidemiology, University of Washington, Seattle, WA, 98195, USA 20 9 Fred Hutchinson Cancer Research Center, University of Washington, Seattle, WA, 98195, 21 USA 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.30.405266; this version posted December 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 22 10 Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 23 27599, USA 24 11 Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, 25 NC, 27599, USA 26 * These authors contributed equally to this work. 27 † Corresponding author: Yun Li. E-mail: [email protected] 28 Abstract 29 With advances in whole genome sequencing (WGS) technology, multiple statistical methods 30 for aggregate association testing have been developed. Many common approaches aggregate 31 variants in a given genomic window of a fixed/varying size and are not reliant on existing 32 knowledge to define appropriate test units, resulting in most identified regions not being 33 clearly linked to genes, limiting biological understanding. Functional information from new 34 technologies (such as Hi-C and its derivatives), which can help link enhancers to the genes 35 they affect, can be leveraged to predefine variant sets for aggregate testing in WGS. 36 Therefore, in this paper we propose the eSCAN (Scan the Enhancers) method for genome- 37 wide assessment of enhancer regions in sequencing studies, combining the advantages of 38 dynamic window selection in SCANG with the advantages of increased incorporation of 39 genomic annotation. eSCAN searches biologically meaningful searching windows, increasing 40 power and aiding biological interpretation, as demonstrated by simulation studies under a 41 wide range of scenarios. We also apply eSCAN for association analysis of blood cell traits 42 using TOPMed WGS data from Women’s Health Initiative (WHI) and Jackson Heart Study 43 (JHS). Results from this real data example show that eSCAN is able to capture more 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.30.405266; this version posted December 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 44 significant signals, and these signals are of shorter length and drive association of larger 45 regions detected by other methods. 46 Main text 47 In genome-wide association studies (GWAS), most significantly associated variants 48 are located outside coding regions of genes, making it difficult to interpret the biological 49 function of associated variants. Statistical power to detect rare variant associations in 50 noncoding regions, which is of increasing importance with the advent of large-scale whole 51 genome sequencing (WGS) studies, is also limited with a standard single variant GWAS 52 approach. Aggregate testing is necessary to increase statistical power to detect rare variant 53 associations; linking noncoding variants to their likely effector genes is necessary for 54 interpretation of identified aggregate signals. Many standard methods for aggregate analysis 55 of the noncoding genome are agnostic to regulatory and functional annotation (for example, 56 standard sliding window analysis, where all variants in a given location bin (for example a 5 57 kb or 10 kb window) are analyzed, followed by analysis of a subsequent partially overlapping 58 window, until each chromosome is assessed in full)1-3. SCANG has recently been proposed as 59 an improvement on conventional sliding-window procedures, with the ability to detect the 60 existence and locations of association regions with increased statistical power 4. SCANG 61 allows sliding-windows to have different sizes within a pre-specified range and then searches 62 all the possible windows across the genome, increasing statistical power. However, since 63 SCANG tests all possible windows, it can "randomly" identify some regions across the 64 genome regardless of their biological functions. Identified regions could often cross multiple 65 enhancer regions with distinct functions, thus impeding the identification of biologically 66 important enhancers and their target genes. This cross-boundary issue may also lead to a 67 higher false positive rate in a fine-mapping sense. The whole region/chromosome in which 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.30.405266; this version posted December 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 68 the detected regions are located may not be a false positive, but locations of the detected 69 regions will not match the true association regions. Moreover, SCANG applies SKAT to all 70 candidate windows, but computing p-values in SKAT requires eigen decomposition5. This 71 analysis method is therefore very time-consuming and has high computational costs, which 72 may not be feasible for increasingly large genome-wide studies. 73 In addition to sliding window approaches, many analyses of WGS data rely on 74 aggregate tests of predefined variant sets, attempting to link the most likely regulatory 75 variants (as defined by tissue specific histone marks, open chromatin data, sequence 76 conservation, etc) to genes prior to association testing, with variants assigned to genes based 77 on either physical proximity or chromatin conformation1; 3. There is increasing data available 78 to define these tissue specific regulatory regions, which are known to show enrichment for 79 GWAS identified noncoding variant signals.6-8 Recent biotechnological advances based on 80 Chromatin Conformation Capture (3C), such as promoter capture Hi-C data, can also better 81 link gene promoters to enhancers based on their physical interactions in 3D space 9. We here 82 propose an extension of SCANG which combines the advantages of both scanning and fixed 83 variant set methods (see Fig. 1 for illustration). Our eSCAN (or “scan the enhancers” with 84 “enhancers” as a shorthand for any potential regulatory regions in the genome) method can 85 integrate various types of functional information, including chromatin accessibility, histone 86 markers, and 3D chromatin conformation. There can be a significant distance between a gene 87 and its regulatory regions; simply expanding the size of the window to include kilobases of 88 genomic data around each gene will include too many non-causal SNPs, giving rise to power 89 loss, as well as difficulties in results interpretation9. Our proposed framework can enhance 90 statistical power for identifying new regions of association in the noncoding genome. We 91 particularly focus on integration of 3D spatial information, which has not yet been fully 92 exploited in most WGS association testing studies. Our method allows users to input broadly 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.30.405266; this version posted December 2, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 93 defined regulatory/enhancer regions and then select those which are most likely relevant to a 94 given phenotype, in a statistically powerful framework. 95 Given our incomplete understanding of chromatin conformation and enhancer 96 annotation, an annotation agnostic approach such as SCANG does have some advantages, in 97 that no prior information is needed for rare variant testing. However, our simulations and the 98 real data example presented here demonstrate the advantages of our eSCAN method, which 99 can flexibly accommodate multiple types of annotation information and shows significant 100 power gains over SCANG, as well as a lower false positive rate in different scenarios for both 101 continuous and dichotomous traits. These advantages are demonstrated in our application of 102 eSCAN to TOPMed WGS analyses of four blood cell traits in the Women’s Health Initiative 103 (WHI) study, with replication in Jackson Heart Study (JHS). 104 The eSCAN procedure can be split into two steps: a p-value computing step and a 105 decision-making step (using a p-value threshold). First, for each enhancer, set-based p-values 106 are calculated by fastSKAT, which applies randomized singular value decomposition (SVD) 107 to rapidly analyze much larger regions than standard SKAT, and then p-values are "averaged" 108 by the Cauchy method via ACAT 4. Second, eSCAN calculates two types of significance 109 threshold.

Escan: Scan Regulatory Regions for Aggregate Association Testing

In Vivo Studies Using the Classical Mouse Diversity Panel

TFEB Regulates Murine Liver Cell Fate During Development and Regeneration ✉ Nunzia Pastore 1,2 , Tuong Huynh1,2, Niculin J

Full-Length Cdna, Expression Pattern and Association Analysis of the Porcine FHL3 Gene

Dynamic Transcriptomic Profiles of Zebrafish Gills in Response to Zinc

Mouse Population-Guided Resequencing Reveals That Variants in CD44 Contribute to Acetaminophen-Induced Liver Injury in Humans

Functional Genomics Atlas of Synovial Fibroblasts Defining Rheumatoid Arthritis

Gynecologic and Obstetric Pathology

Single-Cell RNA-Seq Analysis Reveals the Progression of Human

Supplementary Table 5. Clover Results Indicate the Number Of

What Your Genome Doesn't Tell

Figure S1. Basic Information of RNA-Seq Results. (A) Bar Plot of Reads Component for Each Sample

Overview Gene List Target Scan Vs DIANA Group a Group B Group A