Giggle: a Search Engine for Large-Scale Integrated Genome
Total Page:16
File Type:pdf, Size:1020Kb
BRIEF COMMUNICATIONS data. It works through command line and web interfaces, as well GIGGLE: a search engine as APIs in the C, Go, and Python programming languages. GIGGLE is based on a temporal indexing scheme5 that uses a B+ for large-scale integrated tree to create a single index of the genome intervals from thousands of annotations and genomic data files (Fig. 1a). Each interval in an genome analysis indexed file is represented by two keys in the tree that correspond to the interval’s bounds (start and end + 1). Each key in a leaf node Ryan M Layer1,2 , Brent S Pedersen1,2, contains a list of intervals that either start at a chromosomal posi- Tonya DiSera1,2 , Gabor T Marth1,2, Jason Gertz3 tion (indicated by a “+”) or have ended (indicated by a “−”) just & Aaron R Quinlan1,2,4 before that position. We give an example (Fig. 1a) in which position 7 corresponds to a key in the second leaf node with the list [+T2, −B2]. This indicates that at chromosomal position 7, the second GIGGLE is a genomics search engine that identifies and ranks interval in the “Transcripts” file (T2) has started, and the second the significance of genomic loci shared between query features interval in the “TF binding sites” file (B2) has ended. To find the and thousands of genome interval files. GIGGLE (https:// intervals in the index that intersect a query interval (e.g., [1,5] in github.com/ryanlayer/giggle) scales to billions of intervals Fig. 1a), the tree is searched for the query start and end, the keys and is over three orders of magnitude faster than existing within that range are scanned, and intervals in the lists of those keys methods. Its speed extends the accessibility and utility of are identified as intersecting the query interval (see Supplementary resources such as ENCODE, Roadmap Epigenomics, and GTEx by Fig. 1 and Online Methods for complete algorithmic details). facilitating data integration and hypothesis generation. GIGGLE′s potential for high scalability is based on two factors. First, identifying the number of overlaps between a query and The results from genome-wide assays such as ChIP-seq, RNA-seq, any given annotation file is determined entirely within the uni- and variant calling are often interpreted by comparing experimen- fied index, thus eliminating the inefficiencies of existing methods, tally identified genomic loci to other known genomic features which must instead open and inspect the underlying data files. such as open chromatin, enhancers, and transcribed regions. Second, the B+ tree structure minimizes disk reads; this is vital Large-scale functional genomics projects have greatly empow- to performance since databases of this scale will grow beyond the ered this type of analysis by characterizing the genomic regions capacity of main memory and must be stored on disk. To meas- associated with a wide range of genomic processes. However, ure GIGGLE′s query performance (Supplementary Software), interpretation is complicated by the size of these data set collec- we created an index of the ChromHMM6 annotations curated tions, which consist of thousands of results that span hundreds of by the Roadmap Epigenomics Project (Roadmap) from 127 tis- different tissue types, assays, and biological conditions. Effectively sues and cell lines. Each genome was segmented into 15 genomic integrating these large, complex, and heterogeneous resources states, yielding over 55 million intervals in the resulting GIGGLE Nature America, Inc., part of Springer Nature. All rights reserved. requires the ability to rapidly search the full data set and identify index (2.2 GB index, indexed in 80 s). When testing query per- 8 the most statistically relevant features. While existing software formance with a range of 10 to 1,000,000 query intervals, GIGGLE 1 2 201 such as BEDTOOLS and TABIX identify regions that are com- was 2,336× faster than TABIX and 25× faster than BEDTOOLS © mon to genome interval files, these methods were designed to (Fig. 1b; see Supplementary Data 1 for the data used to create Fig. 1) investigate a limited number of files. More recent methods3,4 for the largest comparison. Similarly, using an index of 5,603 anno- describe improved statistical measures, yet they do not scale to tation files for the human genome (GRCh37, a total of 6.9 billion the vast amount of data that is now available. intervals) from the UCSC Genome browser (554 GB index, indexed We introduce GIGGLE, a fast and highly scalable genomic in 269 min), GIGGLE was up to 345× faster than TABIX and 8× interval searching strategy that, much like web search engines did faster than BEDTOOLS (Fig. 1c). for the Internet, provides users with the ability to conduct large- Speed is essential for searching data of this scale, but, as with scale comparisons of their results with thousands of reference data internet searches, it is arguably more important to rank results by sets and genome annotations in seconds. GIGGLE enables the their relevance to the set of query intervals. Ranking requires a identification of novel and unexpected relationships among local metric that quantifies the degree of similarity between the query data sets as well as the vast amount of publicly available genomics intervals and each interval file in the GIGGLE index. Monte Carlo 1Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA. 2USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA. 3Department of Oncological Sciences, University of Utah, Huntsman Cancer Institute, Salt Lake City, Utah, USA. 4Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA. Correspondence should be addressed to R.M.L. ([email protected]) or A.R.Q. ([email protected]). RECEIVED 5 JULY 2017; ACCEPTED 6 DECEMBER 2017; PUBLISHED ONLINE 8 JANUARY 2018; DOI:10.1038/NMETH.4556 NATURE METHODS | ADVANCE ONLINE PUBLICATION | BRIEF COMMUNICATIONS 7,8 (MC) simulations are commonly used in genomics analyses to a 1 2 3 4 5 6 7 8 9 5 compare the observed number of intersections to a null distri- TF binding sites B1 B2 1 2 3 4 5 7 8 10 bution of intersections obtained by randomly shuffling intervals Promoters P1 P2 T1 T2 +P1+B1+T1 −P1 +B2+T2 −T2−P2 Transcripts −B1 +P2−B2 thousands of times and testing the number of intersections in −T1 each trial. While MC simulations are an effective method for pairs Query Search(1,5) = [P1,B1,T1,B2,P2] b 106 c 106 of interval sets, they are computationally intractable for large- 105 105 scale data sets since thousands of permutations are required for 104 104 103 103 each interval file. 102 102 GIGGLE eliminates this complexity by estimating the sig- 101 101 0 0 Runtime (s) 10 Runtime (s) 10 nificance and enrichment between the query intervals and each –1 GIGGLE –1 GIGGLE 10 BEDTOOLS 10 BEDTOOLS –2 –2 indexed interval file with a Fisher’s Exact two-tailed test and the 10 TABIX 10 TABIX odds ratio of a 2 × 2 contingency table containing the number of 4 5 6 4 5 6 10 10 100 100 intervals that are in (i) both the query and indexed file, (ii) solely 1,000 1,000 1 × 10 1 × 10 1 × 10 1 × 10 1 × 10 1 × 10 the query file, (iii) solely the indexed file, and (iv) neither the Number query intervals Number query intervals query file nor the indexed file. The first three values are directly computed with a GIGGLE search, and the last value is estimated d 1.0 e 40 by the difference between the union of the two sets and the quo- 35 0.8 tient of the mean interval size of both sets and the genome size. 30 These estimates are well correlated with the MC results (Fig. 1d,e) 0.6 25 value 20 and have the favorable property of near-instant computation. P 0.4 15 GIGGLE ranks query results by a composite of the product of MC 10 −log (P value) and log (odds ratio). This ‘GIGGLE score’ avoids 0.2 10 2 MC observed/expected some of the issues that arise when using only P values to select 5 9 0.0 0 top hits . In MC simulations, the proportion of values that are 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 70 more extreme than the observation (i.e., the P value) is highly Fisher’s exact P value (GIGGLE) Odds ratio (GIGGLE) dependent on the variance of the trials. When the variance of Figure | Indexing, searching, performance, and score calibration. the MC distribution is low, observations that are only margin- (a) A set of three genomic intervals files (transcription factor (TF) ally larger than the expected value may be significant, yet not binding sites, promoters, and transcripts) (left, black) is indexed using interesting biologically. For example, one result from a search of a single (simplified) B+ tree (right). Intervals among the annotations MyoD (a muscle differentiation transcription factor) ChIP-seq overlapping a query interval (left, red) are found by searching the tree peaks against Roadmap had a low enrichment (1.7×), but the vari- for the query start and end (right, boxed red) and scanning the keys ance of the MC simulations was also low, making the observation between these positions (right, boxed red). (b) Runtimes for GIGGLE, BEDTOOLS, and TABIX considering random query sets with between 10 significant (P < 0.001). Similarly, when the MC distribution vari- and 1 million random 100-base-pair intervals against the ChromHMM ance is high, large enrichments may not reach significance.