© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. & & Aaron R Quinlan 3 1 BEDTOOLS as such software existing While features. relevant statistically most the and identify set data full the search to rapidly ability the requires resources heterogeneous and complex, large, these integrating different tissue types, assays, and biological conditions. Effectively tions, consist which of of thousands results that span hundreds of collec set data However, these of size the by complicated processes. is interpretation genomic of range wide a with associated regions genomic the characterizing by analysis of type this ered empow greatly have projects regions. genomics functional transcribed Large-scale and enhancers, chromatin, open as features such genomic known other to loci genomic identified tally and variant calling are often interpreted by comparing experimen The results from genome-wide assays such as ChIP-seq, RNA-seq, generation. hypothesis and integration data facilitating as such resources methods. G files. interval of genome thousands and features query between shared loci ofgenomic significance the G Ryan M Layer genome analysis for large-scale integrated G Received Received 5 July 2017; accepted 6 decembe Utah, Salt Lake City, Utah, USA. Correspondence should be addressed to ( R.M.L. Tonya DiSera and is over three orders of magnitude faster than existing existing than faster of orders over magnitude is three and github.com data data sets as well as the vast amount of publicly available genomics among of relationships local novel the and identification unexpected enables GIGGLE seconds. in annotations genome and sets scale comparisons of their results with thousands of reference data large- conduct to ability the with users provides Internet, the for searching interval strategy that, much like web search engines did available. now is that data of amount vast the to scale not do they yet measures, statistical improved describe netgt a iie nme o fls Mr rcn methods recent More files. of number limited a to investigate designed were methods these files, interval genome to mon Department Department of UniversitySciences, Oncological of Utah, Huntsman Cancer Institute, Salt Lake City, Utah, USA. Department of Human UniversityGenetics, of Utah, Salt Lake City, Utah, USA. I We introduce GIGGLE, a fast and highly scalable genomic genomic scalable highly and fast a GIGGLE, introduce We GGL I GGL E is a genomics search engine that identifies and ranks and ranks identifies that engine search a is genomics I ts speed extends the accessibility and utility of utility and accessibility the extends speed ts /ryanlayer/giggl E 1 1 , , 2 : : a search engine 2 ENCODE

1

, , Gabor T Marth and TABIXand

, , Brent S Pedersen 1 , 2 , 4 , ,

R e oadmap oadmap ) scales to billions of intervals of intervals billions to ) scales

2 identify regions that are com that regions identify r 2017; published online 8 E pigenomics, and G and pigenomics, 1 , 2 I 1 , , Jason Gertz GGL , 2 ,

E ( https:// TE j 3 anua

x x by

3 r 2 , USTAR Center for Discovery,Genetic University of Utah, Salt Lake City, Utah, USA. y y 2018; 4 - - - - [email protected]

e rae a idx f h ChromHMM the of index an created we GIGGLE ure Tomeas disk. on stored be must and memory main of capacity grow the will beyond of scale databases this since to performance vital is this reads; disk minimizes structure tree B+ the Second, files. data underlying the inspect and open instead must which fied index, thus eliminating the inefficiencies of existing methods, uni the within entirely determined is file annotation and given any query a between overlaps of number the identifying First, Fig. 1 are identified as intersecting the query interval (see within that range are scanned, and intervals in the lists of those keys Fig. 1 Fig. in [1,5] (e.g., interval query a intersect that index the in intervals To the ended. has find (B2) sites”bindingfile “TF the in interval second the and started, has “Transcripts”(T2) the file in interval second the 7, position chromosomal at that indicates This −B2]. [+T2, list the with node leaf second the in key a to corresponds 7 as APIs in the C, Go, and Python programming languages. programming Python and Go, C, the in APIs as well as interfaces, web and line command Itthrough works data. before that position. We give just an “−”) a exampleby (indicated ( ended have or “+”) a by (indicated tion contains a list of thatintervals either start at a chromosomal posi to the interval’s bounds (start and end + 1). Each key in a leaf node indexed is file represented by two keys in the tree that correspond of annotations and genomic data files ( tree to create a single index of the genome intervals from thousands intervals intervals and each interval file in the GIGGLE index. Monte Carlo query the between of similarity degree the quantifies that metric a requires Ranking intervals. query of set the to relevance their it searches, internet is arguably more important to rank results by faster than BEDTOOLS (Fig. 1 TABIX8× and than faster 345× to up was GIGGLE min), 269 in intervals) from the UCSC Genome browser (554 GB index, indexed billion 6.9 of total a (GRCh37, forthe files tation for the largest comparison. Similarly, using an index of 5,603 anno ( BEDTOOLS than faster 25× and TABIX than faster 2,336× was formance with a range of per 10 to query 1,000,000 query intervals, testing GIGGLE When s). 80 in indexed index, GB (2.2 index resulting GIGGLE the in intervals overstates, yielding million 55 genomicinto 15 segmented genome was Each lines. andcell sues tis 127 from (Roadmap) Project Epigenomics Roadmap the by Fig. 1 doi GIGGLE GIGGLE is based on a temporal indexing scheme Speed is essential for searching data of this scale, but, as with with as but, scale, this of data searching for essential is Speed :10.1038/nmeth.455 and Online Methods for complete algorithmic details). a b ), the tree is searched for the query start and end, the keys keys the end, and start query forthe searched is tree the ), ; see ′ ) ) or A.R.Q. ( Supplementary Data 1 s s potential foron is based scalability two high factors. ′ s query performance ( performance query s nature methods 4 6 Department Department of Biomedical Informatics, University of [email protected] brief communications c ). | ADVANCE ONLINE PUBLICATION for the data used to create Supplementary Software Supplementary Fig. 1a Fig. 1 6 annotations curated curated annotations ). ). Each interval in an a ) in which position Supplementary 5 that uses a B+ Fig. |

1 ), ),  ------)

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. (3.5 s) query sets of GWAS SNPs for 39 different traits different GWAS 39 of for SNPs sets query s) (3.5 quickly can GIGGLE example, For exploratory scale. massive a conduct on research to researchers allows the SNPs) s Crohn’s (<0.3 for illuminate GIGGLE of speed results the features, these individual of While dynamics enhancers. cell immune lutae Mo’ iprat n seii rl i msl tis muscle in role sues specific and important MyoD’s illustrates of ( index the GIGGLE Roadmap against peaks ChIP-seq MyoD of search a from scores GIGGLE of heatmap a example, significant ( significant ance of the MC was simulations low,also the making observation peaks against Roadmap had a low enrichment ChIP-seq (1.7×), but the vari factor) transcription differentiation muscle (a MyoD not of yet search a from result one For example, biologically. significant, interesting be may value expected the than larger ally margin only are that observations low, is of distribution variance MC the the When trials. the of variance the on dependent the (i.e., observation the than extreme more hits top only using when arise that issues the of some and absent in nonautoimmune traits nonautoimmune in absent and diseases autoimmune other in in present is SNPs enhancers cell of immune enrichment the that confirm only not to Roadmap (MC) simulations are commonly used in genomics analyses in genomics used are commonly simulations (MC) each interval file. interval each for required are permutations of thousands since sets data large- scale for intractable computationally are they sets, interval of in each intersections trial. While MC of simulations are an number effective method for pairs the testing and times of thousands intervals shuffling by randomly obtained of intersections bution distri null a to intersections of number observed the compare rh’ disease Crohn’s These estimates are well correlated with the MC results ( size. genome the and sets both of size interval mean the of tient quo the and sets two of the union the between difference by the estimated is value last the and search, a GIGGLE with computed directly are values the three first The neither file. indexed (iv) the nor file and query file, indexed the solely (iii) file, query the solely (ii) file, and that indexed are query the in (i) both intervals of number the containing table of a 2 × 2 ratio contingency odds the and test two-tailed Fisher’s a Exact with file interval indexed each and intervals query the between enrichment and nificance for either set of traits ( traits of set either for regions transcribed in pattern cell-specific no is there that show Supplementary Data 2 Data Supplementary ( sets a set and ships between query interval all indexed insightful to use all scores to visualize the full spectrum of relation- score. GIGGLE the into enrichment and significance combining by mitigated are effects ance is high, large may enrichments not These reach significance. −log computation. near-instant of property favorable the have and of all ChIP-seq data sets available from Cistrome from available sets data ChIP-seq all of For for a index using sets any example, GIGGLE species. interval of collection any explore efficiently to researchers enables that 

brief communications IGE lmnts hs opeiy y siaig h sig the estimating by complexity this eliminates GIGGLE While the GIGGLE score can be used to rank results, it is also also is it results, rank to used be can score GIGGLE the While GIGGLE ranks query results by a composite of the product of of product the of composite a by results query ranks GIGGLE We emphasize that GIGGLE is a completely general framework | ADVANCE ONLINE PUBLICATION 1 10 0 . Similarly, a search of GWAS SNPs associated with with associated SNPs GWAS of search a Similarly, . ( P 9 value) and value) log . In MC simulations, the proportion of values that are are that values of proportion the simulations, MC In . P < 0.001). Similarly, when the MC distribution vari Similarly, < the MC when 0.001). distribution 1 1 ( i. 2 Fig. Fig. 2 Fig. 2 (odds ratio). This ‘GIGGLE ratio). (odds score’ avoids for the data used to create create to used data the for b c ) shows that variants cluster in in cluster variants that shows ) , right). , | nature methods 1 1 ( Fig. 2 Fig. c P P , left), but also to to also but left), , value) is highly highly is value) values to select select to values 1 2 (5,992 files; files; (5,992 Fig. Fig. Fig. Fig. 1 Fig. Fig. 1d 1 Fig. Fig. 2 against against 2 ). For ). 2 7 ; ; see , 8 to , a e ------) )

estrogen receptor receptor estrogen histone variant H2AFX variant histone of co-occurrence strong the is data ChIP-seq MCF-7 of analysis oriae gnmc idn o CC, A2, n STAG1, and indicative of regions RAD21, involved in long-range CTCF,interactions of binding genomic coordinated including clear, become subsets distinct comparison, this From available (734,249 intervals) for the MCF-7 breast cancer cell line. (<3 min) performed a full pair-wise comparison of the 270 factors quickly we s), 17 in indexed index; MB 521 intervals; 8,716,024 that are involved in estrogen-induced regulation (EP300 regulation gene estrogen-induced in involved are that and NCAPG and genomic binding (FOXA1 binding genomic transcription specific factors known to play important roles in ER on ER ( specifically Focusing tively). in 2 Group and 1 (Group factors ChromHMM predictions from Roadmap. from predictions ChromHMM against intervals) (631 peaks ChIP-seq of MyoD a search considering for ( for methods Monte-Carlo-based and table a contingency using estimates query set sizes. ( sizes. set query typical exceeds far scenario this of intervals, of millions hundreds exceeding sizes query for converge runtimes BEDTOOLS and GIGGLE While intervals). billion 6.9 over and files (5,603 annotations browser genome ( intervals). million 55 over and files (1,905 Epigenomics of Roadmap processing ChromHMM the against intervals 100-base-pair random 1 million and 10 between with sets query random TABIXand considering BEDTOOLS, ( red). boxed (right, positions these between keys the scanning and red) boxed (right, end and start query the for tree the searching by found are red) (left, interval a query overlapping annotations the among Intervals (right). tree B+ (simplified) a single using indexed is black) (left, transcripts) and promoters, sites, binding F ( a igure igure TF bindingsites b d a ) A set of three genomic intervals files (transcription factor (TF) (TF) factor (transcription files intervals genomic of three ) A set Runtime (s)

MC P value 10 10 10 10 10 10 10 10 10 d 0.0 0.2 0.4 0.6 0.8 1.0 Transcripts –2 –1 Promoters ) significance (Fisher’s Exact two-tailed test) and ( and test) two-tailed Exact (Fisher’s ) significance 0 1 2 3 4 5 6 0.0

1 10 Fisher’s exactPvalue(GIGGLE) | Query Indexing, searching, performance, and score calibration. calibration. score and performance, searching, Indexing, c . 0.4 0.2 100 Number queryintervals ) Runtimes for the same method and queries against UCSC UCSC against queries and method same the for ) Runtimes 1 P1 2

0 1,000 d ). One unexpected finding from this large-scale large-scale this from finding unexpected One ). 2 , B1 e ) A comparison between GIGGLE’s relationship relationship GIGGLE’s between ) A comparison

α 1 × 10 3 T1 (ER) co-occurrence with other transcription transcription other with co-occurrence (ER) . . . 02 04 06 70 60 50 40 30 20 10 0 1.0 0.8 0.6 4

1 × 10 4 TABIX BEDTOOLS GIGGLE 20 , 5 5 2 P2 B2 1 1 6 and ER cofactor GREB1 cofactor ER and 1 × 10 , GATA3 , 6

6 7 T2 c e Supplementary Fig. 2 Fig. Supplementary Runtime (s)

MC observed/expected 10 10 10 10 10 10 10 10 10 10 15 20 25 30 35 40 8 –2 –1 Fig. Fig. 2 0 5 0 1 2 3 4 5 6

1 10 7 9 b and PR and ) Runtimes for GIGGLE, GIGGLE, for ) Runtimes P+1T −P1 +P1+B1+T1 Search(1,5) =[P1,B1,T1,B2,P2] 1 d

Number queryintervals 100 Odds ratio(GIGGLE) ) uncovers ) sequence- uncovers 2 1,000 3 −B1 1 4 8 e ) and cofactors cofactors and ) 1 × 10 ) enrichment ) enrichment 5 4 2 −T1 +P2 +B2 TABIX BEDTOOLS GIGGLE 2 5

(Group 3 (Group 1 × 10 13– −B2 +T2 , respec , 5 7 1 10 8

5 1 × 10 , , and −T2−P2

6 1 9 -

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. between ESR1 and FOXA1, GATA3, PR, EP300, and NCAPG. Color lookup tables indicate GIGGLE scores. GIGGLE indicate tables lookup Color GATA3, NCAPG. and FOXA1, and ESR1 EP300, PR, between relationships the highlight boxes Black more enrichment. indicate scores GIGGLE Higher cells. MCF-7 in factors) unique different (38 experiments seq and cell types. ( types. cell and tissues immune highlight boxes black The traits. nonautoimmune 18 and disorder autoimmune 21 for GWAS SNPs considering when data Roadmap for within panels highlight ( highlight panels within and tissues predicted by ChromHMM for Roadmap and ( and Roadmap for ChromHMM by predicted tissues and F igure igure Blood and T-cell Cancer cellline HSC andBcell d a ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 ESR1 Sm muscle Neurosph 2 Digestive Epithelial Mesench ES deriv | AHR Thymus Muscle

Visualization of GIGGLE scores from various searches. ( searches. various from scores of GIGGLE Visualization CEBPB Other Heart Brain iPSC Lung CTCF +Oestrogen ESC CTCF +Oestrogen CTCF CTCF d CTCF TssA ) The relationships between ESR1 ChiP-seq binding sites from 53 different experiments and the binding sites from 105 other ChIP- other 105 from sites binding the and experiments different 53 from sites binding ChiP-seq ESR1 between relationships ) The CTCFL TssAFlnk CTCFL TxFlnk E2F1 Tx E2F1 TxWk E2F1 EnhG

a E2F4 +Tamoxifen Enh ) muscle and ( and ) muscle EGR1 ZNF/Rpts ELF1 Het EP300 TssBiv EP300 +E2 EP300 BivFlnk EP300 +Progesterone EnhBiv EP300 +Progesterone ReprPC EP300 +R5020 ReprPCWk EP300 +R5020 Quies EP300 +R5020

FOSL2 –606 0 1,237

b FOXA1 +E2

) immune tissues and cell types. ( types. cell and tissues ) immune FOXA1 +E2

FOXA1 Blood and T-cell Cancer cellline HSC andBcell FOXA1 b FOXA1 FOXA1 Sm muscle Neurosph Digestive Mesench FOXA1 Epithelial ES deriv Thymus

FOXA1 +ETOH Muscle Other Heart

FOXA1 +E2 Brain iPSC Lung FOXA1 ESC FOXA1 +E2 a FOXA1 ) MyoD ChIP-seq peaks and ( and peaks ChIP-seq ) MyoD FOXA1 TssA FOXA1 FOXM1 +DMSO TssAFlnk FOXM1 TxFlnk GABPA Tx GATA3 TxWk GATA3 +E2 EnhG a GATA3 Enh ,

b GATA3 +E2 ZNF/Rpts

) The relationships between 15 genomic states across 127 different cell types types cell different 127 across states genomic 15 between relationships ) The GATA3 +E2 Het GATA3 TssBiv GATA3 BivFlnk GATA3 EnhBiv GATA3 ReprPC c GREB1 ReprPCWk ) Results from the enhancer and strong transcription tracks from ChromHMM ChromHMM from tracks transcription strong and enhancer the from ) Results H2AFX +H2O2 Quies HDAC2 JUND 154.7 KDM5B –254.8 0.0 MAX MYC

b MYC Blood and T-cell Cancer cellline HSC andBcell ) genome-wide significant SNPs for Crohn’s disease. Black boxes Black disease. Crohn’s for SNPs significant ) genome-wide MYC +Stimulated MYC c MYC +Estrogen Sm muscle

MYC +Starved Neurosph Digestive Epithelial Mesench ES deriv

NCAPG +E2 Thymus NCAPG +E2 Muscle Other Heart Brain iPSC Lung

NCAPG2 +E2 ESC NR2F2

nature methods NR2F2 PML POLR2A +E2

POLR2A +E2 Non-autoimmune Autoimmune Non-autoimmune Autoimmune POLR2A +DMSO POLR2A +E2 Enhancers (Enh) POLR2A +DMSO PR +Progesterone PR +Progesterone PR +Progesterone brief communications PR +R5020 PR +R5020 PR +R5020 RAD21 +Oestrogen RAD21 +Oestrogen |

ADVANCE ONLINE PUBLICATION RAD21 RAD21 RAD21 +Oestrogen REST SIN3A

Strong transcription (Tx) SRF STAG1 +Oestrogen STAG1 STAG1 STAG1 +Oestrogen STAG1 +Oestrogen TAF1 TCF12 TCF7L2 TFAD4 TFAP2A +E2 TFAP2A +ETOH TFAP2A TFAP2C TFAP2C +E2 TFAP2C –11.9 0.0 191.6 0 500 1,000 1,500 2,000 2,500 3,000 3,500 |

 © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. these these biological relationships. These examples illustrate GIGGLE different assays and therefore provide orthogonal corroboration of fundamentally on based are Roadmap FANTOM5 and since ing intervals) ( online online version of the pape Note: Any Information Supplementary and Source Data files are available in the the of version the in available are references, and codes accession ated Methods, including statements of data availability and any associ M in  searches similar with an of against index the FANTOM5 data Roadmap against SNPs GWAS disease Crohn’s and peaks ChIP-seq MyoD both of from search GIGGLE hits the top the recapitulated we example, For results. to used verify be also can indices GIGGLE Other visible. one are overlap least at with tracks the only where ‘smartview’, dynamic a 3 Fig. GIGGLE index of the UCSC genomea querying browser a by Roadmap dataenhancers) Myoblast ( and from ChIP-seq MyoD (e.g., results interesting investigate further to users allows that interface web a developed we example, For sources. factors. these between tion onstrate the discovery potential afforded by GIGGLE by afforded potential discovery the onstrate ability to confirm previously associations and characterized dem genome biology in diverse experimental contexts. experimental diverse in biology genome into of in search insights sets data genomics of multidimensional a new engine with which to conduct large-scale, large-scale, to conduct which with a engine new with genomic a region. particular In summary, GIGGLE provides associated are that annotations curated and experiments known single access point that will inform researchers and clinicians of all large-scale, integrative analyses. GIGGLE is capable of powering a data sets can be searched has the possibility to dramatically advance searches. prioritized brief communications ethods | GIGGLE also provides the infrastructure to integrate data data integrate to infrastructure the provides also GIGGLE The exploratory power of a single interface from which many which from interface single a of power exploratory The Supplementary Supplementary Fig. 2 ADVANCE ONLINE PUBLICATION ). Those results are visualized in the genome browser as as browser genome the in visualized are results Those ). Supplementary Supplementary Tables 1 pape r . r . ), ), suggesting a potential physical interac | nature methods – 4 ). ). This is especially promis 2 3 (1,825 files; 11,284,790 11,284,790 files; (1,825 in in silico Supplementary ‘screens’ ′ s rapid, rapid, s o nline nline ′ s - - - -

6. 5. 4. 3. 2. 1. institutional and maps published affiliations. in claims jurisdictional to regard with R The authors declare no competing financial interests. COM conceived and designed the study and wrote the manuscript. the web interface. J.G. conceived and designed the ChIP-seq experiment. A.R.Q. T.D. developed the web interface. G.T.M. provided input in the development of manuscript. B.S.P. developed the GIGGLE score and the PYTHON and GO APIs. R.M.L. conceived and designed the study, developed GIGGLE, and wrote the AUTHOR and(K99HG009532) A.R.Q. R01GM124355, U24CA209999). (R01HG006693, This research was funded by US National Institutes of Health awards to R.M.L. We are grateful to the anonymous reviewers for their suggestions and comments. A 17. 16. 15. 14. 13. 12. 11. 10. 9. 8. 7. 23. 22. 21. 20. 19. 18. c l om/reprints/index.htm eprints and permissions information is available online at cknow

P 1–12(Morgan Kaufmann, SanFrancisco, California, USA1990). LargeData Bases (VLDB ‘90) temporaldata. in Bioinformatics Ernst, J. & Kellis, M. Kellis, & J. Ernst, Elmasri,R.,Wuu,G.T.J. Kim,Y.-J.& Thetime index: accessan structure for Favorov,A. C. Bock, & N.C. Sheffield, H. Li, I.M. Hall, & A.R. Quinlan, Theodorou, V., Stark, R., Menon, S. & Carroll, J.S. Carroll, & S. Menon, R., V.,Stark, Theodorou, J.S. Carroll, Y. Xu, R. Nativio, Splinter,E. S. Mei, K.K.-H. Farh, MacQuarrie,K.L. Y.Xiao, K. Kechris, & B.S.Pedersen, S., De, A.R. Quinlan, & I.M. Hall, G., Robins, K., Skadron, R.M., Layer, Lizio, M. Lizio, H. Mohammed, Periyasamy,M. W.Li, B. Hanstein, H. Mohammed, (2013). ETIN

l CONTRIBUTIONS ed G G Bioinformatics et al. et et al. et et al. et FINANCIA g et al. et et al. et ments et al. et et al. et et al. et et al. et Mol. Cell Mol. PLoS Genet. PLoS et al. et et al. et Nucleic Acids Res. Acids Nucleic Bioinformatics 29 et al. et et al. et al. et Genome Biol. Genome et al. et Proceedingsofthe16thInternational Conference Veryon , 1–7 (2013). 1–7 , PLoS Genet. PLoS L L Dev. Genes PLOS Comput. Biol. Comput. PLOS Cell Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. INTERESTS Nature 27 . Publisher’s note: note: Publisher’s . Cell Rep. Cell Nat. Methods Nat. Cell Rep. Cell Nature , 188–202 (2015). 188–202 59, 122 Mol. Cell. Biol. Cell. Mol. , 718–719 (2011). 718–719 , 12 (eds. McLeod, D., Sacks-Davis, R. & Schek, H.-J.)Schek, & Sacks-Davis,R. D.,McLeod, (eds. Bioinformatics Bioinformatics , 33–43 (2005). 33–43 , 518 , e1005992 (2016). e1005992 , 523 20 30 16 , e1000739 (2009). e1000739 5, 13 3 , 337–343 (2015). 337–343 ,

, 2349–2354 (2006). 2349–2354 , , 801–807 (2014). 801–807 , , 22 (2015). 22 , 45 , 342–349 (2013). 342–349 , , 313–317 (2015). 313–317 , , 108–121 (2015). 108–121 , , D658–D662 (2017). D658–D662 , , 215–216 (2012). 215–216 9, Brief. Bioinform. Brief.

, e1002529 (2012). e1002529 8, 33 , 773–784 (2013). 773–784 , 32 S 26 pringer pringer , 587–589 (2016). 587–589 , , 841–842 (2010). 841–842 ,

93 , 11540–11545 (1996). 11540–11545 , N Genome Res. Genome ature remains neutral neutral remains ature 15 http://www.nature. , 919–928 (2014). 919–928 , 23 , 12–22 , © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. described The GIGGLE index. ON doi: uk indexing. Bulk for intervals that start in some earlier leaf node (e.g., interval T1). by preventing queries from having to load and scan other leaf nodes information and require extra storage, improvethey performance Supplementary Fig. 1 not ended by the last key in the prior leaf node (e.g., interval T1 in that stores intervals that start before the first key in the leaf, but‘leading’ key (“L” have in in 1 Fig. at that position (indicated by a “+” in the values associated with keys are lists of intervals that either start In the GIGGLE index, keys represent chromosomal positions, and currentthatkey.are keys the to greaterwith orequal than node a keys that are less than the current key, and the ‘right’ link points to with node a to points link ‘left’ The nodes. child two to linked is node internal an in key Each linked. are nodes leaf adjacent and and facilitate tree searching, leaf nodes contain the key-value pairs, each node can have multiple keys, internal nodes contain only keys where tree Treebinary B+ a generalizationof A a is files. interval Trees, one for each , represented among the database key (intervals P1, B1, and T1 in steps 1, 2, and 3 in that with associated is that list new the to added is (“+”) start val mum number of keys, which is set to 100 by default), and the inter start position (assuming the current leaf has not reached its maxi Fig.1b value (intervals B2, P2, and T2 in steps 4, 5, and 6 previously in istostart added listthe interval the then of existingthe observed, been has interval current the of position start the If Tree. B+ the into positions end insert and values leading the to coordinate (plus one). This priority queue is toused add intervals end the by keyed is that queue priority auxiliary an and position from the TF binding sites is file added after B1 is considered. B2 2, Similarly, step queue. in the to added is (P2) Promotersfile TF binding sites file. After P1 is considered, the next interval in the from the Promoters file, and step 2 considers B1 interval from the example, in next interval from the corresponding file is added to the queue. For the queue, the from removedare intervals as and, file; each from interval first) (the one with loaded is queue The files. of set full to select the interval with the next lowest start position among the (see Supplementary Fig. files 1a interval presorted many across indexing ‘bulk’ forms to the lists of existing keys (interval B1 in step 4 in from the priority queue, and the interval ends removed are(”-”) value arestart the eitherto equal oradded than less values end with Fig. 1b construction is complete, internal nodes are added by promotingby added are complete,nodes isinternal construction 1b Fig. the new key is added to the new node (step 4 in in priority queue are added to the leading keynode valuebecomes full,of thenthe a newnew node,leaf node and is created, all steps intervals4 and 6 inin the Fig. 1b Each Each interval is inserted into both the B+ Tree based on its start Fig. 1a Fig. L 10.1038/nmeth.4556 INE ) or have ended just before that position (indicated by a “-” “-” a by(indicated position thatbeforejust haveended or ) ), or new keys are created (intervals P1, T1, B2, P2, and T2 in ). Before a key is added for a new start position, all intervals ). Otherwise, a new key is added to the currentfor leaf tothe the added is key new a Otherwise, ). ). Once all files have been processed and leaf node node leaf and processed been have files all Once ).

METHODS 5 and temporal indexing method and consists of a set of B+ B+ of set a ofconsists and method indexing temporal Supplementary Figure 1b Supplementary Fig. 1 Fig. Supplementary Supplementary Fig. 1b o mrv idxn efcec, IGE per GIGGLE efficiency, indexing improve To Supplementary Fig. 1 The GIGGLE index onis based a previously ). While the leading values contain redundant ). In bulk indexing, a priority queue is used Fig. 1 ). Leaf nodes also contain a contain also nodes Leaf ). , step 1 considers interval P1 ). If at any point the current , omitted from a and Supplementary Supplementary Supplementary Supplementary Supplementary Fig. 1 a) - - -

only those files with names that match one of those expressions. expressions. those of one match that names with files those only comma-separated a list of providing regular expressions.by Results are (-f) giving for files reference of subset a only consider can the followed interval query reference each hits so that lists users can (-o) recover specific mode hits. Third, searches query’ ‘per a Second, results. filter can users that so file source the with along usability. First, a ‘verbose’ mode (-v) prints all overlapping intervals and data file. source the index on the both accesses TABIX is less efficient than GIGGLE because it must perform disk together, taken When file. data source each for index one creates TABIX Second, and data decompressing, source the files. parsing optimized for small index files, and most queries require opening, TABIX is First, reasons. two for efficient However, less TABIX is uses the index only to consider the database intervals near queries. is based on an R-Tree but has fixed bin sizes. Like GIGGLE, TABIX does not have the overhead of the index. TABIX uses an index that it because GIGGLE than efficient more be may BEDTOOLS val, case a where intersects nearly interdatabase interval query every or are immediately adjacent to In the intervals. query the unlikely contrast, GIGGLE only considers the intervals that either intersect In database. the of most parse and read must BEDTOOLS case), small proportion of (the the most intervals database common use a only intersect intervals query the where cases In file. database must, in general, perform a full scan of both the file query and the and index an of advantage take not does BEDTOOLS databases. small considering when GIGGLE than faster be may that rithms algo TABIXefficient are and BEDTOOLS Both database. the of thousands of files, and query intervals overlap only a small fraction accesses. This benefit is most apparent when the database contains the end key, then the search leaf use the nodes. will between links 5 and 6 in are scanned, and the starting intervals are added at each key (steps ing intervals (“-”) are removed (steps 3 and 4 in in 4 and 3 (steps removed are (“-”) intervals ing are and(“+”) set, to end added intersecting the intervals starting upvalue from first to the andkey. including start the At key,each Supplementary Fig. in 1c 2 (step set intersecting the to added are node leaf start the of 1c Fig. Supplementary start key and the end leaf node and end key, respectively (step 1 in interval across set) all indexed First,files. the B+ Treeintersecting is searched for (the the query intervals overlapping find to query range returned. are key matching and leaf the searched, follow the key’s When of the keys link. ‘left’ a have node leaf been will it otherwise link, key’s ‘right’ the follow will path the query, the to equal or than less is key matching the If path. the in next node the determines search that of result the node, leaf a not is node current the While node. the in keys the among performed is search an internal path, search in Atthe node node. a each leaf in key a at terminates and nodes, internal down proceeds node, the Searching GIGGLE index. tinues one level at a time until there is only one parent node. parent node (step 7 in a to node) leftmost the than (other node leaf each in key first the Fig. Fig. 1c eea sac otos r aalbe o nrae GIGGLE’s increase to available are options search Several disk minimizing in is index GIGGLE a of advantage main The For a given query interval, GIGGLE performs a specialized specialized a performs GIGGLE interval, query given a For ). Last, ). the Last, remaining keys up to and including the end key ′ s s start and end values, which gives the start leaf node and Supplementary Fig. 1c Supplementary Supplementary Fig. 1b ). Then the keys in the leaf node are scanned ). Next, the intervals in the leading value value leading the in intervals the Next, ). A B+ Tree search starts at the root ). ). If notthe key start does equal ). This process con Supplementary Supplementary nature methods - - - -

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. detailed methods underlying this process can be found at: found be can process this underlying methods detailed gwas/READM nature from: downloaded was fields other and end, data source. myod/README.m at: found be can process this underlying methods Detailed files. BED to converted are values end and start, some, goldenPath/hg19/d ciated metadata was downloaded from: downloaded was metadata ciated GSM12188 from: downloaded are ucsc/README.m tations was downloaded from: from: downloaded was tations human_permissive rme/ methods Detailed at: found be can process this underlying Spleen/Enhancers). (e.g., files state-based all.mnem mm from: downloaded were annotations Tissue-based sources. Data track.bed.gz > -c bgzip | files: BED uncompressed on individual, used be can command following the Otherwise, processors. multiple using directories giggle/blob/mas ( repository GIGGLE the in script a Weprovide performance. cache of because better perform to and BED need These not files. be but sorted, are files sorted likely and then by end. For GIGGLE VCFsupports bgzipped searching, by cographical chromosome, by then numerically ascending start ( FA ( files VC VCF supports requirements. sorting and format Data com/ry coun Human.sampl Specific examples of each of these options are given at nature methods https://github.com/ https://github.c a with Peaks Fantom5 data source. https:/ at: found be can process this underlying methods Detailed ht GWAS variants for 39 autoimmune and non-autoimmune traits https://gith http://fantom ftp://ftp.ncbi.nlm MyoD ChIP-seq data source. UCSC Genome Browser data source. Browser Genome UCSC h tissue/ into renamed and split subsequently were files These http://egg2.wu track.bed -k3,3n 2G -k2,2n -k1,1 = LC_ALL C sort–buffer-size http://fantom.gsc. Qformat.html#format Fv4.3.pd ttps://github.com/ry Segmentations/ChmmMo tps://www.nature.com README.m t_matrix.txt.g 13835-s1.xl e master/README.md#exampl anlayer/giggle/blob/ /github.com/ryanlaye onics.bedFiles.tg 50/suppl/GSM1218850_ A with spreadsheet a list of traits, chromosome, start, f ad E fls ( files BED and ) ub.com/ryanlayer/gig

e_name2library_id.tx CHROMHMM, roadmap epigenomics data roadmap source. epigenomics CHROMHMM, E.m q .gsc.riken.jp/5/data ter/scripts/sort_be -value greater than or equal to 100 are retained; retained; are 100 to equal or than greater -value d stl.edu/roadmap/data s d om/ryanlayer/giggle/ d z d atabas _enhancers_phase_1_a riken.jp/5/datafiles samtools/htsli .nih.gov/geo/samples The enhance expression matrix and asso anlayer/giggle/blob/ 1 https://samtools.gi e ) that have been sorted and bgzipped bgzipped and sorted been have that ) z . The files with identifiable chromo identifiable with files The . /nature/journal/v518 ChIP-seq peaks from GSM1218850 dels/coreMarks/joint http://h r/giggle/blob/master https://ge http b d files/latest/extra/E MB135DMMD.peak.txt.g t ). Sorting is ascending lexi ascending is Sorting ). ) that can sort and bgzip full full and ) bgzip that can sort gle/blob/master/exam The full set of hg19 anno of hg19 set full The s://github.com/ryanl gdownload.cse.ucsc.e /latest/extra/Enhanc blob/master/examples For indexing, GIGGLE GIGGLE For indexing, /byFileType/chromh nome.ucsc.edu/FAQ/ nd_2_expression_ thub.io/hts-specs/ master/examples/ /GSM1218nnn/ /n7539/extref/ h Model/final/ ttps://github. /examples/ nhancers/ . ayer/ ples/ ers/ du/ z - - - - - /

files. Detailed methods underlying this process can be found at: raw GEO data was downloaded from: downloaded was data GEO raw fanto Experiments. trome/READ found can be process at: this underlying methods Detailed 100. files with less than 100 peaks with a had to We have last 500 peaks enrichment. ten-fold removed also to pass two Cistrome CQ metrics: fraction of reads in peaks, and at q Human_chromatin_accessibility, Human_other. Only peaks with a ( of intersections between a query set and a database set for GIGGLE Detailed methods underlying this process can be found can be process at: this underlying methods Detailed of cache and a 510 MB/s read 485 MB/s write SSD drive (SM863a). core on the 2. 4 GHz Intel single Xeon a processorusing performed (E5-2680 were v4) tests with All browser.25 genome MB UCSC the from annotations hg19 the and Epigenomics Roadmap from predictions ChromHMM the were databases the and intervals, htsli com/arq5x/bedtools hoHM rdcin fo Ramp pgnmc. For Epigenomics. Roadmap Figure 2 from predictions ChromHMM state. Within these major columns, the left minor column column minor left the columns, major these Within state. TranscriptionStrong the from scores the to correspondcolumns Enhancer state from ChromHMM for each tissue, and the right two sp ( Epigenomics. Monte Carlo simulations were performed using BITS and the database was the ChromHMM predictions from Roadmap was set the GWASquery variants associated with Crohn’s disease, 2 × 2 contingency table versus a Monte Carlo base enrichment. The gency table versus a Monte Carlo base set were compared: the Fisher’s Exact two-tail test of a 2 × 2 contin the relationship between a query interval set and a database interval methods methods underlying this process can be found at: interval set, and each simulation consisted of 1,000 rounds. state GWAS tissue/genomicDetailed the each of and variantsintersection used. The used. cell line and names tissue are in Epigenomics. Only the Roadmap peaks with a from predictions ChromHMM the against searched mc_vs Detailed methods underlying this process can be found at: found be can process this underlying methods Detailed with Crohn’s disease and other traits were searched against the the against searched were traits other and Crohn’s disease with map. chipseq/README.m https://g https://github.c -value -value greater than 100 were retained. For ht Values were extracted from the matrix and placed in tissue-specific http:/ Cistrome data source. https://gi With the following fields selected: Human_TF, Human_histone, ht Relationshipcomparison. https MyoD heat map. https://github.co Crohn’s disease heat map and autoimmune/nonautoimmune heat eed_test/README.m tps://github.com/rya b tps://github.com/rya The GIGGLE scores for the sets of GWAS variants associated m/README.m ). ). The query sets had between 10 and 1 million 100 base pair _table/README.m ://github.com/ryanla /cistrome.org/db/interface.htm c ithub.com/ryanlayer/ , the left two columns correspond to the scores from the thub.com/ryanlayer/g ME.m Speed tests. om/arq5x/bit The GIGGLE scores for MyoD ChIP-seq peaks m/ryanlayer/giggle/b d 2 d ), ), and TABIX ( d Reanalyzed ChIP-seq narrow peaks from narrow ChIP-seq peaks Reanalyzed d nlayer/giggle/blob/m Runtimes were for counting the number nlayer/giggle/blob/m d Two forquantifying methods of pairs yer/giggle/blob/mast s ), a simulation was performed for the giggl iggle/blob/master/ex e q ), BEDTOOLS ( q -value greater than or equal to https://github.com P -value -value greater than 100 were value and the odds ratio of a l lob/master/experimen Supplementary File 1 File Supplementary Figure Figure 2 doi: aster/experiments/ 10.1038/nmeth.4556 aster/examples/ er/experiments/ https://githu d , , all files had amples/cis /samtools/ ts/ b. - - .

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. oun o h nnuomue ris Te aeoiain of categorization The Farh from retained was traits these traits. nonautoimmune the to column minor right the and disorders, autoimmune the to corresponds doi:10.1038/nmeth.4556 ders and the non-autoimmunethe andders intraitsare listed sue names are in Data 3 100 were used. The full set of accession numbers is also passed quality control. Only the peaks withabove)weresearched against otheraallMCF-7 linecellresults that ERS1 from the MCF-7 cells line that passed quality control (described gwas/README.m Data 3 process can be found at: found be can process numbers accession is of set a full The with used. were peaks 100 than the greater Only themselves. against were searched control quality passed that line cell MCF-7 the from files cistrome/README.m cistrome/README.m int main(int argc, char **argv) { **argv) char argc, main(int int “giggle_index.h” #include Indexing: interface C \ search giggle \ search giggle Searching: \ index giggle Indexing: interface. line Command int main(int argc, char **argv) { **argv) char argc, main(int int #include”giggle_index.h” Searching: } https://github.c CistromeER. https://githu https://github MCF-7. Cistrome IGE omn ln ad rgamn interfaces programming and line command GIGGLE uint64_t num_intervals = num_intervals uint64_t -q query.bed.gz -i interval_index -i interval_index -r chr1:1000000-2000000 -s -o interval_index \ -i “intervals/*.bed.gz” struct giggle_index *gi = *gi giggle_index struct 0; return struct giggle_query_result *gqr = *gqr giggle_query_result struct upeetr Dt 3 Data Supplementary giggle_bulk_insert( giggle_load( giggle_query( . Detailed methods underlying this process can be found at: 1); “interval_index”, “intervals/*.bed.gz”, block_store_giggle_set_data_handler); “interval_index”, gi,“chr1”,1000000,2000000,NULL); . Detailed methods underlying this process can be found at: . b.com/ryanlayer/gigg TheGIGGLE scores for ChIP-seqthe offilespeak .com/ryanlayer/giggl Supplementary Data 3 om/ryanlayer/giggle/ d The GIGGLE scores for all ChIP-seq peak peak ChIP-seq all for scores GIGGLE The d d . Detailed methods underlying this this underlying methods Detailed . et al. et le/blob/master/exper e/blob/master/experi blob/master/experime . The autoimmune disor 1 1 - tis and line cell The . q -value greater than Supplementary Supplementary q iments/ ments/ -value -value nts/ - .

func main() { main() func ) ( import Searching: } func main() { main() func ) ( import Indexing: interface: Python } Go interface: Go in hit result[0]: for print(result.n_hits(0)) print(result.n_total_hits) print(result.n_files) result = index.query( ’ chr1 ’print(index.files) , 9999, 20000) ’ ) ’ interval_index = Giggle( index Giggle import giggle from Searching: index = Giggle.create(’interval_index’, Giggle import giggle from Indexing: index.Files() in index the // files all 567999) 565657, index.Query(“1”, res:= giggle.Open(“interval_index”) index:= “fmt” “github.com/brentp/go-giggle” giggle index := giggle.New(“interval_indexr”, index “fmt” “github.com/brentp/go-giggle” giggle 0; return giggle_index_destroy(&gi); giggle_query_result_destroy(&gqr); } print(hit) # is hit a string print(hit) for(i = 0; i < gqr->num_files; i++) { i++) = 0; i for(i < gqr->num_files; i; uint32_t if (giggle_get_query_len(gqr, i) > { 0)) if (giggle_get_query_len(gqr,

} struct file_data *fd = *fd file_data struct while (giggle_query_next(gqi, giggle_iter_destroy(&gqi); struct giggle_query_iter *gqi = *gqi giggle_query_iter struct *result; char i); file_index_get(gi->file_idx, printf(“%s\t%s\n”, i); giggle_get_query_itr(gqr,

https://gi fd->file_name); result, https://github.c thub.com/brentp/go-g ’intervals/*.bed.gz’) “intervals/*.bed.gz”) om/brentp/python-giggl &result) == 0) iggl e nature methods e © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. nature methods } lines = res.Of(1) lines “\n”)) fmt.Println(strings.Join(lines, = res.Of(0) lines of file. by index results # access []string lines var res.Hits() // []uint32 giving res.TotalHits() number of hits for count total // showing int each file blob heatmap are available at at available are heatmap interactive hosted a and indices Fantom5 and browser, Genome availability. Data Reporting Summary the in found be may design experimental the ing Summary. Reporting Sciences Life com/ryanl availability. Code /master/README.md#ho ayer/giggl URLs for Roadmap Epigenomics, the UCSC UCSC the Epigenomics, Roadmap for URLs All source code is available at at available is code source All e . . https://github.com/r sted-data-and-service Further information regardinformation Further doi:10.1038/nmeth.4556 yanlayer/giggle/ http Life Sciences Sciences Life s s://github. . - nature research | life sciences reporting summary

Corresponding author(s): Ryan Layer, Aaron Quinlan Initial submission Revised version Final submission Life Sciences Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form is intended for publication with all accepted life science papers and provides structure for consistency and transparency in reporting. Every life science submission will use this form; some list items might not apply to an individual manuscript, but all fields must be completed for clarity. For further information on the points included in this form, see Reporting Life Sciences Research. For further information on Nature Research policies, including our data availability policy, see Authors & Referees and the Editorial Policy Checklist.

` Experimental design 1. Sample size Describe how sample size was determined. N/A 2. Data exclusions Describe any data exclusions. N/A 3. Replication Describe whether the experimental findings were N/A reliably reproduced. 4. Randomization Describe how samples/organisms/participants were N/A allocated into experimental groups. 5. Blinding Describe whether the investigators were blinded to N/A group allocation during data collection and/or analysis. Note: all studies involving animals and/or human research participants must disclose whether blinding and randomization were used.

6. Statistical parameters For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or in the Methods section if additional space is needed).

n/a Confirmed

The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.) A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly A statement indicating how many times each experiment was replicated The statistical test(s) used and whether they are one- or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section) A description of any assumptions or corrections, such as an adjustment for multiple comparisons The test results (e.g. P values) given as exact values whenever possible and with confidence intervals noted A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range) Clearly defined error bars June 2017 See the web collection on statistics for biologists for further resources and guidance.

1 Nature Methods: doi:10.1038/nmeth.4556 ` Software nature research | life sciences reporting summary Policy information about availability of computer code 7. Software Describe the software used to analyze the data in this All source code for all analysis is available at https://github.com/ryanlayer/giggle study.

For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

` Materials and reagents Policy information about availability of materials 8. Materials availability Indicate whether there are restrictions on availability of None unique materials or if these materials are only available for distribution by a for-profit company. 9. Antibodies Describe the antibodies used and how they were validated N/A for use in the system under study (i.e. assay and species). 10. Eukaryotic cell lines a. State the source of each eukaryotic cell line used. N/A

b. Describe the method of cell line authentication used. N/A

c. Report whether the cell lines were tested for N/A mycoplasma contamination.

d. If any of the cell lines used are listed in the database N/A of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.

` Animals and human research participants Policy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines 11. Description of research animals Provide details on animals and/or animal-derived N/A materials used in the study.

Policy information about studies involving human research participants 12. Description of human research participants Describe the covariate-relevant population N/A characteristics of the human research participants. June 2017

2 Nature Methods: doi:10.1038/nmeth.4556