<<

© 2012 America, Inc. All rights reserved. genetic variation genetic human of view population-scale extensive data an the create helped project, the of phase pilot the In ulations. pop multiple in variation human characterize sively comprehen to effort an Project, 1000 the gies were at the of heart in the decision 2007 to launch technolo These project. human the in used technology capillary gel the over costs reduced cally whole-genome and at at scale an dramati unprecedented enable (SOLiD), Technologies and (454) Life Diagnostics Roche Illumina, from those including technologies, sequencing High-throughput enable widespread data access. have developed and deployed several tools to M project makes all of its data publicly available. variation using new sequencing technologies, the methods to accurately discover and characterize catalog of and extensive the primary scientificgoals ofcreating both a deep projects ever undertaken in biology. the largest distributed data collection and analysis T Stephen Sherry Iliana Toneva Clarke Laura management and community access T Published Published online 27 a ([email protected]). in the Information, National ofLibrary , US National Institutes of Maryland, Health,USA. Bethesda, 1 high-coverage, plus individuals ~2,500 for sequence low- and coverage, ~4× ~20× whole-genome whole-exome collecting of scale project current the to plans of these revisions to repeated led capacity sequencing Increasing total. in of sequence (Tbp) pairs tera–base ~6 and individual per of sequence pairs giga–base ~6 representing individuals, 1,000 for coverage genome 2× whole was to collect Project Genomes for 1000 the plan initial The methods. data-distribution and lysis new for requirements ana substantial , created technologies sequencing high-throughput of European Bioinformatics Institute, Genome Campus, Hinxton, Cambridge, UK. he he embers of the project data coordination center The larger data volumes and shorter read lengths lengths read shorter and volumes data larger The he he 1 Supplementary Supplementary Note 000 000 Genomes 1 000 000 Genomes 1 1 , , Xiangqun Zheng-Bradley , , Brendan Vaughan 2 p , , Paul Flicek ril ril 2012; P roject was launched as one of . . Correspondence should be addressed to P.F. or ([email protected]) The Consortium 1 . d o i : 1 0 . 1 1 0 & The 1000 Genomes Project Consortium 3 8 / I n n n addition to m 1 e , , Don Preuss t h . 1 9 7

4 1 P , , Richard Smith roject: data - - - - ­

2 , , Rasko Leinonen 38,000 files and over 12 terabytes of data being avail being data of terabytes 12 over and files 38,000 in resulting data, sequence of Tbp 5 collected Project Pilot Genomes 1000 the fact, In estimates). original over generation sequence in increase (~25-fold total in individuals 500 for sequence whole-genome ~40× most important initial challenges are (i) collating all all collating (i) are challenges groups initial important analysis most major dozen two than more Table1 ( DCC the for challenge bioinformatics fundamental the is community wider that the data are available within the project and to the Managing data flow in the 1000 Genomes Project such D of components key workflows. the complex demonstrate to methods exa to data sequence raw pr from community the to ces resour data provide to members Project Genomes browser. genome and and to manage community access through the FTP site data flow, data to deposition ensure sequence archival project-specific to manage (NCBI) for Biotechnology Bioinformatics Institute (EBI) and the National Center Center (DCC) Coordination was set up jointlyData between the the European Therefore, frame. time able in a reason to community the were available data the ensure to and productively forward move to critical be would that data coordination recognized members files. accessible publicly 250,000 than more in data of more 260 terabytes than include resources ing project able to the community ata flow ata Here we describe the methods used by the 1000 1000 the by used methods the describe we Here As in previous efforts o m ject results that can be browsed. We provide provide We browsed. be can that results ject ples drawn from the project’s data-processing data-processing project’s the from drawn ples ). With nine different sequencing centers and and centers sequencing different nine With ). nature methods 1 , , Eugene Kulesha 2 National Center for Biotechnology 1 , , Martin Shumway 1 . . In March 2012 the still-grow 2– 3 A full A list full of members is provided 4 Fig. Fig.

, , the Project 1000 Genomes | O. N. | MAY | NO.5 VOL.9 3 1 1 and and perspective , , Chunlin Xiao Supplementary Supplementary 2 ,

2012 1 , the the , 2 | ,

 - - - ­

© 2012 Nature America, Inc. All rights reserved. to either the EBI or the NCBI, whereas whereas NCBI, the or EBI the either to oped automated data-submission methods project, the major sequencing centers devel and the NCBI SRA (ENA), Archive Nucleotide European the the SRA at the EBI, provided as a of service archives (SRAs): sequence-read two to the ble for the first multi-terabase submissions sequencing capacity global increases. as grow to poised sites both with day, approachesterabytesper ­currently 30 NCBI and EBI the of capacity submission times faster than FTP in typical usage. Using Aspera, the Aspera, a UDP-based method that achieves data members transfer project company the solutiontransferfrominternetan rates rely on to elected the 20–30 Instead, justified. not was infrastructure network dedicated a astronomy, building and physics so in tered requirementsencoun remaindatathose belowfor well sequence data-transfer time, same the At labor-intensive. very is way this data sequence with physicalharddrives sending to resorted havegroups some capacity.response, tion In cols such as FTP have not withscaled increased sequence produc community. the to resources these providing (v) and metadata; associated vari their and and files ant alignment sequence, to groups; access easy analysis the maintaining (iv) to results analysis intermediate and data sequencing both of availability rapid ensuring (iii) institutions; participating between data the exchanging (ii) standardization; and control quality necessary for centrally data sequencing the Illumina. IL, Institute; Sanger Trust Wellcome SC, ; Molecular for Institute Planck Max MPI, Technologies; Life AB, Roche; 454, University; Washington WU, Institute; Broad BI, of Medicine; College Baylor BCM, possible. as soon as released publically are data the All 4). (arrow DCC the to back provided are files variant and files alignment the Both variants. call to alignments the uses and genome the to data sequence the aligns 3), (arrow DCC the from data access group analysis The data. the on steps quality-control performs and 2) (arrow SRA the from files FASTQ retrieves DCC The data. exchange which 1), (arrow databases SRA of one two the to data raw their submit Figure  both SRA databases developed generalized

perspective | The 1000 Genomes Project was responsi In recent years, data transfer speeds using TCP/IP–based proto O. N. | MAY | NO.5 VOL.9 Production group 454 andAB SC andIL BGI, MPI BCM, BI, 1 exchange WU | Data s Data flow in the 1000 Genomes Project. The sequencing centers centers sequencing The Project. Genomes 1000 the in flow Data , Data flow 1 1 6 . Over the course of the

2012 NCB SRA EBI I | Community nature methods 2 Additional qualit Uniform qualit Uniform qualit Sample chec Sample chec Initial qualit File integrit File integrit control control DCC 5 , although handling datahandling although , y y y - - k k y y y VCF SAM or BAM FASTQ SRF Format T able able 1 5 4 3 9 immediate releas | interpretation Publications Population Alignment

Biological Analysi

genetic File File formats used in the 1000 Genomes Project variant combined calling group 7 members) consortium by (designed calls those for genotypes individual and calls variant storing for format text A column-based genome (designed by consortium members) of short-read data with respect to a reference A compact alignment format for placement quality values Text-based format for sequence and machines based on ZTR Container format for data from sequencing D escription s s

,

e - - - -

FTP site, which is, as of March 2012, more than 260 terabytes. terabytes. 260 than more 2012, March of as is, which site, FTP the it to approach obtain is of contents the to the mirror logical most and available, is set data Project Genomes 1000 entire The D ( EBI the at sites download mirrored has project The possible. as publicly releases prepublication data as described below as quickly data to exclusivelyfocus on and calls base quality scores. a decision to stop archiving raw intensity measurements from read understanding of the needs of the project-analysis group, leading to formats ( to the more compact Binary Alignment/Map (BAM) format read files (SRF) expansivesequence from the evolved also accepted and distributed by both the archives and the project have to search methods and the archivedaccess data. The data formats metadata consistency checking ( checking consistency metadata tity. For data, alignment quality control includes and integrity file and quality of the raw sequence data and confirming sample iden syntax checking includes this data, For control process. sequence VCF files. and BAM both for provided Genomes FTP site within 48–72 h after the EBI SRA has processed master. EBI the from faster download will world the in elsewhere and in users whereas mirror, Generally users process. in the Americas will Aspera access data most automatic quickly from the nightly NCBI a via h 24 within mirrored usually is copy NCBI the and EBI, the at DCC the by updated directly is copy master The capacity. the download overall increase efficiently and simultaneously access community t (VCF) format format call variant in distributed are results analysis and format, via the FTP site in and BAMthe project distributed within duced group. analysis the of requirements the and centers production the of output depending the on varies frequency release the project full the for and phase, pilot the during months two every approximately duced dated ( file a sequence.index with associated freezes data periodic though managed are data Project at EBI. the mirrored be must SRA first NCBI the to submitted originally data that requires processing This them. r f ata access ata t a As As a ‘community project’ resource All All data on the FTP site have been through an extensive quality- The raw sequence data, as FASTQ files, appear on the 1000 1000 the on appear files, FASTQ as data, sequence raw The Alignments based on a specific sequence.index file are pro are file sequence.index specific a on based Alignments p c : e / . / n f t c p b . Table i 1 . 0 n 0 i h 0 9 . g . . Index files created by the Tabix software g 1 e o ). ). This format shift was made possible by a better n v o / m 1

Supplementary Note Supplementary 0 e 0 s 0 . e g b e i n . a http://vcftools.sourceforge.net http://samtools.sourceforge.net/ http://en.wikipedia.org/wiki/FASTQ_format http://srf.sourceforge.net/ Website o c m . u e k s Supplementary Note Supplementary / / v f t o p l 8 1 / , the 1000 Genomes Project Project , Genomes the 1000 ) that provide project and and project provide that ) / f t p / ). These files were ). files These pro ) and NCBI ( NCBI ) and 7 and FASTQ / ). 1 f 0 t are also p : / / f t p - - - - © 2012 Nature America, Inc. All rights reserved. of both 1000 Genomes Project and other web-accessible indexed indexed web-accessible and other Project Genomes of 1000 both ( variants (SIFT) (VEP) Effect Variant Predictor the including tools variation Ensembl provides also browser Genomes browser. 1000 genome The (UCSC) Cruz Santa of University the or Ensembl as such genome resources in appear or dbSNP by processed are they before ants 1000 b dedicated the though on browser based infrastructure the Genomes Ensembl information regulatory whole- genome and genes -coding as such annotation, genome slicer. data the using tion Note ( tool data-slicing Tabix web-based a with via or of BAM and can subsections directly VCFobtain they either files genomic ­specific regions without downloading the complete files, ( to produce a large number of such results as FASTQ or BAM files likely types file to exclude filters site and FTP supports NCBI the aid searching. The search returns full file paths to either the EBI or to convention a strict follow which names, file data our in found using any sampleuser-specified identifier(s) or other information o mation. We developed a web interface ( infor integrity file and update last of time including directories enable to designed the FTP mirroring site and contains a was complete list of and all files file This site. the searching in assist to difficult. extremely be can data project by structure browsing the specific FTP directory accessing and locating files, available of thousands of hundreds ( on based are they date freeze sequence.index the for named ries site in FTP directo the via are distributed files analysis the the Indeed, set. than data entire rather genome the of regions specific from slices or alignment data raw targeted and results analysis in interested Our experience is that most users are more updated. be can the database before new to data rapid access allows files from data view to for users The ability (iv). context the genomic showing Ensembl from annotation and gene (iii); yellow in one variant with track a as shown database release the 20101123 with two variants in (ii); yellow variants from release 20110521 VCF file shown as a track reads by the lower arrow (i); variants from sequence noted by the upper arrow and sequence BAM file from EBI FTP site with consensus based on Ensembl version 63 are an NA12878 in the image from our October 2011 browser files to be displayed in ‘Location’ view. The tracks remote files to allow accessible BAM and VCF of the attachment enables Browser Genomes Figure BAM BAM and VCF in context genomic files ( Supplementary Note Supplementary r r o g One can view 1000 Genomes data in the context of extensive extensive of context the in data Genomes 1000 view can One from alignments or variants discovered wanting users For site FTP of the root at the provided is current.tree called A file / w f t ). VCF files can also be divided by sample name or popula or name sample by divided be also can VCF files ). s p e 2 1 s r | 3 . e 1 Remote file viewing. The 1000 The 1000 viewing. file Remote n PolyPhen and a Supplementary Note Supplementary 0 r 0 c 0 h g / ) to provide direct access to the current.tree file file current.tree the to access direct provide to ) e n 1 o 2 m as well as ‘sorting tolerant from intolerant’ intolerant’ from tolerant ‘sorting as well as e s . ). o 1 r Supplementary Note Supplementary 4 g predictions for all nonsynonymous nonsynonymous all for predictions / ). The browser displays project vari project displays browser The ). ). The browser supports viewing viewing supports browser The ). h t Fig. Fig. t - p : / / w 2 ). ). A archival stable w ). However, with with However, ). w Supplementary Supplementary (iv) (iii) (ii (i) . 1 ) 0 0 0 1 g 1 e ( n h o t m t p e : / s - - - / . cess takes advantage of advantage cess takes the copies of two synched the SRA, which projects and the wider community. The archival streamlined pro sequencing to offer Project 1000 all benefits large-scale Genomes the to support developed and of access Methods data submission D ([email protected]; r ( rss via available at h via set data public a as available are files VCF and BAM project submitted. similarly be will variants of the Database VariantsGenomic archive (DGVa) to dbSNP and have small polymorphisms submitted been nucleotide single- project Pilot “1000GENOMES”. handle the using tories ( format displayed VCF in downloaded be FASTA the likewise or can FASTQ Genotypes SAM, format. covering BAM, in downloaded individuals be can region selected for genome. the Sequence of region any for genotypes individual and reads t NCBI data browser at at browser data NCBI ( appropriate the using version of the Ensembl Application Programming directly Interface programmatically (API) be accessed can or these and queried available publicly also are browser at available is data h project pilot the containing and 60 code release Ensembl on based browser Genomes 1000 the of ­version o Supplementary Note Supplementary Supplementary Note Supplementary s t t iscussion s t t o Finally, all links and announcements of the available project currently data can all be found Services, Web Amazon of users For The project submits all called variants to the appropriate reposi the using data project download and explore also may Users project the support that databases MySQL underlying The h p . p x l t : : s m t / / / p / / 1 1 p : l / ), Twitter (@1000genomes) and also via an email list list email an via also and (@1000genomes) Twitter ), 0 0 i / l 0 0 w o 1 0 0 t 5 w g g b , and data have variation to submitted , been and structural e w e r n o n . 1 w o o 0 m m s 0 e h e 0 e r t s . g s t 1 . p e / s 0 nature methods . The browser displays both sequence sequence both displays browser The . 3 n : / 0 . ). o / ). a 0 w h m m g t w e t e a p n s w z . : o o o / . 1 m / n r w 0 g a e 0 / w w s , and announcements are made made are announcements and , 0 . s w o g . c e r . Supplementary Supplementary Note n o g n

c m / | o . O. N. | MAY | NO.5 VOL.9 b m / i ( . e n Supplementary NoteSupplementary s l . m perspective o r . g n / i a h n . g n 1 6 o o . . Full project v u / n v c a

2012 e r ). m i a t e i n o | t n

s ).  ­ - / /

© 2012 Nature America, Inc. All rights reserved. The authors declare no competing financial interests. COMPETIN Program of the US National Institutes of Health National Library of Medicine. Biology Laboratory. This research was supported in part by the Intramural Research is provided by the Wellcome Trust (grant WT085532) and the European Molecular J. Barker, V. Silventoinen, G. Kellman and P. Jokinen. Funding support at theG. Cochrane.EBI For maintenance of the EBI computer infrastructure, we acknowledge J. Paschall, Z. Belaia, R. Sanders, C. O’Sullivan, S. Keenan, G. Ritchie and F. Cunningham, Y. Chen, W. McLaren, V. Zalunin, R. Radhakrishnan, D. Smirnov,For early work and support to the DCC, we thank Z. Iqbal, H. Khouri, A Note: information Supplementary is available on the data. the for understanding tools effective provide to effort community the of indicative is methods processing and analysis volumes The rapid of data encountered. much than larger previously of challenge the part in and technologies new to systems bioinformatics existing of adopting difficulty the part in services. tion annota variation centralized and cloud, WebServices Amazon to dbVar/DGVa, provisioning of alignment and variant files in the and variant data to sion sets of final dbSNP and preliminary both ( Browser Genomes the consortium analysis group are created. These include the 1000 data management ensures that resources targeted to users outside alignments. of completion the and data sequence of collection the at including flow, data the in points specific at simultaneously public the and consortium the of bers ( quality and integrity data data ensure to tests extensive multiple includes flow established the analysis, the Project supporting In Genome 1000 possible. as widely and rapidly as available made are data that ensuring while project the to infrastructure and support necessary are provide to activities of such goals The activity data-management and from an centralized organized DCC. by the learned lessons as quickly as possible and allowed the archives to benefit from the to data are community the available made Genomes 1000 that all ensures SRA the to DCC the of proximity close the addition, In processing. submission of task resource-intensive the distribute  Fig. ckno

perspective | The experiences of data management used for this project reflect Beyond directly supporting the needs of the project, centralized benefit can projects analysis and generation data Large-scale O. N. | MAY | NO.5 VOL.9

1 w ). As part of this process, data are made available to mem available are made data process, of this ). As part led G G F g INANCIAL ments

INTERESTS

h 2012 t t p : / / |

b nature methods r o w s e r . 1 0 0 0 g N e a n t u o r e m

M e e s t . h o o r d g s website.

/ ), submis ),

2–

4 - - - . 7. 6. 5. 4. 3. 2. 1. visit ShareAlike 3.0 Unported (CC BY-NC-SA) license. To view a copy of this license, This work is licensed under a Creative Commons Attribution-NonCommercial- com/reprints/index.html. R P 8. 16. 15. 14. 13. 12. 11. 10. 9. ublished ublished online at http://www.nature.com/naturemethods/. eprints and permissions information is available online at http://www.nature.

in harvesting comprehensive experimental details.experimental comprehensive harvesting in Browser. site. Web Project HapMap sequencing. population-scale from the data the 1 sharing. data prepublication T sharing. Bioinformatics data. sequencing Methods Nat. 2 Nat. Genet. Nat. 2007). Press, Laboratory Harbor Spring (Cold 41–61 J.C.) Stephens, & S.B. in .missense Protoc. Nat. algorithm. SIFT the using function protein on variantsnon-synonymous features genomic key other and regions regulatory genes, impact variants such how about information provides and variants discovered newly all annotate to method updated T (2010). Predictor.Effect SNP and API Ensembl (2011). files. (2011). 2156–2158 Li, H. Li, generationnext Archiving H. Sugawara, & G. Cochrane,Shumway, M., overload. data to adjusting Next-generationsequencing: Baker,M. N.L. Washington, K.R. Rosenbloom, InternationalThe L.D. Stein, & L. A.V.,Krishnan, Smith, G.A., Thorisson, variationgenome human of map A Consortium. Project Genomes 1000 Toronto International Data Release Workshop Authors. Prepublication data PrepublicationAuthors. Workshop Release Data TorontoInternational Church, D.M. Church, searching. and content database: dbSNP NCBI Sherry,S.T. & M.L. Foelo, I.A. Adzhubei, coding of effects the P.C.Predicting Ng, & S. Kumar,P.,Henikoff, W.McLaren, P. Flicek, TAB-delimitedgeneric from features sequence of retrieval fast Tabix: H. Li, P.Danecek, http://creativecommons.org/licenses/by-nc-sa/3.0 he he 000 Genomes 000 0 11 Genetic Variation: A Laboratory Manual Laboratory A Variation: Genetic T E , bar023 (2011). bar023 , oronto Bioinformatics nsembl et al. et

. Nature Nucleic Acids Res. Acids Nucleic et al. et

The /Map format and SAMtools. and formatAlignment/Map Sequence The 42 et al. et 4 et al. et

A et al. et

7 , 1073–1081 (2009). 1073–1081 , V greement describes a set of best practices for practices best of set a describes greement 2 , 813–814 (2010). 813–814 , , 495–499 (2010). 495–499 ,

et al. et ariant Ensembl 2011. 2011. Ensembl 5 4 P , 2078–2079 (2009). 2078–2079 , 6 The and VCFtools. and format call variant The Nucleic Acids Res. Acids Nucleic roject and have helped drive the widespread use of use widespread the drive helped have and roject et al. et Deriving the consequences of genomic variants with the with variantsgenomic of consequences the Deriving et al. et 1 Public data archives for genomic structural variation. structuralgenomic for archives data Public

, 168–170 (2009). 168–170 , 2 A method and server for predicting damagingpredicting for server and method A Nat. Methods Nat. 7 E , 718–719 (2011). 718–719 , The modENCODE Data Coordination Center: lessons Center: Coordination Data modENCODE The ffect ENCODE whole-genome data in the UCSC Genome UCSC the in datawhole-genome ENCODE Genome Res. Genome

3 8 P , D620–D625 (2010). D620–D625 , redictor ( redictor T hese practices have been adopted by the by adopted been have practices hese Nucleic Acids Res. Acids Nucleic

7

Nature 3 , 248–249 (2010). 248–249 , 8

, D870–D871 (2010). D870–D871 , Bioinformatics

1 VEP 5 (eds., Weiner,M.P.,Gabriel, (eds.,

, 1592–1593 (2005). 1592–1593 , 4 67 ) is a flexible and regularly and flexible a is ) , 1061–1073 (2010). 1061–1073 ,

3 / Database (Oxford) Database . 9

, D800–D806 D800–D806 , . 2 6 Bioinformatics , 2069–2070 ,

2 7

,