The 000 Genomes Project

PERSPECTIVE The 1000 Genomes Project: data management and community access Laura Clarke1, Xiangqun Zheng-Bradley1, Richard Smith1, Eugene Kulesha1, Chunlin Xiao2, Iliana Toneva1, Brendan Vaughan1, Don Preuss2, Rasko Leinonen1, Martin Shumway2, Stephen Sherry2, Paul Flicek1 & The 1000 Genomes Project Consortium3 The 1000 Genomes Project was launched as one of ~40× whole-genome sequence for 500 individuals in the largest distributed data collection and analysis total (~25-fold increase in sequence generation over projects ever undertaken in biology. In addition to original estimates). In fact, the 1000 Genomes Pilot the primary scientific goals of creating both a deep Project collected 5 Tbp of sequence data, resulting in catalog of human genetic variation and extensive 38,000 files and over 12 terabytes of data being avail- methods to accurately discover and characterize able to the community1. In March 2012 the still-grow- variation using new sequencing technologies, the ing project resources include more than 260 terabytes project makes all of its data publicly available. of data in more than 250,000 publicly accessible files. Members of the project data coordination center As in previous efforts2–4, the 1000 Genomes Project have developed and deployed several tools to members recognized that data coordination would be enable widespread data access. critical to move forward productively and to ensure High-throughput sequencing technologies, including the data were available to the community in a reason- those from Illumina, Roche Diagnostics (454) and able time frame. Therefore, the Data Coordination Life Technologies (SOLiD), enable whole-genome Center (DCC) was set up jointly between the European sequencing at an unprecedented scale and at dramati- Bioinformatics Institute (EBI) and the National Center cally reduced costs over the gel capillary technology for Biotechnology (NCBI) to manage project-specific used in the human genome project. These technolo- data flow, to ensure archival sequence data deposition gies were at the heart of the decision in 2007 to launch and to manage community access through the FTP site © All rights reserved. 2012 Inc. Nature America, the 1000 Genomes Project, an effort to comprehen- and genome browser. sively characterize human variation in multiple pop- Here we describe the methods used by the 1000 ulations. In the pilot phase of the project, the data Genomes Project members to provide data resour helped create an extensive population-scale view of ces to the community from raw sequence data to human genetic variation1. project results that can be browsed. We provide The larger data volumes and shorter read lengths examples drawn from the project’s data-processing of high-throughput sequencing technologies created methods to demonstrate the key components of substantial new requirements for bioinformatics, ana complex workflows. lysis and data-distribution methods. The initial plan for the 1000 Genomes Project was to collect 2× whole Data flow genome coverage for 1,000 individuals, representing Managing data flow in the 1000 Genomes Project such ~6 giga–base pairs of sequence per individual and ~6 that the data are available within the project and to the tera–base pairs (Tbp) of sequence in total. Increasing wider community is the fundamental bioinformatics sequencing capacity led to repeated revisions of these challenge for the DCC (Fig. 1 and Supplementary plans to the current project scale of collecting low- Table 1). With nine different sequencing centers and coverage, ~4× whole-genome and ~20× whole-exome more than two dozen major analysis groups1, the sequence for ~2,500 individuals plus high-coverage, most important initial challenges are (i) collating all 1European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 2National Center for Biotechnology Information, National Library of Medicine, US National Institutes of Health, Bethesda, Maryland, USA. 3A full list of members is provided in the Supplementary Note. Correspondence should be addressed to P.F. ([email protected]) or The 1000 Genomes Project Consortium ([email protected]). PUBLISHED ONLINE 27 ApRIL 2012; DOI:10.1038/NMETH.1974 NATURE METHODS | VOL.9 NO.5 | MAY 2012 | PERSPECTIVE Community methods to search and access the archived data. The data formats accepted and distributed by both the archives and the project have also evolved from the expansive sequence read format (SRF) files Production Analysis 7 SRA DCC groups group to the more compact Binary Alignment/Map (BAM) and FASTQ formats (Table 1). This format shift was made possible by a better understanding of the needs of the project-analysis group, leading to Initial quality Alignment BCM, BI, 1 NCBI control 3 a decision to stop archiving raw intensity measurements from read variant WU File integrity calling 454 and AB Sample check data to focus exclusively on base calls and quality scores. Uniform quality 8 4 As a ‘community resource project’ , the 1000 Genomes Project 2 publicly releases prepublication data as described below as quickly Additional quality Population control 5 genetics as possible. The project has mirrored download sites at the EBI File integrity Biological BGI, MPI, 1 EBI Sample check (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and NCBI (ftp://ftp- interpretation SC and IL Uniform quality trace.ncbi.nih.gov/1000genomes/ftp/) that provide project and Data Publications, community access simultaneously and efficiently increase the exchange Data flow immediate release overall download capacity. The master copy is directly updated by the DCC at the EBI, and the NCBI copy is usually mirrored Figure 1 | Data flow in the 1000 Genomes Project. The sequencing centers submit their raw data to one of the two SRA databases (arrow 1), which within 24 h via a nightly automatic Aspera process. Generally exchange data. The DCC retrieves FASTQ files from the SRA (arrow 2) and users in the Americas will access data most quickly from the NCBI performs quality-control steps on the data. The analysis group access data mirror, whereas users in Europe and elsewhere in the world will from the DCC (arrow 3), aligns the sequence data to the genome and download faster from the EBI master. uses the alignments to call variants. Both the alignment files and variant The raw sequence data, as FASTQ files, appear on the 1000 files are provided back to the DCC (arrow 4). All the data are publically Genomes FTP site within 48–72 h after the EBI SRA has processed released as soon as possible. BCM, Baylor College of Medicine; BI, Broad Institute; WU, Washington University; 454, Roche; AB, Life Technologies; them. This processing requires that data originally submitted to MPI, Max Planck Institute for Molecular Genetics; SC, Wellcome Trust the NCBI SRA must first be mirrored at the EBI. Project data are Sanger Institute; IL, Illumina. managed though periodic data freezes associated with a dated sequence.index file (Supplementary Note). These files were pro- duced approximately every two months during the pilot phase, the sequencing data centrally for necessary quality control and and for the full project the release frequency varies depending standardization; (ii) exchanging the data between participating on the output of the production centers and the requirements of institutions; (iii) ensuring rapid availability of both sequencing the analysis group. data and intermediate analysis results to the analysis groups; Alignments based on a specific sequence.index file are pro- (iv) maintaining easy access to sequence, alignment and vari- duced within the project and distributed via the FTP site in BAM ant files and their associated metadata; and (v) providing these format, and analysis results are distributed in variant call format resources to the community. (VCF) format9. Index files created by the Tabix software10 are also In recent years, data transfer speeds using TCP/IP–based proto- provided for both BAM and VCF files. cols such as FTP have not scaled with increased sequence produc- All data on the FTP site have been through an extensive quality- © All rights reserved. 2012 Inc. Nature America, tion capacity. In response, some groups have resorted to sending control process. For sequence data, this includes checking syntax physical hard drives with sequence data5, although handling data and quality of the raw sequence data and confirming sample iden- this way is very labor-intensive. At the same time, data-transfer tity. For alignment data, quality control includes file integrity and requirements for sequence data remain well below those encoun- metadata consistency checking (Supplementary Note). tered in physics and astronomy, so building a dedicated network infrastructure was not justified. Instead, the project members Data access elected to rely on an internet transfer solution from the company The entire 1000 Genomes Project data set is available, and the Aspera, a UDP-based method that achieves data transfer rates 20–30 most logical approach to obtain it is to mirror the contents of the times faster than FTP in typical usage. Using Aspera, the combined FTP site, which is, as of March 2012, more than 260 terabytes. submission capacity of the EBI and NCBI currently approaches 30 terabytes per day, with both sites poised to grow as global Table 1 | File formats used in the 1000 Genomes Project sequencing capacity increases. Format Description Website The 1000 Genomes Project was responsi- SRF Container format for data from sequencing http://srf.sourceforge.net/

The 000 Genomes Project

Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S

The Variant Call Format Specification Vcfv4.3 and Bcfv2.2

A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T

(DDD) Project: What a Genomic Approach Can Achieve

A Combined RAD-Seq and WGS Approach Reveals the Genomic

Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences

VCF/Plotein: a Web Application to Facilitate the Clinical Interpretation Of

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P

Infravec2 Open Research Data Management Plan

NIH-GDS: Genomic Data Sharing