The International Genome Sample Resource (IGSR)

D854–D859 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 15 September 2016 doi: 10.1093/nar/gkw829 The international Genome sample resource (IGSR): A worldwide collection of genome variation incorporating the 1000 Genomes Project data Laura Clarke1, Susan Fairley1, Xiangqun Zheng-Bradley1, Ian Streeter1,EmilyPerry1, Ernesto Lowy1, Anne-Marie Tasse´ 2 and Paul Flicek1,* 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 2Public Population Project in Genomics and Society, McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada Received August 16, 2016; Accepted September 08, 2016 ABSTRACT open, public release of de-identified genetic data2 ( ). The open nature of the data has led to its widespread use for var- The International Genome Sample Resource (IGSR; ious applications, ranging from reference panels for geno- http://www.internationalgenome.org) expands in type imputation to prioritizing variants for further study to data type and population diversity the resources serving as a test bed for methods development (3). In or- from the 1000 Genomes Project. IGSR represents der to ensure the continued usability of this valuable data the largest open collection of human variation data collection for these and other purposes, we established the and provides easy access to these resources. IGSR International Genome Sample Resource (IGSR) in 2015 to was established in 2015 to maintain and extend the sustain, improve and expand the resources generated by the 1000 Genomes Project data, which has been widely 1000 Genomes Project. used as a reference set of human variation and by IGSR exists for multiple reasons. We maintain the value researchers developing analysis methods. IGSR of the 1000 Genomes data by updating the fundamental re- sults to be consistent with the newest versions of the hu- has mapped all of the 1000 Genomes sequence to man genome assembly and in the light of new analysis tech- the newest human reference (GRCh38), and will nology. We are collecting new types of data and new data release updated variant calls to ensure maximal sets generated on the 1000 Genomes samples to expand usefulness of the existing data. IGSR is collecting the overall resource. We also add data into IGSR from new structural variation data on the 1000 Genomes new samples with similarly open consent for release of de- samples from long read sequencing and other tech- identified genetic data to both enlarge the existing popula- nologies, and will collect relevant functional data tions and to include populations that were not represented into a single comprehensive resource. IGSR is ex- in the original 1000 Genomes Project. tending coverage with new populations sequenced We support several different activities within IGSR in- by collaborating groups. Here, we present the new cluding (i) data coordination and ethical review for groups data and analysis that IGSR has made available. collecting data on new open samples; (ii) analysis pipelines to align sequence reads and call variants; (iii) data discov- We have also introduced a new data portal that ery and distribution through the IGSR data portal and FTP increases discoverability of our data––previously site; and (iv) user support and training. To collect, pro- only browseable through our FTP site––by focusing cess and distribute the IGSR data, we leverage and have on particular samples, populations or data sets of extended the infrastructure created for the 1000 Genomes interest. Project (4). In the last year, we have added new sequence data on existing 1000 Genomes samples, incorporated the INTRODUCTION initial data for samples not sequenced in the 1000 Genomes Project, and calculated and released new alignments of the The 1000 Genomes Project cataloged human genetic vari- entirety of the 1000 Genomes phase 3 data to the updated ation by generating and analyzing whole genome sequenc- human reference assembly, GRCh38. ing data from more than 2500 individuals across 26 populations from five continental groups (1). All 1000 Genomes data were generated from samples with broad consent for *To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: [email protected] C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2017, Vol. 45, Database issue D855 In the following sections, we describe our data structure, Other data collections have generated new information how to browse and access our data and the support we pro- about existing samples. We will retain information from the vide for our users. studies that generated data, while, where samples are included in multiple projects, making it clear that this is the case. For example, the Simons Diversity Project uses GBR Data structure samples that were also part of the 1000 Genomes Project, Over the course of the 1000 Genomes Project, ∼500 000 but it has included more precise information about the ori- data files requiring 750 TB were added to and hosted on gin of the donor: some of the GBR samples in this collection the project FTP site in a manner and structure designed to are labeled as English. In cases such as this, we will use the support the needs of the 1000 Genomes Consortium. While new information to extend our descriptions, but we will also maintaining all of the 1000 Genomes data, we have made maintain the GBR label to ensure that all the data we have changes to the FTP site structure to support the expanded for these populations can be found in one place. scope of IGSR and improve the discoverability of the data. To ensure clarity as we add new data sets, we have designed Data access a structure and nomenclature for the elements within our structure that describes the source of the data, what type of We have several access methods for the IGSR data that take data it is and what other similar data exist (Table 1). More advantage of the data structure outlined above to organize information about these data elements and how they can be and filter the data and which are described in detail below. used to filter and discover IGSR data is below and inthe data portal section. Data portal. We have developed a data portal to make it as easy as possible to explore the variety of data in IGSR. The Data collections. Data collections are large-scale sets of portal is accessible at http://www.internationalgenome.org/ related data designed to be useful for answering a host of data-portal/ or by clicking on Portal in the horizontal bar at linked questions. During IGSR’s first year, the number of the top of the IGSR website (Figure 1A). The default entry data collections doubled (Table 2). In addition to the 1000 to the portal is a summary table listing the samples, their Genomes Project data, we have sequence data from four ad- sex and population, and what data are available for them. ditional sources and are in the process of producing align- This table can also be directly accessed via http://www. ments and, eventually, variant calls using these new col- internationalgenome.org/data-portal/sample. Samples and lections. We anticipate that further data collections will be data can be located using the filters at the top or by using added to the resource in the near future. the search to find specific samples or populations (Figure 1B). The summary entry for each sample includes links for Analysis groups. The analysis group concept stems from detailed information about the sample (Figure 1C) and its the start of phase 1 of the 1000 Genomes Project, when it corresponding population (Figure 1D), which are each de- became clear that a simple study structure would be inad- scribed in the following paragraphs. equate to distinguish between the low coverage and exome Clicking on a sample reference number leads to a dedi- sequencing efforts. As IGSR takes on different data types, cated sample page (Figure 1C). This can also be accessed the analysis group designation allows us to effectively sub- by searching with the sample reference to jump straight to divide data to facilitate the discovery and use of specific it. The top section of the sample page lists details regarding types of data. This enables discovery of coherent data sets the sample including the population, sex and if there are any that can be processed together. Table 3 describes the existing related samples also found in IGSR. Clicking on the popu- analysis groups. lation name brings up the population page described below. There is also a direct link to the sample in the BioSamples Data type. The data types represent both unprocessed database (5) and to the relevant source, such as the Coriell submitted data in the form of raw sequence data files and Institute, where the lymphoblastoid cell line can be pur- processed data in the form of sequence alignments to either chased. The lower section of the sample page lists all the the GRCh37 or GRCh38 human reference assembly. Vari- files available for the sample on the IGSR FTP site. This list ant data includes sites of variation and genotype data aris- includes both files that are specific to the sample, such as ing from either sequencing or array-based assays. Data that FASTQ sequences and BAM alignments with their indexes, do not fit into the existing main data categories, and which as well as files that include the sample among other data, currently do not exist in sufficient qualities to warrant their such as VCF files containing genotypes from the sample.

The International Genome Sample Resource (IGSR)

Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S

(DDD) Project: What a Genomic Approach Can Achieve

Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P

NIH-GDS: Genomic Data Sharing

The 1000 Genomes Project

A Variant-Centric Perspective on Geographic Patterns of Human Allele Frequency Variation Arjun Biddanda, Daniel P Rice, John Novembre*

Industry Programme EMBL-EBI and Industry

1000 Genomes Project Consortiapedia.Fastercures.Org

A National Database of Polish Variant Allele Frequencies