The International Genome Sample Resource (IGSR)

Total Page:16

File Type:pdf, Size:1020Kb

Load more

D854–D859 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 15 September 2016 doi: 10.1093/nar/gkw829 The international Genome sample resource (IGSR): A worldwide collection of genome variation incorporating the 1000 Genomes Project data Laura Clarke1, Susan Fairley1, Xiangqun Zheng-Bradley1, Ian Streeter1,EmilyPerry1, Ernesto Lowy1, Anne-Marie Tasse´ 2 and Paul Flicek1,* 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 2Public Population Project in Genomics and Society, McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada Received August 16, 2016; Accepted September 08, 2016 ABSTRACT open, public release of de-identified genetic data2 ( ). The open nature of the data has led to its widespread use for var- The International Genome Sample Resource (IGSR; ious applications, ranging from reference panels for geno- http://www.internationalgenome.org) expands in type imputation to prioritizing variants for further study to data type and population diversity the resources serving as a test bed for methods development (3). In or- from the 1000 Genomes Project. IGSR represents der to ensure the continued usability of this valuable data the largest open collection of human variation data collection for these and other purposes, we established the and provides easy access to these resources. IGSR International Genome Sample Resource (IGSR) in 2015 to was established in 2015 to maintain and extend the sustain, improve and expand the resources generated by the 1000 Genomes Project data, which has been widely 1000 Genomes Project. used as a reference set of human variation and by IGSR exists for multiple reasons. We maintain the value researchers developing analysis methods. IGSR of the 1000 Genomes data by updating the fundamental re- sults to be consistent with the newest versions of the hu- has mapped all of the 1000 Genomes sequence to man genome assembly and in the light of new analysis tech- the newest human reference (GRCh38), and will nology. We are collecting new types of data and new data release updated variant calls to ensure maximal sets generated on the 1000 Genomes samples to expand usefulness of the existing data. IGSR is collecting the overall resource. We also add data into IGSR from new structural variation data on the 1000 Genomes new samples with similarly open consent for release of de- samples from long read sequencing and other tech- identified genetic data to both enlarge the existing popula- nologies, and will collect relevant functional data tions and to include populations that were not represented into a single comprehensive resource. IGSR is ex- in the original 1000 Genomes Project. tending coverage with new populations sequenced We support several different activities within IGSR in- by collaborating groups. Here, we present the new cluding (i) data coordination and ethical review for groups data and analysis that IGSR has made available. collecting data on new open samples; (ii) analysis pipelines to align sequence reads and call variants; (iii) data discov- We have also introduced a new data portal that ery and distribution through the IGSR data portal and FTP increases discoverability of our data––previously site; and (iv) user support and training. To collect, pro- only browseable through our FTP site––by focusing cess and distribute the IGSR data, we leverage and have on particular samples, populations or data sets of extended the infrastructure created for the 1000 Genomes interest. Project (4). In the last year, we have added new sequence data on existing 1000 Genomes samples, incorporated the INTRODUCTION initial data for samples not sequenced in the 1000 Genomes Project, and calculated and released new alignments of the The 1000 Genomes Project cataloged human genetic vari- entirety of the 1000 Genomes phase 3 data to the updated ation by generating and analyzing whole genome sequenc- human reference assembly, GRCh38. ing data from more than 2500 individuals across 26 popu- lations from five continental groups (1). All 1000 Genomes data were generated from samples with broad consent for *To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: [email protected] C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2017, Vol. 45, Database issue D855 In the following sections, we describe our data structure, Other data collections have generated new information how to browse and access our data and the support we pro- about existing samples. We will retain information from the vide for our users. studies that generated data, while, where samples are in- cluded in multiple projects, making it clear that this is the case. For example, the Simons Diversity Project uses GBR Data structure samples that were also part of the 1000 Genomes Project, Over the course of the 1000 Genomes Project, ∼500 000 but it has included more precise information about the ori- data files requiring 750 TB were added to and hosted on gin of the donor: some of the GBR samples in this collection the project FTP site in a manner and structure designed to are labeled as English. In cases such as this, we will use the support the needs of the 1000 Genomes Consortium. While new information to extend our descriptions, but we will also maintaining all of the 1000 Genomes data, we have made maintain the GBR label to ensure that all the data we have changes to the FTP site structure to support the expanded for these populations can be found in one place. scope of IGSR and improve the discoverability of the data. To ensure clarity as we add new data sets, we have designed Data access a structure and nomenclature for the elements within our structure that describes the source of the data, what type of We have several access methods for the IGSR data that take data it is and what other similar data exist (Table 1). More advantage of the data structure outlined above to organize information about these data elements and how they can be and filter the data and which are described in detail below. used to filter and discover IGSR data is below and inthe data portal section. Data portal. We have developed a data portal to make it as easy as possible to explore the variety of data in IGSR. The Data collections. Data collections are large-scale sets of portal is accessible at http://www.internationalgenome.org/ related data designed to be useful for answering a host of data-portal/ or by clicking on Portal in the horizontal bar at linked questions. During IGSR’s first year, the number of the top of the IGSR website (Figure 1A). The default entry data collections doubled (Table 2). In addition to the 1000 to the portal is a summary table listing the samples, their Genomes Project data, we have sequence data from four ad- sex and population, and what data are available for them. ditional sources and are in the process of producing align- This table can also be directly accessed via http://www. ments and, eventually, variant calls using these new col- internationalgenome.org/data-portal/sample. Samples and lections. We anticipate that further data collections will be data can be located using the filters at the top or by using added to the resource in the near future. the search to find specific samples or populations (Figure 1B). The summary entry for each sample includes links for Analysis groups. The analysis group concept stems from detailed information about the sample (Figure 1C) and its the start of phase 1 of the 1000 Genomes Project, when it corresponding population (Figure 1D), which are each de- became clear that a simple study structure would be inad- scribed in the following paragraphs. equate to distinguish between the low coverage and exome Clicking on a sample reference number leads to a dedi- sequencing efforts. As IGSR takes on different data types, cated sample page (Figure 1C). This can also be accessed the analysis group designation allows us to effectively sub- by searching with the sample reference to jump straight to divide data to facilitate the discovery and use of specific it. The top section of the sample page lists details regarding types of data. This enables discovery of coherent data sets the sample including the population, sex and if there are any that can be processed together. Table 3 describes the existing related samples also found in IGSR. Clicking on the popu- analysis groups. lation name brings up the population page described below. There is also a direct link to the sample in the BioSamples Data type. The data types represent both unprocessed database (5) and to the relevant source, such as the Coriell submitted data in the form of raw sequence data files and Institute, where the lymphoblastoid cell line can be pur- processed data in the form of sequence alignments to either chased. The lower section of the sample page lists all the the GRCh37 or GRCh38 human reference assembly. Vari- files available for the sample on the IGSR FTP site. This list ant data includes sites of variation and genotype data aris- includes both files that are specific to the sample, such as ing from either sequencing or array-based assays. Data that FASTQ sequences and BAM alignments with their indexes, do not fit into the existing main data categories, and which as well as files that include the sample among other data, currently do not exist in sufficient qualities to warrant their such as VCF files containing genotypes from the sample.
Recommended publications
  • Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

    Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

    Published online 1 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D563–D569 doi:10.1093/nar/gkp871 Ensembl Genomes: Extending Ensembl across the taxonomic space P. J. Kersey*, D. Lawson, E. Birney, P. S. Derwent, M. Haimel, J. Herrero, S. Keenan, A. Kerhornou, G. Koscielny, A. Ka¨ ha¨ ri, R. J. Kinsella, E. Kulesha, U. Maheswari, K. Megy, M. Nuhn, G. Proctor, D. Staines, F. Valentin, A. J. Vilella and A. Yates EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Received August 14, 2009; Revised September 28, 2009; Accepted September 29, 2009 ABSTRACT nucleotide archives; numerous other genomes exist in states of partial assembly and annotation; thousands of Ensembl Genomes (http://www.ensemblgenomes viral genomes sequences have also been generated. .org) is a new portal offering integrated access to Moreover, the increasing use of high-throughput genome-scale data from non-vertebrate species sequencing technologies is rapidly reducing the cost of of scientific interest, developed using the Ensembl genome sequencing, leading to an accelerating rate of genome annotation and visualisation platform. data production. This not only makes it likely that in Ensembl Genomes consists of five sub-portals (for the near future, the genomes of all species of scientific bacteria, protists, fungi, plants and invertebrate interest will be sequenced; but also the genomes of many metazoa) designed to complement the availability individuals, with the possibility of providing accurate and of vertebrate genomes in Ensembl. Many of the sophisticated annotation through the similarly low-cost databases supporting the portal have been built in application of functional assays.
  • Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W ­ 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S

    Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W ­ 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S

    https://doi.org/10.1038/s41586-021-03855-y Accelerated Article Preview Rare variant contribution to human disease W in 281,104 UK Biobank exomes E VI Received: 3 November 2020 Quanli Wang, Ryan S. Dhindsa, Keren Carss, Andrew R. Harper, Abhishek N ag­­, I oa nn a Tachmazidou, Dimitrios Vitsios, Sri V. V. Deevi, Alex Mackay, EDaniel Muthas, Accepted: 28 July 2021 Michael Hühn, Sue Monkley, Henric O ls so n , S eb astian Wasilewski, Katherine R. Smith, Accelerated Article Preview Published Ruth March, Adam Platt, Carolina Haefliger & Slavé PetrovskiR online 10 August 2021 P Cite this article as: Wang, Q. et al. Rare variant This is a PDF fle of a peer-reviewed paper that has been accepted for publication. contribution to human disease in 281,104 UK Biobank exomes. Nature https:// Although unedited, the content has been subjectedE to preliminary formatting. Nature doi.org/10.1038/s41586-021-03855-y (2021). is providing this early version of the typeset paper as a service to our authors and Open access readers. The text and fgures will undergoL copyediting and a proof review before the paper is published in its fnal form. Please note that during the production process errors may be discovered which Ccould afect the content, and all legal disclaimers apply. TI R A D E T A R E L E C C A Nature | www.nature.com Article Rare variant contribution to human disease in 281,104 UK Biobank exomes W 1,19 1,19 2,19 2 2 https://doi.org/10.1038/s41586-021-03855-y Quanli Wang , Ryan S.
  • (DDD) Project: What a Genomic Approach Can Achieve

    (DDD) Project: What a Genomic Approach Can Achieve

    The Deciphering Development Disorders (DDD) project: What a genomic approach can achieve RCP ADVANCED MEDICINE, LONDON FEB 5TH 2018 HELEN FIRTH DM FRCP DCH, SANGER INSTITUTE 3,000,000,000 bases in each human genome Disease & developmental Health & development disorders Fascinating facts about your genome! –~20,000 protein-coding genes –~30% of genes have a known role in disease or developmental disorders –~10,000 protein altering variants –~100 protein truncating variants –~70 de novo mutations (~1-2 coding ie. In exons of genes) Rare Disease affects 1 in 17 people •Prior to DDD, diagnostic success in patients with rare paediatric disease was poor •Not possible to diagnose many patients with current methodology in routine use– maximum benefit in this group •DDD recruited patients with severe/extreme clinical features present from early childhood with high expectation of genetic basis •Recruitment was primarily of trios (ie The Doctor Sir Luke Fildes (1887) child and both parents) ~ 90% Making a genomic diagnosis of a rare disease improves care •Accurate diagnosis is the cornerstone of good medical practice – informing management, treatment, prognosis and prevention •Enables risk to other family members to be determined enabling predictive testing with potential for surveillance and therapy in some disorders February 28th 2018 •Reduces sense of isolation, enabling better access to support and information •Curtails the diagnostic odyssey •Not just a descriptive label; identifies the fundamental cause of disease A genomic diagnosis can be a gateway to better treatment •Not just a descriptive label; identifies the fundamental cause of disease •Biallelic mutations in the CFTR gene cause Cystic Fibrosis • CFTR protein is an epithelial ion channel regulating absorption/ secretion of salt and water in the lung, sweat glands, pancreas & GI tract.
  • Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences

    Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences

    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Mol Genet Genomics (2016) 291:1127–1136 DOI 10.1007/s00438-016-1170-7 ORIGINAL ARTICLE Different evolutionary patterns of SNPs between domains and unassigned regions in human protein‑coding sequences Erli Pang1 · Xiaomei Wu2 · Kui Lin1 Received: 14 September 2015 / Accepted: 18 January 2016 / Published online: 30 January 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com Abstract Protein evolution plays an important role in Furthermore, the selective strength on domains is signifi- the evolution of each genome. Because of their functional cantly greater than that on unassigned regions. In addition, nature, in general, most of their parts or sites are differently among all of the human protein sequences, there are 117 constrained selectively, particularly by purifying selection. PfamA domains in which no SNPs are found. Our results Most previous studies on protein evolution considered indi- highlight an important aspect of protein domains and may vidual proteins in their entirety or compared protein-coding contribute to our understanding of protein evolution. sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each pro- Keywords Human genome · Protein-coding sequence · tein of a given genome. To this end, based on PfamA anno- Protein domain · SNPs · Natural selection tation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in Introduction protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occur- Studying protein evolution is crucial for understanding ring within protein domains and those within unassigned the evolution of speciation and adaptation, senescence and regions.
  • Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

    Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

    EMBL-European Bioinformatics Institute Annual Scientific Report 2013 On the cover Structure 3fof in the Protein Data Bank, determined by Laponogov, I. et al. (2009) Structural insight into the quinolone-DNA cleavage complex of type IIA topoisomerases. Nature Structural & Molecular Biology 16, 667-669. © 2014 European Molecular Biology Laboratory This publication was produced by the External Relations team at the European Bioinformatics Institute (EMBL-EBI) A digital version of the brochure can be found at www.ebi.ac.uk/about/brochures For more information about EMBL-EBI please contact: [email protected] Contents Introduction & overview 3 Services 8 Genes, genomes and variation 8 Molecular atlas 12 Proteins and protein families 14 Molecular and cellular structures 18 Chemical biology 20 Molecular systems 22 Cross-domain tools and resources 24 Research 26 Support 32 ELIXIR 36 Facts and figures 38 Funding & resource allocation 38 Growth of core resources 40 Collaborations 42 Our staff in 2013 44 Scientific advisory committees 46 Major database collaborations 50 Publications 52 Organisation of EMBL-EBI leadership 61 2013 EMBL-EBI Annual Scientific Report 1 Foreword Welcome to EMBL-EBI’s 2013 Annual Scientific Report. Here we look back on our major achievements during the year, reflecting on the delivery of our world-class services, research, training, industry collaboration and European coordination of life-science data. The past year has been one full of exciting changes, both scientifically and organisationally. We unveiled a new website that helps users explore our resources more seamlessly, saw the publication of ground-breaking work in data storage and synthetic biology, joined the global alliance for global health, built important new relationships with our partners in industry and celebrated the launch of ELIXIR.
  • 1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P

    1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P

    Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: Populations Coming into Focus In November 2012, some eleven years after the publication of the first draft sequence of a human genome, an article published in Nature reported a new ‘map’ of the human genome – created from not one, but 1,092 individuals. For many researchers, however, what was compelling was not the number of individuals sequenced, but rather the fourteen worldwide populations they represented. Comparisons that could be made within and among these populations represented new possibilities for the scientific study of human genetic variation. The paper – which has been cited over 400 times in the subsequent year – was the output of the first phase of the 1000 Genomes Project, one of several international research consortia launched with the intent of identifying and cataloguing such variation. With the project’s phase three data release, anticipated in early spring 2014, the sample size will rise to over 2500 individuals representing twenty-six populations. Each individual’s full sequence data is made publicly available online, and is also preserved through the establishment of immortal cell lines, from which DNA can be extracted and distributed. With these developments, population-based science has been made genomic, and scientific conceptions of human populations have begun to crystallize (see appendix). Such extensive biobanking and databasing of human populations is remarkable for a number of reasons, not least among them the socially charged terrain that such an enterprise inevitably must navigate. While the 1000 Genomes Project (1000G) has been relatively uncontroversial in its reception, predecessors such as the Human Genome Diversity Project (HGDP), first conceived in 1991, faced greater difficulty.
  • NIH-GDS: Genomic Data Sharing

    NIH-GDS: Genomic Data Sharing

    NIH-GDS: Genomic Data Sharing National Institutes of Health Data type Explain whether the research being considered for funding involves human data, non- human data, or both. Information to be included in this section: • Type of data being collected: human, non-human, or both human & non-human. • Type of genomic data to be shared: sequence, transcriptomic, epigenomic, and/or gene expression. • Level of the genomic data to be shared: Individual-level, aggregate-level, or both. • Relevant associated data to be shared: phenotype or exposure. • Information needed to interpret the data: study protocols, survey tools, data collection instruments, data dictionary, software (including version), codebook, pipeline metadata, etc. This information should be provided with unrestricted access for all data levels. Data repository Identify the data repositories to which the data will be submitted, and for human data, whether the data will be available through unrestricted or controlled-access. For human genomic data, investigators are expected to register all studies in the database of Genotypes and Phenotypes (dbGaP) by the time data cleaning and quality control measures begin in addition to submitting the data to the relevant NIH-designated data repository (e.g., dbGaP, Gene Expression Omnibus (GEO), Sequence Read Archive (SRA), the Cancer Genomics Hub) after registration. Non-human data may be made available through any widely used data repository, whether NIH- funded or not, such as GEO, SRA, Trace Archive, Array Express, Mouse Genome Informatics, WormBase, the Zebrafish Model Organism Database, GenBank, European Nucleotide Archive, or DNA Data Bank of Japan. Data in unrestricted-access repositories (e.g., The 1000 Genomes Project) are publicly available to anyone.
  • The 1000 Genomes Project

    The 1000 Genomes Project

    The 1000 Genomes Project: obtaining a deep catalogue of human genetic variation with new sequencing technology 2007First quarterfirstsecondfourththird20062005 quarter quarter quarter 2008 quarter Manolio, Brooks, Collins, J. Clin. Invest., May 2008 Chromosome 9p21: diabetes, coronary heart disease. Three genes, multiple SNPs 500,000 basepairs of Chr 9 (total length 109M bp) Zeggini et al, Science 2007; 316:1336-1341. After GWAS “hit”, what next? (remember, these are associations, not causes) One region (~Mb), multiple genes, or sometimes no genes (!), multiple SNPs to sort through Which is the right gene? What is the “causal” variant? The current SNP catalog is not complete – may not have the causal variant After a GWAS “hit”, what next? • One could get lucky (gene is a likely candidate based on previously known function*; a known associated SNP is a variant that prevents any gene function) • Gene expression correlates with believed function (e.g. tissue specific, disease specific) • Conservation of sequence between genomes of many mammals • Get a complete list of variants in the region, and one of them will be right. Need to sequence the associated region in many people. *CDKN: evidence for a role in islet cell growth. Also a tumor suppressor. Chromosome 9p21: diabetes, coronary heart disease. Three genes, multiple SNPs 500,000 basepairs of Chr 9 (total length 109M bp) Good bet on the gene, but what is the cause? 1000 Genomes Project: A resource for aiding human genetics studies • An essentially complete list of all variants in human
  • A Variant-Centric Perspective on Geographic Patterns of Human Allele Frequency Variation Arjun Biddanda, Daniel P Rice, John Novembre*

    A Variant-Centric Perspective on Geographic Patterns of Human Allele Frequency Variation Arjun Biddanda, Daniel P Rice, John Novembre*

    TOOLS AND RESOURCES A variant-centric perspective on geographic patterns of human allele frequency variation Arjun Biddanda, Daniel P Rice, John Novembre* Department of Human Genetics, University of Chicago, Chicago, United States Abstract A key challenge in human genetics is to understand the geographic distribution of human genetic variation. Often genetic variation is described by showing relationships among populations or individuals, drawing inferences over many variants. Here, we introduce an alternative representation of genetic variation that reveals the relative abundance of different allele frequency patterns. This approach allows viewers to easily see several features of human genetic structure: (1) most variants are rare and geographically localized, (2) variants that are common in a single geographic region are more likely to be shared across the globe than to be private to that region, and (3) where two individuals differ, it is most often due to variants that are found globally, regardless of whether the individuals are from the same region or different regions. Our variant- centric visualization clarifies the geographic patterns of human variation and can help address misconceptions about genetic differentiation among populations. Introduction Understanding human genetic variation, including its origins and its consequences, is one of the long-standing challenges of human biology. A first step is to learn the fundamental aspects of how human genomes vary within and between populations. For example, how often do variants have an *For correspondence: allele at high frequency in one narrow region of the world that is absent everywhere else? For [email protected] answering many applied questions, we need to know how many variants show any particular geo- graphic pattern in their allele frequencies.
  • Industry Programme EMBL-EBI and Industry

    Industry Programme EMBL-EBI and Industry

    The European Bioinformatics Institute . Cambridge Industry Programme EMBL-EBI and Industry Our Industry Programme is unique. It is a forum for interaction and knowledge exchange for those working at the forefront of applied bioinformatics, in over 20 major companies with global R&D activities. The programme focuses on precompetitive collaboration, open-source software and informatics standards, which have become essential to improving efficiency and reducing costs for the world’s bioindustries. The European Bioinformatics Institute (EMBL-EBI) is a global leader in the storage, annotation, interrogation and dissemination of large datasets of relevance to the bioindustries. We help companies realise the potential of ‘big data’ by combining our unique expertise with their own R&D knowledge, significantly enhancing their ability to exploit high-dimensional data to create value for their business. We see data as a critical tool that can accelerate research and development. Our mission is to provide opportunities for scientists across sectors to make the best possible use of public and proprietary data. This can help companies reduce costs, enhance product selection and validation and streamline their decision-making processes. Companies with large R&D capacity must ensure high data quality and integrate licensed information with both public and proprietary data. At EMBL-EBI, we help companies build publicly available data into their local infrastructure so they can add proprietary and licensed information in a secure way. Going forward, we see our interactions with our industry partners growing stronger, as the quality of data continues to rise. Through our programme and efforts such as the Innovative Medicines Initiative and the Pistoia Alliance, we support pre- competitive research collaborations, promote the uptake and utility of open-source software, and steer the development of data standards.
  • 1000 Genomes Project Consortiapedia.Fastercures.Org

    1000 Genomes Project Consortiapedia.Fastercures.Org

    1000 Genomes Project consortiapedia.fastercures.org 1000 Genomes Project consortiapedia.fastercures.org/consortia/1gp/ Research Areas At a Glance Status: Active Consortium Tool Development Year Launched: 2008 Initiating Organization: National Biomarker Research Institutes of Health/National Human Diagnostic, Genomic Biomarker Genome Research Institute Data-Sharing Enabler Initiator Type: Government Location: International Abstract The 1000 Genomes Project is a consortium focused on developing methods to collect, share, and integrate genomic data generated from multiple sources in multiple countries, in an effort to provide a foundation for investigating the relationship between genotype and phenotype. The goal is to use these methods to create a catalog of common genetic variation across human populations, which will be made accessible for the broader scientific community. Mission The aims of the 1000 Genomes Project are to genotype and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. To achieve these aims, the consortium worked across its partners to develop and evaluate methods for high-throughput sequencing. This process included the development of robust protocols for generating whole-genome shotgun and targeted sequence data, as well as algorithms to detect variants. In addition to a data repository, the study results should serve as a template for future genome-wide sequencing studies on larger sample sets. 1000 Genomes Project - consortiapedia.fastercures.org Page 1/9 1000 Genomes Project consortiapedia.fastercures.org The consortium identified the following populations for DNA sequencing: Yoruba in Ibadan, Nigeria; Japanese in Tokyo; Chinese in Beijing; Utah residents with ancestry from northern and western Europe; Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Toscani in Italy; Gujarati Indians in Houston; Chinese in metropolitan Denver; people of Mexican ancestry in Los Angeles; and people of African ancestry in the southwestern United States.
  • A National Database of Polish Variant Allele Frequencies

    A National Database of Polish Variant Allele Frequencies

    bioRxiv preprint doi: https://doi.org/10.1101/2021.07.07.451425; this version posted July 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. ‘The Thousand Polish Genomes Project’ - a national database of Polish variant allele frequencies *Elżbieta Kaja1,5, *Adrian Lejman1,5, Dawid Sielski1, Mateusz Sypniewski1,10, Tomasz Gambin3, Tomasz Suchocki2,13, Mateusz Dawidziuk9, Paweł Golik4, Marzena Wojtaszewska1,5, Maria Stępień1,11, Joanna Szyda2,13, Karolina Lisiak-Teodorczyk1, Filip Wolbach1, Daria Kołodziejska1, Katarzyna Ferdyn1,12, Alicja Woźna1,8, Marcin Żytkiewicz7, Anna Bodora-Troińska7, Waldemar Elikowski7, Zbigniew Król5, Artur Zaczyński5, Agnieszka Pawlak5,14, Robert Gil5, Waldemar Wierzba5, Paula Dobosz1,6, Katarzyna Zawadzka1 , Paweł Zawadzki1,8, Paweł Sztromwasser1 1 MNM Diagnostics Sp. z o.o., ul. Macieja Rataja 64, Poznań, 61-695, Poland 2 Biostatistics Group, Wrocław University of Environmental and Life Sciences, Wrocław, Poland 3 Institute of Computer Science, Warsaw University of Technology, Warsaw, 00-665, Poland 4 Institute of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Warsaw, 02-106, Poland 5 Central Clinical Hospital of Ministry of the Interior and Administration in Warsaw, Warsaw, 02-507, Poland 6 Department of Hematology, Transplantation and Internal Medicine University Clinical Center of the Medical