The NCBI Dbgap Database of Genotypes and Phenotypes

COMMENTARY The NCBI dbGaP database of genotypes and phenotypes Matthew D Mailman, Michael Feolo, Yumi Jin, Masato Kimura, Kimberly Tryka, Rinat Bagoutdinov, Luning Hao, Anne Kiang, Justin Paschall, Lon Phan, Natalia Popova, Stephanie Pretel, Lora Ziyabari, Moira Lee, Yu Shao, Zhen Y Wang, Karl Sirotkin, Minghong Ward, Michael Kholodov, Kerry Zbicz, Jeffrey Beck, Michael Kimelman, Sergey Shevelev, Don Preuss, Eugene Yaschenko, Alan Graeff, James Ostell & Stephen T Sherry The National Center for Biotechnology Information has created the dbGaP public repository for individual-level http://www.nature.com/naturegenetics phenotype, exposure, genotype and sequence data and the associations between them. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data. The technical advances and declining costs of The purposes of this description of dbGaP are collaborating researchers—a model that sup- high-throughput genotyping afford investi- threefold: (i) to describe dbGaP’s functionality ports documentation of protocols, phenotypes, gators fresh opportunities to do increasingly for users and submitters; (ii) to describe dbGaP’s variables and analysis through paper-based complex analyses of genetic associations with design and operational processes for database manuals and forms, personal communication phenotypic and disease characteristics. The methodologists to emulate or improve upon; or, more recently, electronic access privileges to Nature Publishing Group Group Nature Publishing 7 leading candidates for such genome-wide and (iii) to reassure the lay and scientific public a project data coordinating center. A central- association studies (GWAS) are existing large- that individual-level phenotype and genotype ized resource model, designed specifically for 200 © scale cohort and clinical studies that have data are securely and responsibly managed. broad data distribution, requires electronically collected rich sets of phenotype data. To sup- dbGaP accommodates studies of varying accessible explanations of variables tied closely port investigator access to data from these design. It contains four basic types of data: (i) with the data tables. dbGaP addresses that goal initiatives at the National Institutes of Health study documentation, including study descrip- by compiling all available descriptive metadata (NIH) and elsewhere, the National Center for tions, protocol documents, and data collection from participating clinical studies—docu- Biotechnology Information (NCBI) has cre- instruments, such as questionnaires; (ii) phe- mentation of protocols, data collection forms, ated a database of Genotypes and Phenotypes notypic data for each variable assessed, both manuals of procedure and the corpus of sta- (dbGaP) with stable identifiers that make it at an individual level and in summary form; tistical analyses—putting it in electronic form, possible for published studies to discuss or (iii) genetic data, including study subjects’ indi- and creating explicit links between measured cite the primary data in a specific and uniform vidual genotypes, pedigree information, fine phenotypic variables and related documenta- way. dbGaP provides unprecedented access to mapping results and resequencing traces; and tion (for example, the measurement for blood the large-scale genetic and phenotypic datasets (iv) statistical results, including association and pressure at time point 1 is linked to protocol required for GWAS designs, including public linkage analyses, when available. instructions on how and when to take the mea- access to study documents linked to summary To protect the confidentiality of study sub- surement). data on specific phenotype variables, statistical jects, dbGaP accepts only de-identified data overviews of the genetic information, position and requires investigators to go through an Tiered access for public and authorized of published associations on the genome, and authorization process in order to access indi- users authorized access to individual-level data (see vidual-level phenotype and genotype datasets. dbGaP is divided into public and authorized- Box 1 for summary of dbGaP features). Summary phenotype and genotype data, as access sections for aggregate and individual- well as study documents, are available without level data, respectively. The public dbGaP The authors are at the National Center for restriction. website provides a rich set of descriptive sta- Biotechnology Information, National Library tistics for each phenotype variable, allowing of Medicine, National Institutes of Health, Phenotypes are described in a structured users to assess whether the individual-level Bethesda, Maryland 20892-6510, USA. narrative framework data may be relevant to their research. Users Correspondence should be addressed to S.T.S. Traditionally, clinical phenotype data have with required bona fides may download indi- ([email protected]). only been shared among a limited group of vidual-level phenotypic data and genotype NATURE GENETICS | VOLUME 39 | NUMBER 10 | OCTOBER 2007 1181 COMMENTARY http://www.nature.com/naturegenetics Figure 1 Accession numbers in dbGaP are created separately for a study and its subordinate objects — phenotype variables, phenotype trait tables, documents and genotype datasets – prefixed phs, phv, pht, phd and phg, respectively. Accession numbers have suffixes “v” for data version, “p” for participant set version and “c” for consent group version. Phenotype and genotype data are distributed as both public summary records and individual-level data that require authorization to use. Associations between genotypes and phenotype traits of interest are not subordinate objects of a study, as an analysis Nature Publishing Group Group Nature Publishing may include data components from several studies simultaneously. 7 200 © files, as well as genotype chip intensity files the search interface to query text words against Committee (DAC) for review. DAC review cri- and pedigree information, for use in approved phenotypic variable names and descriptions or teria vary by program; however, all reviews seek research. Authorization for access to clini- against protocols and data collection forms, to ensure that the stated research purposes are cal data obligates the investigator and his or which contain explicit links to phenotypes. compatible with participant consent and that her institution to obey data use restrictions Authorized access. Authorization for access the PI and his or her institution will abide by dictated by participant informed consent to individual-level data files is obtained via the the study’s data use certification and terms of agreements and to comply with requirements dbGaP authorized access portal at http://view. use. Once access has been granted, researchers detailed in a governing Data Use Certification ncbi.nlm.nih.gov/dbgap-controlled. Entry can log into the system and download the de- (DUC) document. Policy details for the GAIN into the system requires that investigators have identified individual-level data files for which project1 nicely illustrate the process for GWAS either an NIH eRA Commons Account (extra- they have approval. studies in general. mural researchers) or an NIH login (intramural Public access. The public interface at http:// researchers) and that they be classified as prin- Publication view.ncbi.nlm.nih.gov/dbgap allows users to cipal investigators (PIs). Non-PIs cannot inde- For many studies in dbGaP, the investigators browse and search study metadata, phenotype pendently apply for access to individual-level who contributed the data will retain the exclu- variable summary information, documenta- data; however, they can be approved for local sive right to publish analyses of the dataset tion, and those association analyses that are in access to downloaded data files within the PI’s for a defined period of time after its release in the public domain. It is hoped that these data lab if they are listed as collaborating investiga- dbGaP, typically 9 or 12 months. The date that will stimulate new hypotheses and help inves- tors on a PI’s application. Although the NCBI the exclusive publication rights expire is called tigators identify those datasets that are suitable provides the interface for requests and down- the embargo release date in dbGaP. During this for their research. Users can quickly navigate loads, data authorization decisions are made period of exclusivity, other investigators may to any study and browse or search its com- by the NIH institute that sponsors each study be granted access to download and analyze the ponents; those interested in specific diseases in dbGaP. The dbGaP authorized access portal data, but they are expected not to seek publica- can browse dbGaP using disease terms linked authenticates users requesting access against tion of their analyses or conclusions until the to the National Library of Medicine (NLM) the NIH eRA Commons system, provides embargo release date. Embargo release dates Medical Subject Headings vocabulary. Users the necessary forms and forwards completed are provided in several places: (i) the public interested in a particular phenotype can use requests to the appropriate NIH Data Access dbGaP home page’s list of studies (ii) the pub- 1182 VOLUME 39 | NUMBER

The NCBI Dbgap Database of Genotypes and Phenotypes

Identifying Novel Disease-Associated Variants and Understanding The

The Struggle to Find Reliable Results in Exome Sequencing Data

BIOINFORMATICS Pages 1–9

The Clark Phase-Able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS

Integration and Missing Data Handling in Multiple Omics Studies

Erciyes Medical Genetics Days 2019

Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data

Genetic Investigation of Kidney Disease

Exploration of Interaction Between Common and Rare Variants in Genetic Susceptibility to Holoprosencephaly Artem Kim

Estimation of Genotype Error Rate Using Samples with Pedigree Information—An Application on the Genechip Mapping 10K Array

Mendelian Error Detection in Complex Pedigrees Using Weighted Constraint Satisfaction Techniques Marti Sanchez, Simon De Givry, Thomas Schiex

Using Mendelian Inheritance to Improve High-Throughput SNP Discovery