The NCBI Dbgap Database of Genotypes and Phenotypes

Total Page:16

File Type:pdf, Size:1020Kb

The NCBI Dbgap Database of Genotypes and Phenotypes COMMENTARY The NCBI dbGaP database of genotypes and phenotypes Matthew D Mailman, Michael Feolo, Yumi Jin, Masato Kimura, Kimberly Tryka, Rinat Bagoutdinov, Luning Hao, Anne Kiang, Justin Paschall, Lon Phan, Natalia Popova, Stephanie Pretel, Lora Ziyabari, Moira Lee, Yu Shao, Zhen Y Wang, Karl Sirotkin, Minghong Ward, Michael Kholodov, Kerry Zbicz, Jeffrey Beck, Michael Kimelman, Sergey Shevelev, Don Preuss, Eugene Yaschenko, Alan Graeff, James Ostell & Stephen T Sherry The National Center for Biotechnology Information has created the dbGaP public repository for individual-level http://www.nature.com/naturegenetics phenotype, exposure, genotype and sequence data and the associations between them. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data. The technical advances and declining costs of The purposes of this description of dbGaP are collaborating researchers—a model that sup- high-throughput genotyping afford investi- threefold: (i) to describe dbGaP’s functionality ports documentation of protocols, phenotypes, gators fresh opportunities to do increasingly for users and submitters; (ii) to describe dbGaP’s variables and analysis through paper-based complex analyses of genetic associations with design and operational processes for database manuals and forms, personal communication phenotypic and disease characteristics. The methodologists to emulate or improve upon; or, more recently, electronic access privileges to Nature Publishing Group Group Nature Publishing 7 leading candidates for such genome-wide and (iii) to reassure the lay and scientific public a project data coordinating center. A central- association studies (GWAS) are existing large- that individual-level phenotype and genotype ized resource model, designed specifically for 200 © scale cohort and clinical studies that have data are securely and responsibly managed. broad data distribution, requires electronically collected rich sets of phenotype data. To sup- dbGaP accommodates studies of varying accessible explanations of variables tied closely port investigator access to data from these design. It contains four basic types of data: (i) with the data tables. dbGaP addresses that goal initiatives at the National Institutes of Health study documentation, including study descrip- by compiling all available descriptive metadata (NIH) and elsewhere, the National Center for tions, protocol documents, and data collection from participating clinical studies—docu- Biotechnology Information (NCBI) has cre- instruments, such as questionnaires; (ii) phe- mentation of protocols, data collection forms, ated a database of Genotypes and Phenotypes notypic data for each variable assessed, both manuals of procedure and the corpus of sta- (dbGaP) with stable identifiers that make it at an individual level and in summary form; tistical analyses—putting it in electronic form, possible for published studies to discuss or (iii) genetic data, including study subjects’ indi- and creating explicit links between measured cite the primary data in a specific and uniform vidual genotypes, pedigree information, fine phenotypic variables and related documenta- way. dbGaP provides unprecedented access to mapping results and resequencing traces; and tion (for example, the measurement for blood the large-scale genetic and phenotypic datasets (iv) statistical results, including association and pressure at time point 1 is linked to protocol required for GWAS designs, including public linkage analyses, when available. instructions on how and when to take the mea- access to study documents linked to summary To protect the confidentiality of study sub- surement). data on specific phenotype variables, statistical jects, dbGaP accepts only de-identified data overviews of the genetic information, position and requires investigators to go through an Tiered access for public and authorized of published associations on the genome, and authorization process in order to access indi- users authorized access to individual-level data (see vidual-level phenotype and genotype datasets. dbGaP is divided into public and authorized- Box 1 for summary of dbGaP features). Summary phenotype and genotype data, as access sections for aggregate and individual- well as study documents, are available without level data, respectively. The public dbGaP The authors are at the National Center for restriction. website provides a rich set of descriptive sta- Biotechnology Information, National Library tistics for each phenotype variable, allowing of Medicine, National Institutes of Health, Phenotypes are described in a structured users to assess whether the individual-level Bethesda, Maryland 20892-6510, USA. narrative framework data may be relevant to their research. Users Correspondence should be addressed to S.T.S. Traditionally, clinical phenotype data have with required bona fides may download indi- ([email protected]). only been shared among a limited group of vidual-level phenotypic data and genotype NATURE GENETICS | VOLUME 39 | NUMBER 10 | OCTOBER 2007 1181 COMMENTARY http://www.nature.com/naturegenetics Figure 1 Accession numbers in dbGaP are created separately for a study and its subordinate objects — phenotype variables, phenotype trait tables, documents and genotype datasets – prefixed phs, phv, pht, phd and phg, respectively. Accession numbers have suffixes “v” for data version, “p” for participant set version and “c” for consent group version. Phenotype and genotype data are distributed as both public summary records and individual-level data that require authorization to use. Associations between genotypes and phenotype traits of interest are not subordinate objects of a study, as an analysis Nature Publishing Group Group Nature Publishing may include data components from several studies simultaneously. 7 200 © files, as well as genotype chip intensity files the search interface to query text words against Committee (DAC) for review. DAC review cri- and pedigree information, for use in approved phenotypic variable names and descriptions or teria vary by program; however, all reviews seek research. Authorization for access to clini- against protocols and data collection forms, to ensure that the stated research purposes are cal data obligates the investigator and his or which contain explicit links to phenotypes. compatible with participant consent and that her institution to obey data use restrictions Authorized access. Authorization for access the PI and his or her institution will abide by dictated by participant informed consent to individual-level data files is obtained via the the study’s data use certification and terms of agreements and to comply with requirements dbGaP authorized access portal at http://view. use. Once access has been granted, researchers detailed in a governing Data Use Certification ncbi.nlm.nih.gov/dbgap-controlled. Entry can log into the system and download the de- (DUC) document. Policy details for the GAIN into the system requires that investigators have identified individual-level data files for which project1 nicely illustrate the process for GWAS either an NIH eRA Commons Account (extra- they have approval. studies in general. mural researchers) or an NIH login (intramural Public access. The public interface at http:// researchers) and that they be classified as prin- Publication view.ncbi.nlm.nih.gov/dbgap allows users to cipal investigators (PIs). Non-PIs cannot inde- For many studies in dbGaP, the investigators browse and search study metadata, phenotype pendently apply for access to individual-level who contributed the data will retain the exclu- variable summary information, documenta- data; however, they can be approved for local sive right to publish analyses of the dataset tion, and those association analyses that are in access to downloaded data files within the PI’s for a defined period of time after its release in the public domain. It is hoped that these data lab if they are listed as collaborating investiga- dbGaP, typically 9 or 12 months. The date that will stimulate new hypotheses and help inves- tors on a PI’s application. Although the NCBI the exclusive publication rights expire is called tigators identify those datasets that are suitable provides the interface for requests and down- the embargo release date in dbGaP. During this for their research. Users can quickly navigate loads, data authorization decisions are made period of exclusivity, other investigators may to any study and browse or search its com- by the NIH institute that sponsors each study be granted access to download and analyze the ponents; those interested in specific diseases in dbGaP. The dbGaP authorized access portal data, but they are expected not to seek publica- can browse dbGaP using disease terms linked authenticates users requesting access against tion of their analyses or conclusions until the to the National Library of Medicine (NLM) the NIH eRA Commons system, provides embargo release date. Embargo release dates Medical Subject Headings vocabulary. Users the necessary forms and forwards completed are provided in several places: (i) the public interested in a particular phenotype can use requests to the appropriate NIH Data Access dbGaP home page’s list of studies (ii) the pub- 1182 VOLUME 39 | NUMBER
Recommended publications
  • Identifying Novel Disease-Associated Variants and Understanding The
    Identifying Novel Disease-variants and Understanding the Role of the STAT1-STAT4 Locus in SLE A dissertation submitted to the Graduate School of University of Cincinnati In partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Immunology Graduate Program of the College of Medicine by Zubin H. Patel B.S., Worcester Polytechnic Institute, 2009 John B. Harley, M.D., Ph.D. Committee Chair Gurjit Khurana Hershey, M.D., Ph.D Leah C. Kottyan, Ph.D. Harinder Singh, Ph.D. Matthew T. Weirauch, Ph.D. Abstract Systemic Lupus Erythematosus (SLE) or lupus is an autoimmune disorder caused by an overactive immune system with dysregulation of both innate and adaptive immune pathways. It can affect all major organ systems and may lead to inflammation of the serosal and mucosal surfaces. The pathogenesis of lupus is driven by genetic factors, environmental factors, and gene-environment interactions. Heredity accounts for a substantial proportion of SLE risk, and the role of specific genetic risk loci has been well established. Identifying the specific causal genetic variants and the underlying molecular mechanisms has been a major area of investigation. This thesis describes efforts to develop an analytical approach to identify candidate rare variants from trio analyses and a fine-mapping analysis at the STAT1-STAT4 locus, a well-replicated SLE-risk locus. For the STAT1-STAT4 locus, subsequent functional biological studies demonstrated genotype dependent gene expression, transcription factor binding, and DNA regulatory activity. Rare variants are classified as variants across the genome with an allele-frequency less than 1% in ancestral populations.
    [Show full text]
  • The Struggle to Find Reliable Results in Exome Sequencing Data
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE METHODS ARTICLEprovided by Frontiers - Publisher Connector published: 12 February 2014 doi: 10.3389/fgene.2014.00016 The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors Zubin H. Patel 1,2 †, Leah C. Kottyan1,3 †, Sara Lazaro1,3 , Marc S. Williams 4 , David H. Ledbetter 4 , GerardTromp 4 , Andrew Rupert 5 , Mojtaba Kohram 5 , Michael Wagner 5 , Ammar Husami 6 , Yaping Qian 6 , C. Alexander Valencia 6 , Kejian Zhang 6 , Margaret K. Hostetter 7 , John B. Harley 1,3 and Kenneth M. Kaufman1,3 * 1 Division of Rheumatology, Center for Autoimmune Genomics and Etiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA 2 Medical Scientist Training Program, University of Cincinnati College of Medicine, Cincinnati, OH, USA 3 Department of Veterans Affairs, Veterans Affairs Medical Center – Cincinnati, Cincinnati, OH, USA 4 Genomic Medicine Institute, Geisinger Health System, Danville, PA, USA 5 Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA 6 Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA 7 Division of Infectious Disease, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA Edited by: Next Generation Sequencing studies generate a large quantity of genetic data in a Helena Kuivaniemi, Geisinger Health relatively cost and time efficient manner and provide an unprecedented opportunity to System, USA identify candidate causative variants that lead to disease phenotypes. A challenge to these Reviewed by: studies is the generation of sequencing artifacts by current technologies. To identify and Tatiana Foroud, Indiana University School of Medicine, USA characterize the properties that distinguish false positive variants from true variants, we Goo Jun, University of Michigan, USA sequenced a child and both parents (one trio) using DNA isolated from three sources *Correspondence: (blood, buccal cells, and saliva).
    [Show full text]
  • BIOINFORMATICS Pages 1–9
    Vol. 00 no. 00 2012 BIOINFORMATICS Pages 1–9 DELISHUS: An Efficient and Exact Algorithm for Genome-Wide Detection of Deletion Polymorphism in Autism Derek Aguiar 1;2, Bjarni V. Halldorsson´ 3, Eric M. Morrow ∗;4;5, Sorin Istrail ∗;1;2 1Department of Computer Science, Brown University, Providence, RI, USA 2Center for Computational Molecular Biology, Brown University, Providence, RI, USA 3School of Science and Engineering, Reykjavik University, Reykjavik, Iceland 4Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI, USA 5Department of Psychiatry and Human Behavior, Brown University, Providence, RI, USA Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT mutations with severe effects is more commonly being viewed as a Motivation: The understanding of the genetic determinants of com- major component of disease (McClellan and King, 2010). Phenoty- plex disease is undergoing a paradigm shift. Genetic heterogeneity pic heterogeneity – a large collection of individually rare or personal of rare mutations with deleterious effects is more commonly being conditions – is associated with a higher genetic heterogeneity than viewed as a major component of disease. Autism is an excellent previously assumed. This heterogeneity spectrum can be summari- example where research is active in identifying matches between the zed as follows: (i) individually rare mutations collectively explain phenotypic and genomic heterogeneities. A considerable portion of a large portion of complex disease; (ii) a single gene may contain autism appears to be correlated with copy number variation, which many severe but rare mutations in unrelated individuals; (iii) the is not directly probed by single nucleotide polymorphism (SNP) array same mutation may lead to different clinical conditions in different or sequencing technologies.
    [Show full text]
  • The Clark Phase-Able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS
    The Clark Phase-able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS Bjarni V. Halld´orsson1,3,4,,, Derek Aguiar1,2,,, Ryan Tarpine1,2,,, and Sorin Istrail1,2,, 1 Center for Computational Molecular Biology, Brown University [email protected] 2 Department of Computer Science, Brown University [email protected] 3 School of Science and Engineering, Reykjavik University 4 deCODE genetics Abstract. A phase transition is taking place today. The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. In the next few years it is quite possible that millions of Americans will have been genotyped. The question then arises of how to make the best use of this information and jointly estimate the haplotypes of all these individuals. The premise of the paper is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts which may be identical by state (IBS). Here we estimate for populations, using the US as a model, what sample size of genotyped individuals would be necessary to have sufficiently long shared haplotype regions (tracts) that are identical by descent (IBD), at a statistically significant level. These tracts can then be used as input for a Clark-like phasing method to obtain a complete phasing solution of the sample. We estimate in this paper that for a population like the US and about 1% of the people genotyped (approximately 2 million), tracts of about 200 SNPs long are shared between pairs of individuals IBD with high probability which assures the Clark method phasing success.
    [Show full text]
  • Integration and Missing Data Handling in Multiple Omics Studies
    INTEGRATION AND MISSING DATA HANDLING IN MULTIPLE OMICS STUDIES by Zhou Fang MS in Industrial Engineering, University of Pittsburgh, 2014 BS in Theoretical and Applied Mechanics, Fudan University, China, 2012 Submitted to the Graduate Faculty of the Department of Biostatistics Graduate school of Public health in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2018 UNIVERSITY OF PITTSBURGH DEPARTMENT OF BIOSTATISTICS GRADUATE SCHOOL OF PUBLIC HEALTH This dissertation was presented by Zhou Fang It was defended on May 3rd 2018 and approved by George C. Tseng, ScD, Professor, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh Wei Chen, PhD, Associate Professor, Department of Pediatrics, School of Medicine, University of Pittsburgh Gong Tang, PhD, Associate Professor, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh Ming Hu, PhD, Assistant Staff, Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation Ying Ding, PhD, Assistant Professor, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh ii Dissertation Advisors: George C. Tseng, ScD, Professor, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Wei Chen, PhD, Associate Professor, Department of Pediatrics, School of Medicine, University of Pittsburgh iii Copyright c by Zhou Fang 2018 iv INTEGRATION AND MISSING DATA HANDLING IN MULTIPLE OMICS STUDIES Zhou Fang, PhD University of Pittsburgh, 2018 ABSTRACT In modern multiple omics high-throughput data analysis, data integration and missing- ness data handling are common problems in discovering regulatory mechanisms associated with complex diseases and boosting power and accuracy. Moreover, in genotyping problem, the integration of linkage disequilibrium (LD) and identity-by-descent (IBD) information be- comes essential to reach universal superior performance.
    [Show full text]
  • Erciyes Medical Genetics Days 2019
    ERCIYES MEDICAL JOURNAL International Participated Erciyes Medical Genetics Days 2019 Uluslararası Katılımlı Erciyes Tıp Genetik Günleri 2019 DOI: 10.14744/etd.2019.55631 KARE ERCIYES MEDICAL JOURNAL International Participated Erciyes Medical Genetics Days 2019 21–23 February 2019, Erciyes University, Kayseri, Turkey Honorary Presidents of Congress Mustafa Calis M. Hakan Poyrazoglu Guest Editor Munis Dundar President of the Congress Munis Dundar Organizing Committee Owner Munis Dundar On Behalf of Faculty of Yusuf Ozkul Medicine, Erciyes University Cetin Saatci Dr. M. Hakan POYRAZOĞLU Muhammet E. Dogan Executive Committee of Turkish Society of Medical Genetics Mehmet Ali Ergun KARE Oya Uyguner Ayca Aykut Publisher Mehmet Alikasifoglu Kare Publishing Beyhan Durak Aras Project Assistants Taha Bahsi Eymen İSKENDER Altug Koc Graphics Neslihan ÇAKIR Contact Address: Dumlupınar Mah., Cihan Sok., No: 15, Concord İstanbul, B Blok 162, Kadıköy İstanbul, Turkey Phone: +90 216 550 61 11 Fax: +90 216 550 61 12 E-mail: [email protected] A-II ERCIYES MEDICAL JOURNAL Editor-in-Chief Jordi RELLO Mehmet DOĞANAY Centro de Investigación Biomédica en Red (CIBERES), Department of Infectious Diseases and Clinical Microbiology, Vall d’Hebron Hospital Campus, Barcelona, Spain Faculty of Medicine, Erciyes University, Kayseri, Turkey Jafar SOLTANİ Editors Infectious Diseases Unit, Department of Pediatrics, Niyazi ACER Faculty of Medicine, Kurdistan University of Medical Department of Anatomy, Faculty of Medicine, Sciences, Sanandaj, Iran Erciyes University, Kayseri,
    [Show full text]
  • Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data
    bioRxiv preprint doi: https://doi.org/10.1101/001958; this version posted January 24, 2014. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. JOINT VARIANT AND DE NOVO MUTATION IDENTIFICATION ON PEDIGREES FROM HIGH-THROUGHPUT SEQUENCING DATA JOHN G. CLEARY1,*, ROSS BRAITHWAITE1, KURT GAASTRA1, BRIAN S. HILBUSH2, STUART INGLIS1, SEAN A. IRVINE1, ALAN JACKSON1, RICHARD LITTIN1, SAHAR NOHZADEH-MALAKSHAH2, MINITA SHAH2, MEHUL RATHOD1, DAVID WARE1, LEN TRIGG1,†, AND FRANCISCO M. DE LA VEGA2,‡, † 1Real Time Genomics, Ltd., Hamilton 3240, New Zealand. 2Real Time Genomics, Inc. San Bruno, CA 94066, USA. The analysis of whole-genome or exome sequencing data from trios and pedigrees has being successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next- generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyses data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detect de novo mutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of WGS data from a 17 individual, 3- generation CEPH pedigree sequenced to 50X average depth. Compared to singleton calling, our family caller produced more high quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP array data concordance.
    [Show full text]
  • Genetic Investigation of Kidney Disease
    Genetic Investigation of Kidney Disease A thesis submitted for the Degree of Doctor of Philosophy 2010 Daniel P. Gale UCL Division of Medicine Declaration I, Daniel Philip Gale, confirm that the work presented in this thesis is my own. Where information has been derived from other sources I confirm that this has been indicated in the thesis. 2 Acknowledgements I would like to thank the patients and families who participated in this study and from whom I have learned so much. Without their trust, hospitality and generosity none of this would have been possible. I am especially glad to have the opportunity to thank my supervisor, Professor Patrick Maxwell whose support, encouragement and wisdom have been so valuable in both the laboratory and the hospital. I would also like to thank the Medical Research Council for funding this work. I also owe a debt of gratitude to my colleagues in the laboratories and hospitals in which I worked – especially Drs Sarah Harten, Tapan Bhattacharyya, Alkis Pierides, Margaret Ashcroft and Deepa Shukla; and to my collaborators, in particular Professors Terry Cook, Ted Tuddenham and Peter Robbins, and Drs Matthew Pickering, Nick Talbot and Elena Goicoechea de Jorge. I would also like to thank Guy Neild, Tom Connor, Tony Segal and Oliver Staples for many stimulating discussions, and my family and friends for their unwavering support and interest. 3 Abstract Kidney disease is an important contributor to the burden of ill-health worldwide. Genetic factors have an important role in determining which people are affected by kidney disease, and this study aimed to identify the genes responsible for disease in families in which unusual kidney diseases were transmitted in a pattern suggesting autosomal dominant inheritance.
    [Show full text]
  • Exploration of Interaction Between Common and Rare Variants in Genetic Susceptibility to Holoprosencephaly Artem Kim
    Exploration of interaction between common and rare variants in genetic susceptibility to holoprosencephaly Artem Kim To cite this version: Artem Kim. Exploration of interaction between common and rare variants in genetic susceptibility to holoprosencephaly. Human health and pathology. Université Rennes 1, 2020. English. NNT : 2020REN1B019. tel-03199534 HAL Id: tel-03199534 https://tel.archives-ouvertes.fr/tel-03199534 Submitted on 15 Apr 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE DE DOCTORAT DE L'UNIVERSITE DE RENNES 1 ECOLE DOCTORALE N° 605 Biologie Santé Spécialité : Génétique, génomique et bio-informatique Par Artem KIM « Prénom NOM » Exploration de l’interaction entre variants rares et communs dans la susceptibilité génétique à holoprosencéphalie Thèse présentée et soutenue à Rennes, le 30/09/2020 Unité de recherche : Institut de Génétique et Développement de Rennes (IGDR), UMR6290 CNRS, Université Rennes 1 Rapporteurs avant soutenance : Composition du Jury : Thomas BOURGERON Thomas BOURGERON Professeur des universités Professeur
    [Show full text]
  • Estimation of Genotype Error Rate Using Samples with Pedigree Information—An Application on the Genechip Mapping 10K Array
    Genomics 84 (2004) 623–630 www.elsevier.com/locate/ygeno Estimation of genotype error rate using samples with pedigree information—an application on the GeneChip Mapping 10K array Ke Hao,a Cheng Li,a,b Carsten Rosenow,c and Wing Hung Wonga,d,* a Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USA b Department of Biostatistics, Dana Farber Cancer Institute, Boston, MA, USA c Genomics Collaboration, Affymetrix, Santa Clara, CA, USA d Department of Statistics, Harvard University, Cambridge, MA, USA Received 16 January 2004; accepted 13 May 2004 Available online 13 July 2004 Abstract Currently, most analytical methods assume all observed genotypes are correct; however, it is clear that errors may reduce statistical power or bias inference in genetic studies. We propose procedures for estimating error rate in genetic analysis and apply them to study the GeneChip Mapping 10K array, which is a technology that has recently become available and allows researchers to survey over 10,000 SNPs in a single assay. We employed a strategy to estimate the genotype error rate in pedigree data. First, the ‘‘dose–response’’ reference curve between error rate and the observable error number were derived by simulation, conditional on given pedigree structures and genotypes. Second, the error rate was estimated by calibrating the number of observed errors in real data to the reference curve. We evaluated the performance of this method by simulation study and applied it to a data set of 30 pedigrees genotyped using the GeneChip Mapping 10K array. This method performed favorably in all scenarios we surveyed.
    [Show full text]
  • Mendelian Error Detection in Complex Pedigrees Using Weighted Constraint Satisfaction Techniques Marti Sanchez, Simon De Givry, Thomas Schiex
    Mendelian error detection in complex pedigrees using weighted constraint satisfaction techniques Marti Sanchez, Simon de Givry, Thomas Schiex To cite this version: Marti Sanchez, Simon de Givry, Thomas Schiex. Mendelian error detection in complex pedigrees using weighted constraint satisfaction techniques. Artificial Intelligence Research and Development, 163, IOS Press, pp.22, 2007, Frontiers in Artificial Intelligence and Applications, 978-1-60750-285-2 978-1-58603-798-7. hal-02814634 HAL Id: hal-02814634 https://hal.inrae.fr/hal-02814634 Submitted on 6 Jun 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Mendelian error detection in complex pedigrees using weighted constraint satisfaction techniques Marti Sanchez [email protected] Simon de Givry [email protected] Thomas Schiex [email protected] INRA - UBIA Toulouse, France October 12, 2007 Abstract With the arrival of high throughput genotyping techniques, the detection of likely genotyping errors is becoming an increasingly important problem. In this paper we are interested in errors that violate Mendelian laws. The problem of deciding if a Mendelian error exists in a pedigree is NP-complete [1]. Existing tools dedicated to this problem may offer different level of services: detect simple inconsistencies using local reasoning, prove inconsistency, detect the source of error, propose an optimal correc- tion for the error.
    [Show full text]
  • Using Mendelian Inheritance to Improve High-Throughput SNP Discovery
    INVESTIGATION Using Mendelian Inheritance To Improve High-Throughput SNP Discovery Nancy Chen,*,†,1 Cristopher V. Van Hout,‡ Srikanth Gottipati,‡ and Andrew G. Clark‡ *Department of Ecology and Evolutionary Biology, †Cornell Laboratory of Ornithology, and ‡Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853 ORCID IDs: 0000-0001-8966-3449 (N.C.); 0000-0001-9689-5344 (C.V.V.H.). ABSTRACT Restriction site-associated DNA sequencing or genotyping-by-sequencing (GBS) approaches allow for rapid and cost- effective discovery and genotyping of thousands of single-nucleotide polymorphisms (SNPs) in multiple individuals. However, rigorous quality control practices are needed to avoid high levels of error and bias with these reduced representation methods. We developed a formal statistical framework for filtering spurious loci, using Mendelian inheritance patterns in nuclear families, that accommodates variable-quality genotype calls and missing data—both rampant issues with GBS data—and for identifying sex-linked SNPs. Simulations predict excellent performance of both the Mendelian filter and the sex-linkage assignment under a variety of conditions. We further evaluate our method by applying it to real GBS data and validating a subset of high-quality SNPs. These results demonstrate that our metric of Mendelian inheritance is a powerful quality filter for GBS loci that is complementary to standard coverage and Hardy– Weinberg filters. The described method, implemented in the software MendelChecker, will improve quality control during SNP discovery in nonmodel as well as model organisms. HE advent of next-generation sequencing technologies discovery and genotyping of thousands of single-nucleotide Thas revolutionized biological research by allowing the polymorphisms (SNPs) distributed across the genome.
    [Show full text]