Standard Sequencing Service Data File Formats File Format V2.0 Software V2.0 March 2012

Total Page:16

File Type:pdf, Size:1020Kb

Standard Sequencing Service Data File Formats File Format V2.0 Software V2.0 March 2012 Standard Sequencing Service Data File Formats File format v2.0 Software v2.0 March 2012 CGA Tools, cPAL, and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries. All other trademarks are the property of their respective owners. Disclaimer of Warranties. COMPLETE GENOMICS, INC. PROVIDES THESE DATA IN GOOD FAITH TO THE RECIPIENT “AS IS.” COMPLETE GENOMICS, INC. MAKES NO REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, OR ANY OTHER STATUTORY WARRANTY. COMPLETE GENOMICS, INC. ASSUMES NO LEGAL LIABILITY OR RESPONSIBILITY FOR ANY PURPOSE FOR WHICH THE DATA ARE USED. Any permitted redistribution of the data should carry the Disclaimer of Warranties provided above. Data file formats are expected to evolve over time. Backward compatibility of any new file format is not guaranteed. Complete Genomics data is for Research Use Only and not for use in the treatment or diagnosis of any human subject. Information, descriptions and specifications in this publication are subject to change without notice. Copyright © 2011-2012 Complete Genomics Incorporated. All rights reserved. RM_DFFSS_2.0-02 Table of Contents Table of Contents Preface ...........................................................................................................................................................................................6 Conventions .................................................................................................................................................................................................. 6 Analysis Tools .............................................................................................................................................................................................. 6 References ..................................................................................................................................................................................................... 6 Introduction ................................................................................................................................................................................9 Sequencing Approach ............................................................................................................................................................................... 9 Mapping Reads and Calling Variations ............................................................................................................................................. 9 Read Data Format....................................................................................................................................................................................... 9 Data Delivery ..............................................................................................................................................................................................10 Data File Formats and Conventions ................................................................................................................................. 11 Data File Structure ...................................................................................................................................................................................11 Header Format...........................................................................................................................................................................................11 Sequence Coordinate System ..............................................................................................................................................................14 Data File Content and Organization .................................................................................................................................................15 ASM Results .............................................................................................................................................................................. 17 Small Variations and Annotations Files..........................................................................................................................................17 Variations .....................................................................................................................................................................................................20 ASM/var-[ASM-ID].tsv.bz2 ..............................................................................................................................................................20 Master Variations .....................................................................................................................................................................................25 ASM/masterVarBeta-[ASM-ID].tsv.bz2 .....................................................................................................................................25 Annotated Variants within Genes .....................................................................................................................................................31 ASM/gene-[ASM-ID].tsv.bz2 ...........................................................................................................................................................31 Annotated Variants within Non-coding RNAs .............................................................................................................................36 ASM/ncRNA-[ASM-ID].tsv.bz2 ......................................................................................................................................................36 Count of Variations by Gene ................................................................................................................................................................38 ASM/geneVarSummary-[ASM-ID].tsv ........................................................................................................................................38 Variations at Known dbSNP Loci .......................................................................................................................................................40 ASM/dbSNPAnnotated-[ASM-ID].tsv.bz2 .................................................................................................................................40 Sequencing Metrics and Variations Summary .............................................................................................................................44 ASM/summary-[ASM-ID].tsv .........................................................................................................................................................44 Copy Number Variation Files ..............................................................................................................................................................48 Copy Number Segmentation ...............................................................................................................................................................50 ASM/CNV/cnvSegmentsDiploidBeta-[ASM-ID].tsv .............................................................................................................50 Detailed Ploidy and Coverage Information ...................................................................................................................................53 ASM/CNV/cnvDetailsDiploidBeta-[ASM-ID].tsv.bz2 ..........................................................................................................53 Genomic Copy Number Analysis of Non-Diploid Samples Files ..........................................................................................56 Non-diploid CNV Segments ..................................................................................................................................................................57 ASM/CNV/cnvSegmentsNondiploidBeta-[ASM-ID].tsv.bz2 .............................................................................................57 Detailed Non-Diploid Coverage Level Information ...................................................................................................................59 ASM/CNV/cnvDetailsNondiploidBeta-[ASM-ID].tsv.bz2 ..................................................................................................59 © Complete Genomics, Inc. Standard Sequencing Service Data File Formats — ii Table of Contents Depth of Coverage Report ....................................................................................................................................................................62 ASM/CNV/depthOfCoverage_100000-[ASM-ID].tsv ...........................................................................................................62 Structural Variation Files ......................................................................................................................................................................65 Detected Junctions and Associated Annotations ........................................................................................................................67 ASM/SV/allJunctionsBeta-[ASM-ID].tsv ...................................................................................................................................67 High-confidence
Recommended publications
  • Duplications and Copy Number Variants of 8P23.1 Are Cytogenetically Indistinguishable but Distinct at the Molecular Level
    European Journal of Human Genetics (2005) 13, 1131–1136 & 2005 Nature Publishing Group All rights reserved 1018-4813/05 $30.00 www.nature.com/ejhg ARTICLE Duplications and copy number variants of 8p23.1 are cytogenetically indistinguishable but distinct at the molecular level John CK Barber*,1,2,3, Viv Maloney2, Edward J Hollox4, Annegret Stuke-Sontheimer5, Gabi du Bois6, Eva Daumiller6, Ute Klein-Vogler7, Andreas Dufke7, John AL Armour4 and Thomas Liehr8 1Wessex Regional Genetics Laboratory, Salisbury Hospital NHS Trust, Salisbury, Wiltshire, UK; 2National Genetics Reference Laboratory (Wessex), Salisbury Hospital NHS Trust, Salisbury, Wiltshire, UK; 3Human Genetics Division, Southampton University School of Medicine, Southampton General Hospital, Southampton, UK; 4Institute of Genetics, University of Nottingham, Queen’s Medical Centre, Nottingham, UK; 5Genetics Clinic, Wernigerode, Germany; 6Institute for Chromosome Diagnostics and Genetic Counselling, Boeblingen, Germany; 7Department of Medical Genetics, Eberhard-Karls University, Tuebingen, Germany; 8Institute for Human Genetics and Anthropology, Friedrich-Schiller University, Jena, Germany It has been proposed that duplications of 8p23.1 are either euchromatic variants of the 8p23.1 defensin domain with no phenotypic consequences or true duplications associated with developmental delay and heart defects. Here, we provide evidence for both alternatives in two new families. A duplication of most of band 8p23.1 (circa 5 Mb) was found in a girl of 8 years with pulmonary stenosis and mild language delay. BAC fluorescence in situ hybridisation (FISH) and multiplex amplifiable probe hybridisation (MAPH) showed that the two copies of the duplicated segment were sited, in an alternating fashion, between three copies of a circa 300–450 kb segment from 8p23.1 distal to REPD.
    [Show full text]
  • 35Th International Society for Animal Genetics Conference 7
    35th INTERNATIONAL SOCIETY FOR ANIMAL GENETICS CONFERENCE 7. 23.16 – 7.27. 2016 Salt Lake City, Utah ABSTRACT BOOK https://www.asas.org/meetings/isag2016 INVITED SPEAKERS S0100 – S0124 https://www.asas.org/meetings/isag2016 epigenetic modifications, such as DNA methylation, and measuring different proteins and cellular metab- INVITED SPEAKERS: FUNCTIONAL olites. These advancements provide unprecedented ANNOTATION OF ANIMAL opportunities to uncover the genetic architecture GENOMES (FAANG) ASAS-ISAG underlying phenotypic variation. In this context, the JOINT SYMPOSIUM main challenge is to decipher the flow of biological information that lies between the genotypes and phe- notypes under study. In other words, the new challenge S0100 Important lessons from complex genomes. is to integrate multiple sources of molecular infor- T. R. Gingeras* (Cold Spring Harbor Laboratory, mation (i.e., multiple layers of omics data to reveal Functional Genomics, Cold Spring Harbor, NY) the causal biological networks that underlie complex traits). It is important to note that knowledge regarding The ~3 billion base pairs of the human DNA rep- causal relationships among genes and phenotypes can resent a storage devise encoding information for be used to predict the behavior of complex systems, as hundreds of thousands of processes that can go on well as optimize management practices and selection within and outside a human cell. This information is strategies. Here, we describe a multi-step procedure revealed in the RNAs that are composed of 12 billion for inferring causal gene-phenotype networks underly- nucleotides, considering the strandedness and allelic ing complex phenotypes integrating multi-omics data. content of each of the diploid copies of the genome.
    [Show full text]
  • Cancer Sequencing Service Data File Formats File Format V2.4 Software V2.4 December 2012
    Cancer Sequencing Service Data File Formats File format v2.4 Software v2.4 December 2012 CGA Tools, cPAL, and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries. All other trademarks are the property of their respective owners. Disclaimer of Warranties. COMPLETE GENOMICS, INC. PROVIDES THESE DATA IN GOOD FAITH TO THE RECIPIENT “AS IS.” COMPLETE GENOMICS, INC. MAKES NO REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, OR ANY OTHER STATUTORY WARRANTY. COMPLETE GENOMICS, INC. ASSUMES NO LEGAL LIABILITY OR RESPONSIBILITY FOR ANY PURPOSE FOR WHICH THE DATA ARE USED. Any permitted redistribution of the data should carry the Disclaimer of Warranties provided above. Data file formats are expected to evolve over time. Backward compatibility of any new file format is not guaranteed. Complete Genomics data is for Research Use Only and not for use in the treatment or diagnosis of any human subject. Information, descriptions and specifications in this publication are subject to change without notice. Copyright © 2011-2012 Complete Genomics Incorporated. All rights reserved. RM_DFFCS_2.4-01 Table of Contents Table of Contents Preface ...........................................................................................................................................................................................1 Conventions .................................................................................................................................................................................................
    [Show full text]
  • Meta-Analysis of Gene Expression in Individuals with Autism Spectrum Disorders
    Meta-analysis of Gene Expression in Individuals with Autism Spectrum Disorders by Carolyn Lin Wei Ch’ng BSc., University of Michigan Ann Arbor, 2011 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics) The University of British Columbia (Vancouver) August 2013 c Carolyn Lin Wei Ch’ng, 2013 Abstract Autism spectrum disorders (ASD) are clinically heterogeneous and biologically complex. State of the art genetics research has unveiled a large number of variants linked to ASD. But in general it remains unclear, what biological factors lead to changes in the brains of autistic individuals. We build on the premise that these heterogeneous genetic or genomic aberra- tions will converge towards a common impact downstream, which might be reflected in the transcriptomes of individuals with ASD. Similarly, a considerable number of transcriptome analyses have been performed in attempts to address this question, but their findings lack a clear consensus. As a result, each of these individual studies has not led to any significant advance in understanding the autistic phenotype as a whole. The goal of this research is to comprehensively re-evaluate these expression profiling studies by conducting a systematic meta-analysis. Here, we report a meta-analysis of over 1000 microarrays across twelve independent studies on expression changes in ASD compared to unaffected individuals, in blood and brain. We identified a number of genes that are consistently differentially expressed across studies of the brain, suggestive of effects on mitochondrial function. In blood, consistent changes were more difficult to identify, despite individual studies tending to exhibit larger effects than the brain studies.
    [Show full text]
  • Transmitted Duplication of 8P23.1–8P23.2 Associated with Speech Delay, Autism and Learning Difficulties
    European Journal of Human Genetics (2009) 17, 37–43 & 2009 Macmillan Publishers Limited All rights reserved 1018-4813/09 $32.00 www.nature.com/ejhg ARTICLE Transmitted duplication of 8p23.1–8p23.2 associated with speech delay, autism and learning difficulties Mary Glancy*,1, Angela Barnicoat2, Rajan Vijeratnam3, Sharon de Souza4, Joanne Gilmore1, Shuwen Huang5, Viv K Maloney5, N Simon Thomas6, David J Bunyan5, Ann Jackson1 and John CK Barber5,6,7 1NE London Regional Cytogenetics Laboratory, Great Ormond Street Hospital NHS Trust, London, UK; 2NE Thames Regional Clinical Genetics Service, Institute of Child Health, London, UK; 3Cedar House, St Michael’s Centre for Primary Care, Middlesex, UK; 4Paediatric Department, Chase Farm Hospital, Enfield, Middlesex, UK; 5National Genetics Reference Laboratory (Wessex), Salisbury NHS Foundation Trust, Salisbury, Wiltshire, UK; 6Wessex Regional Genetics Laboratory, Salisbury NHS Foundation Trust, Salisbury, Wiltshire, UK; 7Human Genetics Division, Southampton University School of Medicine, Southampton General Hospital, Southampton, UK Duplications of distal 8p with and without significant clinical phenotypes have been reported and are often associated with an unusual degree of structural complexity. Here, we present a duplication of 8p23.1– 8p23.2 ascertained in a child with speech delay and a diagnosis of ICD-10 autism. The same duplication was found in his mother who had epilepsy and learning problems. A combination of cytogenetic, FISH, microsatellite, MLPA and oaCGH analysis was used to show that the duplication extended over a minimum of 6.8 Mb between 3 539 893 and 10 323 426 bp. This interval contains 32 novel and 41 known genes, of which only microcephalin (MCPH1) is a plausible candidate gene for autism at present.
    [Show full text]
  • Variation in Protein Coding Genes Identifies Information Flow
    bioRxiv preprint doi: https://doi.org/10.1101/679456; this version posted June 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Animal complexity and information flow 1 1 2 3 4 5 Variation in protein coding genes identifies information flow as a contributor to 6 animal complexity 7 8 Jack Dean, Daniela Lopes Cardoso and Colin Sharpe* 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Institute of Biological and Biomedical Sciences 25 School of Biological Science 26 University of Portsmouth, 27 Portsmouth, UK 28 PO16 7YH 29 30 * Author for correspondence 31 [email protected] 32 33 Orcid numbers: 34 DLC: 0000-0003-2683-1745 35 CS: 0000-0002-5022-0840 36 37 38 39 40 41 42 43 44 45 46 47 48 49 Abstract bioRxiv preprint doi: https://doi.org/10.1101/679456; this version posted June 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Animal complexity and information flow 2 1 Across the metazoans there is a trend towards greater organismal complexity. How 2 complexity is generated, however, is uncertain. Since C.elegans and humans have 3 approximately the same number of genes, the explanation will depend on how genes are 4 used, rather than their absolute number.
    [Show full text]
  • Content Based Search in Gene Expression Databases and a Meta-Analysis of Host Responses to Infection
    Content Based Search in Gene Expression Databases and a Meta-analysis of Host Responses to Infection A Thesis Submitted to the Faculty of Drexel University by Francis X. Bell in partial fulfillment of the requirements for the degree of Doctor of Philosophy November 2015 c Copyright 2015 Francis X. Bell. All Rights Reserved. ii Acknowledgments I would like to acknowledge and thank my advisor, Dr. Ahmet Sacan. Without his advice, support, and patience I would not have been able to accomplish all that I have. I would also like to thank my committee members and the Biomed Faculty that have guided me. I would like to give a special thanks for the members of the bioinformatics lab, in particular the members of the Sacan lab: Rehman Qureshi, Daisy Heng Yang, April Chunyu Zhao, and Yiqian Zhou. Thank you for creating a pleasant and friendly environment in the lab. I give the members of my family my sincerest gratitude for all that they have done for me. I cannot begin to repay my parents for their sacrifices. I am eternally grateful for everything they have done. The support of my sisters and their encouragement gave me the strength to persevere to the end. iii Table of Contents LIST OF TABLES.......................................................................... vii LIST OF FIGURES ........................................................................ xiv ABSTRACT ................................................................................ xvii 1. A BRIEF INTRODUCTION TO GENE EXPRESSION............................. 1 1.1 Central Dogma of Molecular Biology........................................... 1 1.1.1 Basic Transfers .......................................................... 1 1.1.2 Uncommon Transfers ................................................... 3 1.2 Gene Expression ................................................................. 4 1.2.1 Estimating Gene Expression ............................................ 4 1.2.2 DNA Microarrays ......................................................
    [Show full text]
  • Searching for Rare Variants Associated with Osahs-Related Phenotypes
    SEARCHING FOR RARE VARIANTS ASSOCIATED WITH OSAHS-RELATED PHENOTYPES THROUGH PEDIGREES by JINGJING LIANG Dissertation Advisor: Dr. Xiaofeng Zhu Department of Population and Quantitative Health Sciences CASE WESTERN RESERVE UNIVERSITY May 29, 2019 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of Jingjing Liang candidate for the degree of Ph.D Committee Chair Scott M. Williams Committee Member Jonathan L. Haines Committee Member Xiaofeng Zhu Committee Member Rong Xu Committee Member Curtis M. Tatsuoka Date of Defense January 29, 2019 *We also certify that written approval has been obtained for any proprietary material contained therein. 1 Table of Contents CHAPTER 1: LITERATURE REVIEW AND SPECIFIC AIMS ………………14 1.1 Obstructive sleep apnea-hypopnea syndrome …………………………………..14 1.2 AHI and SpO2 …………………………………………………………………...16 1.3 Rare variants and missing heritability …………………………………………..22 1.4 Rare variant association analysis ...………………………………………………24 1.5 Rare variant test using pedigree………………………………………………….27 1.6 Annotating variants in genetic regions ………………………………………….29 1.7 Mendelian randomization………………………………………………………..32 1.8 Specific aims ……………………………………………………………………36 CHAPTER 2: IDENTIFYING LOW FREQUENCY AND RARE VARIANTS ASSOCIATED WITH AVSPO2S USING PEDIGREES .…….………38 2.1 Introduction ……………………………………………………………………...38 2.2 Material and methods……………………………………………………………..42 2.2.1 Description of study samples…..……………………………………………..42 2.2.2 Overview of the method………………………………………………………45 2.2.3 Primary
    [Show full text]
  • Data-Driven and Knowledge-Driven Computational Models of Angiogenesis in Application to Peripheral Arterial Disease
    DATA-DRIVEN AND KNOWLEDGE-DRIVEN COMPUTATIONAL MODELS OF ANGIOGENESIS IN APPLICATION TO PERIPHERAL ARTERIAL DISEASE by Liang-Hui Chu A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland March, 2015 © 2015 Liang-Hui Chu All Rights Reserved Abstract Angiogenesis, the formation of new blood vessels from pre-existing vessels, is involved in both physiological conditions (e.g. development, wound healing and exercise) and diseases (e.g. cancer, age-related macular degeneration, and ischemic diseases such as coronary artery disease and peripheral arterial disease). Peripheral arterial disease (PAD) affects approximately 8 to 12 million people in United States, especially those over the age of 50 and its prevalence is now comparable to that of coronary artery disease. To date, all clinical trials that includes stimulation of VEGF (vascular endothelial growth factor) and FGF (fibroblast growth factor) have failed. There is an unmet need to find novel genes and drug targets and predict potential therapeutics in PAD. We use the data-driven bioinformatic approach to identify angiogenesis-associated genes and predict new targets and repositioned drugs in PAD. We also formulate a mechanistic three- compartment model that includes the anti-angiogenic isoform VEGF165b. The thesis can serve as a framework for computational and experimental validations of novel drug targets and drugs in PAD. ii Acknowledgements I appreciate my advisor Dr. Aleksander S. Popel to guide my PhD studies for the five years at Johns Hopkins University. I also appreciate several professors on my thesis committee, Dr. Joel S. Bader, Dr.
    [Show full text]
  • Parental Micronutrient Deficiency Distorts Liver DNA Methylation And
    www.nature.com/scientificreports OPEN Parental micronutrient defciency distorts liver DNA methylation and expression of lipid genes associated Received: 24 October 2017 Accepted: 31 January 2018 with a fatty-liver-like phenotype in Published: xx xx xxxx ofspring Kaja H. Skjærven1, Lars Martin Jakt2, Jorge M. O. Fernandes2, John Arne Dahl 3, Anne-Catrin Adam1, Johanna Klughammer4, Christoph Bock 4 & Marit Espe1 Micronutrient status of parents can afect long term health of their progeny. Around 2 billion humans are afected by chronic micronutrient defciency. In this study we use zebrafsh as a model system to examine morphological, molecular and epigenetic changes in mature ofspring of parents that experienced a one-carbon (1-C) micronutrient defciency. Zebrafsh were fed a diet sufcient, or marginally defcient in 1-C nutrients (folate, vitamin B12, vitamin B6, methionine, choline), and then mated. Ofspring livers underwent histological examination, RNA sequencing and genome-wide DNA methylation analysis. Parental 1-C micronutrient defciency resulted in increased lipid inclusion and we identifed 686 diferentially expressed genes in ofspring liver, the majority of which were downregulated. Downregulated genes were enriched for functional categories related to sterol, steroid and lipid biosynthesis, as well as mitochondrial protein synthesis. Diferential DNA methylation was found at 2869 CpG sites, enriched in promoter regions and permutation analyses confrmed the association with parental feed. Our data indicate that parental 1-C nutrient status can persist as locus specifc DNA methylation marks in descendants and suggest an efect on lipid utilization and mitochondrial protein translation in F1 livers. This points toward parental micronutrients status as an important factor for ofspring health and welfare.
    [Show full text]
  • I GENETIC VARIATION in the DOMESTICATED DOG AS A
    GENETIC VARIATION IN THE DOMESTICATED DOG AS A MODEL OF HUMAN DISEASE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Jennie Lynn Rowell, M.S. Graduate Program in Nursing The Ohio State University 2012 Dissertation Committee: Donna O. McCarthy, Adviser Carlos E. Alvarez, Cognate Adviser Kim McBride Jodi McDaniel i ii Copyright by Jennie Lynn Rowell 2012 iii ABSTRACT One of the greatest challenges facing clinical scientists is a developed understanding of the genetic basis for complex human diseases such as cancer. Despite many technological advances in genetics, progress has been slow. This is owed, in part, to intricate gene-gene interactions as well as poorly understood environmental influences on gene expression and genetic traits. The high level of heterogeneity of the human genome makes identification of these interactions and environmental influences difficult. Recently, the domesticated dog (Canis Lupis Familaris), has demonstrated its powerful applicability to human disease as a new model of genetic variation. Dogs are an excellent model for the study of complex disease in humans for a variety of reasons, including the extensive level of health care they receive, and their phenotypic diversity. Approximately 400 inherited diseases similar to those of humans are characterized in dogs, including complex disorders such as cancers, heart disease, and neurological disorders. The purpose of this dissertation project is to elucidate the role of genetic variation in the domesticated dog as a model for understanding human disease and is comprised of four manuscripts. I establish 1) the utility of the canine model for the study of human disease, 2) the theoretical and empirical establishment of a novel method for genomewide genetic analysis, 3) identification of germline loci associated with the risk for development of Osteosarcoma.
    [Show full text]