Data Formats Document Table of Contents

Data File Formats File format v1.4 Software v1.9.0 Copyright © 2010 Complete Genomics Incorporated. All rights reserved. cPAL and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries. All other trademarks are the property of their respective owners. Disclaimer of Warranties. COMPLETE GENOMICS, INC. PROVIDES THESE DATA IN GOOD FAITH TO THE RECIPIENT “AS IS.” COMPLETE GENOMICS, INC. MAKES NO REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, OR ANY OTHER STATUTORY WARRANTY. COMPLETE GENOMICS, INC. ASSUMES NO LEGAL LIABILITY OR RESPONSIBILITY FOR ANY PURPOSE FOR WHICH THE DATA ARE USED. Any permitted redistribution of the data should carry the Disclaimer of Warranties provided above. Data file formats are expected to evolve over time. Backward compatibility of any new file format is not guaranteed. Complete Genomics data is for Research Use Only and not for use in the treatment or diagnosis of any human subject. Information, descriptions and specifications in this publication are subject to change without notice. Data Formats Document Table of Contents Table of Contents Preface ........................................................................................................................................................................................... 4 Conventions .................................................................................................................................................................................................. 4 Analysis Tools .............................................................................................................................................................................................. 4 References ..................................................................................................................................................................................................... 4 Introduction ................................................................................................................................................................................ 6 Sequencing Approach ............................................................................................................................................................................... 6 Mapping Reads and Calling Variations ............................................................................................................................................. 6 Read Data Format....................................................................................................................................................................................... 6 Data Delivery ................................................................................................................................................................................................ 7 Data File Formats and Conventions .................................................................................................................................... 8 Data File Structure ..................................................................................................................................................................................... 8 Header Format............................................................................................................................................................................................. 8 Sequence Coordinate System .............................................................................................................................................................. 11 Data File Content and Organization ................................................................................................................................ 12 ASM Results ................................................................................................................................................................................................ 12 Small Variations and Annotations ............................................................................................................................................... 13 Assemblies Underlying Called Variants ..................................................................................................................................... 33 Coverage and Reference Scores .................................................................................................................................................... 41 Quality and Characteristics of Sequenced Genome .............................................................................................................. 43 Library information ................................................................................................................................................................................. 56 Architecture of Reads and Gaps .................................................................................................................................................... 56 Empirically Observed Mate Gap Distribution ......................................................................................................................... 57 Empirical Intraread Gap Distribution ........................................................................................................................................ 59 Sequence-dependent Empirical Intraread Gap Distribution ........................................................................................... 60 Reads and Mapping Data ...................................................................................................................................................................... 62 Reads and Quality Scores ................................................................................................................................................................. 62 Initial Mappings ................................................................................................................................................................................... 65 Association between Initial Mappings and Reads Data ...................................................................................................... 68 Glossary...................................................................................................................................................................................... 69 © Complete Genomics, Inc. ii Data Formats Document List of Tables List of Tables Table 1: Header Metadata .......................................................................................................................................................................... 9 Table 2: Sequence Coordinate System (Build 36) ......................................................................................................................... 11 Table 3: Sequence Coordinate System (Build 37) ......................................................................................................................... 11 Table 4: Header of Variations File ........................................................................................................................................................ 16 Table 5: Variations File Description...................................................................................................................................................... 17 Table 6: Header for Gene Annotation File ......................................................................................................................................... 19 Table 7: Gene Annotation File Format Description ...................................................................................................................... 20 Table 8: Header for ncRNA File .............................................................................................................................................................. 23 Table 9: ncRNA File Format Description ........................................................................................................................................... 24 Table 10: Header for Gene Variation Summary File ..................................................................................................................... 25 Table 11: Gene Variation Summary File Format Description .................................................................................................. 26 Table 12: Header for dbSNP Annotation File ................................................................................................................................... 28 Table 13: Annotated dbSNP File Format Description ................................................................................................................... 29 Table 14: Header of Summary File ....................................................................................................................................................... 30 Table 15: Summary File Description .................................................................................................................................................... 31 Table 16: Alignment CIGAR Format Modifiers in evidenceDnbs-[CHROMOSOME-ID]-[ASM-ID].tsv.bz2 ................ 34 Table 17: Alignment

Data Formats Document Table of Contents

Supplemental Note Hominoid Fission of Chromosome 14/15 and Role Of

Genetic Variation Across the Human Olfactory Receptor Repertoire Alters Odor Perception

13 Genomics and Bioinformatics

Genomics and Its Impact on Science and Society: the Human Genome Project and Beyond

Genetic Effects on Microsatellite Diversity in Wild Emmer Wheat (Triticum Dicoccoides) at the Yehudiyya Microsite, Israel

Gene Prediction and Genome Annotation

Assembly and Annotation of an Ashkenazi Human Reference Genome

Small Variants Frequently Asked Questions (FAQ) Updated September 2011

Epigenetics Analysis and Integrated Analysis of Multiomics Data, Including Epigenetic Data, Using Artiﬁcial Intelligence in the Era of Precision Medicine

The Economic Impact and Functional Applications of Human Genetics and Genomics

Chr Start End Size Gene Exon 1 69482 69600 118 OR4F5 1 1 877520

Effects of Chronic Stress on Prefrontal Cortex Transcriptome in Mice Displaying Different Genetic Backgrounds