A Long-Read Assembly of the Dog Reference Genome

bioRxiv preprint doi: https://doi.org/10.1101/2021.05.05.442772; this version posted May 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Dog10K_Boxer_Tasha_1.0: A long-read assembly of the dog reference genome Vidhya Jagannathan1, Christophe Hitte2, Jeffrey M. Kidd3,4, Patrick Masterson5, Terence D. Murphy5, Sarah Emery3, Brian Davis6, Reuben M. Buckley7, Yanhu Liu8,9, Xiangquan Zhang8,9, Tosso Leeb1, Ya-ping Zhang8,9, Elaine A. Os- trander7 and Guo-dong Wang8,9,*. 1 Institute of Genetics, Vetsuisse Faculty, University of Bern, 3001 Bern, Switzerland; [email protected] (V.J.); [email protected] (T.L.) 2 University of Rennes, CNRS, IGDR - UMR 6290, F-35000 Rennes, France; [email protected] (C.H.) 3 Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, 48109 USA; [email protected] (J.K.); [email protected] (S.E) 4 Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, USA; 5 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, USA; [email protected] (P.M.); [email protected] (T.D.M) 7 National Human Genome Research Institute, National Institutes of Health, USA; [email protected] (E.A.O.); [email protected] (R.B) 6 Department of Integrative Biological Sciences, Texas A and M University, College Station, TX, 77840, USA; [email protected] (B.D.) 8 State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, China; [email protected] (G.W); [email protected] (Y.Z); [email protected] (Y.L); [email protected] (X.Z) 9 Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, China *Correspondence: [email protected]; Abstract: The domestic dog has evolved to be an important biomedical model for studies regarding the genetic basis of disease, morphology and behavior. Genetic studies in the dog have relied on a draft reference genome of a pure- bred female boxer dog named “Tasha” initially published in 2005. Derived from a Sanger whole genome shotgun sequencing approach coupled with limited clone-based sequencing, the initial assembly and subsequent updates have served as the predominant resource for canine genetics for 15 years. While the initial assembly produced a good quality draft, as with all assemblies produced at the time it contained gaps, assembly errors and missing sequences, par- ticularly in GC-rich regions, which are found at many promoters and in the first exons of protein coding genes. Here we present Dog10K_Boxer_Tasha_1.0, an improved chromosome-level highly contiguous genome assembly of Tasha created with long-read technologies, that increases sequence contiguity >100-fold, closes >23,000 gaps of the Canfam3.1 reference assembly and improves gene annotation by identifying >1200 new protein-coding transcripts. The assembly and annotation are available at NCBI under the accession GCF_000002285.5. Keywords: Canis lupus familiaris, high-quality, contiguity, Pacific Biosciences, annotation, resource 1. Introduction High quality reference genomes are fundamental assets for the study of genetic variation in any species. The ability to link genotype to phenotype and the subsequent identification of functional variants relies on high fidelity assessment of variants throughout the genome. This reliance is well illustrated by the domestic dog, which offers specific challenges for any genetic study. Featuring over 350 pure breeding populations, each breed is a mosaic of ancient and modern variants, and each reflects a complex history linking it to other related breeds. As a result of bottlenecks associated with domestication (15,000-30,000 years before present) and more recent individual breed formation (50 – 250 years before present) dog genomes contain long and frequent stretches of linkage disequilibrium (LD). While helpful for identifying loci of interest, long LD makes the necessary fine mapping for moving from marker to gene to variant both la- bor intensive and error prone. In 2005, the first high-quality draft (7.5×) sequence of a Boxer dog, named Tasha, was made publicly available [1]. The reference sequence has proven useful in discoveries of canine associated molecular variants [2,3], including single-nucleotide variants (SNVs) and small indels, regulatory sequences [4,5], bioRxiv preprint doi: https://doi.org/10.1101/2021.05.05.442772; this version posted May 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. large rearrangements and copy number variants [6][7] associated with both inter and intra gene variation. The resulting SNV arrays, designed based on variation relative to the Tasha-derived assembly, have led to the success of hundreds of genome wide association studies (GWAS), advancing the dog as a system for studies of disease susceptibility and molecular pathomechanisms, evolution, and behaviour. However, for the dog system to advance further, a long read high quality assembly of a reference genome is needed. This will greatly improve the sensitivity of variant detection, especially for large structural variation. Further- more, a high-quality assembly is an essential pre-requisite for accurate annotation, which is required to assay the potential functional effects of detected variants. Using the same dog as used for the initial assembly offers specific advantages, including the ability to integrate new findings with previous observations. A high quality genome assembly from the boxer Tasha will mean that the value of existing resources, such as existing bacterial artificial chromosome (BAC) libraries, and the wealth of experience and knowledge gained using previous versions of this dog’s genome, will be preserved for future research efforts. The dog genome assembly reported here was built using a combination of Pacific Biosciences (PacBio) continuous long read (CLR) sequencing technology, 10x Chromium linked reads, BAC pair-end sequences and the draft reference genome sequence CanFam3.1. 2. Materials and Methods 2.1. Whole Genome Sequencing. A single blood draw from which genomic DNA was isolated from blood leukocytes of a female Boxer, Tasha, which also was used to generate the previous CanFam 1, CanFam 2 and CanFam 3 genome assemblies was utilized here. Continuous long-read (CLR) sequencing was carried out at Novogene Bioin- formatics Technology Co., Ltd (Beijing, China) with a PacBio Sequel sequencer (Pacific Biosciences, Menlo Park, CA, USA). Approximately 100 µg of genomic DNA were used for sequencing. SMRTbell libraries were prepared using a DNA Template Prep Kit 1.0 (PacBio), and 56 20-kb SMRTbell libraries were constructed. A total of 252 Gb of sequence data was collected. High molecular weight DNA from Tasha was also sequenced with Chromium libraries (10x Genomics, Pleasanton, CA, USA) on Illumina (San Diego, CA, USA) HiSeq X (2x150 bp), generating 589,824,390 read pairs or 176 Gb of data. 2.2. Genome Assembly Workflow We assembled the genome using the Canu (v1.6) [8] and wtdbg2 [9] assembly algorithms. Briefly, the pipeline was composed of assembly, scaffolding and a final polishing step. PacBio reads had a mean read length of 8.5 kb and were used for the de novo assembly. The reads were corrected using the Canu error correction module which generates a consensus sequence for each read using its best set of long read overlaps. The corrected consensus reads were then assembled using the wtdbg2 algorithm[9], which is designed for assembly of long reads produced by the PacBio or Nanopore technologies. The assembled contigs were polished with raw PacBio reads using the WTPOA-CNS tool of the WTDBG2 package. This was followed by misassembly detection and correction with TIGMINT [10]. End sequences from BAC clones were extracted from the TraceDB of NCBI and used for scaffolding corrected contigs using the BESST algorithm (v2.2.8 ) [11]. Gap filling was done using the PacBio subreads with PBjelly (from PBSuite v15.8.24) [12] and one additional round of genome polishing was carried out using Pilon v1.23. [13] with the 10x Chromium reads. Finally, RaGOO (v1.1) [14] was used for reference guided scaffolding using CanFam3.1 as the reference. The draft scaffolds were subjected to additional gap closure using PBJelly. bioRxiv preprint doi: https://doi.org/10.1101/2021.05.05.442772; this version posted May 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 2.3. Assembly Quality Control The scaffold order and orientation of the assembly was assessed by aligning it to an existing radiation hybrid map (RH-map) comprising 10,000 markers [15]. A chromosome-wide review of scaffold discrepancies was determined visually, and those that were incorrectly ordered were corrected. The assembly was also assessed for completeness using BUSCO [34] which provides a summary of genome completeness using a database of expected gene content based on near-universal single-copy orthologs from mammalian species with genomic sequence data. This includes 4,104 single copy genes that are evolutionarily conserved between mammals. 2.3.1 Fosmid end sequence alignment End sequences from previously constructed fosmid libraries from Tasha were aligned to the assembly as previously described [1]. Concordant clones were considered to be those with an inward read orientation and a size between 35,328 and 43,453 bp. Using bedtools [16], the physical coverage of concordant clones in 5 kb windows along the genome was determined. Segments of the primary chromosome assemblies that were not supported by any concordant fosmids were also identified. Analysis was limited to the primary chromosome assemblies (chr1-chr38, chrX) and any interval that intersected with chromosome ends was discarded.

A Long-Read Assembly of the Dog Reference Genome

Structural Forms of the Human Amylase Locus and Their Relationships to Snps, Haplotypes, and Obesity

Chromosome 1 (Human Genome/Inkae) A

Characterization of Genomic Copy Number Variation in Mus Musculus Associated with the Germline of Inbred and Wild Mouse Populations, Normal Development, and Cancer

Role of Amylase in Ovarian Cancer Mai Mohamed University of South Florida, [email protected]

Improved Detection of Gene Fusions by Applying Statistical Methods Reveals New Oncogenic RNA Cancer Drivers

Supplementary Note 1 - Copy Number Estimation with Fastcn

The Wayward Dog: Is the Australian Native Dog Or Dingo a Distinct Species?

Mus Musculus) Was Subject to Repeated Introgression Including the Rescue of a Pseudogene Miriam Linnenbrink†, Kristian K

Human Social Genomics in the Multi-Ethnic Study of Atherosclerosis

Cancer Sequencing Service Data File Formats File Format V2.4 Software V2.4 December 2012

Alpha-Amylase 2B Antibody / AMY2B (F54556)

The Australian Dingo: Untamed Or Feral? J