Coverage-Versus-Length Plots, a Simple Quality Control Step for De Novo Yeast Genome

bioRxiv preprint doi: https://doi.org/10.1101/421347; this version posted September 19, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 2 Coverage-versus-Length plots, a simple quality control step for de novo yeast genome 3 sequence assemblies 4 † 5 Alexander P. Douglass*, Caoimhe E. O’Brien , Benjamin Offei*, Aisling Y. Coughlan*, † ,1 6 Raúl A. Ortiz-Merino*, Geraldine Butler , Kevin P. Byrne*, Kenneth H. Wolfe* † 7 *School of Medicine and School of Biomolecular and Biomedical Sciences, UCD Conway 8 Institute, University College Dublin, Dublin 4, Ireland. 1 9 Corresponding author: [email protected] 10 11 Abstract 12 Illumina sequencing has revolutionized yeast genomics, with prices for commercial draft 13 genome sequencing now below $200. The popular SPAdes assembler makes it simple to 14 generate a de novo genome assembly for any yeast species. However, whereas making 15 genome assemblies has become routine, understanding what they contain is still 16 challenging. Here, we show how graphing the information that SPAdes provides about the 17 length and coverage of each scaffold can be used to investigate the nature of an assembly, 18 and to diagnose possible problems. Scaffolds derived from mitochondrial DNA, ribosomal 19 DNA, and yeast plasmids can be identified by their high coverage. Contaminating data, such 20 as cross-contamination from other samples in a multiplex sequencing run, can be identified 21 by its low coverage. Scaffolds derived from the bacteriophage PhiX174 and Lambda DNAs 22 that are frequently used as molecular standards in Illumina protocols can also be detected. 23 Assemblies of yeast genomes with high heterozygosity, such as interspecies hybrids, often 24 contain two types of scaffold: regions of the genome where the two alleles assembled into 25 two separate scaffolds and each has a coverage level C, and regions where the two alleles 26 co-assembled (collapsed) into a single scaffold that has a coverage level 2C. Visualizing the 27 data with Coverage-versus-Length (CVL) plots, which can be done using Microsoft Excel or 28 Google Sheets, provides a simple method to understand the structure of a genome assembly 29 and detect aberrant scaffolds or contigs. We provide a Python script that allows assemblies 30 to be filtered to remove contaminants identified in CVL plots. 31 1 bioRxiv preprint doi: https://doi.org/10.1101/421347; this version posted September 19, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 32 100-word article summary 33 We describe a simple new method, Coverage-versus-Length plots, for examining de novo 34 genome sequence assemblies. These plots enable researchers to detect scaffolds that have 35 unusually high or unusually low coverage, which allows contaminants, and scaffolds that 36 come from atypical parts of the organism’s DNA complement, to be detected. We show that 37 contaminants are common in yeast genomes sequenced in multiplex Illumina runs. We 38 provide instructions for making plots using Microsoft Excel or Google Sheets, and software 39 for filtering assemblies to remove contaminants. Contaminants can be detected and 40 removed, even without knowing their source. 41 42 Keywords: Genomics, Genome assembly, Bioinformatics, Yeast 43 44 45 Introduction 46 It is now easy and cheap to sequence a yeast genome using the Illumina platform. 47 Genome sequencing projects are usually described as either resequencing projects or de 48 novo assembly projects. In resequencing projects, Illumina reads from the strain to be 49 studied are mapped onto a reference genome from the same species (e.g. Saccharomyces 50 cerevisiae S288C), in order to find nucleotide polymorphisms or mutations. In de novo 51 assembly projects, the Illumina reads from the strain to be studied are assembled without 52 using a reference, to produce a set of contigs or scaffolds for the strain. De novo assembly is 53 used in several specific situations: (i), where no suitable reference genome sequence exists 54 (e.g., if the species has not been sequenced before); (ii), where the researcher chooses to 55 ignore the reference (e.g., to explore the pan-genome in natural isolates of S. cerevisiae); 56 (iii), where the strain being studied comes from an unknown species; or (iv), for mixed 57 samples, such as metagenomics projects. For any newly-discovered yeast species, in order 58 to publish a formal taxonomic description in the International Journal of Systematic and 59 Evolutionary Microbiology, authors are now requested to include a (de novo) sequence 1 60 assembly of its genome . 1 http://ijs.microbiologyresearch.org/content/journal/ijsem/about 2 bioRxiv preprint doi: https://doi.org/10.1101/421347; this version posted September 19, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 61 One of the most widely-used assembly programs for de novo yeast genome 62 sequences is SPAdes (Bankevich et al. 2012). SPAdes assembles genomes into contigs or 63 scaffolds, depending on the type of Illumina data that is input (single reads, paired-end 64 reads, or mate-pair reads). Contigs are contiguous stretches of assembled genome 65 sequence. Scaffolds are groups of contigs, whose relative order and orientation is known, but 66 which are separated by short gaps of unknown sequence (represented by poly-N regions in 67 the scaffold sequence). A typical yeast genome sequencing project will generate about 1.5 68 gigabases of raw data (e.g., 10 million Illumina reads of 150 bp each), which can be 69 assembled by SPAdes in a few hours on a standard laptop, giving more than 50x coverage 70 of the genome. 71 As the capacity of Illumina machines has grown, it has become increasingly common 72 that samples are sequenced in multiplex, i.e. genomic libraries are constructed from pooled 73 samples, using a different index (or pair of indexes) for each sample, and then sequenced 74 together in the same flowcell. For example, an Illumina HiSeq 4000 machine can sequence 75 96 yeast samples in multiplex in a single flowcell. It should be noted that the Illumina 76 multiplex methodology uses up to four separate sequencing reactions to obtain data from a 77 single flowcell spot containing a genomic DNA fragment: the library index(es) are sequenced 78 in 1-2 reactions using different primers than are used to sequence one or both ends of the 2 79 genomic DNA fragment . There is therefore some potential for mix-ups in which genomic 80 DNA reads are assigned to the wrong index and hence to the wrong sample. 81 The drawback to our ability to sequence genomes faster and cheaper is that the 82 amount of time and care spent on human examination of each sequence must inevitably 83 decrease (Lu and Salzberg 2018). Gross errors such as the assignment of a genome 84 sequence to the wrong species (Stavrou et al. 2018; Watanabe et al. 2018; Shen et al. 2016; 85 Pavlov et al. 2018), or yeast genome sequences that include many contigs from 86 contaminating bacteria (Donovan et al. 2018), are increasingly present in public databases. 87 In addition, the genome sequences of some yeast isolates have proven difficult to assemble 88 because they are highly heterozygous, in some cases because they are interspecies hybrids 89 (Pryszcz et al. 2015; Pryszcz and Gabaldon 2016; Schröder et al. 2016; Braun-Galleani et al. 90 2018). In this report, we present a simple method that can be used for quality control of an 91 Illumina genome assembly, by making use of the information that SPAdes produces about 92 the length and coverage of each contig or scaffold. 2 https://support.illumina.com/downloads/indexed-sequencing-overview-15057455.html 3 bioRxiv preprint doi: https://doi.org/10.1101/421347; this version posted September 19, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 93 94 Materials and Methods 95 All the example assemblies we present were made using SPAdes version 3.11.1 96 (Bankevich et al. 2012), downloaded from http://cab.spbu.ru/software/spades/. Illumina 97 FASTQ files were obtained by our laboratory using commercial DNA sequencing services, 98 except for Hanseniaspora osmophila NCYC58 for which we downloaded FASTQ files from 99 http://opendata.ifr.ac.uk/NCYC/ and assembled them (Table 1). We did not carry out any 100 data preprocessing steps. 101 Instructions for how to make a CVL plot from a SPAdes output file using common 102 spreadsheet programs are given in Box 1. 103 104 Results 105 SPAdes (Bankevich et al. 2012) is a popular assembly program because it is simple 106 to install, runs in a few hours on a laptop, and makes good assemblies ‘out of the box’ with 107 its default settings. It produces two main output files, called contigs.fasta and 108 scaffolds.fasta. The scaffolds.fasta file may include some poly-N regions where 109 two contigs have been joined with a gap between them, because their relative order and 110 orientation is known.

Load more