Heap: a Highly Sensitive and Accurate SNP Detection Tool for Low-Coverage High-Throughput Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data Item Type Article Authors Kobayashi, Masaaki; Ohyanagi, Hajime; Takanashi, Hideki; Asano, Satomi; Kudo, Toru; Kajiya-Kanegae, Hiromi; Nagano, Atsushi J.; Tainaka, Hitoshi; Tokunaga, Tsuyoshi; Sazuka, Takashi; Iwata, Hiroyoshi; Tsutsumi, Nobuhiro; Yano, Kentaro Citation Kobayashi M, Ohyanagi H, Takanashi H, Asano S, Kudo T, et al. (2017) Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data. DNA Research. Available: http://dx.doi.org/10.1093/dnares/dsx012. Eprint version Publisher's Version/PDF DOI 10.1093/dnares/dsx012 Publisher Oxford University Press (OUP) Journal DNA Research Rights This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Download date 24/09/2021 06:06:16 Item License http://creativecommons.org/licenses/by-nc/4.0/ Link to Item http://hdl.handle.net/10754/623635 Dna Research, 2017, 0(0), 1–9 doi: 10.1093/dnares/dsx012 Full paper Full paper Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data Masaaki Kobayashi1, Hajime Ohyanagi1,2, Hideki Takanashi3, Satomi Asano1, Toru Kudo1, Hiromi Kajiya-Kanegae3, Atsushi J. Nagano4,5,6, Hitoshi Tainaka3, Tsuyoshi Tokunaga7, Takashi Sazuka8, Hiroyoshi Iwata3, Nobuhiro Tsutsumi3, and Kentaro Yano1* 1Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kanagawa 214-8571, Japan, 2King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal 23955-6900, Kingdom of Saudi Arabia, 3Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan, 4Faculty of Agriculture, Ryukoku University, Shiga 520-2194, Japan, 5PRESTO, Japan Science and Technology Agency, Japan, 6Center for Ecological Research, Kyoto University, Shiga 520-2113, Japan, 7EARTHNOTE Co. Ltd., Nago, Okinawa 905-1152, Japan, and 8Bioscience and Biotechnology Center, Nagoya University, Aichi 464-8601, Japan *To whom correspondence should be addressed. Tel. þ81 44 934 7046. Fax. þ81 44 934 7046. Email: [email protected] Edited by Prof. Hiroyuki Toh Received 15 June 2016; Editorial decision 15 March 2017; Accepted 16 March 2017 Abstract Recent availability of large-scale genomic resources enables us to conduct so called genome- wide association studies (GWAS) and genomic prediction (GP) studies, particularly with next- generation sequencing (NGS) data. The effectiveness of GWAS and GP depends on not only their mathematical models, but the quality and quantity of variants employed in the analysis. In NGS single nucleotide polymorphism (SNP) calling, conventional tools ideally require more reads for higher SNP sensitivity and accuracy. In this study, we aimed to develop a tool, Heap, that enables robustly sensitive and accurate calling of SNPs, particularly with a low coverage NGS data, which must be aligned to the reference genome sequences in advance. To reduce false positive SNPs, Heap determines genotypes and calls SNPs at each site except for sites at the both ends of reads or containing a minor allele supported by only one read. Performance comparison with existing tools showed that Heap achieved the highest F-scores with low coverage (7X) restriction-site as- sociated DNA sequencing reads of sorghum and rice individuals. This will facilitate cost-effective GWAS and GP studies in this NGS era. Code and documentation of Heap are freely available from https://github.com/meiji-bioinf/heap (29 March 2017, date last accessed) and our web site (http://bioinf.mind.meiji.ac.jp/lab/en/tools.html (29 March 2017, date last accessed)). Key words: single nucleotide polymorphism (SNP), next-generation sequencing (NGS), restriction-site associated DNA sequencing (RAD-seq), genome-wide association studies (GWAS), genomic prediction (GP) VC The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 1 2 Heap 1. Introduction from even reads with a low coverage aligned to the reference genome DNA polymorphisms including single nucleotide polymorphisms sequences. To confirm the performance of Heap, we identified SNPs (SNPs), insertions and deletions (INDELs), among inbred lines or from RAD-seq/GBS reads and calculated an accuracy index F-score, collected individuals have been employed to identify loci associated which is the harmonic mean between sensitivity and PPV, of SNP with traits or to predict phenotypes by genotypes. To identify genes calling in 17 inbred sorghum lines and 4 inbred rice lines. associated with phenotypes and implement high-throughput breed- ing, genome-wide association studies (GWAS) and genomic predic- tion (GP) studies have been performed with next-generation 2. Materials and methods 1–4 sequencing (NGS) data . In GWAS and GP studies, SNPs are 2.1. Software details widely used as genetic markers. SNPs are determined by mapping 2.1.1. Algorithms of Heap NGS reads and consequent variant calls with conventional tools. We designed Heap to identify SNPs from short read sequences of Widely used software, such as Burrows-Wheeler Aligner (BWA) or diploid species. In Heap analysis, short read sequences must Bowtie 2 performs alignments of NGS reads to reference genomes5,6, be aligned to reference genome sequences in advance, and informa- whereas SAMtools/BCFtools or Genome Analysis Toolkit (GATK) tion on aligned reads, which is stored in either Sequence Alignment/ call variants based on alignments produced during the mapping MAP (SAM) format files or the binary version of SAM (BAM) for- steps7–11. The accuracy of both mapping and variant calling should mat files7, must be employed (Fig. 1A). be key factors for the effectiveness of GWAS and GP. After importing information on read alignments from a SAM or SAMtools/BCFtools and GATK accurately call SNPs if provided with enough number of reads. When read coverage (depth) is 20X or BAM file, Heap performs read filtering to obtain high quality reads more, SNPs have been called with sufficient sensitivity in human ge- (Fig. 1B). By default setting of Heap, reads with a phred scaled map- nome resequencing12. It is also reported that SNP calling becomes ping quality score (MAPQ) below 20 are removed. Bases with a difficult under low read coverage (7X or lower)12. In large-scale phred scaled quality score in base calling below 13 are also elimi- GWAS or GP studies, the number of NGS reads in each individual nated from the search scope of valid SNP sites. Heap also trims both tends to be small in order to genotype as many individuals as possi- ends of each read before mining SNPs, because it is empirically ob- ble with a limited budget. In these cases, methodologies that accu- served that these regions contain many base calling errors. By default rately detect large number of SNPs based on a relatively low number setting, both ends with n ¼ 2bp are not counted for SNP searches. of reads are paramount for an efficient implementation of GWAS Moreover, flanking regions within i ¼ 5bp of each INDEL site are and GP studies. also removed in SNP calling. To determine the default setting of the Restriction-site associated DNA sequencing, or genotype-by- n and i values, we examined the empirical distribution of F-scores sequencing (RAD-seq/GBS), is an economical strategy to identify ge- under multiple conditions (for F-scores, see Performance comparison nome wide polymorphisms13. While whole genome shotgun (WGS) among SNP calling tools with RAD-seq reads in sorghum in results sequencing methods provide sequencing information from the whole and discussions) (Supplementary Fig. S1). The highest F-score was genome, the RAD-seq/GBS methods confine sequencing regions to obtained under around n ¼ 2–3, while F-score reached a plateau un- the fragments neighbouring particular restriction sites. To be precise, der around i ¼ 4–5. Therefore, we determined that n ¼ 2 and i ¼ 5as in RAD-seq/GBS, regions neighbouring particular restriction sites are default values. sequenced in combination with high-throughput sequencing methods Next, Heap determines each sample’s genotype in every site that and with subsequent mapping and variant calling steps on reference passes quality filtering (Fig. 1C). On each nucleotide site, the allele genome sequences to identify genomic polymorphisms. For example, frequency is calculated from the number of nucleotide bases (A, T, C RAD-seq/GBS reads sets of HindIII which covers only 0.24% of the or G) aligned on it. Heap ignores any sites with three or more allele chicken genome on average (i.e. coverage ¼ 0.0024x) could show 3Â variants, since any allele possibilities above 2 on diploid organisms or more depth of coverage on the RAD fragments14. Compared with are likely due to sequencing errors. To determine the zygosity of each WGS sequencing methods, the RAD-seq/GBS strategy under the allele, a binomial test with allele frequency is performed (the null hy- same sequencing cost generally assures a relatively higher coverage pothesis H0: the allele is heterozygous). When the P value of the bi- at each sequencing region, even when the total number of reads is nomial test is <0.05, the null hypothesis is rejected thus the zygosity lower in each sample (e.g. inbred lines, individuals, or libraries) with on the site is determined as homozygous. Conversely, when the P the RAD-seq/GBS strategy.