Porcine Single Nucleotide Polymorphisms and Their Functional Effect: an Update
Total Page:16
File Type:pdf, Size:1020Kb
Keel et al. BMC Res Notes (2018) 11:860 https://doi.org/10.1186/s13104-018-3973-6 BMC Research Notes RESEARCH NOTE Open Access Porcine single nucleotide polymorphisms and their functional efect: an update B. N. Keel*, D. J. Nonneman, A. K. Lindholm‑Perry, W. T. Oliver and G. A. Rohrer Abstract Objective: To aid in the development of a comprehensive list of functional variants in the swine genome, single nucleotide polymorphisms (SNP) were identifed from whole genome sequence of 240 pigs. Interim data from 72 animals in this study was published in 2017. This communication extends our previous work not only by utilizing genomic sequence from additional animals, but also by the use of the newly released Sscrofa 11.1 reference genome. Results: A total of 26,850,263 high confdence SNP were identifed, including 19,015,267 reported in our previously published results. Variation was detected in the coding sequence or untranslated regions (UTR) of 78% of the genes in the porcine genome: 1729 loss-of-function variants were predicted in 1162 genes, 12,686 genes contained 64,232 nonsynonymous variants, 250,403 variants were present in UTR of 15,739 genes, and 15,284 genes contained 90,939 synonymous variants. In total, approximately 316,000 SNP were classifed as being of high to moderate impact (i.e. loss-of-function, nonsynonymous, or regulatory). These high to moderate impact SNP will be the focus of future genome-wide association studies. Keywords: Swine, Genome sequence, Functional variation, Loss-of-function Introduction variants that alter the amino acid structure of a protein One of the key aims of livestock genomics research is and those that regulate protein production, detected to identify genetic variation underlying economically from whole-genome sequencing would be of consider- important traits such as reproductive performance, feed able interest in swine genomic studies, particularly those efciency, disease resistance/susceptibility, and product targeting fertility and production traits. quality. Until recently, association studies and genomic Te porcine variation currently reported in National predictions have been performed using commercial sin- Center for Biotechnology Information (NCBI) genetic gle nucleotide polymorphism (SNP) arrays, which con- variation database (dbSNP) and the European Vari- tain markers evenly spaced across the genome. Accuracy ation Archive (EVA) represents several diverse pig of genomic predictions can be improved by using more breeds and wild boars from diferent regions of the informative variants, including variants located near or world. Terefore it is likely that many of the variants in within genes, predicted to afect gene function, or known these databases do not segregate in commercial swine to be causal. Genetic variants detected from whole- germplasm, a consequence of domestication and selec- genome sequence have been used to successfully identify tion for lean meat production. To provide information causal variants and to map complex traits in domestic on variants predicted to afect gene function in com- cattle [1–4]. In particular, it was shown that loss-of-func- mercial swine germplasm, a study was conducted to tion (LOF) variants, those expected to disrupt the protein identify single nucleotide polymorphisms (SNP) from coded by a gene, in the homozygous state can compro- whole-genome sequence of 240 members of an experi- mise fertility in cattle by causing embryonic lethality [1]. mental swine herd at the U.S. Meat Animal Research Hence, a comprehensive list of LOF variants, as well as Center (USMARC). Tese animals included all 24 of the founding boars (12 Duroc and 12 Landrace), 48 of *Correspondence: [email protected] the founding Yorkshire-Landrace composite sows, 109 USDA, ARS, U.S. Meat Animal Research Center, Clay Center, NE 68933, USA composite animals from generations 4 through 9, 29 © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Keel et al. BMC Res Notes (2018) 11:860 Page 2 of 6 composite animals from generation 15, and 30 pure- Sequence data processing bred industry boars (15 Landrace and 15 Yorkshire) Te Trimmomatic software (Version 0.35) [6] was used used as sires in generations 10 through 15. Interim to trim Illumina adaptor sequences and low quality bases results from the 72 founding animals have been previ- from the reads. Te remaining reads were mapped to ously published [5]. Te objective of this report is to the Sscrofa 11.1 genome assembly (NCBI Accession present the updated data following sequencing of an AEMK00000000.2) using Burrows–Wheeler Alignment additional 168 animals and the release of the new and (BWA, Version 0.7.12) [7] with the default parameters. improved swine reference genome build, Sscrofa 11.1. All output SAM fles were converted to sorted BAM fles using SortSam from Picard (Version 1.1; http://broad insti tute.githu b.io/picar d/), and duplicates in the BAM fles Main text were marked by applying MarkDuplicates from Picard. Materials and methods Genomic coverage for each of the BAM fles was com- Te DNA samples sequenced for this study were puted using Samtools (Version 1.3) [8]. extracted from semen collected by commercial artif- cial insemination services and from blood and tail tissue archived under standard operating procedures for the Variant calling and fltering USMARC tissue repository. Te research did not involve Best practices established for the Genome Analysis experimentation on animals requiring Institutional Ani- Toolkit (GATK, Version 3.7) [9] were used to identify mal Care and Use Committee approval. SNP. Briefy, RealignerTargetCreator and IndelRealigner from GATK were applied for local realignment of indels. Base quality recalibration was then performed using Library preparation and sequencing BaseRecalibrator from GATK, where the recalibration Blood, semen, or tail tissue samples were obtained from report was formed using the default setting for covariates 240 members of a USMARC composite population, 72 and the NCBI dbSNP database (Build 150) as the data- founding animals (generation 0), 109 animals from gen- base for known sites. erations 4 through 9, 29 animals from generation 15, Multi-sample variant calling and genotyping was per- and 30 industry sires. DNA extraction, library prepa- formed with the GATK UnifedGenotyper, taking input ration, and sequencing for the 72 founding animals has from each of the 240 BAM fles. been previously described in [5]. For the 36 animals in Te UnifedGenotyper output provides several met- generations 4 and 5, genomic DNA was extracted from rics that assess the quality of detected variation, includ- tail tissue using standard DNA extraction protocols, ing quality (QUAL), quality by depth (QD), RMS sheared to 300–500 bp using a Covaris S220 ultrasoni- mapping quality (MQ), Fisher strand (FS), haplotype cator (Woburn, MA, USA), and libraries prepared using score (HaplotypeScore), mapping quality rank sum test the TruSeq DNA sample prep kit, version 2 (Illumina, (MQRankSum), and read position rank sum test (Read- San Diego, CA, USA) were paired-end sequenced (100 bp PosRankSum). Te QUAL metric is a phred-scaled prob- read length) on an Illumina HiSeq 2500 (Illumina Inc., ability of the SNP being homozygous for the reference, San Diego, CA, USA) at DNA LandMarks (St.-Jean-sur- where higher values indicate higher confdence. QD is Richelieu, QC, Canada). Genomic DNA for the 30 AI computed by dividing the variant confdence (QUAL) sires, the 73 animals from generations 6 through 9, and by the unfltered depth of all non-reference samples. FS the 29 animals from generation 15 was extracted using a is a phred-scaled P-value using Fisher’s Exact Test to Wizard SV96 genomic DNA purifcation kit according to identify strand bias. Higher FS values indicate stronger the manufacturer’s protocol (Promega Corp., Madison, strand bias, i.e. likely false positives. Te HaplotypeScore WI, USA). Genomic DNA was sheared to 350 bp (gen- is a measure of how well the data in a 10-base window eration 15 animals) or 550 bp (AI sires and generation 6 around the variant can be explained by at most 2 haplo- through 9 animals) using a Covaris S220 ultrasonicator types. MQRankSum is a Wilcoxon Rank Sum Test that (Woburn, MA, USA), and libraries prepared using the tests the hypothesis that the reads carrying the variant TruSeq DNA PCR-Free prep kit (Illumina, San Diego, allele have a consistently lower mapping quality than the CA, USA). Libraries were paired-end sequenced (150 bp reads with the reference allele, while ReadPosRankSum is read length) on an Illumina NextSeq 500 (Illumina, San a Mann–Whitney Rank Sum Test that tests the hypoth- Diego, CA, USA) at USMARC. Bases of the paired-end esis that instead of being randomly distributed over the reads for all sequenced genomes were identifed with the read, the variant allele is consistently found more often at Illumina BaseCaller, and FASTQ fles were produced for the beginning or the end of a sequencing read. In order to downstream analysis of the sequence data. reduce the false discovery rate, variants were hard fltered Keel et al. BMC Res Notes (2018) 11:860 Page 3 of 6 to meet the following criteria, suggested by GATK scrofa GO annotation as background for the enrichment documentation: QUAL > 30.0, QD > 2.0, MQ > 40.0, analysis. Data were considered statistically signifcant at FS < 60.0, HaplotypeScore < 13.0, MQRankSum > − 12.5, Bonferroni corrected P-value < 0.05.