bioRxiv preprint doi: https://doi.org/10.1101/2021.01.25.428118; this version posted January 26, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

INVESTIGATIONS

The Genetic Diversity of Two Subpopulations of Shenxian Pigs as Revealed by Genome Resequencing

Liu Diao∗,1, Cao Hongzhan∗,1,2, Lu Chunlian∗,3, Li Shang∗,4, Jia Mengyu∗,5, Li Sai† and Ren Liqin‡ ∗College of Animal Science and Technology, Agricultural University, 071000, Hebei, †Hebei Zhengnong Animal Husbandry Co., Ltd., Xinji 052360, Hebei, ‡Fishery Technology Extension Station, Wanquan District, City, Zhangjiakou, Hebei 076250

1

2 ABSTRACT Shenxian pigs can be divided into two main strains from their shape and appearance: Huang- KEYWORDS

3 guazui and Wuhuatou. There are significant differences in the phenotypic characters between the two Genome rese-

4 subpopulations. The Shenxian pig, as the only local pig breed listed in Pig Breeds in Hebei Province, quencing

5 has the excellent traits of Chinese local pigs. In order to explore the genetic diversity and genetic distance of Shenxian pig

6 the different subpopulations of Shenxian pig, as well as understand their evolutionary process, whole genome genetic diversity

7 resequencing and genetic structure analyses were performed for the two sub-population types of Shenxian analysis

8 pigs.The results showed that a total of 14,509,223 SNP sites were detected in the Huangguazui type and

9 13,660,201 SNP sites were detected in the Wuhuatou type. This study’s principal component analysis results

10 showed that the genetic differentiation of the Shenxian pig population was serious, and there was obvious

11 stratification in the population. The phylogenetic tree analysis results indicated that there was a certain genetic

12 distance between the Huangguazui and Wuhuatou types.

13

information in the whole genome was accurately obtained, and the 18 two populations were discovered to the greatest extent. Genetic 19 1 INTRODUCTION variation(Bi et al.2020). 20 2 Shenxian pig is the only outstanding local pig breed listed in the 3 "Chinese Pig Breeds" in Hebei Province. Compared with other MATERIALS AND METHODS 21 4 imported pig breeds, Shenxian pigs have an earlier sexual maturity,

5 short estrus interval, and a large number of litters. They have good Materials 22 6 research significance. There are two main strains of Shenxian The pig farm in Shenxian County was selected from Hebei 23 7 pigs: Cucumber beak and Wuhuatou. "Cucumber Mouth": The Zhengnong Animal Husbandry Co., Ltd. and selected healthy, 24 8 mouth is long and pointed at the end, hence the name. The nose is weight, and age-similar cucumber beaks and 5 pigs with five- 25 9 straight and flat, with few hairs, shallow wrinkles on the forehead, flower head each. Cucumber beaks: B1-1, B1-2, B1-3, B1 -4, B1-5, 26 10 slender neck, narrow forequarters, long and concave back and five flower head: B2-1, B2-2, B2-3, B2-4, B2-5. Use ear tag tongs 27 11 waist, slightly drooping abdomen, inclined buttocks, weak hind to collect ear tissue, shear pig ear tissue and disinfect with 75 per- 28 12 limbs, and thick tail. "Five-flowered head": The head is rough, cent alcohol, take about 100mg sample, store the sample in an EP 29 13 the hair is thick, the forehead is wide and wrinkled, and it gets tube containing 95 percent alcohol, transport it back to the labo- 30 14 its name. Based on the whole-genome resequencing technology, ratory with an ice pack, and store it in a refrigerator at -20°C It is 31 15 the genetic structure of the two sub-populations of Shenxian pigs reserved for use in sampling, transportation and storage to ensure 32 16 was discussed, and the evolutionary mechanism between the two that the samples are not contaminated for subsequent extraction 33 17 populations was revealed at the molecular level, and all genetic of genomic DNA. 34

Manuscript compiled: Saturday 23rd January, 2021 Whole genome resequencing 35 1These authors contributed equally to this work. Whole genome resequencing is process used to sequence the 36 2These authors contributed equally to this work. 3Corresponding author: College of Animal Science and Technology, Hebei Agricultural genomes of different individuals of species with known genome se- 37 University, Baoding 071000.E-mail:[email protected] quences, and then analyze the differences of individuals or groups 38

1 bioRxiv preprint doi: https://doi.org/10.1101/2021.01.25.428118; this version posted January 26, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

1 on that basis(Ley et al.2008). It has been found that based on InDel -R ref.fa -T UnifiedGenotyper -I test.sorted.repeatmark.bam 61 2 whole genome resequencing technology, researchers have been -o test.raw.vcf. The detection conditions are consistent with SNP. 62 3 able to quickly carry out resource surveys and screening processes The detected SNP and InDel mutation sites also need quality con- 63 4 in order to determine large numbers of genetic variations, and trol, remove untrusted data sites, correct the quality value, and out- 64 5 realize genetic evolution analyses and predictions of important put the file in VCF format after correction. Use R language(Gao et 65 6 candidate genes. In the current study, the two subpopulations of al.2014;Sudhaka.2018)and ANNOVAR software (Kai et al.2014)to 66 7 Shenxian pigs were re-sequenced for the purpose of obtaining the perform mutation information Sort and comment. 67 8 genomic information. A large number of high accuracy SNPs, In-

9 Del, and other variation information were obtained by comparing RESULTS AND DISCUSSION 68 10 the results with the reference genome. Subsequently, the variation Results 69 11 information was successfully detected, annotated, and counted. BWA software (Version: 0.7.15-r1140) was used in this research 70 71 12 Detection methods based on next generation sequencing tech- investigation to compare the clean sequences of the two subpopu- 72 13 nology lations of Shenxian pigs after quality control with the latest version of reference genome of pigs was implemented. The results are 73 14 At present, the second-generation sequencing technology is widely detailed in Table 2-1. It was found that on average, more than 86 74 15 used(Levy et al.2016;Slatko et al.2018).The development of the percent of the sequencing data could be compared to the reference 75 16 next generation sequencing (NGS) technology has resulted in rev- genome. 76 17 olutionary changes in the detections of genetic variations. Re- There are a large number of SNPs, and there may be one SNP 77 18 searchers can now comprehensively and accurately detect vari- variant site every 500-1000bp(Group et al.2001).Count the gene 78 19 ous types and the sizes of genetic variations from the genomic region and type of SNP to get Table 1.A total of 14509223 SNPs were 79 20 level(Mardis.2008) For the detection of genetic variations based detected in Huangguazui pigs (B1), of which there were 88176.4 80 21 on the next generation sequencing technology, the first step is (0.61%) and 101948.2 (0.70%) in the upstream and downstream 1kb 81 22 to compare the reading segments of thousands of sequences to region of the gene, and 7865381 (54.21%) in the intergenic region. 82 23 the corresponding positions on the reference genome. Then, the There are an average of 99217.4 (0.68%) and 6151792 (42.40%) of 83 24 possible variation information can be inferred using simple mathe- exons and introns. A total of 13660201 SNPs were detected in 84 25 matical models(Andersson.2009;Mckenna et al.2010;Montgomery Wuhuatou pig (B2), of which 84855.2 (0.62%) and 97797.2 (0.72%) 85 26 et al.2013). These methods take the number of reading segments were found in the upstream 1kb region and the downstream 1kb 86 27 supporting the variations and reference sequences; quality of the region of the gene, and the intergenic region contained 7407500 87 28 reading segment comparison; environmental conditions of the (54.23%). There are 96530.8 (0.71%) and 5782023 (42.33%) of exons 88 29 genome sequences; and the possible noise (such as sequencing er- and introns respectively. 89 30 rors) as prior information. Then, the variation types and genotypes In the current study, the SNP was identified using GATK soft- 90 31 at certain positions can be determined according to naive Bayesian ware, and the SNP data were subsequently analyzed. The original 91 32 theory, and the genotype with the highest posterior probability can data contained a total of ten samples and 31,658,350 SNP sites. 92 33 be successfully obtained(Montgomery et al.2013;Durbin et al.2008). Effective data was obtained after a secondary quality control of the 93 34 Such methods have been found to be effective for SNP and small original data, with ten samples and 23,088,823 SNP sites remaining. 94 35 InDel detections(Depristo et al.2011;Albers et al.2011). The total samples contained the individuals of the two strains of 95 Shenxian pigs.A total of 580 high-quality polymorphic SNPs were 96 36 Data processing obtained by screening 23,088,823 SNP sites. The marker covered 97 37 This test sample uses the whole genome resequencing data (se- eleven chromosomes, and its PIC index ranged from 0.1 to 0.529, 98 38 quencing depth 10X) of 10 Shenxian pigs (5 cucumber-mouthed with an average value of 0.305. There were 119 highly polymor- 99 39 pigs and 5 Wuhuatou pigs) . The output of the genome data com- phic SNPs and 191 moderately polymorphic SNPs identified.In 100 40 parison is the sam file. Since the sorting method of the sam file this study, the quantity and density distributions of the SNPs on 101 41 cannot be used for subsequent analysis, the Samtools software(Li chromosomes were further analyzed. The results are shown in 102 42 et al.2009)is used to convert the sam file into a file in chromosome Figure 1. The distributions of the quantities and densities of the 103 43 sorting format, that is, the bam format. Chromosome rearrange- SNPs on the chromosomes were observed to not be uniform. For 104 44 ment is to arrange the entries of the same chromosome in the file example, Chromosome 15 was short in length but had the largest 105 45 in ascending order, and merge the two files of the same sample. number of SNPs, which was far more than the other examined 106 46 SNP detection and annotation: Use GATK and VarScan soft- chromosomes. These findings indicated that the densities of the 107 47 ware(Koboldt et al.2009)to perform SNP detection at the same SNPs on that particular chromosome were the highest. There- 108 48 time, so as to ensure that the obtained SNP site information fore, it was considered that there was a relatively high density of 109 49 will not be affected by the deviation of base misalignment variation on Chromosome 15. 110 50 caused by InDel mutation. GATK detection SNP code: java -jar In order to infer the genetic relationship between the Huang- 111 51 GenomeAnalysisTK.jar glm SNP -R ref.fa -T UnifiedGenotyper guazui and Wuhuatou pig subpopulations, a genetic distance ma- 112 52 -I test.sorted.repeatmark.bam -o test.raw.vcf. In order to ensure trix of the IBS was constructed using Plink (V1.9) software(Purcell 113 53 the accuracy of SNP information, strict testing conditions are set et al.2007). Then, based on the results, a phylogenetic tree was 114 54 when detecting SNP site information: <1>The minimum num- constructed using a neighbor joining method (NJ), as detailed in 115 55 ber of end reads greater or equal to 4; <2>The minimum quality Figure 2. The analysis results showed that the Huangguazui and 116 56 value Q20 greater or equal to 90; <3>Minimum coverage greater or Wuhuatou subpopulations had originated from the same ancestor 117 57 equal to 6; <4>P value Less than or equal to 0.01. InDel detection and belonged to the same variety. However, they could clearly be 118 58 and annotation: The same use of GATK and VarScan software divided into two distinct groups, which indicated that there were 119 59 for InDel detection can eliminate errors caused by SNP variation. indeed differences in the genetic backgrounds of the Huangguazui 120 60 GATK detection InDel code: java -jar GenomeAnalysisTK.jar glm and Wuhuatou pigs. It was confirmed that there was a certain 121

2 | FirstAuthorLastname et al. bioRxiv preprint doi: https://doi.org/10.1101/2021.01.25.428118; this version posted January 26, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

I Table 1 Statistical table of gene region and type of SNP Sample-name Downstream Exonic Intergenic Intronic Upstream Totala Huangguazui(B1) 101948 0.70% 99217 0.68% 7865381 54.21% 6151792 42.40% 88176 0.61% 14509223. Wuhuatou(B2) 97797 0.72% 96530 0.71% 7407500 54.23% 5782023 42.33% 84855 0.62% 13660201.

a There is little difference in the number of SNPs contained in the cucumber mouth and five-flower head pig gene regions. Introns and intergenic regions contain the most SNPs, indicating that the two regions have the most variation.

Figure 1 The abscissa represents the chromosome number, and the ordinate represents the number of SNPs.The distributions of the Figure 2 The analysis results showed that the Huangguazui and quantities and densities of the SNPs on the chromosomes were ob- Wuhuatou subpopulations had originated from the same ancestor served to not be uniform. For example, Chromosome 15 was short and belonged to the same variety. However, they could clearly be in length but had the largest number of SNPs, which was far more divided into two distinct groups, which indicated that there were in- than the other examined chromosomes. These findings indicated deed differences in the genetic backgrounds of the Huangguazui that the densities of the SNPs on that particular chromosome were and Wuhuatou pigs. It was confirmed that there was a certain dis- the highest. Therefore, it was considered that there was a relatively tance between the two strains in their genetic relationship. high density of variation on Chromosome 15.

than those of the Huangguazui strain. 27

1 distance between the two strains in their genetic relationship.It 2 was also determined in the phylogenetic tree analyses of the Shenx- Discussion 28 3 ian pigs, large white pigs representative of introduced pigs, and For DNA extraction and detection, the complete DNA in the 29 4 Meishan pigs representative of local south China pigs, that the agarose gel should be a band. At times, there will be three bands 30 5 three populations had originated from the same species, and there observed, which may have been formed by DNA degradation. 31 6 was a distant genetic relationship between the Shenxian pigs and Generally speaking, the foremost band is of supercoiled configu- 32 7 the other two examined populations. ration; middle band is linear; and the last band is of single-chain 33 8 The PCA analysis of the effective data obtained after quality defective configuration(Martina et al.2020). If mechanical dam- 34 9 controls were implemented were conducted using Plink software, ages occur during the extraction process, the DNA can become 35 10 and then the PCA analysis results were visualized using R software. degraded, which may be presented as a diffuse state on the gel. 36 11 In order to reveal the population specificity of the Huangguazui SNP mainly refers to the DNA sequence polymorphism caused 37 12 and Wuhuatou pigs, the SNP information of the two populations by a single nucleotide variation at the genomic level. In this study, 38 13 was analyzed using a principal component analysis (PCA) method. the SNP information obtained via statistical variation detection 39 14 The top three eigenvectors were selected for the analysis, and the showed that the quantity and density distributions of the SNPs 40 15 analysis results were visualized by R language, as shown in Fig. in the two subpopulations were uneven, and they were mainly 41 16 2-3. The analysis results of the first and second eigenvectors are concentrated on Chromosome 15. There have been many studies 42 17 detailed in Figure 3. It can be seen in the figure that the first conducted regarding the QTL of meat quality traits on Chromo- 43 18 eigenvector had accurately distinguished the Huangguazui and some 15(Liu.2015;Jia.2004), which may be related to the excellent 44 19 Wuhuatou into two strains, which confirmed that the differentia- meat quality traits of the Shenxian pig species(Hu et al.2015). Previ- 45 20 tion among the Shenxian pig population was serious. The second ous studies have successfully mapped the QTLs of the meat quality 46 21 eigenvector revealed that both the Huangguazui and Wuhuatou traits and acquired linked SNPs for labelling(Yan et al.2017;Li et 47 22 subpopulations were stratified within the total population. The al.2010;Science.2016b). However, the current SNP chip technology 48 23 analysis results of the first and third eigenvectors showed that the is not adequately developed for mapping the quantitative traits in 49 24 stratification in Huangguazui population was the most obvious pigs(Chen et al.2011). Therefore, the SNP and InDel information 50 25 (Figure 4). The second and third eigenvectors showed (Figure 5) on Chromosome 15 will be detected, annotated, and analyzed in 51 26 that the degrees of stratification in the Wuhuatou strain were less follow-up experimental research. 52

3 bioRxiv preprint doi: https://doi.org/10.1101/2021.01.25.428118; this version posted January 26, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

Figure 5 The abscissa represents the second vector, and the ordi- nate represents the third vector; pink represents cucumber-mouthed pigs, blue represents five-flowered pigs. Figure 3 The abscissa represents the first vector, and the ordinate represents the second vector; pink represents cucumber-mouthed pigs, blue represents five-flowered pigs. Furthermore, functional gene annotations will also be carried 1 out in order to explore the differential genes, along with the 2 quality traits regulated by the differential genes, in the two sub- 3 populations of Shenxian pigs. Another area to be explored in the 4 future will be the main traits controlled by Chromosome 15 in 5 order to provide a research basis for the follow-up breeding of 6 Shenxian pigs. 7 The whole genome population structure analysis processes are 8 utilized to re-sequence different geographic locations and popu- 9 lations, as well as to obtain the SNP and other information after 10 comparisons are made with the reference genome. Subsequently, 11 the principal components and population structures among popu- 12 lations are analyzed, and appropriate phylogenetic trees are con- 13 structed. In the present study, based on the analysis results of the 14 population structures of the examined species, the differences in 15 the members of the examined species could be more clearly under- 16 stood. It was found that the genome-wide population structure 17 analysis results had inferred the characteristics, population struc- 18 ture, and development history of the varieties from the genome. 19 The obvious stratification and severe phenotypic differentiation 20 observed in the population structures of the Shenxian pigs indi- 21 cated that the phenotypic traits of Huangguazui and Wuhuatou 22 may have been less affected by artificial selection processes. 23

LITERATURE CITED 24

Changwei Bi, Na Lu, Zhen Huang, et al.,2020 Whole-genome rese- 25 quencing reveals the pleistocene temporal dynamics of branchios- 26 toma belcheri and branchiostoma floridae populations. Ecology 27 and Evolution 10(15), 1-15. 28 Figure 4 The abscissa represents the first vector, and the ordinate Ley,T.J.,Mardis,E.R.,Ding,L.,et al.,2008 DNA sequencing of a 29 represents the third vector; pink represents cucumber-mouthed pigs, cytogenetically normal acute myeloid leukaemia genome.Nature 30 blue represents five-flowered pigs. 456(7218):66-72. 31 Levy,S. E.,Myers, R. M.,2016 Advancements in next-generation 32 sequencing. Annu Rev Genomics Hum Genet 17(1), 95-115. 33 Slatko Barton E, Gardner Andrew F, Ausubel Frederick M.,2018 34

4 | FirstAuthorLastname et al. bioRxiv preprint doi: https://doi.org/10.1101/2021.01.25.428118; this version posted January 26, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

1 Overview of Next-Generation Sequencing Technologies122(1):e59. Chen Huiyong, Luo Saiqun, Wu Zhenfang.,2011 Current sta- 62 2 Mardis,E.R.,2008 The impact of next-generation sequencing tus of pig genome-wide SNP research. China High-tech Indus- 63 3 technology on genetics.Trends Genet 24:133-141. trialization Research Association. Cultivating bio-industry and 64 4 Andersson,L.,2009 Genome-wide association analysis in do- developing green economy——The Fifth China Bio-industry Con- 65 5 mestic animals:a powerful approach for genetic dissection of trait ference·2011 Gene Science and Industrial Development Forum Pro- 66 6 loci.Genetica 136:341-349. ceedings. China High-Tech Industrialization Research Association: 67 68 7 Mckenna,A., M.Hanna, E.Banks, et al.,2010 The Genome Analy- China High-Tech Industrialization Research Association:4. 8 sis Toolkit: a MapReduce framework for analyzing next-generation 9 DNA sequencing data.Genome Res 20:1297-1303. 10 Montgomery,S.B., D.L.Goode, E.Kvikstad, el al.,2013 The origin, 11 evolution, and functional impact of short insertion-deletion vari- 12 ants identified in 179 human genomes. Genome Res 23:749-761. 13 Li.H.,J.Ruan,R.Durbin.,2008 Mapping short DNA sequence 14 reads and calling variants using mapping quality scores.Genome 15 Res 18:1851-1858. 16 Depristo,M.A.,Banks,E.,Poplin,R.et al.,2011 A framework for 17 variation discovery and genotyping using next-generation dna 18 sequencing data. Nature Genetics 43(5), 491-498. 19 Albers,C.A.,Lunter,G.,Macarthur,D. G.,2011 Dindel: accurate 20 indel calls from short-read data. Genome Research 21(6), 10-11. 21 Li Heng, Handsaker Bob, Wysoker Alec, et al.,2009 The Se- 22 quence Alignment/Map format and SAMtools.. 25(16):2078-9. 23 Daniel C. Koboldt, Ken Chen, Todd Wylie, et al.,2009 VarScan: 24 variant detection in massively parallel sequencing of individual 25 and pooled samples 25(17):2283-2285. 26 Shan Gao, Jianhong Ou, Kai Xiao.,2014b R language and Bio- 27 conductor bioinformatics application. : Tianjin Science and 28 Technology Translation Publishing Company, 2014. 29 Kalyan Sudhaka.,2018b Python vs. R Programming Language. 30 8(8):70-79. 31 Kai W,Mingyao L,Hakon H.,2014 ANNOVAR:functional an- 32 notation of genetic variants from high-throughput sequencing 33 data[J].Nucleic Acid Research.16(38):e164. 34 Group,T.I.S.M.W.,2001 A map of human genome sequence vari- 35 ation containing 1.42 million single nucleotide polymorphism. Na- 36 ture 15. 37 Purcell S, Neale B, Todd-Brown K, et al.,2007 PLINK: a tool 38 set for whole-genome association and population-based linkage 39 analyses[J].Am J Hum Genet81(3): 559–575. 40 Martina Leonardi, Giulia La Marca, Barbara Pajola, et al.,2020 41 Assessment of real-time PCR for Helicobacter pylori DNA detec- 42 tion in stool with co-infection of intestinal parasites: a comparative 43 study of DNA extraction methods. 20(1):193-213. 44 Liu Xianxian.,2015 QTL analysis on pig chromosome 15 affect- 45 ing glycogen content and drip loss of longissimus dorsi muscle. 46 Jiangxi Agricultural University. 47 Jia Chao.,2004 Study on the correlation between the quality 48 traits of Sutai pork and microsatellite markers on chromosome 49 15. Agricultural University. 50 Yingxin Hu,Suqiao Chu,Chunlian Lu,et al.,2015 Study on the 51 characteristics of pig germplasm in Shenxian County.Animal Hus- 52 bandry and Veterinary Medicine47(07):58-60. 53 Yan Guorong, Qiao Ruimin, Zhang Feng, et al.,2017 Imputation- 54 Based Whole-Genome Sequence Association Study Rediscovered 55 the Missing QTL for Lumbar Number in Sutai Pigs.7(1):615. 56 Li HD, Lund MS, Chritensen OF, et al.,2010 Quantitaive trait 57 loci analysis of swine meat quality traits.J Anim Sci 88(9): 2904-12. 58 Science,2016 Reports from China Agricultural University Ad- 59 vance Knowledge in Science (Identification of genes for controlling 60 swine adipose deposition by integrating transcriptome, whole- 61 genome resequencing, and quantitative trait loci data).

5