Supplementary Information for

Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars

Mingzhou Li1,2,13, Shilin Tian3,13, Long Jin1,13, Guangyu Zhou3,13, Ying Li1,13, Yuan Zhang3,13, Tao Wang1, Carol KL Yeung3, Lei Chen4, Jideng Ma1, Jinbo Zhang3, Anan Jiang1, Ji Li3, Chaowei Zhou1, Jie Zhang1, Yingkai Liu1, Xiaoqing Sun3, Hongwei Zhao3, Zexiong Niu3, Pinger Lou1, Linjin Xian1, Xiaoyong Shen3, Shaoqing Liu3, Shunhua Zhang1, Mingwang Zhang1, Li Zhu1, Surong Shuai1, Lin Bai1, Guoqing Tang1, Haifeng Liu1, Yanzhi Jiang1, Miaomiao Mai1, Jian Xiao1, Xun Wang1, Qi Zhou5, Zhiquan Wang6, Paul Stothard6, Ming Xue7, Xiaolian Gao8, Zonggang Luo9, Yiren Gu10, Hongmei Zhu3, Xiaoxiang Hu11, Yaofeng Zhao11, Graham S. Plastow6, Jinyong Wang4, Zhi Jiang3, Kui Li12, Ning Li11, Xuewei Li1 & Ruiqiang Li2,3

1 Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Agricultural University, Ya’an, China.

2 Biodynamic Optical Imaging Center (BIOPIC), Peking-Tsinghua Center for Life Sciences, and School of Life Sciences, Peking University, Beijing, China.

3 Novogene Bioinformatics Institute, Beijing, China.

4 Chongqing Academy of Animal Science, Chongqing, China.

5 Ya’an Vocational College, Ya’an, China.

6 Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, Canada.

7 National Animal Husbandry Service, Ministry of Agriculture of China, Beijing, China.

8 Department of Biology and Biochemistry, University of Houston, Houston, USA.

9 Department of Animal Science, Southwest University at Rongchang, Chongqing, China.

10 Sichuan Animal Science Academy, , China.

11 State Key Laboratory for Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China.

12 Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China.

13 These authors contributed equally to this work.

Correspondence should be addressed to X.L. (email: [email protected]) or to R.L. (email: [email protected]).

0

Nature Genetics: doi:10.1038/ng.2811

Table of contents

Supplementary Figs. 1-36 ...... 5

Supplementary Fig. 1. The distribution areas of the original Tibetan wild boar in China. 5 Supplementary Fig. 2. Comparison of Tibetan wild boar and domestic Duroc pig...... 6 Supplementary Fig. 3. Synteny between the Tibetan wild boar and Duroc pig genomes...... 7 Supplementary Fig. 4. Distribution of 19-mer frequency...... 8 Supplementary Fig. 5. The GC content and CpG frequency for 10 kb, non-overlapping sliding windows across the Tibetan wild boar genome and five other mammalian genomes...... 8 Supplementary Fig. 6. GC content against the sequencing depth of Tibetan wild boar genome...... 9 Supplementary Fig. 7. Depth distribution of fraction bases...... 9 Supplementary Fig. 8. Distribution of heterozygosity density in the Tibetan wild boar diploid genome...... 10 Supplementary Fig. 9. Comparison of gene parameters among the Tibetan wild boar and five other mammalian genomes...... 10 Supplementary Fig. 10. Divergence distribution of classified families of transposable elements...... 11 Supplementary Fig. 11. Length distribution of InDels in the Tibetan wild boar whole genome and in coding sequence (CDS) regions...... 12 Supplementary Fig. 12. Orthology assignment of the Tibetan wild boar, Duroc pig and human genomes...... 13 Supplementary Fig. 13. Sequence depth distribution between single- and multi-copy genes in the Tibetan wild boar genome...... 14 Supplementary Fig. 14. Orthology delineation among the protein-coding gene family repertoires of the Tibetan wild boar and five other mammals...... 14 Supplementary Fig. 15. Venn diagrams showing the distribution of shared and unique gene families...... 15 Supplementary Fig. 16. Distribution of pairwise amino acid identity of orthologs between the Tibetan wild boar and five other mammals...... 15 Supplementary Fig. 17. Venn diagram showing the distribution of olfactory-related gene repertoires among six mammals...... 16 Supplementary Fig. 18. Identification and comparison of olfactory receptor genes among six mammals using conserved olfactory receptor-specific motifs...... 17 Supplementary Fig. 19. Phylogenetic analysis of the olfactory-related gene repertoires...... 18 Supplementary Fig. 20. Amino acid identity of olfactory-related genes between Duroc pig, Tibetan wild boar and four other mammals...... 18 Supplementary Fig. 21. Average protein similarity of olfactory-related genes and total genes between Duroc pig, Tibetan wild boar and four other mammals...... 19 Supplementary Fig. 22. Comparison of ω values between PSGs in Tibetan wild boar and Duroc pig...... 20 Supplementary Fig. 23. Tibetan wild boar and Duroc pig KA/KS (ω) in functional gene categories...... 21 Supplementary Fig. 24. PSGs in Tibetan wild boar involved in the pathway ‘mTOR 1

Nature Genetics: doi:10.1038/ng.2811

signaling’ and ‘vascular smooth muscle contraction’...... 22 Supplementary Fig. 25. Comparison of the proportions of PSGs in Tibetan wild boar and Duroc pig...... 23 Supplementary Fig. 26. PSGs in Duroc pig involved in the pathway of ‘extracellular matrix (ECM)-receptor interaction’...... 23 Supplementary Fig. 27. Inactivation events of six identified pseudogenes related to ‘response to drug’ in the Tibetan wild boar genome...... 24 Supplementary Fig. 28. Genetic structure analysis for 103 sequenced individuals using FRAPPE with K = 2 to 9...... 25 Supplementary Fig. 29. Genome-wide distribution of SNPs...... 26 Supplementary Fig. 30. Box plot of θπ ratio (θπ, domestic / θπ, Tibetan) and FST values for regions of Tibetan wild boars and Chinese domestic pigs that have undergone positive selection versus the whole genome...... 26 Supplementary Fig. 31. Distribution of selection statistics (Tajima’s D)...... 27 Supplementary Fig. 32. LD patterns between the selected regions and whole genome of Tibetan wild boars and Chinese domestic pigs...... 28 Supplementary Fig. 33. Analysis of the phylogenetic relationship of Tibetan wild boars (n = 30) and neighboring domestic pigs (n = 15) using SNPs in regions with strong selective sweep signals...... 29 Supplementary Fig. 34. Genes embedded in naturally selected regions in Tibetan wild boars related to ‘vitamin B6 binding’ and ‘response to hypoxia’...... 30 Supplementary Fig. 35. Genes examined in the ‘saliva secretion’ functional category (GO-BP: 0046541) showed signatures of selective sweeps in Chinese domestic pigs. .. 31 Supplementary Fig. 36. Vacuum chewing (Domestic Duroc pig)...... 32

Supplementary Tables 1-8, 10-16, 18-22, 24-27 and 29-36 ...... 33

Supplementary Table 1. Genome sequencing strategy for the Tibetan wild boar...... 33 Supplementary Table 2. Estimation of the Tibetan wild boar genome size using K-mer analysis...... 34 Supplementary Table 3. Summary of the Tibetan wild boar genome assembly...... 34 Supplementary Table 4. Summary of mapping and coverage depth...... 35 Supplementary Table 5. Transposon element families in the Tibetan wild boar genome based on various methods...... 35 Supplementary Table 6. Transposon element families in the Tibetan wild boar genome based on homolog alignment...... 36 Supplementary Table 7. Summary of InDels in the Tibetan wild boar genome...... 37 Supplementary Table 8. Summary of syntenic regions between the Tibetan wild boar and Duroc pig genomes...... 37 Supplementary Table 10. Summary of non-coding RNA distribution and annotation in the Tibetan wild boar genome...... 38 Supplementary Table 11. Characteristics of the Tibetan wild boar and Duroc pig genome assemblies...... 39 Supplementary Table 12. Summary of RNA-seq mapping results ...... 40 Supplementary Table 13. Summary of evidence for the EVidenceModeler (EVM) gene models in the Tibetan wild boar genome...... 41 Supplementary Table 14. Assessment of sequence coverage of the Tibetan wild boar

2

Nature Genetics: doi:10.1038/ng.2811

genome assembly using the CDS regions of the Duroc pig genome...... 41 Supplementary Table 15. Summary of predicted protein-coding genes in the Tibetan wild boar genome compared with other representative mammalian genomes...... 42 Supplementary Table 16. Number of Tibetan wild boar genes with functional classification by various methods...... 42 Supplementary Table 18. Functional gene categories enriched for the Tibetan wild boar- and Duroc pig-specific families...... 43 Supplementary Table 19. Summary of gene families in six mammals...... 44 Supplementary Table 20. Functional gene categories enriched for the Tibetan wild boar- and Duroc pig-specific expansion families...... 45 Supplementary Table 21. Positively selected genes (PSGs) identified in the Tibetan wild boar and Duroc pig genomes...... 46 Supplementary Table 22. Functional gene categories enriched for the 215 PSGs in the Tibetan wild boar and 182 PSGs in the Duroc pig...... 57 Supplementary Table 24. List of a priori functional candidate genes related to ‘response to hypoxia’, ‘response to UV’ and ‘energy metabolism’...... 59 Supplementary Table 25. Functional candidate genes related to ‘response to hypoxia’ under positive selection in the Tibetan wild boar (21 PSGs) and Duroc pig (1 PSG)...... 61 Supplementary Table 26. Functional candidate genes related to ‘response to UV’ under positive selection in the Tibetan wild boar (6 PSGs)...... 63 Supplementary Table 27. Functional candidate genes related to ‘energy metabolism’ under positive selection in the Tibetan wild boar (17 PSGs) and Duroc pig (21 PSGs). . 64 Supplementary Table 29. Functional gene categories enriched for Tibetan wild boar pseudogenes...... 69 Supplementary Table 30. Drug response genes that that appear inactive in the Tibetan wild boar genome...... 70 Supplementary Table 31. Summary and mapping statistics of sampled pig populations/breeds...... 71 Supplementary Table 32. Summary and mapping statistics of the downloaded pig genome re-sequencing data...... 73 Supplementary Table 33. Summary of SNP calling on a population-scale...... 76 Supplementary Table 34. Tracy-Widom (TW) statistics for the first ten eigenvalues from PCA analysis of pig breeds...... 76 Supplementary Table 35. Summary of SNPs in Tibetan wild boars and Chinese domestic pigs...... 77 Supplementary Table 36. Functional gene categories enriched for genes affected by natural and artificial selection...... 78

Supplementary Note ...... 80

1 De novo sequencing, assembly and annotation of Tibetan wild boar genome .... 80 1.1 Sequencing strategy and data generation ...... 80 1.2 Sequence quality checking and filtering ...... 80 1.3 Estimation of genome size using K-mer method ...... 80 1.4 De novo assembly ...... 81 1.5 Detections of heterozygous SNPs and deletion or insertion polymorphisms (InDels) ...... 82

3

Nature Genetics: doi:10.1038/ng.2811

1.6 Repeat annotation...... 82 1.7 Structural annotation of genes ...... 83 1.8 Functional annotation of genes ...... 84 1.9 non-coding RNA (ncRNA) annotations ...... 84 2 Lineage-specific genes ...... 84 2.1 Gene family cluster and orthology relationships ...... 84 2.2 Evidence of transcription for the Tibetan wild boar-specific genes ...... 85 3 Functional enrichment analyses for genes ...... 85 4 Identification of pseudogenes ...... 86 5 Population-based re-sequencing and SNP calling...... 86 5.1 Re-sequencing strategy and read mapping ...... 86 5.2 SNP calling ...... 87 6 Demographic history reconstruction ...... 88 7 Linkage-disequilibrium (LD) analysis ...... 89

Supplementary URLs ...... 89

Supplementary References ...... 90

4

Nature Genetics: doi:10.1038/ng.2811

Supplementary Figs. 1-36

Supplementary Fig. 1. The distribution areas of the original Tibetan wild boar in China. Tibetan wild boars are primarily distributed in the mountainous grassland, low bulrush meadows and the valley zone of a large high altitude area in Southwest China (yellow regions), these mainly include: (a) The Southeast of Tibet autonomous region: Milin (3,700 m altitude), Nyingchi (3,000 m), Gongbujiangda (3,600 m), Langxian (3,200 m), Bomi (2,700 m), Mangkang (3,870 m), Zuogong (3,750 m), Bianba (3,500 m), Chaya (3,500 m), Jiangda (3,650 m), Gongjue (3,640 m), and Jiali (4400 m); (b) The Northwest of Sichuan province: Heishui (3,544 m), (2,633 m), Xiaojin (2,367 m), Litang (4,014 m), Xiangcheng (2,856 m), Daocheng (3,750 m), Xinlong (3,500 m), and Dege (3,500 m); (c) The Northwest of Yunnan province: Shangri-La (3,280 m), Diqing (4,270 m), and Weixi (2,340 m); and (d) The Southwest of Gansu province: Hezuo (3,000 m), Luqu (3,500 m), and Zhuoni (2,500 m). Data from the survey report of ‘Area coverage planning of Chinese specific agricultural product, 2006–2015’, Chinese Ministry of Agriculture, 2007.

5

Nature Genetics: doi:10.1038/ng.2811

Tibetan wild boar Duroc pig

Appearance

○The breed originated in America, one of several red pig strains which ○Indigenous to the Tibetan plateau of China with an developed around 1,800 in New average altitude of 4,268 m above sea level, living in the England. Breed history forest and valley zone. ○Duroc has been intensively ○ Tibetan wild boar has not undergone artificial artificially selected for fast growth, selection. and efficient accumulation of lean meat (muscle). ○Black color. ○Small body size. Under plateau conditions, the average adult body weight is about 50 kg (female is 46 ○Red color kg, male is 56 kg), and the body length is 71.37 ± 0.73 ○Large body size, the average adult cm and body height is 45.75 ± 0.52 cm for 13 months (n body weight is more than 300 kg = 17). (female is 350 kg, male is 380 kg). ○Slow growth. During the period of 2 to 6 months of age, ○Fast growth performance. During average daily gain is less than 100 g (99.87 ± 12.11 g, n the period of 30 to 100 kg, average = 27). daily gain is about 900 g (936 ± 33.4 ○High deposition of fat. The lean percent is 43.58 ± g, n = 120). 5.39 % at 6 months of age, and 39.72 ± 2.75 % at 12 ○High carcass production. At 6 months. The intramuscular fat content is 3.82 ± 0.21 % months, the lean percent is about for 6 months, and 10.15 ± 0.15% for 12 months (n = 17). 63.50 ± 4.29 %; the intramuscular fat Characteristics ○Poor meat production. The loin eye area is 12.30 ± content is 3.04 ± 0.33 %; the loin eye 2.18 cm2 for 6 months and 15.15 ± 3.43 cm2 for 12 area is 44.87 ± 1.92 cm2; the dressing months (n = 19); the dressing percent is 51.00 ± 1.26 % percent is 74.23 ± 0.88% (n =121). for 6 months and 74.19 ± 0.52 % for 12 months (n = 17) ○Bad maternal instincts. ○Adapted to the high altitude-induced extremely harsh ○Late maturing type. conditions, such as: hypoxia, low temperature, high ○ Ratio of lung weight versus body solar radiation, and lack of food resources. weight = 0.83 ± 0.07% (n = 110); ratio ○Well-developed blood circulation system, strong limbs, of heart weight versus body weight = long and rigid bristles, presence of down under the hair. 0.35 ± 0.04% (n = 110). ○ Large lungs and hearts. Ratio of lung weight versus ○The average feed: gain ratio is 2.38 body weight = 1.36 ± 0.18% (n = 17); ratio of heart ± 0.02 (n = 131). weight versus body weight = 0.48 ± 0.08% (n = 17). ○High energy metabolism. The average feed: gain ratio is 4.89 ± 0.04 (n = 17). ○The average litter size is 8 to 10. ○The average litter size is 4 to 8. The total number of The total number of born is 8.42 ± born is 4.00 ± 0.20 for the first parity and 7.25 ± 0.98 for 0.87 for the first parity and 10.74 ± Reproductions the 2nd to 3rd parity (n = 25). 1.10 for the 2nd to 3rd parity (n = 171). ○The new born piglet is relatively big. The average new ○The average new born piglet weight born weight is 1.28 ± 0.12 kg (n = 15) is 1.7 ± 0.23 kg (n = 142)

Currently, the Tibetan wild boar is mainly distributed in Current Internationally used breed (93 an important natural conservation zone of Southwest distribution countries) China, and the breed is facing the danger of extinction.

Supplementary Fig. 2. Comparison of Tibetan wild boar and domestic Duroc pig. Values are means ± s.d

6

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 3. Synteny between the Tibetan wild boar and Duroc pig genomes. GC content, density of repeats and density of genes were calculated using a 1 Mb sliding window. The mitochondrial genome and Y chromosome were excluded. The number of contiguous syntenic blocks was determined by pairwise comparisons between the Tibetan and Duroc pig genomes. A total of 2,458 regions of inverted orientation covering more than 186.61 Mb were identified using Breakdancer (parameter –q=20) (Supplementary URLs), which is slightly higher than the 1,576 inversions covering more than 154 Mb identified between the human and chimpanzee genomes1. A complete list of inversions is provided in Supplementary Table 9.

7

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 4. Distribution of 19-mer frequency. In total 130.05 Gb of high-quality short-insert reads (180 bp) were used to generate the 19-mer depth distribution curve frequency information.

Supplementary Fig. 5. The GC content (a) and CpG frequency (b) for 10 kb, non-overlapping sliding windows across the Tibetan wild boar genome and five other mammalian genomes.

8

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 6. GC content against the sequencing depth of Tibetan wild boar genome. We used 100 kb non-overlapping sliding windows along the assembled sequence to calculate GC content and average sequencing depth using short reads.

Supplementary Fig. 7. Depth distribution of fraction bases. The x-axis represents the sequencing depth, and the y-axis the fraction of bases. The high-quality short-insert reads (180 bp and 500 bp) were mapped to the Tibetan wild boar genome assembly with an average depth of 70.8, and ~94.8% of the genome was covered by more than 20 reads.

9

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 8. Distribution of heterozygosity density in the Tibetan wild boar diploid genome. A total of 4.4 M heterozygous SNPs were identified between the two sets of chromosomes of the Tibetan wild boar diploid genome. Non-overlapping 50 kb windows were chosen and the heterozygosity density in each window was calculated.

Supplementary Fig. 9. Comparison of gene parameters among the Tibetan wild boar and five other mammalian genomes. a, mRNA length; b, CDS length; c, exon length; d, exon number; and e, intron length. The similar gene parameters between the Tibetan wild boar and other mammals indicate the high quality gene structure annotation in Tibetan wild boar genome. 10

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 10. Divergence distribution of classified families of transposable elements. The classified transposon families in a, Tibetan wild boar, b, Duroc pig, c, human and d, cattle genomes were aligned onto the consensus in Repbase. The divergence rate was calculated based on the alignment between the RepeatMasker annotated repeat copies and the consensus sequence in the repeat library. Notably, although transposable elements comprise ~39.47% of the Tibetan wild boar genome, which is similar to that of the Duroc pig genome (40.55%), the length of long interspersed elements (LINEs) with a lower divergence rate (≤ 10%) was shorter in Tibetan wild boar repeat families (~12.96 Mb) than that in Duroc pigs (~34.89 Mb). This implies that the Duroc pig genome has experienced considerable recent transposable element activity, which is a highly effective mechanism for generating genetic and epigenetic variation that may be acted on by selection.

11

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 11. Length distribution of InDels in the Tibetan wild boar whole genome and in coding sequence (CDS) regions. Consistent with previous reports short InDels tend to be detected with greater frequency than long InDels, although CDS regions display an enrichment of InDels that are expected to preserve reading frame2,3.

12

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 12. Orthology assignment of the Tibetan wild boar, Duroc pig and human genomes. Bars are subdivided to represent different types of orthology relationships. ‘1:1:1’ indicates single-copy orthologs in each genome. ‘N:N:N’, ‘N in 1’, and ‘N in 2’ indicate multi-copy orthologs in all three, one or two genomes, respectively. ‘X:X:0’, ‘X:0:X’, and ‘0:X:X’ indicate single- or multi-copy groups with genes in only two genomes, respectively. The lineage-specific genes exhibit no orthology with genes in the other two genomes. For genes with alternative splicing variants, we chose the longest transcripts (≥ 30 amino acids) to represent the genes. Mitochondrial genes and unclustered genes are excluded. Most of the 21,806 predicted protein-coding genes in the Tibetan wild boar genome have a homologue either in the Duroc pig (14,427; 66.16%) or human (12,133, 55.64%), with a core set of 10,190 (46.73%) being shared by these three mammals. There are 7,917 single-copy genes that have reciprocal best-match orthologs (1:1:1) among these three mammalian genomes. Out of 3,074 Tibetan wild boar-specific genes (1,178 families), 1,752 Duroc pig-specific genes (1,343 families) and 3,832 human-specific genes (2,333 families), 1,979 (64.38%), 1,365 (77.91%) and 2,610 (68.11%) have known InterPro domains annotation, respectively.

13

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 13. Sequence depth distribution between single- and multi-copy genes in the Tibetan wild boar genome. Orthologous genes shared with the Duroc pig and human (a) and six mammalian genomes (b). Boxes denote the interquartile range (IQR) between the first and third quartiles (25th and 75th percentiles, respectively) and the line inside denotes the median. Whiskers denote the lowest and highest values within 1.5 times IQR from the first and third quartiles, respectively. Outliers beyond the whiskers are shown as black dots. The sequence depth of multiple-copy genes was in the same range as for single-copy ortholog genes, indicating that the calculation of gene copy numbers was accurate.

Supplementary Fig. 14. Orthology delineation among the protein-coding gene family repertoires of the Tibetan wild boar and five other mammals. The red dashed horizontal line represents 1,141 single-copy orthologous genes shared within six mammalian genomes. For genes with alternative splicing variants, we chose the longest transcripts (≥ 30 amino acids) to represent the genes.

14

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 15. Venn diagrams showing the distribution of shared and unique gene families. a, Among Tibetan wild boar, cattle, dog, human and mouse. b, Among Duroc pig, cattle, dog, human and mouse. c, Between Tibetan wild boar and Duroc pig. The Venn diagram was created with web tools provided by the Bioinformatics and Systems Biology of Gent (Supplementary URLs). For genes with multiple alternative transcripts, the transcript with the best alignment was selected. InParanoid (Supplementary URLs) was used to identify orthologous gene pairs, and then MultiParanoid (Supplementary URLs) was used to merge them into multiple species orthologous groups. Obviously, the mouse has the most lineage-specific families compared with the five other mammals.

Supplementary Fig. 16. Distribution of pairwise amino acid identity of orthologs between the Tibetan wild boar and five other mammals. The Tibetan wild boar exhibited the highest protein identity with Duroc pigs (mean protein similarity: 94.19%; diverged 6.9 Mya), compared with cattle (88.85%, 63.6 Mya), dog (87.05%, 90.8 Mya), human (86.83%, 99.3 Mya) and mouse (82.94%, 99.3 Mya). 15

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 17. Venn diagram showing the distribution of olfactory-related gene repertoires among six mammals. Sequences with more than 60% amino acid sequence identity were clustered together.

16

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 18. Identification and comparison of olfactory receptor genes among six mammals using conserved olfactory receptor-specific motifs. a, Schema chart of the three olfactory receptor specific motifs in mammals. The numbers indicate the positions of amino acids. TM: transmembrane domain. b, Distribution of the olfactory-related genes by their olfactory receptor motif containing patterns. The motifs within parentheses were absent. A TBLASTN search was performed to identify genes containing the following conserved motifs: MAYDRYAIC (TMIII), KAFSTCASH (TMVI), and PMLNPFIY (TMVII)4,5, and their variants with less than 50% sequence difference from the conserved motif and within a predicted protein of at least 300 amino acids in length. The Duroc pig has the highest proportion (79.09%) of sequences containing all three mammalian-specific conserved olfactory receptors domains, which should be termed as bona fide functional olfactory receptors. c, Variable amino acids between three conserved motifs. All the amino acid sequences of the olfactory-related genes that had all three conserved motifs were aligned to determine the level of variability at each motif. The Duroc pig has the highest level of divergence (1.35 variable amino acids per motif).

17

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 19. Phylogenetic analysis of the olfactory-related gene repertoires. a, Six mammalian genomes; b, Duroc pig and Tibetan wild boar genomes. The neighbor-joining phylogenetic tree was generated using MEGA 5.15 (Supplementary URLs). The Bootstrap values are from 1,000 trials.

Supplementary Fig. 20. Amino acid identity of olfactory-related genes between Duroc pig, Tibetan wild boar and four other mammals.

18

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 21. Average protein similarity of olfactory-related genes and total genes between Duroc pig, Tibetan wild boar and four other mammals.

19

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 22. Comparison of ω values between PSGs in Tibetan wild boar (a)

6,7 and Duroc pig (b). Orthologous genes with KS > 3 or ω > 5 were filterd resulting in 5,398 orthologs shared between Tibetan wild boar and Duroc pig. Top panels: Boxes denote the interquartile range (IQR) between the first and third quartiles (25th and 75th percentiles, respectively) and the line inside denotes the median. Whiskers denote the lowest and highest values within 1.5 times IQR from the first and third quartiles, respectively. Outliers beyond the whiskers are shown as black dots. The PSGs (P < 0.05, likelihood ratio test) in Tibetan wild boar (or Duroc pig) have significantly higher ω values than that in Duroc pig (or Tibetan wild boar) and genome background (Mann-Whitney U test, P < 10-16). Lower panels: Bootstrapping was performed by randomly resampling 105 genes from the 5,398 orthologs and PSGs. Distribution of genes in the different ω bins confirms the elevated ω values of PSGs.

20

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 23. Tibetan wild boar and Duroc pig KA/KS (ω) in functional gene categories. Points represent pairs of mean ω in Tibetan wild boar and Duroc pig of genes significantly enriched (P < 0.05) in various KEGG-pathway, Gene Ontology (GO) biological process (BP) and molecular function (MF) categories. Dashed lines represent the fold change in mean ω between Tibetan wild boar versus Duroc pig that are > 2 (lower line) or < 0.5 (upper line). A complete list of categories is provided in Supplementary Table 23.

21

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 24. PSGs in Tibetan wild boar involved in the pathway ‘mTOR signaling’ (a) and ‘vascular smooth muscle contraction’ (b). Solid lines represent direct relationships between PSGs (grey boxes) and metabolites (circular nodes), dashed lines represent indirect relationships, and arrowheads denote directionality (adapted from KEGG pathway: map04150 and map04270). The ω values of PSGs are also shown. 22

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 25. Comparison of the proportions of PSGs in Tibetan wild boar and Duroc pig. The numbers of PSG are given in parentheses. Dashed horizontal lines represent the proportion of a priori functional candidate genes in the genome (i.e. 7,917 single-copy orthologs shared with Tibetan wild boar, Duroc pig and human). UV, ultraviolet.

Supplementary Fig. 26. PSGs in Duroc pig involved in the pathway of ‘extracellular matrix (ECM)-receptor interaction’. Lines represent direct relationships between PSGs (light yellow boxes), the downstream signaling effectors of PSGs (blue boxes) and metabolites (circular nodes) (adapted from KEGG pathway: map 04512). The ω values of 11 PSGs in Duroc pig (red bar) and their orthologs in Tibetan wild boar (green bar) and human (white bar) are also shown. 23

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 27. Inactivation events of six identified pseudogenes related to ‘response to drug’ in the Tibetan wild boar genome. Boxes and lines indicate exons and introns, respectively. Red arrows show inactivation events and are labeled with the nature of the change.

24

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 28. Genetic structure analysis for 103 sequenced individuals using FRAPPE with K = 2 to 9. In total 55 individuals were added from the EMBL-EBI database7-9 (shown in blue). The different symbols correspond to the different geographic locations in Fig. 2a. Each individual is represented by a stacked column, which is partitioned into 2 to 9 colored segments with the length of each segment representing the proportion of the individual’s genome from K = 2 to 9 ancestral populations. The samples are sorted by region/ population only after the analysis. The population names and geographic locations are at the top of the figure. The first level of clustering (K = 2) reflects the primary geographical isolation between Asia-Africa (most samples are in China) and Europe. At K = 3, four other species of genus Sus from islands of Southeast Asia and an African warthog species become separated from the Asian-African individuals. At K = 4 the Tibetan wild boars and Asian wild boars were separated.

25

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 29. Genome-wide distribution of SNPs. Out of 252,121 windows of 100 kb in length sliding in 10 kb steps across the Tibetan wild boar genome, 73,197 windows contain < 100 SNPs (red bars) and cover 29.03% of the genome (dashed lines). 178,924 windows contain ≥ 100 SNP (blue bars) and cover 70.97% of the genome, and these were used to detect signatures of selective sweeps. The cumulative % in whole genome length (black line) is also charted.

Supplementary Fig. 30. Box plot of θπ ratio (θπ, domestic / θπ, Tibetan) (a) and FST values (b) for regions of Tibetan wild boars and Chinese domestic pigs that have undergone positive selection versus the whole genome. Boxes denote the interquartile range (IQR) between the first and third quartiles (25th and 75th percentiles, respectively) and the line inside denotes the median. Whiskers denote the lowest and highest values within 1.5 times IQR from the first and third quartiles, respectively. Outliers beyond the whiskers are shown as black dots. The statistical significance was calculated by the Mann-Whitney U test. 26

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 31. Distribution of selection statistics (Tajima’s D). a, |Tajima’s

Ddomestic – Tajima’s DTibetan| against θπ ratio (θπ,domestic / θπ, Tibetan). b, |Tajima’s Ddomestic – Tajima’s

DTibetan| against FST value. Out of 178,924 windows of length 100 kb across the Tibetan wild boar genome, 2,802 and 1,076 windows were picked out as regions with strong selective sweep signals for Tibetan wild boars (green points) and Chinese domestic pigs (blue points). c,

Boxplot of |Tajima’s Ddomestic – Tajima’s DTibetan| in genomic regions with strong selective sweep signals for Tibetan wild boars and Chinese domestic pigs versus the whole genome. Boxes denote the interquartile range (IQR) between the first and third quartiles (25th and 75th percentiles, respectively) and the line inside denotes the median. Whiskers denote the lowest and highest values within 1.5 times IQR from the first and third quartiles, respectively. Outliers beyond the whiskers are shown as black dots. The statistical significance was calculated by the Mann-Whitney U test.

27

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 32. LD patterns between the selected regions and whole genome of Tibetan wild boars and Chinese domestic pigs. Selected regions had significantly higher LD than the whole genome background across the range of distances separating loci for Tibetan wild boars and Chinese domestic pigs (P < 10-16, Mann-Whitney U test). LD decays much more slowly in selected regions than in the whole genome. The LD decay rate was measured as the distance at which the average squared correlations of allele frequencies (r2) dropped to half its maximum value. For Tibetan wild boars, the LD decay rates of selected regions (black line) and whole genomes (gray line) were estimated at ~11.4 kb and ~5.9 kb, respectively, where the r2 drops to 0.18. For Chinese domestic pigs, LD decay rates of selected regions (red line) and whole genomes (purple line) were estimated at ~17.8 kb and ~8.1 kb, respectively, where the r2 drops to 0.20.

28

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 33. Analysis of the phylogenetic relationship of Tibetan wild boars (n = 30) and neighboring domestic pigs (n = 15) using SNPs in regions with strong selective sweep signals. a, A neighbor-joining phylogenetic tree. The scale bar represents p distance. b, Two-way PCA plot. The fraction of the variance explained is 18.21% for eigenvector 1 (P = 7.08 × 10-4, Tracy-Widom test) and 8.57% for eigenvector 2 (P = 1.95 × 10-5, Tracy-Widom test). Out of 9.49 M SNPs in whole genome, only 8.59% (0.81 M) SNPs in the selected regions of Tibetan wild boars and Chinses domestic pigs were used.

29

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 34. Genes embedded in naturally selected regions in Tibetan wild boars related to ‘vitamin B6 binding’ and ‘response to hypoxia’. Ratio of sequence

diversity level (θπ ratio, black line), diversity between two populations (FST values, red line), and selection statistics (Tajima’s D, blue and green lines for Chinese domestic pigs and Tibetan wild boars, respectively) are plotted using a 10 kb sliding window. Genomic regions

located above the horizontal dashed line (corresponding to a 5% significance level of θπ ratio,

where θπ ratio = 1.10; and a 5% significance level of FST, where FST = 0.361) were termed as regions with strong selective sweep signals for Tibetan wild boars (gray regions). Genome annotations are shown at the bottom (black bar: coding sequence, blue bar: gene). Three genes (ALB, GLDC and SPTLC2) related to ‘‘vitamin B6 binding’, and four genes (ALB, ECE1, GNG2 and PIK3C2G) related to ‘response to hypoxia’ are marked in red.

30

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 35. Genes examined in the ‘saliva secretion’ functional category (GO-BP: 0046541) showed signatures of selective sweeps in Chinese domestic pigs.

Nine genes exhibited a lower θπ ratio, higher FST and |Tajima’s Ddomestic – Tajima’s DTibetan| compared with the genome background. a, Two genes (KCNMA1 and TRPC1) embedded in regions with significant signatures of selective sweeps are marked in red. KCNMA1 (also

known as KCa1.1) encodes the maxi-K channel in the acinar cells of parotid and submandibular exocrine glands10. TRPC1, as a critical component of the store-operated Ca2+ channel in acinar cells, is essential for neurotransmitter-regulation of fluid secretion11. If a

gene crossed multiple windows, its θπ ratio, FST and |Tajima’s Ddomestic – Tajima’s DTibetan|

values were averaged over these overlapping windows. b, Box plot of θπ ratio, FST and

|Tajima’s Ddomestic – Tajima’s DTibetan| values for 9 genes in the ‘saliva secretion’ category of Chinese domestic pigs versus the whole genome. Bootstrapping was performed by randomly resampling 178,924 genes from the 9 genes. The statistical significance was calculated by the Mann-Whitney U test.

31

Nature Genetics: doi:10.1038/ng.2811

Supplementary Fig. 36. Vacuum chewing (Domestic Duroc pig). Vacuum chewing is defined as oral activities with saliva, but no food in the mouth, which is accompanied by copious production of saliva seen as ‘froth’ around the mouth: it is one of the most frequently observed stereotypies in housed pigs in the pig industry.

32

Nature Genetics: doi:10.1038/ng.2811

Supplementary Tables 1-8, 10-16, 18-22, 24-27 and 29-36 Supplementary Table 1. Genome sequencing strategy for the Tibetan wild boar.

High-quality data Raw Pair-end Insert Read data Data Proportion Proportion Proportion libraries size length (Gb) (Gb) of Q20 (%) of Q30 (%) of GC (%) (bp) 180 bp 136.57 130.05 96.80 91.42 39.45 101 500 bp 88.64 86.19 96.20 91.01 39.56 101 Illumina 2 Kb 27.13 20.84 94.44 88.06 44.14 51/101 reads 5 Kb 33.72 13.08 95.58 90.62 43.78 101 10 Kb 33.23 28.07 96.71 91.16 45.84 75 In total 319.29 Gb of sequence data were obtained for de novo assembly. After filtering reads based on quality, 278.23 Gb of high-quality data were retained for subsequent analysis.

33

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 2. Estimation of the Tibetan wild boar genome size using K-mer analysis.

K mer K mer Genome Revised genome Heterozygous Repetition Used bases Sequence K mer number depth size (Mb) size* (M) rate (%) rate (%)† (Gb) depth (×)

19 1.02E+11 41.94 2,427.87 2,379.31 0.85 38.86 128.4 53.97 The estimated size of the Tibetan wild boar genome is ~2.38 Gb. * ‘Revised genome size’ is the accurate estimation without error K-mers. † ‘Repetition rate’ is the proportion of the same K-mer fragments in all K-mers.

Supplementary Table 3. Summary of the Tibetan wild boar genome assembly.

Calculated using the fragments > 100 bp Calculated using the fragments > 500 bp Category Contigs Scaffolds Contigs Scaffolds Total length (bp) 2,426,282,217 2,501,667,227 2,400,295,503 2,475,602,644 Max length (bp) 278,361 6,123,902 278,361 6,123,902 Average length (bp) 6,490 15,321 10,177 87,980 N50 length (bp) | Number 20,411 | 32,634 1,049,950 | 714 20,688 | 32,002 1,062,107 | 701 N60 length (bp) | Number 15,751 | 46,177 817,959 | 984 16,022 | 45,196 826,816 | 965 N70 length (bp) | Number 11,775 | 63,968 616,452 | 1,334 12,059 | 62,441 634,339 | 1,305 N80 length (bp) | Number 8,062 | 88,736 421,873 | 1,815 8,368 | 86,205 442,560 | 1,767 N90 length (bp) | Number 4,605 | 128,040 227,167| 2,599 4,942 | 123,139 247,789 | 2,501

34

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 4. Summary of mapping and coverage depth.

Category Value Average sequencing depth (×) 70.8 Mismatch rate (%) 0.5 Mapping rate (%) 90.3 Coverage (%) 98.7 Coverage at least 4 × (%) 98.0 Coverage at least 10 × (%) 97.0 Coverage at least 20 × (%) 94.8

To evaluate the single-base accuracy of the assembled Tibetan wild boar genome, the high-quality short-insert reads (180 bp and 500 bp) were realigned onto the assembly scaffolds. An average depth of 70.8 was obtained and approximately 94.8% of the genome was covered by 20 or more reads.

Supplementary Table 5. Transposon element families in the Tibetan wild boar genome based on various methods.

Type Repeat size (bp) % of genome Proteinmask 202,408,765 8.25 Repeatmasker 903,922,135 36.85 Trf 37,346,250 1.52 De novo 605,241,890 24.68 Total 968,058,934 39.47

Transposable elements comprised ~39.47% of the Tibetan wild boar genome, which is similar to the value obtained for the Duroc pig genome (40.55%).

35

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 6. Transposon element families in the Tibetan wild boar genome based on homolog alignment.

Repbase TEs TE proteins RepeatModeler Combined TEs* Repeat type Length Length % in Length % in Length % in % in genome (kb) (kb) genome (kb) genome (kb) genome DNA transposon 62,355 2.54 4,350 0.18 23,551 0.96 63,921 2.61 LINE 416,309 16.97 190,852 7.78 202,588 8.26 442,644 18.05 LTR retrotransposon 110,510 4.51 7,227 0.29 66,794 2.72 120,730 4.92 SINE 320,011 13.05 0 0.00 310,469 12.66 336,061 13.70 Other† 5 0.00 0 0.00 0 0.00 5 0.00 Unknown‡ 880 0.04 0 0.00 0 0.00 880 0.04 Total 903,922 36.85 202,408 8.25 602,302 24.56 949,776 38.72 *Combined: the non-redundant consensus of all repeat prediction/classification methods employed. †Other: the repeats classified by RepeatMasker, which are not included in the other groups; ‡Unknown: the predicted repeats that cannot be classified by RepeatMasker; LINE, long interspersed nuclear elements; LTR, long terminal repeat; SINE, short interspersed nuclear elements.

36

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 7. Summary of InDels in the Tibetan wild boar genome.

Category Number of InDels Upstream 6,571 CDS 982 Intron 291,414 Splicing 20 Downstream 6,790 Upstream/Downstream 82 Intergenic 678,425 Total 984,284 ‘Upstream’ refers to a variant that overlaps with the 1 kb region upstream of the gene start site. ‘Downstream’ refers to a variant that overlaps with the 1 kb region downstream of the gene end site. ‘Upstream/Downstream’ indicates that a variant is located in downstream and upstream regions (possibly for two different genes). ‘Splicing’ refers to a variant that is within 2 bp of a splice junction.

Supplementary Table 8. Summary of syntenic regions between the Tibetan wild boar and Duroc pig genomes.

Scaffold / Genome Aligned Syntenic Number of Breed size* nucleotides proportion (%) blocks† Tibetan 2,501,667,227 bp 2,336,696,950 bp 93.41 wild boar (2.50 Gb) (2.34 Gb) 37,544 Duroc 2,806,871,662 bp 2,715,263,667 bp 96.74 pig‡ (2.81 Gb) (2.72 Gb)

To detect synteny blocks between Tibetan wild boar and Duroc pig genomes, after repeat masking, pairwise whole-genome alignment was performed using LASTZ with the parameters T = 2 (no transition), Y (ydrop) = 15,000, L (gappedthresh) = 3,000 and K (hspthresh) = 4,500 (Supplementary URLs). The raw alignments were combined into larger blocks using the ChainNet algorithm. *The size of Scaffold/genome included the gaps, i.e. ‘N’ (unidentified nucleotides), whose content in the Tibetan wild boar genome (3.01%) is lower than that in the Duroc pig genome (10.31%). †Number of contiguous syntenic blocks determined by pairwise comparisons between Tibetan wild boar and Duroc pig genomes. ‡Excludes mitochondrial genome and Y chromosome.

Supplementary Table 9. List of inversion regions between the Tibetan wild boar and Duroc pig genomes. (see Excel file ‘Supplementary Table 9.xls’)

37

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 10. Summary of non-coding RNA distribution and annotation in the Tibetan wild boar genome.

Average Total % of Type Number length (bp) length (bp) genome miRNA 381 88 33,339 0.00136 tRNA 531 75 39,594 0.00161 rRNA 304 114 34,507 0.00141 18S 26 226 5,886 0.00024 rRNA 28S 118 139 16,418 0.00067 5.8S 4 96 383 0.00002 5S 156 76 11,820 0.00048 snRNA 890 113 100,406 0.00409 CD-box 221 93 20,568 0.00084 snRNA HACA-box 189 138 26,107 0.00106 splicing 458 111 50,865 0.00207 microRNA (miRNA), small nuclear RNA (snRNA) and tRNA located in repeat or gap regions were filtered. rRNA (< 50bp) with identity less than 85% were also filtered. The average length and total length were calculated using the integrated data.

38

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 11. Characteristics of the Tibetan wild boar and Duroc pig genome assemblies.

Tibetan Genomic features Duroc pig* wild boar Assembled genome size (Gb)† 2.43 2.52 Number of N (unidentified nucleotides) 75,385,010 289,538,800 N content of whole genome (%) 3.01 10.31 Number of Contigs 370,587 73,524 (placed) | 168,358 (unplaced) Contig N50 (bp) ‡ 20,688 69,669 Average contig length (bp) 10,177 11,611 Largest contig length (bp) 278,361 1,598,650 Number of Scaffolds 163,276 5,343 (placed) | 4,562 (unplaced) Scaffold N50 (bp) ‡ 1,062,107 576,008 Average scaffold length (bp) 87,980 283,544 Largest scaffold length (bp) 6,123,902 3,862,550 GC content (%) 41.82 41.70 Number of base A 705,040,222 733,853,103 % of genome base A 29.06 29.13 Number of base T 706,487,877 734,661,583 % of genome base T 28.12 29.16 Number of base C 507,683,217 525,183,301 % of genome base C 20.92 20.85 Number of base G 507,070,901 525,289,361 % of genome base G 20.90 20.85 Repeat rate (%) 39.47 40.55 Number of putative coding genes 21,806 21,640 Number of exons 188,336 197,675 Average gene model length (bp) 32,117 26,781 Average CDS length (bp) 1,582 1,370 Average gene exon length (bp) 183 162 Average exon number per gene 8.64 8.44 Average gene intron length (bp) 3,998 3,444 Number of miRNA 381 374 Number of tRNA 531 819 Number of rRNA 304 185 Number of snRNA 890 1,030

* From Groenen et al. (2012)7.

† The fragments of the ungapped genome assembly.

‡ N50 (50% of the genome is in fragments of this length or longer) of genome assembly was calculated using the fragments longer than 500 bp.

39

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 12. Summary of RNA-seq mapping results

Tissue Read types Mapping to the Tibetan wild boar genome Mapping to the Duroc pig genome Number of reads % of reads Number of reads % of reads Total reads 104,723,266 104,723,266 Mapped reads 83,979,755 80.19 74,893,632 71.52 Multiple- | Uniquely- mapped reads 3,937,595 | 80,042,160 3.76 | 76.43 6,220,562 | 68,673,070 5.94 | 65.58 Heart Read-1 | Read-2 39,047,371 | 37,853,776 37.29 | 36.15 36,532,082 | 35,287,352 34.88 | 33.70 Reads map to '+' | to '-' 38,711,826 | 38,189,321 36.97 | 36.47 35,852,834 | 35,966,600 34.24 | 34.34 Non-splice reads | Splice reads 58,162,158 | 18,738,989 55.54 | 17.89 49,640,490 | 22,178,944 47.40 | 21.18 Total reads 30,460,082 30,460,082 Mapped reads 22,830,732 74.95 22,669,607 74.42 Multiple- | Uniquely- mapped reads 763,398 | 22,067,334 2.51 | 72.45 2,162,136 | 20,507,471 7.10 | 67.33 Kidney Read-1 | Read-2 11,134,500 | 10,932,834 36.55 | 35.89 10,346,021 | 10,161,450 33.97 | 33.36 Reads map to '+' | to '-' 11,040,124 | 11,027,210 36.24 | 36.20 10,292,010 | 10,215,461 33.79 | 33.54 Non-splice reads | Splice reads 15,959,027 | 6,108,307 52.39 | 20.05 15,390,368 | 5,117,103 50.53 | 16.80 Total reads 20,257,918 20,257,918 Mapped reads 14,757,764 72.85 14,200,850 70.10 Multiple- | Uniquely- mapped reads 523,069 | 14,234,695 2.58 | 70.27 1,811,792 | 12,389,058 8.94 | 61.16 Liver Read-1 | Read-2 7,173,634 | 7,061,061 35.41 | 34.86 6,244,772 | 6,144,286 30.83 | 30.33 Reads map to '+' | to '-' 7,132,602 | 7,102,093 35.21 | 35.06 6,202,752 | 6,186,306 30.62 | 30.54 Non-splice reads | Splice reads 9,488,360 | 4,746,335 46.84 | 23.43 8,423,595 | 3,965,463 41.58 | 19.57 Total reads 35,255,828 35,255,828 Mapped reads 25,001,818 70.92 22684760 64.34 Multiple- | Uniquely- mapped reads 814,419 | 24,187,399 2.31 | 68.61 2,424,339 | 20,260,421 6.88 | 57.47 Lung Read-1 | Read-2 12,301,199 | 11,886,200 34.89 | 33.71 10,311,043 | 9,949,378 29.25 | 28.22 Reads map to '+' | to '-' 12,109,760 | 12,077,639 34.35 | 34.26 10,143,933 | 10,116,488 28.77 | 28.69 Non-splice reads | Splice reads 16,876,361 | 7,311,038 47.87 | 20.74 14,324,210 | 5,936,211 40.63 | 16.84

RNA-seq reads were aligned to the Tibetan wild boar and Duroc pig genomes using TopHat (v2.0.7) with default parameters. ‘Splice reads’ refers to reads where part of the read was not mapped contiguously to the reference genome. The mapping rate of RNA-seq reads against the Tibetan wild boar genome (74.73%) is higher than against the Duroc pig genome (70.10%) across four Tibetan wild boar tissues. Out of 21,806 predicted protein-coding genes in the Tibetan wild boar genome, 18,366 (84.23%) show evidence of transcription based on RNA-seq.

40

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 13. Summary of evidence for the EVidenceModeler (EVM) gene models in the Tibetan wild boar genome.

≥20% overlap ≥50% overlap ≥80% overlap Category % of % of % of Number Number Number total total total P (single) 34 0.14 463 1.84 2,439 9.69 P (more) 1,789 7.11 2,328 9.25 3,145 12.49 H (single) 18 0.07 27 0.11 101 0.40 H (more) 5 0.02 58 0.23 530 2.11 C (single) 1 0.00 2 0.01 80 0.32 C (more) 0 0.00 4 0.02 37 0.15 P + H 12 0.05 136 0.54 849 3.37 P + C 402 1.60 888 3.53 1,290 5.12 H + C 5,569 22.12 6,584 26.15 6,575 26.11 P + H + C 17,347 68.90 14,677 58.29 9,642 38.30 P, ab initio prediction; H, homology-based; C, cDNA/EST/ transcript expressed genes. Genes were further separated into “single” and “more” categories based on the number of sources supporting their existence.

Supplementary Table 14. Assessment of sequence coverage of the Tibetan wild boar genome assembly using the CDS regions of the Duroc pig genome.

with >50% Covered by with >90% sequence Length of Total length sequence in one Number the draft in one scaffold unigene (bp) scaffold genome (%) Number % Number % All 21,619 29,614,875 99.94 19,567 90.51 21,277 98.42 >200 bp 21,276 29,558,865 99.95 19,258 90.51 20,938 98.41 >500 bp 17,710 28,275,129 99.95 15,927 89.93 17,394 98.22 >1,000 bp 10,926 23,033,892 99.96 9,876 90.39 10,816 98.99

The CDS sequences of the Duroc pig genome were downloaded from Ensembl release 67, and mapped to the Tibetan wild boar genome assembly. Out of 21,806 predicted protein-coding genes in the Tibetan wild boar genome, 21,619 (99.94%) were covered by CDS regions of the Duroc pig genome.

41

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 15. Summary of predicted protein-coding genes in the Tibetan wild boar genome compared with other representative mammalian genomes.

Average Average Average Average Average CDS exons Gene set Number gene model exon length intron length length number per length (bp) (bp) (bp) (bp) gene Tibetan 21,806 32,117 1,582 8.64 183 3,998 wild boar Duroc pig 21,619 26,987 1,370 8.44 162 3,444 Human 20,207 49,011 1,580 9.31 169 5,708 Cattle 19,970 35,523 1,598 9.59 167 3,949 Dog 19,281 30,994 1,577 9.90 160 3,305 Mouse 22,838 36,688 1,516 8.56 177 4,651 Genes with alternative splicing-induced premature termination and defective codon events were not considered.

Supplementary Table 16. Number of Tibetan wild boar genes with functional classification by various methods.

Category Number Percent (%) Total 21,806 100 Swissprot 19,754 90.59 Annotated TrEMBL 20,128 92.30 (20,157 genes, KEGG 14,297 65.56 92.44%) InterPro 16,137 74.00 GO 12,888 59.10 Unannotated 1,649 7.56 Out of 21,806 predicted protein-coding genes in the Tibetan wild boar genome, 20,157 (92.44%) have protein homologues in the other mammalian genomes.

Supplementary Table 17. Tibetan wild boar-specific genes with evidence of transcription. (see Excel file ‘Supplementary Table 17.xls’)

42

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 18. Functional gene categories enriched for the Tibetan wild boar- and Duroc pig-specific families.

Involved Functional Term ID Term description P values gene category number Tibetan wild boar GO-MF GO:0003964 RNA-directed DNA polymerase activity 0.00E+00 507 GO-BP GO:0006278 RNA-dependent DNA replication 0.00E+00 507 GO-BP GO:0006260 DNA replication 0.00E+00 508 InterProScan IPR004244 Transposase, L1 0.00E+00 253 GO-MF GO:0016779 Nucleotidyltransferase activity 0.00E+00 509 InterProScan IPR005135 Endonuclease/exonuclease/phosphatase 3.18E-278 206 GO-BP GO:0090304 Nucleic acid metabolic process 8.81E-255 571 InterProScan IPR003036 Core shell protein Gag P30 4.44E-13 21 KEGG-pathway map05130 Pathogenic Escherichia coli infection 8.54E-11 17 KEGG-pathway map04270 Vascular smooth muscle contraction 2.07E-09 23 KEGG-pathway map04810 Regulation of actin cytoskeleton 2.93E-09 20 KEGG-pathway map04350 TGF-beta signaling pathway 4.52E-09 19 KEGG-pathway map04670 Leukocyte transendothelial migration 4.52E-09 19 KEGG-pathway map04062 Chemokine signaling pathway 7.15E-09 20 InterProScan IPR004875 DDE superfamily endonuclease, 1.08E-04 13 CENP-B-like InterProScan IPR001063 Ribosomal protein L22/L17 1.25E-02 6 InterProScan IPR003308 Integrase, N-terminal zinc-binding 1.25E-02 4 domain GO-BP GO:0015074 DNA integration 2.03E-02 4 GO-MF GO:0004523 Ribonuclease H activity 2.77E-02 3 KEGG-pathway map04150 mTOR signaling pathway 3.43E-02 6 KEGG-pathway map04010 MAPK signaling pathway 3.91E-02 14 KEGG-pathway map04914 Progesterone-mediated oocyte 3.99E-02 8 maturation Duroc pig KEGG-pathway ssc04740 Olfactory transduction 1.53E-04 35 InterProScan IPR009311 Interferon-induced 6-16 6.78E-03 8 GO-BP GO:0006508 Proteolysis 3.08E-02 8 GO-BP GO:0051605 Protein maturation by peptide bond 4.27E-02 3 cleavage GO-BP GO:0016485 Protein processing 4.27E-02 3 GO-BP GO:0051604 Protein maturation 4.27E-02 3 GO-MF GO:0008233 Peptidase activity 4.38E-02 7 InterProScan IPR011360 Complement B/C2 4.68E-02 4 P values (i.e. EASE scores), indicating significance of the overlap between various gene sets, were calculated using a Benjamini-corrected modified Fisher’s exact test. Only GO-BP (biological process), GO-MF (molecular function), KEGG-pathway and InterPro domain terms with a P value less than 0.05 were considered as significant and listed.

43

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 19. Summary of gene families in six mammals.

Tibetan wild Duroc pig Human Cattle Dog Mouse boar Number of genes* 19,444 19,753 17,558 19,767 18,742 17,592 Number of gene families 16,203 16,356 15,506 17,401 16,935 10,907 Number of genes per family 1.20 1.21 1.13 1.14 1.11 1.61 Number of linage-specific 1,264 271 536 39 49 3,473 genes Number of linage-specific 189 124 191 9 18 1,036 gene families * Excludes mitochondrial genes and unclustered genes. Similar to the Duroc pig (number of genes per families: 1.21, lineage-specific gene families: 124) and human (1.13 and 191), the Tibetan wild boar (1.20 and 189) exhibited a moderate rate of evolution relative to other mammals, which is higher than the rate in cattle (1.14 and 9) and in dog (1.11 and 18), but lower than in mouse (1.61 and 1,036).

44

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 20. Functional gene categories enriched for the Tibetan wild boar- and Duroc pig-specific expansion families.

Involved Functional Term ID Term description P values gene category number Tibetan wild boar InterProScan IPR008331 Ferritin/DPS protein domain 8.64E-13 9 InterProScan IPR009040 Ferritin- like diiron domain 8.64E-13 9 GO-MF GO:0008199 Ferric iron binding 7.18E-12 9 KEGG-pathway map05130 Pathogenic Escherichia coli infection 8.48E-06 6 InterProScan IPR002190 MAGE protein 1.14E-05 6 Oxidoreductase activity, acting on paired GO-MF GO:0016705 donors, with incorporation or reduction of 1.47E-05 4 molecular oxygen KEGG-pathway map04270 Vascular smooth muscle contraction 5.71E-05 6 KEGG-pathway map04350 TGF-beta signaling pathway 1.94E-04 6 KEGG-pathway map04670 Leukocyte transendothelial migration 1.94E-04 6 Glycosphingolipid biosynthesis - lacto and KEGG-pathway map00601 5.46E-04 4 neolacto series KEGG-pathway map04310 Wnt signaling pathway 5.50E-04 6 KEGG-pathway map04810 Regulation of actin cytoskeleton 1.97E-03 6 KEGG-pathway map04062 Chemokine signaling pathway 2.46E-03 6 InterProScan IPR007087 Zinc finger, C2H2 3.89E-03 17 InterProScan IPR015880 Zinc finger, C2H2-like 1.05E-02 16 KEGG-pathway map00980 Metabolism of xenobiotics by cytochrome P450 1.16E-02 4 Duroc pig KEGG-pathway ssc04740 Olfactory transduction 8.46E-23 30 InterProScan IPR001039 MHC class I, alpha chain, alpha1 and alpha2 8.50E-03 5 GO-MF GO:0046872 Metal ion binding 1.62E-02 6 GO-MF GO:0043169 Cation binding 1.73E-02 6 InterProScan IPR011161 MHC class I-like antigen recognition 1.73E-02 7 GO-MF GO:0043167 Ion binding 1.77E-02 5 InterProScan IPR003006 Immunoglobulin/major histocompatibility 2.68E-02 5 complex, conserved site InterProScan IPR003597 Immunoglobulin C1-set 3.03E-02 6

There are 92 families (390 genes) and 232 families (950 genes) that were substantially expanded in the Tibetan wild boar and Duroc pig compared to other mammals, respectively.

45

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 21. Positively selected genes (PSGs) identified in the Tibetan wild boar and Duroc pig genomes.

Gene ID Gene name P value symbol Tibetan wild boar 1 ABLIM1 Actin binding LIM protein 1 1.97E-05 2 ACR Acrosin 2.58E-14 3 ACTR5 ARP5 actin-related protein 5 homolog (yeast) 3.55E-14 4 ACVR1B Activin A receptor, type IB 0.00E+00 ADAM metallopeptidase with 4.06E-14 5 ADAMTS15 thrombospondin type 1 motif, 15 ADAM metallopeptidase with 5.46E-14 6 ADAMTS9 thrombospondin type 1 motif, 9 7 ADAMTSL3 ADAMTS-like 3 6.46E-14 8 ADCY1 Adenylate cyclase 1 (brain) 0.00E+00 9 ADCY2 Adenylate cyclase 2 (brain) 0.00E+00 10 ADCY4 Adenylate cyclase 4 1.33E-06 11 ADORA2B Adenosine A2b receptor 7.33E-09 12 ADRA1B Adrenergic, alpha-1B-, receptor 9.14E-14 13 AEBP1 AE binding protein 1 9.87E-14 14 AGA Aspartylglucosaminidase 1.11E-06 AKT interacting protein; similar to AKT 0.00E+00 15 AKTIP interacting protein Aldehyde dehydrogenase 2 family 1.42E-10 16 ALDH2 (mitochondrial) 17 ALPK2 Alpha-kinase 2 1.41E-13 18 ANKAR Ankyrin and armadillo repeat containing 0.00E+00 19 ANKRD27 Ankyrin repeat domain 27 (VPS9 domain) 1.57E-13 20 ANO5 Anoctamin 5 1.67E-13 21 ANTXR2 Anthrax toxin receptor 2 1.97E-13 Adaptor-related protein complex 4, epsilon 1 2.13E-13 22 AP4E1 subunit APAF1 interacting protein; similar to APAF1 0.00E+00 23 APIP interacting protein Apolipoprotein B mRNA editing enzyme, 2.17E-13 24 APOBEC1 catalytic polypeptide 1 Hypothetical LOC100129500; apolipoprotein 5.19E-07 25 APOE E ArfGAP with RhoGAP domain, ankyrin repeat 2.49E-13 26 ARAP3 and PH domain 3 27 ARG2 Arginase, type II 3.51E-13 Rho guanine nucleotide exchange factor 2.17E-05 28 ARHGEF11 (GEF) 11 Rho guanine nucleotide exchange factor 0.00E+00 29 ARHGEF12 (GEF) 12 Aryl hydrocarbon receptor nuclear 0.00E+00 30 ARNT translocator 31 ASNSD1 Asparagine synthetase domain containing 1 3.95E-13 Astacin-like metallo-endopeptidase (M12 2.45E-07 32 ASTL family)

46

Nature Genetics: doi:10.1038/ng.2811

33 ATAD2 ATPase family, AAA domain containing 2 0.00E+00 34 ATXN7 Ataxin 7 1.39E-10 35 BBS7 Bardet-Biedl syndrome 7 0.00E+00 36 BCL3 B-cell CLL/lymphoma 3 4.65E-11 37 BIRC2 Baculoviral IAP repeat-containing 2 4.09E-13 38 C8B Complement component 8, beta polypeptide 4.31E-13 39 C8ORF76 Chromosome 8 open reading frame 76 4.49E-13 40 CA6 Carbonic anhydrase VI 4.65E-13 41 CA9 Carbonic anhydrase IX 5.26E-13 42 CABLES2 Cdk5 and Abl enzyme substrate 2 5.28E-13 43 CALCRL Calcitonin receptor-like 5.48E-03 Calcium/calmodulin-dependent protein 3.33E-16 44 CAMK2G kinase II gamma Cas-Br-M (murine) ecotropic retroviral 5.28E-13 45 CBL transforming sequence 46 CCHCR1 Coiled-coil alpha-helical rod protein 1 5.30E-13 47 CCNE2 Cyclin E2 5.70E-13 48 CDK12 Cdc2-related kinase, arginine/serine-rich 0.00E+00 Bruno-like 5, RNA binding protein 1.01E-10 49 CELF5 (Drosophila) Chromodomain helicase DNA binding protein 1.06E-10 50 CHD3 3 51 COL11A1 Collagen, type XI, alpha 1 0.00E+00 52 COL14A1 Collagen, type XIV, alpha 1 5.78E-13 53 COPZ2 Coatomer protein complex, subunit zeta 2 4.55E-10 Cytoplasmic polyadenylation element binding 0.00E+00 54 CPEB4 protein 4 55 CPXM2 Carboxypeptidase X (M14 family), member 2 6.52E-13 56 CTSZ Cathepsin Z 1.48E-10 Diacylglycerol O-acyltransferase homolog 1 6.61E-08 57 DGAT1 (mouse) 58 DGUOK Deoxyguanosine kinase 1.31E-08 DnaJ (Hsp40) homolog, subfamily C, 6.20E-09 59 DNAJC7 member 7 60 DPP4 Dipeptidyl-peptidase 4 7.47E-13 61 DPYSL4 Dihydropyrimidinase-like 4 7.75E-13 62 DPYSL5 Dihydropyrimidinase-like 5 8.99E-11 63 DUSP3 Dual specificity phosphatase 3 1.93E-08 64 EBPL Emopamil binding protein-like 7.92E-13 65 EEA1 Early endosome antigen 1 8.08E-13 66 EGLN2 Egl nine homolog 2 (C. elegans) 8.74E-13 Eukaryotic translation initiation factor 4E 1.99E-10 67 EIF4E1B family member 1B Eukaryotic translation initiation factor 4E 2.69E-06 68 EIF4E2 family member 2 Excision repair cross-complementing rodent 5.07E-07 69 ERCC4 repair deficiency, complementation group 4 Excision repair cross-complementing rodent 1.01E-12 70 ERCC6 repair deficiency, complementation group 6 71 EREG Epiregulin 3.13E-09

47

Nature Genetics: doi:10.1038/ng.2811

Endoplasmic reticulum-golgi intermediate 1.50E-07 72 ERGIC1 compartment (ERGIC) 1 Establishment of cohesion 1 homolog 1 (S. 1.11E-16 73 ESCO1 cerevisiae) Electron-transfer-flavoprotein, alpha 2.12E-08 74 ETFA polypeptide 75 FABP2 Fatty acid binding protein 2, intestinal 4.19E-08 76 FBXL4 F-box and leucine-rich repeat protein 4 0.00E+00 77 FBXO30 F-box protein 30 5.55E-16 78 FGF10 Fibroblast growth factor 10 1.05E-12 C-fos induced growth factor (vascular 1.35E-12 79 FIGF endothelial growth factor D) FAD1 flavin adenine dinucleotide synthetase 0.00E+00 80 FLAD1 homolog (S. cerevisiae) 81 FNBP1 Formin binding protein 1 2.49E-10 82 FNBP1L Formin binding protein 1-like 3.76E-10 83 FOXL2 Forkhead box L2 6.66E-16 84 GHRHR Growth hormone releasing hormone receptor 1.36E-12 85 GIN1 Gypsy retrotransposon integrase 1 5.65E-11 Glycerol-3-phosphate dehydrogenase 2 0.00E+00 86 GPD2 (mitochondrial) 87 GPR182 G protein-coupled receptor 182 1.56E-12 88 GRAMD1C GRAM domain containing 1C 1.74E-12 89 GRIA2 Glutamate receptor, ionotropic, AMPA 2 2.13E-12 90 GTPBP8 GTP-binding protein 8 (putative) 2.31E-12 91 GUF1 GUF1 GTPase homolog (S. cerevisiae) 2.67E-12 92 GUSB Glucuronidase, beta 3.53E-12 93 HELB Helicase (DNA) B 3.62E-12 94 HHAT Hedgehog acyltransferase 2.02E-06 Hypoxia inducible factor 1, alpha subunit 3.96E-12 95 HIF1A (basic helix-loop-helix transcription factor) 96 HLTF Helicase-like transcription factor 2.22E-16 3-hydroxymethyl-3-methylglutaryl-Coenzyme 3.96E-12 97 HMGCL A lyase 98 HPS5 Hermansky-Pudlak syndrome 5 4.08E-12 99 HSF1 Heat shock transcription factor 1 4.09E-12 100 HSPA9 Heat shock 70kDa protein 9 (mortalin) 4.44E-16 Inhibitor of DNA binding 2, dominant negative 4.09E-12 101 ID2 helix-loop-helix protein 102 IDH1 Isocitrate dehydrogenase 1 (NADP+), soluble 6.66E-16 103 IDH3G Isocitrate dehydrogenase 3 (NAD+) gamma 4.34E-12 104 IFIH1 Interferon induced with helicase C domain 1 4.58E-12 105 IFNG Interferon, gamma 4.86E-12 106 IGF1 Insulin-like growth factor 1 (somatomedin C) 0.00E+00 107 IGF2R Insulin-like growth factor 2 receptor 5.26E-12 108 IHH Indian hedgehog homolog (Drosophila) 5.97E-06 109 IL4I1 Interleukin 4 induced 1 5.11E-07 110 IL5RA Interleukin 5 receptor, alpha 7.07E-07 Potassium voltage-gated channel, 6.61E-12 111 KCNA3 shaker-related subfamily, member 3 48

Nature Genetics: doi:10.1038/ng.2811

Potassium voltage-gated channel, subfamily 6.93E-12 112 KCNH4 H (eag-related), member 4 113 KLHL2 Kelch-like 2, Mayven (Drosophila) 0.00E+00 Low density lipoprotein receptor adaptor 6.27E-08 114 LDLRAP1 protein 1 115 LEF1 Lymphoid enhancer-binding factor 1 2.47E-10 116 LEPR Leptin receptor 2.68E-07 117 LHX2 LIM homeobox 2 5.56E-10 118 LMTK2 Lemur tyrosine kinase 2 1.12E-07 119 LPCAT4 Lysophosphatidylcholine acyltransferase 4 4.06E-10 Microtubule-associated protein 1 light chain 3 3.85E-11 120 MAP1LC3C gamma Mitogen-activated protein kinase kinase 2 9.45E-13 121 MAP2K2 pseudogene; mitogen-activated protein kinase kinase 2 Mitogen-activated protein kinase 8 0.00E+00 122 MAPK8IP3 interacting protein 3 Mitogen-activated protein kinase-activated 7.04E-12 123 MAPKAPK2 protein kinase 2 124 MAT2A Methionine adenosyltransferase II, alpha 2.78E-06 Multiple inositol polyphosphate histidine 1.85E-03 125 MINPP1 phosphatase, 1 126 MIXL1 Mix1 homeobox-like 1 (Xenopus laevis) 9.57E-06 127 MMP11 Matrix metallopeptidase 11 (stromelysin 3) 7.94E-12 128 MYO1H Myosin IH 2.19E-10 129 MYO5C Myosin VC 3.93E-07 130 MYT1L Myelin transcription factor 1-like 0.00E+00 131 NARS Asparaginyl-tRNA synthetase 0.00E+00 NADH dehydrogenase (ubiquinone) Fe-S 5.22E-08 132 NDUFS2 protein 2, 49kDa (NADH-coenzyme Q reductase) natriuretic peptide receptor A/guanylate 3.87E-13 133 NPR1 cyclase A (atrionatriuretic peptide receptor A) 134 NPY1R Neuropeptide Y receptor Y1 3.31E-06 135 ODAM Odontogenic, ameloblast asssociated 4.71E-08 Platelet-activating factor acetylhydrolase 2, 0.00E+00 136 PAFAH2 40kDa Phosphoribosylaminoimidazole carboxylase, 8.34E-12 137 PAICS phosphoribosylaminoimidazole succinocarboxamide synthetase 138 PAK7 P21 protein (Cdc42/Rac)-activated kinase 7 8.57E-12 139 PANK3 Pantothenate kinase 3 1.07E-11 Proprotein convertase subtilisin/kexin type 7 7.89E-07 140 PCSK7 pseudogene; proprotein convertase subtilisin/kexin type 7 Platelet-derived growth factor receptor, alpha 0.00E+00 141 PDGFRA polypeptide 142 PEX3 Peroxisomal biogenesis factor 3 6.66E-16 143 PGF Placental growth factor 4.64E-08 Phosphoinositide-3-kinase, class 2, gamma 1.20E-11 144 PIK3C2G polypeptide Phosphatidylinositol-4-phosphate 5-kinase, 4.52E-07 145 PIP5K1C type I, gamma 49

Nature Genetics: doi:10.1038/ng.2811

phospholipase A2, group IIA (platelets, 6.61E-03 146 PLA2G2A synovial fluid) 147 PLAU Plasminogen activator, urokinase 3.33E-16 Phospholipase C, beta 3 2.85E-05 148 PLCB3 (phosphatidylinositol-specific) 149 PLCG1 Phospholipase C, gamma 1 0.00E+00 150 PLK3 Polo-like kinase 3 (Drosophila) 2.58E-07 Procollagen-lysine, 2-oxoglutarate 0.00E+00 151 PLOD2 5-dioxygenase 2 152 PMCH Pro-melanin-concentrating hormone 7.07E-11 153 PPA1 Pyrophosphatase (inorganic) 1 2.68E-08 154 PPID Peptidylprolyl isomerase D 0.00E+00 Protein phosphatase 1, regulatory (inhibitor) 9.03E-08 155 PPP1R12B subunit 12B Protein phosphatase 1, regulatory (inhibitor) 8.61E-03 156 PPP1R15B subunit 15B Protein kinase, AMP-activated, alpha 2 2.63E-06 157 PRKAA2 catalytic subunit Protein kinase, cAMP-dependent, catalytic, 0.00E+00 158 PRKACA alpha Proteasome (prosome, macropain) subunit, 3.83E-09 159 PSMB6 beta type, 6 Proteasome (prosome, macropain) 26S 6.66E-16 160 PSMD9 subunit, non-ATPase, 9 Proteasome (prosome, macropain) activator 0.00E+00 161 PSME4 subunit 4 Phosphoserine phosphatase-like; 4.52E-11 162 PSPH phosphoserine phosphatase 163 PTGIR Prostaglandin I2 (prostacyclin) receptor (IP) 0.00E+00 Protein tyrosine phosphatase, non-receptor 7.56E-10 164 PTPN1 type 1 165 PYGO1 Pygopus homolog 1 (Drosophila) 1.37E-10 166 RABEPK Rab9 effector protein with kelch motifs 1.90E-09 167 RAD51AP1 RAD51 associated protein 1 0.00E+00 Receptor (G protein-coupled) activity 4.53E-09 168 RAMP1 modifying protein 1 169 RANBP3L RAN binding protein 3-like 0.00E+00 Rap guanine nucleotide exchange factor 0.00E+00 170 RAPGEF2 (GEF) 2; similar to RAPGEF2 protein 171 RARS2 Arginyl-tRNA synthetase 2, mitochondrial 0.00E+00 172 REV1 REV1 homolog (S. cerevisiae) 0.00E+00 RPTOR independent companion of MTOR, 1.78E-04 173 RICTOR complex 2 174 RIOK1 RIO kinase 1 (yeast) 0.00E+00 175 RNASET2 Ribonuclease T2 5.72E-08 176 RNF111 Ring finger protein 111 0.00E+00 177 RNF151 Ring finger protein 151 3.75E-06 178 RNF214 Ring finger protein 214 0.00E+00 Ribosomal protein S6 kinase, 70kDa, 0.00E+00 179 RPS6KB2 polypeptide 2 180 RSPRY1 Ring finger and SPRY domain containing 1 6.10E-08 181 SDHAF2 Chromosome 11 open reading frame 79 5.17E-06

50

Nature Genetics: doi:10.1038/ng.2811

182 SEC14L5 SEC14-like 5 (S. cerevisiae) 7.11E-15 Secretion regulating guanine nucleotide 4.11E-11 183 SERGEF exchange factor Serpin peptidase inhibitor, clade E (nexin, 1.18E-05 184 SERPINE1 plasminogen activator inhibitor type 1), member 1 Small glutamine-rich tetratricopeptide repeat 9.88E-15 185 SGTB (TPR)-containing, beta 186 SHH Sonic hedgehog homolog (Drosophila) 1.33E-11 187 SP8 Sp8 transcription factor 9.99E-15 188 SPHK1 Sphingosine kinase 1 1.31E-07 189 SRGN Serglycin 3.33E-02 190 STX3 Syntaxin 3 4.27E-06 191 SYT13 Synaptotagmin XIII 7.91E-10 192 TBCD Tubulin folding cofactor D 2.30E-11 193 TDO2 Tryptophan 2,3-dioxygenase 2.45E-11 194 TDRD1 Tudor domain containing 1 2.49E-11 195 TGDS TDP-glucose 4,6-dehydratase 0.00E+00 Transmembrane emp24-like trafficking 6.33E-09 196 TMED10 protein 10 (yeast) Transmembrane and tetratricopeptide repeat 0.00E+00 197 TMTC4 containing 4 198 TRIM37 Tripartite motif-containing 37 0.00E+00 199 TRIM44 Tripartite motif-containing 44 4.02E-11 200 TRNAU1AP tRNA selenocysteine 1 associated protein 1 3.83E-06 Transient receptor potential cation channel, 0.00E+00 201 TRPM7 subfamily M, member 7 202 TTC13 Tetratricopeptide repeat domain 13 3.33E-16 203 TTC9 Tetratricopeptide repeat domain 9 9.46E-11 204 USF1 Upstream transcription factor 1 2.61E-11 205 VEGFC Vascular endothelial growth factor C 4.44E-16 WW domain containing E3 ubiquitin protein 2.61E-10 206 WWP1 ligase 1 X-ray repair complementing defective repair 5.18E-10 207 XRCC1 in Chinese hamster cells 1 208 ZC3H12D Zinc finger CCCH-type containing 12D 1.67E-14 209 ZNF451 Zinc finger protein 451 2.11E-08 210 ZNF558 Zinc finger protein 558 0.00E+00 211 ZNF567 Zinc finger protein 567 5.88E-09 212 ZNF606 Zinc finger protein 606 3.40E-06 213 ZNRF4 Zinc and ring finger 4 1.15E-06 214 ZPBP Zona pellucida binding protein 2.73E-11 215 ZRANB3 Zinc finger, RAN-binding domain containing 3 0.00E+00 Duroc pig 1 ABLIM1 Actin binding LIM protein 1 2.41E-03 2 ACVR1C Activin A receptor, type IC 0.00E+00 ADAM metallopeptidase with 1.05E-12 3 ADAMTS12 thrombospondin type 1 motif, 12 4 ADCY1 Adenylate cyclase 1 (brain) 0.00E+00 5 ADCY4 Adenylate cyclase 4 0.00E+00 51

Nature Genetics: doi:10.1038/ng.2811

6 ADRB3 Adrenergic, beta-3-, receptor 2.83E-04 7 AGA Aspartylglucosaminidase 3.78E-02 1-acylglycerol-3-phosphate 2.48E-03 8 AGPAT2 O-acyltransferase 2 (lysophosphatidic acid acyltransferase, beta) 9 ALOX5 Arachidonate 5-lipoxygenase 2.37E-06 10 ALS2CL ALS2 C-terminal like 1.42E-12 11 ANLN Anillin, actin binding protein 3.77E-13 Amyloid beta (A4) precursor protein-binding, 0.00E+00 12 APBA1 family A, member 1 Amyloid beta (A4) precursor protein-binding, 6.17E-14 13 APBA2 family A, member 2 14 APOO Apolipoprotein O 1.91E-03 Rho GTPase activating protein 11B; Rho 1.83E-12 15 ARHGAP11A GTPase activating protein 11A 16 ARHGAP25 Rho GTPase activating protein 25 2.51E-12 Beta-1,4-N-acetyl-galactosaminyl transferase 4.78E-12 17 B4GALNT1 1 18 BARX2 BARX homeobox 2 2.33E-03 19 BTC Betacellulin 9.50E-12 20 BTG4 B-cell translocation gene 4 2.58E-03 21 BYSL Bystin-like 1.23E-11 22 C9ORF89 Chromosome 9 open reading frame 89 2.48E-03 23 CDC16 Cell division cycle 16 homolog (S. cerevisiae) 2.26E-03 Cell division cycle 26 homolog (S. 7.06E-04 24 CDC26 cerevisiae); cell division cycle 26 homolog (S. cerevisiae) pseudogene CDC45 cell division cycle 45-like (S. 2.42E-03 25 CDC45 cerevisiae) 26 CDCA7L Cell division cycle associated 7-like 8.32E-04 27 CEP164 Centrosomal protein 164kDa 9.93E-06 Choline kinase beta; carnitine 2.56E-11 28 CHKB palmitoyltransferase 1B (muscle) Cartilage intermediate layer protein, 2.37E-03 29 CILP nucleotide pyrophosphohydrolase 30 CLDN18 Claudin 18 2.38E-04 31 CNGA3 Cyclic nucleotide gated channel alpha 3 3.03E-11 32 CNTNAP5 Contactin associated protein-like 5 0.00E+00 33 COL11A1 Collagen, type XI, alpha 1 0.00E+00 34 COL17A1 Collagen, type XVII, alpha 1 4.38E-11 35 COL4A4 Collagen, type IV, alpha 4 8.77E-15 36 COL5A3 Collagen, type V, alpha 3 4.83E-03 37 COL6A2 Collagen, type VI, alpha 2 4.65E-11 Choline kinase beta; carnitine 7.97E-11 38 CPT1B palmitoyltransferase 1B (muscle) Cysteine-rich secretory protein LCCL domain 1.13E-10 39 CRISPLD2 containing 2 Colony stimulating factor 3 receptor 1.18E-10 40 CSF3R (granulocyte) Coxsackie virus and adenovirus receptor 1.88E-10 41 CXADR pseudogene 2; coxsackie virus and adenovirus receptor 52

Nature Genetics: doi:10.1038/ng.2811

DnaJ (Hsp40) homolog, subfamily B, 0.00E+00 42 DNAJB5 member 5 43 DSCAM Down syndrome cell adhesion molecule 0.00E+00 E74-like factor 3 (ets domain transcription 3.96E-06 44 ELF3 factor, epithelial-specific ) Echinoderm microtubule associated protein 0.00E+00 45 EML4 like 4 46 EMX2 Empty spiracles homeobox 2 1.80E-05 47 ENO2 Enolase 2 (gamma, neuronal) 2.00E-04 48 EVI5L Ecotropic viral integration site 5-like 2.70E-10 49 FANCD2 Fanconi anemia, complementation group D2 3.22E-10 50 FNDC3A Fibronectin type III domain containing 3A 8.78E-06 51 FREM2 FRAS1 related extracellular matrix protein 2 6.02E-14 52 GDF3 Growth differentiation factor 3 6.21E-04 53 GHSR Growth hormone secretagogue receptor 7.05E-14 Glycosylphosphatidylinositol specific 2.00E-04 54 GPLD1 phospholipase D1 Glyoxylate reductase/hydroxypyruvate 4.54E-10 55 GRHPR reductase 56 HIATL1 Hippocampus abundant transcript-like 1 1.06E-05 Insulin-like growth factor 2 mRNA binding 1.55E-03 57 IGF2BP2 protein 2 Insulin-like growth factor binding protein, acid 1.64E-04 58 IGFALS labile subunit Insulin-like growth factor binding protein 2, 2.22E-03 59 IGFBP2 36kDa 60 IL6R Interleukin 6 receptor 5.32E-10 Integrin, alpha 3 (antigen CD49C, alpha 3 2.00E-15 61 ITGA3 subunit of VLA-3 receptor) 62 ITGA8 Integrin, alpha 8 8.53E-04 63 ITGB6 Integrin, beta 6 8.60E-03 Junction mediating and regulatory protein, 7.17E-10 64 JMY p53 cofactor 65 JUNB Jun B proto-oncogene 1.11E-15 66 KCNT2 Potassium channel, subfamily T, member 2 1.92E-03 67 KEL Kell blood group, metallo-endopeptidase 9.19E-10 68 KLC1 Kinesin light chain 1 0.00E+00 69 KLHL2 Kelch-like 2, Mayven (Drosophila) 1.96E-03 70 LAMA4 Laminin, alpha 4 3.45E-03 71 LAMB3 Laminin, beta 3 1.45E-09 72 LCAT Lecithin-cholesterol acyltransferase 9.03E-04 73 LEF1 Lymphoid enhancer-binding factor 1 7.19E-04 74 LIMK2 LIM domain kinase 2 3.77E-13 V-yes-1 Yamaguchi sarcoma viral related 1.47E-09 75 LYN oncogene homolog 76 LYST Lysosomal trafficking regulator 1.67E-09 Mitogen-activated protein kinase 8 4.69E-05 77 MAPK8IP3 interacting protein 3 Membrane-bound transcription factor 1.71E-09 78 MBTPS1 peptidase, site 1 79 MCF2L MCF.2 cell line derived transforming 1.71E-09

53

Nature Genetics: doi:10.1038/ng.2811

sequence-like Minichromosome maintenance complex 3.06E-09 80 MCM4 component 4 81 MEF2B Myocyte enhancer factor 2B 0.00E+00 82 MEF2C Myocyte enhancer factor 2C 2.38E-04 83 MGRN1 Mahogunin, ring finger 1 2.15E-04 Multiple inositol polyphosphate histidine 1.85E-03 84 MINPP1 phosphatase, 1 85 MYBPC1 Myosin binding protein C, slow type 4.15E-09 86 MYH13 Myosin, heavy chain 13, skeletal muscle 0.00E+00 87 MYO10 Myosin X 2.36E-04 88 MYO18B Myosin XVIIIB 2.43E-13 89 MYO1D Myosin ID 0.00E+00 90 MYO1F Myosin IF 2.58E-03 91 NARS Asparaginyl-tRNA synthetase 5.16E-06 92 NCAPD3 Non-SMC condensin II complex, subunit D3 5.52E-09 NudE nuclear distribution gene E homolog 1 7.28E-09 93 NDE1 (A. nidulans) 94 NDRG1 N-myc downstream regulated 1 5.64E-14 NADH dehydrogenase (ubiquinone) 1 beta 1.39E-04 95 NDUFB7 subcomplex, 7, 18kDa 96 NFE2L2 Nuclear factor (erythroid-derived 2)-like 2 9.60E-09 Nuclear factor of kappa light polypeptide 1.38E-08 97 NFKB2 gene enhancer in B-cells 2 (p49/p100) 98 NIPBL Nipped-B homolog (Drosophila) 1.45E-08 99 NMUR2 Neuromedin U receptor 2 2.18E-04 100 NNT Nicotinamide nucleotide transhydrogenase 5.66E-15 101 NOTCH2 Notch homolog 2 (Drosophila) 1.52E-08 102 OSBPL7 Oxysterol binding protein-like 7 1.52E-08 Phosphoprotein associated with 1.99E-08 103 PAG1 glycosphingolipid microdomains 1 104 PANX1 Pannexin 1 5.84E-04 105 PARVA Parvin, alpha 2.01E-08 106 PDGFC Platelet derived growth factor C 2.63E-03 107 PEX11G Peroxisomal biogenesis factor 11 gamma 1.78E-04 108 PGF Placental growth factor 1.06E-04 Phosphatidylinositol-4-phosphate 5-kinase, 0.00E+00 109 PIP5K1C type I, gamma Polycystic kidney and hepatic disease 1 2.62E-08 110 PKHD1 (autosomal recessive) 111 PLSCR1 Phospholipid scramblase 1 3.60E-13 112 PLXNC1 Plexin C1 3.61E-08 113 PNPO Pyridoxamine 5'-phosphate oxidase 4.22E-08 114 POSTN Periostin, osteoblast specific factor 6.10E-04 115 PPAP2B Phosphatidic acid phosphatase type 2B 7.54E-04 Peroxisome proliferator-activated receptor 1.16E-05 116 PPARGC1A gamma, coactivator 1 alpha PTPRF interacting protein, binding protein 1 3.80E-14 117 PPFIBP1 (liprin beta 1) 118 PPP1R15B Protein phosphatase 1, regulatory (inhibitor) 3.61E-05 54

Nature Genetics: doi:10.1038/ng.2811

subunit 15B Chromosome 8 open reading frame 62; 0.00E+00 119 PSAT1 phosphoserine aminotransferase 1 Proteasome (prosome, macropain) 26S 5.77E-15 120 PSMD5 subunit, non-ATPase, 5 121 PSRC1 Proline/serine-rich coiled-coil 1 6.48E-13 Protein tyrosine phosphatase, receptor type, 4.97E-08 122 PTPRR R Quaking homolog, KH domain RNA binding 5.40E-08 123 QKI (mouse) 124 RAD51AP1 RAD51 associated protein 1 9.72E-03 125 RAP1GAP RAP1 GTPase activating protein 5.57E-08 126 RBL1 Retinoblastoma-like 1 (p107) 1.67E-15 127 RCC2 Regulator of chromosome condensation 2 2.53E-03 Reversion-inducing-cysteine-rich protein with 0.00E+00 128 RECK kazal motifs V-rel reticuloendotheliosis viral oncogene 7.87E-13 129 RELB homolog B 130 RTN4 Reticulon 4 1.28E-03 131 SBNO2 Strawberry notch homolog 2 (Drosophila) 5.94E-08 132 SCARB1 Scavenger receptor class B, member 1 6.37E-08 Sema domain, seven thrombospondin 9.25E-08 repeats (type 1 and type 1-like), 133 SEMA5A transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A Serpin peptidase inhibitor, clade E (nexin, 3.35E-06 134 SERPINE1 plasminogen activator inhibitor type 1), member 1 Serpin peptidase inhibitor, clade F (alpha-2 1.03E-07 135 SERPINF1 antiplasmin, pigment epithelium derived factor), member 1 136 SESN1 Sestrin 1 7.95E-04 137 SESN3 Sestrin 3 0.00E+00 138 SGSM1 Small G protein signaling modulator 1 1.08E-07 139 SH3PXD2A SH3 and PX domains 2A 2.22E-16 Signal-induced proliferation-associated 1 like 3.08E-04 140 SIPA1L2 2 141 SLC15A5 Solute carrier family 15, member 5 2.05E-03 Solute carrier family 16, member 14 1.90E-05 142 SLC16A14 (monocarboxylic acid transporter 14) Solute carrier family 16, member 6 2.71E-04 143 SLC16A6 (monocarboxylic acid transporter 7); similar to solute carrier family 16, member 6 Solute carrier family 1 (glutamate 2.44E-05 144 SLC1A7 transporter), member 7 Solute carrier family 27 (fatty acid 1.09E-05 145 SLC27A1 transporter), member 1 Solute carrier family 2 (facilitated glucose 2.27E-04 146 SLC2A2 transporter), member 2 Solute carrier family 6 (amino acid 1.08E-07 147 SLC6A14 transporter), member 14 Solute carrier family 6 (neurotransmitter 0.00E+00 148 SLC6A3 transporter, dopamine), member 3 149 SNX32 Sorting nexin 32 1.67E-07 55

Nature Genetics: doi:10.1038/ng.2811

150 SNX5 Sorting nexin 5 1.75E-07 151 SOS1 Son of sevenless homolog 1 (Drosophila) 2.05E-07 152 SPHK1 Sphingosine kinase 1 2.70E-03 Sterol regulatory element binding 5.57E-07 153 SREBF2 transcription factor 2 154 SRGN Serglycin 1.02E-04 155 SYCE1L Hypothetical protein LOC100130958 1.03E-05 Synapse defective 1, Rho GTPase, homolog 0.00E+00 156 SYDE2 2 (C. elegans) 157 TBC1D13 TBC1 domain family, member 13 7.07E-07 158 TBC1D15 TBC1 domain family, member 15 2.17E-05 159 TBC1D2 TBC1 domain family, member 2 7.33E-07 160 TCF21 Transcription factor 21 7.79E-14 Transcription factor AP-2 alpha (activating 8.40E-04 161 TFAP2A enhancer binding protein 2 alpha) 162 TFDP1 Transcription factor Dp-1 5.95E-14 Transforming growth factor, beta-induced, 9.00E-07 163 TGFBI 68kDa 164 TGFBR3 Transforming growth factor, beta receptor III 1.36E-03 165 THBS4 Thrombospondin 4 2.54E-14 Tumor necrosis factor receptor superfamily, 1.19E-06 166 TNFRSF1B member 1B 167 TNN Tenascin N 5.53E-14 168 TOM1L1 Target of myb1 (chicken)-like 1 1.23E-06 Thyrotropin-releasing hormone degrading 0.00E+00 169 TRHDE enzyme 170 TRHR Thyrotropin-releasing hormone receptor 8.92E-13 Transient receptor potential cation channel, 5.58E-05 171 TRPV1 subfamily V, member 1 172 TSTA3 Tissue specific transplantation antigen P35B 1.24E-06 Tyrosinase-like (pseudogene); tyrosinase 2.79E-13 173 TYR (oculocutaneous albinism IA) Ubiquitin protein ligase E3 component 0.00E+00 174 UBR1 n-recognin 1 UDP-glucose ceramide 5.77E-15 175 UGGT1 glucosyltransferase-like 1 UDP-glucose ceramide 2.11E-06 176 UGGT2 glucosyltransferase-like 2 Usher syndrome 1C (autosomal recessive, 7.50E-04 177 USH1C severe) 178 USHBP1 Usher syndrome 1C binding protein 1 2.23E-06 Vacuolar protein sorting 16 homolog A (S. 1.01E-04 179 VPS16 cerevisiae) WW domain containing E3 ubiquitin protein 9.10E-15 180 WWP2 ligase 2 181 ZBTB40 Zinc finger and BTB domain containing 40 0.00E+00 Zwilch, kinetochore associated, homolog 0.00E+00 182 ZWILCH (Drosophila) In total, 215 and 182 PSGs were identified for the Tibetan wild boar and Duroc pig, respectively, using the likelihood ratio test (LRT) based on the branch-site model (P < 0.05). 56

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 22. Functional gene categories enriched for the 215 PSGs in the Tibetan wild boar and 182 PSGs in the Duroc pig.

Involved Functional Term ID Term description gene P values category number Tibetan wild boar KEGG-pathway hsa04270 Vascular smooth muscle contraction 16 9.66E-07 GO-BP GO:0070482 Response to oxygen levels 15 1.85E-05 KEGG-pathway hsa04150 mTOR signaling pathway 10 6.39E-05 GO-BP GO:0001666 Response to hypoxia 13 3.40E-04 GO-MF GO:0030554 Adenyl nucleotide binding 42 1.25E-03 GO-BP GO:0032870 Cellular response to hormone stimulus 10 1.27E-03 GO-MF GO:0032559 Adenyl ribonucleotide binding 41 1.28E-03 GO-BP GO:0031331 Positive regulation of cellular catabolic 6 1.40E-03 process GO-BP GO:0048514 Blood vessel morphogenesis 12 1.49E-03 GO-BP GO:0031329 Regulation of cellular catabolic process 7 1.51E-03 GO-BP GO:0001525 Angiogenesis 10 1.53E-03 GO-BP GO:0009725 Response to hormone stimulus 19 1.75E-03 GO-BP GO:0045761 Regulation of adenylate cyclase activity 8 2.48E-03 GO-BP GO:0009894 Regulation of catabolic process 8 2.48E-03 GO-BP GO:0051240 Positive regulation of multicellular 15 2.53E-03 organismal process GO-BP GO:0030817 Regulation of cAMP biosynthetic process 8 3.15E-03 GO-BP GO:0051339 Regulation of lyase activity 8 3.15E-03 KEGG-pathway hsa04020 Calcium signaling pathway 12 3.38E-03 GO-BP GO:0001568 Blood vessel development 12 3.65E-03 GO-BP GO:0030808 Regulation of nucleotide biosynthetic 10 5.35E-03 process GO-BP GO:0030802 Regulation of cyclic nucleotide 10 5.35E-03 biosynthetic process GO-BP GO:0006140 Regulation of nucleotide metabolic 10 5.95E-03 process GO-BP GO:0001944 Vasculature development 12 1.98E-02 GO-MF GO:0032555 Purine ribonucleotide binding 44 2.09E-02 GO-MF GO:0003684 Damaged DNA binding 4 2.42E-02 InterProScan IPR001126 DNA-repair protein, UmuC-like 2 4.00E-02 GO-BP GO:0045740 Positive regulation of DNA replication 3 4.28E-02 GO-BP GO:0043085 Positive regulation of catalytic activity 18 4.70E-02 GO-BP GO:0006468 Protein amino acid phosphorylation 21 4.80E-02 GO-BP GO:0022610 Biological adhesion 33 2.09E-07 Duroc pig GO-BP GO:0007155 Cell adhesion 33 4.04E-07 KEGG-pathway hsa04512 ECM-receptor interaction 11 2.17E-05 KEGG-pathway hsa04510 Focal adhesion 16 2.53E-05 GO-BP GO:0002021 Response to dietary excess 5 1.76E-04 GO-BP GO:0022402 Cell cycle process 19 3.33E-04

57

Nature Genetics: doi:10.1038/ng.2811

GO-BP GO:0010033 Response to organic substance 22 3.54E-04 GO-MF GO:0008047 Enzyme activator activity 13 4.25E-04 GO-MF GO:0005099 Ras GTPase activator activity 7 5.75E-04 InterProScan IPR001609 Myosin head, motor region 5 5.89E-04 GO-BP GO:0048285 Organelle fission 11 7.26E-04 GO-BP GO:0010876 Lipid localization 9 9.24E-04 GO-MF GO:0003779 Actin binding 12 1.20E-03 GO-BP GO:0040008 Regulation of growth 13 1.46E-03 GO-MF GO:0030695 GTPase regulator activity 13 2.14E-03 GO-BP GO:0002274 Myeloid leukocyte activation 5 2.77E-03 GO-BP GO:0032483 Regulation of Rab protein signal 5 3.00E-03 transduction GO-BP GO:0050873 Brown fat cell differentiation 4 3.41E-03 GO-BP GO:0030198 Extracellular matrix organization 10 3.85E-03 GO-BP GO:0042493 Response to drug 9 6.63E-03 GO-BP GO:0043567 Regulation of insulin-like growth factor 3 6.84E-03 receptor signaling pathway GO-BP GO:0002263 Cell activation during immune response 4 1.08E-02 GO-BP GO:0002366 Leukocyte activation during immune 4 1.08E-02 response GO-BP GO:0006869 Lipid transport 7 1.08E-02 GO-MF GO:0005096 GTPase activator activity 12 1.58E-02 GO-BP GO:0045444 Fat cell differentiation 4 3.02E-02 GO-BP GO:0007049 Cell cycle 24 3.71E-02 GO-BP GO:0040014 Regulation of multicellular organism 7 3.92E-02 growth

Supplementary Table 23. List of KA/KS (ω) for functional gene categories in Tibetan wild boar and Duroc pig. The mean of ω in Tibetan wild boar and Duroc pig by GO-MF, GO-BP terms and KEGG pathways are provided for genes that are significantly enriched (P < 0.05, Benjamini-corrected modified Fisher’s exact test). The fold change in mean ω between Tibetan wild boar versus Duroc pig that are > 2 or < 0.5 are marked in bold. (see Excel file ‘Supplementary Table 23.xls’)

58

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 24. List of a priori functional candidate genes related to ‘response to hypoxia’, ‘response to UV’ and ‘energy metabolism’.

Response to hypoxia (122 genes)* ABAT ATP1B1 CXCR4 ENG HSD11B2 L1CAM PDGFA PLOD1 SOCS5 UBQLN1 ACVR1B BCL2 CYB5R4 EP300 HSP90B1 LATS1 PDGFB PLOD2 SOD1 UCP3 ADM BIRC2 CYP17A1 EPAS1 IFNG LRRC3B PDGFRA PML SOD3 USF1 ADORA1 BNIP3 CYP1A2 EPHX2 IL10 MMP2 PDIA2 PSME2 TDO2 VAV3 ADORA2A C1QTNF7 CYP2E1 ERCC3 INSR NAGLU PDLIM1 PYGM TGFB1 XRCC1 ADORA2B CA9 CYP2F1 FANCA ITGA1 NARFL PGF RORC TGFB2 AGTR1 CAMK2D CYP2U1 FLT1 ITGA2 NPR1 PIK3C2A RPS6KA1 TGFB3 ALDH2 CAPN2 DDAH1 FRMD6 ITPR1 OR6Y1 PIK3C2B RYR1 TICAM1 ALG12 CENPM DISC1 GPR182 JAG2 OTX1 PIK3C2G RYR2 TMEM206 ANGPT1 CFTR DPP4 GUCY1A3 JAK2 OXTR PIK3CB SCNN1G TNF APOE CHMP4B EGFR HBE1 KATNA1 P2RX3 PIK3R1 SHH TRH ARG2 CHRNB2 EGLN1 HIF1A KCNA5 P2RX4 PIK3R2 SMAD4 TXN ARNT CLDN3 EGLN2 HMOX2 KCNJ8 PDE5A PLAU SOCS3 TXN2 Response to UV (38 genes)† AURKB BRCA2 CDKN2D ERCC5 IL12A MME POLD1 TIPIN USP28 XPC BAK1 CASP9 EGFR ERCC6 IL12B MYC REV1 TP73 USP47 ZRANB3 BCL2 CAT ERCC3 FEN1 MC1R PIK3R1 RUVBL2 USF1 WRN BCL3 CCND1 ERCC4 HUS1 MEN1 PML SPRTN USP1 XPA Energy metabolism (151 genes)‡ ABCA7 APOA4 CHM FAIM2 GYS1 LEPR NHLH2 PPARG SERPINE1 TXNIP ABCC8 APOA5 CPE FANCL HEXB LIPE NMUR2 PPARGC1A SFRP1 UBR1 ACACB APOC3 CPEB4 FASN HSD11B1 LMNA NPY PPARGC1B SLC2A2 UCP2

59

Nature Genetics: doi:10.1038/ng.2811

ACP1 APOE CPT1A FGF21 HSD11B2 LRPAP1 NPY1R PPP1R3A SLC6A1 UCP3 ACVR1C AQP7 CRH FOXA2 HTR1B MAGEL2 NPY2R PPY SLC6A14 VSX1 ADAMTS9 ARID5B CYB5R4 GAD2 IDE MAOA NPY5R PRKAA2 SLC6A3 WT1 ADRA1B ATP1B1 DBH GAMT IDH1 MC3R NR0B2 PRKAR1A SNRPN ZNF608 ADRA2A BBS2 DGAT1 GDF3 IFRD1 MC4R PCSK1 PROX1 SOAT2 ADRA2B BBS4 DHCR24 GHRHR IGF1 MC5R PCSK1N PTPN1 SOCS3 ADRB3 BBS7 DLK1 GHSR IL15 MED12 PGD PTTG1 SREBF1 AEBP1 BRS3 DPT GIPR IL6R MEN1 PHF6 RASGRF1 TBX3 AGPAT2 BSCL2 EIF4EBP1 GNPDA2 INSR MEST PIK3R1 RETN TGFB1 AGRP CBL ENPP1 GPAM IRS1 MKKS PLA2G1B RSC1A1 TMEM160 AMACR CCKAR EREG GPC4 KCNA3 MMP11 PLSCR1 RSPO3 TNF ANGPTL6 CEBPA FABP1 GPD2 KEL MYC PMCH SCARB1 TNFRSF1B APOA2 CEBPD FABP2 GSK3B LEP NCOA3 PNMT SDC3 TRPV1

* A total of 122 functional candidate genes related to ‘response to hypoxia’ are merged from the reports of Beall et al. (2010)12, Bigham et al. (2010)13, Simonson et al. (2010)14, Yi et al. (2010)15, Peng et al. (2011)16, Xu et al. (2011)17 , Ji et al. (2012)18 and Scheinfeldt et al. (2012)19.

† A total of 38 functional candidate genes related to ‘response to UV’ were listed from the GO-Biological Process category of ‘response to UV’ (GO 0009411), which represents process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of an ultraviolet radiation (UV light) stimulus.

‡ A total of 151 functional candidate genes related to ‘energy metabolism’ are merged from the reports of Rankinen et al. (2006)20, MacDougald et al. (2007)21, Heid et al. (2010)22, Speliotes et al. (2010)23 and Li et al. (2012)24, which are mainly involved in energy homeostasis, muscle growth and adipose deposition, as well as adipokines, myokines, neurokines and hormones in regulating food intake.

Only the functional candidate genes which are also included in the 7,917 single-copy orthologs shared with Tibetan wild boar, Duroc pig and human are listed.

60

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 25. Functional candidate genes related to ‘response to hypoxia’ under positive selection in the Tibetan wild boar (21 PSGs) and Duroc pig (1 PSG).

Gene ω P value ω P value Gene name symbol (Tibetan) (Tibetan) (Duroc) (Duroc) ACVR1B Activin A receptor, type IB 0.385 0.00E+00 0.000 6.87E-01 ALDH2 Aldehyde dehydrogenase 2 family (mitochondrial) 0.627 1.42E-10 0.219 9.98E-01 APOE Apolipoprotein E 0.296 5.19E-07 0.216 9.99E-01 ARG2 Arginase, type II 0.593 3.51E-13 0.107 9.81E-01 ARNT Aryl hydrocarbon receptor nuclear translocator 0.852 0.00E+00 0.033 6.27E-01 BIRC2 Baculoviral IAP repeat-containing 2 0.383 4.09E-13 0.326 9.83E-01 CA9 Carbonic anhydrase IX 0.685 5.26E-13 0.091 9.88E-01 DPP4 Dipeptidyl-peptidase 4 0.093 7.47E-13 0.065 9.91E-01 EGLN2 Egl nine homolog 2 0.537 8.74E-13 0.100 9.91E-01 GPR182 G protein-coupled receptor 182 0.554 1.56E-12 0.218 9.93E-01 HIF1A Hypoxia inducible factor 1, alpha subunit 0.636 3.96E-12 0.313 9.94E-01 IFNG Interferon, gamma 0.768 4.86E-12 0.115 9.95E-01 PDGFRA Platelet-derived growth factor receptor, alpha polypeptide 0.422 0.00E+00 0.569 7.52E-02 PGF Placental growth factor 0.813 4.64E-08 0.778 1.06E-04 PIK3C2G Phosphoinositide-3-kinase, class 2, gamma polypeptide 1.006 1.20E-11 0.026 9.96E-01 PLAU Plasminogen activator, urokinase 0.612 3.33E-16 0.143 7.30E-01 PLOD2 Procollagen-lysine, 2-oxoglutarate 5-dioxygenase 2 0.703 0.00E+00 0.085 5.95E-02 SHH Sonic hedgehog homolog 0.366 1.33E-11 0.047 9.96E-01 TDO2 Tryptophan 2,3-dioxygenase 0.935 2.45E-11 0.029 9.97E-01

61

Nature Genetics: doi:10.1038/ng.2811

USF1 Upstream transcription factor 1 1.105 2.61E-11 0.150 9.97E-01 X-ray repair complementing defective repair in Chinese XRCC1 0.708 5.18E-10 0.260 9.98E-01 hamster cells 1

25 The ω ratio of non-synonymous to synonymous substitutions (i.e. KA/KS) was calculated by the PAML package for the Tibetan wild boar and Duroc pig, taking the human ortholog as an outgroup. The P value was determined using the likelihood ratio test (LRT) based on the branch-site model. The P values less than 0.05 are shown in bold.

62

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 26. Functional candidate genes related to ‘response to UV’ under positive selection in the Tibetan wild boar (6 PSGs).

Gene ω P value ω P value Gene name Functional description symbol (Tibetan) (Tibetan) (Duroc) (Duroc) UV-induced BCL3 activation directly suppressed the activity BCL3 B-cell CLL/lymphoma 3 0.584 4.65E-11 0.110 9.98E-01 of epigenetic factor CTCF which is a master keeper of global chromatin structure26,27. Excision repair cross ERCC4 is a specific endonuclease in DNA cross-linking complementing rodent ERCC4 0.521 5.07E-07 0.000 9.99E-01 repair, its hypomorphic mutations cause the UV-sensitive repair deficiency, disorder xeroderma pigmentosum28,29. complementation group 4 Excision repair cross ERCC6, a DNA-binding protein, which is important in complementing rodent ERCC6 0.764 1.01E-12 0.149 9.93E-01 transcription-coupled excision repair and involved in repair deficiency, preferential repair of active genes30. complementation group 6 REV1 is essential for the induction of mutations REV1 REV1 homolog 1.104 0.00E+00 0.150 5.00E-01 through replication processes that directly copy the damaged DNA template during DNA replication31,32. UV-activated USF-1 could directly upregulated a variety of Upstream transcription pigmentation genes implicated in protection from UV USF1 1.105 2.61E-11 0.150 9.97E-01 factor 1 radiation33,34 (particularly MC1R, a major determinant of coat color variation in mammals35, including pig36). Zinc finger, RAN-binding ZRANB3 maintains genomic stability by facilitating fork ZRANB3 0.870 0.00E+00 0.000 4.13E-01 domain containing 3 restart and limiting inappropriate recombination37,38.

25 The ω ratio of non-synonymous to synonymous substitutions (i.e. KA/KS) was calculated by the PAML package for the Tibetan wild boar and Duroc pig, taking the human ortholog as an outgroup. The P value was determined using the likelihood ratio test (LRT) based on the branch-site model. The P values less than 0.05 are shown in bold.

63

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 27. Functional candidate genes related to ‘energy metabolism’ under positive selection in the Tibetan wild boar (17 PSGs) and Duroc pig (21 PSGs).

Gene ω P value ω P value Gene name Functional description symbol (Tibetan) (Tibetan) (Duroc) (Duroc) ACVR1C (also known as ALK7) is a type I receptor for the TGFB family of signaling molecules. Growth/differentiation ACVR1C Activin A receptor, type IC 0.221 6.83E-01 0.627 0.00E+00 factor 3 regulates adipose-tissue homeostasis and energy balance under nutrient overload in part by signaling through the ALK7 receptor39. ADRB3 is a member of the adrenergic receptor group of ADRB3 Adrenergic, beta-3-, receptor 0.091 1.00E+00 0.361 2.83E-04 G-protein-coupled receptors, which is involved in the regulation of lipolysis and thermogenesis40,41. AGPAT2 is a key intermediate in the biosynthesis of 1-acylglycerol-3-phosphate triacylglycerol and glycerophospholipids, which catalyzes AGPAT2 0.000 1.00E+00 0.133 2.48E-03 O-acyltransferase 2 the acylation of lysophosphatidic acid to form phosphatidic acid42,43. GDF3 is a member of the TGFβ superfamily, which regulates adipose-tissue homeostasis and energy balance GDF3 Growth differentiation factor 3 0.286 7.18E-01 0.534 6.21E-04 under nutrient overload in part by signaling through the ALK7 receptor.39,44 GHSR is a component of the ghrelin signaling pathway and Growth hormone secretagogue is involved in mediating the pleiotropic effects of ghrelin, GHSR 0.096 8.47E-01 0.408 7.05E-14 receptor which play a role in energy homeostasis and regulation of body weight 45,46 IL6R is a key mediator of inflammatory response, which is IL6R Interleukin 6 receptor 0.392 9.90E-010.799 5.32E-10 also involved in the modulation of metabolic traits and the etiology of metabolic syndrome 47,48 Kell blood group, KEL is a type II transmembrane glycoprotein that is the KEL 0.623 9.91E-01 1.175 9.19E-10 metallo-endopeptidase highly polymorphic Kell blood group antigen49. NMUR2 is a receptor for neuromedin U, which is widely NMUR2 Neuromedin U receptor 2 0.247 1.00E+00 0.377 2.18E-04 distributed in the gut and central nervous system and plays

64

Nature Genetics: doi:10.1038/ng.2811

an important role in the regulation of food intake and body weight50,51. PLSCR1 is a member of PLSCR gene family, which plays a central role in receptor signaling and transactivation and contributes to cytokine-regulated cell proliferation and PLSCR1 Phospholipid scramblase 1 0.170 9.69E-01 0.773 3.60E-13 differentiation, and appears to influence the lipid accumulation and the risk for acquiring the metabolic syndrome52. PPARGC1A is a transcriptional coactivator which interacts Peroxisome PPARGC1 with PPARγ and regulates muscle fiber type determination, proliferator-activated receptor 0.001 6.24E-01 0.636 1.16E-05 A cellular cholesterol homoeostasis and the development of gamma, coactivator 1 alpha obesity53,54. SCARB1 is a plasma membrane receptor for high density lipoprotein cholesterol (HDL), which is involved in the Scavenger receptor class B, SCARB1 0.219 9.95E-01 0.537 6.37E-08 regulation of plasma HDL levels through reverse member 1 cholesterol transport, cardioprotection, steroidogenesis, and reproduction55,56. SLC2A2 is an integral plasma membrane glycoprotein Solute carrier family 2, member SLC2A2 0.702 6.04E-01 1.270 2.27E-04 which mediates facilitated bidirectional glucose transport 2 and influences serum HDL57. SLC6A14 is a member of the solute carrier family 6 which potentially regulates tryptophan availability for serotonin Solute carrier family 6, member SLC6A14 0.000 9.95E-01 0.867 1.08E-07 synthesis and thus possibly affects appetite control. 14 Mutations in this gene may be associated with X-linked obesity58,59. SLC6A3 is a dopamine transporter. The polymorphisms involving a variable number of tandem repeats in the 3' Solute carrier family 6, member SLC6A3 0.215 4.28E-01 0.272 0.00E+00 UTR of SLC6A3 are associated with idiopathic epilepsy, 3 dependence on alcohol and cocaine, and obesity in smokers60,61, TNFRSF1B is a member of the TNF-receptor superfamily, TNFRSF1 Tumor necrosis factor receptor 0.415 9.96E-01 0.478 1.19E-06 which is associated with obesity-induced peripheral B superfamily, member 1B neuropathy, hypertension and inflammation, and has been

65

Nature Genetics: doi:10.1038/ng.2811

termed as a major contributing factor of type 2 diabetes62,63. Transient receptor potential TRPV1 is an ion channel which is highly expressed on TRPV1 cation channel, subfamily V, 0.130 9.97E-01 0.434 5.58E-05 sensory nerve fibers innervating the pancreas and involved member 1 in the regulation of energy and fat metabolism64-66. UBR1 is a component of the N-end rule pathway. Ubiquitin protein ligase E3 UBR1 0.767 4.71E-01 0.686 0.00E+00 UBR1-induced degradation of the low-density lipoprotein component n-recognin 1 (LDL) receptor is essential for clearing circulating LDL67,68. ADAM metallopeptidase with ADAMTS9, an endogenous angiogenesis inhibitor, controls ADAMTS9 0.298 5.46E-14 0.400 9.72E-01 thrombospondin type 1 motif, 9 organ shape during development69,70. Adrenergic, alpha-1B-, ADRA1B, an α-adrenergic receptor, is required for normal ADRA1B 0.279 9.14E-14 0.000 9.74E-01 receptor postnatal growth of cardiac myocytes71. AEBP1, a transcriptional repressor, positively regulates the AEBP1 AE binding protein 1 0.365 9.87E-14 0.257 9.75E-01 enhancement of adipocyte proliferation and reduction of adipocyte differentiation72. APOE, a transport apolipoprotein, is essential for APOE Apolipoprotein E 0.296 5.19E-07 0.216 9.99E-01 lipoprotein metabolism and cardiovascular disease73,74. BBS7 is a member of the BBSome complex which is required for ciliogenesis. Mutations in this gene are BBS7 Bardet-Biedl syndrome 7 0.773 0.00E+00 0.000 7.21E-01 associated with Bardet-Biedl syndrome75, which is characterized principally by obesity, retinitis pigmentosa, polydactyly, and hypogonadism76,77. CBL accepts ubiquitin from specific E2 ubiquitin conjugating enzymes, and transfers it to substrates, which Cas-Br-M ecotropic retroviral CBL 0.288 5.28E-13 0.987 9.88E-01 regulate various cellular signaling events, including the transforming sequence insulin/insulin-like growth factor 1 and epidermal growth factor pathways78-80. CPEB4 is a sequence-specific RNA-binding protein that Cytoplasmic polyadenylation promotes polyadenylation-induced translation in oocytes CPEB4 0.688 0.00E+00 0.000 6.40E-01 element binding protein 4 and neurons81 and is related to the modulation of body fat distribution22. Diacylglycerol DGAT1 catalyzes the linkage of a sn-1,2-diacylglycerol with DGAT1 1.381 6.61E-08 0.253 9.99E-01 O-acyltransferase homolog 1 a fatty acyl CoA to form a triglyceride molecule82. Mice

66

Nature Genetics: doi:10.1038/ng.2811

lacking DGAT1 have increased energy expenditure and insulin sensitivity and are protected against dietinduced obesity and glucose intolerance83. EREG is a member of the epidermal growth factor family, EREG Epiregulin 0.688 3.13E-09 0.096 6.77E-01 which is related to weight loss with dextran sulfate sodium exposure84. Fatty acid binding protein 2, FABP2 is a lipid sensor in triglyceride-rich lipoprotein FABP2 1.367 4.19E-08 0.075 9.99E-01 intestinal synthesis that maintains energy homeostasis85,86. GHRHR is a receptor for growth hormone-releasing Growth hormone releasing GHRHR 0.636 1.36E-12 0.195 9.93E-01 hormone, which stimulates somatotroph cell growth, hormone receptor synthesis and release of growth hormone87,88. GPD2 catalyzes conversion of glycerol-3-phosphate to Glycerol-3-phosphate dihydroxyacetone phosphate, and is a very important GPD2 0.632 0.00E+00 0.542 4.34E-01 dehydrogenase 2 enzyme of the integration of glycolysis, oxidative phosphorylation and fatty acid metabolism89. IDH1 catalyzes the oxidative decarboxylation of isocitrate Isocitrate dehydrogenase 1 to 2-oxoglutarat. The presence of IDH1 in peroxisomes IDH1 0.916 6.66E-16 0.000 8.33E-01 (NADP+), soluble suggests roles in the regeneration of NADPH for intraperoxisomal reductions90,91 IGF1, a hormone similar to insulin,has been recognized as IGF1 Insulin-like growth factor 1 0.671 0.00E+00 0.385 6.86E-01 a major determinant of body size in mammals 92,93. KCNA3 (also known as Kv1.3) is a subunit of a heteromeric Potassium voltage-gated potassium channel and considered a therapeutic target for KCNA3 channel, shaker-related 0.430 6.61E-12 0.162 9.95E-01 the treatment of obesity and for enhancing subfamily, member 3 peripheral insulin sensitivity in patients with type-2 diabetes mellitus94,95. LEPR, a major receptor for the well-known LEPR Leptin receptor 1.177 2.68E-07 0.290 9.99E-01 adipocyte-specific hormone leptin96,97. MMP11 (also known as stromelysin 3) is a member of the matrix metalloproteinase family, which negatively regulates MMP11 Matrix metallopeptidase 11 0.449 7.94E-12 0.250 9.96E-01 adipogenesis by reducing pre-adipocyte differentiation and reversing mature adipocyte differentiation65,66. NPY1R Neuropeptide Y receptor Y1 0.000 3.31E-06 0.000 1.00E+00 NPY1R is one of the most abundant neuropeptides in the

67

Nature Genetics: doi:10.1038/ng.2811

mammalian nervous system and is associated with effects on food intake and regulation of central endocrine secretion 98,99. PMCH is a cyclic neuropeptide that plays an important role Pro-melanin-concentrating PMCH 0.406 7.07E-11 0.494 9.98E-01 in energy homeostasis and a number of neuronal functions hormone such as food intake 100,101. PRKAA2, a monitor of cellular energy status, is necessary Protein kinase, AMP-activated, PRKAA2 0.204 2.63E-06 0.074 1.00E+00 for maintaining myocardial energy homeostasis during alpha 2 catalytic subunit ischemia102,103. PTPN1 is a negative regulator of insulin and leptin Protein tyrosine phosphatase, PTPN1 0.687 7.56E-10 0.117 9.98E-01 signaling that modulates glucose homeostasis and energy non-receptor type 1 expenditure 104,105. 25 The ω ratio of non-synonymous to synonymous substitutions (i.e. KA/KS) was calculated by the PAML package for the Tibetan and Duroc pigs, taking the human ortholog as an outgroup. The P value was determined using the likelihood ratio test (LRT) based on the branch-site model. The P values less than 0.05 are shown in bold.

Supplementary Table 28. Tibetan wild boar pseudogenes. A total of 188 pseudogenes containing 137 frameshift and 60 premature termination events were identified in the Tibetan wild boar genome based on the use of in silico filters and further manual examination. (see Excel file “Supplementary Table 28.xls”)

68

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 29. Functional gene categories enriched for Tibetan wild boar pseudogenes.

Functional Involved gene P Term ID Term description Gene symbol category number values GO-BP GO:0042493 Response to drug 6 0.013 CAV2, BCHE, LCK, SMPD1, DDIT3, HTR2A GO-MF GO:0042169 SH2 domain binding 3 0.027 SQSTM1, LCK, CRK GO-MF GO:0019900 Kinase binding 5 0.042 CAV2, SQSTM1, LCK, AXIN2, RPS3 TMEM85, SQSTM1, ARHGEF18, LCK, RYBP, GO-BP GO:0008219 Cell death 11 0.045 CGB7, AXIN2, BCL2L12, C3ORF38, RPS3, HTR2A TMEM85, SQSTM1, ARHGEF18, LCK, RYBP, GO-BP GO:0016265 Death 11 0.047 CGB7, AXIN2, BCL2L12, C3ORF38, RPS3, HTR2A

69

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 30. Drug response genes that that appear inactive in the Tibetan wild boar genome.

Gene Inactivation ω0 ω1 ω2 Gene name Functional description Related disease symbol event (average) (other) (Tibetan) Delayed metabolism of succinylcholine, BCHE encodes a non-specific cholinesterase mivacurium, procaine, and cocaine / BCHE Butyrylcholinesterase Frameshift 0.208 0.208 1.048 enzyme that hydrolyses many different choline Postanesthetic apnea / Organophosphate toxicity / esters106-108. Alzheimer's disease drug hypersensitivity / Post succinylcholine apnea / Dementia CAV2 is a major component of the inner surface of caveolae, small invaginations of the Disturbance of cholesterol binding drug / Prostate Premature plasma membrane, and is involved in essential CAV2 Caveolin 2 0.405 0.374 ∞ cancer/ Breast cancer / Pulmonary dysfunction / stop codon cellular functions, including signal transduction, Esophageal and bladder carcinomas lipid metabolism, cellular growth control and apoptosis109,110. DDIT3 is a member of the C/EBP family of transcription factors, which are implicated in DNA damage inducible Myxoid liposarcoma / Ewing sarcoma / Myeloid DDIT3 Frameshift 0.125 0.125 1.394 adipogenesis and erythropoiesis, and is transcript 3 leukemia activated by endoplasmic reticulum stress and promotes apoptosis 111,112. Dependence of alcohol, nicotine, heroin and cotinine / Schizophrenia / Anorexia nervosa / HTR2A encodes one of the receptors for Obsessive compulsive disorder / Citalopram 5 hydroxytryptamine 5-hydroxytryptamine (serotonin), a biogenic HTR2A Frameshift 0.181 0.139 1.791 induced depressive disorder/Seasonal affective (serotonin) receptor 2A hormone that functions as a neurotransmitter, a disorder / Weight gain, antipsychotic drug induced hormone, and a mitogen113,114. / Depression drug hypersensitivity / Antidepressant medication intolerance LCK is a member of the Src familyof protein Lymphocyte specific tyrosine kinases which play an important role in Severe combined immunodeficiency / Type 1 LCK Frameshift 0.032 0.031 0.137 protein tyrosine kinase the selection and maturation of developing diabetes / Alzheimer's disease T-cells115,116. Sphingomyelin SMPD1 encodes a lysosomal acid Niemann-Pick disease type A and B (also known SMPD1 phosphodiesterase 1, Frameshift 0.084 0.082 ∞ sphingomyelinase that converts sphingomyelin as acid sphingomyelinase deficiency) acid lysosomal to ceramide117,118. Note: ‘∞’indicates that there is no synonymous mutation has been identified in this gene. The nonsynonymous to synonymous substitution ratio (KA/KS, i.e. ω) was estimated for Duroc pig, human and Tibetan wild boar sequences using the Codeml program with the free-ratio model as implemented in the 25 PAML package . ω0 is the average ratio in all branches, ω1 is the average ratio in human and Duroc pig branches, and ω2 is the ratio in the Tibetan wild boar branch.

70

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 31. Summary and mapping statistics of sampled pig populations/breeds.

Latitude, PE Raw Coverage Coverage Population/ High-quality Mapping Depth Pig Location longitude, average Individual length base at least 1 at least 4 Breed rate (%) rate (%) (×) altitude (m) (bp) (Gb) × (%) × (%) 1 101 12.18 98.47 91.76 4.41 94.4 60.1 Ganzi Tibetan 2 100 12.18 99.8 91.14 4.41 95.63 61.04 30.05ºN, 100.30ºE, Ganzi , 3 100 10.66 99.8 91.75 3.88 93.91 52.06 3,774m Sichuan province, China 4 100 14.32 99.81 91.82 5.22 96.26 71.49 5 100 14.26 99.77 91.51 5.18 96.43 70.91 1 100 16.08 99.79 91.98 5.79 96.43 75.54 Diqing Tibetan 2 101 12.3 99.03 91.21 4.45 95.59 61.59 27.82ºN, 99.70ºE, Diqing autonomous prefecture, 3 101 11.99 98.27 91.33 4.31 95.00 58.96 3,281m Yunnan province, China 4 100 17.66 99.75 92.85 6.54 96.83 80.30 5 101 11.74 99.20 92.57 4.35 94.73 58.41 1 100 9.93 98.50 91.91 3.24 89.76 39.89 Tibetan Nyingchi prefecture, 2 100 19.08 99.81 91.79 6.96 97.01 83.14 29.65ºN, 93.98ºE, wild boar Nyingchi Tibetan autonomous 3 100 13.43 99.81 91.68 4.86 94.74 64.20 3,526m (female) region, China 4 100 12.18 99.78 92.09 4.41 93.98 58.65 5 100 17.91 99.76 92.63 6.56 96.04 78.00 1 100 14.74 99.75 92.07 5.36 94.31 67.09 Shigatse prefecture, 2 100 11.51 99.77 91.69 4.20 92.47 54.85 29.27ºN, 89.60ºE, Shigatse Tibetan autonomous 3 100 15.09 99.76 91.74 5.41 94.70 67.73 4,023m region, China 4 100 12.44 99.72 92.50 4.58 94.36 61.02 5 100 14.90 99.75 92.46 5.45 95.15 68.73 1 100 15.60 99.76 92.32 5.72 95.78 72.42 Gannan Tibetan 2 100 12.07 99.75 92.85 4.42 92.85 58.63 34.98ºN, 102.91ºE, Gannan autonomous prefecture, 3 100 12.98 99.70 91.86 4.68 93.18 59.88 2,881m Gansu province, China 4 101 12.70 98.30 91.21 4.58 95.13 63.14 5 101 11.81 98.89 91.19 4.26 93.66 57.75

71

Nature Genetics: doi:10.1038/ng.2811

1 100 11.50 99.73 92.28 4.19 93.57 56.90 A'ba Tibetan autonomous 2 100 18.63 99.75 92.86 6.84 96.47 81.10 31.54ºN,102.96ºE, A'ba prefecture, Sichuan 3 100 14.49 99.74 92.16 5.29 95.15 69.36 3,441m province, China 4 100 18.58 99.69 92.48 6.38 95.45 76.79 5 100 15.14 99.65 92.26 5.36 94.43 68.25 1 101 12.05 98.17 93.27 4.42 94.15 60.24 city, Sichuan 30.65ºN, 105.81ºE, Penzhou 2 101 12.02 98.46 93.29 4.41 92.32 57.68 province, China 515m 3 100 14.10 99.74 91.33 5.08 95.75 68.91 Liangshan Yi autonomous 1 100 15.94 99.65 90.73 5.60 95.37 72.13 27.88ºN, 103.55ºE, Wujin prefecture, Sichuan 2 100 14.27 99.66 92.88 5.12 93.9 67.00 541m province, China 3 100 12.11 99.23 92.59 4.38 93.94 59.24 Chinese 1 100 12.15 99.71 91.6 4.37 93.92 58.27 domestic Chengdu city, Sichuan 30.65ºN, 103.46ºE, Ya'nan 2 101 11.18 99.16 91.39 4.11 94.15 56.39 pig province, China 504m 3 101 13.30 98.36 92.99 4.92 94.93 66.92 (female) 1 100 15.80 99.56 91.58 5.09 94.22 66.25 city, Sichuan 30.65ºN, 105.06ºE, Neijiang 2 100 17.31 99.79 91.25 6.02 94.89 71.22 province, China 335m 3 101 11.52 99.11 92.41 4.25 92.92 56.50 1 101 11.68 99.37 93.31 4.39 94.64 60.62 Jinhua city, Zhejiang 30.27ºN, 119.65ºE, Jinhua 2 100 12.42 99.8 93.33 4.62 93.77 60.56 province, China 42m 3 100 10.62 99.85 92.60 3.90 93.34 51.01 Wild 1 100 12.13 98.98 88.69 4.17 93.84 56.12 29.56ºN, 109.87ºE, boar Wild boar Southwest China 2 100 16.36 99.64 91.58 5.78 96.38 76.69 368m (female) 3 100 16.35 99.62 90.88 5.70 96.22 74.54

72

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 32. Summary and mapping statistics of the downloaded pig genome re-sequencing data.

Coverage Coverage High-quality Mapping Accession Breed Pig name Land of origin Individual Depth (×) at least 1 at least 4 base (Gb)* rate (%) No. × (%) × (%) 1 21.01 97.41 5.95 81.79 68.52 ERS177302 Denmark, North 2 22.69 97.95 6.96 81.34 69.86 ERS177303 Duroc American 3 11.74 97.96 4.56 80.21 59.00 ERS177304 4 14.76 98.04 5.77 80.68 64.31 ERS177305 England, North 1 22.51 98.00 6.77 81.88 71.31 ERS177306 Hampshire American 2 19.72 97.54 6.09 81.42 66.08 ERS177307 Jiangsu province, Jiangquhai 1 20.50 98.14 8.09 81.34 71.05 ERS177311 China 1 18.34 98.21 7.21 81.24 69.40 ERS177312 2 27.01 97.59 7.99 82.12 74.29 ERS177313 Domestic Landrace Denmark 3 17.56 97.56 5.32 80.99 63.21 ERS177314 pig 4 14.48 98.07 5.64 81.12 66.54 ERS177315 5 14.87 98.03 5.86 81.25 68.16 ERS177316 1 10.89 97.20 4.33 77.25 51.83 ERS177317 2 19.98 98.04 7.55 82.29 74.48 ERS177318 3 19.98 98.09 7.57 82.19 74.33 ERS177319 4 19.96 98.13 7.68 82.28 74.52 ERS177320 Large White England 5 18.47 97.90 7.06 82.15 73.42 ERS177321 6 22.72 97.90 6.58 81.65 70.65 ERS177322 7 18.57 98.15 7.20 81.58 68.93 ERS177323 8 18.99 97.64 4.66 79.13 57.90 ERS177324 9 19.44 98.02 7.55 82.33 74.59 ERS177325

73

Nature Genetics: doi:10.1038/ng.2811

10 16.65 98.05 6.04 81.54 69.70 ERS177326 11 17.38 98.11 6.15 81.56 69.96 ERS177327 12 18.52 98.21 6.72 81.64 71.43 ERS177328 13 13.59 98.10 4.92 80.77 63.14 ERS177329 14 17.02 98.08 6.20 81.62 70.31 ERS177330 1 18.03 97.98 6.85 82.01 72.78 ERS177331 Jiangsu province, 2 17.92 98.09 6.74 81.76 70.73 ERS177332 China 3 17.17 97.11 6.07 80.56 65.81 ERS177333 4 19.76 98.12 7.79 81.24 70.06 ERS177334 1 20.68 97.98 4.95 81.05 64.29 ERS177336 2 20.91 97.93 8.2 81.84 73.33 ERS177337 Pietrain Belgium 3 16.45 96.71 6.22 79.83 62.04 ERS177338 4 10.88 96.51 4.28 76.35 49.97 ERS177339 5 21.44 97.78 4.92 80.33 60.87 ERS177340 Guangxi province, 1 17.66 98.23 6.41 81.27 70.02 ERS177355 Xiang China 2 17.37 98.04 6.26 81.28 69.64 ERS177356 France France 1 18.54 97.94 7.32 81.28 70.39 ERS177349 Japan Japan 1 21.55 97.91 8.44 81.19 71.03 ERS177344 Meinweg, the Meinweg, the 1 10.56 96.90 4.17 76.87 50.82 ERS177347 Netherlands Netherlands 2 15.70 97.89 6.08 81.28 68.48 ERS177348 1 9.31 96.15 3.64 72.50 41.06 ERS177353 Wild North China North China boar 2 19.29 97.55 7.55 81.24 70.16 ERS177354 1 9.83 97.07 3.91 75.04 46.47 ERS177351 South China South China 2 19.83 98.13 7.78 81.57 72.04 ERS177352 1 21.56 98.02 8.33 80.82 70.55 ERS177308 Sumatran Sumatra, Indonesia 2 20.98 98.22 8.30 80.70 69.69 ERS177310

74

Nature Genetics: doi:10.1038/ng.2811

Switzerland Switzerland 1 28.39 97.53 6.29 81.73 70.51 ERS177350 Veluwe, the Veluwe, the 1 18.18 97.88 7.15 81.59 71.46 ERS177345 Netherlands Netherlands 2 22.56 97.63 7.33 81.97 72.58 ERS177346 African Phacochoerus Tanzania 1 23.13 97.91 8.45 78.09 66.44 ERS177335 warthog africanus Sus barbatus Sumatra, Indonesia 1 12.73 97.53 4.93 77.56 55.92 ERS177309 Sus cebifrons Philippines 1 19.05 96.67 7.42 80.43 70.52 ERS177341 Genus Sus Sulawesi, Indonesia 1 46.06 97.88 17.88 82.37 77.39 ERS177342 Sus celebensis Sus Java, Indonesia 1 24.04 97.74 9.5 80.92 71.84 ERS177343 verrucosus * The criteria used for sequence read filtering are slightly different between our sequenced data (see ‘1.2 Sequence quality checking and filtering’) and the downloaded genome data (phred quality ≤ 20)7-9.

75

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 33. Summary of SNP calling on a population-scale.

Tibetan Wild boar, genus Category Domestic pig Total wild boar Sus and warthog Sample Size n = 30 n = 52 n = 21 n = 103 Number of total SNPs 8,390,501 9,173,377 7,780,578 14,637,670 Number of Shared SNPs 3,020,386

Supplementary Table 34. Tracy-Widom (TW) statistics for the first ten eigenvalues from PCA analysis of pig breeds.

Number Eigenvalues TW P value 1 28.318 34.685 4.18 × 10-61 2 14.368 48.295 3.58 × 10-99 3 5.626 17.219 1.42 × 10-22 4 5.514 21.185 3.86 × 10-30 5 4.239 8.921 1.58 × 10-9 6 4.076 9.063 1.02 × 10-9 7 3.992 10.426 1.41× 10-11 8 3.858 11.107 1.48 × 10-12 9 3.475 6.935 1.62 × 10-7 10 3.182 3.305 9.37 × 10-4

76

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 35. Summary of SNPs in Tibetan wild boars and Chinese domestic pigs.

Tibetan wild Chinese Category Total boar domestic pig Sample size n = 30 n = 15 n = 45 Number of total SNP 8,390,501 6,011,186 9,492,123 Number of shared SNP 4,909,564 Upstream 55,163 38,265 62,906 Nonsynonymous 18,326 12,515 21,062 Synonymous 27,142 17,223 30,804 Nonsyn/Syn ratio (ω) 0.67 0.73 0.68 Exonic Stop gain 332 217 389 Stop loss 91 67 99 Unknown 3,879 2,883 4,584 Intronic 2,232,946 1,577,151 2,519,351 Splicing 160 108 182 Downstream 55,794 39,246 63,798 Upstream/Downstream 607 437 725 Intergenic 5,996,061 4,323,074 6,788,223

The package ANNOVAR119 was used to identify whether SNPs cause protein coding changes and the amino acids that are affected. ‘Upstream’ refers to a variant that overlaps with the 1 kb region upstream of the gene start site. ‘Stop gain’ means that a nonsynonymous SNP leads to the creation of a stop codon at the variant site. ‘Stop loss’ means that a nonsynonymous SNP leads to the elimination of a stop codon at the variant site. ‘Unknown’ means unknown function (due to various errors in the gene structure definition in the database file). ‘Splicing’ means that a variant is within 2 bp of a splice junction. ‘Downstream’ means that a variant overlaps with the 1 kb region downstream of the gene end site. ‘Upstream/Downstream’ means that a variant is located in downstream and upstream regions (possibly for two different genes).

77

Nature Genetics: doi:10.1038/ng.2811

Supplementary Table 36. Functional gene categories enriched for genes affected by natural and artificial selection.

Involved Functional Term ID Term description P value gene category number Tibetan wild boar GO-BP GO:0006281 DNA repair 9.11E-03 2 InterProScan IPR007237 CD20-like 1.08E-02 2 Melanoma associated antigen, MAGE, InterProScan IPR021072 1.25E-02 2 N-terminal GO-MF GO:0015276 Ligand-gated ion channel activity 1.27E-02 4 GO-MF GO:0016779 Nucleotidyltransferase activity 1.39E-02 15 GO-MF GO:0034061 DNA polymerase activity 1.48E-02 14 InterProScan IPR000477 Reverse transcriptase 2.17E-02 13 InterProScan IPR005135 Endonuclease/exonuclease/phosphatase 2.47E-02 7 GO-MF GO:0005230 Extracellular ligand-gated ion channel activity 2.84E-02 3 GO-BP GO:0006278 RNA-dependent DNA replication 2.87E-02 13 GO-MF GO:0003964 RNA-directed DNA polymerase activity 2.87E-02 13 InterProScan IPR000980 SH2 domain 2.90E-02 4 GO-MF GO:0003723 RNA binding 2.98E-02 17 GO-BP GO:0006259 DNA metabolic process 3.94E-02 16 InterProScan IPR003036 Core shell protein Gag P30 4.05E-02 2 GO-MF GO:0003777 Microtubule motor activity 4.09E-02 3 GO-MF GO:0070279 Vitamin B6 binding 4.56E-02 3 GO-BP GO:0007017 Microtubule-based process 4.63E-02 4 GO-MF GO:0003774 Motor activity 4.90E-02 4 InterProScan IPR002190 MAGE protein 4.94E-02 2 Domestic pig GO-MF GO:0004888 Transmembrane signaling receptor activity 4.21E-04 36 GO-MF GO:0005149 Interleukin-1 receptor binding 5.01E-04 2 InterProScan IPR003502 Interleukin-1 propeptide 5.50E-04 2 InterProScan IPR003294 Interleukin-1, alpha/beta 5.50E-04 2 InterProScan IPR000048 IQ calmodulin-binding region 8.28E-04 7 GO-BP GO:0050671 Positive regulation of lymphocyte proliferation 5.09E-03 5 GO-BP GO:0070665 Positive regulation of leukocyte proliferation 5.43E-03 5 Positive regulation of mononuclear cell GO-BP GO:0032946 5.43E-03 5 proliferation InterProScan IPR000975 Interleukin-1 7.75E-03 2 GO-BP GO:0050878 Regulation of body fluid levels 9.01E-03 7 GO-BP GO:0009968 Negative regulation of signal transduction 9.70E-03 3 GO-BP GO:0043407 Negative regulation of MAP kinase activity 1.04E-02 4 GO-MF GO:0004984 Olfactory receptor activity 1.08E-02 22 GO-BP GO:0007166 Cell surface receptor signaling pathway 1.09E-02 38

78

Nature Genetics: doi:10.1038/ng.2811

Transferase activity, transferring GO-MF GO:0016772 1.22E-02 40 phosphorus-containing groups GO-BP GO:0007186 G-protein coupled receptor signaling pathway 1.26E-02 35 GO-MF GO:0016503 Pheromone receptor activity 1.28E-02 2 InterProScan IPR004072 Vomeronasal receptor, type 1 1.40E-02 2 KEGG map04914 Progesterone-mediated oocyte maturation 1.42E-02 4 pathway GO-BP GO:0006720 Isoprenoid metabolic process 1.80E-02 4 GO-BP GO:0046541 Saliva secretion 1.94E-02 2 GO-BP GO:0006662 Glycerol ether metabolic process 2.00E-02 2 GO-BP GO:0006955 Immune response 2.04E-02 6 KEGG hsa04730 Long-term depression 2.09E-02 5 pathway InterProScan IPR000725 Olfactory receptor 2.09E-02 22 GO-BP GO:0050670 Regulation of lymphocyte proliferation 2.09E-02 5 GO-BP GO:0070663 Regulation of leukocyte proliferation 2.18E-02 5 GO-BP GO:0032944 Regulation of mononuclear cell proliferation 2.18E-02 5 GO-BP GO:0008299 Isoprenoid biosynthetic process 2.60E-02 3 GO-BP GO:0000188 Inactivation of MAPK activity 2.60E-02 3 GO-BP GO:0042102 Positive regulation of T cell proliferation 2.69E-02 3 InterProScan IPR017452 GPCR, rhodopsin-like superfamily 3.31E-02 27 GO-BP GO:0006954 Inflammatory response 3.32E-02 2 GO-BP GO:0043405 Regulation of MAP kinase activity 3.33E-02 6 GO-BP GO:0051251 Positive regulation of lymphocyte activation 3.45E-02 5 GO-BP GO:0050777 Negative regulation of immune response 3.94E-02 3 InterProScan IPR006201 Neurotransmitter-gated ion-channel 4.50E-02 3 GO-BP GO:0002696 Positive regulation of leukocyte activation 4.54E-02 5

79

Nature Genetics: doi:10.1038/ng.2811

Supplementary Note 1 De novo sequencing, assembly and annotation of Tibetan wild boar genome

1.1 Sequencing strategy and data generation

We used a whole genome shotgun strategy and next-generation sequencing technologies on the Illumina HiSeq 2000 platform to sequence the genome of Tibetan wild boar. DNA were extracted from a female Tibetan wild boar from Daocheng County (~ 3,750 m altitude) in the Tibetan plateau of China. All the animals and samples used in this study were collected according to the guidelines for the care and use of experimental animals established by the Ministry of Agriculture of China. Short-insert (180 bp and 500 bp) and long-insert (2 kb, 5 kb and 10 kb) DNA libraries were constructed according to the manufacturer’s specifications (Illumina), and read lengths were 101 bp, 75 bp and 51 bp (Supplementary Table 1). In total, we generated ~319.3 Gb of sequence.

1.2 Sequence quality checking and filtering

To avoid reads with artificial bias (i.e. low quality paired reads, which mainly result from base-calling duplicates and adapter contamination), we removed the following type of reads: (a) Reads with ≥ 10% unidentified nucleotides (N); (b) Reads with > 10 nt aligned to the adapter, allowing ≤ 10% mismatches; (c) Reads with > 50% bases having phred quality < 5; and (d) Putative PCR duplicates generated by PCR amplification in the library construction process (i.e. read 1 and read 2 of two paired-end reads that were completely identical). Consequently, 278.2 Gb (114.5 x coverage) was retained for assembly, of which the quality of 95% and 90% of the bases were ≥ Q20 and ≥Q30, respectively (Supplementary Table 1).

1.3 Estimation of genome size using K-mer method

To estimate the genome size of the Tibetan wild boar, we selected 130.05 Gb high-quality reads from the short-insert reads (180 bp), and generated 19-mer 80

Nature Genetics: doi:10.1038/ng.2811

frequency information based on the K-mer analysis as implemented in the software Meryl120,121. The estimate size of Tibetan wild boar genome is 2,379.31 Mb (~2.38 Gb) (Supplementary Fig. 4 and Supplementary Table 2).

1.4 De novo assembly

The paired-end reads of 180 bp, 500 bp and 2 kb DNA libraries were processed using the error-correction module of ALLPATHS-LG122. We assembled the Tibetan wild boar genome using SOAPdenovo, a de Bruijn graph algorithm based de novo genome assembler123. Firstly, the corrected reads of 180 bp and 500 bp DNA libraries were used to construct the contig sequences employing 27-mers. Consequently, we obtained a contig N50 size of 1,124 bp and a contig N90 size of 252 bp with the fragments longer than 100 bp. Secondly, we realigned all the reads, including those from the short-insert libraries (180 bp and 500 bp) and the long-insert libraries (2 kb, 5 kb and 10 kb), onto the contig sequences with 83.60% of the aligned paired-end reads. Thirdly, we constructed scaffolds using adjacent contigs identified by paired-end information that had at least four consistent read pairs. Consequently, the contig N50 and N90 sizes (based on fragments longer than 500 bp) within these scaffolds were improved to 10,830 bp and 2,411 bp, respectively. The scaffold N50 and N90 sizes were also enhanced to 1,068,344 bp and 231,601 bp. Fourthly, to close the gaps within the constructed scaffolds (caused mainly by the presence of repeats that were masked during scaffold construction), we used the paired-end information to retrieve the read pairs that had one read well-aligned on the contigs and the other read located in the gap region, and then performed a local assembly for these collected reads using the package Gapcloser (version 1.12)123. This last step improved the contig N50 and N90 sizes to 20,411 bp and 4,605 bp, and the scaffold N50 and N90 sizes to 1,049,950 and 227,167 bp, respectively, with the fragments longer than 100 bp (Supplementary Table 3). Consequently, a total length of ungapped sequence of 2.43 Gb was generated

81

Nature Genetics: doi:10.1038/ng.2811

for the Tibetan wild boar genome, similar to the amount generated for the Duroc pig genome (2.52 Gb) (Table 1 and Supplementary Table 11).

1.5 Detections of heterozygous SNPs and deletion or insertion polymorphisms (InDels)

To evaluate the heterozygosity rate for the Tibetan wild boar genome, we realigned the ~216.2 Gb high-quality reads from short-insert libraries (180bp and 500 bp) onto the genome assembly using the package BWA124 (Supplementary Fig. 7 and Supplementary Table 4). Then we preformed SNP calling using the package SOAPsnp125, and finally obtained ~4.4 M heterozygous SNPs for the Tibetan wild boar genome with a high-confidence (i.e. the coverage depth ≥ 4 and ≤ 150, the genotype quality ≥ 20, copy number ≤ 2 and the distance of adjacent SNPs ≥ 5) (Supplementary Fig. 8), which represents a heterozygous SNP rate in the wild Tibetan wild boar of 1.82 × 10-3. In addition, we performed InDel calling for the Tibetan wild boar genome using a Bayesian approach implemented in the package SAMtools. The ‘mpileup’ command was used to identify InDels with the parameters ‘-m 2 -F 0.002 -d 1,000’. A total of 984,284 InDels were identified, ranging from 1 bp to 30 bp in length of which 982 (0.10%) were in coding regions (Supplementary Fig. 11 and Supplementary Table 7).

1.6 Repeat annotation

After the genome assembly, we performed repeat annotation for the Tibetan wild boar genome.

(a) Identification of known transposable elements (TEs)

We used RepeatMasker Vision 3.3.0 (Supplementary URLs) against the Repbase TE library (RM database vision 20110920)126, and RepeatProteinMask (Supplementary URLs) performing WU-BLASTX against the TE protein database.

(b) De novo repeat prediction

82

Nature Genetics: doi:10.1038/ng.2811

We built a de novo repeat library for the Tibetan wild boar using RepeatModeler Vision 1.0.5 (Supplementary URLs) which uses two core programs, i.e. RECON127 and RepeatScout128 to generate the TE families. (c) Identification of tandem repeats We identified non-interspersed repeat sequences using RepeatMasker with the “-nolow” option, including the simple repeat, satellites and low complexity repeats. We also predicted tandem repeats using the package Tandem Repeat Finder129, with parameters set to “Match=2, Mismatch=7, Delta=7, PM=80, PI=10, Minscore=50, and MaxPeriod=12”. In addition, to compare the TE characters among different genomes, we performed repeat annotation for the Duroc pig, human and cattle genomes based on the same pipeline used for the Tibetan wild boar (Supplementary Fig. 10 and Supplementary Tables 5, 6).

1.7 Structural annotation of genes

The genes in the Tibetan wild boar genome were predicted using ab initio-, and homology-based methods, and by incorporating evidence of transcription from the RNA-seq data. (a) Ab initio prediction We used the ab initio predication packages Augustus130, Geneid131, Genscan132, GlimmerHMM133 and SNAP134 with the parameters trained from a set of high-quality homologous prediction proteins. (b) Homology-based prediction The protein repertoires of human, mouse, cattle, dog and the Duroc pig were downloaded from Ensembl release 67 and mapped onto the repeat-masked Tibetan wild boar genome using TBLASTn135. Then, homologous genome sequences were aligned against the matching proteins using Genewise136 to define gene models. Moreover, we aligned the porcine cDNA and EST sequences onto the Tibetan wild boar genome, which provided the evidence for the homology-based prediction. (c) RNA-seq data To optimize the genome annotation, four tissue RNA libraries (i.e. heart, liver, lung and kidney) were constructed using the Illumina mRNA-Seq Prep Kit and 83

Nature Genetics: doi:10.1038/ng.2811

about 27.9 Gb of sequence was generated (100 bp at each end). RNA-seq reads were aligned to both the Tibetan wild boar and Duroc pig reference assemblies using TopHat (v2.0.7) 137 with default parameters to identify exons region and splice positions (Supplementary Table 12). The alignment results were then used as input for Cufflinks (v2.0.2)138 with default parameters for genome-based transcript assembly. The final non-redundant reference gene set was generated by merging genes predicted by three methods using EvidenceModeler (EVM)139, and genes with ≤ 50 amino acids, or only with de novo predictive support were removed (Supplementary Table 13). The final reference gene set of the Tibetan wild boar was comprised of 21,806 genes which is comparable with the gene repertoire of the Duroc pig genome (21,640 genes) (Supplementary Table 15).

1.8 Functional annotation of genes

Gene functions were assigned according to the best match of the alignment to the SwissProt and TEMBL databases140, using BLASTP135. We annotated motifs and domains using InterPro141 by searching against publicly available databases, including Pfam142, PRINTS, PROSITE, ProDom, and SMART using InterProScan141. Gene Ontology (GO) terms143 for each gene were retrieved from the corresponding InterPro descriptions (Supplementary Table 16). Furthermore, we also mapped these Tibetan wild boar genes to the KEGG pathway144 to identify the best match category for each gene.

1.9 non-coding RNA (ncRNA) annotations

The tRNA genes were predicted by tRNAscan-SE145 with eukaryote parameters. The rRNA, microRNA (miRNA) and small nuclear (snRNA) were identified using the Infernal software146 by searching against the Rfam database147 with default parameters (Supplementary Table 10). In addition, we filtered the miRNAs, snRNAs and tRNAs which were located in the repeat or gap regions, as well as the rRNAs of short length (≤ 50 bp) and low identity (≤ 85%).

2 Lineage-specific genes

2.1 Gene family cluster and orthology relationships

84

Nature Genetics: doi:10.1038/ng.2811

All DNA and protein data for the Duroc pig, human, mouse, cattle and dog were downloaded from Ensembl database release 67. For genes with alternative splicing variants, we chose the longest transcripts (≥ 30 amino acids) to represent the genes. We used the Treefam methodology148 to define a gene family as a group of genes that descended from a single gene in the last common ancestor of the considered species. An all-against-all BLASTP135 was applied to determine the similarities between genes in three (Tibetan wild boar, Duroc pig and human) or in six (Tibetan wild boar, Duroc pig, cattle, dog, mouse and human) mammalian genomes with the e-value of 1e-7 and conjoined fragmental alignments for each gene pair by Solar (Supplementary Figs. 12, 14 and Supplementary URLs).

We assigned a connection (edge) between the two nodes (genes), if more than 1/3 of the region aligned to both genes. A minimum edge weight that ranged from 0 to 100 was used to weigh the similarity (edge). For clustering protein coding genes into gene families, we used the average distance for the hierarchical clustering algorithm by Hcluster_sg, requiring edge weight ≥ 10, and the minimum edge density (total number of edges/theoretical number of edges) ≥ 0.34.

2.2 Evidence of transcription for the Tibetan wild boar-specific genes

A total 27.9 Gb of RNA-seq sequences generated from the four libraries were mapped to the Tibetan wild boar genome using TopHat137. Gene expression levels were determined using the normalized RPKM values (reads per kilobase per million mapped reads) (Supplementary Table 17).

3 Functional enrichment analyses for genes

Functional enrichment analysis of Gene Ontology (GO) terms and pathways was performed using the DAVID (Database for Annotation, Visualization and Integrated Discovery) web server149,150. Genes were submitted to DAVID for enrichment analysis of the significant overrepresentation of GO biological processes (GO-BP), molecular function (GO-MF) terminologies, and categories of InterPro domain and KEGG-pathway. In all tests, the whole set of known genes was appointed as the background, and P values (i.e. EASE scores), indicating significance of the overlap between various gene sets, were 85

Nature Genetics: doi:10.1038/ng.2811

calculated using a Benjamini-corrected modified Fisher’s exact test. Only GO-BP, GO-MF, KEGG-pathway or InterPro domain terms with a P value less than 0.05 were considered as significant and listed.

4 Identification of pseudogenes

We identified 188 pseudogenes in the Tibetan wild boar genome, containing 137 frameshift and 60 premature termination events based on the in silico filters and further manual examination (Supplementary Table 28). We first aligned all human protein sequences from Ensembl release 67 onto the Tibetan wild boar genome using TBLASTn135. Then the best matched regions of each gene were reduced and re-aligned using GeneWise136, to help define the exon-intron structure. To avoid splicing errors near the frameshift or premature termination events, we also aligned human genes onto the human genome with the same pipeline. Cases with high mapping quality (numbers of reads covering ≥ 10 and with matched transcription reads), excluding any splicing error, SNPs or InDels, but containing the frameshift or premature termination events were considered as pseudogenes. In addition, we aligned the re-sequencing data sets of 30 Tibetan wild boars to the Tibetan wild boar genome assembly and further evaluated the candidate pseudogenes.

5 Population-based re-sequencing and SNP calling

5.1 Re-sequencing strategy and read mapping

We sampled a total 48 pigs, including 30 Tibetan wild boars, 15 domestic pigs in China and three wild boars in Southwest China (Fig. 2a and Supplementary Table 31). Sequencing was performed on the Illumina HiSeq 2000 platform, and generated a total of 659.4 Gb of paired-end DNA sequence. The criteria for quality checking and filtering of sequence (see ‘1.2 Sequence quality checking and filtering’) were also applied.

Consequently, 655.9 Gb (99.5%, out of 659.4 Gb) high quality paired-end reads were mapped to the Tibetan wild boar genome assembly using the BWA software124. First, the reference was indexed. Second, the command ‘aln -o 1 -e 10 -t 4 -l 32 -i 15 -q 10’ was used to find the suffix array coordinates of good matches for each read. Third, the best alignments were generated in the SAM

86

Nature Genetics: doi:10.1038/ng.2811

format given paired-end reads with command ‘sampe’.

Next, we improved the alignment results with the following three steps: (a) Filter the alignment read with mismatches ≤ 5 and mapping quality = 0; (b) The alignment results were corrected using the package Picard (Supplementary URLs) with two core commands. The ‘AddOrReplaceReadGroups’ command was used to replace all read groups in the INPUT file with a new read group and assigns all reads to this read group in the OUTPUT BAM. ‘FixMateInformation’ command was used to ensure that all mate-pair information was in sync between each read and its mate pair; (c) Remove potential PCR duplication. If multiple read pairs have identical external coordinates, only retain the pair with the highest mapping quality.

Finally, for each individual, ~91.99% of reads mapped to 94.63% (at least 1 ×) or 64.55% (at least 4 ×) of the reference genome assembly of the Tibetan wild boar with 4.95-fold average depth (Supplementary Table 31). In addition, we downloaded the genome data of 55 individuals (a total of 1,037 Gb genome data) from across the world from the EMBL-EBI database (accession number ERP001813), including 30 European domestic pigs, 7 domestic pigs in Southeast China, 7 Asian wild boars, 6 European wild boars, 4 other species in the genus Sus, and an African warthog, with 6.72-fold average depth, 97.77% mapping rate and ~80.69% (at least 1 ×) or ~67.16% (at least 4 ×) coverage of the Tibetan wild boar genome (Fig. 2a and Supplementary Table 32). The lower mapping rate of Tibetan wild boar re-sequences (see ‘1.2 Sequence quality checking and filtering’) than sequences of other pigs to Tibetan wild boar genome is likely due to more stringent filtering criteria used in other pig genome studies (e.g. phred quality ≤ 20) 7-9. When reads with phred quality ≤ 20 were filtered, the mapping rates of Tibetan wild boars to the Tibetan wild boar genome increased to 98.90%, which is higher than the mapping rate of any downloaded pig genome data set to the Tibetan wild boar genome.

5.2 SNP calling

After alignment, we performed SNP calling on a population-scale for three groups (30 Tibetan wild boars, 52 domestic pigs, and 21 wild boars and wild

87

Nature Genetics: doi:10.1038/ng.2811

genus sus) using a Bayesian approach as implemented in the package SAMtools151. The genotype likelihoods from reads for each individual at each genomic location were calculated, and the allele frequencies were also estimated. The ‘mpileup’ command was used to identify SNPs with the parameters as ‘-q 1 -C 50 -S -D -m 2 -F 0.002 –u’. Then, only the high quality SNPs (coverage depth ≥ 4 and ≤ 1,000, RMS mapping quality ≥ 20, the distance of adjacent SNPs ≥ 5 bp and the missing ratio of samples within each group < 50%) were kept for the subsequent analysis. In total, we identified 14,637,670 (14.64 M) SNPs from 103 individuals (Supplementary Table 33). We then pooled separately and obtained SNP sets for each of three groups, including 8,390,501 (8.39 M) from the 30 Tibetan wild boars, 9,173,377 (9.17 M) from the 52 domestic pigs, and 7,780,578 (7.78 M) from the 21 wild boars as well as individuals of the wild genus Sus (Supplementary Tables 33 and 35). The small proportion of (3.02 M of 14.64 M, 20.63%) SNPs were shared among the three groups, which indicated the larger differences of genomic backgrounds among them.

6 Demographic history reconstruction

Demographic history of seven wild boars (three in Europe and four in Asia), and six Tibetan wild boars from six geographically diverse populations was inferred using a hidden Markov model (HMM) approach as implemented in pairwise sequentially Markovian coalescence (PSMC) based on SNP distribution152 (Fig. 2e). To improve the accuracy of inferred historical recombination events, we only used the scaffolds larger than 50 kb (~93.85% of all scaffolds) and ~7.6 M heterozygous SNPs for each individual were used to reconstruct a demographic history. The program `fq2psmcfa' was used to transform the consensus sequence into a fasta-like format where the i-th character in the output sequence indicates whether there is at least one heterozygote in the bin [100i, 100i+100). Parameters were set as follows: ‘−N30 −t15 −r5 −p ‘4+25*2+4+6’. The porcine generation time (g) = 5 years, and neutral mutation rate per generation (μ) = 2.5 x 10-8 were based on previous reports 7,9.

In addition, climate change and migration are two important factors

88

Nature Genetics: doi:10.1038/ng.2811

influencing population size. Thus, we obtained atmospheric surface air temperature (℃) and global relative sea level (10 m) data of the past 1 million years from National Climatic Data Center (NCDC) (Supplementary URLs) and combined them together with the demographic data into a single plot. Note that PSMC simulation cannot detect population changes more recent than 10,000 years ago.

7 Linkage-disequilibrium (LD) analysis

To estimate the LD patterns between Tibetan wild boars and Chinese domestic pigs, we used 6.01 M SNPs of 15 Chinese domestic pigs and merged them with SNPs of the Tibetan wild boars resulting in 9.49 M SNPs in total. To evaluate LD decay, the coefficient of determination (r2) between any two loci was calculated using Haploview153 (Fig. 3a). Parameters were set as follows: ‘-n -dprime -minGeno 0 -missingCutoff 1 -minMAF 0.01’. Average r2 was calculated for pairwise markers in a 500 kb window and averaged across the whole genome.

Supplementary URLs Breakdancer, http://gmt.genome.wustl.edu/breakdancer/1.2/index.html; Bioinf ormatics and Systems Biology of Gent, http://bioinformatics.psb.ugent.be/w ebtools/Venn/; InParanoid, http://inparanoid.sbc.su.se/cgi-bin/index.cgi; Multi Paranoid, http://multiparanoid.sbc.su.se/; MEGA 5.15, http://www.megasoft ware.net/; LASTZ, http://www.bx.psu.edu/miller_lab/; RepeatMasker, Repea tProteinMask and RepeatModeler, http://www.RepeatMasker.org; Solar, htt p://treesoft.svn.sourceforge.net/viewrc/treesoft/, Picard, http://sourceforge. net/projects/picard/; National Climatic Data Center (NCDC), http://www.ncd c.noaa.gov/.

89

Nature Genetics: doi:10.1038/ng.2811

Supplementary References 1 Feuk, L. et al. Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet. 1, e56, (2005). 2 Lai, J. et al. Genome-wide patterns of genetic variation among elite maize inbred lines. Nat. Genet. 42, 1027-1030 (2010). 3 Xu, X. et al. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 30, 105-111, (2012). 4 Nguyen, D. T. et al. The complete swine olfactory subgenome: expansion of the olfactory gene repertoire in the pig genome. BMC Genomics 13, 584 (2012). 5 Quignon, P. et al. The dog and rat olfactory receptor repertoires. Genome Biol. 6, R83 (2005). 6 Castillo-Davis, et al. The functional genomic distribution of protein divergence in two animal phyla: coevolution, genomic conflict, and constraint. Genome Res. 14, 802-811 (2004). 7 Groenen, M. A. et al. Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491, 393-398 (2012). 8 Rubin, C. J. et al. Strong signatures of selection in the domestic pig genome. Proc. Natl. Acad. Sci. USA 109, 19529-19536 (2012). 9 Bosse, M. et al. Regions of homozygosity in the porcine genome: consequence of demography and the recombination landscape. PLoS Genet. 8, e1003100 (2012). 10 Romanenko, V., Nakamoto, T., Srivastava, A., Melvin, J. E. & Begenisich, T. Molecular identification and physiological roles of parotid acinar cell maxi-K channels. J. Biol. Chem. 281, 27964-27972 (2006). 11 Liu, X. et al. Attenuation of store-operated Ca2+ current impairs salivary gland fluid secretion in TRPC1(-/-) mice. Proc. Natl. Acad. Sci. USA 104, 17542-17547 (2007). 12 Beall, C. M. et al. Natural selection on EPAS1 (HIF2α) associated with low hemoglobin concentration in Tibetan highlanders. Proc. Natl. Acad. Sci. USA 107, 11459-11464 (2010). 13 Bigham, A. et al. Identifying signatures of natural selection in Tibetan and Andean populations using dense genome scan data. PLoS Genet. 6 (2010). 14 Simonson, T. S. et al. Genetic evidence for high-altitude adaptation in Tibet. Science 329, 72-75 (2010). 15 Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75-78 (2010). 16 Peng, Y. et al. Genetic variations in Tibetan populations and high-altitude adaptation at the Himalayas. Mol. Biol. Evol. 28, 1075-1081 (2011). 17 Xu, S. et al. A genome-wide search for signals of high-altitude adaptation in Tibetans. Mol. Biol. Evol. 28, 1003-1011 (2011). 90

Nature Genetics: doi:10.1038/ng.2811

18 Ji, L. D. et al. Genetic adaptation of the hypoxia-inducible factor pathway to oxygen pressure among eurasian human populations. Mol. Biol. Evol. 29, 3359-3370 (2012). 19 Scheinfeldt, L. B. et al. Genetic adaptation to high altitude in the Ethiopian highlands. Genome Biol. 13, R1 (2012). 20 Rankinen, T. et al. The human obesity gene map: the 2005 update. Obesity 14, 529-644 (2006). 21 MacDougald, O. A. & Burant, C. F. The rapidly expanding family of adipokines. Cell. Metab. 6, 159-161 (2007). 22 Heid, I. M. et al. Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nat. Genet. 42, 949-960 (2010). 23 Speliotes, E. K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937-948 (2010). 24 Li, M. et al. An atlas of DNA methylomes in porcine adipose and muscle tissues. Nat. Commun.3, 850 (2012). 25 Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555-556 (1997). 26 Lace, B. et al. BCL3 gene role in facial morphology. Birth. Defects Res. A Clin. Mol. Teratol. 94, 918-924 (2012). 27 Wang, Y. & Lu, L. Activation of oxidative stress-regulated Bcl-3 suppresses CTCF in corneal epithelial cells. PloS One 6, e23984 (2011). 28 Yu, H. et al. Association between single nucleotide polymorphisms in ERCC4 and risk of squamous cell carcinoma of the head and neck. PloS One 7, e41853 (2012). 29 Krupa, R. et al. Polymorphisms of the DNA repair genes XRCC1 and ERCC4 are not associated with smoking- and drinking-dependent larynx cancer in a Polish population. Exp. Oncol. 33, 55-56 (2011). 30 Muftuoglu, M. et al. Cockayne syndrome group B protein stimulates repair of formamidopyrimidines by NEIL1 DNA glycosylase. J. Biol. Chem. 284, 9270-9279 (2009). 31 Kim, H., Yang, K., Dejsuphong, D. & D'Andrea, A. D. Regulation of Rev1 by the Fanconi anemia core complex. Nat. Struct. Mol. Biol. 19, 164-170 (2012). 32 Kuang, L. et al. A non-catalytic function of Rev1 in translesion DNA synthesis and mutagenesis is mediated by its stable interaction with Rad5. DNA repair 12, 27-37 (2013). 33 Pajukanta, P. et al. Familial combined hyperlipidemia is associated with upstream transcription factor 1 (USF1). Nat. Genet. 36, 371-376 (2004). 34 Corre, S. et al. In vivo and ex vivo UV-induced analysis of pigmentation gene expressions. J. Invest. Dermatol. 126, 916-918 (2006). 35 Majerus, M. E. & Mundy, N. I. Mammalian melanism: natural selection in black and

91

Nature Genetics: doi:10.1038/ng.2811

white. Trends Genet. 19, 585-588 (2003). 36 Fang, M., Larson, G., Ribeiro, H. S., Li, N. & Andersson, L. Contrasting mode of evolution at a coat color locus in wild and domestic pigs. PLoS Genet. 5, e1000341 (2009). 37 Yuan, J., Ghosal, G. & Chen, J. The HARP-like domain-containing protein AH2/ZRANB3 binds to PCNA and participates in cellular response to replication stress. Mol. Cell 47, 410-421 (2012). 38 Ciccia, A. et al. Polyubiquitinated PCNA recruits the ZRANB3 translocase to maintain genomic integrity after replication stress. Mol. Cell 47, 396-409 (2012). 39 Andersson, O., Korach-Andre, M., Reissmann, E., Ibanez, C. F. & Bertolino, P. Growth/differentiation factor 3 signals through ALK7 and regulates accumulation of adipose tissue and diet-induced obesity. Proc. Natl. Acad. Sci. USA 105, 7252-7256 (2008). 40 Malik, S. G. et al. Association of β3-adrenergic receptor (ADRB3) Trp64Arg gene polymorphism with obesity and metabolic syndrome in the Balinese: a pilot study. BMC Res. Notes 4, 167 (2011). 41 Zawodniak-Szalapska, M. et al. Association of Trp64Arg polymorphism of β3-adrenergic receptor with insulin resistance in Polish children with obesity. J. Pediatr. Endocrinol. Metab. 21, 147-154 (2008). 42 Subauste, A. R. et al. Alterations in lipid signaling underlie lipodystrophy secondary to AGPAT2 mutations. Diabetes 61, 2922-2931 (2012). 43 Agarwal, A. K. et al. AGPAT2 is mutated in congenital generalized lipodystrophy linked to chromosome 9q34. Nat. Genet. 31, 21-23 (2002). 44 Shen, J. J. et al. Deficiency of growth differentiation factor 3 protects against diet-induced obesity by selectively acting on white adipose. Mol. Endocrinol. 23, 113-123 (2009). 45 Laviano, A., Molfino, A., Rianda, S. & Rossi Fanelli, F. The growth hormone secretagogue receptor (ghs-R). Curr. Pharm. Des. 18, 4749-4754 (2012). 46 Gauna, C. et al. Unacylated ghrelin is not a functional antagonist but a full agonist of the type 1a growth hormone secretagogue receptor (GHS-R). Mol. Cell Endocrinol. 274, 30-34 (2007). 47 Gottardo, L. et al. A polymorphism at the IL6ST (gp130) locus is associated with traits of the metabolic syndrome. Obesity 16, 205-210 (2012). 48 Lin, F. H., Chu, N. F., Lee, C. H., Hung, Y. J. & Wu, D. M. Combined effect of C-reactive protein gene SNP +2147 A/G and interleukin-6 receptor gene SNP rs2229238 C/T on anthropometric characteristics among school children in Taiwan. Int. J. Obes. 35, 587-594 (2011). 49 Camara-Clayette, V. et al. Transcriptional regulation of the KEL gene and Kell protein expression in erythroid and non-erythroid cells. Biochem. J. 356, 171-180 (2001). 50 Ingallinella, P. et al. PEGylation of neuromedin U yields a promising candidate for

92

Nature Genetics: doi:10.1038/ng.2811

the treatment of obesity and diabetes. Bioorgan. Med. Chem. 20, 4751-4759 (2012). 51 Malendowicz, L. K., Ziolkowska, A. & Rucinski, M. Neuromedins U and S involvement in the regulation of the hypothalamo-pituitary-adrenal axis. Front. Endocrinol. 3, 156 (2012). 52 Lu, B. et al. Expression of the phospholipid scramblase (PLSCR) gene family during the acute phase response. Biochim. Biophys. Acta. 1771, 1177-1185 (2007). 53 Charos, A. E. et al. A highly integrated and complex PPARGC1A transcription factor binding network in HepG2 cells. Genome Res. 22, 1668-1679 (2012). 54 Gemma, C. et al. Maternal pregestational BMI is associated with methylation of the PPARGC1A promoter in newborns. Obesity 17, 1032-1039 (2009). 55 Connelly, M. A. & Williams, D. L. Scavenger receptor BI: a scavenger receptor with a mission to transport high density lipoprotein lipids. Curr. Opin. Lipidol. 15, 287-295 (2004). 56 Jeyakumar, S. M., Vajreswari, A. & Giridharan, N. V. Impact of vitamin A on high-density lipoprotein-cholesterol and scavenger receptor class BI in the obese rat. Obesity 15, 322-329 (2007). 57 Le, M. T. et al. Impact of Genetic Polymorphisms of SLC2A2, SLC2A5, and KHK on Metabolic Phenotypes in Hypertensive Individuals. PloS One 8, e52062 (2013). 58 Suviolahti, E. et al. The SLC6A14 gene shows evidence of association with obesity. J. Clin. Invest. 112, 1762 (2003). 59 Walley, A. J., Asher, J. E. & Froguel, P. The genetic contribution to non-syndromic human obesity. Nat. Rev. Genet. 10, 431-442 (2009). 60 Epstein, L. H. et al. Dopamine transporter genotype as a risk factor for obesity in African-American smokers. Obesity Res. 10, 1232-1240 (2002). 61 van Dyck, C. H. et al. Increased dopamine transporter availability associated with the 9-repeat allele of the SLC6A3 gene. J. Nucl. Med. 46, 745-751 (2005). 62 Benjafield, A. V., Glenn, C. L., Wang, X. L., Colagiuri, S. & Morris, B. J. TNFRSF1B in genetic predisposition to clinical neuropathy and effect on HDL cholesterol and glycosylated hemoglobin in type 2 diabetes. Diabetes Care 24, 753-757 (2001). 63 Tabassum, R. et al. Association analysis of TNFRSF1B polymorphisms with type 2 diabetes and its related traits in North India. Genomic Medicine 2, 93-100 (2008). 64 Motter, A. L. & Ahern, G. P. TRPV1-null mice are protected from diet-induced obesity. FEBS Lett. 582, 2257-2262 (2008). 65 Garami, A. et al. Thermoregulatory phenotype of the Trpv1 knockout mouse: thermoeffector dysbalance with hyperkinesis. J. Neurosci. 31, 1721-1733 (2011). 66 Suri, A. & Szallasi, A. The emerging role of TRPV1 in diabetes and obesity. Trends Pharmacol. Sci. 29, 29-36 (2008).

93

Nature Genetics: doi:10.1038/ng.2811

67 Qi, L. et al. TRB3 links the E3 ubiquitin ligase COP1 to lipid metabolism. Science 312, 1763-1766 (2006). 68 Sorrentino, V. & Zelcer, N. Post-transcriptional regulation of lipoprotein receptors by the E3-ubiquitin ligase inducible degrader of the low-density lipoprotein receptor. Curr. Opin. Lipidol. 23, 213-219 (2012). 69 Tortorella, M. D., Malfait, F., Barve, R. A., Shieh, H. S. & Malfait, A. M. A review of the ADAMTS family, pharmaceutical targets of the future. Curr. Pharm. Des. 15, 2359-2374 (2009). 70 Wagstaff, L., Kelwick, R., Decock, J. & Edwards, D. R. The roles of ADAMTS metalloproteinases in tumorigenesis and metastasis. Front. Biosci. 16, 1861-1872 (2011). 71 Reder, N. P. et al. Adrenergic α-1 pathway is associated with hypertension among Nigerians in a pathway-focused analysis. PloS One 7, e37145 (2012). 72 Ro, H. S. et al. Adipocyte enhancer-binding protein 1 modulates adiposity and energy homeostasis. Obesity 15, 288-302 (2007). 73 Elosua, R. et al. Obesity modulates the association among APOE genotype, insulin, and glucose in men. Obesity Res. 11, 1502-1508 (2012). 74 Wang, J. et al. ApoE and the role of very low density lipoproteins in adipose tissue inflammation: ApoE and adipose tissue inflammation. Atherosclerosis (2012). 75 Badano, J. L. et al. Identification of a novel Bardet-Biedl syndrome protein, BBS7, that shares structural features with BBS1 and BBS2. Am. J. Hum. Genet. 72, 650-658 (2003). 76 Nachury, M. V. et al. A core complex of BBS proteins cooperates with the GTPase Rab8 to promote ciliary membrane biogenesis. Cell 129, 1201-1213 (2007). 77 Katsanis, N. et al. Triallelic inheritance in Bardet-Biedl syndrome, a Mendelian recessive disorder. Science 293, 2256-2259 (2001). 78 Thirone, A. C., Carvalheira, J. B., Hirata, A. E., Velloso, L. A. & Saad, M. J. Regulation of Cbl-associated protein/Cbl pathway in muscle and adipose tissues of two animal models of insulin resistance. Endocrinology 145, 281-293 (2004). 79 Taniguchi, C. M., Emanuelli, B. & Kahn, C. R. Critical nodes in signalling pathways: insights into insulin action. Nat. Rev. Mol. Cell Bio. 7, 85-96 (2006). 80 Yu, Y. et al. Neuronal Cbl controls biosynthesis of insulin-like peptides in Drosophila melanogaster. Mol. Cell Biol. 32, 3610-3623 (2012). 81 Huang, Y. S., Kan, M. C., Lin, C. L. & Richter, J. D. CPEB3 and CPEB4 in neurons: analysis of RNA-binding specificity and translational control of AMPA receptor GluR2 mRNA. EMBO J. 25, 4865-4876 (2006). 82 Harris, C. A. et al. DGAT enzymes are required for triacylglycerol synthesis and lipid droplets in adipocytes. J. Lipid. Res. 52, 657-667 (2011). 83 Chen, H. C. Enhancing energy and glucose metabolism by disrupting triglyceride synthesis: Lessons from mice lacking DGAT1. Nutr. Metab. 3, 10 (2006). 84 Lee, D. et al. Epiregulin is not essential for development of intestinal tumors but is

94

Nature Genetics: doi:10.1038/ng.2811

required for protection from intestinal damage. Mol. Cell. Biol. 24, 8907-8916 (2004). 85 Bohme, M. et al. Association between functional FABP2 promoter haplotypes and body mass index: analyses of 8,072 participants of the KORA cohort study. Mol. Nutr. Food. Res. 53, 681-685 (2009). 86 Martinez-Lopez, E. et al. Effect of Ala54Thr polymorphism of FABP2 on anthropometric and biochemical variables in response to a moderate-fat diet. Nutrition 29, 46-51 (2013). 87 Camats, N. et al. Contribution of human growth hormone-releasing hormone receptor (GHRHR) gene sequence variation to isolated severe growth hormone deficiency (ISGHD) and normal adult height. Clin. Endocrinol. 77, 564-574 (2012). 88 Lee, L. T. et al. Discovery of growth hormone-releasing hormones and receptors in nonmammalian vertebrates. Proc. Natl. Acad. Sci. USA 104, 2133-2138 (2007). 89 Mracek, T., Drahota, Z. & Houstek, J. The function and the role of the mitochondrial glycerol-3-phosphate dehydrogenase in mammalian tissues. Biochim. Biophys. Acta. 1827, 401-410 (2012). 90 Muoio, D. M. & Newgard, C. B. Obesity-related derangements in metabolic regulation. Annu. Rev. Biochem. 75, 367-401 (2006). 91 Koh, H. J. et al. Cytosolic NADP+ dependent isocitrate dehydrogenase plays a key role in lipid metabolism. J. Biol. Chem. 279, 39968-39974 (2004). 92 Sutter, N. B. et al. A single IGF1 allele is a major determinant of small size in dogs. Science 316, 112-115 (2007). 93 Boucher, J. et al. Impaired thermogenesis and adipose tissue development in mice with fat-specific disruption of insulin and IGF-1 signalling. Nat. Commun.3, 902 (2012). 94 Xu, J. et al. The voltage-gated potassium channel Kv1.3 regulates peripheral insulin sensitivity. Proc. Natl. Acad. Sci. USA 101, 3112-3117 (2004). 95 Tucker, K., Overton, J. M. & Fadool, D. A. Kv1.3 gene-targeted deletion alters longevity and reduces adiposity by increasing locomotion and metabolism in melanocortin 4 receptor-null mice. Int. J. Obes. 32, 1222-1232 (2008). 96 Sadagurski, M. et al. IRS2 signaling in LepR-b neurons suppresses FoxO1 to control energy balance independently of leptin action. Cell. Metab. 15 (2012). 97 Myers, M. G., Jr. & Olson, D. P. Central nervous system control of metabolism. Nature 491, 357-363 (2012). 98 Macia, L. et al. Neuropeptide Y1 receptor in immune cells regulates inflammation and insulin resistance associated with diet-induced obesity. Diabetes 61, 3228-3238 (2012). 99 Rojas, J. M. et al. Central nervous system neuropeptide Y signaling via the Y1 receptor partially dissociates feeding behavior from lipoprotein metabolism in lean rats. Am. J. Physiol. Endocrinol. Metab. 303, E1479-1488 (2012). 100 Mul, J. D. et al. Pmch expression during early development is critical for normal

95

Nature Genetics: doi:10.1038/ng.2811

energy homeostasis. Am. J. Physiol. Endocrinol. Metab. 298, 477-488 (2010). 101 Kokkotou, E. et al. Melanin-concentrating hormone as a mediator of intestinal inflammation. Proc. Natl. Acad. Sci. USA 105, 10613-10618 (2008). 102 Wang, S. et al. Activation of AMP-activated protein kinase α2 by nicotine instigates formation of abdominal aortic aneurysms in mice in vivo. Nat. Med. 18, 902-910 (2012). 103 Lee-Young, R. S. et al. Obesity impairs skeletal muscle AMPK signaling during exercise: role of AMPK α2 in the regulation of exercise capacity in vivo. Int. J. Obes. 35, 982-989 (2011). 104 Tiganis, T. PTP1B and TCPTP - nonredundant phosphatases in insulin signaling and glucose homeostasis. FEBS J. (2012). 105 Tonks, N. K. Protein tyrosine phosphatases: from genes, to function, to disease. Nat. Rev. Mol. Cell Bio.7, 833-846 (2006). 106 Huang, Y. J. et al. Recombinant human butyrylcholinesterase from milk of transgenic animals to protect against organophosphate poisoning. Proc. Natl. Acad. Sci. USA 104, 13603-13608 (2007). 107 Ilyushin, D. G. et al. Chemical polysialylation of human recombinant butyrylcholinesterase delivers a long-acting bioscavenger for nerve agents in vivo. Proc. Natl. Acad. Sci. USA 110, 1243-1248 (2013). 108 Geyer, B. C. et al. Plant-derived human butyrylcholinesterase, but not an organophosphorous-compound hydrolyzing variant thereof, protects rodents against nerve agents. Proc. Natl. Acad. Sci. USA 107, 20251-20256 (2010). 109 De Boer, A., Van der Sandt, I. & Gaillard, P. The role of drug transporters at the blood-brain barrier. Annu. Rev. Pharmacol. 43, 629-656 (2003). 110 Das, M. & Das, D. K. Caveolae, caveolin, and cavins: potential targets for the treatment of cardiac disease. Ann. Med. 44, 530-541 (2012). 111 Narendra, S., Valente, A., Tull, J. & Zhang, S. DDIT3 gene break-apart as a molecular marker for diagnosis of myxoid liposarcoma assay validation and clinical experience. Diagn. Mol. Pathol. 20, 218-224 (2011). 112 Nemoto, K. et al. Characteristics of nobiletin-mediated alteration of gene expression in cultured cell lines. Biochem. Biophys. Res. Commun., doi:10.1016/j.bbrc.2013.01.024 (2013). 113 Wilkie, M. J. et al. Polymorphisms in the SLC6A4 and HTR2A genes influence treatment outcome following antidepressant therapy. Pharmacogenomics J. 9, 61-70 (2009). 114 Wrzosek, M. et al. Serotonin 2A receptor gene (HTR2A) polymorphism in alcohol-dependent patients. Pharmacol. Rep. 64, 449-453 (2012). 115 Kim, E. J. et al. Alzheimer's disease risk factor lymphocyte-specific protein tyrosine kinase regulates long-term synaptic strengthening, spatial learning and memory. Cell Mol. Life Sci., doi:10.1007/s00018-012-1168-1 (2013). 116 Venkitachalam, S., Chueh, F. Y., Leong, K. F., Pabich, S. & Yu, C. L. Suppressor of

96

Nature Genetics: doi:10.1038/ng.2811

cytokine signaling 1 interacts with oncogenic lymphocyte-specific protein tyrosine kinase. Oncol. Rep. 25, 677-683 (2011). 117 Simonaro, C. M. et al. Imprinting at the SMPD1 locus: implications for acid sphingomyelinase-deficient Niemann-Pick disease. Am. J. Hum. Genet. 78, 865-870 (2006). 118 Kirkegaard, T. et al. Hsp70 stabilizes lysosomes and reverts Niemann-Pick disease-associated lysosomal pathology. Nature 463, 549-553 (2010). 119 Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 120 Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196-2204 (2000). 121 Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311-317 (2010). 122 Butler, J. et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810-820 (2008). 123 Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265-272 (2010). 124 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760 (2009). 125 Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124-1132 (2009). 126 Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462-467 (2005). 127 Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269-1276 (2002). 128 Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1, 351-358 (2005). 129 Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573-580 (1999). 130 Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2, 215-225 (2003). 131 Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511-515 (2000). 132 Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516-522 (2000). 133 Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878-2879 (2004). 134 Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 59 (2004). 135 Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-664

97

Nature Genetics: doi:10.1038/ng.2811

(2002). 136 Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988-995 (2004). 137 Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105-1111 (2009). 138 Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511-515 (2010). 139 Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008). 140 Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45-48 (2000). 141 Mulder, N. & Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol. Biol. 396, 59-70 (2007). 142 Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290-301 (2012). 143 Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25-29 (2000). 144 Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27-30 (2000). 145 Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964 (1997). 146 Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335-1337 (2009). 147 Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121-124 (2005). 148 Li, H. et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 34, D572-580 (2006). 149 Huang da, W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44-57 (2009). 150 Huang da, W. et al. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 35, W169-175 (2007). 151 Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009). 152 Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493-496 (2011). 153 Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263-265 (2005).

98

Nature Genetics: doi:10.1038/ng.2811