Purifying Selection Shapes the Coincident SNP Distribution of Primate Coding Sequences

Purifying selection shapes the coincident SNP distribution of primate coding sequences

Chia-Ying Chen1, Li-Yuan Hung1, Chan-Shuo Wu1, and Trees-Juen Chuang1,*

1Genomics Research Center, Academia Sinica, Taipei 11529, Taiwan

*Corresponding Author

Trees-Juen Chuang, Ph.D.

Genomics Research Center Academia Sinica Taipei 11529 Taiwan E-mail: [email protected] Tel: +886 2 27871244 Fax: +886 2 27899923

1 Supplemental Table S1. Summary of SOLiD sequencing reads for each chimpanzee individual. Individual ID Sex NGS Number Number of Number of Number of Coverage platform of wells total reads mapped uniquely depth of reads mapped reads analyzed reads

20050256B10* male SOLiD 4 4 171,146,830 109,019,594 105,965,672 100.37 (63.70%) (61.92%) SOLiD 3+ 2 78,480,998 45,451,670 44,225,477 49.69 (57.91%) (56.35%) 20040256B10 male SOLiD 4 2 102,904,080 67,213,850 65,060,227 55.08 (65.32%) (63.22%) 20060308B10 male SOLiD 4 2 103,568,155 69,288,535 67,145,295 45.89 (66.90%) (64.83%) 20060199B10 female SOLiD 4 2 105,797,061 59,030,081 56,788,418 56.16 (55.80%) (53.68%) 20060235B10 female SOLiD 4 2 106,354,100 60,386,029 58,404,216 60.38 (56.78%) (54.91%) 20060387B10 female SOLiD 4 2 103,364,130 65,869,658 63,643,894 50.76 (63.73%) (61.57%) *The sequencing was technically repeated on both the SOLID 4 and SOLiD 3+ platforms.

2 Supplemental Table S2. Sequence coverage across six chimpanzee exomes. Individual ID Covered length (bp); coverage percentage*  1X  5X  8X 20050256B10 27,705,434; 93.86% 26,884,629; 91.08% 26,380,633; 89.37% 20040256B10 27,169,921; 92.05% 25,709,407; 87.10% 24,666,645; 83.57% 20060308B10 27.125,941; 91.90% 25,415,754; 86.11% 24,151,701; 81.82% 20060199B10 27,135,512; 91.93% 25,741,636; 87.21% 24,748,050; 83.84% 20060253B10 27,135,654; 91.93% 25,857,042; 87.60% 24,955,018; 84.55% 20060387B10 26,810,651; 90.83% 24,983,981; 84.64% 23,743,209; 80.44% *The coverage percentage = coverage length100/length of the captured exome region. The captured exome region is 29,516,842 bp in length.

3 Supplemental Table S3. The SNP datasets of gorilla, orangutan, and rhesus macaque used in this study. Primate Description (Ref) No. of coding SNPs species examined Gorilla  31 individuals from Rwanda, 105,394 Cameroon, and Congo1 Orangutan  dbSNP136 139,797 (www.ncbi.nlm.nih.gov/projects/SNP/)  5 Sumatran and 5 Bornean individuals1  5 Sumatran individuals2 Rhesus  dbSNP136 82,752 macaque (www.ncbi.nlm.nih.gov/projects/SNP/)  5 individuals2

4 Supplemental Table S4. The patterns of coSNPs and the observed-to-expected ratio for each type of coSNP patterns at zero-fold (i=0), two-/three-fold (i=2 or 3), and four-fold (i=4) degenerate sites (related to Fig. 3A). Human SNP Chimpanzee C-G T-G T-C A-G A-C A-T SNP

Observed ( i = 0/2 or 3/4) C-G 222/99/ 84 43/19/24 112/87/73 137/74/71 38/20/29 0/0/0 T-G 16/11/11 199/77/53 51/18/11 93/59/64 4/2/2 14/6/11 T-C 106/84/36 49/45/34 1011/872/485 29/19/4 120/84/52 61/49/42 A-G 83/96/40 88/83/76 36/19/7 1001/884/460 45/36/15 64/50/30 A-C 30/18/11 8/3/2 81/59/67 48/19/14 216/87/64 20/9/13 A-T 0/1/0 4/2/3 39/20/12 41/9/14 7/6/5 123/55/38

Observed/expected ( i = 0/2 or 3/4) C-G 3.7/3.3/3.2 0.84/0.85/0.87 0.64/0.83/0.78 0.78/0.72/0.79 0.68/0.88/1.21 0/0/0 T-G 0.39/0.63/0.78 5.72/5.99/3.55 0.43/0.30/0.22 0.78/0.99/1.31 0.10/0.15/0.15 0.56/0.63/1.06 T-C 0.71/0.72/0.59 0.39/0.53/0.53 2.34/2.17/2.22 0.07/0.05/0.02 0.86/0.96/0.93 0.67/0.77/0.94 A-G 0.58/0.82/0.68 0.72/0.96/1.23 0.09/0.05/0.03 2.39/2.19/2.29 0.33/0.40/0.28 0.73/0.78/0.70 A-C 0.69/0.92/0.69 0.22/0.21/0.12 0.64/0.87/1.17 0.37/0.28/0.26 5.28/5.85/4.39 0.75/0.84/1.11 A-T 0/0.11/0 0.20/0.29/0.42 0.58/0.62/0.50 0.60/0.28/0.61 0.32/0.85/0.81 8.64/10.78/7.71 Note. The expected frequency and number of each coSNP pattern at i-fold degenerate sites were estimated on the basis of the corresponding 6×6 contingency table. The six types of coSNP patterns with the same two alleles in both species were colored with gray background.

5 Supplemental Table S5. The two-way ANOVA analysis of the effect of the two factors (strength of selective constraints and mutation rate) on coSNPO/E. Source of Variation df Sum of Mean F value P value squares square Selection (s) 3 6183 2061 1335.368  210-16 Mutation () 2 6834 3417 2213.951  210-16 Selection  Mutation (s) 6 14 2 1.555 0.156 Error () 11988 18501 2

Note. The two-way ANOVA model is: ijk = m + si +j + (s)ij + ijk, where ijk, m, si,

j, (s)ij, and ijk, respectively, represents coSNPO/E, the average of coSNPO/E, the selection effect with four-level factor (s1 = -0.15; s2 = -0.1; s3 = -0.05; s4 = -0.01), the -8 -8 -7 mutation effect with three-level factor (1 = 110 ; 2 = 510 ; 3 = 110 ), the interaction effect between selection and mutation, and error with k =1,, 1000 replicate.

6 Supplemental Table S6. Functional annotation clustering analysis of genes with coSNPi=0. The top two annotation clusters ranked by enrichment scores are listed. Annotation Term P-value Corrected P-value* Annotation cluster 1 (enrichment score: 11.58) PIR_SUPERFAMILY RIRSF003152:G protein-coupled 1.8E-20 1.7E-17 olfactory receptor, class II SP_PIR_KEYWORDS olfaction 1.4E-19 3.7E-17

INTERPRO olfactory receptor 2.0E-19 5.3E-16

GOTERM_MF_FAT Olfactory receptor activity 1.5E-18 2.2E-15

KEGG_PATHWAY Olfactory transduction 9.5E-18 1.8E-15

GOTERM_BP_FAT Sensory perception of smell 1.6E-17 6.7E-14

SP_PIR_KEYWORDS Sensory transduction 2.5E-16 3.5E-14

GOTERM_BP_FAT Sensory perception of chemical 1.7E-15 3.4E-12 stimulus PIR_SUPERFAMILY PIRSF800006:rhodopsin-like G 4.7E-12 2.2E-9 protein-coupled receptors GOTERM_BP_FAT Sensory perception 1.0E-11 1.4E-8

SP_PIR_KEYWORDS g-protein coupled receptor 1.4E-10 1.4E-8

INTERPRO GPCR, rhodopsin-like superfamily 1.5E-9 2.0E-6

INTERPRO 7TM GPCR, rhodopsin-like 2.0E-9 1.8E-6

GOTERM_BP_FAT cognition 4.8E-9 4.9E-6

SP_PIR_KEYWORDS transducer 7.5E-9 6.5E-7

SP_PIR_KEYWORDS receptor 8.3E-9 6.5E-7

GOTERM_BP_FAT G-protein coupled receptor protein 2.8E-6 1.7E-3 signaling pathway GOTERM_BP_FAT Neurological system process 6.6E-6 3.4E-3

SP_PIR_KEYWORDS Cell membrane 6.9E-6 3.8E-4

Annotation cluster 2 (enrichment score: 8.9)

UP_SEQ_FEATURE glycosylation site: N-linked (GlcNAc) 7.3E-22 2.7E-18

SP_PIR_KEYWORDS glycoprotein 1.2E-21 4.8E-19

UP_SEQ_FEATURE topological domain: Extracellular 6.2E-13 1.1E-9

UP_SEQ_FEATURE Topological domain: Cytoplasmic 1.6E-11 2.4E-8

GOTERM_CC_FAT Plasma membrane 6.2E-6 2.1E-3

UP_SEQ_FEATURE Transmembrane region 1.5E-5 3.4E-3

SP_PIR_KEYWORDS transmembrane 1.8E-5 9.3E-4

GOTERM_CC_FAT Integral to membrane 1.4E-4 1.8E-2 7 GOTERM_CC_FAT Intrinsic to membrane 6.9E-4 4.1E-2 * The P values were adjusted using the Benjamini-Hochberg correction.

8 Exon-captured reads The chim panzee from six chim panzees genom e (PanTro3)

Read-to-genome Novoalign alignment:  excluding reads with multiple hits  excluding m atched reads not located within hum an-chimpanzee orthologous CCDSs

Post-alignment process:  excluding m atched reads located within CNV or repetitive regions  excluding m atched regions with low coverage (depth < 8)

SNP calling by SAMtools (QV ≥ 30)

21,306 SNPs

SNP authenticity examination:  there m ust be at least one hom ozygous individual whose two alleles are the sam e as the reference genom ic sequence  the variants m ust be supported by both left- and right-half parts of reads  the variants m ust be sim ultaneously supported by variant calling with different gap penalties

11,868 SNPs (11,171 SNPs in coding regions)

Supplemental Figure S1. The procedure of identifying chimpanzee SNPs in CE6.

9

6

5 E / O 4 P N S

o 3 c

2

1 0-fold 2/3-fold 4-fold Intron

Supplemental Figure S2. The coSNPO/E ratios of zero-fold, two-/three-fold, and four- fold degenerate nucleotides are determined by the comparisons between human SNPs derived from nine individuals1 and the chimpanzee SNPs analyzed in this study (excluding SNPs located within CpG dinucleotides). The full list of identified coSNPs is publicly available at http://treeslab1.genomics.sinica.edu.tw/coSNPs.html.

10 Two-/Three-Fold Degenerate Sites 50000 Zero-fold Degenerate Sites 40000 CW25 mean CW25 mean CW5 mean CW5 mean y = 15503ln(x) - 2438.2

R² = 0.9918 s CE12 mean 40000 P y = 11939ln(x) - 1523.2 CE12 mean N S R² = 0.9935 30000 CW10 mean e z

CW10 mean z s n

P CE6 mean a N p

S CE6 mean

m i e 30000 h e c z (A) f n o a p r

e 20000 m i b h m c

u f y = 5363.9ln(x) + 1005.7 o

20000 N

r R² = 0.998 e y = 8810.2ln(x) + 809.21 y = 6020.7ln(x) + 550.95 y = 5966.7ln(x) + 844.21 b R² = 0.9803 m Observed numbersR² = 0.995 of chimpanzee SNPs R² = 0.9828 u N 10000 10000 y = 3571.8ln(x) + 1222.3 y = 2850.3ln(x) + 1103.9 R² = 0.9934 R² = 0.9919

y = 1825.2ln(x) + 216.98 y = 1368.2ln(x) + 229.72 R² = 0.996 R² = 0.9988 0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Number of sampling individuals Number of sampling individuals

Intron Four-Fold Degenerate Sites 10000000 35000 CW25 mean CW25 mean y = 3E+06ln(x) - 288929 CW5 mean R² = 0.9947 y = 9283.6ln(x) - 110.9 CW5 mean CE12 mean 30000 R² = 0.9961 8000000 CW10 mean CW10 mean s P N

CE6 mean s S P 25000 e N e S z n e a e 6000000 z p n m a i 20000 p h c m i f h o

c r

f e o b 15000 r 4000000 m e u b

N y = 4251.2ln(x) + 819.09 m y = 4320.7ln(x) + 861.75 R² = 0.9985 u 10000 N R² = 0.9945 y = 1E+06ln(x) + 242939 R² = 0.995 2000000 y = 600405ln(x) + 232203 5000 y = 1264.6ln(x) + 485.35 R² = 0.9898 R² = 0.978 y = 963.76ln(x) + 177.52 R² = 0.9991 0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Number of sampling individuals Number of sampling individuals

(cont.)

11 Two-/Three-Fold Degenerate Sites

Zero-Fold Degenerate Sites 3000

3500 y = 959.55ln(x) + 19.424 CW25 mean CW25 mean R² = 0.9956 CW5 mean y = 741.38ln(x) + 1.1425 CW5 mean 2500 R² = 0.9965 3000 CE12 mean CE12 mean CW10 mean CW10 mean 2500 CE6 mean 2000 CE6 mean s s P P N S N o S c o

c f

2000 o f Observed numbers of Homo-Pan coSNPs r o

1500 e r b e b m u m y = 414.52ln(x) + 123.87 u N 1500 N R² = 0.9996 y = 701.79ln(x) + 110.4 y = 465.91ln(x) + 97.072 1000 R² = 0.9948 R² = 0.9986 1000 y = 437.4ln(x) + 92.989 y = 264.27ln(x) + 114.82 R² = 0.9925 y = 212.59ln(x) + 53.71 R² = 0.9929 500 R² = 0.9958 500 y = 119.45ln(x) + 17.542 y = 96.628ln(x) + 12.131 R² = 0.9992 R² = 0.9937 0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Number of sampling individuals Number of sampling individuals

Intron Four-Fold Degenerate Sites

250000 CW25 mean 2000 y = 543.92ln(x) + 91.628 y = 67408ln(x) + 5839.5 R² = 0.9977 CW25 mean R² = 0.998 CW5 mean CW5 mean 200000 CW10 mean CE12 mean 1500 CW10 mean s

CE6 mean P N 150000 S o s c P

f N o

S r o e c 1000

y = 301.36ln(x) + 77.326 b f o m

r R² = 0.999 u e 100000 N b m u y = 32835ln(x) + 7066.6 N y = 317.3ln(x) + 67.926 R² = 0.9962 500 R² = 0.9953 50000 y = 16576ln(x) + 6709.5 R² = 0.9884 y = 66.448ln(x) + 16.004 R² = 0.9989 y = 71.878ln(x) + 32.237 R² = 0.9678 0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Number of sampling individuals Number of samplinf individuals

(cont.)

12 Two-/Three-Fold Degenerate Sites 120000 CW25 prediction Zero-fold Degenerate Sites 100000 CW5 prediction CW5 prediction CE12 prediction CE12 prediction CW10 prediction 100000 CW10 prediction CE6 prediction s 80000 P

CE6 prediction N s

S CE6 prediction P

N e S 80000 z

z e n e a z p n 60000 a (B) m p i h m i c

h 60000 f c

o

f r o

e r b e Projected numbers of chimpanzee SNPs b

m 40000 m u u N

N 40000

20000 20000

0 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Number of sampling individuals Number of sampling individuals

Intron Four-Fold Degenerate Sites 500000 CW25 prediction 70000 CW25 prediction CW5 prediction CW5 prediction 450000 CW10 prediction CE12 prediction 60000 CW10 prediction 400000 CE6 prediction 50000 350000 s P s P N S N

S e 300000 o e c z

f n 40000 o a p r

e 250000 m i b h m c

u f

o 30000 N 200000 r e b m u 150000 N 20000

100000

10000 50000

(cont.)

13 Two-/Three-Fold Degenerate Sites Zero-Fold Degenerate Sites 6000 CW25 prediction 8000 CW25 prediction CW5 prediction CW5 prediction CE12 prediction 7000 CE12 prediction 5000 CW10 prediction CW10 prediction CE6 prediction 6000 CE6 prediction

Projected numbers of Homo-Pan coSNPs4000 s P 5000 N S s o P c

N f S o

o 3000 r c

4000 e f b o

r m e u b N

3000 m u

N 2000

2000

1000 1000

Four-Fold Degenerate Sites Intron 500000 CW25 prediction CW25 prediction 5000 CW5 prediction CW5 prediction 450000 CW10 prediction CE12 prediction 400000 4000 CW10 prediction

CE6 prediction 350000 s P N S s 300000 o P 3000 c N f S o o

c r 250000 f e o b

r m e u b

N 200000 m 2000 u N 150000

1000 100000

50000

(cont.)

14 (C)

** ** * 3

NS NS CE6 2.5 CE12 NS CW5 E /

O CW10 P

N 2 CW25 S o c

1.5

1 0-fold 2/3-fold 4-fold Intron

Supplemental Figure S3. Estimation of the coSNPO/E ratios based on the coSNPs determined by the comparisons between human SNPs (dbSNP138) and each of the five chimpanzee SNP datasets used (CE6, CE12, CW5, CW10, and CW25 SNPs), when the number of chimpanzee individuals exceeded 1,000. (A) Observed numbers, with fitted log-linear models and coefficients of determination (R2). (B) Projected numbers of chimpanzee SNPs and Homo-Pan coSNPs at the indicated sample sizes, using the fitted model in (A). (C) The coSNPO/E ratios for different types of i-fold degenerate sites and intronic sequences, according to the simulated chimpanzee SNPs and Homo-Pan coSNPs.

15 (A) i = 0 i = 2 or 3 i = 4 s t i b

(B)

i = 0

1714/4327 sequences; E-value: 1.3x10-535 691/4327 sequences; E-value: 2.2x10-331

i = 2 or 3 1126/3341 sequences; E-value: 1.6x10-532

i = 4 835/2479 sequences; E-value: 1.2x10-219 266/2479 sequences; E-value: 4.4x10-109

Supplemental Figure S4. Motif analysis based on the (A) Weblogo3 and (B) MEME predictions around the identified coSNPs at zero-fold (i=0), two-/three-fold (i=2 or 3), and four-fold (i=4) degenerate sites. (A) Motif logos showing the frequencies scaled according to the information content at each position (from -3 to +3) relative to the coSNPs with 95% confidence intervals. (B) The MEME motifs around the coSNPs (within -50 nucleotides to +50 nucleotides of the examined sites). Only the motifs supported by >100 sequences were listed. Bit values range from 0 to 2, with higher values indicating higher degrees of conservation.

16 (A) (B)

3 -6 -6 (10 ) 0-ifold= 0 (10 ) 6 2/3i =-fold 2 or 3 1.8 2.5 4-ifold= 4 5 codingcoding

2 1.5 y t i 4 s n

1.5 e d 3 1.2 P N

1 S o 2 C 0.9 1

0 0.6

SNP density (10-4) Average recombination rate

Supplemental Figure S5. Distribution of coSNP density of zero-fold (i=0), two-/three-fold (i=2 or 3), and four-fold (i=4) degenerate nucleotides in the 1M-bp windows (see the text) of different levels of (A) SNP density and (B) average recombination rate.

17 Supplemental Figure S6. Pearson's correlation between read depths of SOLiD 3+ and SOLiD 4 data derived from the same sample (chimpanzee individual ID: 20050256B10; see Supplemental Table S1).

18 (A)

(B)

Supplemental Figure S7. Examples of false positive SNPs arising from equivocal read-to-genome alignments: (A) the called variant is not supported by any right-half parts of reads; and (B) none of the right-half parts of reads supporting the called variant are qualified, as each part contains both the called variant and other variant(s)/mismatch(es). The called variant is highlighted in blue. The reads supporting the variant are highlighted in yellow; the other reads around the called variant are highlighted in gray. For each read, dots and "N"s denote the nucleotides of exact matches and uncertain mismatches, respectively.

19 0.45

0.4

0.35

0.3 n o

i 0.25 t r o

p 0.2 o r

P 0.15

0.1

0.05

0 1 2 3 4 5 6 7 8 9 10 11 12 Dervied variants

Supplemental Figure S8. Derived allele frequency distribution of the identified nonsynonymous SNPs in CE6.

20 Supplemental References

1 Prado-Martinez, J. et al. Great ape genetic diversity and population history. Nature 499, 471-475 (2013). 2 Gokcumen, O. et al. Primate genome architecture influences structural variation mechanisms and functional consequences. Proceedings of the National Academy of Sciences of the United States of America 110, 15764- 15769 (2013).

21