bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Identifying Lineage-specific Targets of Darwinian Selection by a Bayesian Analysis of Genomic Polymorphisms and Divergence from Multiple Species

Shilei Zhaoa,c,1, Tao Zhangb,c,1, Qi Liua,c, Yongming Liua,c, Hao Wua,c, Bing Sub,c, Peng Shib,c, Hua Chena,c,∗

aCAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China bState Key Laboratory of Evolution and Genetic Resources, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China cUniversity of Chinese Academy of Sciences, Beijing 100049, China

Abstract

We present a method that jointly analyzes the polymorphism and divergence sites in ge- nomic sequences of multiple species to identify the under positive or negative selection and pinpoints the occurrence time of selection to a specific lineage of the species phylogeny. This method integrates population genetics models using the Bayesian Poisson random field framework and combines information over all loci to boost the power to detect selec- tion. The method provides posterior distributions of the fitness effects of each gene along with parameters associated with the evolutionary history, including the species divergence times and effective population sizes of external species. A simulation is performed, and the results demonstrate that our method provides accurate estimates of these population genetic parameters. The proposed method is applied to genomic sequences of , , and , and a spatial and temporal map is constructed of the natural selection that occurred during the evolutionary history of the four Hominidae species. In addition to FOXP2 and other known genes, we identify a new list of lineage-specific targets of Darwinian selection. The positively selected genes in the lineage are enriched in pathways of gene expression regulation, immune system, metabolism etc. Interestingly, some pathways, such as gene expression, are significantly enriched with positively selected genes, whereas other

∗Corresponding author: [email protected] 1These two authors contributed equally.

1 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

pathways, such as metabolism, are enriched with both positively and negatively selected genes. Our analysis provides insights into Darwinian evolution in the coding regions of humans and great apes and thus serves as a basis for further molecular and functional studies. Keywords: positive selection, adaptive evolution, dN/dS ratio, MK test, Bayesian, Poisson random field.

1 1. Introduction

2 Comparing the genomic sequences of multiple species is useful for studying evolutionary

3 mechanism and identifying functionally important genes (Fay et al., 2002; Smith and Eyre-

4 Walker, 2002; Clark et al., 2003; Rogers and Gibbs, 2014). Codon comparison across different

5 species is perhaps the most widely used approach in comparative genomics (Li et al., 1985;

6 Hughes and Nei, 1988; Yang, 1998). This method estimates the ratio of the number of

7 replacement sites to the number of synonymous sites for different species (dN/dS ratio) to

8 detect the genes under recurrent directional selection. The method was further developed

9 to identify ratio changes on a specific branch along the phylogenetic tree (Yang, 1998; Yang

10 and Nielsen, 2002; Yang, 2007; Zhang et al., 2005). These methods were designed to analyze

11 single gene locus with one sequence per species without making use of population genetic

12 samples, and they have been extensively used to identify targets of natural selection in

13 various species (e.g., Clark et al. (2003)).

14 The McDonald-Kreitman test (MK test) is another method that uses codon replacement

15 to detect Darwinian selection. The MK test compares coding sequences from two species

16 and requires that at least one of the two species has multiple sequences. Polymorphism sites

17 (within each species) and divergence sites (between species) of coding sequences are identified

18 in the sample and further classified as synonymous or replacement (non-synonymous) sites.

19 A contingency table is constructed based on the above four site types, and the chi-square test

20 is used to examine the equality of the within-species ratios and the between-species ratios of

21 replacements over synonymous sites. The underlying assumptions of the MK test are that

22 (1) synonymous sites are under neutrality while non-synonymous sites are potentially under

23 positive or negative selection; and (2) if one of the two species is under long-term recurrent

2 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 selection since their divergence, then the ratio of the non-synonymous site number over the

2 synonymous site number for the between-species divergence sites will be significantly greater

3 than that of the within-species sites, and this situation is reversed for negative selection

4 (McDonald and Kreitman, 1991).

5 Comparing within-species polymorphism to between-species divergence improves the

6 power of detecting selection and enables the ability to control for the inter-locus hetero-

7 geneity of rates, and it can also distinguish between positive selection and the

8 relaxation of negative selection (Wyckoff et al., 2000). However, the MK test analyzes each

9 gene individually and has limited power to detect loci under weak or moderate selection,

10 especially when the numbers of these sites are small. Moreover, the chi-square test is not

11 model based and cannot provide the strength and direction of selection acting on

12 in the . To tackle these problems, Bustamante et al. (2001) extended the MK test to

13 embrace the population genetic model using the Poisson random field theory developed by

14 Sawyer and Hartl (1992). The McDonald-Kreitman Poisson random field method (MKPRF)

15 created by Bustamante et al. (2001) assumes that the number of sites in the MK table fol-

16 lows Poisson distributions, and their means are parameterized as functions of the population

17 history, mutation rates and selection intensity. The method then uses a Bayesian approach

18 to combine information across different gene loci and obtain the posterior distributions of

19 selection intensity for non-synonymous sites of each gene.

20 Bustamante et al. (2005) applied the MKPRF Bayesian approach to analyze 11, 624 cod-

21 ing regions of 39 human and 1 sequences, and they identified 304 genes under

22 rapid evolution since the divergence of the two species. However, the MKPRF approach

23 applies only to directionless comparisons of two species; therefore, the method cannot deter-

24 mine whether selection occurred in the human or the chimpanzee lineage.

25 With the development of sequencing technology, abundant population genomic data are

26 now available for multiple species. However, few efficient methods are available that can

27 simultaneously analyze genomic polymorphisms and divergence from multiple species. The

28 method developed in this paper is designed to satisfy a growing request for such methods, and

29 it is an extension of the MKPRF method that simultaneously analyzes polymorphism sites

30 and divergence sites from multiple species (aka, high-dimensional MKPRF, HDMKPRF).

3 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Based on the joint pattern of divergence and polymorphism sites from multiple species, the

2 proposed method not only identifies genes under selection in any of the analyzed species but

3 also pinpoints the occurrence time of selection to a specific lineage of the species phylogeny.

4 Compared with codon-based phylogenetic methods, such as PAML (phylogenetic analysis

5 by maximum likelihood), the new method gains power by jointly analyzing both between-

6 species divergence and within-species polymorphism sites and combining the information

7 from all gene loci with a Bayesian approach. Furthermore, the method provides estimates of

8 phylogenetic and population genetic parameters, such as the divergence times of species, mu-

9 tation rates of each gene loci and effective population sizes of different species. The method

10 constructs the spatial and temporal landscape of natural selection during the evolutionary

11 history of the species. We applied the method to the genomic sequences of four species of

12 Hominidae: humans, chimpanzees, gorillas and orangutans. We identified a set of genes

13 under rapid adaptation in the human and chimpanzee lineages as well as genes under weak

14 purifying selection. The positively selected genes are enriched in pathways related to gene

15 expression regulation, immunity, metabolism, neurological diseases, etc, thereby providing

16 a natural selection map for further investigation of the molecular mechanism in Hominidae

17 evolution.

18 Methods

19 Poisson random field model (PRF) for three species

20 In the scenario involving three species (Figure 1), we analyze n1, n2 and n3 aligned

21 sequences for a from the three species. If we assume an infinite-sites mutation

22 model with no introgression among species, the polymorphism and divergence sites can be

23 classified into 14 types according to their joint patterns across the three species (see Table 1).

24 For example, polymorphism sites that are segregating within species 1 and fixed in species

25 2 and 3 are denoted as P1∼(2,3); divergence sites with one allele that is fixed in species 1

26 and another that is fixed in species 2 and 3 are denoted as D1∼(2,3) etc. Following the MK

27 test, a chi-square test with six degrees of freedom can be immediately constructed based on

28 Table 1 to test the deviation of entries of the contingency table from randomness. In the

4 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

(A) (B)

Figure 1: An illustration of the genealogies for three species and four species and the parameters.

Synonymous Replacement 1 2γ1 1 Divergence D ∼ θs, (T − T + M(n )) θr, − γ (T − T + G(n , γ )) 12 ∼ ( , 3) 1 (2,3) 1 12 2 23 1 1 1−e( 2 1 ) 12 2 23 1 1 2γ1 Polymorphism P ∼ θ L(n ) θ − F (n , γ ) 1 (2,3) s,1 1 r,1 1−e( 2γ1 ) 1 1 1 2γ2 1 Divergence D ∼ θs, ( v T + M(n )) θr, − γ ( v T + G(n , γ )) 21 ∼ ( , 3) 2 (1,3) 2 2 3 23 2 2 1−e( 2 2 ) 2 2 23 2 2 2γ2 Polymorphism P ∼ θ L(n ) θ − F (n , γ ) 2 (1,3) s,2 2 r,2 1−e( 2γ2 ) 2 2 1 2γ3 1 Divergence D ∼ θs, ( v T + M(n )) θr, − ( v T + G(n , γ )) 31 ∼ ( , 2) 3 (1,2) 3 2 3 23 3 3 1−e( 2γ3 ) 2 3 23 3 3 2γ3 Polymorphism P ∼ θ L(n ) θ − F (n , γ ) 3 (1,2) s,3 3 r,3 1−e( 2γ3 ) 3 3

2γ4 ∼ θ − I n , n , γ , γ , γ (2) , 3 ∼ 1 Polymorphism P(2,3) 1 θs,4H(n2, n3,N2,N3,T23) r,4 1−e 2γ4 ( 2 3 2 3 4)

Table 1: Three-species McDonald-Kreitman table.

1 following sections, we set up the model using the more efficient Bayesian Poisson random

2 field framework. 3 In the Poisson random field modeling assumption, each of the 14 entries in Table 1 follows

4 a Poisson distribution, and the means are parameterized with population genetic models

5 (Sawyer and Hartl, 1992). The population genetic parameters Γ include: the divergence

6 time between species 1 and the common ancestor of species 2 and 3 (T123), the divergence

7 time between species 2 and 3 (T23), and the effective haploid population sizes of the three i 8 species (N1, N2, and N3). For each gene locus i, mutation rate µs is observed for synonymous i i i 9 sites, mutation rate µr is observed for replacement sites, and selection intensities γ1, γ2, and i 10 γ3 are observed in the three species.

11 s Synonymous polymorphism sites in species 3, which are denoted by P3∼(1,2), are neutral

12 mutations that have occurred in species 3 since the divergence of species 2 and 3. If the

13 s divergence time is sufficiently large, then the population allele frequency x of P3∼(1,2) follows

5 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 a stationary distribution (Sawyer and Hartl, 1992)

2Nµ f(x) = s , (1) x

2 where N = N3. Thus, the expected number in a sample with n3 sequences is:

1 n3 n3 s 1 − x − (1 − x) EP3∼(1,2) = 2N3µs dx Z0 x n3−1 1 = 2N µ = θ L(n ), (2) 3 s i s,3 3 Xi=1

n−1 3 1 with L(n) = 1 i . P 4 s The synonymous divergence sites, D3∼(1,2), include those sites segregating in the popula-

5 tion but fixed in the sample:

1 n3 dx 2N3µsx (3) Z0 x 1 = 2N3µs = θs,3M(n3). (4) n3

6 The synonymous divergence sites also include those sites fixed in the population of species

7 3: µsT23. If we scale time T23 in units of N1, then

N1T23 1 µsT23 = 1/2θ3 ∗ = θs,3T23v3, (5) N3 2

1 8 N and v3 = N3 .

9 For non-synonymous (replacement) sites, we know that the stationary population allele

10 frequency under selection is as follows:

2Nµ 1 − e−2γ(1−x) g(x) = r , (6) 1x( − x) 1 − e−2γ

11 where x is the population allele frequency and γ = Ns, with s representing the selection

6 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 intensity (Sawyer and Hartl, 1992). The fixation rate is:

2γ µ , (7) r 1 − e−2γ

2 Thus, the number of replacement sites along branch leading to species 3 is:

1 2γ3 θr,3v3T23µr , (8) 2 1 − e−2γ3

3 where T23 is in units of N1 generations.

4 The number of replacement sites fixed in the sample n3 is as follows:

1 2N µ 1 − e−2γ3(1−x) xn3 3 r dx −2γ3 Z0 1x( − x) 1 − e 2γ3 = 2N3µr G(n3, γ3) (9) 1 − e−2γ3

5 with 1 1 − e−2γ(1−x) G(n, γ) = xn−1 dx, (10) Z0 21γ( − x)

6 r Similarly, the expected number of segregating replacement sites, Ps∼(1,2), is:

2γ3 2N3µr F (n3, γ3), (11) 1 − e−2γ3

7 with 1 1 − xn − (1 − x)n 1 − e−2γ(1−x) F (n, γ) = dx. (12) Z0 1x( − x) 2γ

8 Through similar logic, we can obtain the expected number of divergence and polymor-

9 phism sites for P2∼(1,3), D2∼(1,3), P1∼(2,3) and D1∼(2,3) (see Table 1 for details).

10 In the aforementioned paragraphs, we assume that the divergence time T23 is sufficiently

11 large; thus, the chance of observing polymorphism sites shared by species 2 and 3, P(2,3)∼1,

12 has a very low probability. However, for closely-related species with T23 ≤ TMRCA of n2,

13 the expected number of neutral polymorphic sites segregating in species 2 and 3, P(2,3)∼1,

14 cannot be ignored and should be calculated from the joint allele frequency of species 2 and

7 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 3 f(y,z|T23,N2,N3) (see AppendixA for details):

1 1 (1 − yn2 − (1 − y)n2 )(1 − zn3 − (1 − z)n3 ) Z0 Z0 ×f(y, z|T23,N2,N3,N4, µs)dydz

= θs,4H(n2, n3,N2,N3,T23), (13)

2 where θs,4 = 2N4µs stands for scaled mutation rate in species 4 and y and z are the allele

3 frequencies in species 2 and 3, respectively. Additionally,

1 1 1 n2 n2 n3 n3 H(n2, n3,N2,N3,T23) = (1 − y − (1 − y) )(1 − z − (1 − z) ) Z0 Z0 Z0 1 × φ(y|x,T ,N )φ(z|x, T ,N )dxdydz, (14) x 23 2 23 3

4 with φ(y|x, T, N) representing the transient allele frequency distribution conditional on an

5 initial frequency x, population size N and time T (see Equation 25 for the detailed form).

6 The expected number of selected polymorphic sites segregating in species 2 and 3 is

1 1 (1 − yn2 − (1 − y)n2 )(1 − zn3 − (1 − z)n3 ) Z0 Z0 ×g(y, z|T23,N2,N3, γ4)dydz

2γ4 = θr,4 I(n2, n3, γ2, γ3, γ4), (15) 1 − e−2γ4

7 where γ2, γ3, and γ4 are the selection coefficients in species 2, 3 and, 4 respectively. And,

1 1 1 (1 − yn2 − (1 − y)n2 )(1 − zn3 − (1 − z)n3 ) I(n2, n3, γ2, γ3, γ4) = Z0 Z0 Z0 21x( − x)γ4 ×ψ(y|x, T23,N2, γ2)ψ(z|x, T23,N3, γ3)dxdydz, (16)

8 with ψ(y|x, T, N, γ) representing the transient allele frequency distribution conditional on

9 an initial frequency x, population size N, time T and selection intensity γ (see Appendix A

10 for details).

11 s Accordingly, when under the assumption T23 ≤ TMRCA of n2, P3∼(1,2) and other entries

12 in Table 1 should be derived in a new form, which can be found in Appendix A.

8 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Assuming a Poisson random field model, the joint probability of the data given parameter

l l l 2 Γ = {θs,i, θr,i,T123,T23, γi, i ∈ {1, 2, 3}, 1 ≤ l ≤ L} is the product of the individual entries of

3 Table 1 for all the L gene loci:

Pr(P, D|γ, θ, T) L l l l l l l l l l = Pr(Dc,i∼(j,k)|T, γc, θc) Pr(Pc,i∼(j,k)|T, γc, θc)Pr(Pc,(2,3)∼1|T, γc, θc(17)), Yl=1 c∈{Ys,r} i,j,k∈{Y1,2,3}

4 where Pr(·) denotes the Poisson distributions.

5 Poisson random field model (PRF) for four species

6 The phylogeny of four species is shown in Figure 1 (B). Similar to the three-species

7 case, the four-species data contains branch-specific divergence and polymorphism sites (e.g.,

8 D1∼(2,3,4), P1∼(2,3,4), etc.). In addition, there are some unique site patterns corresponding to

9 internal branches connecting species 5 and 6. Mutations that occur on this branch could

10 have generated multiple site patterns in the modern four-species samples. These patterns

11 could be created by polymorphism sites shared by species 3 and 4, which is denoted by

12 P(3,4)∼(1,2); by sites with one allele type fixed in species 3 and 4, and the other allele type

13 fixed in species 1 and 2, which is denoted by D(3,4)∼(1,2); or by sites with one allele type that

14 is fixed in species 3 and another fixed in species 1, 2 and 4 (or perhaps the reverse). Since we

15 assume no migration or introgression between species and an infinite-sites mutation model,

16 common mutations shared between species 3 and 4 can only be descended from ancestral

17 mutations existing in N5. Similar to Ps,(2,3)∼1 in the three-species scenario, the expected

18 number of neutral polymorphic sites segregating in species 3 and 4 is as follows:

1 1 (1 − yn3 − (1 − y)n3 )(1 − zn4 − (1 − z)n4 ) Z0 Z0 ×f(y, z|T34,N3,N4)dydz

= θs,5H(n3, n4,N3,N4,T34), (18)

19 where y and z now represent the allele frequencies in species 3 and 4, respectively (see

20 Appendix A for details). Similarly, the expected number of polymorphic replacement sites

9 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 segregating in species 3 and 4 is

1 1 (1 − yn3 − (1 − y)n3 )(1 − zn4 − (1 − z)n4 ) Z0 Z0 ×g(y, z|T34,N3,N4, γ5)dydz

2γ5 = θr,5 I(n3, n4, γ3, γ4, γ5). (19) 1 − e−2γ5

2 If the divergence time between species 3 and 4 is sufficiently large, then these common

3 ancestral polymorphic sites are mostly lost or fixed, the values of H(n3, n4) and I(n3, n4)

4 become negligible, and P(3,4)∼(1,2) collapses into P3∼(1,2,4), P4∼(1,2,3) and D(3,4)∼(1,2). Thus,

5 although P(3,4)∼(1,2) exists, the number could be only represent a small proportion and pro-

6 vides limited information for inference. This is consistent with the genomic data we observed

7 in the four Hominidae species (see the result section), with P(n3,n4)∼(1,2) only representing

8 0.2896% of the total number of segregating sites (197, 878).

9 s The expected number of fixed synonymous sites D(3,4)∼(1,2) includes two components:

1 1 n3 n4 y z × f(y, z|T,N3,N4)dydz Z0 Z0 = θs,5J(n3, n4,N3,N4,T34), (20)

10 and 1 θ (T − T )v , (21) 2 s,5 234 34 5

11 with v5 = N1/N5.

12 r Similarly, the expected number of fixed nonsynonymous sites D(3,4)∼(1,2) includes two

13 parts:

1 1 n3 n4 y z × g(y, z|T,N3,N4, γ5)dydz Z0 Z0 2γ5 = θr,5 K(n3, n4, γ3, γ4, γ5). (22) 1 − e−2γ5

14 and

10 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Synonymous Replacement 1 2γ1 1 Divergence D ∼ θ (T − T + M(n )) θ − (T − T + G(n , γ )) 12 ∼ ( , 3, 4) 1 (2,3,4) s,1 1234 2 234 1 r,1 1−e 2γ1 1234 2 234 1 1 2γ1 P ∼ θ L n θ − F n , γ Polymorphism 1 (2,3,4) s,1 ( 1) r,1 1−e 2γ1 ( 1 1) 1 2γ2 1 Divergence D ∼ θ ( v T + M(n )) θ − ( v T + G(n , γ )) 21 ∼ ( , 3, 4) 2 (1,3,4) s,2 2 2 234 2 r,2 1−e 2γ2 2 2 234 2 2 2γ2 P ∼ θ L n θ − F n , γ Polymorphism 2 (1,3,4) s,2 ( 2) r,2 1−e 2γ2 ( 2 2) 1 2γ3 1 Divergence D ∼ θ ( v T + M(n )) θ − ( v T + G(n , γ )) 31 ∼ ( , 2, 4) 3 (1,2,4) s,3 2 3 34 3 r,3 1−e 2γ3 2 3 34 3 3 2γ3 P ∼ θ L n θ − F n , γ Polymorphism 3 (1,2,4) s,3 ( 3) r,3 1−e 2γ3 ( 3 3) 1 2γ4 1 Divergence D ∼ θ ( v T + M(n )) θ − ( v T + G(n , γ )) 41 ∼ ( , 2, 3) 4 (1,2,3) s,4 2 4 34 4 r,4 1−e 2γ4 2 4 34 4 4 2γ4 P ∼ θ L n θ − F n , γ Polymorphism 4 (1,2,3) s,4 ( 4) s,4 1−e 2γ4 ( 4 4) 1 2γ5 1 D ∼ θ v T T θ − v T T Divergence (3,4) (1,2) s,5( 2 5( 234 − 34) s,5 1−e 2γ5 ( 2 5( 234 − 34) (3) , 4 ∼ (1, 2) +n J(n3, 4,N3,N4,T34)) +K(n3, n4, γ3, γ4, γ5)) 2γ5 P ∼ θ H n , n ,N ,N ,T θ − I n , n , γ , γ , γ Polymorphism (3,4) (1,2) s,5 ( 3 4 3 4 34) r,5 1−e 2γ5 ( 3 4 3 4 5)

Table 2: Four-species McDonald-Kreitman table.

1 2γ5 θr,5v5(T234 − T34) . (23) 2 1 − e−2γ5

1 The first part contains a small proportion when the divergence time T34 is large.

2 Results

3 Simulation

4 We tested the performance of the method using simulation data. We simulated 20, 000

5 genes for the four-species scenario in Figure 1, and the population history parameters were

6 set to approximate the evolutionary history of humans, chimpanzees, gorillas and orangutans

7 inferred in previous studies (Prado-Martinez et al., 2013), and the sample includes n4 = 10,

8 n3 = 24, n2 = 20, and n1 = 20 haplotypes from the four species, respectively. We set the

9 effective haploid population sizes at N2 = N1, N3 = 1.2N1, and N4 = 0.8N1. The divergence

10 times were T34 = 4, T234 = 6 and T1234 = 12 in units of 2N1. The scaled mutation rate

11 for synonymous sites for each gene locus, θs = 2N1µs, and the scaled mutation rate for

12 replacement sites, θr = 2N1µr, were chosen from several fixed values 1, 2, ... , 5. Among the

13 20, 000 genes, 1400 were under selection in species 4 or 5 (the common ancestor of humans and

14 chimpanzees), and the other 18, 600 genes were neutral. The selection intensities γi, i = 4, 5

15 of every 100 genes were chosen from fixed values in the range of (−6, −4, −2, 0, 2, 4, 6). Given

11 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 the values of these parameters, simulated data were generated from Poisson distributions,

2 and the means were calculated based on the formulae in Table 2. We then applied the method

3 to the simulated data, and the maximum a posteriori (MAP) estimates of the parameters

4 were recorded. The above simulations were repeated for 100 times. The boxplots of the

5 inferred selection intensity, scaled mutation rates and divergence times are shown in Figure

6 2. For the global parameters, such as the divergence times T1234, T234 and T34, the inferred

7 values are accurate and unbiased since sufficient information for these parameters is derived

8 from the Bayesian joint analysis of all 20, 000 gene loci. The other locus-specific parameters,

9 including the selection intensity and mutation rates of different branches, are generally also

10 unbiased. However, we note that for selection intensity, when under strong negative selection,

11 the inferred values become biased toward smaller values, because few or zero divergence and

12 polymorphism sites are observed when the gene is under strong negative selection, which

13 provides limited information for parameter inference (Bustamante et al., 2005). Overall, we

14 evaluated the performance of HDMKPRF with simulated data, and the boxplots shown in

15 Figure 2 demonstrate that the inferred parameter values well match the true values.

16 Selection in Hominidae

17 We applied the HDMKPRF method to multiple genomic sequences of humans, chim-

18 panzees, gorillas and orangutans from Prado-Martinez et al. (2013). The data were generated

19 via NGS technology with an average sequencing depth of 25, and details of the SNP call-

20 ing pipeline and filtering criteria can be found in the original paper (Prado-Martinez et al.,

21 2013). After excluding several individuals based on further criteria described in Cagan et al.

22 (2016), the final dataset in our analysis includes Pongo pygmaeus (5), Gorrilla (12),

23 Pan troglodytes ellioti (9) and Homo sapiens (9). We aligned the sequences of 23, 362 genes

24 for these samples, from which 5, 429 genes with no coding information were excluded

25 from the analysis. For the remaining 17, 933 genes (17, 234 from autosomes and 699 from X

26 ), we used ANNOVAR (Wang et al., 2010) to annotate the SNPs in the coding

27 regions. Synonymous and replacement polymorphism and divergence sites were identified

28 to construct the four-species MK tables for each gene (see entries of Table 2). In total,

29 133 genes with no divergence and segregating sites in all lineages were excluded from the

12 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 four-species MK tables.

2 We applied our method to the HDMK tables of the 17, 800 genes. After 200, 000 burn-in

3 steps, we then ran the Markov chain Monte Carlo process for 200, 000 steps to achieve

4 the posterior distributions of the parameters (see Appendix 2 for details). The maxi-

5 mum a posterior (MAP) estimate of the divergence time between orangurans and the com-

6 mon ancestor of humans, chimpanzees and gorillas (T1234) is 11.6444 (posterior interval:

7 (11.4125, 11.9126)), between gorillas and the common ancestor of humans and chimpanzees

8 (T234) is 4.9340 (4.6461, 5.2420), between humans and chimpanzees (T34) is 3.7038 (3.6529, 3.7784),

9 with all values in units of 2N1. The MAP estimates of the effective population sizes of gorillas,

10 chimpanzees and humans are: N2 = 0.9665 (0.9458, 0.9878),N3 = 1.1610 (1.1351, 1.1877),

11 and N4 = 0.8306 (0.8118, 0.8501), with all values in units of N1. Our estimates of the de-

12 mographic parameters are consistent overall with previous studies (Prado-Martinez et al.,

13 2013).

14 By using orangutans and gorillas as outgroups, we identified 27, 144 fixed synonymous

15 sites and 19, 123 fixed non-synonymous sites in the human lineage (D4∼(1,2,3) in Table 2). −4 16 The average genomic synonymous and non-synonymous divergence are 4.5197 × 10 and

−4 17 3.1842 × 10 (per nucleotide site). We also identified 22, 926 synonymous and 21, 243 non-

18 synonymous segregating sites in humans (P4∼(1,2,3) in Table 2). The average synonymous −4 −4 19 and non-synonymous densities (per nucleotide site) are 3.8174 × 10 and 3.5372 × 10 .

20 The ratio of non-synonymous to synonymous divergence sites is smaller than the ratio of

21 non-synonymous to synonymous polymorphisms sites, which is consistent with the fact that

22 the majority of variations in the genome are deleterious. Table 3: Top 30 positively selected genes in the three lineages

23 Gene Function Dn Ds Pn Ps Selection coefficient p(γ > name 0)

Human lineage

24 SLC5A1 Primary mediator of dietary glucose and 7 1 0 0 11.15 (4.07, 20.88) 1 galactose FAM166B Unknown 4 2 0 1 10.77 (3.18, 21.48) 0.999925 PDXDC1 Pyridoxal phosphate binding and carboxy- 5 2 0 2 10.71 (3.26, 20.84) 0.999975 lyase activity continued on next page

13 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

continued from previous page Gene Function Dn Ds Pn Ps Selection coefficient p(γ > name 0) LPIN2 Controlling the metabolism of fatty acids 3 9 0 0 10.56 (2.75, 21.68) 0.999775 PKN2 Regulation of transcription activation signal- 3 1 0 3 10.49 (2.66, 21.32) 0.99985 ing processes C20orf30 Involved in trafficking and recycling of 3 1 0 1 10.44 (2.66, 21.37) 0.99965 synaptic vesicles LOC646498 Unknown 4 1 0 0 10.40 (3.09, 20.62) 1 ZNF605 Transcriptional regulation 8 8 0 1 10.32 (2.86, 21.19) 0.999975 SLC26A3 Transporting chloride ions across the cell 9 7 1 0 10.28 (4.05, 19.16) 1 membrane CASP10 Activation cascade of caspases responsible 7 1 0 2 10.26 (3.49, 19.77) 1 for apoptosis PRKD1 A serine/threonine protein kinase 3 2 0 1 10.13 (2.55, 20.78) 0.999775 ATXN3L Unknown (Paralog of ATXN3) 5 3 0 1 10.10 (3.06, 20.03) 0.999975 NCR1 Cytotoxicity-activating receptor that may 4 0 0 1 10.07 (2.88, 20.18) 0.999825 contribute to the increased efficiency of ac- tivated natural killer (NK) cells CDH24 Mediating strong cell-cell adhesion 5 4 0 1 9.94 (2.84, 20.20) 0.999975 NHEJ1 DNA repair factor 4 0 1 0 9.86 (2.77, 20.00) 0.99985 LAMP2 Chaperone-mediated autophagy 3 1 0 1 9.79 (2.38, 20.27) 0.9995 TBCCD1 Regulation of centrosome and Golgi appara- 3 1 0 0 9.79 (2.49, 20.23) 0.999675 tus positioning

1 DAGLB Required for axonal growth during develop- 3 2 0 3 9.75 (2.25, 20.54) 0.9995 ment and for retrograde synaptic signaling at mature synapses TKTL1 Oxidoreductase activity and transketolase 7 2 0 0 9.72 (3.35, 19.04) 1 activity ZDHHC4 Cytochrome-c oxidase activity 5 3 1 0 9.71 (2.67, 19.75) 0.999875 SLC25A25 Calcium-dependent mitochondrial solute 4 1 0 2 9.70 (2.03, 20.76) 0.999125 carrier SLC6A12 Transporting betaine and GABA 6 2 0 2 9.70 (2.36, 20.39) 0.9999 SYT10 Protein-protein interactions at synapses 2 4 0 3 9.65 (1.70, 20.86) 0.997175 BBC3 Essential mediator of /TP53-dependent 2 0 0 0 9.59 (1.57, 20.81) 0.99725 and p53/TP53-independent apoptosis FOXP2 Involved in neural mechanisms mediating the 2 1 0 1 9.58 (1.53, 20.80) 0.996625 development of speech and language P4HTM Catalyzing the post-translational formation 3 1 0 1 9.57 (2.21, 20.10) 0.999325 of 4-hydroxyproline RALY RNA-binding protein that acts as a tran- 2 1 0 0 9.56 (1.77, 20.47) 0.997825 scriptional cofactor for cholesterol biosyn- thetic genes in the liver EPM2AIP1 Interacting with EPM2A, related to adoles- 3 1 0 1 9.54 (1.91, 20.61) 0.99885 cent progressive myoclonus epilepsy MRPL14 Encoding a protein of the mitochondrial ri- 2 0 0 2 9.53 (1.40, 20.99) 0.996475 bosome UBE2CBP Ubiquitin-conjugating enzyme E2C-binding 6 1 1 0 9.52 (2.63, 19.52) 0.99995 protein continued on next page

14 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

continued from previous page Gene Function Dn Ds Pn Ps Selection coefficient p(γ > name 0)

Chimpanzee lineage

SCML1 Maintaining the transcriptionally repressive 18 1 0 2 14.99 (6.81, 25.33) 1 state of homeotic genes throughout develop- ment ZNF280C Transcription factor 16 3 0 0 14.85 (6.80, 24.90) 1 TKTL1 Oxidoreductase activity and transketolase 10 0 0 3 12.55 (4.99, 22.48) 1 activity AIMP2 Required for assembly and stability of the 6 3 0 1 11.93 (3.78, 22.91) 1 aminoacyl-tRNA synthase complex PNMA5 Paraneoplastic Ma antigen protein 8 3 0 1 11.69 (4.57, 21.28) 1 CCDC154 Unknown 9 12 0 3 11.56 (3.47, 22.60) 1 TXLNG Involving in intracellular vesicle traffic 8 1 0 1 11.55 (3.31, 22.91) 1 OTUD6A Deubiquitinating enzyme 7 0 0 2 11.47 (4.04, 21.64) 1 GSDMD Regulation of epithelial proliferation 5 5 0 1 11.33 (3.04, 22.30) 0.9999 TIPARP T-cell function 4 0 0 0 11.20 (3.25, 21.88) 0.99995 THEM4 Acyl-CoA thioesterase 4 1 0 0 11.15 (3.41, 21.88) 0.999925 DOT1L Histone methyltransferase 4 8 0 6 11.11 (2.68, 22.61) 0.9998 ELF4 Transcription factor 4 5 0 1 11.03 (2.86, 22.05) 0.99975 VRK3 Serine/threonine protein kinases 7 2 0 1 11.01 (3.68, 21.18) 1 DDX53 ATP-dependent RNA unwinding 16 1 0 1 10.97 (4.02, 20.46) 1 1 APOL2 Apolipoprotein L gene family that may affect 8 3 0 1 10.95 (3.15, 22.09) 0.999975 the movement of lipids or allow the binding of lipids to organelles HIST1H4G Histone H4;Core component of nucleosome 4 0 0 0 10.92 (3.45, 21.23) 0.999975 TAC4 Regulating peripheral endocrine and 4 1 0 0 10.90 (3.39, 21.41) 1 paracrine functions DCAF12L1 Involving in progression, signal 8 0 1 0 10.84 (3.19, 21.63) 0.99995 transduction, apoptosis, and gene regulation YY2 Transcription factor 6 2 0 1 10.77 (3.81, 20.39) 0.999975 SPC25 Involving in kinetochore-microtubule inter- 3 0 0 0 10.70 (2.53, 22.04) 0.99955 action and spindle checkpoint activity USP18 Ubiquitin-specific proteases 3 0 0 0 10.67 (2.74, 21.85) 0.999625 PTGDR Guanine nucleotide-binding protein (G 4 1 0 2 10.66 (2.76, 21.72) 0.9997 protein)-coupled receptor (GPCR) GYPC Regulating the mechanical stability of red 4 2 0 0 10.61 (2.66, 21.70) 0.999775 cells RSPH1 Male meiotic metaphase chromosome- 4 2 1 1 10.55 (2.80, 21.50) 0.999875 associated acidic protein VGLL1 Specific coactivator for the mammalian 3 2 0 0 10.51 (2.63, 21.44) 0.99965 TEFs PRDM9 Zinc finger protein with histone methyltrans- 4 3 0 0 10.51 (2.57, 21.46) 0.999725 ferase activity LCN12 Binding all-trans retinoic acid and may act 4 3 0 0 10.50 (2.57, 21.85) 0.9996 as a retinoid carrier protein within the epi- didymis continued on next page

15 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

continued from previous page Gene Function Dn Ds Pn Ps Selection coefficient p(γ > name 0) CUTA Part of a complex of membrane at- 3 0 0 0 10.49 (2.05, 22.26) 0.9988 tached to acetylcholinesterase (AChE) ASPHD1 Dioxygenase 3 0 0 0 10.46 (2.21, 22.00) 0.99885

Human-Chimpanzee lineage

TAS2R8 Taste receptors 14 9 0 0 18.14 (7.42, 30.03) 1 ULBP3 Binding and activating the KLRK1/NKG2D 6 1 0 0 17.77 (8.17, 29.24) 1 receptor, mediating natural killer cell cyto- toxicity BEND2 Participating in protein and DNA interac- 10 0 0 0 16.71 (8.46, 26.66) 1 tions TFDP3 The DP family of transcription factors 11 4 0 0 16.37 (7.72, 26.82) 1 ASB9 The ankyrin repeat and suppressor of cy- 4 2 0 0 15.62 (6.48, 26.95) 1 tokine signaling (SOCS) box protein family DDX25 A gonadotropin-regulated and developmen- 4 1 0 0 15.57 (6.44, 27.04) 1 tally expressed testicular RNA helicase CTSF Cysteine protease 5 0 0 0 15.54 (6.49, 26.58) 1 YY2 Transcription factor 5 1 0 0 15.40 (6.57, 26.03) 1 PASD1 Transcription factor 8 1 0 0 15.00 (7.12, 24.75) 1 C16orf78 Unknown 7 0 0 0 14.85 (4.66, 26.82) 1 1 FAM122C Unknown 5 1 0 0 14.37 (5.45, 25.31) 1 ZNF182 Involving in cell proliferation, differentiation, 3 1 0 0 14.30 (5.10, 25.86) 1 and apoptosis OR4D11 8 1 0 0 14.22 (5.51, 24.97) 1 OR9Q1 Olfactory receptor 5 2 0 0 14.21 (5.96, 24.57) 1 COX7A2 Cytochrome c oxidase 3 0 0 0 14.00 (5.00, 25.14) 1 C12orf54 Unknown 3 0 0 0 13.71 (4.78, 25.15) 1 OR4X2 Olfactory receptor 6 1 0 0 13.31 (4.64, 24.11) 1 OR2C1 Olfactory receptor 5 4 0 0 13.23 (4.52, 23.91) 1 PNMA5 Development of paraneoplastic disorders re- 4 1 0 0 13.20 (5.04, 23.55) 1 sulting from an immune response directed against them ESX1 Involving in placental development and sper- 4 0 0 0 13.05 (4.73, 23.55) 1 matogenesis RBMX2 Unknown 3 2 0 0 13.03 (4.44, 24.18) 1 B3GALT4 Involved in GM1/GD1B/GA1 ganglioside 2 2 0 0 12.76 (3.69, 24.68) 0.99985 biosynthesis ZNF649 Transcriptional repressor 4 1 0 0 12.66 (4.57, 23.00) 1 DDX26B Unknown 3 2 0 0 12.59 (3.98, 23.66) 1 FAM46D Non-canonical poly(A) RNA polymerase 4 0 0 0 12.55 (4.59, 22.85) 1 UBA7 The E1 ubiquitin-activating enzyme family 5 4 0 0 12.46 (4.66, 22.69) 1 that is involved in the conjugation of the ubiquitin-like interferon-stimulated gene 15 protein continued on next page

16 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

continued from previous page Gene Function Dn Ds Pn Ps Selection coefficient p(γ > name 0) 1 HUWE1 E3 ubiquitin ligase 4 4 0 0 12.35 (3.39, 23.69) 0.99995 ZMYND10 Involving in assembly of the dynein motor 3 0 0 0 12.24 (3.91, 23.12) 0.999975 IL13RA2 Playing role in internalization of IL13 3 0 0 0 12.20 (4.04, 23.01) 0.99995 C10orf131 Unknown 3 0 0 0 12.08 (3.66, 23.18) 0.999975 2

3 Genes under selection in the human lineage

4 The histogram in Figure 3(A) shows the posterior distribution of the selection intensity

5 γ4 for genes under selection in the human lineage. Among these genes, 1211 genes with

6 a 95% confidence interval above 0 were identified as targets under positive selection and

7 1349 genes with a 95% confidence interval (CI) below 0 are under negative selection. Genes

8 under positive selection in the human-lineage should attract particular attention because

9 they potentially confer to the emergence of human-specific phenotypes and functionality.

10 In the following sections, we focused our analysis on genes showing evidence of positive

11 selection.

12 We first investigated the biological functions of those genes under positive selection by

13 performing gene enrichment analyses with the package DAVID (Huang et al., 2009). Gene

14 sets were defined using the KOBAS pathway and disease categories along with

15 (GO) categories (see Table 4).

16 Several top pathways in the human lineage are related to the immune systems (P <

−12 17 2.16×10 , Table 4), which is similar to the findings of previous studies (Bustamante et al.,

18 2005; Clark et al., 2003; Nielsen et al., 2005). Several genes, such as NCR1, are among

19 the top genes with the highest selection intensity (Table 3): γNCR1 = 10.07 (2.88, 20.18).

20 Interestingly, these genes are under strong negative selection in the chimpanzee lineage and

21 the human-chimpanzee common ancestor lineage (HC lineage), indicating that accelerated

22 evolution of these genes is likely caused by human-specific resistance to pathogens. These

23 genes belong to different parts of the immune system, i.e., CASP10, DEFB110, HSP90AA1,

24 and INSR, and others are found in the innate immune system; NCR1, UBE2CBP, TRAIP,

25 and SEC24C and other 51 genes are from the adaptive immune system; and another 14 genes,

26 including SAMHD1, NUP107, and UBA7, are related to interferon signaling (see Table 4).

17 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

−7 1 Another extremely significantly enriched pathway is gene expression (P < 2.79 × 10 ).

2 This gene set includes 84 genes that fall into four main categories:

3 (1) RNA polymerase II transcription. This category includes many zinc-finger genes, in-

4 cluding ZNF549, ZNF510, ZNF500, ZNF684, and ZNF750, along with others. (2) Transcrip-

5 tional regulation by small RNAs, including NUP88, NUP133 and NUP107. (3) Regulation

6 of TP53 activity, including AURKB, DNA2, TPX2, RMI1, RBBP8 (through phosphoryla-

7 tion), KAT6A, and EHMT1 (through or methylation). (4) Regulation of histone

8 methylation, including PHF1, BCOR, CTCFL, MLL2, and BRCA1. Among these genes,

9 JARID2, MLL, and BRCA1 are only under strong lineage-specific positive selection in hu-

10 mans.

11 The large number of genes from gene expression regulation pathways under positive

12 selection indicates that evolutionary changes at the level of gene regulation played an essential

13 role in the divergence of humans and chimpanzees, which is consistent with the hypothesis

14 of King and Wilson (1975). The evolutionary roles of these genes in forming human-specific

15 phenotypes have yet to be properly recognized, and our results could provide inspiration for

16 further investigations.

17 Our study also identified genes involved in spermatogenesis under positive selection

18 (Nielsen et al., 2005). High selection intensities were inferred according to the MK table

19 patterns of these genes, including SLC26A3, SEPT4, PRM1, NPHP1 and OVOL1. Among

20 these genes, PRM1 is known for its potential influence on sperm morphology and ability to

21 fertilize eggs (Rooney and Zhang, 1999; Wyckoff et al., 2000; Nielsen et al., 2005).

−5 22 Metabolism is another large gene category under selection (P < 7.83×10 ) and includes

23 the metabolism of proteins, lipids and lipoproteins, purines, and carbohydrates. Certain

24 genes, such as SLC5A1, show an extremely strong selection signal (γ = 11.15 (4.07, 20.88)) in

25 humans, with 7 nonsynomymous divergence sites and 1 synonymous divergence site combined

26 with zero within-species polymorphism sites. This gene is a sodium/glucose cotransporter

27 that mediates the transport of glucose and related substances across cellular membranes, and

28 it is involved in the absorption of glucose. Interestingly, Pontremoli et al. (2015) identified

29 positive selection on the regulatory elements of SLC2A5 and SLCA2, which are fructose or

30 glucose transporters and functionally similar to SLC5A1. However, their coding regions are

18 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 under neutrality or negative selection in our analysis, which implies that the evolution of

2 carbohydrate consumption pathways may involve both coding regions and regulatory regions.

3 Strong selection on these genes may reflect the importance of starch metabolism and dietary

4 transitions during human evolution. In addition, other genes, such as PKN2, are involved in

5 pathways regulating glucose and lipid metabolism in different tissues, such as skeletal muscle

6 (Ruby et al., 2017).

7 We also analyze the association between positively selected genes and common diseases

8 (Table 4). Immune system diseases and metabolic diseases, such as diabetes and obesity, are

9 significantly enriched for genes under positive selection in the human lineage. This finding

10 is consistent with the pathway enrichment results. Interestingly, our findings indicate that

11 positive selection of certain genes is correlated with the development of schizophrenia (cor-

12 rected P = 0.00011). The adaptive evolution of human cognitive abilities was hypothesized

13 to be achieved by pushing the limits of metabolic capabilities, which created side effect such

14 as schizophrenia (Khaitovich et al., 2008).

15 We then investigated the posterior mean and CIs of genes in these significantly enriched

16 gene categories in further detail. In Figure 4, we present the results for four pathways showing

17 distinct patterns. The genes in each pathway were ranked according to their posterior

18 mean of the selection intensity, with bars representing the 95% CIs. Red and blue colors

19 denote CIs completely above or below the line γ = 0, respectively, which correspond to the

20 identification of being positively or negatively selected. Pathways for gene expression and

21 schizophrenia are over-dominantly enriched for positively selected genes. These pathways are

22 putatively under accelerated evolution and thus have played essential roles in the emergence

23 of new functionality. In contrast, certain pathways, such as olfactory signaling, are enriched

24 for negative selection, indicating their conservative evolution in the human lineage. Other

25 pathways, such as metabolism, are significantly enriched for both positive selection and

26 negative selection, thereby showing manifold functionality in evolution.

27 In addition to performing a categorical analysis of pathways and genes, we also investi-

28 gated the top 30 genes under positive selection in humans, chimpanzees, and the common

29 ancestor of humans and chimpanzees (see Table 3). The genes were ranked according to

30 their MAP selection intensity. All positively and negatively selected genes are also shown in

19 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Figures 5 and 6 as a selection map of the human and chimpanzee . The genes iden-

2 tified as targets of positive or negative selection are marked on each chromosome according

3 to their physical positions and labelled with red and blue, respectively.

4 Among the genes from the top list, FOXP2 is well known for its putative functionality

5 in the evolution of language (Enard et al., 2002). Our analysis demonstrates that FOXP2

6 has been under strong positive selection in humans since the divergence of humans and

7 chimpanzees and presents a selection intensity of γ = 9.58 (95% CI. (1.53, 20.80)). FOXP2

8 is quite conservative in chimpanzees (γ = −2.70 (−16.49, 11.43)) and the common ancestor of

9 humans and chimpanzees (γ = −1.31 (−16.13, 13.41), which indicates that positive selection

10 on FOXP2 contributes to the emergence of human-specific phenotypes and functionality.

11 Interestingly, a number of genes on the list are related to neurological systems. RBFOX2

12 is a conserved RNA binding protein serving as a key regulator of in

13 the nervous system and may play an important role in neuromuscular functions (Raj and

14 Blencowe, 2015). SYT10 is found in pathways of protein-protein interactions at synapses

15 and transmission across chemical synapses. This gene was identified to be important for

16 epileptogenesis in previous studies (Glavan et al., 2009; Woitecki et al., 2016). ATXN3L is

17 a paralog of ATXN3 that has undergone limited study, although GO annotations related to

18 the gene include ubiquitin protein ligase binding and obsolete ubiquitin thiolesterase activ-

19 ity. Mutations of this gene were identified to be correlated with neurodegenerative disease

20 (do Carmo Costa and Paulson, 2012). SLC6A12 is a transporter of betaine and GABA,

21 which may have a role in the regulation of GABAergic transmission in the brain through the

22 reuptake of GABA into presynaptic terminals. EPM2AIP1 interacts with EPM2A, which

23 produces a protein called laforin, and is related to epilepsy (Lafora disease).

Table 4: Kobas enrichment of positively selected genes in the Human lineage

Term Input(Background) P-Value Corrected P-Value Pathway Immune system 94(1583) 3.522e-14 2.158e-12 24 Generic transcription pathway 56(822) 5.851e-11 2.991e-09 Adaptive immune system 51(787) 2.088e-09 8.629e-08 Gene expression 84(1719) 7.44e-09 2.79e-07 Signal transduction 103(2448) 2.028e-07 6.026e-06 continued on next page

20 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

continued from previous page Term Input(Background) P-Value Corrected P-Value Complement and coagulation cascades 12(79) 1.938e-06 4.836e-05 Metabolic pathways 59(1243) 3.231e-06 7.828e-05 Regulation of TP53 activity 16(152) 3.4e-06 8.179e-05 Innate immune system 41(769) 8.341e-06 0.000185 Chromosome maintenance 13(110) 8.957e-06 0.0001956 Lysine degradation 9(52) 1.47e-05 0.0003077 Purine metabolism 16(176) 1.856e-05 0.000381 Complement cascade 8(44) 3.211e-05 0.0006189 Regulation of TP53 activity through phos- 11(91) 3.586e-05 0.0006804 phorylation Metabolism of proteins 59(1378) 5.648e-05 0.00101 Metabolism 77(1975) 8.72e-05 0.001458 Immunoregulatory interactions between a 12(120) 8.737e-05 0.001459 Lymphoid and a non-Lymphoid cell Hemostasis 32(605) 9.097e-05 0.001518 Cell cycle 32(607) 9.635e-05 0.001595 Transcriptional regulation by TP53 22(355) 0.0001427 0.002278 Extension of telomeres 6(30) 0.0002005 0.003111 Activation of the pre-replicative complex 6(32) 0.0002728 0.004053 GPCR downstream signaling 42(940) 0.0002819 0.004168

1 Disease Immune system diseases 24(243) 4.066e-08 1.353e-06 Primary immunodeficiency 17(138) 2.302e-07 6.725e-06 Other congenital disorders 31(462) 1.518e-06 3.905e-05 Schizophrenia 29(440) 4.525e-06 0.0001067 Obesity-related traits 39(696) 4.68e-06 0.0001095 Diabetes 8(63) 0.0003039 0.004448

GO: biological process Spermatogenesis 42(449) 0.0006161 - Male gamete generation 42(450) 0.000643 - Lymphocyte mediated immunity 23(202) 0.001257 - Regulation of viral life cycle 20(169) 0.001817 - Protein activation cascade 12(74) 0.001868 - Lymphocyte activation involved in im- 17(133) 0.002029 - mune response Negative regulation of multi-organism pro- 18(147) 0.002289 - cess Modification of morphology or physiology 13(88) 0.002511 - of other organism involved in symbiotic in- teraction Sexual reproduction 56(697) 0.00262 - B cell mediated immunity 14(101) 0.002853 Cellular aromatic compound metabolic 334(5508) 0.002969 - process continued on next page

21 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

continued from previous page Term Input(Background) P-Value Corrected P-Value Complement activation 9(47) 0.003233 - Multi-organism reproductive process 65(846) 0.003266 - Adaptive immune response based on so- 22(207) 0.003785 - matic recombination of immune receptors Negative regulation of viral life cycle 12(81) 0.00386 - Nucleic acid metabolic process 291(4752) 0.003885 -

1 Nucleobase-containing compound 324(5348) 0.003908 - metabolic process Regulation of transcription, DNA- 210(3311) 0.004105 - templated Heterocycle metabolic process 330(5465) 0.004212 - Cellular nitrogen compound metabolic 359(6000) 0.004583 - process Complement activation, classical pathway 7(30) 0.004587 - Organic cyclic compound metabolic pro- 341(5675) 0.004848 - cess Terms with Benjamini corrected P -value<0.005 are shown in table

2 Genes under selection in the chimpanzee lineage

3 Among the top 30 signals under strong positive selection in chimpanzees (Table 3),

4 SCML1 is the one with the highest inferred selection intensity at 14.99 (6.81, 25.33). SCML1

5 was identified as a target of repeated positive selection by Nielsen et al. (2005) because it has

6 15 nonsynonymous and one synonymous substitutions between humans and chimpanzees, al-

7 though zero polymorphisms are observed in humans. SCML1 is an expression repressor of

8 the Hox genes and important for the developmental differences between humans and chim-

9 panzees. Wu and Su (2008) also identified strong positive selection on SCML1 in multiple

10 species and noticed that the gene is expressed in testes, implying its role in testis

11 development and spermatogenesis.

12 PNMA5 shows a strong signal for positive selection in chimpanzees (γ = 11.69 (4.57, 21.28)).

13 The exact molecular function of PNMA5 is still unclear, although previous research indi-

14 cated that PNMA5 is highly expressed in the neocortex of the brain and may be involved in

15 primate brain evolution. Interestingly, we found that the gene is also under strong positive

16 selection in the human-chimpanzee common ancestor lineage but is under neutral evolution

17 in humans, implying that it once played an important role in the brain evolution of ancient

18 but that its effect likely terminated in the human lineage. AIMP2 is another gene

22 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 known to be related to human neurodegenerative diseases, such as Parkinson’s disease.

2 Genes under selection in the common ancestor of humans and chimpanzees

3 TAS2R8 shows the strongest positive selection signal in the common ancestor of humans

4 and chimpanzees, although it is under strong negative selection in human and under neutral-

5 ity in chimpanzee (γHC = 18.142 (7.4216, 30.0282); γhuman = −4.32549 (−15.0859, 2.04698);

6 and γchimp = 2.13833 (−0.79666, 7.64481)). TAS2R8 is one of the bitter taste receptor

7 genes and may reflect a change in diet or toxin avoidance during different stages of primate

8 evolution.

9 Several genes from the top list are olfactory receptors, such as OR4D11, OR9Q1, OR4X2,

10 OR2C1, and OR4F6. Olfactory receptor genes are a large group of genes that show a strong

11 tendency for positive selection (Clark et al., 2003), and our results demonstrate that these

12 olfactory receptor genes have been important for sensory perception ever since the the time

13 of the common ancestor of humans and chimpanzees. Among these genes, some are still

14 actively evolving and under positive selection in humans, such as QR9R1, whereas the rest

15 are conservative and under neutrality or negative selection in humans and/or chimpanzees.

16 A number of positively selected genes in the common ancestor lineage of humans and

17 chimpanzees are related to transcription regulation, including TFDP3, YY2, PASD1, ZNF182,

18 ESX1, ZNF649, and ZMYND10. In particular, some of these genes are involved in sper-

19 matogenesis. ZMYND15, for example, encodes a histone deacetylase-dependent transcrip-

20 tional repressor that is important for spermatogenesis and male infertility (Yan et al., 2010).

21 ZMYND10 is a zinc finger gene functioning in assembly of the dynein motor, and it is highly

22 expressed in sperm and important for sperm movement.

23 ASB9, PNMA5, ICAM1 and IL13RA2 are found in the immune system. ICAM1 is a

24 receptor that mediates the binding of Plasmodium falciparum to erythrocytes. Mutations on

25 ICAM1 were under very recent positive selection in human populations (Kun et al., 1999).

26 We found that the gene is under strong selection in the common ancestor of human and

27 chimpanzee (γ = 9.81(3.22, 19.27)), which is consistent with the long history of malaria

28 as a pathogen of primates and may reflect the gene’s role in resistance to malaria since

29 the early stages of primate evolution. We also noticed that ICAM1 is under selection in

23 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 both chimpanzees (γ = 8.71(3.15, 16.95) and humans (γ = 4.95(1.19, 11.25)). Humans and

2 chimpanzees are affected by different malaria species (P. falciparum and P. reichenowi,

3 Martin et al. (2005)). ICAM1 is likely undergoing parallel evolution for resistance to the two

4 parasites. Further investigation of the divergence sites among the three species may provide

5 insights into the mechanism of malaria resistance.

6 Overall, in the above study, we discussed the genes and pathways under strong positive

7 selection in humans, chimpanzees and their common ancestral lineage. Nielsen et al. (2005)

8 used a likelihood-ratio test to compare the dN/dS ratios for coding regions of one human and

9 one chimpanzee genome and provided a list of the 50 top genes under positive selection, and

10 we found that 12 of their genes are also identified as significant in our analysis. However,

11 only 7 out of the 12 genes are under positive selection in the human lineage; three are

12 under positive selection in chimpanzees (CD72, SLC22A4 and DFFA), and the other two

13 genes (RBM23 and C16orf3) are under selection only in the common ancestor of human

14 and chimpanzee. Some other genes, such as SPATA3, are not significantly under positive

15 selection. The MK table entry for SPATA3 in the human lineage is (1, 1, 3, 1), indicating

16 that positive selection on SPATA3 is not likely. The difference between our results and

17 those of Nielsen et al. (2005) demonstrates that by extending the MK-like or dN/dS tests to

18 multiple species, our method can gain information for precisely distinguishing selection that

19 occurred in the human lineage, in the chimpanzee lineage, or in the common ancestor of the

20 two species, thereby providing additional insights into understanding the selection process.

21 Discussion

22 We present the computational method HDMKPRF for detecting lineage-specific selection

23 by jointly analyzing polymorphism and divergence data from multiple species. HDMKPRF

24 is an extension of the MKPRF method and can be used from two species to multiple species

25 (Bustamante et al., 2001). Our method has several advantages over existing approaches. The

26 pairwise comparison of two species in both the MK test and MKPRF is directionless, and it

27 only identifies genes that are divergent between the two species or populations but provides

28 no further conclusions as to the lineage under selection (Akey et al., 2002; Bustamante

29 et al., 2005; Clark et al., 2003; Chen et al., 2010). However, the HDMKPRF method can

24 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

12 8 10

6 8

4 6

4 2 2 0 0

-2 Estimated selection intensity Estimated selection intensity -2

-4 -4

-6 -6 -4 -2 0 2 4 6 -2 0 2 4 6 True selection intensity True selection intensity (A) (B) 8 6

7 5 6 4 s a

5 θ θ

4 3 Estimated 3 Estimated 2 2 1 1

0 0 1 2 3 4 5 1 2 3 4 5 True θ True θ a s (C) (D) 14

12

10

8

6 Estimated divergence time

4

2 4 6 12 True divergence time (E)

Figure 2: Comparison of the true and inferred selection coefficients for the four species in the simulation of 20, 000 gene loci: (A) selection coefficient γ4; (B)selection coefficient γ5; (C) scaled mutation rate; (D) scaled synonymous mutation rate; and (E) divergence times T34, T234 and T1234.

25 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1000 1200

900 95% CI below 0 95% CI below 0 95% CI include 0 1000 95% CI include 0 800 95% CI above 0 95% CI above 0

700 800 600

500 600

400 Number of genes Number of genes 400 300

200 200 100

0 0 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 ≥10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 ≥10 γ γ (A) (B) 1200 2000

95% CI below 0 1800 95% CI below 0 95% CI include 0 95% CI include 0 1000 95% CI above 0 1600 95% CI above 0

1400 800 1200

600 1000

800 Number of genes Number of genes 400 600

400 200 200

0 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 ≥10 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 ≥10 γ γ (C) (D) 400

95% CI below 0 350 95% CI include 0 95% CI above 0 300

250

200

150 Number of genes

100

50

0 -2 -1 0 1 2 3 4 5 6 7 8 9 ≥10 γ (E)

Figure 3: Posterior distribution of γ for humans and chimpanzees.

26 bioRxiv preprint not certifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. doi:

Schizophrenia https://doi.org/10.1101/367482

20

10 (a) γ 0

-10 ; this versionpostedJuly11,2018.

-20 0 20 40 60 80 100 120 140 160 180 Rank 2 7 The copyrightholderforthispreprint(whichwas (b) bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 200 150 Rank 100 Olfactory signaling pathway 50 0 0

20 10 -10 -20 γ

(c) (d)

Figure 4: Distribution of γ for four pathways or disease-related gene sets.

28 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 pinpoint the occurrence of selection to a specific lineage, including the internal lineages.

2 As we demonstrated in the analysis of the Hominidae data, HDMKPRF identifies specific

3 targets of selection in human and chimpanzee lineages as well as the common ancestor of

4 humans and chimpanzees.

5 Similar to the MK test and MKPRF, the HDMKPRF method analyzes both polymor-

6 phism and divergence sites, which increases the power for detecting selection and helps

7 distinguish positive selection from the relaxation of negative selection (Wyckoff et al., 2000).

8 By using the Bayesian approach to combine information from all gene loci, the method is

9 more powerful than single-locus methods, such as the MK test and dN/dS ratio tests. We

10 apply the HDMKPRF method to the genomic sequences of four species of Hominidae and

11 identify gene loci under lineage-specific positive and negative selection. Cagan et al. (2016)

12 recently analyzed the same data set using the MK test and only identified a limited num-

13 ber of genes under positive selection since the divergence of humans and chimapnzees. A

14 comparison of our results with former studies demonstrates that the HDMKPRF method

15 outperforms alternative methods with higher power and provides additional insights into

16 the distribution of selection effects and the temporal and spatial occurrence of Darwinian

17 selection over evolutionary history.

18 With the development of sequencing technologies, genomic population data for multiple

19 species are abundant, thus necessitating methods that can efficiently analyze both within-

20 species and between-species data. The HDMKPRF method presented in this paper satisfies

21 such a need, and we expect that its application potential will be extensive in comparative

22 genomic studies.

23 Appendices

24 A1. Segregating site pattern P(2,3)∼1, P(3,4)∼(1,2) and D(3,4)∼(1,2)

25 For the three-species scenario, we assume that the allele frequency x of the common

26 ancestor of species 2 and 3 (species 4) at time T23 follows the stationary distribution under

27 neutrality for synonymous sites (f(x), Equation 1) and distribution under selection for re-

28 placement sites (g(x), Equation 6). After the population split and with evolution over time,

29 the joint allele frequency distribution of species 2 (y) and species 3 (z) of synonymous sites

29 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 is as follows:

f(y, z|T23,N2,N3) 1 = φ(y|x, T23,N2)φ(z|x, T23,N3)f(x)dx Z0 1 1 = 2N4µs φ(y|x,T23,N2)φ(z|x, T23,N3)dx, (24) Z0 x

2 where φ(y|x, T, N) represents the transient allele frequency distribution y given its initial

3 frequency x at time T and population size N (Chen et al., 2007). Kimura (1955a) found

4 that for neutral evolution

∞ (2i + 1)(1 − (1 − 2x)2) φ(y|x, T, N) = C3/2(1 − 2x) i(i + 1) i−1 Xi=1 3/2 −1/2i(i+1)T ·Ci−1(1 − 2y)e , (25)

3/2 5 where Ci (x) is the Gegenbauer polynomial with λ = 3/2.

6 Kimura (1955b) provides the transient distribution for alleles under selection, which is

7 in complicated form and is difficult to calculate. Williamson et al. (2005) and Evans et al.

8 (2007) adopted the Crank-Nicolson finite difference method to approximate the transient dis-

9 tribution as the solution of a forward diffusion equation, and several recent studies provided

10 analytical solutions using perturbation methods (Schraiber, 2014; Zivkovi´cetˇ al., 2015). Sim-

11 ilar to the neutral case, the joint allele frequency distribution of nonsynonymous sites under

12 selection in species 2 and 3 is:

g(y, z|T23,N2,N3, γ) 1 = ψ(y|x, T23,N2, γ2)ψ(z|x, T23,N3, γ3)g(x)dx Z0 2γ 1 ψ(y|x,T ,N , γ )ψ(z|x, T ,N , γ ) = 2N µ 4 23 2 2 23 3 3 dx, (26) 4 r −2γ4 1 − e Z0 1x( − x) × 2γ4

13 where ψ(y|x, T, N, γ) represents the transient allele frequency distribution for y with an

14 initial allele frequency x, time T , population size N and selection intensity γ.

15 The expected number of polymorphism synonymous sites segregating in species 2 and 3

30 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 is

1 1 (1 − yn2 − (1 − y)n2 )(1 − zn3 − (1 − z)n3 ) Z0 Z0 ×f(y, z|T23,N2,N3)dydz

= θs,4H(n2, n3). (27)

2 The expected number of polymorphism replacement sites segregating in species 2 and 3 is

3 as follows:

1 1 (1 − yn2 − (1 − y)n2 )(1 − zn3 − (1 − z)n3 ) Z0 Z0 ×g(y, z|T23,N2,N3, γ4)dydz

2γ4 = θr,4 I(n2, n3, γ2, γ3, γ4). (28) 1 − e−2γ4

4 The other entries of the three-species MK table are now different from that of Table

5 1. For example, the expected number of P2∼(1,3) includes two components. The first part

6 consists of sites fixed in sample 3 but still segregating in sample 2:

1 1 (1 − yn2 − (1 − y)n2 )((1 − z)n3 ) Z0 Z0 ×f(y, z|T23,N2,N3)dydz. (29)

7 The second part consists of the new mutations occurring in species 2 since T23:

1 T23 n2 n2 θs,2 1 (1 − y − (1 − y) ) × φ(y| , t, N2)dtdy, (30) Z0 2 Z0 N2

θs,2 8 where 2 is the number of new neutral mutations that enter the population every generation

9 with the initial frequency of 1/N2.

10 Note that the above formula for Ps,2∼(1,3) is different from that in Table 1 and is applica-

11 ble in different situations. An empirical criterion for choosing the two fomulae is based on

12 the distribution of TMRCA of n2. According to Griffiths (1984), the TMRCA asymptoti-

13 cally follows a normal distribution, with the mean and variance determined by N2 and the

31 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 population history (see Griffiths (1984) and Chen and Chen (2013) for detailed formulae);

2 thus P r(TMRCA < T23). When T23 is sufficiently large and P r(TMRCA < T23) > 0.95, we

3 adopt the formulae in Table 1; otherwise, we use Equation 29 and 30.

4 For the four-species scenario, we can obtain the expected values for Ps,(3,4)∼(1,2) and

5 Pr,(3,4)∼(1,2) via a similar method used for Ps,(2,3)∼1 and Pr,(2,3)∼1. Synonymous divergence

6 sites D(3,4)∼(1,2) in the four-species scenario include two components: the sites fixed on the

7 branch between species 5 and 6 and the sites fixed in the sample of n3 and n4. Following

8 Equation 5, the first component is as follows:

1 θ (T − T )v , (31) 2 s,5 234 34 5

9 with v5 = N1/N5. The second component includes those sites fixed in the sample of n3 and

10 n4:

1 1 n3 n4 y z × f(y, z|T,N3,N4)dydz Z0 Z0 = θs,5J(n3, n4). (32)

11 Similarly for replacement sites, the two components are:

1 2γ5 θ5v5(T234 − T34) , (33) 2 1 − e−2γ5

12 and

1 1 n3 n4 y z × g(y, z|T,N3,N4, γ5)dydz Z0 Z0 2γ5 = θr,5 K(n3, n4, γ3, γ4, γ5). (34) 1 − e−2γ5

13 Therefore, we have P(3,4)∼(1,2) and D(3,4)∼(1,2).

32 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 A2. MCMC steps for parameter optimization for the 4-species McDonald-

2 Kreitman tables

3 In the Bayesian Poisson random field model of four species, the parameters include

l l Γ = {θs,i, θr,i,T1234,T234,T34, γi,l, i ∈ {1, 2, 3, 4, 5}, 1 ≤ l ≤ L}, (35)

4 where i is the index of lineages (species, see Figure 1B), l is the index of genes. D and P are

5 four-species data containing lineage-specific divergence and polymorphism sites (1 ∼ (2, 3, 4)

6 etc.). Since the posterior distributions of Γ are analytically intractable, we apply the Markov

7 chain Monte Carlo method (MCMC) to achieve them. The details of MCMC steps are as

8 follows.

9 Initialization

10 The initial values of divergence time T1234, T234 and T34 are chosen according to prior

11 knowledge or with arbitrary positive values satisfying T1234 > T234 > T34. The selection

12 parameters γi, i = 1, 2, 3, 4, 5 are generated from the normal distribution N(0, 8). We denote

13 vi, i = 2, 3, 4, 5 as the ratio of effective population sizes N1/Ni. In the four-species McDonald-

14 Kreitman table, the number of polymorphism synonymous mutations occurring in gene l in

l l 15 lineage i, with i = 1, 2, 3, 4 (as is shown in Figure 1B), is Ps,i = θs,iL(ni) = Niµs,lL(ni).

16 Therefore, for i = 2, 3, 4, vi can be estimated as follows:

l L(ni) P v ≈ l s,1 . (36) i P l L(n1) l Ps,i P 17 MCMC iteration

18 Locus-specific mutation parameters for gene l in lineage 1 ∼ (2, 3, 4) are updated via

l l 19 Gibbs sampling. For each gene l, given values of γi,l, i = 1, 2, 3, 4, 5, a value of θr,1 (θr,1 =

20 2N1µr,l) is generated from a gamma distribution

l θr,1 ∼ Γ(par1,l, par2,l) (37)

33 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 with 5 5 l l par1,l = α + Dr,i + Pr,i (38) Xi=1 Xi=1

2 and

2γ1,l par2,l = β + (2T1234 − T234 + G(n1, γ1,l) + F (n1, γ1,l)) 1 − e−2γ1,l 2γ2,l 1 + (T234 + (G(n2, γ2,l) + F (n2, γ2,l))) −2γ2,l 1 − e v2 2γ3,l 1 + (T34 + (G(n3, γ3,l) + F (n3, γ3,l))) (39) −2γ3,l 1 − e v3 2γ4,l 1 + (T34 + (G(n4, γ4,l) + F (n4, γ4,l))) −2γ4,l 1 − e v4 2γ5,l 1 + ((T234 − T34) + (K(n3, n4, γ3,l, γ4,l, γ5,l) + I(n3, n4, γ3,l, γ4,l, γ5,l))). −2γ5,l 1 − e v5

3 Since D(3,4)∼(1,2) and P(3,4)∼(1,2) are usually of low information with high volatility in our l 4 example, we approximate the posterior distribution of θr,1:

l [ [ θr,1 ∼ Γ(par1,l, par2,l) (40)

5 with 4 4 [ l l par1,l = α + Dr,i + Pr,i (41) Xi=1 Xi=1

6 and

2γ1,l par[2,l = β + (2T1234 − T234 + G(n1, γ1,l) + F (n1, γ1,l)) 1 − e−2γ1,l 2γ2,l 1 + (T234 + (G(n2, γ2,l) + F (n2, γ2,l))) 1 − e−2γ2,l v 2 (42) 2γ3,l 1 + (T34 + (G(n3, γ3,l) + F (n3, γ3,l))) −2γ3,l 1 − e v3 2γ4,l 1 + (T34 + (G(n4, γ4,l) + F (n4, γ4,l))) −2γ4,l 1 − e v4

34 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

l 1 Similarly, a value for θs,1 was generated from the gamma distribution

l θs,1 ∼ Γ(par3,l, par4,l) (43)

2 with 5 5 l l par3,l = α + Ds,i + Ps,i, (44) Xi=1 Xi=1

3 and

par4,l = β + (2T1234 − T234 + M(n1) + L(n1)) 1 +(T234 + (M(n2) + L(n2))) v2 1 +(T34 + (M(n3) + L(n3))) (45) v3 1 +(T34 + (M(n4) + L(n4))) v4 1 +(T234 − T34 + (J(n3, n4) + H(n3, n4))). v5

4 We also approximate the gamma distribution:

l [ [ θs,1 ∼ Γ(par3,l, par4,l), (46)

5 where 4 4 [ l l par3,l = α + Ds,i + Ps,i (47) Xi=1 Xi=1

6 and

par[4,l = β + (2T1234 − T234 + M(n1) + L(n1)) 1 +(T234 + (M(n2) + L(n2))) v2 1 (48) +(T34 + (M(n3) + L(n3))) v3 1 +(T34 + (M(n4) + L(n4))) v4

7 Note that in the above gamma distributions, α and β are uninformative small values

8 close to 0.

35 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

l l l l l 1 Once θs,1 and θr,1 are updated, θs,i, θr,i, i = 2, 3, 4 can be calculated by θs,i = Niµs,l = l l l 2 θs,1/vi and θr,i = θr,1/vi.

3 Then, we update the selection parameters using Metropolis sampling as shown in Algo- rithem 1.

Algorithm 1: Updating selection parameters by Metropolis sampling

For selection coefficient γi,l of gene l of lineage i ′ ′ Sampling γ , γ ∼ Uniform[γi,l − ε, γi,l + ε] ′ i,l i,l If p(γ |D,P ) > p(γi,l|D,P ) i,l ′ γi,l = γi,l Else Sampling u, u ∼ Uniform[0, 1] ′ p(γ |D,P ) If u < i,l p(γi,l|D,P ) ′ γi,l = γi,l Else γi,l = γi,l End End End

4

5 In Algorithem 1, p(γi,l|D,P ) is the posterior probability for γi,l.

p(γi,l|D,P ) = p(γi,l)L(γi,l|D,P ) (49)

6 with

p(γi,l) ∼ Normal(µ, σ) (50)

7 and

L(γ|D,P ) = p(D,P |γi,l). (51)

8 Considering that selection coefficient γi,l only affects the non-synonymous mutation num-

9 bers, we have l l p(D,P |γi,l) = p(Dr,i,Pr,i|γi,l). (52)

10 Since the expected K and I in the four-species McDonald-Kreitman table are difficult

11 to calculate and only have a weak influence on the estimation of γ5,l, we approximate the

12 likelihood of γ5,l as follows:

36 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

l p,P(D |γ5,l) = p(Pr,5|γ5,l), (53)

l 1 E(Pr,5) is simplified as follows:

l l 2γ5,l E(Pr,5) ≈ θr,5 (v5(T234 − T34)) 1 − e−2γ5,l (54) l 2γ5,l = θ (T234 − T34) r,1 1 − e−2γ5,l

2 Once all the selection and mutation parameters have been updated, the divergence times

3 T1234, T234 and T34 are updated by Metropolis sampling in a manner analogous to the up-

4 dating of the γ values. The steps for updating divergence time are shown in Algorithm 2

5 below.

Algorithm 2: Updating divergence time by Metropolis sampling

For divergence times T in {T1234,T234,T34} in turn ′ ′ Sampling T , T ∼ Uniform[T − ε, T + ε] ′ If p(T |D,P ) > p(T |D,P ) ′ T = T Else Sampling u, u ∼ Uniform[0, 1] ′ p(T |D,P ) If u < p,P(T |D ) ′ T = T Else T = T End End End

6 In Algorithm 2, divergence time T1234 affects divergence sites D1∼(2,3,4); T234 affects

7 D1∼(2,3,4), D2∼(1,3,4) and D(3,4)∼(1,2); and T34 affects D3∼(1,2,4), D4∼(1,2,3) and D(3,4)∼(1,2):

p(T1234|D,P ) = p(T1234|D1∼(2,3,4)) (55) 8

p(T234|D,P ) = p(T1234|D1∼(2,3,4),D2∼(1,3,4),D(3,4)∼(1,2)) (56) 9

p(T34|D,P ) = p(T1234|D3∼(1,2,4),D4∼(1,2,3),D(3,4)∼(1,2)) (57)

37 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Note that in Algorithm 2, T has to satisfy T34 < T234 < T1234. In our algorithm, we only

2 use the synonymous divergence mutant numbers in the MH sampling step to update T .

3 The simulation results show that this processing method can improve the performance of

4 divergence time estimation.

5 The total number of MCMC iterations was 400,000, and the first 200,000 iterations were

6 treated as burn-in and disregarded; subsequently, we sampled data points every 10 iterations,

7 which yielded a total of 40,000 sample points.

8 Acknowledgements

9 This project was supported by the “Strategic Priority Research Program” of the Chinese

10 Academy of Sciences (Grant No. XDB13020400), the National Natural Science Foundation of

11 China (Grant No. 91731302, 31571370 and 91631106), and the “Hundred Talents Program”

12 of the Chinese Academy of Sciences.

13 References

14 Akey, J. M., Zhang, G., Zhang, K., Jin, L., and Shriver, M. D., 2002. Interrogating a

15 high-density snp map for signatures of natural selection. Genome Res., 12:1805–1814.

16 Bustamante, C., Wakeley, J., Sawyer, S., and Hartl, D., 2001. Directional selection and the

17 site-frequency spectrum. Genetics, 159(4):1779–1788.

18 Bustamante, C. D., Fledel-Alon, A., Williamson, S., Nielsen, R., Hubisz, M. T., Glanowski,

19 S., Tanenbaum, D. M., White, T. J., Sninsky, J. J., Hernandez, R. D., et al., 2005. Natural

20 selection on protein-coding genes in the . Nature, 437(7062):1153–1157.

21 Cagan, A., Theunert, C., Laayouni, H., Santpere, G., Pybus, M., Casals, F., Pr¨ufer, K.,

22 Navarro, A., Marques-Bonet, T., Bertranpetit, J., et al., 2016. Natural selection in the

23 great apes. Mol Biol Evol, 33(12):3268–3283.

24 Chen, H. and Chen, K., 2013. Asymptotic distributions of coalescence times and ancestral

25 lineage numbers for populations with temporally varying size. Genetics, 194(3):721–736.

38 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Chen, H., Green, R. E., P¨a¨abo, S., and Slatkin, M., 2007. The joint allele-frequency spectrum

2 in closely related species. Genetics, 177(1):387–398.

3 Chen, H., Patterson, N., and Reich, D., 2010. Population differentiation as a test for selective

4 sweeps. Genome Res., 20(3):393–402.

5 Clark, A. G., Glanowski, S., Nielsen, R., Thomas, P. D., Kejariwal, A., Todd, M. A., Tanen-

6 baum, D. M., Civello, D., Lu, F., Murphy, B., et al., 2003. Inferring nonneutral evolution

7 from human-chimp-mouse orthologous gene trios. Science, 302(5652):1960–1963.

8 do Carmo Costa, M. and Paulson, H. L., 2012. Toward understanding machado–joseph

9 disease. Prog Neurobiol, 97(2):239–257.

10 Enard, W., Przeworski, M., Fisher, S., Lai, C., Wiebe, V., Kitano, T., Monaco, A., and

11 P¨a¨abo, S., 2002. Molecular evolution of FOXP2, a gene involved in speech and language.

12 Nature, 418(6900):869–872.

13 Evans, S., Shvets, Y., and Slatkin, M., 2007. Non-equilibrium theory of the allele frequency

14 spectrum. Theor. Popul. Biol., 71(1):109–119.

15 Fay, J. C., Wyckoff, G. J., and Wu, C.-I., 2002. Testing the neutral theory of molecular

16 evolution with genomic data from drosophila. Nature, 415(6875):1024–1026.

17 Glavan, G., Schliebs, R., and Zivin,ˇ M., 2009. Synaptotagmins in neurodegeneration. The

18 Anatomical Record, 292(12):1849–1862.

19 Griffiths, R. C., 1984. Asymptotic line of descent distributions. J. Math. Biol., 21(1):67–75.

20 Huang, D. W., Sherman, B. T., and Lempicki, R. A., 2009. Systematic and integrative

21 analysis of large gene lists using david bioinformatics resources. Nat Protoc, 4(1):44–57.

22 Hughes, A. L. and Nei, M., 1988. Pattern of nucleotide substitution at major histocompati-

23 bility complex class i loci reveals overdominant selection. Nature, 335(6186):167–170.

24 Khaitovich, P., Lockstone, H. E., Wayland, M. T., Tsang, T. M., Jayatilaka, S. D., Guo,

25 A. J., Zhou, J., Somel, M., Harris, L. W., Holmes, E., et al., 2008. Metabolic changes in

26 schizophrenia and human brain evolution. Genome Biol, 9(8):R124.

39 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Kimura, M., 1955a. Solution of a process of random genetic drift with a continuous model.

2 Proc Natl Acad Sci U S A., 41(3):144–50.

3 Kimura, M., 1955b. Stochastic processes and distribution of gene frequencies under natural

4 selection. Cold Spring Harb Symp Quant Biol, (20):33–53.

5 King, M.-C. and Wilson, A. C., 1975. Evolution at two levels in humans and chimpanzees.

6 Science, 188(4184):107–116.

7 Kun, J., Klabunde, J., Lell, B., Luckner, D., Alpers, M., May, J., Meyer, C., and Kremsner,

8 P. G., 1999. Association of the icam-1kilifi mutation with protection against severe malaria

9 in lambar´en´e, gabon. Am J Trop Med Hyg, 61(5):776–779.

10 Li, W.-H., Wu, C.-I., and Luo, C.-C., 1985. A new method for estimating synonymous

11 and nonsynonymous rates of nucleotide substitution considering the relative likelihood of

12 nucleotide and codon changes. Mol Biol Evol, 2(2):150–174.

13 Martin, M. J., Rayner, J. C., Gagneux, P., Barnwell, J. W., and Varki, A., 2005. Evolution

14 of human-chimpanzee differences in malaria susceptibility: relationship to human genetic

15 loss of n-glycolylneuraminic acid. Proc Natl Acad Sci U.S.A., 102(36):12819–12824.

16 McDonald, J. H. and Kreitman, M., 1991. Adaptive protein evolution at the adh locus in

17 drosophila. Nature, 351(6328):652.

18 Nielsen, R., Bustamante, C., Clark, A. G., Glanowski, S., Sackton, T. B., Hubisz, M. J.,

19 Fledel-Alon, A., Tanenbaum, D. M., Civello, D., White, T. J., et al., 2005. A scan for

20 positively selected genes in the genomes of humans and chimpanzees. PLoS Biol, 3(6):e170.

21 Pontremoli, C., Mozzi, A., Forni, D., Cagliani, R., Pozzoli, U., Menozzi, G., Vertemara, J.,

22 Bresolin, N., Clerici, M., and Sironi, M., et al., 2015. Natural selection at the brush-

23 border: adaptations to carbohydrate diets in humans and other . Genome Biol

24 Evol, 7(9):2569–2584.

25 Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B.,

26 Veeramah, K. R., Woerner, A. E., OConnor, T. D., Santpere, G., et al., 2013. Great ape

27 genetic diversity and population history. Nature, 499(7459):471.

40 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Raj, B. and Blencowe, B. J., 2015. Alternative splicing in the mammalian nervous system:

2 recent insights into mechanisms and functional roles. Neuron, 87(1):14–27.

3 Rogers, J. and Gibbs, R. A., 2014. Comparative primate genomics: emerging patterns of

4 genome content and dynamics. Nat Rev Genet, 15(5):347.

5 Rooney, A. P. and Zhang, J., 1999. Rapid evolution of a primate sperm protein: relaxation

6 of functional constraint or positive darwinian selection? Mol Biol Evol, 16(5):706–710.

7 Ruby, M. A., Riedl, I., Massart, J., Ahlin,˚ M., and Zierath, J. R., 2017. Protein kinase n2

8 regulates amp kinase signaling and insulin responsiveness of glucose metabolism in skeletal

9 muscle. Am J Physiol Endocrinol Metab., 313(4):E483–E491.

10 Sawyer, S. A. and Hartl, D. L., 1992. Population genetics of polymorphism and divergence.

11 Genetics, 132(4):1161–1176.

12 Schraiber, J. G., 2014. A path integral formulation of the wright–fisher process with genic

13 selection. Theor Popul Biol, 92:30–35.

14 Smith, N. G. and Eyre-Walker, A., 2002. Adaptive protein evolution in drosophila. Nature,

15 415(6875):1022–1024.

16 Wang, K., Li, M., and Hakonarson, H., 2010. Annovar: functional annotation of genetic

17 variants from high-throughput sequencing data. Nucleic Acids Res, 38(16):e164–e164.

18 Williamson, S. H., Hernandez, R., Fledel-Alon, A., Zhu, L., Nielsen, R., and Bustamante,

19 C. D., 2005. Simultaneous inference of selection and population growth from patterns of

20 variation in the human genome. Proc Natl Acad Sci U.S.A., 102(22):7882–7887.

21 Woitecki, A. M., M¨uller, J. A., van Loo, K. M., Sowade, R. F., Becker, A. J., and Schoch,

22 S., 2016. Identification of synaptotagmin 10 as effector of npas4-mediated protection from

23 excitotoxic neurodegeneration. J Neurosci, 36(9):2561–2570.

24 Wu, H. and Su, B., 2008. Adaptive evolution of scml1 in primates, a gene involved in male

25 reproduction. BMC Evol Bio, 8(1):192.

41 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Wyckoff, G. J., Wang, W., Chung, W., et al., 2000. Rapid evolution of male reproductive

2 genes in the descent of man. Nature, 403(6767):304.

3 Yan, W., Si, Y., Slaymaker, S., Li, J., Zheng, H., Young, D. L., Aslanian, A., Saunders,

4 L., Verdin, E., and Charo, I. F., et al., 2010. Zmynd15 encodes a histone deacetylase-

5 dependent transcriptional repressor essential for spermiogenesis and male fertility. J Biol

6 Chem., 285(41):31418–31426.

7 Yang, Z., 1998. Likelihood ratio tests for detecting positive selection and application to

8 primate lysozyme evolution. Mol Biol Evol, 15(5):568–573.

9 Yang, Z., 2007. Paml 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol,

10 24(8):1586–1591.

11 Yang, Z. and Nielsen, R., 2002. Codon-substitution models for detecting molecular adapta-

12 tion at individual sites along specific lineages. Mol Biol Evol, 19(6):908–917.

13 Zhang, J., Nielsen, R., and Yang, Z., 2005. Evaluation of an improved branch-site likelihood

14 method for detecting positive selection at the molecular level. Mol Biol Evol, 22(12):2472–

15 2479.

16 Zivkovi´c,ˇ D., Steinr¨ucken, M., Song, Y. S., and Stephan, W., 2015. Transition densities and

17 sample frequency spectra of diffusion processes with selection and variable population size.

18 Genetics, 200:601–617.

42 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Chrom01 Chrom02 Chrom03 Chrom04 Chrom05 Chrom06

NOC2L ZFYVE28 COLEC11 DOK7 ADAMTS16 TNFRSF8 KLF11 LYAR CYTL1 PADI3 MTHFR ADAM17 SETD5 CRELD1 MAN2B2 GEN1 HRH1 ATG7 SH3TC1 TMCO4 C1orf64 KCNS3 CLNK BASP1 JARID2 UBXN10 ALDH4A1 WDR35 FGD5 CDH12 EPHA8 PLA2G2C ASXL2 OTOF SLC17A1 ALDH5A1 LUZP1 DNAJC5G TRMT61B NKAPL ZBTB40 TMPPE TARS OR2B6 NR0B2 C1QB CAPN14 DTHD1 OR2H1 OR12D2 EIF2AK2 HEATR5B TRANK1 EPM2AIP1 FAM114A1 VARS2 CSMD2 FUCA1 CDKL4 VILL CYP8B1 CARD6 TRIM26 MRPS15 COL16A1 SLC3A1 C6orf15 MUC22 EPAS1 SNRK ZNF502 HLA-C ZNF684 ZMYM4 CCR3 SETD2 CDSN PTPRF CSF3R ATRIP BTNL2 ATF6B TOE1 MTIF2 CCDC85A P4HTM HLA-DQA1 PPCS PUS10 AHSA2 LAMB2 CCDC36 HLA-DRB5 CYP4B1 B4GALT2 C3orf62 C5orf64 HLA-DPA1 HLA-DQB1 ZFYVE9 SERTAD2 CDHR4 TAF11 MKNK1 UBA7 TRAIP CABS1 UGT2A2 CENPH HLA-DPB1 HHLA3 STIL RBM6 MAP1B MDGA1 C6orf222 IFI44 C2orf78 DUSP11 HEMK1 ART3 ALB MOCS1 TTC22 HK2 TTC31 ARHGEF3 FHIT STBD1 F2RL2 KIF6 PKN2 CTH CMYA5 TRERF1 NCR2 BCAR3 DNAH6 ZNF318 CLCA1 GGCX ST3GAL5 ARHGAP24 PTCRA ALG14 GBP4 IMMT MUT MRPL14 PROK1 F3 ZNF2 ZNF514 KIAA0825 DEFB112 DEFB110 MTMR11 OR5K4 FBXO9 DPYD RFX8 ASTL SENP7 ICK C1orf54 MAN1A2 TGFBRAP1 GIN1 PHF3 HMGCLL1 CGN CA14 NPHP1 OGFRL1 EYS TCHHL1 SOWAHC LRIT3 PJA2 SPACA1 RFX5 ZC3H6 PSD4 TMPRSS7 SLC35A5 APC IRAK1BP1 LCE3B GTPBP8 B4GALT4 LAMA4 THEM4 DDX18 FAM170A FBXL4 CRTC2 RPTN GLI2 PLA1A NR1I2 EXOSC9 KIAA1109 ZNF474 ROS1 DSE PBXIP1 SMCP GTF2E1 PARP15 ZNF608 GPATCH4 MYO7B CREB3L4 PTPN18 UMPS OSBPL11 RAPGEF6 TAAR8 AKAP7 ETV3 ARHGEF2 AMOTL2 GFRA3 TAAR6 OR6P1 ESYT3 NME5 HBS1L NES XRN1 PCDHGB1 PHACTR2 MYB OR10J3 CD1C FREM3 SLC23A1 ARHGAP26 SLAMF6 PCDH1 AIM2 TSC22D2 NR3C2 SYNPO NIT1 APCS TIGD4 FGA SPINK9 ESR1 CENPL LOC100144595 PLCH1 FAT2 CD244 RBM46 TMEM144 ADAM19 FNDC1 FAM20B ITGB6 TRIM59 FNIP2 GPA33 SCN9A LMOD1 ANGPTL1 SCN3A UNC93A IL24 LRP2 GOLIM4 PALLD FOXI1 LCP2 NAV1 SPATA16 CD34 SLC26A9 SYT14 PIGR MFN1 TRAF5 DIEXF TBCCD1 SORBS2 DNAH14 HHAT OSGEPL1 LPP OBSCN SLC30A10 ATP13A3 RBM34 EPHX1 PGAP1 ANKRD44 MUC4 ZP4 CAPN9 MARS2 CFLAR OR6F1 RYR2 CASP10 ALS2 OR2AK2 OR2B11 ICA1L OR1C1 ABCA12 BARD1 NHEJ1 PAX3

ECEL1 SP140L RBM44 COL6A3 KIF1A PER2 FARP2 C2orf54

Chrom07 Chrom_X Chrom08 Chrom09 Chrom10 Chrom11

ERICH1 DMRT3 BET1L SIGIRR XG GYG2 MYOM2 GLIS3 DAGLB ZDHHC4 PINX1 INSL6 FAM208B ANKRD16 MUC5B TSPAN32 ARSD ARSF RP1L1 TRPM5 OSBPL5 COL28A1 ICA1 MXRA5 STS NFIB THSD7A VWDE OR51F1 OR52J3 FAM9A ATXN3L CSGALNACT1 OR51B2 OR51B6 SCIN AHR ASB11 ACE2 PIWIL2 NUDT18 DNAH11 STK31 PPP3CC TEK OR51M1 UBQLN3 SCML1 BEND2 ADRA1A WAC SVIL OR52N4 CNGA4 SCRN1 CDKL5 ZNF645 CLU UNC13B C9orf131 OR10A4 OR2D2 POLA1 MAGEB10 NLRP14 NLRP10 MAGEB3 FAM47A HRCT1 FAM166B URGCP FRMPD1 RET STK33 SBF2 TNS3 PKD1L1 FAM47C UBA1 RBP3 MSMB ZDHHC13 CD44 GPKOW FOXP3 SNTG1 OGDHL C11orf74 C11orf49 ZNF713 SHROOM4 PHF8 MYBPC3 OR4S1 FOXR2 OR4C11 OR4S2 MTMR8 ZNF117 ZNF273 ASB12 C8orf44 OR5W2 OR8K1 HEPH VSIG4 FXN FAM189A2 DNAJC12 AIFM2 OR5R1 OR5AP2 TEX11 DGAT2L6 TERF1 MAMDC2 ADAMTS14 UNC5B OR9Q1 MS4A2 STYXL1 MAGEE2 FAM149B1 SEC24C ANXA7 SCGB2A2 EHBP1L1 RMI1 NDST2 OVOL1 CTSF DAPK1 SH2D4B ZNF804B OSGIN2 CCDC87 SSH3 FAM133A TMEM67 MYOF TCIRG1 IGHMBP2 ZNF782 CYP2C8 TRIM14 PYROXD2 TCTN3 NADSYN1 RELT RELN NAPEPLD TCEAL2 TAF7L DCAF13 TBC1D2 NOLC1 RSF1 DLG2 GLRA4 TCEAL4 ELOVL3 CCDC89 FAT3 SLC26A3 PKHD1L1 AKAP2 TMEM168 FOXP2 MUM1L1 IL1RAPL2 MUSK HEPHL1 MTMR2 TEX13B RIPPLY1 KIAA0368 ZNF883 NHLRC2 DCLRE1A BIRC3 EXPH5 ALG13 VSIG1 ZFP37 PAPPA DDX10 PPP2R1B RNF148 POT1 DOCK11 LUZP4 SQLE ZNF572 TLR4 C5 PSTK C11orf57 RNF214 FAM71F1 CCDC136 XPNPEP2 LAMP2 OR1J2 OR1N2 NLRX1 TMEM225 TFDP3 ZNF280C OR1Q1 PDCL KNDC1 ADAM8 OR8D4 ACRV1 SAGE1 GPC3 NEK6 SLC25A25 ECHS1 SLC37A3 DENND2A DENND3 ARHGEF6 ZNF623 PKN3 C9orf106 TAS2R38 TAS2R60 FMR1NB MAPK15 C9orf50 PTGES TAS2R41 OR2F2 FMR1 PLEC BOP1 PASD1 MAGEA5 ZNF251 SURF6 REXO4 SSPO ZNF862 TKTL1 ZNF34 COL5A1 MAMDC4 ATP6V0E2 MAGEA1 PLXNA3 SLC10A3 EHMT1 CACNA1B F8

Chrom12 Chrom13 Chrom14 Chrom15 Chrom16 Chrom17

SLC6A12 RAD52 TMEM8A WDR90 INPP5K OVCA2 AKAP3 LTBR MSLNL ECI1 SMG6 SRR ACRBP CD163 OR1F1 TIGD7 GP1BA PITPNM3 CLEC12A PRR4 ZNF500 C16orf89 DNAH2 AURKB GSG1 GUCY2C OR4K17 TEP1 PMM2 GRIN2A SLC25A35 DHRS7C ARNTL2 PARP4 OR5AU1 CDH24 CLEC16A TNP2 MYH8 MYH4 ERGIC2 PRM1 PDXDC1 SYT10 RNF31 IRF9 SPAG5 KIAA0100 CTSG PRKD1 C16orf45 IQCK SUPT6H ERAL1 ERI2 USP31 AKAP11 KIAA0391 LRFN5 SEZ6 EFCAB5 MTRF1 C14orf28 PLCB2 DISP2 SCNN1B ARHGAP17 SLC6A4 TMIGD1 RAPGEF3 RPAP3 LCMT2 ELL3 HIRIP3 PYGL NIN ITGAL CPD ADAP2 DNAJC22 CCNT1 SPG11 PRR14 ADCY7 C17orf78 GPR179 BCDIN3D FAM186B CGNL1 MLLT6 GSDMA KRT75 CERS5 KRT26 KRT27 SGPP1 IGDCC4 KRT3 KRT76 FAM71D RAD51B CMTM3 C16orf86 KRTAP4-5 KRTAP4-3 PDE1B GPR84 CENPT VRTN PROX2 CD276 HYDIN KRTAP4-1 KRTAP9-1 OR6C75 OR6C3 ESRRB SAMD15 PMFBP1 ZNF385C DBF4B PMEL OR6C4 MTHFS PKD1L2 C14orf178 TSHR WHAMM C16orf46 TTLL6 CALCOCO2 ZBTB39 SDR9C7 HSDL1 SDR42E1 NGFR FLJ45513 TSFM INHBC KCNK10 CCDC88C PLIN1 ZNF469 DNAAF1 SGCA ACSF2 WIF1 AVIL TC2N MCTP2 TCF25 LOC100287036 ABCC3 KIF2B MDM2 NUP107 SERPINA5 SERPINA11 SLC15A1 HSP90AA1 GAS8 BRIP1 SCN4A CCDC59 PTPRQ KIF26A RGS9 CEP112 PLEKHG7 LUM DAOA GPR132 INF2 TEX22 HELZ NOL11 MYBPC1 TMPO BPTF ABCA10 APPL2 PMCH SDK2 GPR142 DAO SSH1 EVPL SLC26A11 FAM216A UBE3B RNF213 SLC38A10 NAA25 TCTN1 CCDC137 P2RX7 OAS1 ZNF605

Chrom18 Chrom19 Chrom20 Chrom21 Chrom22

THOC1 CETN1 POLRMT ABCA7 C20orf96 TGM6 LPIN2 LAMA1 TCF3 ZNF556 TMC2 NOP56 APCDD1 LRG1 C3 C20orf194 SH2D3A CLEC4G BTG3 CTAGE1 CD209 PRAM1 CST9L CDC45 ARVCF MUC16 OR7G2 IGLL5 MMP11 TTR ZNF560 ICAM1 DEFB118 TTLL9 KRTAP6-3 USP16 SLC5A1 ZNF396 DTNA PDE4A TSPAN16 CDK5RAP1 RALY DOPEY2 DNAJC28 APOL3 RBFOX2 SLC39A6 CCDC151 ZNF20 CPNE1 SAMHD1 UMODL1 MX1 MYH9 APOL1 ZNF442 RNASEH2AARHGAP40 LPIN3 KRTAP10-8 KRTAP10-5 TRIOBP SH3BP1 MYO5B OR7A17 CPAMD8 MYBL2 ADA ADSL APOBEC3G TCF4 STARD6 SUGP2 ZNF85 SEMG2 SLC2A10 ARFGAP3 CYB5R3 WDR7 CHST8 ZNF599 SULF2 CTCFL PARVB SCUBE1 CD22 ZNF566 LAMA5 SRMS CELSR1 FAM118A ZNF567 ZNF568 CPT1B PPP6R2 RTTN ZNF585B ZNF540 ECH1 BLVRB CEACAM4 PHLDB3 BBC3 SULT2A1 SYNGR4 DHDH PRRG2 VRK3 SIGLEC7 NKG7 SIGLEC6 FPR1 ZNF766 VN1R2 NCR1 NLRP7 RPL28 ISOC2 CCDC106 NLRP8 ZIM2 USP29 ZNF549 ZIK1 ZNF814 ZNF135 A1BG

Figure 5: A map of Darwin selection of the human genome. 43 bioRxiv preprint doi: https://doi.org/10.1101/367482; this version posted July 11, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Chrom01 Chrom2A Chrom2B Chrom03 Chrom04 Chrom05

NADK PRDM16 ZNF721 FAM193A EXOC3 NOL9 DDX18 IL5RA NOP14 ADAMTS16 DFFA CMPK2 INHBB EPB41L5 CRELD1 ZNF518B MFAP2 PADI3 MBOAT2 GYPC SETD5 ALDH4A1 MYO7B CD38 TMCO4 KCNS3 NT5C1B UGGT1 GPR17 CLRN2 KIF17 LDLRAD2 PTPN18 GPR150 ZBTB40 GDF7 ADCY3 MYOM3 SLC5A6 DNAJC5G OXSM MAP3K6 AHDC1 IFI6 GALNT14 CAPN14 ZBTB8A FAM98A EIF2AK2 CMYA5 TMCO2 MED8 MYD88 SCN11A RFC1 IQGAP2 PTCH2 CEBPZ CDKL4 RBM43 NEB PDS5A F2RL2 TOE1 ABCG8 LRPPRC CX3CR1 ZNF662 N4BP2 POLK MKNK1 CYP4B1 EPAS1 PRSS42 USP19 MAST4 ZFYVE9 SCP2 NRXN1 LAMB2 CCDC36 STBD1 FRAS1 ELOVL7 DMRTB1 TTC22 RTN4 CCDC85A SCN3A GALNT3 AMIGO3 CDHR4 ART3 C8B HHLA3 SPC25 LRP2 CXCL6 GZMK CCNO UBA7 RBM6 CABS1 ALB TNNI3K IFI44 SLC38A3 IFRD2 UGT2A3 CLCA1 GBP4 NAT6 TLR9 BCAR3 F3 ASPRV1 DYSF DNAH1 CCDC142 PHF7 ALG14 DPYD TTC31 ITIH3 SFMBT1 ERVMER34-1 SPEF2 PALMD SLC30A7 HK2 CHDH FOXP1 TARS COL11A1 OCIAD2 GSTM4 DNAH6 HECW2 PGAP1 EBLN2 CNTN3 MAN1A2 ST7L AFF1 HERC5 PRDM9 CTTNBP2NL MARS2 ORC2 TIGD2 CHIA ITPRIPL1 CASP10 ALS2 ERAP1 ERAP2 PROK1 MTMR11 CA14 NPAS2 RAPH1 PARD3B CTSS NRP2 BARD1 COL8A1 SENP7 EMCN LYSMD1 CGN SH3RF3 NFKBIZ AIMP1 TUFT1 ATIC USP37 ARHGEF38 PJA2 TMEM232 RIIAD1 SEC24B EPB41L4A THEM4 TCHHL1 GTPBP8 TMPRSS7 TSSK1B KPRP SPICE1 LCE1B SP140 ARHGAP31 ZNF474 LELP1 NUP210L CHRND ECEL1 FBXO40 PLA1A PBXIP1 FLAD1 COL6A3 SH3BP4 ILDR1 GOLGB1 SLC12A2 RUSC1 ARHGEF2 MLPH PARP15 PARP9 RAPGEF6 SEC24A PER2 TXNDC15 TTC24 GPATCH4 AQP12B PRR21 PDIA5 PARP14 GFRA3 INSRR CD1C ANO7 H1FOO DNAJB8 CD14 PCDHGA8 OR10K1 OR6P1 NMNAT3 PCDHGB6 ARHGAP26 APCS ARHGEF37 LY9 FAT2 CD244 ITLN1 TM4SF1 P2RY14 SH3D19 LRBA ADCY10 SERPINC1 SUCNR1 PLCH1 FGA DCHS2 FAM20B RNASEL TIPARP VEPH1 EDEM3 RARRES1 NAV1 OTOL1 TRIM60 SLC45A3 FCAMR YOD1 CD55 LRRIQ4 PLD1 CD34 PLXNA2 SPATA16 NAALADL2 TRAF3IP3 RRP15 MFN1 IARS2 HHIPL2 MAP3K13 ANKRD37 TAF1A FAM177B CRYGS SRP9 LPP HRG GUK1 TFRC ATP13A4 NUP133 CAPN9 PIGZ HEATR1 EXO1 OR13G1 OR14C36

Chrom06 Chrom07 Chrom_X Chrom08 Chrom09 Chrom10 ADAP1 SDK1 GYG2 ARSD MYOM2 DMRT3 KCNV2 LARP4B PAPOLB GLIS3 UCN3 AKR1E2 CAGE1 VWDE THSD7A ARSE ARSF KIAA2026 TUBAL3 SCIN MXRA5 FAM9A MTMR9 FREM1 FAM9C MOSPD2 PSD3 OLAH HDGFL1 FAM8A1 DNAH11 ACE2 CTPS2 PIWIL2 KIAA0319 ALDH5A1 TAX1BP1 HOXA1 TXLNG SCML1 TRIM38 FAM65B BEND2 SCML2 MAS1L ZNF391 YY2 ZNF645 B4GALT1 RNF39 EPDR1 KCNU1 FAM166B CCDC107 TRIM26 VARS2 MRPL32 AEBP1 DDX53 DCAF8L2 KAT6A OR13J1 MUC21 IGFBP1 MAGEB10 MAGEB2 HRCT1 MUC22 TNS3 FRMPD1 PGM5 RASSF4 CDSN ZPBP MAGEB3 MAGEB4 TJP2 RBP3 MICB CCHCR1 MAGEB1 FTHL17 MAMDC2 OGDHL FRMPD2 VARS SUMF2 TMEM2 GDA ZBTB12 BTNL2 FAM47A FAM47C PHF1 CUTA BCOR MAOA PNPLA1 PXT1 RGN ZNF630 MDGA1 EBP GLOD5 HK1 DNAH8 FBP2 PTCH1 DAAM2 PTCRA GPKOW MAGIX ZNF318 STYXL1 CACNA1F PAGE1 TRIM14 COL15A1 POLH INVS DLG5 DEFB114 SHROOM4 MAGED1 TMEM38B CDHR1 SH2D4B PKHD1 PCLO ACTL7B SVEP1 IL17F HCRTR2 APEX2 FOXR2 COL21A1 FAAH2 LAS1L NBN MUSK FKBP15 LIPA ZNF451 TNC MYOF NOC3L COL19A1 OGFRL1 PEX1 CALCR PJA1 OTUD6A KIAA1429 DDX43 DGAT2L6 RAB41 TSPYL5 MTDH C5 MEGF9 PIK3AP1 PYROXD2 MB21D1 SNX31 OR1N2 LZTS2 PPRC1 CGA SPACA1 ZCWPW1 MEPCE TEX11 MED12 DCAF13 RALGPS1 FBXL4 MUC17 MOGAT3 ZMYM3 ZCCHC13 ODF2 ZNF79 ELOVL3 SH3PXD2A RELN SLC26A3 MAGEE2 FAM46D PKHD1L1 GBGT1 SETX SORCS3 RFPL4B TRAF3IP2 TMEM168 APOOL DACH2 COL5A1 DCLRE1A DSE FCN1 TAF7L ARMCX2 LCN12 LCN1 ZMAT1 GPRASP1 ATE1 TACC2 BHLHB9 NXF3 PSTK CTBP2 CCDC136 FAM71F1 BEX2 ZCCHC18 BCCIP CLRN3 TAAR8 TAAR6 AGBL3 ESX1 TEX13A WISP1 TG PRAP1 HBS1L PEX7 MUM1L1 RIPPLY1 ZFAT CCDC28A ECT2L ZC3HAV1 SLC37A3 DENND3 TSNARE1 DENND2A VSIG1 ATG4A GSDMD AGK ALG13 AMOT BOP1 TAS2R5 TAS2R60 ZNF517 SYNE1 OR2A2 IL13RA2 LUZP4 TPK1 DOCK11 LONRF3 SSPO ATP6V0E2 PLG RHOXF1 CUL4B UNC93A CT47B1 GLUD2 CCR6 WDR27 DCAF12L1 UTP14A THBS2 ELF4 ZNF280C USP26 TFDP3 PLAC1 SAGE1 VGLL1 MCF2 MAGEC2 FMR1NB MAMLD1 CD99L2 PASD1 MAGEA4 PNMA5 TKTL1 ATP6AP1

Chrom11 Chrom12 Chrom13 Chrom14 Chrom15 Chrom16

SCGB1C1 SIGIRR RAD52 CCDC78 CCDC154 CDHR5 TMEM80 FOXM1 OR5AU1 MKRN3 SPSB3 GALNT8 LTBR SALL2 PRSS27 TALDO1 CTSD CDCA3 TNFRSF19 CDH24 REC8 APBA2 PRSS21 FLYWCH1 TNNI2 TSPAN32 CLEC4D ADCY4 ZNF213 MFAP5 PZP CMA1 TMCO5A ZNF200 KCNQ1 SLC22A18 CLECL1 RFC3 BRCA2 GZMH BUB1B ZNF597 NLRC3 OR51F2 OR52J3 CLEC12A STOML3 FREM2 PPP1R14D ZNF500 CLEC1B STYK1 MAP1A SPTBN5 NAGPA OR51V1 OR51B6 TAS2R13 WDR76 ATF7IP2 TNP2 OR52N4 CNGA4 GUCY2C ZC3H13 DUOX1 PRM2 PDE3A HELB TRIM13 CKAP2 TMX1 NIN WDR72 DUT PRM1 APBB1 NLRP10 TBK1 NID2 UNC13C ABCC6 IQCK MICAL2 CD44 AVIL PTGDR AQP9 GP2 INHBC SHMT2 VPS13C CCNB2 EEF2K TTC17 EXT2 RDH16 ZWILCH SCNN1B RBBP6 HARBI1 ARHGAP1 KRT3 SYNE2 SGPP1 SH2B1 KRT76 FIGNL2 FAM71D ASPHD1 MYBPC3 FAM180B FAM186A TMEM202 TBX6 ITGAL NUP160 OR4S2 DNAJC22 ACOT6 ENTPD5 NEIL1 FUS SENP1 RAPGEF3 LMO7 PGF PYCARD OR5W2 OR5M9 RPAP3 PROX2 CPEB1 WHAMM CCDC113 GINS3 OR5M1 OR10V1 BEAN1 CKLF NUP107 MS4A3 MS4A13 KCNK10 ZC3H14 CMTM1 NAE1 MS4A10 SYT7 GLIPR1L2 CATSPERB TRIP11 CDH16 B3GNT9 BEST1 GANAB SERPINA5 HSF4 KIAA0895L SLC3A2 SLC22A6 TSNAXIP1 CCDC59 NALCN CENPT SPDYC EHBP1L1 RASSF9 GPR132 AHNAK2 NFATC3 PLA2G15 MUS81 RIN1 DAOA PRMT7 ZFP90 SLC29A2 CCDC87 EEA1 PACS2 HYDIN NR2C1 TMPO PROZ SPACA7 ZNF23 FADD LRTOMT DEPDC4 LAMP1 ATXN1L PKD1L2 CLPB ARHGEF17 PMCH GAS2L3 TAF1C PABPN1L MYO7A TMEM126B EID3 HCFC2 CPNE7 GAS8 CCDC89 SYTL2 OAS1 SSH1 MTMR2 CCDC82 OAS2 DYNC2H1 CASP1 ARHGAP20 CD3D P2RX7 HPD OR8D4 ACRV1 IL31 ARL6IP4 NTM SPATA19 PITPNM2 MPHOSPH9 DHX37 AACS NCAPD3 FBRSL1

Chrom17 Chrom18 Chrom19 Chrom20 Chrom21 Chrom22

FAM57A CDC34 BSG DEFB126 TMC2 LIPI ANKFY1 OR3A2 PTPN2 KISS1R SBNO2 NOP56 MICAL3 GGT6 APCDD1 TMEM239 USP18 CXCL16 LAMA1 ADAMTSL5 PLK5 PTPRA SLC4A11 TSSK2 CLTCL1 FAM64A GP1BA METTL4 TCF3 REXO1 PROKR2 TRMT2A PITPNM3 KRTAP26-1 USP16 TMEM211 SLC13A5 RBBP8 DOT1L ZNF554 IFNAR1 LRP5L MYO18B ACADVL CABYR GZF1 KRTAP27-1 TTLL6 OSBPL1A ZNF556 TJP3 B3GALT5 DONSON CRYBB1 MTMR3 DBF4B FMNL1 DSG4 ATCAY ANKRD24 SEC14L3 CCDC43 UMODL1 MX1 MORC2 KRTAP9-1 ZNRF4 PRR22 DEFB118 CDK5RAP1 PDXK RSPH1 SMTN BPIFC KRTAP4-4 KRTAP4-3 DUS3L C3 TP53INP2 APOL3 KRT27 PROCR ITGB2 AIRE APOL2 DHRS11 CAMSAP3 CD320 GDF5 SCAND1 MYH9 SUN2 TAF15 RDM1 KATNAL2 OR7D2 ZNF561 DLGAP4 APOBEC3G SLFN14 MAPK4 MYO5B SLA2 APOBEC3H FNDC8 DCC ICAM1 ZNF442 TOX2 MATN4 ADSL SREBF2 CPD ATAD5 FECH PRDX2 MAST1 SPATA25 FAM109B SUPT6H ATP8B1 SLC2A10 PARVB KIAA0100 CASP14 CPAMD8 PREX1 ADNP SMC1B TBC1D22A SEBOX FOXN1 HAUS8 GTPBP3 ZNF217 ALG12 SLC47A1 BCAS1 NCAPH2 DNAH9 SLC5A5 KIAA1683 CTCFL SYCP2 KLHDC7B ACR ALOX15B MYH4 GDF15 SUGP2 BIRC7 TAC4 ZNF516 SRMS SGCA ATP9B WDR88 LRP3 OPRL1 MYT1 ABCC3 CACNA1G CHST8 GPI RNF43 KIF2B ZNF599 CD22 PPM1D RAD51C GAPDHS ZBTB32 MILR1 ICAM2 U2AF1L4 NPHS1 NOL11 RGS9 SDHAF1 ZNF568 ABCA10 ABCA8 ZNF850 GGN KIF19 SDK2 CNTD2 DMRTC2 LLGL2 SLC25A19 CEACAM1 IRGQ EVPL RECQL5 ZNF180 BCL3 EXOC7 GALR2 APOE ZNF296 RNF213 METTL23 CD3EAP PPM1N CCDC137 AATK IGFL1 ZNF541 CD7 BSPH1 PLEKHA4 CD37 TSKS VRK3 FPR1 VN1R2 LILRB5 PTPRH SBK2 EPN1 NLRP11 ZNF471 USP29 ZNF304 ZNF256

44 Figure 6: A map of Darwin selection of the chimpanzee genome.