<<

G3: Genes|Genomes|Genetics Early Online, published on May 13, 2019 as doi:10.1534/g3.119.400223

Humans and display opposite patterns of diversity in arylamine N-acetyltranferase genes

Christelle Vangenot*, Pascal Gagneux†, Natasja G. de Groot‡, Adrian Baumeyer§, Médéric Mouterde*, Brigitte Crouau-Roy**, Pierre Darlu††, Alicia Sanchez-Mazas*,‡‡, Audrey Sabbagh§§, Estella S. Poloni*,‡‡

Affiliations

* Department of Genetics and Evolution, Anthropology Unit, University of Geneva, Switzerland

† Departments of Pathology and Anthropology, CARTA (Center for Academic Research and Training in Anthropogeny), University of California San Diego, La Jolla, USA

‡ Department of Comparative Genetics and Refinement, Biomedical Research Centre, the Netherlands

§ Zoologischer Garten Basel AG, Basel, Switzerland

** CNRS, Université Toulouse 3 UPS, ENFA, UMR 5174, Toulouse, France

†† CNRS/Muséum National d'Histoire Naturelle, UMR 7206 Paris, France

‡‡ Institute of Genetics and Genomics in Geneva (IGE3), Switzerland

§§ UMR 216 MERIT, IRD, Université Paris Descartes, Sorbonne Paris Cité, Paris, France

1

© The Author(s) 2013. Published by the Genetics Society of America. Running title

Diversity in hominid NAT genes

Keywords

Arylamine N-acetyltransferases, multigenic family, drug metabolism, great , natural selection

Corresponding authors

Estella S. Poloni

Department of Genetics and Evolution (GENEV), Anthropology Unit

University of Geneva

30, quai Ernest-Ansermet

1211 Genève 4

Switzerland

Phone +41 (0)22 379 69 77 e-mail [email protected]

Christelle Vangenot

Department of Genetics and Evolution (GENEV), Anthropology Unit

University of Geneva

30, quai Ernest-Ansermet

1211 Genève 4

Switzerland e-mail [email protected]

2 ABSTRACT

Among the many genes involved in the metabolism of therapeutic drugs, arylamine N-acetyltransferases (NATs) genes have been extensively studied, due to their medical importance both in pharmacogenetics and disease epidemiology. One member of this small gene family, NAT2, is established as the locus of the classic human acetylation polymorphism in drug metabolism. Current hypotheses hold that selective processes favoring haplotypes conferring lower NAT2 activity have been operating in modern ’ recent history as an adaptation to local chemical and dietary environments. To shed new light on such hypotheses, we investigated the genetic diversity of the three members of the NAT gene family in seven hominid species, including modern humans, Neanderthals and Denisovans. Little polymorphism sharing was found among hominids, yet all species displayed high NAT diversity, but distributed in an opposite fashion in chimpanzees and ( genus) compared to modern humans, with higher diversity in Pan species at NAT1 and lower at NAT2, while the reverse is observed in humans. This pattern was also reflected in the results returned by selective neutrality tests, which suggest, in agreement with the predicted functional impact of mutations detected in non-human , stronger directional selection, presumably purifying selection, at NAT1 in modern humans, and at NAT2 in chimpanzees. Overall, the results point to the evolution of divergent functions of these highly homologous genes in the different primate species, possibly related to their specific chemical/dietary environment (exposome) and we hypothesize that this is likely linked to the emergence of controlled fire use in the human lineage.

3 INTRODUCTION

Hominid species, the so-called “great apes”, share a recent common history that makes the genomic diversity of our closest relatives highly informative in evolutionary studies on our own species and more broadly on all great apes. This potential to answer evolutionary questions regarding all hominid species, including our own, is extensively exploited, and notably so in the last half decade thanks to the accelerated pace at which whole genome sequences are generated (Kuhlwilm et al. 2016). As foreseen by (Olson and Varki 2003), analyzing genetic and genomic diversity in the great apes is also providing novel insights of medical interest (e.g. (Enard et al. 2002; Bergfeld et al. 2017;

Solis-Moruno et al. 2017; Vangenot et al. 2017). Indeed, physiological differences between species, whether they emerged through demographic processes or as adaptive responses might offer insights into present-day questions regarding human health and disease (Olson and Varki 2003; O'Bleness et al. 2012).

Arylamine-N-acetyltransferases (NAT) genes are members of a small multigene family coding for enzymes that biotransform numerous compounds. In humans, its two functional members, NAT1 and NAT2, show differences in expression patterns and substrate specificity, while the third member (NATP) is a pseudogene. Whilst the NAT2 isoenzyme has a major role in the metabolism of xenobiotics, including therapeutic drugs and carcinogens, growing evidence supports an additional role for NAT1 in physiological processes (notably folate and methionine metabolism) and cancer cell biology. The three NAT genes reside in a 200 Kb region on the short arm of chromosome

8, and each of the two functional genes has a single, uninterrupted, 870 bp-long coding exon that produces a protein of 290 amino acids (Blömeke and Lichter 2018; Sim and

Laurieri 2018). Phylogenetic analyses of NAT sequences point to multiple episodes of

4 NAT gene duplication or gene loss during vertebrate evolution (Sabbagh et al. 2013).

However, in Simiiformes (monkeys and apes, including humans) the evidence suggests the occurrence of a single duplication event prior to their divergence, leading to the

NAT1 and NAT2 paralogs, whereas a subsequent duplication of NAT2, which probably occurred in the common ancestor to Catarrhini (African and Eurasian monkeys and apes, including humans), gave rise to the NATP pseudogene. Nucleotide sequence identity between the three NAT paralogs in humans is high: there is 81% homology between NAT1 and NAT2 coding exons, and 79% with the NATP pseudogene, while protein sequence identity between the two NAT enzymes is at 87% (Sabbagh et al.

2013). Homology between orthologous nucleotide sequences of humans, chimpanzees and is also high, about 98-99% for NAT1 and NAT2, and 96-97% for NATP, according to the human (GRCh37/hg19, (Lander et al. 2001)), (panTro4,

(The Chimpanzee Sequencing and Analysis Consortium 2005)) and (gorGor4,

(Scally et al. 2012)) reference genomes, with no clear distinction between these three great species. Indeed, identity at NAT1 between the human and chimpanzee reference sequences is 98.5%, compared to 99% between human and gorilla, and 98.5% between chimpanzee and gorilla; at NAT2 it is 98.6% between human and chimpanzee,

98.8% between human and gorilla, and 98.9% between chimpanzee and gorilla; and at

NATP it is 96.8% between human and chimpanzee, 95.8% between human and gorilla, and 96.1% between chimpanzee and gorilla. These occurrences of apparent incomplete lineage sorting (although based on very slight differences) are not surprising since it is estimated that about a third of the gorilla genome is more similar to that of humans or chimpanzees than the human and chimpanzee genomes are to each other (Scally et al.

2012; Kronenberg et al. 2018). It highlights the complex speciation history of great apes, which possibly included several admixture events, such as those recently evidenced

5 between bonobos and the ancestors of Central and Eastern chimpanzees (de Manuel et al. 2016).

Due to their involvement in inter-individual variation in response to therapeutic treatments, the molecular diversity of NAT genes in human populations has been intensively studied (Sabbagh et al. 2018). In particular, the NAT2 encoded enzyme, mainly expressed in the liver, small intestine and colon, is involved in the metabolic breakdown of several clinically relevant compounds (Hein 2002), including isoniazid, a first-line antibiotic included in anti- therapies since the 1950s (Zumla et al.

2013). Yet, it is safe to assume that none of these compounds have had any evolutionary impact on human NAT2 evolution. The single coding exon of NAT2 is highly polymorphic in humans, and it has been shown that the efficacy and/or toxicity of several clinically important drugs are associated with variation in enzymatic activity conferred by different NAT2 variants (Meyer 2004; Agundez 2008a; McDonagh et al. 2014). This, together with the involvement of the enzyme in the detoxification (N, O or N,O- acetylation) of numerous carcinogens, motivated numerous studies on the evolution of the diversity of NAT2 in human populations. Current hypotheses hold that geographically- or culturally-restricted selective processes favoring one or several haplotypes conferring lower NAT2 activity (as adaptations to specific chemical environments or dietary habits) have been operating in the recent history of human populations (Patin et al. 2006a; Patin et al. 2006b; Fuselli et al. 2007; Luca et al. 2008;

Magalon et al. 2008; Sabbagh et al. 2008; Mortensen et al. 2011; Sabbagh et al. 2011;

Patillon et al. 2014; Podgorna et al. 2015; Valente et al. 2015; Bisso-Machado et al.

2016); see also (Sabbagh et al. 2018) for a review). Interestingly, the discovery of a strong association of decreased insulin sensitivity with a non-synonymous polymorphism in NAT2 (Knowles et al. 2016) brings indirect support to the hypothesis

6 of dietary-linked selective pressures exerted on the evolution of this gene. NAT1 is also involved in N-acetylation reactions of numerous compounds, and it is mainly expressed in the liver. Moreover, as shown in the AceView browser (Thierry-Mieg and Thierry-

Mieg 2006), it is also expressed in kidneys, lung, and blood cells, particularly in humans, and less so in chimpanzees. However, in contrast to NAT2, NAT1 is substantially less polymorphic in humans, and it is currently held that the accumulation of molecular variation in the NAT1 coding exon, in particular non-synonymous variation, has been hampered by relatively strong purifying selection acting on the gene (Patin et al. 2006a;

Mortensen et al. 2011; Sabbagh et al. 2018). NATP, the third identified member of this family of genes in the human genome bears several loss-of-function mutations and transcriptome studies indicate that it is barely if ever transcribed (as shown in the EBI

Gene Expression Atlas, entry ENSG00000253937 at www.ebi.ac.uk/gxa/home).

One approach to examine hypotheses of recent adaptation in acetylation activity in humans is to investigate the diversity of NAT genes in humans’ closest relatives. To this aim, we Sanger sequenced the three members of the NAT gene family in 84 great ape

DNA samples, including 68 chimpanzees of the Pan troglodytes verus sub-species

(Western chimpanzees). We completed this dataset with 231 NAT sequence genotypes produced with NextGen by the Great Ape Genome Project (GAGP) (Prado-

Martinez et al. 2013). We also examined NAT sequence variation in reference genomes of ancient hominins (i.e. sapiens from Ust’-Ishim, Neanderthal and Denisova).

Similarly to humans, we found high levels of nucleotide and haplotype diversity in non- human primates, but distributed differently in Pan, such that diversity is higher at NAT1 and lower at NAT2, whereas the opposite is observed in humans. Hence we hypothesize that the highly homologous NAT1 and NAT2 genes evolved some divergence in functionality between species in the course of hominid history, and we discuss this

7 hypothesis in relation to changes in the chemical or dietary environment, i.e. the exposome (Wild 2012) in which humans and chimpanzees have evolved.

8 MATERIALS AND METHODS

Great ape DNA samples

Eighty-four DNA samples of great apes were analysed in this study. These comprised

DNA from 68 Pan troglodytes verus () individuals, one P. t. troglodytes () female, one P. t. schweinfurthii () female, one P. paniscus () male, five Gorilla gorilla and eight Pongo abelii

(Sumatran ) individuals. Among the 68 P. t. verus samples, 40 were from members of the Biomedical Primate Research Centre (BPRC) colony, 26 from the Center for Academic Research and Training in Anthropogeny (CARTA) and two from the Basel zoo (Supplementary Figure S1). These samples and their corresponding collections are described in Supplementary File S1.

Sanger sequencing of NAT genes

We sequenced the three segments in the NAT region that include the single coding exons of NAT1 and NAT2 and the homologous DNA stretch of NATP in the 84 DNA samples of great apes available for this study. Supplementary Table S1 lists the primers used in this study both for PCR amplification and forward and reverse sequencing of each of the three NAT loci in great apes. PCR conditions are provided in Supplementary File S1.

Sanger sequencing was outsourced to Retrogen (San Diego, California, USA) and

Macrogen (Seoul, South Korea). Supplementary Table S2 lists the available information on all non-human samples considered in this study, including those retrieved from the

GAGP (see below).

Alignment of NAT sequences 9 All NAT sequences obtained were aligned on the homologs from the reference or draft reference assemblies of Homo sapiens (GRCh37/hg19, February 2009, Pan troglodytes verus (panTro4, February 2011), Pan paniscus (panPan1, May 2012), Gorilla gorilla gorilla (gorGor4, December 2014, as well as gorGor3 for verification, since these two reference sequences are not identical) and Pongo pygmaeus abelii (ponAbe2, July 2007), respectively, all downloaded from the UCSC Genome Browser ((Kent et al. 2002), genome.ucsc.edu). Alignment was performed blind of the known relationships between individuals.

Retrieval of NAT sequences or polymorphic positions from public repositories

For comparison purposes, we retrieved NAT genotypes from the published unphased genomes of 79 great apes generated in the GAGP by NGS (Prado-Martinez et al. 2013), namely from 25 chimpanzees (including a hybrid P. t. verus/troglodytes individual), 13 bonobos, 31 gorillas and 10 . Two individuals, Harriet (P. t. schweinfurthii) and Boscoe (P. t. verus, Supplementary Figure S1), were in common in the GAGP and

CARTA datasets, thus allowing a control of Sanger and NGS sequencing results. We extracted the relevant part of the available VCF files. Further details are provided in

Supplementary File S1. All detected polymorphic positions in the Sanger sequenced samples of this study and retrieved from the GAGP VCF files are detailed in

Supplementary Tables S3, S4 and S5, for NAT1, NAT2 and NATP, respectively.

To allow comparison of segregating sites at the NAT genes among all hominids (i.e. all great apes, including humans), we recorded all human polymorphisms reported by the consensus gene nomenclature of human NAT alleles (http://nat.mbg.duth.gr/, accessed in August 2015), complemented with haplotype data from 1000 Genomes Phase 1 (The

Genomes Project Consortium 2012) and published data (Patin et al. 2006a; Sabbagh et

10 al. 2008; Mortensen et al. 2011; Podgorna et al. 2015). These are also reported in

Supplementary Tables S3, S4 and S5.

Inference of NAT haplotypes in the genus Pan

Diploid haplotypes were inferred for all Pan individuals (but not for the other great ape samples, see Supplementary File S1) using PHASE version 2.1.1 (Stephens et al. 2001;

Stephens and Scheet 2005). The program uses a Bayesian statistical method based on an approximate coalescent model for reconstructing haplotypes from genotype data.

PHASE implements a recombination method (the –MR option), which allows specifying the relative physical location of each SNP and accounts for the decay in linkage disequilibrium with distance.

Constitution of two samples of Pan troglodytes verus unrelated individuals

We considered separately two samples of unrelated individuals from the Western (P. t. verus) chimpanzee sub-species, BPRC and San Diego (Supplementary File S1), notably due to significant differentiation between these two samples at the NATP pseudogene

(Supplementary Table S6, see results).

NAT haplotypes in humans

A dataset of NAT haplotypes’ frequency distributions in human population samples was assembled with published NAT sequences of same length as those of Pan obtained through a comprehensive literature search at the time of the study. Only populations represented by samples including at least 15 individuals (30 chromosomes) were considered. We thus used published data samples from Mortensen et al. (2011) and from Sabbagh et al. (2018). We also extracted NAT1, NAT2 and NATP phased genotypes

11 from the 1000 Genomes Phase 1 dataset (The Genomes Project Consortium 2012), see

Supplementary File S1). In total, the human dataset consists of 20, 18 and 18 samples of unrelated individuals, from human populations distributed on four continents (Sub-

Saharan Africa, Europe, East Asia and America), for NAT1, NAT2 and NATP, respectively

(Supplementary Table S7).

NAT polymorphisms in ancient genomes of hominins

Variant calls in the homologous NAT sequences of ancient genomes of the genus Homo, namely Neanderthal, Altai and the composite genome of three individuals from Vindija,

(Green et al. 2010; Prufer et al. 2014), Denisova (Meyer et al. 2012), and those from the most ancient modern human sequenced genome, Ust’-Ishim (Fu et al. 2014), were examined in both the UCSC Genome Browser (https://genome.ucsc.edu/) and the ancient genome browser at the Max Planck Institute for Evolutionary Anthropology

(http://www.eva.mpg.de/neandertal/index.html), the latter including the recently published high-coverage Vindija Neanderthal genome (Prufer et al. 2017). For each of the two functional genes, we screened 2 kb of the reference human genome sequence

(Hg19/GRCh37) encompassing the coding exon (positions 18’079’000 to 18’081’000 for

NAT1, and 18’257’000 to 18’259’000 for NAT2). For the NATP pseudogene, we screened

2 kb of homologous sequence (18’227’600 to 18’229’600).

Analysis of diversity of NAT genes in the genus Pan and comparison with humans and other great apes

Frequency distributions of NAT haplotypes in the Pan and human population samples were used to test for possible deviations from Hardy Weinberg equilibrium, to estimate expected heterozygosity, (h, equivalent to Nei’s gene diversity, (Nei 1987)) and

12 nucleotide diversity (π), to estimate levels of population differentiation (ΦST statistics), and to test for possible departure from selective neutrality and demographic equilibrium (Ewens-Watterson homozygosity test, Tajima’s D test and Fu’s FS test,), with the program Arlequin ver. 3.5 (Excoffier and Lischer 2010). Each of these three latter tests relies on different summaries of diversity (homozygosity for the Ewens-Watterson test, number of polymorphic sites and nucleotide diversity for Tajima’s D, and number of different haplotypes and nucleotide diversity for Fu’s Fs), and only Tajima’s D and Fu’s Fs tests explicitly account for the mutational events distinguishing haplotypes. Statistical significance was assessed by generating 100’000 random samples under the null conditions of no selection and constant population size. For all tests that revealed at least one significant departure from the null hypothesis in one population (species, sub- species, or collection in the case of P. t. verus), the Holm correction method implemented in R (R Core Team 2013) was applied to control for type I error rate (Holm 1979), so as to obtain adjusted P-values. Arlequin was also used to infer population pairwise ΦST values (between species, sub-species, or collections) under the AMOVA framework, and their statistical significance was assessed with 100’000 permutations. The parameters used to estimate ΦST values were obtained with MEGA ver. 7 (Kumar et al. 2016). These parameters are : the molecular model (Tamura’s distance for all three NAT loci), the gamma parameter (no gamma correction for NAT1 and NAT2, gamma = 0.05 for NATP) and the transition to transversion ratio (2.67, 2.0 and 2.5 for NAT1, NAT2 and NATP, respectively). The program Network ver. 5.0 (Bandelt et al. 1999) was used to construct median-joining networks of NAT haplotypes in Pan.

Prediction of functional impact of specific mutations in Pan haplotypes at the

NAT1 and NAT2 loci

13 Phenotypic predictions of the functional impact of specific mutations in Pan NAT1 and

NAT2 haplotypes were performed with three online software tools (analysis done in

May 2017): PolyPhen (Adzhubei et al. 2010), SIFT (Sim et al. 2012) and the PANTHER cSNP Scoring tool (Tang and Thomas 2016). These three tools are able to predict the effect of a single nonsynonymous substitution on a protein sequence (Supplementary

File S1). To investigate and compare the results returned by these methods, we first applied the three tools on human haplotypes of known phenotypes (Supplementary File

S1). For the analysis of Pan haplotypes, we ran all three prediction tools with the default options, searching the UniProtKB/TrEMBL protein database (release 2010_09) with

SIFT, and specifying Pan troglodytes as reference organism in PANTHER cSNP Scoring.

Data availability

File S1 contains detailed descriptions of the protocols used for DNA amplification, PCR product purification and sequencing of great ape samples, retrieval of unphased NAT polymorphic positions from the Great Ape Genome Project (GAGP), inference of Pan NAT haplotypes, and retrieval of phased human NAT haplotypes from the 1000 Genomes

Project. Table S1 lists the PCR and sequencing primers used for Sanger sequencing of

NAT genes in non-human great ape DNA samples, and available information on these samples is provided in Table S2. Hominid polymorphic positions detected in NAT1,

NAT2, and NATP are listed in Tables S3, S4, and S5, respectively. Table S6 reports estimates of differentiation levels between Pan (sub-)species. Table S7 lists the human population samples included in the modern human dataset, with associated diversity estimates and results of Hardy-Weinberg equilibrium tests. Supplementary information on results is also provided in File S1. All supplementary material has been uploaded to

14 figshare. The 247 NAT sequenced genotypes obtained in this study are available in

GenBank with accession numbers MK244999-MK245288 and MK245291-MK245459.

15 RESULTS

In this study, we Sanger sequenced approximately 1 Kb of homologous sequence in each of the three members of the arylamine N-acetyltransferase (NAT) gene family, NAT1,

NAT2 and the NATP pseudogene, in 84 great ape samples, of which 68 are chimpanzees of the Pan troglodytes verus sub-species. Out of the 84 DNA samples of great apes available, we obtained 248 NAT genotypes (83, 81 and 83 NAT1, NAT2 and NATP genotypes, respectively, Supplementary File S1). We extended our dataset with 231 NAT genotypes from 79 individuals belonging to six great ape (sub-)species retrieved from the GAGP (Prado-Martinez et al. 2013). As reported in Tables 1 and 2, the total dataset assembled for analysis thus included 93 genotypes from four Pan troglodytes (common chimpanzee) sub-species for NAT1 and NAT2 (96 for NATP), 14 genotypes from Pan paniscus (bonobo) for each of the 3 NAT genes, 35 genotypes from two Gorilla species for

NAT2 and NATP (36 for NAT1), and 17 genotypes from two Pongo (orangutan) species for NAT1 and NAT2 (18 for NATP). For comparative analysis purposes, we also assembled a human NAT dataset totalizing 1’159 to 1’240 unrelated individuals from 18 to 20 populations distributed on four continents.

In spite of the high level of known homology between the three NAT genes, approximately 8% nucleotide positions in NAT1 (76 out of 903 bp), 10% in NAT2 (117 out of 1’115 bp), and 15% in NATP (151 out of 1’002 bp) are segregating positions in hominids, that were found to either be divergent between species, polymorphic within a species, or both (Supplementary Tables S3, S4, and S5). Among them, 21 (28%) substitutions in NAT1, 38 (32%) in NAT2, and 77 (51%) in NATP correspond to inter- species divergence.

16 NAT polymorphisms in hominids and polymorphism sharing among species

Despite the high numbers of segregating sites detected at the three NAT genes in hominids, little polymorphism was found to be shared between humans, gorillas, orangutans and Pan: two SNPs at NAT1, four at NAT2 and six at the NATP pseudogene

(Table 1).

No polymorphic position was found shared between the genus Pan and the other great ape species at the NAT1 gene (Table 1 and Supplementary Table S3). Nonetheless, the proportion of non-synonymous SNPs in Pan (60%) is similar to that observed in humans

(59%) and orangutans (56%), while it is lower in gorillas (25%).

Humans and orangutans were found to share two NAT1 SNPs, i.e. non-synonymous A/G at human cds position 445 (rs4987076) and synonymous G/A at 459 (rs4986990), the former being also observed in gorillas. These two polymorphisms, found at low frequencies in humans today, along with a third one not observed in gorillas or orang- utans (non-synonymous 640 G/T, rs4986783), were nevertheless detected in the ancient genome of 45’000 years old Homo sapiens from Ust’-Ishim, and were found to diverge between Neanderthals and Denisova (Table 1 and Supplementary Table S8).

Interestingly, ancient Neanderthal genomes apparently carry the same allele (A) as Pan at rs4987076 whereas the alternative allele (G) was found for Denisova, and the opposite pattern is observed at rs4986990 (both Pan and Denisova carry G, while ancient Neanderthal genomes carry A). Thus, both the 445 A/G (rs4987076) and the

459 G/A (rs4986990) polymorphisms could potentially either pre-date the divergence among hominids or represent a case of independent parallel mutation(s) in hominins and in other great ape lineages. In turn, at position 640 G/T (rs4986783), all great apes and Neanderthals were found to carry G, whereas the alternative allele (T) was only observed in Denisova. In humans today, the derived alleles (respectively A, A and G) at 17 these three SNPs (respectively, rs4987076, rs4986990 and rs4986783) characterize human haplotypes NAT1*11A and NAT1*11B. A previous study estimated that the coalescence of NAT1*11A with other major human NAT1 haplotypes (NAT1*3, NAT1*4, and NAT1*10) dates back to 2 million years ago, leading the authors to suggest that some NAT1 diversity in the genome of modern humans may have persisted from a structured ancestral population (Patin et al. 2006a). Since SNPs 445 A/G and 459 G/A are shared between humans and orangutans (SNP 445 A/G being also shared with gorillas), their coalescence times could be even older and likely pre-date the divergence among hominids. When considered together, the three SNPs define three major combinations in hominids, GGT, AGG, and AAG. The most frequent combination in humans, GGT, which characterizes most human haplotypes (except those of the NAT1*11 series and NAT1*30), was not observed in the other great apes. Instead, all haplotypes in chimpanzees and bonobos carry the AGG combination, which has an estimated frequency of 70% in gorillas, and at least 55% in orangutans, but has not been reported so far in human populations. Such observations could suggest that the AGG haplotype was present in the ancestors of hominids (the reference sequences of rhesus (Macaca mulatta) and cynomolgus (M. fascicularis) macaques, rheMac8 and macFas5, also carry the AGG combination), and was lost at some point in the human lineage, possibly after the divergence from Denisova and Neanderthal. Indeed, Denisova’s ancient genome is defined as GGT, while the Altai and Vindija Neanderthal genomes are reported with the

AAG combination, and the genome of 45’000 years old Homo sapiens from Ust’-Ishim is heterozygous at the three positions (Supplementary Table S8). Nowadays, the human reference haplotype NAT1*4, together with other haplotypes carrying GGT at the three

SNPs (e.g. human NAT1*3 and NAT1*10), have an average cumulated frequency of 95% in all human populations studied so far, whereas haplotypes with AAG (the human

18 NAT1*11 series) are observed at very low frequencies in populations from Africa, Asia,

Europe and New (Patin et al. 2006a; Mortensen et al. 2011).

NAT2 stands in sharp contrast with NAT1, mainly because of the high levels of NAT2 polymorphism in humans (Supplementary Table S4, 55 polymorphic positions recorded for humans at NAT2, which represents twice as many compared to NAT1). The proportion of NAT2 non-synonymous SNPs in humans (75%) seems higher than that found in Pan (56%), orangutans (58%), and gorillas (38%).

It is noticeable that among the four NAT2 polymorphisms shared between hominid species (or five if considering modern humans and Neanderthals as distinct species), all are shared with humans. Two of these are non-synonymous polymorphisms in humans shared with the Pan genus: G/A at human cds position 191 (rs1801279) observed in bonobos, and C/T at 578 (rs79050330) observed in Western chimpanzees (Table 1).

Humans were also found to share two SNPs with gorillas: synonymous C/T at 345

(rs45532639) and non-synonymous A/G at 506 (rs200585149). These shared SNPs are rare polymorphisms in humans (Sabbagh et al. 2008; The Genomes Project Consortium

2012), with the exception of 191 G/A (rs1801279), the signature mutation of human haplotypes of the NAT2*14 series (NAT2*14A/B/E/H/L), observed mostly among African populations (Patin et al. 2006a; Mortensen et al. 2011; Podgorna et al. 2015). None of the polymorphisms detected in orangutans are shared with other great apes.

Remarkably, the common human synonymous C/T SNP at 282 (rs1041983) was found polymorphic in one Neanderthal ancient genome (Altai, Supplementary Table S8). This

SNP is, up to now, the sole NAT exonic position recorded as polymorphic in a non- anatomically modern human ancient genome. Note that due to an apparent inconsistency between the ancient genome browser at the Max Planck Institute for

Evolutionary Anthropology and the UCSC genome browser, we could not conclude

19 whether the NAT2 SNP 803 A/G (rs1208) differentiates the Denisova genome from

Neanderthals and all other hominids (Supplementary Table S8). This mutation is common in most modern human populations except in East Asia, and despite being non- synonymous it does not alter the enzyme’s activity (Zang et al. 2007).

Sequence diversity at the NATP pseudogene in hominids differs from that of its functional paralogs by the presence of several insertions and deletions (InDels) that could be evidenced in the Sanger sequenced samples of this study, and most of which mark divergence between species (Supplementary Table S5). In addition to InDels, high levels of single nucleotide polymorphisms also characterize this pseudogene in hominids (45 SNPs detected in humans, 13 in gorillas, 19 in orangutans, and 17 at least in the genus Pan), but similarly to the functional NAT1 and NAT2 genes, little sharing between species was found (Table 1). Ancient Homo genomes also reflect the high level of hominid polymorphism at NATP (Supplementary Table S8). Besides three modern human SNPs that were also found heterozygous in the ancient Ust’-Ishim Homo sapiens sample, three positions at least apparently differ between Neanderthals and Denisova, a proportion similar to that of NAT1. We also note that an A/G polymorphism is reported in Denisova (at position 18’228’182) which, according to our dataset, has not been detected in any other hominid species.

Within the Pan genus, the largest group of genotyped samples in this study (and in particular the Western chimpanzee sub-species, Pan troglodytes verus), 40 segregating sites were observed in total for the three NAT genes. Among these 40 sites, only 15 were found in Western chimpanzees (Table 2; see also Supplementary File S1, and

Supplementary Tables S3, S4 and S5). We detected 32 polymorphic positions in common chimpanzees (P. troglodytes), but only five are shared by all Pan troglodytes sub-species

(Table 2).

20 None of the polymorphic positions shared between all common chimpanzees is observed in bonobos (P. paniscus). In this latter species, only five polymorphic sites were identified in the three NAT genes, one of which (at 18’079’925 in NAT1) is shared with Western chimpanzees and the other (at 18’228’659 in NATP) with Central and

Eastern chimpanzees. Six positions were apparently fixed on the derived allele in bonobos, one of them (at 18’228’285 in NATP) being polymorphic in common chimpanzees.

NAT haplotypes in Pan

The Arylamine N-acetyltransferase gene nomenclature committee (Hein et al. 2008) currently lists 28 human NAT1 haplotypes encoding a dozen variant NAT1 proteins, some of which are associated with lower enzymatic activity than the reference human

NAT1*4 haplotype, while the number of documented human NAT2 haplotypes listed is

108 and the number of encoded variant NAT2 proteins comes close to 70, about 15 of which are involved in a “slow” acetylation phenotype (http://nat.mbg.duth.gr). In humans, as reviewed in (Sabbagh et al. 2018), frequency distributions of haplotypes encoding NAT2 proteins are highly variable among populations (see for instance Figure

2 in (Sabbagh et al. 2018)), whereas for NAT1, frequency distributions in most human populations are dominated by only two major haplotypes (with cumulated frequencies varying between 85% and 100%), that differ by two SNPs in the 3’UTR region and whose relative phenotypic effect is considered as moderate (haplotype NAT1*10 probably enhancing protein expression relative to the most common haplotype, NAT1*4,

(Hein et al. 2018).

Knowledge of NAT diversity within other great ape species is at present lacking. We thus inferred haplotypes at the three homologous NAT genes for individuals belonging to the

21 genus Pan (due to the small size of available samples, such inference was not possible for gorillas and orangutans), so as to characterize Pan NAT diversity, and predict the functional consequences of non-synonymous mutations. The PHASE analysis of the 108

Pan genotypes (94 chimpanzees and 14 bonobos) led to the inference of 12 Pan haplotypes for NAT1, 10 for NAT2, and 19 for NATP (Table 3).

We found dissimilar frequency distributions between the three NAT genes in all Pan species (Table 4 and Figure 1; haplotype counts in the total samples including related individuals are provided in Supplementary Table S9). As reported in Table 4, none of these frequency distributions were found to deviate from Hardy-Weinberg equilibrium, after correction for multiple testing.

At NAT1, haplotype frequency distributions in all chimpanzee sub-species are characterized by a single major haplotype (NAT1*1 occurring at frequencies between

65% and 80%) along with three to four other haplotypes at lower frequencies (Table 4 and Figure 1). By contrast, NAT1*6 is the more frequent haplotype in bonobos (54%), followed by NAT1*1 and NAT1*12 (25% and 21%, respectively). When measured by ST fixation indices, which take into consideration the molecular diversity of haplotypes,Φ significant differentiation was only found between bonobos and the other Pan

(Supplementary Table S6). On the other hand, it is noteworthy that in bonobos, the two haplotypes NAT1*6 and NAT1*12 observed besides NAT1*1 differ from it only by a single synonymous mutation (Table 3 and Supplementary Figure S2). In other words, 100% of proteins expressed by NAT1 in bonobos are identical to the product of the NAT1*1 haplotype predominant in all chimpanzee sub-species. In turn, most of the other NAT1 haplotypes detected in these sub-species differ from NAT1*1 by one non-synonymous change, resulting in the existence of seven distinct NAT1 proteins: at least four in each of

22 the Western (P. t. verus), Central (P. t. troglodytes), and Eastern (P. t. schweinfurthii) sub- species, and three in the - (P. t. ellioti) chimpanzees, according to our observations (Supplementary Table S10).

At NAT2, a single frequent haplotype is observed in all Pan species, along with one to three less frequent haplotypes (Table 4 and Figure 1). In contrast to NAT1, the frequency of the most prevalent NAT2 haplotype is higher (80% to 92.5%), and it differs between species and sub-species: it is NAT2*1 in the Western (P. t. verus), NAT2*4 in the Central

(P. t. troglodytes) and Eastern (P. t. schweinfurthii), and NAT2*6 in the Nigeria-Cameroon

(P. t. ellioti) chimpanzee sub-species, and NAT2*7 in the bonobos (P. paniscus). Also in contrast to NAT1, no NAT2 haplotype was found shared between chimpanzees and bonobos. Thus, unsurprisingly, significant differentiation was found not only between bonobos and the other Pan, but also between all chimpanzee sub-species, except between Central (P. t. troglodytes) and Eastern (P. t. schweinfurthii) chimpanzees

(Supplementary Table S6). On the other hand, although haplotype NAT2*4 is the predominant haplotype in Central and Eastern chimpanzees, it only differs by one synonymous mutation from the most prevalent NAT2*1 haplotype in Western (P. t. verus) chimpanzees (Table 3 and Supplementary Figure S3). This suggests little differentiation between these three Western, Central, and Eastern sub-species at the level of NAT2 gene products. Indeed, the 10 NAT2 haplotypes detected in Pan translate into six distinct NAT2 proteins, of which we observed two in each of the chimpanzee , and three in bonobos (Supplementary Table S10). In fact, both the commonest haplotype in Nigeria-Cameroon chimpanzees (NAT2*6 in P. t. ellioti) and all haplotypes in bonobos differ from most other Pan NAT2 haplotypes by at least one non- synonymous change.

23 Finally, the NATP pseudogene haplotypes are more evenly distributed than those of the two functional genes in all the chimpanzee sub-species, with two or more frequent NATP haplotypes (Table 4 and Figure 1). By contrast, in bonobos, only two haplotypes were found, one of which with a frequency of 96% (NATP*18), and none is shared with chimpanzees. However, substantial haplotype sharing is observed between chimpanzee sub-species, notably for haplotypes NATP*2 and NATP*8. Analyses of molecular variance detected a significant level of genetic differentiation only between bonobos and all chimpanzees, and among the latter, between Western chimpanzees (P. t. verus) and the other sub-species (Supplementary Table S6). A high level of sequence diversity at NATP may explain these complex results. Indeed, the median-joining network of NATP haplotypes (Supplementary Figure S4) displays two reticulations and a few rather divergent haplotypes (e.g. NATP*14, NATP*15, and NATP*16), which raises the possibility that some Pan haplotypes were unsampled.

Predicted functional differences among NAT1 and NAT2 haplotypes in Pan

We chose Pan NAT1*1 and Pan NAT2*4, the basal haplotypes in the median-joining networks of NAT1 and NAT2 haplotypes, respectively (Supplementary Figures S2 and

S3), as reference sequences to predict the functional impact of NAT mutations in Pan using PolyPhen, SIFT and PANTHER cSNP Scoring, three online software tools.

Four of the 11 Pan NAT1 haplotypes derived from NAT1*1 were predicted to be damaging with more or less confidence (Table 5): NAT1*4 and NAT1*7, each predicted by two tools, and NAT1*3 and NAT1*8 by one tool only. The other seven haplotypes were either predicted as not damaging or only differ from the others by synonymous mutations. The cumulated frequencies of potentially damaging haplotypes could thus reach the values of 10% to 13% in Western (P. t. verus) chimpanzees, 8% to 10% in

24 Eastern (P. t. schweinfurthii) and Nigeria-Cameroon (P. t. ellioti) chimpanzees, and up to

20% in Central (P. t. troglodytes) chimpanzees, at best (Table 4, Figure 1, and

Supplementary Figure S2). If confirmed, these results suggest that a sizeable proportion of chimpanzees may have a moderately reduced NAT1 acetylation capacity.

Nevertheless, the frequency profiles of NAT1 in the Pan species and sub-species are characterized by a majority of haplotypes that are not damaging, and thus translating into a similar enzymatic activity.

As displayed in the Pan NAT2 network (Supplementary Figure S3), nine haplotypes derive from the NAT2*4 basal haplotype, three of which reaching high frequencies, i.e.

NAT2*1, NAT2*6, and NAT2*7 (Table 4, Figure 1 and Supplementary Figure S3). Three low frequency haplotypes out of the nine stemming from the basal haplotype were predicted as damaging with high confidence by the three tools: NAT2*2, only observed in

P. t. verus, and NAT2*8 and NAT2*9, the two haplotypes observed only in P. paniscus.

Interestingly, the mutations that define NAT2*2 and NAT2*8 both occur at positions that are also polymorphic in humans (positions 578, rs79050330 and 191, rs1801279, respectively). The effect of SNP 191 G/A on NAT2 enzymatic activity in humans is known; it defines the human NAT2*14 haplotype series that is associated with a slow acetylation phenotype (see Supplementary File S1), which is thus consistent with our prediction results. That of SNP 578 T/C (which is associated in humans with haplotypes

NAT2*5P, NAT2*12E, and NAT2*13B) is unknown. The prediction tools do not return clear-cut results for the mutations defining NAT2*6 (derived from NAT2*1, and predominant in P. t. ellioti) and NAT2*7 (derived from NAT2*4, and predominant in P. paniscus) (namely A/G at position 514 and G/A at 145, respectively), predicted as possibly damaging by PANTHER cSNP Scoring only, and with a low confidence. This suggests that their potential effects are either less damaging than those of mutations

25 defining NAT2*2, NAT2*8, and NAT2*9 (i.e. rs79050330 C/T at 578, rs1801279 G/A at

191, and SNP A/C at 72), or that they depend on the substrate, as is known for some human polymorphisms (Hein et al. 2006) (see Supplementary File S1). In contrast to

Western (P. t. verus) chimpanzees, where the cumulated frequencies of NAT2*2, NAT2*8 and NAT2*9 are lower than 5%, they reach 10% in bonobos (Table 4 and Figure 1). If confirmed, our prediction results could thus indicate that a significant proportion of bonobos (but a smaller proportion of Western chimpanzees) could have a slow NAT2 acetylation phenotype.

NAT genetic diversity and tests of selective neutrality in Pan

We found that the Western (P. t. verus) chimpanzee sub-species has the lowest diversity among the Pan genus for NAT1 (Table 6 and Figure 2). Indeed, expected heterozygosity

(h) ranges from 0.34 in Western chimpanzees (BPRC sample) to 0.63 in bonobos (P. paniscus), and nucleotide diversity (reported as π x 10-3) from 0.58 in Western chimpanzees to 1.08 in Eastern (P. t. schweinfurthii) chimpanzees. A significant deviation from selective neutrality and demographic equilibrium due to homozygosity excess was found for the Central (P. t. troglodytes) and Eastern (P. t. schweinfurthii) chimpanzee sub-species with one test of selective neutrality, i.e. the Ewens-Watterson test. However, although highly significant, these results must be considered with caution as they may represent artifacts due to the low sample sizes for these two sub-species (only five and six individuals, respectively, Table 6). No other neutrality or demographic equilibrium test returned any significant result.

Similarly to NAT1, Western (P. t. verus) chimpanzees also have the lowest diversity among the Pan genus for NAT2, with h varying from 0.15 in Western (San Diego sample) to 0.38 in Central (P. t. troglodytes) chimpanzees, and π from 0.13 in Western to 0.50 in

26 Central chimpanzees (Table 6 and Figure 2). However, in contrast to NAT1, a significant departure from the expected diversity under selective neutrality and demographic equilibrium was found with at least one test in all species and sub-species before adjustment for multiple testing, with the exception of Nigeria-Cameroon (P. t. ellioti). All observed deviations were either due to an excess of homozygosity (Western-verus,

Central-troglodytes and Eastern-schweinfurthii sub-species), and/or to an excess of rare

(recent) haplotypes, yielding negative values of Tajima’s D (BPRC sample of Western- verus chimpanzees, and Eastern-schweinfurthii chimpanzees) or Fu’s FS (Western chimpanzees and bonobos). For Western chimpanzees, all three tests were significant in the BPRC sample and Fu’s FS test remained so after correction for multiple testing. The results were less clear-cut in the 122 sub-samples of the San Diego sample, as deviations from neutrality were mainly observed with Fu’s FS test (16%, 0% and 87% significant deviations with the Ewens-Watterson, Tajima’s D and Fu’s FS tests, respectively).

Similarly to NAT1, due to small sample size, we consider with caution the Ewens-

Watterson tests results for Central (P. t. troglodytes) and Eastern (P. t. schweinfurthii) chimpanzees (and the Tajima’s D test result for the latter), although they were still significant after accounting for type I error rate. Nevertheless, taken all together, these results suggest a possible action of directional (positive or purifying) selection at NAT2, at least in Western (P. t. verus) chimpanzees.

At NATP, bonobos (P. paniscus) have a particularly low diversity and Western (P. t. verus) chimpanzees have the second lowest diversity among the Pan genus. Indeed, expected heterozygosity ranges from 0.07 in bonobos, and 0.64 in Western chimpanzees, to 0.89 in Central (P. t. troglodytes) chimpanzees, and nucleotide diversity from 0.07 in bonobos, and 0.77 in Western chimpanzees, to 2.75 in Central chimpanzees

(Table 6 and Figure 2). While the null hypothesis of selective neutrality and 27 demographic equilibrium was not rejected for any of the Pan troglodytes sub-species, deviations due to a significant excess of homozygotes (Ewens-Watterson test) and of rare alleles (Fu’s FS test) were found for bonobos (P. paniscus), that remained even after correction for multiple testing. Considering that the locus is a pseudogene, its extremely low diversity in bonobos (only two NATP haplotypes observed, one of which with a frequency over 96%, Table 4 and Figure 1) and the associated significant results returned by the neutrality tests are surprising and call for further investigation.

Figure 2 highlights a marked difference in diversity levels between the three NAT genes, particularly between NAT1 and NAT2. We found only a marginally significant difference in Pan expected heterozygosity (h) between the two functional genes, NAT1 and NAT2, after correcting for multiple testing (P = 0.039), whereas nucleotide diversity (π) is significantly higher at NAT1 than at NAT2 (P = 0. 0065). Since the comparisons of the two functional paralogs with the pseudogene are influenced by the very low diversity of

NATP in bonobos (P. paniscus), we compared again diversity levels between genes considering only chimpanzee sub-species. In Pan troglodytes indeed, nucleotide diversity at NATP is significantly higher than at each of the two functional genes (P =

0.018 for NAT1 versus NATP, and P = 0.018 for NAT2 versus NATP, respectively), and expected heterozygosity at NATP is significantly higher than at NAT2 (P = 0.012).

Comparison of NAT genetic diversity between Pan and humans

The total number of human haplotypes represented in the dataset of human populations analysed here (Supplementary Table S7) is 21 for NAT1 (3.5 per sample, on average), 44 for NAT2 (10.7), and 58 for NATP (10.6). No deviation from Hardy-Weinberg equilibrium was found for any of the human samples at any of the NAT genes, after correction for multiple testing (Supplementary Table S7).

28 When compared to chimpanzees and bonobos, humans appear to have about five times less diversity than Pan at the homologous NAT1 gene, whilst four to nine times more diversity than Pan at NAT2 (Figure 2, Table 6 and Supplementary Table S7). In Figure 2, the documented high level of diversity of human NAT2 is clearly illustrated by its similarity to NATP (Sabbagh et al. 2018). Both the Mann-Whitney U and Student t tests, adjusted for multiple testing, confirm that expected heterozygosity and nucleotide diversity are significantly higher in Pan species and sub-species than in human populations at NAT1, and significantly lower at NAT2 (Supplementary Table S11). By contrast, both expected heterozygosity and nucleotide diversity of human populations at the NATP pseudogene fall within the range of those observed in chimpanzees (P. troglodytes), whereas both estimates were found to be, as expected, extremely low in bonobos. However, differences in NATP diversity levels between humans and the Pan species and sub-species were not significant, even before multiple testing adjustment, and despite the extremely low estimates of bonobos (all P-values > 0.05).

At NAT1, while no significant rejection was observed in Pan, selective neutrality and demographic equilibrium are rejected in many human samples (Table 6 and

Supplementary Table S12). Indeed, each of the three tests of selective neutrality used rejected the null hypothesis at NAT1 in at least one population, even after correction for multiple testing. In the Ewens-Watterson homozygosity test, the observed homozygosity

(Fo) was found always higher than the expected (Fe), and significantly so in nine of the

19 tested population samples (eight after correction for multiple testing). Similarly,

Tajima’s D and FS values are significantly negative in 11 (one after correction) and eight

(three after correction) population samples, respectively. Conversely, while several tests rejected the null neutral equilibrium model at NAT2 in Pan, no rejection was observed for this gene in human populations with any of the three tests, after correction for

29 multiple testing. At NATP, a single rejection in humans was observed with Fu’s FS test after correction (in the Dinka of Sudan, Supplementary Table S12). Thus, rejection of the neutral equilibrium model is more consistent at NAT2 in Pan than in humans, whereas it is more consistent at NAT1 in humans than in Pan.

Comparison with NAT genetic diversity in other hominids

Haplotype inference for gorillas and orangutans could not be achieved with high enough confidence, because several unknown positions in the genotypes retrieved from GAGP overlapped with variants detected in the sequenced samples of this study

(Supplementary Tables S3, S4 and S5). Thus, only nucleotide diversity was estimated for gorillas and orangutans (Supplementary Table S13). As shown in Supplementary Figure

S5, the relative levels of diversity at the three NAT genes differ markedly between the two gorilla species available for this study (Western and Eastern gorillas, G. gorilla and

G. beringei, respectively), and between the two orangutan species (Sumatran and

Bornean orangutans, P. abelii and P. pygmaeus, respectively). The results indicate a level of diversity at NAT1 similar to that of Pan for both species of gorillas, thus also higher than in humans, whereas the highest values among all great apes were observed in orangutans, albeit differing markedly between the two orangutan species

(Supplementary File S1). At NAT2, Western gorillas and both species of orangutans display a diversity level comparable to that of humans, thus higher than Pan, whereas the diversity level of Eastern gorillas is comparable to that of Pan. Finally, the highest values of diversity among all great apes, including humans, were observed for the two orangutan and one of the gorilla (G. gorilla) species at NATP, while no diversity was detected in the sample of the other gorilla species (G. beringei). These contrasting results between both the two gorilla and the two orangutan species should however be

30 considered with caution in view of the extremely small sample sizes of Eastern gorillas

(n=3) and Bornean orangutans (n=6).

31 DISCUSSION

The human acetylation polymorphism, discovered in the early 1950s and shown to be responsible for inter-individual variation in drug biotransformation, is considered as the best-known example of a pharmacogenetic trait (Meyer 2004; Agundez 2008a;

McDonagh et al. 2014). As reviewed in Sabbagh et al. (2018), most of its inter-individual variation, further reflected in inter-population variation, is due to a high number of non- synonymous polymorphisms in the coding region of the NAT2 gene, which ranks among the most polymorphic human drug metabolizing genes. By contrast, human NAT1, its functional paralog located some 200 Kb upstream, displays up to 13 times less diversity than NAT2 in its coding region. Several observations suggest that NAT1 and NAT2 evolved under distinct selective regimes in humans. For human NAT1, it is generally held that its product is functionally constrained, and thus subjected to purifying selection, probably because of the expression of the NAT1 enzyme in many tissues early during development (including reproductive tissues), and its implication in the metabolism of folates (Patin et al. 2006a; Butcher and Minchin 2012). For human NAT2 however, due to the role of its product in the detoxification of exogenous substances, it is proposed that the mode of subsistence and/or the chemical environment in which past populations have been living induced positive population-specific selective pressures on the gene, thereby explaining the documented differential distribution of acetylation prevalence among subsistence strategies and exploited biomes (Patin et al. 2006a; Luca et al. 2008; Magalon et al. 2008; Sabbagh et al. 2008; Mortensen et al. 2011; Podgorna et al. 2015). To gain further insights on these evolutionary hypotheses, we performed here a comprehensive analysis of the diversity of NAT genes in hominids (i.e. in chimpanzees, gorillas, orangutans, and hominins – humans, including Neanderthal and Denisova). To

32 the best of our knowledge, this is the first study investigating NAT intra-species polymorphism in hominids.

Levels and patterns of diversity of the functional arylamine N-acetyltransferase genes in the Pan genus point to different selective pressures acting on NAT1 and

NAT2

Several lines of evidence have led to the current view that, in humans, the diversity of the NATP pseudogene can be largely ascribed to our demographic history, but that of

NAT2 was further shaped by population-specific selective pressures, whereas that of

NAT1 was constrained by relatively strong purifying selection (Sabbagh et al. 2018).

Indeed, a marked geographic structure characterizes human NATP diversity, with less diversity in populations from Europe, Asia and the Americas than in African populations

(as illustrated in Supplementary Figure S6 for the human dataset assembled in this study). For NAT2, the observations are of an excess of non-synonymous relative to synonymous SNPs, a lack of correlation of genetic and geographic distances at the worldwide scale, and an unusual pattern of population structure that differentiates

Asian populations from Africans and Europeans more than expected on neutral grounds.

Finally, NAT1 is much less polymorphic in humans, and is associated with a significant excess of homozygotes in many populations.

In Pan, the frequency distributions of inferred haplotypes suggest that NAT genes have evolved under different selective regimes in these species too. Similarly to humans, these distributions are characterized by marked differences between the three NAT genes (Figure 1 and Table 4). At the NATP pseudogene, we found extensive variation between Pan species and sub-species, and the shapes of these frequency distributions resemble an expected L-shaped neutral distribution (i.e. a haplotype at intermediate to

33 high frequency, other haplotypes at increasingly low frequencies), except in bonobos (P. paniscus). At NAT1, the commonest haplotype in each subspecies is more frequent than those at NATP. The pattern is even more skewed towards a single haplotype at very high frequency at NAT2. Moreover, while the same NAT1 haplotype (Pan NAT1*1) is predominant in each chimpanzee sub-species, at the NAT2 gene instead different haplotypes are prevalent in each sub-species, except for Eastern (P. t. schweinfurthii) and Central (P. t. troglodytes) chimpanzees, which is not surprising knowing the related demographic history of these two sub-species (Gagneux et al. 1999; Hey 2010; Bjork et al. 2011; Prado-Martinez et al. 2013; Funfstuck et al. 2015; de Manuel et al. 2016; Lobon et al. 2016).

In line with these observations, our analyses failed to reveal any significant level of genetic differentiation between chimpanzee sub-species at NAT1, while we do find significant differentiation at NAT2, except between Eastern and Central chimpanzees

(Supplementary Table S6). At the NATP pseudogene, Western (P. t. verus) chimpanzees are significantly differentiated from all the other P. troglodytes sub-species (and even between the two verus samples), while no genetic differentiation was found among any of the latter. Given the high levels of NATP polymorphism in chimpanzees (Figure 1 and

Tables 3 and 5) and the significant differentiation between the two verus samples, we believe that this lack of genetic differentiation could result from a lack of power due to the small sample sizes available for Central (P. t. troglodytes), Eastern (P. t. schweinfurthii) and Nigeria-Cameroon (P. t. ellioti) chimpanzees. In turn, significant differentiation levels were estimated between bonobos and all other Pan at all three

NAT genes.

In terms of gene and molecular diversity, the different Pan species and sub-species display a similar NAT1 - NAT2 pattern of variation, i.e. high diversity at NAT1 and lower

34 at NAT2, in spite of differences in diversity levels (Figure 2). In humans instead, the

NAT1 - NAT2 diversity pattern is reversed, with less diversity than Pan at NAT1 and more at NAT2, and these differences between humans and apes are significant (Figure 2 and Supplementary Table S11). Conversely, diversity was not found significantly different between humans and Pan at NATP. A reversed pattern between Homo sapiens and Pan was also highlighted by the results of the tests of selective neutrality and demographic equilibrium. While at NAT1, only two rejections of the null hypothesis

(with one test only) were observed in chimpanzees and bonobos (Table 6), deviations from expectations were found with the three tests in several human populations

(Supplementary Table S12). At NAT2 instead, rejections were observed at least with one test in all chimpanzees and bonobos, except in the P. t. ellioti sub-species, while very few rejections were observed in humans (none after correction for multiple testing).

Altogether, these results suggest that diversity at NAT1 in humans could have been influenced by selective pressures (either purifying or recent directional selection), but not in chimpanzees or bonobos, whereas similar selective pressures could have influenced the diversity of the NAT2 gene in bonobos and at least some chimpanzee sub- species, but not in humans.

However, it is important to consider the possible confounding effects of population demographic history when interpreting any significant departure of nucleotide polymorphism from equilibrium-neutral predictions. Demographic and selective events tend to leave very similar traces over sequences. For example, a selective sweep, i.e. an episode of recent directional selection, and a population bottleneck followed by expansion are both expected to generate a frequency spectrum skewed towards rare alleles. Indeed, genomic comparisons across large numbers of individual hominids has revealed that modern humans are genetically less variable than most other great apes,

35 notably gorillas and orangutans, and most chimpanzee subspecies, although the difference is less pronounced when compared to Western chimpanzees, Eastern lowland gorillas and bonobos (Gagneux et al. 1999; Kaessmann et al. 2001; Prado-Martinez et al.

2013; Xue et al. 2015). This pattern is ascribed to the evolutionary history of our species, which is marked by a series of recent demographic expansions, such as the one following the founder event driving modern humans out of Africa (Li et al. 2008;

Barbujani and Colonna 2010; Veeramah and Hammer 2014). However, human and non- human hominids do not seem to differ in the genome-wide accumulation of loss-of- function mutations and pseudogenes, suggesting that neither human demographic expansions nor the potential buffering role of human culture apparently led to human genomes tolerating a higher mutational load (Prado-Martinez et al. 2013). Additionally, according to (Prado-Martinez et al. 2013), all chimpanzees and bonobos experienced a population bottleneck more than 2 million years ago, while Western chimpanzees experienced a second, more recent bottleneck some 500,000 years ago, followed by re- expansion. Moreover, studies on the major histocompatibility complex (MHC) region of chimpanzees suggest that a strong selective sweep within this region, owing to the action of a particular viral pathogen, was likely coupled to the first bottleneck (de Groot et al. 2002; de Groot et al. 2008), and a comparative study of the MHC region in humans and Western chimpanzees suggests that this first bottleneck accounts for observed differences of MHC diversity between the two species (Vangenot et al. 2017). The lower diversity observed in Western chimpanzees compared to the other P. troglodytes sub- species (Figure 2 and Table 6) could indeed reflect footprints of the second, more recent bottleneck event, thus implying that the P. t. verus sub-species (Western chimpanzees) experienced greater genetic drift than the others. Indeed, it is in Western chimpanzees, among all sub-species of chimpanzees, that the lowest or one of the lowest estimations of diversity was found, including at the NATP pseudogene, in agreement with a well- 36 known result at the genome level, i.e. with Central chimpanzees being generally the more diverse and Western chimpanzees the least (Prado-Martinez et al. 2013; Bataillon et al. 2015; de Manuel et al. 2016). Despite these complex demographic histories, the neutrality tests used are known to have no statistical power to detect such ancient demographic events (Tajima 1989b; Tajima 1989a; Simonsen et al. 1995; Ramirez-

Soriano et al. 2008). Consequently, the chimpanzee and bonobo populations investigated can be reasonably assumed to be at demographic equilibrium and null values of Tajima’s D and Fu’s Fs are expected at putatively neutral genomic regions (e.g. non-genic and intronic regions, or pseudogenes). Previous studies investigating such regions have consistently produced null D values for Western chimpanzees (Deinard and Kidd 1999; Kaessmann et al. 1999; Fischer et al. 2004; Fischer et al. 2006; Fischer et al. 2011; Sugawara et al. 2011). Therefore, the significant deviations from the standard, neutral equilibrium model at NAT2 in chimpanzees and bonobos can be interpreted, in our view, as evidence for some kind of selection, most likely purifying selection, which is how such patterns are interpreted for NAT1 in humans (Sabbagh et al. 2018).

On the other hand, the few rejections of the null hypothesis at NAT1 in Pan rather support a lack of selective pressure at this gene, but this conclusion fails to explain the lack of NAT1 genetic differentiation among chimpanzees, with the same predominant haplotype in all P. troglodytes (i.e. NAT1*1), as opposed to the patterns characterizing

NAT2 and NATP. In view of the observed similar level of diversity between humans and chimpanzees at the NATP pseudogene, the differences found between the two functional

NAT paralogs, both among Pan, and between Pan and humans rather argue for some contribution of selective pressures acting on both these genes. Neither the diversity indices that we estimated nor the selective neutrality tests that we performed account

37 for differences in the functional impact of mutations. Considering such functional effects could help to identify the kind of selective pressures at work, and their likely magnitude.

At the NAT1 gene, haplotype NAT1*1 is the most frequent in all chimpanzee sub-species, whereas it is NAT1*6 in bonobos (Figure 1). These two haplotypes are nevertheless likely to confer a similar enzymatic profile since they only differ by a synonymous SNP

(T369C, Table 3). Therefore, it is expected that a majority of chimpanzees and bonobos have a similar NAT1 acetylation capacity. Moreover, at each of the NAT1 positions that are polymorphic in Pan, all other great apes, including humans are apparently fixed on the nucleotides of the major Pan haplotype NAT1*1. Thus, the maintenance of NAT1*1 for approximately half a million years (TMRCA of Pan troglodytes) and of a likely similar acetylation capacity (NAT1*6 differs from NAT1*1 by a single synonymous substitution) for about 2 million years (TMRCA of Pan) points to the effect of purifying selection acting on this gene to preserve such acetylation capacity. At NAT2, as already stated, frequency distributions in Pan were found to differ markedly from those at NAT1, not only because they are more skewed towards a high frequency of the major haplotype, but also because this prevalent haplotype is different between sub-species. Pan NAT2 is also characterized by a low level of diversity compared to NAT1 and NATP, and to human NAT2 as well (Figure 2). Theoretically, such pattern could be due either to rapid genetic drift in populations of chimpanzees and bonobos, or to directional selection, possibly purifying selection, acting to preserve specific haplotypes, depending on the sub-species. Both hypotheses are supported by the tests of selective neutrality and demographic equilibrium, which revealed several rejections in favour of directional selection or a population expansion after a bottleneck. Moreover, if only polymorphic positions within the Pan coding exon are considered (Table 3 and Supplementary Figure

S2), only three main NAT2 haplotypes are observed, i.e. NAT2*1/NAT2*4 (most frequent

38 in P. t. verus, P. t. troglodytes and P. t. schweinfurthii), NAT2*6 (P. t. ellioti), and NAT2*7

(P. paniscus). Thus, although haplotype variation between chimpanzee sub-species is apparently higher at NAT2 than at NAT1, it is likely that the evolution of NAT2 in Pan is nevertheless functionally constrained. This idea finds support in the comparison of

NAT2 with the highly variable NATP pseudogene. In spite of this, similarly to observations in humans (Patin et al. 2006a; Mortensen et al. 2011), significant linkage disequilibrium among the three NAT gene family members is mainly detected between

NAT2 and NATP (Supplementary Table S14, tested only in Western chimpanzees). It is thus reasonable to assume that the pseudogene has been more free to accumulate recent mutations, while selection acts to remove harmful mutations in NAT2. In this context, the extremely low NATP diversity in bonobos is intriguing, the more so as their diversity pattern at NAT1 and NAT2 is similar to that of chimpanzees (Figure 2). We verified that this depleted polymorphism in the pseudogene could not be ascribed to a lower quality of the SNP calling process or lower coverage for bonobos in the GAGP data. Note that in the GAGP whole genome study, bonobos and also Eastern gorillas (G. beringei) showed the lowest genetic diversity among all great apes, and displayed distributions of homozygosity tracts similar to those of human populations having experienced strong genetic bottlenecks (Prado-Martinez et al. 2013; Xue et al. 2015).

Prediction of the existence of different profiles of acetylation in chimpanzees and bonobos

In humans, polymorphisms identified in NAT1 and NAT2 led to the definition of haplotypes with a known acetylation profile when an association between a mutation and the functionality of the protein was observed in vivo or in vitro (Walraven et al.

2008; Zhu and Hein 2008; Zhu et al. 2011). In other primates, for the time being,

39 functional knowledge only exists for the rhesus macaque NAT2 gene. Indeed, a recent study (Tsirka et al. 2014) demonstrated that the function of the NAT2 enzyme in the human and rhesus macaque species diverges in substrate selectivity, shifting substrate affinity of the enzyme between bulkier NAT2 substrates and smaller NAT1 substrates, and that this shift is due to a single substitution, that does not otherwise alter the stability and the overall activity of the protein. Such knowledge does not exist yet for chimpanzees, and we thus used three in silico tools to predict the potential consequence of a substitution on the function of Pan NAT proteins (Table 5).

We therefore suggest that, compared to the Pan basal haplotype NAT1*1, nine derived haplotypes might translate into an equivalent phenotype, whereas two might have a moderate damaging effect. Indeed, none of the SNPs detected in NAT1 have been predicted as damaging by the three tools concomitantly, but the comparison of our results for Pan haplotypes with the outcomes of predictions for human haplotypes with known effects raises the possibility that two Pan NAT1 haplotypes, NAT1*4 and NAT1*7, could have a moderate “slowing” effect on enzymatic activity (similar to that of human haplotype NAT1*14B, when considering the similarity of the results outputted by the prediction tools).

At the NAT2 gene, non-synonymous mutations defining three Pan derived haplotypes,

NAT2*2, NAT2*8, and NAT2*9, were predicted as damaging, with good confidence by the three tools. A slower enzymatic activity associated with these haplotypes could thus be expected. Two of these three haplotypes are rather uncommon, each being observed at a frequency below 5% in one species only: NAT2*2 in Western chimpanzees (P. t. verus) and NAT2*8 in bonobos (P. paniscus) (Table 4, Figure 1, Supplementary Table S4, and

Supplementary Figure S3). Haplotype NAT2*9 was also detected in bonobos only, but it has an estimated frequency over 7%. Hence the cumulated frequencies of NAT2

40 haplotypes potentially conferring slower NAT2 enzymatic activity in P. paniscus could reach 10%. This translates into an expected frequency of carriers of two potentially slow haplotypes of 1%, versus 18% of heterozygous slow/rapid carriers (observed frequencies were of 0% and 21.4%, respectively). By comparison, the lowest frequencies of NAT2 slow acetylators in human populations (either directly documented by phenotypic studies, or indirectly estimated by the proportion of carriers of two slow haplotypes) vary by around 10% (Sabbagh et al. 2011). Such low frequencies are encountered both in hunter-gatherer populations around the world (with the lowest values among some hunter-gatherer populations of the American continent (Fuselli et al.

2007), and in some North East Asian populations (in China, Korea and Japan). For the

Pan species, we tentatively speculate that two additional haplotypes (NAT2*6 and

NAT2*7) could be associated with a substrate-dependent reduction in acetylation activity, given the similarity of results returned by the prediction tools with those of human haplotypes for which such substrate-dependent activity has been proposed (i.e. human NAT2*7A, NAT2*7B, and NAT2*10, Supplementary File S1 and Supplementary

Table S15). Although we acknowledge that such assertion is based on very limited evidence, it nevertheless opens up the possibility that, compared to chimpanzees, NAT2 activity in bonobos could be globally reduced given that NAT2*7 is the most prevalent haplotype in this species (Figure 1 and Supplementary Figure S3).

In summary, in the present state of knowledge, our results suggest the existence among chimpanzees and bonobos, as in humans, of diversified acetylation profiles for both

NAT1 and NAT2 genes. While for NAT1 only two infrequent Pan haplotypes might confer a slower enzymatic activity than the reference, for NAT2, two of the three haplotypes observed in bonobos are predicted to do so, as well as one in chimpanzees. In terms of frequencies, however, the majority of NAT2 haplotypes observed in chimpanzees are

41 likely associated with a “normal” acetylation capacity that would be the Pan equivalent corresponding to human rapid acetylation (Supplementary File S1 and Supplementary

Table S16). It is thus likely that, in chimpanzees, mutations modifying the functionality of the NAT1 and NAT2 enzymes in a fashion that slows acetylation could be subject to purifying selection. Purifying selection acting on the chimpanzee NAT2 gene is consistent with its low diversity within sub-species as compared to NATP, and its high molecular differentiation among Pan sub-species. However, the greater molecular diversity found at NAT1 within sub-species is more compatible with a mechanism of

(weak) positive directional selection favouring the prevalent P. troglodytes NAT1*1 haplotype in all sub-species. Although constrained by the sample sizes available, our results raise the possibility that in bonobos, however, functional constraints at NAT2 against slower acetylation could be less stringent than in chimpanzees, allowing a predicted frequency of NAT2 slow acetylation haplotypes of 10% in this species.

These evolutionary hypotheses need to be tested with functional studies of the in vitro and/or in vivo activities of NAT1 and NAT2 enzymes in Pan species. Moreover, such hypotheses will also need to address further complexities, such as the NAT2 metabolic adaptations (i.e. shifts in NAT2 enzymatic activity) recently reported in a study comparing a local human population with a population of first-generation emigrants

(Aklillu et al. 2018). Finally, we acknowledge that our analyses are based on only a few segregating sites, which is expected given the short length of the NATs open-reading frames. Hence larger sample sizes than the ones available in this study are required to make robust assertions on the prevalence of distinct acetylator profiles, and to confirm the patterns of molecular diversity of NAT genes found in chimpanzees and bonobos.

42 Divergent selective pressures acting on the evolution of NAT genes in humans and chimpanzees

The results of the three tests of selective neutrality used on the human dataset analysed here support the current view that human NAT1 diversity is constrained by purifying selection (Table 6 and Supplementary Table S12). On the other hand, the finding of higher frequencies of human NAT2 slow acetylator phenotypes in food-producing populations compared to hunter-gatherers supports the idea that the gene was impacted by selective pressures induced by the modes of subsistence adopted by past populations, and in particular their diets (Patin et al. 2006a; Luca et al. 2008; Magalon et al. 2008; Sabbagh et al. 2008; Mortensen et al. 2011; Podgorna et al. 2015). However, this selective hypothesis fails to be consistently supported, with tests of selective neutrality on human NAT2 variation producing very few significant results (including those from this study, although see (Patillon et al. 2014)). Several explanations for the observed weak and inconstant signals of selection have been suggested, such as the action of adaptive mechanisms difficult to detect through the standard frequency spectrum tests that we and others have used (i.e. selection on standing variation, selection favouring heterozygotes carrying a slow and a rapid haplotype, ancient balancing selection masked by directional selection on specific haplotypes, or recent relaxation of functional constraints). Moreover, consistent evidence for genetic and genomic signatures of demographic expansions in humans in the past (Li et al. 2008;

Barbujani and Colonna 2010; Veeramah and Hammer 2014) could have mitigated molecular signals of selective mechanisms such as balancing selection and/or selection on standing variation. Interestingly, in human populations from the African Sahel and surrounding regions, higher proportions of NAT2 rapid acetylators were found not only among hunter-gatherers, as opposed to food-producers, but also among populations

43 living in humid tropical environments, as opposed to those living in more arid zones, independently of their mode of subsistence. It thus raises the possibility that selective pressures on NAT2 could be exerted not only by shifts to new dietary habits, but by the natural chemical environment as well (Podgorna et al. (2015)).

Our results suggest that the diversity of NAT genes in chimpanzees could result from evolutionary forces that differ from those operating in humans. Today, chimpanzees live mainly in humid tropical environments (according to the classification of the United

Nations Environment Program) even if the limits of their habitat are also located in more arid zones (such as in , Guinea, , , , and the

Republic of Congo). Besides the differences in the overall efficiency of purifying and positive selection acting on the genomes of great ape species, which was shown to be correlated with their long-term population sizes (Cagan et al. 2016), differences in intensity of selective pressures exerted by the environment could also be invoked.

Chimpanzees and bonobos are raw food “hunter-gatherers” that feed mainly on plant matter, but also eat uncooked , birds, eggs and small- to middle-sized .

Moreover, it is generally assumed that chimpanzee and bonobo diets have not substantially changed over time, in contrast to humans, whose diets have done so, probably several times, over the last 200 thousand years.

In humans, besides their influence on the effectiveness of prescribed medications, polymorphisms at NAT2, and also at NAT1, have been associated with differential susceptibility to various cancers linked to arylamine exposure (Hein 2002; Agundez

2008b; Ladero 2008; Selinski et al. 2015; Hein 2018; Laurieri et al. 2018). Such exposures occur with cigarette smoke, gases and pollutants produced by various chemical industries, as well as diets including meat or fish cooked (or fried) at high temperatures (Weisburger 2002; Chung 2015; Fahrer 2016). Patin et al. (2006a)

44 stressed the idea that changes in exposure to xenobiotics with carcinogenic risk or other toxicities associated with the new diets introduced by the transition to food-producing life-styles, as opposed to hunter-gatherer subsistence modes, might have led to changes in the selective pressures acting on drug-metabolizing enzymes such as the NAT enzymes. Here we speculate that the entire phenomenon of food processing, which is intimately linked to the handling of fire, and thus represents a major distinctive human feature common to all modes of subsistence, may have exposed our species to new, food- borne carcinogens and other toxic molecules seldom encountered in the diets of other primate species. Human biological adaptation stemming from the controlled use of fire is an idea that runs back at least to Charles Darwin (Wrangham and Carmody 2010). It is supported, notably, by studies on the influences of a cooked diet on gene expression in the liver (Carmody et al. 2016), and the hypothesis of a specific genetic adaptation to fire use is advocated to explain the fixation, in modern humans, of a single nucleotide substitution in the aryl hydrocarbon receptor (AHR) gene, which results in a lowered sensitivity of the receptor to toxic exogenous AHR-ligands, such as polycyclic aromatic hydrocarbons (PAHs) that are contained in fire smoke and cooked and smoked foods

(Hubbard et al. 2016).

Turning back to chimpanzees, we could thus speculate that living in a more limited environment and having experienced little changes in diet, selective pressures such as those affecting humans have been less intense or even non-existent. Instead, the hypothesis of purifying selection acting to maintain an acetylation activity sufficiently adapted to the chimpanzees’ environment and diet is consistent with their low diversity at NAT2, the significant rejections of neutrality for this gene, at least for Western chimpanzees, and the low frequencies of those mutations that were predicted to be damaging. While NAT2 mutations leading to a slower acetylation phenotype are

45 hypothesised to have recently become advantageous in many human populations as they settled in new environments and/or adopted new subsistence strategies, including the consumption of cooked and roasted foods, such mutations are likely to have been deleterious in chimpanzees, and thus negatively selected.

Although it is likely that chimpanzees and bonobos did not experience similar shifts in diets as humans did, the hypothesis of a less stable chemical environment for humans than for other great apes is challenged by major climatic changes, such as the drier conditions developing in the Last Glacial Maximum which have been proposed to be associated with the speciation process leading to Eastern and Western gorillas, and the demographic decline of the latter species (Roy et al. 2014; Xue et al. 2015). Moreover, our analyses did not produce systematically matching results between Western chimpanzees and the other sub-species, in particular with regards to the selective neutrality tests. A priori, these discrepancies could result from the smaller sample sizes of Central, Eastern and Nigeria-Cameroon chimpanzees, making miss-estimation of haplotype frequencies a possible issue. Note that the sample sizes of Western chimpanzees (18 and 23 individuals for San Diego and BPRC, respectively, Table 4) are smaller than the average those of the human populations samples analysed here

(between 60 and 70 individuals, Supplementary Table S7), although several human samples from African populations are represented by less than 20 individuals.

Moreover, one cannot exclude the possibility of sequencing errors, particularly for those haplotypes defined by singleton SNPs and observed only once (e.g. Eastern chimpanzee

Andromeda, whose data was retrieved from the GAGP, is the only carrier of NAT2*10).

Finally, sub-species determination, very often exclusively based on mitochondrial DNA, could be erroneous, especially so in the presence of hybrids (Becquet et al. 2007); see also Supplementary File S1).

46 Sample sizes at least comparable to those of Western chimpanzees in the present study are thus needed to confirm the apparent differences between sub-species that we detected. The recent publication of new genomes for these sub-species (de Manuel et al.

2016) is likely to allow an evaluation of our findings in the near future. These findings also call for studies on chimpanzees living in different environments and under different chemical exposure. For instance, numerous Eastern chimpanzees of the Sebitoli community present congenital anomalies as well as palate clefts (Krief et al. 2014; Krief et al. 2015; Krief et al. 2017). In humans, deficiency in folates, in which NAT1 is implicated, is often associated with many congenital malformations including palate clefts (Wahl et al. 2015), and some mutations in both NAT1 and NAT2 have been associated with this condition (Song et al. 2013; Santos et al. 2015).

A hypothetical shift in function between NAT1 and NAT2 during hominid evolution

The differences in diversity levels between NAT1 and NAT2 among great ape species suggest that, as in humans, a differentiated function for the two enzymes exists also in other hominid species. In humans, the NAT1 and NAT2 isoenzymes have a different expression profile and acetylate different substrates (Jensen et al. 2006; Wakefield et al.

2007; Butcher and Minchin 2012; Laurieri et al. 2014; Sim et al. 2014). Human NAT1 is expressed in most tissues from very early during development and is assumed to play a role in the metabolism of folates, while NAT2 is mostly expressed in the liver and intestines and its substrates are not supposed to influence its regulation, contrarily to

NAT1. As reviewed by (Butcher and Minchin 2012), the transcriptional and post- transcriptional regulation of the NAT1 enzyme could also have a higher effect on NAT1 activity than the genotype, contrarily to what is known for NAT2. Notably, it is suggested that epigenetic regulation depending on the concentration of some substrates could

47 affect the activity of NAT1 in cells (Wakefield et al. 2010). If differences in diversity levels between the two genes are linked to differences in factors contributing to the inter-individual variation in enzymatic activity between NAT1 (high variation in expression regulation) and NAT2 (high variation in protein sequence) is an open question. It would be tempting to assume that the function of the two enzymes is different in chimpanzees, compared to humans, in view of the reversed NAT1-NAT2 diversity pattern of chimpanzees and other great ape species (bonobos and Sumatran orangutans). For instance, one could consider the possibility of a shift in substrate affinity between the two isoenzymes during hominid evolution, as suggested by the expression study of the rhesus macaque NAT2 gene (Tsirka et al. 2014), leading to a divergence in function between Pan and Homo. At present, however, neither temporal

(i.e. early or later in development), nor spatial (e.g. ubiquitous versus tissue-specific) differences in expression of the two NAT isoenzymes are known in other great ape species besides humans, so we can only speculate on possible environmental factors that could exert a selective pressure on NAT genes in our closer relatives.

In conclusion, we have found high levels of diversity of NAT genes in chimpanzees and bonobos, which is similar to humans. However the diversity is reversely distributed in chimpanzees and bonobos, such that there is higher diversity in Pan NAT1 and lower diversity in Pan NAT2, whereas the opposite is observed in humans. A reversed pattern between Pan and humans was also returned by the tests of selective neutrality and demographic equilibrium. Rejections of the model in chimpanzees were found mostly associated with NAT2, and likely due to directional selection, whereas in human populations this appears to be the case for NAT1, thus suggesting distinct selective pressures acting on Pan NAT1 and NAT2 compared to humans. Our analyses of the predicted functional impact of mutations detected in non-human primates suggest that a

48 non-negligible proportion of chimpanzees could have a moderately reduced NAT1 acetylation capacity, in sharp contrast with most human populations. In turn, reduced

NAT2 acetylation capacity is known to be frequent in many human populations, and our analyses predicted that this could also be the case for a significant proportion of bonobos, but less so of chimpanzees. Altogether, our results raise the possibility that humans and chimpanzees evolved some divergence in functionality at the NAT genes in the course of hominid history, such as divergence in substrate affinity/specificity/selectivity for each of the two enzymes. Such hypothetical shift in function could be due to fixed substitutions between humans and Pan NAT genes, as has been shown for macaques compared to humans. On the basis of the known role of NAT2 in the metabolism of smoke-contained aromatic amines, we postulate that a functional divergence of NATs between Pan and humans could have been driven by the development of fire handling and food processing in humans, a hypothesis that could be addressed in the future by functional studies of NAT1 and NAT2 enzymatic activities in great apes.

49 Acknowledgments

The authors wish to thank Lydie Brunet, Eliška Podgorná, Ilham Bahechar and Juan

Montoya, for technical assistance, Ronald Bontrop and Gaby Doxiadis for help with the

BPRC samples, and Gaby Doxiadis, José Manuel Nunes, and André Langaney for helpful discussions. We are also very grateful to the anonymous reviewer for providing constructive comments and suggestions on a former version of the manuscript.

Supported by Swiss National Science Foundation grant no. 320030 to ESP.

Figure Legends

Figure 1

Haplotype frequency distributions at the three NAT genes in Pan species and sub- species.

Figure 2

Expected heterozygosity (h) and nucleotide diversity (π x 10-3) at the three NAT genes in

Pan species and sub-species (left panes) and in human populations (right panes). The variation of values among the 122 San Diego P. t. verus sub-samples (left panes) and among human populations samples (right panes) are shown by boxplots. The dotted lines were added to the graphs to highlight inter-locus variation. For Pan, P-values of

Wilcoxon rank-sum tests after adjustment for multiple testing (and using only the average value for the San Diego sample) were of 0.039 for NAT1 versus NAT2, 0.065 for

NAT1 versus NATP, and 0.065 for NAT2 versus NATP, respectively, for differences in expected heterozygosity (h), and of 0.0065 for NAT1 versus NAT2, 0.5887 for NAT1 50 versus NATP, and 0.0974 for NAT2 versus NATP, respectively, for differences in nucleotide diversity (π). When restricting the tests to the chimpanzee (P. troglodytes) data only, P-values were of 0.012 for NAT1 versus NAT2, 0.222 for NAT1 versus NATP, and 0.012 for NAT2 versus NATP, respectively, for expected heterozygosity, and of 0.036 for NAT1 versus NAT2, 0.018 for NAT1 versus NATP, and 0.018 for NAT2 versus NATP, respectively for nucleotide diversity. For human populations, adjusted Wilcoxon rank- sum tests P-values for differences in both expected heterozygosity and nucleotide diversity were all < 0.0001 in the comparisons of NAT1 versus NAT2, and NAT1 versus

NATP, and of 0.58 and 0.048 for expected heterozygosity and nucleotide diversity, respectively, in the NAT2 versus NATP comparison.

51 References

Adzhubei I. A., Schmidt S., Peshkin L., Ramensky V. E., Gerasimova A. et al., 2010 A method and

server for predicting damaging missense mutations. Nat Methods. 7(4):248-249. doi:

10.1038/nmeth0410-248

Agundez J. A., 2008a N-acetyltransferases: lessons learned from eighty years of research. Current

drug metabolism. 9(6):463-464

Agundez J. A., 2008b Polymorphisms of human N-acetyltransferases and cancer risk. Current

drug metabolism. 9(6):520-531

Aklillu E., Carrillo J. A., Makonnen E., Bertilsson L., and Djordjevic N., 2018 N-Acetyltransferase-2

(NAT2) phenotype is influenced by genotype-environment interaction in Ethiopians.

European Journal of Clinical Pharmacology. doi: 10.1007/s00228-018-2448-y

Bandelt H. J., Forster P., and Rohl A., 1999 Median-joining networks for inferring intraspecific

phylogenies. Molecular biology and evolution. 16(1):37-48

Barbujani G., and Colonna V., 2010 Human genome diversity: frequently asked questions. Trends

Genet. 26(7):285-295. doi: 10.1016/j.tig.2010.04.002

Bataillon T., Duan J., Hvilsom C., Jin X., Li Y. et al., 2015 Inference of purifying and positive

selection in three subspecies of chimpanzees (Pan troglodytes) from exome sequencing.

Genome Biol Evol. 7(4):1122-1132. doi: 10.1093/gbe/evv058

Becquet C., Patterson N., Stone A. C., Przeworski M., and Reich D., 2007 Genetic structure of

chimpanzee populations. PLoS Genet. 3(4):e66. doi: 10.1371/journal.pgen.0030066

Bergfeld A. K., Lawrence R., Diaz S. L., Pearce O. M. T., Ghaderi D. et al., 2017 N-glycolyl groups of

nonhuman chondroitin sulfates survive in ancient fossils. Proc Natl Acad Sci U S A.

114(39):E8155-e8164. doi: 10.1073/pnas.1706306114

Bisso-Machado R., Ramallo V., Paixao-Cortes V. R., Acuna-Alonzo V., Demarchi D. A. et al., 2016

NAT2 gene diversity and its evolutionary trajectory in the Americas. Pharmacogenomics

J. 16(6):559-565. doi: 10.1038/tpj.2015.72

52 Bjork A., Liu W., Wertheim J. O., Hahn B. H., and Worobey M., 2011 Evolutionary history of

chimpanzees inferred from complete mitochondrial genomes. Molecular biology and

evolution. 28(1):615-623. doi: 10.1093/molbev/msq227

Blömeke B., and Lichter J., 2018 Expression and Activity of Arylamine N-Acetyltransferases in

Organs: Implications on Aromatic Amine Toxicity. In: Laurieri N., and Sim E., editors.

Arylamine N-Acetyltransferases in Health and Disease. World Scientific Publishing. p

133-164. doi: 10.1142/9789813232013_0006

Butcher N. J., and Minchin R. F., 2012 Arylamine N-acetyltransferase 1: a novel drug target in

cancer development. Pharmacol Rev. 64(1):147-165. doi: 10.1124/pr.110.004275

Cagan A., Theunert C., Laayouni H., Santpere G., Pybus M. et al., 2016 Natural Selection in the

Great Apes. Molecular biology and evolution. 33(12):3268-3283. doi:

10.1093/molbev/msw215

Carmody R. N., Dannemann M., Briggs A. W., Nickel B., Groopman E. E. et al., 2016 Genetic

Evidence of Human Adaptation to a Cooked Diet. Genome Biol Evol. 8(4):1091-1103. doi:

10.1093/gbe/evw059

Chung K. T., 2015 Occurrence, uses, and carcinogenicity of arylamines. Frontiers in bioscience

(Elite edition). 7:322-345 de Groot N. G., Heijmans C. M. C., De Groot N., Otting N., De Vos-Rouweler A. J. M. et al., 2008

Pinpointing a selective sweep to the chimpanzee MHC class I region by comparative

genomics. Molecular Ecology. 17(8):2074-2088. doi: 10.1111/j.1365-294X.2008.03716.x de Groot N. G., Otting N., Doxiadis G. M., Balla-Jhagjhoorsingh S., Heeney J. L. et al., 2002 Evidence

for an ancient selective sweep in the MHC class I gene repertoire of chimpanzees.

Proceedings of the National Academy of Sciences. 99(18):11748-11753. doi:

10.1073/pnas.182420799 de Manuel M., Kuhlwilm M., Frandsen P., Sousa V. C., Desai T. et al., 2016 Chimpanzee genomic

diversity reveals ancient admixture with bonobos. Science (New York, NY).

354(6311):477-481. doi: 10.1126/science.aag2602

53 Deinard A., and Kidd K., 1999 Evolution of a HOXB6 intergenic region within the great apes and

humans. J Hum Evol. 36(6):687-703. doi: 10.1006/jhev.1999.0298

Enard W., Khaitovich P., Klose J., Zollner S., Heissig F. et al., 2002 Intra- and interspecific

variation in primate gene expression patterns. Science (New York, NY). 296(5566):340-

343. doi: 10.1126/science.1068996

Excoffier L., and Lischer H. E., 2010 Arlequin suite ver 3.5: a new series of programs to perform

population genetics analyses under Linux and Windows. Molecular ecology resources.

10(3):564-567. doi: 10.1111/j.1755-0998.2010.02847.x

Fahrer J., 2016 Food-Borne Carcinogens. In: Schwab M., editor. Encyclopedia of Cancer. Berlin,

Heidelberg: Springer. p 1-5. doi: 10.1007/978-3-642-27841-9_7235-1

Fischer A., Pollack J., Thalmann O., Nickel B., and Paabo S., 2006 Demographic history and genetic

differentiation in apes. Current biology : CB. 16(11):1133-1138. doi:

10.1016/j.cub.2006.04.033

Fischer A., Prufer K., Good J. M., Halbwax M., Wiebe V. et al., 2011 Bonobos fall within the

genomic variation of chimpanzees. PloS one. 6(6):e21605. doi:

10.1371/journal.pone.0021605

Fischer A., Wiebe V., Paabo S., and Przeworski M., 2004 Evidence for a complex demographic

history of chimpanzees. Molecular biology and evolution. 21(5):799-808. doi:

10.1093/molbev/msh083

Fu Q., Li H., Moorjani P., Jay F., Slepchenko S. M. et al., 2014 Genome sequence of a 45,000-year-

old modern human from western Siberia. Nature. 514(7523):445-449. doi:

10.1038/nature13810

Funfstuck T., Arandjelovic M., Morgan D. B., Sanz C., Reed P. et al., 2015 The sampling scheme

matters: Pan troglodytes troglodytes and P. t. schweinfurthii are characterized by clinal

genetic variation rather than a strong subspecies break. Am J Phys Anthropol.

156(2):181-191. doi: 10.1002/ajpa.22638

Fuselli S., Gilman R. H., Chanock S. J., Bonatto S. L., De Stefano G. et al., 2007 Analysis of

nucleotide diversity of NAT2 coding region reveals homogeneity across Native American

54 populations and high intra-population diversity. Pharmacogenomics J. 7(2):144-152. doi:

10.1038/sj.tpj.6500407

Gagneux P., Wills C., Gerloff U., Tautz D., Morin P. A. et al., 1999 Mitochondrial sequences show

diverse evolutionary histories of African hominoids. Proc Natl Acad Sci U S A.

96(9):5077-5082

Green R. E., Krause J., Briggs A. W., Maricic T., Stenzel U. et al., 2010 A draft sequence of the

Neandertal genome. Science (New York, NY). 328(5979):710-722. doi:

10.1126/science.1188021

Hein D. W., 2002 Molecular genetics and function of NAT1 and NAT2: role in aromatic amine

metabolism and carcinogenesis. Mutat Res. 506-507:65-77

Hein D. W., 2018 Arylamine N-Acetyltransferase Type 2 Polymorphism and Human Urinary

Bladder and Breast Cancer Risks. In: Laurieri N., and Sim E., editors. Arylamine N-

Acetyltransferases in Health and Disease. World Scientific Publishing. p 327-349. doi:

10.1142/9789813232013_0013

Hein D. W., Boukouvala S., Grant D. M., Minchin R. F., and Sim E., 2008 Changes in consensus

arylamine N-acetyltransferase gene nomenclature. Pharmacogenetics and genomics.

18(4):367-368. doi: 10.1097/FPC.0b013e3282f60db0

Hein D. W., Fakis G., and Boukouvala S., 2018 Functional expression of human arylamine N-

acetyltransferase NAT1*10 and NAT1*11 alleles: a mini review. Pharmacogenetics and

genomics. 28(10):238-244. doi: 10.1097/fpc.0000000000000350

Hein D. W., Fretland A. J., and Doll M. A., 2006 Effects of single nucleotide polymorphisms in

human N-acetyltransferase 2 on metabolic activation (O-acetylation) of heterocyclic

amine carcinogens. Int J Cancer. 119(5):1208-1211. doi: 10.1002/ijc.21957

Hey J., 2010 The divergence of chimpanzee species and subspecies as revealed in

multipopulation isolation-with-migration analyses. Molecular biology and evolution.

27(4):921-933. doi: 10.1093/molbev/msp298

Holm S., 1979 A simple sequentially rejective multiple test procedure. Scandinavian Journal of

Statistics. 6:65-70

55 Hubbard T. D., Murray I. A., Bisson W. H., Sullivan A. P., Sebastian A. et al., 2016 Divergent Ah

Receptor Ligand Selectivity during Hominin Evolution. Molecular biology and evolution.

33(10):2648-2658. doi: 10.1093/molbev/msw143

Jensen L. E., Hoess K., Mitchell L. E., and Whitehead A. S., 2006 Loss of function polymorphisms in

NAT1 protect against spina bifida. Human Genetics. 120(1):52-57. doi: 10.1007/s00439-

006-0181-6

Kaessmann H., Wiebe V., and Paabo S., 1999 Extensive nuclear DNA sequence diversity among

chimpanzees. Science (New York, NY). 286(5442):1159-1162

Kaessmann H., Wiebe V., Weiss G., and Paabo S., 2001 Great ape DNA sequences reveal a reduced

diversity and an expansion in humans. Nat Genet. 27(2):155-156. doi: 10.1038/84773

Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H. et al., 2002 The human genome

browser at UCSC. Genome research. 12(6):996-1006. doi: 10.1101/gr.229102.

Knowles J. W., Xie W., Zhang Z., Chennamsetty I., Assimes T. L. et al., 2016 Identification and

validation of N-acetyltransferase 2 as an insulin sensitivity gene. J Clin Invest.

126(1):403. doi: 10.1172/jci85921

Krief S., Berny P., Gumisiriza F., Gross R., Demeneix B. et al., 2017 Agricultural expansion as risk

to endangered wildlife: Pesticide exposure in wild chimpanzees and baboons displaying

facial dysplasia. The Science of the total environment. 598:647-656. doi:

10.1016/j.scitotenv.2017.04.113

Krief S., Krief J. M., Seguya A., Couly G., and Levi G., 2014 Facial dysplasia in wild chimpanzees. J

Med Primatol. 43(4):280-283. doi: 10.1111/jmp.12129

Krief S., Watts D. P., Mitani J. C., Krief J. M., Cibot M. et al., 2015 Two Cases of Cleft Lip and Other

Congenital Anomalies in Wild Chimpanzees Living in Kibale National Park, Uganda. Cleft

Palate Craniofac J. 52(6):743-750. doi: 10.1597/14-188

Kronenberg Z. N., Fiddes I. T., Gordon D., Murali S., Cantsilieris S. et al., 2018 High-resolution

comparative analysis of great ape genomes. Science (New York, NY). 360(6393). doi:

10.1126/science.aar6343

56 Kuhlwilm M., de Manuel M., Nater A., Greminger M. P., Krützen M. et al., 2016 Evolution and

demography of the great apes. Current Opinion in Genetics & Development. 41:124-129.

doi: 10.1016/j.gde.2016.09.005

Kumar S., Stecher G., and Tamura K., 2016 MEGA7: Molecular Evolutionary Genetics Analysis

Version 7.0 for Bigger Datasets. Molecular biology and evolution. 33(7):1870-1874. doi:

10.1093/molbev/msw054

Ladero J. M., 2008 Influence of polymorphic N-acetyltransferases on non-malignant spontaneous

disorders and on response to drugs. Current drug metabolism. 9(6):532-537

Lander E. S., Linton L. M., Birren B., Nusbaum C., Zody M. C. et al., 2001 Initial sequencing and

analysis of the human genome. Nature. 409(6822):860-921. doi: 10.1038/35057062

Laurieri N., Dairou J., Egleton J. E., Stanley L. A., Russell A. J. et al., 2014 From arylamine N-

acetyltransferase to folate-dependent acetyl CoA hydrolase: impact of folic acid on the

activity of (HUMAN)NAT1 and its homologue (MOUSE)NAT2. PloS one. 9(5):e96370. doi:

10.1371/journal.pone.0096370

Laurieri N., Egleton J. E., and Russell A. J., 2018 Human Arylamine N-Acetyltransferase Type 1

and Breast Cancer. In: Laurieri N., and Sim E., editors. Arylamine N-Acetyltransferases in

Health and Disease. World Scientific Publishing. p 351-384. doi:

10.1142/9789813232013_0014

Li J. Z., Absher D. M., Tang H., Southwick A. M., Casto A. M. et al., 2008 Worldwide human

relationships inferred from genome-wide patterns of variation. Science (New York, NY).

319(5866):1100-1104. doi: 10.1126/science.1153717

Lobon I., Tucci S., de Manuel M., Ghirotto S., Benazzo A. et al., 2016 Demographic History of the

Genus Pan Inferred from Whole Mitochondrial Genome Reconstructions. Genome Biol

Evol. 8(6):2020-2030. doi: 10.1093/gbe/evw124

Luca F., Bubba G., Basile M., Brdicka R., Michalodimitrakis E. et al., 2008 Multiple advantageous

amino acid variants in the NAT2 gene in human populations. PloS one. 3(9):e3136. doi:

10.1371/journal.pone.0003136

57 Magalon H., Patin E., Austerlitz F., Hegay T., Aldashev A. et al., 2008 Population genetic diversity

of the NAT2 gene supports a role of acetylation in human adaptation to farming in

Central Asia. Eur J Hum Genet. 16(2):243-251. doi: 10.1038/sj.ejhg.5201963

McDonagh E. M., Boukouvala S., Aklillu E., Hein D. W., Altman R. B. et al., 2014 PharmGKB

summary: very important pharmacogene information for N-acetyltransferase 2.

Pharmacogenetics and genomics. 24(8):409-425. doi: 10.1097/fpc.0000000000000062

Meyer M., Kircher M., Gansauge M. T., Li H., Racimo F. et al., 2012 A high-coverage genome

sequence from an archaic Denisovan individual. Science (New York, NY). 338(6104):222-

226. doi: 10.1126/science.1224344

Meyer U. A., 2004 Pharmacogenetics - five decades of therapeutic lessons from genetic diversity.

Nat Rev Genet. 5(9):669-676. doi: 10.1038/nrg1428

Mortensen H. M., Froment A., Lema G., Bodo J. M., Ibrahim M. et al., 2011 Characterization of

genetic variation and natural selection at the arylamine N-acetyltransferase genes in

global human populations. Pharmacogenomics. 12(11):1545-1558. doi:

10.2217/pgs.11.88

Nei M., 1987 Molecular Evolutionary Genetics: Columbia University Press

O'Bleness M., Searles V. B., Varki A., Gagneux P., and Sikela J. M., 2012 Evolution of genetic and

genomic features unique to the human lineage. Nat Rev Genet. 13(12):853-866. doi:

10.1038/nrg3336

Olson M. V., and Varki A., 2003 Sequencing the chimpanzee genome: insights into human

evolution and disease. Nat Rev Genet. 4(1):20-28. doi: 10.1038/nrg981

Patillon B., Luisi P., Poloni E. S., Boukouvala S., Darlu P. et al., 2014 A homogenizing process of

selection has maintained an "ultra-slow" acetylation NAT2 variant in humans. Hum Biol.

86(3):185-214. doi: 10.13110/humanbiology.86.3.0185

Patin E., Barreiro L. B., Sabeti P. C., Austerlitz F., Luca F. et al., 2006a Deciphering the ancient and

complex evolutionary history of human arylamine N-acetyltransferase genes. American

journal of human genetics. 78(3):423-436. doi: 10.1086/500614

58 Patin E., Harmant C., Kidd K. K., Kidd J., Froment A. et al., 2006b Sub-Saharan African coding

sequence variation and haplotype diversity at the NAT2 gene. Hum Mutat. 27(7):720.

doi: 10.1002/humu.9438

Podgorna E., Diallo I., Vangenot C., Sanchez-Mazas A., Sabbagh A. et al., 2015 Variation in NAT2

acetylation phenotypes is associated with differences in food-producing subsistence

modes and ecoregions in Africa. BMC Evol Biol. 15:263. doi: 10.1186/s12862-015-0543-

6

Prado-Martinez J., Sudmant P. H., Kidd J. M., Li H., Kelley J. L. et al., 2013 Great ape genetic

diversity and population history. Nature. 499(7459):471-475. doi:

10.1038/nature12228

Prufer K., de Filippo C., Grote S., Mafessoni F., Korlevic P. et al., 2017 A high-coverage Neandertal

genome from Vindija Cave in Croatia. Science (New York, NY). 358(6363):655-658. doi:

10.1126/science.aao1887

Prufer K., Racimo F., Patterson N., Jay F., Sankararaman S. et al., 2014 The complete genome

sequence of a Neanderthal from the Altai Mountains. Nature. 505(7481):43-49. doi:

10.1038/nature12886

R Core Team, 2013 R: a language and environment for statistical computing. Vienna, Austria: R

Foundation for Statistical Computing

Ramirez-Soriano A., Ramos-Onsins S. E., Rozas J., Calafell F., and Navarro A., 2008 Statistical

power analysis of neutrality tests under demographic expansions, contractions and

bottlenecks with recombination. Genetics. 179(1):555-567. doi:

10.1534/genetics.107.083006

Roy J., Arandjelovic M., Bradley B. J., Guschanski K., Stephens C. R. et al., 2014 Recent divergences

and size decreases of populations. Biology letters. 10(11):20140811. doi:

10.1098/rsbl.2014.0811

Sabbagh A., Darlu P., Crouau-Roy B., and Poloni E. S., 2011 Arylamine N-acetyltransferase 2

(NAT2) genetic diversity and traditional subsistence: a worldwide population survey.

PloS one. 6(4):e18507. doi: 10.1371/journal.pone.0018507

59 Sabbagh A., Darlu P., Vangenot C., and Poloni E. S., 2018 Arylamine N-Acetyltransferases in

Anthropology. In: Laurieri N., and Sim E., editors. Arylamine N-Acetyltransferases in

Health and Disease. World Scientific Publishing. p 165-193. doi:

10.1142/9789813232013_0007

Sabbagh A., Langaney A., Darlu P., Gerard N., Krishnamoorthy R. et al., 2008 Worldwide

distribution of NAT2 diversity: implications for NAT2 evolutionary history. BMC Genet.

9:21. doi: 10.1186/1471-2156-9-21

Sabbagh A., Marin J., Veyssiere C., Lecompte E., Boukouvala S. et al., 2013 Rapid birth-and-death

evolution of the xenobiotic metabolizing NAT gene family in vertebrates with evidence of

adaptive selection. BMC Evol Biol. 13:62. doi: 10.1186/1471-2148-13-62

Santos M. R., Ramallo V., Muzzio M., Camelo J. S., and Bailliet G., 2015 Association between NAT2

polymorphisms with non-syndromic cleft lip with or without cleft palate in Argentina.

Rev Med Chil. 143(4):444-450. doi: 10.4067/s0034-98872015000400005

Scally A., Dutheil J. Y., Hillier L. W., Jordan G. E., Goodhead I. et al., 2012 Insights into hominid

evolution from the gorilla genome sequence. Nature. 483(7388):169-175. doi:

10.1038/nature10842

Selinski S., Blaszkewicz M., Getzmann S., and Golka K., 2015 N-Acetyltransferase 2: ultra-slow

acetylators enter the stage. Archives of toxicology. 89(12):2445-2447. doi:

10.1007/s00204-015-1650-2

Sim E., Abuhammad A., and Ryan A., 2014 Arylamine N-acetyltransferases: from drug

metabolism and pharmacogenetics to drug discovery. Br J Pharmacol. 171(11):2705-

2725. doi: 10.1111/bph.12598

Sim E., and Laurieri N., 2018 Drug Metabolism and Pharmacogenetics Then and Now. In: Laurieri

N., and Sim E., editors. Arylamine N-Acetyltransferases in Health and Disease. World

Scientific Publishing. p 3-41. doi: 10.1142/9789813232013_0001

Sim N.-L., Kumar P., Hu J., Henikoff S., Schneider G. et al., 2012 SIFT web server: predicting effects

of amino acid substitutions on proteins. Nucleic Acids Res. 40(W1):W452-W457. doi:

10.1093/nar/gks539

60 Simonsen K. L., Churchill G. A., and Aquadro C. F., 1995 Properties of statistical tests of neutrality

for DNA polymorphism data. Genetics. 141(1):413-429

Solis-Moruno M., de Manuel M., Hernandez-Rodriguez J., Fontsere C., Gomara-Castano A. et al.,

2017 Potential damaging mutation in LRP5 from genome sequencing of the first

reported chimpanzee with the Chiari malformation. Scientific reports. 7(1):15224. doi:

10.1038/s41598-017-15544-w

Song T., Wu D., Wang Y., Li H., Yin N. et al., 2013 Association of NAT1 and NAT2 genes with

nonsyndromic cleft lip and palate. Mol Med Rep. 8(1):211-216. doi:

10.3892/mmr.2013.1467

Stephens M., and Scheet P., 2005 Accounting for decay of linkage disequilibrium in haplotype

inference and missing-data imputation. American journal of human genetics. 76(3):449-

462. doi: 10.1086/428594

Stephens M., Smith N. J., and Donnelly P., 2001 A new statistical method for haplotype

reconstruction from population data. American journal of human genetics. 68(4):978-

989. doi: 10.1086/319501

Sugawara T., Go Y., Udono T., Morimura N., Tomonaga M. et al., 2011 Diversification of bitter

taste receptor gene family in western chimpanzees. Molecular biology and evolution.

28(2):921-931. doi: 10.1093/molbev/msq279

Tajima F., 1989a The effect of change in population size on DNA polymorphism. Genetics.

123(3):597-601

Tajima F., 1989b Statistical method for testing the neutral mutation hypothesis by DNA

polymorphism. Genetics. 123(3):585-595

Tang H., and Thomas P. D., 2016 PANTHER-PSEP: predicting disease-causing genetic variants

using position-specific evolutionary preservation. Bioinformatics. 32(14):2230-2232.

doi: 10.1093/bioinformatics/btw222

The Chimpanzee Sequencing and Analysis Consortium, 2005 Initial sequence of the chimpanzee

genome and comparison with the human genome. Nature. 437(7055):69-87. doi:

10.1038/nature04072

61 The Genomes Project Consortium, 2012 An integrated map of genetic variation from 1,092

human genomes. Nature. 491:56. doi: 10.1038/nature11632

Thierry-Mieg D., and Thierry-Mieg J., 2006 AceView: a comprehensive cDNA-supported gene and

transcripts annotation. Genome biology. 7 Suppl 1:S12.11-14. doi: 10.1186/gb-2006-7-

s1-s12

Tsirka T., Boukouvala S., Agianian B., and Fakis G., 2014 Polymorphism p.Val231Ile alters

substrate selectivity of drug-metabolizing arylamine N-acetyltransferase 2 (NAT2)

isoenzyme of rhesus macaque and human. Gene. 536(1):65-73. doi:

10.1016/j.gene.2013.11.085

Valente C., Alvarez L., Marks S. J., Lopez-Parra A. M., Parson W. et al., 2015 Exploring the

relationship between lifestyles, diets and genetic adaptations in humans. BMC Genet.

16:55. doi: 10.1186/s12863-015-0212-1

Vangenot C., Nunes J. M., de Groot N., Poloni E. S., Bontrop R. et al., 2017 (under revision) Similar

patterns of genetic diversity in Western chimpanzees (Pan troglodytes verus) and

humans suggest conserved mechanisms of MHC molecular evolution. BMC Evolutionary

Biology.

Veeramah K. R., and Hammer M. F., 2014 The impact of whole-genome sequencing on the

reconstruction of human population history. Nat Rev Genet. 15(3):149-162. doi:

10.1038/nrg3625

Wahl S. E., Kennedy A. E., Wyatt B. H., Moore A. D., Pridgen D. E. et al., 2015 The role of folate

metabolism in orofacial development and clefting. Developmental Biology. 405(1):108-

122. doi: 10.1016/j.ydbio.2015.07.001

Wakefield L., Boukouvala S., and Sim E., 2010 Characterisation of CpG methylation in the

upstream control region of mouse Nat2: evidence for a gene-environment interaction in

a polymorphic gene implicated in folate metabolism. Gene. 452(1):16-21. doi:

10.1016/j.gene.2009.12.002

62 Wakefield L., Cornish V., Long H., Griffiths W. J., and Sim E., 2007 Deletion of a xenobiotic

metabolizing gene in mice affects folate metabolism. Biochemical and Biophysical

Research Communications. 364(3-2):556-560. doi: 10.1016/j.bbrc.2007.10.026

Walraven J. M., Zang Y., Trent J. O., and Hein D. W., 2008 Structure/function evaluations of single

nucleotide polymorphisms in human N-acetyltransferase 2. Current drug metabolism.

9(6):471-486

Weisburger J. H., 2002 Comments on the history and importance of aromatic and heterocyclic

amines in public health. Mutat Res. 506-507:9-20

Wild C. P., 2012 The exposome: from concept to utility. International journal of epidemiology.

41(1):24-32. doi: 10.1093/ije/dyr236

Wrangham R., and Carmody R., 2010 Human adaptation to the control of fire. Evolutionary

Anthropology: Issues, News, and Reviews. 19(5):187-199. doi: doi:10.1002/evan.20275

Xue Y., Prado-Martinez J., Sudmant P. H., Narasimhan V., Ayub Q. et al., 2015 Mountain gorilla

genomes reveal the impact of long-term population decline and inbreeding. Science

(New York, NY). 348(6231):242-245. doi: 10.1126/science.aaa3952

Zang Y., Doll M. A., Zhao S., States J. C., and Hein D. W., 2007 Functional characterization of single-

nucleotide polymorphisms and haplotypes of human N-acetyltransferase 2.

Carcinogenesis. 28(8):1665-1671. doi: 10.1093/carcin/bgm085

Zhu Y., and Hein D. W., 2008 Functional effects of single nucleotide polymorphisms in the coding

region of human N-acetyltransferase 1. Pharmacogenomics J. 8(5):339-348. doi:

10.1038/sj.tpj.6500483

Zhu Y., States J. C., Wang Y., and Hein D. W., 2011 Functional effects of genetic polymorphisms in

the N-acetyltransferase 1 coding and 3' untranslated regions. Birth defects research Part

A, Clinical and molecular teratology. 91(2):77-84. doi: 10.1002/bdra.20763

Zumla A., Nahid P., and Cole S. T., 2013 Advances in the development of new tuberculosis drugs

and treatment regimens. Nature reviews Drug discovery. 12(5):388-404. doi:

10.1038/nrd4001

63

Figure 1

Haplotype frequency distributions at the three NAT genes in Pan species and sub- species.

Figure 2

Expected heterozygosity (h) and nucleotide diversity (π x 10-3) at the three NAT genes in

Pan species and sub-species (left panes) and in human populations (right panes). The variation of values among the 122 San Diego P. t. verus sub-samples (left panes) and among human populations samples (right panes) are shown by boxplots. The dotted lines were added to the graphs to highlight inter-locus variation. For Pan, P-values of

Wilcoxon rank-sum tests after adjustment for multiple testing (and using only the average value for the San Diego sample) were of 0.039 for NAT1 versus NAT2, 0.065 for

NAT1 versus NATP, and 0.065 for NAT2 versus NATP, respectively, for differences in expected heterozygosity (h), and of 0.0065 for NAT1 versus NAT2, 0.5887 for NAT1 versus NATP, and 0.0974 for NAT2 versus NATP, respectively, for differences in nucleotide diversity (π). When restricting the tests to the chimpanzee (P. troglodytes) data only, P-values were of 0.012 for NAT1 versus NAT2, 0.222 for NAT1 versus NATP, and 0.012 for NAT2 versus NATP, respectively, for expected heterozygosity, and of 0.036 for NAT1 versus NAT2, 0.018 for NAT1 versus NATP, and 0.018 for NAT2 versus NATP, respectively for nucleotide diversity. For human populations, adjusted Wilcoxon rank- sum tests P-values for differences in both expected heterozygosity and nucleotide diversity were all < 0.0001 in the comparisons of NAT1 versus NAT2, and NAT1 versus

NATP, and of 0.58 and 0.048 for expected heterozygosity and nucleotide diversity, respectively, in the NAT2 versus NATP comparison.

Table 1. Segregating sites identified in the three NAT gene paralogs shared among different (sub-)species of hominidsa.

Gene Homo Gorillac Pongod Pane

Ancient genomes

Position in Alleles

human Human (amino acid SNP rs identifier

reference cds change if non- b P .t. verus P. t. ellioti sequence synonymous) Altai Pongo abelii Pongo Gorilla gorlla Paniscus Pan Vindija Homo sapiens Denisova Gorilla beringei Gorilla Ust’Ishim P. t. troglodytes t. P. Pongo pygmaeus Pongo P. t. Schweinfurthii t. P.

NAT1 Total sample sizef 33 3 12 5 72 10 5 6 14 18080001 445 rs4987076 A/G (I149V) 18080015 459 rs4986990 G/A 18080196 640 rs4986783 G/T (A214S) NAT2 Total sample sizef 32 3 12 5 72 10 5 6 14 18257704 191 rs1801279 G/A (R64Q) T T 18257795 282 rs1041983 C/T 18257858 345 rs45532639 C/T A A A A A 18258019 506 rs200585149 A/G (Y169C) 18258091 578 rs79050330 C/T (T193M)

NATP Total sample sizef 32 3 13 5 75 10 5 6 14 18228246 rs73590295 T/C 18228285 T/A

18228458 rs372738250 G/A 18228616 rs35548819 T/C 18228661 rs546009408 G/A 18228673 rs115350875 T/C 18228727 rs530022558 G/A 18228959 T/C 18229104 rs74444655 T/C

a Boxes shaded in light grey indicate the presence of the polymorphism in the relevant species/sub-species and those shaded in dark grey indicate fixation (detection for ancient genomes) of the derived allele in the species (if different from the human derived allele, the allele is indicated in the box). bThe screened segments for the NAT1, NAT2 and NATP homologous sequences span from 18'079'545 to 18'080’447 (903 bp including the NAT1 coding exon), 18'257'489 to 18'258'603 (1,115 bp including the NAT2 coding exon) and 18'228'116 to 18'229'117 (1,002 bp including the NATP pseudogene) respectively, on chromosome 8 in the human reference sequence GRCh37/hg19. Non-synonymous mutations are shown in bold type. c Based on the individuals of this study, the gorillas of Prado-Martinez et al. (2013) and the Gorilla gorilla gorilla draft assembly reference sequence (gorGor4, December 2014). d Based on the individuals of this study, the orangutans of Prado-Martinez et al. (2013) and the Pongo pygmaeus abelii draft assembly reference sequence (ponAbe2, July 2007). e Polymorphism recording is based on the chimpanzee and bonobo individuals of the present study and those of Prado-Martinez et al. (2013), cross-checked with the Pan troglodytes verus assembly reference sequence (panTro4, February 2011) and the Pan paniscus draft assembly reference sequence (panPan1, May 2012). f Total number of genotypes, including genotypes of individuals deduced from their descendants (see Supplementary Figure S1A and Supplementary File S1).

Table 2. Segregating sites identified in the three NAT gene paralogs in the different (sub-)species of the genus Pan, with paralogous positions in humans shown in the three last columnsa.

Gene Pan species/sub-species P. P. troglodytes paniscus Homo sapiens (common chimpanzee)c (bonobo)e

Position in Alleles d human (amino acid Fixed Variable position (SNP rs ellioti Cameroon) Human cds reference change if non- - position identifier in Ensembl)f troglodytes Hybrid (Central) P. t. verus (Eastern)

b P. t. schweinfurthii sequence synonymous) (Western) P. t. P. t. (Nigeria

NAT1 Pan total sample sizeg 72 10 5 6 1 14 18079632 G/A (D26N) 76 G 18079703 C/T 147 C 18079859 C/T 303 C 18079897 T/C (I114T) 341 T/A/C (rs145975713) 18079925 T/C 369 T 18080014 C/T 458 C/T (rs374226986) 18080074 A/C (E173A) 518 A 18080153 T/G (I199M) 597 T 18080316 G/C (E254Q) 760 G 18080345 A/G (I263M) 789 A

NAT2 Pan total sample sizeg 72 10 5 6 1 14 18257549 T/C 36 T 18257585 A/C (L24F) 72 A 18257658 G/A (E49K) 145 G 18257704 G/A (R64Q) 191 G/A (rs1801279) 18258027 A/G (N172D) 514 A 18258091 C/T (T193M) 578 C/T (rs79050330) 18258302 G/T 789 T 18258447 G/A 934 h G 18258462 C/G 949 h C NATP Pan total sample sizeg 75 10 5 6 1 14 18228146 G/T - G 18228189 C/T - C 18228238 A/G - A 18228242 A/G - A 18228285 i T/A N N - T/A 18228304 C/T - C 18228368 T/C - T 18228404 C/T - C 18228501 C/T - C 18228543 A/G - A 18228560 C/A - C 18228582 G/T - G 18228614 G/T - G 18228659 G/A - G

18228660 C/T - G 18228661 G/A - G/A (rs546009408) 18228748 C/T - T 18228771 C/T - T 18228959 T/C - T 18229057 G/A - C 18229103 A/T - A a Boxes shaded in light grey indicate the presence of the polymorphism in the relevant Pan species/sub-species and those shaded in dark grey indicate a fixation of the derived allele in the Bonobo species. b The screened segments for the NAT1, NAT2 and NATP homologous sequences span from 18'079'545 to 18'080’447 (903 bp including the NAT1 coding exon), 18'257'489 to 18'258'603 (1,115 bp including the NAT2 coding exon) and 18'228'116 to 18'229'117 (1,002 bp including the NATP pseudogene) respectively, on chromosome 8 in the human reference sequence GRCh37/hg19. Non-synonymous mutations are shown in bold type. c Polymorphism recording is based on the individuals of the present study and the chimpanzees of Prado-Martinez et al. (2013) cross- checked with the Pan troglodytes verus assembly reference sequence (panTro4, February 2011). d Hybrid Western (P. t. verus) / Central (P. t. troglodytes) individual. e Based on the individual of this study (Bonobo), the bonobos of Prado-Martinez et al. (2013) and the Pan paniscus draft assembly reference sequence (panPan1, May 2012). f With the exception of human NAT2 polymorphisms rs1801279 (G/A at cds position 191) and rs79050330 (C/T at cds position 578), which are common variants in humans, all other human polymorphisms, including those with a SNP rs identifier, are rare variants, detected with a highest population MAF < 0.01 in Ensembl (http://www.ensembl.org/Homo_sapiens/Info/Index). g Total number of genotypes, including genotypes of individuals deduced from their descendants (see Supplementary Figure S1A and Supplementary File S1). h Non-coding positions downstream of NAT2 coding exon (3’UTR region).

i Undefined position in some Nigeria-Cameroun and Eastern chimpanzee samples (indicated by N, see Supplementary Table S5). The human T/A SNP at position 18’228’285, at present only described in Mortensen et al. (2011) at very low frequency, is not reported in Ensembl; we note that Ensembl reports another rare T/A SNP, with associated SNP identifier rs546046491, in the next contiguous position (18'228’286), and both variants are embedded in a stretch of T nucleotides on the human reference sequence (from 18'228'278 to 18'228’286).

Table 3. Haplotypes of the three NAT gene paralogs in the genus Pan

NAT1

Positiona 79’632 79’703 79’859 79’897 79’925 80’014 80’074 80’153 80’316 80’345

SNPb G76A C147T C303T T341C T369C C458T A518C T597G G760C A789G

Amino acid change D26N I114T E173A I199M E254Q I263M

Haplotypes NAT1*1 G C C T T C A T G A NAT1*2 . T ...... NAT1*3 ...... G NAT1*4 . T . . . . . G . . NAT1*5 A ...... NAT1*6 . . . . C . . . . . NAT1*7 ...... C . NAT1*8 . . . C ...... NAT1*9 . . . . . T C . . . NAT1*10 A T ...... NAT1*11 ...... C . . . NAT1*12 . . T ......

NAT2

Positionc 57’549 57’585 57’658 57’704 58’027 58’091 58’302 58’447 58’462

SNPb T36C A72C G145A G191A A514G C578T T789G A934G C949G

Amino acid change L24F E49K R64Q N172D T193M

Haplotypes NAT2*1 T A G G A C T A C NAT2*2 . . . . . T . . . NAT2*3 ...... G . . NAT2*4 ...... G . NAT2*5 ...... G NAT2*6 . . . . G . . . . NAT2*7 . . A . . . . G . NAT2*8 . . A A . . . G . NAT2*9 . C A . . . . G . NAT2*10 C . . . G . . . .

NATP

Positionc 28’146 28’189 28’238 28’242 28’279 28’285 28’304 28’368 28’404 28’501

SNPb 31 74 123 127 164 170 189 253 289 386

Haplotypes NATP*1 G C A A T T C C C T NATP*2 ...... T . . NATP*3 ...... T . . NATP*4 ...... T . . NATP*5 ...... T . . NATP*6 ...... T . C NATP*7 . . G . . . . T . . NATP*8 ...... T . C NATP*9 ...... T . C NATP*10 ...... T . C NATP*11 . . . . . A . T . C NATP*12 . . . . . A . T . C NATP*13 . . . . . A . T T C NATP*14 . . . . . A . T T C NATP*15 . . . . . A . T T C NATP*16 . T . . . A . T . C NATP*17 T . . . . A . T . C NATP*18 . . . G C A T T . C NATP*19 . . . G C A T T . C

NATP (continued)

Positionc 28’543 28’560 28’575 28’576 28’582 28’614 28’659 28’660 28’661 28’748

SNPb 428 445 460 461 467 499 544 545 546 633

Haplotypes NATP*1 A C A G G T G C G T NATP*2 ...... NATP*3 ...... A . NATP*4 . . . . T . . . . . NATP*5 . A ...... NATP*6 ...... NATP*7 ...... NATP*8 . . . . . G . . . . NATP*9 . . . . . G . . . . NATP*10 . . . . . G A . . . NATP*11 ...... NATP*12 ...... NATP*13 . . . . . G . . . C NATP*14 G ...... NATP*15 G . . . . G . . . . NATP*16 . . . . . G . . . . NATP*17 ...... NATP*18 . . G A . G . T . . NATP*19 . . G A . G A T . .

NATP (continued)

Positionc 28’771 28’959 29’057 29’103

SNPb 656 844 942 988

Haplotypes NATP*1 C T G A NATP*2 . . . . NATP*3 . . . . NATP*4 . . . . NATP*5 . . . . NATP*6 . . . . NATP*7 . . . . NATP*8 . . . . NATP*9 T . . . NATP*10 T . . . NATP*11 . . . . NATP*12 . C . . NATP*13 T . . . NATP*14 T . . . NATP*15 T . . . NATP*16 T . . T NATP*17 . . . . NATP*18 . . A . NATP*19 . . A . a Position (+18'000'000) on GRCh37/hg19. b SNP position relative to the coding exon of NAT1, NAT2, or its paralog sequence on NATP (starts at position 1). c Position (+18'200'000) on GRCh37/hg19.

Table 4. NAT haplotype frequencies (%) estimated in the different species and sub-species of the genus Pan and results of Hardy-Weinberg equilibrium tests.

Pan species and sub-species

P. t. verus (Western chimpanzee)

San Diego BPRC P. t. ellioti P. t. troglodytes P. t. schweinfurthii P. paniscus Samplea sample (Nigeria-Cameroon (Central chimpanzee) (Eastern chimpanzee) (Bonobo) chimpanzee)

Pan NAT1 haplotypes

NAT1*1 79.33 (1.71) 80.43 65.00 70.00 66.70 25.00 NAT1*2 7.97 (0.94) 4.35 5.00 0 0 0 NAT1*3 1.59 (1.38) 0 0 10.00 8.33 0 NAT1*4 8.33 (0) 13.04 0 0 0 0 NAT1*5 2.78 (0) 0 0 0 0 0 NAT1*6 0 2.17 0 0 0 53.60 NAT1*7 0 0 10.00 0 0 0 NAT1*8 0 0 0 10.00 0 0 NAT1*9 0 0 0 0 8.33 0 NAT1*10 0 0 20.00 10.00 8.33 0 NAT1*11 0 0 0 0 8.33 0 NAT1*12 0 0 0 0 0 21.40 Total (2n chromosomes) 36 46 20 10 12 28 Hardy-Weinberg testb : Ho 0.36 (0.03) 0.26 0.50 0.60 0.50 0.71 He 0.37 (0.03) 0.34 0.55 0.53 0.58 0.63 P-value [0.39 ; 0.64] 0.20 0.35 > 0.99 0.51 0.45

Pan NAT2 haplotypes

NAT2*1 92.49 (1.27) 91.30 5.00 10.00 0 0 NAT2*2 2.41 (0.94) 4.35 0 0 0 0 NAT2*3 0 2.17 0 0 0 0 NAT2*4 5.10 (1.03) 0 10.00 80.00 91.70 0 NAT2*5 0 2.17 0 0 0 0 NAT2*6 0 0 85.00 10.00 0 0 NAT2*7 0 0 0 0 0 89.3 NAT2*8 0 0 0 0 0 3.57 NAT2*9 0 0 0 0 0 7.14 NAT2*10 0 0 0 0 8.33 0 Total (2n chromosomes) 36 46 20 10 12 28 Hardy-Weinberg testb : Ho 0.15 (0.03) 0.17 0.30 0.20 0.17 0.21 He 0.15 (0.02) 0.17 0.28 0.38 0.17 0.20 P-value [0.08 ; > 0.99 ] > 0.99 > 0.99 0.11 > 0.99 > 0.99

Pan NATP haplotypes

NATP*1 24.61 (2.46) 44.00 0 10.00 0 0 NATP*2 52.14 (1.7) 42.00 40.00 10.00 58.30 0 NATP*3 0 0 0 20.00 0 0 NATP*4 0 6.00 0 0 0 0 NATP*5 0 2.00 0 0 0 0 NATP*6 2.32 (1.03) 0 0 0 0 0 NATP*7 19.33 (2.21) 6.00 0 0 0 0 NATP*8 1.59 (1.38) 0 25.00 20.00 8.33 0 NATP*9 0 0 0 30.00 0 0 NATP*10 0 0 0 0 8.33 0 NATP*11 0 0 10.00 0 16.67 0 NATP*12 0 0 5.00 0 0 0 NATP*13 c 0 0 0 0 0 0 NATP*14 0 0 5.00 0 0 0

NATP*15 0 0 15.00 0 0 0 NATP*16 0 0 0 10.00 0 0 NATP*17 0 0 0 0 8.33 0 NATP*18 0 0 0 0 0 96.43 NATP*19 0 0 0 0 0 3.57 Total (2n chromosomes) 36 50 20 10 12 28 Hardy-Weinberg testb : Ho 0.64 (0.04) 0.60 0.90 0.60 0.67 0.07 He 0.65 (0.01) 0.64 0.78 0.89 0.67 0.07 P-value [0.09 ; > 0.99 ] 0.06 0.03 d 0.15 0.76 > 0.99

a Average over the 122 sub-samples (see text), standard deviation in brackets. b Test for departure from Hardy-Weinberg equilibrium; Ho : observed heterozygosity, He : expected heterozygosity (equivalent to gene diversity). The only significant deviation from equilibrium (heterozygote excess at NATP in P. t. ellioti) is shown in bold. c Haplotype NATP*13, which combines SNPs at 170, 253, 289, 386, 499, 633, and 656 (Table 3), was inferred only for the genotype of the hybrid P. t. verus/troglodytes individual. d P-value > 0.05 after correction for multiple testing.

Table 5. Predictions of the effect of mutations between Pan NAT1 and NAT2 coding sequences according to PolyPhen, SIFT and PANTHER cSNP Scoring.

PolyPhen SIFT PANTHER cSNP Scoring Haplotypes cDNA protein Scorea Predictionb Scorec Predictiond PSEPe Predictionf

Pan NAT1 (reference NAT1*1)g NAT1*3 A789G I263M 0.279 (0.91-0.88) B 0.08 (3.08, 80) T 220 POD NAT1*4 T597G I199M 0.369 (0.9-0.89) B 0.01 (3.07, 81) A 220 POD NAT1*5 G76A D26N 0.377 (0.9-0.89) B 0.1 (3.08, 76) T 91 B NAT1*7 G760C E254Q 0.892 (0.82-0.94) POD 0.07 (3.08, 80) T 455 PRD NAT1*8 T341C I114T 0.099 (0.93-0.85) B 0.06 (3.07, 81) T 220 POD NAT1*11 A518C E173A 0.013 (0.96-0.78) B 0.17 (3.07, 81) T 30 B

Pan NAT2 (reference NAT2*4)h NAT2*2 C578T T193M 1 (0.00-1.00) PRD 0 (3.07, 51) A 456 PRD NAT2*6 A514G N172D 0.001 (0.99- 0.15) B 0.26 (3.07, 51) T 220 POD NAT2*7 G145A E49K 0.002 (0.99-0.3) B 0.5 (3.07, 50) T 324 POD NAT2*8i G191A R64Q 1.00 (0.00-1) PRD 0 (3.07, 50) A 4200 PRD NAT2*9i A72C L24F 1 (0.00-1) PRD 0 (3.07, 50) A 4200 PRD a PolyPhen score : probability that a substitution is damaging; sensibility and specificity in brackets. b PolyPhen prediction : “benign” (B), “possibly damaging” (POD), “probably damaging” (PRD). c SIFT score : probability that a substitution is tolerated; median sequence information and number of sequences used for the prediction in brackets. d SIFT prediction : T: “tolerated” (T), A: “affect protein function” (A). e PANTHER cSNP Scoring PSEP (position-specific evolutionary preservation) : length of time (in millions of years) of preservation of a position. f PANTHER cSNP Scoring prediction : "probably damaging" (PRD), "possibly damaging" (POD), "probably benign" (B). g The reference Pan NAT1 haplotype used is the basal haplotype in the network of NAT1 sequences (Supplementary Figure S1). h The reference Pan NAT2 haplotype used is the basal haplotype in the network of NAT2 sequences (Supplementary Figure S2). Since NAT2*1 differs from NAT2*4 at a single position located 61 bp downstream the coding exon relative to the stop codon (A934G, Table 3), the two haplotypes likely translate into a similar gene product, so that haplotypes deriving from NAT2*1 could be predicted using NAT2*4 as a reference. Instead, both haplotypes NAT2*8 and NAT2*9 derive from NAT2*7, which differs from the basal haplotype at SNP G145A (E49K, Table 3). Thus, for the non-synonymous mutations defining NAT2*8 and NAT2*9, predictions were performed using NAT2*7 as a reference. i Haplotypes NAT2*8 and NAT2*9 both bear the G145A mutation defining haplotype NAT2*7. Since the prediction tools do not allow the simultaneous specification of two substitutions, we ran the prediction tools for G191A and A72C against NAT2*7 as a reference, instead of NAT2*4.

Table 6. Genetic diversity and results of selective neutrality tests for the three NAT gene paralogs in the different species and sub-species of the genus Pan, with equivalent estimates in human populations shown in the last column.

Pan species and sub-species

P. t. verus (Western chimpanzee) Homo sapiens

(human San Diego BPRC P. t. ellioti P. t. troglodytes P. t. schweinfurthii P. paniscus populations Samplea sample (Nigeria-Cameroon) (Central) (Eastern) (Bonobo) average)b

Pan NAT1

Total (2N chromosomes) 36 46 20 10 12 28 119.7 Number of usable positions 903 903 898 898 898 898 903 Number of segregating sites (S) 3.57 (0.5) 3 3 4 5 2 3.75 Number of haplotypes (k) 4.57 (0.5) 4 4 4 5 3 3.5 Expected heterozygosity (h) 0.37 (0.03) 0.34 0.55 0.53 0.58 0.63 0.095 Nucleotide diversity (π) x 10−3 0.58 (0.03) 0.63 1.02 0.89 1.08 0.96 0.187 Ewens-Watterson testc : Fo 0.64 (0.026) 0.67 0.48 0.52 0.47 0.40 0.902 Fe 0.45 (0.041) 0.52 0.44 0.37 0.30 0.59 0.618 P-value [0.84 ; 0.95] 0.82 (0.18) 0.70 (0.10) > 0.99 (>0.99) > 0.99 (>0.99) 0.09 (0.25) 9 (8)/19 d Tajima’s D test : ∈ D -0.91 (0.17) -0.35 0.24 -1.67 -1.53 1.43 -1.475 P-value [0.112 ; 0.282] 0.398 (0.642) 0.642 (0.099) 0.031 (0.061) 0.056 (0.056) 0.917 (0.752) 11 (1) / 18 e Fu’s FS test : ∈ FS -1.64 (0.47) -0.59 -0.20 -1.35 -1.98 1.19 -2.849 P-value [0.038 ; 0.197] 0.321 (0.642) 0.399 (0.901) 0.043 (0.061) 0.024 (0.048) 0.742 (0.742) 8 (3) / 18

Pan NAT2

Total (2N chromosomes) 36 46 20 10 12 28 137.8 Number of usable positions 1’115 1’115 1’091 1’091 1’091 1’112 1115 Number of segregating sites (S) 1.87 (0.34) 3 2 2 3 2 9.78 Number of haplotypes (k) 2.87 (0.34) 4 3 3 2 3 10.7 Expected heterozygosity (h) 0.15 (0.02) 0.17 0.28 0.38 0.17 0.20 0.761 Nucleotide diversity (π) x 10−3 0.13 (0.02) 0.15 0.41 0.50 0.45 0.19 2.041 Ewens-Watterson testc : Fo 0.86 (0.02) 0.84 0.74 0.66 0.85 0.80 0.249 Fe 0.63 (0.05) 0.52 0.56 0.49 0.70 0.59 0.291 P-value [0.75 ; > 0.99]f 0.98 (0.96) 0.90 (0.69) > 0.99 (>0.99) > 0.99 (>0.99) 0.92 (0.92) 0 / 18 d Tajima’s D test : ∈ D -1.26 (0.19) -1.58 -0.44 -0.69 -1.63 -1.24 0.639 P-value [0.034 ; 0.158] 0.020 (0.040) 0.337 (0.521) 0.237 (0.246) 0.019 (0.039) 0.076 (0.076) 2 (0) / 18 e Fu’s FS test : ∈ FS -1.84 (0.55) -3.43 -0.377 -0.59 1.054 -1.59 -0.765 P-value [0.004 ; 0.129]g 0.001 (0.003) 0.260 (0.521) 0.123 (0.246) 0.595 (0.595) 0.015 (0.015) 2 (0) / 18

Pan NATP

Total (2N chromosomes) 36 50 20 10 12 28 128.8 Number of usable positions 1’000 1’000 936 937 936 975 1002.61 Number of segregating sites (S) 3.51 (0.62) 4 7 8 6 1 9.22 Number of haplotypes (k) 4.41 (0.61) 5 6 6 5 2 10.6 Expected heterozygosity (h) 0.65 (0.01) 0.64 0.78 0.89 0.67 0.07 0.755 Nucleotide diversity (π) x 10−3 0.81 (0.05) 0.77 2.60 2.75 1.74 0.07 1.804 Ewens-Watterson testc : Fo 0.37 (0.01) 0.38 0.26 0.20 0.39 0.93 0.255

Fe 0.46 (0.06) 0.44 0.29 0.22 0.30 0.75 0.269 P-value [0.07 ; 0.51] 0.40 (0.93) 0.43 (0.86) 0.48 (0.95) 0.95 (0.83) > 0.99 (>0.99) 1 (0) / 18 d Tajima’s D test : ∈ D -0.06 (0.36) -0.30 1.04 -0.11 -0.47 -1.15 0.018 P-value [0.383 ; 0.849] 0.424 (0.935) 0.865 (0.594) 0.483 (0.947) 0.338 (0.637) 0.138 (0.138) 0 / 18 e Fu’s FS test : ∈ FS -0.37 (0.55) -0.81 0.48 -1.02 -0.55 -1.15 -1.771 P-value [0.245 ; 0.706] 0.312 (0.935) 0.623 (0.863) 0.222 (0.667) 0.318 (0.637) 0.005 (0.010) 3 (1) / 18

∈ a Average over the 122 samples, standard deviation in brackets. b Average values for 18 to 20 human populations from four continents; single population values are reported in Supplementary Tables S7 and S12. c Ewens-Watterson test for departure from selective neutrality and demographic equilibrium; Fo : observed homozygosity, Fe : expected homozygosity; the P-value is given as the proportion of random Fe values generated under the neutral equilibrium model that are smaller than, or equal to the observed Fo value. Significant deviations (P-value < 0.025 or > 0.975) are shown in bold, and after correction for multiple testing in brackets; for humans, we report the number of population samples associated with a significant deviation (before slash, and significant after correction for multiple testing in bold and brackets ) on the total number of population samples tested (after slash, see Supplementary Table S12). d Tajima’s D test for departure from selective neutrality and demographic equilibrium; the P-value is given as the proportion of random D values generated under the neutral equilibrium model that are smaller than, or equal to the observed D value. Significant deviations (P-value < 0.025 or > 0.975) are shown in bold, and after correction for multiple testing in brackets; for humans, we report the number of population samples associated with a significant deviation (before slash, and significant after correction for multiple testing in bold and brackets ) on the total number of population samples tested (after slash, see Supplementary Table S12). e Fu’s FS test for departure from selective neutrality and demographic equilibrium; the P-value is given as the proportion of random FS values generated under the neutral equilibrium model that are smaller than, or equal to the observed FS value. Significant deviations (P-value < 0.02) are shown in bold, and after correction for multiple testing in brackets; for humans, we report the number of population samples associated with a significant deviation (before slash, and significant after correction for multiple testing in bold and brackets ) on the total number of population samples tested (after slash, see Supplementary Table S12). f Twenty tests out of 122 (16%) indicated significant homozygosity excess (P-value > 0.975), thus exceeding by 13 tests the expected proportion of 5% (6.1 out of 122) false positives. g One hundred and six tests out of 122 (87%) indicated significant deviation from neutral expectation (P-value < 0.02), thus exceeding by 103 tests the expected proportion of 2% (2.44 out of 122) false positives.

Vangenot et al. (2019)

Humans and chimpanzees display opposite patterns of diversity in arylamine N-acetyltranferase genes

Supplementary File S1, containing :

Supplementary information on Materials and Methods

1. Description of the great apes DNA samples sequenced in this study

2. DNA amplification, PCR product purification and sequencing of great ape samples

3. Retrieval of unphased NAT polymorphic positions from the Great Ape Genome Project (GAGP)

4. Inference of Pan NAT haplotypes

5. Retrieval of phased human NAT haplotypes from the 1000 Genomes Project

6. Constitution of Western chimpanzee (P. troglodytes verus) samples of unrelated individuals

Supplementary information on Results

7. NAT genotypes obtained by Sanger sequencing

8. NAT polymorphisms in Pan

9. Potential hybrid individual in the CARTA collection

10. In silico prediction tools of the functional impact of mutations in coding sequences

11. Evaluation of prediction adequacy of functional consequence for human haplotypes at the NAT1 and NAT2 loci

12. Prediction of enzymatic activity of Pan NAT1 and NAT2 basal haplotypes

13. Comparison of nucleotide diversity in gorillas and orangutans at the three NAT loci with Pan and humans

14. References

1

Supplementary information on Materials and Methods

1. Description of the great apes DNA samples sequenced in this study

Among the individuals of the Biomedical Primate Research Centre (BPRC) colony (Supplementary Figure S1A), 24 were the founders of the colony (in yellow, orange and blue), and even if their relatedness is unknown mitochondrial genetic data and family study suggest that they are unrelated (de Groot et al. 2005). The 20 other individuals (in green, grey and beige in Supplementary Figure S1A) are first level descendants related through five genealogies.

The Center for Academic Research and Training in Anthropogeny (CARTA) Western chimpanzees consist in 10 unrelated individuals and 16 individuals related through two separate genealogies (in purple in Supplementary Figure S1B). The two Basel zoo Western chimpanzees (Supplementary Figure S1C) are unrelated. Among these 68 Western chimpanzees, 21 are wild, 22 were born in captivity and the status of the remaining 25 is unknown. Other Pan samples from the CARTA collection include the bonobo male, and the Central (individual C327) and Eastern (individual Harriet) chimpanzee females, the latter being also part of the Great Ape Genome Project (see below). Seven of the orangutan and four of the gorilla samples are from the Basel zoo, and the other samples are from the CARTA collection.

2. DNA amplification, PCR product purification and sequencing of great ape samples

DNA amplification, PCR product purification and sequencing of samples from the CARTA research center were performed by Retrogen (San Diego, California), using their 3730 sequencers with ABI BD3.1 sequencing chemistry. Amplifications were done on 50 ng of genomic DNA, 10 pmol of each primer, 1 unit HoTaq DNA Polymerase (MCLAB), 10 X PCR buffer supplied with the polymerase, and dNTP mix, for a total final volume of 20 μl. For NAT1 and NATP, 1 ul additional 50mM MgSO4 was added. For NAT2, 2 ul of PCRx Enhancer Solution (Invitrogen) was added. For NAT1 and NAT2, samples were denatured at 94°C for 10 min, followed by 40 cycles of

2 94°C for 30 s, 55°C for 1 min, and 72°C for 2 min. For NATP, samples were denatured at 95°C for 10 min, followed by 38 cycles of 94°C for 1 min, 57°C for 1 min, and 72°C for 2 min, followed by a final elongation phase at 72°C for 10 min.

DNA amplification, PCR product purification and sequencing of samples from the BPRC research center and from the Basel zoo were performed by Macrogen (Seoul, South Korea), using their Standard Seq platform with the same protocol as used by Retrogen, except for gorillas and orangutans for which specific primers were developed (Supplementary Table S1) from the reference genomes of gorilla (GorGor3.1, May 2011) and orangutan (ponAbe2, July 2007). Because of unsatisfactory results for some chimpanzee samples, PCR conditions were slightly modified in a second round: samples were denatured at 94°C for 5 min, followed by 40 cycles of 94°C for 30 s, 55°C for 1 min, and 72°C for 2 min, followed by a final elongation phase for 10 min. For two DNA samples (Oscar and Gerda) new primers were designed for NAT2 (Supplementary Table S1) and PCR conditions were adapted as follows: samples were denatured at 95°C for 5 min, followed by 35 cycles of 95°C for 1 min, 45-68°C for 1 min, and 72°C for 1 min, followed by a final elongation phase for 10 min.

Note that for the sequencing of NATP in the orangutan samples, both primers pairs Locus3Forward-C/13405 and Locus3Forward-D/Locus3Reverse-D were used but provided less satisfactory results, due to little overlapping of the forward and reverse fragments, which could be due to the quality of the reference sequence (ponAbe2, July 2007) from which they were defined.

3. Retrieval of unphased NAT polymorphic positions from the Great Ape Genome Project (GAGP)

We retrieved NAT unphased genotypes of 79 great apes from the Great Ape Genome Project (GAGP, Prado-Martinez et al. 2013, data downloaded in March 2014, at https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/). The sequencing coverage of genomes in GAGP varies between 7.40 and 49.81 (average = 27.24, sd = 11.17). We extracted, from the available unphased VCF files, the sections corresponding to the coding exon of each of the two functional NAT genes, and the

3 corresponding homologous section of the NAT pseudogene, namely (numbering of positions according to the human chromosome 8 reference sequence hg19/GRCh37): - for NAT1, the homologous stretch spanning from positions 18'079'545 to 18'080’447 (coding exon spans 18'079'557 to 18'080'426), - for NAT2, the homologous stretch spanning from positions 18'257'489 to 18'258'603 (coding exon spans 18'257'514 to 18'258'383), - for NATP, the homologous stretch spanning from positions 18'228'116 to 18'229'117 (the stretch of homology with the coding exons of NAT1 and NAT2 spans 18'228'116 to 18'228'986).

The GAGP data retrieved represents 25 chimpanzees (4 P. t. troglodytes, 6 P. t. schweinfurthii, 10 P.t. ellioti and 4 P. t. verus, and one individual that is a P. t. verus/troglodytes hybrid), 13 bonobos, 31 gorillas (27 Gorilla gorilla gorilla, 3 Gorilla beringei graueri, 1 Gorilla gorilla dielhi) and 10 orangutans (5 Pongo abelii and 5 Pongo pygmaeus).

All detected polymorphic positions in the Sanger sequenced samples of this study and retrieved from the GAGP VCF files are detailed in Supplementary Tables S3, S4 and S5.

For the data retrieved from GAGP, we checked for missing positions in the bed files containing the regions that did not pass the quality filters applied to the SNP data (bed files downloaded in March 2014 at https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Callable_regions/): - for P. troglodytes 0.55%, 2.2% and 6.5% positions were uncallable for the genes NAT1, NAT2 and NATP, respectively, - for P. Paniscus, 0.55%, 0.28% and 2.8% positions were uncallable for NAT1, NAT2 and NATP, respectively, - for gorillas, 1.33%, 2.94% and 5.5% were uncallable for NAT1, NAT2 and NATP, respectively, - and for orangutans, 0.33%, 0.46% and 6.8% were uncallable for NAT1, NAT2 and NATP, respectively.

Therefore, any potential variant overlapping with these positions would not have been present in the VCF SNP files. However, according to our Sanger sequenced chimpanzees and bonobo samples, none of the uncallable positions in the Pan genomes from GAGP were located in known polymorphic positions for this genus.

4 We thus tentatively considered those positions as identical to the panTro4 and panPan1 reference sequences, respectively, unless specified otherwise in Supplementary Tables S3, S4 and S5. We were unable to do so for gorillas and orangutans, as variants were overlapping with some unknown positions. For this reason, as well as due to the small number of gorillas (5 individuals) and orangutans (eight individuals) that were Sanger sequenced in this study (see Materials and Methods in main text), haplotype inference at each of the 3 NAT genes was performed only for individuals from the Pan genus.

4. Inference of Pan NAT haplotypes

Diploid haplotypes were inferred for all Pan individuals using PHASE version 2.1.1 (Stephens et al. 2001; Stephens and Scheet 2005). For each of the 3 NAT genes, 3 PHASE runs were performed with individuals grouped differently: once considering all Pan individuals together; once separating Pan troglodytes and Pan paniscus individuals in two groups; and once additionally separating Western chimpanzees (Pan troglodytes verus) from the other Pan troglodytes sub-species, since the former sample outnumbers the latter, thus considering 3 groups: Pan troglodytes verus, other Pan troglodytes sub-species and Pan paniscus. For NAT1 and NAT2, the same haplotypes were inferred in the 3 runs. For NATP, different sets of haplotypes were returned for each run. When considering all Pan together, 19 haplotypes were given, five of which were different from the 19 returned when separating Pan troglodytes and Pan paniscus (number of pairwise differences = 5.29 and 5.08, respectively). When the program was run with three groups, 23 haplotypes were given (number of pairwise differences = 4.55), six and four being different from those returned with the other two grouping (all Pan together and separating Pan troglodytes and Pan paniscus, respectively). The difference between grouping considering all Pan together and separating Pan troglodytes and Pan paniscus was explained by some undefined positions in Pan troglodytes from the GAGP at otherwise fixed positions in all other Pan (on different bases in Pan troglodytes and Pan paniscus). Consequently, we chose the set of haplotypes inferred when Pan troglodytes and Pan Paniscus individuals were considered separately.

5 5. Retrieval of phased human NAT haplotypes from the 1000 Genomes Project

The human NAT sequences dataset assembled in this study was completed with NAT1, NAT2 and NATP phased genotypes retrieved from the 1000 Genomes Phase 1 dataset (Abecasis et al. 2012). We verified that no known related individuals were present in the dataset.

Similarly to the treatment of data from the Great Ape Genome Project, we extracted the relevant part of the available VCF files from the 1000 Genomes Project (positions 18'079'557-18'080’426, 18'257'514-18'258'383, and 18'228'116-18'228'986 in the chromosome 8 human reference sequence GRCh37/hg19 for, respectively, NAT1, NAT2 and NATP). We checked for missing positions in the pilot mask (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/acc essible_genome_masks/) and no uncallable positions were found.

6. Constitution of two Western chimpanzee (P. troglodytes verus) samples of unrelated individuals

We separated the Western (P. t. verus) chimpanzees according to their provenance to form two samples of unrelated individuals.

The first one, called BPRC, includes the 20 founders of the BPRC colony for which DNA was available (in yellow in Supplementary Figure S1A), as well as four additional founders whose genotypes could be unambiguously deduced from those of their children: Izaak’s genotypes, deduced for all three NAT genes (in blue in Supplementary Figure S1A), and Gerrit’s genotype, deduced for NATP only (in orange, in Supplementary Figure S1A). We also included two unrelated first level descendants in the BPRC sample of genotypes (Oscar, in beige in Supplementary Figure S1A, successfully genotyped for the three NAT genes, and Annaclara, in gray for NAT1 and NAT2 only).

The second sample, called San Diego, comprises individuals from the CARTA collection and the two chimpanzees from the Basel zoo. The CARTA P. t. verus sample includes a large genealogy of 14 inter-related individuals (Supplementary Figure S1B). Estimation of frequency distributions from the San Diego sample was done through a re-sampling procedure leading to 122 sub-samples of 18 unrelated

6 individuals, so as to account for the multiple possible ways of choosing unrelated individuals among those of the CARTA collection. Indeed, the maximum number of unrelated individuals that can be chosen from this large genealogy is 6, and 10 distinct sub-samples of 6 unrelated individuals can thus be formed. However, individuals C316 (Tamblo) and C318 (Reuuh) are not included in any of these 10 sub-samples. In turn, 61 distinct sub-samples of 5 unrelated individuals can be chosen from the large genealogy, and all 14 inter-related individuals from the large genealogy are included in at least one sub-sample. We have thus chosen to use these 61 possible sub-samples of 5 unrelated individuals to generate a collection of non- independent P. t. verus sub-samples. We further added to each sub-sample one individual chosen among the mother-son pair C320-C319 (Supplementary Figure S1B), thus doubling the number of possible sub-samples. Finally, we completed each sub-sample with the remaining 10 unrelated individuals from the CARTA collection and with the two unrelated individuals from the Basel zoo. The San Diego cohort is thus made up of 122 non-independent sub-samples of 18 individuals (36 chromosomes).

When pooling together the BPRC sample with each of the 122 San Diego sub-samples, Hardy-Weinberg equilibrium was rejected for the NATP pseudogene in 45 of those pooled samples (37%). Moreover, although the two P. t. verus samples were not found to differentiate significantly at NAT1 and NAT2 (see Results), they did so at NATP in 75% of the sub-samples (Supplementary Table S6). We thus decided to analyse the two samples separately.

7 Supplementary information on Results

7. NAT genotypes obtained by Sanger sequencing

We obtained 247 NAT genotypes out of the 84 DNA samples of great apes available for Sanger sequencing in this study. For NAT1, DNA samples from 83 individuals were successfully sequenced over a segment of 903 bp (i.e. all but one orangutan), for NAT2, 81 DNA samples over 1,115 bp (all but one Eastern chimpanzee, one gorilla and one orangutan), and for NATP, 83 DNA samples over 1,000 bp (all but one gorilla). Because of restricted DNA availability, we were unable to further adapt the conditions so as to repeat unsuccessful amplification or sequencing.

8. NAT polymorphisms in Pan

Both polymorphic positions and fixed segregating sites between hominids, including all the Pan species and sub-species, at the three NAT genes are reported in Supplementary Tables S3, S4 and S5. Segregating sites in the Pan genus are also shown in Table 2.

Ten polymorphic positions in total were observed for NAT1 in Pan, all located in the exon open reading frame, six of them being non-synonymous (Table 2, Supplementary Table S3 and main text). In the P. t. verus subspecies, only five SNPs were found, three of which are non-synonymous: G76A (D26N), C147T, T369C, T597G (I199M) and A789G (I263M). The polymorphism at position 597 was only observed in P. t. verus whereas the two SNPs G76A and C147T were shared with all other Pan troglodytes sub-species; SNP A789G was also found in Central (P. t. troglodytes) and Eastern (P. t. schweinfurthii) chimpanzees, and SNP A369G was also shared with bonobos (P. paniscus). Besides the five SNPs observed in Western (P. t. verus) chimpanzees, four of which were also shared with other Pan species or sub- species, five additional SNPs were only observed among other Pan individuals: C303T only detected in bonobos, T341C only in Central (P. t. troglodytes), C458T and A518C only in Eastern (P. t. schweinfurthii), and G760C only in Nigeria-Cameroun (P. t. ellioti) chimpanzees. Three of these five additional polymorphisms involve a modification of the protein: T341C (I114T), A518C (E173A), and G760C (E254Q).

For NAT2, nine segregating sites were observed (seven in the coding exon and two in the 3’UTR), one of which G145A (E49K) was apparently fixed on the derived amino

8 acid (K) in P. paniscus (bonobos) (Table 2 and main text). Two out of the four polymorphisms observed among the 68 P. t. verus chimpanzees occur within the NAT2 open reading frame and one involves a protein modification: C578T (T193M). Three out of the four polymorphisms detected among the 68 Western (P. t. verus) chimpanzees were not observed in other Pan individuals (i.e. polymorphisms at positions 578, 789 and 949), whereas SNP 934, outside the coding exon, was found among all Pan troglodytes but not in bonobos (P. paniscus). Actually, G934A is the only NAT2 polymorphism found to be shared among all chimpanzee (P. troglodytes) sub-species. Four additional SNPs, all situated within the coding exon, were observed in the other Pan individuals: non-synonymous SNP A514G (N172D) was observed among Eastern (P. t. schweinfurthii), Nigeria-Cameroun (P. t. ellioti) and Central (P. t. troglodytes) chimpanzees, synonymous SNP T36C polymorphism was only detected in Eastern chimpanzees, and the 2 non-synonymous SNPs A72C (L24F) and G191A (R64Q) only in bonobos.

Among the 24 NATP segregating sites observed in Pan, 17 were observed to be polymorphic within at least one chimpanzee sub-species, but only a few were shared among species and sub-species (Table 2 and main text). However, three of these 24 segregating sites had undefined nucleotides in some genotypes retrieved from the GAGP (Supplementary Table S5). Thus, while four positions (18’228’242, 18’228’304, 18’228’660, and 18’229’057) were apparently fixed differences between P. troglodytes (chimpanzees) and P. paniscus (bonobos), divergence between the two species at three other positions (18’228’279, 18’228’575, and 18’228’576) needs to be confirmed. The 17 other segregating sites were observed to be polymorphic within at least one chimpanzee sub-species, but only a few were shared among species and sub-species. Two SNPs (T/C at position 18’228’501 and T/G at 18’228’614) were shared by all chimpanzee (P. troglodytes) sub-species but were not observed in bonobos (P. paniscus), and two SNPs (T/A at 18’228’285 and C/T at 18’228’771) were shared by all chimpanzee (P. troglodytes) sub-species but Western (P. t. verus) chimpanzees. Four other polymorphisms were found in Western chimpanzees (A/G at 18’228’238, C/T at 18’228‘368, C/A at 18’228’560, and G/T at 18’228’582), three of which were not observed in the other Pan, whereas SNP at 18’228‘368 was also observed in Central (P. t. troglodytes) chimpanzees. One position was found polymorphic both in Eastern (P. t. schweinfurthii) chimpanzees and bonobos (G/A at 18’228’659); it was the only polymorphic position observed in

9 bonobos. Other polymorphisms that were observed only in one chimpanzee sub- species include: SNPs C/T at 18’228’404, A/G at 18’228’543, and T/C at 18’228’959 in Nigeria-Cameroun (P. t. ellioti), SNPs C/T at 18’228’189, G/A at 18’228’661, and A/T at 18’229’103 in Central (P. t. troglodytes), and SNP G/T at position 18’228’146 in Eastern (P. t. schweinfurthii) chimpanzees. One of the polymorphic positions in chimpanzees was found apparently fixed on the derived allele in bonobos (SNP 18’228’285). Finally, the C/T SNPs at positions 18’228’404 (polymorphic in Nigeria- Cameroun chimpanzees) and 18’228’771 (polymorphic in Eastern, Central and Nigeria-Cameroun chimpanzees) were also observed in the single verus/troglodytes hybrid individual, whereas an additional SNP, C/T at position 18’228’748, was only detected in this individual.

9. Potential hybrid individual in the CARTA collection

Besides the individual identified as a Western/Central (P. t. verus/troglodytes) chimpanzee hybrid in the GAGP collection, we identified a potentially hybrid individual also in the CARTA collection. Indeed, female C327, classified as P. t. troglodytes in the CARTA collection, is the only carrier of NAT2*1 and NAT2*6 among its sub-species. Considering the relatively recent discovery of the Nigeria-Cameroun (P. t. ellioti) sub-species (Gonder et al. 1997; Gonder et al. 2006), and the hybridisation zone between this sub-species and Central (P. t. troglodytes) chimpanzees near the , female C327 could be a Nigeria-Cameroun chimpanzee, a Central/Nigeria-Cameroun hybrid, or a descendant of such a hybrid. At present, in the absence of other information, we kept this individual classified as a Central chimpanzee (P. t. troglodytes).

10. In silico prediction tools of the functional impact of mutations in coding sequences

Phenotypic predictions of the functional impact of specific mutations in NAT1 and NAT2 haplotypes were performed with three online software tools (analysis done May 2017: PolyPhen (Adzhubei et al. 2010), SIFT (Sim et al. 2012) and PANTHER cSNP Scoring tool (Tang and Thomas 2016). To evaluate the confidence in the results

10 returned by the 3 prediction software tools, we first applied them on human haplotypes of known enzymatic activity. For that, we selected human haplotypes with a single substitution compared to the reference haplotypes NAT1*4 and NAT2*4 (GenBank accessions X17059 and X14672) following the standards of the official nomenclature of human NAT alleles (McDonagh et al. 2014).

We ran PolyPhen with the default query options. PolyPhen returns three results, namely the score, which is the probability that a substitution is damaging, as well as values of sensibility and specificity, and an associated prediction that can be “benign”, “possibly damaging” and “probably damaging”. For SIFT, we used the default parameters and searched the Uniprot-SwissProt + TrEMBL 2010_09 database. SIFT returns a score, which is the probability that a substitution is tolerated, and which translates into a prediction (either “tolerated”, or “affects protein function” if SIFT score <0.05). It also returns the number of sequences used for the prediction at the substituted position (not counting sequences with gaps at that position) and the median sequence information used to measure the diversity of the sequences used for prediction (a median sequence information between 2.75 and 3.5 is recommended). When using the PANTHER cSNP Scoring tool, we selected Homo sapiens as reference organism. The PANTHER cSNP Scoring tool returns the position- specific evolutionary preservation (PSEP) index and an associated prediction. The PSEP measures the length of time a position in the current protein has been preserved (in millions of years) by tracing it back to its reconstructed direct ancestors. The associated prediction can be "probably damaging" (PSEP > 450my, corresponding to a false positive rate of ~0.2 as tested on HumVar), "possibly damaging" (450my > PSEP > 200my, corresponding to a false positive rate of ~0.4) and "probably benign" (PSEP < 200my).

11. Evaluation of prediction adequacy of functional consequence for human haplotypes at the NAT1 and NAT2 loci

The phenotypic predictions outputted by the three online software tools (PolyPhen, SIFT, and PANTHER cSNP Scoring) are provided in Supplementary Table S15.

For NAT1, the three tools appear to predict adequately the effect on enzymatic expression/activity of those substitutions whose effect is well established. Indeed,

11 haplotypes NAT1*17 and NAT1*22, two well-defined and agreed decreased-activity haplotypes whose defining substitutions (C190T and A752T, respectively) reduce both protein level and N- and O-acetylation activity below detection (Hein et al. 2006), are predicted as damaging by the three tools (“probably damaging” for PolyPhen and PANTHER cSNP Scoring and “affect protein function” for SIFT). On the contrary, another substitution, G560A, characterising the decreased-activity haplotype, NAT1*14B, is predicted as “possibly damaging” by PolyPhen, “probably damaging” by PANTHER cSNP Scoring (but with a low PSEP) and as “tolerated” by SIFT. However, Zhu and Hein (2008) showed that G560A reduces protein levels more moderately compared to C190T and A752T (4-fold compared to 50-fold and 40-fold for NAT1*17 and NAT1*22 respectively), which is thus consistent with the prediction results. Three NAT1 haplotypes, although differing from the reference haplotype NAT1*4 by a non-synonymous substitution, are described as “equivalent to NAT1*4” in terms of phenotype assignment by the official nomenclature of human NAT alleles, NAT1*21, NAT1*24 and NAT1*25. Among these three haplotypes, only substitution A613G, defining NAT1*21 is well predicted by the three tools as “benign” or “tolerated”. Substitution A787G of NAT1*25 is predicted as “benign” and “tolerated” by PolyPhen and SIFT, respectively, but as “possibly damaging” by PANTHER cSNP Scoring. However, for the latter, the PSEP value of 220 is at the limit (PSEP=200) between “probably damaging” and “benign”. Finally, substitution G781A of NAT1*24 is only predicted as “benign” by PolyPhen. These results are nevertheless congruent with the contradictory results returned by functional studies, such as those of (Lin et al. 1998), cited in Zhu and Hein (2008).

The adequacy of predictions is also straightforward for human NAT2 when considering those substitutions whose effect on enzymatic function is well established. Indeed, among the seven haplotypes considered as associated with a slow phenotype, only the substitutions of four of them, C190T (defining NAT2*19), G191A (NAT2*14A), A434C (NAT2*17) and G590A (NAT2*6B) are well predicted as damaging by the three tools, and an additional one, T341C (NAT2*5D), by two of them (SIFT and PANTHER cSNP Scoring). However, while all five substitutions reduce N- and O-acetyltransferase activity (between 80% and 90% lower than NAT2*4 for C190T, G191A, A434C and T341C, and between 36 and 70% for G590A, reviewed in (Hein et al. 2006)), and the protein level of expression as well (between 0 to 35% of NAT2*4 for C190T, G590A, A434C and T341C and for G191A between 36

12 to 70%, (Hein et al. 2006)), C190T, G191A and G590A have also an effect on the thermostability of the protein. This could thus explain why T341C is predicted as damaging by only 2 out of the 3 tools, and A434C with less confidence by PANTHER cSNP Scoring (PSEP=456 compared to PSEP=4200 for C190T and G191A). Two others substitutions, G857A (defining NAT2*7A) and G499A (NAT2*10), were not predicted by any tool as damaging. However, the phenotypes associated with these two substitutions are reported as being potentially substrate-dependent in the official nomenclature of human NAT alleles, and this could explain the prediction results. Indeed, it was shown that G857A also reduces N-acetyltransferase and O- acetyltransferase activity compared to NAT2*4 but to a lesser extent and only for some substrates. Moreover, it does not reduce the protein expression level, but rather affects the stability of the protein as much as C190T, G191A and G590A do (reviewed in (Hein et al. 2006)). The tools thus appear to detect well a substitution only when it has an effect on all three characteristics, i.e. acetylation, protein expression level and thermostability.

We also tested the prediction tools on two NAT2 non-synonymous substitutions associated with a rapid acetylator phenotype (i.e. equivalent to NAT2*4). The A803G substitution (defining haplotype NAT2*12A) is well predicted as benign by the three tools. On the contrary, A845C (NAT2*18) is predicted as damaging by the three tools ("possibly damaging" by PANTHER cSNP Scoring). However, contradictory results exist for this latter substitution; while haplotype NAT2*18 (A845C) is reported as rapid in the official nomenclature, it has also been suggested to be slow in publications (Hein et al. 2006).

Although for NAT2 only two or three acetylation phenotypes have been defined (rapid, slow and intermediate, corresponding to the bi- or tri-modal distribution of responses to medication (Meyer 2004)), acetylation capacity is a continuous quantitative variable, thus allowing a wider number of different phenotypic responses to be recognized ((Ruiz et al. 2012; Selinski et al. 2013; Selinski et al. 2015). A similar observation applies to NAT1 for which only two phenotypes are defined (rapid and slow, as well as absence of activity associated with substitution- induced truncated proteins). Moreover, for NAT1 the correspondence between the phenotype and the genotype is less clear, and probably more complex, than for NAT2. Therefore, the results of the prediction tools, as described above (and reported in Supplementary Table S15), reflect this heterogeneity of phenotypes for both genes.

13 Substitutions in NAT1 and NAT2 that strongly impact acetylation activity are well detected as damaging by the three tools, and observed discordances between the prediction and the described phenotypes all correspond to documented variations of known phenotypes. We thus concluded that the three tools used in conjunction adequately predict the effect of single substitutions on acetylation activity.

12. Prediction of enzymatic activity of Pan NAT1 and NAT2 basal haplotypes

A recent study (Tsirka et al. 2014) has demonstrated that the function of the NAT2 enzyme in the human and rhesus macaque species diverges in substrate selectivity, due to a G691A (V231I) substitution. This substitution mediates a shift in substrate affinity from substrates more NAT2-specific to substrates more NAT1-specific, but the study also demonstrated that neither the stability nor the overall activity of the 231V or 231I proteins were significantly altered. To the best of our knowledge, however, functional studies of chimpanzee NAT enzymes compared to humans have not yet been published. We thus used the three prediction tools to investigate whether the Pan basal NAT1 (NAT1*1) and NAT2 (NAT2*4) haplotypes could be considered functionally equivalent to the human reference haplotypes, i.e. human NAT1*4 and NAT2*4. We first assumed that the common ancestral haplotype to humans and chimpanzees had an enzymatic activity equivalent to that of the human references NAT1*4 and NAT2*4, i.e. that it conferred the rapid acetylation phenotype. Under this assumption, we used the predictions returned by the three online tools to infer the probable acetylation profile of each of the two Pan basal NAT haplotypes.

As deduced from the NAT1 haplotypes network (Supplementary Figure S2), a single non-synonymous substitution (C529A, H177N) occurred on the lineage leading to the Pan NAT1 haplotypes. The ancestral haplotype is embedded in a reticulation that includes another non-synonymous substitution (A583C, Q195K). Both C529A and A583C substitutions have been predicted as not affecting the protein function (Supplementary Table S16), and thus the Pan basal NAT1*1 haplotype should also lead to an enzymatic activity equivalent to that of the ancestral protein. Note that only the results of PolyPhen and SIFT could be considered here, as PANTHER cSNP Scoring failed to test these substitutions.

14 For NAT2, the network of haplotypes (Supplementary Figure S3) indicates that the ancestral human-chimpanzee haplotype is differentiated from the Pan basal NAT2*4 haplotype by five non-synonymous substitutions, i.e. T293C (V98A), T664C (F222L), C345A (D115E), G443C (C148S), and A595G (I199V), the first two being shared with the lineage of Gorilla NAT2 haplotypes. Since the sequence of mutational events leading from the ancestral haplotype to the Pan basal NAT2*4 haplotype is unknown, we could not test the functional impact of these substitutions in a sequential fashion, and opted to test them independently. Although impaired by this limitation, we note that none of the five substitutions is predicted as affecting the protein function (Supplementary Table S16). We thus tentatively assume that also the Pan basal NAT2*4 haplotype should confer an enzymatic activity equivalent to that of the ancestral protein.

Then, we investigated whether assuming that the activity of the theoretical common human-chimpanzee ancestral NAT enzymes is equivalent to the rapid acetylation profile imparted by human reference haplotypes NAT1*4 and NAT2*4 is a plausible hypothesis. To this end, we used the three online tools to predict the impact of substitutions between the latter and the ancestral proteins.

For NAT1, as reported in Supplementary Table S16, the prediction tools did not return clear-cut results on the functional impact of the two non-synonymous substitutions differentiating the ancestral hominid haplotype (ancestral_2) and human NAT1*4. Indeed, the results of PolyPhen contradict those of SIFT and PANTHER cSNP Scoring for substitution A138T (E46D), whereas the output of this latter tool contradicts those of the formers for substitution G826C (E276Q). This raises the possibility that both mutations could have a moderately damaging effect on the protein activity, thus suggesting that the “normal” (i.e. reference) NAT1 activity measured in humans could actually be decreased compared to that of the common ancestral human-chimpanzee NAT1 enzyme, and by extension to that conferred by the Pan basal NAT1 haplotype. We are aware of the highly speculative nature of this reasoning, but note nevertheless that the assumption of an equivalent enzymatic activity associated with the human reference haplotype (NAT1*4) and the chimpanzee basal haplotype (NAT1*1) is, in the light of the prediction results, a conservative hypothesis.

The results for NAT2 (Supplementary Table S16) show that two out of the four non- synonymous substitutions differentiating the theoretical ancestral hominid

15 haplotype and the human reference haplotype (NAT2*4) are predicted as damaging by the three tools, i.e. C451T (R151C), and G834T (K278N). Thus, as for NAT1, we reason that assuming equivalent enzymatic activity for the human reference (NAT2*4) and Pan basal (NAT2*4) haplotypes is again a conservative hypothesis. Note that 451C is a rare variant in humans (rs747336755, from the Exome Aggregation Consortium), and the position was also found polymorphic in P. abelii.

13. Comparison of nucleotide diversity in gorillas and orangutans at the three NAT genes with Pan and humans

Supplementary Table S13 and Figure S5 report nucleotide diversity (π x 10-3) estimated in the two gorilla (western and eastern gorillas, Gorilla gorilla and Gorilla beringei, respectively) and the two orangutan (Sumatran and Bornean orangutans, Pongo abelii and Pongo pygmaeus, respectively) species. At the NAT1 gene, a similar level of diversity to Pan was found in both species of gorillas (0.72 on average), thus also higher than in humans, whereas the highest value among all great apes was observed in orangutans. Actually, NAT1 diversity markedly differs between the two orangutan species: π (x 10-3) in P. abelii (3.94) is almost three times higher than in P. pygmaeus (1.36), in keeping with results from genomic studies (Prado-Martinez et al. 2013; Nater et al. 2015; Kuhlwilm et al. 2016; Nater et al. 2017). In turn, at NAT2, both species of orangutans have a comparable level of nucleotide diversity (1.52 on average) that is more similar to humans than to chimpanzees, whereas it differs markedly between the two gorilla species (1.72 and 0.31, in G. gorilla and G. beringei, respectively). Thus, while eastern gorillas (G. beringei) display, like Pan, a lower diversity for NAT2 compared to NAT1, the reverse pattern is observed for western gorillas (G. gorilla). Contrasting results were also found at the NATP locus: while no nucleotide diversity was detected in G. beringei (genotypes of all three eastern gorillas were identical), that of western gorillas (G. gorilla, 2.77) and Bornean orangutans (P. pygmaeus, 2.29) is similar to the values found for Central (Pan troglodytes troglodytes) and Nigeria-Cameroon (Pan troglodytes ellioti) chimpanzees, while Sumatran orangutans (P. abelii, 3.76) display two to three times more diversity than chimpanzees and humans. We consider these contrasting results, both between the two gorillas species and between the two orangutan species as preliminary, and needing confirmation through the analysis of larger sample sizes, and possibly also

16 through a better representation of species and sub-species levels. Future investigations will also benefit from including extensive inter-gene sequencing within the NAT region on chromosome 8. Such a perspective is becoming more topical every day with new whole genome hominid sequences being produced at an unprecedented high pace (Nater et al. 2015; Xue et al. 2015; de Manuel et al. 2016; Nater et al. 2017).

17 14. References

Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, and McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature. 491(7422):56-65. doi: 10.1038/nature11632 Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, and Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods. 7(4):248-249. doi: 10.1038/nmeth0410- 248 de Groot NG, Garcia CA, Verschoor EJ, Doxiadis GG, Marsh SG, Otting N, and Bontrop RE. 2005. Reduced MIC gene repertoire variation in West African chimpanzees as compared to humans. Molecular biology and evolution. 22(6):1375-1385. doi: 10.1093/molbev/msi127 de Manuel M, Kuhlwilm M, Frandsen P, Sousa VC, Desai T, Prado-Martinez J, Hernandez-Rodriguez J, Dupanloup I, Lao O, Hallast P et al. . 2016. Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science (New York, NY). 354(6311):477-481. doi: 10.1126/science.aag2602 Gonder MK, Disotell T, and Oates J. 2006. New Genetic Evidence on the Evolution of Chimpanzee Populations and Implications for . Int J Primatol. 27(4):1103-1127. doi: 10.1007/s10764-006-9063-y Gonder MK, Oates JF, Disotell TR, Forstner MR, Morales JC, and Melnick DJ. 1997. A new west African chimpanzee subspecies? Nature. 388(6640):337. doi: 10.1038/41005 Hein DW, Fretland AJ, and Doll MA. 2006. Effects of single nucleotide polymorphisms in human N-acetyltransferase 2 on metabolic activation (O-acetylation) of heterocyclic amine carcinogens. Int J Cancer. 119(5):1208-1211. doi: 10.1002/ijc.21957 Kuhlwilm M, de Manuel M, Nater A, Greminger MP, Krützen M, and Marques-Bonet T. 2016. Evolution and demography of the great apes. Current Opinion in Genetics & Development. 41:124-129. doi: 10.1016/j.gde.2016.09.005 Lin HJ, Probst-Hensch NM, Hughes NC, Sakamoto GT, Louie AD, Kau IH, Lin BK, Lee DB, Lin J, Frankl HD et al. . 1998. Variants of N-acetyltransferase NAT1 and a case-control study of colorectal adenomas. Pharmacogenetics. 8(3):269-281 McDonagh EM, Boukouvala S, Aklillu E, Hein DW, Altman RB, and Klein TE. 2014. PharmGKB summary: very important pharmacogene information for N- acetyltransferase 2. Pharmacogenetics and genomics. 24(8):409-425. doi: 10.1097/fpc.0000000000000062 Meyer UA. 2004. Pharmacogenetics - five decades of therapeutic lessons from genetic diversity. Nat Rev Genet. 5(9):669-676. doi: 10.1038/nrg1428 Nater A, Greminger MP, Arora N, van Schaik CP, Goossens B, Singleton I, Verschoor EJ, Warren KS, and Krutzen M. 2015. Reconstructing the demographic history of orang-utans using Approximate Bayesian Computation. Mol Ecol. 24(2):310- 327. doi: 10.1111/mec.13027 Nater A, Mattle-Greminger MP, Nurcahyo A, Nowak MG, de Manuel M, Desai T, Groves C, Pybus M, Sonay TB, Roos C et al. . 2017. Morphometric, Behavioral, and Genomic Evidence for a New Orangutan Species. Current biology : CB. 27(22):3487-3498.e3410. doi: 10.1016/j.cub.2017.09.047 Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O'Connor TD, Santpere G et al. . 2013. Great ape genetic

18 diversity and population history. Nature. 499(7459):471-475. doi: 10.1038/nature12228 Ruiz JD, Martinez C, Anderson K, Gross M, Lang NP, Garcia-Martin E, and Agundez JA. 2012. The differential effect of NAT2 variant alleles permits refinement in phenotype inference and identifies a very slow acetylation genotype. PloS one. 7(9):e44629. doi: 10.1371/journal.pone.0044629 Selinski S, Blaszkewicz M, Getzmann S, and Golka K. 2015. N-Acetyltransferase 2: ultra-slow acetylators enter the stage. Archives of toxicology. 89(12):2445- 2447. doi: 10.1007/s00204-015-1650-2 Selinski S, Blaszkewicz M, Ickstadt K, Hengstler JG, and Golka K. 2013. Refinement of the prediction of N-acetyltransferase 2 (NAT2) phenotypes with respect to enzyme activity and urinary bladder cancer risk. Archives of toxicology. 87(12):2129-2139. doi: 10.1007/s00204-013-1157-7 Sim N-L, Kumar P, Hu J, Henikoff S, Schneider G, and Ng PC. 2012. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40(W1):W452-W457. doi: 10.1093/nar/gks539 Stephens M, and Scheet P. 2005. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. American journal of human genetics. 76(3):449-462. doi: 10.1086/428594 Stephens M, Smith NJ, and Donnelly P. 2001. A new statistical method for haplotype reconstruction from population data. American journal of human genetics. 68(4):978-989. doi: 10.1086/319501 Tang H, and Thomas PD. 2016. PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation. Bioinformatics. 32(14):2230-2232. doi: 10.1093/bioinformatics/btw222 Tsirka T, Boukouvala S, Agianian B, and Fakis G. 2014. Polymorphism p.Val231Ile alters substrate selectivity of drug-metabolizing arylamine N- acetyltransferase 2 (NAT2) isoenzyme of rhesus macaque and human. Gene. 536(1):65-73. doi: 10.1016/j.gene.2013.11.085 Xue Y, Prado-Martinez J, Sudmant PH, Narasimhan V, Ayub Q, Szpak M, Frandsen P, Chen Y, Yngvadottir B, Cooper DN et al. . 2015. Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding. Science (New York, NY). 348(6231):242-245. doi: 10.1126/science.aaa3952 Zhu Y, and Hein DW. 2008. Functional effects of single nucleotide polymorphisms in the coding region of human N-acetyltransferase 1. Pharmacogenomics J. 8(5):339-348. doi: 10.1038/sj.tpj.6500483

19

Supplementary Figure S1

Known relationships among Pan troglodytes verus individuals sequenced in this study. (A) Relationships among individuals of the BPRC (Biomedical Primate Research Centre) sample, constituted of 15 isolated individuals (all females), and 35 individuals related through five separate genealogies. All individuals were Sanger sequenced for the three NAT genes (NAT1, NAT2, and NATP), except those shown in white, orange and blue, due to DNA unavailability; genotypes of individuals that were unambiguously deduced from their descendants are shown in blue (inferred for the three NAT genes) or orange (inferred only for NATP), while those in white could not be inferred. Genotypes used in the analyses of diversity included those of the founders, shown in yellow (genotypes of the three NAT genes sequenced), in blue, or in orange, and those of two descendants, Oscar (in beige, genotypes of the three NAT genes sequenced), and Annaclara (in gray, only included for NAT1 and NAT2 as the corresponding genotype of Gerrit at NAT1 and NAT2 could not be inferred from his descendants). NAT genotypes of individuals in green (sequenced) were only used to infer their parents’ genotypes and were not included in the analyses of diversity. (B) Relationships among individuals of the CARTA (Center for Academic Research and Training in Anthropogeny) sample. The CARTA complete sample is constituted of 10 unrelated individuals (6 females, 2 males, and 2 individuals of unknown sex, represented by polygons), and 16 individuals related through two separate genealogies. All individuals in purple were Sanger sequenced for the three NAT genes. (C) The two unrelated individuals from the Basel zoo, that were also Sanger sequenced for the three NAT genes, were included in the analyses in the San Diego samples (Supplementary File S1).

Supplementary Figure S2

Median-joining network of NAT1 haplotypes in the Pan genus and divergence from other hominoid species (human, gorilla and orangutan), represented by their reference sequence (GRC37/hg19, gorGor4 and ponAbe2). Positions are numbered according to the human cds (Supplementary Table S3); non-synonymous mutations are shown in red bold type, synonymous mutations in blue. A black arrow points to the inferred recurrent substitution in the median joining network at position 18’080’085 (position 529 in cds), which is characterized by a C in humans and gorillas and an A in chimpanzees, bonobos and orangutans.

Supplementary Figure S3

Median-joining network of NAT2 haplotypes in the Pan genus and divergence from other hominoid species (human, gorilla and orangutan), represented by their reference sequence (GRC37/hg19, gorGor4 and ponAbe2). Positions are numbered according to the human cds (Supplementary Table S4); non-synonymous mutations are shown in red bold type, synonymous mutations in blue. Positions that mark substitution divergence between some species but are polymorphic in another species are indicated by a star. The red arrow points to a shared non-synonymous polymorphism (C/T) between humans and Western common chimpanzees at position 18’258’091 (578 in cds). Black arrows point to inferred multiple or recurrent substitutions in the median joining network.

Supplementary Figure S4

Median-joining network of NATP haplotypes in the Pan genus and divergence from other hominoid species (human, gorilla and orangutan), represented by their reference sequence (GRC37/hg19, gorGor4 and ponAbe2). Positions (+ 18’220’000) are numbered according to the human reference sequence (Supplementary Table S5). Due to undefined states, several sequence stretches were not considered for median-joining network construction (Supplementary Table S5). Positions that mark substitution divergence between some species but are polymorphic in another species are indicated by a star. Blue arrows point to shared polymorphisms between species: a shared T/A polymorphism between humans and chimpanzees (all sub-species but Western common chimpanzees) at position 18’228’285; a shared G/A polymorphism between bonobos and P.t. schweinfurthii at position 18’228’659; a shared G/A polymorphism between humans and P.t. troglodytes at position 18’228’661; a shared T/C polymorphism between P.t. ellioti and Gorilla gorilla at position 18’228’959. Black arrows point to inferred multiple, recurrent or coincidental substitutions in different species or sub- species at a position in the network.

Supplementary Figure S5

Nucleotide diversity (π x 10-3) at the three NAT genes in the two Gorilla and the two Pongo species. The dotted lines were added to the graphs to highlight inter-locus variation.

Supplementary Figure S6

Expected heterozygosity (h) and nucleotide diversity (π x 10-3) at the three NAT genes in modern human populations. Each dot represents a population (listed in Supplementary Table S7), color-coded according to continental origin. Supplementary Table S1. PCR and sequencing primers used for the Sanger sequencing of the three NAT genes in great apes

Gene Primer Sequence 5’-3’

Chimpanzees and bonobos NAT1 12540 CCTACTCAAATCCAAGTGTAAAAGT FORWARD1 12355 ACATGATAGGTCGTCAGTTGA REVERSE1 NAT2 12485 GTCACACGAGGAAATCAAGTGC FORWARD1 12486 GTTTTCTAGCATGAATCATTCTGC REVERSE1 NATP 13404 GAATCTAAGGGCAAAAGTAAT FORWARD2 13405 TGAAAATGTGTGGTTATCATT REVERSE3

Additional PCR and Sequencing Primers for gene NAT2 of two BPRC chimpanzees (Oscar and Gerda) NAT2 Locus2FB ACAGGGTTCTGAGACTACTAAGAGA FORWARD4 Locus2RB GGAGACGTCTGCAGGTATGT REVERSE4 Locus2FC GGAAGGATCAGCCTCAGGT FORWARD4 Locus2RC GATTCACAATATATTAGCCAACAATG REVERSE4

Gorillas NAT1 12540 CCTACTCAAATCCAAGTGTAAAAGT FORWARD 12355 ACATGATAGGTCGTCAGTTGA REVERSE NAT2 12485 GTCACACGAGGAAATCAAGTGC FORWARD Locus2Reverse-B GTTTTCTAGCATGAATCACTCTGC REVERSE5 NATP Locus3Forward-B GAATCTAAGGGCAAAGGTAAT FORWARD5 13405 TGAAAATGTGTGGTTATCATT REVERSE5

Orangutans NAT1 12540 CCTACTCAAATCCAAGTGTAAAAGT FORWARD Locus1Reverse-B ACATGATAGGTCATTAGCTGA REVERSE6 NAT2 Locus2Forward-B GCCACACAAGGATATCAAGTGC FORWARD6 Locus2Reverse-B GTTTTCTAGCATGAATCACTCTGC REVERSE6 NATP Locus3Forward-C GAATCTAAGGGCAGAAGTAAT FORWARD6 13405 TGAAAATGTGTGGTTATCATT REVERSE Locus3Forward-D GCCTACTCAAGGGAATCTAAGGG FORWARD6 Locus3Reverse-D GCACAAATCACTATGGTTCCCAA REVERSE6

1 Newly designed on panTro2 reference sequence. 2 Slightly modified from forward primer Np-L2 in Sabbagh et al. (2013). 3 Slightly modified from reverse primer Np-R2 in Sabbagh et al. (2013). 4 Newly designed on panTro4 (February 2011) reference sequence. 5 Newly designed on gorGor3.1 (May 2011) reference sequence. 6 Newly designed on ponAbe2 (July 2007) reference sequence.

References in Supplementary Table S1 Mortensen HM, Froment A, Lema G, Bodo JM, Ibrahim M, Nyambo TB, Omar SA, and Tishkoff SA. 2011. Characterization of genetic variation and natural selection at the arylamine N-acetyltransferase genes in global human populations. Pharmacogenomics. 12(11):1545-1558. doi: 10.2217/pgs.11.88 Patin E, Barreiro LB, Sabeti PC, Austerlitz F, Luca F, Sajantila A, Behar DM, Semino O, Sakuntabhai A, Guiso N et al. . 2006a. Deciphering the ancient and complex evolutionary history of human arylamine N-acetyltransferase genes. American journal of human genetics. 78(3):423-436. doi: 10.1086/500614 Patin E, Harmant C, Kidd KK, Kidd J, Froment A, Mehdi SQ, Sica L, Heyer E, and Quintana- Murci L. 2006b. Sub-Saharan African coding sequence variation and haplotype diversity at the NAT2 gene. Hum Mutat. 27(7):720. doi: 10.1002/humu.9438 Podgorna E, Diallo I, Vangenot C, Sanchez-Mazas A, Sabbagh A, Cerny V, and Poloni ES. 2015. Variation in NAT2 acetylation phenotypes is associated with differences in food- producing subsistence modes and ecoregions in Africa. BMC Evol Biol. 15:263. doi: 10.1186/s12862-015-0543-6 Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O'Connor TD, Santpere G et al. . 2013. Great ape genetic diversity and population history. Nature. 499(7459):471-475. doi: 10.1038/nature12228 Sabbagh A, Langaney A, Darlu P, Gerard N, Krishnamoorthy R, and Poloni ES. 2008. Worldwide distribution of NAT2 diversity: implications for NAT2 evolutionary history. BMC Genet. 9:21. doi: 10.1186/1471-2156-9-21 Sabbagh A, Marin J, Veyssiere C, Lecompte E, Boukouvala S, Poloni ES, Darlu P, and Crouau- Roy B. 2013. Rapid birth-and-death evolution of the xenobiotic metabolizing NAT gene family in vertebrates with evidence of adaptive selection. BMC Evol Biol. 13:62. doi: 10.1186/1471-2148-13-62 The Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature. 491:56. doi: 10.1038/nature11632 Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L et al. . 2016. Ensembl 2016. Nucleic Acids Res. 44(D1):D710-D716. doi: 10.1093/nar/gkv1157

Supplementary Table S2. Available information on the non-human great apes samples considered in the study.

Code of individual NAT1 GenBank NAT2 GenBank NATP GenBank (or studybook ID Name of DNA sample phased accession phased accession phased accession Species as reported in GAGP)1individual1 Sex1 Origin1 collection2 genotype3 number4 genotype3 number4 genotype3 number4

Pan troglodytes sub-species

P. t. verus C311 Judumi M - CARTA *1/*1 MK245079/MK245080 *1/*4 MK245233/MK245234 *2/*6 MK245386/MK245387 P. t. verus C312 Akimel M - CARTA *1/*2 MK245081/MK245082 *1/*1 MK245235/MK245236 *1/*2 MK245388/MK245389 P. t. verus C313 Tanya F - CARTA *1/*1 MK245083/MK245084 *1/*1 MK245237/MK245238 *2/*7 MK245390/MK245391 P. t. verus C314 Mohg M - CARTA *1/*3 MK245085/MK245086 *1/*1 MK245239/MK245240 *7/*8 MK245392/MK245393 P. t. verus C315 Tohbi M - CARTA *1/*2 MK245087/MK245088 *1/*1 MK245241/MK245242 *2/*7 MK245394/MK245395 P. t. verus C316 Tamblo M - CARTA *1/*3 MK245089/MK245090 *1/*1 MK245243/MK245244 *1/*8 MK245396/MK245397 P. t. verus C317 Angie F - CARTA *1/*1 MK245091/MK245092 *1/*2 MK245245/MK245246 *1/*1 MK245398/MK245399 P. t. verus C318 Reeuh M - CARTA *1/*1 MK245093/MK245094 *2/*4 MK245247/MK245248 *1/*6 MK245400/MK245401 P. t. verus C319 Papa M - CARTA *1/*1 MK245095/MK245096 *1/*1 MK245249/MK245250 *1/*1 MK245402/MK245403 P. t. verus C320 Patti F - CARTA *1/*1 MK245097/MK245098 *1/*1 MK245251/MK245252 *1/*7 MK245404/MK245405 P. t. verus C321 Mahsho M - CARTA *1/*1 MK245099/MK245100 *1/*1 MK245253/MK245254 *2/*7 MK245406/MK245407 P. t. verus C322 Miloni F - CARTA *1/*1 MK245101/MK245102 *1/*1 MK245255/MK245256 *2/*7 MK245408/MK245409 P. t. verus C323 Pehgia F - CARTA *1/*1 MK245103/MK245104 *1/*2 MK245257/MK245258 *1/*1 MK245410/MK245411 P. t. verus C324 Mahi F - CARTA *1/*1 MK245105/MK245106 *1/*1 MK245259/MK245260 *7/*7 MK245412/MK245413 P. t. verus C325 Tang F - CARTA *1/*1 MK245107/MK245108 *1/*1 MK245261/MK245262 *2/*7 MK245414/MK245415 P. t. verus C326 Tody F - CARTA *1/*1 MK245109/MK245110 *1/*1 MK245263/MK245264 *1/*2 MK245416/MK245417 P. t. verus C328 Kakaichu F - CARTA *1/*4 MK245111/MK245112 *1/*1 MK245265/MK245266 *2/*2 MK245418/MK245419 P. t. verus C329 Quincy F - CARTA *1/*1 MK245113/MK245114 *1/*1 MK245267/MK245268 *1/*2 MK245420/MK245421 P. t. verus C330 Geronimo M - CARTA *1/*3 MK245115/MK245116 *1/*1 MK245269/MK245270 *2/*8 MK245422/MK245423 P. t. verus CYN92-49 - F - CARTA *2/*4 MK245121/MK245122 *1/*1 MK245271/MK245272 *1/*2 MK245424/MK245425 P. t. verus CYN94_67 - F - CARTA *1/*1 MK245119/MK245120 *1/*1 MK245273/MK245274 *2/*2 MK245426/MK245427 P. t. verus CYN98-355 - M - CARTA *1/*2 MK245127/MK245128 *1/*4 MK245275/MK245276 *2/*7 MK245428/MK245429 P. t. verus C13605 - - - CARTA *1/*1 MK245117/MK245118 *1/*1 MK245277/MK245278 *2/*2 MK245430/MK245431 P. t. verus - Fox - - CARTA *1/*5 MK245125/MK245126 *1/*1 MK245279/MK245280 *2/*2 MK245432/MK245433 P. t. verus - Susie F - CARTA *1/*4 MK245123/MK245124 *1/*1 MK245281/MK245282 *2/*7 MK245358/MK245359 P. t. verus 215? Boscoe M - CARTA (& GAGP) *1/*1 MK245129/MK245130 *1/*1 MK245283/MK245284 *7/*7 MK245436/MK245437 P. t. verus 691711 Eros M - Basel zoo *1/*1 MK245131/MK245132 *1/*1 MK245285/MK245286 *1/*2 MK245438/MK245439

P. t. verus 741715 Jacky M - Basel zoo *1/*1 MK245133/MK245134 *1/*1 MK245287/MK245288 *1/*2 MK245440/MK245441 P. t. verus bprc01 Annaclara F - BPRC *1/*1 MK244999/MK245000 *1/*1 MK245153/MK245154 *1/*1 MK245306/MK245307 P. t. verus bprc02 Agnetta F - BPRC *1/*4 MK245001/MK245002 *1/*1 MK245155/MK245156 *1/*2 MK245308/MK245309 P. t. verus bprc03 Bram M - BPRC *2/*4 MK245003/MK245004 *1/*1 MK245157/MK245158 *1/*7 MK245310/MK245311 P. t. verus bprc04 Carolina F BPRC *1/*1 MK245005/MK245006 *1/*1 MK245159/MK245160 *1/*2 MK245312/MK245313 P. t. verus bprc05 Centa F - BPRC *1/*4 MK245007/MK245008 *1/*1 MK245161/MK245162 *2/*5 MK245314/MK245315 P. t. verus bprc06 Debbie F Sierra Leone BPRC *1/*4 MK245009/MK245010 *1/*1 MK245163/MK245164 *1/*2 MK245316/MK245317 P. t. verus bprc07 Diana F Sierra Leone BPRC *2/*4 MK245011/MK245012 *1/*5 MK245165/MK245166 *2/*7 MK245318/MK245319 P. t. verus bprc08 Dirk1 M - BPRC *1/*1 MK245013/MK245014 *1/*2 MK245167/MK245168 *2/*4 MK245320/MK245321 P. t. verus bprc09 Erik M - BPRC *1/*1 MK245015/MK245016 *1/*1 MK245169/MK245170 *2/*2 MK245322/MK245323 P. t. verus bprc10 Femma F - BPRC *1/*1 MK245017/MK245018 *1/*1 MK245171/MK245172 *1/*2 MK245324/MK245325 P. t. verus bprc11 Frits M Sierra Leone BPRC *1/*1 MK245019/MK245020 *1/*1 MK245173/MK245174 *2/*5 MK245326/MK245327 P. t. verus bprc12 Gina F Sierra Leone BPRC *4/*4 MK245021/MK245022 *1/*1 MK245175/MK245176 *2/*2 MK245328/MK245329 P. t. verus bprc13 Jolanda F Sierra Leone BPRC *1/*1 MK245023/MK245024 *1/*1 MK245177/MK245178 *1/*1 MK245330/MK245331 P. t. verus bprc14 Karin F Sierra Leone BPRC *1/*4 MK245025/MK245026 *1/*1 MK245181/MK245182 *2/*2 MK245332/MK245333 P. t. verus bprc15 Lady F Sierra Leone BPRC *1/*1 MK245027/MK245028 *1/*1 MK245179/MK245180 *1/*2 MK245334/MK245335 P. t. verus bprc16 Liesbeth F Sierra Leone BPRC *1/*1 MK245029/MK245030 *1/*1 MK245183/MK245184 *1/*2 MK245336/MK245337 P. t. verus bprc17 Louise F Sierra Leone BPRC *1/*1 MK245031/MK245032 *1/*1 MK245185/MK245186 *1/*1 MK245338/MK245339 P. t. verus bprc18 Marco M Sierra Leone BPRC *1/*1 MK245033/MK245034 *1/*2 MK245187/MK245188 *2/*4 MK245340/MK245341 P. t. verus bprc19 Olga F - BPRC *1/*1 MK245035/MK245036 *1/*1 MK245189/MK245190 *1/*1 MK245342/MK245343 P. t. verus bprc20 Oscar M - BPRC *1/*1 MK245037/MK245038 *1/*1 MK245191/MK245192 *1/*2 MK245344/MK245345 P. t. verus bprc21 Pearl F Sierra Leone BPRC *1/*1 MK245039/MK245040 *1/*3 MK245193/MK245194 *2/*2 MK245346/MK245347 P. t. verus bprc22 Renee F Sierra Leone BPRC *1/*6 MK245041/MK245042 *1/*1 MK245195/MK245196 *1/*1 MK245348/MK245349 P. t. verus bprc23 Robert M - BPRC *1/*4 MK245043/MK245044 *1/*1 MK245197/MK245198 *1/*2 MK245350/MK245351 P. t. verus bprc24 Sonja F Sierra Leone BPRC *1/*1 MK245045/MK245046 *1/*1 MK245199/MK245200 *1/*2 MK245352/MK245353 P. t. verus bprc25 Sherry F Sierra Leone BPRC *1/*1 MK245047/MK245048 *1/*1 MK245201/MK245202 *1/*1 MK245354/MK245355 P. t. verus bprc26 Socrates M - BPRC *1/*4 MK245049/MK245050 *1/*1 MK245203/MK245204 *1/*2 MK245356/MK245357 P. t. verus bprc27 Susie F Sierra Leone BPRC *1/*1 MK245051/MK245052 *1/*1 MK245205/MK245206 *1/*2 MK245434/MK245435 P. t. verus bprc28 Toetie F Sierra Leone BPRC *1/*1 MK245053/MK245054 *1/*1 MK245207/MK245208 *2/*2 MK245360/MK245361 P. t. verus bprc29 Yoko F Sierra Leone BPRC *1/*2 MK245055/MK245056 *1/*1 MK245209/MK245210 *1/*1 MK245362/MK245363 P. t. verus bprc30 Yvonne F Sierra Leone BPRC *1/*1 MK245057/MK245058 *1/*1 MK245211/MK245212 *1/*1 MK245364/MK245365 P. t. verus bprc31 Cindy F - BPRC *1/*1 MK245059/MK245060 *2/*2 MK245213/MK245214 *4/*4 MK245366/MK245367 P. t. verus bprc32 Christa F - BPRC *1/*4 MK245061/MK245062 *1/*1 MK245215/MK245216 *2/*5 MK245368/MK245369 P. t. verus bprc33 Lenny M - BPRC *1/*1 MK245063/MK245064 *1/*2 MK245217/MK245218 *1/*1 MK245370/MK245371 P. t. verus bprc34 Zorro M - BPRC *1/*1 MK245065/MK245066 *1/*2 MK245219/MK245220 *4/*7 MK245372/MK245373

P. t. verus bprc35 Marti M - BPRC *1/*4 MK245067/MK245068 *1/*1 MK245221/MK245222 *1/*2 MK245374/MK245375 P. t. verus bprc36 Regina F Sierra Leone BPRC *1/*1 MK245069/MK245070 *1/*1 MK245223/MK245224 *1/*2 MK245376/MK245377 P. t. verus bprc37 Yoran M - BPRC *1/*1 MK245071/MK245072 *1/*2 MK245225/MK245226 *1/*4 MK245378/MK245379 P. t. verus bprc38 Joachim M - BPRC *1/*1 MK245073/MK245074 *1/*1 MK245227/MK245228 *2/*2 MK245380/MK245381 P. t. verus bprc39 Bob M - BPRC *1/*1 MK245075/MK245076 *1/*1 MK245229/MK245230 *1/*5 MK245382/MK245383 P. t. verus bprc40 Gerda F - BPRC *1/*4 MK245077/MK245078 *1/*1 MK245231/MK245232 *1/*7 MK245384/MK245385 P. t. verus bprc41 Izaak M Sierra Leone BPRC *1/*4 *1/*2 *4/*7 P. t. verus bprc42 Gerrit M Sierra Leone BPRC n.d. n.d. *1/*2 P. t. verus bprc43 Indira F Sierra Leone BPRC n.d. n.d. n.d. P. t. verus bprc44 Jacob M Sierra Leone BPRC n.d. n.d. n.d. P. t. verus bprc45 Marga F Sierra Leone BPRC n.d. n.d. *4/*7 P. t. verus bprc46 Nina F Sierra Leone BPRC n.d. n.d. n.d. P. t. verus bprc47 Tasja F Sierra Leone BPRC n.d. n.d. *2/*1 P. t. verus bprc48 Tineke F Sierra Leone BPRC n.d. n.d. n.d. P. t. verus bprc49 Wodka F Sierra Leone BPRC n.d. n.d. n.d. P. t. verus 10784 Jimmie F GAGP *1/*1 *1/*1 *2/*2 P. t. verus 650? Koby M - GAGP *1/*1 *1/*2 *2/*2 P. t. verus C0471 Clint M - GAGP *1/*4 *1/*1 *2/*2

P. t. ellioti LWC038 Paquita F Cameroon GAGP *1/*1 *6/*6 *2/*15 P. t. ellioti LWC046 Tobi F Cameroon GAGP *1/*7 *6/*6 *2/*8 P. t. ellioti LWC12 Damian M Cameroon GAGP *1/*10 *6/*6 *2/*8 P. t. ellioti LWC2 Akwaya-Jean M Cameroon GAGP *1/*1 *4/*6 *2/*15 P. t. ellioti LWC21 Julie F Cameroon GAGP *7/*10 *1/*6 *8/*12 P. t. ellioti LWC23 Kopongo F Cameroon GAGP *1/*1 *6/*6 *2/*15 P. t. ellioti LWC24 Koto M Cameroon GAGP *1/*1 *6/*6 *2/*8 P. t. ellioti LWC43 Taweh M Cameroon GAGP *2/*10 *4/*6 *2/*8 P. t. ellioti LWC7 Banyo F Cameroon GAGP *1/*1 *6/*6 *11/*11 P. t. ellioti LWC8 Basho M Cameroon GAGP *1/*10 *6/*6 *2/*14

P. t. troglodytes C327 Bischk F - CARTA *1/*3 MK245137/MK245138 *1/*6 MK245291/MK245292 *1/*8 MK245444/MK245445 P. t. troglodytes 10306? Julie F GAGP *1/*1 *4/*4 *2/*16 P. t. troglodytes - Vaillant M Gabon GAGP *1/*10 *4/*4 *8/*9 P. t. troglodytes - Doris F Gabon GAGP *1/*8 *4/*4 *3/*3 P. t. troglodytes - Clara F Gabon GAGP *1/*1 *4/*4 *9/*9

P. t. schweinfurthii ISIS #925 Harriet F Uganda CARTA (& GAGP) *1/*1 MK245135/MK245136 *4/*4 - *2/*2 MK245442/MK245443 P. t. schweinfurthii BB-042a Andromeda F Tanzania GAGP *1/*3 *4/*10 *2/*8 (Gombe Natl. Park) P. t. schweinfurthii Ch-045 Vincent M Tanzania GAGP *1/*9 *4/*4 *2/*10 (Gombe Natl. Park) P. t. schweinfurthii - Bwambale M Uganda GAGP *1/*1 *4/*4 *2/*11 P. t. schweinfurthii - Kidongo F Democratic GAGP *10/*11 *4/*4 *11/*17 Republic of Congo P. t. schweinfurthii - Nakuu F Democratic GAGP *1/*1 *4/*4 *2/*2 Republic of Congo P. t. verus/troglodytes C0551 Donald M - GAGP *1/*1 *1/*4 *2/*13

Pan paniscus species P. paniscus - Bonobo - - CARTA *1/*6 MK245139/MK245140 *7/*9 MK245293/MK245294 *18/*18 MK245446/MK245447 P. paniscus LB502 LB502 F - GAGP *6/*12 *7/*7 *18/*18 P. paniscus 52 Salonga F - GAGP *1/*6 *7/*7 *18/*18 P. paniscus 260 Kumbuka F - GAGP *1/*6 *7/*7 *18/*18 P. paniscus 91 Hortense F - GAGP *1/*6 *7/*7 *18/*18 P. paniscus 166 Kosana F - GAGP *6/*12 *7/*7 *18/*19 P. paniscus 67 Dzeeta F - GAGP *1/*12 *7/*7 *18/*18 P. paniscus 88 Hermien F - GAGP *12/*12 *7/*7 *18/*18 P. paniscus 57 Desmond M - GAGP *1/*6 *7/*7 *18/*18 P. paniscus 55 Catherine F - GAGP *6/*6 *7/*7 *18/*18 P. paniscus 56 Kombote F - GAGP *6/*6 *7/*9 *18/*18 P. paniscus 220 Chipita F - GAGP *1/*6 *7/*8 *18/*18 P. paniscus 102 Bono M - GAGP *6/*6 *7/*7 *18/*18 P. paniscus 46 Natalie F - GAGP *6/*12 *7/*7 *18/*18

Gorilla gorilla sub-species

G. g. (undefined) - GORYN00-51 M - CARTA unphased MK245148 unphased - unphased -

G. g. gorilla 960298 Kisoro M - Basel zoo unphased MK245149 unphased MK245295 unphased MK245448 G. g. gorilla 960297 Joas F - Basel zoo unphased MK245150 unphased MK245296 unphased MK245449 G. g. gorilla 830293 Faddama F - Basel zoo unphased MK245151 unphased MK245297 unphased MK245450

G. g. gorilla 120305 Enea F - Basel zoo unphased MK245152 unphased MK245298 unphased MK245451 G. g. gorilla 9749 Kowali - - GAGP unphased unphased unphased G. g. gorilla 9750 Azizi - - GAGP unphased unphased unphased G. g. gorilla 9751 Bulera - - GAGP unphased unphased unphased G. g. gorilla 9752 Suzie - - GAGP unphased unphased unphased G. g. gorilla 9753 Kokomo - - GAGP unphased unphased unphased G. g. gorilla A930 Sandra F - GAGP unphased unphased unphased G. g. gorilla A931 Banjo F - GAGP unphased unphased unphased G. g. gorilla A932 Mimi F - GAGP unphased unphased unphased G. g. gorilla A933 Dian F - GAGP unphased unphased unphased G. g. gorilla A934 Delphi F - GAGP unphased unphased unphased G. g. gorilla A936 Coco F - GAGP unphased unphased unphased G. g. gorilla A937 Kolo F - GAGP unphased unphased unphased G. g. gorilla A962 Amani F - GAGP unphased unphased unphased G. g. gorilla B642 Akiba Beri F - GAGP unphased unphased unphased G. g. gorilla B643 Choomba F - GAGP unphased unphased unphased G. g. gorilla B644 Paki F - GAGP unphased unphased unphased G. g. gorilla B647 Anthal F - GAGP unphased unphased unphased G. g. gorilla B650 Katie F - GAGP unphased unphased unphased G. g. gorilla KB3782 Vila F - GAGP unphased unphased unphased G. g. gorilla KB3784 Dolly F - GAGP unphased unphased unphased G. g. gorilla KB4986 Katie F - GAGP unphased unphased unphased G. g. gorilla KB5792 Carolyn F - GAGP unphased unphased unphased G. g. gorilla KB5852 Helen F - GAGP unphased unphased unphased G. g. gorilla KB6039 Oko F - GAGP unphased unphased unphased G. g. gorilla KB7973 Porta F - GAGP unphased unphased unphased G. g. gorilla X00108 Abe M - GAGP unphased unphased unphased G. g. gorilla X00109 Tzambo M - GAGP unphased unphased unphased

G. g. dielhi B646 Nyango F - GAGP unphased unphased unphased

Gorilla beringei species

G. b. graueri 9732 Mkubwa M - GAGP unphased unphased unphased G. b. graueri A929 Kaisi M - GAGP unphased unphased unphased G. b. graueri - Victoria F - GAGP unphased unphased unphased

Pongo species

Pongo (undefined) - ORANG M - CARTA unphased - unphased - unphased MK245452

P. abelii 133060 Budi M - Basel zoo unphased MK245141 unphased MK245299 unphased MK245453 P. abelii 123057 Vendel M - Basel zoo unphased MK245142 unphased MK245300 unphased MK245454 P. abelii 123058 Revital F - Basel zoo unphased MK245143 unphased MK245301 unphased MK245455 P. abelii 820344 Patpat/Schubbi M - Basel zoo unphased MK245144 unphased MK245302 unphased MK245456 P. abelii 920353 Farida F - Basel zoo unphased MK245145 unphased MK245303 unphased MK245457 P. abelii 760341 Sexta F - Basel zoo unphased MK245146 unphased MK245304 unphased MK245458 P. abelii 620332 Kasih F - Basel zoo unphased MK245147 unphased MK245305 unphased MK245459 P. abelii A947 Elsi F - GAGP unphased unphased unphased P. abelii A948 Kiki F - GAGP unphased unphased unphased P. abelii A949 Dunja F - GAGP unphased unphased unphased P. abelii A950 Babu F - GAGP unphased unphased unphased P. abelii A952 Buschi M - GAGP unphased unphased unphased

P. pygmaeus A939 Nonja F - GAGP unphased unphased unphased P. pygmaeus A940 Temmy F - GAGP unphased unphased unphased P. pygmaeus A941 Sari F - GAGP unphased unphased unphased P. pygmaeus A943 Tilda F - GAGP unphased unphased unphased P. pygmaeus A944 Napoleon M - GAGP unphased unphased unphased

1 Empty table entry (-) because information was unavailable; for column Origin, empty table entry can be due to unavailable information or if individual was born in captivity. 2 CARTA : Center for Academic Research and Training in Anthropogeny; BPRC : Biomedical Primate Research Center; Basel zoo : Zoologischer Garten Basel AG, Switzerland; GAGP : Great Ape Genome Project. 3 Genotypes deduced from relatives are underlined; genotypes that could not be unambiguously inferred are indicated as not defined (n.d.); only genotypes of the Pan genus could be phased with confidence (see text). 4 Only for genotypes from the CARTA, BPRC and Basel zoo collections, generated through Sanger sequencing in this study.

Supplementary Table S3. Polymorphism and divergence in hominids at homologous positions in the 903 bp sequence including the 870 bp NAT1 coding exon.

The screened sequence spans from 18'079'545 to 18'080’447 on chromosome 8 in the human reference sequence GRCh37/hg19. Positions upstream or downstream the NAT1 coding exon are reported in italic font. Non-synonymous mutations are shown in bold type. The 21 substitutions corresponding to inter- species divergence (between two species at least) are highlighted by stars to the left part of the table. Polymorphisms shared between species or sub-species are highlighted in blue font. (N) indicates that undefined positions were also observed in some individuals. Variants also recorded in sequenced ancient genomes of hominins (Denisova, Neanderthal or 45’000 years old Homo sapiens from Ust’-Ishim) are boxed (see Supplementary Table S8).

Position Common chimpanzee2 Gorilla4 Orangutan5 human human SNP ID in Nigeria- Gorilla Gorilla Pongo reference Human1 Western Eastern Central Bonobo3 Pongo abelii cds Ensembl Cameroon gorilla beringei pygmaeus sequence * 1 18079548 -9 T C C C C C T T T T 2 18079577 21 rs4986992 T/G T T T T T T T T T * 3 18079595 39 G A A A A A G G G G 4 18079631 75 rs201603160 T/C T T T T T T T T T 5 18079632 76 G G/A G/A G/A G/A G G G G G * 6 18079643 87 A A A A A A A A G G 7 18079653 97 rs56318881 C/T C C C C C C C C C * 8 18079664 108 C C C C C C C C A A 9 18079688 132 T T T T T T T T C C/T (N) 10 18079689 133 rs181298696 G/T G G G G G G G G/A G * 11 18079694 138 T A A A A A A A A A 12 18079703 147 C C/T C/T C/T C/T C C C C C 13 18079737 181 A A A A A A A A/C(N) A A 14 18079746 190 rs56379106 C/T C C C C C C C C C 15 18079757 201 rs200617057 G/C G G G G G G G G G 16 18079775 219 rs765390690 T T T T T T T/C(N) T T T 17 18079793 237 rs751824544 G/C G G G G G G G G G 18 18079797 241 A A A A A A A A T/A T * 19 18079809 253 G G G G G G C C G G * 20 18079817 261 rs368205381 G A A A A A G G G G

* 21 18079821 265 T T T T T T C C T T * 22 18079829 273 G T T T T T G G G G 23 18079859 303 C C C C C C/T C C C C 24 18079883 327 C C C C C C C C T/C T 25 18079892 336 G G G G G G G G G/C G/C(N) 26 18079897 341 rs145975713 T T T T T/C T T T T T 27 18079898 342 rs371752075 T/C T T T T T T T T T 28 18079901 345 rs768399724 T T T T T T T T C/T(N) C 29 18079902 346 G G G G G G G G G/A(N) G 18079906- 350- 306 rs72554606 G/C G G G G G G G G G 18079907 351 31 18079924 368 rs771202638 C C C C C C C C C/G(N) C 32 18079925 369 T T/C T T T T/C T T T T 33 18079943 387 C C C C C C C/T(N) C C C * 34 18079946 390 rs376036465 G G G G G G G G A A 35 18079958 402 rs146727732 T/C T T T T T T T T T 36 18079976 420 G G G G G G G G G/C G * 37 18079994 438 G G G G G G T T G G * 38 18080000 444 T T T T T T T T C C 39 18080001 445 rs4987076 A/G A A A A A A/G(N) A/G A/G A/G 40 18080007 451 rs760970670 C C C C C C T T C/T C * 41 18080009 453 T T T T T T T T C C 42 18080014 458 rs374226986 C C C C/T C C C C C C 43 18080015 459 rs4986990 G/A G G G G G G G G/A G 44 18080024 468 rs141809132 T/C T T T T T T T T T rs72554608 18080053- 497- 457 /rs56328478 G/C G G G G G G G G G 18080055 499 /rs745892749 46 18080074 518 A A A A/C A A A A A A * 47 18080085 529 C A A A A A C C A A 48 18080089 533 C C C C C C C C G/C G * 49 18080099 543 A G G G G G A A A A 50 18080107 551 G G G G G G G G G/A(N) G

51 18080115 559 rs5030839 C/T C C C C C C C C C 52 18080116 560 rs4986782 G/A G G G G G G G G G 53 18080128 572 rs141552883 C/T C C C C C C C C C * 54 18080139 583 A C C C C C G G G G 55 18080153 597 T T/G T T T T T T T T 56 18080169 613 rs72554609 A/G A A A A A A A A A 57 18080196 640 rs4986783 G/T G G G G G G G G G 58 18080219 663 A/G A A A A A A A A A 58 18080247 691 rs201916736 G/T G G G G G G G G G * 60 18080277 721 C C C C C C C C T T 61 18080293 737 rs201129487 A/G A A A A A A A A A 62 18080308 752 rs56172717 A/T A A A A A A A A A 63 18080312 756 rs112846752 A/T A A A A A A A G G 64 18080316 760 G G G/C G G G G G G G * 65 18080321 765 C T T T T T C C C C * 66 18080332 776 G G G G G G G G A A 67 18080333 777 rs4986991 T/C T T T T T T T T T 68 18080336 780 G G G G G G G G G/T G 69 18080337 781 rs72554610 G/A G G G G G G G G G 70 18080343 787 rs72554611 A/G A A A A A A (N) A (N) A A 71 18080345 789 A A/G A A/G A/G A A A A A 72 18080363 807 rs201670530 T/C T T T T T T T T T * 73 18080375 819 rs746370481 T A A A A A T T T T * 74 18080382 826 C G G G G G G G G G 75 18080432 876 T T T T T T T T C/T C 76 18080440 884 rs55793712 A/G A A A A A A A A A

1 Human polymorphisms are those recorded by the consensus gene nomenclature of human NAT alleles (http://nat.mbg.duth.gr/), complemented with haplotype data from 1000Genomes Phase 1 (The Genomes Project Consortium 2012), (Patin et al. 2006a) and (Mortensen et al. 2011). Human positions not recorded as polymorphic but associated with a SNP identifier are reported with a highest population MAF < 0.01 in Ensembl (Yates et al. 2016) (http://www.ensembl.org/Homo_sapiens/Info/Index).

2 Western, Niger-Cameroon, Eastern and Central chimpanzees are Pan troglodytes verus, P. t. ellioti, P. t. schweinfurthii and P. t. troglodytes, respectively. Polymorphism recording is based on the individuals of the present study (Supplementary Figure S1) and the chimpanzees of Prado-Martinez et al. (2013) cross-checked with the P. t. verus assembly reference sequence (panTro4, February 2011). 3 Based on the individual of this study (Bonobo), the bonobos of Prado-Martinez et al. (2013) and the Pan paniscus draft assembly reference sequence (panPan1, May 2012). 4 Based on the individuals of this study, the gorillas of Prado-Martinez et al. (2013) and the Gorilla gorilla gorilla draft assembly reference sequence (gorGor4, December 2014). 5 Based on the individuals of this study, the orangutans of Prado-Martinez et al. (2013) and the Pongo pygmaeus abelii draft assembly reference sequence (ponAbe2, July 2007). 6 Polymorphism (rs72554606) reported by the consensus gene nomenclature of human NAT alleles as occurring on haplotype NAT1*5, on one of these two positions: 18’079’906 or 18’079’907. 7 Polymorphism (rs72554608) reported in the consensus gene nomenclature of human NAT alleles as occurring also on haplotype NAT1*5, on one of these three positions: 18’080’053, 18’080’054, or 18’080’055. Two additional SNP identifiers are reported in Ensembl (rs56328478 and rs745892749).

References Mortensen HM, Froment A, Lema G, Bodo JM, Ibrahim M, Nyambo TB, Omar SA, and Tishkoff SA. 2011. Characterization of genetic variation and natural selection at the arylamine N-acetyltransferase genes in global human populations. Pharmacogenomics 12(11):1545-1558. Patin E, Barreiro LB, Sabeti PC, Austerlitz F, Luca F, Sajantila A, Behar DM, Semino O, Sakuntabhai A, Guiso N et al. . 2006. Deciphering the ancient and complex evolutionary history of human arylamine N-acetyltransferase genes. American journal of human genetics 78(3):423-436. Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O'Connor TD, Santpere G et al. . 2013. Great ape genetic diversity and population history. Nature 499(7459):471-475. The Genomes Project C. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56. Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L et al. . 2016. Ensembl 2016. Nucleic Acids Res 44(D1):D710-D716.

Supplementary Table S4. Polymorphism and divergence in hominids at homologous positions in the 1,115 bp sequence including the 870 bp NAT2 coding exon.

The screened sequence spans from 18'257'489 to 18'258'603 on chromosome 8 in the human reference sequence GRCh37/hg19. Positions downstream the NAT2 coding exon are reported in italic font. Non-synonymous mutations are shown in bold type. The 38 substitutions corresponding to inter-species divergence (between two species at least) are highlighted by stars to the left part of the table. Polymorphisms shared between species or sub-species are highlighted in blue font. (N) indicates that undefined positions were also observed in some individuals. Variants also recorded in sequenced ancient genomes of hominins (Denisova, Neanderthal or 45’000 years old Homo sapiens from Ust’-Ishim) are boxed (see Supplementary Table S8).

Position Common chimpanzee2 Gorilla4 Orangutan5 human human SNP ID in Nigeria- Gorilla Gorilla Pongo Pongo reference Human1 Western Eastern Central Bonobo3 cds Ensembl Cameroon gorilla beringei abelii pygmaeus sequence 1 18257520 7 rs200893121 A/G A A A A A A A A A 2 18257521 8 rs186884477 T/A T T T T T T T T T 3 18257542 29 rs72466456 T/C T T T T T T T T T 4 18257549 36 T T T T/C T T T T T T 5 18257576 63 G G G G G G G G G G/T 6 18257583 70 rs45477599 T/A T T T T T T T T T 7 18257585 72 A A A A A A/C A A A A

8 18257612 99 G G G G G G G/A(N) A G G

9 18257617 104 rs199780526 T/C T T T T T T T T T 10 18257621 108 C C C C C C C C C/T C/T 11 18257624 111 rs72554615 T/C T T T T T T T T T 12 18257630 117 C C C C C C C C C/T(N) C(N) 13 18257634 121 rs149283608 A/T A A A A A A A A A * 14 18257654 141 C C C C C C C C T T

* 15 18257658 145 G G G G G A G G G G

16 18257665 152 rs72466457 G/T G G G G G G G G G 17 18257670 157 G G G G G G G G G/A(N) G/A(N)

18 18257703 190 rs1805158 C/T C C C C C C C C C * 19 18257704 191 rs1801279 G/A G G G G G/A T T G G 20 18257712 199 rs201345576 T/C T T T T T T T T T

21 18257716 203 rs72466458 G/A G G G G G G G G G 22 18257732 219 rs750089838 A A A A A A A/G(N) A A A 23 18257738 225 rs200775849 G/C G G G G G G G G G 24 18257741 228 rs72466459 C/T C C C C C C C C C 25 18257795 282 rs1041983 C/T C C C C C C C C C * 26 18257806 293 T C C C C C C C T T

27 18257821 308 rs749903527 C/T C C C C C C C C C 28 18257854 341 rs1801280 T/C T T T T T T T T T * 296 18257858 345 rs45532639 C/T A A A A A C/T C/T T T 30 18257859 346 rs183409091 G/A G G G G G G G G G 31 18257867 354 rs146405047 T/C T T T T T T T T T * 32 18257876 363 rs372383623 C C C C C C C C T T 33 18257877 364 rs4986996 G/A G G G G G G G G G 34 18257916 403 rs12720065 C/G C C C C C C C C C * 35 18257921 408 A G G G G G G G G G

36 18257924 411 rs4986997 A/T A A A A A A A A A 37 18257947 434 rs72554616 A/C A A A A A A A A A 38 18257951 438 G G G G G G G/A A G G

* 39 18257956 443 G C C C C C G G G G

* 40 18257964 451 rs747336755 T C C C C C C C T/C C 41 18257965 452 G G G G G G G G G/A G 42 18257971 458 rs72466460 C/T C C C C C C C C C * 43 18257978 465 rs368704280 G G G G G G G G C C * 44 18257984 471 A A A A A A A A C C

45 18257985 472 rs139351995 A/C A A A A A A A A A 46 18257994 481 rs1799929 C/T C C C C C C C C C 47 18258004 491 rs139512288 T/C T T T T T T T T T 48 18258012 499 rs72554617 G/A G G G G G G G G G 49 18258019 506 rs200585149 A/G A A A A A A/G A A A * 50 18258024 511 A C C C C C C C C C

51 18258027 514 A A A/G A/G A/G A A A A A

52 18258029 516 C C C C C C C C C C/T

53 18258030 517 A A A A A A A A A/G A/G

54 18258031 518 rs369500066 A/G A A A A A A A A A 55 18258033 520 rs753856548 G G G G G G G/T G G G * 56 18258044 531 rs757090914 T T T T T T T T C C * 57 18258047 534 T T T T T T T T A A

58 18258049 536 rs572750517 A A A A A A A A A/G A * 59 18258054 541 C C C C C C C C G G

* 60 18258085 572 T T T T T T T T G G

* 61 18258086 573 A C C C C C C C C C

62 18258091 578 rs79050330 C/T C/T C C C C C C C C 63 18258092 579 rs144176822 G/T G G G G G G G G G * 64 18258096 583 rs759059031 G G G G G G G G C C * 65 18258101 588 T T T T T T T T C C

66 18258102 589 rs375746304 C/T C C C C C C C C C 67 18258103 590 rs1799930 G/A G G G G G G G G G * 68 18258107 594 rs563563855 A G G G G G A A A A * 69 18258108 595 A G G G G G A A A A

70 18258113 600 rs72466461 A/G A A A A A A A A A 71 18258122 609 rs45618543 G/T G G G G G G G G G 72 18258127 614 T T T T T T T/C T T T 73 18258135 622 rs56387565 T/C T T T T T T T T T 74 18258146 633 G/A G G G G G G G G G

75 18258151 638 rs138707146 C/T C C C C C C C C C 76 18258154 641 C/T C C C C C C C C C 77 18258161 648 A A A A A A A/G G A A * 78 18258177 664 T C C C C C C C T T/C

79 18258178 665 T/G T T T T T T T T T

80 18258196 683 rs45518335 C/T C C C C C C C C C * 81 18258207 694 T T T T T T T T C C

* 82 18258241 728 A A A A A A A A T T

83 18258272 759 rs56011192 C/T C C C C C C C C C 84 18258279 766 rs55700793 A/G A A A A A A A A A

* 85 18258287 774 C G G G G G G G G G

86 18258301 788 T T T T T T T/C T T T 87 18258302 789 T G/T T T T T T T T T 887 18258316 803 rs1208 A/G A A A A A A A A A 89 18258322 809 T/C T T T T T T T T T * 90 18258347 834 T G G G G G G G G G

* 918 18258351 838 rs56393504 G/A G G G G G G G C C 92 18258358 845 rs56054745 A/C A A A A A A A A A 93 18258370 857 rs1799931 G/A G G G G G G G G G 94 18258372 859 rs72554618 T/del T T T T T T T T T * 95 18258390 877 A A A A A A A A T T

969 18258407 894 C C C C C C C/CC C C C

97 18258415 902 T T T T T T T T T T/A * 98 18258441 928 A G G G G G A A A A

* 99 18258444 931 T T T T T T T T G G

100 18258447 934 G G/A G/A G/A G/A G G G G G

* 101 18258456 943 G G G G G G G(N) G(N) C C

102 18258462 949 C C/G C C C C C C C C 103 18258469 956 C C C C C C C C C/T C 104 18258475 962 rs754854317 C C C C C C C/T(N) C C C * 105 18258476 963 rs539028538 G G G G G G G G A A 106 18258480 967 T T T T T T T/C T T T * 107 18258491 978 A A A A A A A A C C

* 108 18258497 984 A A A A A A A A C C

109 18258502 989 C/G C C C C C C C C C * 110 18258508 995 C C C C C C C C G G

* 111 18258518 1005 T C C C C C T T T T

112 18258524 1011 T T T T T T T/C C T T

113 18258534 1021 T/C T T T T T T T T T

* 114 18258538 1025 A A A A A A A A G G

115 18258539 1026 rs185257970 A/G A A A A A A A A A

* 116 18258551 1038 T T T T T T T T C C

117 18258598 1085 G/A G G G G G G G G G

1 Human polymorphisms are those recorded by the consensus gene nomenclature of human NAT alleles (http://nat.mbg.duth.gr/), complemented with haplotype data from 1000Genomes Phase 1 (The Genomes Project Consortium 2012), (Sabbagh et al. 2008), (Patin et al. 2006a), (Mortensen et al. 2011) and (Podgorna et al. 2015). Human positions not recorded as polymorphic but associated with a SNP identifier are reported with a highest population MAF < 0.01 in Ensembl (Yates et al. 2016) (http://www.ensembl.org/Homo_sapiens/Info/Index). 2 Western, Niger-Cameroon, Eastern and Central chimpanzees are Pan troglodytes verus, P. t. ellioti, P. t. schweinfurthii and P. t. troglodytes, respectively. Polymorphism recording is based on the individuals of the present study (Supplementary Figure S1) and the chimpanzees of Prado-Martinez et al. (2013) cross-checked with the P. t. verus assembly reference sequence (panTro4, February 2011). For Eastern chimpanzees (P. t. schweinfurthii), no successful NAT2 Sanger sequencing of the single DNA sample available in this study (CHarriet) was achieved, so that NAT2 polymorphism recording for this individual is based solely on the NGS sequence from Prado-Martinez et al. (2013). 3 Based on the Pan paniscus individual of this study (Bonobo), the bonobos of Prado-Martinez et al. (2013) and the Pan paniscus draft assembly reference sequence (PanPan1, May 2012). 4 Based on the individuals of this study, the gorillas of Prado-Martinez et al. (2013) and the Gorilla gorilla gorilla draft assembly reference sequence (gorGor4, December 2014). 5 Based on the individuals of this study, the orangutans of Prado-Martinez et al. (2013) and the Pongo pygmaeus abelii draft assembly reference sequence (ponAbe2, July 2007). 6 At this position (18’257’858), a fixed non-synonymous mutation (A) differentiates the Pan genus from the other hominids. Among the latter, a synonymous C/T polymorphism is observed in humans and gorillas, whereas a T is observed in the orangutans. 7 At this position (18’258’316), a synonymous A/G polymorphism is observed in humans, whereas all other hominids are apparently fixed on A; an A is also observed in the Neanderthal sequences; in Denisova there seems to be an inconsistency at this position between the high-coverage sequence reads of the Denisova genome in the UCSC Genome Browser (reporting A) and the ancient genome browser from the Max Planck Institute for Evolutionary Anthropology, which apparently reports G (see Supplementary Table S8). 8 At this position (18’258’351), a non-synonymous G/A polymorphism is observed in humans, and a non-synonymous mutation (C) differentiates the orangutans from all other hominids (G). 9 At this position (18’258’407), immediately downstream the coding exon, we hypothesize that a C insertion (C/CC) polymorphism could be present in gorillas. The four gorillas Sanger sequenced in this study have the C insertion and so does the gorGor3 reference sequence (gorGor3, May 2011). However, the newer gorGor4 reference sequence does not have this insertion, and it is not reported, either, in the gorilla NGS sequences of the GAGP from Prado-Martinez et al. (2013).

References Mortensen HM, Froment A, Lema G, Bodo JM, Ibrahim M, Nyambo TB, Omar SA, and Tishkoff SA. 2011. Characterization of genetic variation and natural selection at the arylamine N-acetyltransferase genes in global human populations. Pharmacogenomics 12(11):1545-1558.

Patin E, Barreiro LB, Sabeti PC, Austerlitz F, Luca F, Sajantila A, Behar DM, Semino O, Sakuntabhai A, Guiso N et al. . 2006. Deciphering the ancient and complex evolutionary history of human arylamine N-acetyltransferase genes. American journal of human genetics 78(3):423-436. Podgorna E, Diallo I, Vangenot C, Sanchez-Mazas A, Sabbagh A, Cerny V, and Poloni ES. 2015. Variation in NAT2 acetylation phenotypes is associated with differences in food-producing subsistence modes and ecoregions in Africa. BMC Evol Biol 15:263. Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O'Connor TD, Santpere G et al. . 2013. Great ape genetic diversity and population history. Nature 499(7459):471-475. Sabbagh A, Langaney A, Darlu P, Gerard N, Krishnamoorthy R, and Poloni ES. 2008. Worldwide distribution of NAT2 diversity: implications for NAT2 evolutionary history. BMC Genet 9:21. The Genomes Project C. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56. Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L et al. . 2016. Ensembl 2016. Nucleic Acids Res 44(D1):D710-D716.

Supplementary Table S5. Polymorphism and divergence in hominids at homologous positions in the 1’002 bp sequence including the NAT pseudogene (NATP).

The screened sequence spans from 18'228'116 to 18'229'117 on chromosome 8 in the human reference sequence GRCh37/hg19. The 77 substitutions corresponding to inter-species divergence (between two species at least) are highlighted by stars to the left part of the table. Polymorphisms shared between species or subspecies are highlighted in blue font. Deletions are indicated by del; (N) indicates that the position is undefined in some samples. Variants also recorded in sequenced ancient genomes of hominids (Denisova, Neanderthal or 45’000 years old Homo sapiens from Ust’-Ishim) are boxed (see Supplementary Table S8).

Position Common chimpanzee2 Gorilla4 Orangutan5

human SNP ID in Nigeria- Gorilla Pongo reference Ensembl Human1 Western Eastern Central Bonobo3 Gorilla gorilla Pongo abelii Cameroon beringei pygmaeus sequence

1 18228116 rs10088180 A/G A A A A A A A A A 2 18228117 T T T T T T T T T/A(N) T/A * 3 18228126 C C C C C C C(N) N T T 4 18228146 G G G G/T G G G G G G 5 18228151 rs190040562 T/G T T T T T T T T T 6 18228159 C C C C C C C C C C/A 7 18228162 rs34107584 G/del G G G G G G G G G 8 18228172 rs143635794 A/G G G G G G G G G G 9 18228189 C C C C C/T C C C C C 10 18228213 A A A A A A A A A/T N * 11 18228214 G G G G G G C C G G * 12 18228224 C C C C C C C C T T 13 18228238 A A/G A A A A A A A A * 14 18228242 A A A A A G A A A A 15 18228246 rs73590295 T/C T T T T T T T T T * 16 18228256 A A A A A A A A G G 17 18228267 rs181628748 C/T C C C C C C C C C 186 18228277 rs549576704 A/AT A N A(N) A(N) A(N) A(N) A(N) A(N) N 197 18228278 T del(N) N del(N) del(N) del(N) del(N) N del(N) N * 20 18228279 T T(N) N T(N) T(N) C(N) T(N) N T(N) N * 21 18228285 T/A T T/A(N) T/A(N) T/A A T T T T * 22 18228286 rs546046491 T A A A A A A/T T A/G(N) A 23 18228291 rs2898473 C/A A A A A A A A A A 24 18228294 rs527820494 T/C T T T T T T T T T * 25 18228297 A A A A A A A A G G

26 18228303 A A A A A A A A A/T A/T * 27 18228304 C C C C C T C C C C 28 18228306 C C C C C C C C C/T C * 29 18228315 G G G G G G G G T T * 30 18228320 T G G G G G T T T T * 31 18228327 G G G G G G G G C C 32 18228329 rs547876229 G/A G G G G G G G G G * 33 18228332 G G G G G G G G A A * 34 18228335 C C C C C C C C G G * 35 18228336 G A A A A A A A A A * 36 18228367 C T T T T T C C C C 37 18228368 T T/C T T T/C T T T T T 38 18228373 G G G G G G G G G/T(N) G/T * 39 18228382 A G G G G G G G G G * 40 18228385 G G G G G G G G A A 41 18228389 rs561082251 G/A G G G G G G G G G * 42 18228392 A T T T T T T T T T * 43 18228395 G G G G G G A A G G 44 18228404 C C C/T C C C C C C C 45 18228405 rs530061116 C/T C C C C C C C C C 46 18228409 G/A G G G G G G G G G * 47 18228418 A A A A A A A A G G * 48 18228419 T C C C C C T T T T * 49 18228420 A A A A A A A A G G * 50 18228424 rs184770517 C/T T T T T T C C T T 51 18228425 rs570049254 G/A G G G G G G G G G * 52 18228437 C C C C C C A A C C 53 18228442 rs538826711 T/A T T T T T T T T T * 54 18228458 rs372738250 G/A A A A A A G/A G G G * 55 18228463 C C C C C C T T T T * 56 18228470 T T T T T T A A T T * 57 18228480 C C C C C C C C T T 58 18228483 rs188523053 A/T A A A A A A A A A 59 18228501 C C/T C/T C/T C/T C C C C C 60 18228502 rs534373318 C/T C C C C C C C C C * 61 18228519 C T T T T T T T T T 62 18228543 A A A/G A A A A A A A 63 18228545 rs554671092 C/T C C C C C C C C C * 64 18228552 A A A A A A A A G G * 65 18228560 C C/A C C C C C C T T 668 18228571 rs180998104 C/T C(N) C(N) C(N) C(N) C C C C(N) C

* 67 18228574 G G(N) G(N) G(N) G(N) G G A G(N) G * 68 18228575 A A(N) A(N) A(N) A(N) G A A A(N) A * 69 18228576 G G(N) G(N) G(N) G(N) A G G G(N) G * 70 18228580 G G(N) G(N) G(N) G(N) G C C C/G(N) G 71 18228582 G G/T G G G G G G G G * 72 18228606 C C C C C C T T T T 73 18228608 rs537077588 T/G T T T T T T T T T 74 18228612 A A A A A A A A A/T A 75 18228614 G G/T G/T G/T G/T G G G G G 76 18228616 rs35548819 T/C T T T T T T T T T * 77 18228622 C C C C C C T T T(N) T * 78 18228628 T T T T T T G G T T 79 18228636 rs577255154 A/C A A A A A A A A A * 80 18228639 T A A A A A A A A A * 81 18228658 C C C C C C C C T T 82 18228659 G G G G/A G G/A G G G G * 83 18228660 G C C C C T G G G G 84 18228661 rs546009408 G/A G G G G/A G G G G G * 85 18228666 A C C C C C T T C C 86 18228669 rs73590297 G/C G G G G G G(N) N G G 879 18228673 rs115350875 T/C T(N) N T(N) T(N) T(N) T/C(N) T(N) T(N) N 88 18228675 A/AA A(N) N A(N) A(N) A(N) A(N) N A(N) N 89 18228676 rs200058745 C/CA C(N) N C(N) C(N) C(N) C(N) N C(N) N * 90 18228678 A A(N) N A(N) A(N) A(N) ACAACAAA(N) N ACA/AAA(N) N * 91 18228679 A del(N) N del(N) del(N) del(N) A(N) N A/del(N) N * 92 18228685 A C(N) N C(N) C(N) C C(N) N A A * 93 18228686 T C(N) N C(N) C(N) C C(N) N T T * 94 18228701 C T(N) N T(N) T(N) T T T T(N) T 95 18228713 C C C C C C C C C/T C/T 96 18228716 rs112556119 T/C T T T T T T T T T 97 18228721 rs190996104 G/T G G G G G G G G G 98 18228727 rs530022558 G/A G G G G G G G G/A G * 99 18228732 T T T T T T T T C C 100 18228736 A A A A A A A A A/G A 101 18228738 rs543440451 T/G T T T T T T T T T 102 18228746 T T T T T T T T T/C T * 103 18228771 T C C/T C/T C/T C T T T T 104 18228783 A A A A A A A A A/G(N) A 105 18228788 G G G G G G G/C G G G * 106 18228810 T T T T T T T T G G

107 18228829 T/C T T T T T T T T T * 10811 18228838 A A A A A A A/G(N) G(N) A A * 10911 18228841 T T T T T T T/G(N) G(N) T T 110 18228851 rs73666630 T/C T T T T T T T T T * 11111 18228852 C C C C C C C/T(N) T(N) C C * 112 18228853 T T T T T T T T C C * 113 18228854 G A A A A A A A A A * 114 18228857 A T T T T T A A A A * 115 18228867 T T T T T T C C(N) C C 116 18228869 T/G T T T T T T T T T * 11712 18228876 C C C C C C G G(N) G(N) N 118 18228878 T T T T T T A/T(N) T T(N) N 119 18228880 rs532708242 G/C G G G G G G G G(N) N * 120 18228886 T T T T T T T T del(N) N * 121 18228889 A A A A A A A A G G * 122 18228917 G G G G G G T T T T * 123 18228927 T T T T T T T T C C * 124 18228939 A T T T T T T T T T * 125 18228943 T T T T T T T T C C * 126 18228944 T C C C C C C C T T * 127 18228949 G G G G G G A A A A 128 18228959 T T T/C T T T T/C(N) T T T * 129 18228960 G A A A A A A A G G 130 18228967 C C C C C C C C C C/T 131 18228971 G G G G G G G/A(N) G G G 13213 18228999 rs145709578 G/C G G G G G G G G(N) N * 13311 18229005 A A A A A A A/C(N) C A A * 134 18229011 A A A A A A G G A A * 135 18229016 T T T T T T C C T T 136 18229020 rs2172426 C/T T T T T T T T T T 137 18229034 rs528699467 C/A C C C C C C C C C 138 18229048 A A A A A A A/G(N) A A A * 139 18229050 C T T T T T T T T T 140 18229051 rs565728206 G/A G G G G G G G G G * 141 18229057 C G G G G A C C C C * 142 18229076 C C C C C C C C T T 143 18229077 C C C C C C C C C(N) C/T 144 18229085 T T T T T T T/A T T T * 145 18229087 T C C C C C T T T T 146 18229098 A/G A A A A A A A A A

147 18229103 A A A A A/T A A A A A 148 18229104 rs74444655 T/C T T T T T T T T(N) T 149 18229105 A/C A A A A A A A A(N) A * 150 18229106 T T T T T T T T C C 151 18229111 A A A A A A A A A/G A

1 Human polymorphisms are those recorded by the consensus gene nomenclature of human NAT alleles (http://nat.mbg.duth.gr/), complemented with haplotype data from 1000 Genomes Phase 1 data (The Genomes Project Consortium 2012), (Patin et al. 2006a) and (Mortensen et al. 2011), as well as 1000 Genomes SNPs and indels reported for GRCh37.p13 (hg19) in Ensembl (Yates et al. 2016) (http://grch37.ensembl.org/Homo_sapiens/Info/Index, accessed October 2016). Human positions not recorded as polymorphic but associated with a SNP identifier are reported with a highest population MAF < 0.01 in Ensembl (http://www.ensembl.org/Homo_sapiens/Info/Index). Note that the list is not exhaustive: a polymorphic position (A/G) at 18’228’182 is reported for Denisova in the ancient genome browser of the Max Planck Institute for Evolutionary Anthropology (see Supplementary Table S8); it is not recorded in the table since all humans and other hominoids are fixed on A, and an additional polymorphic position (T/C) at 18’228’748, which defines the Pan NATP*13 haplotype, is not recorded in the table either as it was only observed in the hybrid P. t. verus/troglodytes individual (all humans and other hominids are fixed on T). 2 Western, Niger-Cameroon, Eastern and Central chimpanzees are Pan troglodytes verus, P. t. ellioti, P. t. schweinfurthii and P. t. troglodytes, respectively. Polymorphism recording is based on the individuals of the present study (Supplementary Figure S1) and the chimpanzees of Prado-Martinez et al. (2013) cross-checked with the P. t. verus assembly reference sequence (panTro4, February 2011). 3 Based on the individual of this study (Bonobo), the bonobos of Prado-Martinez et al. (2013) and the Pan paniscus draft assembly reference sequence (panPan1, May 2012). 4 Based on the individuals of this study, the gorillas of Prado-Martinez et al. (2013) and the Gorilla gorilla gorilla draft assembly reference sequence (gorGor4, December 2014). 5 Based on the individuals of this study, the orangutans of Prado-Martinez et al. (2013) and the Pongo pygmaeus abelii draft assembly reference sequence (ponAbe2, July 2007). 6 All individuals from the Pan (troglodytes and paniscus) Gorilla and Pongo species and subspecies sequenced by Prado-Martinez et al. (2013) have undefined positions in the stretch between 18’228’271 and 18’228’282 (included). For this reason, positions in the stretch 18’228’271-18’228’282 were not considered for median-joining network construction. 7 We speculate that a T insertion at position 18'228'278 probably occurred on the human lineage, as it was absent in all Sanger sequenced samples of chimpanzees, bonobos, gorillas and orangutans in this study (not represented in the median-joining network, see footnote 6). 8 All individuals from the Pan troglodytes and Pongo abelii species and subspecies sequenced by Prado-Martinez et al. (2013) have undefined positions in the stretch between 18’228’567 and 18’228’581 (included). For this reason, positions in the stretch 18’228’567-18’228’581 were not considered for median-joining network construction. 9 From positions 18’228’670 onwards, all individuals from the Pan (troglodytes and paniscus) and Pongo abelii species and subspecies sequenced by Prado-Martinez et al. (2013) have undefined positions: up to 18’228’701 (included) for P. troglodytes, up to 18’228’681 (included) for P. paniscus and up to 18’228’682 (included) for Pongo abelii. For Gorilla, positions 18’228’668 and 18’228’669 are undefined, as well as all positions between 18’228’672 and 18’228’689 (included). For P. pygmaeus, all positions between 18’228’672 and 18’228’682 (included) are undefined. Polymorphisms in this region were detected for the individuals sequenced in this study. However,

these are all indel polymorphisms of repetitive motives (CA, CAA, etc.) and we chose not to consider the stretch running from 18’228’670 to 18’228’701 for median-joining network construction. 10 At position 18’228’678, an insertion of a CAACAAA motif was observed in all Sanger sequenced gorillas, and an insertion of a CA/AA polymorphic motif in all Sanger sequenced orangutans. In the latter, a T deletion was also found in all Sanger sequenced individuals at position 12’228’886 (not represented in the median-joining network, see footnote 9). 11 Position 18’228’838 is polymorphic (A/G) in Gorilla gorilla, but not in Gorilla beringei, for which only nucleotide G (or N) is reported. At this position, the gorilla reference sequence (gorGor4) indicates nucleotide A, similarly to the human, chimpanzee, bonobo and orangutan reference sequences (hg19, panTro4, panPan1, ponAbe2). The same ambiguous pattern is found at positions 18’228’841, 18’228’852 and 18’229’005, although for the latter two, the gorilla reference sequence differs from that of the other species (at position 18’228’852, gorGor4 is T, whereas it is C in all other reference sequences, and at position 18’229’005, gorGor4 is C, whereas it is A in all other reference sequences). Thus none of these four positions, 18’228’838, 18’228’841, 18’228’852 and 18’229’005, were considered for median-joining network construction. 12 All Pongo individuals sequenced by Prado-Martinez et al. (2013) have undefined positions between 18’228’876 and 18’228’887 (included). For this reason, positions in the stretch 18’228’876-18’228’887 were not considered for median-joining network construction. 13 All Pongo individuals sequenced by Prado-Martinez et al. (2013) have undefined positions between 18’228’993 and 18’229’003 (included). For this reason, positions in the stretch 18’228’993-18’229’003 were not considered for median-joining network construction.

References Mortensen HM, Froment A, Lema G, Bodo JM, Ibrahim M, Nyambo TB, Omar SA, and Tishkoff SA. 2011. Characterization of genetic variation and natural selection at the arylamine N-acetyltransferase genes in global human populations. Pharmacogenomics 12(11):1545-1558. Patin E, Barreiro LB, Sabeti PC, Austerlitz F, Luca F, Sajantila A, Behar DM, Semino O, Sakuntabhai A, Guiso N et al. . 2006. Deciphering the ancient and complex evolutionary history of human arylamine N-acetyltransferase genes. American journal of human genetics 78(3):423-436. Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O'Connor TD, Santpere G et al. . 2013. Great ape genetic diversity and population history. Nature 499(7459):471-475. The Genomes Project C. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56. Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L et al. . 2016. Ensembl 2016. Nucleic Acids Res 44(D1):D710-D716.

Supplementary Table S6. AMOVA ΦST estimates for the three NAT genes between pairs of Pan species and sub-species.

Except for the analyses involving the San Diego sample, all results are shown as ΦST values, with associated P-values in brackets (adjusted P-values after slash). For analyses involving the San Diego sample, the proportion of significant P-values associated with the 122 sub-samples is reported. Significant values are shown in bold.

P. t. verus (San Diego sample) P. t. ellioti P. t. troglodytes P. t. schweinfurthii P. paniscus

NAT1 P. t. verus (BPRC sample) 0 % 0.068 0.026 0.059 0.332 (0.04 / 0.24) (0.26 / 0.78) (0.06 / 0.30) (<0.00001 / 0) P. t. verus (San Diego sample) 0 % 0 % 0 % 100 % P. t. ellioti 0 0.032 0.315 (0.35 / 0.78) (0.19 / 0.76) (<0.00001 / 0) P. t. troglodytes 0 0.271 (0.76 / 0.78) (0.001 / 0) P. t. schweinfurthii 0.270 (0.0004 / 0) NAT2 P. t. verus (BPRC sample) 0 % 0.739 0.730 0.781 0.915 (< 0.00001 / 0) (< 0.00001 / 0) (< 0.00001 / 0) (< 0.00001 / 0) P. t. verus (San Diego sample) 100 % 100 % 100 % 100 % P. t. ellioti 0.675 0.722 0.890 (< 0.00001 / 0) (< 0.00001 / 0) (< 0.00001 / 0) P. t. troglodytes 0 0.778 (0.71 / 0.71) (< 0.00001 / 0) P. t. schweinfurthii 0.775 (< 0.00001 / 0) NATP P. t. verus (BPRC sample) 75.41 % 0.418 0.483 0.315 0.951 (< 0.00001 / 0) (< 0.00001 / 0) (< 0.00001 / 0) (< 0.00001 / 0) P. t. verus (San Diego sample) 100 % 100 % 100 % 100 % P. t. ellioti 0.035 0.024 0.879 (0.20 / 0.40) (0.24 / 0.40) (< 0.00001 / 0) P. t. troglodytes 0.093 0.919 (0.07 / 0.21) (< 0.00001 / 0) P. t. schweinfurthii 0.940 (< 0.00001 / 0)

Supplementary Table S7. Diversity of NAT genes in human populations and results of Hardy-Weinberg equilibrium tests.

Hardy-Weinberg equilibrium test2 ID Population n1 k1 S1 h1 π (x 10−3)1 Ho P-value HW

NAT1

Sub-Saharan Africa

2 Yoruba in Ibadan, Nigeria3 88 3 3 0.034 0.063 0.034 >0.999 4 Luhya in Webuye, Kenya3 97 4 5 0.061 0.147 0.062 >0.999 5 Maasai of Tanzania4 16 3 4 0.123 0.277 0.125 >0.999 6 Turu of Tanzania4 16 3 3 0.123 0.208 0.125 >0.999 7 Burunge of Tanzania4 17 2 1 0.059 0.065 0.059 >0.999 8 Hadza of Tanzania4 16 3 2 0.280 0.551 0.188 0.305 9 Sandawe of Tanzania4 19 5 6 0.292 0.567 0.316 >0.999 10 Biaka Pygmy of C.A.R. (CEPH)4 15 3 2 0.131 0.148 0.133 >0.999 East Asia 11 Han Chinese in Bejing, China3 97 6 6 0.051 0.080 0.052 >0.999 12 Southern Han Chinese3 100 2 1 0.010 0.011 0.010 >0.999 13 Japanese in Tokyo, Japan3 89 2 3 0.022 0.074 0.022 >0.999 14 Cambodgian (CEPH)4 20 1 0 0.000 0.000 n.a.6 n.a.6 Europe 15 Toscani in Italia3 98 5 6 0.127 0.318 0.122 0.116 16 Utah Residents (CEPH) with N&W European Ancestry3 85 5 6 0.124 0.292 0.129 >0.999 17 British in England and Scotland3 89 5 6 0.139 0.349 0.146 >0.999 18 Finnish in Finland3 93 5 6 0.104 0.141 0.108 >0.999 America 19 Mexican Ancestry from Los Angeles USA3 66 3 4 0.045 0.117 0.045 >0.999 20 Puerto Ricans from Puerto Rico3 55 3 4 0.036 0.081 0.036 >0.999 21 Colombians from Medellin, Colombia3 60 3 4 0.081 0.200 0.083 >0.999 22 Americans of African Ancestry in SW USA3 61 4 3 0.049 0.054 0.049 >0.999 World average for NAT1 59.85 3.50 3.75 0.095 0.187 (s.d.)7 (34.78) (1.32) (1.92) (0.079) (0.162)

NAT2

Sub-Saharan Africa 2 Yoruba in Ibadan, Nigeria3 88 20 16 0.890 2.200 0.864 0.304 3 Mandenka of Senegal5 97 16 12 0.837 2.316 0.814 0.526 4 Luhya in Webuye, Kenya3 97 19 15 0.829 2.451 0.887 0.444 6 Turu of Tanzania4 15 13 11 0.881 2.435 0.933 0.965 7 Burunge of Tanzania4 17 10 10 0.831 2.582 0.882 0.284 9 Sandawe of Tanzania4 18 10 9 0.824 2.308 0.722 0.263 10 Biaka Pygmy of C.A.R. (CEPH)4 15 11 10 0.821 1.563 0.867 0.993 East Asia 11 Han Chinese in Bejing, China3 97 6 7 0.576 1.087 0.588 0.763 12 Southern Han Chinese3 100 7 7 0.652 1.275 0.660 0.197 13 Japanese in Tokyo, Japan3 89 5 7 0.538 1.030 0.461 0.250 Europe 15 Toscani in Italia3 98 9 8 0.707 2.172 0.643 0.058 16 Utah Residents (CEPH) with N&W European Ancestry3 85 8 8 0.723 2.156 0.659 0.471 17 British in England and Scotland3 89 7 7 0.728 2.218 0.640 0.042 (0.720) 18 Finnish in Finland3 93 7 7 0.725 2.182 0.753 0.920 America 19 Mexican Ancestry from Los Angeles USA3 66 10 10 0.762 2.104 0.742 0.489 20 Puerto Ricans from Puerto Rico3 55 10 9 0.785 2.234 0.727 0.342 21 Colombians from Medellin, Colombia3 60 8 8 0.725 2.063 0.750 0.689 22 Americans of African Ancestry in SW USA3 61 17 15 0.861 2.357 0.885 0.654 World average for NAT2 68.89 10.72 9.78 0.761 2.041 (s.d.)7 (32.21) (4.50) (2.96) (0.100) (0.470) NATP

Sub-Saharan Africa 1 Dinka of Soudan4 15 11 7 0.818 1.528 0.867 0.904 2 Yoruba in Ibadan, Nigeria3 88 19 14 0.846 2.540 0.841 0.220 4 Luhya in Webuye, Kenya3 97 19 14 0.822 2.416 0.887 0.642 6 Turu of Tanzania4 15 7 7 0.816 1.773 0.533 0.011 (0.211) 7 Burunge of Tanzania4 18 7 6 0.781 1.286 0.778 0.792

9 Sandawe of Tanzania4 18 10 7 0.676 1.331 0.833 >0.999 10 Biaka Pygmy of C.A.R. (CEPH)4 15 10 7 0.841 1.565 0.800 0.268 East Asia 11 Han Chinese in Bejing, China3 97 8 9 0.652 1.862 0.732 0.374 12 Southern Han Chinese3 100 9 10 0.665 1.914 0.630 0.216 13 Japanese in Tokyo, Japan3 89 10 9 0.685 2.020 0.663 0.579 Europe 15 Toscani in Italia3 98 10 8 0.738 1.601 0.663 0.540 16 Utah Residents (CEPH) with N&W European Ancestry3 85 10 10 0.735 1.740 0.706 0.413 17 British in England and Scotland3 89 9 8 0.707 1.590 0.708 0.165 18 Finnish in Finland3 93 10 9 0.752 1.839 0.806 0.665 America 19 Mexican Ancestry from Los Angeles USA3 66 11 11 0.747 1.737 0.773 0.874 20 Puerto Ricans from Puerto Rico3 55 10 10 0.746 1.645 0.727 0.083 21 Colombians from Medellin, Colombia3 60 9 8 0.768 1.862 0.783 0.620 22 Americans of African Ancestry in SW USA3 61 12 12 0.800 2.214 0.852 0.814 World average for NATP 64.39 10.61 9.22 0.755 1.804 (s.d.)7 (33.68) (3.31) (2.34) (0.061) (0.334)

1 Sample size (n) ; Number of haplotypes (k) ; Number of segregating sites (S) ; Expected heterozygosity, equivalent to Nei’s gene diversity (h) ; Nucleotide diversity (π) x 10−3. 2 Test for departure from Hardy-Weinberg equilibrium; Ho : observed heterozygosity. Significant departures (P-value < 0.05) from equilibrium are underlined. No significant departure remained after Holm’s correction for multiple testing (P-values in brackets). 3 The Genomes Project Consortium (2012). 4 Mortensen et al. (2011). 5 Sabbagh et al. (2008). 6 Not applicable (n.a.). 7 Standard deviation (s.d.).

References Mortensen HM, Froment A, Lema G, Bodo JM, Ibrahim M, Nyambo TB, Omar SA, and Tishkoff SA. 2011. Characterization of genetic variation and natural selection at the arylamine N-acetyltransferase genes in global human populations. Pharmacogenomics 12(11):1545-1558. Sabbagh A, Langaney A, Darlu P, Gerard N, Krishnamoorthy R, and Poloni ES. 2008. Worldwide distribution of NAT2 diversity: implications for NAT2 evolutionary history. BMC Genet 9:21. The Genomes Project C. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56.

Supplementary Table S8. NAT genes variant calls in sequenced ancient genomes of the genus Homo.

Variant calls are recorded as compared to the human reference sequence (GRCh37/hg19). For NAT1 and NAT2, we screened 2 kb of sequence in the region on chromosome 8 encompassing the coding exon (positions 18’079’000 to 18’081’000 for NAT1, and 18’257’000 to 18’259’000 for NAT2). For the NATP pseudogene, we screened 2 kb of homologous sequence (18’227’600 to 18’229’600). Non-synonymous changes are shown in bold type. Variants located beyond the stretch sequenced in individuals from this study are indicated in italics.

NAT1 region (+18’000’000) 79124 79213 79517 80001 80015 80196 80644 80833 80901 80923 80933 hg19 T C A G G T A A C G C rs4987076 rs4986990 rs4986783 SNP ID rs8190856 rs4986988 rs4986989 rs1057126 rs28359534 rs8190862 rs6982949 rs8190863 (A445G) (G459A) (G640T) Ust’-Ishim T / C C / T A / T G / A G / A T / G A A C / G A C / T Altai C T A A A G A A G A T Vindija C T T A A G T1 G1 G A T Denisova T C A G G T A A C A C

NAT2 region (+18’200’000) 57795 58316 58522 hg19 C G A rs1041983 rs1208 SNP ID (C282T) (A803G) Ust’-Ishim C A A Altai C / T A A Vindija C A A Denisova C A2 A/T3

NATP region (+18’200’000) 27844 28182 28246 28291 28616 28669 29020 29104 29212 29503 29533 hg19 C A T C T G C T G C G SNP ID rs4487818 rs73590295 rs2898473 rs35548819 rs73590297 rs2172426 rs74444655 rs57181121 rs17126565 rs17126568 Ust’-Ishim C A T / C A T / C G T T / C G / A C / G G / C Altai C A C A T G T C A G C Vindija T1 A C A T C1 T C A G C Denisova T4 A / G T A T G C T G C G

1 Only reported in the high-coverage alignments of the Vindija Neandertal genome but not called as variant in the ancient genome browser from the Max Planck Institute for Evolutionary Anthropology. 2 Apparently reported as unvariable (G) with respect to the human reference sequence in the ancient genome browser from the Max Planck Institute for Evolutionary Anthropology, but reported as A in the high-coverage sequence reads of the Denisova genome in the UCSC Genome Browser. 3 Only reported in the high-coverage sequence reads of the Denisova genome in the UCSC Genome Browser. 4Apparently reported as unvariable (C) with respect to the human reference sequence in the ancient genome browser from the Max Planck Institute for Evolutionary Anthropology, but reported as T in the high-coverage sequence reads of the Denisova genome in the UCSC Genome Browser.

Supplementary Table S9. Haplotype counts of the three NAT gene paralogs in the total samples of the genus Pan.

P. t. verus Samples1 CARTA Basel zoo BPRC GAGP P. t. ellioti P. t. troglodytes P. t. schweinfurthii Hybrid P. paniscus Pan NAT1 NAT1*1 41 4 64 5 13 7 8 2 7 NAT1*2 4 3 1 NAT1*3 3 1 1 NAT1*4 3 14 1 NAT1*5 1 NAT1*6 1 15 NAT1*7 2 NAT1*8 1 NAT1*9 1 NAT1*10 4 1 1 NAT1*11 1 NAT1*12 6 Total sample size2 52 4 823 6 20 10 12 2 28 Pan NAT2 NAT2*1 46 4 72 5 1 1 1 NAT2*2 3 8 1 NAT2*3 1 NAT2*4 3 2 8 11 1 NAT2*5 1 NAT2*6 17 1 NAT2*7 25 NAT2*8 1 NAT2*9 2 NAT2*10 1 Total sample size2 52 4 823 6 20 10 12 2 28 Pan NATP NATP*1 13 2 37 1 NATP*2 21 2 33 6 8 1 7 1 NATP*3 2 NATP*4 8 NATP*5 4 NATP*6 2 NATP*7 13 6 NATP*8 3 5 2 1 NATP*9 3 NATP*10 1 NATP*11 2 2 NATP*12 1 NATP*13 1 NATP*14 1 NATP*15 3 NATP*16 1 NATP*17 1 NATP*18 27 NATP*19 1 Total sample size2 52 4 884 6 20 10 12 2 28

1 CARTA : includes the sequences from all 26 DNA samples from the CARTA research center (including those of related individuals); BPRC : includes the sequences from all 40 individuals from the Biomedical Primate Research Centre (BPRC) colony, plus those deduced genotypes from descendants (see footnotes 3 and 4) ; GAGP : includes all P.t. verus genotypes retrieved from the Great Ape Genome Project (GAGP), except that of Boscoe already included in San Diego (CARTA); the samples of P. t. troglodytes, P. t. schweinfurthii, and P. paniscus include genotypes retrieved from the GAGP and one DNA sample each that was Sanger sequenced in this study (of which one P. t. schweinfurthii genotype, Harriet, that is common in the GAGP and San Diego datasets), whereas the P. t. ellioti sample only includes genotypes retrieved from the GAGP; Hybrid : hybrid P. t. troglodytes and P. t. verus individual (see text). 2 Total sample size in number of chromosomes (2n). 3 Includes the genotype of Izaak that has been deduced from his children’s genotypes (Bram, Gerda and Yoran); see Figure 1A. 4 Includes the genotypes of Izaak, Gerrit, Tasja and Marga, all deduced from their children’s genotypes, see Figure 1A.

Supplementary Table S10. Gene product counts translated from haplotype counts of the two NAT coding genes in the total samples of the genus Pan.

P. t. verus Samples1 CARTA Basel zoo BPRC GAGP P. t. ellioti P. t. troglodytes P. t. schweinfurthii Hybrid P. paniscus Pan NAT1 *1/*2/*6/*12 45 4 68 5 14 7 8 2 28 *3 3 1 1 *4 3 14 1 *5/*10 1 4 1 1 *7 2 *8 1 *9/*11 2 Total sample size2 52 4 82 6 20 10 12 2 28 Pan NAT2 *1/*3/*4/*5 49 4 74 5 3 9 11 2 *2 3 8 1 *6/*10 17 1 1 *7 25 *8 1 *9 2 Total sample size2 52 4 82 6 20 10 12 2 28

1 Samples as described in Supplementary Table S9. 2 Total sample size in number of chromosomes (2n).

Supplementary Table S11. Mann-Withney U and Student t tests for equal distributions of diversity levels in NAT genes between humans and Pan.

P-value P-value (Mann-Whitney U)1 (Student t)1 NAT1 Expected heterozygosity (h) 0.00090 0.00060 Nucleotide diversity (π) 0.00004 0.00060 NAT2 Expected heterozygosity (h) 0.00090 < 0.00001 Nucleotide diversity (π) 0.00007 < 0.00001 NATP Expected heterozygosity (h)2 0.07 0.14 Nucleotide diversity (π)2 0.29 0.24

1 Adjusted P-values of Mann-Withney U and Student t tests for equality of diversity levels in humans and Pan (including all chimpanzee sub-species and bonobos). 2 Adjusted P-values of Mann-Withney U and Student t tests for equality of diversity levels in humans and common chimpanzees (P. troglodytes, i.e. not including bonobos): Mann-Whitney U P-value = 0.16 and Student t P-value = 0.28 for h, and Mann-Whitney U P-value = 0.51 and t P-value = 0.44 for π, respectively.

Supplementary Table S12. Results of selective neutrality tests for the three NAT gene paralogs in human populations.

1 2 3 Ewens-Watterson test Tajima’s D test Fu’s FS test

ID Population Fo Fe P-value D P-value FS P-value

NAT1

Sub-Saharan Africa 2 Yoruba in Ibadan, Nigeria4 0.966 0.701 0.968 -1.457 0.022 (0.210) -3.173 0.005 (0.080) 4 Luhya in Webuye, Kenya4 0.939 0.606 0.970 -1.627 0.010 (0.140) -3.241 0.018 (0.216) 5 Maasai of Tanzania5 0.881 0.597 >0.999 (>0.999) -1.889 0.007 (0.112) -1.192 0.141 6 Turu of Tanzania5 0.881 0.597 >0.999 (>0.999) -1.730 0.016 (0.180) -1.708 0.021 7 Burunge of Tanzania5 0.943 0.763 >0.999 (>0.999) -1.138 0.034 -1.315 0.055 8 Hadza of Tanzania5 0.729 0.598 0.782 0.006 0.611 0.096 0.420 9 Sandawe of Tanzania5 0.716 0.416 0.972 -1.731 0.015 (0.180) -2.189 0.045 10 Biaka Pygmy of C.A.R. (CEPH)5 0.873 0.595 >0.999 (>0.999) -1.507 0.038 -2.355 0.007 (0.098) East Asia 11 Han Chinese in Bejing, China4 0.949 0.472 >0.999 (>0.999) -1.896 0.001 (0.019) -9.910 <0.001 (<0.001) 12 Southern Han Chinese4 0.990 0.831 >0.999 (>0.999) -0.952 0.051 -2.807 0.008 (0.104) 13 Japanese in Tokyo, Japan4 0.978 0.830 0.817 -1.421 0.027 -0.764 0.135 14 Cambodgian (CEPH)5 n.a.7 n.a.7 n.a.7 n.a.7 n.a.7 n.a.7 n.a.7 Europe 15 Toscani in Italia4 0.874 0.530 0.959 -1.466 0.035 -2.614 0.131 Utah Residents (CEPH) with N&W European 16 0.876 0.524 0.965 -1.554 0.021 (0.210) -2.953 0.052 Ancestry4 17 British in England and Scotland4 0.862 0.524 0.959 -1.440 0.043 -2.400 0.142 18 Finnish in Finland4 0.897 0.529 0.977 (0.747) -1.799 0.003 (0.054) -5.262 0.001 (0.017) America 19 Mexican Ancestry from Los Angeles USA4 0.955 0.684 0.966 -1.621 0.012 (0.156) -2.132 0.078 20 Puerto Ricans from Puerto Rico4 0.964 0.680 >0.999 (>0.999) -1.764 0.003 (0.054) -2.878 0.006 (0.090) 21 Colombians from Medellin, Colombia4 0.919 0.677 0.874 -1.458 0.034 -1.187 0.196 22 Americans of African Ancestry in SW USA4 0.952 0.583 >0.999 (>0.999) -1.580 0.009 (0.135) -6.153 <0.001 (<0.001) NAT2

Sub-Saharan Africa 2 Yoruba in Ibadan, Nigeria4 0.115 0.150 0.233 -0.312 0.441 -5.777 0.045 3 Mandenka of Senegal6 0.167 0.197 0.403 0.628 0.780 -2.178 0.265 4 Luhya in Webuye, Kenya4 0.175 0.164 0.688 0.165 0.635 -3.752 0.134 6 Turu of Tanzania5 0.149 0.132 0.809 -0.071 0.522 -4.848 0.012 (0.204) 7 Burunge of Tanzania5 0.194 0.196 0.596 0.546 0.745 -1.303 0.290 9 Sandawe of Tanzania5 0.199 0.200 0.597 0.554 0.749 -1.609 0.237 10 Biaka Pygmy of C.A.R. (CEPH)5 0.207 0.165 0.873 -0.981 0.175 -5.033 0.003 (0.054) East Asia 11 Han Chinese in Bejing, China4 0.427 0.469 0.480 0.024 0.573 0.822 0.689 12 Southern Han Chinese4 0.351 0.421 0.392 0.410 0.706 0.677 0.668 13 Japanese in Tokyo, Japan4 0.465 0.525 0.444 -0.121 0.512 1.415 0.773 Europe 15 Toscani in Italia4 0.297 0.342 0.441 1.705 0.951 1.557 0.759 Utah Residents (CEPH) with N&W European 16 0.282 0.369 0.293 1.612 0.943 2.043 0.809 Ancestry4 17 British in England and Scotland4 0.276 0.414 0.162 2.229 0.982 (0.760) 3.108 0.875 18 Finnish in Finland4 0.279 0.417 0.165 2.179 0.979 (0.760) 3.083 0.872 America 19 Mexican Ancestry from Los Angeles USA4 0.244 0.289 0.405 0.688 0.791 0.244 0.597 20 Puerto Ricans from Puerto Rico4 0.222 0.278 0.329 1.131 0.884 0.243 0.594 21 Colombians from Medellin, Colombia4 0.281 0.347 0.351 1.279 0.907 1.343 0.744 22 Americans of African Ancestry in SW USA4 0.146 0.162 0.463 -0.156 0.504 -3.804 0.107

NATP

Sub-Saharan Africa 1 Dinka of Soudan5 0.209 0.165 0.877 -0.394 0.394 -5.805 0.001 (0.018) 2 Yoruba in Ibadan, Nigeria4 0.159 0.159 0.605 -0.027 0.556 -4.590 0.083 4 Luhya in Webuye, Kenya4 0.183 0.164 0.730 -0.163 0.503 -4.741 0.077 6 Turu of Tanzania5 0.211 0.283 0.171 0.017 0.556 -0.820 0.344 7 Burunge of Tanzania5 0.241 0.298 0.296 -0.298 0.432 -1.573 0.171 9 Sandawe of Tanzania5 0.343 0.200 0.975 (0.880) -0.595 0.313 -4.665 0.005 (0.085) 10 Biaka Pygmy of C.A.R. (CEPH)5 0.187 0.186 0.608 -0.333 0.415 -4.349 0.006 (0.096) East Asia 11 Han Chinese in Bejing, China4 0.351 0.378 0.516 0.480 0.728 1.036 0.712 12 Southern Han Chinese4 0.338 0.344 0.581 0.237 0.651 0.526 0.642 13 Japanese in Tokyo, Japan4 0.319 0.306 0.646 0.724 0.797 0.002 0.559 Europe 15 Toscani in Italia4 0.265 0.312 0.420 0.417 0.707 -0.866 0.413 Utah Residents (CEPH) with N&W European 16 0.270 0.304 0.464 -0.314 0.437 -0.686 0.442 Ancestry4 17 British in England and Scotland4 0.297 0.336 0.463 0.416 0.703 -0.357 0.496 18 Finnish in Finland4 0.252 0.309 0.365 0.347 0.690 -0.344 0.504 America 19 Mexican Ancestry from Los Angeles USA4 0.258 0.264 0.582 -0.190 0.490 -1.671 0.279 20 Puerto Ricans from Puerto Rico4 0.261 0.277 0.534 -0.311 0.436 -1.470 0.294 21 Colombians from Medellin, Colombia4 0.238 0.313 0.278 0.431 0.712 -0.189 0.523 22 Americans of African Ancestry in SW USA4 0.206 0.237 0.436 -0.125 0.514 -1.323 0.342

1 Ewens-Watterson test for departure from selective neutrality and demographic equilibrium; Fo : observed homozygosity, Fe : expected homozygosity; the P-value is given as the proportion of random Fe values generated under the neutral equilibrium model that are smaller than, or equal to the observed Fo value. Significant departures (P-value < 0.025 or > 0.975) are underlined. Significant departures after Holm’s correction for multiple testing (P-values in brackets) are shown in bold. 2 Tajima’s D test for departure from selective neutrality and demographic equilibrium; the P-value is given as the proportion of random D values generated under the neutral equilibrium model that are smaller than, or equal to the observed D value. Significant departures (P-value < 0.025 or > 0.975) are underlined. Significant departures after Holm’s correction for multiple testing (P-values in brackets) are shown in bold. 3 Fu’s FS test for departure from selective neutrality and demographic equilibrium; the P-value is given as the proportion of random FS values generated under the neutral equilibrium model that are smaller than, or equal to the observed FS value. Significant departures (P-value < 0.02) are underlined. Significant departures after Holm’s correction for multiple testing (P-values in brackets) are shown in bold. 4 The Genomes Project Consortium (2012). 5 Mortensen et al. (2011). 6 Sabbagh et al. (2008). 7 Not applicable (n.a.). References Mortensen HM, Froment A, Lema G, Bodo JM, Ibrahim M, Nyambo TB, Omar SA, and Tishkoff SA. 2011. Characterization of genetic variation and natural selection at the arylamine N-acetyltransferase genes in global human populations. Pharmacogenomics 12(11):1545-1558. Sabbagh A, Langaney A, Darlu P, Gerard N, Krishnamoorthy R, and Poloni ES. 2008. Worldwide distribution of NAT2 diversity: implications for NAT2 evolutionary history. BMC Genet 9:21. The Genomes Project C. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56.

Supplementary Table S13. Nucleotide diversity of the three NAT gene paralogs in gorillas and orangutans.

Gorillas Orangutans Gorilla gorilla Gorilla beringei Pongo abelii Pongo pygmaeus

NAT1 Total (2N chromosomes) 66 6 24 10 Number of used positions1 903 891 903 902 Number of segregating sites (S) 3 2 15 3 Nucleotide diversity (π) x 10−3 0.61 0.82 3.94 1.36

NAT2 Total (2N chromosomes) 64 6 24 10 Number of used positions1 1’116 1’083 1’115 1’114 Number of segregating sites (S) 12 (+1 InDel)2 1 8 7 Nucleotide diversity (π) x 10−3 1.72 0.31 1.65 1.40

NATP Total (2N chromosomes) 64 6 26 10 Number of used positions1 1’008 950 1’002 953 Number of segregating sites (S) 13 0 15 (+ 1 InDel)2 7 Nucleotide diversity (π) x 10−3 2.77 0 3.76 2.29

1 Number of successfully sequenced positions in common between the GAGP data and our Sanger sequencing results for G. gorilla and P. pygmaeus (only GAGP data available for G. beringei and P. abelii). 2 Insertions and deletions (InDels) were not considered in the computation of nucleotide diversity. Supplementary Table S14. Global linkage disequilibrium (GLD) between pairs of NAT loci estimated in the two Western chimpanzee (P. t. verus) samples and in human population samples.

BPRC San Diego Humans Proportion of Number of sub-samples in Variation human samples Pair of NAT loci N1 P-value N1 significant GLD2 of P-values2 N3 in significant GLD4

NAT1~NAT2 23 1 18 0 % [0.24 – 0.8] 17 1 (1) NAT1~NATP 23 0.051 18 0 % [0.24 – 0.92] 17 2 (0) NATP~NAT2 23 0.004 18 34.42% [0.026 – 0.27] 17 14 (13)

1 Number of individuals in the sample. 2 Among the 122 sub-samples of the San Diego sample. 3 Number of human population samples tested. 4 In brackets, number of samples in significant GLD after correction for multiple testing.

Supplementary Table S15. Predictions of the effect of substitutions in human NAT1 and NAT2 coding sequences according to PolyPhen, SIFT and PANTHER cSNP Scoring compared to the phenotype assignments of the official nomenclature of human NAT alleles (http://nat.mbg.duth.gr).

A description of the scores and predictions returned by each of the three online software tools is provided in Supplementary File S1.

Substitution Phenotype assignment PolyPhen SIFT PANTHER cSNP Scoring Haplotypes cDNA protein (http://nat.mbg.duth.gr) Score1 Prediction2 Score3 Prediction4 PSEP5 Prediction6 NAT1 NAT1*4 (reference) NAT1*17 C190T R64W Lower than NAT1*4 1 (0.00 -1.00) PRD 0 (3.07, 81) A 4200 PRD NAT1*22 A752T D251V Lower than NAT1*4 1 (0.00-1.00) PRD 0 (3.07, 80) A 455 PRD NAT1*14B G560A R187Q Lower than NAT1*4 0.772 (0.85-0.92) POD 0.54 (3.07, 81) T 455 PRD NAT1*21 A613G M205V Equivalent to NAT1*4 0 (1.00-0.00) B 1 (3.07, 81) T 91 B NAT1*24 G781A E261K Equivalent to NAT1*4 0.008 (0.96-0.76) B 0.03 (3.07, 80) A 455 PRD NAT1*25 A787G I263V Equivalent to NAT1*4 0 (1.00-0.00) B 0.95 (3.07, 80) T 220 POD NAT1*30 G445A V149I Unknown 0 (1.00-0.00) B 1 (3.07, 81) T 6 B NAT2 NAT2*4 (reference) Rapid NAT2*12A A803G K268R Rapid 0.011 (0.96-0.78) B 0.23 (3.24, 51) T 176 B NAT2*19 C190T R64W Slow 1 (0.00-1.00) PRD 0 (3.08, 51) A 4200 PRD NAT2*14A G191A R64Q Slow 1 (0.00-1.00) PRD 0 (3.18, 53) A 4200 PRD NAT2*5D T341C I114T Slow 0.359 (0.90-0.89) B 0.04 (3.18, 53) A 220 POD NAT2*17 A434C Q145P Slow 1 (0.00-1.00) PRD 0 (3.18, 54) A 456 PRD NAT2*6B G590A R197Q Slow 1 (0.00- 1.00) PRD 0.01 (3.18, 54) A 842 PRD NAT2*18 A845C K282T Rapid 0.999 (0.14-0.99) PRD 0.03 (3.39, 46) A 220 POD NAT2*7A G857A G286E Slow7 0.317 (0.90-0.89) B 0.62 (3.69, 42) T 31 B NAT2*10 G499A E167K Slow7 0.001 (0.99-0.15) B 1 (3.18, 54) T 176 B NAT2*13B C578T T193M Unknown9 1 (0.00-1.00) PRD 0 (3.18, 54) A 456 PRD (C282T)8 (Y94Y)8

1 PolyPhen score : probability that a substitution is damaging; sensibility and specificity in brackets. 2 PolyPhen prediction : “benign” (B), “possibly damaging” (POD), “probably damaging” (PRD). 3 SIFT score : probability that a substitution is tolerated; median sequence information and number of sequences used for the prediction in brackets. 4 SIFT prediction : T: “tolerated” (T), A: “affect protein function” (A). 5 PANTHER cSNP Scoring PSEP (position-specific evolutionary preservation) : length of time (in millions of years) of preservation of a position. 6 PANTHER cSNP Scoring prediction : "probably damaging" (PRD), "possibly damaging" (POD), "probably benign" (B). 7 The official nomenclature of human NAT2 alleles reports the phenotype as “Slow, Substrate dependent ?”. 8 The official nomenclature of human NAT2 alleles has synonymous SNP C282T (Y94Y) as signature mutation of all NAT2*13 haplotypes. 9 No phenotype assignment provided in the official nomenclature of human NAT2 alleles; the substitution has been predicted (with in-silico tools) as damaging by (Patin et al. 2006b).

Reference Patin E, Harmant C, Kidd KK, Kidd J, Froment A, Mehdi SQ, Sica L, Heyer E, and Quintana-Murci L. 2006. Sub-Saharan African coding sequence variation and haplotype diversity at the NAT2 gene. Hum Mutat 27(7):720.

Supplementary Table S16. Predictions of the effect of substitutions between inferred hominid ancestral NAT1 and NAT2 coding sequences and Pan basal and human reference haplotypes according to PolyPhen, SIFT and PANTHER cSNP Scoring.

A description of the scores and predictions returned by each of the three online software tools is provided in Supplementary File S1. Inferred ancestral haplotypes are those reconstructed for the theoretical hominid common ancestor at the center of the NAT1 and NAT2 networks (Supplementary Figures S1 and S2 for, respectively, NAT1 and NAT2). For NAT1, two ancestral haplotypes were considered: ancestral_1 bearing C at position 583, and ancestral_2, bearing A at position 583.

PolyPhen SIFT PANTHER cSNP Scoring Haplotypes cDNA protein Score1 Prediction2 Score3 Prediction4 PSEP5 Prediction6 NAT1 NAT1 ancestral_1 to Pan troglodytes C529A H177N 0 (1.00-0.00) B 1.00 (3.07, 81) T Not scored (invalid substitution)7 NAT1 ancestral_1 to NAT1 ancestral_2 C583A Q195K 0.001 (0.99-0.15) B 0.06 (3.07, 81) T 30 B NAT1 ancestral_2 to NAT1 ancestral_1 A583C K195Q 0 (1.00-0.00) B 0.17 (3.07, 81) T Not scored (invalid substitution)7 NAT1 ancestral_2 to Homo sapiens A138T E46D 0.177 (0.92-0.87) B 0.01 (3.07, 81) A 908 PRD G826C E276Q 0.046 (0.94-0.83) B 0.24 (3.07, 78) T 908 PRD NAT2 NAT2 ancestral to Pan troglodytes T293C V98A 0 (1.00-0.00) B 0.91 (3.07, 50) T 6 B T664C F222L 0.009 (0.96-0.77) B 0.46 (3.07, 51) T 176 B C345A D115E 0.001 (0.99-0.15) B 0.2 (3.07, 50) T 6 B G443C C148S 0 (1.00-0.00) B 0.34 (3.07, 51) T 6 B A595G I199V 0.019 (0.95-0.80) B 0.45 (3.07, 51) T 6 B NAT2 ancestral to Homo sapiens C451T R151C 0.980 (0.75-0.96) PRD 0 (3.07, 51) A Not scored (invalid substitution)7 C511A P171T 0.188 (0.92- 0.87) B 0.27 (3.07, 51) T Not scored (invalid substitution)7 C573A F191L 0 (1.00-0.00) B 0.47 (3.07, 51) T Not scored (invalid substitution)7 G834T K278N 1.00 (0.00-1.00) PRD 0 (3.08, 49) A Not scored (invalid substitution)7

1 PolyPhen score : probability that a substitution is damaging; sensibility and specificity in brackets. 2 PolyPhen prediction : “benign” (B), “possibly damaging” (POD), “probably damaging” (PRD).

3 SIFT score : probability that a substitution is tolerated; median sequence information and number of sequences used for the prediction in brackets. 4 SIFT prediction : T: “tolerated” (T), A: “affect protein function” (A). 5 PANTHER cSNP Scoring PSEP (position-specific evolutionary preservation) : length of time (in millions of years) of preservation of a position. 6 PANTHER cSNP Scoring prediction : "probably damaging" (PRD), "possibly damaging" (POD), "probably benign" (B). 7 PANTHER cSNP Scoring returns the message “Not scored (invalid substitution)” when the input reference amino acid (e.g. H in H177N) does not match the amino acid at that site in the best-match sequence in the PANTHER library, and is likely to be a partial mismatch between the user-input sequence and the best match in PANTHER.