<<

UNIVERSITY OF COPENH AGEN FACULTY OR DEPARTMENT

PhD Thesis

Ye Yin

Evolution and of and Revealed by

Genome Sequencing

Academic advisor: Karsten Kristiansen, University of Copenhagen, Denmark

This thesis has been submitted to the PhD School of The Faculty of Science, University of

Copenhagen

Submitted: March 2018

Dissertation for the degree of philosophiae doctor (PhD)

Department of , University of Copenhagen Copenhagen, Denmark

and

BGI-Research, BGI-Shenzhen Shenzhen, China

March 2018

Author: Ye Yin

Title: Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing

Academic advisors: Karsten Kristiansen, Department of Biology, University of Copenhagen,

Denmark

Submitted March 2018

2 Preface

This PhD project started in 2015 as collaboration between the Department of Biology, University of

Copenhagen and BGI-Shenzhen. The work presented here has been performed at both institutions by supervision of Professor Karsten Kristiansen.

3 Acknowledgements

I would like to thank my supervisor Professor Karsten Kristiansen for introducing me in the most cutting-age research area of genomics and bioinformatics, giving me kind guidance on conducting academic researches.

I would also like to thank Chenglin Zhang, Deputy Director of Beijing , and Professor Rasmus from University of Copenhagen for kindly providing the baboon and mandrill sample used in this study.

Additionally, I would like to thank those who participated in the crowdfunding for the baboon and mandrill sequencing projects.

4 Abstract

Baboon ( Papio) and mandrill ( sphinx) are closely related with beings in phylogenetic relationships, which can serve as unique models for evolutionary studies as well as human diseases researches. However, genetic researches and genomic resources of baboon and mandrill are limited, especially comparing to chimpanzee and gorilla. Thus genome sequencing of baboon and mandrill was carried out here for constructing reference genomes of these remarkable Old World monkeys.

With the process of sampling, DNA extraction and sequencing, 414 Gb and 426 Gb raw sequencing data of different libraries were generated for baboon and mandrill respectively using the second generation sequencing platform. Then, genome assembly was carried out based on the sequencing data of both . The genome assembly of baboon was 3.11 Gb with contig N50 to be 21,659, and scaffold N50 to be 1,070,645, and the genome assembly of mandrill was 2.88 Gb with contig

N50 to be 20,483, and scaffold N50 to be 3,564,730. With the assembled genomes, repeat contents were first annotated to be 42.3% and 40.4% respectively for baboon and mandrill. After masking the repeat content in the genomes, evidence based and ab-initial gene annotation were combined together to predict 23,867 genes in baboon and 21,906 genes in mandrill. Searching 3,023 BUSCO

(Benchmarking Universal Single-Copy Orthologs) genes against the predicted genes, the completeness of the genes were estimated to be 97% (baboon) and 98% (mandrill). This comprehensive assembly and complete gene sets provides new biological insight into genetic diversity, structural variation, behavioral characteristic. Comparative genomic analysis among was conducted to reveal the synteny between primates and also the gene family evolution of contraction and expansion especially in baboon and mandrill. There were 9,930, 11,418,14,318 gene pairs between baboon and mandrill, and baboon, human and macaque. In baboon and mandrill lineage, there were 545 expanded and 618 contracted gene families, expanded genes were

5 significantly enriched in biosynthetic process, structural constituent of ribosome, nucleosomal DNA binding, G-protein coupled receptor activity, olfactory receptor activity, glucose catabolic process, peptidyl-prolyl isomerization, as well as carbon fixation in photosynthetic organisms and electron transport chain pathway. Molecular mechanisms of adaptation for baboon and mandrill including immune, language competence and olfactory character were also investigated through comparative genomics. Through this study, genomic resources were provided for primate species, and comprehensive insights of adaptation and evolution were also provided for better understanding of baboon and mandrill.

6 Table of Contents

Preface ...... 3

Acknowledgements ...... 4

Abstract ...... 5

Table of Contents ...... 7

Abbreviations ...... 10

List of Tables ...... 12

List of Figures ...... 14

1. Introduction ...... 16

1.1 Baboon and its biology ...... 16

1.2 Mandrill and its biology ...... 20

1.3 Genomic studies on baboon and mandrill ...... 22

1.3.1 Genomics of primates ...... 22

1.3.2 Genomics of baboon ...... 26

1.3.3 Genomics of mandrill ...... 28

1.3.4 Comparative genomics in primates ...... 29

1.4 Objectives ...... 31

2. Materials and Methods ...... 32

2.1 Sampling and sample preparation ...... 32

2.2 Genome sequencing ...... 33

2.2.1 Library construction and sequencing ...... 33

2.2.2 Data filtering ...... 33

2.2.3 Overlapping library data merging ...... 34

2.2.4 K-mer analysis ...... 34

7 2.3 Genome assembly and annotation ...... 35

2.3.1 Genome assembly ...... 35

2.3.2 Genome annotation ...... 36

2.4 Evolutionary analysis ...... 40

2.4.1 Gene family cluster ...... 40

2.4.2 Phylogenetic analysis ...... 41

2.4.3 Positively gene selection analysis ...... 41

2.5 Comparative genomics ...... 42

2.5.1 Synteny analysis of human, macaque, baboon and mandrill ...... 42

2.5.2 Gene family contraction and expansion ...... 42

2.5.3 Segmental duplications ...... 43

2.6 Investigating molecular mechanisms of adaptation/phenotype ...... 43

2.6.1 Immune character...... 43

2.6.2 Language competence ...... 44

2.6.3 Olfactory character ...... 44

2.6.4 Predicting binding sites of transcription factors ...... 45

3. Results ...... 46

3.1 Landscapes of baboon and mandrill genomes ...... 46

3.1.1 Sequencing data...... 46

3.1.2 K-mer analysis ...... 47

3.1.3 Genome assembly ...... 48

3.1.4 Annotation results ...... 49

3.2 Evolution of baboon and mandrill ...... 53

3.2.1 Gene families ...... 53

3.2.2 Phylogenetic analysis ...... 55

3.3 Synteny among primates ...... 57

3.3.1 Synteny analysis of human, macaque, baboon and mandrill ...... 57

3.3.2 Gene family contraction and expansion ...... 58

8 3.3.2 Segmental duplications ...... 60

3.4 MHC comparison between human and baboon/mandrill ...... 61

3.5 Language related genomic features ...... 66

3.6 Olfactory receptor genes analysis ...... 67

3.7 Positively selected genes ...... 69

4. Discussion...... 73

5. Conclusions ...... 77

6. Future perspectives ...... 79

7. References ...... 80

8. Appendix ...... 88

9 Abbreviations

4D Fourfold Degenerate AIDS Acquired Immune Deficiency Syndrome BP Biological Process BUSCO Benchmarking Universal Single-Copy Orthologs CC Cellular Component CEGMA Core Eukaryotic Genes Mapping Approach ChIP-seq Chromatin Immunoprecipitation Sequencing CMV DBG De Bruijn Graph EBV Epstein-Barr EC Commission EST Expressed Sequence Tag GO Gene Ontology HAV Hepatitis A Virus HGNC Hugo Gene Nomenclature Committee HIV Human Immunodeficiency Virus HLA Human Leukocyte Antigen IPEX Immunodysregulation Polyendocrinopathy Enteropathy X-Linked KEGG Kyoto Encyclopedia Of Genes And Genomes LINE Long Interspersed Nuclear Element LTR Long Terminal Repeat MF Molecular Function MHC Major Histocompatibility Complex MYA Million Ago NGS Next Generation Sequencing OR Olfactory Receptor PCR Chain Reaction PGC Primordial Germ Cells PPIA Peptidylprolyl A PSMC Pairwise Sequentially Markovian Coalescent QTL Quantitative Trait Loci ROS Reactive Oxygen Species SD Segmental Duplication SINE Short Interspersed Nuclear Element SIV Immunodeficiency Virus

10 SMRT Single-Molecule Realtime Sequencing SNPRC Southwest National Primate Research Center TE Transposable Element WSSD Whole-Genome Sequence Detection

11 List of Tables

Table 1.1 as models for studies of human diseases and vaccines...... 19

Table 1.2 Summary of mandrill as models in human diseases and vaccines studies/tests...... 22

Table 1.3 Published primate genome sequences...... 25

Table 3.1 Statistics of baboon and mandrill raw sequencing data...... 46

Table 3.2 The information of 17-mer statistics...... 47

Table 3.3 Statistics of the genome assemblies...... 48

Table 3.4 Repeat contents of baboon, mandrill, human and mouse...... 50

Table 3.5 Summary of gene annotation in baboon genome...... 51

Table 3.6 Summary of gene annotation in mandrill genome...... 51

Table 3.7 Assessment of gene sets using BUSCO...... 52

Table 3.8 Function annotation of the final gene sets...... 52

Table 3.9 Gene family clustering in the seven species...... 53

Table 3.10 Olfactory receptor gene copy number in five species...... 68

Table 8.1 Statistics of baboon and mandrill clean/filtered sequencing data...... 88

Table 8.2 Prediction of the repeats in baboon genome...... 88

Table 8.3 General statistics of repeats in mandrill genome...... 88

Table 8.4 Categories of TEs in baboon genome...... 89

Table 8.5 Categories of TEs in mandrill genome...... 89

12 Table 8.6 Non-coding RNA genes in baboon genome...... 89

Table 8.7 Non-coding RNA genes in mandrill genome...... 90

Table 8.8 Go enrichment of unique gene families in baboon...... 90

Table 8.9 Go enrichment of unique gene families in mandrill...... 91

Table 8.10 GO enrichment result of unique gene families for mandrill...... 92

Table 8.11 Repeat content of MHC class I region for mandrill and human...... 92

Table 8.12 GO and KEGG enrichment of the positively selected genes (PSGs)...... 93

13 List of Figures

Figure...... 17

Figure 2.1 Photos of the samples selected for sequencing...... 32

Figure 2.2. Overall process of genome annotation...... 37

Figure 3.1 Orthologous gene clusters in the five related species...... 54

Figure 3.2 Comparison of orthologous genes among 12 primates and mouse...... 55

Figure 3.3 Phylogenetic tree based on single copy gene families in the 13 species...... 56

Figure 3.5 Synteny relationship of human, macaque, baboon and mandrill...... 58

Figure 3.6 Gene family contraction and expansion for 12 primates and mouse...... 60

Figure 3.7 Segmental duplications in seven primate species...... 61

Figure 3.8 Synteny between human and mandrill MHC regions...... 63

Figure 3.9 Alignment of HLA genes with amino acid sequence for human, baboon and mandrill.

...... 65

Figure 3.10 Structure of MICA and MICB gene for human and mandrill...... 65

Figure 3.11 Amino acid sequence aligment of FOXP2 gene from human, chimpanzees, mouse, baboon and mandrill...... 67

Figure 3.12 Expansion of the olfactory receptor gene family in baboon and mandrill...... 69

Figure 3.13 Interaction between innate immunity for positively selected genes in mandrill...... 70

Figure 8.1 The distribution of 17-mer frequency of baboon and mandrill...... 102

14 Figure 8.2 Colinearity analysis of chr 3 for mandrill...... 103

Figure 8.3 Sequencing depth and the location relationships of pair-end reads on MHC class I region for mandrill...... 103

Figure 8.4 OR7E24 genes on chromosome 19 in mandrill...... 105

15 1. Introduction

There are two suborders of primates, the Strepsirrhini and . Haplorhines are further split into tarsiers and . Simians comprise two groups, one of them is the catarrhines, Catarrhines are further split into the Old World monkeys (Cercopithecoidea) and the (Hominoidea).

Baboon (genus Papio) and mandrill (Mandrillus sphinx) belong to genus , and they are primates in the Old World family which are widely distributed in . Comparing to chimpanzees and gorillas which belong to Hominidae, baboon and mandrill are also closely related to human beings [1]. Papio, Mandrillus, and Macaca were used to be clustered in a tribe

Cercocebini of the subfamily [2], but currently Papio was assigned to Lophocebus while Mandrillus was assigned to Cercocebus according to postcranial skeleton and the [3].

The Papionini tribe was diverged from around 11.5 million years ago (Mya) and comprises the subtribe Papionina, with the genera Papio, Mandrillus and the subtribe Macacina, with the genus Macaca [4]. Complete mtDNA genome sequences also provided similar phylogenetic relationships among of Macaca and the Mandrillus [5].

1.1 Baboon and its biology

Baboons are primates of the Old World monkeys belonging to Papio. They have close-set eyes, powerful jaws, short tails, long muzzles, thick fur, and rough spots on their protruding buttocks.

Baboon species also show , usually in size, but sometimes also in color or canine development [6]. Baboons can live up to more than 40 years, with the baboons in captivity were known to 45 years while in the wild is about 30 years. They live in open savannah, woodland and hills across Africa and they eat or fish occasionally. There are five species in Papio, P. ursinus (), P. papio ( baboon), P. (), P. anubis

() and P. cynocephalus (), which are predominantly found in Southern

16 Africa, Western Africa, Southwestern Arabia, North-central Africa and eastern Africa, respectively

[7] (Figure 1.1).

Figure 1.1 Geographical distribution of baboons. Distribution based on the map in Kingdom [8]

(Modified from a figure in previous study [9]).

Baboons live in hierarchical troops with number of individuals ranging from 5 to more than 200, considerably larger than most of chimpanzee groups. The size of the troops largely varies for different baboon species and different time periods during a . The structure of hamadryas baboons is remarkably different from that of the other baboon species, which are collectively termed as savanna baboons. For example, hamadryas baboons always have very large troops composed of many small harems while other baboons often have a structure more promiscuous and the hierarchy is determined by the matriline. In the hamadryas harems, the males jealously guard their females and some of them also raid harems for females, which will cause fights by the males.

17 Visual threats such as quick flashing of eyelids and show off the teeth are usually used during the fights, and in some species, infants are taken as hostages during fights.

Baboons can determine the dominance relations between individuals from vocal exchanges. In savanna baboons, each male individual mate with any female and the order among the males depends partially on their rankings in the structure. Individuals with higher rank have benefits in health and reproductive. High-ranking males have higher level of testosterone and lower level of glucocorticoid than other males, and the top-ranking males have higher levels of both testosterone and glucocorticoid than the second-ranking males [10]. Females also prefer friendly males as mates.

Therefore, there is also possibilities that a female baboon cam mate with a female by exhibiting friend behaviors such as groom the female or supply with . The time for gestation of baboons is six months, and usually a single infant. The mother will be the primary caretaker but other females also share the duties of taking care of all the offspring. The young baboons will be weaned about one year later and the male baboons have to leave their group before they reach sexual maturity, about five or six years old. On the other hand, females stay in the same group.

Studies of human complex diseases are difficult because it is very challenge to control human pedigree structure and environmental conditions. The limited access of tissues also greatly hampered the related studies. To overcome these limitations, nonhuman primates are often used as valuable sources. Sharing many genetic, biochemical, physiologic, and anatomic characteristics with human beings [11], baboon are naturally infected with numerous human pathogens and therefore have the potential to be used as animal models for physiology and pathophysiology researches [12], including cardiovascular disease, obesity, hypertension, age-related skeletal disease, epilepsy, infectious disease and intrauterine researches [13, 14] (Table 1.1). Transplantation and

18 drug therapy have also been conducted in baboon [15-17].

Table 1.1 Baboons as animal models for studies of human diseases and vaccines. Experimental objective Reference Viral diseases Studies in pathogenesis [18] Encephalo-myocarditis virus Vaccine [19] HIV-1 HIV-1 vaccine candidates [20] Hepatitis A virus (HAV), cytomegalovirus Infections transmissible between [21] (CMV), Epstein-Barr virus(EBV) baboons and human beings Bacterial infections Bacillus anthracis Infections in nonhuman primate [22] model Francisella tularensis Outer membrane live, attenuated [23] LPS-17 Angioinvasive aspergillus Baboon-to-human liver [15] transplantation infection Parasite infections Leishmania major Infection model [24] Live, irradiated cecariae vaccine [25, 26] Zoonotic gastrointestinal Baboon as zoonotic reservoirs [27]

In the meantime, some characteristics of baboon are obviously different to human, including language ability and sensory capabilities [28]. As comparative biology according to articulatory anatomy, many researches [29] claim nonhuman primates are incapable of producing systems of vowel-like sounds due to their high larynx position, but recent discoveries have begun to challenge this view with three reasons. First, some animal species with no documented ability to produce systems of vowel-like sounds [30]. Second, human infants, with their larynx still high, produce the same range of vowel qualities as adults [31]. Third, modeling suggests that the production of vocalic sounds depend on the position of the larynx, but rather on the control of tongue muscles and lips [32].

19

1.2 Mandrill and its biology

Mandrill (here, specifically referred to Mandrillus sphinx) is a primate of the

(Cercopithecidae) family. Along with the , mandrill was once classified as baboons (Papio) because they are superficially similar [33]. live in tropical rainforests, rocky, riparian, flooded or gallery forests, as well as cultivated areas and stream beds across Africa, and are usually found in Southern Cameroon, Gabon, Equatorial Guinea and Congo [34]. The distribution is generally separated by the Sanaga River and the Ogooué and White Rivers. There are remarkable genetic differences between these two populations. As a result, these two populations have been classified into different subspecies.

The diet is omnivorous ranging from to insects. Its diet is generally composed of fruits

(50.7%), (26.0%), (8.2%), pith (6.8%), flowers (2.7%), and animal (4.1%), with other foods making up the remaining (1.4%). Usually, they consume plants, as diverse as more than a hundred species and fruits are preferred. Furthermore, they also eat mushrooms and soil. Besides plants, they also eat , mostly invertebrates, such as insects like ants, beetles, termites, crickets, and snails or scorpions. Its diet also contains eggs and small vertebrates like , frogs, rats, and shrews or juvenile of larger vertebrates such bay duikers and [35]. The life expectancy of mandrill in captivity can be up to 31 years, shorter than that of baboons.

Mandrills also exhibit strong sexual dimorphism. It has experienced very long and strong sexual selection, as a result, male mandrills have larger size and coloration. Generally, mandrill’s face is hairless with an elongated muzzle. They also have distinct characters, such as protruding blue ridges on the sides, red nostrils and lips. The areas around the genitals are multi-colored. Particularly, dominant male mandrills have more pronounced coloration. Mandrill is the largest and heaviest 20 monkey in the world. Typically, male mandrills weigh 19–37 kg, with an average of 32.3 kg, while the females weigh roughly half as much as the males, at 10–15 kg and an average of 12.4 kg. The male mandrills are 75–95 cm long in average and the females are 55–66 cm. The shoulder height ranges from 45–50 cm in females to 55–65 cm in males. These sizes and weights even surpass that of the largest baboons. Furthermore, the mandrill is more -like compared to the baboons regarding the body structure, with a muscular and compact build, shorter, thicker limbs that are longer in the front and almost no tail.

Mandrills are mostly terrestrial but they are more arboreal compared to baboons [36]. Mandrills live in large, stable groups with the size as big as hundreds of individuals [34]. The largest horde that have been verifiably observed contains more than 1,300 mandrills, which is the largest nonhuman aggregation ever documented. Mandrills are diurnal. They sleep on trees at night. They use tools and have been observed using sticks in captivity.

Mandrills breed every two years and the mating season extends from June to October. Sometimes, male mandrills fight for mating rights. The testicular volume increases along with the gaining of dominance (alpha male) and decreases if the dominance is lost. Similar changes are also observed in the color of sexual skin on the face and genitalia, which becomes red in alpha male mandrills.

Physiologically, the secretion of the sternal cutaneous gland also increases accordingly [37, 38]. A among females also exists [38].

The way monkeys select their mates can be attributed to smell, rather than color which mainly genes called the major histocompatibility complex (MHC) [39]. MHC is a cluster of genes which determine mandrill’s individual scent and help build proteins involved in the body's immune system

21 and affects body odour by interacting with bacteria on the skin. By a series of experiments found that particular odour-types were consistent with particular MHC gene patterns, suggesting that mandrills use odour as an indication of genetic compatibility [40]. Mandrill is one of two species in

Papionini possess a sternal gland, gland is a triangular area in the middle of the chest and structure basis for scent [41]. Mandrill is also widely used in immune systems research and is nature host for

SIV strains (Table 1.2).

Table 1.2 Summary of mandrill as models in human diseases and vaccines studies/tests. Objectives Reference Viral diseases Simian Immunodeficiency virus (SIV) Studies in pathogenesis [42] Simian T-lymphotropic Virus Type 1 (STLV-1) Transmission modes [43, 44] Bacterial infections Paratuberculosis Infections in mandrill [45] Helicobacter heilmannii Bacterial pathogen model [46] Parasite infections Amebic, ciliate, nematodes Parasite prevalence in mandrill [47] Loa loa Irradiated vaccine [48]

1.3 Genomic studies on baboon and mandrill

1.3.1 Genomics of primates

Since the complete of the human genome assembly [49], along with the reduced cost of genome sequencing and greatly increased throughput of new sequencers, more and more genomic data of primates are becoming available, including both Old world monkeys, such as chimpanzee [50], and

New world monkeys, such as Marmoset (Callithrix jaccbus). The genomics of non-human primates received wide interests for two motivations: the application as models for analysis of human disease, and genetic conservation and divergence on evolutionary history through comparative genomics

[51]. Generally, the species selected for genome sequencing meet the criteria of: 1) important 22 evolutionary position within the phylogeny (i.e. chimpanzee, and orangutan etc.); 2) biomedical relevance to human. For example, macaque and baboon, although the genome sequencing of the latter has not been completed yet, were selected because they are often used to study the genetic basis of numerous human diseases [13], and squirrel monkey is used for studies of neurobiology and infectious disease. The size of primate genomes varies little, ranging from 2.7 Gb of Bonobo (Pan paniscus) [52] to 3.4 Gb of Tarsier (Tarsius syrichta) [53]. Repetitive regions occupy about 50% of human, ape and monkey genomes but the amount of species-specific insertions varies substantially, ranging from ~5,000 in human to ~2,300 in chimpanzee. Orang-utan has only 250 [54]. Genomic studies had also been conducted to extinct hominis, the Neanderthals

(Green et al., 2010) and the Denisovans [55]. These genomic data resources together enabled people to perform comparisons between human and other primates, or between primates and other .

The sequencing and assembly of non-human primate genomes went through different stages in pace with the development of sequencing technologies. Sequencing of the genomes of chimpanzee (Pan troglodytes), and the (Macaca mulatta) were performed through the application of shotgun sequencing used exclusively Sanger sequencing methods with considerable cost and efforts

[56, 57]. Then next generation sequencing (NGS) was widely used and gave more rapid progress on genomics, while plenty of primates were sequenced and assembled (Table 1.3), which supplied us with more understanding on genome content, evolution and diversity [51]. Since 2013, the development and application of single-molecule, realtime (SMAT) sequencing technology has shown considerable improvement on human or other genomes assemblies. Compared with the NGS assembly version gorGor3, the results in Gorilla (Gorilla gorilla) with SMAT sequences show significant decrease in assembly fragmentation, while the contig N50 increased >819 folds (from

11.8 kb to 9.6 Mb, Table 1), and 94% of gorGor3 gaps were closed [58, 59]. 23

To understand the origin of the human genome is one of the most important purposes to sequence primates closely related to human. Inter- and intra-species comparisons had provided insights of gene exchange among the early human and chimpanzee ancestors, and allowed the identification of positively selected genes or regions during the evolution of human or other primates. These genes always indicate genetic or phenotypic changes that are critical for adaptation of human or non- human primates, as well as hominis. It has been clearly shown that genes involved in the immune system and resistance to pathogens, as well as those involved in reproductive biology were commonly positively selected in many non-human primates [58, 60, 61]. This might be a result of the long-term exposure to various pathogens in the wild, and genes related to gametogenesis are beneficial to the competition within species. On the other hand, within species, the signals of positive selection have been found in genes related to a wide range of phenotypes. Positively selected genes shared among human, chimpanzee and gorilla are related to neuro and brain development; genes related to glycolipid metabolism and hearing are positively selected in orangutans [54]; In marmosets and other callitrichine primates, genes involved in phyletic reduction of body size were positively selected [62]. The common ancestor of dated back to 12-5 million years ago [63]. The reciprocal gene flow has lasted for ~3 million years between those lineages [63], suggesting the divergence process is a long period with extensive gene flow, instead of a short event.

Similar evidence of gene exchanges was also detected in Bornean and Sumatran orang-utans genomes [54, 63].

The long-read sequence indeed improved the completeness and accuracy of assembly, while the scaffolds were still in the Mb level. Recently, a new technology named Hi-C help assemble contigs into chromsome-scale scaffolds [64-66]. Hi-C and related technologies were developed to detect the three-dimensional folding of chromosomes within the nucleus [67], then the information were used 24 to assist assembly. The results indicated that combination of shotgun fragments and mate-pair sequences with Hi-C date could generate chromsome-scale assemblies with 98% accuracy in assigning scaffolds to chromosome groups for human [64]. As far as we know, there is no Hi-C assistant assebly result for non-human primates.

Table 1.3 Published primate genome sequences. (modified based on [51]).

Common Species name Bases in Contig Scaffold Reference name contigs N50 N50 Chimpanzee Pan troglodytes 2.7 Gb 15.7 kb 8.6 Mb [56] Chimpanzee P. troglodytes 2.9 Gb 50.7 kb 8.9 Mb [68] (updated) Bonobo Pan paniscus 2.7 Gb 67 kb 9.6 Mb [69]

Gorilla Gorilla gorilla 2.7 Gb 11.8 kb 914 kb [58] Gorilla Gorilla gorilla 2.8 Gb 9.6 Mb 23.1 Mb [59] (updated) Orang-utan Pongo abelii 3.1 Gb 15.5 kb 739 kb [54]

Indian rhesus Macaca mulatta 2.9 Gb 25.7 kb 24.3 Mb [57] macaque Indian rhesus M. mulatta 3.1 Gb 107.2 kb 4.2 Mb [70] macaque https://www.ncbi. (updated) nlm.nih.gov/asse mbly/GCA_0007 72875.3 Chinese M. mulatta 2.8 Gb 11.9 kb 891 kb [71] rhesus macaque Vietnamese M. fascicularis 2.9 Gb 12.5 kb 652 kb [71] cynomolgus macaque Aye-aye D. 3.0 Gb NA 13.6 kb [72] madagascarensi s Vervet C. aethiops 2.8 Gb 90.4 kb 81.8 Mb [73]

Olive baboon P. anubis 2.9 Gb 149.8 kb 585.7 kb https://www.ncbi. nlm.nih.gov/asse

25 mbly/GCF_00026 4685.3/ Gibbon Nomascus 2.8 Gb 35.1 kb 22.7 Mb [74] leucogenys Marmoset Callithrix 2.3 Gb 29 kb 6.7 Mb [75] jacchus Mouse Microcebus 2.4 Gb 210.7 kb 108.2 Mb https://www.ncbi. murinus nlm.nih.gov/asse mbly/GCF_00016 5445.2 Pig-tailed Macaca 2.8 Gb 106.9 kb 15.2 Mb https://www.ncbi. macaque nemestrina nlm.nih.gov/asse mbly/GCF_00095 6065.1/#/st Sifaka Propithecus 2.1 Gb 28.1 kb 5.6 Mb https://www.ncbi. coquereli nlm.nih.gov/asse mbly/GCF_00095 6105.1/#/st Sooty Cercocebus atys 2.8 Gb 112.9 kb 12.8 Mb https://www.ncbi. nlm.nih.gov/asse mbly/GCF_00095 5945.1/ Squirrel Saimiri 2.5 Gb 38.8 kb 18.7 Mb https://www.ncbi. monkey boliviensis nlm.nih.gov/asse mbly/GCF_00023 5385.1/#/def Bushbaby Otolemur 2.4 Gb 27.1 kb 13.9 Mb https://www.ncbi. garnettii nlm.nih.gov/asse mbly/GCF_00018 1295.1/ Mouse lemur Microcebus 2.4 Gb 182.9 kb 3.7 Mb https://www.ncbi. murinus nlm.nih.gov/asse mbly/GCF_00016 5445.1/ Tarsier Tarsius syrichta 3.4 Gb 38.2 kb 401 Mb [76]

1.3.2 Genomics of baboon

Baboons (Papio) shares a common ancestor with ~30 million years ago and are genetically closer to human comparing to New World monkeys but are less closely related than the African apes. Although the genome assembly of baboon is lacking before our study, a comparison between

26 a short region (~1.5 Mb) of baboon and human genome showed very limited substitutions, most of which are relatively enriched in exons [77].

Baboon has been commonly used as an ideal primate model for genetic studies of complex traits and human diseases because of high similarities between human and baboon in transcriptome, physiology and genetics [78]. Southwest National Primate Research Center (SNPRC) at the Texas

Biomedical Research Institute maintains ~2,000 baboons for biomedical researches. The pedigree contains over 16,000 individuals across seven generations, with 384 founds of P. h. Anubis, P. h. cynocephalus and their hybrid progenies. Tissues and blood clots from over 8,000 individuals have been well stored, and DNA, serum and buffy coats from ~ 4,000 members have been banked [13].

Among these pedigree baboons across seven generations, more than 2,000 individuals have been genotyped using microsatellite markers, followed by the construction of a whole-genome linkage map, which contains 294 ordered loci, with an average interval between markers of 7.2 cM [79, 80].

Together with the genotypes of these baboons, several hundred quantitative traits have also been phenotyped accordingly, which were further used to localize genomic regions of genes controlling these traits. These data have been implemented in the studies of atherosclerosis, hypertension, obesity, craniofacial complex etc. Taking the advantage of this genetic map, scientists have scanned the genome searching for regions (QTL) associated with over 200 traits related to cardiovascular diseases. Several important QTL were found using this approach. For example, Kammerer et al.

(2002) have found a QTL influencing low-density lipoprotein cholesterol dietary cholesterol response on chromosome 6; next year, Rainwater et al (2003) had found several lipid-/lipoprotein- related QTL including three for low-density lipoprotein cholesterol size fractions located on chromosomes 5, 10q and 17, respectively. A region on chromosome 17 was also found to be associated with cholecystokinin. This region is known to harbor genes of glucose transporter, glucagon-like peptide 2 receptor, and sterol regulatory element binding transcription factor 1, which 27 are related to adiposity [81].

Transcriptomic study is also an efficient approach to advance our understanding to many human diseases and traits. Northcott et al. (2012) developed a cross-species array (rat and baboon) targeting 328 genes possibly related to blood pressure. Among these genes, they found 74 were commonly expressed in both rat and baboon kidney, while 41 were specifically expressed in rat and

34 were specific to baboon. This study displayed evidence of similarities and differences of gene expression profile between primate and and therefore highlighted the importance of an appropriate primate model in studies of human complex diseases as well as other traits, such as neurology and social behaviors.

Several investigators have also combined the transcriptome and linkage map in their studies and found that, for example, the mRNA level of adiponectin, which is correlated to body weight, serum triglycerides, adipocyte volume and glucose levels, is significantly heritable and the heritability is associated to a region on chromosome 4p [82]. Similarly, the abundance of resistin mRNA is also heritable and the QTL is located on chromosome 19p [83].

1.3.3 Genomics of mandrill

As a nature host of HIV and SIV, mandrills (M. sphinx) is on a list of species whose genomes to be sequenced. In some cases, mandrills are able to tolerate SIV infection for long periods of time, and their responses to viral infections are sometimes quite different from the other hosts, such as mangabey and African . Additionally, mandrills are adapted to two different SIV strains, SIVmnd1 possibly originated from a virus in Cercopitehcus lhoesti, and SIVmnd2 from a virus in M. leucophaeus. Furthermore, a correlation between the low rates of vertical transmission 28 and the expression of CCR5 have been found in mandrill [84]. Heterozygous individuals have greater reproductive success regardless of the sex. They always have more offspring. However, this advantage has only been observed in alpha males but not in the beta males. This correlation between heterozygosity and reproductive success and tenure has been fairly explained by multi- locus effects [85].

1.3.4 Comparative genomics in primates

With the genomic data across different primate species, genome regions or elements common or specific to several species (i.e. human) can be identified and analyzed in details by systematic comparisons between these genomes. Sequence alignment and comparison shows strong correlation between pairwise differences and time of divergence which is inferred from other information, such as . The divergence between human and chimpanzee sequence is 1.1-1.4% [58]. The difference between human and rhesus macaque is relatively larger, ~6.5% [61], which is consistent with the longer time of divergence between these two species (28-35 million years ago). The alignments also show indels among species. Indels are favorably located at intronic and intergenic regions, which are more tolerant to small indels compared to protein-coding regions.

Besides substitutions and small indels, insertion of fragments and larger segmental rearrangements were also detected in primate genomes. The most extensively investigated process is the insertion of retrotransposons, such as Alu, which is ongoing in primate genomes and have played very important roles in shaping the genome structure. For example, Alu insertions is a major driver of genome change [86]. Retroposition has broader effects on genome evolution because of its potential of inducing segmental duplication or deletion [87].

Segmental duplication is vital for genome evolution. Segmental duplications collectively make up

29 ~5% of human and chimpanzee genomes, and ~3.8% of orang-utan genome [54, 88]. It is apparent that the segmental duplications are not randomly located on the chromosomes. Segmental duplications are preferentially distributed on human chromosome 22 (11.9%) and non-recombining chromosome Y (50.4%) but against chromosome 3 (1.7%) [89]. Segmental duplications in primate genomes are categorized into three groups according to their locations. They are pericentromeric, subtelomeric and interstitial duplications. Duplications in these three classes differ in the types and frequencies [90]. Pericentromeric duplicates make up about 47.6 Mb, occupying a third of all the duplicates in human genome [89]. The ratio of inter- to intra-chromosomal duplication in this class is about 6:1 [89]. Furthermore, more than 30% of the pericentromeric sequences are occupied by duplicons from other chromosomes. A two-step model has been proposed to explain the process of segmental duplications in pericentromeric regions [91]. It is similar in subtelomeric regions that they also have many duplicates from other chromosomes although the total amount of duplicates is much fewer (2.6 Mb). The inter-chromosomal segmental duplicates are present in 30 of 42 subtelomeric regions [92]. Subtelomeric segmental duplicates are typically 50 to 100 kb long and, on the contrary to the origination of pericentromeric duplicates, their births involve exchanges between subtelomeric regions and a larger part of the relative orientation between non-homologous chromosomes has been retained [89, 93]. Interstitial segmental duplicates locate on euchromatins.

In contrast to the predominance of tandem duplicate clusters found in most genomes, primate genomes contain a large number of interstitial duplicates. Although interspersed duplicates are located along the euchromatin, the locations are not randomly distributed either [89]. Comparative analysis of genomes across primates show that many interspersed intra-chromosomal duplicates can be dated to the evolution of the great ape. Their births are always associated with chromosome rearrangements [94].

30 1.4 Objectives

Providing the background mentioned above, whole genome sequences will facilitate current biological researches of baboon and mandrill. Thus here I proposed to use second generation sequencing to construct reference genomes for both baboon and mandrill, conduct repeat/gene annotation, conduct gene family clustering, conduct evolutionary analyses and comparative genomic analyses for baboon and mandrill. Study objectives include: i) Genomic resources for future studies on baboon and mandrill; ii) Detailed genomic features of baboon and mandrill in repeat content, protein coding genes,

gene families, etc.; iii) Comparing the genome/genetic features of baboon and mandrill to human and other

primates to provide further insights for primate integration; iv) Identify possible genetic mechanisms for Old World monkey adaptation; v) Provide additional insights for human diseases/health.

31 2. Materials and Methods

2.1 Sampling and sample preparation

One baboon (olive baboon, Papio anubis), and one mandrill (Mandrillus sphinx) were selected for sampling (Figure 2.1). With the assistance of zoologist, veterinarians withdrew 5 mL blood from the twenty-year-old male baboon and the eighteen-year-old male mandrill, and the 5 ml whole blood was from the left jugular vein of animal, and the blood was collected to a plastic collection tube with 4% (w/v) sodium citrate. The blood samples were then snap frozen in liquid and stored at -80˚C until further processing. Genomic DNA was extracted from the whole blood samples with the AXYGEN Blood and Tissue Extraction Kit (Corning, USA) according to the manufacturer’s instructions. The extracted DNA was subjected to electrophoresis in 2% agarose gel and stained with ethidium bromide to assess the overall quality. DNA concentration was determined by Quant-iT™ PicoGreen ® dsDNA Reagent and Kits (Thermo Fisher Scientific, USA) according to the manufacturer’s instructions.

Figure 2.1 Photos of the samples selected for sequencing. The baboon (a) and the mandrill (b) are both from Beijing Zoo.

32 2.2 Genome sequencing

2.2.1 Library construction and sequencing

DNA of Baboon and mandrill was used for library construction, according to protocols following descriptions in previous publications [95]. A total of 12 libraries were constructed for each of the two species. Then sequencing was carried out on Illumina sequencer HiSeq2000. For each species,

6 libraries were designed in paired-end configuration, comprising 2 libraries with reads of 100 bp in length and a mean target insert size of 250 bp and 2 libraries of 100 bp reads with insert sizes of 500 bp and 800 bp, respectively. 6 libraries were designed and processed in mate-pair configuration, with all libraries having 100 bp reads and 1 library with insert size 2 kbp, 4kbp and 5 kbp and 1 library each of 10 kbp and 20 kbp insert sizes.

2.2.2 Data filtering

The quality requirement for de novo sequencing is high thus data filtering was carried out to obtain high quality reads for assembly. During sample preparation, adapters were ligated and amplification was conducted. Thus adapter contaminated reads, duplicated reads introduced during amplification and reads with high sequencing errors (low sequencing quality) need to be filtered according to previous study [96]. Here, raw reads from the sequencer were filtered using SOAPnuck (v.

1.5.6; https://github.com/BGI-flexlab/SOAPnuke). The filtering criteria were as below: i) reads with >10 percent base of Ns (uncertain/ambiguous bases) were filtered; ii) reads with >40 percent of low quality bases (quality score <=10) were filtered; iii) reads contaminated by adaptor (adaptor matched 50%, allowed one base mismatch) and produced by PCR duplication (identical reads in both ends) were filtered.

 parameter

33 SOAPnuke filter –f adapter1.list -r adapter2.list -1 reads1.fq.gz -2 reads2.fq.gz -l 10 -q 0.4 –n 0.1 –

M 1 –o ./

2.2.3 Overlapping library data merging

Overlapping libraries are designed in a way that the ends of the paired reads overlapped with each other, thus the fragments were sequenced through. For the overlapping libraries, the insert size (Si) of the library should be shorter than the total read length (length of the two read ends, Lr), and the expected overlap length can be calculated as (2 Lr – Si). The overlap information can be used to merge the paired reads into one longer sequence. Merging reads will benefit downstream analysis by providing longer sequence and lower sequencing error. Here, merging of the overlapped reads was performed using FLASh [97] v1.2.10 and default parameters.

2.2.4 K-mer analysis

In order to estimate the genome features including genome size, repeat content and heterozygosity,

K-mer analysis was first performed. K-mer is sub-sequence of the reads with the length of k. The

Formula 2.1 was used for estimating the genome size. In this formula, knum is the total number of K- mer, kdepth is the expected depth of K-mer, bnum is the total number of bases, bdepth is the expected depth of bases. According to Formula 2.2, the distribution of kdepth follows a Poission distribution.

Thus, the peak depth of the K-mer depth was used for expected K-mer depth, while λ was used as the expect K-mer depth.

In this analysis, the k was 17 with command: kmerfreq -k 17 -m 1 -o 1 -l fq.list [98].

kb G num num kb depth depth (2.1)

34 (2.2)

2.3 Genome assembly and annotation

2.3.1 Genome assembly

The baboon and mandrill genome were assembled by short-reads assembly software SOAPdenovo2

[98] using the filtered data. SOAPdenovo was developed for the short read assemblies based on a de

Bruijn graph algorithm, which has been widely applied in genome assembly.

Four major steps were conducted to complete the preliminary assembly: i) Building the de Bruijn graph

To build the de Bruijn graph, all reads from the small insert size (<1000 bp) libraries were

used to build the de Bruijn graph (DBG). The initial DBG was composed of 57-mers as

nodes and the edge connection among the nodes was made up of read paths. In order to

simplify the DBG, erroneous connections were removed to resolve the repeats, including the

following four aspects.

a) Clipping the short tips

The short tips that were shorter than 114 bp (the length of 2-fold 57mer) in the DBG were

clipped.

b) Removing low-coverage links

c) Solving tiny repeats by read path

d) Merging the bubbles. The bubbles were generally caused by repeats or heterozygosity. ii) Contig construction

On the simplified DBG, the broken connections at repeat boundaries were extracted and 35 output the unambiguous sequence fragments of them as contigs. iii) Scaffold construction

Realigned the reads onto the contigs and used the paired-end information to join the unique

contigs into scaffolds. iv) Gap closure

Filled the intra scaffold gaps using the mapped reads. Most of the remaining gaps probably

occur in repetitive regions. Paired-end reads with one end mapped on the unique contig and

the other end located in the gap region were extracted for the local assembly, thus the

unmapped ends were used to fill in the gaps within the scaffolds. The gap filling was

performed by GAPcloser [98].

Genome assembly with command:

 SOAPdenovo all -s config -K 49

 GapCloser_v1.10_gz –a scaff.fa -b lib.cfg -o baboon_gapClosed.fill -t 16

2.3.2 Genome annotation

Repeat sequences can be classified into tandem repeat including microsatellite sequences, small satellite sequences, and the interspersed repeats including DNA transposons and retrotransposons

(LTRs, LINEs and SINEs). Repeat elements were first annotated using both homolog searching and de novo prediction, and similarly, genes were annotated by combining homolog searching and prediction based on gene structure (Figure 2.2).

36 Genome sequence

Repeat ncRNA Gene annotation annotation annotation

cDNA/ miRNA/ De novo homolog De novo homolog tRNA rRNA EST snRNA

RNA- GLEAN set seq data Statistics results Statistics results

Gene set

Function annotation

InterPro KEGG UniProt

Statistics results

Figure 2.2. Overall process of genome annotation. The genome annotation including three major parts: repeat annotation, gene annotation and ncRNA annotation.

2.3.2.1 Repeat annotation

To predict transposable elements (TEs) in the genome, RepeatMasker [99] (version 4.0.5) and

Repeat-ProteinMask were used to scan the whole genome against the RepBase library [100]

(Version 20.04) for known repeats. RepeatMasker was then used again to identify de novo repeats 37 based on the custom TE library constructed by combining results of RepeatModeler [101] (Version

1.0.8) and LTR_FINDER [102] (Version 1.0.6). Tandem repeats was also predicted using Tandem

Repeat Finder [103] (Version 4.0.7). Finally, all the repeat prediction results were combined together to the final repeat annotation result.

 LTR parameter

LTR_FINDER.x86_64-1.0.5/ltr_finder -w 2 -s tRNAdb/dm3-tRNAs.fa

 RepeatMasker parameter

RepeatMasker -nolow -no_is -norna -parallel 1 -lib RepBase16.10/RepeatMaskerLib.embl.lib

 ProteinMask parameter

RepeatProteinMask -noLowSimple -pvalue 0.0001

2.3.2.2 RNA annotation

To identify transfer ribonucleic acids (tRNAs), tRNAscan [104] was used. While for ribosomal ribonucleic acids (rRNAs) identification, 757,441 rRNAs from public domain were used to search against the genome with command -p blastn -e 1e-5. To identify RNA genes and other non-coding

RNA (ncRNA), Rfam database [105] was used to search against the genome with the Rfam program, rfam_scan.pl, (ftp://ftp.hgc.jp/pub/mirror/sanger/Rfam/tools/rfam_scan.pl).

 rRNA parameter blastall -p blastn -e 1e-5 –i Human_rRNA.fa

 ncRAN parameter rfam_scan.pl -d Rfam.fasta.

2.3.2.3 Gene annotation 38 Genes were predicted using three categories of methods, including homolog based, evidence based and ab initio prediction. For homolog based annotation, protein sequences of Macaca mulatta, Pan troglodytes, Nomascus leucogenys, Pongo abelii, Gorilla gorilla and sapiens were downloaded from Ensembl database (Release 73) and were aligned to the genome using BLAT

[106]. Then GeneWise [107] (Version 2.2.0) was used for further precise alignment and gene structure prediction. For evidence based prediction, EST sequences were downloaded from NCBI and were aligned to the genome using PASA [108] for spliced alignments and assembly to detected gene structure. For ab initio prediction, we employed AUGUSTUS [109] (Version 3.1) to process ab initio gene model prediction in the repeat masked genome. Finally, these gene prediction results were combined using GLEAN [110] to obtain the final non-redundant gene set.

 Homology parameter

blat -q=prot -t=dnax genewise -sum -genesf

 ab initio prediction parameter denovo-predict.pl --augustus human

 GLEAN parameter run.Glean.pl --YAML parameter.yaml --genome **.fa --maxintron 100000 --cds 150 --homolog

**.gff --EST **.pasa.gff

2.3.2.4 Gene function annotation

In order to provide possible gene function information, predicted genes were compared against protein databases with protein function information. Blast2GO program [111] was used to assign gene ontology (GO) terms and enzyme commission (EC) numbers. InterProScan [112], which searches Pfam domains [112] and several other protein signature databases, was used to predict 39 protein domains. InterProScan results were finally subjected to searching against the genome by

Blast2GO for further GO terms assignments.

 Function parameter run_iprscan51-55.pl --cpu 100 --cuts 100 --appl ProDom --appl ProSiteProfiles --appl SMART -- appl PANTHER --appl PRINTS --appl Pfam --appl PIRSF --appl ProSitePatterns **.pep blast -b 100 -v 100 -p blastp -e 1e-5 -F F -d database(database including , swissprot, tremble )

2.3.2.5 Completeness of gene content with and BUSCO

CEGMA [113] and BUSCO [114] were used to assess the completeness of the genome and quality of gene predictions. Both software used universal/conserved single-copy genes which should be present in the genome to search against the genome, thus to estimate the completeness of the genome and gene annotation. Completeness of the gene sets were assessed with default settings for both software and with plant specific reference profiles in the case of BUSCO.

 BUSCO parameter

BUSCO_v1.2.py -o run_glean -m OGS -l vertebrata database -in **.pep -c 16

2.4 Evolutionary analysis

2.4.1 Gene family cluster

Protein sequences of 11 species including Callithrix jacchus, Gorilla gorilla, Homo sapiens,

Macaca mulatta, Microcebus murinus, Nomascus leucogenys, Otolemur garnettii, Pan troglodytes,

Pongo abelii, Tarsius syricht and Mus musculus were used together with the predicted genes of the two species to do the gene family clustering. Proteins were further filtered if, i) the coding sequence

40 was shorter than 90bp, ii) the sequences with first or last amino acid marked as “X”, which indicated ambiguous amino acid because of “N” in the gene sequence. iii) to remain just one of the transcript if multiple transcripts existed. Then TreeFam (http://www.treefam.org/) was used to defined gene families in Mandrillus sphinx and Papio anubis. Firstly, all-vs.-all blastp with the e- value cut-off of 1e-7 for 13 species’ protein sequences were conducted and secondly the possible blast matches were joined together by an in-house program. Thirdly, we removed genes with aligned proportion less than 0.33 and converted bit score to percent score. Finally, hcluster_sg

(Version0.5.0, https://pypi.python.org/pypi/hcluster) was used to cluster genes into gene families.

2.4.2 Phylogenetic analysis

With gene families clusters defined, the fourfold degenerate (4D) sites of 5,133 single-copy orthologous among 13 species were extracted for the phylogenetic tree construction. PhyML package [115] was used to build the phylogenetic tree with maximum-likelihood methods and

GTR+gamma as amino acid model (1,000 rapid bootstrap replicates conducted). Based on the phylogenetic tree, divergence times of these species were estimated by using MCMCTree

(http://abacus.gene.ucl.ac.uk/software/paml.html) With default parameters. To further calibrate the evolution time in the tree, six dates collected from the TimeTree database

(http://www.timetree.org/) were used, including the divergence time between Mus musculus and human to be 85-93 million years ago (MYA) [116], divergent time between human and chimpanzee, gorilla, to be 6 MYA (with a range of 5–7) [117] and 9 MYA (range, 8-10) [118].

2.4.3 Positively gene selection analysis

The selection pressure on protein-encoding genes in mandrill and baboon were measured by comparing nonsynonymous (dN) and synonymous (dS) substitution rates. This ratio would be equal 41 to 1 if the whole coding sequence evolves neutrally. When dN/dS < 1, it's under constraint, and when dN/dS > 1 it should be under positive selection. I calculated the dN/dS ratio using models in the program package PAML version 3.14. From gene family cluster, I obtained single-cope gene in every species. Subsequently, I used neutral (M1 and M7) and selection (M2 and M8) models to identify the codons that are under positive selection. Models M1 and M7 supposed a different distribution of ω values smaller than 1, otherwise models M2 and M8 constrained ω to be larger than 1 (ω2), thereby distinguishing positive selection from purifying evolution (ω < 1), neutral evolution (ω = 1), and positive selection. The fitness of the model M1-M2 and M7-M8 can be compared using a χ2 distribution with 2 degrees of freedom.

2.5 Comparative genomics

2.5.1 Synteny analysis of human, macaque, baboon and mandrill

For the comparative genomic analysis, syntenic blocks among primate species were first identified, for further identification of genomic rearrangement events such as inversions, insertions and deletions among these species. Proteins of human, macaque, baboon and mandrill were aligned between each other using blastp (Version 2.2.26), and then the blast results were filtered using criteria of coverage greater than 85% and identity greater than 85%. Finally, the best match of every gene was obtained as the gene pair in synteny.

2.5.2 Gene family contraction and expansion

With the gene family clustering result, gene family contraction and expansion can be detected to figure out the dynamic evolutionary changes of gene families along the phylogenetic tree.

According to the phylogenetic tree and divergence time, CAFÉ [119] was used for gene family

42 contraction and expansion analysis. Firstly, a global parameter λ by using maximum likelihood based on random birth and death model was estimated. Then a conditional p-value was calculated and families with p-value less than 0.05 were marked as significantly changed families, which means these families underwent contraction or expansion in the process of evolution.

2.5.3 Segmental duplications

Segmental duplications are duplicated blocks of genomic DNA typically ranging in size from 1–200 kb. They often contain high-copy repeats or intron-exon structure. Whole-genome sequence detection (WSSD) method was used for segmental duplications identification [120]. Whether a sequence is duplicated or not were determined according to its overrepresentation and average sequence identity. After excluding TE element in genome, clean reads were then mapped to genome using BWA with parameters “-m 200000 -l 20 -k 2 -t 30”, then samtools was used to get coverage and depth.

2.6 Investigating molecular mechanisms of adaptation/phenotype

2.6.1 Immune character

Major histocompatibility complex (MHC) is a series of genes coding surface proteins assisting cells to recognize foreign substances, which is related with immune system and it has been demonstrated to be in association with many diseases. The main function of MHC molecules is to bind the peptide chain derived from pathogens thus present pathogens on the cell surface to facilitate T-cell recognition and perform a series of immune functions. MHC has been proved to be highly polymorphic in most primate species, incuding macaque. So MHC class I region was identified in the mandrill genome by searching the human sequence against it with RepeatMasker (Version 4.0.5,

43 with parameter “-nolow -no_is -norna -engine ncbi”).

2.6.2 Language competence

Language is a special ability for communication within species, particularly in human. Previously, some genes have been found to be involved in language, and exploring the status of these genes in animals can further help to understand the original of language formation. FOXP2 was the first gene found to be related to the human language development and a heterozygous missense mutation were thought to cause inherited language disorder based on a case study of a family known as KE family. FOXP2 is expressed in many tissues including the basal ganglia and inferior frontal cortex

[121] where it is essential for brain maturation and speech and language development. Here, protein sequences of FOXP2 genes of human, chimp and mouse have been download from NCBI. These

FOXP2 protein sequences were mapped to baboon and mandrill genome using blat with the parameters of “-q=prot -t=dnax”. Blat results were filter using the following criteria, i) hits other than the best five hits were filtered, ii) query protein covered less than 30%, iii) difference greater than 20%. After the blat alignment, GeneWise was used to do fine mapping with default parameters.

2.6.3 Olfactory character

Olfaction or sense of smell is one of the important feelings for animals. Chemical communication is least well understood in Old World species and the olfactory sense is underappreciated [122].

Human olfactory receptor (OR) genes protein from HGNC

(http://www.genenames.org/genefamilies/OR) were used to search against the genomes of mandrill and baboon to identify OR genes.

44 2.6.4 Predicting binding sites of transcription factors

Transcription factors (TFs) are key regulators which bind to specific DNA sequence to activate or repress gene expression. Each TF has at least one DNA-binding domain (DBD) which is always conserved. Based on their DBDs, TFs could be classified into 70 families in AnimalTFDB 2.0 database [123]. In order to identify and explore functions of TFs, a BLAST tool was used to search against TFs in the database with the protein sequences. The 1,691 human protein sequences in

AnimalTFDB 2.0 database were selected as the BLAST database with the conditions setting as e- value<=1e-5, coverage>=30%, identity>=20%. In the prediction result, 68 TF families in total,

3,438, 3,714 and 4,272 genes in Has, Msp, Oba, respectively.

Transcription factor binding sites (TFBS), a motif may correspond to the of an enzyme or a structural unit necessary for proper expression of genes. Thus, sequence motifs are one of the basic functional units of molecular evolution. Consequently, identifying and understanding these motifs is fundamental to building models of cellular processes at the molecular scale and to understanding the mechanisms of human disease. In this study, we used the MEME Suite to perform motif-based sequence analysis, which comprises an integrated set of tools and databases.

We used build-in motifs to identify human genomic sequences with e-value<=1e-10 in DREME algorithm that may contain the discovered motifs, or to determine if the motifs are similar to previously studied motifs. In the prediction result, 19,024 TFBS were found in Has, and determined whether there were some variations near the binding sites with 50 bp extending size.

45 3. Results

3.1 Landscapes of baboon and mandrill genomes

3.1.1 Sequencing data

For de novo genome assembly, 12 libraries were constructed and sequenced for each of the two species, and the sequencing data was summarized in Table 3.1. In total, 512 Gb (~170× considering the genome of 3 Gb) of raw paired-end and 328 Gb (109× considering the genome size of 3 Gb) of raw sequencing data were obtained.

Table 3.1 Statistics of baboon and mandrill raw sequencing data. Species Pair-end Insert Average Reads Raw Data Sequence Libraries Size (bp) Length (bp) (Gb) Depth (×) Baboon 250 150 109.48 36.49 500 100 80.91 26.97 800 100 60.11 20.37

4,000 90 68.24 22.75 10,000 90 95.72 31.91 Total - - 414.46 138.15 Mandrill 250 150 113,296 37.77 500 100 83,054 27.68 800 100 65,328 21.78

2,000 90 34,561 11.52 5,000 90 32,967 10.99 10,000 90 65,377 21.79 20,000 90 32,141 10.71 Total - - 426,724 142.2

After data filtering, 284 Gb and 289 Gb clean data were obtained (Table 8.1).

46

3.1.2 K-mer analysis

In order to assess the genome features, 17-mers (17 bp sub-sequences) were extracted and subjected to the K-mer analysis. The reads from the short insert libraries (baboon, libraries with insert sizes of

250bp, 500bp and 800bp and ~202 Gb data amount in total; mandrill, libraries with insert sizes of

250bp, 500bp and 800bp and 212 Gb data amount in total) were used for this analysis. From the distribution of depth-frequency (Figure 8.1), the peak of distribution was at ~28× and ~31× respectively. Thus the genome sizes of olive baboons and mandrill were estimated to be 2.93 Gb and 2.90 Gb respectively (Table 3.2).

Table 3.2 The information of 17-mer statistics. Species K Number of K- Depth peak Genome Size Sequencing depth mers Baboon 17 82,117,298,803 28 2,932,760,671 33 Mandrill 17 89,967,169,490 31 2,902,166,757 37

The distribution of K-mer frequencies of reads from second generation sequencing dataset can also reflect the heterozygosity of the genome [124]. Considering a genome without heterozygosity, repeat and no errors during sequencing, the K-mer frequency distribution should be a Poisson distribution. For real dataset, due to the sequence errors, there were excessive K-mer with low frequency. In the meantime, heterozygote regions would result in two sets of K-mers with half of the major sequencing depth/K-mer frequency, thus for higher heterozygosity, there would be more obvious secondary peak at half the frequency. Also for repeat sequences, since they are multiple copies of K-mers resulted from identical repeat sequences, secondary peaks can be found at twice or even more times of the major K-mer frequency. As indicated in Figure 8.1, for baboon, there was no obvious secondary peak at half of the major K-mer frequency which was ~28, indicating low 47 heterozygosity for the sequenced baboon individual. However, for mandrill, obvious secondary peak can be found at the K-mer frequency of ~16 which was half of the major K-mer frequency

(~31), thus the mandrill individual should have relatively high heterozygosity. For both genomes, there were also noticeable peaks at twice of the major K-mer frequency, indicating high repeat content for both genomes. Thus, the two baboon genomes sequenced are both high repetitive and obviously heterozygous.

3.1.3 Genome assembly

With estimated genome features, genome assembly was conducted for both species to obtain the genome assemblies. The final baboon genome assembly was 3.12 Gb with ~80 Mb gaps, similar to the overall genome length estimated in the K-mer analysis (Table 3.3). The contig N50 was 21.7 kb with longest contig to be 238.9 kb, indicating continuity of the genome and good quality for gene annotation. For scaffolds, the N50 was 1.1 Mb with longest scaffold to be 8.8 Mb. And 2,308 longest scaffolds consisted more than 80% of the whole genome. Similarly, for mandrill, the total length assembled was 2.88 Gb, with ~80 Mb gaps. The contig N50 was 20.5 kb with longest contig to be 211 kb. The scaffold N50 was 3.6 Mb with the longest scaffold to be 19.1 Mb. And 634 longest scaffolds consisted more than 80% of the whole genome. The genome assemblies are of good quality for downstream analysis, with good coverage and continuity.

Table 3.3 Statistics of the genome assemblies. Contig*1 Scaffold Size (bp) Number Size (bp) Number Baboon N90*2 2,315 171,662 52,973 4,209

48 N80 7,938 108,096 332,809 2,308 N70 12,413 77,767 559,903 1,593 N60 16,868 56,789 798,728 1,128 N50 21,659 40,868 1,070,645 792 Longest 238,945 ---- 8,793,459 ---- Total size 3,044,016,568 ---- 3,116,777,842 ---- Total number (>=100 ---- 1,831,592 ---- 1,610,583 bp) Total number (>=2 kb) ---- 177,569 ---- 11,097 Mandrill N90 5,266 141,475 638,217 936 N80 9,025 101,618 1,303,160 634 N70 12,638 75,505 1,962,294 457 N60 16,336 56,061 2,730,696 332 N50 20,483 40,751 3,564,730 241 Longest 211,017 ---- 19,105,867 ---- Total size 2,798,997,503 ---- 2,882,689,325 ---- Total number (>=100 ---- 455,069 ---- 215,140 bp) Total number (>=2 kb) ---- 194,923 ---- 4,742 *1. Contigs are the first assembled sequences without gaps, while scaffolds are the sequences generated by linking contigs with gaps filled in. *2. N90 means the length of the contig/scaffold for which all the contigs/scaffolds longer than it accumulate to 90% of the total length. Similarly, N(P) in which P ranged from 50 to 90 in this table, indicates the length of contig/scaffold for which all the contigs/scaffolds longer than it accumulated to P% of the total length.

3.1.4 Annotation results

3.1.4.1 Repeat annotation

Repeats are widely existed in the genome with possible important functions. Repeats were annotated and categorized in both genomes (Table 8.2 -Table 8.5). For baboon, the repeat content took up ~50% of the whole genome, with 47% to be transposable elements (TEs). Comparing to the repeat contents in human (Table 3.4), Long Interspersed Nuclear Elements (LINEs) were less in baboon and mandrill genome (~17%) comparing to human genome (~21%), while Short

Interspersed Nuclear Elements (SINEs) were similar in these genomes (~12%), especially with Alu elements to have quite similar proportion (10%~11%), reflecting that the Alu elements were the 49 conserved within primate genomes as previously described [125].

Table 3.4 Repeat contents of baboon, mandrill, human and mouse. Group Percentage coverage of genome Baboon Mandrill Human Mouse LINE 16.76 16.61 20.99 19.2 L1 15.6 15.05 17.37 18.78 L2 1.05 1.39 3.3 0.38 LINE/other 0.11 0.17 0.32 0.04 SINE 11.26 12.10 13.64 8.22 Alu 10.14 10.47 10.74 2.66 MIR 0.92 1.37 2.9 0.57 B4 0.14 0.20 -- 2.36 SINE/other 0.07 0.06 -- 2.64 LTR 7.88 8.36 8.55 9.87 MaLRs 3.12 3.40 3.78 4.82 Other ERVs 4.68 4.85 4.77 4.4 LTR/other 0.08 0.11 -- 0.65 DNA transposons 2.7 3.27 3.03 0.88 Other 3.70 0.06 0.53 0.74 Total 42.3 40.40 46.74 38.91

3.1.4.2 RNA annotation

The non-coding RNAs (ncRNAs) are RNA molecules that are not translated into a protein. Four types of ncRNAs were annotated in baboon and mandrill genomes, including transfer RNAs

(tRNAs), ribosomal RNAs (rRNAs), and small nuclear RNAs (snRNAs) (Table 8.6 and Table8.7).

3.1.4.3 Gene annotation

After masking repeats, protein coding genes were predicted in the genome using ESTs, homolog

50 proteins and ab initio prediction, generating 23,867 (baboon) and 21,906 (mandrill) protein-coding genes finally (Table 3.5 and 3.6). In mandrill genome, the average number of exon per gene is slightly lower than that in baboon genome while the average exon length is longer than baboon. In addition, the average intron length is 700bp longer than that in baboon.

Table 3.5 Summary of gene annotation in baboon genome. Gene set Number Average Average Average Average Average transcript CDS exon per exon intron length length gene length length (bp) (bp) (bp) (bp) De novo AUGUSTUS 22,528 45,907 1,371 8.10 169 6,272 Homolog Nomascus 21,278 36,106 1,467 8.31 176 4,741 leucogenys Pongo abelii 23,806 32,996 1,341 7.59 176 4,801 Pan 21,245 35,899 1,468 8.22 178 4,771 troglodytes Macaca 25,930 32,844 1,267 7.14 177 5,145 mulatta Gorilla gorilla 24,402 30,755 1,377 7.65 179 4,415 Homo sapiens 25,350 35,308 1,481 8.18 181 4,710 EST 39,294 6,538 775 2.21 350 5,763 Final set 23,867 37,246 1,459 8.20 178 4,972

Table 3.6 Summary of gene annotation in mandrill genome Gene set Number Average Average Average Average Average transcript CDS exon exon intron length length per length length (bp) (bp) gene (bp) (bp)

De novo AUGUSTUS 18,460 54,148 1,429 8.68 164.65 6,863 Homolog Nomascus 20,874 39,863 1,499 8.56 175.07 5,072 leucogenys Pongo abelii 23,330 37,371 1,373 7.82 175.53 52,757 Pan troglodytes 20,866 40,317 1,502 8.46 177.62 5,204 Macaca 25,460 38,089 1,294 7.36 175.96 5,787 mulatta

51 Gorilla gorilla 23,791 34,748 1,413 7.92 178.47 4,816 Homo sapiens 25,161 39,338 1,513 8.42 179.77 5,098 EST 38,021 7,365 781 2.33 335.00 4,935 Final set 21,906 39,087 1,390 7.52 184.95 5,785

3.1.4.4 Gene evaluation and function annotation

To evaluate the quality of the annotated protein coding genes, 3,023 BUSCO (Benchmarking

Universal Single-Copy Orthologs) groups were searched against the predicted gene set to find that

97% (baboon) and 98% (mandrill) (Table 3.7) of complete groups can be found in the final gene sets. Besides, 99.24% (baboon) and 98.70% (mandrill) of the predicted genes were with corresponding biological function supported by at least one of the functional databases (Table 3.8).

Table 3.7 Assessment of gene sets using BUSCO. Baboon Mandrill Total BUSCO groups 3,023 3,023 Complete BUSCOs 2,936 2,981 Complete and single-copy BUSCOs 2,772 2,811 Complete and duplicated BUSCOs 164 170 Fragmented BUSCOs 63 28 Missing BUSCOs 24 14

Table 3.8 Function annotation of the final gene sets. Baboon Mandrill Gene number % Gene number %

Total 23,867 100.00 21,906 100.00 Annotated InterPro 20,310 85.10 18,139 82.80 GO 15,818 66.27 14,160 64.64 KEGG 19,733 82.68 18,022 82.27 Swissprot 22,547 94.47 20,547 93.80 TrEMBL 23,661 99.14 21,529 98.28 All 23,686 99.24 21,622 98.70

52 database Unannotated 181 0.74 284 1.30

3.2 Evolution of baboon and mandrill

3.2.1 Gene families

In order to analyze gene family evolution of baboon and mandrill, gene family clustering was conducted to identify 17,947 and 15,368 gene families respectively, with 668 and 1,387 genes not clustered (Table 3.9). Comparing to human (Homo sapiens), macaque (Macaca mulatta) and chimpanzee (Pan troglodytes), 489 and 342 gene families, with 598 and 515 genes, were found to be unique in the two species (Figure 3.1). These unique gene families were significantly enriched in function annotation with gene ontology (GO) terms 0042773 of ATP synthesis coupled electron transport (GO level, biological process, BP, P=1.28e-12), GO:0016651 of activity, acting on NADH or NADPH (GO level, molecular function, MF, P=1.68e-09) for baboon (Table

8.8) and GO:0006412 of translation (GO level: BP, P=6.29e-33), GO:0003735 of structural constituent of ribosome (GO level, BP, P=6.29e-33) for mandrill (Table 8.9). On the other hand,

5,133 single-copy orthologous genes were found to be shared among all the 13 species (Figure 3.2).

Table 3.9 Gene family clustering in the seven species. Species Genes Genes in Un-clustered Family Unique number families genes number families Callithrix jacchus 20,585 445 16,858 12 1.19 Gorilla gorilla 20,478 313 17,495 8 1.15 Homo sapiens 19,513 105 17,367 2 1.12 Macaca mulatta 20,627 912 16,391 38 1.2 Mandrillus sphinx 21,906 1,387 15,368 87 1.34 Microcebus 17,853 310 15,414 9 1.14 murinus Mus musculus 22,190 864 17,778 209 1.2 53 Note: Un-clustered genes refer to unique genes in the species; Unique families refer to unique gene families of the species.

Figure 3.1 Orthologous gene clusters in the five related species. The Venn diagram of unique and shared gene families in the human, mandrill, gorilla, macaque and baboon genomes.

54

Figure 3.2 Comparison of orthologous genes among 13 primates and mouse.

3.2.2 Phylogenetic analysis

In order to analyze the species evolution, phylogenetic tree of the baboon, mandrill and the other sequenced animal genomes were constructed based on single-copy orthologous genes. Molecular clock of 4-fold degenerate sites (neutral substitution rate per year) in species was estimated with single copy orthologous genes thus the divergence time was estimated. The maximum-likelihood phylogenetic tree (Figure 3.3) indicates that baboon and mandrill are located in the same clade with macaque and they diverged from human clade about 28.5 (27.5–30.4) Million years ago (MYA)

55 while the divergence time between Cercopithecoidea and Hominoidea was estimated to be 26.66

(24.29–28.95) MYA using mitochondrial genome sequences method [126]. Baboon and mandrill were estimated to split from macaque about 7.9 (6.9–9.2) MYA which was different from the previous estimation which was 6.6 (6.0–8.0) MYA [127]. Baboon and mandrill split from each other at ~5.8 (5.0–6.8) MYA, reflecting the close evolutionary relationship between baboon and mandrill.

Figure 3.3 Phylogenetic tree based on single copy gene families in the 13 species. The calibration time marked as red dot is derived from previous publications [89-91].

The demographic history of a species reflects historical population changes thus would be important to understand from the genome. We inferred a noticeable population bottleneck in the demographic history of the baboon and mandrill using the pairwise sequentially Markovian coalescent (PSMC) model (Figure 3.4). The two species went through similar population size changes between 100 and

10,000 thousand years (kyr) ago. Around 28 kyr ago, a sharp increase, followed by a noticeable 56 bottleneck from a peak of 61,000 and 47,000 to ~6,500 around 17 kys ago in both the baboon and mandrill populations. The increase of population size was coincident with the increase of human population, probably indicating climate change suitable for increase of mammals, while the recent bottleneck of the baboon and mandrill populations are different from the recent increase of the human population.

Figure 3.4 The demographic change of baboon and mandrill. The population size change over time was estimated by PSMC model. The x-axis indicates the time, from left to right to be from recent to ancient, while the y-axis indicates the effective population size.

3.3 Synteny among primates

3.3.1 Synteny analysis of human, macaque, baboon and mandrill

Comparing the genomes of human, macaque, baboon and mandrill, synteny can be identified thus the historical genome rearrangement events such as inversions, insertions and deletions can be also identified. These events may result in loss, duplication or change of genes functions. In total, 9,930,

11,418, 14,318 gene pairs between baboon and mandrill, macaque and baboon, human and macaque 57 were identified respectively. Human retained 24 chromosomes (22+X+Y) while the baboon clade had only 22 chromosomes (20+X+Y) (Figure 3.5) after ~27.5–30.4 millon years of evolution. I found that chromosome 13 and 14 of baboon branch went through chromosome fusion and chromosome 7 and 10 experienced chromosome breaks after forming a new clade. Moreover, several inversion events including paracentric such as chromosome 1, 6, 9 and so on and pericentric

(chromosome 2) occurred in comparison with human. For detail, I analyzed the genes located in five inversions with length more than 37 Mb on chromosome 1, 2, 3, 4 and 9 and enriched them significantly with terms GO:0006412: translation (GO level: BP, P=4.60e-03), GO:0004950: chemokine receptor activity (GO level: MF, P=8.96E-06) and so on (Table 8.10). And we found

FOXP2 gene, which was vital for the formation of voice and language, was located at chromosome

3: 135,344,867–135,605,165 (mandrill) (Figure 8.3).

Figure 3.5 Synteny relationship of human, macaque, baboon and mandrill.

3.3.2 Gene family contraction and expansion

In baboon and mandrill lineage, there were 545 expanded and 618 contracted gene families (Figure

3.6). Expanded gene families were found to be significantly enriched in the functions of

58 biosynthetic process, structural constituent of ribosome, nucleosomal DNA binding, G-protein coupled receptor activity, olfactory receptor activity, glucose catabolic process, peptidyl-prolyl isomerization, as well as carbon fixation in photosynthetic organisms and electron transport chain pathway. In baboon and mandrill, peptidylprolyl isomerase A (PPIA) was significantly expanded

(GO:0003755, P= 3.60E-89, Fisher’s exact test, 40 baboon genes and 53 mandrill genes). The PPIA belongs to the peptidyl-prolyl cis-trans isomerase (PPIase) family which catalyze the cis-trans isomerization, folding of newly synthesized protein, combination of several transcription factors and regulating many biological processes including inflammation and apoptosis, even acting in cerebral hypoxia-ischemia. In stress environment when presence of reactive oxygen species (ROS), cell will secrete PPIA to induce an inflammatory response and mitigate tissue injury. Baboons have been used in embryo infections and disparate bacterial infections and were found to have rapid infections during the early innate immune responses, which may be related to PPIA functions.

The peroxiredoxin-6 (PRDX6) family, which can reduce peroxides and protection against oxidative injury during metabolism, was also significantly expanded (GO:0051920, P = 0.000641, Fisher’s exact test, 4 baboon genes, 5 mandrill genes).

59

Figure 3.6 Gene family contraction and expansion for 12 primates and mouse.

3.3.2 Segmental duplications

Segmental duplications (SDs) were widely existed in genes and might be functionally important, thus SDs were identified in seven species including baboon and mandrill (Figure 3.7).

Long segment duplications were found to be similar in baboon and mandrill and less than human.

60

Figure 3.7 Segmental duplications in seven primate species.

3.4 MHC comparison between human and baboon/mandrill

Major histocompatibility complex (MHC) contains a series of genes which code surface proteins to assist cells recognizing foreign substances, thus MHC is important for immune system and it has been found to be associated with many diseases. The proteins coded by genes from MHC region are majorly to bind the peptide chain from pathogens and present pathogens on the cell surface to facilitate T-cell recognition and then a series of immune functions. MHC region is highly polymorphic in most primate species that have been studied. Previous study has been conducted on

MHC region of macaque to reveal the diversity of this region. While for other Old World monkeys other than macaque, the MHC regions remain largely unknown. Checking the assembled genomes of baboon and mandrill, relative complete assembly of MHC region has only been found in mandrill other than baboon, because of the complexity of high repeat content. MHC region of mandrill was found on Chromosome 4. In order to make sure the assembled MHC region of mandrill was of high

61 quality, reads were mapped back to the assembled genome to show good coverage and pair- end/mate-pair relationship (Figure 8.3), supporting assembly of MHC region in mandrill. Since the

MHC region is highly repetitive, a detailed repeat annotation was carried out for both mandrill and human MHC class I regions (from gene GABBR1 to gene MICB in the direction from the telomere side to the centromere side) with the same parameter to find similar repeat content for the two species in this region (48.27% in mandrill comparing to 51.03% in human) (Table 8.11). In addition to the similar repeat content, the genes of the two species in this region were in good synteny (Figure 3.8). Only 54 insertion and deletion (indels) with length >100bp were found between the MHC region I of the two species, which were mostly found to be overlapped with repetitive elements, such as SINE, LINE and LTR, indicating the influence of repeat content in the

MHC diversity.

HLA genes are important for immune recognition thus HLA genes were further checked and compared to human. In MHC class I region of human, there were 50 genes in total including 6 HLA genes, while in mandrill MHC class I region, only 4 HLA genes were identified. Searching the whole genome other than the MHC region, another 4 HLA genes were identified, making the total number of HLA genes to be 8 in mandrill. However, further inspection of the 8 HLA genes in mandrill resulted in finding 5 of them harbored start or stop codon changes, prematurely terminated changes or frameshift mutations (Figure 3.9), reflecting the genetic mechanisms of differences in immune response between mandrill and human.

Considering that only one MIC gene was found in chimpanzee [128] comparing to two genes of

MICA and MICB in human which resulted from genomic duplication occurred ~33-44 million years ago [129, 130], MIC genes in mandrill were further identified (Figure 3.10). Both MICA and MICB

62 gene or gene fragments were found to be existed in mandrill. But the gene structure of MICA in mandrill was found to be incomplete because of loss of the first exon. Again, this reflected genetic mechanisms of differences between human and mandrill immune responses.

Figure 3.8 Synteny between human and mandrill MHC regions.

63 64

Figure 3.9 Alignment of HLA genes with amino acid sequence for human, baboon and mandrill.

Figure 3.10 Structure of MICA and MICB gene for human and mandrill. The red, orange,

65 yellow and purple box represent exons, LINEs, SINEs and LTR, respectively.

3.5 Language related genomic features

Language is a special ability for communication, particularly used by human. Genes were identified to be involved in language formation in human being. FOXP2 was the first gene identified to be relevant to the human language development and a heterozygous missense mutation in this gene was prove to cause inherited language disorder based on a case study of a family known as KE family. FOXP2 expressed in many tissues including the basal ganglia and inferior frontal cortex

[121] where are essential for brain maturation and speech/language development. Two amino acid substitutions affected the neural functions of FOXP2 and differential transcriptional regulation in vivo resulted in two human-specific amino acids comparing to chimpanzee. And 111 genes were found to be significantly expression changed [131] by these two substitutions. Similarly, using

ChIP-seq, researchers used FOXP2 peptide to design antibody and found 175 target genes [132].

With the baboon and mandrill genomes available, FOXP2 gene evolution was further investigated here in primates, to shed light on language related genomic features.

FOXP2 genes in baboon and mandrill were identified and compared to those in human, chimpanzee, and mouse (Figure 3.11). In baboon, the FOXP2 gene (which can be well aligned to human

ENSP00000386200) were found on scaffold1015 from 492,895 to 753,085 bp, and in mandrill, the

FOXP2 gene was found on scaffold103 from 4,983,301 to 5,243,599 bp. They both had 18 exons.

In human, the FOXP2 gene has 22 different transcripts with many motifs including FOXP coiled- coil domain and Fork head domain. FOXP coiled-coil domain modulated the dimeric associations of FOXP transcription factors when mutations in this domain might cause disease like IPEX

(immunodysregulation polyendocrinopathy enteropathy X-linked) syndrome. Fork head domain 66 was found in several different transcription factors and to be involved in a variety of biological processes including early embryogenesis, organogenesis, tumorigenesis and signal transduction.

Comparing the human and baboon FOXP2 genes (with the entire length to be 715 amino acids), only two amino acid differences were found, and they were both on the seventh exon. What is more, no mutation in FOXP was found in the coiled-coil and Fork head domain which was concordant with previous studies. The two amino acid substitutions may affect functions of FOXP2.

Figure 3.11 Amino acid sequence aligment of FOXP2 gene from human, chimpanzees, mouse, baboon and mandrill. Dots represent identical residues to the human sequence.

3.6 Olfactory receptor genes analysis

Olfaction or sense of smell is one of the important feelings of animals. However, communications

67 through chemicals like olfactory were not understood in Old World species [122]. Most mammals possess two distinct sets of chemosensory neurons located in the main olfactory epithelium (MOE) and in the vomeronasal organ (VNO), while Old World primates were generally considered to lack a functional of VNO [133]. Previous studies indicated that olfactory communication played a vital role in information acquisition during social foraging for both mandrill and baboon [134].

Comparing to chimpanzee and macaque, almost all the OR gene families substantially expanded

(Table 3.10) and several families including Family 52, were expanded comparing to human

(Figure 3.12). In detail, the number of Family 7, subfamily E member 24 ORs (OR7E24), is notably overrepresented in mandrill and baboon genomes (6 copies in mandrill distributed on chromosome 19 (Figure 8.4), 5 in baboon distributed on chromosome 14 and 19, 1 in human, 1 in macaque and 0 in chimpanzee). Intriguingly, OR7E24 was confirmed to preferentially and specifically expressed in human testis cells and was supposed to play an important role in migratory phase of germ cells life cycle [135]. These ORs may be functionally important during the life cycle of mandrill and baboon and further researches can be conducted to explore the mechanisms.

Table 3.10 Olfactory receptor gene copy number in five species. Families human macaque mandrill baboon chimpanzee Family 1 26 12 23 25 18 Family 2 67 47 94 95 49 Family 3 3 1 3 3 3 Family 4 56 35 68 62 33 Family 5 56 23 56 60 39 Family 6 31 24 38 36 23 Family 7 11 2 14 13 9 Family 8 23 11 22 23 15 Family 9 8 4 12 11 7 Family 10 37 23 46 42 23

68 Family 11 9 4 11 12 4 Family 12 3 1 3 3 0 Family 13 12 5 13 13 10 Family 14 1 0 1 1 1 Family 51 24 18 35 27 21 Family 52 26 16 48 39 22 Family 56 6 3 6 5 5

Figure 3.12 Expansion of the olfactory receptor gene family in baboon and mandrill. The red, blue, green, yellow are olfactory receptor genes in the baboon, mandrill, human and chimpanzee.

3.7 Positively selected genes

In addition to gene family expansion and contractions, genes under selection during evolution are also functionally important, thus I identified positively selected genes in order to reveal evolution of baboon and mandrill as well as depict possible functional changes of these species. 5,133 single- copy orthologous genes shared among 13 species obtained in the gene family clustering were used 69 for detecting positively selected genes (PSGs). In total, 657 PSGs were identified with significant enrichment in the molecular functions of activity, activity, activity and etc. (Table 8.12). Further investigating functions of these PSGs, 34 genes were found to be innate immunity response genes by searching InnateDB. Interactions of these genes were predicted by STRING: functional protein association networks (http://string-db.org/cgi). As shown in Figure 3.13, STAT1, IL5, IL1R1, ATG5, CREB1, DICER1, PIK3R1 genes may have important roles in immune system, which are strongly associated with stress resistance and wound healing.

Finally, by GO and KEGG pathway enrichment analysis, PSGs were found to be enriched in terms of GO:0080134: regulation of response to stress (GO level: BP, P=1.59e-05), GO:0006955: immune response (GO level: BP, P value=4.11e-05) and KEGG:4640: Hematopoietic cell lineage(P=6.83e-05).

Figure 3.13 Interaction between innate immunity for positively selected genes in mandrill.

3.8 Disease related genomic features

70 In order to insight related disease mutation on baboon and mandrill. We collected the mutations in the HGMD database, and check these gene’s mutation on baboon and mandrill. Based on this method, we found 17 genes has the disease mutation of amino acid change in the two species (Table

3.11). Moreover, we found that some of the mutations are in the function domains which would heavily affect the function of these genes (Table 3.12). These mutations could cause disease phenotype in human, such as Lung cancer, cranial volume and Asthma atopic.

Above all, we tried to find some genes which are disease related genes and has unique mutations on baboon and mandrill. To us supervise that we only find one gene (LRRK2) has a unique amino acid changes in baboon and mandrill in position 1210 (Figure 3.14). For this site, all other species is a tyrosine, but that for baboon and mandrill is cysteine and this change has reduced the Hydrophobicity, which could affect the gene’s function.

Table 3.11 Genes with disease and its’ mutation on baboon and mandrill.

Gene name Position Wild type AA Mutation AA Disease Description

ALAD 59 K N Amyotrophic lateral sclerosis

CIITA 500 G A Multiple sclerosis

CRB1 959 G S Retinitis pigmentosa

IL4R 75 I V Asthma, atopic

MCPH1 761 A V Cranial volume

NPHS2 192 I V Nephrotic syndrome

TP53BP1 353 D E Lung cancer

Table 3.12 Mutation effect of some disease related genes.

71 PROTEIN UNIPROT_ID REF ALT POS VAR SIFT Domain

ALAD P13716 K N 59 K59N 1 ALAD

CIITA P33076 G A 500 G500A 1 NACHT domain

CRB1 P82279 G S 959 G959S 0.66 PFAM

NO/PROSITE(EGF-

like 14)

IL4R J9JII2 I V 75 I75V 0.82 Interleukin-4

receptor

MCPH1 Q8NEM0 A V 761 A761V 0.52 BRCT domain

NPHS2 Q9NP85 I V 192 I192V 1 SPFH domain /

Band 7 family

TP53BP1 Q12888 D E 353 D353E 1 not included

Figure 3.14 The unique mutation on baboon and mandrill with Y1210C.

72 4. Discussion

Primates are well studied mammals because of their evolutionarily importance as well as their close relationship to human. As for genomic researches, there were many primate genomes available and genomic features of primates have already been comprehensively studied. Despite current progresses in primate genomic studies, more genomic data for primate species are necessary for further studies to improve our understanding of primates in evolutionary studies and applications.

Here, applying second generation sequencing technologies, I established two draft genomes for baboon and mandrill respectively, which are valuable resources for primate and diseases studies.

The contig N50 of the two genomes were longer than 20 kb while the scaffold N50 reached more than 1 Mb (3.56 Mb for mandrill), indicating good quality of the assembled genomes. In order to further improve the genome assemblies, long reads may be applied to fill in the gaps of the assembly and improve the contig continuity, while genetic maps were necessary for anchoring the scaffolds onto chromosomes. However, lacking of genetic maps usually impeded construction of chromosome-level genome assemblies of primates. With further development of technologies like

Hi-C sequencing (formaldehyde cross-linking and sequencing), the assembled scaffolds may be further anchored to chromosomes, even without the genetic maps.

Secondly, genomic features of baboon and mandrill were comprehensively explored with the draft genomes. The repeat content and gene content were similar to other primate species. According to the phylogenetic tree constructed based on single copy gene families, baboon and mandrill were found to be located in the same clade and the divergence time from the human clade was about 28.5 million years ago (MYA), and the two species of baboon and mandrill were split about 5.8 MYA.

Evolutionary changes including chromosome-level changes, gene families changes (expanded, contracted and specific gene families) and positively selected genes were identified here to reflect 73 genetic differences of the two primate species comparing to others. For example, chromosome fusion events (fusion of human chromosome 13 and 14) have been identified even with the scaffold level genome assembly here. Thus, with further improvement of the genome assembly, especially the chromosome-level genome assembly, further investigation of the genomic changes can be conducted to comprehensively reveal evolutionary changes.

Thirdly, since baboon is usually used as model for human diseases researches and both species have some specific features, genetic mechanisms underpin immune, language ability as well as olfactory have been investigated. For immune, MHC regions were specifically analyzed in mandrill genome, because only mandrill genome assembly here was relatively complete. A very good synteny has been found between mandrill and human MHC region with only 54 insertions and deletions (longer than 100 bp) were found. And for homologs of human leukocyte antigen (HLA), I found fewer

HLA genes in both baboon and mandrill comparing to human (8 genes in total with five of them harboring deleterious mutations). And different from chimpanzee, two MIC genes can be found in baboon and mandrill although one of them has probably become pseudogenes. The similarity in

MHC region, and lacking of HLA gene families are probably related with the success of cross- species plant cases. For baboon, improvement of the assembly in the MHC region should be valuable for future studies. For language ability, I explored the FOXP2 genes in the two species to find two amino acid changes comparing to the human FOXP2 gene, thus further validations should be carried out to further illustrate the influences of these mutations. Substantially expanded olfactory receptor (OR) genes were found in baboon and mandrill comparing to other species, indicating specific olfactory systems for these two species, which also wait for further studies. With the found of some mutation in genes that would cause disease in human but that not show a diseased phenotype. For example, MCPH1 a gene which is identified as being responsible for the neurodevelopmental disorder primary microcephaly type 1, that is characterized by a smaller-than- 74 normal brain size and mental retardation{Liu, 2016 #299}. We found the consensus mutation on all the primates but except Homo sapiens (Figure 4.1). Compared with all the primates, it’s easy to find that the human sapiens has the largest brain volume. We infer that this mutation is positive selection site in human been and it may have accelerated the intelligence in the evolution of Human. What’s more, some disease mutation on baboon and mandrill also made them a better medical model. We can use CRISPR technology to edit the genome of baboon and then see its phenotype, then we can use some newest medicine on them to select the best cure solutions which may facilitate the develop of medicine.

Figure 4.1 The volume of cranial capacity in primates.

Finally, assembly and analysis of the two draft genomes of baboon and mandrill also reflected the possibility of establishing more genomes for primate species. Primates are an order of mammal species with ~16 families and ~500 species, which are all highly evolved animal species with

75 special physiological and behavioral characteristics. Despite the evolutionary importance and relative simple genome content, there were only ~20 species already have been established reference genomes. Also, the genome assemblies were quite different in quality and continuity, making it more difficult for further analysis and applications. Thus, establishing draft genomes using second generation sequencing for all primate species can be invaluable for evolutionary researches, conservation/preservation, as well as human genetic/diseases researches and applications. The plan to sequence all primate species in near future, using either second generation sequencing technologies combined with 10X or Hi-C library construction methods, or the third generation long reads sequencing, should be feasible. With genome sequence available, repeat and gene annotation, as well as comparative genomics among the primate species can also be conducted.

76 5. Conclusions

Firstly, draft genomes of baboon and mandrill have been established in this study, which can serve as reference dataset for future genome sequencing and comparative genomic studies. With more than 100× second generation sequencing data from different sequencing libraries, whole genome shotgun (WGS) assemblies of both species were finished, with the genome size of 3.12 Gb and 2.88

Gb respectively. Then genome assemblies reached to high continuity reflected by long contig N50 of more than 20 kb and scaffold N50 longer than 1 Mb. The longest scaffold was longer than 8.8

Mb in baboon and 19.1 Mb in mandrill. ~40% of the genome were annotated to be repeat sequenced and 23,867 and 21,906 protein coding genes were annotated respectively. BUSCO assessment indicated high quality of both the genome assembly and gene annotation with high coverages (98% and 99%) of the conserved genes.

Secondly, with the draft genome sequences available, basic genomic features were investigated and compared to related species to find similar repeat content, protein coding gene numbers and gene families in baboon and mandrill comparing to other primates. Only 489/342 gene families with

598/515 genes were found to be specific in baboon/mandrill. And fewer segmental duplications

(SDs) were found in baboon and mandrill comparing to human.

Thirdly, evolution of the two species was comprehensively analyzed to find the demographic changes, chromosome-level changes, gene family expansion and contraction, as well as positively selected genes. Baboon and mandrill were found to be located in the same clade with macaque and they were diverged from human clade about 28.5 (27.5–30.4) million years ago (MYA) while the divergence time between Cercopithecoidea and Hominoidea was estimated to be 26.66 (24.29–

77 28.95) MYA. Baboon and mandrill were found to be split from each other ~5.8 (5.0–6.8) MYA.

Demographic changes along evolution with a sharp increase followed by a noticeable bottleneck happened ~28 thousand years ago were observed for both the baboon and mandrill. Synteny between baboon, mandrill and human were established to find chromosomal rearrangements (fusion of chromosome 13 and 14 and chromosome breaks of chromosome 7 and 10). For gene family evolution, the lineage of baboon and mandrill had 545 expanded and 618 contracted gene families, with gene families of important functions to be expanded including PPIA which can induce an inflammatory response and mitigate tissue injury, and PRDX6 family, which can reduce peroxides and protection against oxidative injury during metabolism. 657 positively selected genes were identified for the lineage of baboon and mandrill and some of them were also related with immune responses.

Finally, underlying genetic mechanisms for immune system, language and olfactory were investigated to find highly consistent MHC regions with fewer HLA genes, two amino acid mutations in FOXP2 genes, and notably expanded olfactory gene families in baboon and mandrill.

Good synteny was found between mandrill and human in MHC region with only 54 insertion and deletion (indels) longer than 100 bp in MHC region I. And fewer HLA genes in baboon and mandrill were found comparing to human (8 in total with 5 to become pseudogenes).

78 6. Future perspectives i) Further improving the genome assemblies. Especially by applying third generation

sequencing and Hi-C sequencing, chromosome-level genome assembly with fewer gaps can

be achieved. And for the highly repetitive regions including MHC regions, better assembly

would benefit future functional and comparative genomic studies.

ii) Constructing genome database for these species. In order to effectively share the genome

data, database can be established.

iii) Further functional/molecular studies of some genomic features. Genomic features

including specific gene families, mutations in functionally important genes (for example,

FOXP2 gene) as well as expanded gene families (for example, olfactory receptor genes)

were identified in this study but further validations through functional studies should be

required for illustration of the mechanisms related with these functions.

iv) Large scale genome sequencing of primates. With experiences obtained in this study,

large scale genome sequencing aiming at establishing draft genomes for all primate species

can be further considered.

79 7. References

1. Wilson DE and Reeder DM. Mammal species of the world: a taxonomic and geographic reference. JHU Press; 2005. 2. Jolly C. Introduction to the Cercopithecoidea, with notes on their use as laboratory animals. In: Symp Zool Soc Lond 1966, pp.427-57. 3. Fleagle JG and McGraw WS. Skeletal and dental morphology supports diphyletic origin of baboons and mandrills. Proceedings of the National Academy of Sciences. 1999;96 3:1157- 61. 4. Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MA, et al. A molecular phylogeny of living primates. PLoS genetics. 2011;7 3:e1001342. 5. Liedigk R, Roos C, Brameier M and Zinner D. Mitogenomics of the Old World monkey tribe Papionini. BMC evolutionary biology. 2014;14 1:176. 6. Sigg H, Stolba A, Abegglen J-J and Dasser V. Life history of hamadryas baboons: physical development, infant mortality, reproductive parameters and family relationships. Primates. 1982;23 4:473-87. 7. Groves CP. Primate . 2001. 8. Kingdon J. The Kingdon field guide to African mammals. Bloomsbury Publishing; 2015. 9. Zinner D, Groeneveld LF, Keller C and Roos C. Mitochondrial phylogeography of baboons (Papio spp.)–Indication for introgressive hybridization? BMC evolutionary biology. 2009;9 1:83. 10. Gesquiere LR, Learn NH, Simao MCM, Onyango PO, Alberts SC and Altmann J. Life at the top: rank and stress in wild male baboons. Science. 2011;333 6040:357-60. 11. Rogers J and Hixson JE. Baboons as an animal model for genetic studies of common human disease. The American Journal of Human Genetics. 1997;61 3:489-93. 12. Chai D, Cuneo S, Falconer H, Mwenda J and D'Hooghe T. Olive baboon (Papio anubis anubis) as a model for intrauterine research. Journal of medical primatology. 2007;36 6:365- 9. 13. Cox LA, Comuzzie AG, Havill LM, Karere GM, Spradling KD, Mahaney MC, et al. Baboons as a model to study genetics and epigenetics of human disease. ILAR journal. 2013;54 2:106-21. 14. Locher CP, Witt SA, Herndier BG, Tenner‐Racz K, Racz P and Levy JA. Baboons as an animal model for human immunodeficiency virus pathogenesis and vaccine development. Immunological reviews. 2001;183 1:127-40. 15. Starzl TE, Fung J, Tzakis A, Todo S, Demetris A, Marino I, et al. Baboon-to-human liver transplantation. The lancet. 1993;341 8837:65-71. 16. Taylor Jr F, Chang A, Esmon C, D'angelo A, Vigano-D'Angelo S and Blick K. Protein C prevents the coagulopathic and lethal effects of Escherichia coli infusion in the baboon. Journal of Clinical Investigation. 1987;79 3:918. 17. Hanson SR, Powell JS, Dodson T, Lumsden A, Kelly AB, Anderson JS, et al. Effects of angiotensin converting enzyme inhibition with cilazapril on intimal hyperplasia in injured arteries and vascular grafts in the baboon. Hypertension. 1991;18 4 Suppl:II70. 18. Ryabchikova EI, Kolesnikova LV and Luchko SV. An analysis of features of pathogenesis in two animal models of Ebola virus infection. The Journal of infectious diseases. 1999;179 Supplement_1:S199-S202. 19. Huneke RB, Michaels MG, Kaufman CL and Ildstad ST. Antibody response in baboons (Papio cynocephalus anubis) to a commercially available encephalomyocarditis virus

80 vaccine. Comparative Medicine. 1998;48 5:526-8. 20. VanCott TC, Mascola JR, Loomis-Price LD, Sinangil F, Zitomersky N, McNeil J, et al. Cross-subtype neutralizing antibodies induced in baboons by a subtype E gp120 immunogen based on an R5 primary human immunodeficiency virus type 1 envelope. Journal of virology. 1999;73 6:4640-50. 21. Drewe JA, O’Riain MJ, Beamish E, Currie H and Parsons S. Survey of infections transmissible between baboons and humans, Cape Town, . Emerging infectious diseases. 2012;18 2:298. 22. Stearns-Kurosawa DJ, Lupu F, Taylor FB, Kinasewitz G and Kurosawa S. Sepsis and pathophysiology of anthrax in a nonhuman primate model. The American journal of pathology. 2006;169 2:433-44. 23. Khlebnikov V, Golovlev I, Zhemchugov V, Chugunov A, Averin S, Afanas' ev S, et al. The immunological efficacy of Francisella tularensis outer membranes for hamadryas baboons. Zhurnal mikrobiologii, epidemiologii, i immunobiologii. 1993; 3:61-4. 24. Githure JI, Reid GD, Binhazim AA, Anjili CO, Shatry AM and Hendricks LD. Leishmania major: the suitability of East African nonhuman primates as animal models for cutaneous leishmaniasis. Experimental parasitology. 1987;64 3:438-47. 25. Yole D, Pemberton R, Reid G and Wilson R. Protective immunity to Schistosoma mansoni induced in the olive baboon Papio anubis by the irradiated cercaria vaccine. Parasitology. 1996;112 1:37-46. 26. Nyindo M and Farah I. The baboon as a non-human primate model of human schistosome infection. Parasitology Today. 1999;15 12:478-82. 27. Mafuyai H, Barshep Y, Audu B, Kumbak D and Ojobe T. Baboons as potential reservoirs of zoonotic gastrointestinal parasite infections at Yankari National Park, Nigeria. African health sciences. 2013;13 2:252-4. 28. Prescott M. Primate sensory capabilities and communication signals: implications for care and use in the laboratory. National Centre for the Replacement, Refinement and Reduction of Animals in Research; 2006. 29. Boë L-J, Berthommier F, Legou T, Captier G, Kemp C, Sawallis TR, et al. Evidence of a Vocalic Proto-System in the Baboon (Papio papio) Suggests Pre-Hominin Speech Precursors. PloS one. 2017;12 1:e0169321. 30. Nishimura T, Mikami A, Suzuki J and Matsuzawa T. Descent of the hyoid in chimpanzees: evolution of face flattening and speech. Journal of Human Evolution. 2006;51 3:244-54. 31. Kuhl PK and Meltzoff AN. Infant vocalizations in response to speech: Vocal and developmental change. The journal of the Acoustical Society of America. 1996;100 4:2425- 38. 32. Boë L-J, Badin P, Ménard L, Captier G, Davis B, MacNeilage P, et al. Anatomy and control of the developing human vocal tract: A response to Lieberman. Journal of Phonetics. 2013;41 5:379-92. 33. Nowak RM. Walker's mammals of the world. JHU Press; 1999. 34. Harrison MJ. The mandrill in Gabon's rain forest—ecology, distribution and status. Oryx. 1988;22 4:218-28. 35. Hoshino J. Feeding ecology of mandrills (Mandrillus sphinx) in Campo animal reserve, Cameroon. Primates. 1985;26 3:248-73. 36. Leigh SR, Setchell JM, Charpentier M, Knapp LA and Wickings EJ. Canine tooth size and fitness in male mandrills (Mandrillus sphinx). Journal of Human Evolution. 2008;55 1:75-85. 37. Setchell JM and Dixson AF. Changes in the secondary sexual adornments of male mandrills (Mandrillus sphinx) are associated with gain and loss of alpha status. Hormones and Behavior. 2001;39 3:177-84.

81 38. Setchell JM and Dixson AF. Developmental variables and dominance rank in adolescent male mandrills (Mandrillus sphinx). American journal of primatology. 2002;56 1:9-25. 39. Setchell JM, Vaglio S, Abbott KM, Moggi-Cecchi J, Boscaro F, Pieraccini G, et al. Odour signals major histocompatibility complex genotype in an Old World monkey. Proceedings of the Royal Society of London B: Biological Sciences. 2010:rspb20100571. 40. Setchell JM, Richards SA, Abbott KM and Knapp LA. Mate-guarding by male mandrills (Mandrillus sphinx) is associated with female MHC genotype. Behavioral Ecology. 2016:arw106. 41. Feistner AT. Scent marking in mandrills, Mandrillus sphinx. Folia Primatologica. 1991;57 1:42-7. 42. Pandrea I, Apetrei C, Dufour J, Dillon N, Barbercheck J, Metzger M, et al. Simian immunodeficiency virus SIVagm. sab infection of Caribbean African green monkeys: a new model for the study of SIV pathogenesis in natural hosts. Journal of virology. 2006;80 10:4858-67. 43. Roussel M, Pontier D, Ngoubangoye B, Kazanji M, Verrier D and Fouchet D. Modes of transmission of Simian T-lymphotropic Virus Type 1 in semi-captive mandrills (Mandrillus sphinx). Veterinary microbiology. 2015;179 3:155-61. 44. Nerrienet E, Amouretti X, Müller-Trutwin M, Poaty-Mavoungou V, Bedjebaga I, Nguyen HT, et al. Phylogenetic analysis of SIV and STLV type I in mandrills (Mandrillus sphinx): indications that intracolony transmissions are predominantly the result of male-to-male aggressive contacts. AIDS research and human retroviruses. 1998;14 9:785-96. 45. Zwick LS, Walsh TF, Barbiers R, Collins MT, Kinsel MJ and Murnane RD. Paratuberculosis in a mandrill (Papio sphinx). Journal of Veterinary Diagnostic Investigation. 2002;14 4:326-8. 46. O'Rourke J, Dixon M, Jack A, Enno A and Lee A. Gastric B‐cell mucosa‐associated lymphoid tissue (MALT) lymphoma in an animal model of ‘Helicobacter heilmannii’infection. The Journal of pathology. 2004;203 4:896-903. 47. Setchell JM, Bedjabaga I-B, Goossens B, Reed P, Wickings EJ and Knapp LA. Parasite prevalence, abundance, and diversity in a semi-free-ranging colony of Mandrillus sphinx. International Journal of Primatology. 2007;28 6:1345-62. 48. Ungeheuer M, Elissa N, Morelli A, Georges A, Deloron P, Debre P, et al. Cellular responses to Loa loa experimental infection in mandrills (Mandrillus sphinx) vaccinated with irradiated infective larvae. Parasite immunology. 2000;22 4:173-84. 49. International Human Genome Sequencing C. Initial sequencing and analysis of the human genome. Nature. 2001;409:860. doi:10.1038/35057062 https://www.nature.com/articles/35057062#supplementary-information. 50. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437 7055:69-87. doi:10.1038/nature04072. 51. Rogers J and Gibbs RA. Comparative primate genomics: emerging patterns of genome content and dynamics. Nat Rev Genet. 2014;15 5:347-59. doi:10.1038/nrg3707 http://www.nature.com/nrg/journal/v15/n5/abs/nrg3707.html#supplementary-information. 52. Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, et al. The bonobo genome compared with the chimpanzee and human genomes. Nature. 2012;486 7404:527-31. doi:10.1038/nature11128. 53. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476. doi:10.1038/nature10530 https://www.nature.com/articles/nature10530#supplementary-information.

82 54. Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, et al. Comparative and demographic analysis of orang-utan genomes. Nature. 2011;469 7331:529- 33. doi:http://www.nature.com/nature/journal/v469/n7331/abs/10.1038-nature09687- unlocked.html#supplementary-information. 55. Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, et al. A high-coverage genome sequence from an archaic Denisovan individual. Science. 2012;338 6104:222-6. doi:10.1126/science.1224344. 56. and Analysis ConsortiumThe Chimpanzee S. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437 7055:69-87. doi:http://www.nature.com/nature/journal/v437/n7055/suppinfo/nature04072_S1.html. 57. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, et al. Evolutionary and Biomedical Insights from the Rhesus Macaque Genome. Science. 2007;316 5822:222-34. doi:10.1126/science.1139247. 58. Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, et al. Insights into hominid evolution from the gorilla genome sequence. Nature. 2012;483 7388:169-75. doi:http://www.nature.com/nature/journal/v483/n7388/abs/nature10842.html#supplementary -information. 59. Gordon D, Huddleston J, Chaisson MJP, Hill CM, Kronenberg ZN, Munson KM, et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352 6281 doi:10.1126/science.aae0344. 60. Johnson ME, Viggiano L, Bailey JA, Abdul-Rauf M, Goodwin G, Rocchi M, et al. Positive selection of a gene family during the emergence of humans and African apes. Nature. 2001;413 6855:514-9. doi:10.1038/35097067. 61. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, et al. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316 5822:222-34. 62. Harris RA, Tardif SD, Vinar T, Wildman DE, Rutherford JN, Rogers J, et al. Evolutionary genetics and implications of small size and twinning in callitrichine primates. Proceedings of the National Academy of Sciences of the United States of America. 2014;111 4:1467-72. doi:10.1073/pnas.1316037111. 63. Mailund T, Halager AE, Westergaard M, Dutheil JY, Munch K, Andersen LN, et al. A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species. PLoS Genetics. 2012;8 12:e1003125. doi:10.1371/journal.pgen.1003125. 64. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO and Shendure J. Chromosome- scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 2013;31 12:1119-25. doi:10.1038/nbt.2727. 65. Kaplan N and Dekker J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat Biotech. 2013;31 12:1143-7. doi:10.1038/nbt.2768 http://www.nature.com/nbt/journal/v31/n12/abs/nbt.2768.html#supplementary-information. 66. Chaisson MJP, Wilson RK and Eichler EE. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet. 2015;16 11:627-40. doi:10.1038/nrg3933. 67. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science. 2009;326 5950:289-93. doi:10.1126/science.1181369. 68. Pan_troglodytes-2.1.4 assembly. National Center for Biotechnology Information [online], 2011. 69. Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, et al. The bonobo genome compared with the chimpanzee and human genomes. Nature. 2012;486 7404:527-31. 83 doi:http://www.nature.com/nature/journal/v486/n7404/abs/nature11128.html#supplementary -information. 70. Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, et al. A new rhesus macaque assembly and annotation for next-generation sequencing analyses. Biol Direct. 2014;9 1:20. doi:10.1186/1745-6150-9-20. 71. Yan G, Zhang G, Fang X, Zhang Y, Li C, Ling F, et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus . Nat Biotech. 2011;29 11:1019-23. doi:http://www.nature.com/nbt/journal/v29/n11/abs/nbt.1992.html#supplementary- information. 72. Perry GH, Reeves D, Melsted P, Ratan A, Miller W, Michelini K, et al. A Genome Sequence Resource for the Aye-Aye (Daubentonia madagascariensis), a Nocturnal Lemur from Madagascar. Genome Biol Evol. 2012;4 2:126-35. doi:10.1093/gbe/evr132. 73. Warren WC, Jasinska AJ, Garcia-perez R, Svardal H, Tomlinson C, Rocchi M, et al. The genome of the vervet ( aethiops sabaeus). Genome Res. 2015; doi:10.1101/gr.192922.115. 74. Carbone L, Alan Harris R, Gnerre S, Veeramah KR, Lorente-Galdos B, Huddleston J, et al. Gibbon genome and the fast karyotype evolution of small apes. Nature. 2014;513 7517:195- 201. doi:10.1038/nature13679 http://www.nature.com/nature/journal/v513/n7517/abs/nature13679.html#supplementary- information. 75. The Marmoset Genome S and Analysis C. The common marmoset genome provides insight into primate biology and evolution. Nat Genet. 2014;46 8:850-7. doi:10.1038/ng.3042 http://www.nature.com/ng/journal/v46/n8/abs/ng.3042.html#supplementary-information. 76. Schmitz J, Noll A, Raabe CA, Churakov G, Voss R, Kiefmann M, et al. Genome sequence of the basal haplorrhine primate Tarsius syrichta reveals unusual insertions. Nature Communications. 2016;7:12997. doi:10.1038/ncomms12997. 77. Silva JC and Kondrashov AS. Patterns in spontaneous mutation revealed by human–baboon sequence comparison. TRENDS in Genetics. 2002;18 11:544-7. 78. VandeBerg JL, Williams-Blangero S and Tardif SD. The baboon in biomedical research. New York: Springer; 2009. 79. Cox LA, Mahaney MC, VandeBerg JL and Rogers J. A second-generation genetic linkage map of the baboon (Papio hamadryas) genome. Genomics. 2006;88 3:274-81. doi:https://doi.org/10.1016/j.ygeno.2006.03.020. 80. Rogers J, Mahaney MC, Witte SM, Nair S, Newman D, Wedel S, et al. A genetic linkage map of the baboon (Papio hamadryas) genome based on human microsatellite polymorphisms. Genomics. 2000;67 3:237-47. 81. Voruganti VS, Tejero ME, Proffitt JM, Cole SA, Freeland-Graves JH and Comuzzie AG. Genome-wide Scan of Plasma Cholecystokinin in Baboons Shows Linkage to Human Chromosome 17. Obesity. 2007;15 8:2043-50. doi:10.1038/oby.2007.243. 82. Tejero ME, Voruganti VS, Proffitt JM, Curran JE, Goring HH, Johnson MP, et al. Cross- species replication of a resistin mRNA QTL, but not QTLs for circulating levels of resistin, in human and baboon. Heredity. 2008;101 1:60-6. doi:10.1038/hdy.2008.28. 83. Tejero ME, Cole SA, Cai G, Peebles KW, Freeland-Graves JH, Cox LA, et al. Genome- wide scan of resistin mRNA expression in omental adipose tissue of baboons. International Journal Of Obesity. 2004;29:406. doi:10.1038/sj.ijo.0802699. 84. Pandrea I, Onanga R, Souquiere S, Mouinga-Ondéme A, Bourry O, Makuwa M, et al. Paucity of CD4(+) CCR5(+) T Cells May Prevent Transmission of Simian Immunodeficiency Virus in Natural Nonhuman Primate Hosts by Breast-Feeding. Journal of 84 Virology. 2008;82 11:5501-9. doi:10.1128/JVI.02555-07. 85. Charpentier M, Setchell J, Prugnolle F, Knapp L, Wickings E, Peignot P, et al. Genetic diversity and reproductive success in mandrills (Mandrillus sphinx). Proceedings of the National Academy of Sciences of the United States of America. 2005;102 46:16723-8. 86. Gokcumen O, Tischler V, Tica J, Zhu Q, Iskow RC, Lee E, et al. Primate genome architecture influences structural variation mechanisms and functional consequences. Proceedings of the National Academy of Sciences. 2013;110 39:15764. 87. Cordaux R and Batzer MA. The impact of retrotransposons on human genome evolution. Nat Rev Genet. 2009;10 10:691-703. doi:10.1038/nrg2640. 88. Marques-Bonet T, Ryder OA and Eichler EE. Sequencing primate genomes: what have we learned? Annual review of genomics and human genetics. 2009;10:355-86. doi:10.1146/annurev.genom.9.081307.164420. 89. She X, Horvath JE, Jiang Z, Liu G, Furey TS, Christ L, et al. The structure and evolution of centromeric transition regions within the human genome. Nature. 2004;430:857. doi:10.1038/nature02806 https://www.nature.com/articles/nature02806#supplementary-information. 90. Bailey JA and Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet. 2006;7 7:552-64. doi:10.1038/nrg1895. 91. Eichler EE, Budarf ML, Rocchi M, Deaven LL, Doggett NA, Baldini A, et al. Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity. Human molecular genetics. 1997;6 7:991-1002. 92. Riethman HC, Xiang Z, Paul S, Morse E, Hu XL, Flint J, et al. Integration of telomere sequences with the draft human genome sequence. Nature. 2001;409 6822:948-51. 93. Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM and Trask BJ. Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature. 2005;437 7055:94-100. 94. Antonell A, De LORX and Perez Jurado LA. Evolutionary mechanisms shaping the genomic structure of the Williams-Beuren syndrome chromosomal region at human 7q11.23. Genome Research. 2005;15 9:1179. 95. Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463 7279:311. 96. Minoche AE, Dohm JC and Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome biology. 2011;12 11:R112. 97. Magoč T and Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27 21:2957-63. 98. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1 1:18. 99. Tarailo‐Graovac M and Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics. 2009:4.10. 1-4.. 4. 100. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O and Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110 1- 4:462-7. 101. Smit A and Hubley R. RepeatModeler Open-1.0. Repeat Masker Website. 2010. 102. Xu Z and Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35 suppl 2:W265-W8. 103. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research. 1999;27 2:573. 104. Lowe TM and Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA 85 genes in genomic sequence. Nucleic acids research. 1997;25 5:955-64. 105. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, et al. Rfam: updates to the RNA families database. Nucleic acids research. 2008;37 suppl_1:D136-D40. 106. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12 4:656-64. 107. Birney E, Clamp M and Durbin R. GeneWise and genomewise. Genome Res. 2004;14 5:988-95. 108. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31 19:5654-66. 109. Stanke M, Keller O, Gunduz I, Hayes A, Waack S and Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34 suppl 2:W435-W9. 110. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS and Weinstock GM. Creating a honey bee consensus gene set. Genome biology. 2007;8 1:R13. 111. Götz S, García-Gómez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, et al. High- throughput functional annotation and data mining with the Blast2GO suite. Nucleic acids research. 2008;36 10:3420-35. 112. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths‐Jones S, et al. The Pfam protein families database. Nucleic acids research. 2004;32 suppl_1:D138-D41. 113. Parra G, Bradnam K and Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23 9:1061-7. 114. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31 19:3210-2. 115. Guindon S, Delsuc F, Dufayard J-F and Gascuel O. Estimating maximum likelihood phylogenies with PhyML. Bioinformatics for DNA sequence analysis. 2009:113-37. 116. Huchon D, Chevret P, Jordan U, Kilpatrick CW, Ranwez V, Jenkins PD, et al. Multiple molecular evidences for a living mammalian fossil. Proceedings of the National Academy of Sciences. 2007;104 18:7495-9. 117. Glazko GV and Nei M. Estimation of divergence times for major lineages of primate species. Molecular biology and evolution. 2003;20 3:424-34. 118. Schrago C and Voloch C. The precision of the hominid timescale estimated by relaxed clock methods. Journal of evolutionary biology. 2013;26 4:746-55. 119. De Bie T, Cristianini N, Demuth JP and Hahn MW. CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 2006;22 10:1269-71. 120. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, et al. Recent segmental duplications in the human genome. Science. 2002;297 5583:1003-7. 121. Takahashi H, Takahashi K and Liu F-C. FOXP genes, neural development, speech and language disorders. Forkhead Transcription Factors. Springer; 2009. p. 117-29. 122. Heymann EW. The neglected sense–olfaction in primate behavior, ecology, and evolution. American journal of primatology. 2006;68 6:519-24. 123. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37 Web Server issue:20. 124. Pettersson E, Lundeberg J and Ahmadian A. Generations of sequencing technologies. Genomics. 2009;93 2:105-11. 125. Kriegs JO, Churakov G, Jurka J, Brosius J and Schmitz J. Evolutionary history of 7SL RNA-derived SINEs in Supraprimates. Trends Genet. 2007;23 4:158-61. 126. Raaum RL, Sterner KN, Noviello CM, Stewart C-B and Disotell TR. Catarrhine primate divergence dates estimated from complete mitochondrial genomes: concordance with fossil and nuclear DNA evidence. Journal of Human Evolution. 2005;48 3:237-57.

86 127. Steiper ME and Young NM. Primate molecular divergence dates. Molecular phylogenetics and evolution. 2006;41 2:384-94. 128. Anzai T, Shiina T, Kimura N, Yanagiya K, Kohara S, Shigenari A, et al. Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence. Proceedings of the National Academy of Sciences. 2003;100 13:7708-13. 129. Gaudieri S, Giles KM, Kulski JK and Dawkins RL. Duplication and polymorphism in the MHC: Alu generated diversity and polymorphism within the PERB11 gene family. Hereditas. 1997;127 1‐2:37-46. 130. Yamazaki M, Tateno Y and Inoko H. Genomic organization around the centromeric end of the HLA class I region: Large-scale sequence analysis. Journal of molecular evolution. 1999;48 3:317-27. 131. Konopka G, Bomar JM, Winden K, Coppola G, Jonsson ZO, Gao F, et al. Human-specific transcriptional regulation of CNS development genes by FOXP2. Nature. 2009;462 7270:213-7. 132. Spiteri E, Konopka G, Coppola G, Bomar J, Oldham M, Ou J, et al. Identification of the transcriptional targets of FOXP2, a gene linked to speech and language, in developing human brain. The American Journal of Human Genetics. 2007;81 6:1144-57. 133. Burrows AM. Primate Anatomy: An Introduction. JSTOR, 2001. 134. Laidre ME. Informative breath: olfactory cues sought during social foraging among Old World monkeys (Mandrillus sphinx, M. Leucophaeus, and Papio anubis). Journal of Comparative Psychology. 2009;123 1:34. 135. Goto T, Salpekar A and Monk M. Expression of a testis-specific member of the olfactory receptor gene family in human primordial germ cells. Molecular human reproduction. 2001;7 6:553-8. 136. Price AL, Jones NC and Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21 suppl_1:i351-i8.

87 8. Appendix

Table 8.1 Statistics of baboon and mandrill clean/filtered sequencing data. Species Pair-end Insert size Average Clean data Sequencing Libraries (bp) reads length (Gb) depth (bp) Baboon 250 150 88.37 29.46 500 100 62.69 20.9 800 100 52.74 17.58 4000 90 36.79 12.26 10000 90 43.82 14.61 Total - - 284.41 94.8 Mandrill 250 150 91.18 30.39 500 100 67.38 22.46 800 100 54.28 18.09 2000 90 18.71 6.24 5000 90 16.3 5.43 10000 90 31.35 10.45 20000 90 10.34 3.45 Total - - 289.55 96.52

Table 8.2 Prediction of the repeats in baboon genome. Prediction method Repeat size (bp) Percentage in the genome TRF [103] 88,638,882 2.84 RepeatMasker [99] 1,317,851,115 42.28 RepeatProteinMask [72] 325,375,043 10.43 De novo [136] 1,353,584,808 43.42 Total 1,558,442,757 50.00

Table 8.3 General statistics of repeats in mandrill genome. Type Repeat Size(bp) Percentage in the genome TRF 87,221,621 3.03 RepeatMasker 936,130,281 32.47 RepeatProteinMask 281,888,845 9.77 De novo 1,139,310,255 39.52 Total 1,263,424,029 43.83

88 Table 8.4 Categories of TEs in baboon genome. RepBase TEs TE Proteins De novo Combined TEs Length (bp) % Length % Length (bp) % Length (bp) % DNA 85,821,216 2.75 13,073,15(bp) 0.42 23,596,137 0.76 102,653,65 3.29 LINE 524,428,336 16.8 267,916,84 8.60 728,970,010 23.39 9075 ,729,49 29.12 SINE 365,541,149 11.73 --87 -- 488,745,536 15.68 6296 ,613,40 20.20 3 2 LTR 246,841,510 7.92 44,428,21 1.42 358,098,539 11.49 522,321,82 16.76 Other 979 -- --6 -- -- 0 9796 0 Unknow 1,296,802 0.04 -- -- 495,265 0.02 1,791,826 0.06 Totaln 131,785,111 42.2 325,375,0 10.4 1,281,034,87 41.10 1,465,054,7 47.01 Note: Repbase5 TEs, the result 8of RepeatMasker43 based4 on Repbase;5 TE proteins, the result of16 RepeatProteinMask based on Repbase; De novo, Result of RepeatMasker by using library predicted through De novo prediction; Combined: combined results of Repbase TEs, TE proteins and de novo.

Table 8.5 Categories of TEs in mandrill genome. RepBase TEs TE Proteins De novo Combined TEs Length (bp) % Length (bp) % Length(bp) % Length (bp) % DNA 47,923,460 1.66 13,264,158 0.46 27,821,997 0.96 68,516,869 2.37 LINE 401,922,498 13.9 229,014,482 7.94 725,287,701 25.1 815,296,990 28.28 SINE 319,811,862 11.0 -- -- 481,314,186 16.6 576,217,301 19.99 LTR 169,184,719 5.87 39,705,383 1.38 80,223,826 2.78 200,629,837 6.96 Other 81 ------3,210 0 3,291 0 Unkno ------2,897,396 0.1 2,897,396 0.1 wn Total 936,130,281 32.4 281,888,845 9.78 1,117,858,1 38.78 121,695,029 42. 41 Note: Repbase TEs, the result of RepeatMasker based on Repbase; TE proteins, the result of RepeatProteinMask22 based on Repbase; De novo, Result of RepeatMasker by using library predicted through De novo prediction; Combined: combined results of Repbase TEs, TE proteins and de novo.

Table 8.6 Non-coding RNA genes in baboon genome. Type Copy Average length Total length % of (bp) (bp) genome tRNA 510 75.26 38,384 0.12 rRNA 1,200 101.38 121,666 0.39 rRNA 18S 136 136.05 18,503 0.06 28S 288 155.67 44,833 0.14 5.8S 17 89.94 1,529 0.005 5S 759 74.84 56,801 0. 18

89 snRNA 2,812 110.58 310,963 0.99 snRNA CD-box 900 102.03 91,824 0. 29 HACA- 324 135.44 43,881 0. 14 box splicing 1,322 118.04 156,045 0. 50

Table 8.7 Non-coding RNA genes in mandrill genome. Type Copy Average length Total length % genome tRNA 466 75.36(bp) 35,118(bp) 0.12 rRNA 982 97.05 95,301 0.33 rRNA 18S 20 252.6 5,052 0.02 28S 205 160.49 32,902 0.11 5.8S 8 103.87 831 0.00 5S 749 75.45 56,516 0.19 snRNA 2716 110.76 300,830 1.04 snRNA CD-box 880 101.76 89,547 0.31 HACA- 314 136.82 42,963 0.15 box splicing 1261 118.27 149,146 0.52

Table 8.8 Go enrichment of unique gene families in baboon. GO ID GO term GO class P value GO:0090266 regulation of mitotic cell cycle spindle BP 1.60E-03 assembly checkpoint GO:0048478 replication fork protection BP 7.78E-03 GO:0007416 synapse assembly BP 1.38E-02 GO:0006749 glutathione metabolic process BP 2.34E-02 GO:0007018 microtubule-based movement BP 2.37E-02 GO:0006694 steroid biosynthetic process BP 3.92E-02 GO:0042773 ATP synthesis coupled electron transport BP 1.28E-12 GO:0055114 oxidation-reduction process BP 7.42E-06 GO:0005680 anaphase-promoting complex CC 1.05E-03 GO:0004957 prostaglandin E receptor activity MF 1.15E-02 GO:0003840 gamma-glutamyltransferase activity MF 1.55E-02 GO:0003854 3-beta-hydroxy-delta5-steroid MF 2.06E-02 dehydrogenase activity GO:0003777 microtubule motor activity MF 2.09E-02 GO:0016491 oxidoreductase activity MF 1.57E-06 90 GO:0016651 oxidoreductase activity, acting on NADH or MF 1.68E-09 NADPH Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.

Table 8.9 Go enrichment of unique gene families in mandrill. GO ID GO term GO class P value GO:0044260 cellular macromolecule metabolic BP 2.02E-04 process GO:0043170 macromolecule metabolic process BP 6.18E-04 GO:0009987 cellular process BP 1.32E-02 GO:0008152 metabolic process BP 1.88E-02 GO:0044238 primary metabolic process BP 2.69E-02 GO:0044237 cellular metabolic process BP 3.97E-02 GO:0034645 cellular macromolecule biosynthetic BP 1.02E-10 process GO:0019538 protein metabolic process BP 1.86E-08 GO:0010467 gene expression BP 3.62E-11 GO:0044267 cellular protein metabolic process BP 4.79E-10 GO:0006412 translation BP 6.29E-33 GO:0007186 G-protein coupled receptor signaling BP 9.02E-06 pathway GO:0043229 intracellular organelle CC 1.49E-04 GO:0005622 intracellular CC 3.44E-04 GO:0044391 ribosomal subunit CC 4.00E-04 GO:0005912 adherens junction CC 2.86E-03 GO:0044464 cell part CC 2.92E-03 GO:0044424 intracellular part CC 5.80E-03 GO:0015934 large ribosomal subunit CC 1.70E-02 GO:0015935 small ribosomal subunit CC 4.19E-02 GO:0005840 ribosome CC 1.57E-35 GO:0005737 cytoplasm CC 1.73E-11 GO:0044444 cytoplasmic part CC 2.34E-15 GO:0032991 macromolecular complex CC 3.29E-10 GO:0043232 intracellular non-membrane-bounded CC 6.11E-19 organelle GO:0004888 transmembrane signaling receptor MF 1.11E-04 activity GO:0004871 signal transducer activity MF 1.49E-04 GO:0004930 G-protein coupled receptor activity MF 1.49E-04 91 GO:0045296 cadherin binding MF 9.17E-04 GO:0004807 triose-phosphate isomerase activity MF 1.55E-02 GO:0003735 structural constituent of ribosome MF 1.57E-35 GO:0005198 structural molecule activity MF 4.23E-29 GO:0004984 olfactory receptor activity MF 9.21E-08 Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.

Table 8.10 GO enrichment result of unique gene families for mandrill. GO ID GO term GO class P value GO:0006412 translation BP 4.60E-03 GO:0006935 chemotaxis BP 1.84E-02 GO:0040011 locomotion BP 1.84E-02 GO:0009605 response to external stimulus BP 4.71E-02 GO:0005840 ribosome CC 4.60E-03 GO:0005737 cytoplasm CC 1.12E-02 GO:0044444 cytoplasmic part CC 1.84E-02 GO:0030529 ribonucleoprotein complex CC 2.36E-02 GO:0001594 trace-amine receptor activity MF 2.75E-04 GO:0016493 C-C chemokine receptor activity MF 7.93E-04 GO:0003735 structural constituent of ribosome MF 4.60E-03 GO:0004896 cytokine receptor activity MF 4.60E-03 GO:0008528 G-protein coupled peptide receptor MF 4.60E-03 activity GO:0005198 structural molecule activity MF 1.84E-02 GO:0004950 chemokine receptor activity MF 8.96E-06 Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.

Table 8.11 Repeat content of MHC class I region for mandrill and human. Type mandrill human Copy Length (bp) Percent Copy Length (bp) Percent (%) Number (%) Number DNA/Crypton- 1 65 0.00 0 0 0.00 V DNA/DNA 3 179 0.01 2 126 0.01 DNA/Helitron 1 363 0.02 1 322 0.02 DNA/Maveric 0 0 0.00 1 44 0.00 k DNA/Sola 0 0 0.00 1 69 0.00 DNA/MULE- 2 141 0.01 0 0 0.00

92 MuDR DNA/TcMar- 1 187 0.01 1 183 0.01 Tc1 DNA/TcMar- 12 3,673 0.20 0 0 0.00 Tigge DNA/TcMar- 26 10,177 0.56 27 11,980 0.63 Tigger DNA/hAT 2 184 0.01 1 174 0.01 DNA/hAT- 38 9,586 0.52 46 9,941 0.52 Charlie DNA/hAT- 9 2,016 0.11 6 839 0.04 Tip100 LINE/CR1 4 772 0.04 4 771 0.04 LINE/Jockey 0 0 0.00 1 57 0.00 LINE/L1 754 340,504 18.62 759 406,157 21.26 LINE/L1-Tx1 1 142 0.01 0 0 0.00 LINE/L2 38 10,351 0.57 32 8,906 0.47 LINE/RTE-X 2 290 0.02 2 302 0.02 LTR/Copia 1 92 0.01 0 0 0.00 LTR/ERV1 146 80,428 4.40 126 77,703 4.07 LTR/ERVK 22 10,368 0.57 34 29,679 1.55 LTR/ERVL 171 91,775 5.02 207 123,209 6.45 LTR/ERVL- 109 35,671 1.95 81 27,654 1.45 MaLR LTR/Gypsy 2 170 0.01 1 67 0.00 LTR/LTR 3 728 0.04 1 170 0.01 SINE/7SL 5 338 0.02 9 399 0.02 SINE/Alu 947 276,532 15.12 806 267,594 14.01 SINE/B4 27 1,505 0.08 24 895 0.05 SINE/MIR 43 5,880 0.32 44 6,508 0.34 SINE/tRNA- 10 637 0.03 8 836 0.04 7SL SINE/tRNA- 1 121 0.01 1 121 0.01 RTE All 2,381 882,875 48.27 2,226 974,706 51.03

Table 8.12 GO and KEGG enrichment of the positively selected genes (PSGs). GO ID GO Term GO Class Adjusted P-value

93 GO:0016301 kinase activity MF 6.62E-10 GO:0016772 transferase activity, transferring phosphorus- MF 1.18E-09 containing groups GO:0016773 phosphotransferase activity, alcohol group as acceptor MF 1.18E-09 GO:0003824 catalytic activity MF 2.03E-09 GO:0005524 ATP binding MF 3.42E-09 GO:0004672 activity MF 3.42E-09 GO:0032559 adenyl ribonucleotide binding MF 3.80E-09 GO:0030554 adenyl nucleotide binding MF 4.73E-09 GO:0016740 transferase activity MF 1.76E-08 GO:0005515 protein binding MF 3.48E-08 GO:0004713 protein activity MF 3.64E-08 GO:0016310 phosphorylation BP 3.85E-08 GO:0006468 protein phosphorylation BP 7.96E-08 GO:0035639 purine ribonucleoside triphosphate binding MF 8.16E-07 GO:0036094 small molecule binding MF 8.22E-07 GO:0032553 ribonucleotide binding MF 9.04E-07 GO:0032555 purine ribonucleotide binding MF 9.04E-07 GO:0017076 purine nucleotide binding MF 1.25E-06 GO:0000166 nucleotide binding MF 1.40E-06 GO:0006793 phosphorus metabolic process BP 1.24E-05 GO:0006796 phosphate-containing compound metabolic process BP 1.24E-05 GO:0009452 RNA capping BP 2.66E-05 GO:0007626 locomotory behavior BP 4.05E-05 GO:0005488 binding MF 5.19E-05 GO:0007155 cell adhesion BP 5.68E-05 GO:0022610 biological adhesion BP 5.68E-05 GO:0008374 O-acyltransferase activity MF 8.68E-05 GO:0043412 macromolecule modification BP 8.73E-05 GO:0006464 protein modification process BP 0.000113 GO:0004525 ribonuclease III activity MF 0.00015 GO:0000123 histone acetyltransferase complex CC 0.000275 GO:0004252 serine-type endopeptidase activity MF 0.000396 GO:0030507 spectrin binding MF 0.000399 GO:0006508 proteolysis BP 0.000485 GO:0070011 peptidase activity, acting on L-amino acid peptides MF 0.000639 GO:0004177 aminopeptidase activity MF 0.000665

94 GO:0008233 peptidase activity MF 0.000665 GO:0046777 protein autophosphorylation BP 0.000767 GO:0005802 trans-Golgi network CC 0.000848 GO:0005768 endosome CC 0.000848 GO:0005516 calmodulin binding MF 0.001068 GO:0004842 ubiquitin-protein activity MF 0.001317 GO:0017016 Ras GTPase binding MF 0.001317 GO:0031267 small GTPase binding MF 0.001433 GO:0051020 GTPase binding MF 0.001433 GO:0016881 acid-amino acid ligase activity MF 0.002103 GO:0016747 transferase activity, transferring acyl groups other MF 0.003245 than amino-acyl groups GO:0004175 endopeptidase activity MF 0.003245 GO:0008236 serine-type peptidase activity MF 0.003313 GO:0017171 serine activity MF 0.003313 GO:0019787 small conjugating protein ligase activity MF 0.003335 GO:0016787 hydrolase activity MF 0.003374 GO:0008238 exopeptidase activity MF 0.003374 GO:0070461 SAGA-type complex CC 0.003374 GO:0070566 adenylyltransferase activity MF 0.003374 GO:0042558 pteridine-containing compound metabolic process BP 0.004351 GO:0050660 flavin adenine dinucleotide binding MF 0.004532 GO:0007610 behavior BP 0.004535 GO:0004402 histone acetyltransferase activity MF 0.00476 GO:0006370 mRNA capping BP 0.005009 GO:0008174 mRNA methyltransferase activity MF 0.005009 GO:0009057 macromolecule catabolic process BP 0.005217 GO:0019199 transmembrane receptor protein kinase activity MF 0.005929 GO:0015291 secondary active transmembrane transporter activity MF 0.006141 GO:0008217 regulation of blood pressure BP 0.006384 GO:0014706 striated muscle tissue development BP 0.006384 GO:0060537 muscle tissue development BP 0.006384 GO:0005887 integral to plasma membrane CC 0.006872 GO:0031226 intrinsic to plasma membrane CC 0.006872 GO:0051345 positive regulation of hydrolase activity BP 0.00701 GO:0000910 cytokinesis BP 0.008254 GO:0004568 chitinase activity MF 0.009118

95 GO:0006032 chitin catabolic process BP 0.009118 GO:0045335 phagocytic vesicle CC 0.009118 GO:0055037 recycling endosome CC 0.009118 GO:0030318 melanocyte differentiation BP 0.009118 GO:0017049 GTP-Rho binding MF 0.009118 GO:2000114 regulation of establishment of cell polarity BP 0.009118 GO:0008344 adult locomotory behavior BP 0.009118 GO:0043966 histone H3 acetylation BP 0.009118 GO:0017034 Rap guanyl-nucleotide exchange factor activity MF 0.009118 GO:0004534 5'-3' activity MF 0.009118 GO:0030914 STAGA complex CC 0.009118 GO:0008460 dTDP-glucose 4,6-dehydratase activity MF 0.009118 GO:0004909 interleukin-1, Type I, activating receptor activity MF 0.009118 GO:0004334 fumarylacetoacetase activity MF 0.009118 GO:0004349 glutamate 5-kinase activity MF 0.009118 GO:0004350 glutamate-5-semialdehyde dehydrogenase activity MF 0.009118 GO:0043550 regulation of lipid kinase activity BP 0.009118 GO:0070772 PAS complex CC 0.009118 GO:0003919 FMN adenylyltransferase activity MF 0.009118 GO:0006747 FAD biosynthetic process BP 0.009118 GO:0008609 alkylglycerone-phosphate synthase activity MF 0.009118 GO:0004336 galactosylceramidase activity MF 0.009118 GO:0006683 galactosylceramide catabolic process BP 0.009118 GO:0008611 ether lipid biosynthetic process BP 0.009118 GO:0016287 glycerone-phosphate O-acyltransferase activity MF 0.009118 GO:0006516 glycoprotein catabolic process BP 0.009118 GO:0008705 methionine synthase activity MF 0.009118 GO:0008898 homocysteine S-methyltransferase activity MF 0.009118 GO:0010739 positive regulation of protein kinase A signaling BP 0.009118 cascade GO:0090036 regulation of protein kinase C signaling cascade BP 0.009118 GO:0005137 interleukin-5 receptor binding MF 0.009118 GO:0048280 vesicle fusion with Golgi apparatus BP 0.009118 GO:0008488 gamma-glutamyl carboxylase activity MF 0.009118 GO:0017187 peptidyl-glutamic acid carboxylation BP 0.009118 GO:0006348 chromatin silencing at telomere BP 0.009118 GO:0004375 glycine dehydrogenase (decarboxylating) activity MF 0.009118

96 GO:0006546 glycine catabolic process BP 0.009118 GO:0004483 mRNA (nucleoside-2'-O-)-methyltransferase activity MF 0.009118 GO:0080009 mRNA methylation BP 0.009118 GO:0050902 leukocyte adhesive activation BP 0.009118 GO:0048066 developmental pigmentation BP 0.009118 GO:0050931 pigment cell differentiation BP 0.009118 GO:0032878 regulation of establishment or maintenance of cell BP 0.009118 polarity GO:0019202 amino acid kinase activity MF 0.009118 GO:0046443 FAD metabolic process BP 0.009118 GO:0072387 flavin adenine dinucleotide metabolic process BP 0.009118 GO:0072388 flavin adenine dinucleotide biosynthetic process BP 0.009118 GO:0006681 galactosylceramide metabolic process BP 0.009118 GO:0019374 galactolipid metabolic process BP 0.009118 GO:0019376 galactolipid catabolic process BP 0.009118 GO:0046485 ether lipid metabolic process BP 0.009118 GO:0016413 O-acetyltransferase activity MF 0.009118 GO:0042084 5-methyltetrahydrofolate-dependent methyltransferase MF 0.009118 activity GO:0070528 protein kinase C signaling cascade BP 0.009118 GO:0018214 protein carboxylation BP 0.009118 GO:0016642 oxidoreductase activity, acting on the CH-NH2 group MF 0.009118 of donors, disulfide as acceptor GO:0009071 serine family amino acid catabolic process BP 0.009118 GO:0016556 mRNA modification BP 0.009118 GO:0045123 cellular extravasation BP 0.009118 GO:0017137 Rab GTPase binding MF 0.009356 GO:0006030 chitin metabolic process BP 0.009521 GO:0016891 endoribonuclease activity, producing 5'- MF 0.009521 phosphomonoesters GO:0015103 inorganic anion transmembrane transporter activity MF 0.010851 GO:0007605 sensory perception of sound BP 0.012081 GO:0003714 transcription corepressor activity MF 0.012081 GO:0050954 sensory perception of mechanical stimulus BP 0.012081 GO:0007067 mitosis BP 0.016642 GO:0000280 nuclear division BP 0.016642 GO:0044431 Golgi apparatus part CC 0.017318 GO:0006725 cellular aromatic compound metabolic process BP 0.017468

97 GO:0005452 inorganic anion exchanger activity MF 0.018637 GO:0016055 Wnt receptor signaling pathway BP 0.020776 GO:0070588 calcium ion transmembrane transport BP 0.020776 GO:0004540 ribonuclease activity MF 0.020776 GO:0000226 microtubule cytoskeleton organization BP 0.020776 GO:0008271 secondary active sulfate transmembrane transporter MF 0.020776 activity GO:0008272 sulfate transport BP 0.020776 GO:0015116 sulfate transmembrane transporter activity MF 0.020776 GO:0042813 Wnt-activated receptor activity MF 0.020776 GO:0016573 histone acetylation BP 0.020776 GO:0048193 Golgi vesicle transport BP 0.020776 GO:0030574 collagen catabolic process BP 0.020776 GO:0090382 phagosome maturation BP 0.020776 GO:0045670 regulation of osteoclast differentiation BP 0.020776 GO:0046920 alpha-(1->3)-fucosyltransferase activity MF 0.020776 GO:0034450 ubiquitin-ubiquitin ligase activity MF 0.020776 GO:0008124 4-alpha-hydroxytetrahydrobiopterin dehydratase MF 0.020776 activity GO:0034435 cholesterol esterification BP 0.020776 GO:0034736 cholesterol O-acyltransferase activity MF 0.020776 GO:0006919 activation of cysteine-type endopeptidase activity BP 0.020776 involved in apoptotic process GO:0032963 collagen metabolic process BP 0.020776 GO:0044236 multicellular organismal metabolic process BP 0.020776 GO:0044243 multicellular organismal catabolic process BP 0.020776 GO:0044259 multicellular organismal macromolecule metabolic BP 0.020776 process GO:0002761 regulation of myeloid leukocyte differentiation BP 0.020776 GO:0030316 osteoclast differentiation BP 0.020776 GO:0045637 regulation of myeloid cell differentiation BP 0.020776 GO:0034433 steroid esterification BP 0.020776 GO:0034434 sterol esterification BP 0.020776 GO:0004772 sterol O-acyltransferase activity MF 0.020776 GO:0010950 positive regulation of endopeptidase activity BP 0.020776 GO:0010952 positive regulation of peptidase activity BP 0.020776 GO:0043280 positive regulation of cysteine-type endopeptidase BP 0.020776 activity involved in apoptotic process

98 GO:0097202 activation of cysteine-type endopeptidase activity BP 0.020776 GO:2001056 positive regulation of cysteine-type endopeptidase BP 0.020776 activity GO:0008305 integrin complex CC 0.020935 GO:0007167 enzyme linked receptor protein signaling pathway BP 0.021323 GO:0004675 transmembrane receptor protein serine/threonine MF 0.022843 kinase activity GO:0016050 vesicle organization BP 0.022843 GO:0016337 cell-cell adhesion BP 0.023783 GO:0000087 M phase of mitotic cell cycle BP 0.023909 GO:0051301 cell division BP 0.025184 GO:0004553 hydrolase activity, hydrolyzing O-glycosyl MF 0.025409 compounds GO:0048037 binding MF 0.026148 GO:0048856 anatomical structure development BP 0.026148 GO:0030097 hemopoiesis BP 0.026288 GO:0006475 internal protein amino acid acetylation BP 0.026288 GO:0018393 internal peptidyl-lysine acetylation BP 0.026288 GO:0018394 peptidyl-lysine acetylation BP 0.026288 GO:0008237 metallopeptidase activity MF 0.028813 GO:0048285 organelle fission BP 0.028832 GO:0015301 anion:anion antiporter activity MF 0.030496 GO:0043085 positive regulation of catalytic activity BP 0.030496 GO:0001510 RNA methylation BP 0.030496 GO:0048534 hemopoietic or lymphoid organ development BP 0.030496 GO:0006473 protein acetylation BP 0.030496 GO:0004521 endoribonuclease activity MF 0.030496 GO:0004712 protein serine/threonine/tyrosine kinase activity MF 0.030496 GO:0043473 pigmentation BP 0.030496 GO:0017080 sodium channel regulator activity MF 0.030496 GO:0004948 calcitonin receptor activity MF 0.030496 GO:0046373 L-arabinose metabolic process BP 0.030496 GO:0046556 alpha-N-arabinofuranosidase activity MF 0.030496 GO:0004962 endothelin receptor activity MF 0.030496 GO:0048484 enteric nervous system development BP 0.030496 GO:0070776 MOZ/MORF histone acetyltransferase complex CC 0.030496 GO:0042577 lipid phosphatase activity MF 0.030496 GO:0004822 isoleucine-tRNA ligase activity MF 0.030496

99 GO:0006428 isoleucyl-tRNA aminoacylation BP 0.030496 GO:0019236 response to pheromone BP 0.030496 GO:0080025 phosphatidylinositol-3,5-bisphosphate binding MF 0.030496 GO:0032777 Piccolo NuA4 histone acetyltransferase complex CC 0.030496 GO:0000103 sulfate assimilation BP 0.030496 GO:0004020 adenylylsulfate kinase activity MF 0.030496 GO:0004781 sulfate adenylyltransferase (ATP) activity MF 0.030496 GO:0051018 protein kinase A binding MF 0.030496 GO:0017025 TBP-class protein binding MF 0.030496 GO:0034454 microtubule anchoring at centrosome BP 0.030496 GO:0008250 oligosaccharyltransferase complex CC 0.030496 GO:0005315 inorganic phosphate transmembrane transporter MF 0.030496 activity GO:0034599 cellular response to oxidative stress BP 0.030496 GO:0090307 spindle assembly involved in mitosis BP 0.030496 GO:0004666 prostaglandin-endoperoxide synthase activity MF 0.030496 GO:0019371 cyclooxygenase pathway BP 0.030496 GO:0043141 ATP-dependent 5'-3' DNA helicase activity MF 0.030496 GO:0030139 endocytic vesicle CC 0.030496 GO:0002573 myeloid leukocyte differentiation BP 0.030496 GO:0030010 establishment of cell polarity BP 0.030496 GO:0030534 adult behavior BP 0.030496 GO:0019566 arabinose metabolic process BP 0.030496 GO:0048483 autonomic nervous system development BP 0.030496 GO:0070775 H3 histone acetyltransferase complex CC 0.030496 GO:0004779 sulfate adenylyltransferase activity MF 0.030496 GO:0006677 glycosylceramide metabolic process BP 0.030496 GO:0046477 glycosylceramide catabolic process BP 0.030496 GO:0046514 ceramide catabolic process BP 0.030496 GO:0046521 sphingoid catabolic process BP 0.030496 GO:0010737 protein kinase A signaling cascade BP 0.030496 GO:0010738 regulation of protein kinase A signaling cascade BP 0.030496 GO:0072393 microtubule anchoring at microtubule organizing BP 0.030496 center GO:0019369 arachidonic acid metabolic process BP 0.030496 GO:0006342 chromatin silencing BP 0.030496 GO:0045814 negative regulation of gene expression, epigenetic BP 0.030496 GO:0003684 damaged DNA binding MF 0.032579 100 GO:0030163 protein catabolic process BP 0.0328 GO:0007596 blood coagulation BP 0.03529 GO:0007599 hemostasis BP 0.03529 GO:0006629 lipid metabolic process BP 0.036449 GO:0016569 covalent chromatin modification BP 0.037639 GO:0016570 histone modification BP 0.037639 GO:0043547 positive regulation of GTPase activity BP 0.039103 GO:0050817 coagulation BP 0.041062 GO:0002520 immune system development BP 0.043864 GO:0006026 aminoglycan catabolic process BP 0.043864 GO:0004702 receptor signaling protein serine/threonine kinase MF 0.043864 activity GO:0008235 metalloexopeptidase activity MF 0.043864 GO:0043235 receptor complex CC 0.043864 GO:0016407 acetyltransferase activity MF 0.043864 GO:0043414 macromolecule methylation BP 0.043937 GO:0048731 system development BP 0.044356 GO:0006895 Golgi to endosome transport BP 0.044356 GO:0000186 activation of MAPKK activity BP 0.044356 GO:0006729 tetrahydrobiopterin biosynthetic process BP 0.044356 GO:0030099 myeloid cell differentiation BP 0.044356 GO:0016822 hydrolase activity, acting on acid carbon-carbon MF 0.044356 bonds GO:0016823 hydrolase activity, acting on acid carbon-carbon MF 0.044356 bonds, in ketonic substances GO:0046146 tetrahydrobiopterin metabolic process BP 0.044356 GO:0006687 glycosphingolipid metabolic process BP 0.044356 GO:0019377 glycolipid catabolic process BP 0.044356 GO:0046479 glycosphingolipid catabolic process BP 0.044356 GO:0046504 glycerol ether biosynthetic process BP 0.044356 GO:0006906 vesicle fusion BP 0.044356 GO:0043281 regulation of cysteine-type endopeptidase activity BP 0.044356 involved in apoptotic process GO:2000116 regulation of cysteine-type endopeptidase activity BP 0.044356 GO:0005975 carbohydrate metabolic process BP 0.046346

Map ID Map Title Adjusted P- Gene IDs value 101 map00630 Glyoxylate and 0.025008 Masph06057 Paanu10824 Masph03104 dicarboxylate Paanu04268 Paanu04371 Masph01783 metabolism Paanu09927 Masph17981 Paanu10721 Masph08262 map00525 NA 0.025008 Masph19208 Paanu12892 map01055 Biosynthesis of 0.025008 Masph19208 Paanu12892 vancomycin group antibiotics map00523 Polyketide sugar unit 0.025008 Masph19208 Paanu12892 biosynthesis map04113 Meiosis - yeast 0.025008 Masph15120 Paanu15224 map04141 NA 0.025008 Paanu10424 Masph02116 Paanu18656 Masph17762 Masph17390 Paanu15832

Figure 8.1 The distribution of 17-mer frequency of baboon and mandrill. Major and secondary peaks of the frequency distribution were indicated by arrows for baboon (a) and mandrill (b). a.

4

3.5

3

2.5

(%)

2

1.5 Percentage 1

0.5

0 0 10 20 30 40 50 60 70 80 90 100 K-mer frequency

b.

102 4

3.5

3

2.5

(%)

2

1.5 Percentage 1

0.5

0 0 10 20 30 40 50 60 70 80 90 100 K-mer frequency

Figure 8.2 Colinearity analysis of chr 3 for mandrill. The orange lines represent gene pairs.

Figure 8.3 Sequencing depth and the location relationships of pair-end reads on MHC class I region for mandrill.

103 104

Figure 8.4 OR7E24 genes on chromosome 19 in mandrill.

105