<<

POSITIVE SELECTION IN FACTOR

ALONG THE HUMAN LINEAGE

by

GABRIELLE CELESTE NICKEL

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Thesis Adviser: Dr. Mark D. Adams

Department of Genetics

CASE WESTERN RESERVE UNIVERSITY

January 2009 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Gabrielle Nickel______candidate for the _Ph.D.______degree *.

Helen Salz______(chair of the committee)

Mark Adams______

Radhika Atit______

Peter Harte______

Joe Nadeau______

______

(date) August 28, 2008______

*We also certify that written approval has been obtained for any proprietary material

contained therein.

1

TABLE OF CONTENTS

Table of Contents……………………………………………………………………………………………………….2

List of Tables……………………………………………………………………………………………………………...6

List of Figures……………………………………………………………………………………………………………..8

List of Abbreviations…………………………………………………………………………………………………10

Glossary………………………………………………………………………………………………..………………….12

Abstract..………………………………………………………………………………………………………………….16

Chapter 1: Introduction and Background………………………………………….……………………...17

Origin of Modern Humans………………………………………………………………………………..19

Human‐chimpanzee morphological divergence…………………………………….23

Human‐chimpanzee comparative genomics………………………………………….24

Human Molecular ……………………………………………………………………………..27

Adaptive evolution……………………………………………………………………31

Changes in regulatory sequence……………………………………………………32

“Less‐is‐More” hypothesis of gene loss……………………………………………..….36

Gene duplication in human evolution……………………………………………………37

Epigenetic regulation of ………..…………………………………….38

Gene Expression Differences…………………………………………………………………………….40

Human‐chimpanzee expression differences………………………………………….41

Selective pressures on gene expression………………………………………………..43

Positive Selection……………………………………………………………………………………………..44

2

Detecting positive selection through population genetic analysis………………………………………………………………………………………………….46

Detecting positive selection by phylogenetic analysis……………………………52

Transcription Factors………………………………………………………………………………………..57

Conclusion………………………………………………………………………………………………………..60

Chapter 2: An empirical test for branch‐specific positive selection…………. ……………….62

Abstract…………………………………………………………………………………………………………….63

Introduction……………………………………………………………………………………………………..64

Materials and Methods…………………………………………………………………………………….68

Selection of genes and DNA sequencing………………………………………………..68

Phylogenetic analysis…………………………………………………………………………….73

Simulations testing the performance of codeml…………………………………….76

Empirical tests of positive selection using sequences simulated under a model of neutral evolution……………………………………………………….77

Results………………………………………………………………………………………………………………79

Sequencing of genes and tests of selection……………79

Effect of phylogenetic breadth on predictions of positive selection……..83

Simulations to assess sensitivity of the strict branch+site test……………..86

Alternative null model using an empirical test……………………………………..91

Comparison with previous predictions of positive selection…………………98

Discussion……………………………………………………………………………………………………….103

Chapter 3: Human PAML Browser: A database of positive selection on human genes using phylogenetic analysis……………………………………………………………….110

3

Abstract………………………………………………………………………………………………………….111

Introduction……………………………………………………………………………………………………112

Data Sources and Processing…………………………………………………………………………..115

Input data…………………………………………………………………………………………..115

Statistical analysis……………………………………………………………………………….118

Results…………………………………………………………………………………………………………….121

User interface……………………………………………………………………………………..125

Database organization…………………………………………………………………………126

Discussion……………………………………………………………………………………………………….132

Conclusion………………………………………………………………………………………………………134

Chapter 4: Demonstrating functional divergence using genome wide expression arrays

in the positively selected gene CDX4……………………………………………………………………...135

Abstract………………………………………………………………………………………………….………136

Introduction……………………………………………………………………………………………………137

Materials and Methods…………………………………………………………………….…………….140

DNA sequencing………………………………………………………………….………………140

Phylogenetic analysis..………………………………………………………………………..141

cDNA clone construction…………………………………………………………………….142

Microarray………………………………………………………………………………………….143

Results………………………………………………………………………….………………………………..144

Discussion………………………………………………………………………………………………………161

Chapter 5: Conclusion and Future Directions…………………….……...... ……………………..165

4

The Evolution of Modern Traits……………………………………………………………..………167

Future Directions…………………………………………………………………………………………..170

Picking candidate genes………………………………………….………………………….171

In vivo studies……………………………………………………….…………………………….172

Transgenic mice……………………………………….……………………………..174

Gene targeted mice……………………………….………………………………..176

Target gene discovery……………………………………….…….………………………….178

ChIP‐chip……………………………………………………….………………………..179

Yeast‐2‐hybrid…………………………………………………………………………181

Luciferase gene expression studies…………………………………………………….183

MicroRNA screens………………………………………………………………………………191

Investigating selection around a ……………………………………….194

Positive Selection Along the Human Lineage…………………..…………………………….195

Appendix………………………………………………………………………………………………………………..198

Literature Cited………………………………………………………………………………………………………212

5

LIST OF TABLES

Table 1.1 The dominant hypothesis of human and important

references…………………………………………………………………………………………….30

Table 2.1 The percent of primer pairs that produced high quality sequence from

human templates and macaque and chimpanzee masks……………………….69

Table 2.2 Sequence coverage of transcription factor genes and data sources………72

Table 2.3 Evolutionary models and parameter sets for codeml analysis……………….74

Table 2.4 Branch+site test site classes………………………………………………………………….75

Table 2.5 codeml results for transcription factor genes with significant results in the

strict branch+site test of positive selection…………………………………………..83

Table 2.6 Comparison of the results from the strict branch+site and empirical

tests...... 94

Table 2.7 Comparison of predictions of positive selection by different methods..100

Table 2.8 Tests for positive selection on other primate branches that exhibited

dN/dS > 1…………………………………………………………………………………………….104

6

Table 2.9 Effect of synonymous and nonsynonymous substitution on predictions of

positive selection…………………………………………………………………………………105

Table 3.1 Evolutionary models used by codeml ………………………………………………….122

Table 3.2 Summary of results of tests of selection on human genes…………………..123

Table 3.3 categories over‐represented among genes with p<0.05 in

the strict branch+site test……………………………………………………………………124

Table 4.1 Statistical analysis of CDX4 using different methods to predict positive

selection……………………………………………………………………………………………..147

Table 4.2 The results from the strict branch+site model for CDX4………………………147

Table 4.3 123 genes differentially regulated by human and chimpanzee CDX4…..156

Table 4.4 lines chosen for CDX4 gene expression assays………………………………163

Table 5.1 Luciferase ‐reporter tests with controls………………………………189

Table 5.2 List of experiments and controls associated with luciferase promoter‐

reporter analysis…………………………………………………………………………………190

Table A.1 codeml results from 175 Transcription Factors……………………………………198

Table A.2 Sensitivity analysis: Factors that affect predictions of positive selection...... 208

7

LIST OF FIGURES

Figure 1.1 Phylogenetic tree of the order Primates……………………………………………….20

Figure 1.2 Mechanisms of human molecular evolution………………………………………….29

Figure 2.1 Probability of accurate ancestral sequence reconstruction……………………78

Figure 2.2 Phylogenetic tree of the primate used in phylogenetic

analyses………………………………………………………………………………………………..81

Figure 2.3 Comparison of the results for predictions of positive selection using

codeml for the full primate+rodent alignment set to the minimal

alignment set…………………………………………………………………………….………….85

Figure 2.4 Distribution of likelihood ratio statistic (LRS) values from the strict

branch+site test of simulated gene sets………………………………………………..88

Figure 2.5 Prediction of positive selection across a range of branch lengths

representative of human genes……………………………………………………………89

Figure 2.6 Contribution of dN/dS values to predictions of positive selection…………90

Figure 2.7 Comparison of the strict branch+site test, 50:50 mixture test, and

empirical test on simulated sequences…………………………………………………93

8

Figure 2.8 Comparison of the strict branch+site test and empirical test on human

genes…………………………………………………………………………………………………….97

Figure 3.1 Unrooted mammalian species tree of the organisms used in the

construction of the database and the phylogenetic tree used in PAML

analysis……………………………………………………………………………………………….117

Figure 3.2 PAML database summary results for Iroquois 3 (IRX3)………128

Figure 3.3 Multispecies protein alignment for IRX3…………………………………………….129

Figure 3.4 Likelihood test data and results for IRX3……………………………………………..131

Figure 4.1 The step by step process of fusion PCR……………………………………………….143

Figure 4.2 The protein multispecies sequence alignment for CDX4………………………145

Figure 4.3 Number of matches of chimpanzee and macaque genome to

the human specific microarray……………………………………………………………149

Figure 4.4 Comparison of the gene expression profiles of a human and chimpanzee

fibroblast cell line………………………………………………………………………………..150

Figure 4.5 Scatter plots of BeadChip microarray gene expression data ……………….152

Figure 4.6 Chimpanzee fibroblast cell lines transfected with human or chimpanzee

CDX4 cDNA………………………………………………………………………………………….161

9

LIST OF ABBREVIATIONS

α Significance threshold

AMH Anatomically modern humans

BAC Bacterial artificial cds Coding sequence

ChIP Chromatin immunoprecipitation

CNGs Conserved non‐genic sequence dN The rate of nonsynonymous substitution dS The rate of

EHH Extended haplotype heterozygosity

FST Wright’s fixation index

LRT Likelihood ratio test

Mb Megabase

ML Maximum likelihood mya Million years ago

N The absolute number of nonsynonymous substitutions

PAML Phylogenetic Analysis using Maximum Likelihood

PCR Polymerase chain reaction qPCR Quantitative polymerase chain reaction

S The absolute number of synonymous substitutions

SNP Single nucleotide polymorphism

UTR Untranslated region

10

ω The ratio of the rate of nonsynonymous substitution to the rate of synonymous

substitution. Also written as dN/dS.

Y2H Yeast two‐hybrid

11

GLOSSARY

Anthropoid‐ Simian; the “higher primates” including New and Old World monkeys and apes

Bayesian analysis‐ An approach to inference in which probability distributions of model parameters represent both what we believe about the distributions before looking at data and the likelihood of the parameters given the observed data.

Bonferroni correction‐ A conservative statistical multiple test correction used to correct the significance threshold

Branch length‐ The number of nucleotide substitutions per codon.

Coancestry coefficient (FST)‐ The correlation of genes or a measure of relatedness of different individuals in the same population.

Codon bias‐ The non‐random usage of synonymous codons for the same .

Effective population size (Ne)‐ The number of individuals in a population that contributes to the next generation. It is generally much smaller than the number of individuals in the population, and is influenced by population substructure, sex ratio, mating systems, and age distribution.

Epigenetic‐ Inherited without involving a change to the DNA sequence.

Fitness‐ The ability of an individual to survive and reproduce relative to the rest of the population.

Fixation‐ The increase in frequency of a genetic variant in a population to 100%.

Founder effects‐ Loss of genetic variation when a new colony is established by a very small number of individuals from a larger population.

12

Fourfold degenerate site‐ Any nucleotide at this position in a codon specifies the same amino acid

Genetic drift‐ The random fluctuations of allele frequencies over time due to chance alone.

Goodness‐of‐fit‐ A class of statistical methods used to assess competing models on the basis of their fit to empirical data.

Haplotype‐ A continuous DNA sequence of arbitrary length along a chromosome that has a primary structure that is distinct from that of other homologous regions in a given population

Hitchhiking‐ The increase in frequency of a neutral allele as a result of positive selection for a linked allele. See .

Hominid‐ A member of the family Hominidae: all of the great apes.

Hominin‐ A member of the tribe Hominini: chimpanzees and humans.

Hominan‐ A member of the sub‐tribe Hominina: modern humans and their extinct relatives.

Homologous‐ Phenotypic or genotypic characters that share a common ancestor.

Indel‐ A that involves the or of DNA sequence.

Likelihood ratio test‐ A method for comparing the likelihood of two different hypotheses.

Linkage disequilibrium (LD)‐ The non‐random association of polymorphism at two liked loci. LD is created by mutation and broken down over time by crossing over between the two loci.

Maximum likelihood analysis‐ A statistical method that calculates the probability of the observed data under varying hypotheses, in order to estimate model parameters that

13

best explain the observed data and determine the observed strengths of alternative hypotheses.

Molecular ‐ Nucleotide (or amino acid) substitutions occur at a more or less fixed rate over evolutionary time. Used to calculate the amount of sequence divergence and lapsed time since two molecules diverged.

Negative selection‐ The removal of a deleterious genetic variant from the population owing to the reduced reproductive success of its carriers. Also known as purifying selection.

Neutrality‐ The state of being free from the effects of .

Nondegenerate site‐ Any mutation at this nucleotide position in a codon results in amino acid substitution.

Nonsynonymous substitution‐ A nucleotide substitution that results in an amino acid replacement.

Outgroup‐ Species that are more distantly related to two or more species studied and can therefore be used to estimate the ancestral state of a trait.

Orthologous‐ Genes or that evolved from a common ancestral gene through speciation.

Paracentic inversion‐ An inversion of a region of a chromosome that does not include the centromere.

Paralog‐ Highly similar non‐allelic sequences resulting from a duplication event.

Pleistocene‐ An epoch of the Quaternary period beginning 1.8 million years ago and transitioning to the Holocene epoch roughly 10,000 years ago.

Population bottleneck‐ The transient reduction in the abundance of a population

14

Positive selection‐ The accelerated spread of a beneficial genetic variant in a population owing to the increased reproductive success of its carriers.

Prosimian‐ The “lesser primates”; the most ancestral extant primates that include lemurs, lorises, aye‐ayes, and bushbabies.

Quantitative PCR‐ Polymerase chain reaction which simultaneously amplifies and quantifies the amount of a targeted DNA or cDNA molecule.

Selective sweep‐ The rapid fixation of an allele and all the alleles surrounding it.

Site frequency spectrum‐ The frequency distribution within the population of polymorphic nucleotide sites from a given sequence.

Stabilizing selection‐ When negative selection acts on a phenotypic trait.

Synonymous substitution‐ A nucleotide substitution that does not result in an amino acid replacement.

Transcriptome‐ The combined variability of all transcripts produced by the genome

Transition‐ A base exchange in which a pyramidine base (C or T) is exchanged for another pyramidine base, or a base (G or A) is exchanged for another purine base

Transversion‐ A base exchange in which a pyramidine base (C or T) is exchanged for a purine base, or a purine base (G or A) is exchanged for a pyramidine base

Twofold degenerate site‐ Only two of four possible nucleotides at this position in a codon specify the same amino acid.

15

Positive Selection in Transcription Factor Genes Along The Human Lineage

Abstract By

GABRIELLE CELESTE NICKEL

The genetic changes that have occurred that led to the phenotypic divergence of

modern Homo sapiens are unknown, as are the molecular mechanisms through which these changes transpired. A predominant theory in the field of human molecular evolution is that these changes happened because of the effects of adaptive protein evolution, also known as positive selection. This theory posits that amino acid changes have occurred along the human lineage that were beneficial to modern man and fixed in

the population, thereby contributing to divergence. Transcription factors are proteins that regulate the expression of other genes, and they have a major role in embryogenesis and development. In this thesis, I have investigated transcription factor genes for human specific positive selection that contributed to the phenotypic divergence of humans and chimpanzees. Over the course of this work, I have developed a novel statistical approach for predicting positive selection that is particularly useful in closely related species, as I have shown that traditional methods lack power to detect selection along short branch lengths. I have also used genome wide expression arrays to test for differential activity and functional divergence of a human candidate transcription factor predicted to be positively selected.

16

CHAPTER 1

INTRODUCTION AND BACKGROUND

17

Nearly one hundred and fifty years ago, Charles Darwin declared that through his

theory of natural selection “much light will be thrown on the origin of man and his history” (DARWIN 1859). While now considered a mild proclamation, the implications of

this statement in late nineteenth century England were monumental. In a society that considered man’s natural history to be confined to Genesis 1:26, the suggestion that man may have descended from such a primitive beast as an ape was at the very least

controversial, and at best the beginning of a new era of human reflection on man’s place in nature. Thomas Henry Huxley’s tome Evidence as to Man’s Place in Nature and

Darwin’s later publication The Descent of Man are arguably the most monumental texts on human evolution and the seed of a new field of evolutionary biology.

Darwin and his contemporaries were certainly not the first to acknowledge man’s similarity to the apes. In fact, the word chimpanzee is derived from kivili‐ chimpenze, meaning “mock‐man” in Tshiluba, the language of the Bantu people of the

Congo. Just prior to the publication of The Origin of the Species, skeletons of strange

humanoid creatures were being unearthed across the globe, the most famous of which was a Neanderthal skull discovered in a quarry near Düsseldorf, Germany in 1856. The discovery of these archaic human skeletons alongside the growing understanding of ape anatomy and the theory of natural selection came together in an explosive fashion to create the discipline of human evolution that still captivates scientists today. The notion that man has descended from the apes is as humbling as it is magnificent.

18

ORIGINS OF MODERN HUMANS

Homo sapiens (Latin: "wise man") are members of the biological order Primates, a group that includes lemurs, monkeys, and apes. Carolus Linnaeus, the father of modern taxonomy, first described the order Primates, organizing species according to morphology (LINNAEUS 1735). His classifications have remained unexpectedly strong

with the advent of modern techniques for systematics. Members of this order first appeared in the fossil record approximately 65 million years ago (mya) as small tree‐dwelling prosimians similar to the extant lemurs and lorises of today.

Monkeys emerged 40 mya and diverged into New and Old World species as early at 30 mya, inhabiting exclusively the Americas or Asia and Africa respectively.

Morphological and sequence data has now provided a clear classification of humans among the apes. There are two recognized families (GROVES 2005): the

Hylobatidae or lesser apes (gibbons and siamangs) and Hominidae (orangutan, gorilla, chimpanzee, and human). Historically, humans were classified as having their own family largely due to the anthropocentric view of human evolution. Goodman

(GOODMAN 1999) actually suggested that humans, chimpanzees, and bonobos all be classified under the genus Homo. Morphological and behavioral data along with a paucity of fossil apes made it difficult to decipher the ancestral relationships among the apes predating the emergence of archaic humans. With the advent of sequence data, the relationships between these animals has been established (Figure 1) (YODER and

19

YANG 2000). The divergence of lesser and great apes occurred roughly 17 mya, followed by the orangutan divergence around 12 mya. Gorillas were next to diverge around 7‐9 mya. Only the chimpanzee lineage has split into two surviving species, common chimpanzees and bonobos (often called pygmy chimpanzees), and the split occurred around 3 mya. Fossil data suggested that humans and chimpanzees diverged around 5‐

7 mya which has been supported by sequence data (PATTERSON et al. 2006).

Figure 1: Phylogenetic tree of the order Primates. This order is informally divided into three groups: prosimians, monkeys, and apes, with special attention paid to the subgroup of great apes. Genus names are given in italics. Suborders, parvorders, and superfamilies are displayed on the branches. Tree construction is based on consensus data from several analogous

20 molecular studies and reviews (GOODMAN et al. 1990; GROVES 2005; GROVES 2001; YODER and YANG 2000)

The earliest fossil records of archaic human populations, also known as hominans date to around 6 mya (SENUT et al. 2001) . Members of this subtribe include the genera (in order of appearance) Orrorin, Ardipithecus, Australopithecus,

Paranthropus, Kenyanthropus, and Homo, the only genera with surviving members. No fossil chimpanzee skeletons have been located with the exception of the Ethiopian specimen Sahelanthropus tchadensis believed to have existed 7 million years ago

(BRUNET et al. 2005). This specimen is thought to have postdated the most recent common ancestor of humans and chimpanzees. The genus Homo is presumed to have arisen around 2 mya in the same regions of Africa (ANTON and SWISHER 2004).

Multiple, competing models exist describing the origins of the anatomically modern humans (AMH), however one model, often referred to as the ‘single origin’ model is best supported by both fossil and molecular data ((GARRIGAN and HAMMER 2006) for a review). The model postulates that AMH arose during the late Pleistocene in Africa from a small isolated population, which radiated into Eurasia, Oceania, and the

Americas roughly 15 mya, replacing all other archaic Homo populations (CANN et al.

1987; EXCOFFIER 2002). The earliest fossil specimen of an AMH was discovered in

Ethiopia in volcanic pumice that was dated to 195,000 years ago (DAY 1969; MCDOUGALL

et al. 2005). Even this model has variants as researchers continue to debate if there was gene flow between AMH populations and archaic Homo species (GREEN et al. 2006;

HODGSON and DISOTELL 2008; RELETHFORD 2008; SERRE et al. 2004).

21

Fossil data has been an excellent starting point for unraveling the secrets of human evolution; however, it has generated as many new questions as it has answered.

Fossils have limited ability to answer questions about demography, such as population size and migration patterns, nor do they give any insight into how Homo sapiens have become such a morphologically divergent species. Molecular data, particularly DNA

polymorphism data from mtDNA, the , and many , has become the main tool with which researchers are investigating human origins.

One of the more interesting findings genomic evidence has provided is the calculation of the effective population size (Ne), the minimum number of individuals in a population that contribute offspring to the next generation, that acted as founders for modern humans. For example, we now know there were approximately 10,000 fewer human individuals in the Pleistocene than the previous era (HARPENDING et al. 1998;

HARPENDING et al. 1993). The chimpanzee lineage does not show the same reduction in

population numbers as humans during the Pleistocene (STONE et al. 2001), which suggests that environmental catastrophe was not the primary cause for human‐specific population reduction; however, species‐specific infectious illness may have contributed to this population bottleneck. A more likely scenario is one in which there was a number of extinctions in small, dispersed populations that were later recolonized with founder effects (ELLER 2002). The small effective population size as well as the recent

emergence of AMH served to explain the dearth of genetic diversity among human populations, particularly when compared to chimpanzee populations (HARPENDING et al.

1998). The heterozygosity rate in chimpanzees within a population is variable based on

22

geography and population size, and is between 8.0 x 10‐4 and 17.6 x 10‐4. Strikingly, the

‐4 heterozygosity levels between chimpanzee populations is 19.0 x 10 (CONSORTIUM

2005c). This is more than twice that seen within the human population which has an

‐4 average heterozygosity of 7.5 x 10 (SACHIDANANDAM et al. 2001).

Human‐chimpanzee morphological differences

The morphological differences between humans and the other apes are numerous, from dentition to foot anatomy to fingernails (AIELLO and DEAN 1990; WOOD and COLLARD 1999). Humans have a disproportionately longer ontogeny and lifespan than other apes. Hair cover has been drastically reduced. The craniofacial shape is also quite diverged, including the shape of the brain case, a reduction in the size of the canine teeth and the presence of a chin. Many of these differences are related to the upright gait that is unique to humans, such as the general shape of the body and thorax, the relative limb lengths, the dimensions of the pelvis, an S‐shaped spine, differences in the number and types of vertebrae, and a skull that is balanced upright on the vertebral

column. Lastly, there are the vast intellectual differences between man and the apes,

marked by a larger brain relative to body size and differences in brain topology, which led to the development of language and advanced tool making skills.

These major morphological differences are the likely result of what is known as

‘quantum evolution’ or macroevolution, which are changes that have occurred at the species level, rather than within a population (SIMPSON 1965). Since many of these

23

phenotypic differences are not traceable in the fossil record, they may represent major adaptive shifts ((HONEYCUTT 2008)for a review). The root cause of these changes on morphology is not known—ranging from the cumulative effects of a number of small changes within a large number of genes or perhaps large changes in a small number of genes caused these massive morphological changes.

Human‐chimpanzee comparative genomics

Even though humans are closely related to chimpanzees, man is the most distinct of the apes. The morphological differences are obvious; however, organization is also quite divergent. The result of a fusion of two small great ape to form human , humans have 46 chromosomes, while the chimpanzees, bonobos, gorillas, and orangutans have 48 chromosomes. Many of the genes surrounding the region of the fusion have been pseudogenized or duplicated elsewhere in the genome, indicating that there may be functional relevance to this inversion (FAN et al. 2002a; FAN et al. 2002b). Humans have also undergone 9 pericentric inversions relative to the other great apes (YUNIS and PRAKASH 1982). Not

surprisingly, the synteny between human and great ape chromosomes is well preserved; even the mutation rates in syntenic blocks of non‐coding sequence remain constant

(WEBSTER et al. 2004). Models for human speciation based on these chromosomal

rearrangements have been proposed (NAVARRO and BARTON 2003) stating these rearrangements would have led to because of hybrid sterility;

24

however these theories have not held up under scrutiny, as the previous study lacked an appropriate outgroup to predict selection that was theorized to have taken place in the rearranged segments (LU et al. 2003; RIESEBERG and LIVINGSTONE 2003).

Comparative genomic analysis has been performed on the chimpanzee and human genomes in many studies (CASWELL et al. 2008; CHEN and LI 2001; CONSORTIUM

2005c; CONSORTIUM 2004; EBERSBERGER et al. 2002; GLAZKO et al. 2005; KING and WILSON

1975; SHI et al. 2003a). Since these two genomes are so similar, the focus is not on what is the same, but on what is different. Based on the analysis of 2,400 megabases (Mb) of human‐chimpanzee alignments, the genome wide nucleotide difference between human and chimpanzees is 1.23%, although when corrected for polymorphism, this estimate drops to 1.06% (CONSORTIUM 2005c). In fact species specific polymorphism is

estimated to account for between 11 and 22% of all the sequence differences between the two species (EBERSBERGER et al. 2002; ELANGO et al. 2006). The level of nucleotide diversity is among human populations is half that among the Pan lineages (7.5 x 10‐4 for

‐4 humans vs. 19.0 x 10 for chimpanzees) (CONSORTIUM 2005c; SACHIDANANDAM et al. 2001) ,

which indicates a recent population expansion among humans (KAESSMANN et al. 2001;

YU et al. 2003).

There is regional variation in sequence divergence , presumably because of regional differences in mutation and recombination rate (PATTERSON et al. 2006). The chromosome with the most sequence conservation is the X chromosome (0.94% divergence), and the chromosome with the highest observed divergence rate is the Y

25

chromosome (1.78%) (CONSORTIUM 2005c). There is also variation among autosomes, with chromosome 21 having a 1.44% divergence rate, which is higher than the average value of the entire genome (CONSORTIUM 2004; WATANABE et al. 2004). The degree of sequence similarity is also based on the function of the sequence, with the highest conservation seen among protein coding sequence. The sequence of the 3’ untranslated region (UTR) has diverged twice as much as coding sequence and the divergence rate is even higher within the 5’ UTR (HELLMANN et al. 2003). The average rate ratio of synonymous and nonsynonymous substitution (dN/dS) between human and chimpanzee sequence is 0.23, which is significantly higher than the dN/dS ratio of

0.13 observed between mouse and rat genomes.

Unexpectedly, insertions, deletions, and duplications are common in both genomes, accounting for roughly 3% of sequence differences (CONSORTIUM 2005c). A

large fraction of the nucleotide differences, insertions, and deletions resulted from repeat sequence (one third) and transposable elements (one quarter) that are active in both species. There are 53 human genes that appear to have been deleted or partially disrupted in chimpanzees, although in reality this number is probably much smaller due to the quality of the chimpanzee draft genome sequence. Hahn and Lee (HAHN and LEE

2005; HAHN and LEE 2006) have developed methods to look for human‐specific gene

disruptions, a difficult task in light of the draft chimpanzee genome. A full analysis of genes deleted or disrupted in humans has not yet been performed. Currently, only a small handful of genes are known to be deleted in or made into in humans

(CONSORTIUM 2005c).

26

Between 70‐80% of human and chimpanzee orthologous proteins have amino

acid sequence differences (CONSORTIUM 2005c; GLAZKO et al. 2005), although the lineages

on which the sequence divergences occurred are only recently being discerned. The role of splice variant differences and amounts is yet unclear. As more ape and monkey genomes are sequenced, ancestral sequence reconstruction can begin and the directionality of sequence divergence can be determined. The median ratio of replacement to silent substitutions is 2:3. In a comparison of human and chimpanzee cDNAs, Hellmann et al. (HELLMANN et al. 2003) estimated that 70% of the amino acid changing (nonsynonymous substitutions) that have occurred are deleterious.

Even though the percent differences between the two species are small, these differences still account for about 35 million nucleotide differences and 5 million , both of which contain valuable information one th molecular evolution of modern man.

Given this massive amount of nucleotide difference, most of this genomic change is likely to simply be “the noise of neutral substitutions” (CARROLL 2003) whose fixation resulted from random genetic drift.

HUMAN MOLECULAR EVOLUTION

The types of genetic changes that have made us uniquely human remain unknown. It appears that the changes that led to the emergence of modern humans

27

were not a linear additive process, but rather were the result of a string of adaptive radiations during which many branches were formed and then died out ((CARROLL 2003;

GARRIGAN and HAMMER 2006; PEARSON 2004) for reviews). This view is largely based in fossil data in which many different traits are witnessed in different combinations.

However, this evolution of traits sheds no light on the number and types of genes involved. Developmental geneticists have suggested that morphologic change is ultimately a consequence of changes in the amount, timing, and spatial patterning of gene expression and in gene regulatory networks (CARROLL 2000; DAVIDSON 2001).

According to their view changes in the human lineage are likely to be associated with genes that are involved in development and have therefore shaped human anatomy, physiology and behavior. One of the great remaining questions still is, whether these differences are primarily the collective effect of small changes in a large number of genes or due to changes in a small number of genes that have a disproportionately large

effect.

The mechanisms through which evolution has shaped the human genome

remain hotly contested. There are currently two major hypotheses in the field (LI and

SAUNDERS 2005): the adaptive evolution of proteins and changes in non‐coding regulatory sequence directing gene expression. To a lesser extent, gene loss (OLSON

1999), copy number variation (NAHON 2003; OHTA 2000), and chromatin modification

(GARCIA et al. 2003) may also play a role in functional diversification of gene expression contributing to human phenotypic difference, but the evidence is far less apparent. It is important to note that these hypotheses are not mutually exclusive and it is most likely

28

a combination of all these genome altering events that have lead to the divergence of

the human species (Figure 2). Table 1 lists important publications, both technical review

and specific examples, for each of the main models of human molecular evolution.

Figure 2: Mechanisms of human molecular evolution. The cooperative effect of different mechanisms of molecular evolution led to the emergence of modern humans.

29

Table 1: The dominant hypothesis of human molecular evolution and important references

Molecular Technical References / Reviews Examples Mechanism

Adaptive (AAGAARD and PHILLIPS 2005; (AKEY et al. 2004; ARBIZA et al. 2006b; protein ANISIMOVA et al. 2002; ARBIZA et al. BAKEWELL et al. 2007a; BERSAGLIERI et evolution 2006a; BAMSHAD and WOODING 2003; al. 2004; BUSTAMANTE et al. 2005b; BISWAS and AKEY 2006; ELLEGREN CLARK et al. 2003b; GILAD et al. 2005; FAY and WU 2000; FAY et al. 2003a; GILAD et al. 2002; HAHN et al. 2001; FELSENSTEIN 1993; GOLDMAN 2004; HAMBLIN and DI RIENZO 2000; and YANG 1994; KONDRASHOV et al. HELLMANN et al. 2003; HUTTLEY et al. 2002a; KREITMAN 2000; LI et al. 1985; 2000; JOHNSON et al. 2001; KELLEY et NEI 2005; NIELSEN 2005; PAL et al. al. 2006; KITANO et al. 2004; LAMASON 2006; SABETI et al. 2006; TAJIMA et al. 2005; LU and WU 2005; MEKEL‐ 1993; TANG et al. 2007; WONG and BOBROV et al. 2005; NICKEL et al. NIELSEN 2004; YANG 1998; YANG 2002; 2008a; NIELSEN et al. 2005b; NORTON YANG and NIELSEN 2002b; YANG et al. et al. 2007; PREUSS et al. 2004; 1998; YANG et al. 2005b; ZHANG et al. RONALD and AKEY 2005; SABETI et al. 2005) 2002b; SABETI et al. 2007; SAUNDERS et al. 2006; SAUNDERS et al. 2002; SAWYER et al. 2005; SHI et al. 2003a; TISHKOFF et al. 2007; VOIGHT et al. 2006; WALSH et al. 2006; WANG et al. 2006a; WANG and SU 2004a; WILDMAN et al. 2003; WYCKOFF et al. 2000; ZHANG 2003)

Regulatory / (AKASHI 2001; BUSH and LAHN 2005; (BEJERANO et al. 2004; BOFFELLI et al. non‐coding CARROLL 2000; DERMITZAKIS et al. 2003; CACERES et al. 2003; CHABOT et sequence 2003; KHAITOVICH et al. 2004; al. 2007; DONALDSON and GOTTGENS changes MARGULIES et al. 2003; MARGULIES et 2006; DORUS et al. 2004b; ENARD et al. 2005) al. 2002a; GILAD et al. 2006; GILAD et al. 2005; GU and GU 2003; HEISSIG et al. 2005; HUBY et al. 2001; KARAMAN et al. 2003; KHAITOVICH et al. 2006a; KHAITOVICH et al. 2005a; KHAITOVICH et al. 2005b; KHAITOVICH et al. 2006b; KING and WILSON 1975; LEMOS et al.

30

2005; MAGNESS et al. 2005; POLLARD et al. 2006; PRABHAKAR et al. 2006; PREUSS et al. 2004; ROCKMAN et al. 2005; WEBSTER et al. 2004)

“Less‐Is‐More” (FORTNA et al. 2004; OLSON 1999; (GILAD et al. 2003b; HAHN and LEE hypothesis OLSON and VARKI 2003) 2005; HAHN and LEE 2006; KEIGHTLEY et al. 2005; KEIGHTLEY et al. 2005a; STEDMAN et al. 2004; WANG et al. 2006b)

Epigenetic (GARCIA et al. 2003; JAENISCH and BIRD (ENARD et al. 2004; PASTINEN et al. regulation 2003) 2004; WEBER et al. 2007)

Copy number (FORTNA et al. 2004; HANCOCK 2005; (BAILEY and EICHLER 2006; CHENG et al. variation HURLES 2004; KONDRASHOV et al. 2005; JOHNSON et al. 2001; MAKOVA 2002b; LI et al. 2005; OHNO 1970; and LI 2003; NAHON 2003; NGUYEN et OHTA 2000; PRINCE and PICKETT 2002; al. 2006; PERRY et al. 2006; SHI et al. ROTH et al. 2007) 2003b)

Adaptive protein evolution

Adaptive evolution, also known as positive or , can be an extremely effective way of maintaining advantageous protein changes that can increase the fitness of an organism, as well as being involved in morphological divergence and speciation. In fact, it is believed that 20‐45% of all amino acid changing substitutions

have been influenced by positive selection (BIERNE and EYRE‐WALKER 2004; FAY et al.

2002). Even a single can have an effect on the function of a protein and therefore also affect larger protein networks (BRIDGHAM et al. 2006; KNIGHT et al. 2006).

For example, amino acid replacements can alter DNA‐binding (TRELSMAN et al. 1989)

which is particularly true for transcription factors because it can affect the downstream

31

cascade of gene regulation. It is also a way increase the morphological diversity both with and between species ((CARROLL 2000) for a review).

Positive selection has been demonstrated to actively alter immunity and defense genes in humans and other primate taxa such as the major histocompatibility complex

(MHC) and leukocyte antigens (FAN et al. 1989; FILIP and MUNDY 2004), malarial defense

systems (SAUNDERS et al. 2002), reproduction such as sperm protamines (WYCKOFF et al.

2000), sensory perception such as taste receptors (SHI et al. 2003b), and dietary adaptation such as lysozyme (MESSIER and STEWART 1997) and lactase (BERSAGLIERI et al.

2004; TISHKOFF et al. 2007). Unfortunately, these types of genes would not promote morphological and behavioral changes associated with human and chimpanzee divergence. It has long been known that genes involved in immunity, reproduction, and diet rapidly evolve, as an organism is continually being exposed to new environmental pressures. Genes involved in morphology and physiology, however, are often involved in very complex genetic pathways, and positive selection is more difficult to uncover in

these genes because these changes may have been cooperative and involved compensatory mutations.

Changes in gene regulatory sequence

King and Wilson (1975) first suggested that regulatory mutations might be responsible for the human‐chimpanzee phenotypic differences. Until that time it was

assumed that humans and chimpanzees must have very different sets of genes

32

responsible for such great phenotypic difference. With the development of new techniques in molecular biology, however, it became clear that human and chimpanzee proteins were far more similar than previously expected. King and Wilson used sequencing, immunological comparisons, and electrophoresis to compare over 40 human and chimpanzee orthologous proteins. They concluded that the human and chimp are more than 99% identical, and accelerated protein evolution could not account for the human‐chimpanzee divergence. Rather proposed that small differences in the cis‐regulatory sequences controlling gene expression that are active during embryonic development are the likely causes of the anatomical and behavioral differences of the two species. Divergence of gene regulatory systems remain the dominant theory in human molecular evolution; yet, it is notoriously difficult to investigate because one can no longer rely on the rate of synonymous and nonsynonymous substitution. Techniques designed to investigate positive selection in non‐coding sequence have been developed and are beginning to be applied to the study

of human molecular evolution (DERMITZAKIS et al. 2003; MARGULIES et al. 2003; MARGULIES et al. 2005; POLLARD et al. 2006; WONG and NIELSEN 2004).

A variety of studies have demonstrated gene expression differences between humans and other apes (Enard et al. 2002; Caceres et al. 2003; Karaman et al. 2003;

Fortna et al. 2004; Khaitovich et al. 2004), but follow‐up techniques for linking cis‐ regulatory differences to gene expression changes are weak. The importance of non‐ coding regulatory sequence in vertebrate development is highlighted in a study by

Woolf et al. who performed a whole‐genome comparison between human and the

33

pufferfish Fugu rubripes looking for conserved non‐genic sequence (CNGs) (WOOLFE et al.

2005). They found that many CNGs surrounding genes involved in transcription

regulation and development show significant enhancer activity.

Transcription factor binding sites are small and are often found at large distances away from the gene they are regulating, making their prediction problematic.

Humans and chimpanzees are also so closely related, that it is hard to find functional

divergence in the face of nearly identical sequence. Typically the most important sequence is the most highly conserved, and the close evolutionary distance between these two species provides almost no information. This conservation, however, can be used to detect putative functional elements, either by looking at identical sequence in closely related species (BEJERANO et al. 2004) or in highly divergent species (THOMAS et al. 2003a). Phylogenetic shadowing methods developed by Boffelli et

al. (BOFFELLI et al. 2003) have overcome some of the technical barriers by using multiple sequence alignments of mammalian species to find functionally important regions specific to primates or humans. A second method for characterizing conserved sequence has been developed by Margulies et al. (MARGULIES et al. 2003; MARGULIES et al.

2005), which was initially used to detect clusters of transcription factor binding sites. It must be considered though, that conservation is not synonymous with function, as many deletions of large scale conserved sequence known as gene deserts can be tolerated by an organism without any obvious phenotypic effect (NOBREGA et al. 2004).

34

Several studies have performed genome scans with the intent of finding ultra

conserved, functional non‐genic sequence in the human genome (BEJERANO et al. 2004;

DERMITZAKIS et al. 2003; THOMAS et al. 2003a; WOOLFE et al. 2005) that may have contributed to human phenotypic divergence. Typically, selection is predicted when the human sequence has diverged relative to the other species in highly conserved sequence blocks. However, it is difficult to distinguish a model of selection from one of neutral evolution which could also account for the nucleotide differences. Wong and

Nielsen have developed a maximum likelihood method for predicting positive selection in CNGs using substitution rates in both genic and non‐genic sequence (WONG and

NIELSEN 2004). The true test lies in linking these predictions with actual functional

divergence.

These techniques were recently used to identify non‐coding sequence subject to

selection known as ‘human accelerated regions’ (HARs) (POLLARD et al. 2006). Instead of

finding regulatory sequence, Pollard et al. identified a novel RNA gene expressed in neurons of both the developing and adult brain in both human and non‐human primates. Functional analysis has yet to be performed on this RNA. These non‐coding

RNAs, including miRNAs, are a new mechanism to explore in human evolution, as these elements have the potential to regulate the expression of large numbers of genes.

Keightley et al. (2005) examined CNGs in humans and came to a different conclusion about the selective pressures affecting these sequences. This group examined CNGs along the long arm of human chromosome 21 using multiple species

35

alignments (human‐chimp and mouse‐rat), correcting for variation. They calculated the amount of selective constraint in the regions between the species pairs by estimating the fraction of mutations removed by natural selection. The selective constraint along the hominid lineage is about half as strong as along the murid lineage.

There are also equal numbers of conserved nucleotides in non‐coding sequence as in protein‐coding sequence in humans. In mice, there are twice as many conserved nucleotides in non‐coding sequence as in protein‐coding sequence. They propose that the reason for the lack of selective constraint could be a function of a general degradation in non‐coding sequence, which had been suggested earlier by this group

(Keightley et al. 2005a) or because of the lower effective population size of humans leading to a reduction in the effectiveness of selection.

“Less‐is‐More” hypothesis of gene loss

The “Less‐is‐More” hypothesis was first described by Maynard Olson as the primary mechanism for adaptive gene loss in an organism when faced with novel selective pressures (OLSON 1999). He postulates that gene loss occurs more often than functional mutation and it has the ability to spread rapidly through small populations, such as the types present in the recently evolved archaic human populations (OLSON and

VARKI 2003). The small Ne of early Homo sapiens is an ideal size for homozygous gene

loss among randomly‐mating individuals (VOGEL and MOTULSKY 1996). This type of purifying selection occurs when a species ‘breaks free’ from the selective constraints

36

once applied to its genome. It is believed that this sort of gene loss is responsible for many of the degenerative traits seen in modern human populations, such as delayed postnatal development, and loss of hair and muscle strength.

The initial sequencing of the chimpanzee genome revealed that 53 known human genes were either deleted or disrupted in the chimpanzee (CONSORTIUM 2005c).

More recently, Wang et al. investigated the rate of pseudogenization in the human genome by comparing it to chimpanzee sequence, and identified 80 non‐processed pseudogenes (WANG et al. 2006b) that have occurred since the divergence of humans and chimpanzees. The human specific loss of olfactory genes is an excellent example of pseudogenization acting on an entire gene family (GILAD et al. 2003b) as humans began to rely more heavily on sight than olfaction for survival. Likewise, the inactivation of the predominant myosin heavy chain gene is credited with the reduction in mandibular and masticatory muscles since the divergence of modern humans from the archaic humans, Australopithecus and Paranthropus (STEDMAN et al. 2004).

Gene duplication in human evolution

Susumu Ohno first suggested that genes can evolve new functions through duplication, as they are free from the selective constraints of the source genes (OHNO

1970). A gene may undergo neofunctionalization, which is the acquisition of a novel function, or subfunctionalization, in which the duplicated gene takes on a part of the role of the source gene ((HANCOCK 2005; HURLES 2004) for a review. A gene may also

37

evolve a subtly different function from the parent gene (microfunctionalization) which is often the mechanism through which gene families evolve (HANCOCK 2005). Lastly,

duplicated genes may evolve different expression patterns (LI et al. 2005). Positive selection upon a new protein function may rapidly fix the duplicated gene or segment in a population. This is particularly true along the human lineage among copy‐number variants involved in olfaction and immune function (NGUYEN et al. 2006), and several copy‐number variation ‘hotspots’ have been located in human and chimpanzee in locations of ancient duplications (PERRY et al. 2006).

Through genome comparisons of human and chimpanzee aligned sequence, it has been estimated that at least 180 genes have become duplicated along the human lineage, and 94 have duplicated along the chimpanzee lineage (CHENG et al. 2005). A

study comparing the rates of gene duplications among the great apes found that humans possess a larger number of lineage specific duplications, and identified 134 gene families that have gained human specific duplications (FORTNA et al. 2004). One such gene family is the human‐specific morpheus gene family, which underwent positive selection after a segmental duplication (JOHNSON et al. 2001); however, there is currently no functional information available for this gene family.

Epigenetic regulation of gene expression

A mechanism for human molecular evolution that is gaining interest is epigenetic regulation and the effects it can have on gene expression. DNA and

38

modification are mechanisms that stably affect gene expression without involving mutation in DNA nucleotide sequence. Epigenetic regulation is required for embryonic

development, X‐chromosome inactivation, and imprinting, and many show alterations in epigenetic regulation. These types of effects are particularly important during development, and they also allow an organism to respond to environmental stimuli by altering gene expression (JAENISCH and BIRD 2003). This mechanism, therefore, may have had important contributions to human evolution, particularly in the promoter region and cis‐regulatory sequence (WEBER et al. 2007).

Enard et al. examined the methylation patterns in 36 genes in the brain, , and lymphocytes of humans and chimpanzees using microarray‐based methylation analysis (ENARD et al. 2004). Twenty‐two of the 36 genes showed significant difference in methylation. Their most interesting finding was that methylation patterns were significantly different in 15 of 18 genes expressed in the brain, and that of these 15 genes, 14 were hypermethylated in humans. This led to the conclusion that methylation patterns have diverged over the course of human evolution providing a mechanism for changes in gene regulation without altering the underlying DNA sequence; however, further analysis needs to be performed on a more extensive data set.

39

GENE EXPRESSION DIFFERENCES

Over three decades ago, Mary‐Claire King and Alan Wilson (KING and WILSON

1975) examined the great paradox of human evolution: “The genetic distance between humans and the chimpanzee is probably too small to account for their substantial organism differences.” They instead suggested that small changes in the machinery that controls gene expression during development were the driving force of human phenotypic divergence. Indeed, many studies have linked differences in gene

expression to differences in morphology, such as the expression of the growth factor

Bmp4 and the size and shape of the beak in Darwin’s finches (ABZHANOV et al. 2004).

These types of expression differences are often located in the regulatory sequence of genes or are due to structural changes in proteins that modulate gene expression; however, epigenetic effects are also known to effect gene expression, particularly on the X‐chromosome and in imprinted regions (PASTINEN et al. 2004).

The study of gene expression differences is complicated by many factors because expression levels are constantly changing across different tissues, during different developmental stages, from environmental effects, and because transcription is a complex event often relying on other transcripts or proteins for proper gene regulation to occur (KHAITOVICH et al. 2006a). Cross‐species comparisons are even more difficult

because tissue specificity and developmental timing needs to be synchronized and you

cannot control for the environment. Many have turned to the use of microarrays for

40

cross‐species comparisons, however the technique suffers from changes caused by sequence divergence and probe‐binding sensitivity (GILAD et al. 2005).

The measurement of gene expression differences is most often performed using mRNA levels; however, this comes with caveats of its own. Predicting protein levels based on mRNA level is not possible because gene expression may also be effected by codon bias and the downstream effect of tRNA availability and binding‐preference

(AKASHI 2001). Also, prediction of protein abundance from mRNA level is not feasible because of post‐ modification and the highly variable stability of the mRNA and protein. Fu et al. examined this issue by comparing mRNA levels of genes known to be differently expressed between humans and chimps to steady‐state protein levels (FU et al. 2007). In this case, they found a significant and reproducible positive correlation between mRNA and protein levels.

Human‐chimpanzee expression differences

A number of studies have examined differences in human and chimpanzee gene expression using mRNA levels (ENARD et al. 2002a; FU et al. 2007; GILAD et al. 2006; GU and GU 2003; KARAMAN et al. 2003). Karaman et al. examined the expression profiles of human and bonobo cultured fibroblast cells using gorilla as an outgroup to predict the ancestral state. Using microarray analysis coupled with Northern blots, they found 30 genes that had been upregulated in humans compared to bonobo, and 22 upregulated in bonobo compared to chimpanzee. This study, however, suffers from the fact that

41

cultured cells are an artificial system, and the gene expression changes observed may not be biologically relevant, and that fibroblasts may not be the subject of positive selection during human evolution.

Gene expression differences in tissue samples were examined for humans and chimpanzees, with orangutan as an outgroup (ENARD et al. 2002a). Blood leukocyte, brain, and liver mRNA were measured using microarray technology and protein differences were examined through two‐dimensional electrophoresis. The most surprising find was that intraspecies (within individuals of a species) gene expression was the most divergent within a tissue sample. The tissue that showed the least amount of divergence was the brain; however, the rate ratio of the change was highest in brain on the chimpanzee to human lineages even though there were fewer genes diverging. Subsequent studies supported the results of Enard et al. with regards to the findings of gene expression differences in brain tissue (CACERES et al. 2003; GU and GU

2003; PREUSS et al. 2004).

Other studies have taken a more direct approach by performing functional analysis on human and chimpanzee promoter sequences (CHABOT et al. 2007; HEISSIG et

al. 2005; HUBY et al. 2001). The most recent study by Chabot et al. examined the promoter regions of genes previously shown to be differentially expressed between the two species. The sequences of 10 human and chimpanzee promoters were cloned upstream of a reporter gene, and differential expression was found for three promoters.

Site‐directed mutagenesis was used to pinpoint the functional nucleotide changes.

42

Selective pressures on gene expression

Khaitovich et al. has suggested a neutral model of transcriptome evolution based on human and chimpanzee gene expression data (KHAITOVICH et al. 2005a; KHAITOVICH et al. 2005b; KHAITOVICH et al. 2004). He proposes that most changes that have occurred are selectively neutral based on a number of conclusions from the data. First, gene expression changes increase with evolutionary divergence (time) and have occurred in a linear fashion. He also notes that that intraspecies gene expression differences positively correlate with interspecies differences. Lastly, the rates of interspecies divergence are the same between both intact genes and pseudogenes. In support of the previous brain‐specific gene expression data, it was found that gene expression divergence with a species has also accumulated in a linear fashion. These conclusions were also supported in a study by Lemos et al. that examined differences in the

expression profiles of primates and found that most mRNAs are evolutionarily stable,

most likely due to the effects of (LEMOS et al. 2005).

As with protein evolution, gene expression divergence can be affected by positive selection. Gilad et al. used a multi‐species array (serving to increase both the sensitivity and the specificity) to study gene expression differences in the liver between four primates: human, chimpanzee, orangutan, and rhesus macaque. They found that disease related genes and genes with the Gene Ontology classification of “regulation of cellular properties” were constrained by the effects of stabilizing selection. However,

43

the expression of one category of genes, transcription factors, were evolving rapidly specifically along the human lineage and that these changes are not simply the tail end of the neutral distribution (GILAD et al. 2006). Based on their ability to regulate gene expression and their own rapid evolution of gene expression, transcription factors were studied in this thesis as elements of evolutionary divergence leading to the phenotypic differences between humans and chimpanzees.

POSITIVE SELECTION

The unit of evolution in Darwin’s theory of natural selection is a phenotypic trait;

however, his theories can be easily applied to genomes, with the gene as the unit of

evolution. Advantageous mutations are fixed in a population, and deleterious ones are

eliminated. However such a scenario imposes a heavy genetic load on an organism— there are cumulative costs associated with selection. This idea led Kimura (KIMURA 1968)

to propose a neutral theory of evolution that states that the majority of mutations

(polymorphisms and substitutions) are not the products of selection; rather they have no influence on the fitness of an organism. These selectively neutral mutations are transient and will either be fixed or eliminated from a population by random genetic drift. This theory also posits the presence of a molecular clock regulating sequence evolution at a constant rate over time in all organisms, and population diversity is largely a consequence of genetic drift. This neutral theory has been challenged often,

44

especially the molecular clock, but it highlights many important tenets about sequence evolution.

Today it is accepted that most genes are evolving neutrally or under purifying selection, preserving essential function; however, natural selection does play a role in shaping genomes, both by promoting fixation of advantageous mutations and eliminating harmful ones. Out of this idea arose Ohta’s ‘nearly neutral theory of molecular evolution’ in which both slightly advantageous and deleterious mutations were considered (KIMURA and OTA 1974; OHTA 1973; OHTA 1992). Methods to investigate the selective forces acting on a gene based on the rates of synonymous (dS) and nonsynonymous substitutions (dN) within protein‐coding DNA sequence have since been developed (KIMURA 1983; NEI 1975). Synonymous substitutions are nucleotide changes that change the codon but do not alter the amino acid. Nonsynonymous substitutions are nucleotide changes that do alter the amino acid and are most often found at the first and second positions within a codon. Despite the degeneracy of the amino acid code, theree ar far more possible nonsynonymous substitutions than synonymous ones. Therefore, corrections for the absolute number of substitutions by the number of potential sites of substitution, or the dN and dS, are required for their analysis. The dS is defined as the rate of synonymous nucleotide substitution per synonymous site. Likewise, dN is the rate of nonsynonymous nucleotide substitution per nonsynonymous site.

45

The ratio of the rates that these different types of substitutions are being deposited can lend clues to the selective pressures influencing the gene. A dN/dS < 1 is indicative of purifying or negative selection. Nonsynonymous substitutions are most often deleterious, and quickly weeded out of a population, which is reflected in an under representation of these types of mutations in a gene. A dN/dS = 1 indicates neutral evolution since synonymous and nonsynonymous substitutions are being

deposited at the same rate, with no bias for or against replacement substitutions. A dN/dS > 1 is strong evidence for adaptive or positive selection. This indicates that these amino acid changing mutations have offered a fitness advantage to the organism and are fixed in a population at a higher rate than synonymous substitutions. It must be noted that not all amino acid changes have been selected by evolution; rather, their fixation was the result of random genetic drift. The challenge is to be able to distinguish between selection and drift, and several statistical methods have been developed to detect these functional determined changes and are discussed below.

Detecting positive selection through

The use of population genetics to detect positive selection is best for predicting either very recent human selection or selection that has occurred between human populations. Positive selection leaves a fingerprint on a genome and the discovery of these patterns can lead to predictions of about evolutionary direction. Such signatures

46

are: high proportion of nonsynonymous substitution, reduction in genetic diversity, high

frequency derived alleles, differences between populations, and long range haplotypes.

Since positive selection leads to the fixation of advantageous alleles, one can compare the proportion of amino acid altering changes both between species and within a species determine if between species variation is due to or adaptive evolution (HUGHES and NEI 1988; LI et al. 1985). The genetic divergence between species is mutation that has been fixed by genetic drift, selection, or genetic hitchhiking, which is a result of selective sweep. Polymorphism is the mutation within a species whose ultimate fate of fixation or loss has yet to be determined. This theory has limitations that permit it to only be able to detect ongoing or very recent selection because multiple changes are needed to be able to detect the selection from the noise of neutral substitution.

The use of polymorphism and divergence was first suggested by Hudson et al.

(the HKA test) in 1987 as a way to test Kimura’s controversial theory of neutral evolution by examining the Adh in two Drosophila species (HUDSON et al. 1987). This goodness of fit statistic is restricted to comparing only two loci, however, and is best suited for examining or recent selective sweeps. A modified version of the HKA test and the one to be used in this study is the McDonald‐Kreitman test

(MCDONALD and KREITMAN 1991b). This method is heavily rooted in the phenomenon of the selective sweep. In a population at any point in time, there is neutral variation surrounding a given locus. If a mutation occurs at that locus that adds an advantage to

47

an individual’s fitness, this mutation will be pulled very rapidly through the population.

During fixation of this mutation, surrounding (neutral) alleles will also be “swept” with the mutation and fixed. This has the immediate effect of eliminating variation around the selected site, and only with time will new neutral mutations be gradually reintroduced to this area (LI 1997). In this way a selective sweep will leave its fingerprint on a genome.

Bustamante et al. (BUSTAMANTE et al. 2005b) analyzed the data used in Clark et al.

(CLARK et al. 2003b) with the inclusion of single nucleotide polymorphism (SNP) data for

39 humans using a modified version of the McDonald‐Kreitman test, known as the mkprf (McDonald‐Kreitman analysis using Poisson Random Field). The Clark et al. study used phylogenetic analysis on multispecies sequence alignments of human, chimpanzee, and mouse to predict positive selection in more than 11,400 genes. Bustamante et al.

analyzed 3,323 genes that had at least 4 variable nonsynonymous sites in the alignment and found that 306 genes show evidence of positive selection with P < 0.05.

Unsurprisingly, genes involved in immunity and defense, gametogenesis, and sensory perception were found to be overrepresented in this list, a phenomenon seen in the other large scale scans for positive selection. One intriguing result from this analysis is that transcription factors are also overrepresented on this list of genes undergoing rapid adaptive evolution (P < 0.0001).

Use of the reduction in genetic diversity to predict positive selection was proposed by Tajima and takes advantage of the excess of rare alleles that a selective

48

sweep deposits on a genome (TAJIMA 1989). A selective sweep reduces or eliminates

variation among the nucleotides in neighboring DNA of a mutation as the result of positive selection. Only after time variation will be reintroduced into the ‘swept’ area which produce an excess of rare alleles. The most common way to differentiate selection from genetic drift is through the calculation of Tajima’s D, which is based on the number of segregating sites and nucleotide diversity. A negative Tajima's D signifies an excess of rare alleles, indicating population growth or positive selection. A positive

Tajima's D signifies low levels of both low and high frequency alleles, indicating a decrease in population size or negative selection. One important consideration is that demography can often explain a high D, such as a rapidly expanding population.

Because a selective sweep usually encompasses a large region, pinpointing the causal variant can be quite difficult.

High frequency derived (non‐ancestral) alleles can also be used to detect

selection, once again through the detection of a selective sweep (FAY and WU 2000).

These derived alleles will hitchhike with the advantageous allele, but not become fixed,

generally due to recombination. Recombination facilitates the movements of mutations between genetic backgrounds each generation, and since there is no selective pressure on the variant, there is no need to maintain or purge the polymorphism (RICE and

CHIPPINDALE 2001). The commonly used test for detection of high frequency derived alleles is Fay and Wu’s H which is based on the number of derived variants found in a sample of chromosomes. A classic example of this mechanism is the selective sweep that occurred in African populations around the Duffy red cell Y)antigen (F locus,

49

presumably because it conferred resistance to malaria (HAMBLIN and DI RIENZO 2000).

This test also suffers from confounds of demographics, as the results may also be the effects of population subdivision ((SABETI et al. 2006) for a review).

Signatures of positive selection are also hidden in the differences between populations. Selection can affect one population and not another because of differing environmental pressures, which may drive one allele to high frequency in only one of the populations. In this case it is necessary to know the ancestral allele (to infer the

direction of selection) which can be determined through phylogenetic analysis. This test requires the calculation of the FST statistic, which is essentially a measure of the genetic

variability both between and with populations (LEWONTIN and KRAKAUER 1973). In a

neutrally evolving gene, loci across populations will evolve in a similar and predictable manner, in which case FST is determined by genetic drift. However, positive selection can cause deviations in FST values between populations at the allele of interest as well as

in surrounding nucleotide blocks.

Akey et al. (AKEY et al. 2004) calculated FST values in three populations, African‐

American, East Asian, and European‐American, by analyzing the allele frequencies of

26,530 SNPs. They identified 174 genes with significant FST values that show signatures of positive selection, leading them to conclude the selection has influenced the variation seen in extant human populations.

In 2002, one of the most powerful tools for detecting recent human positive selection, known as the Extended Haplotype Heterozygosity, was developed (SABETI et

50

al. 2002b). This test essentially measures the breakdown of linkage disequilibrium in

large haplotype blocks around a selected allele. During a selective sweep, the advantageous allele brings with it long range associations with other alleles, or a haplotype block. Large haplotype blocks that have remained undisturbed by recombination are indicative of strong, recent positive selection or may lie in recombination “cold spots” in which there is a low recombination rate because of yet unknown reasons. This method can be used to detect positive selection both between species and between populations within a species; however there is significantly greater power associated with the latter. With the release of the data from the International

HapMap Project (CONSORTIUM 2005b), this method was applied to a much larger and detailed data set (SABETI et al. 2007). This analysis revealed more than 300 candidate regions that were subject to a selective sweep in one of the three populations studied, either African, European, or East Asian. Upon closer examination of some of the regions, several genes with previous predictions of positive selection were highlighted.

Within the African population, the genes LARGE and DMD were predicted; both genes related to of the Lasso virus. The genes SLC24A5 and SLC45A2 were implicated in the European populations, and previously associated with skin pigmentation (LAMASON

et al. 2005). Lastly genes involved in hair follicle development, EDAR and EDA2R, had evidence of selection in the Asian populations.

A number of genome scans for positive selection using a variety of population genetic methods have been performed (AKEY et al. 2004; KELLEY et al. 2006; VOIGHT et al.

2006; WALSH et al. 2006), each study providing candidate genes for follow‐up study of

51

positive selection. However, these studies all focus on very recent positive selection, and cannot answer questions about the ancient and fundamental changes that have occurred that led to the divergence of Homo sapiens. To find these types of answers one must look to phylogenetic analysis for predictions of ancient positive selection.

Detecting positive selection by phylogenetic analysis

Phylogenetic analysis has proven to be very useful in detecting selective pressure on a gene (YANG and NIELSEN 2002b). Some of the phylogenetic methods that use multiple sequence alignments to predict positive selection have been reviewed in Yang and Bielawski (YANG and BIELAWSKI 2000), Yang (YANG 2002), and Bamshad and Wooding

(BAMSHAD and WOODING 2003). The methods can be applied to find the signatures of positive selection along specific branches of a phylogenetic tree or at specific codons within a protein coding gene. At its simplest, these methods look at fixed synonymous and nonsynonymous differences in protein coding genes between species, given a specific phylogeny. An estimation of the rate that these differences are accruing can be used to infer positive selection, negative selection, or neutral evolution. I have chosen to use the maximum likelihood method of Goldman and Yang (GOLDMAN and YANG 1994) to investigate selection that is specific to the human lineage.

The advantage of the above method is that it is based on a codon , in which a sense codon (not a nucleotide) is the unit of evolution. Previous methods have used nucleotides or amino acids of protein‐coding sequence as the unit

52

of evolution, which results in the loss of an enormous amount of information by not taking synonymous and nonsynonymous substitutions into account (GOLDMAN and YANG

1994). The dN/dS ratios are calculated according to Li et al. (LI et al. 1985). Li was the first to classify nucleotide sites as nondegenerate, twofold degenerate, or fourfold degenerate depending on how often a nucleotide change will result in an amino acid change. In his likelihood based method, Li also incorporates a classification of a nucleotide change as a or a , the former being the most common type of substitution and more likely to be a silent substitution. Each of these nucleotide changes, based on their classification, is scored according to their probability based on the frequency of the type of substitution in mammalian genes. Ancestral sequences at internal nodes are predicted using the maximum likelihood techniques based largely on

the work of Felsenstein (FELSENSTEIN 1973) for a better estimation of dN and dS. The method that Felsenstein developed parses each nucleotide in a multispecies alignment into discrete character states, and evaluates each substitution according to a probabilistic model of evolution. The maximum likelihood method of predicting selection assumes that synonymous nucleotide substitutions are selectively neutral, however it must be noted that this method cannot tell if a synonymous substitution was driven to fixation by random genetic drift or by selection (YANG and BIELAWSKI 2000). An

added benefit of calculating the synonymous substitution rate for each gene (rather

than estimating the neutral rate of mutation from other genes or non‐coding regions) is

the heavy association of mutation with GC content. In this way the neutral mutation

rate is tailored for each gene (KREITMAN 2000).

53

There are other parameters that a codon substitution model considers.

Transition‐transversion rate ratios are modeled in this likelihood approach, a factor often ignored by other models. This is important to consider because transitions at the third position of a codon are more likely to be synonymous than , and therefore, ignoring this rate is likely to lead to an overestimation of dS and an underestimation of the dN/dS ratio. This model also corrects for multiple hits and back substitutions. Another often overlooked parameter is that of codon bias, that is, the largely unexplainable unequal codon frequencies seen within genes. All these parameters are calculated specifically for each gene in an alignment, rather than considered equal for all genes. Yang et al. (1998) demonstrate a simplified equation for the calculation of the rate of substitution from codon u to codon v:

where π is codon frequency, κ is the transition/transversion rate ratio, and ω is the nonsynonymous/synonymous rate ratio. This equation leaves the user with fewer assumptions about their data, and provides a customized framework for evolutionary

analysis.

Another advantage of the likelihood method is the ability to test “goodness‐of‐ fit” models (GOLDMAN and YANG 1994). Maximum likelihood analysis provides statistical

54

tests that consider the suitability of different models (or scenarios) of evolution against a particular set of data. This comparison is done by randomly simulating data sets according to the given phylogeny and using the parameters (codon frequency, transition/transversion ratio) calculated from the original data to determine how often one would observe the real data. This is transformed into a likelihood score, which can be compared between different models of evolution, to find a scenario that most accurately represents the evolutionary history of the gene.

Several large scale scans have been performed using maximum likelihood methods and phylogenetic analysis in an attempt to better understand the way selection is shaping the human genome. Clark et al. (CLARK et al. 2003b) compared the

multiple sequence alignments of 7,645 orthologous genes between humans and chimpanzees, using mice to infer the ancestral state, and found that 1,547 genes had a dN/dS > 1. Of this number, a hypothesis of neutral evolution was rejected for only 125 genes with p < 0.01. In an effort to understand what sort of biological processes might

be the target of selective pressure, Clark used the Panther classification system (THOMAS et al. 2003b) to see if any categories of genes were overrepresented on this list compared to the normal distribution. Genes involved in amino acid catabolism and olfaction showed a significant trend to be under positive selection, as did genes with

Online Mendelian Inheritance in Man (OMIM) entries. Genes involved in developmental processes show a low p‐value, but as a class their representation on the list is not significant. Those developmental genes with low p‐values primarily fall into three

55

categories: skeletal development, neurogenesis, and genes that are homeotic

transcription factors.

A second scan was performed by Nielsen et al. (NIELSEN et al. 2005b) using a greater number of genes and fewer number a species. This group set out to find genes that were targets of positive selection along the human or chimpanzee lineage during any point of these species’ evolution. The comparison of 13,731 human and chimpanzee orthologous genes reveals that 733 genes are evolving with a dN/dS > 1; however only 35 of these have a p‐value < 0.05 that rejects a hypothesis of neutral evolution. Several groups of genes involved in particular biological processes (using the

Panther classification system) are overrepresented in the list of 733 genes. Topping eth

list are genes involved in immunity and defense, chemosensory perception,

spermatogenesis, and those with an unknown biological process. Transcription factors

also appear to have prominent representation on this list.

Both these studies lack considerable power to detect positive selection. This is in

part due to the small number of species in each data set; the confident reconstruction of ancestral nodes is nearly impossible with only 2‐3 species. A particular problem facing the Nielsen group is the short evolutionary distance between the two species. An even bigger challenge for these groups is that most genes are well conserved, and they evolve with a dN/dS << 1. Positive selection is unlikely to affect all sites over time. By looking at a gene as an entire unit, the signatures of positive selection can be overlooked, as selection often acts upon specific sites or specific domains of a gene.

56

Therefore, many genes that have been targeted by positive selection will remain undetected by these methods (KREITMAN 2000).

Interestingly, three additional genome scans were performed to detect positive selection, however, this time chimpanzee selection was added into the analysis (ARBIZA

et al. 2006a; ARBIZA et al. 2006b; BAKEWELL et al. 2007a). Arbiza et al. (ARBIZA et al. 2006b) found that chimpanzee genes were evolving faster than human genes under both positive selection and relaxation of selective constraint. They also concluded that many genes with previous predictions of positive selection were actually undergoing a relaxation of selective constraint, and care must be taken to differentiate between these two possibilities. Bakewell et al. (BAKEWELL et al. 2007a) came to a similar conclusion, that even though human genes had a higher nonsynonymous substitution rate, significantly more genes were undergoing positive selection along the chimpanzee lineage. This led them to suggest that natural selection was not as efficient along the human lineage because of their small effective population size.

TRANSCRIPTION FACTORS

Transcription factors are proteins that are prominent regulators of gene expression. These proteins regulate transcription initiation, thereby affecting the timing, amount, and spatial patterning of the expression of a given gene. Altering the

57

amino acid sequence of one of these proteins can potentially change its activity, thereby having dramatic or subtle downstream gene expression effects. This may happen by altering the transcription factor’s DNA‐binding ability (either enhancing or weakening the protein‐DNA bond) or by affecting the binding of other cofactor proteins, thus causing a modest change in transcription factor function or possibly even the acquisition of new function ((HSIA and MCGINNIS 2003b) for a review). This effect would be particularly striking were it specific to the development of the organism, as is the case with many transcription factors.

Proteins that regulate transcription are thought to encompass 5‐10% of all protein coding genes; however, this figure also includes genes that make up the basal transcription machinery and those involved in chromatin modification (LEVINE and TJIAN

2003). For the purpose of this project, the analysis will be restricted to the canonical

DNA‐binding transcription factors with the hypothesis that the positively selected amino acid changes in these proteins have lead to functional divergence in activity that lead to some of the phenotypic changes between human and chimpanzees. These transcription factors are known to regulate transcription through protein‐protein interactions. These

interactions can be with other transcription factors, transcription cofactors that do not bind DNA, chromatin‐remodeling , or the basal transcription machinery. It is these interactions that cooperatively act to promote or inhibit transcription of a given gene ((WRAY et al. 2003a) for a review).

58

Transcription factors are composed of an array of functional domains, arranged in different combinations. These may include a DNA‐binding domain(s), a protein‐ binding domain(s), a nuclear localization signal, and ligand‐binding domain(s). It is believed that the evolution of transcription factor gene families arose from either

“domain shuffling” or the deletion of domains (WRAY et al. 2003a). A disruption in the amino acid sequence in any one of these domains could potentially affect the binding affinity of any one of these domains. This could then alter downstream events, such as the timing, amount, spatial patterning, and tissue‐specificity of the expression of target genes.

The bulk of the work done on transcription factor functional divergence and specificity has been performed in invertebrates. Trelsman et al. (TRELSMAN et al. 1989) demonstrated that by changing a single amino acid of the DNA‐binding recognition motif of the homeodomain protein Paired (Prd) of Drosophila was enough to change this protein’s DNA‐binding specificity to that of two other homeodomain proteins, Fushi tarazu (Ftz) and Bicoid (Bcd). Changes outside of the DNA‐binding domain can also contribute to functional change. Grenier and Carroll (GRENIER and CARROLL 2000b),

through functional complementation experiments showed that functional divergence in

the Ultrabithorax (Ubx) in Drosophila and the orthologous gene (OUbx) in onychophoran (the velvet worm, considered a proto‐arthropod) was due to sequences outside the DNA‐binding homeodomain. They propose that the interaction of Ubx with cofactors and other activity modifiers is the reason that OUbx can continue to regulate many of the same target genes as Ubx, but cannot transform segmental identity.

59

For many years it was suggested that changes in gene expression were primarily due to changes is the cis regulatory sequence to which a transcription factor binds. This is in large part due to belief that since a transcription factor regulates many genes, a change in this transcription factor would have a disproportionately large effect on gene expression and therefore a detrimental effect on the development of an organism

((CARROLL 2000) for a review). Given the sequence similarity between humans and chimpanzees, it is likely that the small number of amino acid changes would result in subtle alteration of transcription factor function, affecting only a small subset (or perhaps only a single) of target genes. Transcription factors may therefore be ideal targets for the effects of positive selection acting on morphological diversity of an organism.

CONCLUSION

In this doctoral dissertation, I have chosen to examine the role that adaptive protein evolution played in the shaping of the human genome in order to better understand the molecular mechanisms behind the evolution of modern humans. In an effort to merge the two main hypotheses of human molecular evolution, natural

Darwinian selection and changes in gene regulation and expression, I have chosen to study the effects of positive selection on transcription factor genes, which are genes that regulate the expression of other genes. Transcription factors were chosen for

60

further study based on their involvement with embryogenesis and development in order to narrow the search for genes involved in the morphological and physiological divergence between humans and chimpanzees. The hypothesis is that the human protein has functionally diverged from its chimpanzee ortholog, and has contributed to the phenotypic changes along the human lineage. In the process of investigating positive selection, I have developed a new test for prediction that shows greater

sensitivity in the face of recent divergence such as between humans and chimpanzees

(Chapter 2 and database of findings in Chapter 3). This work has been published in the journals Genetics and Nucleic Acids Research, respectively. Follow up studies

demonstrating functional divergence of the human transcription factor are necessary

and methods for such experiments are discussed in Chapter 4 and 5. Through my investigations, I have uncovered several transcription factors that are candidates for positive selection and that may have contributed to the phenotypic changes that

occurred along the human lineage that make humans so unique among the great apes, and I have also provided a new method for detection of positive selection in closely related species that will aide researchers in better understanding the genetic changes that led to the emergence of the human species.

61

CHAPTER 2

AN EMPIRICAL TEST FOR BRANCH‐SPECIFIC

POSITIVE SELECTION

Gabrielle C. Nickel1, David L. Tefft2, Karrie Goglin3, and Mark D. Adams1

1Department of Genetics, Case Western Reserve University School of Medicine,

Cleveland, OH 44106

2The Broad Institute, Cambridge, MA 02141

3The Scripts Research Institute, La Jolla, CA 92037

Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. DQ976435‐ DQ977607 and EF185319‐ EF185336.

Note: Portions of this chapter are published as Nickel GC, Tefft DL, Goglin K, and Adams MD. 2008. An empirical test for branch‐specific positive selection. Genetics 179 (4). D.L.T provided computer support with data analysis and database construction. K.G. assisted in sequencing. M.D.A. conducted phylogenetic analysis and sequence reconstruction.

62

ABSTRACT

Prediction of positive selection specific to human genes is complicated by the very close evolutionary relationship with our nearest extant primate relatives, chimpanzees. To assess the power and limitations inherent in use of maximum likelihood (ML) analysis of codon substitution patterns in such recently diverged species, a series of simulations was performed to assess the impact of several parameters of the

evolutionary model on prediction of human‐specific positive selection including branch

length and dN/dS ratio. Parameters were varied across a range of values observed in alignments of 175 transcription factor (TF) genes that were sequenced in twelve primate species. The ML method largely lacks the power to detect positive selection that has

occurred since the most recent common ancestor between humans and chimpanzees.

An alternative null model was developed based on gene‐specific evaluation of the empirical distribution of ML results using simulated neutrally evolving sequences. This empirical test provides greater sensitivity to detect lineage‐specific positive selection in the context of recent evolutionary divergence.

63

INTRODUCTION

A key question facing modern science is, “What makes humans different from

the other great apes?” Sequencing of the human genome has heightened the interest of scientists in understanding the origins of our species and the genetic basis for traits that distinguish us from our closest living relatives, chimpanzees. Sequencing of the chimpanzee genome revealed about 1% fixed single‐nucleotide differences with the human genome, and about 1.5% lineage‐specific DNA in each species (CHENG et al. 2005;

CSAAC, 2005a). Even though the percent divergence is small, there are about 35 million single nucleotide changes and 5 million insertion/deletion differences between human and chimpanzee, most of which are presumed to be evolutionarily neutral (CSAAC,

2005a). Thus the complete catalog of human‐chimpanzee genome differences is insufficient to reveal the genetic basis for phenotypic divergence between the species.

Several approaches have been presented to the task of identifying alleles that have been subject to positive selection – that is, that have increased in frequency in a population as a result of adaptive evolutionary advantage. These include methods based on differences among human populations (TAJIMA 1993),

extended haplotype homozygosity indicative of a recent selective sweep (SABETI et al.

2002a), a discrepancy between rates of polymorphism and divergence (MCDONALD and

KREITMAN 1991a), and phylogenetic methods that compare the rates of evolution on different branches of the tree (FELSENSTEIN 1981; GOLDMAN and YANG 1994). Several

64

studies have characterized positive selection in human genes with a goal of identifying the genetic basis of human‐specific features such as a larger brain size, bipedalism, and skeletal and dental differences, etc (VALLENDER and LAHN 2004). Likewise, genome‐wide

surveys have identified genes, gene families, and genome regions with a signature of positive selection (CSAAC, ARBIZA et al. 2006b; BAKEWELL et al. 2007b; BUSTAMANTE et al.

2005a; CLARK et al. 2003a; 2005a; DORUS et al. 2004a; GIBBS et al. 2007; VOIGHT et al.

2006). We are most interested in identifying genetic differences associated with

fundamental and ancient divergence between humans and the non‐human primates, and so have focused on the phylogenetic approach to identify positive selection that is unique to the human lineage.

The nature of selection acting on a protein‐coding sequence is characterized by comparison of the rate of nonsynonymous (dN) and synonymous (dS) substitutions.

Under neutral evolution, the rate of accumulation of synonymous and nonsynonymous substitutions should be equal, and a dN/dS value of one is expected. A dN/dS value greater than one is indicative of positive selection, while a value less than one indicates negative selection. About one‐third of chimpanzee protein sequences are identical to

the sequence of the human ortholog, with the remainder having typically one or two amino acid changes (CSAAC, 2005a). Clearly, positive selection cannot be inferred when two protein‐coding sequences are identical. But how different must two sequences be for the currently available methods to detect statistically significant positive selection?

65

Statistical methods based on models of codon substitution patterns have been developed to identify positive selection specific to one branch in a phylogeny, even when only a subset of sites have experienced selection (YANG and NIELSEN 2002a).

Simulation studies using the maximum likelihood methods implemented in Phylogenetic

Analysis by Maximum Likelihood (PAML) (YANG 1997b) have shown that the approach is

conservative with few false positives, but also demonstrated a lack of power,

particularly in analyzing closely related sequences (WONG et al. 2004; ZHANG et al. 2005).

We were particularly interested in how the maximum likelihood codon substitution models would perform for analysis of selection on human proteins in comparison to chimpanzee, where there are few divergent positions.

Genome‐wide studies to date have been based on data from only two (ARBIZA et

al. 2006b; CSAAC, 2005a; NIELSEN et al. 2005a) or three species (BAKEWELL et al. 2007b;

CLARK et al. 2003a; GIBBS et al. 2007). In the context of human‐chimpanzee‐macaque gene alignments, only a handful of genes were predicted to be positively selected in humans after multiple test correction (BAKEWELL et al. 2007b; GIBBS et al. 2007). Analysis of sequence from a broader set of primate species would place human‐chimpanzee differences into a more robust phylogenetic framework and enable more specific analysis of evolutionary rates in each primate lineage. By improving our understanding of the factors that contribute to detection of positive selection, it should be possible to optimize the design of computational experiments to test for positive selection, and potentially the selection of additional organisms for genome sequencing. Therefore, we

collected sequence data from twelve primate species to provide a more comprehensive

66

view of the phylogenetic breadth that is required to detect positive selection, explore the patterns of codon substitution that accompany predictions of positive selection, study the factors that affect a prediction of positive selection, and examine the evolutionary characteristics of a group of important genes.

Predictions of positive selection in recently diverged species, such as humans and chimpanzees, are often difficult using traditional phylogenetic methods. Simulation

studies were performed to gauge the factors that most affect predictions of selection;

even under the best circumstances, the extent of divergence found between human and chimpanzee genes was not sufficient to meet the strict criteria for prediction of selection in the ML test. Through these studies we developed an empirical test that is

more sensitive to the often subtle changes that have occurred along the human lineage, and that provides the opportunity to identify additional candidates for positive selection that would not be predicted using currently available methods.

Developmentally regulated transcription factor genes were chosen for study

because of the fundamental role that many of these genes play in a broad array of developmental processes, the ability of small changes in sequence to cause broad changes in downstream gene expression patterns (e.g., GRENIER and CARROLL 2000a; TING

et al. 1998), and because TFs are over‐represented among genes predicted to be positively selected in previous genome‐wide studies of selection (ARBIZA et al. 2006b;

BUSTAMANTE et al. 2005a; CSAAC, 2005a).

67

MATERIALS AND METHODS

Selection of genes and DNA sequencing: Transcription factor genes were

selected for sequencing based on a prior prediction of positive selection (BUSTAMANTE et

al. 2005a; CLARK et al. 2003a; NIELSEN et al. 2005a) or involvement in developmental processes (Gene Ontology term GO:0032502) having an excess of nonsynonymous substitution relative to synonymous substitution in a comparison of human and chimpanzee genome annotations. Primate DNA samples were obtained from the NIA

Aging Cell Repository Phylogenetic Primate Panel (PRP0001). Human DNA samples were from the NIGMS Human Genetic Cell Repository Human Variation Collection (NA00946,

NA00131, NA01814, NA14665, and NA03715). Both panels were obtained through

Coriell Cell Repositories. Hylobates and Papio samples were kindly provided by Dr. Evan

Eichler.

An enormous hurdle to overcome was in sequencing the distantly related primate species. Primates that diverged many millions of years ago, such as the lemur and the New World monkeys, are likely to have quite diverse sequence. This is especially true since primers are being designed in intronic regions adjacent to the coding whose sequence is of interest. Without a template sequence for these primates with which to design primers, primers were initially designed exclusively off the human sequence. Human genomic sequence was used as input into primer3 (ROZEN and SKALETSKY 2000), and the primers were designed around coding exons. As can be

68

expected, the success rate for amplifying other primate species was quite low (Table 3‐

Human Design). The next approach we took was to apply a macaque mask to the human sequence by BLASTing the human genomic sequence against the macaque draft sequence. Mismatched basepairs were converted to Ns in the human sequence. Since the average sequence divergence between human and macaque is 4.9% for noncoding

sequence (MAGNESS et al. 2005), the macaque is an ideal organism for this masking. It is

diverged enough to be able to reveal regions of conservation, but not too diverged to be unable to pick primers. This masked sequence was then used as input into primer3. By masking off mismatches in the human cDNA, sometimes there is either no sequence left

for primer picking or what is left is suboptimal (i.e. high GC content, repetitive

sequence). In cases where no suitable primer could be designed, a chimpanzee mask was applied to the human sequence. The success of the various masks can be seen in

Table 1.

Table 1: The percent of primer pairs that produced high quality sequence from human templates and macaque and chimpanzee masks

Species Human Design Macaque Mask Chimp Mask

Human 82% 78% 85%

Chimpanzee 41% 78% 84%

Bonobo 42% 78% 82%

Gorilla 59% 78% 87%

Orangutan 53% 78% 75%

69

Hamadryas Baboon ND 65% 48%

Pigtailed Macaque 24% 65% 51%

Rhesus Macaque 41% 65% 54%

Spider Monkey 24% 49% 51%

Woolly Monkey 24% 51% 39%

Red‐bellied Tamarin 18% 53% 44%

Ring‐tailed Lemur 0 14% 21%

The chimpanzee mask has the highest success for being able to amplify sequence in the great apes, but it is less effective when amplifying product from Old and New

World monkeys. The success in the apes is likely due to optimality of the primers chosen, and the failure in monkeys because of sequence divergence. The success rate in

monkeys, however, increases when the macaque mask is applied because primers are designed in conserved regions. Less success in the apes can be tolerated because some of the sequence is already available. When alternative transcripts were present, the longest isoform was chosen for sequencing. Each forward and reverse primer was tagged with a 16‐ or 17‐mer M13 primer in order to hasten the sequencing process.

Each sequencing reaction was run on an Applied Biosystems 3730xl DNA sequencer.

Sequence was obtained from both strands of each PCR product and trace files were processed by phred and phrap (EWING and GREEN 1998; EWING et al. 1998; GREEN),

with each from each organism assembled independently. Low quality bases

70

(Q<25) in the consensus were converted to Ns. Sequence from each species was aligned to the human genomic DNA using matcher (an implementation of the Pearson lalign algorithm from the EMBOSS package) and the location of exons inferred from coordinates in the reference human sequence. Mouse and rat orthologs were obtained from Mouse Genome Informatics (http://informatics.jax.org), and aligned to the human sequence using transAlign (BININDA‐EMONDS 2005). Rhesus macaque and orangutan

alignments were supplemented with sequences extracted from the NCBI Trace Archive when our data did not cover the entire . Chimpanzee alignments were supplemented with orthologous segments extracted from the chimpanzee draft genome assembly (CSSAC, 2005a), Build 1. Multiple sequence alignments of exonic regions were then constructed for each exon, again in reference to the human sequence, using matcher and sim4 (FLOREA et al. 1998), and exons were assembled to form the entire

coding sequence of the gene. The resulting alignments were reformatted to serve as input to the phylogenetic analysis programs. The total sequence from each source

(generated for this study and previously publicly available) and the average gene coverage for each organism are given in Table 2.

71

Table 2: Sequence coverage of transcription factor genes and data sources

Species Common Name Percent Percent Lab Percent w/ Percent # Genes with

Coveragea Coverageb Public Data Identityc Coverage

Pan troglodytes Chimpanzee 98.1 81.9 16.2 99.4 175

Pan paniscus Bonobo 79.2 79.2 99.5 164

Gorilla gorilla Gorilla 76.8 76.8 99.4 155

Pongo pygmaeus Orangutan 96.9 69.8 27.1 98.7 175

Hylobates klossi Kloss's Gibbon 59.6 59.6 98.1 41

Macaca mulatta Rhesus monkey 96.4 54.2 42.2 97.6 171

Macaca nemestrina Pigtailed macaque 59.8 59.8 97.8 112

Papio hamadryas Hamadryas baboon 44.5 44.5 97.2 35

Ateles geoffroyi Spider monkey 50.3 50.3 97.1 112

Lagothrix lagotricha Woolly monkey 50.6 50.6 96.4 107

Sanguinus labiatus Red‐bellied tamarin 48.5 48.5 96.1 125

Lemur catta Ring‐tailed lemur 36.7 36.7 95.3 43

Mus musculus Mouse 97.6* 97.6 85.2 132

Rattus norvegicus Rat 94.6* 94.6 85.3 108 aPercent coverage is total percent of the human sequence spanned by high‐quality data from the species. bPercent lab coverage is the percentage of the human sequence spanned by high‐quality sequence generated for this study. cPercent identity is the average nucleotide identity of the protein‐coding alignment used for ML analysis. *Includes only genes for which unambiguous orthologs could be assigned.

72

Phylogenetic analysis: Codeml from the PAML package (v3.14b, YANG 1997b)

was run on an Apple OS10.5 computer cluster and results were stored in a custom‐built

MySQL database. The codeml models used and the tests performed are summarized in

Tables 3 and 4.

73

Table 3: Evolutionary models and parameter sets for codeml analysis

Feature/Purpose‡ Formal Namea Name in Text

One ω for all branches & sites M0

Nearly neutral: 2 site categories; 0<ω<1, ω=1 M1a

ω allowed to differ on the human branch MH

ω fixed at 1 on the human branch MH‐null

Branch‐site: sites on the human branch allowed to differ MA

Branch‐site null: sites on the human branch fixed at ω=1 MA‐null

Positive selection at sites on all lineages M8

Neutral evolution at sites on all lineages M8a

Likelihood ratio tests of significance Compare

Positive selection on branch (ωH>ωO) MH, M0 Branch test

Positive selection on branch (ωH>1 and ωH>ωO) MH, MH‐null Strict branch test

Branch‐site test 1 (Positive selection or relaxed constraint) MA, M1ab Relaxed branch+site

test

Branch‐site test 2 (Strict test of positive selection) MA, MAnullc Strict branch+site test

‡ω = dN/dS. aModels are as specified in (YANG 1997) bThis test is equivalent to Test 1 in (YANG and NIELSEN 1997a). cThis is equivalent to Test 2 in (YANG and NIELSEN 1997a).

74

Table 4: Branch+site test site classesa

Site class Proportion Background Foreground Foreground Site class p for

(MA) (MA‐null) simulations

0 p0 0 < ω0 < 1 0 < ω0 < 1 0 < ω0 < 1 0.6

1 p1 ω1 = 1 ω1 = 1 ω1 = 1 0.1

2a (1 ‐ p0 ‐ p1) p0 / 0 < ω0 < 1 ω2 > 1 ω1 = 1 0.26

(p0 + p1)

2b (1 ‐ p0 ‐ p1) p0 / ω1 = 1 ω2 > 1 ω1 = 1 0.04

(p0 ‐p1)

aModified from (ZHANG et al. 2005); ω=dN/dS

Each gene was first tested by asking whether a model in which the human lineage was allowed to have a different dN/dS ratio than the rest of the tree was a better fit to the data than a model with a single dN/dS ratio throughout the tree

(“branch test”). Whether the human‐specific dN/dS ratio was significantly greater than

1, was tested by comparison to a model with the dN/dS ratio fixed at 1 on the human branch (“strict branch test”). The optimized branch‐site method of Yang and Nielsen

(YANG and NIELSEN 2002a; YANG et al. 2005a; ZHANG et al. 2005) was used with nested null models to test whether positive selection might be occurring at a subset of sites

(codons) on the human lineage by comparison of results from model A vs. model A‐null

(“strict branch+site test”) and model A vs. model 1a (“relaxed branch+site test”). The

75

strict branch+site test requires dN/dS>1 at a subset of sites, while the relaxed branch+site test requires only that a subset of sites on the human lineage have a significantly elevated dN/dS ratio compared to those sites on the remainder of the tree and can thus be significant in the presence of either positive selection or a relaxation of

selective constraint (ZHANG et al. 2005). A consistent, species tree was used for each analysis (Fig 2). To estimate significance in the form of a p‐value, a likelihood ratio test

(LRT) was used. Two times the difference in log likelihood values between two models was compared to a chi‐square distribution with degrees of freedom equal to the

difference in number of parameters for the two models being compared. All results with p<0.05 in the LRT were repeated to ensure that the result was robust to variation in the initial ML conditions or failure of the ML estimation to converge to the global optimum. A complete listing of genes with selected results is given in the Appendix

(Table A1). All primary sequence data has been submitted to GenBank; alignments and

codeml results are available at http://mendel.gene.cwru.edu/adamslab/pbrowser.py.

Simulations testing the performance of codeml: The sensitivity of the strict branch+site test to branch length, proportion of sites within the positively selected site classes, background dN/dS, and the magnitude of the foreground dN/dS value were examined by using the NSbranchsites function in the evolver program contained in the

PAML software package to create simulated sequences under a specified evolutionary

model. Codon frequencies were derived from a set of 11,485 human genes with high

76

quality orthologs in several mammalian species (NICKEL et al. 2008a). These genes were divided into quartiles of G+C content and codon frequencies derived for each quartile.

The four quartiles were: <45.2%, 45.2‐52.19%, 52.2‐58.79%, and >=58.8%. It has recently been shown that variations in substitution pattern as a result of variance in the recombination rate can essentially be accounted for by G+C content (BULLAUGHEY et al.

2008). The specifications of each simulation can be found in the Appendix (Table A2).

Branch lengths were calculated by concatenating the alignments for the 175 transcription factors and evaluating the alignment using DNAml of the PHYLIP program package (FELSENSTEIN 1993). Evolver was used to create alignment sets of coding sequences with 400 codons from each G+C quartile separately. 1,000 simulated sequence sets were generated for models of neutral evolution, while 200 simulated sequences were created for models reflecting positive selection along the human lineage. The resulting multispecies sequence alignments were analyzed using codeml with model A and model A‐null.

Empirical tests of positive selection using sequences simulated under a model of neutral evolution: An alternative null model was developed to assess the likelihood of positive selection by comparing the actual results from the strict branch+site test

(codeml model A vs. model Anull) to results obtained using sequences simulated under a model of neutral evolution. For each gene tested, the sequence of the human‐ chimpanzee ancestor was inferred using the RateAncestor function of codeml. More

77

than 99% of the codons were inferred with at least 95% probability, according to codeml

(Fig 1).

Confidence of 100 90 80 70 60 176TF 50 %AAchange Codons

of 40

% 30 176TF %Allchanges 20 10 0 >0.5 >0.8 >0.9 >0.95 >0.99 1 Probability

Figure 1: Probability of accurate ancestral sequence reconstruction. Accuracy of amino acid changing substitutions in the 176 transcription factor set, as well as all substitutions, synonymous or nonsynonymous.

These ancestral sequences were input as the root sequence in evolver, and at least 500 sequences were simulated under a model of neutral evolution (foreground dN/dS=1 in site classes 2a and 2b). The background branch lengths were calculated as described above, and the human branch length used was that reported by codeml using the branch model H. The proportion of sites within site classes 0, 1, 2a and 2b was held

78

constant at 0.6, 0.1, 0.26, 0.04, respectively, based on averages among genes with dN/dS>1 (see Table 4). The transition/transversion ratio, κ was also held constant at

2.6. The average κ for the 175 gene set was 4.1±1.5. A higher κ means more transitions, and therefore more synonymous changes. A smaller κ is therefore more conservative. Codon substitution frequencies were derived from the appropriate G+C quartile for the specific gene.

Gene‐specific empirical p‐values were calculated by running codeml with model

A and model A‐null on each of the simulated neutral human sequences using the original primate+rodent multiple sequence alignments for each gene and calculating the

likelihood ratio statistic (LRS). The empirical p‐value is defined as the fraction of sequences simulated under a model of neutral evolution that had a LRS greater than or equal to the LRS obtained with the actual data. The gene specific significance threshold is based eon th empirical distribution of LRS values using a parametric bootstrap to specify a significance threshold of α = 5% (AAGAARD and PHILLIPS 2005; GOLDMAN 1993).

RESULTS

Sequencing of transcription factor genes and tests of selection

We sought to identify evidence of adaptive protein evolution specific to the human lineage with the hypothesis that genes under positive selection have contributed

79

to the phenotypic changes that occurred since the most recent common ancestor of humans and chimpanzees. 175 transcription factor genes with prior predictions of positive selection (BUSTAMANTE et al. 2005a; CLARK et al. 2003a; CSSAC, 2005a; NIELSEN et al. 2005a) or an excess on nonsynonymous substitution in human relative to chimpanzee were selected for phylogenetic sequencing. Exonic sequence data was obtained from human, chimpanzee, bonobo, gorilla, orangutan, gibbon, three Old World monkeys, three New World monkeys, and lemur (Fig.2). Primary sequence read data from chimpanzee, macaque, and orangutan were obtained from the NCBI Trace

Repository and supplemented the alignments where possible, and annotated mouse

and rat orthologs (http://www.informatics.jax.org) were added to the alignments (Table

2). Coverage is excellent for the great apes, rodents, and rhesus macaque, but lower for the monkeys, reflecting a lower rate of PCR success in more divergent species.

80

Figure 2: Phylogenetic tree of the primate species used in phylogenetic analyses. Primates were chosen so as to include a representative from all the major subdivisions of primates. These include Parvorder Platyrrhini (New World monkeys: red‐bellied tamarin, spider monkey, woolly monkey), Parvorder Catarrhini (Old World monkeys: pig‐tailed macaque, rhesus macaque), Family Hylobatidae (lesser apes: orangutan), and Family Hominidae (great apes: gorilla, chimpanzee, bonobo, and human). Suborder Strepsirrhini (prosimians: lemur) is not depicted in tree. Mouse and rat were used as an outgroup (not shown). Branch lengths were calculated from concatenated alignments of the 175 transcription factor genes using the DNAML program in PHYLIP (FELSENSTEIN 1993). *Branch length < 0.0015.

Multi‐species alignments were constructed of the protein‐coding DNA sequence of each gene and the codon substitution pattern in the alignments was evaluated under several evolutionary models as implemented in the codeml program from the PAML

81

package (YANG 1997a). In each case, results from a null model were compared with results from a test model. The null models assumed neutrality (dN = dS) or negative selection (dN/dS <1) and the test models assumed dN/dS >1 on the human lineage

(strict branch test) or at a subset of codon sites in the gene along the human branch

(strict branch+site test). A list of the models and tests of selection is presented in Tables

3 and 4.

Only three genes have p <0.05 in the strict branch+site test: ISGF3G, SOHLH2, and ZFP37 (Table 5) and amino acids with Bayes Empirical Bayes posterior distribution for probability of positive selection > 95%. Seventeen genes have p <0.05 in the relaxed

branch+site test alone, indicating that they may have experienced a relaxation of selective constraint or simply show weak signals of positive selection (Table A1). None of the nominal p‐values survive Bonferroni correction, and a false discovery rate analysis

(BENJAMINI and HOCHBERG 1995) also predicts that the likelihood of observing true positives is small.

82

Table 5. codeml results for transcription factor genes with significant results in the strict branch+site test of positive selection

Branch Empirical Gene Branch+Site Tests Tests Test

Strict Relaxed Strict Relaxed dN/dS (MA vs (MA vs. M1a) (MH vs. MH‐null) (MH vs. M0) MAnull)

‡LRS p LRS p LRS p LRS p Human Other p

ZFP37 4.62 0.032 8.88 0.012 4.70 0.030 14.4 <0.001 999† 0.24 <0.001

ISGF3G 6.24 0.013 6.50 0.039 0.18 0.67 0.48 0.48 0.69 0.38 <0.001

SOHLH2 5.16 0.023 9.08 0.011 5.18 0.023 9.62 0.002 999 0.48 <0.001

†Values of 999 for dN/dS indicate dS=0, so dN/dS is undefined. Values in bold reflect p<0.05. ‡LRS =

2*(lnL1‐lnL2)

Effect of phylogenetic breadth on prediction of positive selection

Overall phylogenetic breadth is expected to contribute to the accuracy of a prediction of positive selection (ANISIMOVA et al. 2001). Intuitively, interpretation of the significance of lineage‐specific substitutions becomes more straightforward as the number and divergence of species increases resulting in a more accurate reconstruction

of the ancestral sequence and therefore, the direction of selection. However, data collection and analysis of many species can be costly and time consuming. We therefore decided to investigate how well a minimal phylogenetic set performs at

83

prediction of positive selection. There are four well‐annotated and publically available

mammalian genomes that are ideal for this analysis: human, chimpanzee, rhesus

macaque, and mouse as an outgroup. Genes were simulated under a model of positive selection with ω = 2 and ω = 10 for the full primate+rodent set and evaluated using codeml. The tree was then trimmed to contain only the four aforementioned mammalian species and also evaluated using codeml. The results from the minimal alignment set were then compared to the results obtained to those from the full primate+rodent alignments.

Remarkably, the reduced alignment set is reasonably effective in identifying the set of genes that are predicted to be positively selected when the full primate+mouse alignment is used (Fig. 3). In each case, however, a small number of simulated genes had discordant results, which is likely the effect of the use of a too distant outgroup that has been unable to accurately resolve the direction of selection. In general, sensitivity

is higher than specificity, reflecting the fact that the omission of species tended to result in false positives, rather than false negatives.

84

1 2

= 0.9 ω 0.8 MAnull), 0.7 v.

0.6 (MA

0.5 value ‐ p 0.4

0.3 alignment

0.2

0.1

Minimal 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Full alignment p‐value (MA v. MAnull), ω = 2

Figure 3: Comparison of the results for predictions of positive selection simulated using codeml for the full primate+rodent alignment set to the minimal alignment set containing only human, chimpanzee, rhesus macaque, and mouse. Red data points indicate discrepant results between the two alignment sets.

Taken together, these results suggest that increased phylogenetic breadth cannot overcome the limitations inherent in comparing closely related species. On the other hand, only minimal phylogenetic diversity is necessary to maximize the power that does exist to detect positive selection.

85

Simulations to assess the sensitivity of the strict branch+site test

The failure to observe significant positive selection was surprising, given the method for choosing genes and the fact that more than one‐third have a dN/dS ratio >1 on the human branch, prompted us to wonder whether the genes we chose were truly

not positively selected or whether the ML method lacks power to detect positive selection for these closely related species. We therefore embarked on a series of simulations to study the performance of codeml using branch lengths and substitution patterns that are typical of the divergence time of humans from non‐human primates.

In presenting the strict branch+site test, Zhang et al. (ZHANG et al. 2005) performed an extensive series of simulations of codeml that demonstrated a low false‐ positive rate in the absence of dN>dS and a reasonable sensitivity to detect positive selection across a range of branch lengths and dN/dS ratios. However, they primarily focused on selection acting at an internal branch (thereby improving detection of selected sites) and on trees with branch lengths longer than those encountered in human genes.

We used evolver, which generates alignment sets that correspond to a user‐

defined evolutionary model, to create simulated primate gene alignments for use in evaluating the performance of the strict branch+site test of codeml. Several scenarios were examined by varying parameters that reflect the range of branch lengths, dN/dS ratios and codon frequencies found in human genes. These included 'gold standard' sets simulating positive selection with dN/dS>1 at a subset of codons on the human

86

lineage in each simulated sequence and neutrally evolving control sets with dN/dS=1.

The complete list of tests, parameters, and results can be found in Table A2.

The branch length of the human (foreground) lineage had the greatest influence on the power to detect positive selection, which falls as branch length decreases

(Figures 4, 5, and Table A2). Due to the recent divergence of humans and chimpanzees, human‐specific branch lengths are quite short (Fig 5). Under a scenario of positive selection, with dN/dS=10 on the foreground (human) lineage and dN/dS<=1 on the other lineages, only a small fraction of simulated sequences had a p‐value <0.05 in the strict branch+site test (Fig 4A). Across most of the range of observed human branch lengths, few sequences simulated under a model of positive selection met the threshold for significance in the strict branch+site test (Fig 5). For the median branch length of human genes with substitutions (0.007) less than 10% of sequences simulated under the model of positive selection are correctly classified. At longer branch lengths, more simulated positively selected sequences are correctly classified, however <1% of human genes have branch lengths >0.035, and positive selection is not completely predicted even at this relatively long branch length.

87

Figure 4: Distribution of likelihood ratio statistic (LRS) values from the strict branch+site test of simulated gene sets. Sequences were simulated with different branch lengths for the human branch under models of positive selection with foreground dN/dS = 10 (A) and under models of neutral evolution with foreground dN/dS = 1 (B,C). In (A), the LRS corresponding to p=0.05 is shown. The expected distribution of LRS values accordinge to th chi‐square distribution with one degree of freedom (, red diamonds) and a 50:50 mixture distribution of the distribution and a point mass of zero (green circles) is shown in (C) with a subset of the distribution of LRS values from neutral simulations from (B). The red line marks the 5% empirical significance threshold.

88

Figure 5: Prediction of positive selection across a range of branch lengths representative of human genes. The branch lengths of 175 transcription factor genes are shown (columns). The secondary axis represents the percentage of alignment sets simulated with a model of positive selection (foreground dN/dS=10) that were correctly classified as positively selected in the strict branch+site test (triangles).

The power to predict positive selection is also related to the magnitude of dN/dS

(Fig. 6a). With weak positive selection (e.g. dN/dS = 2), there is little power to detect positive selection regardless of branch length, while at longer branch lengths, larger dN/dS results in a higher fraction of simulated sequences that meet the threshold for significance in the strict branch+site test, although this is insufficient to overcome the limitations of shorter branch lengths. Of much less influence is the dN/dS value of the background lineages (Figure 6B). The proportion of significant alignment sets remains consistent over a variety of background dN/dS values, ranging from low (dN/dS = 0.01)

89 to nearly neutral (dN/dS = 0.75). The proportion of foreground (positively selected) sites also has only modest impact on predictions of positive selection (Table A2).

Figure 6: Contribution of dN/dS values to predictions of positive selection. (A) The ability to detect positive selection using a range of dN/dS values for the foreground (human) lineage for different branch lengths was evaluated. (B) The effect of the dN/dS on the background lineages was assessed under models of positive selection (dN/dS =10; black bars) and neutral evolution (dN/dS=1; gray ebars). Th human branch length was 0.01.

90

We also examined the performance of evolver‐generated sequences with dN/dS=1 on the foreground lineage to assess how well these sequences match the neutral expectation. The fraction of simulated sequences with p<0.05 in the strict branch+site test was ≤3% across a range of branch lengths (Fig 4B), demonstrating that under a neutral model, there are few false predictions of positive selection, confirming results of Zhang et al (ZHANG et al. 2005).

In the strict branch+site test, the LRS is compared to a chi‐square distribution

with one degree of freedom to obtain a p‐value. In principle, since we are only testing deviation from neutrality in one direction the appropriate comparison is to a

50:50 mixture of and point mass 0 (CLARK and SWANSON 2005; SELF and LIANG 1987), and the distribution of LRS values with the neutral models fits the 50:50 mixture distribution more closely (Fig 4C), suggesting that evolver‐simulated sequences do in fact represent a reasonable approximation of the null distribution.

Alternative null model using an empirical test

The simulations showed that for any given set of parameters, increasing dN/dS

above 1 resulted in a shift in the distribution of LRS values in the comparison of model A and model A‐null (Fig. 4). As an alternative to determining significance using a likelihood ratio test as in the strict branch+site test, we evaluated use of the distribution of LRS values from simulations with dN/dS=1 on the foreground branch to define an empirical threshold of significance. In this method, the 5% significance threshold is defined as the

91

LRS value such that 5% of the simulated neutral sequences have a LRS value above the

threshold. In evaluating sequences simulated under a model of positive selection, if the

LRS exceeds the 5% threshold defined in the matched neutral simulation, we would consider that there is less than a 5% probability that the pattern of selection can be

accounted for by the neutral model. As shown in Fig 7, more sequences simulated

under models of positive selection were correctly classified when using the empirically defined 5% significance threshold, compared to the LRT. The increase in sensitivity was particularly evident at shorter branch lengths.

92

Figure 7: Comparison of the strict branch+site test, 50:50 mixture test, and empirical test on simulated sequences. 215 sequences were simulated under a model of positive selection using a dN/dS value for the foreground lineage of 2 (yellow, orange, and red bars) or 10 (green, blue, and black bars). The percent correctly classified was calculated using three different methods of assessing significance: 1) the strict branch+site test, which uses a distribution, 2) a 50:50 mixture of a point mass of 0 and the , and 3) the empirical test. For the empirical test, the threshold for each branch length was defined based the distribution of LRS values for the paired neutral simulation (foreground dN/dS=1).

We then developed an approach, based on the empirical distribution of LRS values of neutrally evolving sequences, to defining the probability of positive selection for real human genes. We used evolver to generate neutrally evolved sequences along the human branch based on the inferred human‐chimpanzee ancestor for each tested gene (see Methods). Each neutrally simulated 'human' sequence was then analyzed by

codeml using model A and model A‐null in the context of the primate plus rodent

93

alignment. In this method, an empirically determined significance threshold (at =0.05) was used; if the LRS of the real alignment was greater than that for at least 95% of the neutrally evolved sequences, positive selection was inferred.

Empirical p‐values were obtained for each gene that had p‐value <0.5 ine th

strict branch+site test, several with p>0.5 (Table 6), and two additional control genes

(see below). Figure 8 is a comparison of the nominal p‐values from the strict branch+site test with p‐values obtained from the empirical test using the alternative null model. Eight genes had p <0.05 in the empirical test: CDX4, DMRT3, ELK4, PGR, RBM9,

SOHLH2, TAF11, ZFP37. The control gene ASPM was also significant in the empirical test. These genes have a nonsynonymous substitution rate greater than can be accounted for based on a model of neutral evolution. There is only a modest overlap

(Table 6) in genes with p<0.05 in the empirical test and p<0.05 in the relaxed

branch+site test of selection, indicating that the empirical test is not simply identifying genes undergoing a relaxation of selective constraint.

Table 6: Comparison of the results from the strict branch+site and empirical tests

Gene Length N* S* Background Foreground Relaxed Strict Empirical

(codons) dN/dS (Human) branch+site branch+site p

dN/dS test p test p

ASPM 3477 17 7 0.109 5.455 0.553 0.473 0.027

94

BLZF1 400 3 0 0.146 999† 0.084 0.260 0.105

CDX1 265 2 0 0.104 999 0.292 0.307 0.125

CDX4 284 4 0 0.358 999 0.527 0.286 0.031

CITED2 270 2 0 0.038 999 0.028 0.426 0.124

DMRT3 472 11 3 0.084 0.859 0.005 0.489 0.037

EHF 300 1 0 0.048 313 0.139 0.471 0.102

ELK4 431 2 0 0.270 999 0.092 0.212 0.042

FOXI1 378 5 3 0.083 0.332 0.481 0.811 0.277

FOXJ2 574 2 0 0.099 75.8 0.025 0.239 0.085

GBX2 348 1 0 0.014 999 0.050 0.531 0.255

HESX1 185 1 0 0.269 999 0.186 0.429 0.180

HOXA5 270 2 1 0.029 0.445 0.095 1.000 0.833

HOXD4 255 2 0 0.082 999 0.021 0.288 0.076

KLF10 390 2 0 0.164 999 0.039 0.229 0.090

MCPH1 740 9 4 0.112 1.000 0.315 1.000 0.748

MEOX1 254 2 0 0.127 999 0.075 0.298 0.185

NR2C1 603 4 0 0.089 999 0.008 0.192 0.309

NR3C2 984 5 1 0.102 0.770 0.026 0.318 0.069

95

PEG3 1590 8 3 0.232 0.944 0.067 0.803 0.177

PGR 933 9 4 0.197 0.893 0.072 0.469 0.029

RBM9 367 2 0 0.400 999 0.410 0.184 0.041

SCML1 208 2 0 1.279 999 0.446 0.231 0.075

SOHLH2 425 10 0 0.484 999 0.011 0.023 <0.001

T 435 2 0 0.052 999 0.068 0.318 0.124

TAF11 211 2 0 0.138 999 0.031 0.234 0.034

ZFP37 630 9 0 0.237 999 0.011 0.032 <0.001

ZNF192 578 4 1 0.210 1.605 0.270 0.411 0.101

*Substitutions between the inferred human‐chimpanzee ancestor and the human sequence. N=nonsynonymous, S=synonymous. †Values of 999 for dN/dS indicate dS=0, so dN/dS is undefined. Values in bold reflect p<0.05.

ISGF3G was significant in the strict branch+site test, but not in the empirical test.

This gene has a rare 'double‐hit' codon substitution in which two of the three nucleotides at one codon have changed since the human‐chimpanzee ancestor. An ancestral serine codon (TCA) is changed to valine (GTA) at position 129 in human.

Apparently codeml weights this highly unlikely substitution quite strongly. Introducing a

single double‐hit substitution into other genes resulted in a shift to p<0.05 in the strict branch+site test (data not shown). The selection status of ISGF3G is thus unclear.

96

1

0.9

0.8

0.7 value

‐ 0.6 p

0.5

0.4

0.3

Empirical 0.2

0.1

0 0 0.2 0.4 0.6 0.8 1 Nominal p‐value (strict branch+site test)

Figure 8. Comparison of the strict branch+site test and empirical test on human genes. An empirical probability of positive selection was determined for 26 human transcription factor genes based on codeml analysis of at least 500 alignments with the human sequence substituted with a simulated sequence derived from the human‐chimpanzee ancestor using a model of neutral evolution. The empirical probability represents the proportion dof simulate neutral sequences that gave an LRS greater than that observed for the real gene in the strict branch+site test.

To guard against the possibility that some genes with low p values scored well due to the presence of rare nonsynonymous polymorphisms in the human sequence used for the alignment, the single nucleotide polymorphism (SNP) content of each was examined in dbSNP and by comparison to sequences in dbEST. CDX4 and ZFP37 were also examined by complete exon sequencing in twelve humans. In ZFP37, one high frequency nonsynonymous polymorphism was found at codon 7 (rs2282076). Use of the ancestral allele at this position had a negligible impact on the empirical test. A nonsynonymous substitution in BNC1 at residue 274 (rs1130108) was determined to be

97

a rare polymorphism. Use of the major allele at this position resulted in loss significance; results presented here are for the sequence with the major allele.

Comparison with previous predictions of positive selection

Three genes that have been previously suggested to demonstrate positive selection in humans were also examined: FOXP2 (ENARD et al. 2002b), ASPM (EVANS et al.

2004b), and MCPH1 (EVANS et al. 2004a). FOXP2 is significant in the relaxed branch+site test, but not in the strict branch+site test, in accord with the results of Enard et al. whereas neither ASPM nor MCPH1 reached significance in either traditional test of human‐specific positive selection. This agrees with the results of several other studies that indicate that both proteins experienced rapid adaptive evolution along both the

earlier hominoid and later hominid lineages, preceding the human‐chimpanzee split

(EVANS et al. 2004a; EVANS et al. 2004b; KOUPRINA et al. 2004; WANG and SU 2004b). In the

empirical tests, only ASPM reached significance.

Among the 175 genes analyzed by phylogenetic sequencing here, 73 overlap with genes analyzed by Clark et al. (CLARK et al. 2003a) using alignments of human, chimpanzee, and mouse only; 27 overlap with Nielsen et al. (NIELSEN 2005) using only human and chimpanzee; and 57 overlap with Bustamante et al. (BUSTAMANTE et al.

2005b) who used a modified version of the McDonald‐Kreitman test for positive selection (Table 7). Only one (SCML1) of the 27 overlapping genes in the Nielsen set was predicted to be positively selected, and this gene did not have a significant p‐value

98

(p<0.05) in our study. Fifteen of the overlapping genes were predicted to be positively

selected in the Clark et al. study (p≤0.05 using the M2 method). Of these 15, none had nominally significant p‐values in the strict branch+site test using the more extensive phylogenetic sequence set described here and one (CDX4) had an empirical p‐value

<0.05. Several factors may explain the higher number of predictions in the Clark analysis including a lack of accuracy in reconstruction of the ancestral state in alignments involving only human, chimpanzee, and mouse sequences and differences in the details of the ML methods used. There is a better concordance between M2 p‐

values and p‐values from the relaxed branch+site test, suggesting that improvements in codeml and use of the more stringent test are primarily responsible for fewer predictions of positive selection here. Only 3 of the 57 overlapping genes in the

Bustamante study were predicted to be positively selected. Two of these genes were not significant in either the strict branch+site test or the empirical gene. The third gene,

CDX4, retained its significance in the empirical test. Bustamante used population genetic analysis to scan for selection in the human genome, which is a method that is particularly useful in detecting recent or population‐specific positive selection. The

methods implemented in phylogenetic analysis are more powerful at predicting very ancient positive selection, so comparison of the two studies is incongruous. However, analysis of a gene by multiple methods may lend credence to predictions of positive selection, as evidenced by CDX4, which was predicted in the Clark, Bustamante, and current study.

99

Table 7. Comparison of predictions of positive selection by different methods

Gene Nielsen Clark Bustamante Strict Empirical pd Symbol pa pb Pr(g>0)c branch+site pd

APEX1 1 0.07 0.00084 1

BLZF1 0.03 0.72114 0.261651 0.105

CDX1 1.00 0.298698 0.125

CDX4 0.1534 0.00 0.9866 0.285652 0.031

CREM 0.328894 0.01 0.0936 1

DLX3 0.39 1

DLX4 1.00 1

DMBX1 0.46654 1

DSCR1 0.07 0.86238 0.841481

EHF 0.02 0.7256 0.470842 0.102

ELF5 1 0.08 0.94178 0.113846

ELK4 1.00 0.4664 0.211665 0.042

ETV3 0.0914 0.887537

FMNL2 0.01632 1

FOXN1 1 0.00 0.86434 1

GTF2H1 0.08568 0.41656

HESX1 0.328819 0.08 0.83504 0.423711 0.18

HIF1A 0.47 0.47222 1

HMG20A 1.00 0.8605 1

HOXA3 0.86064 0.583882

HOXA5 1 0.02 0.93888 1 0.833

HOXA7 0.71574 1

100

HOXB1 1 0.20 0.00292 0.431047

HOXB7 0.1375 0.708281

HOXC10 0.49 1

HOXD10 1.00 1

HOXD4 1 0.00 0.289918 0.076

HTATIP2 0.05 0.46962 1

IL10 1 1.00 0.86438 0.431047

IVNS1ABP 0.38008 1

KLF10 0.86504 0.22693

KLF3 0.46598 1

LHX3 0.42 1

MEF2B 1 0.85546 1

MEOX1 0.32892 0.10 0.96612 0.298698 0.185

MLL 1 0.25 0.9548 1

MYB 1 1.00 0.96654 0.423711

NFE2L2 1.00 0.00356 1

NFKBIA 1.00 1

NR1I3 1.00 1

NR3C1 1 0.34 0.98002 1

NR5A2 1.00 1

OASL 0.71928 1

PEG3 0.30512 1 0.177

POU2AF1 0.4622 0.396144

PQBP1 1.00 0.09568 1

PTTG1 1 0.28 0.86084 1

101

RBM9 1.00 0.86316 0.220671 0.041

REL 1 0.08 1

RELA 0.44 1

RING1 0.35 0.583882

RORC 0.46674 1

RREB1 1 0.0079 1

SALF 0.37394 1

SAP30 0.46 1

SCAND1 0.72238 1

SCML1 0.016125 0.99968 0.230139 0.075

SIM1 1 0.35 0.56168 1

SIX1 0.40 0.8608 1

SLC38A2 1 0.37664 1

SPIC 0.86086 1

SRF 1.00 0.624206

T 1 0.01 0.0059 0.327187 0.124

TAF11 0.45936 0.236724 0.034

TBX6 1.00 1

TCEAL1 1.00 0.09366 1

TFEC 1 0.29378 1

UBP1 0.326312 0.00 0.69572 0.887537

XBP1 1.00 1

ZFP37 0.177376 0.92056 0.033112 <0.001

ZHX1 1 0.07 0.72204 0.470842

ZNF154 0.0476 0.654721

102

ZNF174 1 0.33 0.8336 1

ZNF449 0.4615 1

ZNRD1 1.00 1

a b c d (NIELSEN 2005), (CLARK et al. 2003a), (BUSTAMANTE et al. 2005a), This study.

DISCUSSION

Inference of positive selection relies on observation of an excess rate of nonsynonymous codon substitutions relative to synonymous substitutions. In closely related species, with few total substitutions, these rates are low and the uncertainty in defining them is high, limiting the ability to confidently predict positive selection. We have shown that the strict branch+site test of codeml is largely unable to detect positive selection when it exists across a broad range of parameters of branch length and dN/dS ratios that are found in human genes in the context of alignments with sequences from several non‐human primates. The two factors with the largest effect were branch length and magnitude of dN/dS on the human branch. Predictions of selection were most efficient on genes with large dN/dS values and on genes with branch lengths >0.03

(Figure 2 and Table A2). The 175 transcription factor genes used in this study have a

median branch length of 0.007, which is well below the optimum for prediction of

positive selection. In fact, a branch length of >0.035 is necessary to predict positive selection in a 400 codon gene 50% of the time, and fewer than 1% of human genes match these criteria. The standard likelihood ratio test (strict branch+site test) used

103

with codeml model A and model A‐null therefore lacks the power to detect positive selection given the phylogenetic history of most human genes.

The tests for positive selection used here were designed to detect an alteration in evolutionary rate that is specific to the human lineage. An implicit assumption is that the rate is constant (or less variable) along other branches. The site models M8 and

M8a can be used to assess whether there are subsets of codons that are evolving rapidly across the tree (YANG et al. 2000). None of the eight genes with significant predictions of positive selection in the empirical test are significant with the site models. Codeml can also be used with a ‘free‐ratio’ model, in which dN/dS is estimated individually on each branch of the tree. This model is very parameter‐rich and cannot be effectively compared to a suitable null model and so branch‐specific rates must be considered approximations. Each branch with dN/dS >1 in the free ratio model was tested individually (Table 8) and no branches other than human exhibited a significant LRT in the branch or branch‐site tests for the three genes in Table 7.

Table 8: Tests for positive selection on other primate branches that exhibited dN/dS > 1

Model MA vs. MAnull Gene Lineage Free ratio ω p ZFP37 Pan 97.009 0.998404232 ZFP37 Hominid 1.0667 0.253983351 ZFP37 Anthropoid 519.1324 0.926723952 SOHLH2 Pan 916.47 0.189448231 SOHLH2 Hominin 1.9679 0.189679233 ISFG3G Macaque 3.3423 0.064771524

104

After collection of additional sequence data, 65 of the 175 human TF genes encoded proteins that are identical in chimpanzee, so sequence errors may have contributed to previous indications of positive selection. The importance of comparing accurate sequence is highlighted by the impact of single substitutions on the codeml results. A concern about accuracy was also highlighted by Bakewell et al. who used the

available base quality values to restrict analysis to the highest confidence regions of the

chimpanzee genome (BAKEWELL et al. 2007b). Two genes predicted to be positively selected were re‐evaluated by codeml after the addition of synonymous substitution.

The presence of even a single synonymous substitution was enough to raise the p‐value to > 0.05. The removal of individual nonsynonymous substitutions has similar effects.

These small changes within the alignments resulted in a loss of significance in the strict branch+site test, but not in the empirical test (Table 9) indicating that this method can accommodate synonymous substitution, the presence of which is a more realistic evolutionary scenario.

Table 9: Effect of synonymous and nonsynonymous substitution on predictions of positive selection

Human Codon / Strict Empirical Gene Residue Amino Acid N S branch+site p p

ZFP37 No Change 9 0 0.032 < 0.001

ZFP37 7 VÆD 8 0 0.045 0.001

ZFP37 23 EÆG 8 0 0.078 0.001

ZFP37 54 SÆR 8 0 0.075 0.008

105

ZFP37 66 CÆS 8 0 0.207 0.008

ZFP37 109 PÆS 8 0 0.044 0.015

ZFP37 165 KÆT 8 0 0.079 0.001

ZFP37 180 CÆR 8 0 0.046 0.008

ZFP37 235 SÆN 8 0 0.083 0.001

SOHLH2 No Change 10 0 0.023 < 0.001

SOHLH2 146 TÆI 9 0 0.102 < 0.001

SOHLH2 178 SÆP 9 0 0.102 < 0.001

SOHLH2 211 NÆK 9 0 0.099 < 0.001

SOHLH2 260 VÆI 9 0 0.101 < 0.001

SOHLH2 312 AÆT 9 0 0.101 < 0.001

SOHLH2 339 AÆT 9 0 0.103 < 0.001

SOHLH2 353 SÆP 9 0 0.102 < 0.001

SOHLH2 393 MÆV 9 0 0.100 < 0.001

SOHLH2 403 YÆH 9 0 0.101 < 0.001

SOHLH2 146 TÆI 9 0 0.102 < 0.001

ZFP37 78 CCAÆCCG 9 1 0.208 0.02

ZFP37 96 AAAÆAAG 9 1 0.183 0.02

ZFP37 125 CTTÆCTA 9 1 0.186 0.02

ZFP37 237 AGTÆAGC 9 1 0.208 0.02

ZFP37 286 CCTÆCCC 9 1 0.185 0.02

ZFP37 326 TGTÆTGC 9 1 0.185 0.02

ZFP37 356 GCCÆGCT 9 1 0.185 0.02

ZFP37 405 TATÆTAC 9 1 0.185 0.02

ZFP37 418 AACÆAAT 9 1 0.186 0.013

106

ZFP37 514 GGTÆGGG 9 1 0.184 0.013

SOHLH2 26 GACÆGAT 10 1 0.202 0.003

SOHLH2 90 GGTÆGGC 10 1 0.203 0.003

SOHLH2 94 AACÆAAT 10 1 0.202 0.003

SOHLH2 118 ATCÆATT 10 1 0.202 0.003

SOHLH2 127 GAGÆGAA 10 1 0.202 0.003

SOHLH2 237 GACÆGAT 10 1 0.195 0.003

SOHLH2 288 GGTÆGGC 10 1 0.201 0.003

SOHLH2 319 TGTÆTGC 10 1 0.203 0.003

SOHLH2 356 GCCÆGCG 10 1 0.204 0.003

SOHLH2 373 CCCÆCCT 10 1 0.199 0.003

The empirical test of selection described here is an alternative null test that seems to have greater sensitivity to detect positive selection than the strict branch+site test at short branch lengths. The test essentially assesses whether the observed likelihood of positive selection can be accounted for by a neutral model of evolution. By modeling neutral evolution specifically on the human branch, while keeping the rest of

the tree constant, the test allows us to assess the potential for positive selection on a gene‐specific basis. This new method may therefore serve as a useful adjunct to the phylogenetic methods employed in codeml. Evolver‐generated neutral sequences seem to perform quite consistently across a range of branch lengths, codon frequencies, and proportion of selected sites (Fig 5, Table A2). Nonetheless, the neutral model cannot

107

fully describe the true evolutionary history and so caution is advised in interpretation of

results of the empirical test as for any computational inference of positive selection.

The empirical test is also impractical for large‐scale use due to the computational cost:

codeml analysis for 1,000 simulated ZFP37 sequences required more than 700 CPU‐ hours.

The genes analyzed here are not a random or representative sampling of TF genes, but were selected specifically to be biased toward those more likely to show evidence of positive selection. Despite this emphasis, none of the 175 genes analyzed met the strict criteria for positive selection in the branch+site tests after multiple test

correction, and in the empirical test, 8 genes were predicted to be positively selected. A

similar dearth of evidence for human‐specific positive selection was seen in analyses of

over 13,000 human‐chimpanzee‐macaque gene alignments (BAKEWELL et al. 2007b; GIBBS et al. 2007), although both studies reported a somewhat larger number of genes with a prediction of positive selection on the chimpanzee branch.

A subset of transcription factor genes was identified that contain a significant excess of nonsynonymous substitution that has been shown to be unlikely to occur by neutral evolution of the gene. There are several well‐described examples of evolution of transcription factors that contribute to morphological divergence (GALANT and CARROLL

2002; HITTINGER et al. 2005; LOHR and PICK 2005; LOHR et al. 2001; RONSHAUGEN et al. 2002)

(see (HSIA and MCGINNIS 2003a) for a review). For example, the Hox protein

Ultrabithorax (Ubx) represses limb development in insects, but not in crustaceans

108

(RONSHAUGEN et al. 2002), and this difference has been mapped to an interaction domain with pleiotropic functions (HITTINGER et al. 2005). Further work will be required to determine whether the predictions of positive selection in the genes presented here are indicative of functional divergence in the encoded TF proteins and to link that divergence to phenotypic distinctions between humans and chimpanzees.

Alignments and full results at http://mendel.gene.cwru.edu/adamslab/pbrowser.py

109

CHAPTER 3

HUMAN PAML BROWSER: A DATABASE OF POSITIVE SELECTION ON HUMAN GENES USING PHYLOGENETIC METHODS

Gabrielle C. Nickel1, David L. Tefft2, and Mark D. Adams1

1Department of Genetics, Case Western Reserve University School of Medicine,

Cleveland, OH 44106

2The Broad Institute, Cambridge, MA 02141

Note: This chapter is published as Nickel GC, Tefft DL, and Adams MD. 2008. Human PAML browser: a database of positive selection on human genes using phylogenetic methods. Nuc. Acid.Res. 36: D800‐8. DLT and MDA performed phylogenetic analysis are sequence construction. DLT constructed database.

110

ABSTRACT

With the recent increase in the number of mammalian genomes being sequenced, large scale genome scans for human specific positive selection are now possible. Selection can be inferred through phylogenetic analysis by comparing the rates of silent and replacement substitution between related species. Maximum likelihood (ML) analysis of codon substitution models can be used to identify genes with

an accelerated pattern of amino acid substitution on a particular lineage. However, the

ML methods are computationally intensive and awkward to configure. We have created a database that contains the results of tests for positive selection along the human lineage in 13,721 genes with orthologs in the UCSC multispecies genome alignments.

The Human PAML Browser is a resource through which researchers can search for a

gene of interest or groups of genes by Gene Ontology category, and obtain coding sequence alignments for the gene and as well as results from tests of positive selection

from the software package Phylogenetic Analysis by Maximum Likelihood. The Human

PAML Browser is available at http://mendel.gene.cwru.edu/adamslab/pbrowser.py.

111

INTRODUCTION

How are humans so genetically similar to the great apes, yet so phenotypically divergent from the other members of this family? This question has intrigued researchers for decades (CARROLL 2003; KING and WILSON 1975). One approach to identifying genetic determinants of phenotypic divergence is to examine protein‐coding genes for evidence of positive selection (KIMURA 1983; LI et al. 1985). This type of selection is characterized by a new allele that offers a fitness advantage to an organism, and is rapidly pulled through the population until fixation of the beneficial allele. Genes that have been positively selected along the human lineage, yet remain constrained or selectively neutral in our closest living relatives may offer insight into the biologically significant genetic changes that have occurred since the Pan‐Homo split.

A variety of methods have been developed for the prediction of positive selection. Some take advantage of population‐specific genetic patterns. One such method uses allele frequency differences between populations (TAJIMA 1993) to uncover loci that have been affected by a genetic hitchhiking event. Similarly, the extended haplotype heterozygosity method (SABETI et al. 2002b) measures linkage disequilibrium between two markers with the intent of uncovering a recent selective sweep and is particularly useful in uncovering population specific selection. Another technique compares the rate of polymorphism within a species to the rate of divergence (fixed difference) between species (MCDONALD and KREITMAN 1991b). While these methods are

extremely useful for ascertaining genes subject to recent positive selection, they are

112

unable to fully uncover those very ancient and fundamental changes that occurred around the time of human‐chimpanzee divergence and are shared by all human populations. For this purpose, analysis of evolutionary rates in a phylogenetic context

(FELSENSTEIN 1981; LI et al. 1985) can be used to identify genes that are evolving more

rapidly on a particular lineage compared to the rest of the tree (see (YANG 2002) for a review). The pattern of codon substitution across a phylogenetic tree can be inferred from a multiple sequence alignment using maximum likelihood methods (GOLDMAN and

YANG 1994). When the rate of nonsynonymous codon changes (dN) exceeds the rate of synonymous codon changes (dS), positive selection can be inferred. The codeml

program from the Phylogenetic Analysis by Maximum Likelihood (PAML) package can be used to test different codon substitution models and perform a likelihood ratio test of positive selection along specified lineages based on the dN/dS ratio (YANG 1997c).

Anisimova et al. (ANISIMOVA et al. 2001; ANISIMOVA et al. 2002) have performed large scale simulation studies to test the effect that parameters such as the number of species, branch lengths, sequence length, and sequence divergence has on the tests of positive selection as implemented in PAML. They concluded that predictions of positive selection are unreliable when the sequences being compared are highly similar and when only a small number of species is used. To expand the power and accuracy of the

predictions, it is suggested that the number of lineages used in the analysis is increased.

Lastly, they conclude that multiple models should be used in any analysis of selection in order to ensure robustness in the predictions and to protect against spurious results. In accordance with these findings, our study was designed to include the maximum

113

number of mammalian sequences available to increase the power to detect selection in sequences as similar as human and chimpanzee. Multiple models have also been included in the database to help differentiate between positive selection and relaxation of selective constraint.

Essential to the prediction of selection by phylogenetic analysis is the availability of sequence data from a variety of species. Multispecies alignments for orthologous protein‐coding genes are used to infer the ancestral sequence at each internal node within a phylogenetic tree which is necessary in the calculation of the rate and direction of codon substitution. In this way, differences in selective constraint can be predicted on one or many lineages within a phylogeny. Currently, the National Human Genome

Research Institute (NHGRI) has approved 43 mammalian species as sequencing targets, many of which are currently underway (http://www.genome.gov/10002154). Five of these have been completed or are in genome refinement (2004; CONSORTIUM 2005c;

GIBBS et al. 2004; LANDER et al. 2001; LINDBLAD‐TOH et al. 2005; VENTER et al. 2001;

WATERSTON et al. 2002), 21 are slated for draft assembly, and the remaining 17 are to be

sequenced at low (~2X) coverage. The Genome Browser group at UC Santa‐Cruz has

produced full‐genome multisequence alignments using all of the available vertebrate genome assemblies (BLANCHETTE et al. 2004). With the increasing number of genomes being sequenced, phylogenetic analyses can now be performed on a genome‐wide scale.

We have collected alignments containing multiple mammalian species for 13,721 orthologous protein‐coding genes and examined each for evidence of human specific

114

positive selection using phylogenetic analysis. The multispecies alignments and the

results from the genome‐wide selection scan are housed in a web‐accessible database named the Human PAML Browser. Users can search by gene or gene family and obtain results from likelihood tests of positive selection on the gene of interest.

These types of analyses can be computationally intensive and time consuming, and the database offers an alternative for many researchers to investigate selection

without the difficulty of performing the analysis themselves. The wide variety of species represented in the database also avoids the need to sequence multiple organisms for a comprehensive analysis. The data presented in the Human PAML Browser provides the opportunity for researchers to easily examine their gene(s) of interest for human‐

specific positive selection and may be a stepping stone for many future studies of selection and its effect on the human genome.

DATA SOURCES AND PROCESSING

Input Data

The availability of pre‐computed genome alignments for a diverse set of mammalian species represents an excellent starting point for phylogenetic analysis of

the pattern of selection operating on individual genes. An assumption of phylogenetic

analysis is that the aligned sequences are in fact orthologous. Several groups have constructed sets of orthologous genes (EPPIG et al. 2005; HUBBARD et al. 2007; LEE et al.

115

2002b; WHEELER et al. 2007), but the genome‐based alignments have certain advantages.

Orthology relationships in the multiple alignments are in the context of genomic segments, rather than inferred on the basis of protein alignments, and are thus expected to be reasonably robust. Studies comparing primary reads and finished sequence from a selection of mammals with the human genome suggests that >97% of alignable sequences match at orthologous locations (MARGULIES et al. 2005).

Furthermore, coding sequence information from incomplete (and incompletely annotated) genomes can be used that is not available in protein sequence databases.

The disadvantage of using genome alignments is that they do not completely account for lineage‐specific duplication and deletion, which can make inference of orthology difficult (OHNO 1970; PRINCE and PICKETT 2002; ROTH et al. 2007).

Alignments of 16 vertebrate species with the human reference sequence were downloaded from the UCSC Genome Browser

(http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/, (BLANCHETTE et al.

2004)). The genome assembly multiple alignments were constructed from draft genome assemblies from the following organisms: chimpanzee, rhesus macaque, mouse, rat, rabbit, dog, cow, armadillo, elephant, tenrec, opossum, chicken, frog, zebrafish, tetraodon, and fugu. Initial results using protein‐coding sequence (CDS) coordinates resulted in an unacceptable number of frame shifts, presumably due to small alignment errors near exon/intron junctions. We therefore determined CDS

regions by alignment of the longest representative RefSeq mRNA for each gene to the human genome. CDS alignments were then extracted representing all of the

116

mammalian species from the genomic alignment files, representing most mammalian orders including the subclass Metatheria (Fig. 1). For codeml analysis, alignments were extracted corresponding to each mammalian species with sequence data at a given position.

Figure 1: Unrooted mammalian species tree of the organisms used in the construction of the database and the phylogenetic tree used in PAML analysis. Orientation adapted from the UCSC Genome Browser.

13,721 genes were analyzed; 12,905 of these had at least a portion of the sequence represented in at least 8 species. At the time of the analysis, more than 5x coverage of the orangutan genome was available as shotgun sequence reads in the NCBI

117

Trace Archive, but these data had not yet been assembled. We felt that including all available primate sequence was important, so we added inferred protein coding sequence from orangutan to the mammalian CDS alignments by comparing human genomic sequence to the orangutan Trace Archive at NCBI, initially using BLAST, then assembling reads using phrap (www.phrap.org) and producing final alignments with the lalign algorithm implemented in matcher from the EMBOSS package (ALTSCHUL et al.

1990; PEARSON and LIPMAN 1988), and extracting CDS regions based on exon coordinates from the human genomic sequence. Inclusion of orangutan sequence in the alignments had only minimal impact on the human‐specific dN/dS, but improved prediction of human‐specific substitutions when using the branch+site models described below (data not shown). Columns with gaps in the human sequence were removed from the alignment to facilitate analysis.

Statistical analysis

Markov process models of codon substitution, as implemented in the PAML software package (YANG 1997c), were used to analyze the selective pressures affecting

each gene along the human lineage. The PAML software package is available at http://abacus.gene.ucl.ac.uk/software/paml.html. The program codeml v3.14b within the PAML package was used to analyze the data. Codeml allows specification of several different codon substitution models that allow testing of hypotheses related to selection along certain branches of the tree or at a subset of codons (sites).

118

For codeml analysis, a directory was made for each gene containing the multiple sequence alignment in PHYLIP format, codeml control files, and an appropriate

phylogenetic tree file for the species represented in the alignment. Species tree files were constructed from the consensus mammalian tree (Fig 1) (MURPHY et al. 2001) essentially by pruning branches that were not represented for a given gene. codeml was run on an Apple OSX compute cluster consisting of 20 dual‐processor nodes, each equipped with 2 Gbp of RAM. The compute time for codeml analyses under all five models for the 13,721 alignments was 19 days using an average of 30 CPUs concurrently. Following codeml analysis, a python script parsed the codeml results and alignment files and loaded information to a MySQL database. A web interface was developed using python to provide interactive access to the codeml results on a gene by gene basis.

For each gene, codeml was run under two test models and three null models,

and results from each were then compared as tests of positive selection on the human gene. Each run produced a maximum likelihood estimate, which is the probability of observing the data under the evolutionary conditions implemented in the model. A likelihood ratio test (LRT) was used to determine whether the test model is a significantly better fit to the data than the null model. A p‐value was calculated by comparing two times the difference in log likelihood values to a chi‐squared distribution, with the degrees of freedom equal to the difference in number of parameters between the pair of nested tests. The branch model allows the dN/dS ratio on a specified branch of the tree to differ from the average dN/dS ratio across the rest

119

of the tree. The matching null model fixes dN/dS=1 on the specified branch. If dN/dS>1 in the branch model and the LRT is significant, positive selection can be inferred. For tests of selection on the human branch, we refer to these branch models as model H and model Hnull, respectively. A particular problem in the branch tests, which is due to

the short evolutionary time since the human‐chimpanzee divergence, occurs when there are only nonsynonymous substitutions on the branch between human and the human‐chimpanzee ancestor. In this situation, the rate of synonymous substitution dS =

0 and dN/dS is undefined; this is represented in the database as 999. The branch test also requires dN/dS to be elevated across the entire gene, and therefore is considered somewhat unrealistic, particularly for multi‐domain proteins where positive selection may be acting on only one domain.

The branch+site models were developed to address positive selection at a subset of sites (codons) on one branch of the tree (YANG and NIELSEN 2002b; YANG et al.

2005b). Model A defines four classes of sites, where two of the classes have dN/dS <=1 on all branches while two additional classes have dN/dS >1 on the lineage of interest

(human), but dN/dS <=1 on the other branches of the tree. Model Anull fixes dN/dS=1 for the latter two classes and thus comparison of model A with model Anull is a strict test of positive selection. Model 1a assumes dN/dS <= 1 at all sites across all branches.

A significant LRT in the comparison of model A with model 1a can be due to either positive selection or relaxation of selective constraint, since there is no formal test of

whether dN/dS is >1 on the human branch, just whether there is a subset of sites on the human branch with a dN/dS ratio that is elevated relative to the rest of the tree. A

120

summary of the models and likelihood ratio tests performed in this analysis are found in

Table 1 and more thorough information is available in the PAML documentation. A small number of genes have also been run under site models in which dN/dS is allowed to vary among sites, instead of between .lineages This is helpful in interpreting whether positive selection is occurring specifically along the human lineage or if the gene is rapidly evolving in multiple species.

RESULTS

Table 2 presents a summary of the results of tests of positive selection for human genes. As expected, the branch test, which requires dN/dS to exceed 1 over the entire coding sequence, returned few significant results. In the strict branch+site test

that compares results from models A and Anull, 244 genes met the nominal threshold for significance of p<0.05. More than twice as many genes met the significance threshold in a comparison of Model A with Model 1a, which can indicate a relaxation of

selective constraint or positive selection. Gene Ontology categories were matched to the 244 genes with p<0.05 in the strict branch+site test to assess whether certain categories might be over‐represented among genes predicted to be positively selected

(Table 3). As has been reported previously (ARBIZA et al. 2006b; BUSTAMANTE et al. 2005b;

CONSORTIUM 2005c), transcription factors and olfactory receptors are over‐represented

121 among positively selected genes. Reflecting the abundance of transcription factors, the cellular component category most over‐represented is the nucleus.

Table 1. Evolutionary Models used by codeml

Model Description

Model H Non‐neutral model: Human ω* allowed to vary from other branches

ωHUMAN ≠ ωOTHERS

Model Hnull Neutral model: Human ω fixed at ω=1 and allowed to vary from other branches

ωHUMAN = 1 and ωHUMAN ≠ ωOTHERS

Model A Sites on the human branch allowed to differ

Background Lineages 4 site classes: 0 < ω0 < 1 ω1 = 1 0 < ω2a < 1 ω2b = 1

Foreground Lineage‐ 4 site classes: 0 < ω0 < 1 ω1 = 1 ω2a >= 1 ω2b >= 1 Human

Model Anull Sites on the human branch fixed at ω=1

Background Lineages 4 site classes: 0 < ω0 < 1 ω1 = 1 0 < ω2a < 1 ω2b = 1

Foreground Lineage‐ 4 site classes: 0 < ω0 < 1 ω1 = 1 ω2a = 1 ω2b = 1 Human

Model 1a Sites on all branches nearly neutral

All lineages 2 site classes: 0 < ω0 < 1, ω1 = 1

Compare Likelihood Ratio Tests of Significance

Model H, Model Hnull Branch test: Non‐neutral evolution

122

MA, M1a Relaxed branch+site test: Positive selection or relaxation of selective constraint

MA, MAnull Strict branch+site test: Positive selection

Table 2: Summary of results of tests of selection on human genes

Test Significance Threshold #Genes†

Model A vs. Model Anull 0.05 244

‐Strict branch+site test 0.01 152

0.001 48

Model A vs. Model 1a 0.05 611

‐ Relaxed branch+site test 0.01 276

0.001 114

Model H vs. Model Hnull 0.05 16

‐Strict branch test 0.01 3

0.001 2

______

†OR5B3 is the only gene with p<0.05 in both branch (Model H vs. Model Hnull) and branch+site (Model A vs. Model Anull) tests.

123

Table 3: Gene ontology categories over‐represented among genes with p<0.05 in the strict branch+site test.

Gene Ontology ID Category Name Corrected p‐value

Biological Process GO:0006350 Transcription 0.003 GO:0006355 Regulation of transcription, 0.003 DNA‐dependent GO:0007608 Sensory perception of smell 0.02 GO:0006814 Sodium ion transport 0.02 GO:0007165 Signal transduction 0.02 Molecular Function GO:0005515 Protein binding 0 GO:0046872 Metal ion binding 0.00005 GO:0016740 Transferase activity 0.00009 GO:0008270 Zinc ion binding 0.001 GO:0003677 DNA binding 0.001 GO:0000166 Nucleotide binding 0.001 GO:0003676 Nucleic acid binding 0.006 GO:0004984 Olfactory receptor activity 0.006 GO:005524 ATP binding 0.01 GO:0031402 Sodium ion binding 0.02 Cellular Component GO:0005634 Nucleus 0 GO:0016020 Membrane 0.0004 GO:0005737 Cytoplasm 0.002

Gene Ontology classification was performed using Onto‐Express (DRAGHICI et al. 2003; KHATRI et al. 2004)

124

Several groups have performed genome scans for positive selection on the human lineage using divergence data (ARBIZA et al. 2006b; BAKEWELL et al. 2007a; CLARK et

al. 2003b; CONSORTIUM 2005c). One feature that sets this study apart is the large number and wide variety of mammalian species used. When only a small number of species is used (e.g. human‐chimpanzee‐mouse or human‐chimpanzee‐macaque) or outgroups that are too divergent (ARBIZA et al. 2006b; CLARK et al. 2003b; GIBBS et al. 2007) there can be a high degree of uncertainty as to whether a substitution occurred on the human or chimpanzee lineage. This can lead to both an increase in the false positive rate, as well as a lack of power to detect selection if multiple substitutions have occurred at the same codon. The inclusion of a third great ape, the orangutan, the Old World monkey rhesus macaque, and several non‐primate mammals adds considerable power to model the substitution pattern. As recent studies have concluded (ARBIZA et al. 2006b;

BAKEWELL et al. 2007a), there is a relative paucity of human‐specific positively selected genes and these events appear to be relatively rare in the genome. Bakewell (BAKEWELL et al. 2007a) suggests that the small long‐term effective population size of humans as they migrated out of Africa may be masking positive selection, as many advantageous alleles have yet to become fixed in the human population.

User Interface

The Human PAML Browser can be accessed at http://mendel.gene.cwru.edu/adamslab/pbrowser.py and there are a number of ways

125

to query the database. A gene of interest can be searched for by gene symbol,

Gene ID, mRNA accession (RefSeq), and gene name. To examine genes by their biological process, molecular function, or cellular component, Gene Ontology IDs

(http://www.geneontology.org/) may also be used to query the database. For sets of

genes with similar names, a wildcard * can be inserted. A more recent addition is the ability to query the database by statistical significance. The user can set the threshold p‐ value for one of the three likelihood ratio tests to extract genes by significance. The return screen from a query contains one or more genes related to the search, and the

user has the ability to chose the alignment set used in the analysis, which is primarily the UCSC alignments plus orangutan.

Database organization

Upon gene choice the user is presented with a summary screen from the PAML analysis (Fig. 2). This screen contains direct links to the Entrez ID, the mRNA accession, and the GO terms for the gene. The organisms used in this analysis are ordered by evolutionary distance from humans, and information is given on the percent coverage

and average percent identity of the aligned sequence from each organism as compared to the orthologous human sequence. The results from the likelihood ratio tests in the form of a p‐value are displayed for all three tests of positive selection: the relaxed branch+site test, the strict branch+site test, and the branch test. Also contained are links to the DNA and protein multispecies alignments (Fig. 3) as well as the sequences in

126

FASTA format to facilitate subsequent analysis. It is of note that an asterisk in the protein alignments refer to an unknown amino acid or gap in the sequence, not a .

The bulk of the data is contained under the Branch and Site Models link (Fig. 4).

To the right is the table of results from the branch models H and Hnull (Figure 4A). The numbers under the branch heading refer to the relationship of the branch from an internal node to a terminal node or to another internal node. Terminal branches are labeled by organism. One dN/dS ratio is present across the entire tree, with the

exception of the terminal branch to human (Hsa). dN/dS on the human branch is calculated in model H and fixed at 1 in model Hnull.

127

Figure 2: PAML database summary results for Iroquois homeobox 3 (IRX3). The likelihood ratio tests are as follows: the relaxed branch+site test (model A vs. model 1a), the strict branch+site test (model A vs. model Anull), and the branch test (model H vs. model Hnull). The three letter species codes are Hsa (human), Ptr (chimpanzee), Mmu (macaque), Rno (rat), Mms (mouse), Ocu (rabbit), Cfa (dog), Bta (cow), Dno (armadillo), Laf (elephant), Ete (tenrec), and Mdo (opossum).

128

Figure 3: Multispecies protein alignment for IRX3. Coding sequence was extracted from whole‐ genome sequence alignments available from the UCSC Genome Browser and trimmed to include only coding regions based on matches with the human CDS. Alignment columns with gaps in the human sequence were excluded. Residues 1‐180 of 501 are shown. Residues 36 and 44 were predicted to be positively selected in human with Bayes Empirical Bayes posterior probability of 0.886 and 0.997, respectively (see Fig 4B).

129

Figure 4B contains the results from branch+site model A. Model A is run under the assumption of four site classes: class 0, sites under negative selection in all branches; class 1, sites evolving neutrally in all branches; class 2a, sites positively selected on the human branch, but negatively selected on the other branches; and class

2b, sites positively selected on the human branch, but neutrally evolving on the other

branches. The proportion refers to the fraction of sites within the protein that fall into each of the four site classes. Foreground and background ω is the dN/dS ratio for the human lineage and all the other lineages, respectively. The version of codeml implemented uses a Bayes Empirical Bayes (BEB) method for calculating the posterior probability that each site is from a particular site class; sites with a high posterior probability from site class 2a and 2b are inferred to be positively selected (YANG et al.

2005b). A summary table shows the total number of sites predicted to be positively selected as well as the number with a BEB posterior probability >=95 and >=99%. Lastly all sites predicted to be positively selected are listed along with the residue number, the affected amino acid, and the BEB probability specific to that residue. The results from the null model 1a can be retrieved through the Site Models link on the original summary

page (Fig 2).

130

4A. Figure 4: Likelihood test data and results for IRX3. (A) Branch tests for selection, model H (test model) and model Hnull (neutral model). The variables in the table are as follows: t, the length of the branch; s and n, the number of synonymous and nonsynonymous sites, respectively; dN/dS, the ratio of the rate of nonsynonymous and synonymous substitution for the branch; dN and dS, the rate of synonymous and nonsynonymous substitution on the branch; S*dS and N*dN, a rough estimate of the absolute number of synonymous and nonsynonymous substitutions. (B) Branch+site tests for selection, including the proportion of codons

131

4B.

DISCUSSION

There are several limitations of the data and analyses presented in the Human

PAML Browser that the user should consider. The results presented in the browser were obtained using an automated process at multiple steps from alignment creation to extraction of coding sequences, to maximum likelihood analysis, to database loading

132

and web display. Most alignments and result sets have not been examined manually or curated in any way. Bad alignments lead to bad phylogenetic inferences. In particular, some genes with very low p‐values in tests of selection appear to have alignment problems with many human‐specific substitutions clustered at adjacent residues.

Incorrect assignment of orthology could also result in misleading p‐values. The whole‐ genome alignments used here have the advantage of aligning coding sequences in a larger (often much larger) syntenic framework, but a gene family approach that accounts for gene duplication and deletion events would be a useful adjunct to interpretation of the information presented in the Human PAML Browser. Differential gene loss following duplication can lead to 1:1 paralogs being mistaken for 1:1 orthologs. This problem is particularly acute in organisms, such as yeast, that have experienced whole‐genome duplication events (SCANNELL et al. 2007), but could also affect vertebrate alignments, particularly given the extent of lineage‐specific segmental duplication in primates (BAILEY and EICHLER 2006). Gene trees have not been constructed for the alignments used here, but reconciliation of the gene tree with the species tree using a program such as NOTUNG(CHEN et al. 2000) would be a useful exercise in the context of follow‐up study.

The p‐values in the Human PAML Browser have not been corrected for multiple tests, and so care should be taken in interpretation of results. Another factor that potentially impacts the p‐values is that in some cases, the ML methods fail to converge to a global optimum, resulting in an inappropriately low p‐value. Finally, genome

assemblies and assembly alignments are improving over time. It is strongly

133

recommended that the user validate the results presented here for any gene of interest by re‐extracting gene sequences and repeating the codeml analysis. This will serve the dual purposes of incorporating new and improved genome assembly data and ensuring that the ML analysis is stable.

CONCLUSION

We have generated a database containing the PAML results for tests of human specific positive selection. The simple web interface makes access to PAML results readily available, compared to the tasks of preparing properly formatted alignment, phylogenetic tree, and codeml control files and parsing five different codeml output files. The multispecies alignments used in each analysis are readily available in FASTA

file format for further analysis using other methods or to examine selection among other orders/families of mammals or in individual species. The Human PAML Browser is intended to aide other researchers as they search for the selective pressures that have affected their gene or gene family of interest, and can be a stepping stone for many future studies of positive selection and the human genome.

134

CHAPTER 4

DEMONSTRATING FUNCTIONAL DIVERGENCE USING

GENOME WIDE EXPRESSION ARRAYS IN THE POSITIVELY

SELECTED GENE CDX4

135

ABSTRACT

The prediction of positive selection is difficult; the proof is even harder. Much research has been performed on positive selection—both in fine tuning the methods used to predict selection and in using these methods to discover the evolutionary forces shaping individual genes. However, these predictions are simply inferences based on the statistical probability of a given model of evolution; they give no insight into mechanisms of functional change. Adaptive evolution of a protein implies amino acid change that has been selected for over the course of a species’ evolution and has led to functional divergence of the protein, yet there is a paucity of experiments that have been designed to test these predictions. In this study, the functional divergence of a human‐specific, positively selected transcription factor, CDX4, is examined through the use of genome wide expression arrays. The expression profiles of fibroblast cells transfected with human or chimpanzee CDX4e hav been compared to look for genes that are differentially regulated by these orthologous proteins. A set of target genes differentially regulated by human and chimpanzee CDX4 have been identified, and will be validated by qPCR. Microarray analysis of the transcriptomes of cells transfected with a positively selected transcription factor will be a valuable tool for examining the

expression differences of downstream target genes regulated by this transcription factor compared to the orthologous protein in a closely related species.

136

INTRODUCTION

Functional divergence in a human transcription factor posits that an evolutionary adaptation altering downstream gene expression has occurred on and been beneficial to

the human lineage, and has been rapidly fixed in the population. There are a variety of

assays that have been developed to test the activity of a transcription factor, and one of

these cell‐culture based assays have been adapted here to test for functional divergence in the human Caudal type homeobox 4 (CDX4) protein. Global changes in gene expression patterns between the human and chimpanzee ortholog can be measured using microarray technology.

Microarrays provide a particularly useful way of examining global changes in the regulation of gene expression, such as finding the (often) many genomic targets of a transcription factor. The transcription profile of a transcription factor gene can be captured by transfecting it into a cell line and measuring which genes’ expression increases or decreases in response to the addition of the transcription factor. These downstream changes may be due to direct activation (or repression) of a target gene, or

a downstream response to changing one of the target genes. Either way, microarrays are useful in reconstructing transcription pathways.

Currently, no such studies have been performed to look at how humans and chimpanzees respond to a given stimulus. However, many studies have used human‐ specific microarrays to examine global gene expression differences between humans

137

and other primates (CACERES et al. 2003; ENARD et al. 2002a; FORTNA et al. 2004; GILAD et

al. 2005; KARAMAN et al. 2003; KHAITOVICH et al. 2004). These studies were performed in

hopes of finding genes whose differential regulation is contributing to the phenotypic

changes seen between humans and chimpanzees.

The first such experiment was performed by Enard et al. (ENARD et al. 2002a).

They generated expression profiles from blood leukocytes, liver and brain tissues of humans, chimpanzees, orangutans, and rhesus macaques using human Affymetrix arrays. Interspecies variability made individual (gene by gene) comparisons difficult between humans and chimps. However, when these samples were compared to orangutans, they fell into two mutually exclusive groups. When distance trees of the expression profiles are made, the human expression patterns are more similar to chimpanzees than to macaques in leukocytes and liver, however in the brain samples, the chimpanzee and macaque samples cluster together. Enard et al. became the first to demonstrate that microarrays could be successfully used to detect quantitative changes

in gene expression between these closely related species, laying the groundwork for other multispecies comparisons.

A second study that has particular relevance to the experimental design of this study was performed by Karaman et al. (KARAMAN et al. 2003). This group generated expression profiles for human, bonobo, and gorilla primary fibroblast cell lines. They identified a small number of genes that had significant expression differences between humans and the other great apes, such as genes involved in the extracellular matrix,

138

metabolic pathways, signal transduction, and stress response. A particular problem this

group had was in synchronizing the primary cell lines. Only early passage cell lines of approximately the same doubling time were included in their analysis because of known gene expression differences based on age and health of the cell.

CDX4 is an ideal candidate for follow‐up studies of functional divergence, as it is

statistically significant in the empirical test for positive selection, as well as in other studies (BUSTAMANTE et al. 2005b; CLARK et al. 2003b). This 284 amino acid, X‐linked gene belongs to the caudal‐related homeobox protein family that mediates transcription activation. It is a that contains a homeobox domain at the

C‐terminus that mediates DNA‐binding, as well as a large caudal domain that mediates transcription activation and repression at the N‐terminus. There are three mammalian

caudal genes, and this gene family is a homolog of the Drosophila gene caudal (DUPREY et

al. 1988). These genes are relatives of the Hox gene cluster and are believed to have

shared a common ancestor, the proto‐Hox cluster. Accordingly, genes, along with

Gsh are collectively known as Parahox genes and both gene families play a role in the anterior‐posterior patterning of the developing embryo (BROOKE et al. 1998).

CDX4 is active very early in the developing embryo and thought to be involved in the regulation of HOX gene expression (CHARITE et al. 1998; DAVIDSON et al. 2003; VAN NES et al. 2006). Some developmental defects observed in CDX4 zebrafish (kugelig) mutants include defects in haematopoiesis, anterior‐posterior patterning, and aberrant HOX gene expression (DAVIDSON et al. 2003). It has also been implicated as having a role in

139

vertebral (CHARITE et al. 1998)and skeletal patterning through this HOX gene regulation

(TABARIES et al. 2005). Interestingly, CDX genes both regulate and are regulated by FGF

and the Wnt signaling pathway in amphibians (KEENAN et al. 2006) and in mice (PILON et al. 2006). A recent study by Skromme et al. in zebrafish showed that Cdx4 loss‐of‐ function mutants showed a posteriorization of the hindbrain impairing spinal cord formation (SKROMNE et al. 2007).

In mammals, the role of CDX genes in skeletal and vertebral patterning has only recently begun to be explored and is showing some very promising results. CDX4/CDX1 double knock‐out mutants have revealed a novel function for CDX4 in placental ontogenesis, as many of these mutants are embryonic lethal because the placenta is unable to fuse with the chorion (VAN NES et al. 2006). These animals also undergo

extensive defects in vertebral extension and show marked vertebral transformations and rib number (Figure 1). More recent studies suggest that over‐expression of CDX4 and the other CDX genes leads to anterior to posterior homeotic shifts in the vertebral column, particularly within the cervical region (GAUNT et al. 2008). Taking these into consideration, CDX4 may play an early embryonic role in the differences in the vertebral column and rib cage seen between humans and chimpanzees.

MATERIALS AND METHODS

DNA sequencing: CDX4 was sequenced in a panel of human and non‐human primates, including: chimpanzee, bonobo, gorilla, orangutan, Kloss’s gibbon, rhesus

140

macaque, pig‐tailed macaque, hamadryas baboon, spider monkey, woolly monkey, red‐ bellied tamarin, and ring‐tailed lemur. Primate DNA samples were obtained from the

NIA Aging Cell Repository Phylogenetic Primate Panel (PRP0001). Human DNA samples were from the NIGMS Human Genetic Cell Repository Human Variation Collection

(NA00946, NA00131, NA01814, NA14665, and NA03715). Both panels were obtained through Coriell Cell Repositories. Hylobates and Papio samples were kindly provided by

Dr. Evan Eichler.

Sequencing was performed on an Applied Biosystems 3730xl DNA sequencer

according the protocol in Kidd et al. (KIDD et al. 2005). Primers for CDX4 were designed in regions of conservation between human and mouse by hand in intronic sequencing surround the coding exons. Sequence was obtained from both strands of each PCR product and trace files were processed by phred and phrap (EWING and GREEN 1998;

EWING et al. 1998; GREEN), with each exon from each organism assembled independently.

Assembly and annotation of the CDX4 gene was performed according to Nickel et al.

(NICKEL et al. 2008b).

Phylogenetic analysis: Codeml from the PAML package (v3.14b, YANG 1997b)

was run on the multispecies sequence alignment for CDX4 on an Apple OS10.5 The optimized branch‐site method of Yang and Nielsen (YANG and NIELSEN 2002a; YANG et al.

2005a; ZHANG et al. 2005) was used with nested null models to test whether positive selection might be occurring at a subset of sites (codons) on the human lineage. An

141

empirical test for positive selection was used to assess the likelihood of selection by comparing the actual results from the codeml tests to results obtained using sequences simulated under a model of neutral evolution and is described in Nickel et al. (NICKEL et al. 2008b).

cDNA clone construction: There are no available cDNA clones for CDX4, so the coding sequence for the human and chimpanzee genes was created in the lab and subcloned into the pGL4.74[hRluc/TK] (Promega) vector which is under the control of a low expressing, constitutively active TK promoter. The coding sequence for CDX4

replaced the luciferase reporter gene in this vector. A method known as fusion PCR

(WANG et al. 2002) was used to create CDX4 cDNAs (Figure 1). The technique involves piecing together exons of a gene to form the entire coding sequence and inserting this

sequence into a plasmid vector. The PCR primers designed are chimeric‐‐ their 5’ ends contain sequence that is complimentary to primers designed to amplify neighboring exons. The first and last exons have chimeric primers that introduce restriction sites

into the sequence for subcloning.

142

Figure 1: Step by step process of fusion PCR. The technique involves piecing together exons of a gene to form the entire coding sequence and inserting this sequence into a plasmid vector. The PCR primers designed are chimeric‐‐ their 5’ ends contain sequence that is complimentary to primers designed to amplify neighboring exons. The first and last exons have chimeric primers that introduce restriction sites into the sequence for subcloning into an expression vector.

Microarray: Human (GM02037) and chimpanzee (S003642) primary fibroblast cell lines were transfected with human or chimpanzee CDX4 or a control pGL4.74 luciferase vector using the Effectene Transfection Reagent (Qiagen), a non‐liposomal lipid formulation. Cell lines were obtained from the NIA Aging Cell Repository at Coriell

143

Cell Repositories. Cells were harvested 12 hours post‐transfection, and RNA was

isolated using the RNAeasy Mini kit (Qiagen). RNA quantification was performed using the RiboGreen RNA Quantitation Reagent (Invitrogen). It is important that each of the harvested be quantitated with the utmost sensitivity, particularly for the in vitro transcription (IVT) reaction, in which the RNA is first reverse transcribed with an oligo(dT) primer and then undergoes second strand synthesis. The resulting cDNA is then the template for IVT with T7 RNA Polymerase. The IVT reaction consists of labeling with biotinylated UTP, generating biotinylated, antisense RNA copies of each mRNA in a sample. This reaction was performed using Ambion’s Illumina RNA Amplification kit.

The labeled cRNA is then hybridized to Illumina’s Sentrix HumanRef‐8 Expression

BeadChip and run on the Illumina BeadStation.

RESULTS

The CDX4 multiple sequence alignment, the caudal activation domain is quite diverse, while the DNA‐binding domain is extremely well conserved (Figure 2). Of the

159 amino acids in the caudal activation domain, 69 (43%) have amino acid substitutions. All four human‐specific substitutions fall within the caudal activation domain. In contrast, in the homeobox DNA‐binding domain, of the 61 amino acids 2

(3%) have diverged. This would seem to indicate that CDX4 is binding to similar target

sequences, but may have undergone changes in its protein binding affinity and/or specificity.

144

Figure 2: The protein multispecies sequence alignment for CDX4. The species symbols are as follows: Hsa, human; Ppa, bonobo; Ptr chimpanzee; Ggo, gorilla; Ppy, orangutan; Hla, Kloss’s gibbon; Mmu, rhesus macaque; Pha, hamadryas baboon; Mne, pigtailed macaque; Lla, woolly monkey; Age, spider monkey; and Sla, red‐bellied tamarin. The caudal activation domain is shaded in light grey. The homeobox domain is shaded in dark grey.

Hsa MYGSCLLEKE AGMYPGTLMS PGGDGTAGTG GTGGGGSPMP ASNFAAAPAF SHYMGYPHMP 60 Ppa ...... R...... 60 Ptr ...... R...... 60 Ggo ...... R...... 60 Ppy ...... R...... T...... V...... 60 Hla ...... R...... T...... I...... 60 Mmu ...... T...... RI ...G...... D....I...... V...... 60 Pha ...... T...... RI ...G...... D....I...... V...... 60 Mne ...... T...... RI ...G...... D....I...... V...... 60 Lla ..R...... A..R. .R.S...EA. ...D...L...... S...... 60 Age ...... A..R. .R.S...EA. ...D...L...... S...... 60 Sla ...... CS...R. .R.S...EA. .M.D...L...... L..S...... S 60

Hsa SMDPHWPSLG VWGSPYSPPR EDWSVYPGPS STMGTVPVND VTSSPAAFCS TDYSNLGPVG 120 Ppa ...... A. 120 Ptr ...... S.A. 120 Ggo ...... A. 120 Ppy ...... M...... I...... A. 120 Hla **...... ********** ********** ********** ******** 120 Mmu .....RQ... D...... Q...... M...... *...... A. 120 Pha .....RQ... D...... ******..Q. ..I...... M...... *. 120 Mne .....RQ... D...... Q...... M...... A. 120 Lla .I..PGST.. A...... L...... IG....V.S. 120 Age ....PGST.. A...... A. 120 Sla ....*GST.. A...... L....T.... P.....V.S. 120

Hsa GGTSGSSLPG QAGGSLVPTD AGAAKASSPS RSRHSPYAWM RKTVQVTGKT RTKEKYRVVY 180 Ppa ...... P...... SN...... 180 Ptr ...... P...... SY...... 180 Ggo ...... R... PT...... N...... 180 Ppy ....S..... P...... N...... 180 Hla ********** ********** ********** ********** *******...... 180 Mmu ...... P...... N...... 180 Pha ...... P...... *N...... *...... 180 Mne ...... P...... N.*...... 180 Lla ...... ST...... T..TNS...... H...... 180 Age ...... P...... SN...... 180 Sla ..I...... S.....A... T..TN...... H.G...... 180

Hsa TDHQRLELEK EFHCNRYITI QRKSELAVNL GLSERQVKIW FQNRRAKERK MIKKKISQFE 240 Ppa ...... 240 Ptr ...... 240 Ggo ...... 240 Ppy ...... 240 Hla ...... P. R...... 240 Mmu ...... T.... R...... 240 Pha ...... T.... R...... V.. 240 Mne ...... T.... R...... 240 Lla ...... R...... **** ********** ********** 240 Age ...... R...... **** ********** ********** 240 Sla ...... **** ********** ********** 240

145

Hsa NSGGSVQSDS DSISPGELPN TFFTTPSAVR GFQPIEIQQV IVSE Ppa ...... Ptr ...... Ggo ...... Ppy ...... G...... Hla ...... G...... P...... ***** **** Mmu ...... G...... Pha .T...... G...... Mne ...... G...... Lla ********** ********** ********** ********** **** Age ********** ********** ********** ********** **** Sla ********** ********** ********** ********** ****

The branch‐site method of Yang and Nielsen (YANG and NIELSEN 2002a; YANG et al.

2005a; ZHANG et al. 2005) was used to test whether positive selection might be occurring at a subset of sites (codons) in CDX4 on the human lineage (Table 1). This method compares the results from the strict branch+site test (model A vs. model A‐null) and the relaxed branch+site test (model A vs. model 1a). The strict branch+site test requires dN/dS>1 at a subset of sites, while the relaxed branch+site test requires only that a

subset of sites on the human lineage have a significantly elevated dN/dS ratio compared to those sites on the remainder of the tree. The strict branch+site test p‐value is 0.286 for CDX4 and predicts 4 amino acids to be positively selected (Table 2). The empirical

test for positive selection is more sensitive for genes with short branch lengths, and therefore has more power to predict selection in CDX4 than the traditional methods implemented in codeml. The empirical test p‐value is 0.031. CDX4 is also predicted to be positively selected in previous studies using a modified maximum likelihood test using only human and chimpanzee sequence (CLARK et al. 2003b) or a population genetic based approach (BUSTAMANTE et al. 2005b) (Table 2).

146

Table 1: Statistical analysis of CDX4 using different methods to predict positive selection

Test dN/dSa Relaxed Strict Empirical Clarkb Bustamantec branch+site branch+site p‐value P p‐value p‐value p‐value

CDX4 999 0.527 0.286 0.031 < 0.001 0.0987

aFrom Model H vs. Model 0 (YANG 1997c) bFrom (CLARK et al. 2003b) cFrom (BUSTAMANTE et al. 2005b)

Table 2: The results from the strict branch+site model for CDX4

Amino acid Human Other Domain BEB Posterior substitution lineages* Probability

19 M R Caudal activation 0.988

119 V A or S (0.76) Caudal activation 0.943

131 Q P or S (0.772) Caudal activation 0.944

145 K N or G Caudal activation 0.988

*The statistic in parenthesis in this column represents the posterior probability that this amino acid position is under selection in all lineages

The experiments are designed to examine the differences in the downstream effects of a human transcription factor and its chimpanzee ortholog. The two transcription factors should be transfected into an appropriate cell line, and the up‐ and down regulation of target genes will be measured using genome wide expression arrays.

Any genome‐wide expression array can be used; however for this project the genome scans have been performed using the Sentrix HumanRef‐8 Expression BeadChip

(Illumina). These arrays are a particular advantage because the genes included on this

147

array are collected from well‐annotated and validated human genes from NCBI

Reference Sequence. Around 24,000 genes are represented on the BeadChip. Each probe on the chip is a 50‐mer oligo which greatly increases the sensitivity and specificity of hybridization. Figure 3 shows the percent match of the human specific array probe to the chimpanzee and macaque (for reference) genome. 13,091 probes have 100% identify to the chimp genome, and around 5,500 have only one base pair mismatch.

Gilad et al. (GILAD et al. 2005) ran chimpanzee, orangutan, and rhesus macaque on a human microarray, and then compared the results to species‐specific spotted arrays.

They concluded that that sequence divergence has a substantial effect of the measurement of expression levels. Therefore, the data are filtered to only include probes that have a perfect match to the chimp sequence.

148

Figure 3: Number of base pair matches of chimpanzee and macaque genome to the human specific microarray. All the 50‐mer probes on the BeadChip have been BLASTed against the chimpanzee and macaque genome. Just over 13,000 probes have a perfect match to the chimp sequence, and around 5,500 probes have only one base pair change.

For each experiment a cell line was be transfected with a human CDX4 or with its chimpanzee ortholog. Figure 4 highlights the importance of only comparing within a given cell line, as comparing between the two fibroblast lines, is impossible due to the vastly different transcription profiles of each line. The same cell lines were mock transfected with a vector containing only the luciferase cDNA to serve as a control for any non‐specific gene expression changes that have occurred because of the cell’s response to the transfection process. The addition of these chemicals may produce

149

physiological changes to the cell line, so samples collected must all reflect these changes. After harvesting, reverse transcription for the gene of interest was performed on all samples using primers that span splice sites for three reasons. First, the endogenous levels of the transcription factor must be measured in the mock transfected line. Second, effective transfection and expression from the vector must be ascertained.

Lastly, DNA contamination must be ruled out.

Figure 4: Comparison of the gene expression profiles of a human and chimpanzee fibroblast cell line. Scatter plot in log scale of the different transcription profiles of a human and chimpanzee cell line.

150

Figure 5a is a scatter plot in log scale of gene expression results from a human cell line that does not endogenously express CDX4 and has been transfected with either human CDX4 or the luciferase gene which serves as a mock control. The plot has been normalized to reduce background noise. Only genes expressed in either one or both cell lines are indicated on the plot. Of the 24,000 genes represented on the probe, 9,402 are expressed, and CDX4 is not expressed in this cell line. The r2 is 0.971, indicating an excellent correlation between the two samples. The red lines indicate a fold change greater than 2 between the two samples. Lines below the red line are down‐regulated in the test sample compared to the mock sample, and genes above the red line are over‐ regulated in the test sample. These gene are presumably ones whose expression changed (either directly or indirectly through downstream signaling cascades) by the expression of CDX4. Figure 5b is the scatter plot of the same cell line comparing gene expression from samples transfected with either human or chimpanzee CDX4, both cDNAs under the control of the low expressing thymidine kinase promoter. Again, the r2

is extremely high at 0.992 (an r2 greater than 0.99 is expected for hybridization replicate samples), and 9,137 genes are expressed by at least one sample.

Just because there is a 2‐fold or greater change between these two examples does not necessarily mean that this gene has been statistically significantly differentially regulated by CDX4. Differential expression is calculated based not only on the fold change but also on the standard deviation between replicate probes and the signal intensity of the probe in order to calculate a probability statistic for a gene showing differential expression between two samples. Figure 5c is the scatter plot for human

151

fibroblast cell transfected with either human or chimpanzee CDX4 and filtered only to

include genes expressed by one or both samples, of which there at 9,634 genes represented in this plot. Those genes highlighted in blue are 123 genes that are differentially expressed by human and chimpanzee CDX4 for with a p‐value of 0.05.

Table 4 lists the 123 genes that are differentially regulated by human and chimpanzee

CDX4, meaning that these genes’ expression is up or down regulated compared to both the mock sample and to each other. Between the human and the mock sample, 433 genes are statistically significantly differentially expressed, which indicates that

approximately 300 genes show no difference in expression by human or chimpanzee

CDX4. These genes are over‐represented in order of statistical significance in the following Gene Ontology (GO) categories of molecular function: transferase activity

(transferring pentosyl groups), transcription factor activity, activity, double‐ stranded RNA binding, receptor signaling protein activity and Wnt receptor activity.

These 123 genes are candidates for follow‐up study and validation using real‐time qPCR and luciferase‐reporter gene analysis.

Figure 5: Scatter plots of BeadChip microarray gene expression data a. Plot of a human fibroblast cell line transfected with luciferase cDNA or human CDX4 cDNA. Red lines indicate genes with 2‐fold or greater change in gene expression. b. Human fibroblast cell lines transfected with human and chimpanzee CDX4 cDNA. The probe representing CDX4 expression is circled in blue. c. Human fibroblast cell line transfected with human and chimpanzee CDX4 filtered for probes only expressed in one or both samples. Probes highlighted in blue are differentially expressed between the two samples (p<0.05).

152

153

154

155

Table 4: 123 genes differentially regulated by human and chimpanzee CDX4

Symbol †Differential Score CDX4 vs. mock Human CDX4 vs. Chimp CDX4

OASL 94.6286 ↑ CDX4 ↑ Human

EPSTI1 60.7151 ↑ CDX4 ↑ Human

IFIT4 58.2248 ↑ CDX4 ↑ Human

ISG20 57.2632 ↑ CDX4 ↑ Human

C1orf29 51.4009 ↑ CDX4 ↑ Human

MX2 48.5978 ↑ CDX4 ↑ Human

MDA5 48.4933 ↑ CDX4 ↑ Human

MX1 45.6254 ↑ CDX4 ↑ Human

STAT2 42.3349 ↑ CDX4 ↑ Human

RARRES3 42.2438 ↑ CDX4 ↑ Human

G1P3 41.0702 ↑ CDX4 ↑ Human

DUSP19 40.8664 ↑ CDX4 ↑ Human

WARS 40.3627 ↑ CDX4 ↑ Human

APOL2 38.3322 ↑ CDX4 ↑ Human

FLJ11000 36.4481 ↑ CDX4 ↑ Human

cig5 36.164 ↑ CDX4 ↑ Human

STAT1 35.0084 ↑ CDX4 ↑ Human

IL6 30.7833 ↑ CDX4 ↑ Human

Hes4 28.8488 ↑ CDX4 ↑ Human

APOL3 28.0653 ↑ CDX4 ↑ Human

SCN11A 27.1698 ↑ CDX4 ↑ Human

CEB1 26.35 ↑ CDX4 ↑ Human

IFI35 25.3573 ↑ CDX4 ↑ Human

SP110 25.2616 ↑ CDX4 ↑ Human

LOC375468 25.0472 ↑ CDX4 ↑ Human

156

ZC3HAV1 24.3856 ↑ CDX4 ↑ Human

FLJ36874 23.998 ↑ CDX4 ↑ Human

BAG1 22.6387 ↑ CDX4 ↑ Human

RIG‐I 21.7486 ↑ CDX4 ↑ Human

IFRG28 21.6431 ↑ CDX4 ↑ Human

RPS28 21.6122 ↑ CDX4 ↑ Human

BCL2L13 21.5828 ↑ CDX4 ↑ Human

PSMB9 21.43 ↑ CDX4 ↑ Human

TNFSF10 21.4004 ↑ CDX4 ↑ Human

IFI44 21.2539 ↑ CDX4 ↑ Human

STAT1 21.1494 ↑ CDX4 ↑ Human

CD44 20.9341 ↑ CDX4 ↑ Human

AP2M1 20.6951 ↑ CDX4 ↑ Human

SLC25A28 19.6275 ↑ CDX4 ↑ Human

KIAA1404 19.1137 ↑ CDX4 ↑ Human

MT2A 18.5867 ↑ CDX4 ↑ Human

LOC51064 18.5674 ↑ CDX4 ↑ Human

APOL1 18.4972 ↑ CDX4 ↑ Human

FLJ20073 18.2948 ↑ CDX4 ↑ Human

ZNF313 17.8261 ↑ CDX4 ↑ Human

OAS2 17.6991 ↑ CDX4 ↑ Human

ECGF1 17.675 ↑ CDX4 ↑ Human

RPL3 17.6697 ↑ CDX4 ↑ Human

CDX4 17.6284 ↑ CDX4 ↑ Human

BAL 17.34 ↑ CDX4 ↑ Human

IFIT2 17.2017 ↑ CDX4 ↑ Human

DDX23 16.9776 ↑ CDX4 ↑ Human

C20orf52 16.8227 ↑ CDX4 ↑ Human

FLJ11286 16.3407 ↑ CDX4 ↑ Human

157

MT1K 16.2837 ↑ CDX4 ↑ Human

ADAR 16.2407 ↑ CDX4 ↑ Human

APH‐1A 16.1496 ↑ CDX4 ↑ Human

AGTRAP 15.9882 ↑ CDX4 ↑ Human

FLJ20485 15.96 ↑ CDX4 ↑ Human

GMPR 15.8497 ↑ CDX4 ↑ Human

ZDHHC3 15.8278 ↑ CDX4 ↑ Human

IRF1 15.7232 ↑ CDX4 ↑ Human

LOC254531 15.6307 ↑ CDX4 ↑ Human

KIAA1228 15.5537 ↑ CDX4 ↑ Human

LOC400368 15.4595 ↑ CDX4 ↑ Human

IFITM3 15.4537 ↑ CDX4 ↑ Human

LOC134285 15.4166 ↑ CDX4 ↑ Human

CSNK2B 15.3935 ↑ CDX4 ↑ Human

G1P2 15.3627 ↑ CDX4 ↑ Human

MGC5508 15.285 ↑ CDX4 ↑ Human

C1orf2 15.1709 ↑ CDX4 ↑ Human

ARPC3 15.1475 ↑ CDX4 ↑ Human

ITGB4BP 15.1457 ↑ CDX4 ↑ Human

MAGED2 15.0668 ↑ CDX4 ↑ Human

GPR83 14.8385 ↑ CDX4 ↑ Human

COL6A1 14.7923 ↑ CDX4 ↑ Human

MT1X 14.7812 ↑ CDX4 ↑ Human

FZD4 14.5833 ↑ CDX4 ↑ Human

ZC3HAV1 14.4561 ↑ CDX4 ↑ Human

FLN29 14.4549 ↑ CDX4 ↑ Human

BAT5 14.394 ↑ CDX4 ↑ Human

COL4A1 14.187 ↑ CDX4 ↑ Human

PCTAIRE2BP 13.8579 ↑ CDX4 ↑ Human

158

LOC93349 13.829 ↑ CDX4 ↑ Human

LOC375752 13.7379 ↑ CDX4 ↑ Human

DTNB 13.6713 ↑ CDX4 ↑ Human

MAP4K4 13.6248 ↑ CDX4 ↑ Human

STAT1 13.5446 ↑ CDX4 ↑ Human

IFIT1 13.4507 ↑ CDX4 ↑ Human

C1orf38 13.4054 ↑ CDX4 ↑ Human

KIAA1618 13.3978 ↑ CDX4 ↑ Human

PRKD2 13.2917 ↑ CDX4 ↑ Human

TCBAP0758 13.2875 ↑ CDX4 ↑ Human

NOLA3 13.2471 ↑ CDX4 ↑ Human

ZBTB4 13.2324 ↑ CDX4 ↑ Human

SDHA 13.0807 ↑ CDX4 ↑ Human

FLJ12150 13.0227 ↑ CDX4 ↑ Human

GALNACT‐2 ‐13.0308 ↓ CDX4 ↓ Human

PPP2R5C ‐13.0941 ↓ CDX4 ↓ Human

LOC90799 ‐13.3483 ↓ CDX4 ↓ Human

FLJ32028 ‐13.4696 ↓ CDX4 ↓ Human

RNF126 ‐13.571 ↓ CDX4 ↓ Human

CAPZA2 ‐13.6123 ↓ CDX4 ↓ Human

PTPN4 ‐14.2005 ↓ CDX4 ↓ Human

LOC374876 ‐14.2058 ↓ CDX4 ↓ Human

ECT2 ‐15.1285 ↓ CDX4 ↓ Human

LGALS3 ‐15.3587 ↓ CDX4 ↓ Human

FLJ20522 ‐15.5156 ↓ CDX4 ↓ Human

ZBTB2 ‐15.5493 ↓ CDX4 ↓ Human

CSNK1E ‐15.5817 ↓ CDX4 ↓ Human

MAP4 ‐15.8737 ↓ CDX4 ↓ Human

KIAA1171 ‐15.8782 ↓ CDX4 ↓ Human

159

PCNA ‐16.3212 ↓ CDX4 ↓ Human

FACL2 ‐17.5251 ↓ CDX4 ↓ Human

LOC51315 ‐17.8185 ↓ CDX4 ↓ Human

HSPC135 ‐17.8347 ↓ CDX4 ↓ Human

CNOT7 ‐18.0864 ↓ CDX4 ↓ Human

XPC ‐18.4295 ↓ CDX4 ↓ Human

PLEKHA1 ‐18.6677 ↓ CDX4 ↓ Human

GTF3C1 ‐18.7737 ↓ CDX4 ↓ Human

PNAS‐4 ‐19.9233 ↓ CDX4 ↓ Human

NFIC ‐20.4256 ↓ CDX4 ↓ Human

CNN2 ‐21.1947 ↓ CDX4 ↓ Human †Differential scores are measures of significance. |diff| = 13 corresponds to p = 0.05 and |diff| = 20 corresponds to p = 0.01

A primary chimpanzee fibroblast cell line was also transfected with human CDX4, chimp CDX4, and the mock control sample (Figure 6). The data has been analyzed through a program known as Bayesian ANOVA for Microarrays (BAMarray) (ISHWARAN et al. 2006). BAMarray allows one sample group to be defined as baseline, in our case the control sample that was transfected with the luciferase cDNA. Expression results from cells expressing human and chimpanzee CDX4 were each compared with control cells.

BAMarray’s Scatter Plot shows differentially expressed genes that are shared for human

and chimpanzee CDX4 (Pink) and that are unique to cells expressing human (Blue) or

chimpanzee (Green) CDX4. The genes that fall into the blue and green categories will

be further interrogated by qPCR to check for expression differences in cells transfected by the two transcription factor.

160

Figure 6: Chimpanzee fibroblast cell lines transfected with human or chimpanzee CDX4 cDNA. Scatter plot of differentially expressed genes that are shared for both human and chimpanzee CDX4 (Pink) and that are unique to cells expressing human (Blue) or chimpanzee (Green) CDX4.

DISCUSSION

The gene expression data for CDX4 using the genome wide expression array is quite promising; however validation of target genes is necessary. A known target gene of CDX4 is HOXA5 that contains two binding sites for CDX4. One downstream binding site, known as the mesodermal enhancer, limits the regional expression of HOXA5, and one upstream known as the brachial spinal cord enhancer (TABARIES et al. 2005). In the human cell line, HOXA5 expression was induced 1.4‐1.5 fold (p<0.05) by both human

161

and chimpanzee CDX4, and there was no difference in HOXA5 expression between the

two. Binding sites for CDX4 have also been predicted for HOXB5, so I examined the

expression of this gene as well. In this case its expression was down‐regulated nearly two fold by both human and chimpanzee CDX4. Neither HOX gene is expressed in the

chimpanzee fibroblast cell line, which highlights the importance of the cellular environment and presence of cofactor for appropriate target gene expression.

Of greatest concern is the use of an appropriate cell line for the gene expression analysis. Transcription factors rely on the binding of cofactor proteins in order to initiate or repress transcription. If the transcription factor of interest is transfected in a cell line where it is not endogenously expressed, the appropriate accessory proteins may also not be available and the downstream target genes will not be regulated. In addition, many of these genes, CDX4 included, are involved in early embryonic

development and their endogenous expression is limited to the embryo. Embryonic

cells pose both a technical and an ethical issue in their use. Mouse embryonic stem cells are a potentially viable solution; however the problem of appropriate cofactor proteins also exists there as well. If the protein binding partners of the transcription factor of interest are known, then a cell line in which the cofactors expression was verified, this cell line can serve as an appropriate medium. However, in most cases cofactor proteins are not known, and a second best alternative would be to choose a cell line that mirrors the phenotype of the transcription factor’s mutants (Table 5).

162

Table 5: Cell lines chosen for CDX4 gene expression assays

Cell Line Type Reason Endogenous CDX4

BeWo Human choriocarcinoma Assess placental No phenotype

CACA‐2 Human colon CDX mutants produce No adenocarcinoma colon tumors; Expresses receptors MC3T3 Mouse embryonic calvaria Assess embryonic skeletal No preosteoblast cells phenotype

Gene expression arrays will provide a wealth of candidate genes for follow‐up studies using qPCR. Since target genes for CDX4 remain unknown, these arrays can also shed light on the downstream pathways that this master regulator is involved in, and these genes can be used in luciferase reporter‐gene experiments, thereby removing the need for ChIP‐chip assays. The CDX4 cDNA clones necessary for gene targeting have already been constructed, facilitating the construction of transgenic mice. The information gained about CDX4 pathway involvement can also be a very useful tool in phenotyping transgenic animals, as it can be a better indicator for the types of changes that may take place or provide clues to the construction of follow‐up biochemical and

cellular assays to test the differential activity of CDX4.

Site directed mutagenesis can be also used to both create the transcription factor of the most recent common ancestor of humans and chimpanzees, as well as to mutate the substituted residues back to their ancestral state. The former experiment

163

indicates the direction of selection, that is, which lineage’s transcription factor has diverged in activity. The latter experiment gives an indication of which substitutions are

the causative residues for the functional divergence.

As informative as the above experiments are in deducing the differential activity of two transcription factors, the assays all have one fatal flaw in that they may not be biologically relevant to human and chimpanzee evolution. Cell line systems are a non

native environment and functional divergence of the transcription factor cannot be directly related back to the organism. What these experiments may prove, however, is that the amino acid residues that have been substituted along the human lineage have

acquired the ability to differentially regulate a set of target genes.

CHAPTER 5

164

CONCLUSION AND FUTURE DIRECTIONS

“It is even harder for the average ape to believe that he has descended from man.”

165

H.L. Mencken

5 and 7 million years ago, the ancestor of modern humans stood up on both legs

and began the extraordinary descent of the human species. Chimpanzees and bonobos diverged around 2 million years ago to develop into their respective species, while modern humans did not emerge until around 150 to 200,000 years ago. Since the time of the anatomist Nicolaas Tulp who authored the earliest known written description of an ape in 1641 and noted the similarities between chimpanzees and man, humans have been debating their place among these great apes. It wasn’t until over two hundred years later in 1863 when Thomas H. Huxley first suggested that man be placed in the

same order as apes. More recently Wildman et al.(WILDMAN et al. 2003) suggested that

the genus Homo be expanded to include chimpanzees and bonobos, noting that

“humans appear as only slightly remodeled apes.”

Since this time, humans have watched chimpanzees use and fashion tools, develop social customs and traditions, construct wars and then make peace, and develop a sense of self awareness, all characteristics thought of as belonging only to humans. We have taught them sign language to communicate with us and with each other and presented them with complex problems to solve, which then did quite successfully. We even dared to send them into space. Through all of this, Frans de Waal teases, “Humanity [has been] hard pressed to find ultimate proof of its uniqueness (DE

WAAL 2005).”

166

The advent of the sequence of the human and chimpanzee genome have not been much kinder to the task of finding out what makes humans so unique among the great apes. The genomes of these animals are remarkably similar in both structure and content and the nucleotide and amino acid substitutions, insertions and deletions, gene

duplications, and rearrangements that do exist are mostly the noise of neutral substitution. The ultimate challenge of the modern biologist is sifting through all of this noise to find the molecular changes that contributed to the phenotypic divergence between humans and chimpanzees.

THE EVOLUTION OF MODERN TRAITS

“If evolution really works, how come mothers only have two hands?”

Milton Berle

There are certain tenants of molecular evolution that apply to the emergence of modern traits in Homo sapiens. The first is that trait evolution is non‐linear, that is, it is not a continuous process of change over time. The second is that trait evolution is non‐ additive; rather it is the combinatorial effects of many genes over a vast amount of time.

Lastly, it is that modern traits are modifications of existing structures. When you apply these tenets to human evolution, what you find is that character states evolved as a series of adaptive radiations, in which many species were born and many species died

167

out, and the timing and lineage specificity of the appearance of modern traits is essentially unknown. In this way these changes were multiple, independent, and

“superimposed” on top of each other. This makes the study of human molecular evolution more challenging because not only does that question ask what distinguishes humans from the great apes, it also asks what distinguishes modern humans from archaic hominids. It is for this reason that I have decided to investigate only the most ancient and fundamental changes of archaic humans that distinguish them from the most recent common ancestor of humans and chimpanzees.

The causes of morphological diversity and speciation are being studied in the

emerging field of Evo‐Devo that fuses evolutionary and developmental biology. This field posits that morphological and physiological divergence are products of changes during embryogenesis and are therefore associated with genes involved in development and embryogenesis (CARROLL 2000). It is the spaciotemporal regulation of expression of regulatory genes and in developmental networks that have lead to changes in morphological modification. The types of changes that have occurred that lead to the emergence of modern humans could be a small number of changes in genes with a disproportionately large effect on phenotype, or it could be a large number of small changes in many genes that lead to divergence.

All animals have a common “genetic toolkit” for development and maintenance

of body plans (CARROLL 2000; DAVIDSON 2001). This toolkit includes many genes that are

transcription factors or are involved in signaling pathways. It is also within this toolkit

168

that one may find the origins of animal diversity, both between and within a species, such as a polygenic array of genes involved in height and skin color in humans. While the transcription factors in this toolkit are conserved over huge evolutionary distances, changes in a transcription factor could potentially perturb many loci creating a diverse

range of transcriptional regulation leading to subtle changes in animal body plans. Hsia and McGinnis (HSIA and MCGINNIS 2003b) suggest that transcription factor coding

sequence substitutions would be modest and not affect the entire downstream hierarchy of gene regulation.

Others, however, believe there are potentially large and deleterious downstream affects of altering a master transcription factor and have suggested that it is instead changes in regulatory sequence that are responsible for morphological divergence

((CARROLL 2000) for a review). In a critical review of the Evo‐Devo literature, Hoekstra and Coyne (HOEKSTRA and COYNE 2007) cautions that the emphasis on cis regulatory

mutations is both unfounded and premature, with very little evidence (as yet) that these

type of changes play a role in adaptation. Most likely changes in both the structure and the regulation of gene have equally important parts in morphological adaptation. As members of the genetic toolkit and genes that regulate the expression of other genes, transcription factors are an ideal starting point for studies of adaptive protein evolution

on the evolution of morphological diversity and the emergence of modern traits.

FUTURE DIRECTIONS

“If I knew what I was doing, it wouldn't be called research.”

169

Albert Einstein

Both technical and ethical issues arise when studying functional divergence in humans and chimpanzees. The use of primates in such a study is impractical—primates have a long generation time and the techniques needed to test protein function are yet rudimentary in this new model system. In addition, the ethics of the use of these animals is debatable. Since proteins have different activities, there is also no “gold standard” experiment to test functional changes. Researchers must therefore be creative in adapting well established techniques to determine functional

divergence in positively selected genes.

As a gene family, one of the reasons transcription factors are so well studied is

because of their contributions to early embryonic development and

(WRAY et al. 2003b). This also provides a readable assessable and biologically relevant phenotype. For this reason, their use in transgenic or gene‐targeted mice is an excellent in vivo method to examine differences between two orthologous genes. Transcription factors also have the advantage in this type of study because there are several available

assays that measure changes in gene expression. Global changes in gene expression can be readily measured through genome wide expression microarrays, as described in

Chapter 4. Differences in gene expression of individual genes can be assessed using reporter gene assays, such as the luciferase system, or through the use of quantitative

PCR (qPCR) on a targeted cDNA molecule. Other methods such as chromatin

170

immunoprecipitation (ChIP) gene chips and yeast two‐hybrid screens, can be used to

find the DNA binding sequence or the protein binding partner of a given transcription factor respectively.

Picking Candidate Genes

The selection of genes for follow‐up study should be based on several criteria.

The first, and most important, is that the gene must be statistically significantly predicted as positively selected using the strict branch+site method or the empirical method described in Chapter 2. It should also be able to maintain its significance using other parameter sets or evolutionary models. Additional predictions of positive selection using the methods described in Chapter 1 would be advantageous, but not necessary, as often the signatures of selection are obscured by evolutionary time.

Second, the residues predicted to be positively selected should be in functionally relevant protein domains or motifs. Conservation of these domains (and of the specific residue) in a wider variety of species than used in the branch+site analysis is also a useful criterion for a candidate gene. The above conditions can be easily assessed using the (BATEMAN et al. 2004) models of protein domain structure. Third, the gene should have a biological function related to embryogenesis or development that may provide clues to human and chimpanzee phenotypic divergence. Genes with mutant should be given priority because they provide a known phenotype for further in vivo studies. Fourth, known DNA‐binding targets are an advantageous criteria

171

as it avoids further downstream assays to discover the target promoters of the transcription factor. The last issue is purely technical and it involves the availability of cDNA clones. Human clones are readily available for many genes; however, chimpanzee cDNA clones are scarce. Site‐directed mutagenesis may be used to construct the chimpanzee cDNA from the human clone if necessary.

In vivo assays

Transgenic mice have been invaluable tools for deciphering the function of genes

(CAPECCHI 1980). Homologous recombination allows the investigator to knock‐out genes in mice, knock‐in new genes, or replace endogenous genes with an engineered gene or a homolog from a different species using a combination of knock‐out and knock‐in strategies. The computational and in vitro experiments previously described predict positive selection under the assumptions of adaptive and functional change. Therefore,

these changes may be able to be measured using the transgenic mouse because this system allows for the study of much more complex gene expression networks of the entire organism, particularly during development. Currently, the mouse is the best available in vivo genetic system to study developmental change that may have led to the morphological diversity between humans and chimpanzees. The human or chimpanzee version of a positively selected transcription factor is to be introduced to the mouse genome or used to replace the orthologous gene in mice. The ability and extent of

172

these two genes to rescue the mouse null phenotype will be measured to investigate functional divergence on a background of positive selection.

There is precedence for examining functional divergence using these methods.

Two spermatid‐associated protein genes, Protamines 1 and 2 (Prm‐1 and Prm‐2) have been predicted to be evolving rapidly in the primates, particularly along the human and chimpanzee branches (WYCKOFF et al. 2000). Protamines are involved in the condensation of sperm chromatin. It is predicted that altered protamines affect sperm morphology, and this rapid evolution is being driven by sexual selection. Previous to this study, the avian promatine galline was knocked‐in to spermatids of transgenic mice

(RHIM et al. 1995). They found that the DNA in the spermatids was not as tightly packed as normal mice, indicating a disruption in chromatin condensation. The mice were functionally normal and fertile, although, there was more rapid degradation of seminiferous tubules, leading to eventual infertility. This conservation of function with subtle differences presumably increased fitness for the organism and is the sort of changes expected to be seen in this study.

Another example of functional divergence discovered by the use of transgenic mice is in the DAZ (deleted in azoospermia) and DAZ‐like (DAZL) gene family that are

involved in germ cell differentiation (VOGEL et al. 2002). The DAZ family are Y‐linked genes present in the apes and Old World monkeys only. The ancestor of the DAZ genes is DAZL, a highly conserved autosomal gene present in vertebrates and worms. Vogel et al. used transgenic mice carrying either human DAZL or human DAZ on a mouse Dazl null

173

background to investigate the functions of the human homologs. They found that DAZL

and DAZ can only partially rescue a Dazl null phenotype, substituting for the early functions of the mouse Dazl resulting in the establishment of the germ cell population and but failing to progress into meiosis. This partial rescue is likely due to functional divergence between Y‐linked DAZ genes and autosomal DAZL gene.

Transgenic mice: The term transgenic mouse is loosely used to refer to any genetically modified mouse; however, by the strictest definition, it specifically refers to the addition of a non‐endogenous gene to an animal. This can be done by inserting simply the coding sequence (cDNA) of the gene to be added or the addition of hundreds of kilobases of a human or chimpanzee bacterial artificial chromosome (BAC). The advantage of BAC replacement is that BACs are relatively easy to come by or construct, particularly in cases when cDNA clones are unavailable. Since many kilobases of

sequence are inserted, the endogenous promoter and regulatory sequence of the inserted gene are also inserted, thereby increasing the likelihood of native gene expression timing and amount.

The accompanying regulatory sequence can also serve as this method’s potential downfall. Gene regulation relies on proper DNA and protein binding, and in its non‐ native environment, the appropriate cofactor proteins may not be available. Likewise, the transcription factors that are present in the mouse genome may sub‐optimally bind to the cis regulatory sequence in the BAC. Therefore, the origin of any phenotypic

174

difference could be because of actual functional differences in the transcription factor or improper gene regulation. Along the same vein, BACs often include more than one gene because of their large size. In this case, it would be difficult to uncover whether or not functional differences were to the affects of the intended transcription factor or another neighboring gene that was also included on the BAC.

One can also insert only the coding sequence of a gene, usually without its native promoter and regulatory sequence, into the genome of the mouse in an untargeted location. Transcription would be initiated only if the transgene inserts near a native

mouse promoter (in the case, one active during embryonic development). The simplest

experiment would be to add either the human or orthologous chimpanzee cDNA to the

mouse genome. The endogenous mouse transcription factor would be there to ensure

proper development occurs, and any differences seen between the two transgenic mice

would be related to one of the inserted genes introduced into the mouse.

There are several disadvantages to this type of genetic modification. The largest of such is that it is not possible to control for insertion copy number or insertion site.

The inserted gene may be introduced once or it may be introduced any number of times. In a study that necessitates a one‐to‐one comparison of the human and chimpanzee transcription factors; this fact alone makes this transgenic technique inappropriate for use in investigating functional divergence. Any phenotypic differences

seen in the mouse could be due to either differences in the transcription factors’ activity

or simply gene dosage effects.

175

A second potential pitfall is that the insertion site for the non‐endogenous gene may actually disrupt the coding or regulatory sequence of an endogenous mouse gene.

Again, it would be impossible to tell if the phenotypic changes were due to the addition of the human or chimpanzee transcription factor or the disruption (and potential inactivation) on an endogenous mouse gene. With these potential problems in mind, it leads to the consideration of additional methods of transgenic mouse construction, such as gene targeting.

Gene targeted mice: The method of gene targeting includes the replacement of the endogenous mouse gene by the human or orthologous chimpanzee transcription

factor of interest. Homologous recombination will be used to ensure complete replacement of the mouse gene by the inserted gene, at the same time also ensuring that the inserted gene is only introduced as a single copy in the mouse genome. This

“knock‐out/knock‐in” strategy avoids many of the potential pitfalls of the basic

“transgenic” strategy described above.

The technique uses homologous recombination to knock‐out only the coding sequence and introns of the endogenous mouse gene and replace them with a cDNA clone of the human or chimpanzee transcription factor. This technique avoids the problems of gene regulation described above because the gene is regulated by the mouse’s endogenous regulatory sequence. Some information may be lost due to the exclusion of introns, but the overall contribution is expected to be small.

176

The experimental plan should include four gene‐targeted constructs: a mouse knock‐out (null) only, a replacement of the null mouse by the reintroduction of the mouse cDNA, a replacement of the null mouse by the human transcription factor, and a replacement of the null mouse by the orthologous chimpanzee transcription factor.

The null mouse serves as the negative control, and the mouse with the endogenous replacement serves as the positive control in its ability to completely rescue the null phenotype. The human and chimpanzee knock‐ins will be judged on both their ability to

rescue the null phenotype and by any phenotypic differences witnessed between these

two knock‐in mice. If the chimpanzee knock‐in resembles the mouse knock‐in, any changes in phenotype can be assumed to be due to functional divergence of the human transcription factor. However if both replacements are different from the mouse replacement and from each other, the origin of the functional divergence may be unknown. One way to resolve this would be to construct a “knock‐out/knock‐in” mouse

that contains the cDNA of the transcription factor’s most recent common ancestor.

One of the potential pitfalls of this method is that the gene targeted mouse may be inviable. The proposed transcription factors for gene targeting, such as CDX4, are early and important embryonic transcriptional regulators. Even subtle alterations in these kingpin transcription factors can have large downstream affects that may be incompatible with mouse development. However, CDX1 and CDX2 null mice are both viable and they have readily describable phenotypes such as segmental and vertebral transformations. Inviability is would be an unlikely outcome as heterologous transcription factor replacement between humans and mice and as far away as yeast

177

has been successful in many other studies (METZGER et al. 1988; XIE et al. 2000; YANG et

al. 2003).

A more likely limitation is that there is not functional difference between the human and chimpanzee transcription factor. Transcription factors are among the most well conserved proteins because of their essential functions in organismal development.

Sustainable changes in activity would more likely have very subtle affects. It may be that the phenotypic differences are present, but are too fine to discriminate by eye, and

in depth cellular, biometric, and biochemical analysis of the transgenic mice may be necessary, such as examining tissue‐specific or temporal‐spatial global gene expression differences between the two animals or in situ hybridization of morphogen gradient in the developing mouse embryo.

Target gene discovery

Knowledge of DNA‐binding targets and protein cofactors is an important criterion for several of the in vitro methods used in follow‐up study. CDX4 is known to bind to enhancer regions both upstream and downstream of the homeobox gene

HOXA5 and maintains the anterior boundary of this HOX gene, and there is a putative binding site for CDX4 upstream of HOXB5 (TABARIES et al. 2005). However, this is the extent of knowledge for CDX4 target promoters. How then, can one discover other target promoters for these transcription factors? Similarly, the cell culture based assays in which CDX4 is exogenously expressed necessitate the presence of appropriate

178

cofactor proteins so that CDX4 can properly regulate its downstream targets’ gene expression. Without knowledge of these cofactors, it is difficult to pick a cell line that is appropriate for CDX4 expression. How can one discover the cofactor proteins to which a transcription factor binds? There are two high throughput methods available for such studies, that of ChIP‐chip for discovery of DNA‐binding sites and yeast two‐hybrid assays for the detection of protein binding partners.

ChIP‐chip assays: Chromatin immunoprecipitation with microarray chips (ChIP‐ chip) is a widely used tool for genome‐wide screens on finding DNA‐binding targets for proteins. This technology can be applied to proteins involved in transcription, replication, recombination, and DNA repair. The first studies using ChIP‐chip technology were to identify individual binding sites for transcription factors (IYER et al. 2001; LIEB et al. 2001; REN et al. 2000). This assay was later successfully adapted to identify the target genes of several mammalian transcription factors (WEINMANN et al. 2002; WELLS et al. 2003).

In general, the process includes growing cells under the desired conditions, and

fixing them using formaldehyde, which cross links proteins to the DNA on which they are bound. The protein‐DNA complex is then immunoprecipitated with specific to the protein of interest. An even better solution is to immunoprecipitate a tagged protein using an specific to the tag. The cross links are then reversed and the

DNA that is left is sonicated or sheared to sizes around 0.2 to 2kb. This sheared DNA is

179

then amplified, fluorescently labeled and applied to a microarray slide that contains

DNA‐elements representing the entire genome (WU et al. 2006). This genome‐ wide

“reverse genetic” approach can be successfully used to isolate DNA fragments that have been specifically bound to the protein of interest.

One of the most important aspects to designing a ChIP‐chip experiment is the use of proper controls. It is important to control for experimental variation caused by sample handling, differential shearing sizes, differential labeling, and non‐specific

antibody reactions (HANLON and LIEB 2004; WU et al. 2006). The best way to do this is a cell lacking the protein of interest (if transfected in) or in a cell lacking the IP epitope tagged protein, so that there is no target for the antibody to bind to specifically. This can be used to subtract the “background” noise out from the test samples.

The ChIP‐chip technology is not quantitative, so it cannot be used to test if a human transcription factor and the orthologous chimpanzee protein have different binding specificities to a target DNA; however if the assays is simply done for target discovery this effect is negligible. The biggest potential pitfall, however, is the non‐ specificity of DNA binding proteins (LEE et al. 2002a). In one study (KIRMIZIS et al. 2004),

50% of the pulled down targets (that had 3‐fold enrichment compared to the control group), were false positives. Follow‐up study using ChIP‐PCR may help to tease out the true positives from the false positives.

180

Yeast two‐hybrid assays: Protein‐protein interactions are essential for virtually

every transcriptional event, and they are a dynamic process as they assemble and disassemble over time according to various stimuli. Many transcription factors have a

DNA‐binding domain and an activation domain that mediates protein‐protein interactions for transcription regulation. Slight changes in this activation domain may alter its protein binding affinity. A given transcription factor may bind to a variety of different cofactors, and knowledge of these interacting proteins, particularly if they in well‐studied pathway, can shed light on the function of the transcription factor of interest. How then, can one determine the cofactor proteins that a transcription factor binds with? A particularly useful and high throughput assay for protein‐protein interaction discovery is the yeast two‐hybrid (Y2H) system.

This system was first pioneered by Fields and Song in 1989, by taking advantage of the binding and activation domains of a yeast transcription factor (FIELDS and SONG

1989). A “bait” protein is fused to the DNA binding domain of a yeast transcription factor, and the “prey” protein is fused to the activation domain of the same yeast transcription factor. If the bait and prey proteins bind together, the DNA‐binding domain and the activation domain of the yeast transcription factor will come into proximity of each other and activate transcription of a reporter gene. The strength of this system is that it is now possible to screen large prey libraries, making this a genome‐ wide and high throughput assay.

181

If the function or transcriptional targets of a putatively positively selected gene are unknown, Y2H may be a valuable assay for functional discovery if the function or pathway of the prey protein is well‐studied. It may also be quite useful if the residues that are predicted to be positively selected lie within the activation domain, such as is

the case for CDX4, because it may alter its protein binding affinity when compared to

the chimpanzee ortholog. If the human and chimpanzee transcription factors are found to bind to different proteins using Y2H, an interesting follow‐up assay would be to synthesize the transcription factor that is the most recent common ancestor of the human and chimp genes to see the direction of selection.

As with any assay, Y2H comes with its own flaws. In a study of genes that have recently diverged such as human and chimp genes, the possibility that one or the other has acquired new protein binding partners is unlikely. A more realistic scenario is one in which one of the genes has slightly more or less affinity to binding with a set of cofactor proteins. Since one of the benefits of Y2H is that it is able to detect both weak and transient interactions, this assay is not quantitative and binding affinity cannot be measured. A second potential flaw is that often fusion proteins, such as the bait and the prey, may acquire an artificial function or fail to perform properly because of incorrect and conformation (MUKHERJEEM et al. 2001). Appropriate controls, such

as performing the assay with a known binding target of the transcription factor, can help avoid such scenarios.

182

The largest criticism of Y2H is the possibility for a high number of false positive and false negatives. False positive rates have been estimated to be as high as 50%

(DEANE et al. 2002). Often time a mammalian protein that has been expressed in yeast has not undergone the appropriate post‐translational modifications necessary for its native interactions, therefore producing a false negative result. Conversely, these genes are also grossly over‐expressed in the yeast cell which could also lead to spurious binding. New systems that have been developed to use E. coli have recently shown much lower false positive rates (JOUNG et al. 2000). Lastly, as with any high throughput screen, each interaction must be verified independently by another biochemical assay such as co‐immunoprecipitation.

Luciferase Gene Expression Assays

Reporter gene assays have long been used to measure gene expression. They can measure such things as the tissue specificity of a given gene, the effect that a

stimulus has on the expression of a gene, or the actions of a transcription factor. In a typical reporter gene assay, the coding sequence of the gene under study is replaced by a reporter gene, whose expression can be tracked by measuring its enzymatic, fluorescent, or bioluminescent activity. These experiments can be done both in vitro using cell lines and in vivo in whole animals.

The luciferase system is widely used in reporter assays measuring gene expression and regulation. It is very sensitive in measuring slight change, easily

183

quantifiable, and exogenous expression of the luciferase gene in the host cell line is a negligible issue (WOOD 1998). The two most popular luciferase genes are from the firefly and Renilla (the sea pansy), and since the system is so prevalent, these genes have been genetically engineered for optimal use in mammalian cell lines. These two luciferase proteins use different substrates to produce light, so their expression can be assayed in the same cell.

The first study to use luciferase reporter gene assays to differentiate human and chimpanzee gene regulation was published by Huby et al. (HUBY et al. 2001). The focus of this study, however, was on cis‐regulatory sequence. Atherothrombogenic lipoprotein Lp(a) is present in much higher levels in chimpanzee plasma than in human plasma. The gene locus, apo(a), primarily determines Lp(a) concentrations, so the authors examined this apo(a)’s promoter region to see if it was causing the varying Lp(a) levels between humans and chimps. The human and chimpanzee apo(a) promoter and 5’ flanking sequence were cloned in front of a promoterless luciferase reporter gene vector, and transfected into a human

(hepatocellular liver ) cell line. Huby et al. successfully measured a 5‐fold greater activity in chimpanzee apo(a) promoters than in humans, and postulated that the sequence differences in the 5’ flanking region are responsible for this differential

activity.

Heisseg et al. (HEISSIG et al. 2005) conducted similar experiments, but with much less success. The promoters of twelve genes that had previously been shown by

184

microarray analysis to be differentially expressed between humans and chimpanzees in the brain and liver were cloned in front of a promoterless luciferase gene. These vectors were then transfected into either a human neuroblastoma or a human cervical carcinoma cell line, and the ability of the human and chimp promoters to activate luciferase expression was measured. The authors found very little correlation between

the in vitro promoter activity and mRNA expression levels in tissues, leading the group to suggest that numerous genetic differences between the two species are likely to be responsible for the expression of a single gene. It is possible that the execution of these experiments in immortalized human cell lines may account for much of the variation the authors observed. The same experiment on human‐chimp differently expressed genes in the liver was reproduced by Chabot et al. (CHABOT et al. 2007) to much greater success. Three out of the seven genes tested recapitulated the microarray data and one gene, DDA3, had the responsible cis regulatory changes identified using site‐directed mutagenesis.

The aforementioned experiments use luciferase assays to test differential gene expression between human and chimps, but they do not examine the differential activity of two transcription factors to activate a reporter gene, as is planned in this study. Geng and Vedeckis (GENG and VEDECKIS 2005) performed one such study in investigating the transcriptional regulation of the human (hGR) in T‐ and B‐lymphoblast cells. The hGR 1A promoter was cloned in front of a luciferase reporter gene. The ability of c‐Myb and c‐Ets family members (Ets‐1/2, PU.1, and Spi‐B) to differentially up‐ and down‐regulate transcription in lymphoblast cells were

185

successfully measured using luciferase reporter gene assays. To date there have been no luciferase reporter assays performed to detect the differential activity of human and chimpanzee transcription factors.

For this experiment of functional divergence, transcription factors that are predicted to be positively selected will be co‐transfected into a cell with a luciferase reporter gene under the control of a promoter that is a target for the transcription factor of interest. This will be done for both the predicted human and chimpanzee genes, and the differential ability of the transcription factor to regulate reporter gene

expression will be measured.

Transcription factors with a robust prediction of positive selection will be chosen for functional characterization based on two additional considerations, the availability of cDNA clones and the existence of known target genes (or sequences). These considerations are likely to be the limiting factors in these experiments. When possible, a commercially available human cDNA clone should be purchased. Depending on the

size and the number of exons in a given transcription factor, it is possible to create these

clones in the lab using the fusion PCR method described above. If no known targets are available from the literature, it may be possible to infer downstream targets from the genome‐wide expression array experiments previously discussed. The promoter regions of genes whose expression changes with the transfection of a given transcription factor can be scanned for putative transcription factor binding sites. Cis‐

REgulatory Module Explorer 2.0 (CREME) is a web‐server at DCODE.org hosted by the

186

Comparative Genomics Center at Lawrance Livermore National Laboratory that identifies cis‐regulatory modules, which are spatially clustered transcription factor binding sites in the promoter regions of genes (SHARAN et al. 2004). CREME uses a database of putative transcription factor binding sites that have been annotated across the human genome using evolutionary conservation with the mouse and rat genomes.

This enables the discovery of binding sites that are not known.

A dual reporter assay should be used for functional study. This system uses firefly and Renilla luciferase reporter genes, which have two different substrates,

allowing both to be measured in the sample. The process is quite simple: the cells are co‐transfected with firefly and Renilla reporter vectors. After 24 hour incubation, the cells are lysed and transferred to a microwell plate. Firefly luciferase substrate (ATP, O2

and Mg2) is added to the lysate, and bioluminescence is measured using a plate reader.

Light is measured as counts per second (CPS). Next, a reagent is added that simultaneously quenches the activity of firefly luciferase, and acts as the substrate for

Renilla (O2), which oxidizes the coelenterate‐luciferin and produces light.

Bioluminescence is then reread from the sample.

Transfecting two luciferase reporter genes, the experimental vector and the control vector, reduces the random variability that occurs between replicate samples and allows correction for transfection efficiency. Performance of experiments at different cell densities, passage number, and edge effects from microwell plates can all

affect the outcome and efficiency of the experiment. Likewise, the addition of a control

187

vector can be used to monitor the events that may have an effect on the health of the cell line. A reduction in expression of an experimental gene can be the result of cell death, rather than differential expression, and a control gene can distinguish between these possibilities. Lastly the ratio of expression of the experimental gene to the control reporter in the presence and absence of a transcription factor will be used to measure differential activity.

Experimental design: The ultimate goal of these luciferase experiments is to demonstrate that a human transcription factor and its chimpanzee ortholog have differential abilities to regulate a given target gene, and that this differential regulation is due to adaptive protein evolution of the human transcription factor. In order for a transcription factor to be considered functionally divergent based on luciferase data, it must pass certain experimental criterion. First, target gene expression (the luciferase reporter gene) must be different in human and chimpanzee cells transfected with the human or chimpanzee transcription factor, respectively. Second, the observed

differences in target gene expression are not due to changes in the target gene’s cis‐ regulatory sequence. Third, the differential expression is not due to the availability (or lack of) of species‐specific transcriptional cofactor proteins. Luciferase experiments testing for all three possibilities must be done in conjunction to determine the origin of the differences of target gene expression. Table 3 shows the necessary transfections

188

and co‐transfections needed to perform the luciferase promoter‐reporter gene

experiments with the proper controls.

Table 1: Luciferase promoter‐reporter tests with controls

Sample Number 1 2 3 4 5 6 7 8 9

Human Transcription + - + ------

Factor

Target Promoter + + - - - - + + -

Chimpanzee Transcription - - - - + - + - -

Factor

Target Promoter - - + + + + - - -

The experiment in Test A (Table 4) examines whether a target gene is differentially expressed in human or chimpanzee cells under the control of a species‐ specific transcription factor. These expression differences could be caused by changes in the cis regulatory sequence of the target gene, which could alter the affinity of binding by the transcription factor. Test B investigates this possibility by swapping the human and chimpanzee target genes. If the human transcription factor and the chimpanzee target gene produce the same results as the chimpanzee transcription factor and chimpanzee target gene, then the difference seen in the initial experiment is

189 likely due to changes in the regulatory sequence in the promoter region of the target gene.

Table 2: List of experiments and controls associated with luciferase promoter‐reporter analysis

Experimental Question Compare Test Name

Does human set work? 1 vs. 2 control

Does chimp set work? 5 vs. 6 control

Does human transcription factor interact with chimp target? 3 vs. 4 control

Does chimp transcription factor interact with human target? 7 vs. 8 control

Do human and chimp differentially express the target promoter? 1 vs. 5 A

Is differential regulation due to changes in cis regulatory 1 vs. 3; 5 B sequence vs. 7

Human with no transcription factor 2,4 control

Chimp with no transcription factor 6,8 control

Background control 9 background

Preliminary data: Human and chimpanzee CDX4 cDNA clones have been created using the fusion PCR method described above. This three exon gene has been ligated

190

into a pCMV‐sport6 expression vector. Tabaries et al. (TABARIES et al. 2005)published a study implicating HOXA5 as a target gene for CDX4. A 2.1 kb mesodermal enhancer

(MES) is located 3’ to HOXA5. A 164 bp DNA fragment within the MES binds CDX4 in

vitro at two conserved domains as determined by electrophoretic mobility shift assays.

Transgenic mice containing a lacZ expressing construct under the control of the MES demonstrate that CDX4 represses HOXA5 and determines its expression boundaries. A portion of the MES from humans and chimpanzees containing the CDX4 binding sites has been amplified by PCR. The human and chimp HOXA5 enhancer region is 99% identical, and the 164 bp CDX4 binding site is perfectly conserved. Appropriate cell lines are essential for this experiment because the proper cofactor proteins are necessary for CDX4 to promote transcription. CDX binding partners are currently unknown so cell lines were chosen that mirrored phenotypes in some of the CDX4/CDX1 and CDX4/CDX2 double mutants (Table 5).

microRNA screens: A relatively new entrant into the complex mix of evolutionary mechanisms that regulate gene expression are (miRNA). These

21‐23 nucleotide long RNA molecules primarily serve to repress or down regulate gene

expression through sequence specific base‐pairing. These non‐coding RNAs were first

described in 1993 by Lee et al. who discovered these antisense RNAs in Caenorhabditis

elegans, a roundworm, as a regulator of postembryonic development (LEE et al. 1993).

Mature miRNAs primarily regulate gene expression through repression of translation,

191

cleavage of mRNAs, mRNA deadenylation, or by altering mRNA stability ((ENGELS and

HUTVAGNER 2006; PILLAI et al. 2007) for reviews).

MicroRNAs play a number of roles in animal morphogenesis and development as well as in the development of genetic disorders, such as and cardiac disease.

There are involved in a variety of processes such as embryogenesis and differentiation, tissue morphogenesis and organogenesis, as a regulator of developmental timing, body patterning, complex signaling cascades, , and metabolism ((ALVAREZ‐GARCIA and

MISKA 2005; KLOOSTERMAN and PLASTERK 2006) for reviews). The conservation of miRNA families across highly divergent species supports the important role that these RNAs play in development. For instance, around 40% of the miRNA families of C. elegans are also conserved in humans (LIM et al. 2003). With regard to individual miRNAs, rather than miRNA families, the story appears to be different, as the individual miRNA genes are not always as well conserved. Berezikov et al. used phylogenetic shadowing to examine miRNA conservation in mammals and found that a large proportion of primate‐ specific miRNAs have no mouse counterparts (BEREZIKOV et al. 2005). Likewise,

Smalheiser and Torvik identified several miRNA genes in highly repetitive and rapidly evolving regions of the genome, such as within transposable elements (SMALHEISER and

TORVIK 2005).

Chen and Rajewsky (CHEN and RAJEWSKY 2007) postulate that rather than having a fundamental role in the development of body plans, miRNAs rather serve as a back‐up mechanism for gene repression and down regulation, and that they are also necessary

192

for “noise reduction” within an organism’s already differentiated state. Therefore, the

effect of altering a miRNA gene is likely to be subtly alter the timing, spatial patterning, or tissue specificity of gene expression. Interestingly, in a study of the targets of microRNAs in mammals, Lewis et al. found that the predicted regulatory targets of miRNAs were genes involved in the regulation of transcription (LEWIS et al. 2003).

These small non‐coding RNAs are also predicted to regulate the expression of hundreds of mammalian genes at a time, indicating that it is possible to have a large scale change in gene repression without too many sequence changes. Krek et al. examined common targets for miRNAs and found that each mammalian miRNA targets

around 200 different transcripts (KREK et al. 2005) and Xie et al. estimates that miRNAs regulate at least 20% of human genes (XIE et al. 2005). Taken into consideration that approximately 2% of known human genes encode microRNAs (ALVAREZ‐GARCIA and MISKA

2005), the potential downstream effects of altering a miRNA could be quite large, and would therefore be an effect target of natural selection.

The identification of microRNAs relies largely on RNA‐folding algorithms that look for hairpin structures capable of being processed into miRNA ((BENTWICH 2005;

BEREZIKOV et al. 2006) for reviews). This search is then filtered to include regions of high conservation between species, as conservation is often an indication that an element is functionally important. The application of an additional filter, one in which the human non‐coding miRNA sequence has diverged relative to other primates would be a first step in identifying miRNA elements that might have diverged along the human lineage.

193

Follow‐up studies for functional divergence of the miRNA would be necessary to validate if the changes were functionally important.

Investigating selection around a phenotype: Genome scans for positive selection are useful exercise for gathering candidate genes that may have contributed to the phenotypic divergence of the human species. However, because of their size, and enormous amount of information is lost, and quality control must be rigorous to avoid a high false positive rate. One is also faced with the possibility that the results are may not be sufficient in answering the big question about the sorts of genic changes that occurred along the human lineage.

Perhaps a better methodology would be to work from the phenotype up; that is, to choose a phenotypic difference common to humans and chimpanzees, and devise a way to study the molecular mechanisms that account for that change. By collecting knowledge of the genes that control a given phenotype one can specifically ask a question about which genes have undergone positive selection for a specific trait, and directly apply both the branch+site test and the empirical test to a smaller cohort of genes.

A second way to proceed might be to choose a mutant phenotype and work out the identity and regulation of the genes controlling the phenotype. Genes whose mutants confer either a gain or a loss of function are ideal candidates for studies of selection, as it is previously shown that the gene has an effect on a particular trait of

194

interest. The ability (or inability) of the wildtype, chimpanzee, or protein of the last common ancestor of humans and chimpanzees to rescue the mutant phenotype could be an important test of functional divergence of a positively selected gene.

POSITIVE SELECTION ALONG THE HUMAN LINEAGE

“I would not be ashamed to have a monkey for my ancestor, but I would be ashamed to

be connected with a man who used great gifts to obscure the truth.”

Thomas Henry Huxley

During my investigations of transcription factor genes for evidence of positive selection, I came upon the unfortunate fact that predicting selection in species as closely

related as humans and chimpanzees is a difficult process. Divergence times as recent at

5 to 7 million years ago promote short branch lengths between the two animals, that is, the accumulation of nucleotide substitution is small. This leaves very little signal for the currently available phylogenetic methods to detect the signatures of positive selection within the human genome.

The maximum likelihood methods implemented in PAML detect only the most extreme cases of positive selection; however, weaker selection remains undetected by

195

these methods. Of the 175 transcription factors sequenced, only 3 garnered predictions with p<0.05. In an effort to increase the sensitivity of phylogenetic methods to detect the ancient and fundamental changes associated with human divergence, I have developed a new empirical method for detection of positive selection in closely related species. This method is based on gene‐specific evaluation of the empirical

distribution of ML results using simulated neutrally evolving sequences. This method is

particularly useful at examining genic sequences with short branch lengths. Using the empirical test, an eight genes were predicted to be positively selected (p<0.05). These transcription factors are then candidates for further follow study to demonstrate

functional divergence of the human protein. This statistical test for positive selection can be a powerful adjunct to traditional tests of positive selection along the human lineage, as it adds power to the predictions by being able to deal with recent divergence. Future studies of adaptive protein evolution in recently diverged organism

will benefit from this technique because it will aide in uncovering genes overlooked using traditional maximum likelihood methods.

It is important to know which genes have undergone positive selection along the human lineage, not only to satisfy man’s inherent curiosity about our origins, but also because of the discovery important genetic, biological, and medical similarities between

humans and our closest relative. Many genes that have been predicted to be positively selected also have strong disease correlations, and these have lead to insights into the molecular mechanisms of such diseases.

196

There are so many questions left to be answered in field of human evolution.

What physiological changes occurred that led to the development of language and the increased size and more complex architecture of the human brain? What were the molecular changes that occurred that led to the development of a larger brain? Did modern humans evolve out of a single population in Africa or did modern humans evolve simultaneously within geographically separated populations? What role do microRNAs play in the evolution of a genome, particularly the human genome? Was there ancient admixture between proto‐humans and proto‐chimpanzees? Was there recent admixture between modern humans and archaic human populations such as the

Neanderthals?

The techniques needed to answer these questions are continually evolving themselves to adapt to recent advances in data collection and mining. In this doctoral dissertation a novel method was developed to investigate positive Darwinian selection along the human lineage that has been able to manage the short branch lengths witnessed between human and chimpanzee genes. Through the combination of methods developed in such studies many of these questions will one day be able to be answered and the mystery of the origins of modern Homo sapiens will begin to be unraveled.

197

APPENDIX

Table A1: codeml results from 175 Transcription Factors

Gene † p‐values ωH ωO Symbol

Empirical MA vs. MA vs. MH vs. MH vs.

test MAnull M1a MHnull M0

ALX4 1 1 0.013518 0.423711 0.0001 0.0488

APEX1 1 0.386741 0.507122 0.247034 0.3777 0.0659

ASCL2 0.383329 0.606531 0.359397 0.708281 0.3314 0.2182

ATOH1 1 1 1 1 1.0046 0.1013

BAPX1 1 1 0.003591 0.431047 0.0002 0.1003

BGLAP 1 0.999 0.423711 0.423711 0.9998 0.3377

BLZF1 0.10526 0.265471 0.084585 0.182149 0.018375 999 0.1458

BNC1 0.52462 0.75824 0.78810 0.077827 1.3221 0.1694

CDX1 0.124579 0.303215 0.289384 0.303215 0.020137 999 0.1042

CDX4 0.031 0.285652 0.527292 0.119795 0.020369 999 0.3584

CITED2 0.124 0.423711 0.028439 0.423711 0.009971 999 0.0381

CNOT2 1 1 0.250592 0.841481 0.0001 0.0082

198

CREM 1 1 1 1 1.0034 0.1517

DLX3 1 1 1 1 1.0004 0.0245

DLX4 1 1 0.017757 0.285652 0.0001 0.1094

DLX6 1 1 1 1 0.9888 0.0047

DMBX1 1 0.771052 0.507122 0.470842 0.3785 0.1303

DMRT3 0.037255 0.488422 0.005042 0.806496 6.14E‐05 0.8593 0.084

DMRTA1 1 1 0.348202 0.777297 0.5145 0.421

DSCR1 1 0.103312 0.841481 0.042883 0.7612 0.0759

E4F1 0.841481 0.199888 0.208761 0.019017 0.4125 0.0903

EGR3 1 1 1 1 1.001 0.0102

EHF 0.10188 0.470842 0.139457 0.462433 0.046594 313.043 0.0478

ELF5 0.45426 0.17552 0.45426 0.062115 999 0.0675

ELK4 0.042 0.214618 0.092551 0.208761 0.046044 999 0.2698

EPAS1 1 1 0.000652 0.145387 0.0001 0.0869

ESR1 1 0.022148 0.059906 0.035106 0.293 0.0716

ETV3 0.887537 0.748264 0.371093 0.402784 0.4029 0.1721

EVX1 1 0.304221 0.165857 0.134481 0.235 0.0473

FLJ21616 1 1 1 1 0.9976 0.0462

199

FMNL2 1 1 0.000257 0.197603 0.0001 0.064

FOXI1 0.27666 0.806496 0.477114 0.187139 0.063636 0.3321 0.0832

FOXJ2 0.084639 0.240101 0.025223 0.236724 0.010198 75.7566 0.0986

FOXN1 1 1 0.537603 0.034698 0.6287 0.1288

FOXP2 1 0.014408 1 0.003591 1.0109 0.0172

FUBP3 1 1 0.438578 0.047715 999 0.1245

GATA3 1 0.802519 0.000336 0.548506 0.0395 0.0191

GATA4 1 1 3.58E‐06 0.082244 0.0001 0.0523

GBX2 0.25522 0.527089 0.049787 0.537603 0.019235 999 0.0143

GJA10 1 1 1 1 0.9999 0.4192

GSC 0.654721 0.394554 0.028792 0.197603 0.0709 0.0111

GTF2E1 0.806496 0.810584 0.06287 0.179712 0.2771 0.1053

GTF2H1 0.438578 0.109701 0.438578 0.048865 999 0.0585

HAND1 1 1 0.003551 0.269361 0.0001 0.0709

HDAC4 1 1 1.68E‐05 0.307821 0.0746 0.0355

HDAC9 1 1 0.00713 0.671373 0.0851 0.1334

HES4 1 1 0.023981 0.303672 0.0001 0.0739

HES6 1 0.57695 0.072744 0.488422 0.2143 0.1176

200

HES7 1 1 0.038112 0.689157 0.1085 0.1735

HESX1 0.18 0.423711 0.18452 0.423711 0.187139 999 0.2688

HIF1A 1 1 1 1 0.9957 0.1047

HMG20A 1 1 0.030873 0.402784 0.0001 0.0901

HOXA3 0.559829 0.29523 1 0.012349 0.9937 0.0687

HOXA5 0.83333 1 0.09442 0.527089 0.029469 0.4446 0.0288

HOXA7 1 0.625002 0.359397 0.462433 0.2603 0.0899

HOXB1 0.431047 0.08291 0.423711 0.104204 999 0.1346

HOXB3 1 0.364219 0.423711 0.22693 0.2854 0.0448

HOXB5 1 1 0.806496 0.61012 1.0166 0.0472

HOXB7 0.708281 0.802519 0.654721 0.194924 0.5623 0.1173

HOXC10 1 1 1 1 1.01 0.0781

HOXD10 1 1 1 1 1.0447 0.0802

HOXD4 0.076172 0.289918 0.021068 0.289918 0.012074 999 0.0819

HOXD8 0.537603 0.826959 0.303215 0.220671 999 0.1651

HTATIP2 1 1 0.122898 0.389661 0.0001 0.1959

IL10 0.389661 0.690734 0.488422 0.294266 999 0.3267

IL31RA 1 0.860708 0.105524 0.431047 0.3686 0.596

201

ILF2 1 1 1 1 1.0031 0.0119

IPF1 1 1 1 1 1.0003 0.0818

IRF7 1 1 0.205903 0.537603 0.3361 0.1981

IRX6 0.63904 0.256661 0.887537 0.08858 1.1721 0.2288

ISGF3G 0.01249 0.024478 0.671373 0.488422 0.6853 0.3812

IVNS1ABP 1 1 0.093096 0.462433 0.0001 0.0999

KLF10 0.089987 0.22693 0.038774 0.223775 0.022073 999 0.1637

KLF3 1 1 1 1 1.0014 0.0375

LHX1 1 1 0.236724 0.020842 0.1665 0.0026

LHX3 1 1 1 1 1.0882 0.1367

LHX6 1 1 1 1 1.0003 0.0253

LMO1 1 1 1 1 1.011 0.0465

MBD1 1 1 0.001565 0.383329 0.0001 0.1995

MEF2A 1 0.71177 0.22693 0.708281 0.1895 0.1165

MEF2B 1 1 0.129374 0.332278 0.0001 0.2754

MEF2D 0.654721 0.588605 0.161513 0.348202 0.1917 0.0516

MEOX1 0.185185 0.298698 0.074274 0.294266 0.026242 999 0.1268

MGC15631 1 1 0.076958 0.462433 0.0001 0.0838

202

MLL 1 0.818731 2.21E‐05 0.654721 0.1632 0.1402

MLLT10 1 1 0.043911 0.507122 0.1729 0.0946

MSX2 1 1 0.094264 0.507122 0.0001 0.0785

MYB 0.41656 0.170333 0.41656 0.079261 999 0.104

MYBL1 1 1 1 1 1.0049 0.1329

MYC 1 0.453845 0.337475 0.298698 0.2407 0.0515

MYOD1 1 1 1 1 1.0012 0.1014

NAB2 1 1 0.108222 0.61012 0.0001 0.0456

NANOGP1 0.507122 0.802519 0.247034 0.250592 999 0.9998

NFE2L2 1 1 1 1 1.0125 0.1907

NFKBIA 1 0.40657 0.409587 0.332278 0.3051 0.0733

NHLH1 1 1 1 1 1.7576 0.0421

NKX2‐4 0.61012 0.76338 0.583882 0.053124 999 0.0282

NKX6‐1 1 1 0.061369 0.559829 0.0001 0.036

NOTCH2 1 0.904837 4.62E‐05 1 0.0737 0.0861

NR0B2 0.887537 0.631284 0.806496 0.285652 0.7572 0.2147

NR1I3 1 1 1 1 1.0003 0.2311

NR2C1 0.308642 0.192288 0.007907 0.157299 0.001119 999 0.0891

203

NR2E1 1 1 0.022587 0.671373 0.0001 0.0171

NR2E3 0.624206 0.726149 0.777297 0.003253 0.7918 0.0725

NR3C1 1 0.571209 0.806496 0.348202 0.7703 0.2508

NR3C2 0.069091 0.322199 0.027052 0.507122 0.001071 1.9675 0.1021

NR5A2 1 1 1 1 1.0021 0.1816

NR6A1 0.516937 0.186374 0.470842 0.095448 0.353 0.0242

NT5C3 1 0.99005 1 1 1.0221 0.0528

OASL 1 1 0.211665 0.067615 999 0.3802

OVOL1 1 0.070651 0.729034 0.026548 0.5421 0.0292

PAX7 1 1 0.000239 0.806496 0.0516 0.0405

PEG3 0.1765 1 0.067206 1 0.02846 0.9442 0.2323

PGR 0.029 0.470842 0.072079 0.887537 0.010667 0.8928 0.1968

PITX3 1 1 0.06928 0.806496 0.0001 0.0082

POU2AF1 0.396144 0.154124 0.396144 0.061369 999 0.0908

PQBP1 1 1 0.094264 0.497624 0.0001 0.084

PRRX2 1 1 1 1 1.0662 0.048

PTTG1 1 0.968507 0.57768 1 0.4494 0.4485

Q86W11 1 0.810584 1 0.689157 1.0429 0.6601 (HUMAN)

204

RBM9 0.041147 0.184625 0.410656 0.214618 0.090808 999 0.3998

REL 1 0.960789 0.087488 0.887537 0.2108 0.2398

RELA 1 0.904837 1 0.559829 2.1228 0.0959

RFXAP 1 1 0.073638 0.289918 0.0001 0.1934

RING1 0.548506 0.802519 0.537603 0.109599 0.4165 0.0333

RREB1 1 1 0.00242 0.527089 0.1742 0.1214

RUNX1 1 0.733447 0.096648 0.337475 0.1708 0.0469

SALF 1 1 0.047715 0.438578 0.242 0.4269

SALL2 1 0.214381 0.285652 0.106864 0.4617 0.1407

SAP30 1 1 1 1 0.9974 0.0531

SCAND1 1 1 1 1 0.9999 0.3366

SCAND2 0.057089 0.163654 0.068442 0.104204 33.2821 1.189

SCML1 0.07457 0.230139 0.039558 0.223775 0.277356 999 1.2786

SCML2 1 0.214381 0.777297 0.431047 0.6874 0.2669

SIM1 1 1 0.105524 0.61012 0.0001 0.0485

SIX1 1 1 1 1 0.9997 0.0071

SLC26A10 1 0.755784 0.548506 0.559829 0.4274 0.1871

SLC26A3 0.182149 0.329559 0.4795 0.05778 0.7185 0.3003

205

SLC38A2 1 1 1 1 1.0021 0.1075

SMAD1 1 1 0.106864 0.777297 0.0001 0.0184

SOHLH2 <0.001 0.023113 0.010673 0.022848 0.001925 999 0.4838

SPIC 1 1 0.147299 0.671373 0.1799 0.3051

SRF 0.4795 0.160414 0.470842 0.06287 999 0.0622

T 0.124214 0.322199 0.068563 0.317311 0.006309 999 0.0515

TAF11 0.034364 0.236724 0.030807 0.236724 0.0188 999 0.1377

TBX10 1 1 0.027486 0.61012 0.099 0.1727

TBX15 1 1 0.02717 0.559829 0.0001 0.0375

TBX18 1 1 0.025642 0.402784 0.0001 0.0757

TBX6 1 0.869358 0.192288 0.729034 0.2076 0.1338

TCEAL1 1 1 1 1 1.0378 0.3589

TFEC 1 1 0.102901 0.285652 0.0001 0.2792

TGIF2LX 0.257899 0.527292 0.887537 0.841481 0.8666 1.0603

UBP1 0.887537 0.013705 0.887537 0.011672 1.1751 0.0789

XBP1 1 1 0.14924 0.281466 0.0001 0.4408

ZBTB25 1 1 0.516937 0.624206 0.3938 0.1924

ZBTB32 1 1 0.141645 0.671373 0.2636 0.3895

206

ZFP37 <0.001 0.031601 0.010889 0.030163 0.000145 999 0.2371

ZFP95 1 0.932394 0.172625 0.806496 0.2733 0.2127

ZHX1 0.470842 0.162026 0.462433 0.087488 999 0.0943

ZIC1 1 1 1 1 1.0129 0.0037

ZNF154 0.887537 0.951229 0.527089 0.777297 0.3905 0.254

ZNF16 1 1 0.026242 0.887537 0.1797 0.1691

ZNF174 1 1 0.047715 0.75183 0.1903 0.2517

ZNF192 0.100791 0.409587 0.26982 0.654721 0.03902 1.6048 0.2101

ZNF24 0.431047 0.657047 0.438578 0.040424 999 0.049

ZNF287 0.596701 0.869358 0.289918 0.033501 999 0.1538

ZNF38 1 1 1 1 0.9998 0.142

ZNF394 1 1 0.170334 0.034294 999 0.3282

ZNF397 1 1 0.022073 0.15939 0.0001 0.2377

ZNF449 1 1 1 1 0.9997 0.146

ZNF483 1 1 0.431047 1 0.3078 0.2786

ZNF96 1 1 0.161513 0.841481 0.287 0.3547

ZNRD1 1 1 1 1 0.9793 0.2192

ZSCAN4 1 1 1 0.61012 0.984 0.5594

† ω = dN/dS ratio; ωH = Human lineage dN/dS; ωO = Background lineage dN/dS

207

Table A2: Sensitivity analysis: Factors that affect predictions of positive selection

G+C Background Foreground Human Branch Percent Class* Test† dN/dS dN/dS Length p<0.05

1 Human branch length‐ neutral 0.05 1 BH = 0.005 0.6

1 Human branch length‐ positive selection 0.05 10 BH = 0.005 0.6

1 Human branch length‐ neutral 0.05 1 BH = 0.01 1.4

1 Human branch length‐ positive selection 0.05 10 BH = 0.01 7.6

1 Human branch length‐ neutral 0.05 1 BH = 0.0175 2.1

1 Human branch length‐ positive selection 0.05 10 BH = 0.0175 23

1 Human branch length‐ neutral 0.05 1 BH = 0.025 1.9

1 Human branch length‐ positive selection 0.05 10 BH = 0.025 32.88

1 Human branch length‐ neutral 0.05 1 BH = 0.0325 3.09

1 Human branch length‐ positive selection 0.05 10 BH = 0.0325 48.3

2 Human branch length‐ neutral 0.05 1 BH = 0.005 0.5

2 Human branch length‐ positive selection 0.05 10 BH = 0.005 1.8

2 Human branch length‐ neutral 0.05 1 BH = 0.01 0.7

2 Human branch length‐ positive selection 0.05 10 BH = 0.01 9.91

2 Human branch length‐ neutral 0.05 1 BH = 0.0175 2

2 Human branch length‐ positive selection 0.05 10 BH = 0.0175 29.73

2 Human branch length‐ neutral 0.05 1 BH = 0.025 2.9

2 Human branch length‐ positive selection 0.05 10 BH = 0.025 36.49

2 Human branch length‐ neutral 0.05 1 BH = 0.0325 2.77

2 Human branch length‐ positive selection 0.05 10 BH = 0.0325 46.85

3 Human branch length‐ neutral 0.05 1 BH = 0.005 0.1

208

3 Human branch length‐ positive selection 0.05 10 BH = 0.005 1.8

3 Human branch length‐ neutral 0.05 1 BH = 0.01 1.4

3 Human branch length‐ positive selection 0.05 10 BH = 0.01 7.21

3 Human branch length‐ neutral 0.05 1 BH = 0.0175 2.1

3 Human branch length‐ positive selection 0.05 10 BH = 0.0175 27.23

3 Human branch length‐ neutral 0.05 1 BH = 0.025 3.3

3 Human branch length‐ positive selection 0.05 10 BH = 0.025 43.24

3 Human branch length‐ neutral 0.05 1 BH = 0.0325 3.1

3 Human branch length‐ positive selection 0.05 10 BH = 0.0325 42.34

4 Human branch length‐ neutral 0.05 1 BH = 0.005 0.3

4 Human branch length‐ positive selection 0.05 10 BH = 0.005 0.5

4 Human branch length‐ neutral 0.05 1 BH = 0.01 0.9

4 Human branch length‐ positive selection 0.05 10 BH = 0.01 4.5

4 Human branch length‐ neutral 0.05 1 BH = 0.0175 3

4 Human branch length‐ positive selection 0.05 10 BH = 0.0175 21

4 Human branch length‐ neutral 0.05 1 BH = 0.025 2.6

4 Human branch length‐ positive selection 0.05 10 BH = 0.025 42

4 Human branch length‐ neutral 0.05 1 BH = 0.0325 1.7

4 Human branch length‐ positive selection 0.05 10 BH = 0.0325 43.5

1 Human ω value‐ positive selection 0.05 2 BH = 0.005 0

1 Human ω value‐ positive selection 0.05 2 BH = 0.0175 4

1 Human ω value‐ positive selection 0.05 2 BH = 0.0325 6.5

1 Human ω value‐ positive selection 0.05 5 BH = 0.005 1.5

1 Human ω value‐ positive selection 0.05 5 BH = 0.0175 13.2

1 Human ω value‐ positive selection 0.05 5 BH = 0.0325 24.5

1 Human ω value‐ positive selection 0.05 10 BH = 0.005 0.6

1 Human ω value‐ positive selection 0.05 10 BH = 0.0175 23

1 Human ω value‐ positive selection 0.05 10 BH = 0.0325 48.3

209

1 Human ω value‐ positive selection 0.05 25 BH = 0.005 0.94

1 Human ω value‐ positive selection 0.05 25 BH = 0.0175 37.89

1 Human ω value‐ positive selection 0.05 25 BH = 0.0325 69.35

1 Human ω value‐ positive selection 0.05 100 BH = 0.005 0.5

1 Human ω value‐ positive selection 0.05 100 BH = 0.0175 46

1 Human ω value‐ positive selection 0.05 100 BH = 0.0325 86

1 Background ω value‐ neutral 0.01 1 BH = 0.01 0

1 Background ω value‐ positive selection 0.01 10 BH = 0.01 7.5

1 Background ω value‐ neutral 0.05 1 BH = 0.01 1.4

1 Background ω value‐ positive selection 0.05 10 BH = 0.01 7.6

1 Background ω value‐ neutral 0.25 1 BH = 0.01 1.5

1 Background ω value‐ positive selection 0.25 10 BH = 0.01 6.5

1 Background ω value‐ neutral 0.75 1 BH = 0.01 2.5

1 Background ω value‐ positive selection 0.75 10 BH = 0.01 7

1 Background ω value‐ neutral 0.01 1 BH = 0.0175 4.0

1 Background ω value‐ positive selection 0.01 10 BH = 0.0175 37.5

1 Background ω value‐ neutral 0.05 1 BH = 0.0175 2.0

1 Background ω value‐ positive selection 0.05 10 BH = 0.0175 29.7

1 Background ω value‐ neutral 0.25 1 BH = 0.0175 4.0

1 Background ω value‐ positive selection 0.25 10 BH = 0.0175 38.0

1 Background ω value‐ neutral 0.75 1 BH = 0.0175 5.5

1 Background ω value‐ positive selection 0.75 10 BH = 0.0175 36.0

2 Site class proportion: 0.7_0.1_0.175_0.025 0.05 1 BH = 0.0175 2.5

2 Site class proportion: 0.7_0.1_0.175_0.025 0.05 10 BH = 0.0175 21

2 Site class proportion: 0.6_0.1_0.26_0.04 0.05 1 BH = 0.0175 2

2 Site class proportion: 0.6_0.1_0.26_0.04 0.05 10 BH = 0.0175 29.7

2 Site class proportion: 0.5_0.1_0.33_0.07 0.05 1 BH = 0.0175 3

2 Site class proportion: 0.5_0.1_0.33_0.07 0.05 10 BH = 0.0175 31.5

210

2 Site class proportion: 0.4_0.1_0.4_0.1 0.05 1 BH = 0.0175 4.5

2 Site class proportion: 0.4_0.1_0.4_0.1 0.05 10 BH = 0.0175 37.5

2 Site class proportion: 0.3_0.1_0.45_0.15 0.05 1 BH = 0.0175 4

2 Site class proportion: 0.3_0.1_0.45_0.15 0.05 10 BH = 0.0175 44

*The four G+C classes are: <45.2%, 45.2‐52.19%, 52.2‐58.79%, and >=58.8%.†200 simulated sequences for positive selection, 1000 simulated sequences for neutral evolution

211

LITERATURE CITED

AAGAARD, J. E., and P. PHILLIPS, 2005 Accuracy and power of the likelihood ratio test for

comparing evolutionary rates among genes. J Mol Evol 60: 426‐433.

ABZHANOV, A., M. PROTAS, B. R. GRANT, P. R. GRANT and C. J. TABIN, 2004 Bmp4 and

morphological variation of beaks in Darwin's finches. Science 305: 1462‐1465.

AIELLO, L., and C. DEAN, 1990 An Introduction to Human Evolutionary Anatomy. Elsevier

Academic Press, San Diego.

AKASHI, H., 2001 Gene expression and molecular evolution. Curr Opin Genet Dev 11: 660‐

666.

AKEY, J. M., M. A. EBERLE, M. J. RIEDER, C. S. CARLSON, M. D. SHRIVER et al., 2004 Population

history and natural selection shape patterns of genetic variation in 132 genes.

PLoS Biol 2: e286.

ALTSCHUL, S. F., W. GISH, W. MILLER, E. W. MYERS and D. J. LIPMAN, 1990 Basic local

alignment search tool. J Mol Biol 215: 403‐410.

ALVAREZ‐GARCIA, I., and E. A. MISKA, 2005 MicroRNA functions in animal development and

human disease. Development 132: 4653‐4662.

ANISIMOVA, M., J. P. BIELAWSKI and Z. YANG, 2001 Accuracy and power of the likelihood

ratio test in detecting adaptive molecular evolution. Mol Biol Evol 18: 1585‐1592.

ANISIMOVA, M., J. P. BIELAWSKI and Z. YANG, 2002 Accuracy and power of bayes prediction

of amino acid sites under positive selection. Mol Biol Evol 19: 950‐958.

212

ANTON, S. C., and I. I. I. C. C. SWISHER, 2004 EARLY DISPERSALS OF HOMO FROM AFRICA,

pp. 271‐296.

ARBIZA, L., J. DOPAZO and H. DOPAZO, 2006a Differentiating positive selection from

acceleration and relaxation in human and chimp. PLoS Comp Biol 2: e38.

ARBIZA, L., J. DOPAZO and H. DOPAZO, 2006b Positive selection, relaxation, and acceleration

in the evolution of the human and chimp genome. PLoS Comput Biol 2: e38.

BAILEY, J. A., and E. E. EICHLER, 2006 Primate segmental duplications: crucibles of

evolution, diversity and disease. Nat Rev Genet 7: 552‐564.

BAKEWELL, M. A., P. SHI and J. ZHANG, 2007a More genes underwent positive selection in

chimpanzee evolution than in human evolution. Proc Natl Acad Sci U S A 104:

7489‐7494.

BAKEWELL, M. A., P. SHI and J. ZHANG, 2007b More genes underwent positive selection in

chimpanzee evolution than in human evolution. Proc Natl Acad Sci U S A.

BAMSHAD, M., and S. P. WOODING, 2003 Signatures of natural selection in the human

genome. Nature Reviews Genetics

Nat Rev Genet 4: 99‐111.

BATEMAN, A., L. COIN, R. DURBIN, R. D. FINN, V. HOLLICH et al., 2004 The Pfam protein families

database. Nucleic Acids Res 32: D138‐141.

BEJERANO, G., M. PHEASANT, I. MAKUNIN, S. STEPHEN, W. J. KENT et al., 2004 Ultraconserved

elements in the human genome. Science 304: 1321‐1325.

BENJAMINI, Y., and Y. HOCHBERG, 1995 Controlling the false discovery rate: a practical and

powerful approach to multiple testing. J. Royal Stat. Soc. B 57: 289‐300.

213

BENTWICH, I., 2005 Prediction and validation of microRNAs and their targets. FEBS Lett

579: 5904‐5910.

BEREZIKOV, E., E. CUPPEN and R. H. PLASTERK, 2006 Approaches to microRNA discovery. Nat

Genet 38 Suppl: S2‐7.

BEREZIKOV, E., V. GURYEV, J. VAN DE BELT, E. WIENHOLDS, R. H. PLASTERK et al., 2005

Phylogenetic shadowing and computational identification of human microRNA

genes. Cell 120: 21‐24.

BERSAGLIERI, T., P. C. SABETI, N. PATTERSON, T. VANDERPLOEG, S. F. SCHAFFNER et al., 2004

Genetic Signatures of Strong Recent Positive Selection at the Lactase Gene.

American Journal of Human Genetics 74: 1111‐1120.

BIERNE, N., and A. EYRE‐WALKER, 2004 The genomic rate of adaptive amino acid

substitution in Drosophila. Mol Biol Evol 21: 1350‐1360.

BININDA‐EMONDS, O. R., 2005 transAlign: using amino acids to facilitate the multiple

alignment of protein‐coding DNA sequences. BMC Bioinformatics 6: 156.

BISWAS, S., and J. M. AKEY, 2006 Genomic insights into positive selection. Trends Genet

22: 437‐446.

BLANCHETTE, M., W. J. KENT, C. RIEMER, L. ELNITSKI, A. F. SMIT et al., 2004 Aligning multiple

genomic sequences with the threaded blockset aligner. Genome Res 14: 708‐

715.

BOFFELLI, D., J. MCAULIFFE, D. OVCHARENKO, K. D. LEWIS, I. OVCHARENKO et al., 2003

Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the

Human Genome. Science 299: 1391‐1394.

214

BRIDGHAM, J. T., S. M. CARROLL and J. W. THORNTON, 2006 Evolution of hormone‐receptor

complexity by molecular exploitation. Science 312: 97‐101.

BROOKE, N. M., J. GARCIA‐FERNANDEZ and P. W. HOLLAND, 1998 The ParaHox gene cluster is

an evolutionary sister of the Hox gene cluster. Nature 392: 920‐922.

BRUNET, M., F. GUY, D. PILBEAM, D. E. LIEBERMAN, A. LIKIUS et al., 2005 New material of the

earliest hominid from the Upper Miocene of Chad. Nature 434: 752‐755.

BULLAUGHEY, K. L., M. PRZEWORSKI and G. COOP, 2008 No effect of recombination on the

efficacy of natural selection in primates. Genome Res.: gr.071548.071107.

BUSH, E. C., and B. T. LAHN, 2005 Selective constraint on noncoding regions of hominid

genomes. PLoS Comput Biol 1: e73.

BUSTAMANTE, C. D., A. FLEDEL‐ALON, S. WILLIAMSON, R. NIELSEN, M. T. HUBISZ et al., 2005a

Natural selection on protein‐coding genes in the human genome. Nature 437:

1153‐1157.

BUSTAMANTE, C. D., A. FLEDEL‐ALON, S. WILLIAMSON, R. NIELSEN, M. TODD HUBISZ et al., 2005b

Natural selection on protein‐coding genes in the human genome. Nature 437:

1153‐1157.

CACERES, M., J. LACHUER, M. A. ZAPALA, J. C. REDMOND, L. KUDO et al., 2003 Elevated gene

expression levels distinguish human from non‐human primate brains. Proc Natl

Acad Sci U S A 100: 13030‐13035.

CANN, R. L., M. STONEKING and A. C. WILSON, 1987 Mitochondrial DNA and human

evolution. Nature 325: 31‐36.

215

CAPECCHI, M. R., 1980 High efficiency transformation by direct microinjection of DNA into

cultured mammalian cells. Cell 22: 479‐488.

CARROLL, S. B., 2000 Endless forms:The evolution of gene regulation and morphological

diversity. Cell 101: 577‐580.

CARROLL, S. B., 2003 Genetics and the making of Homo sapiens. Nature 422: 849‐857.

CASWELL, J. L., S. MALLICK, D. J. RICHTER, J. NEUBAUER, C. SCHIRMER et al., 2008 Analysis of

chimpanzee history based on genome sequence alignments. PLoS Genet 4:

e1000057.

CHABOT, A., R. A. SHRIT, R. BLEKHMAN and Y. GILAD, 2007 Using reporter gene assays to

identify cis regulatory differences between humans and chimpanzees. Genetics

176: 2069 ‐ 2076.

CHARITE, J., W. DE GRAAFF, D. CONSTEN, M. REIJNEN, J. KORVING et al., 1998 Transducing

positional information to the Hox genes: critical interaction of cdx gene products

with position‐sensitive regulatory elements. Development 125: 4349‐4358.

CHEN, F. C., and W. H. LI, 2001 Genomic divergences between humans and other

hominoids and the effective population size of the common ancestor of humans

and chimpanzees. Am J Hum Genet 68: 444‐456.

CHEN, K., D. DURAND and M. FARACH‐COLTON, 2000 NOTUNG: a program for dating gene

duplications and optimizing gene family trees. J Comput Biol 7: 429‐447.

CHEN, K., and N. RAJEWSKY, 2007 The evolution of gene regulation by transcription factors

and microRNAs. Nat Rev Genet 8: 93‐103.

216

CHENG, Z., M. VENTURA, X. SHE, P. KHAITOVICH, T. GRAVES et al., 2005 A genome‐wide

comparison of recent chimpanzee and human segmental duplications. Nature

437: 88‐93.

CLARK, A. G., S. GLANOWSKI, R. NIELSEN, P. D. THOMAS, A. KEJARIWAL et al., 2003a Inferring

nonneutral evolution from human‐chimp‐mouse orthologous gene trios. Science

302: 1960‐1963.

CLARK, A. G., S. GLANOWSKI, R. NIELSEN, P. D. THOMAS, A. KEJARIWAL et al., 2003b Inferring

nonneutral evolution from human‐chimp‐mouse orthologous gene trios. Science

302: 1960‐1963.

CLARK, N. L., and W. J. SWANSON, 2005 Pervasive adaptive evolution in primate seminal

proteins. PLoS Genet 1: e35.

CONSORTIUM, C. S. A. A., 2005a Initial sequence of the chimpanzee genome and

comparison with the human genome. Nature 437: 69‐87.

CONSORTIUM, I. H., 2005b A haplotype map of the human genome. Nature 437: 1299‐

1320.

CONSORTIUM, T. C. S. A. A., 2005c Initial sequence of the chimpanzee genome and

comparison with the human genome. Nature 437: 69‐87.

CONSORTIUM, T. I. C. C., 2004 DNA sequence and comparative analysis of chimpanzee

. Nature 429: 382‐388.

DARWIN, C., 1859 On the Origin of Species by Means of Natural Selection: or the

Preservation of Favoured Races in the Struggle for Life. John Murray, London.

217

DAVIDSON, A. J., P. ERNST, Y. WANG, M. P. S. DEKENS, P. D. KINGSLEY et al., 2003 mutants

fail to specify blood progenitors and can be rescued by multiple hox genes.

Nature 425: 300‐306.

DAVIDSON, E. H., 2001 Genomic Regulatory Systems: Development and Evolution.

Academic Press, San Diego.

DAY, M. H., 1969 Omo human skeletal remains. Nature 222: 1135‐1138.

DE WAAL, F. B., 2005 A century of getting to know the chimpanzee. Nature 437: 56‐59.

DEANE, C. M., L. SALWINSKI, I. XENARIOS and D. EISENBERG, 2002 Protein interactions: two

methods for assessment of the reliability of high throughput observations. Mol

Cell Proteomics 1: 349‐356.

DERMITZAKIS, E. T., A. REYMOND, N. SCAMUFFA, C. UCLA, E. KIRKNESS et al., 2003 Evolutionary

discrimination of mammalian conserved non‐genic sequences (CNGs). Science

302: 1033‐1035.

DONALDSON, I. J., and B. GOTTGENS, 2006 Evolution of candidate transcriptional regulatory

motifs since the human‐chimpanzee divergence. Genome Biol 7: R52.

DORUS, S., E. J. VALLENDER, P. D. EVANS, J. R. ANDERSON, S. L. GILBERT et al., 2004a Accelerated

evolution of nervous system genes in the origin of Homo sapiens. Cell 119: 1027‐

1040.

DORUS, S., E. J. VALLENDER, P. D. EVANS, J. R. ANDERSON, S. L. GILBERT et al., 2004b Accelerated

Evolution of Nervous System Genes in the Origin of Homo sapiens. Cell 119:

1027‐1040.

218

DRAGHICI, S., P. KHATRI, P. BHAVSAR, A. SHAH, S. A. KRAWETZ et al., 2003 Onto‐Tools, the

toolkit of the modern biologist: Onto‐Express, Onto‐Compare, Onto‐Design and

Onto‐Translate. Nucleic Acids Res 31: 3775‐3781.

DUPREY, P., K. CHOWDHURY, G. R. DRESSLER, R. BALLING, D. SIMON et al., 1988 A mouse gene

homologous to the Drosophila gene caudal is expressed in epithelial cells from

the embryonic intestine. Genes Dev 2: 1647‐1654.

EBERSBERGER, I., D. METZLER, C. SCHWARTZ and S. PAABO, 2002 Genomewide Comparison of

DNA Sequences between Humans and Chimpanzees. American Journal of Human

Genetics 70: 1490‐1497.

ELANGO, N., J. W. THOMAS and S. V. YI, 2006 Variable molecular clocks in hominoids. Proc

Natl Acad Sci U S A 103: 1370‐1375.

ELLEGREN, H., 2005 Evolution: natural selection in the evolution of humans and chimps.

Curr Biol 15: R919‐922.

ELLER, E., 2002 Population extinction and recolonization in human demographic history.

Mathematical Biosciences 177‐178: 1‐10.

ENARD, W., A. FASSBENDER, F. MODEL, P. ADORJAN, S. PAABO et al., 2004 Differences in DNA

methylation patterns between humans and chimpanzees. Curr Biol 14: R148‐

149.

ENARD, W., P. KHAITOVICH, J. KLOSE, S. ZOLLNER, F. HEISSIG et al., 2002a Intra‐ and Interspecific

Variation in Primate Gene Expression Patterns. Science 296: 340‐343.

ENARD, W., M. PRZEWORSKI, S. E. FISHER, C. S. LAI, V. WIEBE et al., 2002b Molecular evolution

of FOXP2, a gene involved in speech and language. Nature 418: 869‐872.

219

ENGELS, B. M., and G. HUTVAGNER, 2006 Principles and effects of microRNA‐mediated post‐

transcriptional gene regulation. 25: 6163‐6169.

EPPIG, J. T., C. J. BULT, J. A. KADIN, J. E. RICHARDSON, J. A. BLAKE et al., 2005 The Mouse

Genome Database (MGD): from genes to mice‐‐a community resource for mouse

biology. Nucleic Acids Res 33: D471‐475.

EVANS, P. D., J. R. ANDERSON, E. J. VALLENDER, S. S. CHOI and B. T. LAHN, 2004a Reconstructing

the evolutionary history of microcephalin, a gene controlling human brain size.

Hum Mol Genet 13: 1139‐1145.

EVANS, P. D., J. R. ANDERSON, E. J. VALLENDER, S. L. GILBERT, C. M. MALCOM et al., 2004b

Adaptive evolution of ASPM, a major determinant of cerebral cortical size in

humans. Hum Mol Genet 13: 489‐494.

EWING, B., and P. GREEN, 1998 Base‐calling of automated sequencer traces using phred. II.

Error probabilities. Genome Res 8: 186‐194.

EWING, B., L. HILLIER, M. C. WENDL and P. GREEN, 1998 Base‐calling of automated sequencer

traces using phred. I. Accuracy assessment. Genome Res 8: 175‐185.

EXCOFFIER, L., 2002 Human demographic history: refining the recent African origin model.

Current Opinion in Genetics & Development 12: 675‐682.

FAN, W., M. KASAHARA, J. GUTKNECHT, D. KLEIN, W. MAYER et al., 1989 Shared class II MHC

polymorphisms between humans and chimpanzees. Hum. Immunol. 26: 107‐121.

FAN, Y., E. LINARDOPOULOU, C. FRIEDMAN, E. WILLIAMS and B. J. TRASK, 2002a Genomic

structure and evolution of the ancestral chromosome fusion site in 2q13‐2q14.1

220

and paralogous regions on other human chromosomes. Genome Res 12: 1651‐

1662.

FAN, Y., T. NEWMAN, E. LINARDOPOULOU and B. J. TRASK, 2002b Gene content and function of

the ancestral chromosome fusion site in human chromosome 2q13‐2q14.1 and

paralogous regions. Genome Res 12: 1663‐1672.

FAY, J. C., and C. I. WU, 2000 Hitchhiking under positive Darwinian selection. Genetics

155: 1405‐1413.

FAY, J. C., G. J. WYCKOFF and C. I. WU, 2001 Positive and negative selection on the human

genome. Genetics 158: 1227‐1234.

FAY, J. C., G. J. WYCKOFF and C. I. WU, 2002 Testing the neutral theory of molecular

evolution with genomic data from Drosophila. Nature 415: 1024‐1026.

FELSENSTEIN, J., 1973 Maximum likelihood and minimum‐steps methods for estimating

evolutionary trees from data on discrete characters. Syst. Zool. 22: 240‐249.

FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood

approach. J Mol Evol 17: 368‐376.

FELSENSTEIN, J., 1993 PHYLIP (Phylogeny Inference Package) version 3.5c., pp. Department

of Genetics, University of Washington, Seattle.

FIELDS, S., and O. SONG, 1989 A novel genetic system to detect protein‐protein

interactions. Nature 340: 245‐246.

FILIP, L. C., and N. I. MUNDY, 2004 Rapid Evolution by Positive Darwinian Selection in the

Extracellular Domain of the Abundant Lymphocyte Protein CD45 in Primates.

Mol Biol Evol 21: 1504‐1511.

221

FLOREA, L., G. HARTZELL, Z. ZHANG, G. M. RUBIN and W. MILLER, 1998 A computer program for

aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8: 967‐

974.

FORTNA, A., Y. KIM, E. MACLAREN, K. MARSHALL, G. HAHN et al., 2004 Lineage‐Specific Gene

Duplication and Loss in Human and Great Ape Evolution. PLoS Biology 2: e207.

FU, N., I. DRINNENBERG, J. KELSO, J. R. WU, S. PAABO et al., 2007 Comparison of protein and

mRNA expression evolution in humans and chimpanzees. PLoS ONE 2: e216.

GALANT, R., and S. B. CARROLL, 2002 Evolution of a transcriptional repression domain in an

insect Hox protein. Nature 415: 910‐913.

GARCIA, F., M. GARCIA, L. MORA, L. ALARCON, J. EGOZCUE et al., 2003 Qualitative analysis of

constitutive heterochromatin and primate evolution. Biological Journal of the

Linnean Society 80: 107‐124.

GARRIGAN, D., and M. F. HAMMER, 2006 Reconstructing human origins in the genomic era.

Nat Rev Genet 7: 669‐680.

GAUNT, S. J., D. DRAGE and R. C. TRUBSHAW, 2008 Increased Cdx protein dose effects upon

axial patterning in transgenic lines of mice. Development.

GENG, C.‐D., and W. V. VEDECKIS, 2005 C‐MYB and members of the C‐ETS family of

transcription factors act as a molecular switch to mediate opposite steroid

regulation of human glucocorticoid receptor 1A promoter. J. Biol. Chem.:

M508245200.

222

GIBBS, R. A., J. ROGERS, M. G. KATZE, R. BUMGARNER, G. M. WEINSTOCK et al., 2007

Evolutionary and biomedical insights from the rhesus macaque genome. Science

316: 222‐234.

GIBBS, R. A., G. M. WEINSTOCK, M. L. METZKER, D. M. MUZNY, E. J. SODERGREN et al., 2004

Genome sequence of the Brown Norway rat yields insights into mammalian

evolution. Nature 428: 493‐521.

GILAD, Y., C. D. BUSTAMANTE, D. LANCET and S. PAABO, 2003a Natural selection on the

olfactory receptor gene family in humans and chimpanzees. Am J Hum Genet 73:

489‐501.

GILAD, Y., O. MAN, S. PAABO and D. LANCET, 2003b Human specific loss of olfactory receptor

genes. Proc Natl Acad Sci U S A 100: 3324‐3327.

GILAD, Y., A. OSHLACK, G. K. SMYTH, T. P. SPEED and K. P. WHITE, 2006 Expression profiling in

primates reveals a rapid evolution of human transcription factors. Nature 440:

242‐245.

GILAD, Y., S. A. RIFKIN, P. BERTONE, M. GERSTEIN and K. P. WHITE, 2005 Multi‐species

microarrays reveal the effect of sequence divergence on gene expression

profiles. Genome Res. 15: 674‐680.

GILAD, Y., S. ROSENBERG, M. PRZEWORSKI, D. LANCET and K. SKORECKI, 2002 Evidence for

positive selection and population structure at the human MAO‐A gene. Proc Natl

Acad Sci U S A 99: 862‐867.

GLAZKO, G., V. VEERAMACHANENI, M. NEI and W. MAKALOWSKI, 2005 Eighty percent of

proteins are different between humans and chimpanzees. Gene 346: 215‐219.

223

GOLDMAN, N., 1993 Statistical tests of models of DNA substitution. J Mol Evol 36: 182‐

198.

GOLDMAN, N., and Z. YANG, 1994 A codon‐based model of nucleotide substitution for

protein‐coding DNA sequences. Mol Biol Evol 11: 725‐736.

GOODMAN, M., 1999 Molecular evolution '99: the genomic record of humankind's

evolutionary roots. American Journal of Human Genetics 64: 31‐39.

GOODMAN, M., D. A. TAGLE, D. H. FITCH, W. BAILEY, J. CZELUSNIAK et al., 1990 Primate

evolution at the DNA level and a classification of hominoids. J Mol Evol 30: 260‐

266.

GREEN, P., www.phrap.org, pp.

GREEN, R. E., J. KRAUSE, S. E. PTAK, A. W. BRIGGS, M. T. RONAN et al., 2006 Analysis of one

million base pairs of Neanderthal DNA. Nature 444: 330‐336.

GRENIER, J. K., and S. B. CARROLL, 2000a Functional evolution of the Ultrabithorax protein.

Proc Natl Acad Sci U S A 97: 704‐709.

GRENIER, J. K., and S. B. CARROLL, 2000b Functional evolution of the Ultrabithorax protein.

Proceedings of the National Academy of Sciences of the United States of America

97: 704‐709.

GROVES, C., 2005 Mammalian Species of the World. Johns Hopkins University Press,

Baltimore.

GROVES, C. P., 2001 Primate Taxonomy. Smithsonian Institute Press, Washington DC.

GU, J., and X. GU, 2003 Induced gene expression in human brain after the split from

chimpanzee. Trends Genet 19: 63‐65.

224

HAHN, M. W., M. V. ROCKMAN, N. SORANZO, D. B. GOLDSTEIN and G. A. WRAY, 2004 Population

genetic and phylogenetic evidence for positive selection on regulatory mutations

at the factor VII locus in humans. Genetics 167: 867‐877.

HAHN, Y., and B. LEE, 2005 Identification of nine human‐specific frameshift mutations by

comparative analysis of the human and the chimpanzee genome sequences.

Bioinformatics 21: i186‐i194.

HAHN, Y., and B. LEE, 2006 Human‐specific nonsense mutations identified by genome

sequence comparisons. Hum Genet 119: 169‐178.

HAMBLIN, M. T., and A. DI RIENZO, 2000 Detection of the signature of natural selection in

humans: evidence from the Duffy blood group locus. Am J Hum Genet 66: 1669‐

1679.

HANCOCK, J. M., 2005 Gene factories, microfunctionalization and the evolution of gene

families. Trends Genet 21: 591‐595.

HANLON, S. E., and J. D. LIEB, 2004 Progress and challenges in profiling the dynamics of

chromatin and transcription factor binding with DNA microarrays. Curr Opin

Genet Dev 14: 697‐705.

HARPENDING, H. C., M. A. BATZER, M. GURVEN, L. B. JORDE, A. R. ROGERS et al., 1998 Genetic

traces of ancient demography, pp. 1961‐1967.

HARPENDING, H. C., S. T. SHERRY, A. R. ROGERS and M. STONEKING, 1993 The Genetic Structure

of Ancient Human Populations, pp. 483.

HEISSIG, F., J. KRAUSE, J. BRYK, P. KHAITOVICH, W. ENARD et al., 2005 Functional analysis of

human and chimpanzee promoters. Genome Biology 6: R57.

225

HELLMANN, I., S. ZOLLNER, W. ENARD, I. EBERSBERGER, B. NICKEL et al., 2003 Selection on

Human Genes as Revealed by Comparisons to Chimpanzee cDNA. Genome Res.

13: 831‐837.

HITTINGER, C. T., D. L. STERN and S. B. CARROLL, 2005 Pleiotropic functions of a conserved

insect‐specific Hox peptide motif. Development 132: 5261‐5270.

HODGSON, J. A., and T. R. DISOTELL, 2008 No evidence of a Neanderthal contribution to

modern human diversity. Genome Biol 9: 206.

HOEKSTRA, H. E., and J. A. COYNE, 2007 The locus of evolution: evo devo and the genetics

of adaptation. Evolution 61: 995‐1016.

HONEYCUTT, R., 2008 Small changes, big results: evolution of morphological discontinuity

in mammals. Journal of Biology 7: 9.

HSIA, C. C., and W. MCGINNIS, 2003a Evolution of transcription factor function. Curr Opin

Genet Dev 13: 199‐206.

HSIA, C. C., and W. MCGINNIS, 2003b Evolution of transcription factor function. Current

Opinions in Genetics and Development 13: 199‐206.

HUBBARD, T. J., B. L. AKEN, K. BEAL, B. BALLESTER, M. CACCAMO et al., 2007 Ensembl 2007.

Nucleic Acids Res 35: D610‐617.

HUBY, T., C. DACHET, R. M. LAWN, J. WICKINGS, M. J. CHAPMAN et al., 2001 Functional Analysis

of the Chimpanzee and Human apo(a) Promoter Sequences. J. Biol. Chem. 276:

22209‐22214.

HUDSON, R. R., M. KREITMAN and M. AGUADE, 1987 A test of neutral molecular evolution

based on nucleotide data. Genetics 116: 153‐159.

226

HUGHES, A. L., and M. NEI, 1988 Pattern of nucleotide substitution at major

histocompatibility complex class I loci reveals overdominant selection. Nature

335: 167‐170.

HURLES, M., 2004 Gene duplication: the genomic trade in spare parts. PLoS Biol 2: E206.

HUTTLEY, G. A., S. EASTEAL, M. C. SOUTHEY, A. TESORIERO, G. G. GILES et al., 2000 Adaptive

evolution of the tumour suppressor BRCA1 in humans and chimpanzees. Nature

Genetics 25: 410‐413.

ISHWARAN, H., J. S. RAO and U. B. KOGALUR, 2006 BAMarraytrade mark: Java software for

Bayesian analysis of variance for microarray data. BMC Bioinformatics 7: 59.

IYER, V. R., C. E. HORAK, C. S. SCAFE, D. BOTSTEIN, M. SNYDER et al., 2001 Genomic binding

sites of the yeast cell‐cycle transcription factors SBF and MBF. Nature 409: 533‐

538.

JAENISCH, R., and A. BIRD, 2003 Epigenetic regulation of gene expression: how the genome

integrates intrinsic and environmental signals. Nat Genet 33 Suppl: 245‐254.

JOHNSON, M. E., L. VIGGIANO, J. A. BAILEY, M. ABDUL‐RAUF, G. GOODWIN et al., 2001 Positive

selection of a gene family during the emergence of humans and African apes.

Nature 413: 514‐519.

JOUNG, J. K., E. I. RAMM and C. O. PABO, 2000 A bacterial two‐hybrid selection system for

studying protein‐DNA and protein‐protein interactions. Proc Natl Acad Sci U S A

97: 7382‐7387.

KAESSMANN, H., V. WIEBE, G. WEISS and S. PAABO, 2001 Great ape DNA sequences reveal a

reduced diversity and an expansion in humans. Nat Genet 27: 155‐156.

227

KARAMAN, M. W., M. L. HOUCK, L. G. CHEMNICK, S. NAGPAL, D. CHAWANNAKUL et al., 2003

Comparative Analysis of Gene‐Expression Patterns in Human and African Great

Ape Cultured Fibroblasts. Genome Res. 13: 1619‐1630.

KEENAN, I. D., R. M. SHARRARD and H. V. ISAACS, 2006 FGF signal transduction and the

regulation of Cdx gene expression. Dev Biol 299: 478‐488.

KEIGHTLEY, P. D., G. V. KRYUKOV, S. SUNYAEV, D. L. HALLIGAN and D. J. GAFFNEY, 2005

Evolutionary constraints in conserved nongenic sequences of mammals. Genome

Res. 15: 1373‐1378.

KEIGHTLEY, P. D., M. J. LERCHER and A. EYRE‐WALKER, 2005a Evidence for Widespread

Degradation of Gene Control Regions in Hominid Genomes. PLoS Biology 3: 1‐7.

KELLEY, J. L., J. MADEOY, J. C. CALHOUN, W. SWANSON and J. M. AKEY, 2006 Genomic signatures

of positive selection in humans and the limits of outlier approaches. Genome Res

16: 980‐989.

KHAITOVICH, P., W. ENARD, M. LACHMANN and S. PAABO, 2006a Evolution of primate gene

expression. Nat Rev Genet 7: 693‐702.

KHAITOVICH, P., I. HELLMANN, W. ENARD, K. NOWICK, M. LEINWEBER et al., 2005a Parallel

patterns of evolution in the genomes and transcriptomes of humans and

chimpanzees. Science 309: 1850‐1854.

KHAITOVICH, P., S. PAABO and G. WEISS, 2005b Toward a neutral evolutionary model of gene

expression. Genetics 170: 929‐939.

KHAITOVICH, P., K. TANG, H. FRANZ, J. KELSO, I. HELLMANN et al., 2006b Positive selection on

gene expression in the human brain. Curr Biol 16: R356‐358.

228

KHAITOVICH, P., G. WEISS, M. LACHMANN, I. HELLMANN, W. ENARD et al., 2004 A Neutral Model

of Transcriptome Evolution. PLoS Biology 2: e132.

KHATRI, P., P. BHAVSAR, G. BAWA and S. DRAGHICI, 2004 Onto‐Tools: an ensemble of web‐

accessible, ontology‐based tools for the functional design and interpretation of

high‐throughput gene expression experiments. Nucleic Acids Res 32: W449‐456.

KIDD, J. M., K. C. TREVARTHEN, D. L. TEFFT, Z. CHENG, M. MOONEY et al., 2005 A catalog of

nonsynonymous polymorphism on mouse . Mamm Genome 16:

925‐933.

KIMURA, M., 1968 Evolutionary rate at the molecular level. Nature 217: 624‐626.

KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University

Press, Cambridge.

KIMURA, M., and T. OTA, 1974 On some principles governing molecular evolution. Proc

Natl Acad Sci U S A 71: 2848‐2852.

KING, M.‐C., and A. C. WILSON, 1975 Evolution at two levels in humans and chimpanzees.

Science 188: 107‐116.

KIRMIZIS, A., S. M. BARTLEY, A. KUZMICHEV, R. MARGUERON, D. REINBERG et al., 2004 Silencing of

human polycomb target genes is associated with methylation of histone H3 Lys

27. Genes Dev 18: 1592‐1605.

KITANO, T., Y. H. LIU, S. UEDA and N. SAITOU, 2004 Human‐specific amino acid changes

found in 103 protein‐coding genes. Mol Biol Evol 21: 936‐944.

KLOOSTERMAN, W. P., and R. H. PLASTERK, 2006 The diverse functions of microRNAs in

animal development and disease. Dev Cell 11: 441‐450.

229

KNIGHT, C. G., N. ZITZMANN, S. PRABHAKAR, R. ANTROBUS, R. DWEK et al., 2006 Unraveling

adaptive evolution: how a single point mutation affects the protein coregulation

network. Nat Genet 38: 1015‐1022.

KONDRASHOV, A. S., S. SUNYAEV and F. A. KONDRASHOV, 2002a Dobzhansky‐Muller

incompatibilities in protein evolution. Proc Natl Acad Sci U S A 99: 14878‐14883.

KONDRASHOV, F. A., I. B. ROGOZIN, Y. I. WOLF and E. V. KOONIN, 2002b Selection in the

evolution of gene duplications. Genome Biol 3: RESEARCH0008.

KOUPRINA, N., A. PAVLICEK, G. H. MOCHIDA, G. SOLOMON, W. GERSCH et al., 2004 Accelerated

evolution of the ASPM gene controlling brain size begins prior to human brain

expansion. PLoS Biol 2: E126.

KREITMAN, M., 2000 Methods to detect selection in populations with applications to the

human. Annu. Rev. Genomics Hum. Genet. 1: 539‐559.

KREK, A., D. GRUN, M. N. POY, R. WOLF, L. ROSENBERG et al., 2005 Combinatorial microRNA

target predictions. Nat Genet 37: 495‐500.

LAMASON, R. L., M. A. MOHIDEEN, J. R. MEST, A. C. WONG, H. L. NORTON et al., 2005 SLC24A5,

a putative cation exchanger, affects pigmentation in zebrafish and humans.

Science 310: 1782‐1786.

LANDER, E. S., L. M. LINTON, B. BIRREN, C. NUSBAUM, M. C. ZODY et al., 2001 Initial sequencing

and analysis of the human genome. Nature 409: 860‐921.

LEE, R. C., R. L. FEINBAUM and V. AMBROS, 1993 The C. elegans heterochronic gene lin‐4

encodes small RNAs with antisense complementarity to lin‐14. Cell 75: 843‐854.

230

LEE, T. I., N. J. RINALDI, F. ROBERT, D. T. ODOM, Z. BAR‐JOSEPH et al., 2002a Transcriptional

regulatory networks in Saccharomyces cerevisiae. Science 298: 799‐804.

LEE, Y., R. SULTANA, G. PERTEA, J. CHO, S. KARAMYCHEVA et al., 2002b Cross‐referencing

eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res

12: 493‐502.

LEMOS, B., C. D. MEIKLEJOHN, M. CACERES and D. L. HARTL, 2005 Rates of divergence in gene

expression profiles of primates, mice, and flies: stabilizing selection and

variability among functional categories. Evolution Int J Org Evolution 59: 126‐

137.

LEVINE, M., and R. TJIAN, 2003 Transcription regulation and animal diversity. Nature 424:

147‐151.

LEWIS, B. P., I. H. SHIH, M. W. JONES‐RHOADES, D. P. BARTEL and C. B. BURGE, 2003 Prediction

of mammalian microRNA targets. Cell 115: 787‐798.

LEWONTIN, R. C., and J. KRAKAUER, 1973 Distribution of gene frequency as a test of the

theory of the selective neutrality of polymorphisms. Genetics 74: 175‐195.

LI, W.‐H., 1997 Molecular Evolution. Sinauer Associates, Inc., Sunderland,

Massachusetts.

LI, W.‐H., and M. A. SAUNDERS, 2005 The chimpanzee and us. Nature 437: 50‐51.

LI, W. H., C. I. WU and C. C. LUO, 1985 A new method for estimating synonymous and

nonsynonymous rates of nucleotide substitution considering the relative

likelihood of nucleotide and codon changes. Mol Biol Evol 2: 150‐174.

231

LI, W. H., J. YANG and X. GU, 2005 Expression divergence between duplicate genes. Trends

Genet 21: 602‐607.

LIEB, J. D., X. LIU, D. BOTSTEIN and P. O. BROWN, 2001 Promoter‐specific binding of Rap1

revealed by genome‐wide maps of protein‐DNA association. Nat Genet 28: 327‐

334.

LIM, L. P., N. C. LAU, E. G. WEINSTEIN, A. ABDELHAKIM, S. YEKTA et al., 2003 The microRNAs of

Caenorhabditis elegans. Genes Dev 17: 991‐1008.

LINDBLAD‐TOH, K., C. M. WADE, T. S. MIKKELSEN, E. K. KARLSSON, D. B. JAFFE et al., 2005

Genome sequence, comparative analysis and haplotype structure of the

domestic dog. Nature 438: 803‐819.

LINNAEUS, C., 1735 Sistema naturae sive regna tria Naturae systematice proposita per

classes, ordines, genera, & species. Theodorum Haak, Lugduni Batavorum.

LOHR, U., and L. PICK, 2005 Cofactor‐interaction motifs and the cooption of a homeotic

Hox protein into the segmentation pathway of Drosophila melanogaster. Curr

Biol 15: 643‐649.

LOHR, U., M. YUSSA and L. PICK, 2001 Drosophila fushi tarazu. a gene on the border of

homeotic function. Curr Biol 11: 1403‐1412.

LU, J., W. H. LI and C. I. WU, 2003 Comment on "Chromosomal speciation and molecular

divergence‐accelerated evolution in rearranged chromosomes". Science 302:

988; author reply 988.

232

LU, J., and C. I. WU, 2005 Weak selection revealed by the whole‐genome comparison of

the X chromosome and autosomes of human and chimpanzee. Proc Natl Acad Sci

U S A 102: 4063‐4067.

MAGNESS, C., P. C. FELLIN, M. THOMAS, M. KORTH, M. AGY et al., 2005 Analysis of the Macaca

mulatta transcriptome and the sequence divergence between Macaca and

human. Genome Biology 6: R60.

MAKOVA, K. D., and W. H. LI, 2003 Divergence in the spatial pattern of gene expression

between human duplicate genes. Genome Res 13: 1638‐1645.

MARGULIES, E. H., M. BLANCHETTE, NISC COMPARATIVE SEQUENCING PROGRAM, D. HAUSSLER and E.

D. GREEN, 2003 Identification and Characterization of Multi‐Species Conserved

Sequences. Genome Res. 13: 2507‐2518.

MARGULIES, E. H., J. P. VINSON, W. MILLER, D. B. JAFFE, K. LINDBLAD‐TOH et al., 2005 An initial

strategy for the systematic identification of functional elements in the human

genome by low‐redundancy comparative sequencing. Proc Natl Acad Sci U S A

102: 4795‐4800.

MCDONALD, J. H., and M. KREITMAN, 1991a Adaptive protein evolution at the Adh locus in

Drosophila. Nature 351: 652‐654.

MCDONALD, J. H., and M. KREITMAN, 1991b Adaptive protein evolution at the Adh locus in

Drosophila. Nature 351: 652‐654.

MCDOUGALL, I., F. H. BROWN and J. G. FLEAGLE, 2005 Stratigraphic placement and age of

modern humans from Kibish, Ethiopia. Nature 433: 733‐736.

233

MEKEL‐BOBROV, N., S. L. GILBERT, P. D. EVANS, E. J. VALLENDER, J. R. ANDERSON et al., 2005

Ongoing Adaptive Evolution of ASPM, a Brain Size Determinant in Homo sapiens.

Science 309: 1720‐1722.

MESSIER, W., and C.‐B. STEWART, 1997 Episodic adaptive evolution of primate lysozymes.

Nature 385: 151‐154.

METZGER, D., J. H. WHITE and P. CHAMBON, 1988 The human oestrogen receptor functions

in yeast. Nature 334: 31‐36.

MUKHERJEEM, S., S. BAL and P. SAHA, 2001 Protein interaction maps using yeast two‐hybrid

assay. Current Science 81: 458‐464.

MURPHY, W. J., E. EIZIRIK, S. J. O'BRIEN, O. MADSEN, M. SCALLY et al., 2001 Resolution of the

early placental mammal radiation using Bayesian phylogenetics. Science 294:

2348‐2351.

NAHON, J. L., 2003 Birth of 'human‐specific' genes during primate evolution. Genetica

118: 193‐208.

NAVARRO, A., and N. H. BARTON, 2003 Chromosomal speciation and molecular divergence‐

‐accelerated evolution in rearranged chromosomes. Science 300: 321‐324.

NEI, M., 1975 Molecular population genetics and evolution. North‐Holland, Amsterdam.

NEI, M., 2005 Selectionism and neutralism in molecular evolution. Mol Biol Evol 22:

2318‐2342.

NGUYEN, D. Q., C. WEBBER and C. P. PONTING, 2006 Bias of selection on human copy‐

number variants. PLoS Genet 2: e20.

234

NICKEL, G. C., D. TEFFT and M. D. ADAMS, 2008a Human PAML browser: a database of

positive selection on human genes using phylogenetic methods. Nucleic Acids

Res 36: D800‐808.

NICKEL, G. C., D. L. TEFFT, K. GOGLIN and M. D. ADAMS, 2008b An empirical test for branch‐

specific positive selection. Genetics 179: 2183‐2193.

NIELSEN, R., 2005 Molecular signatures of natural selection. Annu Rev Genet 39: 197‐218.

NIELSEN, R., C. BUSTAMANTE, A. G. CLARK, S. GLANOWSKI, T. B. SACKTON et al., 2005a A scan for

positively selected genes in the genomes of humans and chimpanzees. PLoS Biol

3: e170.

NIELSEN, R., C. BUSTAMANTE, A. G. CLARK, S. GLANOWSKI, T. B. SACKTON et al., 2005b A scan for

positively selected genes in the genomes of humans and chimpanzees. PLoS

Biology 3: e170.

NOBREGA, M. A., Y. ZHU, I. PLAJZER‐FRICK, V. AFZAL and E. M. RUBIN, 2004 Megabase deletions

of gene deserts result in viable mice. Nature 431: 988‐993.

NORTON, H. L., R. A. KITTLES, E. PARRA, P. MCKEIGUE, X. MAO et al., 2007 Genetic evidence for

the convergent evolution of light skin in Europeans and East Asians. Mol Biol Evol

24: 710‐722.

OHNO, S., 1970 Evolution by Gene Duplication. George Allen and Unwin, London.

OHTA, T., 1973 Slightly deleterious mutant substitutions in evolution. Nature 246: 96‐98.

OHTA, T., 1992 The nearly neutral theory of molecular evolution. Annu Rev Ecol Sys 23:

263‐286.

OHTA, T., 2000 Evolution of gene families. Gene 259: 45‐52.

235

OLSON, M. V., 1999 When less is more: gene loss as an engine of evolutionary change.

Am J Hum Genet 64: 18‐23.

OLSON, M. V., and A. VARKI, 2003 Sequencing the chimpanzee genone: insights into

human evolution and disease. Nature Reviews Genetics

Nat Rev Genet 4: 20‐28.

PAL, C., B. PAPP and M. J. LERCHER, 2006 An integrated view of protein evolution. Nat Rev

Genet 7: 337‐348.

PASTINEN, T., R. SLADEK, S. GURD, A. SAMMAK, B. GE et al., 2004 A survey of genetic and

epigenetic variation affecting human gene expression. Physiol Genomics 16: 184‐

193.

PATTERSON, N., D. J. RICHTER, S. GNERRE, E. S. LANDER and D. REICH, 2006 Genetic evidence for

complex speciation of humans and chimpanzees. Nature 441: 1103‐1108.

PEARSON, O. M., 2004 Has the combination of genetic and fossil evidence solved the

riddle of modern human origins? Evolutionary Anthropology: Issues, News, and

Reviews 13: 145‐159.

PEARSON, W. R., and D. J. LIPMAN, 1988 Improved tools for biological sequence

comparison. Proc Natl Acad Sci U S A 85: 2444‐2448.

PERRY, G. H., J. TCHINDA, S. D. MCGRATH, J. ZHANG, S. R. PICKER et al., 2006 Hotspots for copy

number variation in chimpanzees and humans. Proc Natl Acad Sci U S A 103:

8006‐8011.

PILLAI, R. S., S. N. BHATTACHARYYA and W. FILIPOWICZ, 2007 Repression of protein synthesis

by miRNAs: how many mechanisms? Trends Cell Biol 17: 118‐126.

236

PILON, N., K. OH, J. R. SYLVESTRE, N. BOUCHARD, J. SAVORY et al., 2006 Cdx4 is a direct target of

the canonical Wnt pathway. Dev Biol 289: 55‐63.

POLLARD, K. S., S. R. SALAMA, N. LAMBERT, M. A. LAMBOT, S. COPPENS et al., 2006 An RNA gene

expressed during cortical development evolved rapidly in humans. Nature 44:

167 ‐ 172.

PRABHAKAR, S., J. P. NOONAN, S. PAABO and E. M. RUBIN, 2006 Accelerated evolution of

conserved noncoding sequences in humans. Science 314: 786.

PREUSS, T. M., M. CACERES, M. C. OLDHAM and D. H. GESCHWIND, 2004 Human brain

evolution: insights from microarrays. Nature Reviews Genetics

Nat Rev Genet 5: 850‐860.

PRINCE, V. E., and F. B. PICKETT, 2002 Splitting pairs: the diverging fates of duplicated

genes. Nat Rev Genet 3: 827‐837.

RELETHFORD, J. H., 2008 Genetic evidence and the modern human origins debate.

Heredity 100: 555‐563.

REN, B., F. ROBERT, J. J. WYRICK, O. APARICIO, E. G. JENNINGS et al., 2000 Genome‐wide

location and function of DNA binding proteins. Science 290: 2306‐2309.

RHIM, J., W. CONNER, G. DIXON, C. HARENDZA, D. EVENSON et al., 1995 Expression of an avine

protamine in transgenic mice disrupts chromatin structure in spermatozoa.

Biology of Reproduction 52: 20‐32.

RICE, W. R., and A. K. CHIPPINDALE, 2001 Sexual recombination and the power of natural

selection. Science 294: 555‐559.

237

RIESEBERG, L. H., and K. LIVINGSTONE, 2003 Evolution. Chromosomal speciation in primates.

Science 300: 267‐268.

ROCKMAN, M. V., M. W. HAHN, N. SORANZO, F. ZIMPRICH, D. B. GOLDSTEIN et al., 2005 Ancient

and recent positive selection transformed opioid cis‐regulation in humans. PLoS

Biol 3: e387.

RONALD, J., and J. M. AKEY, 2005 Genome‐wide scans for loci under selection in humans.

Hum Genomics 2: 113‐125.

RONSHAUGEN, M., N. MCGINNIS and W. MCGINNIS, 2002 Hox protein mutation and

macroevolution of the insect body plan. Nature 415: 914‐917.

ROTH, C., S. RASTOGI, L. ARVESTAD, K. DITTMAR, S. LIGHT et al., 2007 Evolution after gene

duplication: models, mechanisms, sequences, systems, and organisms. J Exp

Zoolog B Mol Dev Evol 308: 58‐73.

ROZEN, S., and H. J. SKALETSKY, 2000 Primer3 on the WWW for general users and for

biologist programmers. Humana Press, Totowa, NJ.

SABETI, P. C., D. E. REICH, J. M. HIGGINS, H. Z. LEVINE, D. J. RICHTER et al., 2002a Detecting

recent positive selection in the human genome from haplotype structure. Nature

419: 832‐837.

SABETI, P. C., D. E. REICH, J. M. HIGGINS, H. Z. P. LEVINE, D. J. RICHTER et al., 2002b Detecting

recent positive selection in the human genome from haplotype structure. Nature

419: 832‐837.

SABETI, P. C., S. F. SCHAFFNER, B. FRY, J. LOHMUELLER, P. VARILLY et al., 2006 Positive natural

selection in the human lineage. Science 312: 1614‐1620.

238

SABETI, P. C., P. VARILLY, B. FRY, J. LOHMUELLER, E. HOSTETTER et al., 2007 Genome‐wide

detection and characterization of positive selection in human populations.

Nature 449: 913‐918.

SACHIDANANDAM, R., D. WEISSMAN, S. C. SCHMIDT, J. M. KAKOL, L. D. STEIN et al., 2001 A map of

human genome sequence variation containing 1.42 million single nucleotide

polymorphisms. Nature 409: 928‐933.

SAUNDERS, M. A., J. M. GOOD, E. C. LAWRENCE, R. E. FERRELL, W. H. LI et al., 2006 Human

adaptive evolution at Myostatin (GDF8), a regulator of muscle growth. Am J Hum

Genet 79: 1089‐1097.

SAUNDERS, M. A., M. F. HAMMER and M. W. NACHMAN, 2002 Nucleotide variability at G6pd

and the signature of malarial selection in humans. Genetics 162: 1849‐1861.

SAWYER, S. L., L. I. WU, M. EMERMAN and H. S. MALIK, 2005 Positive selection of primate

TRIM5alpha identifies a critical species‐specific retroviral restriction domain.

Proc Natl Acad Sci U S A 102: 2832‐2837.

SCANNELL, D. R., A. C. FRANK, G. C. CONANT, K. P. BYRNE, M. WOOLFIT et al., 2007 Independent

sorting‐out of thousands of duplicated gene pairs in two yeast species

descended from a whole‐genome duplication. Proc Natl Acad Sci U S A 104:

8397‐8402.

SELF, S. G., and K. LIANG, 1987 Asymptotic Properties of Maximum Likelihood Estimators

and Likelihood Ratio Tests Under Nonstandard Conditions. Journal of the

American Statistical Association 82: 605‐610.

239

SENUT, B., M. PICKFORD, D. GOMMERY, P. MEIN, K. CHEBOI et al., 2001 First hominid from the

Miocene (Lukeino Formation, Kenya)Premier hominidé du Miocène (formation

de Lukeino, Kenya). Comptes Rendus de l'Académie des Sciences ‐ Series IIA ‐

Earth and Planetary Science 332: 137‐144.

SERRE, D., A. LANGANEY, M. CHECH, M. TESCHLER‐NICOLA, M. PAUNOVIC et al., 2004 No evidence

of Neandertal mtDNA contribution to early modern humans. PLoS Biol 2: E57.

SHARAN, R., A. BEN‐HUR, G. G. LOOTS and I. OVCHARENKO, 2004 CREME: Cis‐Regulatory

Module Explorer for the human genome. Nucl. Acids Res. 32: W253‐256.

SHI, J., H. XI, Y. WANG, C. ZHANG, Z. JIANG et al., 2003a Divergence of the genes on human

chromosome 21 between human and other hominoids and variation of

substitution rates among transcription units. Proc Natl Acad Sci U S A 100: 8331‐

8336.

SHI, P., J. ZHANG, H. YANG and Y.‐P. ZHANG, 2003b Adaptive Diversification of Bitter Taste

Receptor Genes in Mammalian Evolution. Mol Biol Evol 20: 805‐814.

SIMPSON, G. G., 1965 Tempo and Mode in Evolution. Hafner Publishing Company, New

York.

SKROMNE, I., D. THORSEN, M. HALE, V. E. PRINCE and R. K. HO, 2007 Repression of the

hindbrain developmental program by Cdx factors is required for the specification

of the vertebrate spinal cord. Development 134: 2147‐2158.

SMALHEISER, N. R., and V. I. TORVIK, 2005 Mammalian microRNAs derived from genomic

repeats. Trends Genet 21: 322‐326.

240

STEDMAN, H. H., B. W. KOZYAK, A. NELSON, D. M. THESIER, L. T. SU et al., 2004 Myosin gene

mutation correlates with anatomical changes in the human lineage. Nature 428:

415 ‐ 418.

STONE, A. C., R. C. GRIFFITHS, S. L. ZEGURA and M. F. HAMMER, 2001 High levels of Y‐

chromosome nucleotide diversity in the genus Pan, pp. 012364999.

TABARIES, S., J. LAPOINTE, T. BESCH, M. CARTER, J. WOOLLARD et al., 2005 Cdx Protein

Interaction with Hoxa5 Regulatory Sequences Contributes to Hoxa5 Regional

Expression along the Axial Skeleton. Mol. Cell. Biol. 25: 1389‐1401.

TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA

polymorphism. Genetics 123: 585‐595.

TAJIMA, F., 1993 Statistical analysis of DNA polymorphism. Jpn J Genet 68: 567‐595.

TANG, K., K. R. THORNTON and M. STONEKING, 2007 A New Approach for Using Genome

Scans to Detect Recent Positive Selection in the Human Genome. PLoS Biol 5:

e171.

THOMAS, J. W., J. W. TOUCHMAN, R. W. BLAKESLEY, G. G. BOUFFARD, S. M. BECKSTROM‐STERNBERG

et al., 2003a Comparative analyses of multi‐species sequences from targeted

genomic regions. Nature 424: 788‐793.

THOMAS, P. D., A. KEJARIWAL, M. J. CAMPBELL, H. MI, K. DIEMER et al., 2003b PANTHER: a

browsable database of gene products organized by biological function, using

curated protein family and subfamily classification. Nucl. Acids Res. 31: 334.

TING, C. T., S. C. TSAUR, M. L. WU and C. I. WU, 1998 A rapidly evolving homeobox at the

site of a hybrid sterility gene. Science 282: 1501‐1504.

241

TISHKOFF, S. A., F. A. REED, A. RANCIARO, B. F. VOIGHT, C. C. BABBITT et al., 2007 Convergent

adaptation of human lactase persistence in Africa and Europe. Nat Genet 39: 31‐

40.

TRELSMAN, J., P. GÖNCZY, M. VASHISHTHA, E. HARRIS and C. DESPLAN, 1989 A single amino acid

can determine the DNA binding specificity of homeodomain proteins. Cell 59:

553‐562.

VALLENDER, E. J., and B. T. LAHN, 2004 Positive selection on the human genome. Hum Mol

Genet 13: R245‐R254.

VAN NES, J., W. DE GRAAFF, F. LEBRIN, M. GERHARD, F. BECK et al., 2006 The Cdx4 mutation

affects axial development and reveals an essential role of Cdx genes in the

ontogenesis of the placental labyrinth in mice. Development 133: 419‐428.

VENTER, J. C., M. D. ADAMS, E. W. MYERS, P. W. LI, R. J. MURAL et al., 2001 The sequence of

the human genome. Science 291: 1304‐1351.

VOGEL, F., and A. MOTULSKY, 1996 Human genetics; problems and approaches. Springer‐

Verlag, New York.

VOGEL, T., R. M. SPEED, A. ROSS and H. J. COOKE, 2002 Partial rescue of the Dazl knockout

mouse by the human DAZL gene. Mol. Hum. Reprod. 8: 797‐804.

VOIGHT, B. F., S. KUDARAVALLI, X. WEN and J. K. PRITCHARD, 2006 A map of recent positive

selection in the human genome. PLoS Biol 4: e72.

WALSH, E. C., P. SABETI, H. B. HUTCHESON, B. FRY, S. F. SCHAFFNER et al., 2006 Searching for

signals of evolutionary selection in 168 genes related to immune function. Hum

Genet 119: 92‐102.

242

WANG, E. T., G. KODAMA, P. BALDI and R. K. MOYZIS, 2006a Global landscape of recent

inferred Darwinian selection for Homo sapiens. Proc Natl Acad Sci U S A 103:

135‐140.

WANG, H., B. L. POSTIER and R. L. BURNAP, 2002 Optimization of Fusion PCR for In Vitro

Construction of Gene Knockout Fragments. Biotechniques 33: 26‐32.

WANG, X., W. E. GRUS and J. ZHANG, 2006b Gene losses during human origins. PLoS Biol 4:

e52.

WANG, Y.‐Q., and B. SU, 2004a Molecular evolution of microcephalin, a gene determining

human brain size. Human Molecular Genetics 13: 1131‐1137.

WANG, Y. Q., and B. SU, 2004b Molecular evolution of microcephalin, a gene determining

human brain size. Hum Mol Genet 13: 1131‐1137.

WATANABE, H., A. FUJIYAMA, M. HATTORI, T. D. TAYLOR, A. TOYODA et al., 2004 DNA sequence

and comparative analysis of chimpanzee chromosome 22. Nature 429: 382‐388.

WATERSTON, R. H., K. LINDBLAD‐TOH, E. BIRNEY, J. ROGERS, J. F. ABRIL et al., 2002 Initial

sequencing and comparative analysis of the mouse genome. Nature 420: 520‐

562.

WEBER, M., I. HELLMANN, M. B. STADLER, L. RAMOS, S. PAABO et al., 2007 Distribution,

silencing potential and evolutionary impact of promoter DNA methylation in the

human genome. Nat Genet 39: 457‐466.

WEBSTER, M. T., N. G. C. SMITH, M. J. LERCHER and H. ELLEGREN, 2004 Gene Expression,

Synteny, and Local Similarity in Human Noncoding Mutation Rates, pp. 1820‐

1830.

243

WEINMANN, A. S., P. S. YAN, M. J. OBERLEY, T. H. HUANG and P. J. FARNHAM, 2002 Isolating

human transcription factor targets by coupling chromatin immunoprecipitation

and CpG island microarray analysis. Genes Dev 16: 235‐244.

WELLS, J., P. S. YAN, M. CECHVALA, T. HUANG and P. J. FARNHAM, 2003 Identification of novel

pRb binding sites using CpG microarrays suggests that recruits pRb to

specific genomic sites during S phase. Oncogene 22: 1445‐1460.

WHEELER, D. L., T. BARRETT, D. A. BENSON, S. H. BRYANT, K. CANESE et al., 2007 Database

resources of the National Center for Biotechnology Information. Nucleic Acids

Res 35: D5‐12.

WILDMAN, D. E., M. UDDIN, G. LIU, L. I. GROSSMAN and M. GOODMAN, 2003 Implications of

natural selection in shaping 99.4% nonsynonymous DNA identity between

humans and chimpanzees: Enlarging genus Homo. PNAS 100: 7181‐7188.

WONG, W. S., and R. NIELSEN, 2004 Detecting selection in noncoding regions of nucleotide

sequences. Genetics 167: 949‐958.

WONG, W. S., Z. YANG, N. GOLDMAN and R. NIELSEN, 2004 Accuracy and power of statistical

methods for detecting adaptive evolution in protein coding sequences and for

identifying positively selected sites. Genetics 168: 1041‐1051.

WOOD, B., and M. COLLARD, 1999 The Human Genus. Science 284: 65‐71.

WOOD, K., 1998 The chemistry of bioluminescent reporter assays. Promega Notes 65: 14‐

20.

244

WOOLFE, A., M. GOODSON, D. K. GOODE, P. SNELL, G. K. MCEWEN et al., 2005 Highly conserved

non‐coding sequences are associated with vertebrate development. PLoS Biol 3:

e7.

WRAY, G. A., M. W. HAHN, E. ABOUHEIF, J. P. BALHOFF, M. PIZER et al., 2003a The evolution of

transcriptional regulation in eukaryotes. Mol Biol Evol 20: 1377 ‐ 1419.

WRAY, G. A., M. W. HAHN, E. ABOUHEIF, J. P. BALHOFF, M. PIZER et al., 2003b The evolution of

transcriptional regulation in eukaryotes. Mol Biol Evol 20: 1377‐1419.

WU, J., L. T. SMITH, C. PLASS and T. H. HUANG, 2006 ChIP‐chip comes of age for genome‐

wide functional analysis. Cancer Res 66: 6899‐6902.

WYCKOFF, G. J., W. WANG and C.‐I. WU, 2000 Rapid evolution of male reproductive genes

in the descent of man. Nature 403: 304‐309.

XIE, W., J. L. BARWICK, M. DOWNES, B. BLUMBERG, C. M. SIMON et al., 2000 Humanized

xenobiotic response in mice expressing SXR. Nature 406: 435‐

439.

XIE, X., J. LU, E. J. KULBOKAS, T. R. GOLUB, V. MOOTHA et al., 2005 Systematic discovery of

regulatory motifs in human promoters and 3' UTRs by comparison of several

mammals. Nature 434: 338‐345.

YANG, Y., S. SWAMINATHAN, B. K. MARTIN and S. K. SHARAN, 2003 Aberrant splicing induced by

missense mutations in BRCA1: clues from a humanized mouse model. Hum Mol

Genet 12: 2121‐2131.

YANG, Z., 1997a PAML: a program package for phylogenetic analysis by maximum

likelihood. Comput Appl Biosci 13: 555‐556.

245

YANG, Z., 1997b PAML: a program package for phylogenetic analysis by maximum

likelihood. Comput. Appl. BioSci. 13: 555‐556.

YANG, Z., 1997c PAML: a program package for phylogenetic analysis by maximum

likelihood. Comput Appl Biosci. 13: 555‐556.

YANG, Z., 1998 Likelihood ratio tests for detecting positive selection and application to

primate lyzozyme evolution. Journal of Molecular Evolution 15: 568‐573.

YANG, Z., 2002 Inference of selection from multiple species alignments. Current Opinions

in Genetics and Development 12: 688‐694.

YANG, Z., and J. P. BIELAWSKI, 2000 Statistical methods for detecting molecular adaptation.

Trends Ecol Evol 15: 496‐503.

YANG, Z., and R. NIELSEN, 2002a Codon‐substitution models for detecting molecular

adaptation at individual sites along specific lineages. Mol Biol Evol 19: 908‐917.

YANG, Z., and R. NIELSEN, 2002b Codon‐sunstitution models for detecting molecular

adaptation at individual sites along specific lineages. Mol Biol Evol 19: 908‐917.

YANG, Z., R. NIELSEN, N. GOLDMAN and A. M. PEDERSEN, 2000 Codon‐substitution models for

heterogeneous selection pressure at amino acid sites. Genetics 155: 431‐449.

YANG, Z., R. NIELSEN and M. HASEGAWA, 1998 Models of amino acid substitution and

applications to mitochondrial protein evolution. Molecular Biology and Evolution

15: 1600‐1611.

YANG, Z., W. S. WONG and R. NIELSEN, 2005a Bayes empirical bayes inference of amino acid

sites under positive selection. Mol Biol Evol 22: 1107‐1118.

246

YANG, Z., W. S. W. WONG and R. NIELSEN, 2005b Bayes Empirical Bayes Inference of Amino

Acid Sites Under Positive Selection. Mol Biol Evol 22: 1107‐1118.

YODER, A. D., and Z. YANG, 2000 Estimation of Primate Speciation Dates Using Local

Molecular Clocks. Mol Biol Evol 17: 1081‐1090.

YU, N., M. I. JENSEN‐SEAMAN, L. CHEMNICK, J. R. KIDD, A. S. DEINARD et al., 2003 Low

nucleotide diversity in chimpanzees and bonobos. Genetics 164: 1511‐1518.

YUNIS, J., and O. PRAKASH, 1982 The origin of man: a chromosomal pictoral legacy. Science

215: 1525‐1530.

ZHANG, J., 2003 Evolution of the Human ASPM Gene, a Major Determinant of Brain Size.

Genetics 165: 2063‐2070.

ZHANG, J., R. NIELSEN and Z. YANG, 2005 Evaluation of an Improved Branch‐Site Likelihood

Method for Detecting Positive Selection at the Molecular Level. Mol Biol Evol 22:

2472‐2479.

247