POSITIVE SELECTION IN TRANSCRIPTION FACTOR GENES
ALONG THE HUMAN LINEAGE
by
GABRIELLE CELESTE NICKEL
Submitted in partial fulfillment of the requirements
For the degree of Doctor of Philosophy
Thesis Adviser: Dr. Mark D. Adams
Department of Genetics
CASE WESTERN RESERVE UNIVERSITY
January 2009 CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis/dissertation of
Gabrielle Nickel______candidate for the _Ph.D.______degree *.
Helen Salz______(chair of the committee)
Mark Adams______
Radhika Atit______
Peter Harte______
Joe Nadeau______
______
(date) August 28, 2008______
*We also certify that written approval has been obtained for any proprietary material
contained therein.
1
TABLE OF CONTENTS
Table of Contents……………………………………………………………………………………………………….2
List of Tables……………………………………………………………………………………………………………...6
List of Figures……………………………………………………………………………………………………………..8
List of Abbreviations…………………………………………………………………………………………………10
Glossary………………………………………………………………………………………………..………………….12
Abstract..………………………………………………………………………………………………………………….16
Chapter 1: Introduction and Background………………………………………….……………………...17
Origin of Modern Humans………………………………………………………………………………..19
Human‐chimpanzee morphological divergence…………………………………….23
Human‐chimpanzee comparative genomics………………………………………….24
Human Molecular Evolution……………………………………………………………………………..27
Adaptive protein evolution……………………………………………………………………31
Changes in gene regulatory sequence……………………………………………………32
“Less‐is‐More” hypothesis of gene loss……………………………………………..….36
Gene duplication in human evolution……………………………………………………37
Epigenetic regulation of gene expression………..…………………………………….38
Gene Expression Differences…………………………………………………………………………….40
Human‐chimpanzee expression differences………………………………………….41
Selective pressures on gene expression………………………………………………..43
Positive Selection……………………………………………………………………………………………..44
2
Detecting positive selection through population genetic analysis………………………………………………………………………………………………….46
Detecting positive selection by phylogenetic analysis……………………………52
Transcription Factors………………………………………………………………………………………..57
Conclusion………………………………………………………………………………………………………..60
Chapter 2: An empirical test for branch‐specific positive selection…………. ……………….62
Abstract…………………………………………………………………………………………………………….63
Introduction……………………………………………………………………………………………………..64
Materials and Methods…………………………………………………………………………………….68
Selection of genes and DNA sequencing………………………………………………..68
Phylogenetic analysis…………………………………………………………………………….73
Simulations testing the performance of codeml…………………………………….76
Empirical tests of positive selection using sequences simulated under a model of neutral evolution……………………………………………………….77
Results………………………………………………………………………………………………………………79
Sequencing of transcription factor genes and tests of selection……………79
Effect of phylogenetic breadth on predictions of positive selection……..83
Simulations to assess sensitivity of the strict branch+site test……………..86
Alternative null model using an empirical test……………………………………..91
Comparison with previous predictions of positive selection…………………98
Discussion……………………………………………………………………………………………………….103
Chapter 3: Human PAML Browser: A database of positive selection on human genes using phylogenetic analysis……………………………………………………………….110
3
Abstract………………………………………………………………………………………………………….111
Introduction……………………………………………………………………………………………………112
Data Sources and Processing…………………………………………………………………………..115
Input data…………………………………………………………………………………………..115
Statistical analysis……………………………………………………………………………….118
Results…………………………………………………………………………………………………………….121
User interface……………………………………………………………………………………..125
Database organization…………………………………………………………………………126
Discussion……………………………………………………………………………………………………….132
Conclusion………………………………………………………………………………………………………134
Chapter 4: Demonstrating functional divergence using genome wide expression arrays
in the positively selected gene CDX4……………………………………………………………………...135
Abstract………………………………………………………………………………………………….………136
Introduction……………………………………………………………………………………………………137
Materials and Methods…………………………………………………………………….…………….140
DNA sequencing………………………………………………………………….………………140
Phylogenetic analysis..………………………………………………………………………..141
cDNA clone construction…………………………………………………………………….142
Microarray………………………………………………………………………………………….143
Results………………………………………………………………………….………………………………..144
Discussion………………………………………………………………………………………………………161
Chapter 5: Conclusion and Future Directions…………………….……...... ……………………..165
4
The Evolution of Modern Traits……………………………………………………………..………167
Future Directions…………………………………………………………………………………………..170
Picking candidate genes………………………………………….………………………….171
In vivo studies……………………………………………………….…………………………….172
Transgenic mice……………………………………….……………………………..174
Gene targeted mice……………………………….………………………………..176
Target gene discovery……………………………………….…….………………………….178
ChIP‐chip……………………………………………………….………………………..179
Yeast‐2‐hybrid…………………………………………………………………………181
Luciferase gene expression studies…………………………………………………….183
MicroRNA screens………………………………………………………………………………191
Investigating selection around a phenotype……………………………………….194
Positive Selection Along the Human Lineage…………………..…………………………….195
Appendix………………………………………………………………………………………………………………..198
Literature Cited………………………………………………………………………………………………………212
5
LIST OF TABLES
Table 1.1 The dominant hypothesis of human molecular evolution and important
references…………………………………………………………………………………………….30
Table 2.1 The percent of primer pairs that produced high quality sequence from
human templates and macaque and chimpanzee masks……………………….69
Table 2.2 Sequence coverage of transcription factor genes and data sources………72
Table 2.3 Evolutionary models and parameter sets for codeml analysis……………….74
Table 2.4 Branch+site test site classes………………………………………………………………….75
Table 2.5 codeml results for transcription factor genes with significant results in the
strict branch+site test of positive selection…………………………………………..83
Table 2.6 Comparison of the results from the strict branch+site and empirical
tests...... 94
Table 2.7 Comparison of predictions of positive selection by different methods..100
Table 2.8 Tests for positive selection on other primate branches that exhibited
dN/dS > 1…………………………………………………………………………………………….104
6
Table 2.9 Effect of synonymous and nonsynonymous substitution on predictions of
positive selection…………………………………………………………………………………105
Table 3.1 Evolutionary models used by codeml ………………………………………………….122
Table 3.2 Summary of results of tests of selection on human genes…………………..123
Table 3.3 Gene ontology categories over‐represented among genes with p<0.05 in
the strict branch+site test……………………………………………………………………124
Table 4.1 Statistical analysis of CDX4 using different methods to predict positive
selection……………………………………………………………………………………………..147
Table 4.2 The results from the strict branch+site model for CDX4………………………147
Table 4.3 123 genes differentially regulated by human and chimpanzee CDX4…..156
Table 4.4 Cell lines chosen for CDX4 gene expression assays………………………………163
Table 5.1 Luciferase promoter‐reporter tests with controls………………………………189
Table 5.2 List of experiments and controls associated with luciferase promoter‐
reporter analysis…………………………………………………………………………………190
Table A.1 codeml results from 175 Transcription Factors……………………………………198
Table A.2 Sensitivity analysis: Factors that affect predictions of positive selection...... 208
7
LIST OF FIGURES
Figure 1.1 Phylogenetic tree of the order Primates……………………………………………….20
Figure 1.2 Mechanisms of human molecular evolution………………………………………….29
Figure 2.1 Probability of accurate ancestral sequence reconstruction……………………78
Figure 2.2 Phylogenetic tree of the primate species used in phylogenetic
analyses………………………………………………………………………………………………..81
Figure 2.3 Comparison of the results for predictions of positive selection using
codeml for the full primate+rodent alignment set to the minimal
alignment set…………………………………………………………………………….………….85
Figure 2.4 Distribution of likelihood ratio statistic (LRS) values from the strict
branch+site test of simulated gene sets………………………………………………..88
Figure 2.5 Prediction of positive selection across a range of branch lengths
representative of human genes……………………………………………………………89
Figure 2.6 Contribution of dN/dS values to predictions of positive selection…………90
Figure 2.7 Comparison of the strict branch+site test, 50:50 mixture test, and
empirical test on simulated sequences…………………………………………………93
8
Figure 2.8 Comparison of the strict branch+site test and empirical test on human
genes…………………………………………………………………………………………………….97
Figure 3.1 Unrooted mammalian species tree of the organisms used in the
construction of the database and the phylogenetic tree used in PAML
analysis……………………………………………………………………………………………….117
Figure 3.2 PAML database summary results for Iroquois homeobox 3 (IRX3)………128
Figure 3.3 Multispecies protein alignment for IRX3…………………………………………….129
Figure 3.4 Likelihood test data and results for IRX3……………………………………………..131
Figure 4.1 The step by step process of fusion PCR……………………………………………….143
Figure 4.2 The protein multispecies sequence alignment for CDX4………………………145
Figure 4.3 Number of base pair matches of chimpanzee and macaque genome to
the human specific microarray……………………………………………………………149
Figure 4.4 Comparison of the gene expression profiles of a human and chimpanzee
fibroblast cell line………………………………………………………………………………..150
Figure 4.5 Scatter plots of BeadChip microarray gene expression data ……………….152
Figure 4.6 Chimpanzee fibroblast cell lines transfected with human or chimpanzee
CDX4 cDNA………………………………………………………………………………………….161
9
LIST OF ABBREVIATIONS
α Significance threshold
AMH Anatomically modern humans
BAC Bacterial artificial chromosome cds Coding sequence
ChIP Chromatin immunoprecipitation
CNGs Conserved non‐genic sequence dN The rate of nonsynonymous substitution dS The rate of synonymous substitution
EHH Extended haplotype heterozygosity
FST Wright’s fixation index
LRT Likelihood ratio test
Mb Megabase
ML Maximum likelihood mya Million years ago
N The absolute number of nonsynonymous substitutions
PAML Phylogenetic Analysis using Maximum Likelihood
PCR Polymerase chain reaction qPCR Quantitative polymerase chain reaction
S The absolute number of synonymous substitutions
SNP Single nucleotide polymorphism
UTR Untranslated region
10
ω The ratio of the rate of nonsynonymous substitution to the rate of synonymous
substitution. Also written as dN/dS.
Y2H Yeast two‐hybrid
11
GLOSSARY
Anthropoid‐ Simian; the “higher primates” including New and Old World monkeys and apes
Bayesian analysis‐ An approach to inference in which probability distributions of model parameters represent both what we believe about the distributions before looking at data and the likelihood of the parameters given the observed data.
Bonferroni correction‐ A conservative statistical multiple test correction used to correct the significance threshold
Branch length‐ The number of nucleotide substitutions per codon.
Coancestry coefficient (FST)‐ The correlation of genes or a measure of relatedness of different individuals in the same population.
Codon bias‐ The non‐random usage of synonymous codons for the same amino acid.
Effective population size (Ne)‐ The number of individuals in a population that contributes to the next generation. It is generally much smaller than the number of individuals in the population, and is influenced by population substructure, sex ratio, mating systems, and age distribution.
Epigenetic‐ Inherited without involving a change to the DNA sequence.
Fitness‐ The ability of an individual to survive and reproduce relative to the rest of the population.
Fixation‐ The increase in frequency of a genetic variant in a population to 100%.
Founder effects‐ Loss of genetic variation when a new colony is established by a very small number of individuals from a larger population.
12
Fourfold degenerate site‐ Any nucleotide at this position in a codon specifies the same amino acid
Genetic drift‐ The random fluctuations of allele frequencies over time due to chance alone.
Goodness‐of‐fit‐ A class of statistical methods used to assess competing models on the basis of their fit to empirical data.
Haplotype‐ A continuous DNA sequence of arbitrary length along a chromosome that has a primary structure that is distinct from that of other homologous regions in a given population
Hitchhiking‐ The increase in frequency of a neutral allele as a result of positive selection for a linked allele. See selective sweep.
Hominid‐ A member of the family Hominidae: all of the great apes.
Hominin‐ A member of the tribe Hominini: chimpanzees and humans.
Hominan‐ A member of the sub‐tribe Hominina: modern humans and their extinct relatives.
Homologous‐ Phenotypic or genotypic characters that share a common ancestor.
Indel‐ A mutation that involves the insertion or deletion of DNA sequence.
Likelihood ratio test‐ A method for comparing the likelihood of two different hypotheses.
Linkage disequilibrium (LD)‐ The non‐random association of polymorphism at two liked loci. LD is created by mutation and broken down over time by crossing over between the two loci.
Maximum likelihood analysis‐ A statistical method that calculates the probability of the observed data under varying hypotheses, in order to estimate model parameters that
13
best explain the observed data and determine the observed strengths of alternative hypotheses.
Molecular clock‐ Nucleotide (or amino acid) substitutions occur at a more or less fixed rate over evolutionary time. Used to calculate the amount of sequence divergence and lapsed time since two molecules diverged.
Negative selection‐ The removal of a deleterious genetic variant from the population owing to the reduced reproductive success of its carriers. Also known as purifying selection.
Neutrality‐ The state of being free from the effects of natural selection.
Nondegenerate site‐ Any mutation at this nucleotide position in a codon results in amino acid substitution.
Nonsynonymous substitution‐ A nucleotide substitution that results in an amino acid replacement.
Outgroup‐ Species that are more distantly related to two or more species studied and can therefore be used to estimate the ancestral state of a trait.
Orthologous‐ Genes or proteins that evolved from a common ancestral gene through speciation.
Paracentic inversion‐ An inversion of a region of a chromosome that does not include the centromere.
Paralog‐ Highly similar non‐allelic sequences resulting from a duplication event.
Pleistocene‐ An epoch of the Quaternary period beginning 1.8 million years ago and transitioning to the Holocene epoch roughly 10,000 years ago.
Population bottleneck‐ The transient reduction in the abundance of a population
14
Positive selection‐ The accelerated spread of a beneficial genetic variant in a population owing to the increased reproductive success of its carriers.
Prosimian‐ The “lesser primates”; the most ancestral extant primates that include lemurs, lorises, aye‐ayes, and bushbabies.
Quantitative PCR‐ Polymerase chain reaction which simultaneously amplifies and quantifies the amount of a targeted DNA or cDNA molecule.
Selective sweep‐ The rapid fixation of an allele and all the alleles surrounding it.
Site frequency spectrum‐ The frequency distribution within the population of polymorphic nucleotide sites from a given sequence.
Stabilizing selection‐ When negative selection acts on a phenotypic trait.
Synonymous substitution‐ A nucleotide substitution that does not result in an amino acid replacement.
Transcriptome‐ The combined variability of all transcripts produced by the genome
Transition‐ A base exchange in which a pyramidine base (C or T) is exchanged for another pyramidine base, or a purine base (G or A) is exchanged for another purine base
Transversion‐ A base exchange in which a pyramidine base (C or T) is exchanged for a purine base, or a purine base (G or A) is exchanged for a pyramidine base
Twofold degenerate site‐ Only two of four possible nucleotides at this position in a codon specify the same amino acid.
15
Positive Selection in Transcription Factor Genes Along The Human Lineage
Abstract By
GABRIELLE CELESTE NICKEL
The genetic changes that have occurred that led to the phenotypic divergence of
modern Homo sapiens are unknown, as are the molecular mechanisms through which these changes transpired. A predominant theory in the field of human molecular evolution is that these changes happened because of the effects of adaptive protein evolution, also known as positive selection. This theory posits that amino acid changes have occurred along the human lineage that were beneficial to modern man and fixed in
the population, thereby contributing to divergence. Transcription factors are proteins that regulate the expression of other genes, and they have a major role in embryogenesis and development. In this thesis, I have investigated transcription factor genes for human specific positive selection that contributed to the phenotypic divergence of humans and chimpanzees. Over the course of this work, I have developed a novel statistical approach for predicting positive selection that is particularly useful in closely related species, as I have shown that traditional methods lack power to detect selection along short branch lengths. I have also used genome wide expression arrays to test for differential activity and functional divergence of a human candidate transcription factor predicted to be positively selected.
16
CHAPTER 1
INTRODUCTION AND BACKGROUND
17
Nearly one hundred and fifty years ago, Charles Darwin declared that through his
theory of natural selection “much light will be thrown on the origin of man and his history” (DARWIN 1859). While now considered a mild proclamation, the implications of
this statement in late nineteenth century England were monumental. In a society that considered man’s natural history to be confined to Genesis 1:26, the suggestion that man may have descended from such a primitive beast as an ape was at the very least
controversial, and at best the beginning of a new era of human reflection on man’s place in nature. Thomas Henry Huxley’s tome Evidence as to Man’s Place in Nature and
Darwin’s later publication The Descent of Man are arguably the most monumental texts on human evolution and the seed of a new field of evolutionary biology.
Darwin and his contemporaries were certainly not the first to acknowledge man’s similarity to the apes. In fact, the word chimpanzee is derived from kivili‐ chimpenze, meaning “mock‐man” in Tshiluba, the language of the Bantu people of the
Congo. Just prior to the publication of The Origin of the Species, skeletons of strange
humanoid creatures were being unearthed across the globe, the most famous of which was a Neanderthal skull discovered in a quarry near Düsseldorf, Germany in 1856. The discovery of these archaic human skeletons alongside the growing understanding of ape anatomy and the theory of natural selection came together in an explosive fashion to create the discipline of human evolution that still captivates scientists today. The notion that man has descended from the apes is as humbling as it is magnificent.
18
ORIGINS OF MODERN HUMANS
Homo sapiens (Latin: "wise man") are members of the biological order Primates, a group that includes lemurs, monkeys, and apes. Carolus Linnaeus, the father of modern taxonomy, first described the order Primates, organizing species according to morphology (LINNAEUS 1735). His classifications have remained unexpectedly strong
with the advent of modern molecular biology techniques for systematics. Members of this order first appeared in the fossil record approximately 65 million years ago (mya) as small tree‐dwelling prosimians similar to the extant lemurs and lorises of today.
Monkeys emerged 40 mya and diverged into New and Old World species as early at 30 mya, inhabiting exclusively the Americas or Asia and Africa respectively.
Morphological and sequence data has now provided a clear classification of humans among the apes. There are two recognized families (GROVES 2005): the
Hylobatidae or lesser apes (gibbons and siamangs) and Hominidae (orangutan, gorilla, chimpanzee, and human). Historically, humans were classified as having their own family largely due to the anthropocentric view of human evolution. Goodman
(GOODMAN 1999) actually suggested that humans, chimpanzees, and bonobos all be classified under the genus Homo. Morphological and behavioral data along with a paucity of fossil apes made it difficult to decipher the ancestral relationships among the apes predating the emergence of archaic humans. With the advent of sequence data, the relationships between these animals has been established (Figure 1) (YODER and
19
YANG 2000). The divergence of lesser and great apes occurred roughly 17 mya, followed by the orangutan divergence around 12 mya. Gorillas were next to diverge around 7‐9 mya. Only the chimpanzee lineage has split into two surviving species, common chimpanzees and bonobos (often called pygmy chimpanzees), and the split occurred around 3 mya. Fossil data suggested that humans and chimpanzees diverged around 5‐
7 mya which has been supported by sequence data (PATTERSON et al. 2006).
Figure 1: Phylogenetic tree of the order Primates. This order is informally divided into three groups: prosimians, monkeys, and apes, with special attention paid to the subgroup of great apes. Genus names are given in italics. Suborders, parvorders, and superfamilies are displayed on the branches. Tree construction is based on consensus data from several analogous
20 molecular studies and reviews (GOODMAN et al. 1990; GROVES 2005; GROVES 2001; YODER and YANG 2000)
The earliest fossil records of archaic human populations, also known as hominans date to around 6 mya (SENUT et al. 2001) . Members of this subtribe include the genera (in order of appearance) Orrorin, Ardipithecus, Australopithecus,
Paranthropus, Kenyanthropus, and Homo, the only genera with surviving members. No fossil chimpanzee skeletons have been located with the exception of the Ethiopian specimen Sahelanthropus tchadensis believed to have existed 7 million years ago
(BRUNET et al. 2005). This specimen is thought to have postdated the most recent common ancestor of humans and chimpanzees. The genus Homo is presumed to have arisen around 2 mya in the same regions of Africa (ANTON and SWISHER 2004).
Multiple, competing models exist describing the origins of the anatomically modern humans (AMH), however one model, often referred to as the ‘single origin’ model is best supported by both fossil and molecular data ((GARRIGAN and HAMMER 2006) for a review). The model postulates that AMH arose during the late Pleistocene in Africa from a small isolated population, which radiated into Eurasia, Oceania, and the
Americas roughly 15 mya, replacing all other archaic Homo populations (CANN et al.
1987; EXCOFFIER 2002). The earliest fossil specimen of an AMH was discovered in
Ethiopia in volcanic pumice that was dated to 195,000 years ago (DAY 1969; MCDOUGALL
et al. 2005). Even this model has variants as researchers continue to debate if there was gene flow between AMH populations and archaic Homo species (GREEN et al. 2006;
HODGSON and DISOTELL 2008; RELETHFORD 2008; SERRE et al. 2004).
21
Fossil data has been an excellent starting point for unraveling the secrets of human evolution; however, it has generated as many new questions as it has answered.
Fossils have limited ability to answer questions about demography, such as population size and migration patterns, nor do they give any insight into how Homo sapiens have become such a morphologically divergent species. Molecular data, particularly DNA
polymorphism data from mtDNA, the X chromosome, and many autosomes, has become the main tool with which researchers are investigating human origins.
One of the more interesting findings genomic evidence has provided is the calculation of the effective population size (Ne), the minimum number of individuals in a population that contribute offspring to the next generation, that acted as founders for modern humans. For example, we now know there were approximately 10,000 fewer human individuals in the Pleistocene than the previous era (HARPENDING et al. 1998;
HARPENDING et al. 1993). The chimpanzee lineage does not show the same reduction in
population numbers as humans during the Pleistocene (STONE et al. 2001), which suggests that environmental catastrophe was not the primary cause for human‐specific population reduction; however, species‐specific infectious illness may have contributed to this population bottleneck. A more likely scenario is one in which there was a number of extinctions in small, dispersed populations that were later recolonized with founder effects (ELLER 2002). The small effective population size as well as the recent
emergence of AMH served to explain the dearth of genetic diversity among human populations, particularly when compared to chimpanzee populations (HARPENDING et al.
1998). The heterozygosity rate in chimpanzees within a population is variable based on
22
geography and population size, and is between 8.0 x 10‐4 and 17.6 x 10‐4. Strikingly, the
‐4 heterozygosity levels between chimpanzee populations is 19.0 x 10 (CONSORTIUM
2005c). This is more than twice that seen within the human population which has an
‐4 average heterozygosity of 7.5 x 10 (SACHIDANANDAM et al. 2001).
Human‐chimpanzee morphological differences
The morphological differences between humans and the other apes are numerous, from dentition to foot anatomy to fingernails (AIELLO and DEAN 1990; WOOD and COLLARD 1999). Humans have a disproportionately longer ontogeny and lifespan than other apes. Hair cover has been drastically reduced. The craniofacial shape is also quite diverged, including the shape of the brain case, a reduction in the size of the canine teeth and the presence of a chin. Many of these differences are related to the upright gait that is unique to humans, such as the general shape of the body and thorax, the relative limb lengths, the dimensions of the pelvis, an S‐shaped spine, differences in the number and types of vertebrae, and a skull that is balanced upright on the vertebral
column. Lastly, there are the vast intellectual differences between man and the apes,
marked by a larger brain relative to body size and differences in brain topology, which led to the development of language and advanced tool making skills.
These major morphological differences are the likely result of what is known as
‘quantum evolution’ or macroevolution, which are changes that have occurred at the species level, rather than within a population (SIMPSON 1965). Since many of these
23
phenotypic differences are not traceable in the fossil record, they may represent major adaptive shifts ((HONEYCUTT 2008)for a review). The root cause of these changes on morphology is not known—ranging from the cumulative effects of a number of small changes within a large number of genes or perhaps large changes in a small number of genes caused these massive morphological changes.
Human‐chimpanzee comparative genomics
Even though humans are closely related to chimpanzees, man is the most distinct of the apes. The morphological differences are obvious; however, human genome organization is also quite divergent. The result of a fusion of two small great ape chromosomes to form human chromosome 2, humans have 46 chromosomes, while the chimpanzees, bonobos, gorillas, and orangutans have 48 chromosomes. Many of the genes surrounding the region of the fusion have been pseudogenized or duplicated elsewhere in the genome, indicating that there may be functional relevance to this inversion (FAN et al. 2002a; FAN et al. 2002b). Humans have also undergone 9 pericentric inversions relative to the other great apes (YUNIS and PRAKASH 1982). Not
surprisingly, the synteny between human and great ape chromosomes is well preserved; even the mutation rates in syntenic blocks of non‐coding sequence remain constant
(WEBSTER et al. 2004). Models for human speciation based on these chromosomal
rearrangements have been proposed (NAVARRO and BARTON 2003) stating these rearrangements would have led to reproductive isolation because of hybrid sterility;
24
however these theories have not held up under scrutiny, as the previous study lacked an appropriate outgroup to predict selection that was theorized to have taken place in the rearranged segments (LU et al. 2003; RIESEBERG and LIVINGSTONE 2003).
Comparative genomic analysis has been performed on the chimpanzee and human genomes in many studies (CASWELL et al. 2008; CHEN and LI 2001; CONSORTIUM
2005c; CONSORTIUM 2004; EBERSBERGER et al. 2002; GLAZKO et al. 2005; KING and WILSON
1975; SHI et al. 2003a). Since these two genomes are so similar, the focus is not on what is the same, but on what is different. Based on the analysis of 2,400 megabases (Mb) of human‐chimpanzee alignments, the genome wide nucleotide difference between human and chimpanzees is 1.23%, although when corrected for polymorphism, this estimate drops to 1.06% (CONSORTIUM 2005c). In fact species specific polymorphism is
estimated to account for between 11 and 22% of all the sequence differences between the two species (EBERSBERGER et al. 2002; ELANGO et al. 2006). The level of nucleotide diversity is among human populations is half that among the Pan lineages (7.5 x 10‐4 for
‐4 humans vs. 19.0 x 10 for chimpanzees) (CONSORTIUM 2005c; SACHIDANANDAM et al. 2001) ,
which indicates a recent population expansion among humans (KAESSMANN et al. 2001;
YU et al. 2003).
There is regional variation in sequence divergence , presumably because of regional differences in mutation and recombination rate (PATTERSON et al. 2006). The chromosome with the most sequence conservation is the X chromosome (0.94% divergence), and the chromosome with the highest observed divergence rate is the Y
25
chromosome (1.78%) (CONSORTIUM 2005c). There is also variation among autosomes, with chromosome 21 having a 1.44% divergence rate, which is higher than the average value of the entire genome (CONSORTIUM 2004; WATANABE et al. 2004). The degree of sequence similarity is also based on the function of the sequence, with the highest conservation seen among protein coding sequence. The sequence of the 3’ untranslated region (UTR) has diverged twice as much as coding sequence and the divergence rate is even higher within the 5’ UTR (HELLMANN et al. 2003). The average rate ratio of synonymous and nonsynonymous substitution (dN/dS) between human and chimpanzee sequence is 0.23, which is significantly higher than the dN/dS ratio of
0.13 observed between mouse and rat genomes.
Unexpectedly, insertions, deletions, and duplications are common in both genomes, accounting for roughly 3% of sequence differences (CONSORTIUM 2005c). A
large fraction of the nucleotide differences, insertions, and deletions resulted from repeat sequence (one third) and transposable elements (one quarter) that are active in both species. There are 53 human genes that appear to have been deleted or partially disrupted in chimpanzees, although in reality this number is probably much smaller due to the quality of the chimpanzee draft genome sequence. Hahn and Lee (HAHN and LEE
2005; HAHN and LEE 2006) have developed methods to look for human‐specific gene
disruptions, a difficult task in light of the draft chimpanzee genome. A full analysis of genes deleted or disrupted in humans has not yet been performed. Currently, only a small handful of genes are known to be deleted in or made into pseudogenes in humans
(CONSORTIUM 2005c).
26
Between 70‐80% of human and chimpanzee orthologous proteins have amino
acid sequence differences (CONSORTIUM 2005c; GLAZKO et al. 2005), although the lineages
on which the sequence divergences occurred are only recently being discerned. The role of splice variant differences and amounts is yet unclear. As more ape and monkey genomes are sequenced, ancestral sequence reconstruction can begin and the directionality of sequence divergence can be determined. The median ratio of replacement to silent substitutions is 2:3. In a comparison of human and chimpanzee cDNAs, Hellmann et al. (HELLMANN et al. 2003) estimated that 70% of the amino acid changing mutations (nonsynonymous substitutions) that have occurred are deleterious.
Even though the percent differences between the two species are small, these differences still account for about 35 million nucleotide differences and 5 million indels, both of which contain valuable information one th molecular evolution of modern man.
Given this massive amount of nucleotide difference, most of this genomic change is likely to simply be “the noise of neutral substitutions” (CARROLL 2003) whose fixation resulted from random genetic drift.
HUMAN MOLECULAR EVOLUTION
The types of genetic changes that have made us uniquely human remain unknown. It appears that the changes that led to the emergence of modern humans
27
were not a linear additive process, but rather were the result of a string of adaptive radiations during which many branches were formed and then died out ((CARROLL 2003;
GARRIGAN and HAMMER 2006; PEARSON 2004) for reviews). This view is largely based in fossil data in which many different traits are witnessed in different combinations.
However, this evolution of traits sheds no light on the number and types of genes involved. Developmental geneticists have suggested that morphologic change is ultimately a consequence of changes in the amount, timing, and spatial patterning of gene expression and in gene regulatory networks (CARROLL 2000; DAVIDSON 2001).
According to their view changes in the human lineage are likely to be associated with genes that are involved in development and have therefore shaped human anatomy, physiology and behavior. One of the great remaining questions still is, whether these differences are primarily the collective effect of small changes in a large number of genes or due to changes in a small number of genes that have a disproportionately large
effect.
The mechanisms through which evolution has shaped the human genome
remain hotly contested. There are currently two major hypotheses in the field (LI and
SAUNDERS 2005): the adaptive evolution of proteins and changes in non‐coding regulatory sequence directing gene expression. To a lesser extent, gene loss (OLSON
1999), copy number variation (NAHON 2003; OHTA 2000), and chromatin modification
(GARCIA et al. 2003) may also play a role in functional diversification of gene expression contributing to human phenotypic difference, but the evidence is far less apparent. It is important to note that these hypotheses are not mutually exclusive and it is most likely
28
a combination of all these genome altering events that have lead to the divergence of
the human species (Figure 2). Table 1 lists important publications, both technical review
and specific examples, for each of the main models of human molecular evolution.
Figure 2: Mechanisms of human molecular evolution. The cooperative effect of different mechanisms of molecular evolution led to the emergence of modern humans.
29
Table 1: The dominant hypothesis of human molecular evolution and important references
Molecular Technical References / Reviews Examples Mechanism
Adaptive (AAGAARD and PHILLIPS 2005; (AKEY et al. 2004; ARBIZA et al. 2006b; protein ANISIMOVA et al. 2002; ARBIZA et al. BAKEWELL et al. 2007a; BERSAGLIERI et evolution 2006a; BAMSHAD and WOODING 2003; al. 2004; BUSTAMANTE et al. 2005b; BISWAS and AKEY 2006; ELLEGREN CLARK et al. 2003b; GILAD et al. 2005; FAY and WU 2000; FAY et al. 2003a; GILAD et al. 2002; HAHN et al. 2001; FELSENSTEIN 1993; GOLDMAN 2004; HAMBLIN and DI RIENZO 2000; and YANG 1994; KONDRASHOV et al. HELLMANN et al. 2003; HUTTLEY et al. 2002a; KREITMAN 2000; LI et al. 1985; 2000; JOHNSON et al. 2001; KELLEY et NEI 2005; NIELSEN 2005; PAL et al. al. 2006; KITANO et al. 2004; LAMASON 2006; SABETI et al. 2006; TAJIMA et al. 2005; LU and WU 2005; MEKEL‐ 1993; TANG et al. 2007; WONG and BOBROV et al. 2005; NICKEL et al. NIELSEN 2004; YANG 1998; YANG 2002; 2008a; NIELSEN et al. 2005b; NORTON YANG and NIELSEN 2002b; YANG et al. et al. 2007; PREUSS et al. 2004; 1998; YANG et al. 2005b; ZHANG et al. RONALD and AKEY 2005; SABETI et al. 2005) 2002b; SABETI et al. 2007; SAUNDERS et al. 2006; SAUNDERS et al. 2002; SAWYER et al. 2005; SHI et al. 2003a; TISHKOFF et al. 2007; VOIGHT et al. 2006; WALSH et al. 2006; WANG et al. 2006a; WANG and SU 2004a; WILDMAN et al. 2003; WYCKOFF et al. 2000; ZHANG 2003)
Regulatory / (AKASHI 2001; BUSH and LAHN 2005; (BEJERANO et al. 2004; BOFFELLI et al. non‐coding CARROLL 2000; DERMITZAKIS et al. 2003; CACERES et al. 2003; CHABOT et sequence 2003; KHAITOVICH et al. 2004; al. 2007; DONALDSON and GOTTGENS changes MARGULIES et al. 2003; MARGULIES et 2006; DORUS et al. 2004b; ENARD et al. 2005) al. 2002a; GILAD et al. 2006; GILAD et al. 2005; GU and GU 2003; HEISSIG et al. 2005; HUBY et al. 2001; KARAMAN et al. 2003; KHAITOVICH et al. 2006a; KHAITOVICH et al. 2005a; KHAITOVICH et al. 2005b; KHAITOVICH et al. 2006b; KING and WILSON 1975; LEMOS et al.
30
2005; MAGNESS et al. 2005; POLLARD et al. 2006; PRABHAKAR et al. 2006; PREUSS et al. 2004; ROCKMAN et al. 2005; WEBSTER et al. 2004)
“Less‐Is‐More” (FORTNA et al. 2004; OLSON 1999; (GILAD et al. 2003b; HAHN and LEE hypothesis OLSON and VARKI 2003) 2005; HAHN and LEE 2006; KEIGHTLEY et al. 2005; KEIGHTLEY et al. 2005a; STEDMAN et al. 2004; WANG et al. 2006b)
Epigenetic (GARCIA et al. 2003; JAENISCH and BIRD (ENARD et al. 2004; PASTINEN et al. regulation 2003) 2004; WEBER et al. 2007)
Copy number (FORTNA et al. 2004; HANCOCK 2005; (BAILEY and EICHLER 2006; CHENG et al. variation HURLES 2004; KONDRASHOV et al. 2005; JOHNSON et al. 2001; MAKOVA 2002b; LI et al. 2005; OHNO 1970; and LI 2003; NAHON 2003; NGUYEN et OHTA 2000; PRINCE and PICKETT 2002; al. 2006; PERRY et al. 2006; SHI et al. ROTH et al. 2007) 2003b)
Adaptive protein evolution
Adaptive evolution, also known as positive or directional selection, can be an extremely effective way of maintaining advantageous protein changes that can increase the fitness of an organism, as well as being involved in morphological divergence and speciation. In fact, it is believed that 20‐45% of all amino acid changing substitutions
have been influenced by positive selection (BIERNE and EYRE‐WALKER 2004; FAY et al.
2002). Even a single point mutation can have an effect on the function of a protein and therefore also affect larger protein networks (BRIDGHAM et al. 2006; KNIGHT et al. 2006).
For example, amino acid replacements can alter DNA‐binding (TRELSMAN et al. 1989)
which is particularly true for transcription factors because it can affect the downstream
31
cascade of gene regulation. It is also a way increase the morphological diversity both with and between species ((CARROLL 2000) for a review).
Positive selection has been demonstrated to actively alter immunity and defense genes in humans and other primate taxa such as the major histocompatibility complex
(MHC) and leukocyte antigens (FAN et al. 1989; FILIP and MUNDY 2004), malarial defense
systems (SAUNDERS et al. 2002), reproduction such as sperm protamines (WYCKOFF et al.
2000), sensory perception such as taste receptors (SHI et al. 2003b), and dietary adaptation such as lysozyme (MESSIER and STEWART 1997) and lactase (BERSAGLIERI et al.
2004; TISHKOFF et al. 2007). Unfortunately, these types of genes would not promote morphological and behavioral changes associated with human and chimpanzee divergence. It has long been known that genes involved in immunity, reproduction, and diet rapidly evolve, as an organism is continually being exposed to new environmental pressures. Genes involved in morphology and physiology, however, are often involved in very complex genetic pathways, and positive selection is more difficult to uncover in
these genes because these changes may have been cooperative and involved compensatory mutations.
Changes in gene regulatory sequence
King and Wilson (1975) first suggested that regulatory mutations might be responsible for the human‐chimpanzee phenotypic differences. Until that time it was
assumed that humans and chimpanzees must have very different sets of genes
32
responsible for such great phenotypic difference. With the development of new techniques in molecular biology, however, it became clear that human and chimpanzee proteins were far more similar than previously expected. King and Wilson used peptide sequencing, immunological comparisons, and electrophoresis to compare over 40 human and chimpanzee orthologous proteins. They concluded that the human and chimp peptides are more than 99% identical, and accelerated protein evolution could not account for the human‐chimpanzee divergence. Rather proposed that small differences in the cis‐regulatory sequences controlling gene expression that are active during embryonic development are the likely causes of the anatomical and behavioral differences of the two species. Divergence of gene regulatory systems remain the dominant theory in human molecular evolution; yet, it is notoriously difficult to investigate because one can no longer rely on the rate of synonymous and nonsynonymous substitution. Techniques designed to investigate positive selection in non‐coding sequence have been developed and are beginning to be applied to the study
of human molecular evolution (DERMITZAKIS et al. 2003; MARGULIES et al. 2003; MARGULIES et al. 2005; POLLARD et al. 2006; WONG and NIELSEN 2004).
A variety of studies have demonstrated gene expression differences between humans and other apes (Enard et al. 2002; Caceres et al. 2003; Karaman et al. 2003;
Fortna et al. 2004; Khaitovich et al. 2004), but follow‐up techniques for linking cis‐ regulatory differences to gene expression changes are weak. The importance of non‐ coding regulatory sequence in vertebrate development is highlighted in a study by
Woolf et al. who performed a whole‐genome comparison between human and the
33
pufferfish Fugu rubripes looking for conserved non‐genic sequence (CNGs) (WOOLFE et al.
2005). They found that many CNGs surrounding genes involved in transcription
regulation and development show significant enhancer activity.
Transcription factor binding sites are small and are often found at large distances away from the gene they are regulating, making their prediction problematic.
Humans and chimpanzees are also so closely related, that it is hard to find functional
divergence in the face of nearly identical sequence. Typically the most important sequence is the most highly conserved, and the close evolutionary distance between these two species provides almost no information. This conservation, however, can be used to detect putative functional elements, either by looking at identical sequence in closely related species (BEJERANO et al. 2004) or conserved sequence in highly divergent species (THOMAS et al. 2003a). Phylogenetic shadowing methods developed by Boffelli et
al. (BOFFELLI et al. 2003) have overcome some of the technical barriers by using multiple sequence alignments of mammalian species to find functionally important regions specific to primates or humans. A second method for characterizing conserved sequence has been developed by Margulies et al. (MARGULIES et al. 2003; MARGULIES et al.
2005), which was initially used to detect clusters of transcription factor binding sites. It must be considered though, that conservation is not synonymous with function, as many deletions of large scale conserved sequence known as gene deserts can be tolerated by an organism without any obvious phenotypic effect (NOBREGA et al. 2004).
34
Several studies have performed genome scans with the intent of finding ultra
conserved, functional non‐genic sequence in the human genome (BEJERANO et al. 2004;
DERMITZAKIS et al. 2003; THOMAS et al. 2003a; WOOLFE et al. 2005) that may have contributed to human phenotypic divergence. Typically, selection is predicted when the human sequence has diverged relative to the other species in highly conserved sequence blocks. However, it is difficult to distinguish a model of selection from one of neutral evolution which could also account for the nucleotide differences. Wong and
Nielsen have developed a maximum likelihood method for predicting positive selection in CNGs using substitution rates in both genic and non‐genic sequence (WONG and
NIELSEN 2004). The true test lies in linking these predictions with actual functional
divergence.
These techniques were recently used to identify non‐coding sequence subject to
selection known as ‘human accelerated regions’ (HARs) (POLLARD et al. 2006). Instead of
finding regulatory sequence, Pollard et al. identified a novel RNA gene expressed in neurons of both the developing and adult brain in both human and non‐human primates. Functional analysis has yet to be performed on this RNA. These non‐coding
RNAs, including miRNAs, are a new mechanism to explore in human evolution, as these elements have the potential to regulate the expression of large numbers of genes.
Keightley et al. (2005) examined CNGs in humans and came to a different conclusion about the selective pressures affecting these sequences. This group examined CNGs along the long arm of human chromosome 21 using multiple species
35
alignments (human‐chimp and mouse‐rat), correcting for mutation rate variation. They calculated the amount of selective constraint in the regions between the species pairs by estimating the fraction of mutations removed by natural selection. The selective constraint along the hominid lineage is about half as strong as along the murid lineage.
There are also equal numbers of conserved nucleotides in non‐coding sequence as in protein‐coding sequence in humans. In mice, there are twice as many conserved nucleotides in non‐coding sequence as in protein‐coding sequence. They propose that the reason for the lack of selective constraint could be a function of a general degradation in non‐coding sequence, which had been suggested earlier by this group
(Keightley et al. 2005a) or because of the lower effective population size of humans leading to a reduction in the effectiveness of selection.
“Less‐is‐More” hypothesis of gene loss
The “Less‐is‐More” hypothesis was first described by Maynard Olson as the primary mechanism for adaptive gene loss in an organism when faced with novel selective pressures (OLSON 1999). He postulates that gene loss occurs more often than functional mutation and it has the ability to spread rapidly through small populations, such as the types present in the recently evolved archaic human populations (OLSON and
VARKI 2003). The small Ne of early Homo sapiens is an ideal size for homozygous gene
loss among randomly‐mating individuals (VOGEL and MOTULSKY 1996). This type of purifying selection occurs when a species ‘breaks free’ from the selective constraints
36
once applied to its genome. It is believed that this sort of gene loss is responsible for many of the degenerative traits seen in modern human populations, such as delayed postnatal development, and loss of hair and muscle strength.
The initial sequencing of the chimpanzee genome revealed that 53 known human genes were either deleted or disrupted in the chimpanzee (CONSORTIUM 2005c).
More recently, Wang et al. investigated the rate of pseudogenization in the human genome by comparing it to chimpanzee sequence, and identified 80 non‐processed pseudogenes (WANG et al. 2006b) that have occurred since the divergence of humans and chimpanzees. The human specific loss of olfactory receptor genes is an excellent example of pseudogenization acting on an entire gene family (GILAD et al. 2003b) as humans began to rely more heavily on sight than olfaction for survival. Likewise, the inactivation of the predominant myosin heavy chain gene is credited with the reduction in mandibular and masticatory muscles since the divergence of modern humans from the archaic humans, Australopithecus and Paranthropus (STEDMAN et al. 2004).
Gene duplication in human evolution
Susumu Ohno first suggested that genes can evolve new functions through duplication, as they are free from the selective constraints of the source genes (OHNO
1970). A gene may undergo neofunctionalization, which is the acquisition of a novel function, or subfunctionalization, in which the duplicated gene takes on a part of the role of the source gene ((HANCOCK 2005; HURLES 2004) for a review. A gene may also
37
evolve a subtly different function from the parent gene (microfunctionalization) which is often the mechanism through which gene families evolve (HANCOCK 2005). Lastly,
duplicated genes may evolve different expression patterns (LI et al. 2005). Positive selection upon a new protein function may rapidly fix the duplicated gene or segment in a population. This is particularly true along the human lineage among copy‐number variants involved in olfaction and immune function (NGUYEN et al. 2006), and several copy‐number variation ‘hotspots’ have been located in human and chimpanzee in locations of ancient duplications (PERRY et al. 2006).
Through genome comparisons of human and chimpanzee aligned sequence, it has been estimated that at least 180 genes have become duplicated along the human lineage, and 94 have duplicated along the chimpanzee lineage (CHENG et al. 2005). A
study comparing the rates of gene duplications among the great apes found that humans possess a larger number of lineage specific duplications, and identified 134 gene families that have gained human specific duplications (FORTNA et al. 2004). One such gene family is the human‐specific morpheus gene family, which underwent positive selection after a segmental duplication (JOHNSON et al. 2001); however, there is currently no functional information available for this gene family.
Epigenetic regulation of gene expression
A mechanism for human molecular evolution that is gaining interest is epigenetic regulation and the effects it can have on gene expression. DNA methylation and histone
38
modification are mechanisms that stably affect gene expression without involving mutation in DNA nucleotide sequence. Epigenetic regulation is required for embryonic
development, X‐chromosome inactivation, and imprinting, and many cancers show alterations in epigenetic regulation. These types of effects are particularly important during development, and they also allow an organism to respond to environmental stimuli by altering gene expression (JAENISCH and BIRD 2003). This mechanism, therefore, may have had important contributions to human evolution, particularly in the promoter region and cis‐regulatory sequence (WEBER et al. 2007).
Enard et al. examined the methylation patterns in 36 genes in the brain, liver, and lymphocytes of humans and chimpanzees using microarray‐based methylation analysis (ENARD et al. 2004). Twenty‐two of the 36 genes showed significant difference in methylation. Their most interesting finding was that methylation patterns were significantly different in 15 of 18 genes expressed in the brain, and that of these 15 genes, 14 were hypermethylated in humans. This led to the conclusion that methylation patterns have diverged over the course of human evolution providing a mechanism for changes in gene regulation without altering the underlying DNA sequence; however, further analysis needs to be performed on a more extensive data set.
39
GENE EXPRESSION DIFFERENCES
Over three decades ago, Mary‐Claire King and Alan Wilson (KING and WILSON
1975) examined the great paradox of human evolution: “The genetic distance between humans and the chimpanzee is probably too small to account for their substantial organism differences.” They instead suggested that small changes in the machinery that controls gene expression during development were the driving force of human phenotypic divergence. Indeed, many studies have linked differences in gene
expression to differences in morphology, such as the expression of the growth factor
Bmp4 and the size and shape of the beak in Darwin’s finches (ABZHANOV et al. 2004).
These types of expression differences are often located in the regulatory sequence of genes or are due to structural changes in proteins that modulate gene expression; however, epigenetic effects are also known to effect gene expression, particularly on the X‐chromosome and in imprinted regions (PASTINEN et al. 2004).
The study of gene expression differences is complicated by many factors because expression levels are constantly changing across different tissues, during different developmental stages, from environmental effects, and because transcription is a complex event often relying on other transcripts or proteins for proper gene regulation to occur (KHAITOVICH et al. 2006a). Cross‐species comparisons are even more difficult
because tissue specificity and developmental timing needs to be synchronized and you
cannot control for the environment. Many have turned to the use of microarrays for
40
cross‐species comparisons, however the technique suffers from changes caused by sequence divergence and probe‐binding sensitivity (GILAD et al. 2005).
The measurement of gene expression differences is most often performed using mRNA levels; however, this comes with caveats of its own. Predicting protein levels based on mRNA level is not possible because gene expression may also be effected by codon bias and the downstream effect of tRNA availability and binding‐preference
(AKASHI 2001). Also, prediction of protein abundance from mRNA level is not feasible because of post‐translation modification and the highly variable stability of the mRNA and protein. Fu et al. examined this issue by comparing mRNA levels of genes known to be differently expressed between humans and chimps to steady‐state protein levels (FU et al. 2007). In this case, they found a significant and reproducible positive correlation between mRNA and protein levels.
Human‐chimpanzee expression differences
A number of studies have examined differences in human and chimpanzee gene expression using mRNA levels (ENARD et al. 2002a; FU et al. 2007; GILAD et al. 2006; GU and GU 2003; KARAMAN et al. 2003). Karaman et al. examined the expression profiles of human and bonobo cultured fibroblast cells using gorilla as an outgroup to predict the ancestral state. Using microarray analysis coupled with Northern blots, they found 30 genes that had been upregulated in humans compared to bonobo, and 22 upregulated in bonobo compared to chimpanzee. This study, however, suffers from the fact that
41
cultured cells are an artificial system, and the gene expression changes observed may not be biologically relevant, and that fibroblasts may not be the subject of positive selection during human evolution.
Gene expression differences in tissue samples were examined for humans and chimpanzees, with orangutan as an outgroup (ENARD et al. 2002a). Blood leukocyte, brain, and liver mRNA were measured using microarray technology and protein differences were examined through two‐dimensional electrophoresis. The most surprising find was that intraspecies (within individuals of a species) gene expression was the most divergent within a tissue sample. The tissue that showed the least amount of divergence was the brain; however, the rate ratio of the change was highest in brain on the chimpanzee to human lineages even though there were fewer genes diverging. Subsequent studies supported the results of Enard et al. with regards to the findings of gene expression differences in brain tissue (CACERES et al. 2003; GU and GU
2003; PREUSS et al. 2004).
Other studies have taken a more direct approach by performing functional analysis on human and chimpanzee promoter sequences (CHABOT et al. 2007; HEISSIG et
al. 2005; HUBY et al. 2001). The most recent study by Chabot et al. examined the promoter regions of genes previously shown to be differentially expressed between the two species. The sequences of 10 human and chimpanzee promoters were cloned upstream of a reporter gene, and differential expression was found for three promoters.
Site‐directed mutagenesis was used to pinpoint the functional nucleotide changes.
42
Selective pressures on gene expression
Khaitovich et al. has suggested a neutral model of transcriptome evolution based on human and chimpanzee gene expression data (KHAITOVICH et al. 2005a; KHAITOVICH et al. 2005b; KHAITOVICH et al. 2004). He proposes that most changes that have occurred are selectively neutral based on a number of conclusions from the data. First, gene expression changes increase with evolutionary divergence (time) and have occurred in a linear fashion. He also notes that that intraspecies gene expression differences positively correlate with interspecies differences. Lastly, the rates of interspecies divergence are the same between both intact genes and pseudogenes. In support of the previous brain‐specific gene expression data, it was found that gene expression divergence with a species has also accumulated in a linear fashion. These conclusions were also supported in a study by Lemos et al. that examined differences in the
expression profiles of primates and found that most mRNAs are evolutionarily stable,
most likely due to the effects of stabilizing selection (LEMOS et al. 2005).
As with protein evolution, gene expression divergence can be affected by positive selection. Gilad et al. used a multi‐species array (serving to increase both the sensitivity and the specificity) to study gene expression differences in the liver between four primates: human, chimpanzee, orangutan, and rhesus macaque. They found that disease related genes and genes with the Gene Ontology classification of “regulation of cellular properties” were constrained by the effects of stabilizing selection. However,
43
the expression of one category of genes, transcription factors, were evolving rapidly specifically along the human lineage and that these changes are not simply the tail end of the neutral distribution (GILAD et al. 2006). Based on their ability to regulate gene expression and their own rapid evolution of gene expression, transcription factors were studied in this thesis as elements of evolutionary divergence leading to the phenotypic differences between humans and chimpanzees.
POSITIVE SELECTION
The unit of evolution in Darwin’s theory of natural selection is a phenotypic trait;
however, his theories can be easily applied to genomes, with the gene as the unit of
evolution. Advantageous mutations are fixed in a population, and deleterious ones are
eliminated. However such a scenario imposes a heavy genetic load on an organism— there are cumulative costs associated with selection. This idea led Kimura (KIMURA 1968)
to propose a neutral theory of evolution that states that the majority of mutations
(polymorphisms and substitutions) are not the products of selection; rather they have no influence on the fitness of an organism. These selectively neutral mutations are transient and will either be fixed or eliminated from a population by random genetic drift. This theory also posits the presence of a molecular clock regulating sequence evolution at a constant rate over time in all organisms, and population diversity is largely a consequence of genetic drift. This neutral theory has been challenged often,
44
especially the molecular clock, but it highlights many important tenets about sequence evolution.
Today it is accepted that most genes are evolving neutrally or under purifying selection, preserving essential function; however, natural selection does play a role in shaping genomes, both by promoting fixation of advantageous mutations and eliminating harmful ones. Out of this idea arose Ohta’s ‘nearly neutral theory of molecular evolution’ in which both slightly advantageous and deleterious mutations were considered (KIMURA and OTA 1974; OHTA 1973; OHTA 1992). Methods to investigate the selective forces acting on a gene based on the rates of synonymous (dS) and nonsynonymous substitutions (dN) within protein‐coding DNA sequence have since been developed (KIMURA 1983; NEI 1975). Synonymous substitutions are nucleotide changes that change the codon but do not alter the amino acid. Nonsynonymous substitutions are nucleotide changes that do alter the amino acid and are most often found at the first and second positions within a codon. Despite the degeneracy of the amino acid code, theree ar far more possible nonsynonymous substitutions than synonymous ones. Therefore, corrections for the absolute number of substitutions by the number of potential sites of substitution, or the dN and dS, are required for their analysis. The dS is defined as the rate of synonymous nucleotide substitution per synonymous site. Likewise, dN is the rate of nonsynonymous nucleotide substitution per nonsynonymous site.
45
The ratio of the rates that these different types of substitutions are being deposited can lend clues to the selective pressures influencing the gene. A dN/dS < 1 is indicative of purifying or negative selection. Nonsynonymous substitutions are most often deleterious, and quickly weeded out of a population, which is reflected in an under representation of these types of mutations in a gene. A dN/dS = 1 indicates neutral evolution since synonymous and nonsynonymous substitutions are being
deposited at the same rate, with no bias for or against replacement substitutions. A dN/dS > 1 is strong evidence for adaptive or positive selection. This indicates that these amino acid changing mutations have offered a fitness advantage to the organism and are fixed in a population at a higher rate than synonymous substitutions. It must be noted that not all amino acid changes have been selected by evolution; rather, their fixation was the result of random genetic drift. The challenge is to be able to distinguish between selection and drift, and several statistical methods have been developed to detect these functional determined changes and are discussed below.
Detecting positive selection through population genetics
The use of population genetics to detect positive selection is best for predicting either very recent human selection or selection that has occurred between human populations. Positive selection leaves a fingerprint on a genome and the discovery of these patterns can lead to predictions of about evolutionary direction. Such signatures
46
are: high proportion of nonsynonymous substitution, reduction in genetic diversity, high
frequency derived alleles, differences between populations, and long range haplotypes.
Since positive selection leads to the fixation of advantageous alleles, one can compare the proportion of amino acid altering changes both between species and within a species determine if between species variation is due to neutral mutation or adaptive evolution (HUGHES and NEI 1988; LI et al. 1985). The genetic divergence between species is mutation that has been fixed by genetic drift, selection, or genetic hitchhiking, which is a result of selective sweep. Polymorphism is the mutation within a species whose ultimate fate of fixation or loss has yet to be determined. This theory has limitations that permit it to only be able to detect ongoing or very recent selection because multiple changes are needed to be able to detect the selection from the noise of neutral substitution.
The use of polymorphism and divergence was first suggested by Hudson et al.
(the HKA test) in 1987 as a way to test Kimura’s controversial theory of neutral evolution by examining the Adh locus in two Drosophila species (HUDSON et al. 1987). This goodness of fit statistic is restricted to comparing only two loci, however, and is best suited for examining balancing selection or recent selective sweeps. A modified version of the HKA test and the one to be used in this study is the McDonald‐Kreitman test
(MCDONALD and KREITMAN 1991b). This method is heavily rooted in the phenomenon of the selective sweep. In a population at any point in time, there is neutral variation surrounding a given locus. If a mutation occurs at that locus that adds an advantage to
47
an individual’s fitness, this mutation will be pulled very rapidly through the population.
During fixation of this mutation, surrounding (neutral) alleles will also be “swept” with the mutation and fixed. This has the immediate effect of eliminating variation around the selected site, and only with time will new neutral mutations be gradually reintroduced to this area (LI 1997). In this way a selective sweep will leave its fingerprint on a genome.
Bustamante et al. (BUSTAMANTE et al. 2005b) analyzed the data used in Clark et al.
(CLARK et al. 2003b) with the inclusion of single nucleotide polymorphism (SNP) data for
39 humans using a modified version of the McDonald‐Kreitman test, known as the mkprf (McDonald‐Kreitman analysis using Poisson Random Field). The Clark et al. study used phylogenetic analysis on multispecies sequence alignments of human, chimpanzee, and mouse to predict positive selection in more than 11,400 genes. Bustamante et al.
analyzed 3,323 genes that had at least 4 variable nonsynonymous sites in the alignment and found that 306 genes show evidence of positive selection with P < 0.05.
Unsurprisingly, genes involved in immunity and defense, gametogenesis, and sensory perception were found to be overrepresented in this list, a phenomenon seen in the other large scale scans for positive selection. One intriguing result from this analysis is that transcription factors are also overrepresented on this list of genes undergoing rapid adaptive evolution (P < 0.0001).
Use of the reduction in genetic diversity to predict positive selection was proposed by Tajima and takes advantage of the excess of rare alleles that a selective
48
sweep deposits on a genome (TAJIMA 1989). A selective sweep reduces or eliminates
variation among the nucleotides in neighboring DNA of a mutation as the result of positive selection. Only after time variation will be reintroduced into the ‘swept’ area which produce an excess of rare alleles. The most common way to differentiate selection from genetic drift is through the calculation of Tajima’s D, which is based on the number of segregating sites and nucleotide diversity. A negative Tajima's D signifies an excess of rare alleles, indicating population growth or positive selection. A positive
Tajima's D signifies low levels of both low and high frequency alleles, indicating a decrease in population size or negative selection. One important consideration is that demography can often explain a high D, such as a rapidly expanding population.
Because a selective sweep usually encompasses a large region, pinpointing the causal variant can be quite difficult.
High frequency derived (non‐ancestral) alleles can also be used to detect
selection, once again through the detection of a selective sweep (FAY and WU 2000).
These derived alleles will hitchhike with the advantageous allele, but not become fixed,
generally due to recombination. Recombination facilitates the movements of mutations between genetic backgrounds each generation, and since there is no selective pressure on the variant, there is no need to maintain or purge the polymorphism (RICE and
CHIPPINDALE 2001). The commonly used test for detection of high frequency derived alleles is Fay and Wu’s H which is based on the number of derived variants found in a sample of chromosomes. A classic example of this mechanism is the selective sweep that occurred in African populations around the Duffy red cell Y)antigen (F locus,
49
presumably because it conferred resistance to malaria (HAMBLIN and DI RIENZO 2000).
This test also suffers from confounds of demographics, as the results may also be the effects of population subdivision ((SABETI et al. 2006) for a review).
Signatures of positive selection are also hidden in the differences between populations. Selection can affect one population and not another because of differing environmental pressures, which may drive one allele to high frequency in only one of the populations. In this case it is necessary to know the ancestral allele (to infer the
direction of selection) which can be determined through phylogenetic analysis. This test requires the calculation of the FST statistic, which is essentially a measure of the genetic
variability both between and with populations (LEWONTIN and KRAKAUER 1973). In a
neutrally evolving gene, loci across populations will evolve in a similar and predictable manner, in which case FST is determined by genetic drift. However, positive selection can cause deviations in FST values between populations at the allele of interest as well as
in surrounding nucleotide blocks.
Akey et al. (AKEY et al. 2004) calculated FST values in three populations, African‐
American, East Asian, and European‐American, by analyzing the allele frequencies of
26,530 SNPs. They identified 174 genes with significant FST values that show signatures of positive selection, leading them to conclude the selection has influenced the variation seen in extant human populations.
In 2002, one of the most powerful tools for detecting recent human positive selection, known as the Extended Haplotype Heterozygosity, was developed (SABETI et
50
al. 2002b). This test essentially measures the breakdown of linkage disequilibrium in
large haplotype blocks around a selected allele. During a selective sweep, the advantageous allele brings with it long range associations with other alleles, or a haplotype block. Large haplotype blocks that have remained undisturbed by recombination are indicative of strong, recent positive selection or may lie in recombination “cold spots” in which there is a low recombination rate because of yet unknown reasons. This method can be used to detect positive selection both between species and between populations within a species; however there is significantly greater power associated with the latter. With the release of the data from the International
HapMap Project (CONSORTIUM 2005b), this method was applied to a much larger and detailed data set (SABETI et al. 2007). This analysis revealed more than 300 candidate regions that were subject to a selective sweep in one of the three populations studied, either African, European, or East Asian. Upon closer examination of some of the regions, several genes with previous predictions of positive selection were highlighted.
Within the African population, the genes LARGE and DMD were predicted; both genes related to infection of the Lasso virus. The genes SLC24A5 and SLC45A2 were implicated in the European populations, and previously associated with skin pigmentation (LAMASON
et al. 2005). Lastly genes involved in hair follicle development, EDAR and EDA2R, had evidence of selection in the Asian populations.
A number of genome scans for positive selection using a variety of population genetic methods have been performed (AKEY et al. 2004; KELLEY et al. 2006; VOIGHT et al.
2006; WALSH et al. 2006), each study providing candidate genes for follow‐up study of
51
positive selection. However, these studies all focus on very recent positive selection, and cannot answer questions about the ancient and fundamental changes that have occurred that led to the divergence of Homo sapiens. To find these types of answers one must look to phylogenetic analysis for predictions of ancient positive selection.
Detecting positive selection by phylogenetic analysis
Phylogenetic analysis has proven to be very useful in detecting selective pressure on a gene (YANG and NIELSEN 2002b). Some of the phylogenetic methods that use multiple sequence alignments to predict positive selection have been reviewed in Yang and Bielawski (YANG and BIELAWSKI 2000), Yang (YANG 2002), and Bamshad and Wooding
(BAMSHAD and WOODING 2003). The methods can be applied to find the signatures of positive selection along specific branches of a phylogenetic tree or at specific codons within a protein coding gene. At its simplest, these methods look at fixed synonymous and nonsynonymous differences in protein coding genes between species, given a specific phylogeny. An estimation of the rate that these differences are accruing can be used to infer positive selection, negative selection, or neutral evolution. I have chosen to use the maximum likelihood method of Goldman and Yang (GOLDMAN and YANG 1994) to investigate selection that is specific to the human lineage.
The advantage of the above method is that it is based on a codon substitution model, in which a sense codon (not a nucleotide) is the unit of evolution. Previous methods have used nucleotides or amino acids of protein‐coding sequence as the unit
52
of evolution, which results in the loss of an enormous amount of information by not taking synonymous and nonsynonymous substitutions into account (GOLDMAN and YANG
1994). The dN/dS ratios are calculated according to Li et al. (LI et al. 1985). Li was the first to classify nucleotide sites as nondegenerate, twofold degenerate, or fourfold degenerate depending on how often a nucleotide change will result in an amino acid change. In his likelihood based method, Li also incorporates a classification of a nucleotide change as a transition or a transversion, the former being the most common type of substitution and more likely to be a silent substitution. Each of these nucleotide changes, based on their classification, is scored according to their probability based on the frequency of the type of substitution in mammalian genes. Ancestral sequences at internal nodes are predicted using the maximum likelihood techniques based largely on
the work of Felsenstein (FELSENSTEIN 1973) for a better estimation of dN and dS. The method that Felsenstein developed parses each nucleotide in a multispecies alignment into discrete character states, and evaluates each substitution according to a probabilistic model of evolution. The maximum likelihood method of predicting selection assumes that synonymous nucleotide substitutions are selectively neutral, however it must be noted that this method cannot tell if a synonymous substitution was driven to fixation by random genetic drift or by selection (YANG and BIELAWSKI 2000). An
added benefit of calculating the synonymous substitution rate for each gene (rather
than estimating the neutral rate of mutation from other genes or non‐coding regions) is
the heavy association of mutation with GC content. In this way the neutral mutation
rate is tailored for each gene (KREITMAN 2000).
53
There are other parameters that a codon substitution model considers.
Transition‐transversion rate ratios are modeled in this likelihood approach, a factor often ignored by other models. This is important to consider because transitions at the third position of a codon are more likely to be synonymous than transversions, and therefore, ignoring this rate is likely to lead to an overestimation of dS and an underestimation of the dN/dS ratio. This model also corrects for multiple hits and back substitutions. Another often overlooked parameter is that of codon bias, that is, the largely unexplainable unequal codon frequencies seen within genes. All these parameters are calculated specifically for each gene in an alignment, rather than considered equal for all genes. Yang et al. (1998) demonstrate a simplified equation for the calculation of the rate of substitution from codon u to codon v:
where π is codon frequency, κ is the transition/transversion rate ratio, and ω is the nonsynonymous/synonymous rate ratio. This equation leaves the user with fewer assumptions about their data, and provides a customized framework for evolutionary
analysis.
Another advantage of the likelihood method is the ability to test “goodness‐of‐ fit” models (GOLDMAN and YANG 1994). Maximum likelihood analysis provides statistical
54
tests that consider the suitability of different models (or scenarios) of evolution against a particular set of data. This comparison is done by randomly simulating data sets according to the given phylogeny and using the parameters (codon frequency, transition/transversion ratio) calculated from the original data to determine how often one would observe the real data. This is transformed into a likelihood score, which can be compared between different models of evolution, to find a scenario that most accurately represents the evolutionary history of the gene.
Several large scale scans have been performed using maximum likelihood methods and phylogenetic analysis in an attempt to better understand the way selection is shaping the human genome. Clark et al. (CLARK et al. 2003b) compared the
multiple sequence alignments of 7,645 orthologous genes between humans and chimpanzees, using mice to infer the ancestral state, and found that 1,547 genes had a dN/dS > 1. Of this number, a hypothesis of neutral evolution was rejected for only 125 genes with p < 0.01. In an effort to understand what sort of biological processes might
be the target of selective pressure, Clark used the Panther classification system (THOMAS et al. 2003b) to see if any categories of genes were overrepresented on this list compared to the normal distribution. Genes involved in amino acid catabolism and olfaction showed a significant trend to be under positive selection, as did genes with
Online Mendelian Inheritance in Man (OMIM) entries. Genes involved in developmental processes show a low p‐value, but as a class their representation on the list is not significant. Those developmental genes with low p‐values primarily fall into three
55
categories: skeletal development, neurogenesis, and genes that are homeotic
transcription factors.
A second scan was performed by Nielsen et al. (NIELSEN et al. 2005b) using a greater number of genes and fewer number a species. This group set out to find genes that were targets of positive selection along the human or chimpanzee lineage during any point of these species’ evolution. The comparison of 13,731 human and chimpanzee orthologous genes reveals that 733 genes are evolving with a dN/dS > 1; however only 35 of these have a p‐value < 0.05 that rejects a hypothesis of neutral evolution. Several groups of genes involved in particular biological processes (using the
Panther classification system) are overrepresented in the list of 733 genes. Topping eth
list are genes involved in immunity and defense, chemosensory perception,
spermatogenesis, and those with an unknown biological process. Transcription factors
also appear to have prominent representation on this list.
Both these studies lack considerable power to detect positive selection. This is in
part due to the small number of species in each data set; the confident reconstruction of ancestral nodes is nearly impossible with only 2‐3 species. A particular problem facing the Nielsen group is the short evolutionary distance between the two species. An even bigger challenge for these groups is that most genes are well conserved, and they evolve with a dN/dS << 1. Positive selection is unlikely to affect all sites over time. By looking at a gene as an entire unit, the signatures of positive selection can be overlooked, as selection often acts upon specific sites or specific domains of a gene.
56
Therefore, many genes that have been targeted by positive selection will remain undetected by these methods (KREITMAN 2000).
Interestingly, three additional genome scans were performed to detect positive selection, however, this time chimpanzee selection was added into the analysis (ARBIZA
et al. 2006a; ARBIZA et al. 2006b; BAKEWELL et al. 2007a). Arbiza et al. (ARBIZA et al. 2006b) found that chimpanzee genes were evolving faster than human genes under both positive selection and relaxation of selective constraint. They also concluded that many genes with previous predictions of positive selection were actually undergoing a relaxation of selective constraint, and care must be taken to differentiate between these two possibilities. Bakewell et al. (BAKEWELL et al. 2007a) came to a similar conclusion, that even though human genes had a higher nonsynonymous substitution rate, significantly more genes were undergoing positive selection along the chimpanzee lineage. This led them to suggest that natural selection was not as efficient along the human lineage because of their small effective population size.
TRANSCRIPTION FACTORS
Transcription factors are proteins that are prominent regulators of gene expression. These proteins regulate transcription initiation, thereby affecting the timing, amount, and spatial patterning of the expression of a given gene. Altering the
57
amino acid sequence of one of these proteins can potentially change its activity, thereby having dramatic or subtle downstream gene expression effects. This may happen by altering the transcription factor’s DNA‐binding ability (either enhancing or weakening the protein‐DNA bond) or by affecting the binding of other cofactor proteins, thus causing a modest change in transcription factor function or possibly even the acquisition of new function ((HSIA and MCGINNIS 2003b) for a review). This effect would be particularly striking were it specific to the development of the organism, as is the case with many transcription factors.
Proteins that regulate transcription are thought to encompass 5‐10% of all protein coding genes; however, this figure also includes genes that make up the basal transcription machinery and those involved in chromatin modification (LEVINE and TJIAN
2003). For the purpose of this project, the analysis will be restricted to the canonical
DNA‐binding transcription factors with the hypothesis that the positively selected amino acid changes in these proteins have lead to functional divergence in activity that lead to some of the phenotypic changes between human and chimpanzees. These transcription factors are known to regulate transcription through protein‐protein interactions. These
interactions can be with other transcription factors, transcription cofactors that do not bind DNA, chromatin‐remodeling enzymes, or the basal transcription machinery. It is these interactions that cooperatively act to promote or inhibit transcription of a given gene ((WRAY et al. 2003a) for a review).
58
Transcription factors are composed of an array of functional domains, arranged in different combinations. These may include a DNA‐binding domain(s), a protein‐ binding domain(s), a nuclear localization signal, and ligand‐binding domain(s). It is believed that the evolution of transcription factor gene families arose from either
“domain shuffling” or the deletion of domains (WRAY et al. 2003a). A disruption in the amino acid sequence in any one of these domains could potentially affect the binding affinity of any one of these domains. This could then alter downstream events, such as the timing, amount, spatial patterning, and tissue‐specificity of the expression of target genes.
The bulk of the work done on transcription factor functional divergence and specificity has been performed in invertebrates. Trelsman et al. (TRELSMAN et al. 1989) demonstrated that by changing a single amino acid of the DNA‐binding recognition motif of the homeodomain protein Paired (Prd) of Drosophila was enough to change this protein’s DNA‐binding specificity to that of two other homeodomain proteins, Fushi tarazu (Ftz) and Bicoid (Bcd). Changes outside of the DNA‐binding domain can also contribute to functional change. Grenier and Carroll (GRENIER and CARROLL 2000b),
through functional complementation experiments showed that functional divergence in
the Hox gene Ultrabithorax (Ubx) in Drosophila and the orthologous gene (OUbx) in onychophoran (the velvet worm, considered a proto‐arthropod) was due to sequences outside the DNA‐binding homeodomain. They propose that the interaction of Ubx with cofactors and other activity modifiers is the reason that OUbx can continue to regulate many of the same target genes as Ubx, but cannot transform segmental identity.
59
For many years it was suggested that changes in gene expression were primarily due to changes is the cis regulatory sequence to which a transcription factor binds. This is in large part due to belief that since a transcription factor regulates many genes, a change in this transcription factor would have a disproportionately large effect on gene expression and therefore a detrimental effect on the development of an organism
((CARROLL 2000) for a review). Given the sequence similarity between humans and chimpanzees, it is likely that the small number of amino acid changes would result in subtle alteration of transcription factor function, affecting only a small subset (or perhaps only a single) of target genes. Transcription factors may therefore be ideal targets for the effects of positive selection acting on morphological diversity of an organism.
CONCLUSION
In this doctoral dissertation, I have chosen to examine the role that adaptive protein evolution played in the shaping of the human genome in order to better understand the molecular mechanisms behind the evolution of modern humans. In an effort to merge the two main hypotheses of human molecular evolution, natural
Darwinian selection and changes in gene regulation and expression, I have chosen to study the effects of positive selection on transcription factor genes, which are genes that regulate the expression of other genes. Transcription factors were chosen for
60
further study based on their involvement with embryogenesis and development in order to narrow the search for genes involved in the morphological and physiological divergence between humans and chimpanzees. The hypothesis is that the human protein has functionally diverged from its chimpanzee ortholog, and has contributed to the phenotypic changes along the human lineage. In the process of investigating positive selection, I have developed a new test for prediction that shows greater
sensitivity in the face of recent divergence such as between humans and chimpanzees
(Chapter 2 and database of findings in Chapter 3). This work has been published in the journals Genetics and Nucleic Acids Research, respectively. Follow up studies
demonstrating functional divergence of the human transcription factor are necessary
and methods for such experiments are discussed in Chapter 4 and 5. Through my investigations, I have uncovered several transcription factors that are candidates for positive selection and that may have contributed to the phenotypic changes that
occurred along the human lineage that make humans so unique among the great apes, and I have also provided a new method for detection of positive selection in closely related species that will aide researchers in better understanding the genetic changes that led to the emergence of the human species.
61
CHAPTER 2
AN EMPIRICAL TEST FOR BRANCH‐SPECIFIC
POSITIVE SELECTION
Gabrielle C. Nickel1, David L. Tefft2, Karrie Goglin3, and Mark D. Adams1
1Department of Genetics, Case Western Reserve University School of Medicine,
Cleveland, OH 44106
2The Broad Institute, Cambridge, MA 02141
3The Scripts Research Institute, La Jolla, CA 92037
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. DQ976435‐ DQ977607 and EF185319‐ EF185336.
Note: Portions of this chapter are published as Nickel GC, Tefft DL, Goglin K, and Adams MD. 2008. An empirical test for branch‐specific positive selection. Genetics 179 (4). D.L.T provided computer support with data analysis and database construction. K.G. assisted in sequencing. M.D.A. conducted phylogenetic analysis and sequence reconstruction.
62
ABSTRACT
Prediction of positive selection specific to human genes is complicated by the very close evolutionary relationship with our nearest extant primate relatives, chimpanzees. To assess the power and limitations inherent in use of maximum likelihood (ML) analysis of codon substitution patterns in such recently diverged species, a series of simulations was performed to assess the impact of several parameters of the
evolutionary model on prediction of human‐specific positive selection including branch
length and dN/dS ratio. Parameters were varied across a range of values observed in alignments of 175 transcription factor (TF) genes that were sequenced in twelve primate species. The ML method largely lacks the power to detect positive selection that has
occurred since the most recent common ancestor between humans and chimpanzees.
An alternative null model was developed based on gene‐specific evaluation of the empirical distribution of ML results using simulated neutrally evolving sequences. This empirical test provides greater sensitivity to detect lineage‐specific positive selection in the context of recent evolutionary divergence.
63
INTRODUCTION
A key question facing modern science is, “What makes humans different from
the other great apes?” Sequencing of the human genome has heightened the interest of scientists in understanding the origins of our species and the genetic basis for traits that distinguish us from our closest living relatives, chimpanzees. Sequencing of the chimpanzee genome revealed about 1% fixed single‐nucleotide differences with the human genome, and about 1.5% lineage‐specific DNA in each species (CHENG et al. 2005;
CSAAC, 2005a). Even though the percent divergence is small, there are about 35 million single nucleotide changes and 5 million insertion/deletion differences between human and chimpanzee, most of which are presumed to be evolutionarily neutral (CSAAC,
2005a). Thus the complete catalog of human‐chimpanzee genome differences is insufficient to reveal the genetic basis for phenotypic divergence between the species.
Several approaches have been presented to the task of identifying alleles that have been subject to positive selection – that is, that have increased in frequency in a population as a result of adaptive evolutionary advantage. These include methods based on allele frequency differences among human populations (TAJIMA 1993),
extended haplotype homozygosity indicative of a recent selective sweep (SABETI et al.
2002a), a discrepancy between rates of polymorphism and divergence (MCDONALD and
KREITMAN 1991a), and phylogenetic methods that compare the rates of evolution on different branches of the tree (FELSENSTEIN 1981; GOLDMAN and YANG 1994). Several
64
studies have characterized positive selection in human genes with a goal of identifying the genetic basis of human‐specific features such as a larger brain size, bipedalism, and skeletal and dental differences, etc (VALLENDER and LAHN 2004). Likewise, genome‐wide
surveys have identified genes, gene families, and genome regions with a signature of positive selection (CSAAC, ARBIZA et al. 2006b; BAKEWELL et al. 2007b; BUSTAMANTE et al.
2005a; CLARK et al. 2003a; 2005a; DORUS et al. 2004a; GIBBS et al. 2007; VOIGHT et al.
2006). We are most interested in identifying genetic differences associated with
fundamental and ancient divergence between humans and the non‐human primates, and so have focused on the phylogenetic approach to identify positive selection that is unique to the human lineage.
The nature of selection acting on a protein‐coding sequence is characterized by comparison of the rate of nonsynonymous (dN) and synonymous (dS) substitutions.
Under neutral evolution, the rate of accumulation of synonymous and nonsynonymous substitutions should be equal, and a dN/dS value of one is expected. A dN/dS value greater than one is indicative of positive selection, while a value less than one indicates negative selection. About one‐third of chimpanzee protein sequences are identical to
the sequence of the human ortholog, with the remainder having typically one or two amino acid changes (CSAAC, 2005a). Clearly, positive selection cannot be inferred when two protein‐coding sequences are identical. But how different must two sequences be for the currently available methods to detect statistically significant positive selection?
65
Statistical methods based on models of codon substitution patterns have been developed to identify positive selection specific to one branch in a phylogeny, even when only a subset of sites have experienced selection (YANG and NIELSEN 2002a).
Simulation studies using the maximum likelihood methods implemented in Phylogenetic
Analysis by Maximum Likelihood (PAML) (YANG 1997b) have shown that the approach is
conservative with few false positives, but also demonstrated a lack of power,
particularly in analyzing closely related sequences (WONG et al. 2004; ZHANG et al. 2005).
We were particularly interested in how the maximum likelihood codon substitution models would perform for analysis of selection on human proteins in comparison to chimpanzee, where there are few divergent positions.
Genome‐wide studies to date have been based on data from only two (ARBIZA et
al. 2006b; CSAAC, 2005a; NIELSEN et al. 2005a) or three species (BAKEWELL et al. 2007b;
CLARK et al. 2003a; GIBBS et al. 2007). In the context of human‐chimpanzee‐macaque gene alignments, only a handful of genes were predicted to be positively selected in humans after multiple test correction (BAKEWELL et al. 2007b; GIBBS et al. 2007). Analysis of sequence from a broader set of primate species would place human‐chimpanzee differences into a more robust phylogenetic framework and enable more specific analysis of evolutionary rates in each primate lineage. By improving our understanding of the factors that contribute to detection of positive selection, it should be possible to optimize the design of computational experiments to test for positive selection, and potentially the selection of additional organisms for genome sequencing. Therefore, we
collected sequence data from twelve primate species to provide a more comprehensive
66
view of the phylogenetic breadth that is required to detect positive selection, explore the patterns of codon substitution that accompany predictions of positive selection, study the factors that affect a prediction of positive selection, and examine the evolutionary characteristics of a group of important genes.
Predictions of positive selection in recently diverged species, such as humans and chimpanzees, are often difficult using traditional phylogenetic methods. Simulation
studies were performed to gauge the factors that most affect predictions of selection;
even under the best circumstances, the extent of divergence found between human and chimpanzee genes was not sufficient to meet the strict criteria for prediction of selection in the ML test. Through these studies we developed an empirical test that is
more sensitive to the often subtle changes that have occurred along the human lineage, and that provides the opportunity to identify additional candidates for positive selection that would not be predicted using currently available methods.
Developmentally regulated transcription factor genes were chosen for study
because of the fundamental role that many of these genes play in a broad array of developmental processes, the ability of small changes in sequence to cause broad changes in downstream gene expression patterns (e.g., GRENIER and CARROLL 2000a; TING
et al. 1998), and because TFs are over‐represented among genes predicted to be positively selected in previous genome‐wide studies of selection (ARBIZA et al. 2006b;
BUSTAMANTE et al. 2005a; CSAAC, 2005a).
67
MATERIALS AND METHODS
Selection of genes and DNA sequencing: Transcription factor genes were
selected for sequencing based on a prior prediction of positive selection (BUSTAMANTE et
al. 2005a; CLARK et al. 2003a; NIELSEN et al. 2005a) or involvement in developmental processes (Gene Ontology term GO:0032502) having an excess of nonsynonymous substitution relative to synonymous substitution in a comparison of human and chimpanzee genome annotations. Primate DNA samples were obtained from the NIA
Aging Cell Repository Phylogenetic Primate Panel (PRP0001). Human DNA samples were from the NIGMS Human Genetic Cell Repository Human Variation Collection (NA00946,
NA00131, NA01814, NA14665, and NA03715). Both panels were obtained through
Coriell Cell Repositories. Hylobates and Papio samples were kindly provided by Dr. Evan
Eichler.
An enormous hurdle to overcome was in sequencing the distantly related primate species. Primates that diverged many millions of years ago, such as the lemur and the New World monkeys, are likely to have quite diverse sequence. This is especially true since primers are being designed in intronic regions adjacent to the coding exons whose sequence is of interest. Without a template sequence for these primates with which to design primers, primers were initially designed exclusively off the human sequence. Human genomic sequence was used as input into primer3 (ROZEN and SKALETSKY 2000), and the primers were designed around coding exons. As can be
68
expected, the success rate for amplifying other primate species was quite low (Table 3‐
Human Design). The next approach we took was to apply a macaque mask to the human sequence by BLASTing the human genomic sequence against the macaque draft sequence. Mismatched basepairs were converted to Ns in the human sequence. Since the average sequence divergence between human and macaque is 4.9% for noncoding
sequence (MAGNESS et al. 2005), the macaque is an ideal organism for this masking. It is
diverged enough to be able to reveal regions of conservation, but not too diverged to be unable to pick primers. This masked sequence was then used as input into primer3. By masking off mismatches in the human cDNA, sometimes there is either no sequence left
for primer picking or what is left is suboptimal (i.e. high GC content, repetitive
sequence). In cases where no suitable primer could be designed, a chimpanzee mask was applied to the human sequence. The success of the various masks can be seen in
Table 1.
Table 1: The percent of primer pairs that produced high quality sequence from human templates and macaque and chimpanzee masks
Species Human Design Macaque Mask Chimp Mask
Human 82% 78% 85%
Chimpanzee 41% 78% 84%
Bonobo 42% 78% 82%
Gorilla 59% 78% 87%
Orangutan 53% 78% 75%
69
Hamadryas Baboon ND 65% 48%
Pigtailed Macaque 24% 65% 51%
Rhesus Macaque 41% 65% 54%
Spider Monkey 24% 49% 51%
Woolly Monkey 24% 51% 39%
Red‐bellied Tamarin 18% 53% 44%
Ring‐tailed Lemur 0 14% 21%
The chimpanzee mask has the highest success for being able to amplify sequence in the great apes, but it is less effective when amplifying product from Old and New
World monkeys. The success in the apes is likely due to optimality of the primers chosen, and the failure in monkeys because of sequence divergence. The success rate in
monkeys, however, increases when the macaque mask is applied because primers are designed in conserved regions. Less success in the apes can be tolerated because some of the sequence is already available. When alternative transcripts were present, the longest isoform was chosen for sequencing. Each forward and reverse primer was tagged with a 16‐ or 17‐mer M13 primer in order to hasten the sequencing process.
Each sequencing reaction was run on an Applied Biosystems 3730xl DNA sequencer.
Sequence was obtained from both strands of each PCR product and trace files were processed by phred and phrap (EWING and GREEN 1998; EWING et al. 1998; GREEN),
with each exon from each organism assembled independently. Low quality bases
70
(Q<25) in the consensus were converted to Ns. Sequence from each species was aligned to the human genomic DNA using matcher (an implementation of the Pearson lalign algorithm from the EMBOSS package) and the location of exons inferred from coordinates in the reference human sequence. Mouse and rat orthologs were obtained from Mouse Genome Informatics (http://informatics.jax.org), and aligned to the human sequence using transAlign (BININDA‐EMONDS 2005). Rhesus macaque and orangutan
alignments were supplemented with sequences extracted from the NCBI Trace Archive when our data did not cover the entire coding region. Chimpanzee alignments were supplemented with orthologous segments extracted from the chimpanzee draft genome assembly (CSSAC, 2005a), Build 1. Multiple sequence alignments of exonic regions were then constructed for each exon, again in reference to the human sequence, using matcher and sim4 (FLOREA et al. 1998), and exons were assembled to form the entire
coding sequence of the gene. The resulting alignments were reformatted to serve as input to the phylogenetic analysis programs. The total sequence from each source
(generated for this study and previously publicly available) and the average gene coverage for each organism are given in Table 2.
71
Table 2: Sequence coverage of transcription factor genes and data sources
Species Common Name Percent Percent Lab Percent w/ Percent # Genes with
Coveragea Coverageb Public Data Identityc Coverage
Pan troglodytes Chimpanzee 98.1 81.9 16.2 99.4 175
Pan paniscus Bonobo 79.2 79.2 99.5 164
Gorilla gorilla Gorilla 76.8 76.8 99.4 155
Pongo pygmaeus Orangutan 96.9 69.8 27.1 98.7 175
Hylobates klossi Kloss's Gibbon 59.6 59.6 98.1 41
Macaca mulatta Rhesus monkey 96.4 54.2 42.2 97.6 171
Macaca nemestrina Pigtailed macaque 59.8 59.8 97.8 112
Papio hamadryas Hamadryas baboon 44.5 44.5 97.2 35
Ateles geoffroyi Spider monkey 50.3 50.3 97.1 112
Lagothrix lagotricha Woolly monkey 50.6 50.6 96.4 107
Sanguinus labiatus Red‐bellied tamarin 48.5 48.5 96.1 125
Lemur catta Ring‐tailed lemur 36.7 36.7 95.3 43
Mus musculus Mouse 97.6* 97.6 85.2 132
Rattus norvegicus Rat 94.6* 94.6 85.3 108 aPercent coverage is total percent of the human sequence spanned by high‐quality data from the species. bPercent lab coverage is the percentage of the human sequence spanned by high‐quality sequence generated for this study. cPercent identity is the average nucleotide identity of the protein‐coding alignment used for ML analysis. *Includes only genes for which unambiguous orthologs could be assigned.
72
Phylogenetic analysis: Codeml from the PAML package (v3.14b, YANG 1997b)
was run on an Apple OS10.5 computer cluster and results were stored in a custom‐built
MySQL database. The codeml models used and the tests performed are summarized in
Tables 3 and 4.
73
Table 3: Evolutionary models and parameter sets for codeml analysis
Feature/Purpose‡ Formal Namea Name in Text
One ω for all branches & sites M0
Nearly neutral: 2 site categories; 0<ω<1, ω=1 M1a
ω allowed to differ on the human branch MH
ω fixed at 1 on the human branch MH‐null
Branch‐site: sites on the human branch allowed to differ MA
Branch‐site null: sites on the human branch fixed at ω=1 MA‐null
Positive selection at sites on all lineages M8
Neutral evolution at sites on all lineages M8a
Likelihood ratio tests of significance Compare
Positive selection on branch (ωH>ωO) MH, M0 Branch test
Positive selection on branch (ωH>1 and ωH>ωO) MH, MH‐null Strict branch test
Branch‐site test 1 (Positive selection or relaxed constraint) MA, M1ab Relaxed branch+site
test
Branch‐site test 2 (Strict test of positive selection) MA, MAnullc Strict branch+site test
‡ω = dN/dS. aModels are as specified in (YANG 1997) bThis test is equivalent to Test 1 in (YANG and NIELSEN 1997a). cThis is equivalent to Test 2 in (YANG and NIELSEN 1997a).
74
Table 4: Branch+site test site classesa
Site class Proportion Background Foreground Foreground Site class p for
(MA) (MA‐null) simulations
0 p0 0 < ω0 < 1 0 < ω0 < 1 0 < ω0 < 1 0.6
1 p1 ω1 = 1 ω1 = 1 ω1 = 1 0.1
2a (1 ‐ p0 ‐ p1) p0 / 0 < ω0 < 1 ω2 > 1 ω1 = 1 0.26
(p0 + p1)
2b (1 ‐ p0 ‐ p1) p0 / ω1 = 1 ω2 > 1 ω1 = 1 0.04
(p0 ‐p1)
aModified from (ZHANG et al. 2005); ω=dN/dS
Each gene was first tested by asking whether a model in which the human lineage was allowed to have a different dN/dS ratio than the rest of the tree was a better fit to the data than a model with a single dN/dS ratio throughout the tree
(“branch test”). Whether the human‐specific dN/dS ratio was significantly greater than
1, was tested by comparison to a model with the dN/dS ratio fixed at 1 on the human branch (“strict branch test”). The optimized branch‐site method of Yang and Nielsen
(YANG and NIELSEN 2002a; YANG et al. 2005a; ZHANG et al. 2005) was used with nested null models to test whether positive selection might be occurring at a subset of sites
(codons) on the human lineage by comparison of results from model A vs. model A‐null
(“strict branch+site test”) and model A vs. model 1a (“relaxed branch+site test”). The
75
strict branch+site test requires dN/dS>1 at a subset of sites, while the relaxed branch+site test requires only that a subset of sites on the human lineage have a significantly elevated dN/dS ratio compared to those sites on the remainder of the tree and can thus be significant in the presence of either positive selection or a relaxation of
selective constraint (ZHANG et al. 2005). A consistent, species tree was used for each analysis (Fig 2). To estimate significance in the form of a p‐value, a likelihood ratio test
(LRT) was used. Two times the difference in log likelihood values between two models was compared to a chi‐square distribution with degrees of freedom equal to the
difference in number of parameters for the two models being compared. All results with p<0.05 in the LRT were repeated to ensure that the result was robust to variation in the initial ML conditions or failure of the ML estimation to converge to the global optimum. A complete listing of genes with selected results is given in the Appendix
(Table A1). All primary sequence data has been submitted to GenBank; alignments and
codeml results are available at http://mendel.gene.cwru.edu/adamslab/pbrowser.py.
Simulations testing the performance of codeml: The sensitivity of the strict branch+site test to branch length, proportion of sites within the positively selected site classes, background dN/dS, and the magnitude of the foreground dN/dS value were examined by using the NSbranchsites function in the evolver program contained in the
PAML software package to create simulated sequences under a specified evolutionary
model. Codon frequencies were derived from a set of 11,485 human genes with high
76
quality orthologs in several mammalian species (NICKEL et al. 2008a). These genes were divided into quartiles of G+C content and codon frequencies derived for each quartile.
The four quartiles were: <45.2%, 45.2‐52.19%, 52.2‐58.79%, and >=58.8%. It has recently been shown that variations in substitution pattern as a result of variance in the recombination rate can essentially be accounted for by G+C content (BULLAUGHEY et al.
2008). The specifications of each simulation can be found in the Appendix (Table A2).
Branch lengths were calculated by concatenating the alignments for the 175 transcription factors and evaluating the alignment using DNAml of the PHYLIP program package (FELSENSTEIN 1993). Evolver was used to create alignment sets of coding sequences with 400 codons from each G+C quartile separately. 1,000 simulated sequence sets were generated for models of neutral evolution, while 200 simulated sequences were created for models reflecting positive selection along the human lineage. The resulting multispecies sequence alignments were analyzed using codeml with model A and model A‐null.
Empirical tests of positive selection using sequences simulated under a model of neutral evolution: An alternative null model was developed to assess the likelihood of positive selection by comparing the actual results from the strict branch+site test
(codeml model A vs. model Anull) to results obtained using sequences simulated under a model of neutral evolution. For each gene tested, the sequence of the human‐ chimpanzee ancestor was inferred using the RateAncestor function of codeml. More
77
than 99% of the codons were inferred with at least 95% probability, according to codeml
(Fig 1).
Confidence of Ancestral Reconstruction 100 90 80 70 60 176TF 50 %AAchange Codons
of 40
% 30 176TF %Allchanges 20 10 0 >0.5 >0.8 >0.9 >0.95 >0.99 1 Probability
Figure 1: Probability of accurate ancestral sequence reconstruction. Accuracy of amino acid changing substitutions in the 176 transcription factor set, as well as all substitutions, synonymous or nonsynonymous.
These ancestral sequences were input as the root sequence in evolver, and at least 500 sequences were simulated under a model of neutral evolution (foreground dN/dS=1 in site classes 2a and 2b). The background branch lengths were calculated as described above, and the human branch length used was that reported by codeml using the branch model H. The proportion of sites within site classes 0, 1, 2a and 2b was held
78
constant at 0.6, 0.1, 0.26, 0.04, respectively, based on averages among genes with dN/dS>1 (see Table 4). The transition/transversion ratio, κ was also held constant at
2.6. The average κ for the 175 gene set was 4.1±1.5. A higher κ means more transitions, and therefore more synonymous changes. A smaller κ is therefore more conservative. Codon substitution frequencies were derived from the appropriate G+C quartile for the specific gene.
Gene‐specific empirical p‐values were calculated by running codeml with model
A and model A‐null on each of the simulated neutral human sequences using the original primate+rodent multiple sequence alignments for each gene and calculating the
likelihood ratio statistic (LRS). The empirical p‐value is defined as the fraction of sequences simulated under a model of neutral evolution that had a LRS greater than or equal to the LRS obtained with the actual data. The gene specific significance threshold is based eon th empirical distribution of LRS values using a parametric bootstrap to specify a significance threshold of α = 5% (AAGAARD and PHILLIPS 2005; GOLDMAN 1993).
RESULTS
Sequencing of transcription factor genes and tests of selection
We sought to identify evidence of adaptive protein evolution specific to the human lineage with the hypothesis that genes under positive selection have contributed
79
to the phenotypic changes that occurred since the most recent common ancestor of humans and chimpanzees. 175 transcription factor genes with prior predictions of positive selection (BUSTAMANTE et al. 2005a; CLARK et al. 2003a; CSSAC, 2005a; NIELSEN et al. 2005a) or an excess on nonsynonymous substitution in human relative to chimpanzee were selected for phylogenetic sequencing. Exonic sequence data was obtained from human, chimpanzee, bonobo, gorilla, orangutan, gibbon, three Old World monkeys, three New World monkeys, and lemur (Fig.2). Primary sequence read data from chimpanzee, macaque, and orangutan were obtained from the NCBI Trace
Repository and supplemented the alignments where possible, and annotated mouse
and rat orthologs (http://www.informatics.jax.org) were added to the alignments (Table
2). Coverage is excellent for the great apes, rodents, and rhesus macaque, but lower for the monkeys, reflecting a lower rate of PCR success in more divergent species.
80
Figure 2: Phylogenetic tree of the primate species used in phylogenetic analyses. Primates were chosen so as to include a representative from all the major subdivisions of primates. These include Parvorder Platyrrhini (New World monkeys: red‐bellied tamarin, spider monkey, woolly monkey), Parvorder Catarrhini (Old World monkeys: pig‐tailed macaque, rhesus macaque), Family Hylobatidae (lesser apes: orangutan), and Family Hominidae (great apes: gorilla, chimpanzee, bonobo, and human). Suborder Strepsirrhini (prosimians: lemur) is not depicted in tree. Mouse and rat were used as an outgroup (not shown). Branch lengths were calculated from concatenated alignments of the 175 transcription factor genes using the DNAML program in PHYLIP (FELSENSTEIN 1993). *Branch length < 0.0015.
Multi‐species alignments were constructed of the protein‐coding DNA sequence of each gene and the codon substitution pattern in the alignments was evaluated under several evolutionary models as implemented in the codeml program from the PAML
81
package (YANG 1997a). In each case, results from a null model were compared with results from a test model. The null models assumed neutrality (dN = dS) or negative selection (dN/dS <1) and the test models assumed dN/dS >1 on the human lineage
(strict branch test) or at a subset of codon sites in the gene along the human branch
(strict branch+site test). A list of the models and tests of selection is presented in Tables
3 and 4.
Only three genes have p <0.05 in the strict branch+site test: ISGF3G, SOHLH2, and ZFP37 (Table 5) and amino acids with Bayes Empirical Bayes posterior distribution for probability of positive selection > 95%. Seventeen genes have p <0.05 in the relaxed
branch+site test alone, indicating that they may have experienced a relaxation of selective constraint or simply show weak signals of positive selection (Table A1). None of the nominal p‐values survive Bonferroni correction, and a false discovery rate analysis
(BENJAMINI and HOCHBERG 1995) also predicts that the likelihood of observing true positives is small.
82
Table 5. codeml results for transcription factor genes with significant results in the strict branch+site test of positive selection
Branch Empirical Gene Branch+Site Tests Tests Test
Strict Relaxed Strict Relaxed dN/dS (MA vs (MA vs. M1a) (MH vs. MH‐null) (MH vs. M0) MAnull)
‡LRS p LRS p LRS p LRS p Human Other p
ZFP37 4.62 0.032 8.88 0.012 4.70 0.030 14.4 <0.001 999† 0.24 <0.001
ISGF3G 6.24 0.013 6.50 0.039 0.18 0.67 0.48 0.48 0.69 0.38 <0.001
SOHLH2 5.16 0.023 9.08 0.011 5.18 0.023 9.62 0.002 999 0.48 <0.001
†Values of 999 for dN/dS indicate dS=0, so dN/dS is undefined. Values in bold reflect p<0.05. ‡LRS =
2*(lnL1‐lnL2)
Effect of phylogenetic breadth on prediction of positive selection
Overall phylogenetic breadth is expected to contribute to the accuracy of a prediction of positive selection (ANISIMOVA et al. 2001). Intuitively, interpretation of the significance of lineage‐specific substitutions becomes more straightforward as the number and divergence of species increases resulting in a more accurate reconstruction
of the ancestral sequence and therefore, the direction of selection. However, data collection and analysis of many species can be costly and time consuming. We therefore decided to investigate how well a minimal phylogenetic set performs at
83
prediction of positive selection. There are four well‐annotated and publically available
mammalian genomes that are ideal for this analysis: human, chimpanzee, rhesus
macaque, and mouse as an outgroup. Genes were simulated under a model of positive selection with ω = 2 and ω = 10 for the full primate+rodent set and evaluated using codeml. The tree was then trimmed to contain only the four aforementioned mammalian species and also evaluated using codeml. The results from the minimal alignment set were then compared to the results obtained to those from the full primate+rodent alignments.
Remarkably, the reduced alignment set is reasonably effective in identifying the set of genes that are predicted to be positively selected when the full primate+mouse alignment is used (Fig. 3). In each case, however, a small number of simulated genes had discordant results, which is likely the effect of the use of a too distant outgroup that has been unable to accurately resolve the direction of selection. In general, sensitivity
is higher than specificity, reflecting the fact that the omission of species tended to result in false positives, rather than false negatives.
84
1 2
= 0.9 ω 0.8 MAnull), 0.7 v.
0.6 (MA
0.5 value ‐ p 0.4
0.3 alignment
0.2
0.1
Minimal 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Full alignment p‐value (MA v. MAnull), ω = 2
Figure 3: Comparison of the results for predictions of positive selection simulated using codeml for the full primate+rodent alignment set to the minimal alignment set containing only human, chimpanzee, rhesus macaque, and mouse. Red data points indicate discrepant results between the two alignment sets.
Taken together, these results suggest that increased phylogenetic breadth cannot overcome the limitations inherent in comparing closely related species. On the other hand, only minimal phylogenetic diversity is necessary to maximize the power that does exist to detect positive selection.
85
Simulations to assess the sensitivity of the strict branch+site test
The failure to observe significant positive selection was surprising, given the method for choosing genes and the fact that more than one‐third have a dN/dS ratio >1 on the human branch, prompted us to wonder whether the genes we chose were truly
not positively selected or whether the ML method lacks power to detect positive selection for these closely related species. We therefore embarked on a series of simulations to study the performance of codeml using branch lengths and substitution patterns that are typical of the divergence time of humans from non‐human primates.
In presenting the strict branch+site test, Zhang et al. (ZHANG et al. 2005) performed an extensive series of simulations of codeml that demonstrated a low false‐ positive rate in the absence of dN>dS and a reasonable sensitivity to detect positive selection across a range of branch lengths and dN/dS ratios. However, they primarily focused on selection acting at an internal branch (thereby improving detection of selected sites) and on trees with branch lengths longer than those encountered in human genes.
We used evolver, which generates alignment sets that correspond to a user‐
defined evolutionary model, to create simulated primate gene alignments for use in evaluating the performance of the strict branch+site test of codeml. Several scenarios were examined by varying parameters that reflect the range of branch lengths, dN/dS ratios and codon frequencies found in human genes. These included 'gold standard' sets simulating positive selection with dN/dS>1 at a subset of codons on the human
86
lineage in each simulated sequence and neutrally evolving control sets with dN/dS=1.
The complete list of tests, parameters, and results can be found in Table A2.
The branch length of the human (foreground) lineage had the greatest influence on the power to detect positive selection, which falls as branch length decreases
(Figures 4, 5, and Table A2). Due to the recent divergence of humans and chimpanzees, human‐specific branch lengths are quite short (Fig 5). Under a scenario of positive selection, with dN/dS=10 on the foreground (human) lineage and dN/dS<=1 on the other lineages, only a small fraction of simulated sequences had a p‐value <0.05 in the strict branch+site test (Fig 4A). Across most of the range of observed human branch lengths, few sequences simulated under a model of positive selection met the threshold for significance in the strict branch+site test (Fig 5). For the median branch length of human genes with substitutions (0.007) less than 10% of sequences simulated under the model of positive selection are correctly classified. At longer branch lengths, more simulated positively selected sequences are correctly classified, however <1% of human genes have branch lengths >0.035, and positive selection is not completely predicted even at this relatively long branch length.
87
Figure 4: Distribution of likelihood ratio statistic (LRS) values from the strict branch+site test of simulated gene sets. Sequences were simulated with different branch lengths for the human branch under models of positive selection with foreground dN/dS = 10 (A) and under models of neutral evolution with foreground dN/dS = 1 (B,C). In (A), the LRS corresponding to p=0.05 is shown. The expected distribution of LRS values accordinge to th chi‐square distribution with one degree of freedom ( , red diamonds) and a 50:50 mixture distribution of the distribution and a point mass of zero (green circles) is shown in (C) with a subset of the distribution of LRS values from neutral simulations from (B). The red line marks the 5% empirical significance threshold.
88
Figure 5: Prediction of positive selection across a range of branch lengths representative of human genes. The branch lengths of 175 transcription factor genes are shown (columns). The secondary axis represents the percentage of alignment sets simulated with a model of positive selection (foreground dN/dS=10) that were correctly classified as positively selected in the strict branch+site test (triangles).
The power to predict positive selection is also related to the magnitude of dN/dS
(Fig. 6a). With weak positive selection (e.g. dN/dS = 2), there is little power to detect positive selection regardless of branch length, while at longer branch lengths, larger dN/dS results in a higher fraction of simulated sequences that meet the threshold for significance in the strict branch+site test, although this is insufficient to overcome the limitations of shorter branch lengths. Of much less influence is the dN/dS value of the background lineages (Figure 6B). The proportion of significant alignment sets remains consistent over a variety of background dN/dS values, ranging from low (dN/dS = 0.01)
89 to nearly neutral (dN/dS = 0.75). The proportion of foreground (positively selected) sites also has only modest impact on predictions of positive selection (Table A2).
Figure 6: Contribution of dN/dS values to predictions of positive selection. (A) The ability to detect positive selection using a range of dN/dS values for the foreground (human) lineage for different branch lengths was evaluated. (B) The effect of the dN/dS on the background lineages was assessed under models of positive selection (dN/dS =10; black bars) and neutral evolution (dN/dS=1; gray ebars). Th human branch length was 0.01.
90
We also examined the performance of evolver‐generated sequences with dN/dS=1 on the foreground lineage to assess how well these sequences match the neutral expectation. The fraction of simulated sequences with p<0.05 in the strict branch+site test was ≤3% across a range of branch lengths (Fig 4B), demonstrating that under a neutral model, there are few false predictions of positive selection, confirming results of Zhang et al (ZHANG et al. 2005).
In the strict branch+site test, the LRS is compared to a chi‐square distribution