<<

Lump it or Loose it! Population genetic inference with the n-coalescent experiments graph within a phylogenomic lineage

Principal Investigator: Dr. Raazesh Sainudiin Department of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch, New Zealand [email protected] Telephone: +64 3 364 2987 ext 7691 Proposal Root Date: November 4, 2005 December 15, 2009

Summary of proposed research Following the completion of the human and other genome projects, there are growing amounts of data which docu- ment genetic diversity, at the deoxyribonucleic acid (DNA) level, within human populations. Such data carries vital information about the history of the populations, about the genetic basis of disease, and about the evolutionary process itself. Because of the complex nature of the data, extracting this information is a very difficult statistical problem. Existing approaches rely heavily on sophisticated mathematical modeling and modern computing power, but they in- volve several overly simplistic assumptions. A major challenge is to assess the assumptions underpinning the models, and there is an urgent need to develop new methods which adequately capture the complexities of the real world, in order to solve pressing problems in conservation genetics, nutritional genetics, plant and animal breeding, and disease biology. The proposed research will exploit the genome sequencing of our close relatives amongst the primates, including the chimpanzee and the gorilla, to develop accurate models for the process: the fundamental mechanism by which is introduced into populations. The DNA sequences of these species are very similar to those of humans: for example chimpanzee and human DNA agrees at 99% of positions. The differences result from mutation, and in some cases subsequent . Crucially, the underlying similarity between the sequences allows an assessment of the way in which mutation works in different parts of the genome, in a setting in which the background context is very similar across the species, thus effectively eliminating one important source of variation. The use of several related species also allows the researcher to pin down the exact lineage on which have occurred. New statistical methods will be developed for this part of the research. Having learned about the underlying mutation processes, the second stage will be to use this information to give more powerful statistical methods for studying DNA sequence variation within a species. The tools from and are seldom used together in complementary ways to solve a problem. The fundamental objective of this proposal is to help bridge this divide by developing novel and powerful statistical methods that use tools from both disciplines, in order to integrate the information in phylogenomic data from ape genomes with the data from human populations. This integrative approach using the recently developed theory of controlled lumped coalescents over a partially ordered graph of coalescent experiments will allow for more robust statistical decisions to be made in real-world applications of genetics.

Background and Motivation Following the completion of the human and other genome projects, there are growing amounts of data which doc- ument genetic diversity, at the DNA level, within human populations. Such data carries vital information about the

1 population history, about the genetic basis of disease, and about the evolutionary process itself. Due to the complex nature of the data, extracting this information in order to make robust decisions in real-world applications of genetics is a challenging statistical problem. Patterns of diversity in molecular population genetic data are shaped by the inter- action of two processes: (1) the genealogical process that captures the ancestral interrelatedness of the sampled DNA sequences and (2) the biochemical process that describes the molecular of DNA. When population genetic inference is based on the likelihood function of a stochastic model for the two interacting processes, one can extract information in the data that can be captured by the model. Much progress has been made in computationally intensive likelihood-based inference from samples of DNA sequences using Monte Carlo (e.g. [2,3, 30, 36, 58]) and importance sampling (IS) techniques (e.g. [1, 14, 18, 19, 20, 21, 22, 23, 26, 27, 28, 34, 35, 51, 52]). However, this progress in population genetic inference is mostly in terms of sophisticated coalescent models of the generally unobservable genealogical process under a homogeneous and mathematically convenient model of the empirically observable biochemical process. The major source of DNA polymorphism caused by mutation and recombination events in natural populations is the result of the counteracting forces of DNA damage and DNA repair. Although our understanding of the molecular basis underlying this counteraction has markedly increased in the past decade [47, 48], relatively little effort has been made to incorporate biochemical realism into population genetic inference. On the other hand, significant progress has been made in phylogenetics to incorporate realism into models of the biochemical pro- cess [17, 39, 49]. However, this phylogenetic progress has mostly been at the expense of simplifying the genealogical process. Since only the product of the per locus per generation and the number of generations is iden- tifiable, (1) mutational complexities and (2) genomic heterogeneities in the mechanisms of DNA damage and repair can confound the shape of the inferred coalescent trees describing the genealogical process of the sample. Therefore, population genetic inference can be affected by model mis-specification of the biochemical process. Realistic models of DNA evolution with several parameters informed by various empirical findings can be incor- porated into existing inference methods with population genetic data. However, the statistical experiment tends to be over-parametrized with an extremely flat likelihood function. The large number of parameters tends to make the com- putational time of IS methods impractical and the convergence diagnostics of MCMC methods (with local proposal distributions) difficult and extremely heuristic. However, several genomes of our closest relatives are being sequenced. If we assume that the DNA damage and repair mechanisms are similar (not necessarily identical) between the great apes, then information in molecular divergence given by the phylogeny of the entire genomes of the great apes can shed light on the biochemical process of DNA evolution. Likelihood methods that use such phylogenomic information from at least two species, in the context of population genetic inference along a particular lineage, are now possible with the increasing chimp genome sequence [13, 40, 56]. By pooling phylogenomic and population genetic data together within a unified and novel phylopopulation model of the two interacting processes (genealogical and biochemical) we can hope to reduce the above computational burdens that arise in the presence of population genetic data alone. The proposed research will develop such data integrative methods to incorporate biochemical realism into population genetic inference. In the process, a common platform for integrating tools from traditionally disparate disciplines of population genetics and phylogenetics is expected to emerge. Next we will see that population genetic inference can be affected when complexities in the biochemical process of a particular kind of repetitive DNA sequence are ignored. Microsatellites are simple sequence repeats in DNA, for example the motif AC repeated eleven times in a row. Microsatellites mutate by changing their repeat length, i.e. the number of their repeats. There is a difference of four to five orders of magnitude between the mutation rate of a DNA and a tandem dinucleotide DNA repeat [12, 57]. Micosatellites are popular genetic markers because they are highly polymorphic, fairly inexpensive to genotype, and relatively dense in the genomes of many organisms. They are used in conservation genetic efforts, in genomic scans of natural selection, in breeding agricultural plants, and in inferring population history. To see that inference can be affected by ignoring the mutational complexities, let us focus on the most abundant microsatellites of motif AC, in humans and chimpanzees. Since there are motif-specific heterogeneities in the biochemical process of dinucleotide repeat evolution [7, 43], we focus on a particular motif to highlight the mutational complexities that can be assumed to be homogeneous across all loci of this motif. The simplest and most popular model to describe microsatellite evolution is the step-wise mutation model (SMM) [37] which allows all microsatellite alleles to mutate indiscriminately at the same rate. However, microsatellite mu- tations are caused by those primary slippage events during DNA replication that have escaped the mis-match repair machinery [11, 25]. Since there are more opportunities for a longer microsatellite to slip during replication, the mu- tation rate of a microsatellite increases with its repeat length (rate proportionality)[57]. The counteracting forces of slippage and repair along with possible selection against long repeats (mutational bias) are thought to drive the length of human AC repeats towards 21 from several lines of evidence [7, 8, 43, 57]. The SMM allows neither rate

2 proportionality nor mutational bias and has been rejected, on the basis of a phylogenomic dataset of homologous AC microsatellites between humans and chimpanzees, in favor of a proportional-rate linear-biased one-phased model (PL1) that allows for rate proportionality and linear mutational bias [43]. 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 Number of alleles Heterozygosity Variance Figure 1: The cumulative distributions of each of the three summaries of the data differ significantly under PL1 model (red) and SMM (blue) by the Kolmogorov-Smirnov test.

To assess the robustness of population genetic inference to model mis-specification of the biochemical process, we simulated samples at a pure AC microsatellite locus for 100 individuals in an idealized Wright-Fisher coalescent model through a genealogical urn scheme [21]. The population parameters were chosen to match those of humans. Both mutation models (SMM and PL1) were calibrated to have the observed mutation rate in humans [57]. Figure 1 shows the empirical cumulative distributions of three commonly used statistics of the data, namely, number of alleles, heterozygosity, and variance in repeat number. Transformations of these summaries, and others like them, under the SMM have been used to infer various population genetic phenomena of interest, including signals of natural selection in structured human populations [54]. The plots show that all three summaries are significantly reduced when the more realistic mutation model, that accounts for the counteracting forces of replication slippage and mismatch-repair, is used to describe the empirically observed biochemical process of neutral microsatellite evolution, even under one of the simplest models for the genealogical process. Similar results were obtained when SMM was compared to another sophisticated model that was estimated from a large human pedigree study that measured 118,866 parent-offspring transmissions of AC repeats and observed 53 mutations [57]. Thus, deviations from selective neutrality may be best detected when the neutral model of microsatellite evolution is empirically founded.

First Phase of Research: Mutational Complexities In the first phase of research we will construct more realistic models of microsatellite evolution that take advantage of the complete genome sequences of the organism in question and its closest relatives. In a recent sequential importance sampling approach a general way to construct proposal distributions for IS was given [26, 27]. Using an idealized model of microsatellite mutation (SMM) they computed the likelihood function for structured populations [28]. Their approach already allows for a class of more complex microsatellite mutation models with distribution of mutational jumps independent of the mutating allele. However, the Fourier series methods employed in their approach to obtain the sampling distribution may not work well for a model with mutational bias that depends on the mutating allele. We will first extend their samplers [27, 28] to compute the likelihood of a population sample at a microsatellite locus evolving under a more realistic mutation model under a genealogical model that is embedded within a particular lineage of a phylogeny of closely related genomes. The phylogenomic data for this extended genealogical model is given by homologous microsatellite alleles of the given motif at several additional loci in the compared genomes. We will accomplish this by extending the stochastic coalescent model of the human population to further allow the lineage at each site from each species to coalesce in the most recent common ancestral population. Samples of size one at thousands of microsatellite loci from the other species allows us to focus on the genealogical process for a smaller part of the genome in the target species (humans) while mining information about the biochemical process of microsatellite evolution from the phylogeny of primate genomes. The first realistic model of microsatellite evolution we will use is the empirical mutation model (EMM). Under this model, a microsatellite in species k has k k k the instantaneous mutation rate from an allele of length i to an allele of length j given by µ πj , where µ is the

3 k k constant mutation rate for all alleles, and πj is the stationary distribution of allele j. We will estimate π from the empirical distribution of repeat lengths from the genome of species k and µk from the information in phylogenomic divergence. The entire spectrum for the probability matrix is known in closed form for EMM, since it is identical to an independent Metropolis-Hastings Markov chain on the microsatellite state space that is targeting πk under a uniform proposal distribution [9]. In fact, the entire spectrum of a more flexible EMM models is known for an arbitrary proposal distribution (not necessarily uniform) [9]. EMM has not been exploited in population genetic inference with microsatellite data despite its ability to incorporate allele-specific mutational bias. We will also attempt to further extend the sequential importance sampler to include more sophisticated microsatellite models that allow rate proportionality (e.g. PL1). These models are expected to pose further numerical challenges that may be overcome by rigorous Eigen routines using interval methods [31]. Next we will extend the sampler to compute the likelihood of population samples of blocks of DNA sequence data that contain both repetitive microsatellite DNA with repeat length polymorphisms (RLPs) and non-repetitive DNA with single nucleotide polymorphisms (SNPs). Such data are closer to the actual DNA sequences and have information about events that occur at two scales of time that can differ by 5 orders of magnitude within the same block of DNA. One can construct a unique perfect phylogeny from mutations on DNA sequences under the infinitely- many-sites model (ISM) of mutation and no recombination [24]. This phylogeny encodes information about the evolutionary topology or shape of the genealogical process. The ISM model can approximate point mutations in non-recombining segments of non-repetitive human DNA. Sophisticated methods exist to integrate over genealogies conditional on the genetree topology (e.g. [1]). One can integrate efficiently over the genealogies of the microsatellite loci by conditioning on the genetree topology of a small block of non-repetitive DNA around a microsatellite. We will use the interval methods developed by the applicant [41] to rigorously enclose the likelihood function of the EMM over a compact set of trees with a given genetree topology. Such enclosures can be naturally used to solve for the most likely tree [41] as well as to obtain importance sampling weights [42]. Next we will extend the genealogical process of such sequence data with RLPs (and SNPs) to allow recombination and infer using approximate likelihood methods developed at Oxford [14, 15]. Recently, such approximate methods have been used to infer recombinational hotspots in humans [32, 59]. Using the repetitive as well as non-repetitive sequence within a hypothesized block of hardly recombining DNA, that is demarcated by the inferred adjacent hotspots, one can formally test the extent to which our hypothesized block is truly non-recombining, with the aid of our novel IS method that can mine information simultaneously from two scales of time. Another advantage of the flexible empirical mutation model is that it can account for the empirical evidence that microsatellites of different motifs mutate differently [7, 43]. Accounting for genomic heterogeneities is the main subject of the second phase of the proposed research. Although, the above discussion has been limited to dinucleotide repeats for reasons of concreteness and pedagogy, the samplers that will be developed are applicable to other finite alphabet sequences: (1) empirical versions of DNA models commonly used in phylogenetics [17] that are even capable of modeling indels and (2) empirical models of other simple tandem repetitive elements, including minisatellites and satellites. Being able to compute the likelihood of polymorphism data that are not merely SNPs is particularly important in light of genomic findings about the great apes [5]. For example, by comparing the entire sequence of chimpanzee 22 with the human counterpart, it was found recently that 1.44% of the chromosome consists of single-base substitutions in addition to nearly 68,000 insertions or deletions [56]. Second Phase of Research: Genomic heterogeneities In the second phase of research we will focus on other examples of complexities underlying DNA damage and repair that primarily involve heterogeneities in the biochemical process across the genome. Two such examples are (1) the differences in substrate specificity of DNA glycosylases [46] and the immense variation in base-excision repair efficiency [53] that can partly explain the observed 23-fold higher transition rate in CpG sites relative to the average rate in humans and chimpanzees [10], and (2) the differences between the global genome repair that is involved in the repair of non-transcribed DNA and a more efficient transcription coupled repair that is involved in the repair of the transcribed regions [55]. The DNA sequences of the primate species are very similar to those of humans: for example chimpanzee and hu- man agree at 99% of positions. The differences result from mutations, and in some cases subsequent natural selection. Crucially, the underlying similarity between the sequences allows an assessment of the way in which mutations work in different parts of the genome, in a setting in which the background context is very similar across the species, thus effectively eliminating one important source of variation. For example, the differences in mutation rates between tran-

4 scribed and non-transcribed DNA, or the elevated transitions at CpG sites can be accounted for by means of various context-dependent models developed in the field of phylogenetics. By context-dependent models, we mean mixtures of Markov models of DNA evolution on the state space of the four [17], higher-order Markov models [50], hidden Markov models [39], and Markov random fields [4]. The second phase of research will begin by accurately modeling the observed heterogeneities in the biochemical process of DNA evolution across the primate genomes by means of mixtures of Markov models on the state space of four nucleotides. We will start with the earliest such Markov model due to Jukes and Cantor (JC69) [29]. For this model, the applicant has rigorously solved for the most likely tree [41] (with three or four primates) using interval methods by extending Felsenstein’s post-order traversal [16]. Such a tree estimation method amounts to a computer-assisted proof of the maximum likelihood (ML) estimate unlike conventional numerical methods that rely on heuristic local searches with uncontrolled floating-point arithmetic [38]. Thus, in the second phase we will first extend this interval method to rigorously estimate a finite mix- ture of JC69 Markov chains. This extension uses the information in primate phylogenomic divergence to better model the genomic heterogeneity in mutation rates. Next, we will model mutational bias in the Markov chains [17] that are finitely mixed. Here, we will use the algebraic statistical tools for computational biology [39] developed by the group led by Dr. Bernd Sturmfels and Dr. Lior Pachter at the Berkeley department of mathematics. These tools provide the sufficient statistics for Markov models of DNA evolution on small phylogenetic trees. It has already been shown that by coupling phylogenetic interval methods with standard algebraic statistical techniques one can efficiently enclose the ML trees in a numerically rigorous manner [41, 45]. After having extended the interval methods to estimate the most likely mixtures of Markov models describing the biochemical process on the primate phylogeny, we will integrate the large human population dataset of SNPs from the Haplotype Map (HapMap) project with the genome sequences of our closest relatives amongst the primates, including the gorilla and chimpanzee, to obtain our phylopopulation genomic data. Using the stochastic model of the coalescent embedded within a phylogenomic lineage described in the first phase of research, we will conduct statistical inference with IS methods that allow heterogeneity in the biochemical process. Inference with such a model allows for the mining of information from both phylogenomic and popula- tion genetic scales simultaneously, while allowing for context-dependence in a simplified manner. Next, we will try to address the more complicated context-dependent models of the biochemical process within this phylopopulation setting. This phylopopulation approach has several phylogenomic advantages for population genetic inference. First, there is little genealogical ambiguity in the primate species tree: for example the species tree of chimpanzee, gorilla, and human has only one unrooted topology. Second, ample phylogenomic information is in the longer divergence times of the species tree relative to the genealogical depth of the population samples. Third, the use of several related species allows the researcher to pin down the exact lineage on which mutations have occurred. Fourth, the computational challenges that arise in the context of population genetic data alone are expected to be minimal. Thus, this integrative statistical method will provide a more robust and powerful approach to population genetic inference with the new generation of phylopopulation genomic DNA sequence data. Research significance and career objectives Sequential IS methods developed in the first phase use information of microsatellite polymorphism within a small human haplotype block. Thus, they can be used to infer recent genealogical histories that cannot be obtained by looking at the SNP data alone. These methods also use more realistic microsatellite mutation models and can allow for the integration of repetitive and non-repetitive sequences in a cohesive statistical framework. They are suited for conservation genetic studies that need to carefully characterize the recent loss in diversity in a threatened but still extant species, genomic scans of selection, and also for genetic association studies. Phylopopulation models of heterogeneously evolving genomic DNA sequences developed in the second phase can provide more realistic null hypotheses of neutral molecular evolution in structured human populations. These selectively neutral models are particularly useful for tests of natural selection since deviations from selective neutrality can be best detected when the neutral model of DNA evolution is biologically realistic. The HapMap data in conjunction with the primate phylogenomic data also provides a unique opportunity to test empirically founded theories of DNA damage and repair, provided the more complex context-dependent models can be successfully analyzed. The mathematical genetics group at Oxford, which is leading the analysis of the HapMap project, is the ideal place to develop computationally intensive statistical methods for analyzing the new generation of phylopopulation data. Most of the applicant’s biological training at Cornell University has been limited to phylogenetic inference with complex models of microsatellite [6, 43] and codon [44] evolution. He has also applied interval analysis [33] to solve for maximum likelihood estimates of simple phylogenetic models in a rigorous fashion [41]. The efficiency

5 of this method was shown to increase when simple algebraic statistical techniques were first used to reduce the high dimensional data into sufficient statistics [45]. There is evidence that such interval phylogenetic methods can produce IS weights to accelerate and improve the Monte Carlo algorithms that sample from the tree space [42] with a fixed topology. He has been a research fellow of the Royal Commission for the exhibition of 1851 at The University of Oxford from 2005 to 2007 under the sponsorship of Peter Donnelly. During this period he developed mathematical statistical formalisms for coalescent experiments and developed a novel theory of controlled lumped Markov chains for conducting exact likelihood computations of a popular population genetic statistic called the site frequency spectrum. The fundamental objective of the proposed research is to provide a statistical framework for integrating population genetic and phylogenomic data guided by the biological and chemical literature on DNA damage and repair. Currently, no such integrative framework across phylogenetics and population genetics exists. The applicant will have the best opportunity to facilitate this integration by working closely with the mathematical genetics group at Oxford, the algebraic statistics group at Berkeley, and biologists in New Zealand, Irvine and Cornell.

References

[1] M Bahlo and RC Griffiths. Inference from gene trees in a subdivided population. Theoret. Pop. Biol., 57:79–95, 1996. [2] M Beaumont. Detecting population expansion and decline using microsatellites. Genetics, 153:2013–2029, 1999. [3] P Beerli and J Felsenstein. Maximum likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics, 152:763–773, 1999. [4] P. Bremaud.´ Markov Chains, Gibbs Fields, Montecarlo Simulations and Queues. Springer-Verlag, New York, 1999. [5] R Britten, L Rowen, J Williams, and RA Cameron. Majority of divergence between closely related dna samples is due to indels. Proc. Natl. Acad. Sci. USA, 100:4661–4665, 2003. [6] P Calabrese and R Sainudiin. Models of microsatellite evolution. In R Nielsen, editor, Statistical Methods in Molecular Evolution, Series: Statistics for Biology and Health. Springer, New York, 2004. [7] P. P. Calabrese and R. T. Durrett. Dinucloetide repeats in the drosophila and human genomes have complex, length-dependent mutation processes. Mol. Biol. Evol. , 20:715–725, 2003. [8] G. Cooper, D. C. Rubinsztein, and W. Amos. Ascertainment bias cannot entirely account for human microsatel- lites being longer than their chimpanzee homologues. Hum. Mol. Genet. , 7:1425–9, 1998. [9] P Diaconis and L Saloff-Coste. What do we know about the metropolis algorithm ? Jnl. Comp. Sys. Sci., 57:20–36, 1998. [10] I Ebersberger, D Metzler, and CW Schwarz. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet., 70:1490–1497, 2002. [11] JA Eisen. Mechanistic basis for microsatellite instability. In DB Goldstein and C Schlotterer,¨ editors, Microsatel- lites: Evolution and Applications. Oxford Univ. Press, Oxford, 1999. [12] H. Ellegren. Microsatellite mutations in the germline: implications for evolutionary inference. Trends in Genet- ics, 16:551–558, 2000. [13] W Enard and S Pa¨abo.¨ Comparative primate genomics. Annu. Rev. Genomics Hum. Genet., 5:351–378, 2004. [14] P Fearnhead and P Donnelly. Estimating recombination rates from population genetics data. Genetics, 159:1299– 1318, 2001. [15] P Fearnhead and P Donnelly. Approximate likelihood methods for estimating local recombination rates. J. R. Statist. Soc. B, 64:657–680, 2002.

6 [16] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Jnl. Mol. Evol. , 17:368–376, 1981. [17] J Felsenstein. Inferring Phylogenies. Sinauer, Sunderland, 2003. [18] RC Griffiths and P Marjoram. Ancestral inference from samples of DNA sequences with recombination. J. Com- put. Biol., 3:479–502, 1996. [19] RC Griffiths and S Tavare. Ancestral inference in population genetics. Stat. Sci., 9:307–319, 1994. [20] RC Griffiths and S Tavare. Sampling theory for neutral alleles in a varying environment. Proc. R. Soc. London B, 344:403–410, 1994. [21] RC Griffiths and S Tavare. Simulating probability distributions in the coalescent. Theoret. Pop. Biol., 46:131– 159, 1994. [22] RC Griffiths and S Tavare. Markov chain inference methods in population genetics. Math. Comput. Modelling, 23:141–158, 1996. [23] RC Griffiths and S Tavare. The ages of mutations in gene trees. Ann. Appl. Prob., 9:567–590, 1999. [24] D Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21:19–28, 1991. [25] B Harr, J Todorova, and C Schlotterer.¨ Mismatch repair-driven mutational bias in D. melanogaster. Mol. Cell, 10:199–205, 2002. [26] MD Iorio and RC Griffiths. Importance sampling on coalescent histories. i. Adv. Appl. Prob., 36:417–433, 2004. [27] MD Iorio and RC Griffiths. Importance sampling on coalescent histories. ii: subdivided population models. Adv. Appl. Prob., 36:434–454, 2004. [28] MD Iorio, RC Griffiths, R Leblois, and F Rousset. Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models. Technical report, Oxford University, 2004. [29] TH Jukes and CR Cantor. Evolution of molecules. In HN Munro, editor, Mammalian protein , volume 3, pages 21–123. Academic Press, New York, 1969. [30] MK Kuhner, J Yamato, and J Felsenstein. Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics, 140:1421–1430, 1995. [31] G Mayer. Result verification for eigenvectors and eigenvalues. In J. Herzberger, editor, Topics in validated computations, volume 5 of Studies in computational mathematics, pages 209–276. North-Holland, New York, 1994. [32] G McVean, SR Myers, S Hunt, P Deloukas, DR Bentley, and P Donnelly. The fine-scale structure of recombina- tion rate variation in the human genome. Science, 304:581–584, 2004. [33] RE Moore. Interval analysis. Prentice-Hall, Englewood Cliffs, New Jersey, 1967. [34] M Nath and RC Griffiths. Estimation in an island model using simulation. Theoret. Pop. Biol., 3:227–253, 1996. [35] R Nielsen. A likelihood approach to population samples of microsatellite alleles. Genetics, 146:711–716, 1997. [36] R Nielsen. Estimation of population parameters and recombination rates from single nucleotide polymorphsms. Genetics, 154:931–942, 2000. [37] T Ohta and M Kimura. A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res. , 22:201–204, 1973. [38] IEEE Task P754. ANSI/IEEE 754-1985, Standard for Binary Floating-Point Arithmetic. IEEE, New York, 1985.

7 [39] L Pachter and B Sturmfels, editors. Applications of Interval Methods to Phylogenetic Trees. Cambridge Univer- sity Press, 2005. [40] M Ruvolo. Comparative primate genomics: the year of the chimpanzee. Curr. Opin. genet. devel., 14:650–656, 2004. [41] R Sainudiin. Enclosing the maximum likelihood of the simplest DNA model evolving on fixed topologies: towards a rigorous framework for phylogenetic inference. BSCB Dept. technical report, BU-1653-M, Cornell University, Ithaca, 2004. [42] R Sainudiin. Machine Interval Experiments. pHd dissertation, Cornell University, Ithaca, New York, 2005. [43] R Sainudiin, R Durrett, C Aquadro, and R Nielsen. Microsatellite mutation models: Insights from a comparison of humans and chimpanzees. Genetics, 168:383–395, 2004a. [44] R Sainudiin, W Wong, K Yogeeswaran, J Nasrallah, Z Yang, and R Nielsen. Detecting site-specific physico- chemical selective pressures: applications to the class-i hla of the human major histocompatibility complex and the srk of the plant sporophytic self-incompatibility system. Jnl. Mol. Evol. , page in press, 2004b. [45] R Sainudiin and R Yoshida. Applications of interval methods to phylogenetic trees. In L Pachter and B Sturmfels, editors, Algebraic statistics for computational biology. Cambridge University Press, 2005. [46] OD Scharer.¨ Recent progress in the biology, chemistry and structural biology of DNA glycosylases. Bioessays, 23:270, 2001. [47] OD Scharer.¨ Chemistry and biology of DNA repair. Angew. Chem. Int. Ed., 42:2946–2974, 2003. [48] MJ Schofield and P Hsieh. DNA mismatch repair: Molecular mechanisms and biological function. Annu. Rev. Microbiol., 57:579–608, 2003. [49] C Semple and M Steel. Phylogenetics. Oxford University Press, Oxford, UK, 2003. [50] A Siepel and D Haussler. Phylogeentic estimation of context-dependent substitution rates by maximum likeli- hood. Mol. Biol. Evol., 21(3):468–488, 2004. [51] M Slatkin. A vectorized method of importance sampling with applications to models of mutation and migration. Theoret. Pop. Biol., 62:339–348, 2002. [52] M Stephens and P Donnelly. Inference in molecular population genetics. J. R. Statist. Soc. B, 62:605–655, 2000. [53] JT Stivers and AC Drohat. Uracil DNA glycosylase: Insights from a master catalyst. Arch. Biochem. Biophys., 396:1, 2001. [54] JF Storz, BA Payseur, and MW Nachman. Genome scans of DNA variability in humans reveal evidence for selective sweeps outside of africa. Mol. Biol. Evol. , 21:1800–1811, 2004. [55] JQ Svejstrup. Transcription: Mechanisms of transcription-coupled DNA repair. Nat. Rev. Mol. Cell Biol., 3:21, 2002. [56] The International Chimpanzee Chromosome 22 Consortium. Dna sequence and comparative analysis of chim- panzee chromosome 22. Nature, 429:382–388, 2004. [57] JC Whittaker, RM Harbord, N Boxall, I Mackay, G Dawson, and M Sibly. Likelihood-based estimation of microsatellite mutation rates. Genetics, 164:781–787, 2003. [58] IJ Wilson and DJ Balding. Genealogical inference from microsatellite data. Genetics, 150:499–510, 1998. [59] W Winckler, SR Myers, DJ Richter, RC Onofrio, GJ McDonald, RE Bontrop, GAT McVean, SB Gabriel, D Re- ich, P Donnelly, and D Altshuler. Comparison of fine-scale recombination rates in humans and chimpanzees. Science, 2005.

8