Lump It Or Loose It! Population Genetic Inference with the N-Coalescent Experiments Graph Within a Phylogenomic Lineage

Lump it or Loose it! Population genetic inference with the n-coalescent experiments graph within a phylogenomic lineage Principal Investigator: Dr. Raazesh Sainudiin Department of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch, New Zealand [email protected] Telephone: +64 3 364 2987 ext 7691 Proposal Root Date: November 4, 2005 December 15, 2009 Summary of proposed research Following the completion of the human and other genome projects, there are growing amounts of data which docu- ment genetic diversity, at the deoxyribonucleic acid (DNA) level, within human populations. Such data carries vital information about the history of the populations, about the genetic basis of disease, and about the evolutionary process itself. Because of the complex nature of the data, extracting this information is a very difficult statistical problem. Existing approaches rely heavily on sophisticated mathematical modeling and modern computing power, but they in- volve several overly simplistic assumptions. A major challenge is to assess the assumptions underpinning the models, and there is an urgent need to develop new methods which adequately capture the complexities of the real world, in order to solve pressing problems in conservation genetics, nutritional genetics, plant and animal breeding, and disease biology. The proposed research will exploit the genome sequencing of our close relatives amongst the primates, including the chimpanzee and the gorilla, to develop accurate models for the mutation process: the fundamental mechanism by which genetic variation is introduced into populations. The DNA sequences of these species are very similar to those of humans: for example chimpanzee and human DNA agrees at 99% of positions. The differences result from mutation, and in some cases subsequent natural selection. Crucially, the underlying similarity between the sequences allows an assessment of the way in which mutation works in different parts of the genome, in a setting in which the background context is very similar across the species, thus effectively eliminating one important source of variation. The use of several related species also allows the researcher to pin down the exact lineage on which mutations have occurred. New statistical methods will be developed for this part of the research. Having learned about the underlying mutation processes, the second stage will be to use this information to give more powerful statistical methods for studying DNA sequence variation within a species. The tools from phylogenetics and population genetics are seldom used together in complementary ways to solve a problem. The fundamental objective of this proposal is to help bridge this divide by developing novel and powerful statistical methods that use tools from both disciplines, in order to integrate the information in phylogenomic data from ape genomes with the polymorphism data from human populations. This integrative approach using the recently developed theory of controlled lumped coalescents over a partially ordered graph of coalescent experiments will allow for more robust statistical decisions to be made in real-world applications of genetics. Background and Motivation Following the completion of the human and other genome projects, there are growing amounts of data which doc- ument genetic diversity, at the DNA level, within human populations. Such data carries vital information about the 1 population history, about the genetic basis of disease, and about the evolutionary process itself. Due to the complex nature of the data, extracting this information in order to make robust decisions in real-world applications of genetics is a challenging statistical problem. Patterns of diversity in molecular population genetic data are shaped by the inter- action of two processes: (1) the genealogical process that captures the ancestral interrelatedness of the sampled DNA sequences and (2) the biochemical process that describes the molecular evolution of DNA. When population genetic inference is based on the likelihood function of a stochastic model for the two interacting processes, one can extract information in the data that can be captured by the model. Much progress has been made in computationally intensive likelihood-based inference from samples of DNA sequences using Markov chain Monte Carlo (e.g. [2,3, 30, 36, 58]) and importance sampling (IS) techniques (e.g. [1, 14, 18, 19, 20, 21, 22, 23, 26, 27, 28, 34, 35, 51, 52]). However, this progress in population genetic inference is mostly in terms of sophisticated coalescent models of the generally unobservable genealogical process under a homogeneous and mathematically convenient model of the empirically observable biochemical process. The major source of DNA polymorphism caused by mutation and recombination events in natural populations is the result of the counteracting forces of DNA damage and DNA repair. Although our understanding of the molecular basis underlying this counteraction has markedly increased in the past decade [47, 48], relatively little effort has been made to incorporate biochemical realism into population genetic inference. On the other hand, significant progress has been made in phylogenetics to incorporate realism into models of the biochemical process [17, 39, 49]. However, this phylogenetic progress has mostly been at the expense of simplifying the genealogical process. Since only the product of the mutation rate per locus per generation and the number of generations is iden- tifiable, (1) mutational complexities and (2) genomic heterogeneities in the mechanisms of DNA damage and repair can confound the shape of the inferred coalescent trees describing the genealogical process of the sample. Therefore, population genetic inference can be affected by model mis-specification of the biochemical process. Realistic models of DNA evolution with several parameters informed by various empirical findings can be incor- porated into existing inference methods with population genetic data. However, the statistical experiment tends to be over-parametrized with an extremely flat likelihood function. The large number of parameters tends to make the computational time of IS methods impractical and the convergence diagnostics of MCMC methods (with local proposal distributions) difficult and extremely heuristic. However, several genomes of our closest relatives are being sequenced. If we assume that the DNA damage and repair mechanisms are similar (not necessarily identical) between the great apes, then information in molecular divergence given by the phylogeny of the entire genomes of the great apes can shed light on the biochemical process of DNA evolution. Likelihood methods that use such phylogenomic information from at least two species, in the context of population genetic inference along a particular lineage, are now possible with the increasing chimp genome sequence [13, 40, 56]. By pooling phylogenomic and population genetic data together within a unified and novel phylopopulation model of the two interacting processes (genealogical and biochemical) we can hope to reduce the above computational burdens that arise in the presence of population genetic data alone. The proposed research will develop such data integrative methods to incorporate biochemical realism into population genetic inference. In the process, a common platform for integrating tools from traditionally disparate disciplines of population genetics and phylogenetics is expected to emerge. Next we will see that population genetic inference can be affected when complexities in the biochemical process of a particular kind of repetitive DNA sequence are ignored. Microsatellites are simple sequence repeats in DNA, for example the motif AC repeated eleven times in a row. Microsatellites mutate by changing their repeat length, i.e. the number of their repeats. There is a difference of four to five orders of magnitude between the mutation rate of a DNA nucleotide and a tandem dinucleotide DNA repeat [12, 57]. Micosatellites are popular genetic markers because they are highly polymorphic, fairly inexpensive to genotype, and relatively dense in the genomes of many organisms. They are used in conservation genetic efforts, in genomic scans of natural selection, in breeding agricultural plants, and in inferring population history. To see that inference can be affected by ignoring the mutational complexities, let us focus on the most abundant microsatellites of motif AC, in humans and chimpanzees. Since there are motif-specific heterogeneities in the biochemical process of dinucleotide repeat evolution [7, 43], we focus on a particular motif to highlight the mutational complexities that can be assumed to be homogeneous across all loci of this motif. The simplest and most popular model to describe microsatellite evolution is the step-wise mutation model (SMM) [37] which allows all microsatellite alleles to mutate indiscriminately at the same rate. However, microsatellite mutations are caused by those primary slippage events during DNA replication that have escaped the mis-match repair machinery [11, 25]. Since there are more opportunities for a longer microsatellite to slip during replication, the mutation rate of a microsatellite increases with its repeat length (rate proportionality)[57]. The counteracting forces of slippage and repair along with possible selection against long repeats (mutational bias) are thought to drive the length of human AC repeats towards 21 from several

Lump It Or Loose It! Population Genetic Inference with the N-Coalescent Experiments Graph Within a Phylogenomic Lineage

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support