Evolutionary Population Genetics of Promoters: Predicting Binding Sites and Functional Phylogenies
Total Page:16
File Type:pdf, Size:1020Kb
Evolutionary population genetics of promoters: Predicting binding sites and functional phylogenies Ville Mustonen and Michael La¨ ssig* Institut fu¨r Theoretische Physik, Universita¨t zu Ko¨ln, Zu¨lpicherstrasse 77, 50937 Cologne, Germany Edited by Tomoko Ohta, National Institute of Genetics, Mishima, Japan, and approved August 8, 2005 (received for review June 30, 2005) We study the evolution of transcription factor-binding sites in neutral background DNA. As an important step in this direction, prokaryotes, using an empirically grounded model with point the notion of a fitness landscape for binding-site sequences has been mutations and genetic drift. Selection acts on the site sequence via introduced, where the fitness of a site depends on the binding its binding affinity to the corresponding transcription factor. Cal- energy of the corresponding factor (8). The evolutionary impor- ibrating the model with populations of functional binding sites, we tance of the binding energy has also been highlighted in ref. 9, where verify this form of selection and show that typical sites are under it was shown that nucleotide substitution rates within functional substantial selection pressure for functionality: for cAMP response sites in Escherichia coli depend on the energy difference induced by protein sites in Escherichia coli, the product of fitness difference the substitution as predicted from the position weight matrix. The and effective population size takes values 2N⌬F of order 10. We biophysics of factor-DNA binding imposes stringent constraints on apply this model to cross-species comparisons of binding sites in the form of the fitness landscape (10) and has important conse- bacteria and obtain a prediction method for binding sites that uses quences for bioinformatic binding site searches (11). Using such evolutionary information in a quantitative way. At the same time, fitness landscapes, we have introduced a stochastic evolution model this method predicts the functional histories of orthologous sites for functional loci, which is based on Kimura–Ohta point substi- in a phylogeny, evaluating the likelihood for conservation or loss tutions with rates governed by the fitness difference between the or gain of function during evolution. We have performed, as an corresponding sequence states (12, 13). This model demonstrates example, a cross-species analysis of E. coli, Salmonella typhi- the possibility of rapid adaptive formation of binding sites under murium, and Yersinia pseudotuberculosis. Detailed lists of pre- positive selection and provides evolutionary constraints on eukary- dicted sites and their functional phylogenies are available. otic promoter architecture. A similar evolutionary model (14) underlies a recently introduced method to identify conserved egulatory interactions between genes are believed to provide binding sites in multiple alignments (15). Ran important mode of evolution, which accounts for a substan- In this paper, we develop a quantitative evolutionary rationale for tial part of the differentiation between species (1). This is reflected the cross-species analysis of regulatory sequences, which goes by the sequence variability of regulatory DNA: there is ample case beyond the mere prediction of binding sites. For aligned regulatory evidence of compensatory evolution at conserved function but also DNA of orthologous genes, our method predicts sites together with of rapid functional changes even between closely related species (2). their functional evolution. The method is based on the evolution Lacking a quantitative model of regulatory evolution, however, model of refs. 12 and 13 and uses a bioinformatic measurement of alignments of regulatory sequences and predictions of their func- selection pressures for functionality, which is obtained from se- tionality have proven notoriously difficult. quence data of verified functional sites. Typical functional loci for A large body of existing work has focused on the identification of pleiotropic factors, as exemplified by the cAMP response protein transcription factor-binding sites as the main functional elements of (CRP) family in E. coli, are found to be under substantial selection, regulatory DNA. For factors with known binding specificity (given in contrast to nonfunctional loci, which evolve neutrally. For in the form of a position weight matrix), putative binding sites are families of aligned loci, our method assigns likelihood values to identified from their conservation in cross-species comparisons. different modes of evolution and associates them with functional Different measures of conservation have been introduced, which histories: (i) neutral evolution of nonfunctional loci, (ii) evolution involve, e.g., the sequence similarity of aligned loci or their inde- of functional loci under time-independent selection, and (iii) evo- pendent high scoring in all species compared (3–7). These methods lution under time-dependent selection, corresponding to loss or are powerful prediction tools for binding sites. From an evolution- gain of function along a given branch of the phylogeny. ary point of view, however, the conservation criteria are heuristic. Hence, it is difficult to quantify the statistical significance of the Theory results, which depends on the number and evolutionary distance of Evolution Models for Nonfunctional Sequence and Functional Loci. We the species compared. Sequence conservation tends to be too consider genomic loci a ϭ (a1,...,al) consisting of l contiguous restrictive in cases where substantial sequence variation is compat- nucleotides and elementary substitution processes a 3 b, where a, ible with the position weight matrix, in particular for distant species. b are any two sequence states differing by exactly one nucleotide. Simple sequence similarity measures implicitly assume neutral For nonfunctional (background) sequences, we use uniform nucle- evolution, whereas independent scoring of orthologous sites ignores otide substitution rates a3b depending on the nucleotide to be the evolutionary link between the species altogether. Most impor- mutated and on its nearest sequence neighbors (16). Models of this tantly, none of these conservation measures allows a consistent type are neutral with respect to factor binding and have been shown statistical treatment of functional innovations in the evolution of to provide a good description of intergenic background DNA in E. binding sites. coli (11). A locus is defined as functional if binding of the corre- A more explicit model for regulatory DNA should address two sponding factor at that locus affects the regulation of a gene. issues: How does the sequence divergence between species depend Functional loci are assumed to be under selection. This is described on their evolutionary distance, and how does the specific biological function of binding sites enter? Answering these questions is a considerable task for experiment and theory, which must link the This paper was submitted directly (Track II) to the PNAS office. biophysics of binding sites with their population dynamics. In Abbreviation: CRP, cAMP response protein. particular, it involves quantifying the selection by which the se- *To whom correspondence should be addressed. E-mail: [email protected]. quence evolution at functional loci is distinguished from that of © 2005 by The National Academy of Sciences of the USA 15936–15941 ͉ PNAS ͉ November 1, 2005 ͉ vol. 102 ͉ no. 44 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0505537102 Downloaded by guest on September 29, 2021 by a (Malthusian) fitness function F(a), which measures the con- The binding probability, however, is a strongly nonlinear func- tribution of a genotype a to the growth rate of the number of tion of the energy. Hence, the fitness effect of a substitution at individuals carrying that genotype (and is therefore defined only up one position depends on the nucleotides present at all other to an additive constant, the genotype-independent fitness). Notice positions. This induces correlations between nucleotide frequen- that this definition of a functional locus is weaker than that of a cies at any two positions within functional loci (13), in addition functional binding site, which is a functional locus with a sequence to the short-ranged correlations already present in background state a that is likely to actually bind the factor. A functional locus sequences. These correlations prevent the factorization of the can lose its binding sequence due to deleterious mutations, and distributions P0(a) and Q(a). However, because the fitness F(a) conversely, a nonfunctional locus can become a spurious binding depends on the sequence state a only via the binding energy site. According to the Kimura–Ohta theory (17–19), selection leads E(a), we can project these distributions on the energy E as to modified substitution rates at functional loci, independent continuous variables, summing over all sequence Ϫ Ϫ ͑ ͑ ͒ Ϫ ͑ ͒͒ states a with an (approximately) equal value of E. Denoting the 1 exp[ 2 F b F a ] projected ensembles for simplicity with the same symbols P and u 3 ϭ 3 N , [1] 0 a b a b 1 Ϫ exp[Ϫ 2N͑F͑b͒ Ϫ F͑a͒͒] Q, Eq. 3 takes the form ͑ ͒ ϭ ͑ ͒ ͓ ͑ ͒ ϩ ͔ where N denotes the effective population size. In writing Eq. 1,we Q E P0 E exp 2NF E const. [6] have assumed N ϽϽ 1, so that subsequent substitution processes are well separated in time and can be assumed to be independent. It is this simplification that enables us to infer the functional form of these distributions from bioinformatic