Statistical Significance of Clusters of Motifs Represented by Position

3214±3224 Nucleic Acids Research, 2002, Vol. 30 No. 14 ã 2002 Oxford University Press Statistical signi®cance of clusters of motifs represented by position speci®c scoring matrices in nucleotide sequences Martin C. Frith1, John L. Spouge4, Ulla Hansen1,3 and Zhiping Weng1,2,* 1Bioinformatics Program and 2Department of Biomedical Engineering, Boston University, 44 Cummington Street, 3Department of Biology, Boston University, 5 Cummington Street, Boston MA 02215, USA and 4National Center for Biotechnology Information, National Library of Medicine, Building 38A, Room 8N805, Bethesda, MD 20894, USA Received February 26, 2002; Revised and Accepted May 24, 2002 ABSTRACT (motifs); however, these patterns are typically too short and degenerate for accurate detection of cis-elements (2). The human genome encodes the transcriptional Fortunately, there is much evidence that cis-elements occur control of its genes in clusters of cis-elements in clusters rather than in isolation. Although transcription that constitute enhancers, silencers and promoter factor binding sites may be located many tens of kilobases signals. The sequence motifs of individual cis- away from the transcription start site, they generally occur elements are usually too short and degenerate for within enhancers or silencers extending over a few hundred con®dent detection. In most cases, the require- base pairs that contain multiple cis-elements (3). There is ments for organization of cis-elements within these further evidence that mRNA 3¢-end processing (4), mRNA clusters are poorly understood. Therefore, we have localization (5) and alternative splicing (6) are controlled by developed a general method to detect local concen- clusters of signals. Thus, it appears to be a widespread trations of cis-element motifs, using predetermined phenomenon for molecular biological processes to be regu- matrix representations of the cis-elements, and cal- lated by clusters of signals that are individually weak but collectively strong. culate the statistical signi®cance of these motif A number of more or less ad hoc algorithms for detecting clusters. The statistical signi®cance calculation is cis-element clusters have been proposed in the past (7±13). To highly accurate not only for idealized, pseudo- assess predictions made by such a method, it is extremely random DNA, but also for real human DNA. We use valuable to know the statistical signi®cance of each prediction, our method `cluster of motifs E-value tool' (COMET) i.e. the probability of obtaining the result merely by chance. to make novel predictions concerning the regulation None of the previous methods includes an analytic calculation of genes by transcription factors associated with of statistical signi®cance, with the exception of a technique by muscle. COMET performs comparably with two Wagner based on Poisson statistics (14). The major limitation alternative state-of-the-art techniques, which are of Wagner's method is that it represents cis-elements using more complex and lack E-value calculations. Our degenerate consensus sequences, rather than the more general statistical method enables us to clarify the major position speci®c scoring matrices (PSSMs) (15). A further bottleneck in the hard problem of detecting cis- undesirable feature of many previous methods is the use of arbitrary threshold parameters, such as a window size within regulatory regions, which is that many known which the cis-elements must occur. enhancers do not contain very signi®cant clusters Ockham's razor is a widely accepted criterion for choosing of the motif types that we search for. Thus, dis- among alternative methods. With this principle in mind, we covery of additional signals that belong to these have designed a method to detect cis-element clusters that is regulatory regions will be the key to future pro- about as simple and general as can be achieved. Our method gress. ®nds the optimal motif cluster obtained by summing PSSM scores for the motifs, and subtracting a linear `gap penalty' for the spacer sequences between motifs. In principle there is just INTRODUCTION one undetermined parameter: the gap penalty. Unlike most It is estimated that 1.5% of the human genome encodes previous methods, our technique has a solid statistical proteins via the `genetic code' (1). Equally important are the foundation, being based on a log likelihood ratio of observing regulatory signals, within a more extended genetic code, that the data given a model of cis-element clusters versus a model control the manner in which the proteins are synthesized. of background DNA. The Neyman±Pearson lemma states that These regulatory signals, or cis-elements, are typically protein log likelihood ratios are the most powerful statistic for binding sites that possess characteristic sequence patterns distinguishing between hypotheses. Our model of cis-element *To whom correspondence should be addressed at: Bioinformatics Program, Boston University, 44 Cummington Street, Boston, MA 02215, USA. Tel: +1 617 353 3509; Fax: +1 617 353 6766; Email: [email protected] Nucleic Acids Research, 2002, Vol. 30 No. 14 3215 clusters is for cis-elements to occur with a uniform (Poisson) distribution of some intensityÐa reasonable minimal assump- tion model. The intensity parameter of this distribution corresponds in a one-to-one fashion with the gap penalty. A DNA background model that adequately re¯ects the properties of natural DNA is more problematic to construct. We try three versions: an independent mononucleotide model estimated from the query sequence, a higher order Markov model estimated from genomic DNA, and a locally varying mononucleotide model estimated from a sliding window within the query sequence. Our method incorporates an analytic calculation of statistical signi®cance, extending the technique of Claverie and Audic for calculating signi®cance of individual PSSM Figure 1. A hidden Markov model of cis-element clusters. The large circles matches (16). Having identi®ed a motif cluster with some represent states that emit single nucleotides. The small circles represent raw score (sum of PSSM scores minus gap scores), we can silent states that do not emit, and the arrows represent allowed transitions between states. The cis-element states emit nucleotides with probabilities calculate its E-value, i.e. the number of times we expect to obtained from the count matrix for this cis-element. The background state observe a cluster with this score or greater by chance, in a emits nucleotides with background probabilities. Non-palindromic cis- random sequence of speci®ed length and nucleotide com- elements are duplicated so that they are represented once on each strand. position. Thus, E-values <1 become increasingly signi®cant. We demonstrate that, as expected, the E-values accurately re¯ect the number of motif clusters we observe in synthetic sequences generated by each of the three background models. cis-elements occur in a Poisson process of some intensity, In addition, using the sliding window background model, the embedded in random DNA generated by the null model. The E-values are astonishingly accurate for natural DNA cis-elements are modeled by mononucleotide frequency sequences (after masking the most egregious tandem repeat matrices that are speci®ed by the user. If a mononucleotide and low complexity regions). We name our method COMET: null model is used, the cluster model is equivalent to the cluster of motifs E-value tool. hidden Markov model shown in Figure 1, with the emission We use COMET to study two types of regulatory region: probabilities of the background state equal to those of the null promoters regulated by the transcription factor LSF and model. a in Figure 1 is related to the expected average muscle speci®c regulatory regions. LSF regulates a diverse set distance between motifs in a cluster, a, by a = 1/(a + 1). of cellular and viral genes (17±26). One de®ned biological COMET allows the user to specify a value for a. function of LSF is an essential role in mediating cell cycle In the Markov case, the cluster model combines a progression, via regulation of thymidylate synthase expression mononucleotide model of cis-elements with a higher order (26). There is evidence that the transcription factors Sp1 and model of the spacer sequences between them. Since there has Ets-1 may co-regulate genes with LSF (27,28). In this study, been some controversy over how best to combine two such we restrict attention to sites of LSF regulation close to models (30), we give a careful description of our method. The transcription start sites, searching for these regulatory regions cluster model generates the DNA segment in two stages, ®rst by detecting clusters of motifs for LSF, Sp1, Ets-1 and the generating the locations and sequences of the cis-elements, TATA box. For muscle speci®c regulatory regions, we look and then emitting the sequences of the spacer regions between for clusters of motifs for the transcription factors Mef-2, Myf, them. In the ®rst stage, the model generates a gap length SRF, Tef and Sp1, as previously suggested by others (9). (possibly zero) from a geometric (memoryless) distribution, then randomly selects a cis-element and generates its sequence from the mononucleotide model, generates another gap, and so MATERIALS AND METHODS on. In the second stage, the gaps are ®lled with spacer COMET's scoring scheme sequences. Two distinct probability distributions can be proposed for the generation of the spacer sequences: We wish to ®nd the segment within a query DNA sequence that has the maximum score according to the following log Pr(spacer sequences | cluster model) likelihood ratio formula: = Pr(spacer sequences | null model) 2 score (segment) = log{[prob(segment | cluster model)]/ Pr(spacer sequences | cluster model) [prob(segment | null model)]} 1 = Pr(segment | null model; cis-element sequences) 3 Three different null models were tried: independent nucleo- In equation 2, each spacer sequence is generated as if it were tides with frequencies estimated from the entire query not attached to the rest of the segment.

Statistical Significance of Clusters of Motifs Represented by Position

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support