Identification of the Binding Sites of Regulatory Proteins in Bacterial Genomes
Total Page:16
File Type:pdf, Size:1020Kb
Identification of the binding sites of regulatory proteins in bacterial genomes Hao Li*†, Virgil Rhodius‡, Carol Gross‡, and Eric D. Siggia§ *Departments of Biochemistry and Biophysics, and ‡Department of Stomatology and Department of Microbiology and Immunology, University of California, 513 Parnassus Avenue, San Francisco, CA 94143; and §Center for Studies in Physics and Biology, The Rockefeller University, Box 25, 1230 York Avenue, New York, NY 10021 Contributed by Carol Gross, June 6, 2002 We present an algorithm that extracts the binding sites (repre- each gene set defines a particular context, and searching for all sented by position-specific weight matrices) for many different contingencies to which the cell can respond is daunting. Fur- transcription factors from the regulatory regions of a genome, thermore, observed expression patterns can result from a reg- without the need for delineating groups of coregulated genes. The ulatory cascade or from multiple factors acting simultaneously, algorithm uses the fact that many DNA-binding proteins in bacteria increasing the difficulty of identifying all of the relevant sites. bind to a bipartite motif with two short segments more conserved Interspecies comparison is limited by the availability of species than the intervening region. It identifies all statistically significant separated by proper evolutionary distances. In addition, multiple patterns of the form W1NxW2, where W1 and W2 are two short alignment algorithms do not yet respect the phylogenetic rela- oligonucleotides separated by x arbitrary bases, and groups them tionships. Finally, when the conserved sequence elements are into clusters of similar patterns. These clusters are then used to identified, it is a challenging task to group the potential sites for derive quantitative recognition profiles of putative regulatory each gene into regulons (7). proteins. For a given cluster, the algorithm finds the matching The computational algorithms used to extract common sites sequences plus the flanking regions in the genome and performs from a select group of genes can be categorized as either direct a multiple sequence alignment to derive position-specific weight search, i.e., count all sites in a certain class (8–10), or relax- matrices. We have analyzed the Escherichia coli genome with this ational, i.e., guess a pattern and improve it iteratively (11–14). algorithm and found Ϸ1,500 significant patterns, which give rise to The computational effort in the former approach grows expo- Ϸ160 distinct position-specific weight matrices. A fraction of these nentially in the length of the site but nothing is missed, whereas matrices match the binding sites of one-third of the Ϸ60 charac- the latter can find longer and more diffuse patterns but may not terized transcription factors with high statistical significance. Many converge to the global optimal. Algorithms also differ in how of the remaining matrices are likely to describe binding sites and they assess the statistical significance of a motif: either extrin- regulons of uncharacterized transcription factors. The significance sically, by contrasting the frequency of the motif in the gene of these matrices was evaluated by their specificity, the location of cluster with that in the rest of the genome (8); or intrinsically, by the predicted sites, and the biological functions of the correspond- determining how much the number of occurrences of the motif ing regulons, allowing us to suggest putative regulatory functions. deviates from that expected by chance (e.g., given the single base The algorithm is efficient for analyzing newly sequenced bacterial frequencies in the cluster). Applications to entire genomes are genomes for which little is known about transcriptional regulation. difficult because of the quantity of data involved, the absence of suitable comparisons for the extrinsic methods, the multiplicity algorithm ͉ position weight matrix ͉ DNA-binding site ͉ of patterns, and the limitations of simple background models for transcription factor ͉ E. coli the probabilities. Some of these limitations were overcome with the Mobydick algorithm (15), which searches for multiple motifs in parallel by sequence segmentation. It was run successfully on s more and more genomes are sequenced, organisms are the regulatory sequences upstream of all the genes in yeast. Aincreasingly represented by a list of genes; however, there is However, it does not allow for the discovery of new position- little knowledge as to how these genes are regulated. Even for specific weight matrices (PSWM; related to a table of the number Escherichia coli, the best understood bacteria, only about one- of bases at each position for aligned sites), which is the most fifth of the estimated 300–350 regulatory proteins (1) have fruitful way of describing bacterial regulatory sites. characterized binding sites. For newly sequenced bacteria, only Transcription factor-binding sites in bacterial genomes are those transcription factor-binding sites that happen to match usually long, Ϸ30 bases, and variable. However, often most of those already identified in E. coli or Bacillus subtilis can be used their sequence signal is carried in two conserved subregions, to infer regulatory properties of the organism. Consequently, it each about 6 bases in length (16), which contain the predominant is clearly important to develop genome-wide computational contacts with the transcription factor. This bipartite character tools to identify the binding sites of uncharacterized transcrip- results from the fact that most prokaryotic transcription factors tion factors. That new bacteria are sequenced almost weekly have two DNA-binding regions, because of either dimerization indicates that developing these computational tools is a high of the transcription factor or the presence of two DNA-binding priority. domains in a single protein as in the case of factors (17). A One commonly used approach to identify transcription factor- number of researchers have exploited this fact to search for binding sites is to delineate a group of coregulated genes [e.g., patterns of the form W1NxW2 (henceforth termed dimers), where by clustering genes on the basis of their expression profiles (2, 3), W1,2 are short oligonucleotides (henceforth called words) sep- or functional annotation] and search for common sequence arated by x arbitrary bases (9, 10, 15, 18, 19). patterns in their upstream regulatory regions. An alternative In this paper, we develop a new dimer-based algorithm that approach is to compare the regulatory regions of orthologous generates PSWMs and puts an intrinsic probabilistic score on the genes in different species to identify functionally conserved resulting motifs. When applied to E. coli, we identify the binding sequence motifs (4–6). These approaches have been successfully used to analyze bacterial genomes, but they both have limita- tions. For example, clustering genes on the basis of their Abbreviations: PSWM, position-specific weight matrices; TSP, transcription start point. expression profiles is far from an exact and objective process; †To whom reprint requests should be addressed. E-mail: [email protected]. 11772–11777 ͉ PNAS ͉ September 3, 2002 ͉ vol. 99 ͉ no. 18 www.pnas.org͞cgi͞doi͞10.1073͞pnas.112341999 Downloaded by guest on September 27, 2021 sites of one-third of the characterized transcription factors with two dimer patterns are related to the binding sites of LexA in high statistical significance. In addition, this algorithm predicts E. coli: binding sites for many uncharacterized transcription factors that potentially define new regulons. We evaluate the signif- CTGTANNNNNNTACAG icance of these predictions from their probability score, their CTGTNNNNNNNNNCAGT positional distribution, and the coherence of the biological function of the putative group of coregulated genes. Our To divide the significant dimer patterns we found in step 1 into distinct groups, we first score the best alignment for each pair of success in applying this algorithm to E. coli suggests it will be dimers as illustrated above. The score is the number of matches useful in identifying regulatory networks of newly sequenced minus the number of mismatches, and matches of N to any other genomes. base or overhangs (e.g., the terminal T in the second sequence) are ignored. We then create a similarity score between zero and Methods one by normalizing the pair scores by the maximum over all Our algorithm consists of three steps. In the first step, it possible pairs and then cluster the dimers by using the CAST tabulates the positions of all strings W up to some length algorithm developed by Ben-Dor et al. (20). We experimented (typically 5 for Ϸ1 Mb of sequence) in the data. This table is with various thresholds for the cluster score (the average of all then searched to count the number of occurrences of the dimer the pair scores of its members) and found that a threshold of 0.6 W1NxW2, where the spacing x varies typically from 0 to 30 bp. gave good compact clusters. This number is compared with that expected if W1,2 are A cluster obtained in step 2 will give us only the most uncorrelated, and Poisson statistics is used to assign a prob- conserved part of the binding sites of a putative factor but ability to the observation. The second step takes all statistically provides no information about the middle or the flanking significant dimers and clusters them on the basis of sequence regions. However, this information still resides in the genomic similarity. The final step takes the actual genomic sequences sequence. For a given cluster, we extract the actual sequences matched by any member of a cluster plus the flanking regions that are matched by any member of the cluster plus about 10 (with no double counting) and performs a multiple sequence flanking base pairs. We then perform a multiple sequence alignment to yield the PSWM. To search for putative sites, alignment by using CONSENSUS (11), which easily finds the standard information theory measures are used to score correct alignment because the extracted sequences are short and sequences using PSWMs.