A Database of Regulatory Networks in Gamma-Proteobacterial Genomes Abel D
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Open Repository and Bibliography - Luxembourg D98–D102 Nucleic Acids Research, 2005, Vol. 33, Database issue doi:10.1093/nar/gki054 TRACTOR_DB: a database of regulatory networks in gamma-proteobacterial genomes Abel D. Gonza´lez, Vladimir Espinosa, Ana T. Vasconcelos1, Ernesto Pe´rez-Rueda2 and Julio Collado-Vides3,* National Bioinformatics Center, Industria y San Jose´, Capitolio Nacional, CP. 10200, Habana Vieja, Habana, Cuba, 1National Laboratory for Scientific Computing, Avenue Getulio Vargas 333, Quitandinha, CEP 25651-075, Petropolis, Rio de Janeiro, Brazil, 2Depto. de Ingenierı`a Celular y Biocata´lisis, IBT-UNAM, Cuernavaca, Morelos, Mexico and 3Center of Genomics, UNAM, AP 565-A Cuernavaca, CP. 62100, Morelos, Mexico Received August 6, 2004; Revised and Accepted October 1, 2004 Downloaded from ABSTRACT out in this direction (1–3). The first step towards this goal is the recognition of all the genes regulated by a transcription factor Experimental data on the Escherichia coli (TF), i.e. its regulon. transcriptional regulatory system has been used in Computational approaches to recognizing the location of http://nar.oxfordjournals.org/ the past years to predict new regulatory elements regulatory sites in bacterial genomes include the use of weight (promoters, transcription factors (TFs), TFs’ binding matrices (4), phylogenetic footprinting (5), searching for sites and operons) within its genome. As more gen- statistical overrepresentation of oligonucleotides within a omes of gamma-proteobacteria are being sequenced, genome and clustering co-expressed genes in order to find the prediction of these elements in a growing number conserved patterns in their upstream regions (6,7), among of organisms has become more feasible, as a others (8,9). Recently, Tan et al. (10) proposed a new method- step towards the study of how different bacteria ology which brings together the advantages of both, the weight respond to environmental changes at the level of matrices and the phylogenetic footprinting approaches. at University of Luxembourg on May 12, 2014 In the past few years, a great amount of research has been transcriptional regulation. In this work, we present dedicated to computational prediction of important regulatory TRACTOR_DB (TRAnscription FaCTORs’ predicted elements in the Escherichia coli genome: promoters (11), binding sites in prokaryotic genomes), a relational data- operons (12), TFs (13) and TF binding sites (9). As more base that contains computational predictions of new bacterial genomes are sequenced, it is becoming more impor- members of 74 regulons in 17 gamma-proteobacterial tant to extend these efforts to other organisms, and decipher genomes. For these predictions we used a compar- their transcriptional regulatory networks by means of com- ative genomics approach regarding which several parative regulatory studies (10,14–19). proof-of-principle articles for large regulons have Our two major goals in this work were the production of a been published. TRACTOR_DB may be currently reliable set of binding site predictions for as many gamma- accessed at http://www.bioinfo.cu/Tractor_DB, http:// proteobacterial TFs as possible in 17 organisms of this division www.tractor.lncc.br/ or at http://www.cifn.unam.mx/ [E.coli K12 (NC_000913), Haemophilus influenzae (NC_000907), Salmonella typhi (NC_003198), Salmonella Computational_Genomics/tractorDB. Contact Email typhimurium LT2 (NC_003197), Shewanella oneidensis id is [email protected]. (NC_004347), Shigella flexneri 2a (NC_004337), Vibrio cholerae (NC_002505), Yersinia pestis KIM (NC_004088), Buchnera aphidicola (NC_004545), Pseudomonas aeruginosa (NC_002516), Pseudomonas syringae (NC_004578), Pasteur- INTRODUCTION ella multocida (NC_002663), Pseudomonas putida KT2440 One of the challenges of Functional Genomics is the identifi- (NC_002947), Vibrio parahaemolyticus (NC_004603), Vibrio cation of all the elements that take part in an organism’s vulnificus CMCP6 (NC_004459), Xanthomonas axonopodis transcriptional regulatory network. This is necessary to under- (NC_003919) and Xylella fastidiosa (NC_002488)], and stand how the cell reacts to environmental stimuli at the level the construction of a database (TRACTOR_DB, accessible of transcriptional regulation. Intense research is being carried at http://www.bioinfo.cu/Tractor_DB, http://www.tractor. *To whom correspondence should be addressed. Tel: +527 773 132063; Fax: +527 773 175581; Email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact [email protected]. ª 2005, the authors Nucleic Acids Research, Vol. 33, Database issue ª Oxford University Press 2005; all rights reserved Nucleic Acids Research, 2005, Vol. 33, Database issue D99 lncc.br and http://www.cifn.unam.mx/Computational_ four putative sites. These sets were then used to build a specific Genomics/tractorDB) with a user-friendly navigation interface model for each organism using CONSENSUS, which were containing these computationally predicted sites. Several then used to re-calculate the cutoffs and to re-scan the reg- recent papers have addressed the first goal; however, to our ulatory regions of each organism. knowledge, the present work is the most complete in terms of Summing up, the main features included in the approach the number of regulons and organisms that it comprises. Most proposed by Tan et al. (10) in order to extend it to as many TFs other studies have been limited to E.coli and H.influenzae (10) as possible were as follows: (i) the use of two different algo- or to one or few regulons (18,19). In contrast, the work by rithms to build the statistical models of each TF binding site: McCue et al. (20) predicted a large number of putative reg- the CONSENSUS (4) and the Gibbs-SAMPLER (22); (ii) the ulatory sites in 10 gamma-proteobacterial genomes, but could inclusion, within the training set used to build the model of not associate many of those sites to a given TF. each TF, of non-coding regions upstream the TUs of the organisms other than E.coli that are orthologous to those that in E.coli are known as regulon members (orthologous non-coding regions), along with E.coli known binding METHODS sequences of the TF; and (iii) the reconstruction of the models Downloaded from Selecting organisms and regulons for each TF in each organism after the prediction process, which allows the refinement of the search for new members We used the main ideas of the methodology proposed by Tan of the regulon within each genome. et al. (10) to predict new regulon members in 17 gamma- All the sites found for a given TF—in the eight initial proteobacterial genomes. Orthology information is used in organisms—after the second scanning using the rebuilt models this methodology to assess the biological significance of were aligned, using CONSENSUS, to produce a Positional http://nar.oxfordjournals.org/ putative regulon members. The methodology requires that Weight Matrix (PWM). Those PWMs were then used to the organisms selected be phylogenetically close to E.coli, scan the genomes of the other nine organisms for new putative since E.coli known binding sequences are used to build a binding sites. The orthology filtering process was done as statistical model of a TF’s binding site, and this model is described in the Supplementary Material, Section I.5, using subsequently used to predict new members of that regulon only the first eight organisms at the centre of the analysis. in the other organisms within the study. Hence, we included For a thorough description of the predictive methodo- in our study a group of organisms from different subdivisions logy see Section I.(1–6) and Figure 1 of the Supplementary of the gamma-proteobacteria subclass whose genomes were Material. completely sequenced. We started working with all TFs with at University of Luxembourg on May 12, 2014 at least one binding site known in E.coli. Designing and building the database Outline of the predictive methodology The database was designed and built following the relational model, and installed on a MySQL server. The web interface is Original training sets contain E.coli TFs’ binding sequences managed by a cgi PERL script that does all the work, from extracted from RegulonDB, version 4.0 (21). The first eight querying the database at the user’s request, to generating the organism referenced in the list above (those sharing at least dynamic web pages that form the interface. The design that we 30% of orthologous genes with E.coli) were used to construct have adopted makes it very easy to incorporate new instances the original training sets. A training set was built for each TF to the database, which may be very easily accommodated into with at least one known binding sequence in E.coli (21) and it the interface. Figure 1 shows the main relationships between included also those of orthologous non-coding regions when the tables that compose TRACTOR_DB and the most impor- less than 25 binding sequences were known in E.coli. We built tant queries carried out by the interface program. weight matrices only for TFs with training sets larger than four Several links in the dynamically generated web pages that sequences. Two statistical models were built for each TF: form the interface make the navigation easy and user-friendly. one using CONSENSUS, and the second using the Gibbs- The dynamic pages are linked to other databases such SAMPLER. The training sets were filtered twice as proposed as RegulonDB (21), EcoCyc (23) and the NCBI database.