The NFI-Regulome Database: a Tool
Total Page:16
File Type:pdf, Size:1020Kb
Gronostajski et al. Journal of Clinical Bioinformatics 2011, 1:4 http://www.jclinbioinformatics.com/content/1/1/4 JOURNAL OF CLINICAL BIOINFORMATICS DATABASE Open Access The NFI-Regulome Database: A tool for annotation and analysis of control regions of genes regulated by Nuclear Factor I transcription factors Richard M Gronostajski1,2*, Joseph Guaneri2,3, Dong Hyun Lee4, Steven M Gallo5 Abstract Background: Genome annotation plays an essential role in the interpretation and use of genome sequence information. While great strides have been made in the annotation of coding regions of genes, less success has been achieved in the annotation of the regulatory regions of genes, including promoters, enhancers/silencers, and other regulatory elements. One reason for this disparity in annotated information is that coding regions can be assessed using high-throughput techniques such as EST sequencing, while annotation of regulatory regions often requires a gene-by-gene approach. Results: The NFI-Regulome database http://nfiregulome.ccr.buffalo.edu was designed to promote easy annotation of the regulatory regions of genes that contain binding sites for the NFI (Nuclear Factor I) family of transcription factors, using data from the published literature. Binding sites are annotated together with the sequence of the gene, obtained from the UCSC Genome site, and the locations of all binding sites for multiple genes can be displayed in a number of formats designed to facilitate inter-gene comparisons. Classes of genes based on expression pattern, disease involvement, or types of binding sites present can be readily compared in order to assess common “architectural” structures in the regulatory regions. Conclusions: The NFI-Regulome database allows rapid display of the relative locations and number of transcription factor binding sites of individual or defined sets of genes that contain binding sites for NFI transcription factors. This database may in the future be expanded into a distributed database structure including other families of transcription factors. Such databases may be useful for identifying common regulatory structures in genes essential for organ development, tissue-specific gene expression or those genes related to specific diseases. Background High-throughput sequencing techniques now allow Genome annotation, and the ability to extract and use human and other complex genomes to be sequenced information stored in genome databases, is an essential relatively easily [8,9]. However determining the func- part of genomic and bioinformatic analysis [1-4]. While tional significance of sequence changes, particularly now primarily a basic research tool, analysis of genome changes that affect regulatory elements in the genome, is annotation information is rapidly becoming an impor- still in its infancy. The annotation of regulatory elements tant part of Medical and Health Care informatics. in genomes has lagged behind the analysis of coding As more patient genomes are determined, the ability to regions [1,2]. While coding regions can be readily correlate changes in the regulatory regions of genes with assessed by comparing genome sequence with cDNA specific disease states will become increasingly impor- sequences, regulatory regions are still identified primarily tant for Personalized Medicine [5-7]. on a gene-by-gene basis. Even with the use of such powerfultoolsaswholegenomeChIP-seqandthe Encode project [4,10,11], the functional significance of * Correspondence: [email protected] binding sites found by large scale screening can only be 1Department of Biochemistry, State University of New York at Buffalo, 140 definitively tested by mutational analysis of binding sites Farber Hall, Buffalo, NY, 14214, USA within genes and determining the effect of loss of binding Full list of author information is available at the end of the article © 2011 Gronostajski et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Gronostajski et al. Journal of Clinical Bioinformatics 2011, 1:4 Page 2 of 10 http://www.jclinbioinformatics.com/content/1/1/4 on gene expression. Thus, the wealth of published data occur are handled through the software due to the lim- on the analysis of regulatory elements in genes remains itation of MyISAM not having transaction management an important asset to be mined by bioinformatic or foreign key management. approaches. The overall structure of the tables (Figure 1) was devel- Gene expression can be regulated at many levels oped specifically for this database. The tables can be sepa- including control of transcription rate, transcript trans- rated into smaller groups which have a complete port and degradation, translation rate, protein folding dependency on each other (Figure 2), two additional tables and assembly into multi-subunit structures, and protein provide static information, and one table is used for hold- stability [12]. We have focused on the analysis of cis- ing user information. Each group is responsible for acting regulatory elements as mediators of gene expression as the data warehouse for a specific set of information. [13-17]. In particular, we have created a database for the The AuthorDB and ArticleDB grouping is used for annotation and analysis of binding sites for site-specific holding information relatedtoPubMedarticles.Asthe transcription factors in the promoter and enhancer PubMed ID is a unique feature, it is used in this group- regions of genes. To provide a focus for such a broad ing as the primary key. The Author field is set currently topic, we’ve considered only genes that contain binding to only 64 characters maximum as no current value sites for the Nuclear Factor I (NFI) family of transcrip- even approaches that maximum. This field maximum tion factors. can be adjusted if a value were to supersede this arbi- The NFI family of transcription factors is essential for trary default. The Abstract field is set to longtext due to the development of multiple organ systems including the varying length of article abstracts. Utilizing the MyI- brain, lung, muscle, hematopoietic cells, and teeth [18-21]. SAM engine allows for searching of this field for key- The NFI-Regulome database contains the control regions words at the lowest level which is preferable to creating of genes that have been shown to be regulated by NFI a piece of software to accomplish the same task. transcription factors in the primary literature. These con- GeneDB, GeneSynonymDB, SpeciesDB, BindingSiteDB, trol regions are annotated with transcription start sites, and TF_connect_TFBS_DB form the main grouping of translation start sites, NFI binding sites, and the location tables used for holding gene and binding site informa- and identity of other known or unknown site-specific tran- tion. The TF_connect_TFBS_DB acts as a directory to scription factors. Since there are hundreds of known site- allow a gene to know what binding sites it has and for a specific transcription factors [13,16,17], a comprehensive binding site to know what gene it belongs to. The Gen- coverage of all known cis-regulatory sites within a single eDB table includes a Cell_memo field that is used to hold database is daunting, therefore restricting our analysis to important keywords. The keywords are generally sepa- NFI-site containing regulatory regions provides us with a rated by a comma but this is an in-house practice. The defined starting point for our analysis. Since this gene use of MyISAM here allows for this field to be searched family has been shown to be essential for a number of for string values and can be changed to a text type if the developmental processes, the database should also provide varchar length, currently set to the max of 254, gets useful information on the structures of regulatory regions exceeded. GeneSynonymDB provides a table that lists the of genes involved in development and disease. alternative names for a particular gene. While this infor- mation could have been listed underneath the Cell_- Construction and content memo field by providing a separate table for this Structure information, fast indexing and access can be provided. The database is built using MySQL with the MyISAM This feature can be expanded to other attributes located engine. This provides the full text support needed for in the Cell_memo field. The BindingSiteDB table houses searches. The table structure is designed to be in first all of the binding site information. This table also normal form, which states that the attributes of the rela- includes a TFBS_memo field which allows an annotator tion contain only atomic values [22]. While third normal to list important keywords. In NFI-Regulome these are form (3NF) can be achieved easily through the use of an separatedbycommasalso.Duetotheshortlengthof algorithm [22], the decomposed tables are not practical binding site sequences, the binding site sequence field for the queries utilized by the website pages. The tables uses a variable type of varchar instead of the longtext are separated by major characteristics and generally all used by GeneDB. The BindingSiteDB table also provides fields of a given table are utilized by a given query. This the link to the previous group by the inclusion of the enables a single query to retrieve all the information Pubmed_id field. The TF_connect_TFBS_DB is the needed for a particular object such as a binding site or a central table of the entire database. The TF_- gene. In this case performance concerns outweigh the connect_TFBS_DB table connects a particular gene for a concerns of anomalies [22] appearing in the relation particular species to a particular binding site and gives it scheme. Most situations where anomalies can possibly transcription factor information. Most queries utilized by Gronostajski et al. Journal of Clinical Bioinformatics 2011, 1:4 Page 3 of 10 http://www.jclinbioinformatics.com/content/1/1/4 Figure 1 Table structure of NFI-Regulome Database.