Uniprobe: an Online Database of Protein Binding Microarray Data on Protein–DNA Interactions

UniPROBE: an online database of protein binding microarray data on protein–DNA interactions The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Newburger, D. E., and M. L. Bulyk. “UniPROBE: An Online Database of Protein Binding Microarray Data on protein-DNA Interactions.” Nucleic Acids Research 37.Database (2009): D77–D82. As Published http://dx.doi.org/10.1093/nar/gkn660 Publisher Oxford University Press (OUP) Version Final published version Citable link http://hdl.handle.net/1721.1/73095 Terms of Use Creative Commons Attribution Non-Commercial Detailed Terms http://creativecommons.org/licenses/by-nc/2.5 Published online 8 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D77–D82 doi:10.1093/nar/gkn660 UniPROBE: an online database of protein binding microarray data on protein–DNA interactions Daniel E. Newburger1 and Martha L. Bulyk1,2,3,* 1Division of Genetics, Department of Medicine, 2Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School and 3Harvard-MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115, USA Received August 15, 2008; Revised September 18, 2008; Accepted September 21, 2008 ABSTRACT stimuli, differentiation and development in an organism. Despite recent advances in this field, the vast majority The UniPROBE (Universal PBM Resource for of TFs in most major model organisms and pathogens Oligonucleotide Binding Evaluation) database remain either uncharacterized or poorly described (1). hosts data generated by universal protein binding The development of universal (2) protein binding micro- microarray (PBM) technology on the in vitro DNA- array (PBM) technology (3) (Figure 1) offers a new avenue binding specificities of proteins. This initial release for the exploration of protein–DNA binding specificity. of the UniPROBE database provides a centralized Universal PBMs provide an efficient and comprehensive resource for accessing comprehensive PBM data method for in vitro interrogation of DNA-binding pref- on the preferences of proteins for all possible erences. PBM technology complements other currently sequence variants (‘words’) of length k (‘k-mers’), available technologies, such as chromatin immunoprecipi- as well as position weight matrix (PWM) and graphi- tation coupled with either microarray readout (4–7) or cal sequence logo representations of the k-mer high-throughput sequencing (8–10) that identify genomic regions bound in vivo. data. In total, the database hosts DNA-binding Universal PBMs achieve comprehensive, high-resolu- data for over 175 nonredundant proteins from a tion determination of proteins’ DNA-binding preferences diverse collection of organisms, including the pro- by measuring the binding preferences of a protein over karyote Vibrio harveyi, the eukaryotic malarial para- all possible k-mers of a given length (2,11). Currently site Plasmodium falciparum, the parasitic employed custom array designs contain a set of 60-bp Apicomplexan Cryptosporidium parvum, the yeast DNA probes that encompass all possible permutations Saccharomyces cerevisiae, the worm Caenorhabdi- of either 9 (Bulyk Lab, unpublished data) or 10 bp (12), tis elegans, mouse and human. Current web tools depending upon the microarray design (2,12). In addition include a text-based search, a function for asses- to covering all contiguous 9-mers or 10-mers, these array sing motif similarity between user-entered data designs also offer an extensive set of gapped permuta- and database PWMs, and a function for locating tions that provide coverage of binding sites of greater putative binding sites along user-entered nucleotide length. Together, these data can be synthesized to produce high-confidence measurements of the relative preferences sequences. The UniPROBE database is available at of a protein for all possible sequence variants belonging http://thebrain.bwh.harvard.edu/uniprobe/. to a wide range of k-mer patterns typically found in TF binding site motifs (2,12). PBM enrichment scores from the PBM signal intensity data are typically calculated for INTRODUCTION each of the more than 2.3 million 8-mers (i.e. binding site The characterization of transcription factors’ (TFs’) ‘words’ with eight informative nucleotide positions, DNA-binding specificities represents a critical step including all contiguous 8-mers and a large collection of towards understanding the regulation of gene expression gapped 8-mers). These 8-mers encompass the full affinity and elucidating the biophysical properties governing range of DNA binding preferences, from the most prefer- protein–DNA interactions. The study of DNA-binding entially bound k-mers to low-affinity k-mers and nonspe- specificities therefore has profound implications for the cifically associated sequences (2). analysis and prediction of the regulatory networks The TRANSFAC (13) and JASPAR (14) databases con- that govern intracellular function, responses to external tain hundreds of matrices constructed from DNA binding *To whom correspondence should be addressed. Tel: +1 617 525 4725; Fax: +1 617 525 4705; Email: [email protected] ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D78 Nucleic Acids Research, 2009, Vol. 37, Database issue number of binding site sequences in a JASPAR SELEX dataset is just 28, while the median number of nonredundant binding site sequences is just 11. In contrast, the DNA binding profiles obtained from our PBM experiments provide information on the direct binding preferences of a given protein over all k-mer DNA binding sequence variants, measured in vitro using the universal PBM technology; the number of sequence variants examined is limited only by the number of features on the microarray. In addition to the k-mer binding profiles, these procedures also provide DNA binding sequence PWMs derived from the k- mer data using our Seed-and-Wobble algorithm (2). The UniPROBE database hosts the high-resolution DNA binding profiles obtained from PBM experiments on known and predicted TFs (2,3,18–21). The database currently contains DNA binding profiles for many proteins not included in similar databases such as JASPAR (14) and TRANSFAC (13), and it offers several tools for searching the database and analyzing user-defined binding profiles or DNA sequences. The resources and analysis tools offered by the UniPROBE database promise to facilitate previously untenable, downstream genomic analyses, Figure 1. Universal PBM schema. Universal PBMs containing all and we anticipate that it will represent an important geno- possible 10-mers within 60-mer probes are first synthesized as single- mic resource as additional PBM data are compiled. stranded oligonucleotide arrays, to which a common primer is annealed and extended in order to biochemically convert the single-stranded array to a double-stranded DNA (dsDNA) array (these steps are not shown in the figure) (2). The dsDNA array is then bound by protein, DATABASE CONSTRUCTION stained with a fluorophore-conjugated antibody, and scanned; the The UniPROBE database is managed by a MySQL rela- quantified array data are then normalized by the relative amounts of DNA in each spot, and used to calculate k-mer binding data (2). tional database that provides the back-end for user queries PWMs can be calculated either from the k-mer binding data using and facilitates the data retrieval necessary for the site’s our Seed-and-Wobble algorithm (2) or from the 60-mer probe data analysis tools. All HTML pages are dynamically gener- using other motif finding algorithms (34). ated by PHP scripts hosted on an Apache server, and several JavaScript libraries provide interactive interfaces that facilitate site navigation and form accessibility. The site data from various data types; indeed, a given position Apache server also hosts all downloadable data, available weight matrix (PWM) in TRANSFAC frequently is by HTTP connection. derived from binding sequence data compiled from mul- tiple experimental methods, which include lower throughput approaches such as gel retardation (i.e. electrophoretic DATABASE CONTENT mobility shift assays), DNase I footprinting, immuno- The UniPROBE database hosts the results of PBM experi- precipitation, supershift assays and methylation protec- ments, subsequent computational analyses performed tion and higher throughput approaches such as in vitro on these data, and protein annotations. The site currently selection (15) (SELEX). The PAZAR database (16) is a hosts PBM data for over 175 nonredundant proteins from meta-database that contains TF binding site data. It con- a wide range of organisms, including the prokaryote Vibrio tains PWMs from the JASPAR core database and in vivo harveyi, the eukaryotic malarial parasite Plasmodium TF binding site data and cis-regulatory module infor- falciparum, the parasitic Apicomplexan Cryptosporidium mation from various sources, including other databases. parvum, the yeast Saccharomyces cerevisiae, the worm A review of databases of cis-regulatory modules is Caenorhabditis elegans, mouse and human (2,18,20,21). beyond the scope of this article. These data already encompass the majority of mouse The universal PBM technology has several

Uniprobe: an Online Database of Protein Binding Microarray Data on Protein–DNA Interactions

Locating Mammalian Transcription Factor Binding Sites: a Survey of Computational and Experimental Techniques

Specificity Landscapes Unmask Submaximal Binding Site Preferences of Transcription Factors

DNA Binding Site of the Growth Factor-Inducible Protein Zif268

Direct Inhibition of the DNA-Binding Activity of POU Transcription Factors

Preferences in a Trait Decision Determined by Transcription Factor Variants

A STAT Protein Domain That Determines DNA Sequence Recognition Suggests a Novel DNA-Binding Domain

On the Prediction of DNA-Binding Preferences of C2H2-ZF Domains Using Structural Models: Application on Human CTCF

A-Activating Factor-1 and AP-1 Cooperation Between Serum

Probing DNA Shape and Methylation State on a Genomic Scale with Dnase I

Predicting Transcription Factor Binding Sites Using Structural Knowledge

Fos-Jun Interactions That Mediate Transcription Regulatory Speci®City

Identification of a Consensus DNA-Binding Site for The