JASPAR 2014: an Extensively Expanded and Updated Open-Access Database of Transcription Factor Binding Profiles
Total Page:16
File Type:pdf, Size:1020Kb
JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Anthony Mathelier, Xiaobei Zhao, Allen W Zhang, François Parcy, Rebecca Worsley-Hunt, David J Arenillas, Sorana Buchman, Chih-Yu Chen, Alice Chou, Hans Ienasescu, et al. To cite this version: Anthony Mathelier, Xiaobei Zhao, Allen W Zhang, François Parcy, Rebecca Worsley-Hunt, et al.. JASPAR 2014: an extensively expanded and updated open-access database of transcription fac- tor binding profiles.. Nucleic Acids Research, Oxford University Press, 2014, 42 (1), pp.D142-7. 10.1093/nar/gkt997. hal-00943558 HAL Id: hal-00943558 https://hal.archives-ouvertes.fr/hal-00943558 Submitted on 28 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial| 4.0 International License D142–D147 Nucleic Acids Research, 2014, Vol. 42, Database issue Published online 4 November 2013 doi:10.1093/nar/gkt997 JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles Anthony Mathelier1, Xiaobei Zhao2,3, Allen W. Zhang1, Franc¸ ois Parcy4, Rebecca Worsley-Hunt1, David J. Arenillas1, Sorana Buchman2, Chih-yu Chen1, Alice Chou1, Hans Ienasescu2, Jonathan Lim1, Casper Shyr1, Ge Tan4, Michelle Zhou1, Boris Lenhard5,6,*, Albin Sandelin2,* and Wyeth W. Wasserman1,* 1Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, University of British Columbia, Vancouver, BC, Canada, 2Department of Biology and Biotech Research and Innovation Centre, The Bioinformatics Centre, Copenhagen University, Ole Maaloes Vej 5, DK-2200, Denmark, 3Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA, 4Laboratoire Physiologie Cellulaire & Ve´ ge´ tale, Universite´ Grenoble Alpes, CNRS, CEA, iRTSV, INRA, 38054 Grenoble, France, 5Computational Regulatory Genomics, MRC Clinical Sciences Centre, Imperial College London, Du Cane Road, London W12 0NN, UK, and 6Department of Informatics, University of Bergen, Thormøhlensgate 55, N-5008 Bergen, Norway Received September 15, 2013; Accepted October 3, 2013 ABSTRACT facilitate access for both manual and automated JASPAR (http://jaspar.genereg.net) is the largest methods. open-access database of matrix-based nucleotide profiles describing the binding preference of tran- INTRODUCTION scription factors from multiple species. The fifth Transcription factors (TFs) influence gene expression by major release greatly expands the heart of binding to specific cis-acting elements in a genomic JASPAR—the JASPAR CORE subcollection, which sequence. Thus, accurate models for describing the contains curated, non-redundant profiles—with 135 binding properties of TFs are essential in modeling tran- new curated profiles (74 in vertebrates, 8 in scription. From a set of known transcription factor Drosophila melanogaster,10inCaenorhabditis binding sites (TFBSs) for a given TF, the binding prefer- elegans and 43 in Arabidopsis thaliana; a 30% ence is generally represented in the form of a position increase in total) and 43 older updated profiles weight matrix (PWM) (also called position-specific (36 in vertebrates, 3 in D. melanogaster and 4 in scoring matrix) derived from a position frequency matrix A. thaliana; a 9% update in total). The new and (PFM). A PFM is essentially an occurrence table, updated profiles are mainly derived from published summarizing the number of each nucleotide observed at chromatin immunoprecipitation-seq experimental each position of a set of aligned TFBSs (1,2). Compared with simpler models like consensus sequences, PWMs datasets. In addition, the web interface has allow for an additive probabilistic description of binding been enhanced with advanced capabilities in preferences (3). browsing, searching and subsetting. Finally, the The JASPAR database holds collections of PFM nu- new JASPAR release is accompanied by a cleotide profiles based on published experiments from new BioPython package, a new R tool package diverse sources, and has grown gradually from its incep- and a new R/Bioconductor data package to tion (4–7). The most widely used JASPAR collection is *To whom correspondence should be addressed. Tel: +44 208 383 8353; Fax: +44 208 383 8577; Email: [email protected] Correspondence may also be addressed to Albin Sandelin. Tel: +45 353 21285; Fax: +45 3532 5669; Email: [email protected] Correspondence may also be addressed to Wyeth W. Wasserman. Tel: +1 604 875 3812; Fax: +1 604 875 3819; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors. ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Nucleic Acids Research, 2014, Vol. 42, Database issue D143 JASPAR CORE, which is a curated non-redundant set of elegans. From these studies, we extracted the bound TFBS profiles for multicellular eukaryotes, based on ex- regions, identified over-represented motifs close to the perimental evidence. The JASPAR database aims to ChIP-seq peak max position (corresponding to the provide the best canonical DNA binding profile per TF, region where the maximum number of ChIP-seq reads as assessed by expert curators. Non-redundancy of TFBS are mapped) using the MEME suite (18) and constructed profiles (i.e. one profile per TF) is intended with the PFMs describing the binding preferences of the TFs (see exception of cases in which curators observe a clear Supplementary Text for details). difference in the sequence (e.g Nkx2-5) or length (e.g. As in previous JASPAR CORE additions, we manually JUND) at the core of a profile. Other JASPAR motif curated the profiles. To confirm the putative binding collections, with different characteristics than the CORE patterns, we identified independent publications with database, are available (7). TFBSs or profiles consistent with the candidates, as Over the years, JASPAR has been equipped with func- described in (7). To gain additional profiles, we considered tions aimed at casual and power users. The web-based bound regions derived from ChIP-chip experiments from graphical user interface functionality includes browsing, modENCODE and (19) for D. melanogaster. A similar searching, subsetting and downloading, as well as basic strategy as for ChIP-seq datasets was used to derive sequence searching tools, dynamic clustering of matrices PFMs from ChIP-chip data (see Supplementary Text for and generation of random PFMs by sampling selected details). In total, we obtained 45, 28, 8 and 10 high-quality profiles (4–7). PFMs in H. sapiens, M. musculus, D. melanogaster and Historically, JASPAR was populated by PFMs C. elegans, respectively, for TFs that have never been generated by in vitro site selection assays or collections described previously in JASPAR (see Supplementary of in-depth characterized sites, limiting both the number Table S1). It represents a 57, 6 and 200% increase when of TFs with binding profiles and the number of sites compared with the previous release for vertebrates, insects contributing to the profiles. With the development of and nematodes, respectively. The newly introduced verte- high-throughput techniques that can assess in vitro or brate profiles are derived from 34 and 40 ChIP-seq experi- in vivo binding (8–10), it is now possible to generate ments collected from PAZAR and ENCODE, binding models for most regulators, in multiple species. respectively. The fact that almost 50% of the new PFMs To this end, we have, in this fifth release, expanded the are from individual studies collected in PAZAR highlights JASPAR CORE collection substantially, as well as the importance of our manual retrieval of published ChIP- updated the profiles of several existing ones with new seq data. From ChIP-seq data sets of the vertebrate data from high-throughput experiments. sequence-specific TFs not previously described in JASPAR, we obtained 71 (60%) canonical motifs satisfying our literature-based manual curation (see EXTENSIVE EXPANSION AND IMPROVEMENT OF Supplementary Table S2). The rich data from ChIP ex- JASPAR CORE periments allowed replacement of 39 existing profiles for The JASPAR CORE database has been substantially TFs in mammals (36 PFMs updated) and in D. expanded. In total, 135 new PFMs have been added melanogaster (3 PFMs updated). (a 30% increase), and 43 older PFMs (9% of last As part of the curation of ChIP-seq data, and as release) have been updated with new data, from verte- introduced earlier, we computed a centrality score as brate, insect, nematode and plant species (Table 1). described in (20), based on our expectation that