A Tool for Searching Putative Factors Regulating Gene
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 TFmapper: A tool for searching putative 2 factors regulating gene expression using 3 ChIP-seq data 4 5 Jianming Zeng 1, Gang Li 1* 6 1Faculty of Health Sciences, University of Macau, Macau, China 7 8 Email: Zeng Jianming [email protected] 9 Gang Li [email protected] 10 11 * To whom correspondence should be addressed: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 12 Abstract 13 Background: Next-generation sequencing coupled to chromatin immunoprecipitation 14 (ChIP-seq), DNase I hypersensitivity (DNase-seq) and the transposase-accessible chromatin 15 assay (ATAC-seq) has generated enormous amounts of data, markedly improved our 16 understanding of the transcriptional and epigenetic control of gene expression. To take 17 advantage of the availability of such datasets and provide clues on what factors, including 18 transcription factors, epigenetic regulators and histone modifications, potentially regulates the 19 expression of a gene of interest, a tool for simultaneous queries of multiple datasets using 20 symbols or genomic coordinates as search terms is needed. 21 Results: In this study, we annotated the peaks of thousands of ChIP-seq datasets generated 22 by ENCODE project, or ChIP-seq/DNase-seq/ATAC-seq datasets deposited in Gene 23 Expression Omnibus and curated by CistromeDB; We built a MySQL database called 24 TFmapper containing the annotations and associated metadata, allowing users without 25 bioinformatics expertise to search across thousands of datasets to identify factors targeting a 26 genomic region/gene of interest in a specified sample through a web interface. Users can also 27 visualize multiple peaks in genome browsers and download the corresponding sequences. 28 Conclusion: TFmapper will help users explore the vast amount of publicly available 29 ChIP-seq/DNase-seq/ATAC-seq data, and perform integrative analyses to understand the 30 regulation of a gene of interest. The web server is freely accessible at 31 http://www.tfmapper.org/. 2 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 32 33 Keywords: 34 Chromatin immunoprecipitation; Next-generation sequencing; Regulation of gene expression 3 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 35 Background 36 Next-generation sequencing (NGS) coupled to chromatin immunoprecipitation (ChIP-seq), 37 DNase I hypersensitivity (DNase-seq) and the transposase-accessible chromatin assay 38 (ATAC-seq) provide powerful tools for studying gene regulation by factors in cis and trans, 39 which includes components of the basal transcriptional machinery, transcription factors, 40 chromatin regulators, histone variants, histone modifications and others. A great amount of 41 data of ChIP-seq, DNase-seq and ATAC-seq has been accumulated by individual laboratories 42 and large-scale collaborative projects, including the ENCODE (Encyclopedia of DNA 43 Elements) Consortium [1], the Roadmap Epigenomics Mapping Consortium (Roadmap) [2] 44 and the International Human Epigenome Consortium (IHEC) projects [3]. The datasets are 45 usually publicly accessible through the Gene Expression Omnibus (GEO) of the National 46 Center for Biotechnology Information (NCBI) [4], and the data portals of large-scale projects. 47 Despite the easy access, mining and interpreting the ChIP-seq/DNase-seq/ATAC-seq 48 data is challenging for regular users, especially for bench biologists with limited bioinformatic 49 expertise. Adding to the complexity, the data qualities and data analysis pipelines are 50 remarkably varied, which hinder their direct use in further analyses [5]. To address the 51 challenges, multiple algorithms/metrics have been developed to evaluate the data quality 52 bioinformatically, such as ENCODE quality metrics, NGS-QC, ChiLin and others [6-9]. 53 NGS-QC developed by Gronemeyer and colleagues built a quality control (QC) indicator 54 database of the largest collection of publicly available NGS datasets [6, 8], which provides a 4 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 55 solid start point for further analysis. CistromeDB developed by Liu and colleagues curated 56 and processed a huge collection of human and mouse ChIP-seq and chromatin accessibility 57 datasets from GEO with a standard analysis pipeline ChiLin [9], and further evaluated 58 individual data quality under several scoring metrics [10]. High-quality processed ChIP-seq 59 data generated by ENCODE consortium, including histone modification, chromatin regulator 60 and transcription factor binding data in a selected set of biological samples, are also available 61 through its data portal [11]. ENCODE and CistromeDB provide access to the processed data, 62 and the corresponding metadata including the sources and properties of biological samples, 63 experimental protocols, the antibody used and others, which offer opportunities for users to 64 re-analyze the data and identify the genome-wide targets of a transcription regulator in 65 different cell lines and tissues. Nonetheless, different questions are often asked, such as: 66 Which transcription factors are responsible for the regulation of a gene of interest, and what is 67 the epigenetic landscape of a gene of interest in a particular tissue or cell type? To address 68 these questions, a tool for simultaneous queries of multiple datasets using symbols or 69 genomic coordinates of target genes as search terms is needed. 70 In this study, we collected and annotated a large number of ChIP-seq, DNase-seq and 71 ATAC-seq datasets including the ENCODE datasets and the GEO datasets curated by 72 CistromeDB [10]. We built a Structured Query Language (SQL) database containing the 73 annotations and the associated metadata, allowing users to search across multiple datasets 74 and identify the putative factors which target a specified gene or genomic locus based on 5 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 75 actual NGS data. We also provide links for users to visualize the peaks and download the 76 corresponding sequences. In addition, we included an example to demonstrate the utility of 77 TFmapper. 6 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 78 Construction and content 79 Data collection 80 We downloaded the GEO datasets processed by Liu and colleagues from CistromeDB as 81 BED (Browser Extensible Data) files, which includes 6092 ChIP-seq datasets for trans-acting 82 factors and 6068 ChIP-seq datasets for histone modifications generated from human samples; 83 4786 ChIP-seq datasets for trans-acting factors and 5002 ChIP-seq datasets for histone 84 modifications generated from mouse samples; and 1371 DNase-seq/ATAC-seq datasets. For 85 ENCODE datasets, we downloaded the conservative IDR (Irreproducible Discovery Rate)- 86 thresholded peaks, which are called after combinational analysis of two replicates for each 87 ChIP-seq experiment. The ENCODE datasets selected in this study include 955 transcription 88 factor and 771 histone modification datasets for Homo sapiens, and 145 transcription factor 89 and 1252 histone modification datasets for Mus musculus. 90 91 Data processing 92 We used HOMER [12] to annotate the peaks in BED files per the newest reference genome, 93 GRCh38/hg38 for human and GRCm38/mm10 for mouse respectively. HOMER assigns each 94 peak to the nearest gene by calculating the distance between the middle of a peak and the 95 transcription start site (TSS) of a gene. Meanwhile, seven genomic features based on RefSeq 96 annotations were assigned to each peak, which are promoter (by default defined from -1kb to 97 +100bp of TSS), TTS region (transcription termination site, by default defined from -100 bp to 7 bioRxiv preprint doi: https://doi.org/10.1101/262923; this version posted February 9, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.