Shinygo: a Web Application for In-Depth Analysis of Gene Sets
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary document ShinyGO: a web application for in-depth analysis of gene sets Steven Xijin Ge1,*, and Dongmin Jung1,2 1Department of Mathematics and Statistics, Box 2225, Brookings, SD, USA 57007 2Avison Biomedical Research Center, Yonsei University, Seoul, South Korea Table 1 compares ShinyGO (http://ge-lab.org/go/ ) with several other tools for enrichment analysis using gene Table 1. Comparison of selected enrichment analysis lists. tools. Methods #Organisms Custom Reference To enable gene ID conversion, we downloaded all Background available gene ID mappings from Ensembl. The final Enrichr 2 No [1] mapping table for the current ShinyGO v0.4 release consists of 135,832,098 rows, mapping various gene IDs, GOrilla 8 No [3] including DNA microarray probe names, into Ensembl PlantGSEA 15 No [5] gene IDs. Panther 112 No [7] Enrichment analysis is calculated based on STRING 2031 No [8] hypergeometric distribution followed by false discovery DAVID 65,000 Yes [10] rate (FDR) correction. Background gene-sets are all protein-coding genes in the genome. As many of the g:Profiler 208 No [12] enriched GO terms are related or redundant (i.e., “cell ShinyGO** 208 No Present study cycle” and “cell cycle process”), we provide two plots to ShinyGO new features: visualization of overlapping gene sets, gene summarize such correlation following a method characteristics plots (gene length, GC content, chromosomal location), developed in [2]. We first measure the distance among two KEGG pathway diagram, protein-protein interaction (PPI) network. gene-sets by 1─ Ni / Nu, where Ni and Nu is the number of genes in the intersect and the union of the two sets, respectively. The distance matrix is used to construct a hierarchical clustering tree using average linkage, and to construct a network of GO terms using a cutoff of 0.05 overlap ratio [2]. To identify enriched TF binding motifs, transcript annotation and promoter sequences are retrieved from Ensembl. For genes with multiple transcripts, the transcription start site (TSS) with multiple transcripts is used. If multiple TSS locations have the same number of transcripts, then the most upstream TSS is used. Promoters are scanned using TF binding motifs in CIS-BP [4]. Instead of defining a binary outcome of binding or not binding, which depends on arbitrary cutoffs, we recorded the best score for each of the TFs in every 300bp and 600bp promoter sequences. Then student’s t-test is used to compare the scores observed in a group of genes against the rest of genes. The P-values are corrected for multiple testing using false discovery rate (FDR). Use case As an example, we analyzed a set of 149 genes (Table 2) up-regulated in lymphoblasts cells (TK6, WTK1, and NH32) treated with ionizing radiation [6]. This gene list is available on the MSigDB [9] website [11]. The 149 human genes are mapped to 147 Ensembl gene IDs for enrichment analysis, as suggested by the mapping information is available at the “Genes” tab. Below, we will use a large collection of human gene-sets (Table 3) to investigate this list. GO Biological Process Using Gene Ontology (GO) Biological Process for gene-sets, we get an enrichment results as Table 4. The top terms are related with positive regulation of cellular metabolic process, response to stress, apoptosis, etc. These terms, many of which are related, are ranked by FDR in the table. A more organized presentation of these terms is shown in Figure 1, where related terms are grouped together. For example, several apoptosis related terms are grouped in the branch at the bottom of Figure 1. The most significant groups of terms are related to nitrogen metabolism. Ionizing radiation induces reactive oxygen and nitrogen species, which might activate signaling pathways in response to DNA 1 Supplementary document damage[13]. As shown in Figure 1, other groups of terms are related to regulation of biosynthesis, response to stimulus, and apoptosis. Many of these processes are known to underpinning cellular response to ionizing radiation [14]. GO Cellular Component When switching to GO cellular component, we can detect that this list is overrepresented with 61 (41%) nuclear proteins (Table 5 and Figure 2) (FDR < 7.3×10-11, hypergeometric test). Several small and highly specific functional categories are also identified. For example, 4 out of 7 proteins involved in the I-κB/NF-κB complex are included in the gene list (FDR < 2.6×10-6). As shown in Table 5, the 4 proteins are RELA NFKB2 NFKBIA NFKB1, which is available in the downloaded enrichment results. The I-κB/NF-κB complex plays important roles in immune response [15]. This list also contains 6 out of the 42 proteins in the Cyclin-dependent protein kinase holoenzyme complex (FDR < 9.4×10-6). Both proteins in the PCNA-p21 complex are included in the list (FDR < 7.5×10-4). Downstream of p53 signaling pathways, the interaction of p21 and PCNA plays a role in regulating DNA cell cycle after DNA damage [16]. Enriched terms can also be displayed as connected networks. For example, Figure 4. The network of enriched GO Cellular Component terms. It shows a big cluster of interconnected terms related to chromatin and nuclear, and a group of 3 terms related to membrane and cell surfaces. GO Molecular Function Using GO molecular function (Table 6 and Figure 3), ShinyGO reveals that 40 (27%) of the 147 genes have DNA binding transcription factor activity (FDR < 2.9×10-14). This list contains many transcription factors such as JUNB, NFKB1, STAT1, MYC and so on, which give rise to many of the terms in the big branch in the lower side of Figure 3. Other less significant terms include kinase binding, and cytokine receptor binding. KEGG pathways Using KEGG pathways, we can detect overrepresentation of genes in cancer pathways with FDR < 8.5×10-20 (Table 7). Thirty-two genes in the lists are related to tumorigenesis (Figure 5). Other significant pathways include the P53 signaling pathway, for DNA damage response, and TNF and NF-κB signaling pathways for immune response. ShinyGO retrieves the pathway diagrams from KEGG web server and highlights the user’s genes (Figure 6). Transcription Factor target genes To investigate whether the 149 genes can be regulated by common transcription factors (TF)s, we choose the “TF.Target” gene sets. These are verified or predicted TF target genes compiled from various sources, including RegNetwork [17], CircuitsDB [18], TRED [19], ENCODE [20], and TRRUST[21]. As shown Tale 8, the most significantly enriched TF is p53, which is represented by multiple gene-sets from different databases. Among the 147 genes, 27 (18%) are target genes of p53 (FDR < 7.08×10-21), which play a critical role in cellular response to ionizing radiation[16]. Other significant TFs are NFKB and RELA, which probably mediate the immune response via the Rela/NF-κB pathway[15]. Consistent with this enrichment, NFKB1, NFKB2, and RELA are included in the 147 query genes. Other TFs with enriched target genes include SP1 and BRCA1. These enrichment results are organized in Figure 7. The 4 gene-sets related to p53 target genes are grouped together. The 12 gene-sets of Rela/NF-κB also form a bigger branch on the tree. Another groups of TFs at the bottom of the tree includes SP1, as well as FOS, and JUN. Taking advantage of a large collection of transcription factor target genes, ShinyGO can help generate hypothesis on gene regulation. microRNA target genes As shown in Table 10, 13 of the 147 genes are target genes of miR-145, with FDR < 9.23×10-6. Previous studies have shown that miR-145 is involved in DNA damage repair, and is regulated by p53 [22]. As shown in Figure 8, two gene-sets are related to miR-145. miR-21 target genes are also overrepresented in the 147 genes (FDR < 9.23×10-6). miR-21 has also been shown to be involved in DNA damage repair [23, 24], probably by targeting MSH2, a mismatch repair gene. Other microRNAs in Figure 8, are less significant but may also be further investigated. For example, miR- 146 might be regulated by p53 as well [25]. Gene characteristics ShinyGO can also compare the list of 147 genes with the rest of the genes from several aspects. Figure 9A shows these genes are distributed randomly on the chromosomes (Chi-squared test, P=0.97). A detailed genomic location 2 Supplementary document map is shown in Figure 11. Figure 9C indicates that the genes are all protein-coding. These genes seem to have less exons (Chi-squared test, P= 0.023) and more transcript isoforms (Chi-squared test, P= 0.0086) than other protein- coding genes (Figure 9B and D). The distribution of the length of various gene features are shown in Figure 10. The list of 147 genes have similar lengths for coding sequences, transcripts, and genomic span, and 3’ UTR (untranslated region). Their 5’ UTR are slightly longer than the rest of coding-genes (T-test, P = 0.026, Figure 10B). Plotting and t- test on these gene lengths are done on a log-scale, as the transformed data are closer to normal distribution. The GC content of the genes are also similar to other genes. STRING API The STRING API recognized 138 (94%) of the 147 genes. In addition to enrichment analysis based on GO and KEGG, STRING also offers enrichment of protein domains using Pfam and InterPro databases. As shown in Table 11. Enriched Pfam domains in the query genes.Table 11, this list is overrepresented with 5 proteins with Cyclin, N- terminal domain, and 5 proteins with Helix-loop-helix DNA-binding domain.