Discovery of a Robust Gene Regulatory Network with a Complex Transcription Factor Network on Organ Cancer Cell-Line RNA Sequence Data

Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Discovery of a Robust Gene Regulatory Network with a Complex Transcription Factor Network on Organ Cancer Cell-line RNA Sequence Data

Bharata Kalbuaji1, Y-H. Taguchi2, and Akihiko Konagaya1*

1 Department of Computer Science, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa, 226-8503, Tokyo, Japan 2 Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, 112-8551, Tokyo, Japan *E-mail: [email protected]

(Received February 28, 2019; accepted July 19, 2019; published online October 3, 2019)

Abstract

Gene expression analysis for understanding cancer cell development is a basic, but an important step, to further our knowledge in cancer research. We may also be interested in understanding gene interactions that may lead to cancer development. One of the most important interactions is a regulatory interaction that involves transcription factor genes. In this research, we are attempting to construct a new regulatory network that imitates the transcription and translation processes of mRNA. We construct this network from four different cancer types: bile-duct cancer (BDC), lung adenocarcinoma (LUAD), colorectal cancer (CRC), and hepatocyte carcinoma (HCC). We also integrate differential expression data to obtain the interactions among differentially expressed genes. We then try to find intersecting sub-networks that exist across all cancer types. We believe that the transcription factor genes found in intersection sub-networks may reveal an important mechanism that affects cancer cell growth. In this research, we found that genes, such as those in the TEAD4, IRX5, HMGA1, and E2F gene family and the SOX gene family, are found in the enrichment analysis of the intersection sub-network obtained from multiple cancer data-sets. These genes point us toward dysregulation of the cell cycle, cell division, and cell proliferation mechanisms in cancer cells. These genes may become new cancer drug targets for cancer treatment.

Key Words: gene regulatory network, RNA-seq, transcription factor, differentially expressed genes

Area of Interest: Bioinformatics and its applications in medicine

32 Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

1. Introduction

With the advancement of Next Generation Sequencing, researchers can now carry out different types of experiments with genomics data [1]. For RNA-seq data, the most widely applied analysis workflow is differential gene expression analysis. This workflow has become a gold standard for analyzing RNA-seq data having two or more sample conditions. There are many tools available to perform differential gene expression analysis, especially in the R Bioconductor library [2], such as DESeq2 [3], edgeR [4], and limma/voom [5]. From the results of differential gene expression analysis, there are many downstream analyses that can be applied; one of the most important being network analysis. Researchers have known the importance of network analysis for gene expression data owing to the nature of human genetics that form a complex interaction among genes, proteins, and other factors. The interaction is so complex that it is hard to create a comprehensive network that can show the overall picture of what happens in the cell. Many types of networks have been created to show the complex interactions that occur within the cell. Several types of networks have been created for this purpose. The first type is manually curated networks such as the KEGG pathway [6–8] or WikiPathway [9–11]. This network type provides high quality data, but the amount of data is limited and small compared with the actual network within the cell. The second type of network is a reverse-engineered network derived from gene expression data. It is often called a co-expression network due to the method of creating the interactions between nodes. Several popular methods to generate this network have been proposed; e.g., WGCNA library [12] in R Bioconductor and Ingenuity Pathway Analysis [13]. One of the advantages of constructing gene networks by using the expression profiles is that we can see the inter-connectivity and relationships between genes; however, we cannot really show how genes are connected to one another. The third type is networks constructed based on publicly available data. One of such popular databases is the STRING database (STRING DB) [14] that creates networks by using a text-mining approach. While STRING DB gives us a general idea of how genes are connected, the detailed information of how this occurs cannot be obtained. Other than the networks mentioned above, there are several types of specially constructed networks. One of them is RegNetwork [15] that builds interaction networks from several sources and combines them to create a complex interaction among several types of molecules, such as RNA, miRNA, and proteins. RegNetwork gives us useful information about interactions among regulatory molecules, during both the transcription and post-transcription processes. Another database is Regulatory Circuit [16]. The networks obtained from the Regulatory Circuit database are tissue-specific networks. The Regulatory Circuit builds the network based on pairing the transcription binding matrix in the form of a position weight matrix (PWM) into the enhancer and tissue-specific promoter region annotation data available from the FANTOM5 database. The network also is constructed with the inclusion of GWAS data related to diseases. Construction of a regulatory network using PWM was also done by Neph et al. [17]. PWM is a very good method to build gene regulatory networks, because it imitates the transcription process. Although it is good to use PWM to build a network, it has a limitation: the network only shows gene-gene relationships. We cannot see gene-to-protein relationships during the translation process. In this research, we try to introduce another type of network for downstream analysis of cancer gene expression by creating a transcription factor network that reflects transcription and translation process. We try to create a network that shows gene-protein-gene interactions by simulating the actual processes of transcription and translation. The transcription process is reflected by using a

33 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019) binding motif from PWM and a promoter region sequence similar to the Regulatory Circuit database method. The difference is that we approach the cell-specific promoter region by using transcript level gene expression from the RNA-seq experiment instead of the CAGE-seq experiment, such as in the FANTOM5 database, which are used by the Regulatory Circuit. We believe that transcript-level gene expression from RNA-seq can be used to determine which transcripts are expressed for a gene so that we can determine which promoter regions are active. Our network also has a special characteristic, because it consists of only transcription factors and the network will show a positive regulatory relationship. We assume that any changes affecting the cell state can be reflected by the transcription factor expression due to the characteristic of transcription factor that regulates other genes. Any changes in expression of a transcription factor can create a chain reaction that changes the expression of other genes. We determine positive regulatory relationships by using differential gene expression analysis. A target gene that is up-regulated would only be connected to an up-regulated transcription factor and vice versa. This way, the relationship would be more accurate and easier to interpret. With this new network, we try to analyze several cancer types. We hypothesize that although the cancers are different, there should be some common characteristics across cancer types. We can utilize these common characteristics as a potential drug target that works for multiple cancer types. In addition, we also try to find a positive regulatory loop among transcription factors. We hypothesize that such a loop among transcription factors can create feedback effects, which would change a normal cell into a cancer state and maintain that state.

2. Materials and Methods

Our transcription factor network is a directed network that is constructed to imitate the relationships between genes and proteins. Currently, most gene regulatory network generation methods generate gene-gene interactions; but, in reality, transcription factor genes cannot directly interact with their target genes. We try to improve the gene-gene interaction regulatory networks by constructing gene-protein-gene regulatory networks. Gene-protein-gene interaction can improve our understanding of the network, because it can closely imitate the actual relationships within the cell. With gene-protein-gene interactions, we can understand how genes are translated into proteins and then how those proteins regulate other genes. To further improve the accuracy of our network, we also add gene expression data from RNA-seq. We use transcript level gene expression to estimate which transcripts are expressed for each gene so that we can estimate which promoter regions are likely to be bound by transcription factors. We also use our differential gene expression analysis results to filter the network so as to obtain a positive regulatory relationship.

2.1 Transcription factor protein binding motif

We use a binding motif in the form of PWM. We obtained PWM data from the TRANSFAC Pro database, version 2017.2 [18]. PWM reflects the transcription factor protein binding motif. We also obtained PWM annotation from the same database. These PWM annotation data provide information about the gene components and protein complex types of each PWM. There are 3 types of protein complex. The first type is a specific protein. This type means that the PWM comes from a protein that is translated from a specific gene. Most of the PWMs with this type have only one gene product as

34 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019) their building block. The second type is a complex protein. This means that the protein is a hetero-dimer complex, having more than one gene product as their building block. The third type is a family. This means that the proteins have more than one gene product that can produce the same protein. From the TRANSFAC Pro version 2017.2, we obtained 1,265 transcription factor genes and 4,112 PWMs.

2.2 Cancer gene expression data-set

We use five data-sets from four different cancer types obtained from the NCBI GEO database. This data-set is publicly available. Table 1 shows the list of the data-sets that we use. We select the data-set based on the cancer cell type characteristics, because all four cancers we studied share similar tissue characteristics. We also consider the sample size and the availability of raw RNA-seq data. Sample size indicates how many cancer-normal sample pairs are in the data-set. The cancer and normal sample pairing is really important for the differential expression analysis that we will perform, so that we can obtain the expression changes in the cancer condition compared to the normal condition.

Table 1. List of data-sets used for each cancer type Cancer type Size NCBI GEO Data-set Bile-duct cancer (BDC) 7 pairs GSE63420 Lung Adenocarcinoma (LUAD) 27 pairs GSE87340 10 pairs GSE104836 Colorectal cancer (CRC) 3 pairs GSE104178 Hepatocellular Carcinoma (HCC) 28 pairs GSE77509

Raw data are needed to obtain transcript-level and gene-level expression and to perform differential gene expression analysis. For the CRC cancer type, we use two data-sets, GSE104178 and GSE104836. Data-set GSE104178 provides raw RNA-seq data, but a small sample size, so we also add data-set GSE104836. We use Salmon [19] to obtained transcript-level read counts and gene-level read counts. We use the limma/voom library to get counts per million (cpm) as a measurement of the gene expression level and to indicates whether a gene is expressed or not. Among 1,265 transcription factor genes in our list, not all of them are expressed in each cancer type. Table 2 shows the number of transcription factors that are expressed in each cancer data-set. To perform differential gene expression analysis between the cancer condition and normal condition, we use the DESeq 2 package. We use default parameters to perform differential analysis with the DESeq 2 package.

Table 2. Number of expressed transcription factor genes for each cancer type. Cancer type No. of expressed TF genes Bile-duct cancer (BDC) 1,016 genes Lung Adenocarcinoma (LUAD) 1,080 genes Colorectal cancer (CRC) 1,092 genes Hepatocellular Carcinoma (HCC) 945 genes

35 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

2.3 Transcription factor network construction

We try to approach the actual transcription and translation process, although we understand that many factors are involved. For example, such factors may include an enhancer protein that binds to the DNA quite far from the actual promoter region of a gene, and a limitation on how many transcription factor proteins can bind at the same time in the promoter region. We, therefore, simplify the method to only simulate transcription factor binding in the promoter region, because it is really hard to know the long range effect of an enhancer protein on the target gene. Our transcription factor network will have unique characteristics. There are two types of nodes: gene nodes and protein nodes. The network also has two types of edges: “bind_to” edge and “translate_to” edge. Each edge is used to imitate the respective transcription and translation processes in the cell. A “translate_to” edge represents a translation process, whereby a gene or several genes are translated and form a protein. A “bind_to” edge represents transcription processes whereby a transcription factor protein binds to a promoter region from their target gene. In this network, the edge type will always alternate between a “translate_to” edge and a “bind_to” edge. Figure 1 illustrates this edge and node relationship. To determine the promoter region for a target gene, we need to determine which transcripts are expressed for each transcription factor gene. We use transcript level gene expression to determine whether a transcript is expressed or not. We use a simple proportion calculation for a transcript compared to all transcripts from a gene. The formula is shown in equation (1). R is the proportion of the read count from the transcript ti , while r is the read count for transcript ti, and I is all transcripts from gene G. We then determine manually a threshold that can give the best result to obtain all expressed transcripts for each transcription factor gene.

Figure 1. Illustration of node and edge types in the transcription factor network. A“bind_to” edge symbolizes that a protein in the source node regulates a gene in the target node. A“bind_to” edge has a circular shape edge and a dashed line. A “translate_to” edge symbolizes that a gene in the source node is translated into a protein in the target node. A “translate_to” edge has an arrow shape edge and a solid line. The protein node has a round rectangular shape and the gene node has a circular shape. A “translate_ to” edge can have several meanings. If there are more than one gene, the nodes have one “translate to” edge to a protein node. This means that the protein is either a complex protein or a hetero-dimer protein that is constructed from several protein sub-units translated from those genes; or the protein comes from a gene family that produces a similar protein. The way to distinguish the type of “translate_to” types is to see the name of the protein node.

36 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

r t i Rt = i ∑ r (1) t j jϵI G

From the list of expressed transcripts for each transcription factor gene, we extract the sequences with the lengths of 1,000 nt, 2,000 nt, 3,000 nt, 4,000 nt, and 5,000 nt upstream of the first exon location. Exon locations are obtained from the Ensembl [20] database. An illustration of the promoter region location is shown in Figure 2. By using the sequence above, we use the FIMO [21] software to find matches between each PWM motif and the sequence. We use nucleotide frequency as an FIMO background parameter to generate the match score and p-value. We then use the False Discovery Rate (FDR) Benjamini-Hochberg method [22] to obtain a Q-value to correct the raw p-value due to multiple testing. Based on this Q-value, we filter results with thresholds of <0.1, <0.05, <0.01, <0.005, and <0.001. We use the multiple promoter region length and Q-value threshold to find a robust result. From the FIMO result, we add some additional filtering steps. The first filtering is to get a positive regulatory relationship. We then use a differential gene expression analysis result to find which PWM are up-regulated or down-regulated. We determine if a PWM will be up-regulated or down-regulated based on their genetic component. Up-regulated and down-regulated genes are determined by their log-fold changes (base 2) that are obtained by using the DESeq2 library in Bioconductor. We defined that a gene is up-regulated if the log fold change >= 0.5 and down-regulated if the log fold change <= -0.5. Other genes with log fold changes > -0.5 and <0.5 are considered not differentially expressed. From this definition, we then inferred the expression change of PWM. There are several rules that we use to estimate whether a PWM will be up-regulated, down-regulated, not changed in expression, or ambiguous. The rules are: 1. If all genes that form a PWM are up-regulated, the PWM will also be up-regulated and vice versa. 2. If one or more genes from a specific PWM is down-regulated, the PWM expression is also down-regulated. The rationale is that if one or more of the gene components is down-regulated, the protein will lack gene products as its components; therefore, the protein expression level will also be down-regulated. 3. If no gene components are down-regulated and some gene components are up-regulated, the PWM expression will be up-regulated. 4. If some gene components are down-regulated while other components are up-regulated, the PWM expression will be ambiguous.

Figure 2. Illustration of how the promoter region is calculated. We define the promoter region as a region upstream of the 1st exon from a transcript of a gene. We define multiple lengths for the promoter region, which are 1,000 nt, 2,000 nt, 3,000nt, 4,000 nt, and 5000 nt. We use the DNA sequences extracted from this region as a promoter region for the transcript.

37 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

Based on the rule above, we assign each PWM to their inferred expression change, and then filter out the PWM-target gene that has a mismatch expression change, except ambiguous. We exclude all ambiguous PWM results from the filter. After this step, the result will only contain positive regulatory relationships. The next filter is one that filters out PWM match results that are overlapping. The reason is that it is impossible for multiple transcription factors to bind to the same or an overlapping location. If a transcription factor has bound a location in the promoter region, it will prevent the binding of other transcription factors. Based on this, we decided to use a maximization algorithm to determine the best PWM that can bind to a location in the promoter region. To determine which PWM are the best, we defined a p-score for each PWM binding. This p-score formula is obtained from MCAST software [23]. The formula to calculate the p-score is shown in Equation (2).  Ps p  scores= log10   (2)  p  For PWM s, P(s) is the Q-value of PWMs and p is the Q-value threshold (0.1, 0.05, 0.01, 0.005, and 0.001). Because we filter out the results based on the Q-value threshold, P(s) < p. So, the p-score(s) basically is a measure of the distance between the Q-value score of a PWM and the Q-value threshold. The further the distance is, the bigger the score. We then use a dynamic algorithm to find a combination of PWMs that maximizes the sum of the p-scores. Overall, the workflow that we use in our method is shown in Figure 3.

2.4 Intersection transcription factor sub-network

We defined an intersection transcription factor sub-network as a sub-network of shared differentially expressed transcription factors in more than one cancer type. We found these sub-networks by overlapping the differentially expressed transcription factor network, iteratively, among the cancer types. We also considered the effects of the promoter region length and the Q-value threshold for the subnetwork construction. We focused on the transcription factor network for the following reason. We assumed that changes in gene expression that alter the cell state from the normal state to the cancer state could be traced back to the changes in transcription factor gene expression. Therefore, finding differentially expressed transcription factors, and knowing how all of those transcription factors are interconnected in the network, would be important for understanding the basic mechanism of cancer development. In addition, we also assumed that, although cancer cell gene expression might vary depending on the cell type, there would be a basic mechanism involved in the transcription factor change in gene expression that caused the changes in the cell state. Based on these assumptions, we tried to find the intersection transcription factor sub-network among all cancer data-sets.

38 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Figure 3. Workflow to construct a transcription factor network. To construct our transcription factor network, we used two different data types for each edge type (“bind_to” edge and “translate_to” edge). For the “translate_to” edge, we use annotation data of the corresponding gene component of each PWM from TRANSFAC Pro 2017.2. We assumed PWM as a protein representation. For the “bind_to” edge, we calculate the binding probability of each PWM to the promoter region of the target gene by using FIMO software and additional filtering. We then combine these two results into one network.

39 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

3. Results and Discussion

After we constructed the transcription factor network for each cancer data set, we then found the intersection of sub-networks from all cancer data-sets. This sub-network gave us information about how the transcription factor genes that were differentially expressed interact with each other in multiple cancer types. We then tried to find a feed-back loop in the intersection sub-network. Although it was hard to check whether the interactions were true or not, because there was no gold standard for gene regulatory networks, we compared the results with a publicly available regulatory network as a measurement.

3.1 Differentially expressed transcription factors among cancers

With regard to the 1,265 transcription factor genes that we had identified, each cancer data-set had a different number of differentially expressed transcription factor (DE TF) genes. We used a threshold of p-value <0.05 and log-fold change >= 0.5 for up-regulated genes and log-fold change <= -0.5 for down-regulated genes. Table 3 shows DE TF in all data-sets.

Table 3. Number of differentially expressed transcription factors (DE TF), up or down-regulated, for each cancer type. Cancer Up-regulated TF genes Down-regulated TF genes Total DE TF genes BDC 348 143 491 LUAD 250 171 421 CRC 171 134 305 HCC 290 162 452

40 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Figure 4. Venn diagram showing the number of shared and exclusive differentially expressed transcription factors (DE TFs) in each cancer data-set.

Among these DE TF genes, some of them were exclusive to one cancer and some were shared across multiple cancers. Figure 4 shows a Venn diagram of a number of DE TF that were shared or exclusive for each cancer. In total, 57 transcription factor genes were differentially expressed in all cancers. The detailed information on these TF genes is shown in the supplementary data Table S1.

3.2 Intersection sub-network among cancers

An intersection transcription factor sub-network is a subnetwork of transcription factors that are differentially expressed in all cancer data sets. The number of edges, nodes, and loops in the subnetwork strongly depends on the promoter region length and Q-value threshold as given in Table 4. In general, the number of shared edges and nodes tends to be small when a small Q-value threshold is used. It is rare to find any loops in the intersection sub-network when the Q-value is less than 0.01. On the other hand, there might be an optimal promoter region length that maximizes the number of shared edges and nodes of the intersection sub-network. For example, the promoter region length 3000 results in a higher number of shared edges and nodes, than in the cases of 4000 and 5000 when the Q-value is 0.01. Although it might be mathematically possible to find the optimal promoter region length and the optimal Q-value that maximize the number of shared edges and nodes in the intersection sub-network, there is no guarantee that the obtained intersection sub-network becomes biologically meaningful. Therefore, we pursued other approaches to estimate a biologically meaningful Q-value and a promoter region length by means of comparing the number of shared edges with other publicly available databases.

41 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

3.3 Comparison with other databases

We compared the intersection sub-network with three other gene regulatory network databases: the HepG2 cell regulatory network from Neph et al., the Regulatory Circuit database, and the RegNetwork database. Because all of these regulatory network databases are in the form of gene-gene relationships, we transformed our network from a gene-protein-gene to a gene-gene relationship network.

1. Comparison with the HepG2 cell network Nepth et al. provide 41 cell-specific TF networks that were derived from a DNAse I footprint experiment. This network consists only of TF genes, which is similar to our network. The network consists of 12,863 edges and 493 nodes, all of which are TFs. We choose the HepG2 cell network, because this network is the most relevant in relation to our network. The HepG2 cell is an immortalized cell line consisting of human liver carcinoma cells, which also existed in our data-set (HCC). We hypothesized that our intersection sub-network would also appear in the HepG2 cell network, because the cell type is the same; although, due to the different TF PWM database used, there would be some discrepancy in each network. The number of shared edges between our network and the HepG2 network is shown in Table 5a. According to the results, the intersection sub-network that uses a Q-value threshold 0.05 and a 5,000 nt promoter region length had the most shared edges with 18 edges being the same.

2. Comparison with the Regulatory Circuit gene regulatory networks

The Regulatory Circuit database provides 394 cell- and tissue-specific networks. These networks were constructed with enhancer and promoter activity data from the CAGE-seq experiment obtained from the FANTOM5 project. This database used 662 TF genes to find the regulatory relationships between TFs and the target genes. For comparison with the Regulatory Circuit database, we first obtained four cell- specific networks from the database that corresponded to our data-set, which comprised the following cell lines: colon carcinoma, bile duct carcinoma, lung adenocarcinoma, and hepatocellular carcinoma. We chose this data-set because the cell types were the same as those in our data-set. From the Regulatory Circuit networks, we extracted the intersection sub-networks and compared them with our intersection sub-network. The intersection sub-network from the Regulatory Circuit database had 406,580 edges and 11,568 nodes. The network size was quite large because the network mapped all possible TF genes binding to all gene types. The result of a comparison using this intersection sub-network is shown in Table 5b. Intersections with the promoter region length 3,000 and Q-value threshold 0.05 had the most shared edges: 16 edges were the same. The shared edges were quite small compared with those in the Regulatory Circuit database, because our interaction network only included TF genes, and we also used the gene expression profile to obtain a positive regulatory relationship between TF and the target gene.

42 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Table 4. Comparison of intersection sub-networks using different promoter regions and Q-value thresholds a) Number of edges in the intersection sub-network Q-value threshold 0.001 0.005 0.01 0.05 0.1 1,000 4 27 59 72 76 2,000 0 42 55 129 140 Promoter 3,000 4 39 64 149 174 Region 4,000 8 28 60 143 210 5,000 5 20 45 178 208

b) Number of nodes in the intersection sub-network Q-value threshold 0.001 0.005 0.01 0.05 0.1 1,000 4 27 51 63 63 2,000 0 39 53 94 94 Promoter 3,000 5 40 59 97 111 Region 4,000 9 28 59 86 123 5,000 6 22 43 103 111

c) Number of edges that form a loop Q-value threshold 0.001 0.005 0.01 0.05 0.1 1,000 0 8 13 8 16 2,000 0 0 8 41 31 Promoter 3,000 0 0 0 100 73 Region 4,000 0 0 0 84 138 5,000 0 0 0 107 143

3. Comparison with RegNetwork gene regulatory networks RegNetwork (http://www.regnetworkweb.org) is a comprehensive database of gene and miRNA interactions. It combines multiple published databases and supporting databases. The network contains all types of genes and is not limited to TFs genes. Because of that, the network size is quite large, consisting of 372,774 edges and 23,336 nodes. This network is not cell- or tissue-specific. The result of the comparison is shown in Table 5c. Intersection with the promoter region length 3,000 and Q-value threshold 0.05 led to the largest number of shared edges: 12 edges were the same. Based on these comparison results, we decided to use an intersection sub-network with Q-value threshold 0.05 and promoter region length 3,000 nt for further analysis, because it has the most shared edges that also exist in other databases. As for edges that did not appear in other networks, these may show novel relationships between TFs and target genes that were not covered in other databases. In total, the intersection sub-network contains 38 TF genes. It means that our

43 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019) intersection sub-network found 38 TF genes out of 57 TF genes that were differentially expressed in the four cancer types (shown in Figure 4). Our intersection sub-network successfully covered around 66% of DE TF that were shared in four cancer types. Our intersection sub-network may help to explain the mechanism by which those TF genes become differentially expressed. The intersection sub-network is shown in Figure 5, and the sub-networks with gene expression data for each cancer data-set are available in supplementary figures (Figures S1–S5). It is clear that the sub-network shows a positive regulatory relationship, because all up-regulated genes are only connected to other up-regulated genes and vice versa.

3.4 Intersection sub-network result analysis

To further understand and analyze our intersection sub-network, we used the STRING DB and enrichment analysis. We decided to use STRING DB because this database uses text mining techniques to find gene or protein relationships from the published literature. This means that some of the gene or protein relationships that appear in STRING DB have been experimentally validated, depending on the score. In this research, we do not have the capabilities of designing a wet lab experiment, so we approach our experimental validation by using STRING DB. We also use the enrichment analyses provided by the WikiPathway 2016 and KEGG 2015 databases to further understand the biological pathways of genes that are in the intersection sub-network of the four cancer types.

1. STRING DB analysis In the chosen intersection sub-network (promoter region length 3,000 nt and Q-value threshold 0.05), there were 38 gene nodes (list of the gene names is available in the supplementary data Table S1). We input these genes into STRING DB. The network from STRING DB is shown in Figure 6. From this STRING DB network result, some of the edges were shown to be consistent with our results. Although the STRING DB network was an undirected network, some of the directly connected genes in STRING DB were similarly connected in our intersection sub-network. In total, we obtained 13 gene-gene relationships in the STRING DB results that were similar to those in our intersection sub-network. Table 6 shows these gene-gene relationships and their respective STRING DB scores. These results from the STRING DB further strengthen our gene-protein-gene relationships in our intersection sub-network. With our method, the relationships that we discovered have been proven in other literature and experimental results. With these results, our method has been shown to be robust and consistent with existing knowledge sufficiently to use our intersection sub-network for the analysis of genetic characteristics that were shared across the four cancer types in our data-set.

44 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Table 5. Comparison of shared edges between our intersection sub-network and three other databases a) Shared edges between HepG2 network and our intersection sub-network Q-value threshold 0.001 0.005 0.01 0.05 0.1 1,000 1 3 9 9 7 2,000 0 6 8 12 12 Promoter 3,000 1 3 5 16 12 region 4,000 1 2 3 15 11 5,000 1 1 1 18 11

b) Shared edges between the Regulatory Circuit network and our intersection sub-network Q-value threshold 0.001 0.005 0.01 0.05 0.1 1,000 1 2 4 10 5 2,000 0 5 6 12 9 Promoter 3,000 1 3 5 16 15 region 4,000 1 2 2 13 14 5,000 1 1 1 14 14

c) Shared edges between the RegNetwork network and our intersection sub-network Q-value threshold 0.001 0.005 0.01 0.05 0.1 1,000 1 2 3 7 4 2,000 0 2 4 9 10 Promoter 3,000 0 1 3 12 11 region 4,000 0 1 2 11 12 5,000 0 1 3 10 13

45 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

Figure 5. Intersection sub-network with promoter region length 3000 and Q-value threshold 0.05 We obtained an intersection network from individual networks generated for each cancer type. In the intersection network, it is apparent that there are two separate networks, because we only extract a positive regulatory relationship. Bold edges show relationships that formed a feed-back loop.

Figure 6. STRING DB results from genes in the intersection sub-network. We queried STRING DB using genes that are found in the intersection sub-network. In STRING DB, the gene-gene relationship is undirected and the same two connected genes can have multiple relationship types, according to STRING DB category. The relationship is shown here using color in the edges. We found several relationships that are the same. Green color shows predicted gene neighborhood, red color shows predicted gene fusions, dark blue color shows predicted gene co-occurrence, yellow color shows relationships from text mining, black color shows co-expression, cyan color shows known interactions from curated databases, and magenta color shows experimentally determined interactions.

46 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Table 6. STRING DB results that show the same gene-gene relationship in our intersection sub-network. STRING DB produces a weighted network whose edge has multiple scores. From the explanation in STRING DB, the score is calculated based on evidence. Experiment Score is derived from primary source based on experiments. Co-Expression Score is derived from the list of proteins whose genes are observed to be correlated in expression, across a large number of experiments. DB Score is obtained from curated databases. Txt Mine Score is calculated from the frequencies of gene/protein mentioned together in articles. Source Target Experiment CoExp DB Txt Mine Overall Score Score Score Score Score RORA TBX21 0 0 0 0.515 0.515 KLF15 NR3C1 0 0.111 0 0.473 0.511 ETV4 SOX9 0.091 0.068 0 0.349 0.4 ZBTB16 TBX21 0 0.063 0 0.585 0.595 ZBTB16 NR3C1 0 0.355 0 0.307 0.534 FOXM1 MYBL2 0.632 0.204 0.9 0.765 0.992 FOXM1 E2F7 0.418 0 0 0.385 0.627 E2F1 E2F3 0.077 0 0.9 0.083 0.911 E2F1 E2F7 0.229 0.32 0.9 0.245 0.956 E2F1 FOXM1 0.347 0 0 0.427 0.61 GTF2IRD1 DEAF1 0 0 0 0.495 0.495 E2F8 MYBL2 0.357 0.125 0 0.468 0.675 E2F8 E2F7 0.231 0.379 0.9 0.199 0.958

2. Enrichment Analysis We performed an enrichment analysis by using the Enrichr [24] web application against the WikiPathway 2016 and KEGG 2015 databases. We chose these two databases so that we could understand the biological processes and pathways that involved genes in the intersection sub-network. Table 7 and Table 8 (complete Tables are shown in supplementary data Table S2 and Table S3) show the top 10 results for each enrichment analysis. We believe that among these results, the pre-implantation embryo pathway and cell cycle-related pathway are important for cancer development. We will investigate further the TF genes that are involved in these pathways.

3.5 Pathway related to cancer development

From the results of the intersection gene network analysis, five genes (TEAD4, IRX5, HMGA1, FOSB, E2F5) in our intersection sub-network were involved in the preimplantation embryo pathway, derived from the experiments [25]. In the data-set that we used, IRX5, HMGA1, and E2F5 were all up-regulated, whereas FOSB was down-regulated in all cancer data-sets. As for the TEAD4 gene, it was also up-regulated in all cancer data-sets analyzed, except for LUAD. In our intersection sub-network, IRX5, HMGA1, and E2F5 were connected via the TCF3 gene, and TEAD4 was also connected to E2F5 via the ONECUT2 gene. As for FOSB, it was unrelated to the others. While preimplantation embryos may not be directly linked to cancer development, there are some similarities of embryonic development with cancer development [26]. So, we believe that the expression profile during the embryonic development stage may give us a hint about these genes’ functions and how they possibly promote cancer growth. We then obtained raw expression data from the experiments [25] to further analyze these gene expression values in the embryonic

47 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019) development stage. The raw data, shown in Table 9, give average expression in every cell stage during embryonic development in RPKM.

Table 7. Enrichment analysis results using the WikiPathway 2016 database Term Pathway ID Adj. P-value Genes TEAD4, IRX5, HMGA1, Preimplantation Embryo WP3527 1e-05 FOSB, E2F5 E2F1, HMGA1, RORA, Adipogenesis WP236 7e-05 NR3C1, KLF15 DEAF1, RORA, TCF3, SIDS Susceptibility Pathways WP706 0.00014 NR3C1, FOXM1 G1 to S cell cycle control WP45 0.00241 E2F1, E2F3, E2F5 Cell Cycle WP179 0.00633 E2F1, E2F3, E2F5 Spinal Cord Injury WP2431 0.00785 E2F1, SOX9, E2F5 Gastric Cancer Network 1 WP2361 0.00785 MYBL2, E2F7 Nuclear Receptors WP170 0.01056 RORA, NR3C1 EGF/EGFR Signaling Pathway WP437 0.01507 E2F1, FOSB, MYBL2 Circadian rhythm- related genes WP3594 0.02245 NONO, KLF9, RORA

Table 8. Enrichment analysis results derived by using the KEGG 2015 database Term Adj. P-value Genes Bladder cancer 0.0174 E2F1, E2F3 Non small cell lung cancer 0.0174 E2F1, E2F3 Glioma 0.0174 E2F1, E2F3 Melanoma 0.0174 E2F1, E2F3 Pancreatic cancer 0.0174 E2F1, E2F3 Chronic myeloid leukemia 0.0174 E2F1, E2F3 Prostate cancer 0.0174 E2F1, E2F3 Small cell lung cancer 0.0174 E2F1, E2F3 Cell cycle 0.0222 E2F1, E2F3 Basal transcription factors 0.0731 GTF2IRD1

Table 9. Average expression (in RPKM) in each embryonic developmental stage for the TEAD4, IRX5, HMGA1, FOSB, and E2F5 genes 2-cell 4-cell 8-cell Troph-ect Primitive hESC hESC Gene Oocyte Zygote Morulae epiblast embryo embryo embryo oderm endoderm passage#0 passage#10

IRX5 0 0 0 1.164 31.959 10.898 0 0 0 0.394 0.204 TEAD4 2.956 8.88 6.408 19.54 19.736 77.088 45.607 105.365 75.344 25.741 31.28 HMGA1 26.041 33.279 38.064 42.428 55.181 46.951 101.954 252.2 133.062 208.045 351.773 FOSB 6.126 6.727 5.706 11.362 3.823 0.603 0.572 2.257 0 50.578 10.882 E2F5 3.376 1.458 0.948 2.211 0.536 0.367 2.752 7.719 2.932 42.606 17.993

48 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

From the pathway, IRX5 is seen to be involved in the 8-16-cell division stage, TEAD4 is involved in the 16-32 cell division stage, while HMGA1, FOSB, and E2F5 are active in human embryonic stem cell (hESC) early and late passage. Raw data from the experiments [25] show that IRX5 expression is high in the 8-cell embryo stage, but the expression decreases after that stage, while TEAD4 expression increases from the oocyte stage until the primitive endoderm stage and then decreases after that. From this, we conclude that high expression levels of IRX5 and TEAD4 indicate an increase in cell division activity; thus, the functions of these two genes are related to cell division. Raw data from the experiments [25] also show that FOSB and E2F5 expressions are increased until the hESC early stage, while HMGA1 is increased until hESC late passage. Although we are not really sure about the functions of FOSB, E2F5, and HMGA1 in this stage, we also think that the high expression in the hESC stage indicates that FOSB, E2F5, and HMGA1 are important in cell division and cell proliferation. We hypothesized that up regulation of the cell division activity-related genes such as IRX5, TEAD4, HMGA1, and E2F5 might be more active in cancer cells than in normal cells. The promotion of cell division activity might lead to uncontrollable cell division in cancer cells. As for a FOSB down-regulation effect, further investigation will be needed to understand how the down regulation of FOSB plays a role in cancer cell development. This hypothesis is further supported by the results of others showing that TEAD4 [27–30], HMGA1 [31–34], and IRX5 [35–37] are related to cell proliferation, cell cycle mechanisms, and cancer development. Other than these genes, some genes in our intersection sub-network are also involved in cell proliferation mechanisms, such as FOXM1 [38], MYBL2 [39], and the SOX transcription family [40,41]. Cell proliferation mechanisms have also been linked to cancer development [42,43], which further confirms our results. E2F5 is also involved in the cell cycle related pathway with E2F1 and E2F3. The E2F gene family has been known to be involved in the cell cycle pathway [44–46]. In our data-set, all E2F5, E2F1, and E2F3 genes are up-regulated compared with normal cells. This means that the cell cycle pathway is disrupted in cancer cells, and this disruption may affect the cell division and cell proliferation mechanisms that lead to uncontrolled cell growth. From this analysis, we conclude that most of the genes that are dysregulated in the multiple cancer type data-sets that we used are affecting the cell cycle, cell proliferation, and cell division mechanisms. The inter-connectivity that we have shown in our intersection sub-network from the multiple cancer-type data indicates that these genes may become interconnected with each other during the transcription process. Some of the gene connectivity even creates positive feedback loops, which may, in turn, create a steady state condition in cancer cells. Knowing how these genes are interconnected could facilitate the discovery of novel therapeutic targets for cancer treatment. Further experiments would be needed to confirm the relationships between the genes in our intersection sub-network, but we are certain that our results can give a hint to find new cancer drug targets.

3.6 Confirmation of results robustness

In our workflow, there is concern that our results actually depend on the combination of cancer types that are used. Our results might be just a coincidence or not able to be used with other datasets. To address this, we tried to further confirm the robustness of our method, The ideal method would involve the use of more and various cancer types. Unfortunately, the data and processing times are limited, because we need the raw data from both the RNA-seq experiment and the cancer-normal comparison experiment. To solve this issue, we decided to modify our leave-one-out cross-validation method.

49 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

The basic idea is that from the 4 cancer types that we have, we want to find results from all combinations, when only 3 cancer types are used for each experiment in our method. If all results from the combinations of only 3 cancer types are consistent with the overall results, we can say that our method is robust. To confirm that, we do the exact workflow for the combinations of: - BDC, HCC, and CRC - BDC, CRC, and LUAD - BDC, HCC, and LUAD - HCC, CRC, and LUAD From this validation, we will have 4 intersection networks from 3 cancer types. From Figure 4, it has been shown that each combination will have different sets of shared DE TF. We want to understand whether this difference in shared DE TF will give us different pathway results or not. If the results are still consistent, it means that changes in the DE TF set will not affect the results, and our method would still give important pathways in cancer development. The summary of intersection networks from the combination of 3 cancer types is shown in Table 10, and the complete network is shown in the supplementary figures (Figures S6-S9). From these networks, we then performed the same pathway enrichment analysis, and the result is shown in the supplementary data Tables S4-S10. Enrichment analysis by using the WikiPathway 2016 results show more terms whereas KEGG 2015 results show a similar number of terms selected, although for the BDC, CRC, and LUAD intersection network, no single term was found to be significant. From enrichment analysis using the WikiPathway 2016 database result, we can see that the results are consistent with the overall intersection network. Important terms are still on top of the results for each combination, such as preimplantation embryo, cell cycle, G1 to S cell cycle control, adipogenesis, and circadian rhythm. From the enrichment analysis using the KEGG 2015 database results, similar terms were also selected and significant. So, despite the different sets of DE TF that we obtained from each combination, we still get the same results. From this, we strongly believe that our method is robust against changes in the dataset.

Table 10. Size of the intersection network obtained from the combination of the three cancer types. Cancer types combination Edges Gene nodes Protein nodes Total nodes BDC-HCC-CRC 451 76 163 239 BDC-CRC-LUAD 296 59 107 166 BDC-HCC-LUAD 684 102 227 329 HCC-CRC-LUAD 323 64 117 181

3.7 Method limitation and future works

We have seen that in our method, we can extract important gene-protein-gene relationships that might play important roles in cancer development. Despite this, we acknowledge that our method is still limited, because we only use gene expression as a measurement. We understand that there are other potential causes that might drive cancer development, such as gene mutation and epigenetics. We also believe that despite this limitation, our method can still give useful information, because gene mutation and epigenetics can alter gene expression levels if the mutation or epigenetics factor affects transcription factor genes. The famous examples of gene mutations that are believed to drive cancer growth are TP53

50 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019) mutations. There have been many articles that report TP53 mutations in various cancer types [47]. Despite this, our method is limited for showing the relevance of TP53 mutations. In our dataset, TP53 is only up-regulated in LUAD and BDC. Because of this, TP53 does not appear in the intersection network. One of the possible explanations why TP53 is not found to be significantly upregulated in CRC and HCC is that the TP53 mutation may occur individually and not appear in all patients. We still believe that if TP53 mutated, there will be significant effects on their target genes. For example, if the TP53 mutation alters the ability of the gene to bind to the genome, it may cause TP53 to not be able to act normally as a transcription factor. If this happened, the expression level of the TP53 target genes will be disturbed. Our method then will be able to include those target genes in the network. In the future, we can address this issue by developing a method to construct an individual gene regulatory network. This will need a specially designed experiment to extract such information. With the increasing usage of single cell RNA-sequencing, we believe that at some point in the future, we will be able to do this.

4. Conclusion

We have shown that TF regulatory networks can be developed by using the RNA-seq results to predict which promoter regions are active during the transcription processes. We also included gene expression changes in cancer and compared them to normal cells to create a TF network that shows important TF genes that are dysregulated in cancer cells. By combining multiple cancer cell-type expression data, we have extracted an intersection of TF sub-networks that shows TF genes playing important roles in cancer development in multiple cancer cells. From the intersection sub-network, we have shown that TF genes related to the cell cycle, cell division, and cell proliferation mechanisms are dysregulated in cancer. The dysregulation of these mechanisms may lead to uncontrollable cell growth in cancer cells. We hypothesize that these genes may become new drug targets for cancer treatment. We hope that further experiments can be conducted with regards to these genes, so that we can further our research in finding a way to fix the dysregulation of cell cycle, cell division, and cell proliferation mechanism in cancer cells.

Financial Support

This research is partially supported by KAKENHI Grant Number JP17H00769.

Acknowledgement

The authors acknowledge useful discussions with Konagaya laboratory members in TITECH.

51 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

References

[1] Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev Genet. 2009, 10, 57–63. DOI: 10.1038/nrg2484 [2] Huber, W.; Carey, V. J.; Gentleman, R.; Anders, S.; Carlson, M.; Carvalho, B. S.; et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015, 12(2), 115–121. DOI: 10.1038/nmeth.3252 [3] Love, M. I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15(12), 50. DOI:10.1186/s13059-014-0550-8 [4] Robinson, M. D.; McCarthy, D. J.; Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009, 26(1), 139–140. DOI: 10.1093/bioinformatics/btp616. [5] Law, C. W.; Chen, Y.; Shi, W.; Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014, 15(2), R29. DOI:10.1186/gb-2014-15-2-r29 [6] Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W., Bono, H.; Kanehisa, M.; KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999, 27(1), 29–34. DOI:10.1093/nar/27.1.29. [7] Kanehisa, M.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016, 44(D1), D457–D462. DOI: 10.1093/nar/gkv1070 [8] Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017, 45(D1), D353–D361. DOI:10.1093/nar/gkw1092 [9] Kelder, T.; Van, Iersel, M. P.; Hanspers, K.; Kutmon, M.; Conklin, B. R.; et al. WikiPathways: Building research communities on biological pathways. Nucleic Acids Res. 2012, 40, D1301– D1307. DOI: 10.1093/nar/gkr1074 [10] Kutmon, M.; Riutta, A.; Nunes, N.; Hanspers, K.; Willighagen, E. L.; et al. WikiPathways: Capturing the full diversity of pathway knowledge. Nucleic Acids Res. 2016, 44(D1), D488– 94. DOI: 10.1093/nar/gkv1024 [11] Slenter, D. N.; Kutmon, M.; Hanspers, K.; Riutta, A.; Windsor, J.; et al. WikiPathways: A multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018, 46(D1), D661–7. DOI: 10.1093/nar/gkx1064 [12] Langfelder, P.; Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008, 9, 559. DOI: 10.1186/1471-2105-9-559 [13] Krämer, A.; Green, J.; Pollard, J.; Tugendreich, S. Causal analysis approaches in ingenuity pathway analysis. Bioinformatics. 2014, 30(4), 523–530. DOI: 10.1093/bioinformatics/btt703 [14] Szklarczyk, D.; Morris, J. H; Cook, H.; Kuhn, M.; Wyder, S.; et al. The STRING database in 2017: Quality-controlled protein-protein association networks, made broadly accessible.

52 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

Nucleic Acids Res. 2017,45, D362–D368. DOI: 10.1093/nar/gkw937 [15] Liu, Z. P.; Wu, C.; Miao, H.; Wu, H. RegNetwork: An integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database. 2015, 1–12. DOI: 10.1093/database/bav095 [16] Marbach, D.; Lamparter, D.; Quon, G.; Kellis, M,.; Kutalik, Z.; et al. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nat Methods. 2016, 13(4), 366–370. DOI: 10.1038/nmeth.3799 [17] Neph, S.; Stergachis, A. B.; Reynolds, A.; Sandstrom, R.; Borenstein, E. Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012, 150(6), 1274–1286. DOI: 10.1016/j.cell.2012.04.040 [18] Matys, V. TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, 34 (Database issue), D108–D110. DOI: 10.1093/nar/gkj143 [19] Patro, R.; Duggal, G.; Love, M. I.; Irizarry, R. A.; Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017, 14(4), 417–419. DOI: 10.1038/nmeth.4197 [20] Cunningham, F.; Achuthan, P.; Akanni, W.; Allen, J.; Amode, M. R.; et al. Ensembl 2019. Nucleic Acids Res. 2019, 47(D1), D745–D751. DOI: 10.1093/nar/gky1113 [21] Grant, C. E.; Bailey, T. L.; Noble, W. S. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011,27(7), 1017–1018. DOI: 10.1093/bioinformatics/btr064 [22] Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Statist. Soc. B (Methodological). 1995,57(1),289–300. [23] Bailey, T. L.; Noble, W. S. Searching for statistically significant regulatory modules. Bioinformatics. 2003, 19 Suppl. 2, ii16–ii25. DOI:10.1093/bioinformatics/btg1054 [24] Chen, E. Y.; Tan, C. M.; Kou, Y.; Duan, Q.; Wang, Z.; et al. Enrichr: Interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013, 14, 128. DOI: 10.1186/1471-2105-14-128 [25] Yan, L.; Yang, M.; Guo, H.; Yang, L.; Wu, J.; et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct and Mol Biol. 2013, 20(9), 1131–1139. DOI: 10.1038/nsmb.2660 [26] Ma, Y.; Zhang, P.; Wang, F.; Yang, J.; Yang, Z.; et al. The relationship between early embryo development and tumourigenesis. J. Cell. Mol. Med. 2010, 14(12), 2697–2701. DOI:10.1111/j.1582-4934.2010.01191.x [27] Li, Z.; Zhang, Y, Li. S.; Zou, T.; Li, S. Role of TEAD4 in colorectal cancer cell proliferation and analysis of its mechanism. Precision Radiation Oncology. 2018, 2(3), 85–88. DOI:10.1002/pro6.50 [28] Takeuchi, S.; Kasamatsu, A.; Yamatoji, M.; Nakashima, D.; Endo-Sakamoto, Y.; et al. TEAD4-YAP interaction regulates tumoral growth by controlling cell-cycle arrest at the G1 phase. Biochem Biophys Res Commun. 2017, 486(2), 385–390. DOI:10.1016/j.bbrc.2017.03.050

53 Chem-Bio Informatics Journal, Vol.19, pp.32–55 (2019)

[29] Wang, C.; Nie, Z.; Zhou, Z.; Zhang, H.; Liu, R.; et al. The interplay between TEAD4 and KLF5 promotes breast cancer partially through inhibiting the transcription of p27Kip1. Oncotarget. 2015, 6(19), 17685-17697 DOI:10.18632/oncotarget.3779 [30] Tang, J. Y.; Yu, C. Y.; Bao, Y. J.; Chen, L.; Chen, J.; et al. TEAD4 promotes colorectal tumorigenesis via transcriptionally targeting YAP1. Cell Cycle. 2018, 17(1), 102–109. DOI: 10.1080/15384101.2017.1403687 [31] Fu, F.; Wang, T.; Wu, Z.; Feng, Y.; Wang, W.; et al. HMGA1 exacerbates tumor growth through regulating the cell cycle and accelerates migration/invasion via targeting miR-221/222 in cervical cancer article. Cell Death and Disease. 2018, 9: 594 DOI :10.1038/s41419-018-0683-x. [32] Akaboshi, S. I.; Watanabe, S.; Hino, Y.; Sekita, Y.; Xi, Y.; et al. HMGA1 is induced by Wnt/β-catenin pathway and maintains cell proliferation in gastric cancer. Am. J. Pathol. 2009, 175(4), 1675–1685. DOI:10.2353/ajpath.2009.090069. [33] Schuldenfrei, A.; Belton, A.; Kowalski, J.; Talbot, C. C.; Di, Cello, F.; et al. HMGA1 drives stem cell, inflammatory pathway, and cell cycle progression genes during lymphoid tumorigenesis. BMC Genomics. 2011, 12: 549 DOI: 10.1186/1471-2164-12-549 [34] Conte, A.; Paladino, S.; Bianco, G.; Fasano, D.; Gerlini, R.; Tornincasa, M.; et al. High mobility group A1 protein modulates autophagy in cancer cells. Cell Death Differ. 2017, 24(11), 948–962. DOI: 10.1038/cdd.2017.117 [35] Liu, D.; Pattabiraman, V.; Bacanamwo, M.; Anderson, L. M. Iroquois homeobox transcription factor (Irx5) promotes G1/S-phase transition in vascular smooth muscle cells by CDK2-dependent activation. Am J Physiol-Cell Ph. 2016, 311(2), C179–C189. DOI:10.1152/ajpcell.00293.2015 [36] Huang, L.; Song, F.; Sun, H.; Zhang, L.; Huang, C. IRX5 promotes NF-κB signalling to increase proliferation, migration and invasion via OPN in tongue squamous cell carcinoma. J Cell Mol Med. 2018, 22(8), 3899–3910. DOI: 10.1111/jcmm.13664 [37] Myrthue, A.; Rademacher, B. L. S.; Pittsenbarger, J.; Kutyba-Brooks, B.; Gantner, M.; et al. The iroquois homeobox gene 5 is regulated by 1,25-dihydroxyvitamin D3 in human prostate cancer and regulates apoptosis and the cell cycle in LNCaP prostate cancer cells. Clin Cancer Res. 2008, 14(11), 3562–3570. DOI: 10.1158/1078-0432.CCR-07-4649 [38] Chen, Y.; Liu, Y. Ni. H.; Ding, C.; Zhang, X.; et al. FoxM1 overexpression promotes cell proliferation and migration and inhibits apoptosis in hypopharyngeal squamous cell carcinoma resulting in poor clinical prognosis. In J Oncol. 2017, 51(4), 1045–1054. DOI: 10.3892/ijo.2017.4094 [39] Musa, J.; Aynaud, M. M.; Mirabeau, O.; Delattre, O.; Grünewald, T. G. MYBL2 (B-Myb): a central regulator of cell proliferation, cell survival and differentiation involved in tumorigenesis. Cell Death Dis. 2017, 8(6): e2895. DOI: 10.1038/cddis.2017.244 [40] Sarkar, A.; Hochedlinger, K. The Sox family of transcription factors: Versatile regulators of stem and progenitor cell fate. Cell Stem Cell. 2013, 12(1), 15–30. DOI: 10.1016/j.stem.2012.12.007 [41] Jo, A.; Denduluri, S.; Zhang, B.; Wang, Z.; Yin, L.; et al. The versatile functions of Sox9 in

54 Chem-Bio Informatics Journal, Vol.19, pp.32-55 (2019)

development, stem cells, and human diseases. Genes and Diseases. 2014, 1, 149–161. DOI: 10.1016/j.gendis.2014.09.004 [42] Evan, G. I.; Vousden, K. H. Proliferation, cell cycle and apoptosis in cancer. Nature. 2001, 411(6835), 342–348. DOI: 10.1038/35077213 [43] Feitelson, M. A.; Arzumanyan, A.; Kulathinal, R. J.; Blain, S. W.; Holcombe, R. F.; et al. Sustained proliferation in cancer: Mechanisms and novel therapeutic targets. Semin Cancer Biol. 2015, 35 Suppl: S25–S54. DOI: 10.1016/j.semcancer.2015.02.006. Epub 2015 Apr 17 [44] Ren, B.; Cam, H.; Takahashi, Y.; Volkert, T.; Terragni, J.; et al. E2F integrates cell cycle progression with DNA repair, replication, and G2/M checkpoints. Genes Dev. 2002, 16(2), 245–56. DOI:10.1101/gad.949802 [45] Wang, L.; Chen, H.; Wang, C.; Hu, Z.; Yan, S. Negative regulator of E2F transcription factors links cell cycle checkpoint and DNA damage repair. Proc Natl Acad Sci USA. 2018, 115(16), E3837–E3845. DOI: 10.1073/pnas.1720094115 [46] Lavia, P; Jansen-Dürr, P. E2F target genes and cell-cycle checkpoint control. BioEssays. 1999, 21(3), 221–230. DOI: 10.1002/(SICI)1521-1878(199903)21:3<221::AID-BIES6>3.0.CO;2-J [47] Cole, A. J.; Zhu, Y.; Dwight, T.; Yu, B.; Dickson, K. A.; et al. Comprehensive analyses of somatic TP53 mutation in tumors with variable mutant allele frequency. Sci Data. 2017, 4: 170120. DOI: 10.1038/sdata.2017.120