<<

BRIDGING INFLAMMATORY BOWEL DISEASES AND HEPATOBILIARY DISORDERS THROUGH PATHWAY ENRICHMENT AND MODULE-BASED APPROACH

Master Degree Project in bioinformatics One year 30 ECTS Spring term 2020

Alaa Saloum [email protected]

Supervisor: Zelmina Lubovac Email: [email protected]

Examiner: Björn Olsson Email: [email protected]

Abstract

Inflammatory bowel diseases (IBD) including Crohn’s disease (CD) and ulcerative colitis (UC) are associated with various hepatobiliary disorders. Two of the chronic hepatobiliary disorders that may coexist with inflammatory bowel diseases are: primary biliary cholangitis (PBC) and primary sclerosing cholangitis (PSC). Previous studies have hypothesized that IBD, PBC, and PSC might share an underlying mechanism which contributes to the pathogenesis of the three conditions. In this study, a module-based network analysis and pathway enrichment analysis was applied on IBD, PSC, and PBC differentially expressed (DEGs). The sample data were obtained from the study by Ostrowski et al. (2019). A network module- based approach was applied to examine generated results where additional information about biological processes, pathways and molecular functions can be inferred. FunRich and Enrichr were utilized as functional enrichment tools. A interaction network was constructed for the three conditions using STRING. Functional modules and overlapping modules of IBD, PSC, and PBC were identified using different plug-ins in Cytoscape. Some of the results were consistent with the findings of Ostrowski et al. (2019) such as the ATP synthesis and signal transduction that is shared among the overlapping genes in IBD, PBC, and PSC. ModuLand highlighted nodes that have been previously reported to have a role in the pathogenesis of autoimmune diseases. The proposed approach demonstrated that the module-based approach contributes to similar results regarding biological processes and pathway enrichment of generated modules, compared to enrichment analysis of DEGs. In addition, the utilization of the ModuLand plug-in to find hierarchal layers of disease genes is still poorly researched and would benefit from more in-depth comparison with related tools for module discovery. For instance, implementing ModuLand plug-in can potentially support research in elucidating complex diseases.

Table of Contents Abstract ...... Abbreviations ...... 1. Introduction ...... 1 1.1. Background of inflammatory bowel disease and hepatobiliary disorders ...... 1 1.2. Hepatobiliary associations with inflammatory bowel disease ...... 1 1.3. Networks and module-based approach ...... 2 2. Materials and Methods ...... 3 2.1. Analysis of differentially expressed genes (DEGs) ...... 4 2.2. Pathway enrichment analysis ...... 4 2.3. Protein-protein interaction network ...... 6 2.4. Functional modules ...... 7 2.5. Identification of shared modules ...... 8 3. Alternative methods ...... 9 3.1. Analysis of differentially expressed genes (DEGs) ...... 9 3.2. Pathway enrichment analysis ...... 10 3.3. Protein-protein interaction network ...... 10 3.4. Functional modules ...... 11 3.5. Identification of shared modules ...... 11 4. Implementation and Results ...... 12 4.1. Analysis of differentially expressed genes (DEGs) ...... 12 4.2. Pathway enrichment analysis ...... 13 4.2.1. PBC DEGs enrichment analysis...... 13 4.2.2. PSC DEGs enrichment analysis ...... 14 4.2.3. IBD DEGs enrichment analysis ...... 14 4.2.4. Comparison between functional enrichment of PBC, PSC, and IBD DEGs ...... 14 4.2.5. Comparison between biological pathways in PBC, PSC and IBD using FunRich ...... 14 4.2.6. Comparison between biological processes in PBC, PSC and IBD using FunRich ...... 16 4.2.7. Enrichment analysis of overlapping genes between PBC, PSC and IBD using FunRich ...... 16 4.3. Protein-Protein Interaction Network ...... 19 4.4. Functional modules ...... 19 4.4.1. PSC, PBC, and IBD functional modules ...... 20 4.4.2. Enrichment analysis of the functional modules ...... 20

4.5. Identification of modules ...... 23 5. Discussion ...... 25 6. Ethical Aspects ...... 29 7. Impact on Society and Future Directions ...... 30 8. References ...... 31 9. Appendix ...... 38

Abbreviations

BioGRID Biological general repository for interaction data sets CD Crohn’s disease DEG Differentially expressed genes Database for Annotation, Visualization, and Integrated DAVID Discovery FunRich Functional Enrichment GO Ontology GEO Omnibus IBD Inflammatory bowel disease KEGG Kyoto Encyclopedia of Genes and Genomes MCODE Molecular Complex Detection MCL Markov CLustering Algorithm NCBI National Center for Biotechnology Information PPI Protein-protein interaction PBC Primary biliary cholangitis PSC Primary sclerosing cholangitis STRING Search Tool for the Retrieval of Interacting Genes/ UC Ulcerative colitis

1. Introduction 1.1. Background of inflammatory bowel disease and hepatobiliary disorders Inflammatory bowel disease (IBD) is defined as a chronic intestinal inflammation that pursues a protracted relapsing and remitting course (Rubin et al., 2012). The disease encompasses two major types, ulcerative colitis (UC) and Crohn's disease (CD). Both UC and CD exhibit severe diarrhea, fatigue, abdominal pain and weight loss in patients (Fakhoury et al., 2014). Despite the similar clinical and pathological features that UC and CD share, both diseases are heterogeneous with marked differences in clinical presentation, underlying genetic factors, and also response to treatment.

In terms of inflammation location, CD is characterized by its inflammation anywhere in the gastrointestinal tract that occurs in a patchy distribution (Vasquez et al., 2007). On the other hand, UC mainly presents with inflammation of the rectal and sigmoid colon (Vasquez et al., 2007). Both IBD subtypes are reported to be prevalent in highly developed nations. For example, CD has a prevalence of 30–50 of 100,000 people in western countries (Fakhoury et al., 2014).

IBD including CD and UC are associated with various hepatobiliary disorders. Hepatobiliary disorders can be primary biliary cholangitis (PBC) and primary sclerosing cholangitis (PSC). PBC is a chronic autoimmune disease that results in damage to the biliary epithelial cells of small bile ducts. On the other hand, PSC is characterized by chronic inflammation that affects both intrahepatic and extra bile ducts that ultimately become strictured and fibrotic. These heterogeneous chronic autoimmune diseases PBC, PSC, and IBD, are believed to share an underlying pathogenic mechanism (Marchioni et al., 2014). 1.2. Hepatobiliary associations with inflammatory bowel disease Hepatobiliary disorders are common in patients with IBD, in which 20% to 30% of individuals with IBD exhibit persistent abnormal liver function. Most of PSC cases appear concurrently with IBD, particularly UC. Previous research has hypothesized that there is a strong linkage between the pathogenesis of PSC with IBD. Up to 80% of PSC cases are associated with IBD, while PSC is present in 3–8% of all patients with UC and 1–3% of patients with CD. PBC is less associated with IBD and is reported to be rare. However, recent research has suggested that PBC prevalence is higher in patients with IBD than in the general population (Liberal et al., 2019) and that a PBC test should be considered in individuals with IBD that manifest abnormal liver function. Individuals with both PSC and IBD are at a greater risk of developing colorectal carcinoma and therefore have a lower chance of survival than normal IBD patients.

Ostrowski et al. (2019) has analyzed blood transcriptomes from patients with PBC, PSC, and IBD where they have highlighted 1946 genes that were common to all three comparisons. Molecular alteration in PBC, PCS, and IBDs were studied through functional analysis of white blood cell count (WBC) gene expression profiles. The generated results showed that shared molecular alternations were related to mitochondrial function, the vesicle endomembrane system, and GTPase-mediated processes. Additionally, in this study it was also shown that 86 differentially expressed genes were common to a set of 133 genes previously proposed by Peters et al. (2017) as key driver genes of IBD. Total of 37 genes out of 133 were shared among the diseases, and 15, 7, and 3 were unique for IBD, PBC, and PSC, respectively. Overlap of the genes indicates a functional link between IBD susceptibility genes expression that is involved in discrete systemic inflammation.

1

Genome-wide association study (GWAS) have been used by researchers to reveal the pathogenesis of complex diseases. This is done through the identification of associations between genetic regions (loci) and traits (including diseases). The genome is scanned for small variations, called single nucleotide polymorphisms (SNPs) that are known to take place more frequently in affected individuals than in healthy ones. Paziewska et al. (2017) identified nineteen and twenty-one SNPs that were verified as associated with PBC and PSC, respectively. Another study (Qiu et al., 2017) highlighted that IL21 signaling pathway and Tfh cells are involved in PBC pathogenesis in which the enhanced expression of IL21 and IL21R in PBC livers assist in disease mechanisms. However, further functional studies are required to understand the development of PBS and PSC.

More than 200 risk loci were identified for IBD using the genome-wide association approach (Momozawa et al., 2018). CD is one of the diseases that has benefitted the most from the GWAS technology in which key players in its pathogenesis were identified. One of the key players in CD is the IL‐17/IL‐23 axis and innate lymphoid cells (Geremia et al., 2011). In the most recent analysis, 241 IBD susceptibility loci were reported, where the majority are associated with an increased risk of both CD and UC (Verstockt et al., 2018). Many of the identified loci were also associated with other immune-mediated diseases such as PBC (Verstockt et al., 2018). 1.3. Networks and module-based approach One of the crucial aims in biological research is to identify all molecules and their potential role in a living cell as well as the way these molecules interact with each other. Complex diseases like autoimmune diseases involve altered interactions between thousands of genes (Sharma et al., 2015). Several computational tools have been developed to utilize data from high throughput technologies, including microarrays and RNA-sequence data, to allow the identification of such genes and their products. However, the functions of many genes and their co-regulations is still not understood and stands as a formidable challenge.

Scientists have used network-based analyses of omics data to generate disease modules and their associated genes to elucidate a systems level and a molecular understanding of disease mechanisms (Gustafsson et al., 2014). For example, with the use of module-based approaches a module for allergy conditions was generated where it revealed a novel candidate gene that was later validated by functional and clinical studies (Gustafsson et al., 2014). In systems biology, networks are graph representations of an intracellular biological system. The network shows nodes that represent molecular components, such as genes and proteins, and their direct or indirect interactions as links (Ma'ayan, 2011).

Protein networks are useful resources to identify putative pathways that help scientists grasp a deeper understanding of the disease through the illustration of physical interactions between proteins in the cell. Previous studies have highlighted that the collective contribution of disease-associated genes is large when compared to the effect of individual disease-associated genes. This is because many complex diseases develop due to the interaction of multiple genes in a complex network (Gustafsson et al., 2014). With the identification of disease-associated proteins and hubs, suggested new disease candidates can also be identified (Nazarieh and Helms, 2019)

Disease modules are defined as the local neighborhood of an interactome where localized perturbation takes place (Sharma et al., 2015). As disease module-based approaches are expanding and increasingly used in the scientific research, it is becoming a prerequisite for a detailed explanation of a particular

2

pathophenotype (Ghiassian et al., 2015). One of the successful studies identified asthma disease-modules by using several computational tools aiming to elucidate the molecular and physiological mechanisms through which these disease-associated genes affect disease phenotypes (Sharma et al., 2015). The study by Sharma et al. (2015) used the DIseAse MOduleDetection (DIAMOnD) method for disease module identification and validation. This algorithm is defined as a topological method to identify the network neighborhood of the seed genes (Sharma et al., 2015). Seed genes represent previously known disease genes (Sharma et al., 2015). The study also found an enriched asthma-disease module with susceptibility variants in which GWAS p-values against the background of random variation were found. Moreover, the study also highlighted shared mechanisms between the asthma disease-module and other immune- related disease modules. Moreover, Sharma et al. (2015) tested the performance of DIAMOnD against four network-based algorithms: random walk (RW), DaDa, PRINCE, and CIPHER. The results yielded a positive outcome when asthma-specific datasets were used, in which the performance of DIAMOnD outstands the four mentioned tools. Finally, the identified asthma disease-module highlighted the GAB1 signaling pathway as an important novel modulator in asthma. This concludes that module-based approaches can reveal novel pathways that are crucial in the pathogenesis of diseases.

The aim of this study is to investigate the use of network module-based methods to elucidate the linkage between IBD and hepatobiliary disorders, PBC and PSC, through the analysis of shared biological processes and pathways. This procedure aims to explore how module-based approach can contribute to the knowledge about inter-disease connections, in addition to differentially expressed genes. Moreover, this study is a continuation of the study by Ostrowski et al. (2019) that analyzed blood transcriptomes from patients with PBC, PSC, and IBD where generated results with extra information about biological process, pathways and molecular function can be inferred. It is hypothesized that identifying shared genes and functional modules in IBD, PBC, and PSC can reveal cellular mechanisms shared by all diseases. In addition, identifying the disease module for the three comparisons, IBD, PBC, and PSC can validate the functional relevance through computational approaches.

The study is initiated by acquiring the data sets used by Ostrowski et al. (2019) from the Gene Expression Omnibus (GEO) database. Differentially expressed genes are identified for each data set (each disease) by comparing healthy control samples and their corresponding disease samples. To understand the processes enriched in both diseases, enrichment analysis is carried out on the differentially expressed genes using Enrichr and FunRich enrichment tools. A protein-protein interaction network is constructed by mapping the differentially expressed genes for IBD, PSC, and PBC diseases using STRING. Functional modules are identified through module-based methods (MCODE and ClueGo) and topological analysis can reveal novel module-based biomarker candidates. At the end, pathway enrichment analysis is performed on modules to elucidate the main functions of the genes within the modules (ClueGo). Finally, identification of shared modules is carried out using the ModuLand plug-in which is integrated in the Cytoscape program.

2. Materials and Methods The data set with the accession number GSE119600 is utilized in the project and was downloaded from the Gene Expression Omnibus database. Since this study is a continuation of the Ostrowski study, same data set will be used in this research (Ostrowski et al., 2019). In the Ostrowski study the data set was used to investigate the relationship between autoimmune cholestatic liver and IBD disease (Ostrowski et al., 2019). Moreover, when results are generated, comparisons with results from the Ostrowski study can be

3

performed. The data set contains a total of 370 samples where 90 are PBC, 45 PSC, 95 CD, 93 UC and 47 controls. All sample sources are blood-based samples. A summary of the samples can be found in Table 1. Blood is considered to be a rich and informative source of transcriptome information for many complex human diseases and pharmacogenomic studies (Vartanian et al., 2009). In addition, previous studies have assessed the use of whole blood samples in differentiating the activity of CD patients and the identification of CD, UC, and non-inflammatory diarrheal conditions (Barnes et al., 2015).

Table 1 Microarray data set that is used in the study.

Data set Platform Number of samples Source GSE119600 GPL10558 Illumina 370 (90 PBC,45 PSC,95 Whole blood HumanHT-12 v4.0 CD,93 UC and 47 expression beadchip controls)

2.1. Analysis of differentially expressed genes (DEGs) The necessity of understanding biological differences between healthy and disease states have urged the identification of differentially expressed genes (DEGs). A gene is declared differentially expressed if it is induced or repressed under a certain biological process or condition (Anjum et al., 2016). When DEGs are identified, researchers can pinpoint biomarkers and potential therapeutic targets (Rodriguez-Esteban et al., 2017).

Processed data was selected through the GEO2R tool to generate and analyze the differentially expressed genes. GEO2R is a web tool that allows for the comparison of multiple groups to identify genes that are differently expressed. Processed data is utilized in the GEO2R through the GEO query and limma R package (Leiserson et al., 2015). Moreover, prior to analysis the distribution of the samples can be calculated through the value distribution option to ensure selected samples are comparable. 2.2. Pathway enrichment analysis Pathway enrichment analysis aids researchers in understanding the biological pathways that are enriched in a gene list. (Reimand et al., 2019). The summarized gene list provides a mechanistic insight where the data is interpreted in the context of biological processes, pathways, and networks. In addition, pathway enrichment analysis can reduce data dimensionality by arranging genes into pathways. The resulted enriched pathways are statistically tested more than would be expected by chance. Several enrichment analyses tools such as Enrichr most commonly use the Fisher exact test to compute the significance of overlap where a p-value is assigned. The p-value is a measurement of the confidence that this overlap is due to chance (Reimand et al., 2019).

The statistical technique, pathway enrichment analysis, provides a picture of the common roles of the generated differentially expressed genes in the two conditions. The previous research, Pan et al. (2017) study aimed to identify breast cancer biomarkers, where they have utilized the FunRich database to carry out functional enrichment analysis of genes involved in breast cancer (Pan et al., 2017). The study was able to highlight the biological processes and pathways of breast cancer genes.

One of the tools that is used for this step is the Functional Enrichment Analysis tool (FunRich). FunRich is an open-access software that provides a bioinformatics analysis system in which functional enrichment and interaction network analysis of genes and proteins are provided (Pathan et al., 2015). FunRich is

4

integrated from heterogeneous genomic and proteomic resources with >6.8 million annotations (Pathan et al., 2017). The FunRich background database hosts only human specific data sets where gene/protein annotations were collated from publicly accessible gene and protein databases Hochberg (Benito-Martin and Peinado, 2015).

The tool uses database, HPRD2, Gene3 and UniProt to gather gene ontology annotations that include biological process, cellular component, and molecular function. In terms of statistical methods, FunRich utilizes hypergeometric distribution test to report p-values and evaluate statistical significance of the enriched as well as depleted terms. In fact, several excising enrichment analysis tools including FunRich rely on hypergeometric distribution test and its variant such as the Fisher’s exact test. FunRich also uses FDR (false discovery rate) to correct for multiple testing using Bonferroni and Benjamini-Hochberg (Benito-Martin and Peinado, 2015).

Cao and Zhang (2014) reviewed the classical hypergeometric distribution test used for functional enrichment analysis, shown in the equation1:

Equation 1

The equation is calculated as follows: g is the number of genes annotated to a certain GO term, and f is the total numbers of genes evaluated. d stands for the DEG detected. The number of DEGs annotated to this GO term, denoted by n, indicates the representation of the GO term in the list of DEGs. Finally, n is modeled by a hypergeometric distribution under the null hypothesis. The P-value measuring the significance of enrichment is the tail probability of observing n or more DE genes annotated to the GO term where is the binomial coefficient (Cao and Zhang, 2014).

A list of differentially expressed genes is supplied to the database where results of the analysis is depicted graphically. Two subprocesses is performed and examined: comparison between biological pathways in IBD, PBC, and PSC and comparison between biological processes in IBD, PBC and PSC. Moreover, an overview of overlapped differentially expressed genes between the conditions is viewed by generating a Venn diagram through the FunRich database. Analysis of the overlapping genes is also carried out in FunRich where a generated figure of the biological processes is obtained.

Another tool that is used in the enrichment analysis process is the Enrichr tool. A list of generated differentially expressed genes are uploaded to the web-based tool. Pathways and ontologies are the categories that are analyzed in this process. Enrichr is a web-based tool that provides users with several types of visualization summaries of collective functions of the mammalian gene list. The tool was developed by the Ma'ayan laboratory at Mount Sinai with a large collection of integrated gene set libraries (166 libraries). (Chen et al., 2013). The workflow of the tool consists of three basic steps: a list of human genes (Entrez gene symbols) are supplied. Next, the Enrichr uses gene set libraries to compute enrichment. Finally, enrichment results are displayed in different forms such as bar graphs with highlighted enrichment terms. Enrichr computes four types of enrichment scores to evaluate the significance of overlap in the supplied gene list and the gene sets in each library: p-value, q-value, rank (Z-

5

score), and combined score Fisher's exact test is used to generate the p-value where It assumes a binomial distribution and independence for probability of any gene belonging to any set. q-value stands for the adjusted p-value using the Benjamini-Hochberg method for correction for multiple hypotheses testing. z- score is the deviation from the expected rank by the Fisher exact test. Finally, the combined score multiplies the log of the p-value computed with the Fisher exact test by the z-score. Combined score formula is: c = ln(p) * z, where c is the combined score, p is the p-value computed using Fisher's exact test, and z is the z-score (Kuleshov et al., 2016). (Supragna Sandur, 2017). A study by Kuleshov et al. (2016) compared the Enrichr tool to several other similar tools including Fidea,DAVID, WebGestalt, g:Profiler and GSEA where they pointed out the advantage of Enrichr over the mentioned tools (D'Andrea et al., 2013; Huang et al., 2007; Liao et al., 2019; Raudvere et al., 2019; Reimand et al., 2019). It was concluded that Enrichr is much more comprehensive in which larger gene set libraries are integrated that accumulates biological knowledge for further biological discoveries. (Kuleshov et al.2016). One of the cons of Enrichr is lack of flexibility when compared with Fidea, DAVID, WebGestalt, g: Profiler and GSEA. Enrichr lacks ID conversion tool which usually converts user’s input gene IDs from one format to another (Kuleshov et al.2016).

An overview of the project’s methodology is found in the Appendix-Figure 13. 2.3. Protein-protein interaction network The biological function of proteins can be defined as a crucial agent that determines the molecular and cellular mechanisms within a cell. This means that proteins tend to control healthy and disease states. Studies have shown that proteins are not completely functional when isolated, instead, they tend to work in a network manner. Protein-protein interaction networks is a mathematical representation of various types of protein interactions (Ideker et al., 2018).

Physical interaction of proteins forms a network that enables researchers to further analyze and study such interactions to be able to fully understand biological phenomena. The analyses of protein interaction networks have proved the structure and dynamics of protein networks that are abnormal in cancer and several autoimmune diseases (Safari-Alighiarloo et al., 2014).

One of the most used databases in protein-protein interaction network construction is STRING. The principle behind this database is to integrate all publicly available sources of protein–protein interaction information and eventually generate computational predictions (Szklarczyk et al., 2019). STRING (version 11) is the chosen database for a protein–protein interaction analysis where differentially expressed genes are uploaded to the biological database. In addition, three networks are generated separately for the three conditions: IBD, PBC and PSC. STRING database computes a combined score which is the final measure when building networks. The combined score is generated by combining the probabilities from the different evidence channels, correcting for the probability of randomly observing an interaction (Szklarczyk et al., 2017).

One of the main differences between interaction databases is the type of evidence or PPI source used in interactome construction (Szklarczyk et al., 2019). The PPI data source range from experimental to computational predicted data as well as a blend of both sources. STRING belongs to one of the major predictive PPI databases where it integrates PPI data from diverse collection of sources and not only rely on pure PPI data. This allows the existence of experimentally verified and bioinformatically predicted PPIs which contributes to the availability of huge data set (Szklarczyk et al., 2019). Moreover, databases with

6

PPI predictions, such as STRING, can expand the known interactome and can lead interactomes closer to completeness. Therefore, the integration of experimental and predicted PPI sources provides users with a more comprehensive result (Szklarczyk et al., 2017).

A disadvantage of utilizing predictive PPI databases, such as STRING, is that generated data may not be reliable because of non-availability of possible PPIs. For instance, prediction of false positive interactions. One of STRING’s protein-protein prediction channels is text-mining where mentions of protein names in all PubMed abstracts are searched and statistical co-citation analysis is performed (Szklarczyk et al., 2019). One drawback of such approach is that protein-protein association is displayed based only on the existence of the protein names in the same article. This can result in a misleading conclusion (Szklarczyk et al., 2017).

STRING is a rich database containing 9 643 763 proteins from 2031 organisms with 932 553 897 interactions in total (You et al., 2019). However, in this study only human proteins were analyzed. There are three main specialization points in STRING database that makes it one of the leading tools in protein network analysis. First, the availability of Homo sapiens as organism, hence human proteins, and their interactions (ii) hosts experimental, predicted and transferred interactions, (iii) and provision of domains and protein structures (Franceschini et a., 2013). In terms of functional enrichment, the database engages several functional classification systems such as GO, Pfam and KEGG When visualizing a network, nodes represent genes, and edges denote the interactions. Moreover, each protein node shows a preview to 3D structure information (Szklarczyk et al., 2019). One of the main reasons for utilizing STRING database in this project, is that it combines both experimental and prediction PPI data. STRING also extracts experimental data from a wide collection of databases including: BIND, DIP, GRID, HPRD, IntAct, MINT, and PID which allows a wider collection of results. Protein association knowledge is also obtained from curated databases which are: Biocarta, BioCyc, GO, KEGG, and Reactome. Curated databases are highly reliable as the data is collected by a great deal of human effort through the consultation, verification, and aggregation of existing sources (Sun et al., 2012). In addition, the resulted protein-protein interaction network can be easily saved in different formats and imported into Cytoscape for further analyses. 2.4. Functional modules Cytoscape is one of the top used bioinformatics software platforms that provides a visual interface for importing, visually exploring, and analyzing graphical data (Shannon et al., 2003). The generated PPI network in the previously mentioned step is supplied to Cytoscape using the Cytoscape plug-in ClusterMaker2 in order to construct functional modules and clustering nodes within a network. Protein complexes or modules are groups of proteins that interact with each other forming a single multi- molecular machine. Functional modules groups consisting of proteins that participate in a particular cellular process while binding each other at a different time and place. ClusterMaker2 is a Cytoscape App that uses Gene Ontology terms and pathways from multiple ontologies to allow users to better understand biological phenomena by visualizing functional organized networks (Shannon et al., 2003). Developers of ClusterMaker2 aimed to design such a tool to extract the non-redundant biological information for large clusters of genes (Bindea and Mlecnik, 2013). Using GO, KEGG, and BioCarta, ClusterMaker2 will facilitate this study to extract non-redundant biological information for clusters of genes. Moreover, the tool holds several beneficial features including the automatic recognition of multiple gene and protein identifiers, allowing simultaneous analysis of multiple annotation and finally, results are illustrated as a functionally grouped network. ClusterMaker2 hosts several network cluster

7

algorithms, where the primary function of such algorithm is to detect natural groupings of nodes within a network and assigning a numeric edge attribute. The numeric edge attribute stands for some similarity or distance metric between two nodes. Nodes that are closer together are more likely to be grouped together. One of the network cluster algorithms within ClusterMaker2 that is utilized in this project is the Molecular Complex Detection (MCODE) algorithm. ClusterMaker2 will serve in constructing the functional network through MCODE algorithm.

MCODE is a Cytoscape plugin and a relatively fast method of clustering. This Cytoscape plugin has other strong features such as fine-tuning of results with numerous node-scoring and cluster-finding parameters (Bader and Hogue.2003). The main principle that drives this tool is to detect densely connected regions in large protein-protein interaction networks. MCODE algorithm is based on vertex weighting (based on k- core) by local neighborhood density and outward traversal from a locally dense seed protein to eventually point out the dense regions (Bader and Hogue, 2003). K-Core can be defined as a measure that helps identify small interlinked core areas on a network (k-core is a graph of minimal degree k) (Bader and Hogue.2003).

Therefore, the algorithm operation can be summarized in three stages: first vertex weighing based on their local network density, next is molecular complex prediction where the procedure starts from the highest weighted node and then move out adding nodes to the complex. Final step is to apply filters that improve the cluster quality (Bader and Hogue, 2003). The main motivation of using MCODE is that it also achieves the highest precision where it detects protein complexes that are with the highest quality, in terms of the function and localization similarity of proteins. However, the drawback of this method is its low sensitivity as it can detects a small number of proteins, hence false negative predictions (Wu et al., 2008).

Bader and Hogue (2003) compared the MCODE tool to other graph clustering methods where they concluded that MCODE is a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network. Another advantage of MCODE over other similar tools is that it allows examination of cluster interconnectivity, which is relevant for protein networks (Bader and Hogue, 2003). Three functional modules for IBD, PBS, and PSC are generated using the MCODE algorithm. Every disease is represented by clusters generated through MCODE algorithm and using ClusterMaker2 app.

ClueGo is another useful plugin tool that is used to improve the biological interpretation of gene lists. ClueGo creates a functional network from the supplied gene lists. The methodology of this tool creates a functionally organized network by integrating Gene Ontology (GO) terms as well as KEGG/BioCarta pathways (Saito et al., 2012). ClueGo uses kappa statistics to reflect the relationships between the terms in the network and how similar are their associated genes (Bindea and Mlecnik, 2013). To be able to use the ClueGo plug-in, a license key was requested. 2.5. Identification of shared modules The ModuLand plug-in is part of Cytoscape where it provides users with an algorithm to determine overlapping network modules. The plug-in identifies several hierarchical layers of modules where meta- nodes of the higher layer are the modules of the lower layer (Szalay-Beko et al., 2012) In this project, it is aimed to identify hierarchical layers of modules through ModuLand plug-in.

The reason for identifying shared modules is that it is hypothesized that proteins interact with disease genes and are suspected to have a role in the disease mechanism. ModuLand have introduced the

8

principle of community landscape where a 2D visualization is achieved through its x-y plane. The z-axis represents community centrality which is the sum of local influence zones of all network edges including the given edge. The generated hills of the community landscape correspond to the modules of the network. The next step is to highlight modular center where it is identified as the links at the local maxima of the community landscapes, and memberships of links in all network modules are determined. The last step of the ModuLand framework is to generate network hierarchy where a higher-level hierarchical representation of the network is created. Elements of the higher level correspond to modules of the original network. Moreover, links of the higher level correspond to overlaps between the respective modules (Kovács et al., 2010).

LinkLand and ProportionalHill are two methods that were previously published by ModuLand network module determination method family. ModuLand plug-in relies on these two methods, LinkLand and ProportionalHill, where they provide an efficient trade-off between the fast and accurate properties (Szalay-Beko et al., 2012). LinkLand algorithm is responsible for calculating influence zones or community centrality. ProportionalHill module determination method consists of multiple rounds where links are assigned to modules based on the assignment of previously assigned links (Szalay-Beko et al., 2012). The lack of studies implementing ModuLand plug-in in generating hierarchical layers (dealing with disease genes) have led to the selection of such method in this project.

3. Alternative methods 3.1. Analysis of differentially expressed genes (DEGs) Several methods are proposed for the generation and analysis of differentially expressed genes from microarray data. Fold change is a well-known method that has been used to identify differentially expressed genes. The method evaluates the log-ratio between two conditions. In simpler words, fold change is the mean expression of the gene in group 1 divided by the mean expression of the gene in group 2. A disadvantage of using fold change is that it can be biased where misclassified differentially expressed genes are generated (Cui and Churchill, 2003).

T-test is a popular and simple statistical test used to derive differentially expressed genes. It works by determining any significant difference between the means of two groups. The statistical test takes into account several points to identify any statistical significance: t-statistic, the t-distribution values, and the degrees of freedom to determine the statistical significance. Working with a t-test can be problematic because variance is not considered. The variance estimates can be skewed by genes having a very low variance (Jeanmougin et al., 2010).

Significance Analysis of Microarrays (SAM) is another alternative for the generation and analysis of differentially expressed genes from microarray data. SAM is implemented as the sam function in the siggenes package (Taguchi, 2018). The SAM technique was first described in 2001 by Tusher et al. and is implemented to detect significant genes. SAM uses a method similar to the t-statistic and computes a statistic dj for each gene j, which measures the strength of the relationship between gene expression and a response variable (Tusher et a., 2001). Moreover, since the implemented data may not follow a normal distribution, SAM uses non-parametric statistics. The advantage with using SAM is that it provides an estimate of False Discovery Rate for multiple testing and therefore controls false positives. However, according to previous studies where SAM technique was used, a major drawback of such method is that

9

estimation of the number of significant genes is biased, especially when this number is relatively large (Pan et al., 2003; Efron et al., 2000). 3.2. Pathway enrichment analysis Several tools are available for the mapping of differentially expressed genes and proteins to their biological annotations. DAVID (Database for Annotation, Visualization and Integrated Discovery) is a bioinformatics resource that provides users with functional annotation tools to deliver a biological interpretation of a gene list (Ma et al., 2018). The database has five integrated functional annotation tools. For any given gene list, the database is able to provide several features such as clustering annotation terms, list interacting proteins, and show genes on BioCarta and KEGG pathway maps.

Unfortunately the database was temporarily inaccessible, which allowed for the selection of an alternative method for pathways enrichment (FunRich,Enrichr).

Gene Set Clustering based on Functional annotation (GeneSCF) is a real-time based functional enrichment tool that takes its input as a gene list. This means that GeneSCF reads data from the source databases such as KEGG and Reactome at run time. It does not rely on already pre-fetched data (Subhash and Kanduri, 2016). The most noticeable advantage of GeneSCF over other available tools is that users do not have to depend on tools to get updated. It also carries out enrichment analysis for multiple gene list using multiple source database (GO, KEGG, REACTOME and NCG) in a single run (Subhash and Kanduri, 2016). Just like Enrichr, GeneSCF calculates the significance of enrichment using Fisher’s Exact test. Moreover, GeneSCF differs from FunRich in terms of statistical significance of enriched terms. This is because FunRich utilizes a hypergeometric distribution test for enrichment significance (Lin et al., 2019).

Subhash and Kanduri (2016) have compared the performance of GeneSCF to tools that follow a similar methodology in which Fisher’s Exact test is used for performing enrichment. DAVID shares a similar methodology in terms of the statistical test it relies on which is the Fisher’s Exact test. The Subhash and Kanduri (2016) study have reported the difference in number of reported genes in DAVID when compared to GeneSCF. DAVID covered only 50 % of the genes that were covered by GeneSCF which indicats that the GeneSCF update mode provides better functional enrichment information (Subhash and Kanduri, 2016)

This tool is a wise choice when the user needs to perform enrichment analysis using any number of gene lists on multiple organisms simultaneously by using simple bash script. Therefore, GeneSCF can be considered as a time saver when compared to other web-based functional tools such as DAVID, Enrichr and FunRich (Subhash and Kanduri, 2016). 3.3. Protein-protein interaction network The biological general repository for interaction data sets (BioGRID) is one of the most detailed databases that offers protein and genetic interactions. Unlike STRING, which is an experimental and predicted database, BioGRID is a database of experimentally derived protein-protein interactions (Chatr-Aryamontri et al., 2017). The primary difference of BioGRID compared to STRING is its manual curation of experimentally validated protein interactions that are reported in peer-reviewed biomedical publications. In terms of curation strategy, Interaction Management System which is an internal dedicated database that carries all the curation activity in BioGRID. Interaction Management System tracks individual curator contributions as well as standardize all aspects of curation for experimental evidence (Oughtred et al., 2019). The main limitation of BioGRID is that it lacks information about multi-protein complexes. Biases is another challenge in experimental procedures of PPI studies. This is because interactions requiring

10

different cell types may not be found. Moreover, researchers carry experimental methods that focus on particular proteins leaving out so called interactome orphans. This scenario results in research bias (Chatr- Aryamontri et al., 2017).

DIP is another curated database of interacting proteins. The database carries quality assessment methods to choose subsets of most dependable interactions (Sharma et al., 2018). In addition to primary sources such as GenBank, DIP derives its data from other sources like KEGG database. Just like BioGRID, DIP documents experimentally determined interactions between proteins which marks one of the main differences when compared to STRING database.

Although BioGRID and DIP databases include thousands of human protein interactions, their coverage of the whole human interactome is still basic. A very limited overlap between the two databases was reported (Chatr-Aryamontri et al., 2017). 3.4. Functional modules Affinity propagation algorithm which is implemented in ClusterMaker2 is a convenient tool to generate clusters within protein-protein networks (Morris et al., 2011). Affinity propagation is based on message passing between data points. The algorithm considers all data points as cluster centers where it groups the clusters totally by the similar degree among the data points. Availability and responsibility are the two messages exchanged among the data points. When the sum of the two messages (responsibilities and availabilities) is maximized the process of message-passing ends and a clear set of exemplars and clusters emerge (Zhao and Xu.2015). Such algorithm can be used for identifying functionally related groups of proteins within large protein-protein similarity networks

Markov CLustering Algorithm (MCL) is another network clustering algorithm that can be considered as an alternative for MCODE algorithm that can be also utilized in ClusterMaker2 plugin. MCL simulates random walks in protein-protein interaction networks to detect protein complexes. There are two basic steps in the MCL operation: expansion and inflation. Expansion phase is where matrix squaring takes place whereas inflation phase occurs by taking the power for each matrix entry. Ultimately the two phases will separate the PPI network into many segments, which are predicted as protein complexes. Zhao and Xu (2015) have compared MCODE to several other network clustering algorithms including MCL where they concluded that the protein complexes predicted by MCODE have the highest quality (Zhao and Xu.2015). 3.5. Identification of shared modules Specific betweenness (S2B) is a novel method used to search the overlap between network modules where cross-disease analyses could be carried out (Garcia-Vaquero et al., 2018). Relying on the fact that disease-related genes form modules in protein interaction networks, the use of S2B can effectively find genes associated with the related diseases. Moreover, it is also hypothesized that proteins interact with disease genes and are suspected to have a role in the disease mechanism. This way S2B prioritizes nodes (genes) frequently and specifically present in shortest paths linking two disease modules. Nodes that are more frequently in these specific shortest paths have a higher S2B score (Garcia-Vaquero et al., 2018).

Garcia-Vaquero et al. (2018) study has used the specific betweenness method to find genes associated with Amyotrophic Lateral Sclerosis and Spinal Muscular Atrophy. The study, Garcia-Vaquero et al. (2018) has found that S2B candidates were enriched in biological processes that have a role in motor neuron degeneration. In addition, suggested common molecular mechanisms for the two diseases were highlighted (Garcia-Vaquero et al., 2018).

11

4. Implementation and Results 4.1. Analysis of differentially expressed genes (DEGs) Processed data was selected through the GEO2R tool to generate and analyze the differentially expressed genes. GEO2R is a web tool that allows for the comparison of multiple groups to identify genes that are differently expressed. Original processed data is utilized the GEO2R through the GEO query and limma R package (Leiserson et al., 2015). Before proceeding with the analyses, distribution of the samples was calculated through the value distribution option. This ensures that selected samples are suitable for comparison and normalized (Leiserson et al., 2015). All three conditions of IBD, PBC, and PSC were compared separately against the control group. A sum of 188 IBD samples (95 CD and 93 UC) were compared against 47 healthy controls. Similarly, 90 PBC samples were compared against 47 controls and 45 PSC samples against 47 healthy controls. Benjamini & Hochberg (False discovery rate) was chosen for the analysis. The generated genes were analyzed and those with an adjusted P-value less than 0.0001 were selected as differentially expressed genes. Moreover, genes that do not have the gene names/symbol were filtered out. A stringent p-value cutoff of 0.0001 yielded less DEGs, which is more practical when constructing a protein-protein interaction network in the STRING database, especially that STRING does not accept gene numbers above 2000. The total number of differentially expressed genes was 583 for PBS, 146 for PSC, and 928 for IBD. All other genes with an adjusted p-value above 0.0001 were discarded. Up-regulation and down-regulation of gene expression was also examined throughout the DEGs of the three conditions by setting a cutoff value of 1 on the LogFC data. All DEGs in PSC and PBC exhibit a downregulation process (LogFC<1) whereas only one gene, matrix metallopeptidase 9, was upregulated in the IBD condition (LogFC>1). The top 10 genes of PBC, PSC, and IBD are shown in table 1. The top 10 overlapping genes (according to alphabetical order) extracted from FunRich are: ABCG1, AK6, AKAP17A, ATP5C1, C14orf159, C15orf39, CCDC25, CKS2, CRK and CUX1. The rest of the overlapping genes are supplied in the Appendix section in Table 6.

Table 2. Top 10 differentially expressed genes with the least adjusted P-value for PBC, PSC, and IBD diseases. All genes have a P<0.0001.

PBC PSC IBD

ABCG1 DBI MED1 TRPC4AP MRPL19 TAOK3 UBE2V2 MRPL47 RASGRP2 MVP UQCRHL SNRPC STK11IP CBX3 EIF5A FLII MED1 MEPCE PRKX AK6 IFFO2 INPPL1 IFI27L2 STAT5A ZNHIT3 VPS8 RPS6KA1 AK6 ABCG1 DHX16

To create a visualization of the number of differentially expressed transcripts as well as overlapping gene numbers between the three conditions, a Venn diagram was illustrated using the FunRich program. In order to generate a Venn diagram, the three DEG lists (adjusted p-value < 0.0001) from PBC, PSC, and IBD have been uploaded in the FunRich program. Figure 1 shows the Venn diagram for the number of

12

differentially expressed transcripts of the three comparison groups. 49 genes showed an overlap between the three conditions whereas 39 genes were common in PSC and IBD.

Figure 1. FunRich Venn diagram showing the number of DEGs from patients with PBC, PSC, and IBD compared with those from healthy controls as well as numbers of overlapping genes among the three conditions.

4.2. Pathway enrichment analysis The first approach in the enrichment analysis method was done using the Enrichr analysis tool. Lists of significant genes from the three conditions were supplied into Enrichr. Pathways and ontologies sections are analyzed closely where the top 10 significant terms are showed. Pathways, biological processes, and their associated genes for the three conditions are supplied in the appendix-Tables 7-15 (Chen et al., 2013) (Kuleshov et al., 2016). 4.2.1. PBC DEGs enrichment analysis The enriched pathways for PBC were generated based on the differently expressed genes of PBC samples. Results are sorted according to P-value ranking, where a p-value < 0.1 is displayed as significant. Inspecting the KEGG 2019 Human pathways, the most significant cluster was indicating pathway. Followed by: thyroid hormone signaling pathway, Fc gamma R-mediated phagocytosis, phospholipase D signaling pathway, phosphatidylinositol signaling system, regulation of actin cytoskeleton, non-alcoholic fatty liver disease (NAFLD), shigellosis, chemokine signaling pathway, and acute myeloid leukemia. In the GO Biological Process 2018, results are ranked according to p-value where the most significant term was . The second most significant term is gene expression, followed by phosphorylation and mitochondrial translational elongation. Looking at the GO Molecular Function 2018, the most significant GO term was activity. Next is RNA binding and phosphotransferase activity which had very similar p-values.

13

4.2.2. PSC DEGs enrichment analysis In the KEGG 2019 Human pathway, bars are sorted according to p-value ranking. The most significant pathway in PSC genes was oxidative phosphorylation. The next five enriched terms had close P-value to the first term. The five KEGG pathways are: Parkinson disease, Alzheimer disease, Non-alcoholic fatty liver disease (NAFLD), and Huntington disease.

Inspecting the GO biological process 2018, two biological pathways were highlighted: mitochondrial ATP synthesis coupled electron transport and respiratory electron transport chain. Recorded p-values were 4.699e-10 and 1.286e-9, respectively. In the GO molecular function 2018 database, the most enriched terms were pointed at NADH dehydrogenase quinone activity and NADH dehydrogenase (ubiquinone) activity. Both terms scored a P-value of 1.30E-05. 4.2.3. IBD DEGs enrichment analysis The KEGG 2019 Human pathway showed toll like receptor signaling pathway as the most enriched term with a P-value < 0.01. The second most significant term is Fc gamma R-mediated phagocytosis with a p- value of 2.8E-05.

In the ontologies, the biological process section showed significance in neutrophil activation involved in immune response with a p-value of 2.801e-11. Another significant biological process is neutrophil mediated immunity followed by neutrophil degranulation. The remaining biological processes involved: phosphorylation, gene expression, and protein transport. In the GO Molecular Function 2018, RNA binding was the most significant term with a p-value of 1.001e-9. 4.2.4. Comparison between functional enrichment of PBC, PSC, and IBD DEGs The next part of the study is the functional analysis, where biological pathways and processes of PSC, PBC, and IBD have been inspected in the FunRich program. The FunRich program version (3.1.3) was downloaded through the following link: http://www.FunRich.org/download. All three groups of DEGs (PBS, PSC, IBD) were uploaded in FunRich program, and a total number of 10 items were chosen to be shown in the displayed graphs. Using the “compare” option, figures of compared biological pathways and processes were generated. 4.2.5. Comparison between biological pathways in PBC, PSC and IBD using FunRich Results of compared biological pathways between PBS, PSC, and IBD generated poor significance in FunRich program. However, the percentage of the biological pathways of the PBC, PSC, and IBD DEGs are similar (see Figure2). The recorded percentages are percentages of the biological pathway of DEGs. Some of the top enriched pathways are integrin family surface interactions, LKB1 signaling events, and glypican pathway. The percentage of genes in IBD playing a role in metabolism is the least among all other biological pathways. PSC DEGs score the highest percentage of genes in the displayed biological pathways, as shown in Figure 2 IBD and PSC have close results of number of genes in each of the shown enriched pathway.

14

Figure 2. Enriched biological pathways in PBC, PSC, and IBD generated by FunRich.

15

4.2.6. Comparison between biological processes in PBC, PSC and IBD using FunRich In FunRich, the most enriched biological processes are displayed based on p-value. However, the generated result in Figure 3. shows poor significance with a P-value equals to 1. Inspecting the percentage of genes in each biological pathway, it was reported that signal transduction, cell communication, regulation of nucleobase, nucleotide and nucleic acid metabolism are also found in the recorded biological processes of the PBC, PSC, and IBD. It was also noticed that a greater percentage of PSC, PBC, and IBD genes participate in an unknown biological process.

Figure 3. Enriched biological processes in PBC, PSC, and IBD generated by FunRich.

4.2.7. Enrichment analysis of overlapping genes between PBC, PSC and IBD using FunRich The Venn diagram in Figure 1 showed 49 overlapping genes. These genes were further analyzed using the “analysis” option in FunRich. A number of 10 items were selected to be shown on the chart for enrichment analysis. Biological pathways and processes were recorded in Figures 4 and 5, respectively. 21.1% of the biological processes of PBC, PSC, and IBD DEGs was the CXCR4-mediated signaling event. When analyzing

16

the percentage of genes of the biological process for overlapping genes, the two most enriched processes were signal transduction and cell communication (see Figure 5).

Figure 4 Enriched biological pathways of the 49 overlapping genes found in PBC, PSC, and IBD. Results are generated by FunRich.

17

Figure 5 Enriched biological processes of the 49 overlapping genes found in PBC, PSC, and IBD. Results are generated by FunRich.

18

4.3. Protein-Protein Interaction Network Differentially expressed genes for PSC, PBC, and IBD with a p-value< 0.0001 were uploaded into The Search Tool for the Retrieval of Interacting Genes/Protein (STRING V11.0). Multiple proteins section was chosen for submitting the list of genes, and Homo Sapiens was selected for organism type. All the default parameters were used in which all the active interaction sources were selected. This option allows for the selection of type of evidence that will contribute to the prediction of the score. In this step, all seven evidence sources were selected: text mining, experiments, databases, co-expression, neighborhood, gene fusion, and co-occurrence. The activation of text mining option looks for significant interaction groups in the abstracts of scientific literature. The experiments source looks for significant protein interaction data sets gathered from other protein-protein interaction databases such as the DIP database. Database option shows a list of significant protein interaction groups found in curated databases such as the KEGG database. Curated databases are a great deal of human effort and is of high reliability (Chatr-Aryamontri et al., 2017). The co-expression evidence provides genes that are co-expressed in the same or in other species. When fusion evidence is selected, the individual gene fusion events per species is shown. The conserved neighborhood evidence displays runs of genes that occur repeatedly in close neighborhood in of genomes. Finally, co-occurrence evidence shows the presence or absence of linked proteins across species (Szklarczyk et al., 2019).

Three separate networks were constructed, see Table 3. for PPI network details. The details in Table 3 are retrieved from the analysis section which gives some brief statistics of the inferred networks. The networks were downloaded as simple tabular text output for further analyzation. In Table2 the average clustering coefficient is shown for every protein network. This measurement stands for the degree of connectivity in the neighborhood of a protein in a network. The higher the cluster coefficient the more the protein takes place in a densely connected part of a network (Lin et al., 2006). The nodes shown in the table below represent the proteins, where edges stand for protein-protein associations. In the analysis section, it was reported that all three networks do have significantly more interactions than expected.

Table 3 Details of PPI networks of the three diseases PBC, PSC and IBD generated from STRING database.

PPI Network Number of nodes Number Average PPI of edges Clustering enrichment coefficient p-value: PBC 507 1908 0.321 P< 1.0e-16 PSC 135 244 0.384 P< 1.0e-16 IBD 829 4406 0.341 P< 1.0e-16

4.4. Functional modules Cytoscape V3.8.0. was utilized to perform functional clustering and analysis. The software was downloaded from: https://cytoscape.org/.To generate clusters, clusterMaker2 which is the Cytoscape 3 version of the clusterMaker plugin was used. The clusterMaker app was downloaded from Cytoscape App store: http://apps.cytoscape.org/apps/clustermaker2. The PPI network data for the three diseases PSC, PBC, and IBD were edited in Excel, to produce a SIF file format that could be uploaded into Cytoscape. The network cluster algorithm, MCODE was selected. A degree cutoff of 2 was chosen, this value stands for the minimum degree necessary in order for a node to be scored. In the cluster finding section, both haircut and fluff were selected. The selection of haircut option drops all nodes from a cluster that show only a single connection. When fluff option is checked (along with haircut option) the size of the complex is

19

increased in which cluster cores are expanded by one step and added to the cluster. The default K-core value of 2 was chosen, which filters out clusters that do not contain a maximally inter-connected sub- cluster of at least k degrees. In the last section of visualization options, create new clustered network was checked. (Pavlovic, 2009). 4.4.1. PSC, PBC, and IBD functional modules After uploading the PPI network data for the three conditions, clusters were visualized in Cytoscape via the ClusterMaker2 plug-in. For the PSC disease, five clusters were generated by the MCODE algorithm in the ClusterMaker2 app. For the IBD disease, 23 clusters were derived, while genes of PBC were clustered into 13 clusters. PBC and PSC functional clusters generated clusters with a single gene suggesting that the genes are unrelated. 4.4.2. Enrichment analysis of the functional modules The Cytoscape App ClueGo was utilized to extract the non-redundant biological information for clusters of genes. First, ClueGo App was downloaded into Cytoscape, and the license was supplied. In the Load Marker List section, the list of disease genes was pasted, and Homo Sapiens was selected for organism type. GO Biological Process, GO Molecular Function, and KEGG were selected for the ontologies and pathways. “All” evidence code was chosen and” Use GO Term Fusion” and P-values less than 0.01 were applied.

For PSC genes, three functional clusters were visualized as shown in Figure 9. The generated results show functionally grouped networks with terms as nodes linked based on their kappa score level (≥0.3), where only the label of the most significant term per group is shown. The majority of nodes were clustered in the oxidative phosphorylation biological process. Figure 9 shows that 80% of the biological terms belong to the oxidative phosphorylation process. Based on the ClueGo information table, the leading term has the highest significance in the group. The three significant GO terms were translational termination (GO biological process), ribosome (KEGG), and oxidative phosphorylation (GO biological process).

Figure 6. Overview chart of PSC terms generated by ClueGo. The percentage of genes associated with a specific term is also displayed.

20

The PBC disease showed 17 functional clusters in ClueGo, where the largest number of genes with a percentage of 23.19 (see Figure 10) play a role in translation. 23.19% stands for the percentage of genes associated with the translation function. The ClueGo results table displayed the 17 most significant terms in the group. All 17 significant terms and their associated ontology sources are found in Table 4.

Figure 7. Overview chart of PBC terms generated by ClueGo. The percentage of genes associated with a specific term is also displayed.

Table 4. PBC most significant terms per group generated via ClueGo. Ontology sources of the terms are also displayed.

GO term Ontology Source

viral process GO_BiologicalProcess Shigellosis KEGG Salmonella infection KEGG organelle organization GO_BiologicalProcess vesicle-mediated transport GO_BiologicalProcess actin cytoskeleton organization GO_BiologicalProcess regulation of Rab protein signal transduction GO_BiologicalProcess ribosomal large subunit biogenesis GO_BiologicalProcess protein-containing complex subunit organization GO_BiologicalProcess Oxidative phosphorylation KEGG phosphorylation GO_BiologicalProcess regulation of small GTPase mediated signal GO_BiologicalProcess transduction binding GO_MolecularFunction intracellular transport GO_BiologicalProcess phosphorus metabolic process GO_BiologicalProcess GTPase binding GO_MolecularFunction Translation GO_BiologicalProcess

21

The biological interpretation of IBD disease was acquired through ClueGo functional network and charts. 17 groups are generated and visualized into a functional network. Terms were illustrated as nodes where only the most significant term per group is shown. In the pie chart (Figure 11), the number of terms per groups is summarized where the biological process vesicle mediated transport shows the largest percentage. The ontology source of the most significant GO terms is shown in Table 4.

Table 5. IBD most significant GO terms per group and their ontology source.

GOTerm Ontology Source autophagy GO_BiologicalProcess Toll-like receptor signaling pathway KEGG_ Salmonella infection KEGG enzyme binding GO_MolecularFunction regulation of localization GO_BiologicalProcess intracellular signal transduction GO_BiologicalProcess symbiotic process GO_BiologicalProcess cellular response to cytokine stimulus GO_BiologicalProcess protein-containing complex disassembly GO_BiologicalProcess Oxidative phosphorylation KEGG positive regulation of response to biotic GO_BiologicalProcess stimulus phosphorylation GO_BiologicalProcess gene expression GO_BiologicalProcess intracellular transport GO_BiologicalProcess regulation of GTPase activity GO_BiologicalProcess RNA binding GO_MolecularFunction vesicle-mediated transport GO_BiologicalProcess

Figure 8. Overview chart of IBD biological terms generated by ClueGo. The percentage of genes associated with a specific term is also displayed.

22

4.5. Identification of modules The final step in this research was to determine overlapping network modules. For the purpose of exploring Cytoscape and obtaining knowledge of available plug-ins, the ModuLand plug-in was utilized in the identification of shared modules. The plug-in was downloaded through the Apps section in Cytoscape V3.8.0.

First, disease networks were imported into Cytoscape and ModuLand was started on the imported networks, where the analysis was run on unweighted networks. In the ModuLand dialog box, weights for module assignment were used instead of LinkLand centrality calculation. LinkLand centrality calculates the influence zone of each edge in the network. This option was selected based on the recommendation of the ModuLand plug-in documentation. Selecting this option will skip LinkLand centrality calculation which includes calculating influence zone of each edge as well as summing up all the influence-zones (Kovács et al., 2010). The selected option will instead directly calculate the overlapping modules from the weights of the network using the weight values as community centrality values. Moreover, this option is used on a higher-level hierarchy and not the base level. Reduce the number of links option was also selected. This option will allow ModuLand to omit the very small overlaps (caused by nodes belonging to a given module by less than 1%) during the creation of the new hierarchical level. Less weak links are generated with such selection (Kovács et al., 2010).

For the PSC disease, ModuLand found eight modules and counted 244 overlaps between module pairs at the first level of hierarchy of the network. After reducing the number of links, level 2 showed one module and two links. The hierarchical modular structure is shown is Figure 12. ModuLand visualizes the higher- and higher-level hierarchies until a single meta-node is displayed (see Figure 12B). In addition, meta-nodes of the higher level represent modules of the lower level. In terms of visualization, nodes are assigned with the color of the module they belong to. When hierarchical layers are created nodes will refer to the modules of the original level, and the edges are weighted according to the module overlaps.

The PBC gene network showed eight modules and 1908 links at the first level of hierarchy. Two modules were recorded at the second level with four overlaps as shown in Figure 13. Each node is represented by the color of the module it belongs to.

ModuLand found nine modules for the IBD disease with 4386 links. The second level of hierarchy recorded three modules and eight links. For all diseases, lower number of meta-nodes and meta-edges are recorded at higher levels of hierarchy. The IBD hierarchical modular structure is shown in Figure 14.

23

Figure 9. PSC network module generated through ModuLand plug-in (A) level 1 hierarchy of the network with 8 modules. (B) level 2 hierarchy of the network and 1 displayed module. Each node is represented by the color of the module it belongs to. Edges represent associations.

Figure 10. PBC network module (A) level 1 hierarchy of the network generated via ModuLand where 8 modules are shown. (B) level 2 hierarchy of the network showing 2 modules. Each node is represented by the color of the module it belongs to. Edges represent associations.

24

Figure 11. IBD network module (A) level 1 hierarchy generated via ModuLand where 9 modules are displayed. (B) level 2 hierarchy with 3 modules. Each node is represented by the color of the module it belongs to. Edges represent associations.

5. Discussion The purpose of this study is to apply a network module-based approach to the Ostrowski et al. (2019) data and examine generated results where extra information about biological processes, pathways and molecular functions can be inferred. Analyzing the results can help in the linkage between IBD and hepatobiliary disorders, PBC and PSC.

Hepatobiliary diseases are common manifestations of IBD disease including CD and UC. The intimate crosstalk between the gut and liver have highlighted the possibility of common mechanisms in the pathogenesis of immune liver diseases and IBD (Li et al., 2017). Several studies have pointed out that the gut-liver axis plays a role in the progression of chronic inflammatory gut and liver diseases where elucidation of local and systemic regulatory mechanisms can potentially lead to the development of therapeutic targets. (Atif et al., 2018). Primary sclerosing cholangitis (PSC) have been reported as the most occurring hepatobiliary disorder in IBD patients, particularly in UC. On the other hand, primary biliary cirrhosis (PBC) is considered to be the less frequent IBD-associated hepatobiliary disorder (Fousekis et al., 2018). To this date, PSC etiology is unclear, and several hypotheses have been suggested. Alterations in the gut microbiota, viral infections, and genetic abnormalities are some of the main proposed hypotheses. A study conducted on twins have shown that there are mutual responsible genes and multiple genetic factors common between UC and PSC (Fousekis et al., 2018). In addition, three UC susceptibility loci have been found to be associated with PSC where the candidate genes are REL, IL2, and CARD9. In some cases, UC is also accompanied with PBC, which raised the question of whether they share similar pathogenesis since both conditions are autoimmune diseases.

In this study, molecular alterations were evaluated through functional analysis and relating results to previously published studies. The first investigation step was to derive differentially expressed genes in all

25

three conditions: PBC, PSC, and IBD. This have led to the generation of 583 PBS, 146 PSC, and 928 IBD significantly differentially expressed genes, with 49 genes overlapping in all three conditions. The number of generated differentially expressed genes are different from the Ostrowski et al. (2019) study where larger gene lists were generated. In the Ostrowski et al. (2019) study, a cutoff p-value of 0.05 was used to extract significant genes, resulting in 4026, 2650 and 4967 genes being differentially expressed between healthy controls and patients with PBC, PSC, and IBD respectively. In this study, a more stringent cutoff value was applied which lead to the generation of fewer significant genes (adjusted p<0.0001). Fewer differentially expressed genes allowed the usage of the STRING database to construct protein-protein interaction networks.

Genes common to all three conditions were further analyzed, where the most enriched biological pathway pointed at CXCR4-mediated signaling events. As shown in Figure 4 common genes to all three comparisons scored the highest number of participations in the CXCR4-mediated signaling events. 21.1% of the common genes are found in the biological pathway CXCR4-mediated signaling events. The chemokine receptor CXCR4 belongs to the large superfamily of G protein-coupled receptors and plays a major role in immune response as well as other biological processes (Busillo and Benovic, 2007). CXCR4 is part of the CXCR4/CXCR7/CXCL12 complex where CXCR4 and CXCR7 act as a receptor of the CXCL12 chemokine.

Werner et al. (2013) reviewed the role of the chemokine CXCL12 and its receptors, CXCR4 and CXCR7 in inflammation. The study focused on the intestinal inflammation occurring in IBD patients where it found defects in chemokine and pattern recognition receptors expressed by epithelial cells. Moreover, alterations of influx of inflammatory cells, mediated via chemokines suggest a role in diseases pathogenesis. Borchers et al. (2009) indicated that CXCR4 is among the most strongly upregulated genes in PBC and PSC liver specimens compared to normal liver. This result might suggest a common role of the CXCR4 gene in the pathogenesis of the three compared diseases: PBC, PSC, and IBD.

One of the over-represented biological processes for common genes in this study is the signal transduction process. As shown in Figure 5 28.3% of the genes played a role in signal transduction which is a parallel result to the conclusions of Ostrowski et al. (2019). The second most enriched biological process is the cell communication process (see Figure 5). The biological term is too general and does not specify which sub- processes the genes belong to. Using other enrichment tools might have specified in detail which communication process is exactly taking place and results could then be compared to Ostrowski et al. (2019) findings. To help researchers point out potential targets, specificity in biological processes and pathways is demanded.

Cell communication typically involves three basic types: surface membrane to surface membrane; exterior, which is between receptors on the cell; and direct communication where signals are transmitted to a cell (Alberts et al., 2002). In Ostrowski et al. (2019) study, processes related to ATP synthesis have been reported as the most enriched biological processes in the common genes of PBC, PSC, and IBD. As shown in Figure 5 the umbrella term for cell communication that is generated through FunRich can be linked to the ATP process that Ostrowski et al. (2019) have reported in the studied diseases.

Another highlighted enriched biological pathway shared among the three diseases, is the Sphingosine-1- phosphate (S1P) pathway. As shown in Figure 2 the biological term was enriched where 30.19% of the PSC, 28.64% of the PBC, and 27.56% of the IBD genes were involved in the biological pathway. S1PR signaling is involved in immune responses where it regulates numerous processes important for the immune system (Donoviel et al., 2015). Spinster 2 (Spns2) transports S1P out of cells, and its deletion in

26

mice protected it from the development of experimental autoimmune disease. Tsai and Han. (2016) study believes that S1P pathway has a role in the pathogenesis of autoimmune diseases. The enrichment of this pathway is consistent with previous studies suggesting a linkage between the S1P pathway and autoimmune diseases. Since PBC, PSC, and IBD are autoimmune diseases, such a pathway should be further analyzed in future research.

ClueGo provided an illustration of the functional network with the most significant terms per group as well as an overview of pathways displayed in a pie chart. The three functional networks belonging to PBC, PSC, and IBD genes showed clusters belonging to the oxidative phosphorylation pathway (see Figures 9,10,11). Oxidative phosphorylation is the process in which ATP is formed in mitochondria. Previous studies have mentioned that mitochondrial dysfunction, including oxidative stress and impaired ATP production have a strong linkage to IBD (Novak and Mollen.2015). As shown in Figure 9, 80% of PSC genes show an involvement in the oxidative phosphorylation pathway making it the most disease participating in this pathway.

Based on enrichment for genes differentially expressed in pair-wise comparisons between the disease groups and healthy controls, Ostrowski et al. (2019) study have found that GTPase-mediated processes are one of the major enriched terms. Inspecting the functional analysis part through ClueGo results (Figures 10B,11B), it is shown that a large part of the PBC and IBD genes play a role in GTPase-mediated processes. However, this process was not reported in PSC genes.

Looking at the PSC significant genes and their associated biological processes and pathways, some of the results that were derived from different databases were related, such as the energy pathway process displayed in FunRich software (see Figure 3). Inspecting the Enrichr results, ATP synthesis was reported as the most enriched process among the PSC genes based on Gene Ontology which makes it a matching result when compared to FunRich result.

Determination of hierarchical layers of overlapping network modules was carried out through the ModuLand plug-in in the Cytoscape software. This tool assigns module cores and predicts the function of the whole module. Key nodes linking two or multiple modules are also displayed in the resulting network (Szalay-Beko M et al., 2012).

For the PSC module network, the UQCRH protein was presented as a meta-node and linked to RPS15A protein (see Figure 12). UQCRH, ubiquinol-cytochrome c reductase hinge, is a protein that is expressed in the mitochondrial membrane and can potentially induce mitochondrial reactive oxygen species (ROS). In Ostrowski et al. (2019) study, it was mentioned that immune and inflammatory responses can lead to the accumulation of ROS, endoplasmic reticulum (ER) stress, and mitochondrial dysfunction. These alterations eventually initiate cholangiocyte cytotoxicity which in return is a big part of PBC and PSC disease.

When investigating the role of RPS15 which is displayed in the middle of the connected modules, it was found that it is a that might have a connection with autoimmune diseases. Philip et al. (2010) have proposed a hypothesis that ribosomal proteins, targeted in autoimmune disease, are also targets of anti-tumor T-cell responses. Since several studies classify PBC and PSC as autoimmune diseases, such hypothesis of RPS15 can be further studied. The third node which is GRB2, is a protein involved in signal transduction and communication. It was mentioned earlier in this study that signal transduction is one of the enriched biological processes in PSC, PBC, and IBD.

27

The PBC network module shows a linkage between three nodes at level 1 hierarchy (Figure 13). One of the nodes is the protein RPLP0, which is a ribosomal protein that is a component of the 60S subunit (Artero-Castro et al., 2011). Ostrowski et al. (2019) study have reported “structural constitute of ribosome” as one of the over-represented GO terms that is common to PSC, PBC, and IBD. In the generated network module, RLP0 is shown to have a linkage with COPS5 and NHP2L1 which might suggest a network of communication between the three proteins.

The five yellow nodes shown in the IBD network module (Figure 14) show a network of functional linkage between the five proteins MTOR, CRK, TLR8, ACTB, and ALDOA. MTOR is a kinase that is involved in several cellular processes such as autophagy and the process of regulating cell growth and proliferation. The MTOR protein has been implicated in several autoimmune diseases, where alteration in autophagy has occurred. It is still unclear how dysfunctional autophagy causes continuous inflammation in patients with autoimmune disease and more efforts are needed to elucidate the underlying mechanism of the altered autophagy function (Yang et al., 2015). Ostrowski et al., (2019) have mentioned that innate immune and inflammatory responses may contribute to mitochondrial autophagy when the bile duct and intestinal defense systems are defected.

It was noticed that FunRich provided too broad terms that can include multiple sub-processes. For example, signal transduction acquired a large percentage of PSC genes involved in such biological processes but with no more specificity. On the other hand, the Gene Ontology and KEGG pathway results showed that ATP synthesis and oxidative phosphorylation, respectively, are the most enriched terms for PSC genes. This might conclude that the reported signal transduction process in Figure 3 was in relation to the mentioned terms in the KEGG database and Gene Ontology generated in Enrichr. Furthermore 27.91% of PSC genes, 23.47% of PBC genes, and 22.85% of IBD genes were displayed as involved in an unknown biological process.

Using different enrichment tools with different background databases can generate various results with varying terminologies. For example, integrin family cell surface was the most enriched biological pathway in PBC genes based on the FunRich analysis, whereas the GO database in the Enrichr tool highlighted translation as the most enriched biological pathway. In fact, integrins play a crucial role when proteins are being translated. Looking at the ClueGo pie chart in Figure 10, the greatest number of genes were participating in translation process. This result was based also on the GO terms integrated in ClueGo.

Differences in results can also be based on the fact that multiple databases do not update their data regularly. For example, the FunRich database was last updated in 2018 whereas the KEGG database had a much newer update on April 2020. This can also justify the poor significance generated by FunRich. The advantage of using ClueGo is that an update option of terms is available for the user. This ensures the retrieval of up-to date information. Moreover, different generated results could be partially due to gene identifiers that are "valid" with one tool and not another.

Further limitations in this project include the usage of a single module-based approach (ModuLand) to bridge IBD, PBC, and PSC. The utilization of a single module-based tool might have missed key genes in the three comparisons. Therefore, future studies can use several module-based methods to identify shared modules. In fact, there are several different tools that are based on different assumptions such as topology (gene expression). Deriving modules identified by several tools can perhaps increase the confidence for the obtained modules and decrease possible false positives.

28

In conclusion, several autoimmune diseases is not yet fully understood, such as the chronic liver disease PBC. Further efforts are needed in elucidating the underlying common pathways and processes involved in the pathogenesis of PBC, PSC and IBD autoimmune diseases. By constructing functional modules and highlighting shared modules in all three diseases, it can shed light on possible shared mechanisms and further support earlier discoveries. In addition, microarray data can support research into the molecular mechanisms underlying disease through the utilization of different enrichment tools as well as functional modules generation.

6. Ethical Aspects The principle of data sharing has a long history in many areas of research. In science, data-sharing communities are available in diverse ways. Several data repositories are publicly available and require no authentication such as the Gene Expression Omnibus (GEO) data repository. Guidelines set by the NCBI which houses a series of databases have declared that there are no restrictions on the use or distribution of the GEO data. Therefore, the utilization of the GEO data set GSE119600 required no previous approval or declaration (Gene Expression Omnibus [GEO], 2016).

Concerning copyright status, researchers that are reproducing, redistributing, or making commercial use of the data found on GEO are obliged to follow the terms and conditions asserted by the copyright holder (Gene Expression Omnibus [GEO], 2016). In addition, the submitters of the GEO data set GSE119600 have annotated the original article as open access (OA) (Ostrowski et al., 2019). Open access is a principle that is adopted by the research community granting online, free, and no access barriers to researchers and users in general (PubMed Central [PMC], 2019). Furthermore, Ostrowski et al. (2019) study is licensed under a Creative Commons Attribution 4.0 International License which allows the use, sharing, adaptation, distribution, and reproduction of the study with citation of the original authors. Accreditation to Ostrowski et al. (2019) research have been mentioned throughout the study.

Many debate why data should be re-used by the community for secondary analysis (Frey et al., 2020). Supportive votes back up the idea of re-using human data for several reasons. First, patients participating in studies have usually consented their data for re-use. Second, when human data is re-used its value is increased and the relative cost is decreased. Moreover, the sharing of data sets leads to statistical robustness (Frey et al., 2020). Studies that relate to previously published data can achieve just as much impact as original projects (Tenopir et al., 2015).

One of the main causes of researchers refraining from reusing publicly available data sets is that many data repositories do not apply re-examination of submitted data. For example, NCBI does not verify the validity, quality, or biological significance of submitted data (Gene Expression Omnibus [GEO], 2016). Appropriate data quality is crucial for the users, especially if clinical trials are involved where results could have a direct impact on the patients involved. It is important to note that the supportive findings that have been discussed in this research are not intended to be direct biomarkers or drug targets for IBD, PBC, and PSC diseases. It cannot be assured that the quality of data retrieved from GEO have been reviewed and assessed by the NCBI. Moreover, the human data utilized in Ostrowski et al. (2019) study are of European origin and may have an impact on the results when compared to previous findings that relied on other ethnicities.

29

7. Impact on Society and Future Directions Concerning the scientific contribution, supportive findings in this study can influence other researchers to further examine the results and carry out a new research on the analyzed diseases. For example, our results that support earlier finding regarding mitochondrial and autophagy dysfunction can be further studied in IBD, PBC, and PSC. Moreover, scientists can not only rely on differentially expressed genes to draw a conclusion but can also apply module-based techniques to further examine functional alterations in studied diseases. In addition, this project evaluates the utilization of ModuLand in constructing hierarchical layers and highlighting overlapping modules. Using ModuLand plug-in to understand complex diseases is still new and needs further evaluations. Finally, forming a solid conclusion relies on multiple hypothesis testing and method analysis. Multiple testing can also shed a light on novel biomarkers.

It is now clear that the incidence and prevalence of IBD is increasing worldwide. To this date, it is not fully understood how IBD progresses and co-exist with hepatobiliary diseases (Ng et al., 2018). Future validation of module-based approaches can potentially enhance life quality of patients with IBD, PBC, and PSC through biomarker identification. Moreover, such approaches can help in the prediction of co- morbidity and therefore alert clinicians and patients ultimately. Perhaps, early clinical tests could be run for patients when scientists have sufficient information about disease alterations and shared mechanisms. Future treatments for complex diseases can be also developed when disease pathologies and co-existence is understood through module-based approaches. All these properties will definitely ease from the suffering of patients and enhance their life quality.

Larger human data sets should be obtained from different ethnicities to ensure the reliability of results. Raw data utilization can also enhance the outcome of such studies where different types of pre-processing stages can be examined such as normalization algorithms. In addition, future studies should take into consideration the development of a pipeline that could make the process automatic by connecting different computational tools. A good example is the clinical decision support systems that can assist clinical decision making about individual patients in the future.

30

8. References

1. Atif M., Warner S., and Oo YH.(2018) Linking the gut and liver: crosstalk between regulatory T cells and mucosa-associated invariant T cells. Hepatol Int. 12(4):305-314. 2. Anjum A., Jaggi S., Varghese E., Lall S., Bhowmik A., and Rai, A. (2016). Identification of Differentially Expressed Genes in RNA-seq Data of Arabidopsis thaliana: A Compound Distribution Approach. Journal of computational biology: a journal of computational molecular cell biology.23(4), 239–247. 3. Artero-Castro A., Castellvi J., García A., Hernández J., Ramón y Cajal S., and Lleonart ME. (2011) Expression of the ribosomal proteins Rplp0, Rplp1, and Rplp2 in gynecologic tumors. Hum Pathol. 42(2):194-203. 4. Alberts B., Johnson A., Lewis J., et al. Molecular Biology of the Cell. 4th edition. New York: Garland Science (2002). General Principles of Cell Communication. Available from: https://www.ncbi.nlm.nih.gov/books/NBK26813/. 5. Bindea G, Mlecnik B, Hackl H., Charoentong P., Tosolini M., Kirilovsky A., Fridman WH., Pages F., Trajanoski Z. and Galon J. (2009) ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology annotation networks. Bioinformatics.25(8):1091-1093. 6. Barnes, E. L., Liew, C. C., Chao, S., & Burakoff, R. (2015). Use of blood based biomarkers in the evaluation of Crohn's disease and ulcerative colitis. World journal of gastrointestinal endoscopy, 7(17), 1233–1237.

7. Bader GD., and Hogue CH.(2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics.4(2). 8. Busillo J. M., and Benovic J. L. (2007). Regulation of CXCR4 signaling. Biochimica et biophysica acta, 1768(4), 952–963. 9. Borchers, A. T., Shimoda, S., Bowlus, C., Keen, C. L., & Gershwin, M. E. (2009). Lymphocyte recruitment and homing to the liver in primary biliary cirrhosis and primary sclerosing cholangitis. Seminars in immunopathology, 31(3), 309–322.

10. Bindea G. and Mlecnik B. (2013) ClueGO Documentation. Retrieved 18 March 2020 from http://www.ici.upmc.fr/cluego/ClueGODocumentation2013.pdf. 11. Benito-Martin A., Peinado H. (2015) FunRich proteomics software analysis, let the fun begin! Proteomics.;15(15):2555‐2556. 12. Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, Clark NR, Ma'ayan A. (2013) Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics.;128(14). 13. Chatr-Aryamontri A., Oughtred R., Boucher L., Rust J., Chang C., Kolas N. K., O'Donnell L., Oster S., Theesfeld C., Sellam A., Stark C., Breitkreutz B. J., Dolinski K., and Tyers M. (2017). The BioGRID interaction database: 2017 update. Nucleic acids research, 45(D1), D369–D379. 14. Cao J., and Zhang S. (2014). A Bayesian extension of the hypergeometric test for functional enrichment analysis. Biometrics, 70(1), 84–94. 15. Donoviel M. S., Hait N. C., Ramachandran S., Maceyka M., Takabe K., Milstien S., Oravecz T., and Spiegel S. (2015). Spinster 2, a sphingosine-1-phosphate transporter, plays a critical role in inflammatory and autoimmune diseases. FASEB journal: official publication of the Federation of American Societies for Experimental Biology, 29(12), 5018–5028.

31

16. D'Andrea D., Grassi L., Mazzapioda M., and Tramontano A. (2013). FIDEA: a server for the functional interpretation of differential expression analysis. Nucleic acids research, 41(Web Server issue), W84–W88. 17. Efron B., Tibshirani R., Goss V., and Chu G., (2000). Microarrays and their use in a comparative experiment. Technical Report 2000-37B/213 Stanford University. 18. Fousekis F. S., Theopistos V. I., Katsanos K. H., Tsianos E. V., and Christodoulou D. K. (2018). Hepatobiliary Manifestations and Complications in Inflammatory Bowel Disease: A Review. Gastroenterology research, 11(2), 83–94. 19. Frey K., Hafner A., and Pucker B. (2020) The reuse of public datasets in the life sciences: potential risks and rewards. 20. Franceschini A., Szklarczyk D., Frankild S., Kuhn, M., Simonovic M., Roth A., Lin J., Minguez, P., Bork P., von Mering, C., and Jensen L. J. (2013). STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic acids research, 41(Database issue), D808–D815. 21. Fakhoury M.,Negrulj R., Mooranian A., and Al-Salami H.(2014) Inflammatory bowel disease: clinical aspects and treatments. J Inflamm Res. (7). 22. Gene Expression Omnibus. (2016) GEO Disclaimer. Retrieved data from https://www.ncbi.nlm.nih.gov/geo/info/disclaimer.html. 23. Geremia A., Arancibia-Cárcamo CV., Fleming MP., Rust N., Singh B., Mortensen NJ., Travis SP. and Powrie F. (2011) IL-23-responsive innate lymphoid cells are increased in inflammatory bowel disease. The Journal of Experimental Medicine. 208(6):1127-33 24. Gustafsson M., Nestor C. E., Zhang H., Barabási A. L., Baranzini S., Brunak S., Chung, K. F., Federoff H. J., Gavin A. C., Meehan R. R., Picotti P., Pujana M. À., Rajewsky N., Smith K. G., Sterk P. J., Villoslada P., and Benson M. (2014). Modules, networks and systems medicine for understanding disease and aiding diagnosis. Genome medicine 6(10), 82. 25. Ghiassian S. D., Menche J., and Barabási A. L. (2015). A DIseAse MOdule Detection (DIAMOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLoS computational biology, 11(4), e1004120.

26. Garcia-Vaquero ML., Gama-Carvalho M., Rivas D., and Pinto FR.(2018) Searching the overlap between network modules with specific betweeness (S2B) and its application to cross-disease analysis. Scientific Reports. 27. Huang DW., Sherman BT., Tan Q., Collins JR., Alvord WG., Roayaei J., Stephens R., Baseler MW., Lane HC.and Lempicki RA. (2007) The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 8(9): R183. 28. Ideker T., and Sharan R. (2008). Protein networks in disease. Genome research, 18(4), 644–652. 29. Jeanmougin M., de Reynies A., Marisa L., Paccard C., Nuel G., and Guedj M. (2010). Should we abandon the t-test in the analysis of gene expression microarray data: a comparison of variance modeling strategies. PloS one, 5(9), e12336. 30. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, McDermott MG, Monteiro CD, Gundersen GW, Ma'ayan A. (2016)Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. gkw377. 31. Kovács I.A., Palotai R., Szalay M.S. and Csermely P. (2010) Community landscapes: a novel, integrative approach for the determination of overlapping network modules. PLoS ONE 7,5(9) e12528.

32

32. Leiserson M. D., Wu H. T., Vandin F., and Raphael B. J. (2015). CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genome biology. 16(1), 160. 33. Li X., Shen J., and Ran Z. (2017) Crosstalk between the gut and the liver via susceptibility loci: Novel advances in inflammatory bowel disease and autoimmune liver disease. Clinical Immunology. (175). 34. Lin C, et al. Xiaohua H, Pan Y. (2006) Clustering methods in protein-protein interaction network, Knowledge Discovery in Bioinformatics: Techniques, Methods and Application, John Wiley & Sons. 35. Lin E. S., Chang W. A., Chen Y. Y., Wu L. Y., Chen Y. J., and Kuo P. L. (2019). Deduction of Novel Genes Potentially Involved in Keratinocytes of Type 2 Diabetes Using Next-Generation Sequencing and Bioinformatics Approaches. Journal of clinical medicine, 8(1), 73. 36. Liberal R., GasparR., Lopes S. and Macedo G. (2019) Primary biliary cholangitis in patients with inflammatory bowel disease. Clin Res Hepatol Gastroenterol. 44(1) 37. Lotia S., Montojo j., Dong Y., Bader GD., and Pico AR. (2013) Cytoscape App Store. Systems Biology.29(10). pp :1350–135. 38. Liao Y., Wang J., Jaehnig E. J., Shi Z., and Zhang, B. (2019). WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic acids research, 47(W1), W199–W205. 39. Ma H., Gao G., and Weber G. M. (2018). Use of DAVID algorithms for clustering custom annotated gene lists in a non-model organism, rainbow trout. BMC research notes, 11(1), 63. 40. Morris H. Apeltsin L., Newman AM., Baumbach J., Wittkop T., Su W., Bader GD. and Ferrin TH. (2011) ClusterMaker: A multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics. 41. Ma'ayan A. (2011). Introduction to network analysis in systems biology. Science signaling, 4(190), tr5. 42. Momozawa Y., Dmitrieva J., Théâtre E., Deffontaine V., Rahmouni S., Charloteaux B., Crins F., Docampo E., Elansary M., Gori AS., Lecut C., Mariman R., Mni M., Oury C., Altukhov I., Alexeev D., Aulchenko Y., Amininejad L., Bouma G., Hoentjen F., Löwenberg M., Oldenburg B., Pierik MJ., Vander Meulen-de Jong AE., Janneke van der Woude C., Visschedijk MC.,International IBD Genetics Consortium, Lathrop M., Hugot JP., Weersma RK., De Vos M., Franchimont D., Vermeire S., Kubo M., Louis E. and Georges M.(2018) IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes. Nat Commun. 9(1):2427. 43. Marchioni Beery R., Vaziri, H., and Forouhar F. (2014). Primary Biliary Cirrhosis and Primary Sclerosing Cholangitis: A Review Featuring a Women's Health Perspective. Journal of clinical and translational hepatology, 2(4), 266–284. 44. National Research Council (US) Board on Biology; Pool R, Esnayra J, editors. Bioinformatics: Converting Data to Knowledge: Workshop Summary. Washington (DC): National Academies Press (US); 2000. Barriers to the Use of Databases. Available from: https://www.ncbi.nlm.nih.gov/books/NBK44936/. 45. Novak E. A., and Mollen, K. P. (2015). Mitochondrial dysfunction in inflammatory bowel disease. Frontiers in cell and developmental biology, 3, 62. 46. Nazarieh, M. and Helms, V. (2019) TopControl: A Tool to Prioritize Candidate Disease-associated Genes based on Topological Network Features. Sci Rep 9(19472). 47. Ng SC, Shi HY, Hamidi N, Underwood E.F., Tang W., Benchimol E.I., Pannacione R., Ghosh S., Wu J., Chan F., Sung J., and Kaplan J. (2018) Worldwide incidence and prevalence of inflammatory bowel disease in the 21st century: a systematic review of population-based studies. Lancet.;390(10114):2769‐2778.

33

48. Ostrowski J., Goryca K., Lazowska I., Rogowska A., Paziewska A., Dabrowska M., Ambrozkiewicz F., Karczmarski J., Balabas A., Kluska A., Piatkowska M., Zeber-Lubecka N., Kulecka M., Habior A., Mikula M., Polis (2019) Common functional alterations identified in blood transcriptome of autoimmune cholestatic liver and inflammatory bowel diseases. Sci Rep.9(1):7190. 49. Oughtred R., Stark C., Breitkreutz B. J., Rust J., Boucher L., Chang C., Kolas N., O'Donnell L., Leung G., McAdam R., Zhang F., Dolma S., Willems A., Coulombe-Huntington J., Chatr-Aryamontri A., Dolinski K., and Tyers M. (2019). The BioGRID interaction database: 2019 update. Nucleic acids research, 47(D1), D529–D541. 50. Peart M. J., Smyth G. K., van Laar R. K., Bowtell D. D., Richon V. M., Marks P. A., Holloway A. J., and Johnstone R. W. (2005). Identification and functional significance of genes regulated by structurally different histone deacetylase inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 102(10), 3697–3702. 51. Paziewska, A., Habior, A., Rogowska, A., Zych, W., Goryca, K., Karczmarski, J., Dabrowska, M., Ambrozkiewicz, F., Walewska-Zielecka, B., Krawczyk, M., Cichoz-Lach, H., Milkiewicz, P., Kowalik, A., Mucha, K., Raczynska, J., Musialik, J., Boryczka, G., Wasilewicz, M., Ciecko-Michalska, I., Ferenc, M., Ostrowski, J. (2017). A novel approach to genome-wide association analysis identifies genetic associations with primary biliary cholangitis and primary sclerosing cholangitis in Polish patients. BMC medical genomics, 10(1), 2. 52. Pathan M., Keerthikumar S., Ang CS., Gangoda L., Quek CY., Williamson NA., Mouradov D., Sieber OM., Simpson RJ., Salim A., Bacic A., Hill AF., Stroud DA., Ryan MT., Agbinya JI., Mariadason JM., Burgess AW., and Mathivanan S.(2015) FunRich: An open access standalone functional enrichment and interaction network analysis tool. Proteomics. 15(15):2597-601. 53. Pathan M., Keerthikumar S., Chisanga D., Alessandro R., Ang C.S. Askenase, P., Batagov A. O., Benito-Martin A., Camuss, G., Clayton A., Collino F., Di Vizio D., Falcon-Perez J. M., Fonseca P., Fonseka P., Fontana S., Gho Y. S., Hendrix A., Hoen E. N., Iraci N., and Mathivanan S. (2017). A novel community driven software for functional enrichment analysis of extracellular vesicles data. Journal of extracellular vesicles, 6(1), 1321455. 54. Pan, Y., Liu, G., Yuan, Y., Zhao, J., Yang, Y., & Li, Y. (2017). Analysis of differential gene expression profile identifies novel biomarkers for breast cancer. Oncotarget, 8(70), 114613–114625. 55. Pan W., Lin J., and Le CT. (2003) A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics.3(3):117‐124. 56. Pavlovic, V. (on 28 07 2009). MCODE Documentation. retreived from baderlab.org: http://www.baderlab.org/Software/MCODE/UsersManual. 57. Philip, M., Schietinger, A., & Schreiber, H. (2010). Ribosomal versus non-ribosomal cellular antigens: factors determining efficiency of indirect presentation to CD4+ T cells. Immunology, 130(4), 494–503. 58. PubMed Central. (2019) Open Access Subset. Retrieved data from https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 59. Peters L. A., Perrigoue J., Mortha A., Luga A., Song W. M., Neiman E. M., Llewellyn S. R., Di Narzo A., Kidd B. A., Telesco S. E., Zhao Y., Stojmirovic A., Sendecki J., Shameer K., Miotto R., Losic B., Shah H., Lee E., Wang M., Faith J. J., … Schadt E. E. (2017). A functional genomics predictive network model identifies regulators of inflammatory bowel disease. Nature genetics, 49(10), 1437–1449.

34

60. Qiu, F., Tang, R., Zuo, X. et al. A genome-wide association study identifies six novel risk loci for primary biliary cholangitis. Nat Commun 8, 14828 (2017). 61. Rodriguez-Esteban R., Jiang X. (2017). Differential gene expression in disease: a comparison between high-throughput studies and the literature. BMC Med Genomics 10, 59. 62. Ritchie ME., Phipson B., Wu D., Hu Y., Law CW., Shi W., and Smyth GK. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43 (7). 63. Reimand, J., Isserlin, R., Voisin, V. Kucera M., Tannus-Lopes C., Rostamianfar A., Wadi L., Meyer M., Wong J., Xu CH., Merico D. and Bader GD.(2019) Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc 14, 482–517 64. Rubin DC., Shaker A. and Levin MS. (2012) Chronic intestinal inflammation: inflammatory bowel disease and colitis-associated colon cancer. Front Immunol. 3(107). 65. Raudvere U., Kolberg L., Kuzmin I., Arak T., Adler P., Peterson H., and Vilo, J. (2019). g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic acids research, 47(W1), W191–W198. 66. Supragna Sandur A. (2017) Enrichment Analysis for Gene Dataset. International Journal for Research in Applied Science & Engineering Technology (IJRASET).5(5). 67. Safari-Alighiarloo N., Taghizadeh M., Rezaei-Tavirani M., Goliaei B., and Peyvandi, A. A. (2014) Protein-protein interaction networks (PPI) and complex diseases. Gastroenterology and hepatology from bed to bench, 7(1), 17–31. 68. Szklarczyk D., Gable A. L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., Simonovic M., Doncheva N. T., Morris J. H., Bork P., Jensen L. J., and Mering, C. V. (2019). STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research, 47(D1), D607–D613. 69. Szklarczyk D., Morris J. H., Cook H., Kuhn M., Wyder S., Simonovic M., Santos A., Doncheva N. T., Roth A., Bork P., Jensen L. J., and von Mering C. (2017). The STRING database in 2017: quality- controlled protein-protein association networks, made broadly accessible. Nucleic acids research, 45(D1), D362–D368. 70. Szalay-Beko M., Palotai R., Szappanos B., Kovács IA., Papp B., Csermely P. (2012) ModuLand plug- in for Cytoscape: determination of hierarchical layers of overlapping network modules and community centrality. Bioinformatics.28(16):2202‐2204 71. Subhash S., Kanduri C. (2016) GeneSCF: a real-time based functional enrichment tool with support for multiple organisms. BMC Bioinformatics 17, 365 72. Sean Davis. (2019) Package ‘GEOquery’. Retrieved 2020 March 15 from https://bioconductor.org/packages/devel/bioc/manuals/GEOquery/man/GEOquery.pdf 73. Shannon P., Markiel A., Ozier O., Baliga N. S., Wang J. T., Ramage D., Amin N., Schwikowski, B. and Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research, 13(11), 2498–2504. 74. Saito R., Smoot M. E., Ono K., Ruscheinski J., Wang P. L., Lotia S., Pico A. R., Bader G. D., and Ideker, T. (2012). A travel guide to Cytoscape plugins. Nature methods, 9(11), 1069–1076. 75. Subhash S., and Kanduri C. (2016) GeneSCF: a real-time based functional enrichment tool with support for multiple organisms. BMC Bioinformatics 17(365).

35

76. Sharma A., Menche J., Huang CC., Ort T., Zhou X., Kitsak M., Sahni N., Thibault D., Voung L., Guo F., Ghiassian SD., Gulbahce N., Baribaud F., Tocker J., Dobrin R., Barnathan E., Liu H., Panettieri RA Jr., Tantisira KG., Qiu W., Raby BA., Silverman EK., Vidal M., Weiss ST. and Barabási AL.(2015) A disease module in the interactome explains disease heterogeneity, drug response and captures novel pathways and genes in asthma. Human Molecular Genetics. 24(11):3005-20. 77. Sharma A., Virk R., Khurana M., and Kaur R., (2018) PROTEIN INTERACTION DATABASES: A REVIEW. Research Journal of Life Sciences, Bioinformatics, Pharmaceutical and Chemical Sciences.

78. Salazar GA., Meintjes A., Mazandu GK., Rapanoël HA., Akinola RO., and Mulder NJ.(2014) A web- based protein interaction network visualizer. BMC Bioinformatics. 79. Sun MA., Wang Y., Cheng H., Zhang Q., Ge W., and Guo D. (2012) RedoxDB--a curated database for experimentally verified protein oxidative modification. Bioinformatics.;28(19):2551‐2552. 80. Taguchi Yh. (2018) Comparative Transcriptomics Analysis. Encyclopedia of Bioinformatics and Computational Biology. 81. Tusher, V. G., Tibshirani, R. & Chu, G. (2001), Significance analysis of microarrays applied to the ionizing radiation response, Proceedings of the National Academy of Sciences of the United States of America 98, 5116–5121. 82. Tsai H., and Han M.H. (2016) Sphingosine-1-Phosphate (S1P) and S1P Signaling Pathway: Therapeutic Targets in Autoimmunity and Inflammation. Drugs 76, 1067–1079. 83. Tenopir, C., Dalton, E. D., Allard, S., Frame, M., Pjesivac, I., Birch, B., Pollock, D., & Dorsett, K. (2015). Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide. PloS one, 10(8), e0134826.

84. Tusher VG., Tibshirani R. and Chu G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 98(9):5116-21. 85. Vartanian K., Slottke R., Johnstone T., Casale A., Planck S. R., Choi D., Smith J. R., Rosenbaum J. T., and Harrington C. A. (2009). Gene expression profiling of whole blood: comparison of target preparation methods for accurate and reproducible microarray analysis. BMC genomics.10(2). 86. Vasquez N., Mangin I Lepage P. Seksik P., MD, Duong JP., Blum S., Schiffrin E., Suau A., Allez M., Vernier G.,Tréton X., Doré J., Marteau PH. and Pochart PH., Patchy distribution of mucosal lesions in ileal Crohn's disease is not linked to differences in the dominant mucosa-associated : A study using fluorescence in situ hybridization and temporal temperature gradient gel electrophoresis(2007) Inflammatory Bowel Diseases. 13(6), pp: 684–692. 87. Verstockt, B., Smith, K. G., and Lee, J. C. (2018). Genome-wide association studies in Crohn's disease: Past, present and future. Clinical & translational immunology, 7(1), e1001. 88. Von Mering C., Jensen LJ., Snel B., Hooper SD., Krupp M., Foglierini M., Jouffre N., Huynen MA. and Bork P. (2005) STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res.: D433-7. 89. Werner L., Guzner-Gur H., and Dotan I. (2013). Involvement of CXCR4/CXCR7/CXCL12 Interactions in Inflammatory bowel disease. Theranostics, 3(1), 40–46. 90. Wu M., Li X., Kwoh CK. (2008) Algorithms for Detecting Protein Complexes in PPI Networks: An Evaluation Study. Singapore. Institute for Infocomm Research. 91. You R., Yao S., Xiong Y., Huang X., Sun F., Mamitsuka H., and Zhu S. (2019). NetGO: improving large-scale protein function prediction with massive network information. Nucleic acids research, 47(W1), W379–W387.

36

92. Yang Z., Goronzy J. J., and Weyand C. M. (2015). Autophagy in autoimmune disease. Journal of molecular medicine (Berlin, Germany), 93(7), 707–717. 93. Zhao X., and Xu W. (2015). An extended affinity propagation clustering method based on different data density types. Computational intelligence and neuroscience. (4):828057

37

9. Appendix

Table 6: Gene symbols of overlapping genes

ABCG1 CYTH4 MRPL19 SEMA4D UQCRBP1 AK6 DBI MRPL47 SH2D3C UQCRH AKAP17A DEF8 NACAP1 SYK UQCRHL ATP5C1 DYNLRB1 NDUFAF4 TCEAL8 VPS8 C14orf159 ECI2 NIN TLK2 WDR83OS C15orf39 EIF3M PLCB2 TPRKB XPO6 CCDC25 FAM160A2 PRDX1 TRPC4AP ZBTB17 CKS2 GAR1 RASSF5 UBE2V2 ZNF226 CRK GTF2A2 RBM5 UPF1 ZNHIT3 CUX1 MED1 RDH11 UQCRB

Figure 12. Overview of the methodology.

38

Table 7: Top 10 significant terms of KEGG pathway for IBD genes

Term P-value Adjusted Genes P-value Toll-like receptor 7.63E-06 2.35E-03 CCL3L1; CCL4L1; CCL3L3; PIK3CD; PIK3CB; IRAK4; signaling pathway TICAM1; IRAK1; CCL5; CCL3; TLR8; RIPK1; TLR6; TLR5; TLR4; MYD88 Chemokine signaling 1.94E-05 2.99E-03 STAT5B; CCL3L1; CCL4L1; PRKCB; SHC1; CCL3L3; PRKCD; pathway PIK3CD; PIK3CB; RASGRP2; PIK3CG; VAV1; PIK3R5; RAP1B; HCK; GRK2; GNGT2; CCL5; CCL3; RAF1; CRK; PLCB2 Fc gamma R-mediated 2.80E-05 2.87E-03 GSN; SYK; PRKCB; LIMK2; PRKCD; INPPL1; PIK3CD; phagocytosis ASAP1; PIK3CB; VAV1; HCK; PLCG2; RAF1; CRK

Chagas disease 2.85E-05 2.20E-03 CCL3L1; CCL3L3; PIK3CD; PIK3CB; IRAK4; TICAM1; (American TGFBR2; IRAK1; PPP2R2B; CCL5; CCL3; TLR6; TLR4; trypanosomiasis) PLCB2; MYD88 Spliceosome 5.30E-05 3.26E-03 DDX5; SF3B6; DDX23; DDX42; SNU13; LSM3; LSM6; PHF5A; SNRNP70; SYF2; U2AF2; DHX38; DHX16; SNRPB2; SNRPD3; SNRPC; SF3B1 Salmonella infection 6.47E-05 3.32E-03 CCL3L1; CCL4L1; CCL3L3; KLC1; ACTB; ACTG1; CCL3; MYH9; PLEKHM2; PKN1; TLR5; TLR4; MYD88

Platelet activation 7.10E-05 3.12E-03 SYK; PIK3CD; PIK3CB; RASGRP2; ACTB; PIK3CG; ACTG1; PIK3R5; RAP1B; APBB1IP; GNA13; BTK; PLCG2; ARHGEF1; PLCB2; FERMT3 HIF-1 signaling pathway 8.13E-05 3.13E-03 PFKFB3; PRKCB; PIK3CD; PIK3CB; MTOR; HK3; PFKL; MKNK1; PLCG2; EP300; LTBR; ALDOA; IL6R; TLR4

Endocytosis 1.18E-04 4.05E-03 ARFGEF1; GIT2; IQSEC1; CLTB; ASAP1; VPS37B; ARAP1; ARFGAP3; IL2RG; HLA-G; AP2A2; RAB11A; ARFGAP1; TGFBR2; ACAP2; ACAP1; GRK2; CYTH4; CAPZB; HGS; KIF5B; RAB35; CHMP4A; CYTH1

Alzheimer disease 1.22E-04 3.77E-03 APP; NDUFB9; COX7B; UQCRB; NDUFA4; NDUFA3; ATP2A3; ATP2A2; PSEN1; UQCR10; COX6A1; HSD17B10; UQCRH; UQCRHL; COX6B1; CASP3; NDUFS4; CAPN1; PLCB2

39

Table 8: Top 10 IBD-Go Biological process 2018

Term P-value Adjusted Genes P-value neutrophil 8.18E-11 4.17E-07 ARHGAP9; ITGAM; SLC44A2; ITGB2; SLC2A3; TCIRG1; activation SRP14; AP2A2; STK10; HK3; MMP25; LAMP1; LAMP2; involved in TIMP2; TOM1; ITGAX; TBC1D10C; COTL1; SIRPA; immune QSOX1; CAPN1; ACAA1; PGM1; KCMF1; ACTR2; DGAT1; response SYK; PRKCD; ATP11B; DYNLL1; ARHGAP45; DNAJC5; (GO:0002283) ALDOA; PDXK; CANT1; TMEM63A; CAB39; PSMD14; MVP; MOSPD2; PRCP; PSEN1; RAP1B; CNN2; RAB24; CD59; TXNDC5; GSN; TRAPPC1; STK11IP; NBEAL2; PFKL; TSPAN14; IMPDH1 neutrophil 1.12E-10 2.85E-07 ARHGAP9; ITGAM; SLC44A2; ITGB2; SLC2A3; TCIRG1; mediated SRP14; AP2A2; STK10; HK3; MMP25; LAMP1; LAMP2; immunity TIMP2; TOM1; ITGAX; TBC1D10C; COTL1; SIRPA; (GO:0002446) QSOX1; CAPN1; ACAA1; PGM1; KCMF1; ACTR2; DGAT1; PRKCD; ATP11B; IRAK4; DYNLL1; ARHGAP45; DNAJC5; ALDOA; PDXK; CANT1; TMEM63A; CAB39; PSMD14; MVP; MOSPD2; PRCP; PSEN1; RAP1B; CNN2; RAB24; CD59; TXNDC5; GSN; TRAPPC1; STK11IP; NBEAL2; PFKL; TSPAN14; IMPDH1 neutrophil 1.80E-10 3.07E-07 ARHGAP9; ITGAM; SLC44A2; ITGB2; SLC2A3; TCIRG1; degranulation SRP14; AP2A2; STK10; HK3; MMP25; LAMP1; LAMP2; (GO:0043312) TIMP2; TOM1; ITGAX; TBC1D10C; COTL1; SIRPA; QSOX1; CAPN1; ACAA1; PGM1; KCMF1; ACTR2; DGAT1; PRKCD; ATP11B; DYNLL1; ARHGAP45; DNAJC5; ALDOA; PDXK; CANT1; TMEM63A; CAB39; PSMD14; MVP; MOSPD2; PRCP; PSEN1; RAP1B; CNN2; RAB24; CD59; TXNDC5; GSN; TRAPPC1; STK11IP; NBEAL2; PFKL; TSPAN14; IMPDH1 phosphorylation 7.92E-09 1.01E-05 APP; DGKD; UQCRB; LRRK2; PNKP; TESK2; ATP5C1; (GO:0016310) PIK3CD; PIK3CB; UQCRH; PIK3CG; STK10; NUAK1; IRAK1; MKNK1; STK38; RPS6KA1; TLK2; PIM3; PAK2; MAP4K4; SYK; PRKCB; IGFBP3; TNK2; LIMK2; PRKCD; LMTK2; PHKB; PHKA2; MTOR; TGFBR2; HCK; SNRK; TAOK3; FES; RARA; BTK; SGK3; PKN1; RAF1; HCST; MAP3K11 gene expression 4.99E-08 5.09E-05 SLBP; SMG1; SMARCD3; MRPL18; CELF2; SNU13; (GO:0010467) RRBP1; MRPL34; RPL10A; RPL9; MRPL13; MRPL32; MRPL42; ATXN1; U2AF2; TPR; RPL14; DHX38; DHX16; RPL17; POLR2K; RBM5; MEF2A; EIF5A; RPL39L; UPF1; RPL41; RPL21; CPSF2; TFIP11; THOC5; MRPL24; SORL1;

40

THOC7; MRPS18C; THAP1; TSPAN14; HNRNPUL1; XRN2; NUP98; RPL26; RPS24; NUP58 protein 4.75E-07 4.04E-04 RABGAP1; TBC1D3B; UNC93B1; TBC1D3F; TBC1D3C; transport HIKESHI; TCIRG1; RAB43; RAB40C; LAMP2; RAB24; (GO:0015031) TBC1D10C; STX6; TOM1; TMED1; ATP6V1C1; SEC31B; SPTBN1; WLS; ATP6V1F; RABGAP1L; PRKCB; STX8; LMTK2; ANK3; ARFGAP3; ZDHHC17; RAB11A; TBC1D5; SGSM3; RAB35; RAB9A; MYH9; ARHGEF2; ABCG1 protein 1.97E-06 1.43E-03 APP; SMG1; USP15; CAB39; LRRK2; TESK2; PIK3CD; phosphorylation PIK3CB; PIK3CG; STK10; NUAK1; GRK2; IRAK1; MKNK1; (GO:0006468) STK38; RPS6KA1; TLK2; PDK3; PIM3; RIPK1; PAK2; MAP4K4; SYK; PRKCB; IGFBP3; PRKCD; LMTK2; PHKB; PHKA2; MTOR; TGFBR2; HCK; SNRK; TAOK3; FES; RARA; BTK; SGK3; PKN1; RAF1; HCST; TLR4; MAP3K11 vesicle- 2.31E-06 1.47E-03 LRPAP1; APP; LRRK2; INPPL1; ASAP1; RAB43; CUX1; mediated STX6; TOM1; TBC1D10C; DENND5A; TRIM27; VPS39; transport CYTH1; VPS16; ARFGEF1; UVRAG; GIT2; SYT3; STX8; (GO:0016192) LMTK2; COMMD1; ARAP1; ATG14; ARFGAP3; MAPK8IP3; ARFGAP1; RAB11A; FAM160A2; ACAP2; ACAP1; TBC1D5; SGSM3; RAB35; RAB9A; RIN3; CHMP4A; VAMP4; MON1B histone mRNA 4.06E-06 2.30E-03 LSM10; UPF1; SLBP; CPSF2; LSM1; PAPD4; SNRPD3; metabolic DCP2 process (GO:0008334) positive 5.72E-06 2.92E-03 RNF31; APP; CANT1; PRKCB; SLC44A2; PELI2; IRAK4; regulation of I- TICAM1; RHOC; ZDHHC17; TRIM8; CD4; IRAK1; TRIM25; kappaB RIPK1; RBCK1; TLR6; LTBR; TLR4; MYD88; WLS kinase/NF- kappaB signaling (GO:0043123)

41

Table 9: Top 10 IBD-molecular function

Term P-value Adjusted Genes P-value RNA binding 9.37E-10 1.08E-06 ATP5C1; SRP14; RPL10A; RPL9; LSM10; HMGN5; (GO:0003723) KHSRP; TRIM25; SNRPD3; RPL21; EIF1AX; FNDC3B; LARP4B; THOC5; FAM133B; HNRNPUL1; EWSR1; XRN2; NUP98; RPL26; SNRPC; FKBP3; SLBP; DDX5; KMT2C; MEPCE; UTP23; RPN1; SNU13; AKAP17A; MRPL13; CORO1A; ERI1; PRDX1; TPR; GAR1; DHX38; LYAR; RBM10; SPTBN1; RPL39L; RPL41; FAM103A1; LSM1; MRPL21; LSM3; PRRC2A; GNL3L; LSM6; AGO4; TRIP6; EIF4G3; SUGP2; RPS24; DIDO1; SMG1; CELF2; DDX42; HSPB1; ADAR; MRPL32; MECP2; MRPL42; SNRNP70; SUMO2; DHX16; RBM34; TNPO1; RBM5; EIF5A; UPF1; GLRX3; PTBP3; PSMA6; SYF2; NFX1; CSDE1; FRG1; TLR8; MYH9; ARHGEF1; HNRNPH2; ALDOA; DCP2; RBM47; SF3B6; DDX23; NOL7; RRBP1; MTIF3; HSD17B10; U2AF2; EXOSC9; EIF4H; RPL14; ZNF106; RPL17; SF3B1; YTHDF3; UTP3; CCDC59; TPD52L2; PHF5A; ZYX; SSBP1; EIF3A; R3HDM2 phosphotransferase 4.16E-07 2.40E-04 SMG1; PDXK; DGKD; LRRK2; TESK2; PIK3CD; PIK3CB; activity, alcohol PIK3CG; HK3; GRK2; IRAK1; PI4KAP1; CCL5; PI4KAP2; group as acceptor CCL3; PDK3; RIPK1; IP6K1; PAK2; CHKA; SYK; PRKCD; (GO:0016773) IRAK4; COASY; MTOR; SGK3; PKN1; RAF1; MAP3K11; GNE hydrogen ion 1.08E-06 4.13E-04 ATP5S; UQCRB; ATP2A3; ATP2A2; TCIRG1; UQCR10; transmembrane ATP5H; UQCRH; ATP6V1C1; UQCRHL; ATP5L; ATP6V1F transporter activity (GO:0015078) kinase activity 1.10E-06 3.18E-04 SMG1; PDXK; DGKD; LRRK2; TESK2; PIK3CD; PIK3CB; (GO:0016301) PIK3CG; GRK2; IRAK1; PI4KAP1; CCL5; PI4KAP2; RPS6KA1; CCL3; PDK3; RIPK1; IP6K1; PAK2; MAP4K4; CHKA; SYK; PRKCD; IRAK4; COASY; MTOR; SGK3; COQ8B; PKN1; RAF1; MAP3K11 GTPase regulator 2.23E-06 5.13E-04 ARHGAP9; RABGAP1; RGS19; TBC1D3B; LRRK2; activity TBC1D3F; TBC1D3C; ASAP1; RASAL3; ARHGAP4; (GO:0030695) TBC1D10C; PDE6D; BNIP2; GIT2; RABGAP1L; TNK2; MYO9B; ARHGAP27; ARAP1; ARFGAP3; ARFGAP1; ARHGAP45; ACAP2; RALGAPB; ACAP1; ARHGAP30; TBC1D20; TBC1D5; SGSM3; PREB

42

GTPase activator 2.67E-06 5.12E-04 ARHGAP9; RABGAP1; RGS19; TBC1D3B; LRRK2; activity TBC1D3F; TBC1D3C; ASAP1; RASAL3; ARHGAP4; (GO:0005096) TBC1D10C; BNIP2; GIT2; RABGAP1L; MYO9B; ARHGAP27; ARAP1; ARFGAP3; ARFGAP1; ARHGAP45; ACAP2; RALGAPB; ACAP1; ARHGAP30; TBC1D20; TBC1D5; SGSM3; PREB protein kinase 3.82E-05 6.28E-03 SMG1; EPHA10; CAB39; LRRK2; PELI2; TESK2; PIK3CG; activity STK10; NUAK1; GRK2; CUX1; IRAK1; CCL5; MKNK1; (GO:0004672) STK38; RPS6KA1; TLK2; CCL3; PDK3; PIM3; RIPK1; PAK2; MAP4K4; STAT5A; STAT5B; SYK; PRKCB; TNK2; LIMK2; PRKCD; LMTK2; IRAK4; MTOR; HCK; SNRK; TAOK3; FES; BTK; SGK3; PKN1; RAF1; MAP3K11 protein 4.48E-05 6.45E-03 SMG1; CAB39; CCNH; LRRK2; PELI2; TESK2; STK10; serine/threonine NUAK1; GRK2; IRAK1; MKNK1; RPS6KA1; STK38; TLK2; kinase activity PDK3; PIM3; RIPK1; PAK2; MAP4K4; SYK; PRKCB; (GO:0004674) LIMK2; PRKCD; LMTK2; IRAK4; MTOR; TGFBR2; SNRK; TAOK3; SGK3; PKN1; RAF1; MAP3K11 GTPase binding 7.96E-05 1.02E-02 FGD3; GIT2; ACAP2; ACAP1; PREB; ASAP1; ARAP1; (GO:0051020) ATG14; ARFGAP3; PAK2; ARFGAP1; SPTBN1 phosphotyrosine 1.65E-04 1.90E-02 HCK; SYK; SHC1; PLCG2; NCK2; CRK; VAV1 residue binding (GO:0001784)

43

Table 10: Top 10 PBC KEGG pathway

Term P-value Adjusted Genes P-value Ribosome 2.25E-08 6.94E-06 RPL30; RPS4Y2; MRPL18; MRPL19; RPLP0; MRPL16; MRPL36; RPL23A; RPS4Y1; RPL9; MRPL13; MRPL24; MRPL35; RPL6; MRPL33; MRPL20; RPL17; RPS24; RPL26L1 Thyroid hormone 5.66E-05 8.72E-03 MED12; MED1; NOTCH1; RHEB; STAT1; AKT1; PIK3CD; signaling pathway PLCB2; ACTB; MTOR; PLCD1; SLC9A1 Fc gamma R- 1.41E-04 1.44E-02 LYN; PAK1; SYK; INPP5D; INPPL1; PIK3CD; AKT1; mediated PIP5K1C; DOCK2; CRK phagocytosis Phospholipase D 1.52E-04 1.17E-02 DGKD; SYK; SHC1; PIK3CD; MTOR; PIK3R5; CYTH4; signaling pathway RHEB; DGKQ; AKT1; PIP5K1C; PLCB2; CYTH1 Phosphatidylinositol 2.83E-04 1.74E-02 DGKD; DGKQ; INPP5D; INPPL1; PIK3CD; PIP4K2C; signaling system PIP5K1C; IP6K1; PLCB2; PLCD1 Regulation of actin 5.78E-04 2.97E-02 CYFIP2; ROCK2; ACTN1; MSN; PIK3CD; ITGAL; ACTB; cytoskeleton SLC9A1; FGD3; PAK1; MYH9; NCKAP1L; PIP4K2C; PIP5K1C; CRK Non-alcoholic fatty 5.97E-04 2.63E-02 IKBKB; SREBF1; COX7A2L; UQCRB; NDUFS4; AKT1; liver disease PIK3CD; COX6C; UQCRH; MAP3K11; UQCRHL; MAP3K5 (NAFLD) Shigellosis 0.0015752 6.06E-02 IKBKB; ROCK2; ELMO2; HCLS1; NOD1; CRK; ACTB Chemokine signaling 0.0016359 5.60E-02 LYN; SHC1; ROCK2; STAT1; PIK3CD; PIK3R5; IKBKB; pathway PAK1; GRK2; AKT1; DOCK2; CRK; PLCB2 Acute myeloid 0.0017229 5.31E-02 IKBKB; STAT5A; PER2; RARA; PIK3CD; AKT1; MTOR leukemia

44

Table 11: Top 10 PBC-GO Biological process

Term P-value Adjusted Genes P-value translation 2.74E-09 1.40E-05 RPL30; RPS4Y2; MRPL18; MRPS33; RPLP0; MRPL16; (GO:0006412) SNU13; MRPL36; RRBP1; RPS4Y1; RPL9; MRPL13; MRPL35; RPL6; EEF1B2; RPL17; WARS; MRPL47; RPL23A; MRPL24; COPS5; PPA2; EEF1E1; RPS24; RPL26L1 gene expression 1.55E-08 3.95E-05 RPL30; RPS4Y2; DHX8; GTF3C6; POLDIP3; MRPL18; (GO:0010467) MRPS33; RPLP0; SNU13; MRPL36; RRBP1; RPS4Y1; GATA2; RPL9; MRPL13; MRPL35; RPL6; SMG5; NXF1; DHX38; RPL17; RBM5; UPF1; WARS; TFIP11; RPL23A; MRPL24; SORL1; SETX; TSPAN14; COPS5; RPS24; NUP58 phosphorylation 4.24E-08 7.21E-05 DGKD; UQCRB; ROCK2; ATP5C1; PIK3CD; UQCRH; (GO:0016310) IKBKB; STK10; PAK1; STK38; RPS6KA1; TLK2; AKT1; NADK; MARK2; MAP3K5; LYN; MAP4K2; SYK; TYK2; PHKA2; MTOR; TGFBR2; LATS2; FES; DGKQ; RARA; PRKD2; PKN1; CSNK1G2; MAP3K11 mitochondrial 6.07E-08 7.75E-05 MRPS28; MRPS33; MRPL18; MRPL19; MRPL16; translational MRPL39; MRPL36; MRPL47; MRPL35; MRPL13; elongation MRPL24; MRPL33; MRPL20; MRPL50 (GO:0070125) mitochondrial 8.17E-08 8.34E-05 MRPS28; MRPS33; MRPL18; MRPL19; MRPL16; translational MRPL39; MRPL36; MRPL47; MRPL35; MRPL13; termination MRPL24; MRPL33; MRPL20; MRPL50 (GO:0070126) translational 1.06E-07 9.00E-05 MRPS28; MRPL18; MRPS33; MRPL19; MRPL16; elongation MRPL39; MRPL36; MRPL47; MRPL13; MRPL24; (GO:0006414) MRPL35; MRPL33; EEF1B2; MRPL20; MRPL50 peptide 1.80E-07 1.31E-04 RPL30; RPS4Y2; WARS; MRPL18; MRPS33; RPLP0; biosynthetic SNU13; MRPL36; RRBP1; RPL23A; RPS4Y1; RPL9; process MRPL13; MRPL24; MRPL35; RPL6; COPS5; RPL17; (GO:0043043) RPS24 translational 2.16E-07 1.38E-04 MRPS28; MRPL18; MRPS33; MRPL19; MRPL16; termination MRPL39; MRPL36; MRPL47; MRPL35; MRPL13; (GO:0006415) MRPL24; MRPL33; MRPL20; MRPL50 peptidyl-serine 2.87E-07 1.63E-04 MAST3; ROCK2; PRKX; MTOR; TGFBR2; IKBKB; GRK2; phosphorylation LATS2; MAPKAPK3; DGKQ; STK38; TLK2; AKT1; PRKD2; (GO:0018105) RIPK1; PKN1; CSNK1G2

45

mitochondrial 8.44E-07 4.31E-04 MRPS28; MRPL18; MRPS33; MRPL19; MRPL16; translation MRPL39; MRPL36; MRPL47; MRPL13; MRPL24; (GO:0032543) MRPL35; MRPL33; MRPL20; MRPL50

46

Table 12: Top 10 PBC-molecular function

Term P-value Adjusted Genes P-value kinase activity 4.48E-07 5.16E-04 LYN; PDXK; DGKD; SYK; ADPGK; PIK3CD; COASY; MTOR; (GO:0016301) IKBKB; PAK1; RASSF2; GRK2; PI4KAP1; DGKQ; PI4KAP2; RPS6KA1; AKT1; PRKD2; RIPK1; PKN1; IP6K1; NADK; MAP3K11; MAP3K5 RNA binding 1.06E-06 6.08E-04 RPL30; RTCA; RPLP0; BCCIP; ATP5C1; MRPL39; (GO:0003723) GPATCH4; ADAR; AATF; RPL9; RPL6; SMG5; RBM5; SRRM2; MRPS28; UPF1; SNRPN; GLRX3; ACTN1; ALG13; LARP4B; SMC1A; RC3H2; NME1; PSMA6; PSMA1; MYH9; FKBP3; PLEC; ZCCHC6; RPS4Y2; DHX8; ROCK2; POLDIP3; SF3B6; ZCCHC9; DDX23; MRPL16; NOL7; SNU13; RRBP1; AKAP17A; RPS4Y1; RPF2; MRPL13; MRPL20; NXF1; RIDA; PRDX1; TRA2B; GAR1; SAFB2; DHX38; EXOSC3; RPL17; MARK2; SPEN; NOP16; MSN; RPL23A; HELZ; CWC15; LSM3; ZNFX1; UBA1; RPS24; RPL26L1 phosphotransferase 1.13E-06 4.35E-04 PDXK; DGKD; SYK; ADPGK; PIK3CD; COASY; MTOR; activity, alcohol IKBKB; PAK1; RASSF2; GRK2; PI4KAP1; DGKQ; PI4KAP2; group as acceptor AKT1; PRKD2; RIPK1; PKN1; IP6K1; NADK; MAP3K11; (GO:0016773) MAP3K5 GTPase activator 3.03E-06 8.71E-04 RABGAP1; RANBP3; ARHGAP19; TBCD; MYO9B; activity TBC1D2B; ARAP1; GMIP; RASAL3; ADAP1; ARHGAP25; (GO:0005096) ARFGAP1; ARHGAP4; ARHGAP45; RIC8A; ABR; ARHGAP30; TBC1D20; TBC1D5; TBC1D14; NCKAP1L hydrogen ion 6.50E-06 1.50E-03 UQCRB; ATP6V1B2; TCIRG1; ATP5H; ATP5F1; UQCRH; transmembrane UQCRHL; ATP5L; SLC9A1 transporter activity (GO:0015078) GTPase regulator 1.41E-05 2.70E-03 RABGAP1; RANBP3; ARHGAP19; TBCD; MYO9B; activity TBC1D2B; ARAP1; GMIP; RASAL3; ADAP1; ARHGAP25; (GO:0030695) ARFGAP1; ARHGAP4; ARHGAP45; RIC8A; ABR; ARHGAP30; TBC1D20; TBC1D5; TBC1D14; NCKAP1L Ras GTPase binding 3.10E-05 5.09E-03 DENND1C; RABGAP1; RANBP3; CLEC16A; TBC1D2B; (GO:0017016) MYO9B; EHD1; IFT20; AKAP13; TBC1D20; AP1G1; TBC1D5; ANKFY1; TBC1D14; ARHGEF2; RAB11FIP4 protein 1.28E-04 1.84E-02 MAP4K2; SYK; MAST3; ROCK2; MTOR; TGFBR2; IKBKB; serine/threonine STK10; PAK1; GRK2; LATS2; MAPKAPK3; RPS6KA1; kinase activity STK38; TLK2; AKT1; PRKD2; RIPK1; PKN1; CSNK1G2; (GO:0004674) MARK2; MAP3K11; MAP3K5

47

Rab GTPase binding 1.70E-04 2.18E-02 EHD1; IFT20; DENND1C; RABGAP1; TBC1D20; AP1G1; (GO:0017137) TBC1D5; CLEC16A; ANKFY1; TBC1D14; TBC1D2B; RAB11FIP4 protein kinase 2.47E-04 2.84E-02 ROCK2; MAST3; IKBKB; STK10; PAK1; RASSF2; GRK2; activity CUX1; STK38; RPS6KA1; TLK2; AKT1; RIPK1; MARK2; (GO:0004672) MAP3K5; LYN; STAT5A; MAP4K2; SYK; TYK2; MTOR; LATS2; MAPKAPK3; FES; PRKD2; PKN1; CSNK1G2; MAP3K11

48

Table 13: Top 10 PSC-KEGG pathway

Term P-value Adjusted Genes P-value Oxidative 3.80E-08 1.17E-05 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; phosphorylation NDUFA1; NDUFC1; COX7C; UQCRH; UQCRHL Alzheimer disease 3.92E-08 6.03E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; NDUFA1; NDUFC1; COX7C; PLCB2; UQCRH; UQCRHL Parkinson disease 7.08E-08 7.27E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; NDUFA1; NDUFC1; COX7C; UQCRH; UQCRHL Thermogenesis 9.83E-08 7.57E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; NDUFAF4; NDUFA1; GRB2; NDUFC1; COX7C; UQCRH; UQCRHL Non-alcoholic fatty 1.12E-07 6.87E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; liver disease NDUFA1; NDUFC1; COX7C; UQCRH; UQCRHL (NAFLD) Huntington disease 1.35E-07 6.93E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; NDUFA1; NDUFC1; COX7C; PLCB2; UQCRH; UQCRHL Ribosome 1.38E-05 6.06E-04 RPS28; RPS27; RPS15A; RPL23; MRPL19; RPL14; MRPL36; MRPL33 Cardiac muscle 2.34E-04 9.01E-03 COX7B; UQCRB; COX7C; UQCRH; UQCRHL contraction Retrograde 6.65E-04 2.27E-02 NDUFB9; NDUFB10; NDUFA4; NDUFA1; NDUFC1; endocannabinoid PLCB2 signaling Fc epsilon RI 0.0125471 3.86E-01 LYN; SYK; GRB2 signaling pathway

49

Table 14: Top 10 PSC-biological process

Term P-value Adjusted Genes P-value mitochondrial ATP 4.70E-10 2.40E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; synthesis coupled NDUFA1; NDUFC1; COX7C; UQCRH; UQCRHL electron transport (GO:0042775) respiratory electron 1.29E-09 3.28E-06 NDUFB9; COX7B; NDUFB10; UQCRB; NDUFA4; transport chain NDUFA1; NDUFC1; COX7C; UQCRH; UQCRHL (GO:0022904) mitochondrial 1.82E-05 3.10E-02 NDUFB9; NDUFB10; NDUFA4; NDUFA1; NDUFC1 electron transport, NADH to ubiquinone (GO:0006120) gene expression 3.80E-05 4.85E-02 UPF1; NXF1; RPS28; RPS27; RPS15A; GTF3C6; RPL23; (GO:0010467) DHX37; RPL14; MRPL36; RBM5; BPTF oxidative 5.55E-05 5.66E-02 UQCRB; ATP5C1; UQCRH phosphorylation (GO:0006119) translational 6.32E-05 5.38E-02 MRPS28; TRMT112; MRPL19; MRPL36; MRPL47; termination MRPL33 (GO:0006415) mitochondrial 6.70E-05 4.88E-02 NDUFB9; NDUFB10; UQCRB; NDUFAF4; NDUFA1; respiratory chain NDUFC1 complex assembly (GO:0033108)

NADH 9.15E-05 5.84E-02 NDUFB9; NDUFB10; NDUFAF4; NDUFA1; NDUFC1 dehydrogenase complex assembly (GO:0010257) mitochondrial 9.15E-05 5.19E-02 NDUFB9; NDUFB10; NDUFAF4; NDUFA1; NDUFC1 respiratory chain complex I biogenesis (GO:0097031)

50

mitochondrial 9.15E-05 4.67E-02 NDUFB9; NDUFB10; NDUFAF4; NDUFA1; NDUFC1 respiratory chain complex I assembly (GO:0032981)

51

Table 15: Top 10 PSC-molecular function

Term P-value Adjusted Genes P-value NADH 1.30E-05 1.50E-02 NDUFB9; NDUFB10; NDUFA4; NDUFA1; NDUFC1 dehydrogenase (quinone) activity (GO:0050136) NADH 1.30E-05 7.50E-03 NDUFB9; NDUFB10; NDUFA4; NDUFA1; NDUFC1 dehydrogenase (ubiquinone) activity (GO:0008137) ubiquinol- 1.91E-05 7.33E-03 UQCRB; UQCRH; UQCRHL cytochrome-c reductase activity (GO:0008121) 1.91E-05 5.50E-03 UQCRB; UQCRH; UQCRHL activity, acting on diphenols and related substances as donors, cytochrome as acceptor (GO:0016681) RNA binding 4.16E-05 9.58E-03 MRPS28; UPF1; MBNL2; UTP3; RPL23; ATP5C1; EDF1; (GO:0003723) AKAP17A; LSM5; NXF1; RPS28; RIDA; LSM6; RPS27; RPS15A; PRDX1; SUMO2; FRG1; DHX37; GAR1; RPL14; GRB2; SLIRP; RBM5 ephrin receptor 3.12E-04 5.99E-02 LYN; GRB2; CRK binding (GO:0046875) transcription 7.19E-04 1.18E-01 ZBTB17; RPL23; SUMO2 binding (GO:0001221) phosphotyrosine 0.0011184 1.61E-01 SYK; GRB2; CRK residue binding (GO:0001784) protein 0.0017847 2.28E-01 SYK; GRB2; CRK phosphorylated amino acid binding (GO:0045309)

52

non-membrane 0.0035133 4.04E-01 LYN; SYK; GRB2 spanning protein tyrosine kinase activity (GO:0004715)

53