bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets

Urminder Singh1,2, Manhoi Hur2, Karin Dorman1,2,3, and Eve Wurtele1,2*

1Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA 2Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA 3Department of Statistics, Iowa State University, Ames, IA 50011, USA

The diverse and growing omics data in public domains provide can be attained with bigger datasets and the wide variety researchers with a tremendous opportunity to extract hidden of biological conditions can reveal the diverse transcription knowledge. However, the challenge of providing domain ex- spectrum of . Yet, despite the availability of such vast perts with easy access to these big data has resulted in the vast data resources, most bioinformatic studies use only a limited majority of archived data remaining unused. Here, we present amount of the available data. MetaOmGraph (MOG), a free, open-source, standalone soft- A common goal of analyzing omics data is to infer func- ware for exploratory data analysis of massive datasets by sci- entific researchers. Using MOG, a researcher can interactively tional roles of particular features by investigating differen- visualize and statistically analyze the data, in the context of its tial expression and coexpression patterns. A wide variety metadata. Researchers can interactively hone-in on groups of of R-platform tools can provide specific analyses (9–16). experiments or genes based on attributes such as expression Such tools are based upon rigorous statistical frameworks and values, statistical results, metadata terms, and ontology anno- produce accurate results when the model assumptions hold. tations. MOG’s statistical tools include coexpression, differen- Some tools avoid the need to code by providing "shiny" in- tial expression, and differential correlation analysis, with per- terfaces (17) to various subsets of R’s functionalities (10, 18– mutation test-based options for significance assessments. Multi- 20). They can only apply selected R packages to the data, and threading and indexing enable efficient data analysis on a per- have the general limitations that they are not well suited for sonal computer, with no need for writing code. Data can be very large datasets, and have very limited interactivity. visualized as line charts, box plots, scatter plots, and volcano Presently, there are very limited options for researchers to in- plots. A researcher can create new MOG projects from any data or analyze an existing one. An R-wrapper lets a researcher teract with expression datasets using the fundamental prin- select and send smaller data subsets to R for additional anal- ciples of exploratory data analysis (21). Exploratory data yses. A researcher can save MOG projects with a history of analysis is a technique to gain insight into a dataset, often the exploratory progress and later reopen or share them. We using graphical methods. Exploratory data analysis can re- illustrate MOG by case studies of large curated datasets from veal complex associations, patterns or anomalies within data human cancer RNA-Seq, in which we assembled a list of novel at different resolutions. By adding interactivity for visual- putative biomarker genes in different tumors, and microarray izations and statistical analyses, the process of exploration and metabolomics from A. thaliana. is simplified by enabling researchers to directly explore the underlying, often complex and multidimensional data. Re- searchers in diverse domains (e.g., experts in Parkinson’s Introduction disease, malaria, or nitrogen metabolism) can mine and re- mine the same data, extracting information pertinent to their Public data repositories store petabytes of raw and pro- areas of expertise and deriving testable hypotheses. These cessed data produced using microarray (1), RNA-seq (2), hypotheses can inform the design of new laboratory experi- and mass spectrometry for small molecules (3) and ments. Being able to explore and interact with the data be- (4). These data represent multiple species, tissues, geno- comes even more critical as datasets become larger. The in- types, and conditions; some are the results of groundbreak- formation content inherent in the vast stores of public data is ing research. Buried in these data are biological relation- tremendous. Due to the sheer size and complexity of such big ships among molecules that have never been explored. In- data, there is a pressing requirement for effective interactive tegrative analysis of data from the multiple studies represent- analysis and visualization tools (22, 23). ing diverse biological conditions is key to fully exploit these Increasing the usability of the vast data resources by enabling vast data resources for scientific discovery (5, 6). Such anal- efficient exploratory analysis would provide a tremendous ysis allows efficient reuse and recycling of these available opportunity to probe the expression of transcripts, genes, pro- data and its metadata (1, 5, 7, 8). Higher statistical power teins, metabolites and other features across a variety of differ- ent conditions. Such exploration can generate novel hypothe- * To whom correspondence should be addressed. Email: ses for experimentation, and hence provide a fundamental un- [email protected] derstanding of the function of genes, proteins, and their roles

Singh et al. | bioRχiv | July 11, 2019 | 1–18 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

in complex biological networks (6, 8, 24–28). searchers with a standard laptop or computer can easily run Here, we present MetaOmGraph (MOG), a software written MOG with data files containing thousands of samples and in Java, to interactively explore and visualize large datasets. fifty thousands of transcripts. MOG is able to perform effi- MOG overcomes the challenges posed by the size and com- ciently via two complementary approaches. First, MOG in- plexity of big datasets by efficient handling of the data files, dexes the data file, rather than storing the whole data in main using a combination of data indexing and buffering schemes. memory. This enables MOG to work with very large files us- Further, by incorporating metadata, MOG adds extra dimen- ing a minimal amount of memory. Second, MOG speeds up sions to the analyses and provides flexibility in data explo- computation using multi-threading to execute multiple tasks ration. Researchers can explore their own data on their local in parallel, optimizing the use of multi-core processors. machines. At any stage of the analysis, a researcher can save her/his progress. Saved MOG projects can be shared, reused, A.3. Exception handling. MOG is robust in its capability to and included in publications. MOG is user-centered software, cope with most of the many errors and exceptions (such as designed for all types of data, but specialized for expression due to missing data or wrong data type) that can occur when data. handling diverse data types. If bugs are encountered, users can submit a bug report with a single click, allowing MOG developers to debug the issue. MATERIALS AND METHODS A.4. Data-type agnostic. Although specifically created for A. Overview. MOG is a standalone program that can run the purpose of analysis of ‘omics data, which is the focus of on any operating system capable of running Java (Linux, this paper, MOG was designed to be flexible enough to gen- Mac and Windows). Access to MOG easy. MOG runs erally handle numerical data and its metadata. It can inter- on the researcher’s computer and thus the researcher does actively analyze and visualize voluminous data of any kind. not need to rely on internet accessibility, and is not slowed Examples could be: transmission of mosquito-borne infec- down by the latency of transfer of big data. Furthermore, the tious diseases world-wide; public tax return data for world data in a researcher’s project is secure, remaining on the re- leaders over the past 40 years; daily sales at Dimo’s Pizza searcher’s computer, particularly important for confidential over five years; player statistics across all Women’s National data such as human RNA-Seq. MOG’s modular structure has Basketball Association (WNBA) teams; climate projections been carefully designed to enable developers to easily im- and outcome since 2000. plement new statistical analyses and visualizations in future. MOG’s Graphical User Interface (GUI) is the central compo- A.5. Leverage of third party JAVA libraries. In addition to the nent through which all the functionality is accessed (Figure functionalities programmed into MOG by us, MOG borrows 1). some existing functionality from freely available and exten- sively tested third-party Java libraries (JFreeChart, Apache A.1. Interactive data exploration. MOG displays all data in Commons Math, Nitrite, and JDOM). This strategy results interactive tables and trees, providing a flexible and struc- in a highly modular system that is accessible to changes and tured view of the data. The user can interactively filter extensions. or select data for analysis. Multiple queries can be com- bined and applied to discover obscure relationships within A.6. Interface to R. Based on the utility and popularity of R the data. This ability is particularly important for an aggre- for analysis, we have implemented an interface to facilitate gated dataset, as users may wish to split it into groups of execution of R scripts through MOG. By using MOG as a studies, treatments, or tissues. A novel aspect of MOG is wrapper to R packages, the interface expands the types of its capability of producing interactive visualizations. The re- analyses that can be performed. For example, a user can searcher can visualize data via line charts, histograms, box write an R script for hierarchical clustering of genes based plots, volcano plots, scatter plots and bar charts, each of on the expression levels over selected samples. The interface which is programmed to allow real-time interaction with enables a user to interactively select/filter data using MOG the data points and the metadata. Users can group, sort, and execute R scripts on selected data. This avoids writing filter, change colors and shapes, zoom, and pan interac- R-code for selecting genes and samples for the analysis. tively, via the GUI. At any point in the exploration, the researcher can search external databases: GeneCards (29), B. Creating a new MOG project. Ensembl (30), EnsemblPlants (31), RefSeq (32), TAIR (33) and ATGeneSearch (http://metnetweb.gdcb. B.1. Expression data files for a new MOG project . The ex- iastate.edu/MetNet_atGeneSearch.htm) to look pression data file contains unique identifiers (IDs) for each up additional information about the genomic features in feature (e.g, ) followed by information about that fea- the dataset. Researchers can also access SRA and GEO ture and then numerical data quantifying each feature across databases using the accessions present in study metadata. multiple conditions (Figure 2). For RNA-Seq analyses, each row would include a unique gene ID followed by metadata A.2. Efficient, multithreaded and robust. A key advantage of about that gene and the quantified expression level of that MOG is its minimal memory usage, enabling datasets to be gene over multiple RNA-Seq runs (Figure 2). If the data is analyzed that are too large for other available tools. Re- from metabolomics studies, each row would provide a unique

2 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. B Creating a new MOG project

Fig. 1. An overview of MOG’s modules. All functionality is accessed through MOG’s graphical user interface (GUI). First, the researcher selects an existing MOG project or creates a new MOG project (.mog) with input data files. Once the project is open in MOG, the workflow is non-linear. The GUI enables interactive exploration of data through various statistical analyses and visualizations. The researcher can export visualizations and results throughout the analysis, and can save and/or export her/his feature lists and statistical analyses in the MOG project file for future exploration. Saved MOG projects can be shared and further analyzed by new researchers.

ID for each metabolite, metadata about each metabolite, and to the sample IDs present in the data file. Other columns numerical data quantifying accumulation of each metabolite might contain the study ID, a description of the experimental over multiple GC-MS analyses. The data columns of expres- design, experimental protocol, biological material sampled, sion data files, which contain numerical expression values, instrument used, study abstract (Figure 2). Information can are headed by unique IDs for the corresponding samples. be added or modified by the user, simply by adding columns Metadata about each feature can be added or modified by to the file. For example, a researcher might wish to add fields the researcher, simply by adding or modifying columns of for references, dates, notes, or keywords. the file (Projects provided on the MOG website can also be A caveat is that leveraging archived metadata is only as good modified). For example, for proteomics studies, a researcher as the metadata provided; metadata about the samples may be might want metadata to include: function, pathway mem- incomplete or misleading, and the quality varies from study bership, length in amino acids, phylostratum (34, 35), ter- to study (7, 36). Despite this, metadata is key to understand- tiary structure, amino acid content, and PubMed references; ing the significance of the experiments (7, 36). for metabolomics studies, metadata might include: InChi The metadata for the features and samples adds extra dimen- key, PubChemID, systematic name, synonyms, and molec- sions to data exploration in MOG. For example, researchers ular weight. can visualize expression of selected genes for selected treat- ments as compared to the other treatments (e.g., the expres- B.2. Sample metadata files for a new MOG project. This sion of transcription factors of the NF-Y family in studies metadata is input as a delimited file (Figure2). The first col- about mutants in fatty acid biosynthesis enzymes compared umn of each row contains a unique sample ID corresponding to mutants in enzymes of other pathways). Researchers can

Singh et al. | bioRχiv | 3 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 2. Structure of input files for a new MOG project. The expression data file (left table) is a matrix with F rows and (I + S) columns. Each row corresponds to a feature (in this example, the features are genes). The first I columns are feature metadata columns; the first column is a unique feature ID for each row (in this example, the gene name). The latter (S) columns are the expression values of F features over S samples. Each column is headed by a unique identifier for the sample. The sample metadata file (right table) is a matrix of S rows by M columns. Its first column contains the unique sample IDs. Thus, each row in the sample metadata file corresponds to a sample in the expression data file. The M columns are the metadata attributes of each analysis (e.g., each run for RNA-Seq data, each chip for microarray data). also filter metadata to select certain samples (e.g., samples data normalization across studies. Despite much research on containing the word "kidney" in the "abstract" field, and the how to merge large numbers of diverse biological studies to- words "stage IV" in the "tumor stage" field) for further ex- gether (44, 46–49), combining heterogeneous studies, partic- ploration. ularly of RNA-Seq data, is still a challenge because many latent unwanted factors induce variability in the data, which B.3. Challenges of integrating data from multiple studies. makes it hard to directly compare data from different samples MOG is designed for exploration of complex datasets com- (38, 39, 50–53). posed of multiple samples from different studies. The data These unwanted factors may be technical effects (e.g., library supplied to MOG should be comparable across these sam- size, hardware used in sequencing, the protocol used for ex- ples. Thus, appropriate normalization should be performed traction) or unknown biological effects (e.g., a temperature on the data and effects of unwanted factors should be re- drop in one of the growth rooms, or a study in which control moved from the data (37–39). Although the steps of data nor- participants were residing on a smoggy region). These un- malization and integration are not a part of MOG, because the wanted factors can render the data uninterpretable in relation data that is input to MOG is typically from aggregate studies, to its metadata, in the sense that they introduce additional co- we present several considerations. variates into the data (38, 39, 50). Two predominate frameworks for analysis of multiple ex- If these unwanted effects are not removed, estimated asso- pression studies are: 1) analyze each study independently ciations between features after pooling data from multiple and then combine results from independent studies (meta- studies may be fallacious (27, 43, 54). Thus, before pool- analysis) (5,9, 10, 27, 40–43), or 2) combine data from mul- ing expression data from heterogeneous studies, the effects tiple studies together to create a "pooled" dataset and analyze of unwanted factors must be removed, and samples/studies the pooled data (5, 26, 27). that have unwieldy noise must be eliminated from the analy- In meta-analysis, if individual studies show statistically- sig- sis. nificant results, then it is likely that the final result combined from the studies will also be significant (5). This approach B.4. Using an existing MOG project. MOG projects avoids the major challenge of normalizing data across studies described in this paper and other MOG projects from well- (5, 27). A caveat is that the number of samples in an individ- vetted datasets are available at http://metnetweb. ual study must be high enough to allow statistical inference gdcb.iastate.edu/MetNet_MetaOmGraph.htm. (5). The major drawback of this method is that it does not Ongoing MOG project results can be saved by a researcher, allow direct comparisons across studies. regardless of whether s/he created it or it was obtained from The pooling approach has the advantage that combining a our website. The correlations, lists, and other interactive large number of samples can result in better statistical power analyses and explorations by the researcher are saved with and inference; a second advantage is that data can be com- the project. The MOG project can later be re-explored, pared across studies (5, 44, 45). The major challenge is modified, or shared.

4 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. C Detecting statistical association among data.

C. Detecting statistical association among data.. Cal- linear associations in many gene expression datasets (28, 62– culating statistical association between a pair of features 65). MOG computes MI using B-splines density estimation, quantifies the similarity in their expression patterns across as described in Daub et.al. 2004(62). the samples that comprise that dataset. A variety of statistical MOG can also determine the context likelihood of related- measures have been applied to estimate similarity in expres- ness (CLR) (58), estimating pairwise relatedness between sion patterns (55–58). Genes with significant association may features. The CLR algorithm aims to remove many false as- share common biological roles and pathways (26, 55, 59). sociations by comparing the MI value between a pair of fea- Genes with significant association only under specific con- tures to the background distribution of MI values that include ditions may reveal their functional significance under those either of the features (58). conditions (54, 60). MOG provides the researcher with several statistical mea- C.2. Meta-analysis of correlation coefficients. MOG can per- sures to estimate associations/coexpression between the fea- form meta-analysis on Pearson correlation coefficients calcu- tures/biomolecules. It can also compute association between lated from multiple studies. This is useful particularly when samples, which reflects similarity between the samples. It is samples from different studies are not directly comparable up to the user to choose an appropriate method to calculate and using pooled data for correlation is not ideal. the statistical association between features or samples and in- MOG can calculate the summary correlation coefficient from terpret the results accordingly. multiple correlation coefficients computed from different studies individually. Researchers can choose between a fixed C.1. Correlation, mutual information and relatedness. We effects model (FEM) or a random effects model (REM) have incorporated four key methods that determine associ- (66, 67). The FEM combines the estimated effects by assum- ations among features, each having its own advantages and ing that all studies probe the same correlation, i.e., studies disadvantages, depending on the types of relationships the re- are homogeneous, and bias is only due to study-specific vari- searcher wishes to examine, and characteristics of the dataset ation. In contrast, the REM considers studies to be potentially being explored. heterogeneous, assuming that each study measures a biased MOG can compute pairwise Pearson and Spearman corre- version of the true effect and combines the estimated effects lation measures for selected features across samples or be- accordingly (43, 68). To perform meta-analysis of correla- tween selected samples across features. Pearson correlation tion coefficients with MOG, the researcher can select an ap- detects the linear dependency between two random variables, propriate column in the study metadata which identifies the X and Y , whereas, Spearman correlation coefficient mea- different studies in the data. Then MOG computes Pearson sures monotonic relationships between two the variables, X correlation between features for samples within each study and Y . Pearson correlation is computed directly using the and reports the summary correlation. The researcher should data points; this is faster than computation of Spearman cor- carefully interpret the results, which depends on the selected relation, which requires converting numerical values to ranks model (FEM or REM) and the studies used. during which some information is lost. However, Spearman correlation is less sensitive to outliers than Pearson correla- D. Differential expression between groups. Deter- tion (61). Pearson or Spearman correlations values can only mining differentially-expressed features from aggregated reveal if there is a linear or a monotonic relationship between datasets provides direction for further data exploration. the two random variables, which doesn’t imply causation. We have incorporated three statistical methods to evaluate Pearson and Spearman correlations are often used to find co- differential expression between two independent groups of expressed genes and generate matrices used for gene expres- samples, and two statistical methods to compare differential sion networks (26, 54, 56). expression between two paired groups of samples. For MOG also computes pairwise mutual information (MI) be- groups with independent samples, we have implemented: tween selected features across samples. MI quantifies the Mann-Whitney U test (a non-parametric test that puts amount of information shared between two random variables. no assumption about data distribution); Student’s t-test MI for two discrete random variables X and Y , having the (assumes equal variance and a normal distribution of data); joint probability p(x,y) and marginal probabilities p(x) and and Welch’s t-test (does not assume equal variance, assumes p(y) respectively, is defined as: a normal distribution of data). For analysis of groups with paired samples, we have implemented in MOG: a Paired X X  p(x,y)  I(X;Y ) = p(x,y)log t-test (assumes normal distribution of data): and a Wilcoxon p(x)p(y) y∈Y x∈X signed-rank test (a non-parametric; no assumption of data distribution). Compared to correlation methods, MI is a more general ap- The differential expression analyses methods implemented proach that can detect complex non-linear associations. The in MOG can achieve greater statistical power if the num- interpretation of the MI value is different than that of corre- ber of samples is sufficiently large (69). Computation and lation values: an MI value of zero between two random vari- calculation of these methods via MOG is efficient and per- ables implies statistical independence between them, whereas mits interactivity that promotes data exploration. For differ- a correlation value of zero does not imply that there is no sta- ential expression analysis of RNA-Seq/microarray data with tistical dependence (61). MI has been applied to detect non- small sample sizes, analysis in R is a viable option; meth-

Singh et al. | bioRχiv | 5 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

ods like edger (16), DESeq2 (13) and limma (12) require raw (76) and has been normalized and batch-corrected for study- counts as input and provide more reliable differential expres- specific biases. The resulting data is comparable across tis- sion analysis for small sample sizes (70, 71). However, inter- sues (organ samples) and conditions (tumor v.s. non-tumor activity is limited when using these R packages. samples) (24) (Supplementary Table S1). To create the MOG project, we excluded from the dataset E. Differential correlation between groups. Features any organ types in which the number of tumor or non-tumor whose correlation with other features is significantly different samples was < 30. To ensure statistical independence, we under one environmental, genetic or developmental condition included only one TCGA sample from a patient (Supple- than another are differentially correlated. Differential cor- mentary Table S2). We then compiled metadata for the relation analysis is complementary to differential expression studies/samples and for the genes in the resultant expres- analysis and together these can reveal important features for sion dataset. We downloaded the study and sample metadata further exploration. MOG can find the features whose Pear- (TCGA metadata from TCGAbiolinks (77); GTEX meta- son correlation to a user-selected feature differs significantly data from GTEX’s website (https://gtexportal.org/home). We in two groups of samples. To do this, MOG applies a Fisher were unable to locate metadata for 17 of the TCGA sam- transformation (72) and performs a hypothesis test. Differ- ples and excluded these samples from the dataset (Supple- ential correlations of genes reflect shifting biological inter- mentary Table S3). We extracted metadata about the genes actions among these genes or their regulators under different from the HGNC (https://www.genenames.org/), NCBI Gene conditions (60, 73), thus reflecting the context-dependency of (https://www.ncbi.nlm.nih.gov/gene), Ensembl (30), Cancer genetic networks. Gene Census (78) and OMIM (79) databases, and added these information to the gene metadata in our dataset. 1,870 genes whose expression was not reported in any of the sam- F. Statistical inference. MOG reports p-values for all the ples were removed from the dataset (Supplementary Table statistical tests it performs. For correlation and MI, MOG S4). uses a permutation test to compute the p-values. MOG speeds-up this computation using multithreading, and pro- The final dataset used for the MOG project (Hu- cessing the permuted datasets in parallel (Algorithm 1). man_cancer.mog) contained expression values for 18,212 MOG also provides two popular methods to adjust the p- genes, 30 fields of metadata detailing each gene, across 7,142 values for multiple comparisons: Bonferroni correction (74) samples representing 14 different cancer types and associated and Benjamini–Hochberg (BH) correction (75). non-tumor tissues (Table 1) integrated with 23 fields of meta- data describing each study and sample. We used MOG to log transform the data for subsequent analyses. Algorithm 1 Permutation test for significance 2 1: P ← number of permutations G.2. A. thaliana microarray dataset. We created a project 2: T ← number of threads based on the A. thaliana curated microarray dataset from 3: X ← expression values of first gene Mentzen and Wurtele, 2008 (26). This dataset was compiled 4: Y ← expression values of second gene using 963 Affymetrix ATH1 chips with 22,746 probes from 5: ρ ← association(X,Y) 70 diverse experiments encompassing different conditions in- 6: extremes ← 0 cluding development, stress, mutant, and other studies. All 7: Execute in parallel P times using T threads : chips in the dataset were individually normalized and scaled 8: X* ← permute (X) to a common mean using MAS 5.0 algorithm. Only chips 9: ρ* ← association(X*,Y) with good quality biological replicates were kept and all the 10: if |ρ*| ≥ |ρ| then biological replicates were averaged to yield 424 samples. Fi- 11: extremes ← extremes + 1 nally, median absolute deviation (MAD)-based normalization 12: end parallel (80) was applied to the data. Study and sample metadata was extremes 13: p-value ← P obtained from Mentzen and Wurtele, 2008 (26). We com- piled the metadata for the genes from TAIR (33) combined it with phylostrata inferences from phylostatr (35).

G. Datasets. We illustrate MOG’s capabilities by exploring G.3. A. thaliana metabolomics GC-MS dataset. Small three diverse datasets from varied technical platforms. These molecule composition (metabolomics) data and meta- are Human cancer RNA-Seq data, A. thaliana microarray data describing the effect of 50 mutations of genes of dataset and A. thaliana metabolomics GC-MS dataset. mostly unknown function (81) were downloaded from Plant/Eukaryotic and Microbial Resource (PMR) (82). G.1. Human cancer RNA-seq dataset. We created a MOG project from the human cancer RNA-Seq dataset of Wang et RESULTS al., 2018 (24). This highly-curated dataset combines RNA- Seq data from The Cancer Genome Atlas (TCGA, tumor To demonstrate MOG’s usability and flexibility, we illustrate and non-tumor samples) (https://cancergenome.nih.gov/) and MOG’s capabilities by exploring three diverse datasets from Genotype Tissue Expression (GTEX, non-tumor samples) different perspectives. In many cases, this exploration led

6 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. H Preliminary exploration of cancer dataset

Corresponding TCGA disease #TCGA samples #GTEX samples Total samples GTEX tissue Breast invasive carcinoma (BRCA) Breast 965 89 1,054 Colon adenocarcinoma (COAD) Colon 277 339 616 Esophageal carcinoma (ESCA) Esophagus 182 659 841 Kidney Chromophobe (KICH) Kidney 60 32 92 Kidney renal clear cell carcinoma (KIRC) Kidney 470 32 502 Kidney renal papillary cell carcinoma (KIRP) Kidney 236 32 268 (LIHC) Liver 295 115 410 Lung adenocarcinoma (LUAD) Lung 491 313 804 Lung squamous cell carcinoma (LUSC) Lung 486 313 799 Prostate adenocarcinoma (PRAD) Prostate 426 106 532 Stomach adenocarcinoma (STAD) Stomach 380 192 572 Thyroid carcinoma (THCA) Thyroid 441 318 759 Uterine Corpus Endometrial Carcinoma (UCEC) Uterus 141 82 223 Uterine Carcinosarcoma (UCS) Uterus 47 82 129

Table 1. Tissue-wise summary of tumor and non-tumor samples in the human cancer RNA-Seq expression data. us to conclusions that reflect prior experimental or in silico I. Identifying a catalog of genes that are differentially results. In other cases, the exploration led us to novel predic- expressed in 14 cancers. Using MOG, we identified dif- tions that could be tested experimentally. The statistical anal- ferentially expressed genes, and then examined their patterns yses and visualizations presented in this section were gener- of expression across multiple samples. Initially we focused ated exclusively using MOG. on differentially expressed genes in each cancer type with respect to the corresponding non-tumor samples. We define a gene as differentially expressed between two groups if it H. Preliminary exploration of cancer dataset. We first meets both of the following criteria: used MOG to explore and assess whether the data is prop- 1. Expression change is by 2-fold or more. erly normalized and free of batch effects or not. We expect that samples from similar biological conditions should ex- 2. Mann–Whitney U test is significant between the two hibit similar expression patterns for all the genes. To exam- groups (BH corrected p-value < 10−3) ine this, we computed, using MOG, pairwise Pearson corre- In each type of cancer, between 2,000-5,000 of the 18,212 lation among samples from same conditions. All the sam- genes whose expression was monitored were differentially ples had high Pearson correlation (> 0.70) with other sam- expressed (either upregulated or downregulated) (Table2, ples from same biological condition, with one exception. The Supplementary Table S5-S7). Rather non-intuitively, only 16 sample TCGA-38-4625-01A-01R-1206-07 from lung adeno- genes were consistently upregulated and 19 genes were con- carcinoma (LUAD), had unusually low expression for most sistently downregulated in every one of the 14 cancer types. genes and showed very low Pearson correlation with other About one-third of the genes (5,748 genes) were not differ- LUAD samples (< 0.35) (Additional File 1). This sample entially expressed in any of the tumor types (Supplementary was not included in the dataset. Table S8-S10). Although, non-tumor samples from colon, esophagus and Several genes that are deeply implicated in cancer were not stomach have high Pearson correlation (> 0.70), we no- differentially expressed in any of the tumor types. One such ticed the distribution of Pearson correlation values is bimodal example, tumor suppressor 53 (TP53) (Figure 3 A (Supplementary Figure 1). We hypothesized that samples and 3 B), is thought to mediate many cancer types via mu- from these tissues were from different anatomical sites. For tations in its CDS that reduces its ability to suppress cancer example, esophagus samples were from gastroesophegeal (83); one colorectal cancer study described it as differentially junction, muscularis and mucosa and the distribution of Pear- expressed (fold change 1.76) (84); another did not detect dif- son correlation values among samples from these three cate- ferential expression (85). gories showed a unimodal shape (Supplementary Figure 1). The expression of 15 of the 16 genes that are upregulated in Similarly, colon samples were from sigmoid colon and trans- all tumor types is strongly positively correlated (Spearman verse colon and samples from sigmoid colon also showed a correlation > 0.60) across all tumor, non-tumor and pooled unimodal distribution of Pearson correlation values (Supple- samples (tumor and non-tumor samples) (Figure 3 C, Sup- mentary Figure 1). However, samples from transverse colon plementary Table S31). Cyclin dependent kinase inhibitor didn’t show a unimodal distribution of Pearson correlation (CDKN2A) was the outlier (Spearman correlation < 0.50) values (Supplementary Figure 1). For stomach samples, no (Figure 3 D; Supplementary Table S11). This may imply information about specific tissues assays was present in the that these 15 upregulated genes function together closely as metadata. a module in both tumor and non-tumor cells.

Singh et al. | bioRχiv | 7 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

TCGA Disease #Upregulated #Downregulated BRCA 1,093 2,827 COAD 1,401 3,036 ESCA 1,989 2,229 KICH 986 4,214 KIRC 1,877 2,263 KIRP 1,152 2,737 LIHC 1,527 1,485 LUAD 1,361 2,753 LUSC 2,210 3,734 PRAD 577 1,633 STAD 1,527 1,631 THCA 993 1,525 UCEC 2,135 3,250 UCS 2,419 2,491

Table 2. Number of differentiallly expressed genes found for each cancer type.

In contrast, among the 19 genes that are downregulated across all cancer types, only 62 pairs are strongly correlated across all the samples (Spearman correlation >= 0.6) (Sup- plementary Table S12-S13). Of these, only 4 pairs showed strong correlation in tumor and non-tumor samples separately (Figure 3 E and 3 F; Supplementary Table S14). Seven pairs of genes are strongly correlated among tumor samples but not among the non-tumor samples (Supplementary Table S15) and 18 pairs are strongly correlated among non-tumor sam- ples but not among the tumor samples (Supplementary Table S16), potentially reflecting a context-dependent coordination.

I.1. Functional analysis of differentially expressed genes. We performed (GO) enrichment analysis us- ing GO::TermFinder (86) and Revigo (87) on the genes that were upregulated, downregulated or not significantly changed across all the cancer types. Upregulated genes were significantly overrepresented in GO terms related to cell pro- Fig. 3. MOG visualizations of selected genes, upregulated, downregulated and un- liferation: cell cycle, , organelle organization, changed across all tumor types, over all the tumor and non-tumor samples. The regulation of cellular component organization and regulation tumor samples are shown in black and red is used for the non-tumor samples. The plots were generated in MOG by interactively choosing the genes and the plot of cell cycle (Supplementary Table S17, Supplementary Fig- were split into the categories tumor and non-tumor using the sample metadata. ure 2). In contrast, downregulated genes showed no signifi- (A) Histogram showing distribution of TP53 expression (number of bins parame- cant GO term overrepresentation; the 5,784 genes that did not ter was set to 50). (B) Box plot summarizing the expression of TP53 over tumor and non-tumor samples. The horizontal line inside the box represents the median change expression were enriched in GO terms: RNA process- expression, which was 9.1 for non-tumor samples and 9.6 for tumor samples. (C) ing, mRNA metabolic process, nucleic acid metabolic pro- Scatter plot showing coexpression of genes BUB1 and KIF4A (both were upregu- cess, and gene expression (Supplementary Table S18, Sup- lated across all tumor types). BUB1 and KIF4A show strong coexpression under both tumor and non-tumor samples (Spearman correlation=0.92 and Spearman plementary Figure 3). correlation= 0.84 respectively). Spearman correlation over both tumor and non- tumor samples was 0.94. (D) Scatter plot showing coexpression of genes CDKN2A J. Gene-level exploration. and KIF4A (both were upregulated across all tumor types). CDKN2A and KIF4A doesn’t show strong coexpression. The Spearman correlations were 0.36, 0.28 and 0.48 under tumor, non-tumor and all samples respectively. (E) Scatter plot J.1. Expression patterns of GPC3. The 3 (GPC3) showing coexpression pattern of genes TGFBR3 and DCN (both were downregu- gene, encoding a glycosylphosphatidylinositol-linked hep- lated across all tumor types). TGFBR3 and DCN show strong coexpression patern aran sulfate proteoglycan, is located on the X . under non-tumor samples (Spearman correlation=0.64). The coexpression was weak (Spearman correlation=0.14) under the tumor samples. Spearman correla- GPC3 has been implicated as a critical regulator of tissue tion over both tumor and non-tumor samples combined was 0.48. (F) Scatter plot growth and morphogenesis (88). Mutations in GPC3 have showing coexpression pattern of genes DPT and DCN (both were downregulated been linked to Simpson-Golabi-Behmel syndrome (SGBS) across all tumor types). The Spearman correlations were 0.82, 0.69 and 0.84 under tumor, non-tumor and all samples respectively. and Wilm’s tumour (89–91). In embryonic development, GPC3 inhibits the cell proliferation and specialization hedge- hog signaling pathway (92). In tumors, GPC3’s role is complex and not well understood.

8 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. K Stage-wise analysis of cancer data

It promotes or inhibits cell growth depending on the can- 3). Then, we imported each network into Cytoscape (102) cer type (93, 94). Hence, we explored expression patterns and identified the tightly connected modules using MCODE of GPC3 using MOG. Our differential expression analysis of (103). the 14 cancer types, using Mann-Whitney test, indicates that In the network built using non-tumor liver samples, GPC3 GPC3’s expression in the LIHC samples is 30 times higher was part of a module that was ranked the second most signif- than in the non-tumor liver samples, and 8 times higher in the icant by MCODE’s algorithm (73 nodes (genes); MCODE UCS samples compared to the non-tumor uterus samples. In score 30.7). GPC3 was directly connected with 21 genes contrast, GPC3 is downregulated in 9 tumor types (BRCA, (Supplementary Figure 4). This module was most enriched in COAD, ESCA, KIRC, KIRP, LUAD, LUSC, THCA, and GO terms: sulfur compound catabolic process, glycosamino- UCEC) and unchanged in 3 tumor types (KICH, STAD and glycan catabolic process, aminoglycan catabolic process, and PRAD) (Figure 4 A and 4 B; Supplementary Table S19). extracellular matrix organization (Supplementary Table S27; These results are consistent with what has been reported in Supplementary Figure 5). In contrast, GPC3 wasn’t signif- the literature, using different data. In liver cancer, GPC3’s icantly coexpressed with other genes in the LIHC samples, expression was highly upregulated (93, 95, 96), and it has and thus was absent from the LIHC network. However, this been suggested as a diagnostic biomarker and as a potential network contained a module with 114 genes (MCODE score target for cancer immunotherapy in hepatocellular carcinoma 94.3), 33 of which were in the GPC3-connected module iden- (95–97). GPC3 was previously found to to be downregulated tified from the network built using liver non-tumor samples in breast (98), lung (99) and ovarian cancers (100) and its role (17 of these genes were directly connected with GPC3) (Sup- as a tumor suppressor has been investigated in lung and renal plementary Figure 4). GO analysis of this module indicated cancer (100, 101). it was overrepresented in GO terms: extracellular matrix or- We then investigated coexpression patterns of GPC3 in the ganization, extracellular structure organization, blood vessel different tumor and non-tumor tissues (Additional File 3). development, and vasculature development (Supplementary We selected a cutoff of Spearman correlation >= 0.6 or <= Table S28; Supplementary Figure 6). −0.6 to identify genes coexpressed with GPC3. We found that GPC3’s coexpression patterns were very different across K. Stage-wise analysis of cancer data. all the tumor and non-tumor samples (Figure 4 C). Genes that are correlated with GPC3 vary greatly across different K.1. Identifying new candidate biomarkers for cancers. We tissues (Figure 4 D and 4 E; Supplementary Table S20). For used MOG to identify genes whose expression is associated example, 4,219 genes were coexpressed with GPC3 in esoph- with disease progression; these genes are potential biomark- agus non-tumor samples, whereas no gene was coexpressed ers. Specifically, we identified genes upregulated in the dif- with GPC3 in non-tumor samples from prostate and stomach ferent cancer types; then, we sorted samples from each organ- (Supplementary Table S20). Coexpressed genes also differed type into early stage (stage I or stage II) and late stage (stage according to whether disease was present. For 7 tissues, the III and later) based on the study metadata. Finally we per- number of coexpressed genes with GPC3 were lower in tu- formed Mann-Whitney test and identified those that were mor samples than those in non-tumor samples (Supplemen- also upregulated in late stage compared to early stage (2- tary Table S20). For example, although 192 genes were co- fold change or more, and B-H corrected p − value < 0.05) expressed with GPC3 in non-tumor liver samples, no genes from among the upregulated genes. These genes would likely were coexpressed with GPC3 in LIHC samples. show increasing expression with cancer progression. We sim- We analyzed GO term overrepresentation for GPC3- ilarly identified the genes that were downregulated in late coexpressed genes from colon, esophagus, kidney and liver tumor stages compared to early stage cancer. These genes samples. GPC3 coexpressed genes in other non-tumor tis- would likely show a decreasing pattern of expression with sues were less than 10. We found that biological adhesion cancer progression. and cell adhesion were two GO functions present in all the A total of 221 genes showed an increasing expression pattern sets of coexpressed genes from colon, esophagus, kidney and during tumor progression (ESCA:91, KIRP:89, THCA:25, liver samples. The GO terms cell development, extracellu- KIRC:24); 227 genes showed a decreasing expression pattern lar structure organization, extracellular matrix organization (ESCA:89, KIRP:68, LIHC:64, KIRC:13) (Supplementary and multicellular organism development were present in all Table S29; Additional File 4). We compared the results ob- sets except the one from lung. Multiple other GO terms were tained by MOG to data in The Human Protein Atlas (THPA) overrepresented in a tissue specific manner (Supplementary (104). Out of the 111 genes we identified as increasing dur- Table S21-S26). ing progression of KIRC or KIRP, 56 have been described as unfavourable prognostic for renal cancer by THPA (Sup- J.2. GPC3 associated modules in tumor versus non-tumor plementary Table S30). For example, Figure 5 A shows ex- liver samples. To explore potential interactions of GPC3 with pression of the Cluster of Differentiation 70 (CD70) gene in other genes in non-tumor and tumor liver samples, we used KICH, KIRC and KIRP tumors; CD70 expression increased MOG to build two gene coexpression networks from the with KIRC and KIRP progression; CD70 gene is marked by 3,012 genes that were differentially expressed in LIHC– one THPA as a prognostic unfavourable for renal cancer. network from non-tumor liver samples, and a second network Out of the 79 genes we identified as decreasing with cancer from LIHC samples (Supplementary File 3; Additional File progression in KIRC or KIRP, 39 were labeled as prognostic

Singh et al. | bioRχiv | 9 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 4. MOG visualizations of glypican 3 (GPC3) expression pattern in tumor and non-tumor tissues. (A) Line chart of GPC3 generated by interactively filtering by study metadata to retain 3,184 samples from 5 tumor types and corresponding non-tumor tissues and grouping the chart by tissue/tumor type. (B) Box plot summary of data in (A). Generated by interactively splitting GPC3 box plot according to tissue/tumor type. (C) Scatter plot showing co-expression of GPC3 and Lumican (LUM) in liver non-tumor and LIHC samples. In non-tumor liver tissue (red), GPC3 and LUM expression are strongly correlated (Spearman correlation=0.70). In LIHC samples (black), GPC3 and LUM expression shows no association (Spearman correlation= −0.10). (D and E) Histograms of distribution of Spearman correlation coefficients of GPC3 expression with expression of all the other genes under: non-tumor liver samples (D) and LIHC samples (E). In non-tumor liver samples the distribution of Spearman correlation coefficients has a longer right tail indicating presence of higher Spearman correlation coefficients as compared to LIHC samples. In LIHC samples the distribution of Spearman correlation coefficients is centered around zero and the correlation coefficients roughly bounded between -0.5 and 0.5. favourable for renal cancer by THPA (Supplementary Table We propose that the other 327 genes identified in this study S30). Figure5 B and 5 C shows the expression pattern of not labeled as prognostic in THPA could be new potential Phosphoenolpyruvate Carboxykinase 1 (PCK1) and Chromo- prognostic biomarkers for these tumor types (Supplementary some 10 Open Reading Frame 99 (C10orf99) over different S30). For example, Table 3 lists genes identified by our stages of KIRC and KIRP. Further, 27 genes out of the 64 method that are not marked as prognostic in THPA but have that we identified as decreasing with cancer progression in been experimentally studied for their potential role as a prog- LIHC were labeled as prognostic favourable for liver cancer nostic biomarker in LIHC or THCA. in THPA (Supplementary Table S30). Out of 25 genes identi- For prognosis and personalized treatment, the exceptions are fied as having increasing pattern in THCA, none were labeled extremely important. By using MOG to explore very large as prognostic by THPA (Supplementary Table S30). sets of data, a researcher can visualize in more detail whether

10 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. L Arabidopsis thaliana microarray data

Disease Gene Gene name Pattern Ref. LIHC ARG1 arginase 1 Decreasing (105) cytochrome P450 LIHC CYP2C8 family 2 subfamily Decreasing (106) C member 8 cytochrome P450 LIHC CYP3A4 family 3 subfamily Decreasing (107) A member 4 cytochrome P450 LIHC CYP3A7 family 3 subfamily Decreasing (107) A member 7 cytochrome P450 LIHC CYP4A11 family 4 subfamily Decreasing (108) A member 11 THCA CHI3L1 chitinase 3 like 1 Increasing (109) THCA SFTPB surfactant protein B Increasing (110) THCA CD207 CD207 molecule Increasing (111) mucin 21, cell THCA MUC21 Increasing (111) surface associated matrix THCA MMP7 Increasing (112, 113) metallopeptidase 7 IGF like family THCA IGFL2 Increasing (111) member 2 kallikrein related THCA KLK7 Increasing (114, 115) peptidase 7 THCA FN1 fibronectin 1 Increasing (116)

Table 3. Genes identified by MOG as showing changed expression in LIHC and THCA with progression of cancer (B-H corrected p−value < 0.05) that have also been considered in studies (Ref. column) for use as potential prognostic biomark- ers. These genes are not marked as prognostic in The Human Protein Atlas (THPA) (104) for the given cancer types. certain samples do not show changed expression of a prog- nostic marker; this could lead to additional experimental analysis to reveal whether something about these "no-show" samples might be biologically distinct. A researcher can also evaluate which cancers have a particular biomarker; how spe- cific is a given biomarker to a particular type and stage of cancer. For example, ARG1, CYP2C8, CYP3A4, CYP3A7 and CYP4A11, which we identified using MOG as decreas- ing expression with LIHC progression, have been recently studied as prognostic markers for hepatocellular carcinoma (105–108).

L. Arabidopsis thaliana microarray data. Our aim in the case study of A. thaliana microarray data was to explore pat- Fig. 5. MOG visualization of expression of selected genes during progression of terns of expression of genes that little or nothing is known renal cancers. (A) Box plots summarizing CD70 expression in non-tumor kidney about. The well-vetted dataset we used (26), encompasses and at different stages of KICH, KIRC, and KIRP kidney cancers. CD70 is desig- expression values for 22,746 genes across 423 A. thaliana nated as prognostic unfavourable for renal cancer in THPA. CD0 levels increases 90-fold in KIRC and 14-fold in KIRP, but it is decreased in KICH (log2FC = 1.56 microarray chips; the chips represent 71 diverse studies and adj p-value=0.004). (B and C) Line charts showing average expression of PCK1 a wide variety of environmental, genetic and developmental (green) and C10orf99 (red) over different stages of KIRC (B) and KIRP (C). The conditions (26). We updated the gene metadata to the current vertical lines are the error bars. THPA designates PCK1 as prognostic favourable and C10orf99 as prognostic unfavourable for renal cancer. TAIR annotations (33); to these we added phylostrata desig- nations (obtained from phylostratr (35)). First, we sought to identify genes of unknown or partially- known function that might be involved in regulation of photo- synthesis, the process that gave rise to the earths oxygenated atmosphere and the associated evolution of extant complex eukaryotic species. Assembly and disassembly of the photo- system I and II complexes are dynamic processes of photo- synthesis that respond sensitively to shifts in light and other environmental factors (117–120); Met1 (AT1G55480) is a 36

Singh et al. | bioRχiv | 11 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Kda protein that regulates the assembly of the photosystem mosome organization and DNA repair (Additional File 5). II (PSII) complex (120). To explore genes that might be in- This reflects and emphasizes the critical role of these pro- volved in PSII assembly, we calculated Spearman correlation cesses in the development of male gametophyte development, of expression of Met1 with that of the 22,746 genes repre- particularly sperm biogenesis. Each angiosperm pollen grain sented on the Affymetrix chip (Figure6). This analysis shows must produce two viable sperm each used in the double fer- that expression of 104 genes is highly correlated (Spearman’s tilization of the ovule. Above all else, proper mitogenesis coefficient > 0.9) to Met1 gene expression across all condi- is essential to the function of a pollen grain. We visualized tions (Supplementary Table S31). the leaf vs pollen differential analysis by volcano plot (Fig- First, we examined whether genes of photosynthesis were ure 8; Supplementary Figure 8), this time to explore genes over-represented among the Met1 coexpressed cohort. upregulated in leaf. Among these is At1G67860, an Ara- Among the Met1 coexpressed genes, the Gene Ontol- bidopsis specific gene encoding a protein of unknown func- ogy (GO) Biological Functional terms most highly over- tion. We used MOG to correlate expression of this gene ver- represented (p-value < 10−5) are integral to the light re- sus all genes across all samples. One hundred sixteen genes, actions of photosynthesis: generation of precursor metabo- dispersed across all five , are coexpressed with lites and energy; photosynthetic electron transport in photo- At1G67860 (Spearman correlation 0.65-0.95) (Supplemen- system I (PSI); reductive pentose-phosphate cycle; response tary Table S31). The genes are expressed almost exclusively to cytokinin; and PS2 assembly (Supplementary Table S32). in mature leaf (Supplementary Figure 9). Most have no For example, the gene most highly correlated with Met1 is known function; a GO enrichment test indicates that GO bi- At2g04039, a gene encoding the NdhV protein, which is ological processes overrepresented among the genes are: de- thought to stabilize the nicotinamide dehydrogenase (NDH) fense response, response to stress, response to external biotic complex of PS1 (121); phylostratal analysis (35) indicates stimulus and response to other organism (p-value < 1.E-03) that NdhV has homologs across photosynthetic organisms, (Supplementary Table S33). streptophyta (land plants and most green algae). Eigh- teen of the Met1 coexpressed genes are designated as "un- M. Arabidopsis thaliana metabolomics data. known function" or "uncharacterized’; six are restricted to Metabolomics is providing a growing resource for un- Viridiplantae. These genes would be good candidates to eval- derstanding metabolic pathways and identifying the uate experimentally for a possible function in photosynthetic structural and regulatory genes that shape these pathways light reaction. and their interconnected lattice (82, 126, 127). Here, we use a metabolomics dataset that represents a comprehensive Our next aim was to use MOG to directly explore an or- study of 50 mutants with a normal morphological phenotype phan gene (a gene encoding a protein unrecognizable by ho- but altered metabolite levels, and 19 wild type control lines mology to those of other species) (34, 35, 122–124), and (81). There are 8-16 biological replicates for each genetic to determine potential processes that it might be involved line; data is corrected for batch effects. Data and metadata in. First, we filtered each gene’s target description to re- were retrieved from PMR (82). The aim of this case study tain "unknown". From these, we filtered to retain only the was to tease out coexpressed metabolites that are affected phylostratigraphic designation "Arabidopsis thaliana". From by genetic perturbations. We identified a group of four this gene list, we identified genes that had an expression value highly coexpressed metabolites (Pearson correlation > 0.8): greater than 100 in at least five samples. We selected the or- the amino acid arginine, its precursor L-ornithine, cyclic phan gene of unknown function, At2G04675, for exploratory ornithine (3-amino-piperidine-2-one), and one unidenti- analysis. At2G04675 encodes a predicted protein of 67 aa fied metabolite. Plots across the means of the biological with no known sequence domains (domains searched using replicates of each sample (Supplementary Figure 10), CDD (125)). A Pearson correlation analysis of the expression shows accumulation of these metabolites is upregulated pattern of At2G04675 with the other genes represented on the over four-fold in four mutant lines: mur9, mutants have Affymetrix chip showed 48 genes had a Pearson correlation altered cell wall constituents; vtc1, encodes GDP-mannose of higher than 0.95 (Supplementary Table S31); these genes pyrophosphorylase, required for synthesis of manose, are expressed almost exclusively in pollen (the male game- major constituent of cell walls, upregulated upon bacterial tophyes of flowering plants) (Figure 7). The exploration im- infection; cim13, gene of unknown function associated with plicates At2G04675 as a candidate for involvement in some disease resistance, eto1, negative regulator of biosynthesis of aspect of pollen biology. the plant hormone ethylene. An arginine-derived metabolite, Using MOG to further explore genes that are associated with nitrous oxide, has been widely implicated in signaling pollen, we identified sets of leaf and pollen samples (Sup- pathways in plants (128). MOG analysis might suggest to a plementary Figure 7; Additional File 5), and then calculated researcher a potential relationship between arginine and the genes that are differentially expressed in the leaf samples cell wall defense response to consider for experimentation. versus the pollen samples using a Mann-Whitney test (Ex- pression change is by 2-fold or more; B-H corrected p-value N. Comparison to other software. Few tools that do not < 10−3) (Additional File 5). The GO terms most highly en- require coding are available for on-the-fly exploration of ex- riched (p-value < 10−20) among genes upregulated in pollen pression data. Most are essentially interfaces to R pack- are processes of cell cycle, mitosis, organellar fission, chro- ages (18–20, 129), providing a "shiny" (17) web-app inter-

12 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. N Comparison to other software

Fig. 6. Spearman correlation followed by line-plot visualization, using MOG, shows that Met1 (At1G55480) is highly expressed in photosynthetic tissues and highly correlated with several genes of unknown function. The "peaks" of expression are all leaf samples; the "troughs" of expression are predominantly root and cell culture samples. A. thaliana Affymetrix microarray dataset representing 71 diverse studies and a wide variety of environmental, genetic and developmental conditions (26). Several genes of unknown function are closely coexpressed in this cluster.

Fig. 7. MOG line chart visualization shows the orphan gene At2G04675, of no known function, and genes highly correlated with At2G04675 are expressed almost exclusively in pollen/male gametophyte samples. A. thaliana Affymetrix microarray dataset representing 71 diverse studies and a wide variety of environmental, genetic and developmental conditions (26). X-axis are samples, and Y-axis indicates their expression value. Each line represents a gene. (Lines in this visualization are for clarity and the connections from sample to sample do not imply a relationship)

Singh et al. | bioRχiv | 13 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

genes over 7,142 samples) cancer dataset: RStudio reported the error message: "cannot allocate vector of size 992.4 Mb" (Additional File 6). We then reduced the test dataset to only the liver tumor and non-tumor samples (18,212 genes over 410 samples), and were able to load this dataset to PIVOT. As a benchmark comparison of MOG and PIVOT, we performed a Mann-Whitney test to identify differentially- expressed genes in liver tumor and non-tumor samples and measured the execution time of this step, i.e., the time taken to produce and display the output. A Mann-Whitney test completed in 21 minutes using PIVOT, whereas MOG took only 7 seconds (Figure 9). After the Mann-Whitney test com- pleted, we kept the program running idle until total runtime reached 30 minutes and compared memory and processor us- age of PIVOT and MOG over this 30 minute window, as re- ported by WPMT (Supplementary Table S35-S36). Average Fig. 8. Using MOG for differential expression analysis of leaf and pollen samples, memory usage of PIVOT (1,869 MB) was about twice that of followed by volcano plot visualization. Gene metadata, revealed upon hovering the mouse over a data point, shows At1G67860, an Arabidopsis specific gene with MOG (995 MB) (Supplementary Table S35-S36). The peak no known function, is 16-fold more highly accumulated in leaves relative to pollen % processor time of PIVOT’s when computing the Mann- −3 (Mann-Whitney U test; B-H corrected p-value< 10 ). Y axis, p-value Whitney test was about half that for MOG, indicating more CPU utilization by MOG; however, the average % processor face to a limited number of R packages for data normaliza- time over 30 minutes was 64% for PIVOT but only 2% for tion, differential expression analysis, PCA analysis (among MOG Figure 9 A). samples) and gene enrichment analysis. Because they are We benchmarked MOG’s performance on datasets of differ- written in R (18–20), these tools must rely on R’s somewhat ent sizes, created by splitting the cancer dataset by differ- limited capabilities for interactive applications. In contrast ent tissue types (tumor and non-tumor samples). For each to R, Java, MOG’s platform, has been used to develop nu- dataset, we performed Mann-Whitney test on all the genes merous software with interfaces that are interactive and user- for tumor and non-tumor groups and measured the execu- friendly (102, 130–133), and MOG provides the researcher tion time. MOG took only 31 seconds to compute a Mann- with specialized GUIs and methods for exploratory data anal- Whitney test on 18,212 genes over 1,054 samples (Figure 9 ysis. MOG’s GUI allows direct interactivity with the data and B; Additional File 6). the visualizations, so that a researcher can easily explore data We also measured the execution time for calculating the Pear- from different perspectives. Table4 compares four of the son correlation of a gene (BIRC5 was chosen at random) with most recent tools for exploratory analysis of expression data all others in a given dataset. MOG took only a couple of to MOG. (More details are provided in Supplementary Table seconds to compute a Pearson correlation over 1,000 sam- S34.) ples and 16 seconds to compute a Pearson correlation over all 7,142 samples (Figure 9 C). N.1. Benchmarking. We used the cancer dataset (18,212 genes over 7,142 samples) for benchmarking. Using a DELL Inspiron laptop with 64 bit Windows 10, 8 GB RAM and DISCUSSION Intel(R) Core(TM) i5-7300HQ CPU, we monitored the sys- An advantage of MOG is that large datasets that have tem’s resource utilization using Windows Performance Mon- been normalized and unwanted technical or biological ef- itor tool (WPMT) (134). During benchmarking, only the fects reduced (batch-corrected) based on different assump- software being tested was run on the system and no other tions can be compared quickly and interactively by statistical user program was running. and visual approaches, such that biologists can gain insight We compared MOGs efficiency to that of one of the "shiny" as to which best reflect experimentally-established "ground web-app (17) R software (choosing PIVOT, because it per- truths". This approach is complementary to the tacts bioinfor- mits loading processed data). R-based tools read all data di- maticists usually take (e.g., assessing GO term enrichment in rectly into the main memory. Thus, on a desktop computer, gene clusters). For the present, we have intentionally avoided analysis of a big dataset is slow (or crashes) if the available incorporating capabilities for data normalization and batch- memory is not sufficiently large. For example, a dataset of correction into MOG for two reasons. 100,000 human transcripts over 5,000 samples (500,000,000 First, the selection of appropriate methods depends on the expression values) requires at least 4GB (8 byte for each data distribution and the biological questions to be asked. value) of free memory to be loaded into memory at once. Different types of data have different characteristics and thus In contrast, MOG uses an indexing strategy to read data only require different normalization methods (37, 52, 135). Re- when it is needed, which drastically reduces the total memory moving/minimizing batch-effects before evaluating the data consumption of the system. also entails more assumptions and approaches. These statis- PIVOT repeatedly crashed and failed to load the full (18,212 tical methods can easily be misapplied, especially when the

14 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. N Comparison to other software

MOG PIVOT iGEAK IRIS-EDA DEvis Reference This paper (18) (19) (20) (129) Year 2019 2018 2019 2019 2019 Platform/GUI Java/Swing R/Shiny R/Shiny R/Shiny R/None Interactivity ++++ + + + - Save progress Yes Yes No No No Use any R package Yes No No No No Supported data types Omics or other numerical data RNA-Seq/scRNA-Seq RNA-Seq/microarray RNA-Seq/scRNA-Seq RNA-Seq M-W U test (sec) 7 1,260 NA NA NA

Table 4. Comparison of MOG with existing tools for exploratory analysis of expression data. MOG’s GUI, designed with Java swing, is fully interactive. In contrast, other available tools are based on R and provide limited or no interactivity. With MOG, the user can execute any R package/script with interactively selected subsets of data if s/he wishes to perform additional analysis, whereas only a limited number of R-packages are available in the other packages. The last row compares the Mann-Whitney U test’s execution time for MOG and PIVOT using the liver tumor and non-tumor datasets (18,212 genes over 410 samples). A more detailed comparison of the tools is available in Supplementary Table S36.

Fig. 9. MOG performance benchmarks. MOG was benchmarked using the entire the cancer dataset (18,212 genes over 7142 samples) and using smaller-sized datasets derived by partitioning the full cancer dataset into tumor and non-tumor samples from various tissues.(A) Comparison of MOG to R-based (PIVOT). Dataset size was limited to the amount of data that could be loaded in PIVOT i.e., liver tumor and non-tumor samples (18,212 genes over 410 samples). The % processor time (% CPU utilization) for PIVOT and MOG was calculated over 30 minutes. The theoretical maximum value of % processor time is total processorsavailable in the computer x 100 (400 in this case). (B) Execution times for computing differentially-expressed genes using Mann-Whitney test with datasets of different sizes. Red dots, execution times for MOG; blue dot, execution time for PIVOT’s with 410 samples (as in panel A). Inset, same graph with expanded scale of Y-axis to display MOG execution times. (C) MOG execution times for pairwise computations of Pearson correlation of a gene (BIRC5) with all other genes in the datasets. (The other tools cannot perform this computation). Execution times are linear with data size, and analysis of the full dataset took just over 16 seconds. data is from multiple heterogeneous studies, resulting in a has many variations, investigating such markers in individ- misleading dataset. Much as if using R or MATLAB statisti- ual tumors would provide critical information for personal- cal software, a MOG user must consider these. Second, the ized medicine, as well as aid cancer research (136). GPC3 methodologies are very unsettled ((37, 51, 53, 135)) with new is known (93, 95, 96), and also was identified by MOG from approaches and variations being developed each year (for the data, as a marker gene for liver cancer. Gene-level resolu- RNA-Seq raw reads there were over 10,000 journal articles tion analysis revealed that the genes that are coexpressed with in first half of 2019 in GoogleScholar, searched on "RNA- GPC3 gene change drastically among the individual tissues, Seq normalization methods"). and between tumor samples and non-tumor samples. Using the A. thaliana microarray dataset, we explored ex- We demonstrated MOG’s functionality and detailed its appli- pression patterns of genes with unknown functions including cation by exploring three different datasets: a human cancer orphan genes, demonstrating a plant-restricted gene that was RNA-Seq dataset from non-tumor and tumor samples, an A. tightly coexpressed with a set of genes involved in in pho- thaliana microarray dataset, and an A. thaliana metabolomics tosystem assembly, and identifying an Arabidopsis specific dataset. Exploring the large human dataset, we created a cat- gene that is highly expressed in pollen development. alog of genes which are differentially expressed in different types of cancers, identifying only a handful of genes that are consistently upregulated or downregulated in every type of CONCLUSION cancer. Using the sample and study metadata, we identified MOG is a novel tool that enables interactive exploratory anal- genes that showed regulation with cancer progression. Com- ysis of big omics datasets; and indeed any normalized dataset paring our results with genes described in THPA, indicated can be input to MOG. Visualizations produced by MOG are that some of the genes we identified have been reported in fully interactive, and can easily allow researchers to detect literature to be potential prognostic biomarkers for different and mine interesting data points. The statistical methods cancers, however, many others of the genes are not marked implemented in MOG help to guide exploration of hidden as prognostic in THPA. These genes present potential new patterns in a user-friendly manner. By integrating metadata, biomarkers for disease progression. Because each tumor type MOG affords an opportunity to extract new insights about

Singh et al. | bioRχiv | 15 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

the data based on information about each feature, or any of Bibliography the diverse information entered by experimenters about the 1. Alvis Brazma, Helen Parkinson, Ugis Sarkans, Mohammadreza Shojatalab, Jaak Vilo, biology of each sample and the experimental protocols and Niran Abeygunawardena, Ele Holloway, Misha Kapushesky, Patrick Kemmeren, Gon- parameters used to obtain its data. zalo Garcia Lara, et al. Arrayexpress—a public repository for microarray gene expression data at the ebi. Nucleic acids research, 31(1):68–71, 2003. 2. Yuichi Kodama, Martin Shumway, and Rasko Leinonen. The sequence read archive: ex- plosive growth of sequencing data. Nucleic acids research, 40(D1):D54–D56, 2011. DATA AVAILABILITY 3. Kenneth Haug, Reza M Salek, Pablo Conesa, Janna Hastings, Paula de Matos, Mark Rijnbeek, Tejasvi Mahendraker, Mark Williams, Steffen Neumann, Philippe Rocca-Serra, MOG is free and open source software published under the et al. Metabolights—an open-access general-purpose repository for metabolomics studies MIT License. MOG and all compiled datasets in this article and associated meta-data. Nucleic acids research, 41(D1):D781–D786, 2012. 4. Lennart Martens, Henning Hermjakob, Philip Jones, Marcin Adamski, Chris Taylor, Kris are freely downloadable from http://metnetweb. Gevaert, Joël Vandekerckhove, Rolf Apweiler, et al. Pride: the proteomics identifications gdcb.iastate.edu/MetNet_MetaOmGraph.htm. database. Proteomics, 5(13):3537–3545, 2005. 5. Cosmin Lazar, Stijn Meganck, Jonatan Taminau, David Steenhoff, Alain Coletta, Colin MOG’s source code and user guide is available at Molter, David Y Weiss-Solís, Robin Duque, Hugues Bersini, and Ann Nowé. Batch effect https://github.com/urmi-21/MetaOmGraph/. removal methods for microarray gene expression data integration: a survey. Briefings in bioinformatics, 14(4):469–490, 2012. 6. Daniel R Rhodes and Arul M Chinnaiyan. Integrative analysis of the cancer transcriptome. Nature genetics, 37(6s):S31, 2005. SUPPLEMENTARY DATA 7. Priyanka Bhandary, Arun S Seetharam, Zebulun W Arendsee, Manhoi Hur, and Eve Syrkin Wurtele. Raising orphans from a metadata morass: A researcher’s guide to re-use of Supplementary data are available at BioRxiv and https: public’omics data. Plant Science, 267:32–47, 2018. //github.com/urmi-21/MetaOmGraphMOG_ 8. Jing Li, Zebulun Arendsee, Urminder Singh, and Eve Syrkin Wurtele. Recycling rna-seq data to identify candidate orphan genes for experimental analysis. bioRxiv, page 671263, SupplementaryData. Additional files are available at 2019. https://github.com/urmi-21/MetaOmGraph/ 9. Andrea Rau, Guillemette Marot, and Florence Jaffrézic. Differential meta-analysis of rna- seq data from multiple studies. BMC bioinformatics, 15(1):91, 2014. MOG_SupplementaryData. 10. Tianzhou Ma, Zhiguang Huo, Anche Kuo, Li Zhu, Zhou Fang, Xiangrui Zeng, Chien-Wei Lin, Silvia Liu, Lin Wang, Peng Liu, et al. Metaomics: analysis pipeline and browser-based software suite for transcriptomic meta-analysis. Bioinformatics, 2018. ACKNOWLEDGEMENTS 11. SungHwan Kim, Dongwan Kang, Zhiguang Huo, Yongseok Park, and George C Tseng. Meta-analytic principal component analysis in integrative omics application. Bioinformat- We especially thank Nick Ransom for his key role in MOG’s ics, 34(8):1321–1328, 2017. 12. Matthew E Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and early development. We are grateful to our collaborators, Gordon K Smyth. limma powers differential expression analyses for rna-sequencing and Kevin Bassler, Pramesh Singh and Ling Li, for their help microarray studies. Nucleic acids research, 43(7):e47–e47, 2015. 13. Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and feedback. We much appreciate the efforts of Jing Li, and dispersion for rna-seq data with deseq2. Genome biology, 15(12):550, 2014. Priyanka Bhandhary, Arun Seetharam, and the early-adopters 14. Sonia Tarazona, Fernando García, Alberto Ferrer, Joaquín Dopazo, and Ana Conesa. Noiseq: a rna-seq differential expression method robust for sequencing depth biases. EM- who beta-tested MOG and provided valuable feedback. Bnet. journal, 17(B):18–19, 2011. 15. Jun Li and Robert Tibshirani. Finding consistent patterns: a nonparametric approach for identifying differential expression in rna-seq data. Statistical methods in medical research, FUNDING 22(5):519–536, 2013. 16. Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1): This work is funded in part by National Science Foundation 139–140, 2010. grant IOS 1546858, Orphan Genes: An Untapped Genetic 17. Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. shiny: Web Application Framework for R, 2018. R package version 1.2.0. Reservoir of Novel Traits, and by the Center for Metabolic 18. Qin Zhu, Stephen A Fisher, Hannah Dueck, Sarah Middleton, Mugdha Khaladkar, and Biology, Iowa State University. Junhyong Kim. Pivot: platform for interactive analysis and visualization of transcriptomics data. BMC bioinformatics, 19(1):6, 2018. 19. Kwangmin Choi and Nancy Ratner. igeak: an interactive gene expression analysis kit for Conflict of interest statement.. None declared. seamless workflow using the r/shiny platform. BMC genomics, 20(1):177, 2019. 20. Brandon Monier, Adam McDermaid, Cankun Wang, Jing Zhao, Allison Miller, Anne Fen- nell, and Qin Ma. Iris-eda: An integrated rna-seq interpretation system for gene expression data analysis. PLoS computational biology, 15(2):e1006792, 2019. 21. John W Tukey. Exploratory data analysis, volume 2. Reading, Mass., 1977. 22. Thomas Kelder, Bruce R Conklin, Chris T Evelo, and Alexander R Pico. Finding the right questions: exploratory pathway analysis to enhance biological discovery in large datasets. PLoS biology, 8(8):e1000472, 2010. 23. Paul T Shannon, Mark Grimes, Burak Kutlu, Jan J Bot, and David J Galas. Rcytoscape: tools for exploratory network analysis. BMC bioinformatics, 14(1):217, 2013. 24. Qingguo Wang, Joshua Armenia, Chao Zhang, Alexander V Penson, Ed Reznik, Liguo Zhang, Thais Minet, Angelica Ochoa, Benjamin E Gross, Christine A Iacobuzio-Donahue, et al. Unifying cancer and normal rna sequencing data from different sources. Scientific data, 5:180061, 2018. 25. Alvis Brazma and Jaak Vilo. Gene expression data analysis. FEBS letters, 480(1):17–24, 2000. 26. Wieslawa I Mentzen and Eve Syrkin Wurtele. Regulon organization of arabidopsis. BMC plant biology, 8(1):99, 2008. 27. Márcia M. Almeida-de Macedo, Nick Ransom, Yaping Feng, Jonathan Hurst, and Eve Syrkin Wurtele. Comprehensive analysis of correlation coefficients estimated from pooling heterogeneous microarray data. BMC Bioinformatics, 14:214, 2013. 28. S. Trevino, Y. Sun, T. F. Cooper, and K. E. Bassler. Robust detection of hierarchical com- munities from Escherichia coli gene expression data. PLoS Comput. Biol., 8(2):e1002391, 2012. 29. Marilyn Safran, Irina Dalah, Justin Alexander, Naomi Rosen, Tsippi Iny Stein, Michael Shmoish, Noam Nativ, Iris Bahir, Tirza Doniger, Hagit Krug, et al. Genecards version 3: the human gene integrator. Database, 2010, 2010. 30. Tim Hubbard, Daniel Barker, Ewan Birney, Graham Cameron, Yuan Chen, L Clark, Tony Cox, J Cuff, Val Curwen, Thomas Down, et al. The ensembl genome database project. Nucleic acids research, 30(1):38–41, 2002. 31. Paul Julian Kersey, James E Allen, Irina Armean, Sanjay Boddu, Bruce J Bolt, Denise

16 | bioRχiv Singh et al. | bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. N Comparison to other software

Carvalho-Silva, Mikkel Christensen, Paul Davis, Lee J Falin, Christoph Grabmueller, et al. 57. Peter Langfelder and Steve Horvath. Wgcna: an r package for weighted correlation net- Ensembl genomes 2016: more genomes, more complexity. Nucleic acids research, 44 work analysis. BMC bioinformatics, 9(1):559, 2008. (D1):D574–D580, 2015. 58. Jeremiah J Faith, Boris Hayete, Joshua T Thaden, Ilaria Mogno, Jamey Wierzbowski, 32. Kim D Pruitt, Tatiana Tatusova, and Donna R Maglott. Ncbi reference sequences (ref- Guillaume Cottarel, Simon Kasif, James J Collins, and Timothy S Gardner. Large-scale seq): a curated non-redundant sequence database of genomes, transcripts and proteins. mapping and validation of escherichia coli transcriptional regulation from a compendium Nucleic acids research, 35(suppl_1):D61–D65, 2006. of expression profiles. PLoS biology, 5(1):e8, 2007. 33. Philippe Lamesch, Tanya Z Berardini, Donghui Li, David Swarbreck, Christopher Wilks, 59. Sipko van Dam, Urmo Vosa, Adriaan van der Graaf, Lude Franke, and Joao Pedro de Ma- Rajkumar Sasidharan, Robert Muller, Kate Dreher, Debbie L Alexander, Margarita Garcia- galhaes. Gene co-expression analysis for functional classification and gene–disease pre- Hernandez, et al. The arabidopsis information resource (tair): improved gene annotation dictions. Briefings in bioinformatics, 19(4):575–592, 2017. and new tools. Nucleic acids research, 40(D1):D1202–D1210, 2011. 60. Andrew T McKenzie, Igor Katsyv, Won-Min Song, Minghui Wang, and Bin Zhang. Dgca: a 34. Tomislav Domazet-Lošo, Josip Brajkovic,´ and Diethard Tautz. A phylostratigraphy ap- comprehensive r package for differential gene correlation analysis. BMC systems biology, proach to uncover the genomic history of major adaptations in metazoan lineages. Trends 10(1):106, 2016. in Genetics, 23(11):533–539, 2007. 61. YX Rachel Wang and Haiyan Huang. Review on statistical methods for gene network 35. Zebulun Arendsee, Jing Li, Urminder Singh, Arun Seetharam, Karin Dorman, and reconstruction using expression data. Journal of theoretical biology, 362:53–61, 2014. Eve Syrkin Wurtele. phylostratr: a framework for phylostratigraphy. Bioinformatics, 03 62. Carsten O Daub, Ralf Steuer, Joachim Selbig, and Sebastian Kloska. Estimating mutual 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz171. information using b-spline functions–an improved similarity measure for analysing gene 36. Matthew N Bernstein, AnHai Doan, and Colin N Dewey. Metasra: normalized human expression data. BMC bioinformatics, 5(1):118, 2004. sample-specific metadata for the sequence read archive. Bioinformatics, 33(18):2914– 63. Lin Song, Peter Langfelder, and Steve Horvath. Comparison of co-expression measures: 2923, 2017. mutual information, correlation, and model based indices. BMC bioinformatics, 13(1):328, 37. Ciaran Evans, Johanna Hardin, and Daniel M Stoebel. Selecting between-sample rna-seq 2012. normalization methods from the perspective of their assumptions. Briefings in bioinformat- 64. P. Singh, T. Chen, Z. Arendsee, E. S. Wurtele, and K. E. Bassler. A Regulatory Network ics, 19(5):776–792, 2017. Analysis of Orphan Genes in Arabidopsis Thaliana. In APS March Meeting Abstracts, 38. Jeffrey T Leek, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Lang- page V6.005, 2017. mead, W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry. Tackling the 65. Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Ric- widespread and critical impact of batch effects in high-throughput data. Nature Reviews cardo Dalla Favera, and Andrea Califano. Aracne: an algorithm for the reconstruction of Genetics, 11(10):733, 2010. gene regulatory networks in a mammalian cellular context. In BMC bioinformatics, vol- 39. Sara Mostafavi, Alexis Battle, Xiaowei Zhu, Alexander E Urban, Douglas Levinson, ume 7, page S7. BioMed Central, 2006. Stephen B Montgomery, and Daphne Koller. Normalizing rna-sequencing data by mod- 66. Larry V Hedges and Jack L Vevea. Fixed-and random-effects models in meta-analysis. eling hidden covariates with prior knowledge. PLoS One, 8(7):e68141, 2013. Psychological methods, 3(4):486, 1998. 40. Daniel R Rhodes, Jianjun Yu, K Shanker, Nandan Deshpande, Radhika Varambally, De- 67. Andy P Field. Meta-analysis of correlation coefficients: a monte carlo comparison of fixed- bashis Ghosh, Terrence Barrette, Akhilesh Pandey, and Arul M Chinnaiyan. Large-scale and random-effects methods. Psychological methods, 6(2):161, 2001. meta-analysis of cancer microarray data identifies common transcriptional profiles of neo- 68. Lun-Ching Chang, Hui-Min Lin, Etienne Sibille, and George C Tseng. Meta-analysis meth- plastic transformation and progression. Proceedings of the National Academy of Sciences, ods for combining multiple expression profiles: comparisons, statistical characterization 101(25):9309–9314, 2004. and an application guideline. BMC bioinformatics, 14(1):368, 2013. 41. Daniel Toro-Domínguez, Jordi Martorell-Marugán, Raúl López-Domínguez, Adrián García- 69. Xiangqin Cui and Gary A Churchill. Statistical tests for differential expression in cdna Moreno, Víctor González-Rumayor, Marta E Alarcón-Riquelme, and Pedro Carmona- microarray experiments. Genome biology, 4(4):210, 2003. Sáez. Imageo: integrative gene expression meta-analysis from geo database. Bioin- 70. Marine Jeanmougin, Aurélien De Reynies, Laetitia Marisa, Caroline Paccard, Gregory formatics, 2018. Nuel, and Mickael Guedj. Should we abandon the t-test in the analysis of gene expression 42. George C Tseng, Debashis Ghosh, and Eleanor Feingold. Comprehensive literature re- microarray data: a comparison of variance modeling strategies. PloS one, 5(9):e12336, view and statistical considerations for microarray meta-analysis. Nucleic acids research, 2010. 40(9):3785–3799, 2012. 71. Charlotte Soneson and Mauro Delorenzi. A comparison of methods for differential expres- 43. Vincenzo Lagani, Argyro D Karozou, David Gomez-Cabrero, Gilad Silberberg, and Ioannis sion analysis of rna-seq data. BMC bioinformatics, 14(1):91, 2013. Tsamardinos. A comparative evaluation of data-merging and meta-analysis methods for 72. Ronald A Fisher. Frequency distribution of the values of the correlation coefficient in sam- reconstructing gene-gene interactions. BMC bioinformatics, 17(5):S194, 2016. ples from an indefinitely large population. Biometrika, 10(4):507–521, 1915. 44. Jason Rudy and Faramarz Valafar. Empirical comparison of cross-platform normalization 73. Atsushi Fukushima. Diffcorr: an r package to analyze and visualize differential correlations methods for gene expression data. BMC bioinformatics, 12(1):467, 2011. in biological networks. Gene, 518(1):209–214, 2013. 45. Patrick Warnat, Roland Eils, and Benedikt Brors. Cross-platform analysis of cancer mi- 74. Eric W Weisstein. Bonferroni correction. 2004. croarray data improves gene expression based classification of phenotypes. BMC bioin- 75. Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and formatics, 6(1):265, 2005. powerful approach to multiple testing. Journal of the Royal statistical society: series B 46. Mark D Robinson and Alicia Oshlack. A scaling normalization method for differential ex- (Methodological), 57(1):289–300, 1995. pression analysis of rna-seq data. Genome biology, 11(3):R25, 2010. 76. John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor 47. Qin Cao, Christine Anyansi, Xihao Hu, Liangliang Xu, Lei Xiong, Wenshu Tang, Myth TS Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype- Mok, Chao Cheng, Xiaodan Fan, Mark Gerstein, et al. Reconstruction of enhancer–target tissue expression (gtex) project. Nature genetics, 45(6):580, 2013. networks in 935 samples of human primary cells, tissues and cell lines. Nature genetics, 77. Antonio Colaprico, Tiago C Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Da- 49(10):1428, 2017. vide Garolini, Thais S Sabedot, Tathiane M Malta, Stefano M Pagnotta, Isabella Castiglioni, 48. Chongzhi Zang, Tao Wang, Ke Deng, Bo Li, Qian Qin, Tengfei Xiao, Shihua Zhang, Clif- et al. Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic ford A Meyer, Housheng Hansen He, Myles Brown, et al. High-dimensional genomic data acids research, 44(8):e71–e71, 2015. bias correction and data integration using mancie. Nature communications, 7:11305, 2016. 78. Zbyslaw Sondka, Sally Bamford, Charlotte G Cole, Sari A Ward, Ian Dunham, and Simon A 49. Jakob Willforss, Aakash Chawade, and Fredrik Levander. Normalyzerde: Online tool for Forbes. The cosmic cancer gene census: describing genetic dysfunction across all human improved normalization of omics expression data and high-sensitivity differential expres- cancers. Nature Reviews Cancer, page 1, 2018. sion analysis. Journal of proteome research, 2018. 79. Joanna S Amberger, Carol A Bocchini, François Schiettecatte, Alan F Scott, and Ada 50. Lucia Peixoto, Davide Risso, Shane G Poplawski, Mathieu E Wimmer, Terence P Speed, Hamosh. Omim. org: Online mendelian inheritance in man (omim®), an online catalog of Marcelo A Wood, and Ted Abel. How data analysis affects power, reproducibility and human genes and genetic disorders. Nucleic acids research, 43(D1):D789–D798, 2014. biological insight of rna-seq studies in complex datasets. Nucleic acids research, 43(16): 80. Yee Hwa Yang, Sandrine Dudoit, Percy Luu, David M Lin, Vivian Peng, John Ngai, and 7664–7674, 2015. Terence P Speed. Normalization for cdna microarray data: a robust composite method 51. Florian Schmidt, Markus List, Engin Cukuroglu, Sebastian Köhler, Jonathan Göke, and addressing single and multiple slide systematic variation. Nucleic acids research, 30(4): Marcel H Schulz. An ontology-based method for assessing batch effect adjustment ap- e15–e15, 2002. proaches in heterogeneous datasets. Bioinformatics, 34(17):i908–i916, 2018. 81. Atsushi Fukushima, Miyako Kusano, Ramon Francisco Mejia, Mami Iwasa, Makoto 52. Aakash Chawade, Erik Alexandersson, and Fredrik Levander. Normalyzer: a tool for rapid Kobayashi, Naomi Hayashi, Akiko Watanabe-Takahashi, Tomoko Narisawa, Takayuki To- evaluation of normalization methods for omics data sets. Journal of proteome research, hge, Manhoi Hur, et al. Metabolomic characterization of knockout mutants in arabidopsis: 13(6):3114–3120, 2014. development of a metabolite profiling database for knockout mutants in arabidopsis. Plant 53. Joseph N Paulson, Cho-Yi Chen, Camila M Lopes-Ramos, Marieke L Kuijjer, John Platig, physiology, 165(3):948–961, 2014. Abhijeet R Sonawane, Maud Fagny, Kimberly Glass, and John Quackenbush. Tissue- 82. Manhoi Hur, Alexis Ann Campbell, Marcia Almeida-de Macedo, Ling Li, Nick Ransom, aware rna-seq processing and normalization for heterogeneous and sparse data. BMC Adarsh Jose, Matt Crispin, Basil J Nikolau, and Eve Syrkin Wurtele. A global approach to bioinformatics, 18(1):437, 2017. analysis and interpretation of metabolic data for plant natural product discovery. Natural 54. Alexis Vandenbon, Viet H Dinh, Norihisa Mikami, Yohko Kitagawa, Shunsuke Teraguchi, product reports, 30(4):565–583, 2013. Naganari Ohkura, and Shimon Sakaguchi. Immuno-navigator, a batch-corrected coex- 83. Katherine Schon and Marc Tischkowitz. Clinical implications of germline mutations in pression database, reveals cell type-specific gene networks in the immune system. Pro- breast cancer: Tp53. Breast cancer research and treatment, 167(2):417–423, 2018. ceedings of the National Academy of Sciences, 113(17):E2393–E2402, 2016. 84. Martha L Slattery, Lila E Mullany, Roger K Wolff, Lori C Sakoda, Wade S Samowitz, and 55. Michael B Eisen, Paul T Spellman, Patrick O Brown, and David Botstein. Cluster analysis Jennifer S Herrick. The p53-signaling pathway and colorectal cancer: Interactions between and display of genome-wide expression patterns. Proceedings of the National Academy downstream p53 target genes and mirnas. Genomics, 2018. of Sciences, 95(25):14863–14868, 1998. 85. Ziad Kanaan, Shesh N Rai, M Robert Eichenberger, Christopher Barnes, Amy M Dworkin, 56. Sapna Kumari, Jeff Nie, Huann-Sheng Chen, Hao Ma, Ron Stewart, Xiang Li, Meng- Clayton Weller, Eric Cohen, Henry Roberts, Bobby Keskey, Robert E Petras, et al. Differ- Zhu Lu, William M Taylor, and Hairong Wei. Evaluation of gene association methods for ential microrna expression tracks neoplastic progression in inflammatory bowel disease- coexpression network construction and biological knowledge discovery. PloS one, 7(11): associated colorectal cancer. Human mutation, 33(3):551–560, 2012. e50411, 2012. 86. Elizabeth I Boyle, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry,

Singh et al. | bioRχiv | 17 bioRxiv preprint doi: https://doi.org/10.1101/698969; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

and Gavin Sherlock. Go:: Termfinder—open source software for accessing gene ontology mir-129-5p/klk7 expression. Journal of cellular physiology, 233(10):6638–6648, 2018. information and finding significantly enriched gene ontology terms associated with a list of 115. Yayuan Zhang, Jintao Hu, Wenbing Zhou, and Hengyuan Gao. Lncrna foxd2-as1 acceler- genes. Bioinformatics, 20(18):3710–3715, 2004. ates the papillary thyroid cancer progression through regulating the mir-485-5p/klk7 axis. 87. Fran Supek, Matko Bošnjak, Nives Škunca, and Tomislav Šmuc. Revigo summarizes and Journal of cellular biochemistry, 2018. visualizes long lists of gene ontology terms. PloS one, 6(7):e21800, 2011. 116. Shaohua Zhan, Jinming Li, Tianxiao Wang, and Wei Ge. Quantitative proteomics analysis 88. Sukhneeraj P Kaur and Brian S Cummings. Role of in regulation of the tumor of sporadic medullary thyroid cancer reveals fn1 as a potential novel candidate prognostic microenvironment and cancer progression. Biochemical Pharmacology, 2019. biomarker. The oncologist, 23(12):1415–1425, 2018. 89. Fiona H Blackhall, Catherine LR Merry, EJ Davies, and Gordon C Jayson. Heparan sulfate 117. Yang-Er Chen, Yan-Qiu Su, Hao-Tian Mao, Wu Nan, Feng Zhu, Ming Yuan, Zhong-Wei proteoglycans and cancer. British journal of cancer, 85(8):1094, 2001. Zhang, Wen-Juan Liu, and Shu Yuan. Terrestrial plants evolve highly-assembled photo- 90. Gavin RM White, Anna M Kelsey, JM Varley, and Jillian M Birch. Somatic glypican 3 (gpc3) system complexes in adaptation to light shifts. Frontiers in plant science, 9:1811, 2018. mutations in wilms’ tumour. British journal of cancer, 86(12):1920, 2002. 118. Alexander V Ruban and Matthew P Johnson. Visualizing the dynamic structure of the plant 91. Jamshid Davoodi, John Kelly, Nathalie H Gendron, and Alex E MacKenzie. The simpson– photosynthetic membrane. Nature plants, 1(11):15161, 2015. golabi–behmel syndrome causative glypican-3, binds to and inhibits the dipeptidyl pepti- 119. Lukáš Nosek, Dmitry Semchonok, Egbert J Boekema, Petr Ilík, and Roman Kouril.ˇ Struc- dase activity of cd26. Proteomics, 7(13):2300–2310, 2007. tural variability of plant photosystem ii megacomplexes in thylakoid membranes. The Plant 92. Mariana I Capurro, Ping Xu, Wen Shi, Fuchuan Li, Angela Jia, and Jorge Filmus. Glypican- Journal, 89(1):104–111, 2017. 3 inhibits hedgehog signaling during development by competing with patched for hedgehog 120. Nazmul H Bhuiyan, Giulia Friso, Anton Poliakov, Lalit Ponnala, and Klaas J van Wijk. Met1 binding. Developmental cell, 14(5):700–711, 2008. is a thylakoid-associated tpr protein involved in photosystem ii supercomplex formation and 93. Wei Gao and Mitchell Ho. The role of glypican-3 in regulating wnt in hepatocellular carci- repair in arabidopsis. The Plant Cell, 27(1):262–285, 2015. nomas. Cancer reports, 1(1):14, 2011. 121. Xiangyuan Fan, Jiao Zhang, Wenjing Li, and Lianwei Peng. The ndhv subunit is required 94. Jorge Filmus and Mariana Capurro. The role of glypican-3 in the regulation of body size to stabilize the chloroplast nadh dehydrogenase-like complex in arabidopsis. The Plant and cancer. Cell Cycle, 7(18):2787–2790, 2008. Journal, 82(2):221–231, 2015. 95. Mitchell Ho and Heungnam Kim. Glypican-3: a new target for cancer immunotherapy. 122. Zebulun W Arendsee, Ling Li, and Eve Syrkin Wurtele. Coming of age: orphan genes in European journal of cancer, 47(3):333–338, 2011. plants. Trends in plant science, 19(11):698–708, 2014. 96. Florencia Anatelli, Shang-Tian Chuang, Ximing J Yang, and Hanlin L Wang. Value of 123. Anne-Ruxandra Carvunis, Thomas Rolland, Ilan Wapinski, Michael A Calderwood, glypican 3 immunostaining in the diagnosis of hepatocellular carcinoma on needle biopsy. Muhammed A Yildirim, Nicolas Simonis, Benoit Charloteaux, César A Hidalgo, Justin Bar- American journal of clinical pathology, 130(2):219–223, 2008. bette, Balaji Santhanam, et al. Proto-genes and de novo gene birth. Nature, 487(7407): 97. Mariana Capurro, Ian R Wanless, Morris Sherman, Gerrit Deboer, Wen Shi, Eiji Miyoshi, 370, 2012. and Jorge Filmus. Glypican-3: a novel serum and histochemical marker for hepatocellular 124. Martin Gollery, Jeff Harper, John Cushman, Taliah Mittler, Thomas Girke, Jian-Kang Zhu, carcinoma. Gastroenterology, 125(1):89–97, 2003. Julia Bailey-Serres, and Ron Mittler. What makes species unique? the contribution of 98. Yun-Yan Xiang, Virginia Ladeda, and Jorge Filmus. Glypican-3 expression is silenced in proteins with obscure features. Genome biology, 7(7):R57, 2006. human breast cancer. Oncogene, 20(50):7408, 2001. 125. Aron Marchler-Bauer, Myra K Derbyshire, Noreen R Gonzales, Shennan Lu, Farideh Chit- 99. Ram Sasisekharan, Zachary Shriver, Ganesh Venkataraman, and Uma Narayanasami. saz, Lewis Y Geer, Renata C Geer, Jane He, Marc Gwadz, David I Hurwitz, et al. Cdd: Roles of heparan-sulphate glycosaminoglycans in cancer. Nature Reviews Cancer, 2(7): Ncbi’s conserved domain database. Nucleic acids research, 43(D1):D222–D226, 2014. 521, 2002. 126. Lloyd W Sumner, Zhentian Lei, Basil J Nikolau, and Kazuki Saito. Modern plant 100. Han Kim, Guo-Liang Xu, Alain C Borczuk, Steve Busch, Jorge Filmus, Mariana Capurro, metabolomics: advanced natural product gene discoveries, improved technologies, and Jerome S Brody, Jennifer Lange, Jeanine M D’armiento, Paul B Rothman, et al. The future prospects. Natural product reports, 32(2):212–229, 2015. heparan sulfate proteoglycan gpc3 is a potential lung tumor suppressor. American journal 127. Stephanie Michelle Moon Quanbeck, Libuse Brachova, Alexis A Campbell, Xin Guan, of respiratory cell and molecular biology, 29(6):694–701, 2003. Ann Perera, Kun He, Seung Y Rhee, Preeti Bais, Julie Dickerson, Philip Dixon, et al. 101. Marina Curado Valsechi, Ana Beatriz Bortolozo Oliveira, André Luis Giacometti Con- Metabolomics as a hypothesis-generating functional genomics tool for the annotation of ceição, Bruna Stuqui, Natalia Maria Candido, Paola Jocelan Scarin Provazzi, Luiza Fer- arabidopsis thaliana genes of “unknown function”. Frontiers in plant science, 3:15, 2012. reira de Araújo, Wilson Araújo Silva, Marilia de Freitas Calmon, and Paula Rahal. Gpc3 128. Luis A del Rıo, F Javier Corpas, and Juan B Barroso. Nitric oxide and nitric oxide synthase reduces cell proliferation in renal carcinoma cell lines. BMC cancer, 14(1):631, 2014. activity in plants. Phytochemistry, 65(7):783–792, 2004. 102. Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel 129. Adam Price, Adrian Caciula, Cheng Guo, Bohyun Lee, Juliet Morrison, Angela Ras- Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. Cytoscape: a software envi- mussen, W Ian Lipkin, and Komal Jain. Devis: an r package for aggregation and visu- ronment for integrated models of biomolecular interaction networks. Genome research, 13 alization of differential expression data. BMC bioinformatics, 20(1):110, 2019. (11):2498–2504, 2003. 130. John W Nicol, Gregg A Helt, Steven G Blanchard Jr, Archana Raja, and Ann E Loraine. 103. Gary D Bader and Christopher WV Hogue. An automated method for finding molecular The integrated genome browser: free software for distribution and exploration of genome- complexes in large protein interaction networks. BMC bioinformatics, 4(1):2, 2003. scale datasets. Bioinformatics, 25(20):2730–2731, 2009. 104. Mathias Uhlen, Cheng Zhang, Sunjae Lee, Evelina Sjöstedt, Linn Fagerberg, Gholamreza 131. Hugo López-Fernández, Miguel Reboiro-Jato, Daniel Glez-Peña, Rosalía Laza, Reyes Bidkhori, Rui Benfeitas, Muhammad Arif, Zhengtao Liu, Fredrik Edfors, et al. A pathology Pavón, and Florentino Fdez-Riverola. Gc4s: A bioinformatics-oriented java software library atlas of the human cancer transcriptome. Science, 357(6352):eaan2507, 2017. of reusable graphical user interface components. PloS one, 13(9):e0204474, 2018. 105. Jia You, Wei Chen, Jing Chen, Qi Zheng, Jing Dong, and Yueyong Zhu. The oncogenic 132. Vladimir Ignatchenko, Alexandr Ignatchenko, Ankit Sinha, Paul C Boutros, and Thomas role of arg1 in progression and metastasis of hepatocellular carcinoma. BioMed research Kislinger. Venndis: A javafx-based venn and euler diagram software to generate publica- international, 2018, 2018. tion quality figures. Proteomics, 15(7):1239–1244, 2015. 106. Xiaojing Ren, Yuanyuan Ji, Xuhua Jiang, and Xun Qi. Downregulation of cyp2a6 and 133. Ilya Kirov, Ludmila Khrustaleva, Katrijn Van Laere, Alexander Soloviev, Sofie Meeus, cyp2c8 in tumor tissues is linked to worse overall survival and recurrence-free survival Dmitry Romanov, and Igor Fesenko. Drawid: user-friendly java software for chromosome from hepatocellular carcinoma. BioMed research international, 2018, 2018. measurements and idiogram drawing. Comparative cytogenetics, 11(4):747, 2017. 107. Tingdong Yu, Xiangkun Wang, Guangzhi Zhu, Chuangye Han, Hao Su, Xiwen Liao, 134. Overview of windows performance monitor: https://docs.microsoft.com/en-us/previous- Chengkun Yang, Wei Qin, Ketuan Huang, and Tao Peng. The prognostic value of differen- versions/windows/it-pro/windows-server-2008-r2-and-2008/cc749154(v=ws.11). Microsoft tially expressed cyp3a subfamily members for hepatocellular carcinoma. Cancer manage- Docs. ment and research, 10:1713, 2018. 135. Stephanie C Hicks, Kwame Okrah, Joseph N Paulson, John Quackenbush, Rafael A 108. Hyuk Soo Eun, Sang Yeon Cho, Byung Seok Lee, Sup Kim, In-Sang Song, Kwangsik Irizarry, and Héctor Corrada Bravo. Smooth quantile normalization. Biostatistics, 19(2): Chun, Cheong-Hae Oh, Min-Kyung Yeo, Seok Hyun Kim, and Kyung-Hee Kim. Cy- 185–198, 2017. tochrome p450 4a11 expression in tumor cells: A favorable prognostic factor for hepato- 136. Marcin Cieslik´ and Arul M Chinnaiyan. Cancer transcriptome profiling at the juncture of cellular carcinoma patients. Journal of Gastroenterology and Hepatology, 34(1):224–233, clinical translation. Nature Reviews Genetics, 19(2):93, 2018. 2019. 109. Dingyuan Luo, Haibo Chen, Penghui Lu, Xiaojuan Li, Miaoyun Long, Xinzhi Peng, Mingqing Huang, Kai Huang, Shaojian Lin, Langping Tan, et al. Chi3l1 overexpression is associated with metastasis and is an indicator of poor prognosis in papillary thyroid carcinoma. Cancer Biomarkers, 18(3):273–284, 2017. 110. Ying Huang, Manju Prasad, William J Lemon, Heather Hampel, Fred A Wright, Karl Kor- nacker, Virginia LiVolsi, Wendy Frankel, Richard T Kloos, Charis Eng, et al. Gene expres- sion in papillary thyroid carcinoma reveals highly consistent profiles. Proceedings of the National Academy of Sciences, 98(26):15044–15049, 2001. 111. Jie Qiu, Wenwei Zhang, Chuanshan Zang, Xiaomin Liu, Fuxue Liu, Ruifeng Ge, Yan Sun, and Qingsheng Xia. Identification of key genes and mirnas markers of papillary thyroid cancer. Biological research, 51(1):45, 2018. 112. Ito Ysuhiro, Yoshida Hiroshi, Kakudo Kennichi, Nakamura Yasushi, Kuma Kanji, and Miyauchi Akira. Inverse relationships between the expression of mmp-7 and mmp-11 and predictors of poor prognosis of papillary thyroid carcinoma. Pathology, 38(5):421–425, 2006. 113. Szu-Tah Chen, Dah-Wel Liu, Jen-Der Lin, Fang-Wu Chen, Yu-Yao Huang, and Brend Ray- Sea Hsu. Down-regulation of matrix metalloproteinase-7 inhibits metastasis of human anaplastic thyroid cancer cell line. Clinical & experimental metastasis, 29(1):71–82, 2012. 114. Hong Zhang, Yuechang Cai, Li Zheng, Zhanlei Zhang, Xiaofeng Lin, and Ningyi Jiang. Long noncoding rna neat1 regulate papillary thyroid cancer progression by modulating

18 | bioRχiv Singh et al. |