Multi-Scale Analysis and Clustering of Co-Expression Networks

Multi-scale analysis and clustering of co-expression networks Nuno R. Nené∗1 1Department of Genetics, University of Cambridge, Cambridge, UK September 14, 2018 Abstract 1 Introduction The increasing capacity of high-throughput genomic With the advent of technologies allowing the collec- technologies for generating time-course data has tion of time-series in cell biology, the rich structure stimulated a rich debate on the most appropriate of the paths that cells take in expression space be- methods to highlight crucial aspects of data struc- came amenable to processing. Several methodologies ture. In this work, we address the problem of sparse have been crucial to carefully organize the wealth of co-expression network representation of several time- information generated by experiments [1], including course stress responses in Saccharomyces cerevisiae. network-based approaches which constitute an excel- We quantify the information preserved from the origi- lent and flexible option for systems-level understand- nal datasets under a graph-theoretical framework and ing [1,2,3]. The use of graphs for expression anal- evaluate how cross-stress features can be identified. ysis is inherently an attractive proposition for rea- This is performed both from a node and a network sons related to sparsity, which simplifies the cumber- community organization point of view. Cluster anal- some analysis of large datasets, in addition to being ysis, here viewed as a problem of network partition- mathematically convenient (see examples in Fig.1). ing, is achieved under state-of-the-art algorithms re- The properties of co-expression networks might re- lying on the properties of stochastic processes on the veal a myriad of aspects pertaining to the impact of constructed graphs. Relative performance with re- genes [2], as well as group features, such as struc- spect to a metric-free Bayesian clustering analysis is tural clusters or communities, highlighting similar evaluated and possible extensions are discussed. We expression profiles. The structure of co-expression further cluster the stress-induced co-expression net- networks can also be compared with other types of works generated independently by using their com- networks, e.g. protein-protein interaction, genetic arXiv:1703.02872v2 [q-bio.QM] 2 Dec 2017 munity organization at multiple scales. This type of interaction or gene regulatory, by using simple and protocol allows for integration of multiple datasets scalable graph-theoretical methods such as the one that may not be immediately comparable, either due explored here. Additionally, the use of graphs may to diverse experimental variations or because they also help to shed light on the evolutionary differences represent different types of information about the or commonalities between cellular responses across same genes. species [4,5,6]. Clustering methodologies have been an invaluable tool in unravelling the structure of time-course data ∗[email protected], [email protected] [1,7]. Either by using standard approaches rely- 1 Figure 1: Co-expression networks for a selection of stresses for two dissimilarity functions. HS25 − 37: heat- shock from a temperature of 25◦C to 37◦C. HS29−33(I): mild heat-shock from 29◦C to 33◦C. Hyper−OS: hyper-osmotic stress. These networks were plotted via a standard spatial embedding force-based algorithm available in Gephi. See Methods for details. ing on cosine dissimilarities and hierarchical clus- ficient to eliminate possible inconsistencies between tering [8,9] or by adopting a Bayesian framework datasets. Here, we will resort to a graph-theoretical without choosing a priori a particular distance func- based protocol that is able to avoid the pitfalls of ex- tion [10, 11, 12, 13], most of the techniques rely on perimental variation. This protocol is based on ideas the assumption that the necessary standardization related to community detection on graphs [14, 15], methods that precede the clustering analysis are suf- which makes it possible to address the character- 2 ization of expression dynamics with methods that [23, 24]. take into account multi-level organization informa- The paper is organized as follows: we initially tion. Overall, the method explores aspects of proba- quantify how much information the network con- bility flow and containment across a network [16, 17]. struction algorithm used in our work retains from Although several other algorithms that rely on sim- the whole set of selected genes, under different co- ilar principles are available, the one we test here is expression metrics (section 2.1.1); this is achieved by very fast. This makes the analysis of large datasets evaluating both structural and diffusion properties of such as those used in the present work a feasible task. each network; following this analysis the constructed In conjunction with this approach, we also resort to networks serve as the basis for the study of diffusion methodologies stemming from the field of non-linear patterns across its structures, at different time-scales, dimensionality reduction. The additional treatment with the intent of identifying clusters of similar time- has the objective of successfully identifying relation- dependent behaviour (section 2.2); finally, all the ships between genes that better represent the under- partition solutions identified for each stress are clus- lying high-dimensional geometry of the data [18] and, tered once again, with respect to other stresses, by naturally, create sparse and manageable datasets. taking into account the geometry in partition space We test the protocol with two dissimilarity functions (section 2.3). This sequential analysis allowed us capturing different features of the expression dynam- to compare how genes are clustered across stresses ics. Each leads to differentiated network representa- and ultimately to identify a joint clustering solution. tions of the same patterns. This allow us to verify if We also compare the multi-resolution organization of additional information improves performance of the all the co-expression networks with those of protein- protocol under different metrics. protein and genetic interaction networks (section S8) The data used in this work is a compendium extracted from BioGRID (http : ==thebiogrid:org). of stress-induced expression microarrays for This overall analysis, under the graph-theoretical S.cerevisiae, originally published by Gasch and paradigm, leads to efficient multi-dataset integration. co-workers [8]. We chose this dataset due to the The work-flow underlying the protocol explored here variety of stresses applied and the fact that it is still is presented in Fig.7. one of the biggest datasets for time-course data, in addition to being a benchmark for testing inference of networks in general [3, 19]. The work published in 2 Results [8] includes both single and serial stress responses, and one experiment with a combinatorial-like stress. 2.1 Identification of stress-induced co- The subject of combinatorial stress in yeasts has expression networks been further explored in more recent work (see for example [20]). A gene co-expression network is an undirected graph The results of Gasch and co-workers [8] were a sem- where nodes represent genes and edges proximity be- inal contribution to the analysis of whole-genome ex- tween expression profiles [25, 26]. It does not include pression and identified crucial features of stress re- or attempt to explicitly represent regulatory or phys- sponse in yeast. Specifically, the presence of a Com- ical interactions, but simply highlights common fea- mon or Environmental Stress Response (ESR), also tures of the data pertaining to the nodes it connects. a hallmark in several other species [5, 21, 22], was Traditionally, co-expression networks have been used extracted by hierarchical clustering analysis and pos- to determine co-expression modules or sets of genes tulated to be fundamental in equipping S.cerevisiae that exhibit similar patterns under the considered ex- to combat serial and combinatorial stress. More re- perimental conditions. Usually, the full co-expression cently, the ESR feature has been further analysed matrix under a specific dissimilarity function is used and studies have concluded that there is more struc- or a significance threshold is imposed, either under ture to the signals measured than previously expected a specific density kernel or simply ad hoc. The lat- 3 ter leads to removal of links that under the chosen time-course data and it contributes to the core group model are not significant and increases speed and of graph-based approaches that have performed very performance in the clustering step [26]. Here, we re- well when applied to time-series analysis in biology sort to a method for sparsifying co-expression matri- (see, for example, several applications in [37, 38]). ces that avoids imposing statistical thresholds to at- Unlike some of the traditional methods for manifold tain edge significance. This method has been tested identification, the RMST algorithm avoids the prob- widely across multiple datasets and provides a robust lem of under-sampling that hinders other method- starting point [18, 27]. ologies (see for example [33] where this is thoroughly Throughout this work a co-expression network will addressed). be represented by G = fV; E; W g, where V repre- In the following section we will highlight several sents the set of vertices or nodes/genes, E represents structural properties of the generated co-expression the set of edges or links between genes in matrix graphs, for both distance functions mentioned above. form, in our case determined by a particular algo- We also evaluate the information that is preserved rithm [18], and W represents the set of weights as- from the original pair-wise distance matrix between sociated with each of the edges in E, again in ma- genes. We focus our analysis on a restricted set of trix form. The network G derived for each stress genes selected by appropriate filtering methods that will represent the overall intra-stress relationships in take into account the order of events as a contributing the expression time-series matrix, g(t), for all genes factor.

Multi-Scale Analysis and Clustering of Co-Expression Networks

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

EMBL-EBI Powerpoint Presentation

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Sequence Alignment/Map Format Specification

NGS Raw Data Analysis

Sequence Data

BAM Alignment File — Output: Alignment Counts and RPKM Expression Measurements for Each Exon • Calculate Coverage Profiles Across the Genome with “Sam2wig”

Introduction to High-Throughput Sequencing File Formats

Magic-BLAST, an Accurate DNA and RNA-Seq Aligner for Long and Short Reads

Institute for Glycomics Annual Report 2019

Predicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures§ George Montañez1 and Young-Rae Cho*,2

For the Identification of Forensically Important Diptera from Belgium and France