Population-Based Meta-Heuristic for Active Modules Identification Leandro Corrêa, Denis Pallez, Laurent Tichit, Olivier Soriani, Claude Pasquier
Total Page:16
File Type:pdf, Size:1020Kb
Population-based meta-heuristic for active modules identification Leandro Corrêa, Denis Pallez, Laurent Tichit, Olivier Soriani, Claude Pasquier To cite this version: Leandro Corrêa, Denis Pallez, Laurent Tichit, Olivier Soriani, Claude Pasquier. Population- based meta-heuristic for active modules identification. 10th International Conference on Com- putational Systems-Biology and Bioinformatics (CSBio 2019), Dec 2019, Nice, France. pp.1-8, 10.1145/3365953.3365957. hal-02366236 HAL Id: hal-02366236 https://hal.archives-ouvertes.fr/hal-02366236 Submitted on 18 Nov 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Population-based meta-heuristic for active modules identification Leandro Correa Denis Pallez Laurent Tichit [email protected] [email protected] [email protected] Université Côte d’Azur, CNRS, I3S, Université Côte d’Azur, CNRS, I3S, Aix Marseille Univ, CNRS, Centrale France France Marseille, I2M, France Olivier Soriani Claude Pasquier [email protected] [email protected] Université Côte d’Azur, CNRS, Université Côte d’Azur, CNRS, I3S, INSERM, iBV, France France ABSTRACT and Bioinformatics (CSBio 2019), December 4–7, 2019, Nice, France. ACM, The identification of condition specific gene sets from transcrip- New York, NY, USA, 8 pages. https://doi.org/10.1145/3365953.3365957 tomic experiments has important biological applications, ranging from the discovery of altered pathways between different phe- 1 INTRODUCTION notypes to the selection of disease-related biomarkers. Statistical High throughput technologies are now able to reliably quantify, on approaches using only gene expression data are based on an overly an organism-wide basis, molecular changes occurring in response simplistic assumption that the genes with the most altered expres- to disturbances or diseases. Usually, in a first step, these experimen- sions are the most important in the process under study. However, tal results are analyzed using statistical or data mining methods to a phenotype is rarely a direct consequence of the activity of a single generate scores expressing the degree of involvement of each gene gene, but rather reflects the interplay of several genes to perform in the process under study. In a second step, one tries to identify certain molecular processes. Many methods have been proposed active gene modules by testing whether genes known to co-operate to analyze gene activity in the light of our knowledge about their in some biological process belong to high scoring genes. The draw- molecular interactions. We propose, in this article, a population- back of this two-step procedure is that it may not identify genes based meta-heuristics based on new crossover and mutation op- whose combined action is essential in the process under study but erators. Our method achieves state of the art performance in an whose individual scores are too low. independent simulation experiment used in other studies. Applied Inspired by the seminal work from Ideker et al. [22], many meth- to a public transcriptomic dataset of patients afflicted with Hepa- ods have been proposed to analyze gene activity in the light of tocellular carcinoma, our method was able to identify significant our knowledge about their molecular interactions. These molecular modules of genes with meaningful biological relevance. interactions provide a convenient framework for understanding changes in gene expression and for integrating a wide variety of CCS CONCEPTS global state measurements. In the articles on this subject, these per- • Theory of computation → Evolutionary algorithms; • Ap- tinent sub-network are named "context-dependent active subnet- plied computing → Computational transcriptomics; Biolog- works" [20], "functional module" [5], "maximal scoring subgraph" ical networks. [14] or "altered subnetworks" [34]. As pinpointed by Rapaport et al. [33], "a small but coherent difference in the expression of all the KEYWORDS genes in a pathway should be more significant than a larger dif- active module identification, transcriptome analysis, protein-protein ference occurring in unrelated genes". To identify these groups of interaction, differential expression, NSGA-II genes considered relevant, it is therefore necessary to allow sensible trade-offs between the involvement of genes in the process and ACM Reference Format: their proximity. The underlying idea of active module detection Leandro Correa, Denis Pallez, Laurent Tichit, Olivier Soriani, and Claude methods is to simultaneously take into account these two criteria: Pasquier. 2019. Population-based meta-heuristic for active modules identifi- one based on a measurement of genes activity and the other one cation. In 10th International Conference on Computational Systems-Biology reflecting the proximity between genes composing the subnetwork. Ideker et al. [22] uses a simulated annealing approach that works Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed by randomly selecting a subnetwork and then gradually modifying for profit or commercial advantage and that copies bear this notice and the full citation it by adding or removing nodes until a subnetwork that includes on the first page. Copyrights for components of this work owned by others than ACM genes with a high activity level is reached. Inspired by this first must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a study, other meta-heuristic approaches have been developed, using fee. Request permissions from [email protected]. for example population-based approaches that maintain and im- CSBio 2019, December 4–7, 2019, Nice, France prove multiple candidate solutions instead of a single one [26, 29], © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-7215-2/19/12...$15.00 sometimes by integrating prior biological knowledge [9]. One of https://doi.org/10.1145/3365953.3365957 the major interest in using population-based meta-heuristics lie in CSBio 2019, December 4–7, 2019, Nice, France Leandro Correa, Denis Pallez, Laurent Tichit, Olivier Soriani, and Claude Pasquier the possibility to obtain several good solutions or, when dealing In this paper, we are considering two criteria: the ZA score as with multi-criteria, to approximate the Pareto front. Other meth- defined by Ideker et al. [22] and an aggregation of the essentiality ods adopt the concepts of network propagation. They assume that of genes, as defined by Jiang et al. [23]. the diffusion of information in a biological network is similar to A gene is considered essential when loss of its function com- the flow of fluid or heat in a pipeline network. Thus, in biological promises the viability or fitness of the system [3]. The measure systems, values measuring gene activity flow outward along the of gene essentiality, proposed by Jiang et al. is computed as the network and accumulate in some regions. Regions that accumulate sum of normalized expression level variation of its neighbors in the maximum flow are then detected as active modules [4, 14, 36]. the interactome network. Since our genetic operators are strongly Somewhat similar approaches consist in performing random walks linked to the neighborhood criterion, the essentiality of each gene on the network (or more precisely biased walks to encourage visits in the neighborhood is a good indicator for the selection of candi- to the most active nodes) with the idea that the walks are more dates that bolster the generation of better offsprings. In order to likely to remain in subnetworks with high interactions between its determine the essentiality of genes based on p-values, we use in members [19, 32]. Approaches based on greedy algorithms, start the formulae, the z-score associated to the p-value ω¹vº (eq. 1). from clusters composed of a single node and gradually extend them If we consider N ¹vº = fn1; n2; :::; np g as the set of neighbors of with neighboring nodes in order to maximize the score of the spe- one gene v 2 V of graph G, the essentiality of v, E¹vº, is defined as: cific subnetwork [18, 28]. Finally, some methods treat the problem 1 Õ as a clustering problem [40] or a prize-collecting Steiner problems E¹vº = z¹ni º (1) jN ¹vºj [41]. A very recent survey on active modules identification can be ni 2N ¹vº found in Nguyen et al. [30]. where z¹vº is the z-score associated to the p-value ω¹vº. We compute In this paper, we propose a population-based meta-heuristic z¹vº as follows: z¹vº = Φ−1¹1 − ω¹vºº where Φ−1 is the inverse based on NSGA-II [13] by integrating new genetic operators, i.e. normal cumulative distribution function [22]. specific crossover and mutation operators. The paper is organized By using E¹vº, our intention is to favor the choice of genes that as follows: next section formalizes active modules identification are connected to many hot spots in the network. Thus, we define in a biological network; section 3 presents in detail how genetic our first criteria f1 as the mean of genes’essentiality for all genes operators of the Evolutionary Algorithm (EA) have been adapted included in the active module. It is formalized in eq. (2): for modules identification; section 4 tests our method called AMI- NSGAII (Active Modules Identification with NSGA-II) on simulated 0 1 Õ f1¹V º = 0 E¹vi º (2) jV j 0 data in order to have a fair comparison with other state-of-the-art vi 2V methods.