Inferring Pairwise Interactions from Biological Data Using Maximum- Entropy Probability Models

Inferring Pairwise Interactions from Biological Data Using Maximum- Entropy Probability Models The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Stein, Richard R., Debora S. Marks, and Chris Sander. 2015. “Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models.” PLoS Computational Biology 11 (7): e1004182. doi:10.1371/journal.pcbi.1004182. http:// dx.doi.org/10.1371/journal.pcbi.1004182. Published Version doi:10.1371/journal.pcbi.1004182 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:21462137 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA REVIEW Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models Richard R. Stein1*, Debora S. Marks2, Chris Sander1* 1 Computational Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York, United States of America, 2 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America * [email protected] (RRS); [email protected] (CS) Abstract Maximum entropy-based inference methods have been successfully used to infer direct interactions from biological datasets such as gene expression data or sequence ensem- bles. Here, we review undirected pairwise maximum-entropy probability models in two cate- gories of data types, those with continuous and categorical random variables. As a concrete example, we present recently developed inference methods from the field of protein contact prediction and show that a basic set of assumptions leads to similar solution strategies for inferring the model parameters in both variable types. These parameters reflect interactive couplings between observables, which can be used to predict global properties of the biological system. Such methods are applicable to the important problems of protein 3-D struc- – OPEN ACCESS ture prediction and association of gene gene networks, and they enable potential applications to the analysis of gene alteration patterns and to protein design. Citation: Stein RR, Marks DS, Sander C (2015) Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models. PLoS Introduction Comput Biol 11(7): e1004182. doi:10.1371/journal. pcbi.1004182 Modern high-throughput techniques allow for the quantitative analysis of various components Editor: Shi-Jie Chen, University of Missouri, UNITED of the cell. This ability opens the door to analyzing and understanding complex interaction pat- STATES terns of cellular regulation, organization, and evolution. In the last few years, undirected pairwise maximum-entropy probability models have been introduced to analyze biological data Published: July 30, 2015 and have performed well, disentangling direct interactions from artifacts introduced by inter- Copyright: © 2015 Stein et al. This is an open mediates or spurious coupling effects. Their performance has been studied for diverse prob- access article distributed under the terms of the lems, such as gene network inference [1,2], analysis of neural populations [3,4], protein contact Creative Commons Attribution License, which permits – unrestricted use, distribution, and reproduction in any prediction [5 8], analysis of a text corpus [9], modeling of animal flocks [10], and prediction of medium, provided the original author and source are multidrug effects [11]. Statistical inference methods using partial correlations in the context of credited. graphical Gaussian models (GGMs) have led to similar results and provide a more intuitive Funding: This work was supported by NIH awards understanding of direct versus indirect interactions by employing the concept of conditional R01 GM106303 (DSM, CS, and RRS) and P41 independence [12,13]. GM103504 (CS). The funders had no role in the Our goal here is to derive a unified framework for pairwise maximum-entropy probability preparation of the manuscript. models for continuous and categorical variables and to discuss some of the recent inference Competing Interests: The authors have declared approaches presented in the field of protein contact prediction. The structure of the manuscript that no competing interests exist. is as follows: (1) introduction and statement of the problem, (2) deriving the probabilistic PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 1 / 22 model, (3) inference of interactions, (4) scoring functions for the pairwise interaction strengths, and (5) discussion of results, improvements and applications. Better knowledge of these methods, along with links to existing implementations in terms of software packages, may be helpful to improve the quality of biological data analysis compared to standard correlation-based methods and increase our ability to make predictions of interactions that define the properties of a biological system. In the following, we highlight the power of inference methods based on the maximum-entropy assumption using two examples of biological problems: inferring networks from gene expression data and residue contacts in proteins from multiple sequence alignments. We compare solutions obtained using (1) correlation-based inference and (2) inference based on pairwise maximum-entropy probability models (or their incarnation in the continuous case, the multivariate Gaussian distribution). Gene association networks Pairwise associations between genes and proteins can be determined by a variety of data types, such as gene expression or protein abundance. Association between entities in these data types are commonly estimated by the sample Pearson correlation coefficient computed for each ... pair of variables xi and xj from the set of random variables x1, , xL. In particular, for M given x1 1; ...; 1 T; ...; xM M; ...; M T RL samples in L measured variables, ¼ðx1 xLÞ ¼ðx1 xL Þ 2 ,itis defined as, C^ r :¼ qffiffiffiffiffiffiffiffiffiffiffiij ; ij ^ ^ CiiC jj X M where C^ :¼ 1 ðxm À x Þðxm À x Þ denotes the (i, j)-element of the empirical covariance ij M m¼1 i i j j matrix C^ ¼ðC^ Þ . The sample mean operator provides the empirical mean from the ij i;j¼1;...;L X Á M measured data and is defined as x :¼ 1 xm. A simple way to characterize dependencies i M m¼1 i in data is to classify two variables as being dependent if the absolute value of their correlation coefficient is above a certain threshold (and independent otherwise) and then use those pairs to draw a so-called relevance network [14]. However, the Pearson correlation is a misleading measure for direct dependence as it only reflects the association between two variables while ignor- ing the influence of the remaining ones. Therefore, the relevance network approach is not suitable to deduce direct interactions from a dataset [15–18]. The partial correlation between two variables removes the variational effect due to the influence of the remaining variables (Cramér [19], p. 306). To illustrate this, let’s take a simplified example with three random variables x , x , x . Without loss of generality, we can scale each of these variables to zero-mean A B C qffiffiffiffiffiffi ^ fi fi and unit-standard deviation by xi7!ðxi À xi Þ C ii, which simpli es the correlation coef - fi cient to rij xixj . The sample partial correlation coef cient of a three-variable system between fi xA and xB given xC is then de ned as [19,20] À ð^ À1Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffirAB rpBC ffiffiffiffiffiffiffiffiffiffiffiffiffiffirAC qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiC AB : rABÁC ¼ 2 2 À 1 1 1 1 À rAC À rBC ^ À ^ À ðC ÞAAðC ÞBB The latter equivalence by Cramer’s rule holds if the empirical covariance matrix, ^ ^ C ¼ðC ijÞi;j2fA;B;Cg, is invertible. Krumsiek et al. [21] studied the Pearson correlations and partial correlations in data generated by an in silico reaction system consisting of three components A, B, C with reactions between A and B, and B and C (Fig 1A). A graphical comparison of PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 2 / 22 Fig 1. Reaction system reconstruction and protein contact prediction. Association results of correlation- based and maximum-entropy methods on biological data from an in silico reaction system (A) and protein contacts (B). (A) Analysis by Pearson’s correlation yields interactions associating all three compounds A, B, and C, in contrast to the partial correlation approach which omits the “false” link between A and C. (Fig 1A based on [21].) (B) Protein contact prediction for the human RAS protein using the correlation-based mutual information, MI, and the maximum-entropy based direct information, DI, (blue and red, respectively). The 150 highest scoring contacts from both methods are plotted on the protein contacts from experimentally determined structure in gray. (Fig 1B based on [6].) doi:10.1371/journal.pcbi.1004182.g001 ’ Pearson s correlations, rAB, rAC rBC, versus the corresponding partial correlations, rABÁC, rACÁB, ’ rBCÁA, shows that variables A and C appear to be correlated when using Pearson s correlation as a dependency measure since both are highly correlated with variable B, which results in a false inferred reaction rAC. The strength of the incorrectly inferred interaction can be numerically large and therefore particularly misleading if there are multiple intermediate variables B [22]. The partial correlation analysis

Inferring Pairwise Interactions from Biological Data Using Maximum- Entropy Probability Models

Ontology-Based Methods for Analyzing Life Science Data

Ploidetect Enables Pan-Cancer Analysis of the Causes and Impacts of Chromosomal Instability

BIOINFORMATICS APPLICATIONS NOTE Pages 380-381

Are You an Invited Speaker? a Bibliometric Analysis of Elite Groups for Scholarly Events in Bioinformatics

Full List of PCAWG Consortium Working Groups and Writing

Biological Pathways Exchange Language Level 3, Release Version 1 Documentation

Systems Biology Graphical Notation: Process Description Language Level 1

I S C B N E W S L E T T

Mapping the Protein Universe

A General Mathematical Framework for Understanding the Behavior of Heterogeneous Stem Cell Regeneration Arxiv:1903.11448V1 [Q-B

Bioinformatics 2 -- Lecture 3

Generating Functional Protein Variants with Variational Autoencoders