Inferring Pairwise Interactions from Biological Data Using Maximum- Entropy Probability Models
The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters
Citation Stein, Richard R., Debora S. Marks, and Chris Sander. 2015. “Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models.” PLoS Computational Biology 11 (7): e1004182. doi:10.1371/journal.pcbi.1004182. http:// dx.doi.org/10.1371/journal.pcbi.1004182.
Published Version doi:10.1371/journal.pcbi.1004182
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:21462137
Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA REVIEW Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models
Richard R. Stein1*, Debora S. Marks2, Chris Sander1*
1 Computational Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York, United States of America, 2 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
* [email protected] (RRS); [email protected] (CS)
Abstract
Maximum entropy-based inference methods have been successfully used to infer direct interactions from biological datasets such as gene expression data or sequence ensem- bles. Here, we review undirected pairwise maximum-entropy probability models in two cate- gories of data types, those with continuous and categorical random variables. As a concrete example, we present recently developed inference methods from the field of protein contact prediction and show that a basic set of assumptions leads to similar solution strategies for inferring the model parameters in both variable types. These parameters reflect interactive couplings between observables, which can be used to predict global properties of the bio- logical system. Such methods are applicable to the important problems of protein 3-D struc- – OPEN ACCESS ture prediction and association of gene gene networks, and they enable potential applications to the analysis of gene alteration patterns and to protein design. Citation: Stein RR, Marks DS, Sander C (2015) Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models. PLoS Introduction Comput Biol 11(7): e1004182. doi:10.1371/journal. pcbi.1004182 Modern high-throughput techniques allow for the quantitative analysis of various components
Editor: Shi-Jie Chen, University of Missouri, UNITED of the cell. This ability opens the door to analyzing and understanding complex interaction pat- STATES terns of cellular regulation, organization, and evolution. In the last few years, undirected pair- wise maximum-entropy probability models have been introduced to analyze biological data Published: July 30, 2015 and have performed well, disentangling direct interactions from artifacts introduced by inter- Copyright: © 2015 Stein et al. This is an open mediates or spurious coupling effects. Their performance has been studied for diverse prob- access article distributed under the terms of the lems, such as gene network inference [1,2], analysis of neural populations [3,4], protein contact Creative Commons Attribution License, which permits – unrestricted use, distribution, and reproduction in any prediction [5 8], analysis of a text corpus [9], modeling of animal flocks [10], and prediction of medium, provided the original author and source are multidrug effects [11]. Statistical inference methods using partial correlations in the context of credited. graphical Gaussian models (GGMs) have led to similar results and provide a more intuitive
Funding: This work was supported by NIH awards understanding of direct versus indirect interactions by employing the concept of conditional R01 GM106303 (DSM, CS, and RRS) and P41 independence [12,13]. GM103504 (CS). The funders had no role in the Our goal here is to derive a unified framework for pairwise maximum-entropy probability preparation of the manuscript. models for continuous and categorical variables and to discuss some of the recent inference Competing Interests: The authors have declared approaches presented in the field of protein contact prediction. The structure of the manuscript that no competing interests exist. is as follows: (1) introduction and statement of the problem, (2) deriving the probabilistic
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 1 / 22 model, (3) inference of interactions, (4) scoring functions for the pairwise interaction strengths, and (5) discussion of results, improvements and applications. Better knowledge of these methods, along with links to existing implementations in terms of software packages, may be helpful to improve the quality of biological data analysis compared to standard correlation-based methods and increase our ability to make predictions of interac- tions that define the properties of a biological system. In the following, we highlight the power of inference methods based on the maximum-entropy assumption using two examples of bio- logical problems: inferring networks from gene expression data and residue contacts in pro- teins from multiple sequence alignments. We compare solutions obtained using (1) correlation-based inference and (2) inference based on pairwise maximum-entropy probability models (or their incarnation in the continuous case, the multivariate Gaussian distribution).
Gene association networks Pairwise associations between genes and proteins can be determined by a variety of data types, such as gene expression or protein abundance. Association between entities in these data types are commonly estimated by the sample Pearson correlation coefficient computed for each ... pair of variables xi and xj from the set of random variables x1, , xL. In particular, for M given x1 1; ...; 1 T; ...; xM M; ...; M T RL samples in L measured variables, ¼ðx1 xLÞ ¼ðx1 xL Þ 2 ,itis defined as,
C^ r :¼ qffiffiffiffiffiffiffiffiffiffiffiij ; ij ^ ^ CiiC jj X M where C^ :¼ 1 ðxm x Þðxm x Þ denotes the (i, j)-element of the empirical covariance ij M m¼1 i i j j matrix C^ ¼ðC^ Þ . The sample mean operator provides the empirical mean from the ij i;j¼1;...;L X M measured data and is defined as x :¼ 1 xm. A simple way to characterize dependencies i M m¼1 i in data is to classify two variables as being dependent if the absolute value of their correlation coefficient is above a certain threshold (and independent otherwise) and then use those pairs to draw a so-called relevance network [14]. However, the Pearson correlation is a misleading mea- sure for direct dependence as it only reflects the association between two variables while ignor- ing the influence of the remaining ones. Therefore, the relevance network approach is not suitable to deduce direct interactions from a dataset [15–18]. The partial correlation between two variables removes the variational effect due to the influence of the remaining variables (Cramér [19], p. 306). To illustrate this, let’s take a simplified example with three random vari- ables x , x , x . Without loss of generality, we can scale each of these variables to zero-mean A B C qffiffiffiffiffiffi ^ fi fi and unit-standard deviation by xi7!ðxi xi Þ C ii, which simpli es the correlation coef - fi cient to rij xixj . The sample partial correlation coef cient of a three-variable system between fi xA and xB given xC is then de ned as [19,20] ð^ 1Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffirAB rpBC ffiffiffiffiffiffiffiffiffiffiffiffiffiffirAC qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiC AB : rAB C ¼ 2 2 1 1 1 1 rAC rBC ^ ^ ðC ÞAAðC ÞBB
The latter equivalence by Cramer’s rule holds if the empirical covariance matrix, ^ ^ C ¼ðC ijÞi;j2fA;B;Cg, is invertible. Krumsiek et al. [21] studied the Pearson correlations and partial correlations in data generated by an in silico reaction system consisting of three components A, B, C with reactions between A and B, and B and C (Fig 1A). A graphical comparison of
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 2 / 22 Fig 1. Reaction system reconstruction and protein contact prediction. Association results of correlation- based and maximum-entropy methods on biological data from an in silico reaction system (A) and protein contacts (B). (A) Analysis by Pearson’s correlation yields interactions associating all three compounds A, B, and C, in contrast to the partial correlation approach which omits the “false” link between A and C. (Fig 1A based on [21].) (B) Protein contact prediction for the human RAS protein using the correlation-based mutual information, MI, and the maximum-entropy based direct information, DI, (blue and red, respectively). The 150 highest scoring contacts from both methods are plotted on the protein contacts from experimentally determined structure in gray. (Fig 1B based on [6].) doi:10.1371/journal.pcbi.1004182.g001
’ Pearson s correlations, rAB, rAC rBC, versus the corresponding partial correlations, rAB C, rAC B, ’ rBC A, shows that variables A and C appear to be correlated when using Pearson s correlation as a dependency measure since both are highly correlated with variable B, which results in a false
inferred reaction rAC. The strength of the incorrectly inferred interaction can be numerically large and therefore particularly misleading if there are multiple intermediate variables B [22]. The partial correlation analysis removes the effect of the mediating variable(s) B and correctly recovers the underlying interaction structure. This is always true for variables following a mul- tivariate Gaussian distribution, but also seems to work empirically on realistic systems as Krumsiek et al. [21] have shown for more complex reaction structures than the example pre- sented here.
Protein contact prediction The idea that protein contacts can be extracted from the evolutionary family record was formu- lated and tested some time ago [23–26]. The principle used here is that slightly deleterious mutations are compensated during evolution by mutations of residues in contact in order to maintain the function and, by implication, the shape of the protein. Protein residues that are close in space in the folded protein are often mutated in a correlated manner. The main prob- lem here is that one has to disentangle the directly co-evolving residues and remove transitive correlations from the large number of other co-variations in protein sequences that arise due to statistical noise or phylogenetic sampling bias in the sequence family. Interactions not internal to the protein are, for example, evolutionary constraints on residues involved in
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 3 / 22 oligomerization, protein–protein, protein–substrate interactions [6,27,28]. In particular, the empirical single-site and pair frequency counts in residue i and in residues i and j for elements
s, o of the 20-element amino acid alphabet plus gap, fi(s) and fij(s, o), are extracted from a representative multiple sequence alignment under applied reweighting to account for biases due to undersampling. Correlated evolution in these positions was analyzed, e.g., by [29], by using the mutual information between residue i and j, ! X f ðs; oÞ ; ij : MIij ¼ fijðs oÞln s;o fiðsÞfjðoÞ
Although results did show promise, an important improvement was made years later by using a maximum-entropy approach on the same setup [5–7,30]. In this framework, the direct dir information of residue i and j was introduced by replacing fij in the mutual information by Pij , ! X Pdirðs; oÞ dir ; ij ; DIij ¼ Pij ðs oÞln ð1Þ s;o fiðsÞfjðoÞ
dir ; 1 ; ~ ~ ~ ; ~ where P ðs oÞ¼ expðe ðs oÞþh ðsÞþh ðoÞÞ and h ðsÞ h ðoÞ and Zij are chosen such ij Zij ij i j i j dir that Pij , which is based on a pairwise probability model of an amino acid sequence compatible with the iso-structural sequence family, is consistent with the single-site frequency counts. In an approximative solution, [6,7] determined the contact strength between the amino acids s and o in position i and j,respectively,by ; 1 ; : eijðs oÞ’ ðC ðs oÞÞij ð2Þ
−1 − Here, (C (s,o))ij denotes the inverse element corresponding to Cij (s,o) fij(s,o) fi(s) fj(o) for amino acids s, o from a subset of 20 out of the 21 different states (the so-called gauge fixing, see below). The comparison of contact prediction results based on MI- and DI-score for the RAS human protein on top of the actual crystal structure shows a much more accurate pre- diction result when using the direct information instead of the mutual information (Fig 1B). The next section lays the foundation to deriving maximum-entropy models for the two data types: continuous, as used in the first example, and categorical, as used in the second one. Sub- sequently, we will present inference techniques to solve for their interaction parameters.
Deriving the Probabilistic Model Ideally, one would like to use a probabilistic model that is, on the one hand, able to capture all orders of directed interactions of all observables at play and, on the other hand, correctly repro- duces the observed and to-be-predicted frequencies. However, this would require a prohibi- tively large number of observed data points. For this reason, we restrict ourselves to probabilistic models with terms up to second order, which we derive for continuous, real-val- ued variables, and extend this framework to models with categorical variables that are suitable, for example, to treat sequence information in the next section.
Model formulation for continuous random variables We model the occurrence of sets of events in a particular biological system by a multivariate ... T RL probability distribution P(x)ofL random variables x =(x1, ,xL) 2 that is, on the one hand, consistent with the mean and covariance obtained from M observed data values x1,...,xM and, on the other hand, maximizing the information entropy, S, to obtain the simplest possible ’ probability model consistent with the data. At this point, each of the data s variables xi is
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 4 / 22 continuously distributed on real values. In a biological example, these data originate from gene
expression studies and each variable xi corresponds to the normalized mRNA level of a gene measured in M samples. As an example, a recent pan-cancer study of The Cancer Genome Atlas (TCGA) provided mRNA levels from M = 3,299 patient tumor samples from 12 cancer types [31]. The problem can be large, e.g., in the case of a gene–gene association study one has L 20,000 human genes. RL R The first constraint on the unknown probability distribution, P: ! 0 is that its integral normalizes to 1, ð PðxÞ dx ¼ 1; ð3Þ x
which is a natural requirement on any probability distribution. Additionally, the first moment
of variable xi is supposed to match the value of the corresponding sample mean over M mea- surements in each i =1,..., L, ð 1 XM x x m ; hxii¼ Pð Þxi d ¼ xi ¼ xi ð4Þ x M m¼1 fi where we de ne the n-th moment ofð the random variable xi distributed by the multivariate n : x n x probability distribution P as hxi i ¼ Pð Þxi d . Analogously, the second moment of the var- x
iables xi and xj and its corresponding empirical expectation is supposed to be equal, ð 1 XM x x m m hxixji¼ Pð Þxixj d ¼ xi xj ¼ xixj ð5Þ x M m¼1
for i, j =1,..., L. Taken together, Eqs 4 and 5 constrain the distribution’s covariance matrix to be coherent to the empirical covariance matrix. Finally, the probability distribution should maximize the information entropy, ð maximize S ¼ PðxÞln PðxÞ dx ð6Þ x
with the natural logarithm ln. A well-known analytical strategy to find functional extrema under equality constraints is the method of Lagrange multipliers [32], which converts a con- strained optimization problem into an unconstrained one by means of the Lagrangian L.In our case, the probability distribution maximizing the entropy (Eq 6) subject to Eqs 3–5 is found as the stationary point of the Lagrangian L ¼ LðPðxÞ; a; β; γÞ [33,34],
XL XL L 1 1 : ¼ S þ aðh i Þþ biðhxii xi Þþ gijðhxixji xixj Þ ð7Þ i¼1 i;j¼1
α β β γ γ The real-valued Lagrange multipliers , =( i)i =1,..., L and =( ij)i,j =1,..., L correspond to the constraints Eqs 3, 4, and 5, respectively. The maximizing probability distribution is then found by setting the functional derivative of L with respect to the unknown density P(x)to zero [33,35],
dL XL XL ¼ 0 ) ln PðxÞ 1 þ a þ b x þ g x x ¼ 0: d x i i ij i j Pð Þ i¼1 i;j¼1
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 5 / 22 Its solution is the pairwise maximum-entropy probability distribution, ! XL XL 1 x; β; γ 1 Hðx;β;γÞ Pð Þ¼exp þ a þ bixi þ gijxixj ¼ e ð8Þ i¼1 i;j¼1 Z
which is contained in the family of exponential probability distributions and assigns a non- fi ... T RL negative probability to any system con guration x =(x1, ,xL) 2 . For the second identity, we introduced the partition function as normalization constant, ð ! XL XL β; γ : x 1 Zð Þ ¼ exp bixi þ gijxixj d expð aÞ x i¼1 i;j¼1 X X L L with the Hamiltonian, HðxÞ :¼ b x g x x . It can be shown by means of the i¼1 i i i;j¼1 ij i j information inequality that Eq 8 is the unique maximum-entropy distribution satisfying the constraints Eqs 3–5 (Cover and Thomas [35], p. 410). Note that α is fully determined for given β β γ γ =( i) and =( ij) by the normalization constraint Eq 3 and is therefore not a free parameter. The right-hand representation of Eq 8 is also referred to as Boltzmann distribution. The γ γ matrix of Lagrange multipliers =( ij) has to have full rank in order to ensure a unique param- etrization of P(x), otherwise, one can eliminate dependent constraints [33,36]. In addition, for – the integrals in Eqs 3 6 to converge with respect to L-dimensional LebesgueX measure, we require γ to be negative definite, i.e., all of its eigenvalues to be negative or g x x ¼ i;j ij i j xTγx < 0 for x 6¼ 0.
Concept of entropy maximization Shannon states in his seminal work that information and (information) entropy are linked: the more information is encoded in the system, the lower its entropy [37]. Jaynes introduced the entropy maximization principle, which selects for the probability distribution that is (1) in agreement with the measured constraints and (2) contains the least information about the probability distribution [38–40]. In particular, any unnecessary information would lower the entropy and, thus, introduce biases and allow overfitting. As demonstrated in the section above, the assumption of entropy maximization under first and second moment constraints results in an exponential model or Markov random field (in log-linear form) and many of the properties shown here can be generalized to this model class [41]. On the other hand, there is some analogy of entropy as introduced by Shannon to the thermodynamic notion of entropy. Here, the Second law of Thermodynamics states that each isolated system monotonically evolves in time towards a state of maximum entropy, the equilibrium. A thorough discussion of this analogy and its limitation in non-equilibrium systems is beyond the scope of this review, but can be found in [42,43]. Here, we exclusively use the notion entropy maximization as the principle of minimal information content in the probability model consistent with the data.
Categorical random variables In the following section, we derive the pairwise maximum-entropy probability distribution on ... T ΩL categorical variables. For jointly distributed categorical variables x =(x1, ,xL) 2 , each var- Ω ... iable xi is defined on the finite set ={s1, , sq} consisting of q elements. In the concrete example of modeling protein co-evolution, this set contains the 20 amino acids represented by a 20-letter alphabet from A standing for Alanine to Y for Tyrosine plus one gap element, then Ω ={A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y,−} and q = 21. Our goal is to extract co-evolving residue pairs from the evolutionary record of a given protein family. As input data,
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 6 / 22 Lq Fig 2. Illustration of binary embedding. The binary embedding 1σ: Ω ! {0, 1} maps each vector of L categorical random variables, x2Ω , here represented by a sequence of amino acids from the amino acid alphabet (containing the 20 amino acids and one gap element), Ω ={A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, Lq S, T, V, W, Y,−}, onto a unique binary representation, x(σ)2{0, 1} . doi:10.1371/journal.pcbi.1004182.g002
we use a so-called multiple sequence alignment, {x1,..., xM} ΩL×M, a collection of closely homologous protein sequences that is formatted such that it allows comparison of the evolu- tion across each residue [44]. These alignments may stem from different hidden Markov model-derived resources, such as PFAM [45], hhblits [46], and Jackhmmer [47]. To formalize the derivation of the pairwise maximum-entropy probability distribution on categorical variables, we use the approach of [8,30,48] and replace, as depicted in Fig 2, each Ω variable xi defined on categorical variables by an indicator function of the amino acid s 2 , q 1s: Ω ! {0, 1} , ( 1 if x ¼ s; : 1 i xi7!xiðsÞ sðxiÞ¼ 0 otherwise:
This embedding specifies a unique representation of any L-vector of categorical random variables, x, as a binary Lq-vector, x(σ) with a single non-zero entry in each binary q-subvector σ ... T q xi( ) =(xi(s1), , xi(sq)) 2{0,1} ,
1 x ; ...; T OL s x σ ; ...; T 0; 1 Lq: ¼ðx1 xLÞ 2 7! ð Þ¼ðx1ðs1Þ xLðsqÞÞ 2f g
Inserting this embedding into the first and second moment constraints, corresponding to Eqs 3 and 4 in the continuous variable case, we find their embedded analogues, the single and pairwise marginal probability in positions i and j for amino acids s,o,2Ω X X x σ ; hxiðsÞi ¼ Pð ð ÞÞxiðsÞ¼ Pðxi ¼ sÞ¼PiðsÞ xðσÞ x X X x σ ; ; hxiðsÞxjðoÞi ¼ Pð ð ÞÞxiðsÞxjðoÞ¼ Pðxi ¼ s xj ¼ oÞ¼Pijðs oÞ xðσÞ x
1 ’ fi including PiiX(s,o)=Pi(s) s(o) and with the distribution s rst moment in each random vari- y y ... T RLq able, hyii¼ yPð Þyi and =(y1, , yLq) 2 . The analogue of the covariance matrix then
becomes a symmetric Lq × Lq matrix of connected correlations whose entries Cij(s,o)=Pij(s,o) − Pi(s) Pj(o) characterize the dependencies between pairs of variables. In the same way, the
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004182 July 30, 2015 7 / 22 sample means translate to the single-site and pair frequency counts over m =1,...,M data vectors xm m; ...; m T OL ¼ðx1 xL Þ 2 , 1 XM m ; xiðsÞ¼ xi ðsÞ¼fiðsÞ M m¼1
1 XM m m ; : xiðsÞxjðoÞ¼ xi ðsÞxj ðoÞ¼fijðs oÞ M m¼1
The pairwise maximum-entropy probability distribution in categorical variables has to ful- fill the normalization constraint, X X PðxÞ¼ PðxðσÞÞ ¼ 1: ð9Þ x xðσÞ
Furthermore, the single and pair constraints, the analogues of Eqs 3 and 4, enforce the resul- ting probability distribution to be compatible with the measured single and pair frequency counts, ; ; ; PiðsÞ¼fiðsÞ Pijðs oÞ¼fijðs oÞð10Þ
for each i, j =1,..., L and amino acids s,o2Ω. As before, we require the probability distribu- tion to maximize the information entropy, X X maximize S ¼ PðxÞln PðxÞ¼ PðxðσÞÞln PðxðσÞÞ: ð11Þ x xðσÞ
The corresponding Lagrangian, L ¼ LðPðxðσÞÞ; a; βðσÞ; γðσ; ωÞÞ, has the functional form,
XL X XL X L 1 1 ; ; ; : ¼ S þ aðh i Þþ biðsÞðPiðsÞ fiðsÞÞ þ gijðs oÞðPijðs oÞ fijðs oÞÞ i¼1 s2O i;j¼1 s;o2O
β γ For notational convenience, the Lagrange multipliers i(s) and ij(s,o) are grouped to the β σ s2O γ σ; ω ; s;o2O Lq-vector ð Þ¼ðbiðsÞÞi¼1;...;L and the Lq × Lq-matrix ð Þ¼ðgijðs oÞÞi;j¼1;...;L, respec- ’ @L 0 tively. The Lagrangian s stationary point, found as the solution of @PðxðσÞÞ ¼ , determines the pairwise maximum-entropy probability distribution in categorical variables [30,49], ! 1 XL X XL X x σ ; β; γ ; Pð ð Þ Þ¼ exp biðsÞxiðsÞþ gijðs oÞxiðsÞxjðoÞ ð12Þ Z i¼1 s2O i;j¼1 s;o2O
with normalization by the partition function, Z exp(1−α). Note that distribution Eq 12 is of the same functional form as Eq 8 but with binary random variables x(σ) 2{0,1}Lq instead of x RL β continuous ones 2 . At this point, we introduce the reduced parameter set, hi(s): = i(s)+ γ γ < ii(s, s) and eij(s,o): = 2 ij(s,o) for i j, using the symmetry of the Lagrange multipliers, γ γ ... ij(s,o): = ji(o, s), and that xi(s) xi(o) = 1 if and only if s = o. For a given sequence (z1, , ΩL ... zL)2 summing over all non-zero elements, (x1(z1)=1, , xL(zL) = 1) or equivalently (x1 = ... z1, ,xL = zL) then yields the probability assigned to the sequence of interest, ! 1 XL X ; ...; ; : Pðz1 zLÞ exp hiðziÞþ eijðzi zjÞ ð13Þ Z i¼1 1 i