An R Package for a Fast Calculation to Semi-Partial Correlation Coefficients

Communications for Statistical Applications and Methods DOI: http://dx.doi.org/10.5351/CSAM.2015.22.6.665 2015, Vol. 22, No. 6, 665–674 Print ISSN 2287-7843 / Online ISSN 2383-4757 ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients Seongho Kim1;a aBiostatistics Core, Karmanos Cancer Institute, Wayne State University, USA Abstract Lack of a general matrix formula hampers implementation of the semi-partial correlation, also known as part correlation, to the higher-order coefficient. This is because the higher-order semi-partial correlation calculation using a recursive formula requires an enormous number of recursive calculations to obtain the correlation coefficients. To resolve this difficulty, we derive a general matrix formula of the semi-partial correlation for fast computation. The semi-partial correlations are then implemented on an R package ppcor along with the partial correlation. Owing to the general matrix formulas, users can readily calculate the coefficients of both partial and semi-partial correlations without computational burden. The package ppcor further provides users with the level of the statistical significance with its test statistic. Keywords: correlation, partial correlation, part correlation, ppcor, semi-partial correlation 1. Introduction The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables (James, 2002; Johnson and Wichern, 2002; Whittaker, 1990). The great number of studies have been published using either partial or semi-partial correlations in many areas including cognitive psychology (e.g., Baum and Rude, 2013), genomics (e.g., Fang et al., 2009; Kim and Yi, 2007; Zhu and Zhang, 2009), medicine (e.g., Vanderlinden et al., 2013), and metabolomics (e.g., Kim et al., 2012; Kim and Zhang, 2013). The partial correlation can be explained as the association between two random variables after eliminating the effect of all other random variables, while the semi-partial correlation eliminates the effect of a fraction of other random variables, for instance, removing the effect of all other random variables from just one of two interesting random variables. The rationale for the partial and semi- partial correlations is to estimate a direct relationship or association between two random variables. The brief explanation follows to describe the main difference among the correlation, the partial correlation and the semi-partial correlation. Suppose there are three random variables (or vectors), X, Y and Z, and we are interested in the relationship (or association) between X and Y, for illustration purposes. Three situations are taken into consideration as shown in Figure 1. The figure describes three cases that (a) Z is correlated with none of X and Y, (b) only the random variable Y is correlated This work is partially supported by NSF grant DMS-1312603. The Biostatistics Core is supported, in part, by NIH Center Grant P30 CA022453 to the Karmanos Cancer Institute at Wayne State University. 1 Biostatistics Core, Karmanos Cancer Institute, Department of Oncology, School of Medicine, Wayne State University, 87 E. Canfield St., Detroit, MI 48201. E-mail: [email protected] Published 30 November 2015 / journal homepage: http://csam.or.kr ⃝c 2015 The Korean Statistical Society, and Korean International Statistical Society. All rights reserved. 666 Seongho Kim XYXXXXYYYY Z ZZZZ (a) (b) (c) Figure 1: Graphical illustration of partial and semi-partial correlations among the three random variables X, Y, and Z. with Z and (c) Z is correlated with both X and Y. Because Z is independent of both X and Y in Figure 1(a), the correlation, partial correlation and semi-partial correlation all should theoretically have the identical value. When only Y is correlated with Z as shown in Figure 1(b), the partial correlation is exactly same as the semi-partial correlation, but is different from the correlation. However, in case of Figure 1(c), all three correlations are different from each other. For more details on the partial and semi-partial correlations, refer to James (2002) and Whittaker (1990). Several R packages have been developed only for the partial correlation. The R package corpcor (Schafer and Strimmer, 2005a) provides the function cor2pcor() for computing partial correlation from correlation matrix and vice versa. The function ggm.estimate.pcor(), which is built in the package GeneNet (Schafer and Strimmer, 2005b), allows users to estimate the partial correlation for graphical Gaussian models. The package PCIT (Watson-Haigh et al., 2010) provides an algorithm for calculating the partial correlation coefficient with information theory. The function partial.cor() is included in Rcmdr (Fox, 2005) package for computing partial correlations. The package parcor (Kramer et al., 2009) can be used for regularized estimation of partial correlation matrices. The qp (Castelo and Roverato, 2006) package provides users with q-order partial correlation graph search algorithm. The R package space (Peng et al., 2009) can be used for sparse partial correlation estimation. However, none of these packages provide the level of significance for the partial correlation coefficient such as p-value and statistic. Furthermore, to our knowledge, there exists no R package for semi-partial correlation calculation. On the other hand, there is no attempt to reduce the computational burden of the higher-order semi- partial correlation coefficients, while the higher-order partial correlation coefficients can be easily calculated using the inverse variance-covariance matrix. This means that a recursive formula (e.g., see Equation (2.2)) should be used for the higher-order semi-partial correlation calculation, hampering the use of the semi-partial correlation for high-dimensional data, such as ‘omics’ data, due to its expensive computation. For these reasons, we derive a general matrix formula for the semi-partial correlation calculation (see Equation (2.6)). Using this general matrix formula, the semi-partial correlation coefficient can be simple but fast calculated. In order for the partial and the semi-partial correlations to be used practically, an R package ppcor is further developed in the R system for statistical computing (R Development Core Team, 2015). It provides a means for fast computing partial and semi-partial correlation as well as the level of statistical significance. The package ppcor is publicly available from CRAN at http://CRAN.r-project.org/package=ppcor and it is also available in the Supplementary Material at CSAM homepage (http://csam.or.kr). 2. Partial and Semi-partial Correlations 0 Consider the random vector X = (x1; x2;:::; xi;:::; xn) where jXj = n. We denote the variance of a random variable xi and the covariance between two random variables xi and x j as vi (= var(xi)) An R Package for a Fast Calculation to Semi-partial Correlation Coefficients 667 and ci j (= cov(xi; x j)), respectively. The variance-covariance matrices of random vectors X and XS (S ⊂ f1; 2;:::; ng and jS j < n) are denoted by CX and CS , respectively, where XS is a random sub- vector of thep randomp vector X. The correlation between two random variables xi and x j is denoted by ri j = ci j=( vi v j)(= corr(xi; x j)). Definition 1. The partial correlation of xi and x j given xk is ri j − rikr jk ri jjk = q q (2.1) − 2 − 2 1 rik 1 r jk and the semi-partial correlation of xi with x j given xk is ri j − rikr jk ri( jjk) = q : (2.2) − 2 1 r jk Whittaker (1990) defined the partial correlation using the correlation between two residuals. In fact, we can easily see that the definition in Whittaker (1990) is equivalent to the definition in Equation (2.1). Using the above definition, we can readily obtain another version of the definitions of the partial and the semi-partial correlation coefficients as follows. Corollary 1. The partial correlation of xi and x j given xk, ri jjk, is equal to −1 ci j − cikv ck j corr(resid(ijk); resid( jjk)) = q qk (2.3) − −1 − −1 vi cikvk cki v j c jkvk ck j and the semi-partial correlation of xi with x j given xk, ri( jjk), is equal to p −1 var(resid(ijk)) ci j − cikv ck j p = q k ; ri jjk p (2.4) vi − −1 vi v j c jkvk ck j j = − = −1 where resid(i k) xi xî(xk) and xî(xk) cikvk xk. Note that the proof of Corollary 1 is omitted since it is straightforward. Corollary 1 can be further generalized to the case that there are two or more given variables. In other words, it can be extended to the higher-order partial and the higher-order semi-partial correlations. To do this, we need to consider = −1 the inverse variance-covariance matrix of X, DX CX . For the simplicity’s sake, we denote DX as th [di j] and CX as [ci j], where di j and ci j are the (i; j) cell of the matrices DX and CX, respectively, and 1 ≤ i; j ≤ n. Then the partial correlation between two variables given a set of variables can be calculated by the following lemma. Lemma 1. (Whittaker, 1990) The partial correlation of xi and x j given a random vector XS (= X(i; j)), ri jjS , is equal to di j − p p ; (2.5) dii d j j 668 Seongho Kim where XS (= X(i; j)) is the random sub-vector of X after removing the random variables xi and x j and its size is jS j (= jXj − 2). Most R packages for calculation of the partial correlation use the matrix-based calculation which is based on Equation (2.5), since this method is much less computationally expensive than the method based on Equation (2.1).

Load more