<<

PsyArXiv

Why Overfitting is Not (Usually) a Problem in Partial Correlation Networks

Donald R. Williams & Josue E. Rodriguez University of California, Davis

Network psychometrics is undergoing a time of methodological reflection. In part, this was spurred by the revelation that `1-regularization does not reduce spurious associations in partial correlation networks. In this work, we address another motivation for the widespread use of regularized estimation: the thought that it is needed to mitigate overfitting. We first clarify im- portant aspects of overfitting and the bias-variance tradeoff that are especially relevant for the network literature, where the number of nodes or items in a psychometric scale are not large compared to the number of observations (i.e., a low p/n ratio). This revealed that bias and especially variance are most problematic in p/n ratios rarely encountered. We then introduce a nonregularized method, based on classical hypothesis testing, that fulfills two desiderata: (1) reducing or controlling the false positives rate and (2) quelling concerns of overfitting by providing accurate predictions. These were the primary motivations for initially adopting the graphical (glasso). In several simulation studies, our nonregularized method provided more than competitive predictive performance, and, in many cases, outperformed glasso. It appears to be nonregularized, as opposed to regularized estimation, that best satisfies these desiderata. We then provide insights into using our methodology. Here we discuss the multiple comparisons problem in relation to prediction: stringent alpha levels, resulting in a sparse network, can deteriorate predictive accuracy. We end by emphasizing key advantages of our approach that make it ideal for both inference and prediction in network analysis.

Keywords: partial correlation network, overfitting, prediction, frequentist inference, mean squared error

In the social-behavioral sciences, network theory has hypersexuality (Werner, Štulhofer, Waldorp, & Jurin, 2018). emerged as an increasingly popular framework for under- Recently, the foundation of network psychometrics was standing psychological constructs (Borsboom, 2017; Jones, improved upon when the default methodology was revis- Heeren, & McNally, 2017). The underlying rationale is that ited (Williams & Rast, 2019; Williams, Rhemtulla, Wysocki, a group of observed variables, say, self-reported symptoms, & Rast, 2019). In the network literature, `1-regularization are a dynamic system that mutually influence and interact (a.k.a., “least absolute and selection operator” or with one another (Borsboom & Cramer, 2013). The observed “lasso”) emerged as the default approach for detecting con- variables are “nodes” and the featured connections between ditionally dependent relations. Initially, it was motivated by nodes are “edges.” This work focuses on partial correlation the thought that it reduces spurious relations. Paradoxically, networks, wherein the edges represent conditionally depen- the exact opposite holds true: lasso is known to not select dent nodes—pairwise relations that have controlled for the the correct model (Zhao & Yu, 2006) and to have a relatively other nodes in the network (Epskamp, Waldorp, Mottus, & low false negative rate. For the latter, in structural equations Borsboom, 2018). This powerful approach has resulted in models (SEM), it was noted that “lasso kept more variables an explosion of research; for example, network analysis has in the model (more Type I and fewer Type II errors)” (p. 72, been used to shed new light upon a variety of constructs Jacobucci, Brandmaier, & Kievit, 2019). In network models, including personality (Costantini et al., 2015), narcissism it was recently demonstrated that the inflated false positive (Di Pierro, Costantini, Benzi, Madeddu, & Preti, 2019), and rate inherent to lasso depends on many factors, including the sample size, edge size, sparsity, and the number of nodes (see Figure 6 in Williams et al., 2019). On the other hand, DRW was supported by a National Science Foundation Grad- nonregularized estimation has a lower false positive rate that uate Research Fellowship under Grant No. 1650042. All does not depend on those factors (Williams & Rast, 2019). code to reproduce the simulations and figures is available online Together, it is now clear that limiting false positives does not (https://osf.io/fm92b/). motivate the default status of `1-regularization. In this work, we seek to further improve network anal-

1 2 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ ysis by revisiting another purported benefit of using `1- predictive performance compared to `1-regularization. How- regularization. Additional motivation for using lasso is the ever, in certain situations, this is a robust finding in the model thought that it can mitigate overfitting. The underlying ratio- selection literature (see references in Bertsimas, King, & nale is summarized in Fried and Cramer(2017): Mazumder, 2016; Mazumder, Radchenko, & Dedieu, 2017). In the case of network models, overfitting is an This is summarized in Hastie, Tibshirani, and Tibshirani especially severe challenge because we investi- (2017): gate relationships among a large number of vari- The lasso gives better accuracy results than ables, which means there is danger of overfitting [nonregularized] best subset selection in the low a large number of parameters. One way to miti- signal-to-noise ratio (SNR) range and worse ac- gate this problem somewhat is to regularize net- curacy than best subset in the high SNR range. works (p. 1011). The transition point—the SNR level past which Although Fried and Cramer(2017) also stated “it is unclear best subset outperforms the lasso—varies de- at present to what degree regularization techniques increase pending on the problem dimensions (p. 17). generalizability” (p. 1011), it has nonetheless permeated the network literature. Indeed, the emerging consensus is that The transition point refers to the number of variables (p) rel- ative to the number of observations (n). Hence, even in noisy `1-regularization provides “protection against overfitting” (p. 16, Christensen & Golino, 2019). We argue that such state- situations, the benefits of lasso diminish or vanish altogether ments are not entirely clear, as, for example, it is all but guar- when n is much larger than p. This is the customary situation in the network literature. The remaining consideration is the anteed that the “fit to the data” (measured by predictive ac- 2 SNR (i.e., R ). Nodes are usually items from a validated curacy) will be better for training data than unseen data— 1−R2 some degree of overfitting is a foregone conclusion. In our psychometric scale, that, by construction, should not have a opinion, it is also not readily apparent that overfitting is a terribly low SNR. Together, this suggests that the oft-touted “severe challenge” that motivates regularization compared to protective shield of regularization may be overstated. nonregularized estimation. Because regularization has seri- The thought that regularization is advantageous for limit- ous ramifications on , such as presenting ing overfitting extends beyond the network literature. In the issues for computing valid confidence intervals (see for ex- social-behavioral sciences, this is typically presented with ample section 3.1 in Bühlmann, Kalisch, & Meier, 2014), some combination of regularization and the bias-variance the guiding idea behind this work is that the “challenge” of tradeoff (Jacobucci et al., 2019; Yarkoni & Westfall, 2017). overfitting must warrant sacrificing gold-standard statistical It is often argued that biasing estimates with regularization approaches (e.g., ordinary least squares and p-values). can be advantageous, for example, “[lasso] introduces bias The network literature provides a unique opportunity to in the parameter estimates in order to avoid overfitting”(p. investigate predictive methods in the social-behavioral sci- 6, Ryan, Bringmann, & Schuurman, 2019). However, it is ences. This is because nodewise predictability has an impor- important to note that parameter and prediction bias are not tant place in both network theory and analysis (Haslbeck & the same thing. Indeed, from the statistics literature, “the Fried, 2017; Haslbeck & Waldorp, 2018). The basic idea is nonzero estimates from the lasso tend to be biased toward to see how well a given node is predicted by the other nodes zero, so the debiasing [removing parameter bias]...can often in the network. The primary motivation of this approach is improve the prediction error of the model” (p. 16, Hastie, to determine “how self-determined the network is” (p. 860, Tibshirani, & Wainwright, 2015). In addition to propos- Haslbeck & Waldorp, 2018). This is accomplished by com- ing novel methodology, another major contribution of this puting variance explained for each node in the network, given work is clarifying possible confusion surrounding the bias- the corresponding “neighborhood” of conditional relations variance tradeoff and overfitting. (i.e., shared connections with other nodes, Meinshausen & Major Contribution Bühlmann, 2006). This approach is used extensively in the network literature, including as an outcome measure in clin- Our major contribution is explicitly examining overfitting ical interventions (see for example Blanken et al., 2019). in partial correlation networks. In the literature, the general Therefore, it is important to develop methodology that brings thought is that overfitting is a serious challenge that requires to fruition the primary motivations for initially adopting `1- using regularization. However, as noted in Waldorp, Mars- regularization: (1) detecting conditional relations with a low man, and Maris(2019), “Once the parameters are obtained false positive rate; and (2) mitigating the possibility of over- it turns out that inference on network parameters is in gen- fitting. As we show below, both goals can be achieved with- eral difficult with `1-regularization” (p. 53). Accordingly, out employing regularization. because inference is the primary goal in network, overfit- At first blush it may sound naive to suggest nonregularized ting must be a pressing challenge that warrants sacrificing methods can not only provide adequate but actually superior statistical inference. If not, this would indicate that there IS NOT (USUALLY) A PROBLEM 3 is no reason to use regularization, given that “inference on Hence, estimating partial correlation networks can be accom- network parameters” is straightforward with non-regularized plished by testing whether each relation in (1) is “signifi- methods. cantly” different from zero. This is described in Drton and This contribution is similar in spirit to Williams and Rast Perlman(2004) and Williams and Rast(2019). The method (2019) and Williams et al.(2019), both of which demon- introduced below builds upon this general idea. strated that `1-regularization has a relatively high false pos- itive rate. In this work, however, the focus is explicitly on Multiple Regression the purported benefits of `1-regularization for overfitting in network analysis. Network predictability takes advantage of the correspon- dence between the elements of Θ and multiple regression Overview (Kwan, 2014; Stephens, 1998). Suppose that the jth column Y j is predicted by the remaining (p − 1) nodes Y− j. For In what follows, we first introduce partial correlation net- nodes i and j, the resulting coefficients and error variances works and predictability in the context of a motivating ex- are defined as ample. Here we bring much needed nuance and clarity to customary examples, including the infamous polynomial re- −θi j 2 1 βi j = and σ j = , (2) gression and the bias-variance tradeoff, that are used to dis- θii θii cuss regularization. To this end, we transport ideas stemming where i and j denote the corresponding row and column of from econometrics and to the psychologi- Θ, βi j is the regression weight for the jth node (i , j), and cal literature. Second, we introduce a rather simple, nonregu- σ2 is the residual variance. This allows for recovering all larized method for both detecting conditional relations with a j elements of Θ (and Σ) with j multiple regression models. In low false positive rate and predictive accuracy. We then com- relation to (1), the regression coefficients also have a direct pare this method to ` -regularization using real datasets that 1 mapping to partial correlation, that is span a wide range of psychological constructs. We conclude with recommendations for computing nodewise predictabil- q ity in psychological networks. βi j = ρi j·z θii/θ j j. (3)

The Gaussian Graphical Model This relationship is often utilized in the GGM literature. For example, with the regression coefficients in hand, it is pos- For multivariate normal data, the Gaussian graphical sible to compute variance explained (R2) for each node in model (GGM) captures conditional relationships that are typ- the network (Haslbeck & Fried, 2017; Haslbeck & Waldorp, ically visualized to infer the underlying dependence structure 2018). This is typically done after the network structure has (i.e., the partial correlation “network”; Højsgaard, Edwards, been determined, such that some elements of Θ are zero. & Lauritzen, 2012; Lauritzen, 1996). There is an undirected Hence, R2 leads to a powerful inference in relation to the con- graph that is denoted G = (V, E), which includes a vertex set ditional relations for that node. Furthermore, there are a vari- V = {1, ..., p} and an edge set E ⊂ V × V. The former refers ety of approaches that use multiple regression to estimate Θ to “nodes” and the set represents, say, items in a question- (Liu & Wang, 2017; Yuan, 2010) or that focus on the partial naire, whereas the latter set contains the estimated network correlation matrix (Krämer, Schäfer, & Boulesteix, 2009). > structure. Let y = (y1, ..., yp) be a random vector indexed by the graph’s vertices that is assumed to follow a multivariate Overfitting Revisited normal distribution, y ∼ Np(µ, Σ), where Σ is a p× p positive definite covariance matrix. We use Y to denote the n × p data At this point, it is important to define overfitting: a model matrix, where each row corresponds to the observations from is said to be overfit when it is overly complex, such that it a given individual. Further, without loss of information, the has learned noise in the (training) data and its predictions data are considered centered with mean vector 0. The undi- generalize poorly to unseen (test) data. Notice that there is rected graph is obtained by determining which off-diagonal nothing directly related to having a model that includes spuri- elements of the precision matrix, Θ = Σ−1, are non-zero. ous associations or parameter estimation. The explicit focus That is, (i, j) ∈ E when node i and j are determined to be is out-of-sample prediction. conditionally dependent and set to zero otherwise. Note that In psychological networks, there are many nodes and this the edges (or “connections”) in a GGM are partial correla- has raised concerns about overfitting. For example, with tions ρi j·z that are computed directly from Θ, that is, p = 20 nodes there are 190 partial correlations in total. Al- though this sounds like an impressive number of parameters, the relation to multiple regression indicates that 19 regression −θi j ρi j·z = p (1) coefficients are estimated (p − 1). Because the relation in (2) θiiθ j j is exact, the dimensions of Y (n × p) is of primary interest 4 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ when thinking about estimation accuracy in GGMs. This is True Underfit Overfit known as the “small n large p” problem: 2 When the matrix dimension p is larger than the number of observations available n, the sample covariance matrix is not even invertible. When 1 the ratio p/n is less than one but not negligible, the sample covariance matrix is invertible, but 0 numerically ill-conditioned, which means that

inverting it amplifies estimation error dramati- Outcome -1 cally. For large p, it is difficult to find enough observations to make p/n negligible (p. 366, -2 Ledoit & Wolf, 2004). Hence, there is an important distinction between the matrix -1 0 1 dimensions and the number of parameters (see also p. 2 in Predictor Kuismin & Sillanpää, 2017). With 190 partial correlations, Figure 1. Regression models fitted in the motivating exam- and, say, n = 200, this seems to warrant some form of regu- ple, including the true model, an underfit model (intercept larization (“p/n” = 0.95). In fact, it is common to motivate only), and an overift model (10th degree polynomial). The regularized estimation by noting the number of partial corre- polynomial example is often used to highlight the dangers of lations (see for example p. 308 in Hevey, 2018). On the other overfitting, but it can also lead to an exaggerated concern if hand, the p/n ratio is 0.10 which does not seem problematic the context is not fully appreciated: typically the number of and it might be “negligible.” These insights (and more) can observations (n) are fixed and model complexity is increased be gleaned from decomposing the mean squared prediction by adding polynomial terms (p). However, psychometric error (MSE). scales are commonly used in network analysis and they have a specific number of items. This suggests that it is more rea- Out-of-Sample Prediction sonable to fix p and then increase n to representative sample sizes. Overfitting from these distinct scenarios is juxtaposed First denote the training Yk and test data Y−k, respectively. in Figure2 (panel A and B). Both are distinct subsets of Y. The training data is used to fit the initial regression model, say, predicting node j,

Theil MSE Decomposition k k k k k 2 Y j = Y− jβ j + ε , ε ∼ N(0, σ j In) (4) We follow the econometric literature and now define the k ˆ −k where β j is a (p − 1) × 1 vector of regression coefficients, out-of-sample predictions, Y j , as P j and the actual values ε is an n-dimensional vector with mean zero and covariance −k for the test data, Y , as A j. MSE is the average squared 2 2 j matrix σ In. Note that σ j is the error variance for the jth re- distance, that is, gression model (k is removed to simplify the notation), that is, Θk = 1/σ2 in a GGM. This relation is critical for un- j j j  2 derstanding overfitting in partial correlation networks, as it MSE = E (A j − P j) (6) 2 defines the absolute lower bound of MSE for the jth regres- = E[A j] − E[P j] sion model—the irreducible error. With the predicted values | {z } bias for the test data defined as + Var(A ) + Var(P ) − 2Cov(A , P ) . −k j j j j Yˆ = Y−kβk (5) | {z } j − j j variance 2 2 the test error will typically be larger than σ j . Although σ j Note that the variance term is specifically for the case of de- is unknown, by drawing many random subsets and taking pendent variables, which accommodates the correlation be- the average, the lower bound for MSE approximately corre- tween A j and P j. The expression in (6) is commonly used sponds to the residual variance from regressing Y j onto Y− j. outside of psychology (Murphy, 1988; Solazzo, Hogrefe, That is, nodewise predictability has an upper bound for ac- Colette, Garcia-Vivanco, & Galmarini, 2017). Building upon curacy that is defined by the diagonal elements of Θ. This this formulation, the lower bound for prediction error can be allows for monitoring the reducible portion of MSE, that is seen in the Theil decomposition for the sample-based MSE indicative of overfitting, as n approaches sample sizes repre- (Bento, Salvação, & Guedes Soares, 2018; Fitzgerald & Ak- sentative of the network literature (Table 1 and 2 in Wysocki intoye, 1995; Goodwin, 1997; Theil, 1961). This can be writ- & Rhemtulla, 2019). ten as OVERFITTING IS NOT (USUALLY) A PROBLEM 5

We also take this opportunity to delve into additional as- pects of the bias-variance tradeoff and overfitting. For ex- MSE(A , P ) = (P¯ − A¯ )2 + (S − rS )2 + (1 − r2)S 2 . j j j j P j A j A j ample, Gronau and Wagenmakers(2018) discussed “overfit- | {z } | {z } | {z } ting” in the context of a model with two parameters, that, bias var m.MSE (7) in a predictive context, would be very unlikely to occur. By overfitting they meant the selected model included false pos-

Here the first term is the bias, and is simply defined as the itives. But as we show below, including too many variables squared difference between the averages for P j and A j. This in a model is not necessarily related to prediction error. To- is much different than MSE for an estimator, as, in this case, gether, in addition to evaluating MSE in dimensions (n × p) the question of interest is the predicted and actual means for representative of the network literature, we also tackle the the test data. S P j and S A j denote the standard deviations relation between a model including too many variables and for the predicted and actual values, respectively. The sec- overfitting. ond term is the variance (Solazzo & Galmarini, 2016). Note Because of the direct correspondence between multiple re- that this formulation arises from regressing A j onto P j (pp. gression and the precision matrix (Equation2), we followed 1 - 11, Miner & Mincer, 1969), where r2 is the coefficient of the customary approach of using non-orthogonal polynomial determination (R2) and the third term, “minimum achievable terms to discuss the bias-variance tradeoff. This translates MSE,” is the residual error. The advantage of (7) is that the into a somewhat unrealistic design matrix that exhibits ex- first two terms will vanish with unbiased and efficient predic- treme multicollinearity between predictors (e.g., nearly infi- tions (p. 65, Clements & Hendry, 1998). Hence, the m.MSE nite variance inflation factors). The true model is presented component corresponds to the irreducible error (the residual in Figure1. The underlying data generating process is a lin- variance in Equation2). It is then useful to define U statis- ear relationship with β0 = −5 and β1 = 2. In these exper- tics in order to examine the contribution of each component iments, the training data (Ntrain) ranged from 11 to 250 (in- to MSE (p. 139 & Teelucksignh, 2008). They are crements of 10), the test data size was fixed to 50 (Ntest), the defined as SNR varied (0.1 and 0.5)1, and the predictors were scaled to have mean zero and unit variance. All results were averaged across 1,000 trials for each condition. 1 =Ub + Uv + Um (8) Results. Figure2 (panel A) shows the “dangers” of ¯ ¯ 2 2 (1 − r2)S 2 overfitting when Ntrain is fixed to 50 and model complexity (P j − A j) (S P j − rS A j ) A j = + + . increases by including additional polynomial terms (p/n → MSE MSE MSE 1). As a point of reference, the results are relative to the ir- Because they sum to one, each reflects a proportion of MSE. reducible error (the residual variance) and on the logarithmic Furthermore, when Ub and Uv approach zero, Um goes to scale. Accordingly, the test (out-of-sample) error is mini- unity such that predictive accuracy has reached its ceiling mized when it is closest to zero. The test error was mini- (i.e., MSE has been minimized, Clements & Hendry, 1998). mized with the linear model that was used to generate the These properties are ideal for delving into the bias-variance data (Figure1, panel A). The test error then increased with tradeoff and overfitting. For example, Um can be used model complexity, such that each additional predictor dete- to infer how close our model is to minimizing MSE, and riorated prediction quality. On the other hand, the training thus, whether concerns of overfitting have been sufficiently (in-sample) error reduced with increasing model complexity. quelled. This is a classic case of overfitting: the model has “learned” noise in the training data and this then hinders generalizabil- Motivating Example ity to unseen data, say, that will be collected at some point in the future. In this section, we present an example of the bias-variance Figure2 (panel C) includes an alternative perspective. In tradeoff. Recall that the bias-variance tradeoff is often used to this case, the fitted model is always a 10th degree polynomial discuss the benefits of regularization, and one contribution of (p is fixed) and Ntrain gradually increases in size (p/n → 0). this work is to provide an alternative perspective. Typically, Recall that the true model is the linear relation in Figure demonstrative examples fix the number of observations and 1 (panel A), and as such, the fitted model includes nine ir- then increase the number of predictors in a regression model. relevant variables that exhibit problematic multicollinearity. We instead fix the number of variables (p) and increase the Here a much different pattern emerged than in panel B: both number of observations (n). This reflects psychological net- the training and test error converged to the irreducible er- works in particular, because nodes are typically items from ror. It is striking how rapidly this occurred. Although the psychometric scales that have a fixed number and n is almost test error was (very) large with less than 100 observations, always much larger than p. This allows for monitoring Um (minimum MSE) as n gradually increases in size. 1R2 ≈ {0.09, 0.33} 6 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

Training Test A 2.0 B 2.0 1.5 ) ) 2 2

1.0 1.0 0.5 log(MSE/ s log(MSE/ s

0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 50 100 150 200 250 Polynomial Degrees Training Data Size

th Um (m.MSE)Ub (bias) Uv (var) SNR 0.1 0.5 10 Degree True

1.0 1.0 0.8 C 0.8 D 0.6 0.6

0.4 0.4 : Minimum MSE m MSE Proportion 0.2 0.2 U

0.0 0.0 0 50 100 150 200 250 0 50 100 150 200 250 Training Data Size Training Data Size Figure 2. Results from the motivating example. Panel A includes a customary example of overfitting: the number of obser- vations is fixed and model complexity increases. The y-axis denotes the ratio of mean squared error (MSE) to irreducible error (i.e., σ2 in Equation2). This is presented on the logarithmic scale, such that MSE is minimized at zero. As more terms are added to the model, training error becomes smaller than the irreducible error and test error dramatically increases. This is a classic case of overfitting. Panel B provides an alternative perspective: the number of variables is fixed (10th degree polynomial) and the number of observations increases. Here, even when not fitting the true model and dealing with extreme multicollinearity, test error is nearly minimized with 250 observations. Panel C decomposes of the test error in Panel B with Equation (7). Note that Um (minimum MSE) is equal to one when the test error is minimized. This reveals that variance (Uv) contributes most to test error when the p/n ratio is large, but this quickly dissipates and Um rapidly approaches one. Panel D includes Um for the 10th degree compared to the true model (Figure1). This reveals that the former provides comparable predictions to the latter (SNR: signal-to-noise ratio). Together, overfitting is not omnipresent when the sample size approaches those representative of the network literature. This does not require fitting the true or even a realistic model. OVERFITTING IS NOT (USUALLY) A PROBLEM 7 it approached the lower bound (i.e., the irreducible error) associations, but it preserves the relationship between speci- with only 150 observations. This is especially relevant for ficity and the false positive rate (Williams & Rast, 2019). network analysis, where there is typically tens of nodes and Second, overfitting is mitigated by estimating a constrained hundreds of observations (Wysocki & Rhemtulla, 2019). precision matrix, given the estimated structure of conditional We then decomposed the test error in Figure2 (panel C). relations. The importance of these methodological desiderata The idea here is to gain insight into the bias-variance trade- are reflected in the applied network literature—the following off as p/n → 0. These results are presented in panel D. As a brings them to fruition. point of reference, we subjectively defined 0.90 as an accept- The first step follows Drton and Perlman(2004) and able level of overfitting. That is, at least 90% of the system- Williams and Rast(2019), both of which employed the atic error due to bias and variance has been minimized. This Fisher-z transformation for the sample partial correlations can be inferred from monitoring Um (minimum MSE), which ri j·z. This transformation can be written as is unity when the reducible portion of MSE has been com- " # pletely minimized. In this example, Um exceeds 0.90 with 1 + ri j·z  a sample size of around 100. Furthermore, with relatively F(ri j·z) = 0.5 log . (9) 1 − ri j·z few observations, it is clear that variance (Uv) dominates the test error. But this also reduces rapidly as Ntrain gradually The resulting sampling distribution is (approximately) nor- increases in size. It is also informative that the bias portion mal, which is one key advantage of employing (9). With (Ub) is not a key contributor to MSE. Hence, it does not ap- F(ri j·z) in hand, all that is needed to test pear all that difficult to predict the mean of the test data. Because our fixed Ntrain experiments used a model that included too many variables, an interesting next step is then H0 : ρi j·z = 0 (10) to compare performance to the true model. To this end, we H1 : ρi j·z , 0 again monitored Um as the training data increased in size (p/n → 0). These results are presented in Figure1 (panel E) is the sample size (n) and the number of nodes (p). With and they are fascinating to behold: even when we fit the true the number of nodes conditioned on denoted c = p − 2 (i.e., model, prediction error was nearly identical to a 10th degree those included in z), the standard error of F(r ) is defined as polynomial when N exceeded 150. This was somewhat i j train √ 1 . The edge set, E, is then constructed by rejecting H influenced by R2, in that, with relatively more signal (0.33 n−3−c 0 at a given α level. That is, for nodes i and j,(i, j) ∈ E if the vs. 0.09), the error more quickly approached that of the true test statistic, (n−c−3)1/2|F(r )|, is larger than Φ−1(1−α/2). model. i jz Note that Φ denotes the cumulative density function of a Summary standard normal distribution. This straightforward approach has the advantage of being Together, these results demonstrate that overfitting does calibrated to the desired level of specificity (1 − α, Williams not seem to be all that bothersome. A similar finding was & Rast, 2019). Although this can be used to limit spurious also observed in Brusco, Voorhees, Calantone, Brady, and associations, thus fulfilling the first desideratum, there are Steinley(2019), where, even with a 10th degree polynomial issues for computing predictability. There are two options (p = 20 and n = 900), there was superior classification accu- for obtaining nodewise predictions: (1) use the full model; racy than more sophisticated approaches (e.g., support vec- or (2) use only the selected edges. Both would be employed tor machine). In our example, with merely 250 observations, after transforming the partial correlations to the correspond- this was enough to mitigate extreme multicollinearity, prac- ing regression coefficients with (3). However, the former tically minimize the reducible portion of MSE, and provide does not reflect the estimated conditional dependence struc- competitive performance compared to the true model. All the ture and the latter still uses estimates from the full model. while we did not even select a model, which is often thought Both options are less than ideal for computing predictability necessary to avoid overfitting. It is not. Of course, in network in GGMs. analysis the first objective is to estimate the underlying struc- To overcome these issues, we draw upon the wealth of ture of conditional relations. Furthermore, perhaps overfit- methods from the GGM literature. These approaches are typ- ting can still be improved by employing `1-regularization. ically developed for high-dimensional data (n < p). In this These remaining questions are addressed in the next section. case, however, we borrow certain ingredients to improve the significance testing based recipe described in Williams and Significance Testing with Re-Estimation Rast(2019). We compute the maximum likelihood estimate We propose nonregularized methodology specifically de- for Θ, given the edge set, E, that was determined by rejecting veloped to fulfill the primary motivations for adopting `1- H0 in (10). Let A denote the corresponding p × p adjacency regularization. First, our method not only limits spurious matrix, that is, 8 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

values) from the estimates. This would be akin to comput-  1 if ρi j , 0 ing, say, confidence intervals (CI) for a subset of variables Ai j = (11) 0 otherwise, that were selected in multiple regression. This is known to result in bias, given that both p-values and where diag(A) = 0. The position of 1’s in a given row corre- CIs assume the model was fixed in advance (it was not se- sponds to the neighborhood, ne j, of conditional relations for lected from the data). For an overview of this topic, we refer node j (i.e., the partial correlations determined to be statis- interested readers to Taylor and Tibshirani(2015) and Berk, tically significant). We then follow the approach described Brown, and Zhao(2010). in Hastie, Tibshirani, and Friedman (p. 634, 2008). This al- lows for estimating Θ with a known structure of conditional Proposed Method relations. We refer to this approach as the “HTF” algorithm. Our innovation is to determine A, which stands in for the To summarize, our method is implemented with the fol- known structure, with the hypothesis testing strategy in (10). lowing steps: To our knowledge, merging the HTF algorithm with hypoth- 1. Compute the Fisher-z transformations of the p(p−1)/2 esis testing is a novel contribution to the GGM literature. We partial correlation coefficients and test their signifi- refer interested readers to (Chaudhuri, Drton, & Richardson, cance at a desired α level. 2007), where a similar approach was used to estimate a co- variance matrix. 2. Construct a symmetric p× p adjacency matrix, A, with The basic idea is to maximize the likelihood, that is, ai j = a ji = 1 if the partial correlation between nodes i and j was statistically significant and ai j = a ji = 0 `(Θ) = log(|Θ|) − trace(SΘ), (12) otherwise. but with respect to constraints, or a pattern of zeros, that are 3. Apply the HFT algorithm in Appendix1 to estimate defined by the adjacency matrix in (11). Note that |Θ| de- Θ, given A and the sample covariance matrix S. notes the determinant and S is the sample covariance matrix (n − 1)−1Y0Y. Maximizing (12) is accomplished by further exploiting the relationship between Θ and multiple regres- Simulation Studies sion (Equation2). 2 In this case, instead of working with the This section includes three simulation studies. In each, data matrix, Y, p subset regression models are fitted directly real datasets are used from the network literature. The from the elements of S. The subset of predictors (i.e., the key difference between the experiments is the measure for neighbors of node Y ) are defined by ne in (11). This gives j j test error. The first two assess prediction error from mul- “the correct solution and solves the constrained maximum- tiple regression (large holdout set and leave-one-out cross- likelihood problem exactly” (p. 632, Hastie et al., 2008). validation), as when computing predictability, and the third Further details can be found in Emmert-Streib, Tripathi, and investigates test error for the entire precision matrix. Recall Dehmer(2019). Our particular implementation is summa- that our method (herein referred to as Nonreg ) provides rized in the Appendix (Algorithm1). NHST an estimate of Θ, given the estimated graph (Equation 11), There are some important aspects of our approach that and hence experiment three provides key insights into the are worth emphasizing. First, we are not determining con- accuracy of our proposal. ditional relations via automated selection of p multiple re- For comparison, we included the default methodology in gression models. Rather, we are using significance testing psychology that employs ` -regularization in combination directly on the partial correlations. This is not only com- 1 with the extended Bayesian information criterion (EBIC) putationally efficient, but it also avoids the known problems for selecting the tuning parameter (herein referred to as of making inference after performing model selection (e.g., Glasso ). As a point of reference, we also included the Hurvich & Tsai, 1990; Leeb & Pötscher, 2005). Second, if EBIC sample covariance matrix (a full model). For our method, we predictability is of interest, estimating Θ with some elements set α = 0.10 that corresponds to 90% specificity (see Figure 1 constrained to zero provides regression coefficients that re- in Williams & Rast, 2019). Although this choice is arbitrary, flect the conditional dependence structure. Merging this into it parallels our subjectively defined level of acceptable over- a hypothesis testing framework should allow for enjoying the fitting (i.e., 90% of the irreducible error). For Glasso , we predictive properties of nonregularized estimation (e.g., Fig- EBIC used the default settings in the R package qgraph (Epskamp, ure 1 in Williams et al., 2019), all the while having a defined Cramer, Waldorp, Schmittmann, & Borsboom, 2012). error rate. This fulfills both desiderata. Third, there is one caveat of estimating a precision matrix with a given condi- 2We explored the possibility of directly maximizing (12) with tional dependence structure: it can be used for prediction but convex optimization, but it was much slower than the HTF algo- it would be incorrect to make inference (e.g., computing p- rithm. OVERFITTING IS NOT (USUALLY) A PROBLEM 9

Full NonregNHST GlassoEBIC

A BFI IRI TAS 1.2 1.0 1.2 1.0 1.0 0.8

MSE 0.8 0.8 0.6 0.6 0.6 200 400 600 800 200 400 600 800 200 400 600 800 1000 1000 1000 Training Data Size

B Full NonregNHST GlassoEBIC 2.00

1.75 BFI 1.50

1.25

1.00

2.00

2 1.75 IRI 1.50

MSE/ s 1.25

1.00

2.00

1.75 TAS 1.50

1.25

1.00 200 400 600 800 200 400 600 800 200 400 600 800 1000 1000 1000 Training Data Size Figure 3. Simulation results for Experiment 1. In panel A, test error was averaged across each node in the respective net- works. Each line denotes a network estimation method, the y-axis is MSE, and the ribbons denote ±1 SD. Across almost all training sizes there were not any notable differences between methods in predictive accuracy. The one exception being low sample sizes, wherein both nonregularized methods outperformed GlassoEBIC (see text for explanation). In panel B, each line corresponds to test error for each node. These were averaged across in panel A. The dotted line corresponds to a 10% increase from the irreducible error. The solid line is the best possible predictive performance. The methods again provide similar predictive performance and the vast majority of test error is minimized. Together, there is not a readily discernible advantage of GlassoEBIC, and even the full model appears to mitigate overfitting in sample sizes representative of the network literature (Table 1 in Wysocki & Rhemtulla, 2019). BFI (p = 25), IRI (p = 28), and TAS (p = 20). 10 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

Illustrative Data Results. We first describe the results in Figure3 (panel A), where the test error was averaged across nodes. The Dataset 1. This dataset come from the R package psych methods provided very similar performance (often within (Revelle, 2019). There are 25 self-reported items that mea- three decimal places). This can be inferred from the overlap- sure personality in 2236 subjects. The items were taken from ping lines and large uncertainty (the ribbons correspond to ± the international personality item pool. There are five factors: one SD). The one distinction between models emerged with agreeableness, conscientiousness, extraversion, neuroticism, the largest p/n ratios. This was paradoxical, however, in that and openness. The majority of subjects were between 20 and both the fully model and Nonreg provided more accu- 60 years old (M = 29.5 years, SD = 10.6 years). 32% were NHST rate predictions than Glasso . This can be understood in males and 68% were females. The items are scored from 1 EBIC reference to the motivating example, where, with no polyno- to 6, where “1” corresponds to “very inaccurate” and 6 cor- mial terms (i.e., an intercept only model), the test error was responds to “very accurate.” large due to underfitting (Figure1, panel A). In this case, the Dataset 2. This dataset comes from a network study us- networks selected with Glasso included too few edges. ing the Interpersonal Reactivity Index (IRI) to measure em- EBIC We return to the tradeoff between sparsity (or parsimony) and pathy in 1973 subjects (Briganti, Kempenaers, Braun, Fried, predictive accuracy below (Section Illustrative Example). & Linkowski, 2018). Further details about the IRI can be These results also highlight a key point we emphasized found in Davis(1983). The subjects were between 17 and above: the distinction between the dimensions of Y (n × p) 25 years old (M = 19.6 years, SD = 1.6 years). 57% were and the number of partial correlations. Recall that the latter females and 43% were males. This scale includes 28 items is often emphasized when motivating ` -regularization. In meant to assess four components of empathy: fantasy, per- 1 the IRI data, although there are 378 partial correlations, the spective taking, empathic concern, and personal distress. test error for Nonreg reduced from 0.789 (N = 400) The items are scored from 0 to 4, where “0” corresponds NHST train to 0.754 (N = 1000). Of course, a difference of 0.035 to “Doesn’t describe me very well” and “4” corresponds to train could be quite important, depending on the application, but “Describes me very well.” Dataset 3. This dataset comes from a network study us- it highlights that the most problematic situations are large ing the Toronto Alexithymia Scale (TAS) as assessment tool p/n ratios. This can be seen in the motivating example (Fig- (Briganti & Linkowski, 2019). Further details about the TAS ure2, panel C), where the variance proportion (U v) of test ≈ can be found in Bagby, Parker, and Taylor(1994). There are error was largest with p/n 0.91. The test error was not 1925 subjects that range between 17 and 25 years old (M = much different for GlassoEBIC (0.788 and 0.748). In fact, 19 years; SD = 1.6 years). 58% of them were females and with Ntrain = 400 and 378 partial correlations (p = 28), 42% were males. This scale includes 20 items that measure the difference between the full model and GlassoEBIC was alexithymia in three domains: difficulty identifying feelings, merely 0.004! → difficulty describing feelings and externally-oriented think- Interestingly, as the training data increased in size (p/n ing. The items are scored from 1 to 5, where “1” corresponds 0) the nonregularized methods did not have superior perfor- to “I completely disagree” and “5” corresponds to “I com- mance. Of course, our central message is that overfitting pletely agree.” might not actually be all that problematic in psychological networks. In these results, it appears that GlassoEBIC does Experiment 1 not mitigate overfitting compared to nonregularized estima- tion. And note that, when n increases, GlassoEBIC is known In this experiment, we followed the approach described in to have an inflated false positives (Williams & Rast, 2019; (Leppä-aho, Pensar, Roos, & Corander, 2017). We first ran- Williams et al., 2019). domly sampled non-overlapping training and test sets from We now describe the results in Figure1 (panel B), where the full datasets, and then estimated the network structure each line corresponds to the average test error for each node. from the training data. After transforming the selected pre- Because we randomly sampled test and training sets, the ir- cision matrix into the corresponding regression coefficients reducible error should be close to the residual variance from (Equation2), the test error (MSE) was computed for each the full dataset. This was described above. The results are node in the respective networks. The test data size was fixed presented relative to the irreducible error. As a point of ref- to 50 (48 was used in Leppä-aho et al., 2017) and the training erence, the dotted line at 1.1 indicates the MSE is 10% larger data ranged in size from 100 to 1000 (in increments of 100). than the irreducible error. These results reveal that, even with This corresponds to maximum p/n ratios of 0.28 (IRI), 0.25 Ntrain = 400, most nodes are below 1.1. In other words, the (BFI), and 0.20 (TAS). These ratios gradually diminish with vast majority of reducible error has been minimized. This larger training sets (i.e., p/n → 0). Note that we did want to pattern held for each model and dataset. When comparing include larger ratios, but with smaller n, GlassoEBIC would models, the results were very similar 3. The most notable select only empty networks (or intercept only models). The results were averaged across 1000 simulations trials. 3We have provided all the simulation results online for re- OVERFITTING IS NOT (USUALLY) A PROBLEM 11 differences were again with the smallest sample sizes, where plied settings. The top row includes our proposed method. nonregularized approaches often had superior performance. Again, there was not a uniformly superior method. However, On the other hand, with the larger samples, the performance NonregNHST minimized the error in more than half of the of GlassoEBIC improved to be comparable and sometimes nodes in a given network 18 times, whereas, for GlassoEBIC, better than the nonregularized methods. this occurred only 9 times in total. The largest disparity was Perhaps most striking was the performance of the full for the BFI and IRI data. Our nonregularized method was model. Recall that this corresponds to a fully connected net- favored in 70% (14/20) of the conditions (as indicated by work. The node specific estimates were far less variable than minimizing the error in more than half of the nodes) and NonregNHST and GlassoEBIC. This was likely due to model GlassoEBIC was favored in 25% (5/20). Note that there was selection inducing variability in the estimates. Furthermore, one tie, where each method minimized the test error for half at the level of the individual nodes, the full model regularly of the nodes (IRI dataset with Nsubset = 100). outperformed GlassoEBIC (although, again, these differences The bottom row includes the results for the full model. were very minor). Together, overfitting hardly seems to be a GlassoEBIC had notably better performance. For example, in “severe challenge” that warrants regularized estimation. 18 compared to 11 conditions the regularized method min- imized the test error in more than half of the nodes. It Experiment 2 is important to note that regularization is used for network analysis because overfitting is thought to be a severe chal- In this experiment, we investigated test error with leave- lenge. However, these results further reveal that, even with one-out cross-validation (LOOCV). Rather than take a ran- models including hundreds of parameters, the advantage for dom set, as in experiment 1, MSE was computed from all GlassoEBIC is not overwhelming. Indeed, if both the full observations in the dataset. One advantage of this approach model and GlassoEBIC were both computed side-by-side for is that it more accurately reflects how predictability would be a given dataset, some nodes would be better predicted by the assessed in applied settings. Because we are still interested full model—and in some cases the majority of nodes in a in the p/n ratio, we took subsets that ranged in size from 100 network! This is striking, because it is just these situations to 1000 (in increments of 100). For each subset, the network where GlassoEBIC is thought to be advantageous. was first selected with row k removed and then the observa- tions from row k were predicted. This was done after trans- Experiment 3 forming the elements of the selected precision matrix to the corresponding regression coefficients (Equation2). This was In this experiment, we investigated cross-validation er- done for each row in the subset of data (k = 1, .., Nsubset), af- ror for the entire precision matrix. This can be understood ter which the test error was computed. Note that we sequen- as global measure of predictive accuracy. This was accom- tially moved through the rows, such that, with Nsubset = 100 plished by computing the log-likelihood for the selected pre- and Nsubset = 200, the error was computed for rows 1 through cision matrix estimated from the training data and the covari- 100 and 200, respectively. This was done to ensure the results ance matrix estimated from the test data (Equation 12). Note were not susceptible to a single, random partition, of the data. that the likelihood is also known as the log predictive density Results. We first discuss Figure4 (panel A). The test (section 2 Gelman, Hwang, & Vehtari, 2014) and is com- error was averaged across nodes. The results were similar to monly used to select the tuning parameter in glasso (Bien & experiment 1, in that the methods ultimately converged to be Tibshirani, 2011; Gao, Pu, Wu, & Xu, 2009). In this case, we similar to one another. However, there were some differences are using it to compare methods in terms predictive accuracy. with the largest p/n ratios. These again favored NonregNHST . Furthermore, this experiment also provides key insights into This was most apparent for the IRI dataset (Nsubset = 100 to the accuracy of using classical significance testing to deter- 400), where the nonregularized predictions were more accu- mine the missing edges, that is, those set to zero in (11). For rate than GlassoEBIC. Interestingly, the full model provided each of 1000 simulation trials, we randomly sampled non- competitive performance for the vast majority of conditions. overlapping training and test sets from the full dataset. The This was especially the case for sample sizes larger than 300. test data size was fixed to 50 and the training data ranged in We emphasize that this lends further credence to the notion size from 100 to 1000 (increments of 100). that reducing the number of parameters is not necessary to Results. Figure5 includes these results. Higher scores mitigate overfitting. indicate superior predictive performance (i.e., maximizing Panel B includes a different perspective on the same re- the likelihood). These results reveal that using nonregular- sults. Rather than averaging across nodes, or noting what ized estimation in combination with null hypothesis signifi- were typically small differences in test error, we computed cance testing provides an accurate estimate of the precision the proportion of nodes in which the test error was mini- mized by each method. The basic idea here is to gain insight searchers to explore and possibly find notable patterns that we over- into what might be observed if using both methods in an ap- looked. 12 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

Full NonregNHST GlassoEBIC A BFI IRI TAS 1.1 1.4 1.0 1.2 0.9 1.2 0.8 1.0 1.0 0.7 0.8 0.8

LOOCV MSE 0.6 0.5 0.6 0.6 200 400 600 800 200 400 600 800 200 400 600 800 1000 1000 1000 Training Data Size

B Full NonregNHST GlassoEBIC

BFI IRI TAS 1000 900 800 700 600 500 400 300 200 100 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 subset

N BFI IRI TAS 1000 900 800 700 600 500 400 300 200 100 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Proportion of Nodes Minimized (LOOCV MSE) Figure 4. Simulation results for Experiment 2. Panel A includes leave-one-out cross-validation (LOOCV) error averaged across nodes in the respective networks. Each line denotes a network estimation method, the y-axis is LOOCV MSE, and the ribbons denote ±1 SD. This reveals an advantage for the nonregularized methods with small training data sizes (see text for explanation). The methods quickly converged to have similar predictive accuracy. Panel B provides a different perspective on the same results. For each training size, we computed the proportion of nodes in which each method had the lowest test error. The grey region corresponds to ≥ 50% of the nodes. In the top row, NonregNHST and GlassoEBIC minimized MSE in more than half of the nodes 18 and 9 times, respectively. In the bottom row, GlassoEBIC and the full model minimized MSE in more than half of the nodes 18 and 11 times, respectively. This indicates an advantage of GlassoEBIC compared to the full model, but not when compared to our method (NonregNHST ). BFI (p = 25), IRI (p = 28), and TAS (p = 20). OVERFITTING IS NOT (USUALLY) A PROBLEM 13

Full NonregNHST GlassoEBIC

BFI IRI TAS

-16.0 -18.0 -23.0 -17.0 -20.0 -25.0 -18.0

-22.0 -27.0 -19.0

CV Log-Likelihood -24.0 -29.0 -20.0 200 400 600 800 200 400 600 800 200 400 600 800 1000 1000 1000 Training Data Size Figure 5. Simulation results for Experiment 3. The y-axis is the cross-validated log-likelihood (Equation 12). This provides a measure of global predictive accuracy for the network. Larger values indicate superior performance (i.e., maximizing the likelihood). The nonregularized methods often provided more accurate predictions than GlassoEBIC. This demonstrates: (1) the accuracy of our proposed method for estimating a constrained precision matrix (Algorithm1); and (2) regularization is not needed to mitigate overfitting (e.g., even the full model provided more than competitive performance). BFI (p = 25), IRI (p = 28), and TAS (p = 20).

matrix with zeros. In fact, NonregNHST often had superior of nodes), the p/n ratio is merely 0.07. This provides a fresh predictive performance compared to GlassoEBIC. Further- perspective into overfitting, or lack thereof, in psychological more, it is also apparent that even the full model provided networks. excellent performance compared to GlassoEBIC. There was also large uncertainty for each method, but it is still infor- Illustrative Example mative that the averages were better for the nonregularized In this section, we discuss strategies for using our pro- methods. The benefit of constraining the precision matrix posed method in applied settings. The most pressing ques- is also readily apparent. For example, while the differences tion relates to correcting for multiple comparisons, as deter- were small, our proposed method was more accurate than mining the edge set E requires potentially hundreds of tests. the full model with the largest p/n ratios. All three meth- Two approaches to address this issue would be to employ, ods ended up converging to be quite similar as p/n → 0. say, a Bonferroni corrected alpha level or simultaneous con- Together, when considering the entire GGM, the nonregular- fidence intervals (e.g., Drton & Perlman, 2004). This would ized method provided more than competitive predictions. reduce spurious findings and provide a sparse network. How- ever, the motivating example and simulation results point to Summary a tradeoff between reducing spurious associations and pre- Together, these results do not provide support for the dictive accuracy. For example, a stringent alpha level will thought that regularization is needed to mitigate overfitting necessarily reduce statistical power and potentially deterio- in psychological networks. Recall that one motivation for rate predictive accuracy by detecting too few edges (i.e., un- adopting `1-regularization was the worry of overfitting (due derfitting). While using a data-driven approach for choosing to estimating a large number of partial correlations). How- α would violate the foundations of frequentist inference, it ever, these results show that, even with more partial correla- is still informative to assess the desired level of specificity tions than observations, even the full model provides more (1 − α, Williams & Rast, 2019) in relation to prediction error. than competitive performance. And, in turn, our proposed This can be used to guide the choice of a particular multiple method provides both an estimate of the conditional depen- testing strategy. dence structure and accurate predictions. These findings We simulated 1000 trials with non-overlapping, ran- can be understood in relation to the correspondence between domly sampled training and test data. We used training multiple regression and the precision matrix (Equation2). In sets of size Ntrain = 250, 500, 1000. The test data size the IRI dataset, there are 28 items resulting in 378 partial was fixed at Ntest = 50. For each training size, the net- correlations. This might seem to be an especially difficult work structure was estimated using alpha levels of α = situation with, say, a sample size of 400. However, with p 0.05/t, 0.01, 0.05, 0.1, and 0.2. t is the respective number of defined in reference to the dimensions of Y (i.e., the number tests conducted for each dataset (i.e., the number of partial 14 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

N train 250 500 1000

BFI IRI TAS 0.975 1.00 l l 1.00 u 1.000 F

E 1.05 1.02 S 1.025 M /

α 1.10 1.04

E 1.050

S 1.06 M 1.15 1.075 t t t / / / 0.1 0.2 0.1 0.2 0.1 0.2 0.01 0.05 0.01 0.05 0.01 0.05 0.05 0.05 0.05 α level Figure 6. Simulations results for the Illustrative Example. Each color indicates the sample size used to train the model. The lines correspond to the test error for each α level (MSEα) relative to that of the full model (MSEFull). Values less than one indicate inferior predictive performance compared to the full model (i.e., the grey region). This reveals: (1) predictive accuracy was best at higher alphas across all conditions, that is, a network that included more spurious associations minimized test error; and (2) the full model routinely provided the best performance and this was always the case for alpha levels less than 0.05. This points towards a methodological quagmire: rigorous inferential standards can deteriorate predictive accuracy (due to reducing power and thus underfitting). BFI (p = 25), IRI (p = 28), and TAS (p = 20).

correlations tested), and 0.05/t is a Bonferroni corrected al- that requires `1-regularization. We first presented a motivat- pha level. ing example that was based upon a polynomial regression. Here important aspects of the bias-variance tradeoff were Results highlighted in low-dimensional data that are commonplace in the network literature (p < n). This included a mean squared Figure6 displays the results, where the test error (MSE) error (MSE) decomposition, where it was apparent that bias was computed relative to the full model. Hence, values less and especially variance are most problematic with p/n ra- than one indicate superior predictive performance of the full tios that are rarely encountered. Additionally, with merely model. A clear tradeoff emerged between alpha level and 250 observations this was enough to overcome extreme mul- predictive accuracy. Across all conditions, predictive accu- ticollinearity and nearly minimize the reducible portion of racy improved as alpha increased. However, this increase MSE. in prediction accuracy came at the cost of an inflated false We then introduced novel methodology fulfilling two positive rate — that is, correcting for multiple tests, or even methodological desiderata: (1) reducing or controlling spu- just setting traditional alpha levels (i.e., 0.05) did not lead to rious associations and (2) mitigating concerns of overfitting the best predictions. Instead it was larger alpha levels that by providing accurate predictions. These were the primary routinely minimized the test error. The tradeoff between a motivations for adopting ` -regularization in the network lit- desired error rate and better predictive power is most evident 1 erature. In extensive simulations, we demonstrated that non- in lower sample sizes. For example, the Bonferroni correc- regularized estimation provides more than competitive pre- tion is overly stringent at N = 250. This is evidenced train dictive accuracy. This complements recent work showcasing by having the worst predictive accuracy among all alpha lev- the inflated false positive rate inherent to ` -regularization els. On the other hand, predictive accuracy was maximized 1 (Williams & Rast, 2019; Williams et al., 2019). Together, it with more liberal alpha levels than traditionally used (e.g., is nonregularized, as opposed to regularized estimation, that 0.20 vs. 0.05). This points towards a predicament: rigorous best satisfies these desiderata. inferential standards appear to come at the cost of inferior predictive accuracy. An additional contribution of this work clarified overfit- ting. We tackled this important topic because the word “over- Discussion fitting” is used far too loosely in the social-behavioral sci- ences. For example, while Babyak(2004) argued that “over- In this work, we delved into the important topic of over- fitting” can result in failing to reproduce findings, predic- fitting and predictability in psychological networks. We set tion was never actually examined! In Babyak(2004), the out to investigate whether overfitting is a “serious challenge” term “overfitting” was used to reference a model that in- OVERFITTING IS NOT (USUALLY) A PROBLEM 15 cludes spurious relations. Recall that overfitting does not exaggerated worry of overfitting, unnecessarily compromises actually refer to replicating, say, significant effects or spu- statistical inference. rious associations, but predicting unseen data. The residue This work also fits within emerging traditions focused on of this confusion impedes methodological literacy and prac- improving methodological practice in psychological science. tice: when focusing on reducing spurious relations, this can We offer two further thoughts. First, almost all applied pa- actually deteriorate prediction quality in situations common pers justify using lasso by stating it reduces spurious asso- to the network literature (Figure6). This is well known in, ciations and/or mitigates overfitting. The former is incorrect say, the forecasting literature, where selecting the true model and the latter is vague, with neither providing sufficient jus- (or the most parsimonious) is not an omnipresent goal “even tification. Because these statements are widespread and re- if there was a true underlying model, selecting that model cycled between articles, this is perhaps indicative of cargo will not necessarily give the best forecasts” (p. 165, Hynd- cult statistics, that is, “the ritualistic miming of statistics man & Athanasopoulos, 2018). This is due to inherent ten- rather than conscientious practice” (p. 40, Stark & Saltelli, sion between model selection consistency4 and minimizing 2018). For possible solutions, we refer interested readers to the mean squared error, that is, “for any model selection cri- Stark and Saltelli(2018). Second, we worry that there has terion to be consistent, it must behave suboptimally for esti- been an unforeseen side effect of the “credibility revolution” mating the regression function” (p. 937, Yang, 2005). The (Vazire, 2018). With renewed attention to methodology, par- latter matters most for prediction, whereas consistent model ticular statistical approaches and not their implementation selection limits spurious associations. have been unfairly targeted (e.g., the p-value). This could Furthermore, the worry of overfitting has ramifications contribute to eagerly adopting alternative approaches, say, that extend beyond prediction to inference. To date, `1- `1-regularization, without fully considering their limitations regularization is used for both inference and prediction in (i.e., are we willing to forego inference?). But as this work the network literature. This is unfortunate. Regularized esti- demonstrated, the grass is not always greener on the other mation does not readily allow for inference, and, in fact, ob- side of the methodological fence. taining valid p-values and confidence intervals is still an ac- tive area of research (see references in Zhang, Ren, & Chen, Explanation vs. Prediction? 2018). Inherent to Glasso is data-driven model selection EBIC Our results further suggest that explanation is not the an- and this presents further challenges for inference: tithesis to prediction. That is, at least in the network litera- We do not discuss statistical inference...(e.g., ture, there is no apparent need to choose one over the other looking at p-values associated with each predic- (Yarkoni & Westfall, 2017). This has far reaching implica- tor). If you do wish to look at the statistical sig- tions. Predictive models are typically selected based on data- nificance of the predictors, beware that any pro- driven procedures, which then, as we stated above, present cedure involving selecting predictors first will challenges for inference. On the other hand, our simple ap- invalidate the assumptions behind the p-values. proach is based on a valid use of significance testing. We The [data-driven] procedures we recommend for did not select a model first and then seek to make inference. selecting predictors are helpful when the model Rather, the network is constructed completely with hypothe- is used for [prediction]; they are not helpful if sis tests. Then, given that conditional dependence structure, you wish to study the effect of any predictor a constrained precision matrix can be used for computing (p.168, Hyndman & Athanasopoulos, 2018). predictability if desired. For multiple regression, this was shown to provide more than competitive performance, and, This induces model selection bias (Danilov & Magnus, 2004; when considering the entire precision matrix, the predictions Kabaila, 1995; Leeb & Pöscher, 2006; Leeb, Pötscher, & were typically more accurate than GlassoEBIC. This demon- Ewald, 2015) and requires post-selection corrections (Berk, strates that it is not necessary to sacrifice inference to achieve Brown, Buja, Zhang, & Zhao, 2013; Lee, Sun, Sun, & Tay- adequate predictive accuracy or to mitigate overfitting. lor, 2016; Taylor & Tibshirani, 2017). Because GlassoEBIC employs a penalty and data-driven model selection, this re- Limitations sults in an underappreciated double whammy: (1) penaliza- There are some notable limitations. First, we only con- tion compromises the sampling distribution of the partial cor- sidered `1-regularization for directly estimating the preci- relations by introducing bias and reducing variance (section sion matrix. There are alternative forms of regularization 3.1, Bühlmann et al., 2014) and (2) data-driven model selec- that could improve predictive performance (e.g., non-convex tion violates key tenets of statistical inference (e.g., Hurvich penalties, Kim, Lee, & Kwon, 2018). For large p/n ra- & Tsai, 1990; Leeb & Pötscher, 2005). This is especially tios (with small n) it could be advantageous to employ `2- relevant for network researchers because prediction is sec- ondary to inference: liberal use of regularization, due to an 4Selecting the true model with probability 1 as n → ∞. 16 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ regularization (Ha & Sun, 2014; Kuismin, Kemppainen, & ing in variance for bias is what comprises the sampling distri- Sillanpää, 2017; Van Wieringen & Peeters, 2016). On the bution of lasso-based estimates. More recent regularization one hand, the ridge penalty does not push values all the way approaches aim to have (1) unbiased estimates (Fan & Li, to zero and thus overcomes the issue of GlassoEBIC select- 2001; Kim et al., 2018) and (2) the same sampling variance ing empty networks (see Experiment 1). On the other hand, as ordinary least squares estimators (p. 2, Y. Wang & Zhu, determining the edge set is a key aspect of GGMs and pe- 2016). Investigating these approaches for low-dimensional nalization again compromises inference. For completeness, settings is an interesting future direction. we have provided an example of `2-regularization in the Ap- pendix (`2-Regularization). Recommendations Second, there are also approaches that use multiple re- We recommend that researchers use nonregularized meth- gression to estimate the network (e.g., Haslbeck & Waldorp, ods with classical (or Bayesian, Williams & Mulder, 2019) 2015). We did not consider these, because they do not di- hypothesis testing. This is implemented in the R packages rectly provide an estimate of the precision matrix. But they qgraph (Epskamp et al., 2012), psychonetrics (Epskamp, could reduce test error. For example, one could use auto- 2020), and our package GGMnonreg. In our experience, mated procedures to select a nonregularized or regularized researchers do in fact want to make inference about effects model based on predictive accuracy or even use the Akaike (i.e., edges) and this is not possible with data-driven ap- information criterion (AIC) in place of EBIC. There is a proaches. Furthermore, because the advantage of those ap- downside. This would compromise the first desideratum of proaches is fleeting, it makes sense to use inferential meth- a low error rate, as AIC, which is asymptotically equivalent ods from the start. This also allows for seamless integration to LOOCV (Stone, 1977), has a relatively high false positive of more focused theoretical predictions (Haslbeck, Ryan, rate (Dziak, Coffman, Lanza, Li, & Jermiin, 2019). Further- Robinaugh, Waldorp, & Borsboom, 2019), say, testing hy- more, the quandary of whether it is worth sacrificing infer- pothesized bridge symptoms in psychopathology networks ence (due to data-driven model selection) would resurface— (Castro et al., 2019; Jones, Ma, & McNally, 2019). there is no free lunch. There are some important aspects of our method that must Third, we did not consider alternative motivations for us- be followed to ensure inference is not compromised. First, ing `1-regularization. In fact, in the SEM literature, the low choosing an alpha level based on predictive accuracy can- false negative rate has been argued to be one key advantage not be done. Rather, researchers must carefully consider the of lasso (Jacobucci et al., 2019). However, it should be noted tradeoff that comes from choosing a desired error rate and that reducing the false negative rate can also be achieved in a predictive accuracy prior to estimating the network structure. frequentist framework by simply increasing the alpha level. Figure6 can be used to inform this decision, for example, This seems to be much more straightforward than adopting by noting the balance between minimizing prediction error lasso (e.g., Figure6). Furthermore, when the rubber meets and limiting spurious associations for different sample sizes. the road (i.e., time for inference) a valid p-value will need Second, we again note that the constrained precision matrix to be computed. And, in fact, most of the recent methods can be used for prediction but it would be incorrect to make for lasso (1) avoid data-driven model selection by fixing the inference from the estimates. Statistical inference (e.g., com- tuning parameter (as implemented in Belloni, Chernozhukov, puting p-values) must be made from the hypothesis tests used & Wang, 2011; Liu & Wang, 2017; T. Wang et al., 2016); to construct the adjacency matrix in (11). and (2) debias (or “de-sparsify”) the estimates such that the model is no longer sparse (Janková & van de Geer, 2017; Ja- Conclusion vanmard & Montanari, 2013, 2015) and then compute confi- dence intervals for inference. These methods can offer bene- This work demonstrated that overfitting has been exagger- fits, when, say, p = 100 and n = 200 (see Table 1 in Jankova ated in the network literature. We hope this eases method- & van de Geer, 2018). But note this is a rare situation and ological anxiety for researchers unfamiliar with the finer de- unnecessary otherwise. tails of regularization, prediction, and inference. The ideas discussed in this work provide a foundation from which to Fourth, we did not consider parameter estimation. This begin using nonregularized estimation in combination with is commonly used to discuss the bias-variance tradeoff (e.g., classical hypothesis testing. To facilitate this shift, we have Figure 1 in Jacobucci et al., 2019). We think this is an- implemented the nonregularized approach for predictability other source of confusion surrounding overfitting, that is, re- in R package GGMnonreg. ducing variance in predictions and an estimator is not same. In our motivating example, the parameters estimates them- References selves were quite inaccurate (huge MSE). On the other hand, test error was nearly minimized and variance (Uv) also ap- Babyak, M. A. (2004). What you see may not be what you get: A proached zero (Figure2, panels B and C). Furthermore, trad- brief, nontechnical introduction to overfitting in regression- OVERFITTING IS NOT (USUALLY) A PROBLEM 17

type models. Psychosomatic Medicine, 66(3), 411–421. doi: Ribeiro, J., & Ferreira, T. B. (2019). The Differential Role of 10.1097/01.psy.0000127692.23278.a9 Central and Bridge Symptoms in Deactivating Psychopatho- Bagby, R. M., Parker, J. D., & Taylor, G. J. (1994). The twenty- logical Networks. Frontiers in Psychology, 10, 2448. doi: item Toronto Alexithymia scale-I. Item selection and cross- 10.3389/fpsyg.2019.02448 validation of the factor structure. Journal of Psychosomatic Chaudhuri, S., Drton, M., & Richardson, T. (2007). Estimation of Research, 38(1), 23–32. doi: 10.1016/0022-3999(94)90005 a covariance matrix with zeros. Biometrika, 94(1). -1 Christensen, A. P., & Golino, H. (2019). Estimating the stability of Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root the number of factors via Bootstrap Exploratory Graph Anal- lasso: Pivotal recovery of sparse signals via conic program- ysis: A tutorial. PsyArXiv. doi: 10.31234/OSF.IO/9DEAY ming. Biometrika, 98(4), 791–806. doi: 10.1093/biomet/ Clements, M., & Hendry, D. (1998). Forecasting economic time asr043 series. Cambridge: Cambridge University Press. Bento, A. R., Salvação, N., & Guedes Soares, C. (2018). Val- Costantini, G., Epskamp, S., Borsboom, D., Perugini, M., Mõttus, idation of a wave forecast system for Galway Bay. Jour- R., Waldorp, L. J., & Cramer, A. O. (2015). State of the nal of Operational Oceanography, 11(2), 112–124. doi: aRt personality research: A tutorial on network analysis of 10.1080/1755876X.2018.1470454 personality data in R. Journal of Research in Personality, 54, Berk, R., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2013). Valid 13–29. doi: 10.1016/j.jrp.2014.07.003 post-selection inference. Annals of Statistics, 41(2), 802– Danilov, D., & Magnus, J. R. (2004). On the harm that ignoring 837. doi: 10.1214/12-AOS1077 pretesting can cause. Journal of Econometrics, 122(1), 27– Berk, R., Brown, L., & Zhao, L. (2010). Statistical inference af- 46. doi: 10.1016/j.jeconom.2003.10.018 ter model selection. Journal of Quantitative Criminology, Davis, M. H. (1983). A Multidimensional Approach to Individual 26(2), 217–236. Differences in Empathy. JSAS Catalog of Selected Docu- Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset se- ments in Psychology. lection via a modern optimization lens. Annals of Statistics, Di Pierro, R., Costantini, G., Benzi, I. M. A., Madeddu, F., & 44(2), 813–852. doi: 10.1214/15-AOS1388 Preti, E. (2019). Grandiose and entitled, but still frag- Bien, J., & Tibshirani, R. J. (2011). Sparse estimation of a co- ile: A network analysis of pathological narcissistic traits. variance matrix. Biometrika, 98(4), 807–820. doi: 10.1093/ Personality and Individual Differences, 140, 15–20. doi: biomet/asr054 10.1016/j.paid.2018.04.003 Blanken, T. F., Van Der Zweerde, T., Van Straten, A., Van Someren, Drton, M., & Perlman, M. D. (2004). Model selection for Gaus- E. J., Borsboom, D., & Lancee, J. (2019). Introduc- sian concentration graphs. Biometrika, 91(3), 591–602. doi: ing Network Intervention Analysis to Investigate Sequential, 10.1093/biomet/91.3.591 Symptom-Specific Treatment Effects: A Demonstration in Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R., & Jermiin, Co-Occurring Insomnia and Depression. Psychotherapy and L. S. (2019). Sensitivity and specificity of information cri- Psychosomatics, 88(1), 52–54. doi: 10.1159/000495045 teria. Briefings in Bioinformatics, 00(March), 1–13. doi: Borsboom, D. (2017). A network theory of mental disorders. World 10.1093/bib/bbz016 Psychiatry, 16(1), 5–13. doi: 10.1002/wps.20375 Emmert-Streib, F., Tripathi, S., & Dehmer, M. (2019). Constrained Borsboom, D., & Cramer, A. O. (2013). Network Analysis: An Covariance Matrices With a Biologically Realistic Structure: Integrative Approach to the Structure of Psychopathology. Comparison of Methods for Generating High-Dimensional Annual Review of Clinical Psychology, 9(1), 91–121. doi: Gaussian Graphical Models. Frontiers in Applied Mathe- 10.1146/annurev-clinpsy-050212-185608 matics and Statistics, 5, 17. doi: 10.3389/fams.2019.00017 Briganti, G., Kempenaers, C., Braun, S., Fried, E. I., & Linkowski, Epskamp, S. (2020). psychonetrics: Structural Equa- P. (2018). Network analysis of empathy items from the in- tion Modeling and Confirmatory Network Analysis. Re- terpersonal reactivity index in 1973 young adults. Psychiatry trieved from https://cran.r-project.org/package= Research, 265, 87–92. doi: 10.1016/j.psychres.2018.03.082 psychonetrics Briganti, G., & Linkowski, P. (2019). Network Approach to Epskamp, S., Cramer, A. O. J., Waldorp, L. J., Schmittmann, V. D., Items and Domains From the Toronto Alexithymia Scale. & Borsboom, D. (2012). qgraph: Network Visualizations of Psychological Reports, 0033294119889586. doi: 10.1177/ Relationships in Psychometric Data. Journal of Statistical 0033294119889586 Software, 48(4). doi: 10.18637/jss.v048.i04 Brusco, M. J., Voorhees, C. M., Calantone, R. J., Brady, M. K., & Epskamp, S., Waldorp, L. J., Mottus, R., & Borsboom, D. (2018). Steinley, D. (2019). Integrating linear discriminant analy- The Gaussian Graphical Model in Cross-Sectional and Time- sis, polynomial basis expansion, and genetic search for two- Series Data. Multivariate Behavioral Research, 53(4), 453– group classification. Communications in Statistics: Simu- 480. doi: 10.1080/00273171.2018.1454823 lation and Computation, 48(6), 1623–1636. doi: 10.1080/ Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Pe- 03610918.2017.1419262 nalized Likelihood and its Oracle Properties. Journal of the Bühlmann, P., Kalisch, M., & Meier, L. (2014). High-Dimensional American Statistical Association, 96(456), 1348–1360. doi: Statistics with a View Toward Applications in Biology. An- 10.1198/016214501753382273 nual Review of Statistics and Its Application, 1(1), 255–278. Fitzgerald, E., & Akintoye, A. (1995). The accuracy and opti- doi: 10.1146/annurev-statistics-022513-115545 mal linear correction of UK construction tender price index Castro, D., Ferreira, F., de Castro, I., Rodrigues, A. R., Correia, M., forecasts. Construction Management and Economics, 13(6), 18 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

493–500. doi: 10.1080/01446199500000057 Practical Guide to Variable Selection in Structural Equa- Fried, E. I., & Cramer, A. O. (2017). Moving Forward: Challenges tion Modeling by Using Regularized Multiple-Indicators, and Directions for Psychopathological Network Theory and Multiple-Causes Models. Advances in Methods and Prac- Methodology. Perspectives on Psychological Science, 12(6), tices in Psychological Science, 2(1), 55–76. doi: 10.1177/ 999–1020. doi: 10.1177/1745691617705892 2515245919826527 Gao, X., Pu, D. Q., Wu, Y., & Xu, H. (2009). Tuning parameter Janková, J., & van de Geer, S. (2017). Honest confidence regions selection for penalized likelihood estimation of inverse co- and optimality in high-dimensional precision matrix estima- variance matrix. arXiv. doi: 10.5705/ss.2009.210 tion. Test, 26(1), 143–162. doi: 10.1007/s11749-016-0503 Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding pre- -5 dictive information criteria for Bayesian models. Statistics Jankova, J., & van de Geer, S. (2018). Inference in high- and Computing, 24(6), 997–1016. doi: 10.1007/s11222-013 dimensional graphical models. , 0, 1–29. -9416-2 Javanmard, A., & Montanari, A. (2013). Confidence Intervals and Goodwin, P. (1997). Adjusting judgemental extrapolations using Hypothesis Testing for High-Dimensional Regression. , 1– Theil’s method and discounted weighted regression. Jour- 21. nal of Forecasting, 16(1), 37–46. doi: 10.1002/(SICI)1099 Javanmard, A., & Montanari, A. (2015). De-biasing the Lasso: -131X(199701)16:1<37::AID-FOR647>3.0.CO;2-T Optimal Sample Size for Gaussian Designs. arXiv. Gronau, Q. F., & Wagenmakers, E.-J. (2018). Limitations of Jones, P. J., Heeren, A., & McNally, R. J. (2017). Commentary: A Bayesian Leave-One-Out Cross-Validation for Model Selec- network theory of mental disorders. Frontiers in Psychology, tion. PsyArXiv. doi: 10.17605/OSF.IO/AT7CX 8, 1305. doi: 10.3389/fpsyg.2017.01305 Ha, M. J., & Sun, W. (2014). Partial correlation matrix estima- Jones, P. J., Ma, R., & McNally, R. J. (2019). Bridge Cen- tion using ridge penalty followed by thresholding and re- trality: A Network Approach to Understanding Comorbid- estimation. Biometrics, 70(3), 765–773. doi: 10.1111/ ity. Multivariate Behavioral Research, 1–15. doi: 10.1080/ biom.12186 00273171.2019.1614898 Haslbeck, J. M., & Fried, E. I. (2017). How predictable are symp- Kabaila, P. (1995). The effect of model selection on confidence toms in psychopathological networks? A reanalysis of 18 regions and prediction regions. Econometric Theory, 11(3), published datasets. Psychological Medicine, 47(16), 2267– 537–549. doi: 10.1017/S0266466600009403 2276. doi: 10.1017/S0033291717001258 Kim, D., Lee, S., & Kwon, S. (2018). A unified algorithm for Haslbeck, J. M., Ryan, O., Robinaugh, D., Waldorp, L., & Bors- the non-convex penalized estimation: The ncpen package. boom, D. (2019). Modeling Psychopathology: From Data arxiv. Models to Formal Theories. PsyArXiv. doi: 10.31234/ Krämer, N., Schäfer, J., & Boulesteix, A.-L. (2009). Regular- OSF.IO/JGM7F ized estimation of large-scale gene association networks us- Haslbeck, J. M., & Waldorp, L. J. (2015). MGM: Esti- ing graphical Gaussian models. BMC Bioinformatics, 10(1), mating Time-Varying Mixed Graphical Models in High- 384. doi: 10.1186/1471-2105-10-384 Dimensional Data. Kuismin, M., Kemppainen, J., & Sillanpää, M. (2017). Pre- Haslbeck, J. M., & Waldorp, L. J. (2018). How well do net- cision Matrix Estimation With ROPE. Journal of Com- work models predict observations? On the importance of putational and Graphical Statistics, 26(3), 682–694. doi: predictability in network models. Behavior Research Meth- 10.1080/10618600.2016.1278002 ods, 50(2), 853–861. doi: 10.3758/s13428-017-0910-x Kuismin, M., & Sillanpää, M. (2017). Estimation of covariance Hastie, T., Tibshirani, R., & Friedman, J. (2008). The Elements of and precision matrix, network structure, and a view toward Statistical Learning: , Inference, and Prediction systems biology. Wiley Interdisciplinary Reviews: Compu- (2nd ed.). New York: Springer. tational Statistics, 9(6), 1–13. doi: 10.1002/wics.1415 Hastie, T., Tibshirani, R., & Tibshirani, R. J. (2017). Extended Kwan, C. C. Y. (2014). A Regression-Based Interpretation of the Comparisons of Best Subset Selection, Forward Stepwise Inverse of the Sample Covariance Matrix. Spreadsheets in Selection, and the Lasso. arXiv. Education, 7(1). Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statisti- Lauritzen, S. L. (1996). Graphical models (Vol. 17). Clarendon cal Learning with Sparsity: The Lasso and Generalizations. Press. Boca Raton: CRC Press. doi: 10.1201/b18401-1 Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator Hevey, D. (2018). Network analysis: a brief overview and tutorial. for large-dimensional covariance matrices. Journal of Mul- Health Psychology and Behavioral Medicine, 6(1), 301–328. tivariate Analysis, 88(2), 365–411. doi: 10.1016/S0047 doi: 10.1080/21642850.2018.1521283 -259X(03)00096-4 Højsgaard, S., Edwards, D., & Lauritzen, S. (2012). Graphical Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post- Models with R. doi: 10.1007/978-1-4614-2299-0 selection inference, with application to the lasso. Annals of Hurvich, C. M., & Tsai, C.-L. (1990). The Impact of Model Se- Statistics, 44(3), 907–927. doi: 10.1214/15-AOS1371 lection on Inference in . The American Leeb, H., & Pöscher, B. M. (2006). Can one estimate the Statistician, 44(3), 214. doi: 10.2307/2685338 conditional distribution of post-model-selection estimators? Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: prin- Annals of Statistics, 34(5), 2554–2591. doi: 10.1214/ ciples and practice (2nd ed.). Melbourne: OTexts. 009053606000000821 Jacobucci, R., Brandmaier, A. M., & Kievit, R. A. (2019). A Leeb, H., & Pötscher, B. M. (2005). Model selection and inference: OVERFITTING IS NOT (USUALLY) A PROBLEM 19

Facts and fiction. Econometric Theory, 21(1), 21–59. doi: Royal Statistical Society: Series B (Methodological), 39(1), 10.1017/S0266466605050036 44–47. doi: 10.1111/j.2517-6161.1977.tb01603.x Leeb, H., Pötscher, B. M., & Ewald, K. (2015). On Various Con- Taylor, J., & Tibshirani, R. (2017). Post-selection inference for fidence Intervals Post-Model-Selection. Statistical Science, `1-penalized likelihood models. Canadian Journal of Statis- 30(2), 216–227. doi: 10.1214/14-STS507 tics(2015), 1–26. doi: 10.1002/cjs.11313 Leppä-aho, J., Pensar, J., Roos, T., & Corander, J. (2017). Learning Taylor, J., & Tibshirani, R. J. (2015). Statistical learning and se- Gaussian graphical models with fractional marginal pseudo- lective inference. Proceedings of the National Academy of likelihood. International Journal of Approximate Reasoning, Sciences, 112(25), 7629–7634. 83, 21–42. doi: 10.1016/j.ijar.2017.01.001 Theil, H. (1961). Economic forecast and policy. North-Holland. Liu, H., & Wang, L. (2017). TIGER: A tuning-insensitive ap- Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation proach for optimally estimating gaussian graphical mod- of inverse covariance matrices from high-dimensional data. els. Electronic Journal of Statistics, 11(1), 241–294. doi: Computational Statistics and Data Analysis, 103, 284–303. 10.1214/16-EJS1195 doi: 10.1016/j.csda.2016.05.012 Mazumder, R., Radchenko, P., & Dedieu, A. (2017). Subset Selec- Vazire, S. (2018). Implications of the Credibility Revolution tion with Shrinkage: Sparse Linear Modeling when the SNR for Productivity, Creativity, and Progress. Perspectives on is low. arXiv. psychological science : a journal of the Association for Meinshausen, N., & Bühlmann, P. (2006). High-dimensional Psychological Science, 13(4), 411–417. doi: 10.1177/ graphs and variable selection with the Lasso. An- 1745691617751884 nals of Statistics, 34(3), 1436–1462. doi: 10.1214/ Waldorp, L., Marsman, M., & Maris, G. (2019). 009053606000000281 and Ising networks: prediction and estimation when violat- Miner, J., & Mincer, V. (1969). The Evaluation of Economic Fore- ing lasso assumptions. Behaviormetrika, 46(1), 49–72. doi: casts. In J. Mincer (Ed.), Economic forecasts and expecta- 10.1007/s41237-018-0061-0 tions: Analysis of forecasting behavior and performance (pp. Wang, T., Ren, Z., Ding, Y., Fang, Z., Sun, Z., MacDonald, M. L., 3–46). New York: National Bureau of Economic Research. . . . Chen, W. (2016). FastGGM: An Efficient Algorithm Murphy, A. H. (1988). Skill Scores Based on the Mean Square for the Inference of Gaussian Graphical Model in Biological Error and Their Relationships to the Correlation Coeffi- Networks. PLoS Computational Biology, 12(2), 1–16. doi: cient. Monthly Weather Review, 148(3). doi: 10.1175/ 10.1371/journal.pcbi.1004755 1520-0493(1988)116<2417:SSBOTM>2.0.CO;2 Wang, Y., & Zhu, L. (2016). Variable Selection and Parameter Revelle, W. (2019). psych: Procedures for Psychological, Psycho- Estimation with the Atan Regularization Method. Journal of metric, and Personality Research. Evanston, Illinois. Re- Probability and Statistics, 2016, 1–12. doi: 10.1155/2016/ trieved from https://cran.r-project.org/package= 6495417 psych Watson, P., & Teelucksignh. (2008). A Practical Introduction Ryan, O., Bringmann, L., & Schuurman, N. K. (2019). The to Econometric Methods: Classical and Modern (1st ed.). Challenge of Generating Causal Hypotheses Using Network Kingston: University of the West Indies Press. Models. Werner, M., Štulhofer, A., Waldorp, L., & Jurin, T. (2018). A doi: 10.31234/OSF.IO/RYG69 Network Approach to Hypersexuality: Insights and Clinical Schäfer, J., & Strimmer, K. (2005). A Shrinkage Approach to Implications. Journal of Sexual Medicine, 15(3), 410–415. Large-Scale Covariance Matrix Estimation and Implications doi: 10.1016/j.jsxm.2018.01.009 for Functional Genomics. Statistical Applications in Genet- Williams, D. R., & Mulder, J. (2019). Bayesian Hypothesis Testing ics and Molecular Biology, 4(1). doi: 10.2202/1544-6115 for Gaussian Graphical Models: Conditional Independence .1175 and Order Constraints. PsyArXiv. doi: 10.31234/osf.io/ Solazzo, E., & Galmarini, S. (2016). Error apportionment for ypxd8 atmospheric chemistry-transport models - A new approach Williams, D. R., & Rast, P. (2019). Back to the basics: Re- to model evaluation. Atmospheric Chemistry and Physics, thinking partial correlation network methodology. British 16(10), 6263–6283. doi: 10.5194/acp-16-6263-2016 Journal of Mathematical and Statistical Psychology. doi: Solazzo, E., Hogrefe, C., Colette, A., Garcia-Vivanco, M., & Gal- 10.1111/bmsp.12173 marini, S. (2017). Advanced error diagnostics of the CMAQ Williams, D. R., Rhemtulla, M., Wysocki, A. C., & Rast, P. and Chimere modelling systems within the AQMEII3 model (2019). On Nonregularized Estimation of Psychological Net- evaluation framework. Atmospheric Chemistry and Physics, works. Multivariate Behavioral Research, 54(5), 1–23. doi: 17(17), 10435–10465. doi: 10.5194/acp-17-10435-2017 10.1080/00273171.2019.1575716 Stark, P. B., & Saltelli, A. (2018). Cargo-cult statistics and sci- Wysocki, A. C., & Rhemtulla, M. (2019). On Penalty Parame- entific crisis. Significance, 15(4), 40–43. doi: 10.1111/ ter Selection for Estimating Network Models. Multivariate j.1740-9713.2018.01174.x Behavioral Research, 1–15. doi: 10.1080/00273171.2019 Stephens, G. (1998). On the Inverse of the Covariance Matrix in .1672516 Portfolio Analysis. The Journal of Finance, 53(5), 1821– Yang, Y. (2005). Can the strengths of AIC and BIC be shared? 1827. A conflict between model indentification and regression es- Stone, M. (1977). An Asymptotic Equivalence of Choice of Model timation. Biometrika, 92(4). by Cross-Validation and Akaike’s Criterion. Journal of the Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Ex- 20 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

planation in Psychology: Lessons From Machine Learning. Appendix A Perspectives on Psychological Science, 12(6), 1100–1122. Estimating a Precision Matrix with Zeros doi: 10.1177/1745691617693393 Yuan, M. (2010). High Dimensional Inverse Covariance Matrix Algorithm 1 Description of the HTF algorithm (pp. 631- Estimation via Linear Programming. Journal of Machine 634, Hastie et al., 2008) Learning Research, 11, 2261–2286. Zhang, R., Ren, Z., & Chen, W. (2018). SILGGM: An extensive R S is a p × p covariance matrix package for efficient statistical inference in large-scale gene A is a p × p adjacency matrix networks. PLoS Computational Biology, 14, e1006369. doi: 10.1371/journal.pcbi.1006369 Initialize W = S Zhao, P., & Yu, B. (2006). On Model Selection Consistency of repeat Lasso. The Journal of Machine Learning Research, 7, 2541– For nodes j = 1,..., p 2563. doi: 10.1109/TIT.2006.883611 (a) Partition the matrices: W11 = W− j,− j and s12 = S j,− j (b) Obtain the row indices that define the neighborhood, ne j, in A j,− j (Equation 11) (c) Subset the matrices according to the neighborhood, ne j ne j that is, W11 and s12 .

Note: The rows and columns of W11 are sub- setted according to ne j, resulting in the sam- ple covariance matrix with respect to A j,− j.

(d) Initialize a (p − 1) × 1 vector of zeroes, β, to gather the regression coefficients. (e) Compute the regression coefficients: βˆne j = −1 ne j ne j W11 s12 , where the coefficients are filled into the positions of 1’s that were obtained from step (b).

Note: This corresponds to solving for βˆne j 0 0 Y ˆne j Y Y −1Y Y with subsets of : β = ( ne j ne j ) ne j j, W−1 Y0 Y −1 · − S where 11 = ( ne j ne j ) (n 1) and 11 = Y0 Y − Y ne j j/(n 1). Here ne j denotes the subset of columns in Y that correspond to the neigh- borhood of Y j.

(f) Update W : W j,− j = W− j, j = W11βˆ end for if maxi j(W − Wprevious) < tolerance end repeat return Θˆ = W−1 else Wprevious = W

In this algorithm, W11 is the sample covariance ma- trix, S, but excluding the jth row and column. Further, s12 is the jth row of S, excluding the jth column. In step (c), an additional subset of each is then obtained according to the neighborhood of the jth node, corresponding to those rela- tions determined to be significantly different than zero (i.e., the location of 1’s A j,− j).

Appendix B OVERFITTING IS NOT (USUALLY) A PROBLEM 21

5 `2-Regularization This was implemented with the R package GGMridge. Rationale In full disclosure, we were hoping to provide a clear picture of where `1-regularization provided notable bene- fits. We struggled to find such examples that would also be representative of the network literature. We thus re- peated Experiment 3 with `2-regularization (i.e., ridge) in- stead of `1-regularization. The aim was to explore whether `2-regularization overcomes the issues of GlassoEBIC select- ing empty networks and having large test error for the largest p/n ratios. In this example, the training data sizes ranged from 50 to 200 (in increments of 10). We estimated the ridge-based precision matrix using the approach outlined in Ha and Sun(2014). 5 This particular method estimates the optimal shrinkage parameter analytically (see Equations 4-7 in Schäfer & Strimmer, 2005). Results. The results (see Figure B1) show that the ridge approach not only overcomes the issues of the lasso at larger p/n ratios, but provided more than competitive pre- dictive performance when compared to NonregNHST and the full model. This is seen, for example, in the BFI dataset, where the ridge approach outperformed both nonregularized methods at all sample sizes. However, the methods ulti- mately converged to be quite similar. A similar pattern can be observed across the IRI and TAS datasets. Note that there are many partial correlations in each network (BFI: 300, IRI: 378, TAS: 190). This again makes clear that nonreg- ularized estimation provides competitive performance even when there are more partial correlations than observations (the number of nodes is what matters). Together, while there are situations where regularization may be employed for a gain in predictive accuracy, one must be cognizant that this sacrifices two desirable goals in network analysis: (1) edge selection cannot be performed because the `2-penalty does not push any estimates to zero; and (2) inference is compro- mised by introducing bias from the shrinkage penalty. 22 DONALD R. WILLIAMS & JOSUE E. RODRIGUEZ

Full NonregNHST Ridge

BFI IRI TAS

−20 −30 −20 −30 −40

−50 −25 −40 −60

CV Log−likelihood −30 50 75 50 75 50 75 100 125 150 175 200 100 125 150 175 200 100 125 150 175 200 Training Data Size Figure B1. The y-axis is the cross-validated log-likelihood (Equation 12). This provides a measure of global predictive accuracy for the network. Larger values indicate superior performance (i.e., maximizing the likelihood). BFI (p = 25), IRI (p = 28), and TAS (p = 20).