Prediction Error in the Bennett and Stam 1996 Model

Total Page:16

File Type:pdf, Size:1020Kb

Prediction Error in the Bennett and Stam 1996 Model

Revisiting Predictions of War Duration

Scott Bennett and Allan Stam

The Pennsylvania State University and University of Michigan

[email protected] and [email protected]

July 15, 2008

Abstract

We reexamine the fit of Bennett and Stam’s 1996 model of war duration, correcting errors in the reported estimates of prediction accuracy. We discuss how to assess fit in the absence of standard or widely accepted measures of fit in duration models. We introduce a proportional reduction in error measure for duration models, and report new estimates of model fit from the Bennett and Stam model. The model does significantly reduce prediction error relative to a naïve model. Introduction

In Bennett and Stam (1996), we estimated a model of war duration incorporating measures of military attributes, domestic politics, and military strategy in the first analysis of war duration to employ current event-history/hazard model techniques. In discussing the results, we presented several statistics to assess the model’s fit to the data in terms of the accuracy of predictions. These statics included the mean error of prediction from different models, the median error, and the prediction error as a percentage of war length. Unfortunately, there was an error in the computation and reporting of the last of these measures. In this paper, we revisit the prediction accuracy (“fit”) of the Bennett and Stam model in the original data. We also revisit the broader question of how model “fit” in duration models should be assessed. After discussing the measures we originally reported, we also present an alternative proportional reduction in error (PRE) measure, and argue that it is a better single measure of fit for predictions in duration models. With the corrected error measure, the model represents a clear improvement over both a null model, and over models containing subsets of independent measures.

Original Data and Method

Bennett and Stam developed a data set of 78 wars and assessed the independent effects associated with17 independent variables on their duration. The original data were largely drawn from the Correlates of War project’s list of interstate wars, with updates from various other sources such as Clodfelter (1992) and Dupuy and Dupuy (1986), as implemented in Stam (1996). A Weibull duration model was estimated on a final dataset of 78 wars, and 169 war-years. The paper was one of the first papers to use current event history methods (also known as hazard analysis, duration analysis, or survival analysis) in political science, and more specifically international relations. It concluded that military strategy, domestic political factors, and realpolitik variables all independently influence the length of wars, and that war was not duration dependent once independent variables had been appropriately included in a statistical model. Results reported by Bennett and Stam (2006) with a data set updated to include the 1990/1991 Persian Gulf War were almost identical. Others have modified the data or reanalyzed it after incorporating other explanatory variables (e.g. Slantchev 2004), or taken the findings as informative for creating new theoretical models of war (e.g., Filson and Werner 2002, 2004; Langlois and Langlois 2008). In this paper, we reexamine the original data set and results, as this is where incorrect estimates of fit were reported.

Assessing Fit in the War Duration Model

Along with a review of their standard hypothesis testing results, Bennett and Stam (1996) discussed the overall fit of their model and its ability to accurately predict war durations using data that would be available ex ante. They noted that maximum likelihood models “do not have an overall measure of fit analogous to the R2 statistic in an OLS model” (pg. 250). To provide a general sense of how the model fit the data they provided three auxiliary indicators of prediction error. Their approach provided an intuitive feel for gauging the accuracy of their model’s predictions, explaining that their model typically was off by n months, or typically came within x % of the actual length. Such assessments of accuracy are relatively uncommon in analyses of duration data. Bennett and Stam’s analysis of aggregate model fit began with the computation of the absolute value of the difference between the model’s prediction in each war and the actual war length. For example, the predicted duration of some individual war might be 2 months more or less than its actual duration. They then reported the mean absolute error, median absolute error, and mean absolute error as a percentage of war length for each of the models they estimated. There are two good reasons for including multiple measures of overall model fit. First, there is no single best measure of model fit in duration models (or MLE models in general). Second, the data on war duration is highly skewed, with most wars being very short with only a few long wars. Because of skew, the mean and median errors differ significantly. As a result, a naïve model predicting just the mean or median war length might perform very well in terms of mean or median error, but would obviously be unable to identify any circumstances associated with either particularly long wars, or wars that are unusually brief. Because of the skew in the data, if the model did make longer predictions, but the duration of those wars was especially long (e.g. the model predicts 24 months but the duration was 60 months), the absolute error in these sorts of cases could still appear very large. The idea of computing error as a percentage of war length is a plausible way to see how well a model performs relative to the specific cases, given the large variance and skewed distribution of the duration data. Revisiting the original model provides an opportunity to consider how to best assess predictions of war length using an econometric model. There are several potential measures for assessing prediction errors. The obvious starting point is always a prediction of the length of each war in the data set using the parameter estimates generated using the MLE estimation compared to the war’s actual duration. The difference between the two (predicted duration – actual duration) may be positive or negative for any war, with a negative value indicating that the war’s length was under-predicted, and a positive value indicating that the length was over- predicted. The simplest assessment would be the mean error across all wars. Because some errors are positive and some negative, this mean could end up being positive or negative, with a negative mean suggesting that the model is somewhat under predicting actual war lengths, and a positive mean suggesting net over prediction. However, unless there is systematic bias in the predictions, they will sum to near-zero, as there will be under- and over- predictions balancing each other out. Moreover, this would tell us nothing about the size of the typical error or the variance of the prediction errors. For example, if one model yielded prediction error in two cases of +2 months and -2 months, the mean error would be 0; if a different model produced errors of +10 months and -10 months, the mean would still be 0. The variance in the distribution of errors is much higher in the second case, and this would be reflected in a higher standard error of the estimate. But here it would be critical to examine the combination of mean and variance. A better measure, particularly as a single measure, might be the mean absolute value of the error for each war. Using absolute values means that errors do not average out to 0, and lead to intuitive statements such as “the typical estimate is off by x months.” Bennett and Stam (1996) reported this value, and its standard deviation. Bennett and Stam further computed the absolute error in war length as a percentage of the length of the war. As noted above, the intuition behind computing and reporting this measure is that the magnitude of an error matters relative to the expected duration of the war. In other words, a three-month error should be seen as less important in a five-year war than in a one- month war. Unfortunately, this measure has its own pathology resulting from the cases where the error is greater than the length of the war. When dividing error by length, resulting values less than 1 indicate that the error was less than the war’s actual length, while values greater than 1 mean the error was more than the actual length. A 0.75 (e.g.) indicates that the error is 25% of the war’s length (say predicting 3 months when the war took 4 months), while a 2.0 indicates that the error is twice the war’s length (say predicting 8 months when the war took 4 months). For wars where the absolute value of the prediction is less than the war’s duration, the values are bounded between 0 and 1. But when the absolute value of the prediction is greater than actual duration, the ratio is (theoretically) unbounded. Prediction errors as a proportion of length are biased in that they will be largest for the shortest wars. For example, in a 1-month war, a 6- month error leads to a 600% error, while in a 12-month war, that same 6-month error is a 50% error. When averaging the computed errors as a proportion of war length, the average will tend to yield high values. In the example just given, the average error is 325% of war length, even though both wars were off by 6 months, and the total error was 6 months out of 13 months of war. This measure actually creates a problem related to that which it was intended to avoid!

A Proportional Reduction in Error (PRE) measure of prediction fit These considerations lead us to develop a new measure here for estimating how well econometric models fit duration data, following a proportional-reduction-in-error approach. Proportional Reduction in Error (PRE) estimates of model fit focus on how much a model’s total prediction error drops following fitting a subsequent model to the data. The “removed” or “reduced” errors are reported as a percentage of the total error produced by a naïve model fit to the data. In the case of the war duration data, we can conceive of an error in predicting each war, which is the absolute prediction error (abs[predicted duration – actual duration]). We can obtain the total possible prediction error by estimating a constant-only model on the data, and summing the resulting absolute prediction error across all wars. This error is the total error that we would have given a naïve or null understanding of the determinants of war duration. We then estimate an improved model with covariates, and sum the absolute prediction errors across all wars. This summed error constitutes the new error estimate. Comparing the aggregate errors for the null model to the errors from the model with covariates, we can then compute the proportional reduction in error as

Proportional Reduction in Error =

The statistic provides an estimate of the reduced error relative to the initial total error possible. This simple formulation is an appropriate way to assess error reduction, and eliminates the problem of discrepant scaling when assessing error as a proportion of the initial war length. Moreover, it is conceptually the same as and comparable to PRE measures reported in other contexts when many other types of estimators are used. In this case, given that error takes the form of months off from the estimated model vs. true duration, we have

PRE = We can also compute PRE using [sum of absolute error/actual duration] to compute the reduction in the percentage error of the model. While the components of this estimation (error/duration) suffer from the same possible problem detailed above (of skew towards large values when error exceeds actual duration), a PRE measure based on these has the same interpretation as the PRE in absolute error, namely the reduction in total error when error is measured as the individual war error/duration. In our tables, we report the PRE in absolute prediction error, and PRE in absolute error/actual duration. Assessing the accuracy of duration predictions via a PRE method is actually possible using any measure of error generated by any statistical duration model that produces point estimates. Point predictions rather than a distribution of durations are necessary because the predicted durations generated using the statistical models are compared to the observed durations. However, in the context of parametric or semi-parametric duration models, only certain models make such predictions. In particular, point predictions of duration can be made only for the family of parametric models, e.g. those that assume an exponential, Weibull, gamma, or other specific function for the baseline hazard. Point predictions cannot be made using the Cox proportional-hazard duration model. To be able to generate a predicted duration of the process in question requires specification of the baseline hazard function. The Cox model specifically rejects making any assumptions about this attribute of the data generating process. Making out-of-sample predictions requires specification of the full functional form of the baseline hazard rate. Doing so defeats the purpose of estimating a Cox model. With a Cox model we can conduct standard hypothesis test about the presence or absence of a variable’s associated effect on the duration data. We can also assess the independent variables’ effects on the relative hazard of the process in question ending without assuming any particular distribution for the baseline hazard. The tradeoff that comes with not assuming an underlying functional form is that assessment of prediction accuracy – whether via PRE or any other method – cannot be done with the Cox model. Finally, we note that the PRE method we use can be used to assess either models that are structured to estimate the associated effects of time-varying covariates (TVCs), or those that have only time-invariant covariates. By time-varying covariates we mean data with independent variables whose values vary over time within an individual case composed of multiple observations. In the time-varying covariates case there are several observations or “lines of data” for each case or spell (e.g. each war) in the data set. In the time-invariant case, there is only one observation per spell. In the instance of using a data set with time-varying covariates, the estimation of the model’s coefficients assumes that when there are multiple observations, all observations within a particular case except the last observation are censored, with the final outcome unobserved. With a set of parameter estimates in hand, all it takes to produce a point prediction for a hypothetical duration is a specified set of values for the full set of independent variables. In the case of a duration model, this string of “X” values is simply multiplied by the XB produced B coefficients to produce the familiar XB, which is then directly used in e to predict a duration time. In the instance of data with TVCs, we could actually make a point prediction on the basis of any of the observations associated with a given case/war, obtaining several duration estimates that would change as the covariates changed over the life of a case. There is no good way to use information about the full “path” of the TVCs to make a prediction, so in order to avoid overcounting the multiple observations of spells with TVCs, we should base the prediction on just one of the observations when computing the PRE statistic or other measures of average values. Doing so ensures produce an average across wars and not observations. In the results for the TVC model we present here, we use the TVC values from the final observation of each war.

Reanalysis

We performed our replication and reanalysis of the Bennett and Stam data using the original dataset from Bennett and Stam (1996). Here, we reanalyze the data using Stata (the original analysis was performed in Limdep) so that we can report robust standard errors. We present significantly more data about model fit in Table 1 than in the original paper. We include the mean error, mean and median absolute error, and the standard deviation of those errors. Importantly, we report corrected estimates of error as a percentage of war length. We report full information for a naïve prediction model, which is the basis for all likelihood-ratio tests of the various component models, and the basis for all PRE assessments. Importantly, the new table adds information on our newly computed “proportional reduction in error” values based on absolute errors and error/actual duration. A new Stata command file for prediction is available on the authors’ websites.1 So how does the final complete model actually perform? Substantially better than any of the baseline (naïve, constant-only), regime-only, or realpolitik only models. Nevertheless, there is still room for substantial improvement in the models’ forecasts of war duration. The mean prediction error indicates that the model systematically underestimates the length of wars. Focusing on the complete model (model 4), on average, we under-predict the length of wars by about 3 months (-3.2). This appears much better than the average error from a naïve model (model 1, which predicts essentially the mean duration) of -9.6 months. This figure alone is somewhat misleading, however, as over- and under-estimates of duration cancel out in this mean. Focusing on the mean of the absolute error in each war, we find that the most complete model yields a mean absolute error of 11 months, with the median absolute error being 4.5 months. The naïve model makes predictions that are off on average nearly 14 months. The standard deviation of the absolute errors is quite high, however, at almost 17 months in model 4, indicating significant skew in the error distribution. This mirrors the significant skew in the underlying data. A majority of the model’s predictions fall closer than 11 months to each case’s true duration, but the model makes some quite large errors. When we look at the mean absolute error as a percentage of war length, we see errors larger than reported in the original article. These range from 326% in the complete TVC model to 997% in the naïve model. To provide some context, an error of 997% would mean that the absolute error is on average nearly ten times the true duration of the war on average. This could occur if the true duration of a war was 1 month and the prediction was 10 months, or if the true duration was 24 months but the model predicted 240. As discussed above, this measure is misleadingly high, because if the prediction was 2 months and the true duration was 3 months, the 1 month error would yield a proportional error of 33%, while a prediction of 3 months given a 2 month reality would yield a proportional error of 150%. But even given the misleadingly

1 Two other minor changes have improved the replication data code. First, the original article reported that the “repression” measure was multiplied by -1 in order to get the valence of the effect correct. However, the previously- released replication data set had not actually multiplied repression by -1, and as a result, the straight coefficient generated from the replication data set had the incorrect sign. The new prediction command file now includes appropriate code to make this switch. Second, the replication data set had scaled total population and total military personnel differently than when the data were used to produce the original Table 1. The new prediction command file includes appropriate rescaling for these variables. high values, we see that the complete models (TVC or non-TVC) yield much lower error rates as a percentage of war length than the naïve or component models. In the complete model, the prediction error is (on average) 3.3 times the actual length of the war. The complete model is clearly an improvement over both the baseline naïve model as well as the other simpler ones. We can now examine just how much using our new PRE measures of duration error. Starting with the final predictions, the complete model estimated on the data with time-varying covariates has a PRE in absolute error terms of.201. This indicates a 20% reduction in total error from the complete model relative to a naïve model prediction. In detail, there are 1079 months of error from the naïve model. There are 861 months of error in the predictions made by the complete model. The reduction of 218 error-months is 20.1% of 1079. If we look at the reduction in error as a proportion of actual war duration, we reduce error by 67% in the complete model. If we also look at the separate component models, each reduces prediction error from the naïve model, but the regime model does so just barely. It yields an error reduction much less than 1% in absolute terms, and half of the improvement of the complete model in terms of error/duration. The realpolitik model (which has all but the regime variables) reduces absolute error by 13.6% (PRE .136), although the PRE in terms of error/duration of .646 is quite close to the 0.673 reduction of the complete model. The complete model yields a clearly superior reduction in absolute error at 20%. Note that this is the largest absolute PRE by far of the models, even though the improvement in the average error as a percentage of war length dropped only from 353% to 326% compared to the realpolitik model (that small reduction explains the closeness of PRE in terms of error/duration). Clearly, there remains much variation in war duration to explain. But at the same time, the complete model with TVCs is clearly the best model of those explored here in terms of making predictions of duration. Not only does it yield a significant increase in likelihood (via likelihood-ratio tests), but it yields a sizable improvement in error reduction. The reductions are quite similar in the model without TVCs.

Conclusions

In this paper, we have revisited the predictions of war duration from Bennett and Stam (1996), correcting some prior errors and introducing a new proportional reduction in error measure. This measure allows us to better assess how well different econometric models are doing at predicting war duration. Clearly, the full statistical model is an improvement over naïve or any component models, and a 20% reduction in error is significant. But while individual hypothesis tests indicate that the individual parameters are statistically significant, and likelihood ratio tests indicated that the overall model provides a statistically significant improvement in fit to the data versus the null, much of the variation in war duration clearly remains to be explained in ongoing work. A PRE measure that avoids some of the issues with the measures used previously is a new tool we can use as we seek better prediction and explanation when it comes to understanding war duration. Bibliography

Bennett D. Scott, and Allan C. Stam. 2006. “Predicting the Length of the 2003 US-Iraq War: A Postwar Assessment.” Foreign Policy Analysis 2 (April):101-115. Bennett, D. Scott, and Allan Stam. 1996. “The Duration of Interstate Wars, 1816-1985.” American Political Science Review 90:239-257. Clodfelter, Michael. 1992. Warfare and Armed Conflicts 2 vols. Jefferson, NC: McFarland and Co. Dupuy, R. Ernest, and Trevor N. Dupuy. 1986. The Encyclopedia of Military History From 3500 BC to the Present. 2nd revised edition. New York: Harper and Row, 1986 Filson, Darren, and Suzanne Werner. 2004. “Bargaining and Fighting: The Impact of Regime Type on War Onset, Duration, and Outcomes.” American Journal of Political Science 48 (2): 296–313. Filson, Darren, and Suzanne Werner. 2002. “A Bargaining Model of War and Peace: Anticipating the Onset, Duration, and Outcome of War.” American Journal of Political Science 46 (4): 819-837. Langlois, Catherine, and Jean-Pierre Langlois. 2008. “Does Attrition Behavior Help Explain the Duration of Interstate Wars?” Manuscript. Slantchev, Branislav L. 2004. “How Initiators End Their Wars: The Duration of Warfare and the Terms of Peace.” American Journal of Political Science 48 (4): 813–829. Stam, Allan C. III. 1996. Win, Lose or Draw: Domestic Politics and the Crucible of War. Ann Arbor: University of Michigan Press. Note: We would like to thank Professor Catherine Langlois, Georgetown University, for bringing to our attention the mistake in computing error as a percentage of war length, and for useful conversations concerning PRE. Table 1. Revised and Expanded War Duration Hazard Model Coefficient and Prediction Estimates

Model N1 Model 1 Model 2 Model 3 Model 4 Model N2 Model 5 Naïve model VT model, Realpolitik, Regime, Complete, Naïve model Complete, TVC TVC TVC TVC TVC non-TVC non-TVC Variable

Constant 2.39 (0.193)** 2.48 (0.678) 2.46 (1.10) 1.75 (0.634) 2.252 (1.165) 2.358 (0.197) 1.264 (1.202)

Realpolitik Strategy: OADM -- -- 2.484 (0.531)** -- 2.759 (0.507)** -- 2.874 (0.588)** Strategy: OADA -- -- 3.254 (0.510)** -- 3.203 (0.496)** -- 3.330 (0.638)** Strategy: OADP -- -- 7.016 (1.418)** -- 6.285 (1.417)** -- 6.283 (1.771)** Strategy: OPDA -- -- 11.596 (2.454)** -- 10.991 (2.323)** -- 7.990 (2.587)** Terrain -- -- 6.618 (2.971)* -- 5.062 (2.865) -- 2.987 (3.552) Terrain x Strategy -- -- -2.026 (0.785)** -- -1.703 (0.746)* -- -1.242 (0.939) Balance of Forces -- -- -5.027 (1.276)** -- -4.756 (1.225)** -- -3.981 (1.233)** Total Mil. Personnel -- -- 0.061(0.027)** -- 0.124 (0.037)** -- 0.273 (0.125)* Total Population -- -- 0.751(0.625) -- 0.707 (0.540) -- 0.162 (0.791) Population Ratio -- -0.024 (0.014) 0.001 (0.011) -- 0.007 (0.012) -- 0.007 (0.015) Quality Ratio -- -- 0.013 (0.011) -- 0.010 (0.009) -- 0.001 (0.001) Surprise -- -- -0.123 (0.651) -- -0.203 (0.524) -- -0.219 (0.652) Salience -- -- 0.336 (0.231) -- 0.420 (0.201)* -- 0.427 (0.212)*

Regime Repression ------0.281 (0.180)** -0.200 (0.111) -- -0.246 (0.127) Democracy ------0.130 (0.080)** -0.100 (0.053) -- -0.118 (0.059)*

Other Approaches Previous Disputes ------0.008 (0.053) -- 0.016 (0.059) Number of States -- 0.064 (0.077) -- -- -0.190 (0.092)* -- -0.135 (0.100) Year -- -0.001 (0.004) ------p (duration param.) 0.629 (0.044) 0.629 (0.046) 0.907 (0.074) 0.629 (0.045) 0.965 (0.082) 0.621 (0.043) 0.942 (.074)

Log-Likelihood -157.55 -156.38 -127.49 -155.86 -124.83 -156.2 -126.1 Mean Error (months) -9.6 -9.5 -3.1 -9.4 -3.2 -9.7 -4.2 SD of Mean Error 24.8 24.8 21.4 24.7 19.9 24.9 18.7 Mean Abs. Error 13.8 13.7 11.9 13.8 11.0 13.8 11.1 SD of Abs. Error 22.7 22.7 18.0 22.5 16.8 22.9 15.6 Median Error 1.1 0.9 0.04 .48 -0.3 0.9 0.0006 Median Abs. Error 5.1 4.8 4.6 4.1 4.5 4.9 4.4 Mean Abs. Error as 997% 879% 353% 667% 326% 929% 286% % of War Length PRE (abs. error) 0 0.007 0.136 0.0002 0.201 0 0.195 PRE (abs. error as 0 0.118 0.646 0.331 0.673 0 0.692 % length) Number of Wars 78 78 78 78 78 77 77 No. of Data Points 169 169 169 169 169 77 77 (War-Years) * p < 0.05 ** p < 0.01 Notes: Robust standard errors in parentheses. Significance tests are two-tailed. TVC indicates “time-varying covariate” model, vs. the non-TVC model with one observation per war.

Recommended publications