Improving External Validity of Machine Learning, Reduced Form, And
Total Page:16
File Type:pdf, Size:1020Kb
Improving Macroeconomic Model Validity and Forecasting Performance via Pooled Country Data Cameron Fen and Samir Undavia∗ August 29, 2021 We show that pooling countries across a panel dimension to macroeconomic data can significantly and statistically improve the generalization ability of structural and reduced-form models, as well as enable machine learning methods to produce state-of-the-art results and improve other macroeconomic forecasting models. Using GDP forecasts evaluated on a out-of-sample test set, this procedure reduces root mean squared error (RMSE) by 12% across horizons and models for certain reduced-form models and by 24% across horizons for dynamic structural general equilibrium (DSGE) models. Removing US data from the training set and forecasting out-of-sample country-wise, we show that both reduced-form and structural models become more policy invariant, and outperform a baseline model that uses US data only. Finally, given the comparative advantage of \nonparametric" machine learning forecasting models in a data rich regime, we demonstrate that our recurrent neural network (RNN) model and automated machine learning (AutoML) approach outperform all baseline economic models in this regime. Robustness checks indicate that machine learning outperformance is reproducible, numerically stable, and generalizes across models. JEL Codes: E27, C45, C53, C32 Keywords: Neural Networks, Policy Invariance, GDP Forecasting, Pooled Data ∗Corresponding author Cameron Fen is PhD student at the University of Michigan, Ann Arbor, MI, 48104 (E- mail: [email protected]. Website: cameronfen.github.io.). Samir Undavia is an ML and NLP engineer (E-mail: [email protected]). The authors thank Xavier Martin Bautista, Thorsten Drautzburg, Florian Gunsilius, Jeffrey Lin, Daniil Manaenkov, Matthew Shapiro, Eric Jang, and the member participants of the European Winter Meeting of the Econometric society, the ASSA conference, the Midwestern Economic Association conference, EconWorld conference, and the EcoMod conference for helpful discussion and/or comments. We would also like to thank Nimarta Kumari for research assistance assembling the DSGE panel data set and AI Capital Management for computational resources. All errors are our own. I. Introduction A central problem in both reduced-form and structural macroeconomics is the dearth of under- lying data. For example, GDP is a quarterly dataset that only extends back to around the late 1940s, which results in around 300 timesteps. Thus generalization and external validity of these models are a pertinent problem. In forecasting, this approach is partially addressed by using simple linear models. In structural macroeconomics the use of micro-founded parameters and Bayesian estimation attempts to improve generalization to a limited effect. More flexible and nonparametric models would likely produce more accurate forecasts, but with limited data, this avenue is not available. However, pooling data across many different countries allows economists to forecast and even estimate larger structural models which have both better external validity and forecasting without having to compromise on internal validity or model design. We show that the effectiveness of pooling US or other single country data with other countries in conjunction with large DSGE models and machine learning leads to improvements in external validity and economic forecasting. This pooling approach adds more data, rather than more co- variates and leads to consistent and significant improvements in external validity as measured by timestep out-of-sample forecasts and other metrics. for example, our data goes from 211 timesteps of US data for our AutoML model, to 2581 timesteps over all countries in our pooled data. This not only leads to significant improvements in forecasting for standard models, but also allows more flexible models to be used without overfitting. Pooling a panel of countries also leads to parameters that are a better fit to the underlying data generating process{almost for free{without any change to the underlying equations governing the models and only changing the interpretation from a single country parameter to an average world parameter. Even in this case we show that the stability of parameters over space{across countries{may be better than the alternative{going back further in time for more single country data. A central theme throughout this paper is that more flexible models benefit more from pooling. We start with the linear models as a good baseline. Even in this case, pooling improves performance. Estimating traditional reduced-form models{AR(2) (Walker, 1931), VAR(1), VAR(4) models (Sims, 1 1980){we show that we can reduce RMSE by an average of 12% across horizons and models. Outside of pure forecasting, analysis of our pooling procedure across models suggests improvements in external validity in other ways{making models more policy/regime invariant. To show this, we estimate both linear and nonlinear models on data from all countries except the target country being forecasted. Thus, our test data is both time step out-of-sample, which we implement in all our forecast evaluations, but also country out-of-sample, which we add to illustrate generalizability. Across most models and forecasting horizons, our out-of-sample forecasts outperforms the typical procedure of estimation models on only the data of the country of interest. This time and country out-of-sample analysis leads to about 80% of the improvement gained from moving all the way to forecasting with the full data set. We think this provides suggestive evidence that this data augmentation scheme can help make reduced-form as well as more nonlinear models more policy invariant. Moving to a more flexible model, we proceed to apply our panel data augmentation scheme to the Smets-Wouters structural 45+ parameter DSGE models (Smets and Wouters, 2007). Our augmentation statistically improves the generalization of this model by an average of 24% across horizons. We again test our model in a country out-of-sample manner, and show improvements over estimating over a single country baselines. This suggests that even DSGE models are not immune to the Lucas critique and the use of country panel data can improve policy/country invariance. Given the consistent improvements across all horizons and reduced-form models, we are confident this approach will generalize to estimation of other structural models. We advocate applying this ap- proach to calibration, generalized method of moments (Hansen and Singleton, 1982), and Bayesian estimation (Fern´andez-Villaverde and Rubio-Ram´ırez,2007), where the targets are moments from a cross-section of countries instead of just one region like the United States. Recognizing that this augmentation increases the effective number of observations by a factor of 10, we also demonstrate that pooling can overcome overfitting in \nonparametric" machine learning models that can now outperform traditional forecasting models in this high data regime. This is in line with the trend of larger models having a comparative advantage in terms of forecasting improvement given more data. We use two different algorithms. The first, AutoML, runs a horse 2 race with hundreds of machine learning models to determine the best performing model on the validation set, which is ultimately evaluated on the test set. We view AutoML as a proxy for great machine learning performance, but imagine individual model tuning can improve performance even more. As different models perform the better under the low data (US) regime and the high data (panel) regime under AutoML, we also test a RNN to show the improvement of a single model under both data regimes. The model improvement indicates that while these approaches are competitive in the low data regime, machine learning methods consistently outperforms baseline economic models{VAR(1), VAR(4), AR(2), Smets-Wouters, and factor models{in the high data regime. Furthermore, while some of the baseline models use a cross-section of country data, we only use three covariates{GDP, consumption, and unemployment lags. In contrast, the DSGE model uses 11 different covariates, the factor model uses 248, and the Survey of Professional Forecasters (SPF) (None, 1968) uses just as many covariates along with real time data, suggesting our machine learning models still have room for improvement. Over most horizons, our model approaches SPF median forecast performance, albeit evaluated on 2020 vintage data (see Appendix C), resulting in outperformance over SPF benchmark at 5 quarters ahead. Moreover, the outperformance of our model over the SPF benchmark is noteworthy as the SPF is an ensemble of both models and human decision makers, and many recessions like the recent COVID recession are not generally predicted by models. The paper proceeds as follows: Section II. reviews the literature on forecasting and recurrent neural networks and describes how our paper merges these two fields; Section III. discusses feed- forward neural networks, linear state space models, and gated recurrent units (Cho et al., 2014); Section V. describes the data; Section III.B. briefly mentions our model architecture; Section IV. discusses the benchmark economic models and the SPF that we compare our model to; Section VI. and Appendix G.2 provide the main results and robustness checks; and Section VII. concludes the paper. 3 II. Literature Review As our pooling technique leads larger datasets, this creates an opportunity to either increase the parameter count in