Predicting Website Visits with Dynamic Regression

The Effect of Online Advertising in a Digital World: Predicting Website Visits with Dynamic Regression Martin Björklund& Felix Hasselblad Bachelor's thesis in Statistics Advisor Tatjana Pavlenko & Henrik Feldt 2021 Acknowledgements We would like to thank Tatjana Pavlenko for her dedication and enthusiasm in helping us produce this thesis, as well as for her statistical advice. Thank you to Henrik Feldt, for giving us the opportunity to write our thesis in collaboration with Logary Analytics and for helping us formulate a relevant research question. We would also like to thank Maximillian Mantei for his input regarding the accuracy of the models created in the thesis. i Abstract The goal of the thesis is to accurately predict future values of a company's website visits and to estimate the uncertainty of those predictions. To achieve this, a dynamic regression model with an ARIMA error term is considered, using advertisement spending with lags and dummy variables for Black Friday and weekdays as predictors. After dividing the data into a training set and a test set, the order of the ARIMA error term is specified using the Box-Jenkins methodology. The initial model is then run through a backward elimination algorithm, which selects two models based on the Akaike Information Criterion and Bayes Information Criterion. As expected, the model selected using Bayes Information Criterion is more conservative in its choice of variables than the model specified using the Akaike Information Criterion. The forecasts made on the test set are complemented with normal and bootstrap-based prediction intervals in order to estimate the uncertainty of the predictions. These are then compared to the forecasts made using a simpler model, consisting of only the ARIMA error term. The thesis concludes that the dynamic regression models are twice as accurate as the simpler model and that they were on average off by 14% from the actual values. The prediction intervals for the dynamic regression models are slightly too pessimistic as they overstate the uncertainty of the model by about 10 percentage points in the 80% prediction interval and by 5 percentage points in the 95% prediction interval. There is no practical discrepancy in prediction power between the model selected using the Akaike Information Criterion and the one using Bayes Information Criterion. The accuracy of the prediction intervals is higher than in the simpler model even though both dynamic regression models have more residual autocorrelation. ii Contents Acknowledgements i Abstract ii Abbreviations v List of Figures vi List of Tables vii 1 Introduction 1 2 Theory and Methodology 2 2.1 The ARIMA Error Term . .2 2.1.1 Stationarity of the Dependent Variable . .3 2.1.2 Choosing the Order of the ARIMA Error Term . .5 2.1.3 Estimation . .5 2.1.4 Residual Diagnostics . .6 2.2 Dynamic Regression . .7 2.2.1 Stationarity of the Predictors . .7 2.2.2 Model Specification with Stepwise Selection and Information Criteria8 2.2.3 Parameter Estimation . 10 2.2.4 Residual Diagnostics . 10 2.3 Forecasting . 10 2.3.1 Prediction Intervals . 11 2.3.2 Model Evaluation . 12 3 Data 13 3.1 Data Pre-Processing . 13 3.1.1 Data Partitioning . 14 3.2 Final Data . 14 3.3 Selecting the Number of Lags for the Predictors . 15 4 Results 15 4.1 Stationarity of All Variables . 16 4.2 Determining the ARIMA Error . 19 4.3 Model Selection with Backward Elimination and Information Criteria . 20 4.4 Model Evaluation Using Residuals . 21 4.5 Parameter Evaluation . 23 iii 4.6 Forecasting . 23 5 Discussion 26 5.1 Suggestions for Future Studies . 27 5.2 Alternative Methods . 28 References 29 A Appendix 31 iv Abbreviations ACF Autocorrelation function ADF Augmented Dickey Fuller AIC Akaike information criterion AICc Corrected Akaike information criterion ARIMA Autoregressive integrated moving average BIC Bayes information criterion MAE Mean absolute error MAPE Mean absolute percentage error PACF Partial autocorrelation function RMSE Root mean square error v List of Figures 1 Time series plot and correlogram of sessions variable . 17 2 Time series plots showing the amount of money spent on gAds and gAds1 . 17 3 Time series plot showing the amount of money spent on Facebook advertising 18 4 Time series plot showing the amount of money spent on Pinterest advertising 18 5 Time series graph and correlogram for AR(1) residuals . 19 6 Values of AIC and AICc for the best models according to the backward elimination algorithm . 20 7 Values of BIC for the best models according to the backward elimination algorithm . 21 8 Histogram of the standardized residuals from the AIC model along with the PDF of a standard normal distribution . 22 9 Histogram of the standardized residuals from the AIC model along with the PDF of a standard normal distribution . 23 10 AIC Model forecasts vs actual . 24 11 BIC Model forecasts vs actual . 25 12 ACF for the residuals of the estimated models . 31 13 Q-Q Plot for AIC and BIC Models . 31 14 Standardized residuals from the AR(1) model . 32 15 Forecast made using the AR(1) model . 32 vi List of Tables 1 Descriptive statistics for the sessions variable (training data) . 15 2 Results from ADF tests . 18 3 Ljung-Box tests . 21 4 Precision of prediction intervals . 25 5 Error estimates . 26 6 Parameter estimates and standard errors for the AIC and BIC models . 33 vii 1 Introduction Logary analytics is a company that specializes in evaluating the effects of advertising on rev- enue. Their customers are other companies that want to get the most out of their investments in advertising. The purpose of this paper is to propose a dynamic regression model for predicting the number of website visits (referred to as sessions) over time, using the amount of money spent on advertising on different platforms and calendar dummies. The data is partitioned into a training set and a test set. Model selection is performed on the training set using backward elimination based on the Akaike Information Criterion, the Corrected Akaike Information Criterion and Bayes Information Criterion. The models' predictive accuracy are then evaluated using the test set. A dynamic regression model allows the inclusion of lagged versions of predictor variables while also controlling for time effects in the dependent variable. The ARIMA error term adjusts for any time-dependencies not already included in the model and corrects for possible omitted variables through autocorrelation correction with lags for the dependent variable. The reason that this is preferable over estimations based on a simple ARIMA model or linear regression is that the joint analysis can explain more variation in the data and therefore the error in prediction will be smaller. This has been noted empirically by for example Kongchaoren & Kruanpradit (2013), Tsui et al (2014) and Anners (2017), where a dynamic regression model outperformed a simpler ARIMA model when doing forecasts. The novelty of this study lies in applying the methods to a new data set. The research questions are therefore: • How well can website visits per day be predicted using a dynamic regression model using advertising predictor variables chosen via backward elimination? • Which information criterion used during backward elimination provides the best model for prediction? • How well can prediction uncertainty be estimated using normality and bootstrap-based prediction intervals? • When evaluated on the test data, how do the error measurements and the prediction intervals of the dynamic regression models compare to those of an ARIMA model? 1 The paper is organized as follows: The next section is concerned with the methodology of the thesis, as well as the statistical theory behind the used methods. Section 3 describes the data. Section 4 presents the results. Finally, section 5 concludes the thesis. Afterwards follows the appendix. 2 Theory and Methodology A dynamic regression model can be defined as a model on the form: yt = β0 + β1x1;t + ··· + βgxg;t + ηt (1) where yt is the outcome variable at time t, β0; : : : ; βg are coefficients, x1; : : : ; xg are the predictor variables, g is the number of predictor variables, and ηt is an ARIMA error term for yt. Section 2.1 describes how the ARIMA error term is selected for the dynamic regression model. Section 2.2 describes how the dynamic regression model is created, given an ARIMA error term and section 2.3 is concerned with how the results from the final model are evaluated and how forecasts and prediction intervals are constructed. 2.1 The ARIMA Error Term This section describes the theory behind ARIMA models, a type of model used as an error term in dynamic regression. ARIMA modeling is a type of statistical modeling dedicated to determining how the value of a variable relates to previous values of said variable. It consists of autoregressive and moving average terms and can also be differenced. ARIMA models were developed from the work of Norbert Wiener (1949). An autoregressive time series model of order p is specified as (Hyndman and Athanasopoulos, 2018, Ch. 8.3): yt = c + φ1yt−1 + φ2yt−2 + ··· + φpyt−p + "t (2) where p < t, c is a constant term and φ1; : : : ; φp are coefficients. The process has a drift if c 6= 0. A moving average time series model specifies that the modelled variable is dependent on the past values of a stochastic error term. It can intuitively be thought of as a model where 2 future values are predicted based on how wrong the previous estimates were. Hyndman & Athanasopoulos (2018, Ch. 8.4) specifies a moving average of order q as: yt = c + "t + θ1"t−1 + ··· + θq"t−q (3) where g < t, "t is the true error of the model at time t and θ1 : : : θq are the parameters of the MA(q) data generating process.

Load more