<<

The Effect of Online Advertising in a Digital World: Predicting Website Visits with Dynamic Regression

Martin Bj¨orklund& Felix Hasselblad

Bachelor’s thesis in

Advisor Tatjana Pavlenko & Henrik Feldt

2021 Acknowledgements

We would like to thank Tatjana Pavlenko for her dedication and enthusiasm in helping us produce this thesis, as well as for her statistical advice.

Thank you to Henrik Feldt, for giving us the opportunity to write our thesis in collaboration with Logary Analytics and for helping us formulate a relevant research question.

We would also like to thank Maximillian Mantei for his input regarding the accuracy of the models created in the thesis.

i Abstract

The goal of the thesis is to accurately predict future values of a company’s website visits and to estimate the uncertainty of those . To achieve this, a dynamic regression model with an ARIMA error term is considered, using advertisement spending with lags and dummy variables for Black Friday and weekdays as predictors.

After dividing the into a training set and a test set, the order of the ARIMA error term is specified using the Box-Jenkins methodology. The initial model is then run through a backward elimination algorithm, which selects two models based on the Akaike Information Criterion and Bayes Information Criterion. As expected, the model selected using Bayes Information Criterion is more conservative in its choice of variables than the model specified using the Akaike Information Criterion. The forecasts made on the test set are complemented with normal and bootstrap-based intervals in order to estimate the uncertainty of the predictions. These are then compared to the forecasts made using a simpler model, consisting of only the ARIMA error term.

The thesis concludes that the dynamic regression models are twice as accurate as the simpler model and that they were on average off by 14% from the actual values. The prediction intervals for the dynamic regression models are slightly too pessimistic as they overstate the uncertainty of the model by about 10 percentage points in the 80% prediction interval and by 5 percentage points in the 95% prediction interval.

There is no practical discrepancy in prediction power between the model selected using the Akaike Information Criterion and the one using Bayes Information Criterion. The accuracy of the prediction intervals is higher than in the simpler model even though both dynamic regression models have more residual .

ii Contents

Acknowledgements i

Abstract ii

Abbreviations v

List of Figures vi

List of Tables vii

1 Introduction 1

2 Theory and Methodology 2 2.1 The ARIMA Error Term ...... 2 2.1.1 Stationarity of the Dependent Variable ...... 3 2.1.2 Choosing the Order of the ARIMA Error Term ...... 5 2.1.3 Estimation ...... 5 2.1.4 Residual Diagnostics ...... 6 2.2 Dynamic Regression ...... 7 2.2.1 Stationarity of the Predictors ...... 7 2.2.2 Model Specification with Stepwise Selection and Information Criteria8 2.2.3 Parameter Estimation ...... 10 2.2.4 Residual Diagnostics ...... 10 2.3 ...... 10 2.3.1 Prediction Intervals ...... 11 2.3.2 Model Evaluation ...... 12

3 Data 13 3.1 Data Pre-Processing ...... 13 3.1.1 Data Partitioning ...... 14 3.2 Final Data ...... 14 3.3 Selecting the Number of Lags for the Predictors ...... 15

4 Results 15 4.1 Stationarity of All Variables ...... 16 4.2 Determining the ARIMA Error ...... 19 4.3 with Backward Elimination and Information Criteria . . . . 20 4.4 Model Evaluation Using Residuals ...... 21 4.5 Parameter Evaluation ...... 23

iii 4.6 Forecasting ...... 23

5 Discussion 26 5.1 Suggestions for Future Studies ...... 27 5.2 Alternative Methods ...... 28

References 29

A Appendix 31

iv Abbreviations

ACF Autocorrelation function ADF Augmented Dickey Fuller AIC Akaike information criterion AICc Corrected Akaike information criterion ARIMA Autoregressive integrated BIC Bayes information criterion MAE absolute error MAPE Mean absolute percentage error PACF Partial autocorrelation function RMSE Root mean square error

v List of Figures 1 plot and of sessions variable ...... 17 2 Time series plots showing the amount of money spent on gAds and gAds1 . . 17 3 Time series plot showing the amount of money spent on Facebook advertising 18 4 Time series plot showing the amount of money spent on Pinterest advertising 18 5 Time series graph and correlogram for AR(1) residuals ...... 19 6 Values of AIC and AICc for the best models according to the backward elimi- nation algorithm ...... 20 7 Values of BIC for the best models according to the backward elimination al- gorithm ...... 21 8 of the standardized residuals from the AIC model along with the PDF of a standard ...... 22 9 Histogram of the standardized residuals from the AIC model along with the PDF of a standard normal distribution ...... 23 10 AIC Model forecasts vs actual ...... 24 11 BIC Model forecasts vs actual ...... 25 12 ACF for the residuals of the estimated models ...... 31 13 Q-Q Plot for AIC and BIC Models ...... 31 14 Standardized residuals from the AR(1) model ...... 32 15 Forecast made using the AR(1) model ...... 32

vi List of Tables 1 for the sessions variable (training data) ...... 15 2 Results from ADF tests ...... 18 3 Ljung-Box tests ...... 21 4 Precision of prediction intervals ...... 25 5 Error estimates ...... 26 6 Parameter estimates and standard errors for the AIC and BIC models . . . . 33

vii 1 Introduction

Logary analytics is a company that specializes in evaluating the effects of advertising on rev- enue. Their customers are other companies that want to get the most out of their investments in advertising.

The purpose of this paper is to propose a dynamic regression model for predicting the number of website visits (referred to as sessions) over time, using the amount of money spent on advertising on different platforms and calendar dummies.

The data is partitioned into a training set and a test set. Model selection is performed on the training set using backward elimination based on the Akaike Information Criterion, the Corrected Akaike Information Criterion and Bayes Information Criterion. The models’ predictive accuracy are then evaluated using the test set.

A dynamic regression model allows the inclusion of lagged versions of predictor variables while also controlling for time effects in the dependent variable. The ARIMA error term adjusts for any time-dependencies not already included in the model and corrects for possible omitted variables through autocorrelation correction with lags for the dependent variable. The reason that this is preferable over estimations based on a simple ARIMA model or is that the joint analysis can explain more variation in the data and therefore the error in prediction will be smaller. This has been noted empirically by for example Kongchaoren & Kruanpradit (2013), Tsui et al (2014) and Anners (2017), where a dynamic regression model outperformed a simpler ARIMA model when doing forecasts. The novelty of this study lies in applying the methods to a new data set.

The research questions are therefore:

• How well can website visits per day be predicted using a dynamic regression model using advertising predictor variables chosen via backward elimination? • Which information criterion used during backward elimination provides the best model for prediction? • How well can prediction uncertainty be estimated using normality and bootstrap-based prediction intervals? • When evaluated on the test data, how do the error measurements and the prediction intervals of the dynamic regression models compare to those of an ARIMA model?

1 The paper is organized as follows: The next section is concerned with the methodology of the thesis, as well as the behind the used methods. Section 3 describes the data. Section 4 presents the results. Finally, section 5 concludes the thesis. Afterwards follows the appendix.

2 Theory and Methodology

A dynamic regression model can be defined as a model on the form:

yt = β0 + β1x1,t + ··· + βgxg,t + ηt (1) where yt is the outcome variable at time t, β0, . . . , βg are coefficients, x1, . . . , xg are the predictor variables, g is the number of predictor variables, and ηt is an ARIMA error term for yt.

Section 2.1 describes how the ARIMA error term is selected for the dynamic regression model. Section 2.2 describes how the dynamic regression model is created, given an ARIMA error term and section 2.3 is concerned with how the results from the final model are evaluated and how forecasts and prediction intervals are constructed.

2.1 The ARIMA Error Term

This section describes the theory behind ARIMA models, a type of model used as an error term in dynamic regression.

ARIMA modeling is a type of statistical modeling dedicated to determining how the value of a variable relates to previous values of said variable. It consists of autoregressive and moving average terms and can also be differenced. ARIMA models were developed from the work of Norbert Wiener (1949).

An autoregressive time series model of order p is specified as (Hyndman and Athanasopoulos, 2018, Ch. 8.3):

yt = c + φ1yt−1 + φ2yt−2 + ··· + φpyt−p + εt (2) where p < t, c is a constant term and φ1, . . . , φp are coefficients. The process has a drift if c 6= 0.

A moving average time series model specifies that the modelled variable is dependent on the past values of a stochastic error term. It can intuitively be thought of as a model where

2 future values are predicted based on how wrong the previous estimates were. Hyndman & Athanasopoulos (2018, Ch. 8.4) specifies a moving average of order q as:

yt = c + εt + θ1εt−1 + ··· + θqεt−q (3) where g < t, εt is the true error of the model at time t and θ1 . . . θq are the parameters of the MA(q) data generating process. The process has a drift if the constant c 6= 0.

A model with these parts combined is an ARMA(p, q) model. If differencing of order d is applied, the model is known as an ARIMA(p, d, q) model where the I stands for integrated. Differencing is discussed under section 2.1.1.

This thesis uses the Box Jenkins approach to time series modeling, which was introduced by Box & Jenkins (1970). It can be broken down into the following steps:

1. Transformation of the data to ensure it is stationary. 2. Model identification, using the ACF and PACF. The values of p and q are determined. 3. Estimation of the parameters in the model. In this thesis, maximum likelihood is used. 4. Residual checking to ensure that there is no autocorrelation left. The residuals should resemble white noise. A Ljung-Box test may be used to test for autocorrelation in the error term after model fitting.

Each step is described in more detail in the following sections.

2.1.1 Stationarity of the Dependent Variable

To create an ARMA model, stationarity is required. Stationarity can roughly be defined as when the properties of the time series are the same no matter when it is observed (Hyndman and Athanasopoulos, 2018, Ch. 8.1). This that a seasonal time series, or one with a trend, is not stationary. In this thesis, stationarity refers to weak stationarity, which is defined as when a stochastic process holds the following properties (Cryer and Chan, 2008):

E(yt) = µ for all t, and (4) γt,t−k = γ0,k for all time t and lag k where k < t and γt,t−k is the covariance between observations t and t − k. If a trend or unit root is present in the data, one can transform the process to make it stationary by differencing the process. The first difference is performed by subtracting the value of the former observation for all observations as in Equation 5.

0 yt = yt − yt−1. (5) 3 Augmented Dickey Fuller Test To test for stationarity, one can use the Dickey Fuller test for unit roots, developed by Dickey and Fuller (1979). A variant of this is the Augmented Dickey Fuller (ADF) test, developed by Dickey and Said (1984), which is be used for this thesis. It is similar to the Dickey Fuller test but is more flexible. The test procedure works as follows (Cryer and Chan, 2008, p. 128). Consider the process:

yt = αyt−1 + xt for t = 1, 2,...,T (6) where xt is a . yt is stationary if α < 1 and not stationary if |α| = 1. If xt is assumed to be an AR(p) process: xk = φ1xt−1 + ··· + φkxt−k, then xt = yt − yt−1 under the null hypothesis that α = 1. If we then let a = 1 − α:

yt − yt−1 = (α − 1)yt−1 + xt

= ayt−1 + φ1xt−1 + ··· + φpxt−p + εt (7)

= ayt−1 + φ1(yt−1 − yt−2) + ··· + φp(yt−k − yt−p−1) + εt then a = 0 under the Augmented Dickey-Fuller test’s null hypothesis that the process contains a unit root and a < 1 under the alternative hypothesis that it does not contain a unit root.

The hypotheses of the ADF test are:

H0 : The process contains a unit root, |a| = 1.

H1 : The process does not contain a unit root, |a|= 6 1

There are three versions of the test, which allow for a constant term and a trend, by adding a constant c and/or the term βt to the right hand side in Equation 7. To perform the test, the number of lags used must first be determined, using the method described in section 2.1.2.

Ifa ˆ is the estimated value of a, obtained using the method described in section 2.1.3, andσ ˆaˆ is its , the test is defined as: aˆ ADF = . (8) σˆaˆ The statistic does not follow any known distribution under the null. However, critical values have been calculated and are incorporated into the R function adf.test. The null hypothesis is rejected if ADFobs < the critical value at the 5% significance level.

4 When performing multiple tests, the multiple testing problem can arise. This means that the probability of getting at least one type 1 error is inflated, from α to 1−(1−α)the number of tests performed.

2.1.2 Choosing the Order of the ARIMA Error Term

The values of p and q of the ARIMA error terms are chosen by looking at plots of the ACF (autocorrelation function) and the PACF (partial autocorrelation function). A plot of the sample autocorrelation function simply shows Corr(yt, yt−k).

The partial autocorrelation function, on the other hand, can be interpreted as the correlation between yt and yt−k after removing the effect of yt−1, yt−2, . . . , yt−k+1. Assuming that yt is a normally distributed time series, it is written formally as (Cryer and Chan, 2008):

ψkk = Corr(yt, yt−k | yt−2, ..., yt−k+1). (9)

For an AR(p) process without seasonality, ψkk = 0 when k > p. This means that one can use the PACF plot to determine the order of an autoregressive model. p is based on the number of consecutive significant lags in the PACF.

For an MA(q) model this means that an ACF plot can be used to determine the order of an autoregressive model. q is based on the number of consecutive significant lags in the ACF.

2.1.3 Estimation

Estimation of the parameters in an ARIMA model is done using maximum likelihood. Max- imum likelihood estimation returns the parameter estimates that maximize the given the data.

There are two distinct methods that are used for parameter estimation: the exact maximum likelihood which is computed numerically, and the conditional maximum likelihood (Hamil- ton, 1994, p. 125). Estimation of the parameters in an ARIMA model, as well as in a dynamic regression model, is performed using conditional maximum likelihood estimation. Here, an AR(1) process is used as an example to explain conditional maximum likelihood estimation.

Consider a Gaussian AR(1) process, defined as:

yt = c + φyt−1 + εt (10)

5 where the errors are assumed to be normally, independently, identically distributed (εt ∼ NIID(0, σ2)), we can let the vector of estimated parameters be A = (c, φ, σ2). Since the process is Gaussian, the density of the first observation can be written as:

2 fy1 (y1; A) = fy1 (y1; c, φ, σ ) (11) 1 2 = √ √ exp[−(y1 − (c/(1 − φ))) ]. 2π σ2/(1−φ2)

If we let fyT ,yT −1,...,y2|y1 (yT , yT −1, . . . , y2|y1; A) represent the joint density of observations yT , yT −1, . . . , y2, given observation y1, their likelihood can be calculated as:

fyT ,yT −1,...,y2|y1 (yT , yT −1, . . . , y2|y1; A) T (12) Y = fyt|yt−1 (yt|yt−1; A). t=2 In practice, the natural logarithm of the likelihood, the log likelihood, is often used instead of the likelihood, as it is easier to differentiate. To find the maximum likelihood estimates of the parameters in A, one must maximize:

log fyT ,yT −1,...,y2|y1 (yT , yT −1, . . . , y2|y1; A) T 2 (13) T − 1 T − 1 X (yt − c − φ1yt−1) = − log(2π) − log(σ2) − . 2 2 2σ2 t=2 The values of A that maximize Equation 13 are the conditional maximum likelihood esti- mates. This is later generalized to dynamic regression modeling by simply including predictors and parameters in the same way as φ and lagged versions of y in the expression.

Even if the process is non-Gaussian, the estimates that maximize the Gaussian log likelihood provide consistent estimates of the parameters (Hamilton, 1994, p. 126).

2.1.4 Residual Diagnostics

The final step of the Box-Jenkins methodology is to determine whether or not the residu- als resemble a white noise process. A white noise process contains no autocorrelation, is stationary and is centered around zero, this is tested using a Ljung-Box test.

The hypotheses of the Ljung-Box test are:

H0 : the residuals are independently distributed.

H1 : the residuals are not independently distributed.

6 The test statistic is formulated in equation the following equation (Ljung and Box, 1978):

h X −1 2 Q = t(t + 2) (t − k) rˆk (14) k=1 where t is the sample size, h is the number of lags being tested andr ˆk is the sample autocor- 2 relation at lag k. Under the null hypothesis, Q is approximately χh−(p+q) distributed, where p + q is the total number of parameters in the model. We reject the null hypothesis that the 2 2 residuals contain no autocorrelation if and only if Qobserved > χα,h−(p+q), where χα,h−(p+q) is the 1 − α of the central χ2 distribution with (h − (p + q)) degrees of freedom.

2.2 Dynamic Regression

A dynamic regression model, sometimes known as a transfer function model, is a type of regression model where the error term is allowed to contain autocorrelation (Hyndman and Athanasopoulos, 2018, Ch. 9). The general dynamic regression model can be formulated as:

yt = β0 + β1x1,t + ··· + βkxk,t + ηt (15) where ηt is an ARIMA process explained in detail in section 2.1. The dynamic regression model also allows for lagged predictor variables. A model with one predictor variable, using the predictor variable both for the current and previous time periods, can according to Hyndman and Athanasopoulos (2018, Ch. 9.6) be written as:

yt = β0 + β0xt + β1xt−1 + ··· + βhxt−h + ηt (16) where h is the number of lags applied to the predictor. Lags are included for the advertising variables as a model not using lags would not be able to appropriately model the relationship between advertising and number of sessions, since it is likely that advertising not only has an effect on sessions the same day, but that it also affects future sessions.

2.2.1 Stationarity of the Predictors

When constructing a dynamic regression model, it is required that all the variables used in the model are stationary. If they are not stationary, the parameter estimations will be incon- sistent. If one variable is found to be nonstationary, the same amount of differencing should be applied to all variables in the model in order to maintain the relationship between the variables (Hyndman and Athanasopoulos, 2018, Ch. 9.1). Evaluation of predictor stationarity is done in the same way as described in section 2.1.1.

7 2.2.2 Model Specification with Stepwise Selection and Information Criteria

The subject of interest in model selection is the model that minimizes the expected test error of the estimated model. However, the training error is a poor estimate of the test error as the training error generally decreases as the complexity of the model increases. Complexity is in this case expressed as the number of predictor variables in the model. The reason for this is the bias- trade off. This means that as complexity increases, the model can adapt to more complex nuances in the training data to fit the data well. However, as the model becomes more complex it will generalize worse to the test data due to larger variance.

This error decomposition for a test sample xtest with square error can be described as follows (Hastie, Tibshirani and Friedman, 2017, p. 223):

2 Errout = E[(Y − fˆ(x)) |X = xtest] 2 ˆ 2 ˆ ˆ 2 = σε + [E(f(xtest)) − f(xtest)] + E[f(xtest) − Ef(xtest)] (17) = Irreducible error + Bias2 + Variance where Y = f(x) + ε which is the “true” data generating process.

Since bias typically decreases and variance increases as a function of model complexity, it is necessary to find a compromise that minimises the sum of these two errors. One way is to directly estimate the test error by leaving out parts of the test data, and another way is to estimate the difference between the training error and the test error. The first is best estimated with cross validation and the second can be estimated with Akaike’s or Bayes Information Criterion.

Akaike’s Information Criterion Akaike’s information criterion, AIC, was developed by Akaike (1974) and models the relative quality of the model and is therefore useful in model selection. The way that AIC models the relative model quality is through modeling the optimism of the training error estimates. The optimism (OP ) is defined as the difference between the estimated training error, Errin, and the true error of the model, Errout.

OP = Errout − Errin. (18)

The estimated training error can be formulated as T · log(SSE/T ), where T is the number of observatons used for estimation and SSE is the sum of squared errors. The optimism can be modelled as 2(v + 2) in order to get an estimate of the relative test error AIC where v is

8 the number of estimated model parameters. Formally, the AIC is defined as (Hyndman and Athanasopoulos, 2018, Ch. 5.5)

SSE AIC = log( ) + 2(v + 2) (19) T where v is the number of estimated parameters.

For small values of T however, the AIC usually overfits the data (Hyndman and Athana- sopoulos, 2018, Ch. 5.5). Therefore the AICC can be used as it penalizes the model more strictly for having too many parameters. It is defined as:

2(v + 2)(v + 3) AIC = AIC + (20) C T − v − 3 where T is the number of observations and v is the number of estimated parameters in yˆi.

Bayes Information Criterion BIC works similarly to the AIC but adds a larger penalty term. This means that it will select a model with fewer variables than the AIC (Hyndman and Athanasopoulos, 2018, Ch. 5.5). It is defined as:

SSE BIC = T log( ) + (v + 2) log(T ). (21) T

Stepwise selection Stepwise selection is an iterative method used to decide which variables to use in a regression model. The main approaches are forward selection and backward elimination which accomplish the same thing but start from different directions. Backward elimination starts with the full model and for each step removes the variable that performs the worst according to the model assessment criteria. The information criteria estimates are less computationally expensive than cross validation and leads to effective model selection (Hastie, Tibshirani and Friedman, 2017, p. 230). Therefore the Akaike Information Criterion and the Bayesian Information Criterion are used for model assessment.

All procedures involving predictor selection invalidate the assumptions of the p values (Hyn- dman and Athanasopoulos, 2018, Ch. 5.5). This will however not be a problem as the focus of this thesis is on prediction.

According to James et al (2017, p. 209), backward elimination can be described with the following algorithm:

9 Backward Elimination Algorithm

1. Let Mg denote the initial model, containing all g predictors. 2. For j = g, g − 1,..., 1:

a. Create j models that contain all except one of the predictors in Mj. Each model contains j − 1 predictors.

b. Choose the best among the j − 1 models. Call it Mj−1. We define best as having the lowest information criterion. 3. Select the single best of the g models using the information criterion.

2.2.3 Parameter Estimation

Since we are using the Arima function in R to estimate the dynamic regression model, esti- mation of the parameters works in the same way as described in section 2.1.3.

2.2.4 Residual Diagnostics

Here, just like in section 2.1.4, a Ljung-Box test is performed. However since multiple lags are introduced by the predictors, and the ARIMA error term is specified in such a way that a model only containing the ARIMA error would have white noise residuals, autocorrelation is likely to be present in the residuals for the dynamic regression model due to the introduced lags.

The analysis of the residual distribution is done by standardizing the residuals and comparing them to a normal distribution, as well as by constructing a QQ plot.

If the distribution of the residuals are non-normal or if the effects of autocorrelation are large, then the assumptions of the normal prediction intervals will be invalidated which motivates the use of bootstrap. More on this in section 2.3.1

2.3 Forecasting

When all relevant models have been examined and the best model is selected it needs to be evaluated on previously unseen data. In order to get an estimate of how well the model can predict observations, a test data set needs to be used. A typical distribution of the observations in the test vs training set depends on the signal-to noise ratio in the data and the training sample size (Hastie, Tibshirani and Friedman, 2017, p. 222). This is discussed in more depth under section 3.

10 Since prediction variables are used in the forecast, values for these predictors are needed in order to get the predictions. The forecast of this thesis is performed by applying the data in the test set to the model trained on the training dataset. This will yield the forecasted values of the sessions variable.

2.3.1 Prediction Intervals

To estimate the certainty of the forecasts, prediction intervals are created based on the train- ing data. These are then evaluated using the test data. These are functions of the predicted error in reference to the expected error as well as all former errors due to compounding errors.

Taking an AR(1) model as an example, the properties of the errors are displayed below:

yt = φyt−1 + εt (22) 2 εt ∼ NIID(0, σ )

2 where σ is the variance of the innovation term εt and it is estimated using bias adjusted maximum likelihood estimation.

The prediction errors f steps ahead is the model’s predicted value minus the .

e(1) =y ˆT +1 − E(yT +1) = εT +1

e(2) =y ˆT +2 − E(yT +2) = φεT +1εT +2 2 2 e(3) =y ˆT +3 − E(yT +3) = φ εT +1φ εT +2 + εT +3 (23) . . f−1 e(f) = φ εT +1 + ··· + φεT +f−1 + εT +f . The variance of e(f) in a stationary process can be written as (Hyndman and Athanasopoulos, 2018, Ch. 8.8): V ar(e(1)) = σ2 V ar(e(2)) = σ2(1 + φ2) V ar(e(3)) = σ2(1 + φ2 + φ4) (24) . . 2 Pf 2i V ar(e(f)) = σ ( i=0(φ )) where i is the specific number of forecasted days ahead and f is the total number of forecasted days ahead.

11 The rate of increase of the variance from the AR(1) model should decrease with f as φ < 1. Thus the variance should approach a fixed number.

The prediction interval of a given forecast is calculated using the standard error of the specific forecast; this is then multiplied by the z-score given by the desired significance level of the prediction interval.

The z-score refers to the quartile distance in a normal distribution within which (1 − α)% of the observations should be where α is the specified significance level.

A 95% prediction interval ofy ˆT +3 using a z-score of 1.96 is exemplified in Equation 25. q 2 ˆ2 ˆ4 PI :y ˆT +3 ± 1.96 σˆ (1 + φ + φ ). (25)

In order for the prediction intervals to be accurate the residuals need to be normally dis- tributed and without autocorrelation. If this is not the case then the prediction interval may contain too many or too few observations due to faulty estimates of σ2 and lack of convergence ofy ˆT +f to the normal distribution.

Generally prediction errors are too narrow. According to Hyndman & Athanasopoulos (2018, Ch. 11.4) this is due to missing sources of uncertainty.

If the normality assumption doesn’t hold then bootstrap-based prediction intervals will be applied as complements to the normal based prediction intervals and evaluated.

The bootstrap procedure only assumes that the error terms are uncorrelated (Hyndman and

Athanasopoulos, 2018, Ch. 3.5). This allows for εT +1 to be estimated by drawing from the residuals of previous observations. Repeating this for all the f future values yields simulated values of yT +1 . . . yT +f that should behave similarly to the actual values. Iterating this process yields many possible future values from which can be calculated and a (1 − α)% prediction interval would be one that contains (1−α)% of the simulated observations, where α is the percentage of observations allowed to be outside of the prediction interval. The number of iterations performed are 2000, which is the default in the R package forecast.

2.3.2 Model Evaluation

Three types of error measurements are used in this thesis: the root mean squared error (RMSE), the mean absolute error (MAE) and the mean absolute percentage error (MAPE). All of these errors say something about how wrong the estimates are. The RMSE is an

12 estimate of average error but gives proportionally more error the more wrong each estimate is. The MAE estimates how far off the model is on average in terms of absolute values and the MAPE estimates how far off the estimations are in relation to the absolute value of the dependent variable. If yt is the actual value of y at time t, andy ˆt is the estimated value of y at time t, we can define the error measurements as: s PT (y − yˆ )2 RMSE = t=1 t t (26) T

PT |y − yˆ | MAE = t=1 t t (27) T T 100 X yt − yˆt MAP E = | |. (28) T y t=1 t The error measurements of the dynamic regression models are compared to the error mea- surements of a model using only the ARIMA error term explained in 2.1.

3 Data

3.1 Data Pre-Processing

The original data received from Logary Analytics are divided into two files, one containing data about advertising for several companies in different countries, and one containing session data from Google Analytics. Both data sets contain thousands of rows. Before analyzing the data, some pre-processing needs to be performed.

In processing the data, observations not referring to the specific company and country of interest are removed. The data set with information about advertising contains three relevant advertising platforms: Google, Facebook and Pinterest. We rename the rows referring to advertising on Google to gAds and gAds1, as the company uses two different accounts for advertising on Google: one for “performance advertising” and one for “brand advertising”. This is done to reflect the fact that the two different types of advertising could have different effects.

Since observations from the same day are spread out over several rows, we reorganize the data as one observation per day and add the amount of money spent on each platform as separate columns in such a way that for each day there are 4 variables denoting how much was spent on each platform that day. If there were no advertisement spending on a certain platform on a given day its observation value would be 0.

13 The other file, containing session data, also contains several observations per day and includes variables such as country, date and sessions. The data are reorganized in a manner similar to the other data set, so that the data frame contains daily observations of the sessions variable. This data set is then combined with the one containing the amount spent on advertising per day to create the final data set. We then create dummy variables for Black Friday as well as for weekdays. The dummy variable for Black Friday is 1 on Black Friday and the two following days, as its effect seems to be present on those days as well, which is visible in Figure 1.

3.1.1 Data Partitioning

The signal to noise ratio is the ratio of the variation in the dependent variable that can be attributed to the signal (in this case the effect of advertisement) compared to the noise (unexplained variance). When there is a lot of noise in relation to the signal, a larger portion of the sample size needs to be partitioned to training as more data is required in order to determine the signal from the noise. Dimensionality such as time also affects uncertainty as for example time related events will introduce noise if it can’t be explained. Due to the time disturbances it seems reasonable to include a full year in the training data yielding a relatively large training sample of 82.6/17.4 compared to Hastie, Tibshirani and Friedman (2017, p. 222), who recommend that 25% of the total sample is allocated to the test set.

3.2 Final Data

The resulting data consist of 443 daily observations of money spent on advertising on three different platforms as well as the revenue each day for a specific company in Sweden. The data are collected from Google Analytics. The first observation is from February 19th 2020 and the last observation is from May 6th 2021. The first 366 observations belong to the training set, while the rest belong to the test set.

The three different advertising platforms are Google, Facebook and Pinterest. From 2020-02- 19 until 2020-09-29, the company mostly spent money on advertising on Google, and less on the two other platforms. From that point, until 2021-02-12, the company spent more than before on Facebook advertising. Out of respect for the company we are examining, the exact values of the advertising variables are omitted from the graphs. In Table 1 the for the sessions variable are displayed.

14 Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Value 366 6,015.254 2,276.371 2,402 4,708.2 6,754 29,920

Table 1: Descriptive statistics for the sessions variable (training data)

3.3 Selecting the Number of Lags for the Predictors

The relationship between the predictors of interest (advertisement) and number of sessions on the website is believed to have a straightforward relationship most of the time; advertisement leads people to click on a link and then to browse the website. However it is possible that the advertisements have a more indirect effect and that people might visit the website later rather than the day of the advertisement. This effect is covered by lagged versions of the advertising variables. This however leads to the question of the number of lags to include in the initial model. Adding more variables infers a cost in terms of efficiency since the number of models to test grows with the function (g + 1)g/2 where g is the number of predictor variables. On top of that, adding one more lag for the backward elimination algorithm to examine means adding one more lag for all the advertisement platforms (unless there is good reason to have different numbers of lags for different predictors).

With this in mind, there is a practical upper limit to the number of lags that can be tested. Including multiple weeks of lag however is not really necessary since most of the effect should be direct (by clicking the ad and being taken to the company’s website) and the effect of each subsequent day should be small and diminishing. Therefore the first 6 lags are examined and are believed to display most of the effect. This means that there are 4 · 7 + 6 + 1 = 35 variables to select from, resulting in 36 · 35/2 = 630 models to evaluate.

4 Results

The initial dynamic regression model is defined as:

yt = β0 + β1(B)gAdst + β2(B)gAds1t + β3(B)facebookt + β4(B)pinterestt

+ β5blackfridayt + β6sunday + β7tuesday (29) + β8wednesday + β9thursday + β10friday

+ β11saturday + ηt where yt is the sessions variable, β0 is a drift term, gAdst and gAds1t is the amount spent on two different accounts for Google ads, facebookt is the amount spent on facebook advertising

15 per day and pinterestt is the amount spent on Pinterest advertising on day t. The final seven variables are dummy variables for weekdays and Black Friday. ηt is an ARIMA error and 2 6 βi(B) = (1 − βi1B − βi2B − · · · − βi6B ) (30) where βi(B) refers to the coefficients in the initial model where i is 1, 2, 3 and 4. B is the backshift operator, also known as the lag operator. It is an element that, when applied to a variable, produces the variable from the previous period (see Equation 31).

Byt = yt−1 2 B yt = yt−2 . (31) . h B yt = yt−h.

Section 4.2 specifies the ARIMA error of the model (ηt) and the order of differencing required for all variables, section 4.3 specifies the predictors chosen for the models using backward elimination, section 4.4 estimates the model and analyses the residuals. Section 4.6 creates forecasts on the test set using the created models, compares the predicted values to the actual session values and computes the errors. It also estimates and evaluates the prediction intervals of the forecasts.

4.1 Stationarity of All Variables

The stationarity for the numeric variables is examined both visually, using time series graphs and , and using the Augmented Dickey-Fuller (ADF) test. Judging by the time series plot in Figure 1, there does not seem to be a trend component, nor does there seem to be any seasonality. There does not seem to be any weekly seasonality either, as there is no significant seventh lag. Extreme values occur during two periods especially: Black Friday in the end of November and in the period between Christmas and New Years. The process has a drift.

For the variables gAds and gAds1, there does not seem to be any clear trend judging by Figure 2. gAds starts out with low values which then increase during the summer, decrease during the autumn and then go up again around Black Friday and Christmas. It is however difficult to determine if this is a sign of seasonality, as the data only span about a year and one can therefore not know if this pattern is repeated every year.

Figure 3 shows that, up until the end of September 2020, no money was spent on advertising

16 30000

20000

10000 Sessions

0 Apr 2020 Jul 2020 Oct 2020 Jan 2021 Time

0.50 0.50

0.25 0.25 ACF PACF

0.00 0.00

0 20 40 60 80 100 0 20 40 60 80 100 Lag Lag

Figure 1: Time series plot and correlogram of sessions variable

variable

gAds SEK gAds1

Apr 2020 Jul 2020 Oct 2020 Jan 2021 Time Figure 2: Time series plots showing the amount of money spent on gAds and gAds1 on Facebook. The amount spent per day remains quite stable after this, aside from the peak on Black Friday. There is no visible trend or seasonality in the data.

Pinterest, on the other hand, was zero for a period of time around April, as seen in Figure 4. For the rest of the year, it still holds values much lower than that of the gAds variables. It does not seem to have a trend or seasonality component.

To make sure that no unit root is present for any of the variables, ADF tests are performed for each variable, using the version of the test including a drift but not a trend. As discussed

17 Facebook EUR

Apr 2020 Jul 2020 Oct 2020 Jan 2021 Time Figure 3: Time series plot showing the amount of money spent on Facebook advertising

Pinterest USD

Apr 2020 Jul 2020 Oct 2020 Jan 2021 Time Figure 4: Time series plot showing the amount of money spent on Pinterest advertising in section 2.2.1, to avoid the multiple testing problem the significance level, which is usually set to α = 0.05, is set to 1 − (1 − 0.05)5 = 0.23.

For the sessions variable 1 lag is used for the test, as the dynamic regression model used contains an AR(1) error term. This is described in section 4.2. For the other variables, 6 lags are used. The choice of lags is explained further in section 3.3. As seen in Table 2, for all variables, the p values are below 0.23. The null hypothesis that there exists a unit root

Variable Lag p value ADF value Sessions 2 <=0.01 -7.97 gAds 6 0.0138 -3.37 gAds1 6 0.0725 -2.74 Facebook 6 <=0.01 -4.61 Pinterest 6 <=0.01 -3.51

Table 2: Results from ADF tests

18 20000

15000

10000

5000 Residual

0

−5000 Apr 2020 Jul 2020 Oct 2020 Jan 2021 Time

0.1 0.1

ACF 0.0

PACF 0.0

−0.1 −0.1 0 20 40 60 80 100 0 20 40 60 80 100 Lag Lag

Figure 5: Time series graph and correlogram for AR(1) residuals in the time series is rejected for all of the variables. Since these time series do not contain trends or unit roots, differencing is not required.

4.2 Determining the ARIMA Error

According to the PACF in Figure 1 there seems to be only one significant lag, while the lags in the ACF are decreasing in value over time. These features are indicative of an AR(1) process.

After estimating and running an AR(1) model on the training data, we can examine the residuals shown in Figure 5. From the figure, it seems like the residuals approximate white noise. There are no significant lags, the mean seems centered around 0 and the process looks stationary. However, the residuals are quite large around Black Friday. By performing a Ljung-Box test on the residuals, the null hypothesis that all the are 0 is tested. As seen in table 3, we do not reject the null hypothesis on the 5 % significance level.

Since an AR(1) model seems to fit the sessions variable quite well, we use an AR(1) error term in the dynamic regression. This model is also used in section 4.6 to evaluate the dynamic

19 6000

5980

AIC

AIC AICc 5960

5940 0 10 20 30 Index Figure 6: Values of AIC and AICc for the best models according to the backward elimination algorithm regression model fit.

4.3 Model Selection with Backward Elimination and Information Crite- ria

The model is optimised with respect to the AICc, AIC and BIC using the advertising variables, six lags of each advertising variable, dummy variables for weekdays and Black Friday as well as an AR(1) error.

By running the initial model, specified in equation 29 through the backward elimination algorithm using three different information criteria, we get two selected models. As seen in Figure 6, both by using AIC and by using AICc, the model with index number 29 is chosen.

They both contain the same six predictor variables. If we let ηt represent the AR(1) error term, the model can be written as:

yt = β0 + β1gAdst + β2gAdst−5 + β3gAds1t−6 + β4facebookt (32) +β5sunday + β6tuesday + ηt.

We will be referring to this model as the AIC model. When using BIC, model number 31 is instead chosen as the best model, as seen in Figure 7. Again, letting ηt represent the AR(1) error term, the model chosen via BIC can be written as:

yt = β0 + β1gAdst + β2gAdst−5 + β3gAds1t−6 + β4facebookt + ηt. (33)

This model will be referred to as the BIC model.

20 6100

6050 BIC

6000

0 10 20 30 Index Figure 7: Values of BIC for the best models according to the backward elimination algorithm

p.value AR(1)Model 0.82 AICModel <0.001 BICModel <0.001

Table 3: Ljung-Box tests

4.4 Model Evaluation Using Residuals

After estimating both models, we can examine the residuals. By looking at Figures 8 and 9, we can see that the residuals do not seem to follow a normal distribution for any of the two models. We see it again by looking at the QQ plots in Figure 13, shown in the appendix. From observing the two sets of they might indicate a slight , but the biggest difference from the standard normal distribution comes in the form of as there are more observations close to zero and on the extremes. From the plots it is also evident that the residuals are more prone to take on extreme positive values rather than extreme negative values due to more deviation on the right side of the plot than on the left for both models.

The prediction intervals of the predictions are based on normality of the residuals. Since extreme values are present this is likely to lead to upward bias in the estimates, which is why the normal prediction intervals are complemented with bootstrap -based prediction intervals. The actual accuracy of these intervals is evaluated in section 4.6.

21 Standardized Residuals from AIC Model 0.8 0.6 0.4 Density 0.2 0.0

−4 −2 0 2 4

Standard deviations

Figure 8: Histogram of the standardized residuals from the AIC model along with the PDF of a standard normal distribution

By looking at Figure 12 in the appendix, we see that many of the lags are significantly different from zero for the residuals of both of the models. If we perform Ljung-Box tests on the residuals for both models, we reject the null hypothesis that there is no autocorrelation in the residuals for both of the models on the 5% significance level. The p values for the tests are displayed in Table 3.

Since the residuals from the AR(1) model displayed no autocorrelation (see section 4.2), auto- correlation is likely to have been introduced by the added predictor variables. This is possible as the predictors themselves were lagged variables that correlate with the sessions variable. Therefore when the backward elimination process chooses more lags, more autocorrelation is introduced.

While some values of the ACF are significantly different from zero, they are not very large, as many of them are around 0.1. Therefore, it is not believed that the autocorrelation will have a large impact on the predictions. During backward elimination, the trade-off between the number of variables and the effects of autocorrelation is taken into account by choosing the models that minimize the AIC and BIC. The models with index number 35 (the last models form the backward elimination processes), which only contain the AR(1) error term and no predictor variables, do not have any autocorrelation in the residuals (as shown in

22 Standardized Residuals from BIC Model 0.8 0.6 0.4 Density 0.2 0.0

−4 −2 0 2 4

Standard deviations

Figure 9: Histogram of the standardized residuals from the AIC model along with the PDF of a standard normal distribution section 4.2) but are likely to perform worse for prediction as the AIC and BIC values are much higher.

4.5 Parameter Evaluation

The parameter estimates of the two selected models are shown in Table 6 in the appendix. AR(1) represents the estimate of the parameter in the AR(1) error term. The values within parentheses display the standard errors of the estimated parameters.

The assumptions behind the p values are invalidated due to the model selection process, which is described in section 2.2.2. The standard errors of the estimated parameters are also wrong as the residual distribution deviates from normality (discussed in section 4.4). The estimates displayed in Table 6 are therefore flawed and care should be taken when drawing conclusions from them.

4.6 Forecasting

Because of the non-normal residuals, we create bootstrap-based prediction intervals as com- plements to the prediction intervals using the normality assumption. Figures 10 and 11 display the actual values for the sessions variable in the test set, along with the forecasts produced by each model. The dotted lines display the bootstrap-based prediction intervals

23 AIC Model predictions vs actual 12000 8000 Sessions

4000 AIC Model Predictions Actual Values Bootstrap−based prediction intervals 0

2021−03−01 2021−04−01 2021−05−01

Time Figure 10: AIC Model forecasts vs actual at the 80% and 95% level and the blue and grey areas represent the corresponding prediction intervals made under the assumption of normality.

The forecasts look quite similar, judging by the Figures 10 and 11. One difference is that the seasonal variance seems slightly larger in the forecasts using the AIC model. This is reasonable since the difference between the models is the inclusion of weekday variables in the AIC model. We can also see that in both forecasts, the forecast was consistently under the actual value until mid-April, from where the forecasts over-estimated the number of sessions.

By looking at the Figures 10 and 11, we can see that the prediction intervals made using the normality assumption are symmetrical, while the bootstrap-based prediction intervals are not. This is due to extreme positive values being more common than extreme negative ones. The prediction intervals also seem quite pessimistic in relation to the variance of the actual observations. This is confirmed when looking at Table 4, which shows that the prediction intervals from the dynamic regression model covered more than expected of the actual observations. The table shows the percentage of actual values of sessions that were covered by the prediction intervals in the test set.

24 BIC Model predictions vs actual 12000 8000 Sessions

4000 BIC Model Predictions Actual Values Bootstrap−based prediction intervals 0

2021−03−01 2021−04−01 2021−05−01

Time Figure 11: BIC Model forecasts vs actual

Method: Normal Normal Bootstrap Bootstrap Interval: 80% 95% 80% 95% AIC 90,9% 100% 88,3% 100% BIC 92,2% 100% 90,9% 100% AR(1) 70.1% 96.1% 40.3% 98.7%

Table 4: Precision of prediction intervals

The biggest difference in accuracy between the prediction intervals from the dynamic regres- sion model and the AR(1) model is the 80% bootstrap-based interval which only contained 40.3% of the observations for the AR(1) model. This is interesting as this interval was the only one for which the assumptions of no autocorrelation actually held (see Table 3). As the bootstrap-based residuals are based on the historical variation its failings suggest that the AR(1) model is not enough to model the variation of the data. This suggests that the choice of predictor variables can be more important than the absence of autocorrelation when modeling correct prediction intervals. One way to improve on the interval would be to trans- form the sessions variable in order to make the residuals normal (discussed in section 5). The dynamic regression models are characterized more by similarities than differences, but as seen in Table 5, the AIC slightly outperforms the BIC when using RMSE as the error measurements. However, the MAE and MAPE measurements indicate that the BIC model

25 AICModel BICModel AR(1) Model RMSE 1218 1225 2518 MAE 1133 1131 2254.084 MAP E 14.73 14.69 37.38

Table 5: Error estimates performs slightly better. The difference between the two models is not very big, judging by any of the three measurements in Table 5. The difference between the error measurements for the two models is 7 for RMSE, 2 for MAE and 0.04 percentage points using MAPE.

Relating the errors of the dynamic regression model to the AR(1) model, there is a clear difference. This signifies that the relationship between the predictors and the dependent variable is strong and therefore, taking advertising into account when predicting website visits is important for prediction accuracy.

5 Discussion

The discussion is centered around the research questions stated in section 1.

The study finds that the dynamic regression models created can predict website visits per day for the company under study quite well. The values of the error measurements are not very large, especially when comparing them to the actual values of the session variable. As the sessions variable has a standard deviation of 2276 displayed in Table 1, MAE values of around 1130 for a forecast done for 77 steps ahead shows that the created dynamic regression models can prove to be useful for predicting website visits in practice.

As is visible in Table 5 there was no substantial difference in prediction accuracy between the AIC and the BIC models. The reason that no model overperformed the other is likely due to a few variables contributing to the majority of the estimations. This is backed by the results in Table 6, showing that the variables for Sunday and Tuesday did (on average) not contribute much in terms of signal. Therefore the conclusion is that it was the other four variables that were key in predicting future values of the sessions variable.

It is possible that the dummy variable for Tuesdays included in the AIC model is spurious. While it is possible that workdays have a negative impact on website visits, there is no intuitive reason why Tuesdays specifically would impact website visits more than any other weekday. A possible solution to this would be to, instead of using dummy variables for every

26 day of the week, use a dummy variable for working days.

By comparing the error measurements for the AIC and BIC models with the AR(1) model, it is clear that both dynamic regression models outperform the AR(1) model when it comes to prediction, as the errors of the AR(1) model are around twice as high as the errors of the dynamic regression models. This is in line with the findings by the Kongchaoren & Kruanpradit (2013), Tsui et al (2014) and Anners (2017), who all noted that a dynamic regression model outperformed an ARIMA model for forecasting.

As Table 4 shows, the prediction intervals using bootstrap are more accurate than those based on normality. However, both models are too pessimistic as they contain too many observations. The reason that the normal prediction intervals are too large is probably due to the long tails of the residual distribution. This is probably due to a few extreme observations having a large influence on the errors and therefore also the estimated standard deviation, which in turn leads to inflated prediction intervals. The reason that the bootstrap-based prediction intervals are too large might be due to the autocorrelation of the model residuals but as is evident by the AR(1) model, misspecification can also result in considerable error. The intervals of the dynamic regression model also outperform the AR(1) model. The relative accuracy of the bootstrap-based prediction interval from the dynamic regression compared to the AR(1) suggests that the model is decently accurate in describing the data generating process.

5.1 Suggestions for Future Studies

As the data set used for this study only contains data for about a year, it is not possible to inspect the effect of sales-impacting events that only occur once a year, such as Black Friday and Christmas sales. It would therefore be interesting to perform further studies using data from a longer time period. If the data had been collected for multiple years these time related effects could be estimated with yearly seasonal effects.

Furthermore, because of computational limitations, the number of lags used in the initial model (shown in equation 29) is limited. It would be interesting to compare the findings of the models presented in this study to a model using similar data, but with monthly observations. Such a model could possibly capture more of the long-term effects of marketing on website visits, as the lags included in the advertising variables would cover a longer time period. This would provide insight to what is “missed” when only using lags for six days.

27 Clearly, such a model would not be able to provide daily results, however.

It is possible that taking the natural logarithm of the sessions variable would dampen the impact of extreme values in the residuals. This might improve the accuracy of the normal prediction intervals. Another possibility is the Box-Cox transform of the residuals, making them normally distributed.

5.2 Alternative Methods

One alternative option to using a dynamic regression model for predicting revenue would be to use a neural network. These are often used for prediction and work quite well for this purpose. The downside of them is that they can often be perceived as hard to understand and interpret. It would, however, be interesting to compare a neural network’s predictive capabilities to those of the models used in this thesis.

Lasso is an alternative model specification method that is useful for model selection. It has a penalty term containing the sum of the absolute value of the beta parameters as well as a regularization term λ. The lasso minimizes (James et al., 2017, p. 219):

p X Errin + λ | βj | . (34) j=1

. This model is then iterated for the values of y that results in the lowest out of sample error. The method is deemed on par with selecting models using AIC and BIC.

Also, in order to save degrees of freedom a fourier series can be applied to correct for weekday effects. Since this effect is small in our data this was not deemed necessary.

Finally, a distributed lag model, which is a model where the lag parameter is forced to decrease in some pattern, could be used. This avoids problems arising from collinearity between lags of the same variable. Since collinearity doesn’t have a large effect on prediction, this is not implemented in this study.

28 References

Akaike, H. (1974) ‘A new look at the identification’, IEEE transactions on automatic control, 19(6), pp. 716–723.

Anners, C. (2017) Forecasting energy usage in the industrial sector in sweden using SARIMA and dynamic regression. Master’s thesis. Uppsala University, Department of Statistics; Uppsala University, Department of Statistics.

Box, G. E. P. and Jenkins, G. M. (1970) Time series analysis: Forecasting and control. Holden-Day (Holden-day series in time series analysis and digital processing). Available at: https://books.google.se/books?id=5BVfnXaq03oC.

Cryer, J. D. and Chan, K.-S. (2008) Time series analysis: With applications in r. New York, NY: Springer New York.

Dickey, D. A. and Fuller, W. A. (1979) ‘Distribution of the estimators for autoregressive time series with a unit root’, Journal of the American Statistical Association, 74(366a), pp. 427–431. doi: 10.1080/01621459.1979.10482531.

Hamilton, J. D. (1994) Time series analysis. Princeton, N.J: Princeton Univ. Press.

Hastie, T., Tibshirani, R. and Friedman, J. (2017) The elements of statistical learning: Data mining, inference, and prediction, second edition. New York, NY: Springer New York.

Hyndman, R. J. and Athanasopoulos, G. (2018) Forecasting: Principles and practice. 2nd edn. Australia: OTexts.

James, G. et al. (2017) An introduction to statistical learning: With applications in r. New York, NY: Springer New York.

Kongcharoen, C. and Kruangpradit, T. (2013) ‘Autoregressive integrated moving average with explanatory variable (ARIMAX) model for thailand export’, in.

Ljung, G. and Box, G. (1978) ‘On a measure of lack of fit in time series models’, Biometrika, 65. doi: 10.1093/biomet/65.2.297.

Said, S. E. and Dickey, D. A. (1984) ‘Testing for unit roots in autoregressive-moving average models of unknown order’, Biometrika, 71(3), pp. 599–607. Available at: http:

29 //www.jstor.org/stable/2336570.

Tsui, K. et al. (2014) ‘Forecasting of hong kong airport’s passenger throughput’, Tourism Management, 42, pp. 62–76. doi: 10.1016/j.tourman.2013.10.008.

Wiener, N. (1949) Extrapolation, interpolation, and smoothing of stationary time series: With engineering applications. Cambridge, Mass.: MIT Press.

30 A Appendix

ACF for AIC Residuals ACF for BIC Residuals

0.1 0.1

0.0 0.0 ACF ACF

−0.1 −0.1

0 50 100 150 0 50 100 150 Lag Lag

Figure 12: ACF for the residuals of the estimated models

Normal Q−Q Plot

AIC Model BIC Model Quantiles 4000 −4000

4000 Theoretical Quantiles −4000 −3 −2 −1 0 1 2 3 Figure 13: Q-Q Plot for AIC and BIC Models

31 Standardized Residuals from AR(1) Model 1.2 0.8 Density 0.4 0.0

−4 −2 0 2 4

Standard deviations

Figure 14: Standardized residuals from the AR(1) model

AR(1) Model predictions VS actual 12000 8000 Sessions

4000 AR(1) Model Predictions Actual Values Bootstrapped prediction intervals 0

2021−03−01 2021−04−01 2021−05−01

Time Figure 15: Forecast made using the AR(1) model

32 Variables AIC Estimates BIC Estimates AR(1) 0.77∗∗∗ 0.77∗∗∗ (0.03) (0.03) intercept 3588.11∗∗∗ 3695.34∗∗∗ (335.56) (334.73) gAds 1.11∗∗∗ 1.04∗∗∗ (0.09) (0.10) facebook 0.88∗∗∗ 0.90∗∗∗ (0.05) (0.05) gAds1t−6 0.21 0.22 (0.12) (0.12) ∗∗ ∗∗ gAdst−5 −0.24 −0.23 (0.08) (0.08) sunday 239.17∗ - (108.18) tuesday −155.58 - (100.45) ∗∗∗p < 0.001; ∗∗p < 0.01; ∗p < 0.05

Table 6: Parameter estimates and standard errors for the AIC and BIC models

33