Regression analysis: An evaluation of the influences behind the pricing of Regressionsanalys: En utv¨arderingav influenserna bakom priss¨attningen av ¨ol

Sara Eriksson and Jonas H¨aggmark Spring semester 2017

i 1 Preface

This project is a bachelor thesis in multiple linear regression created by Sara Eriksson and Jonas H¨aggmarkduring the spring semester in 2017.

We wish to thank our mentor Pierre Nyquist, for all of your advises and guiding throughout the project.

ii 2 Abstract

This bachelor thesis in applied mathematics is an analysis of which factors affect the pricing of beer at the Swedish market. A multiple linear regression model is created with the statistical programming language R through a study of the influences for several explanatory variables. For example these variables include country of origin, beer style, volume sold and a Bayesian weighted mean rating from RateBeer, a popular website for beer enthusiasts. The main goal of the project is to find sig- nificant factors and, as follows directly, a significant model without any influence of multicollinearity.

The regression analysis is based on a data set with 1413 observations which represent that sold over 1000 liters, among further restrictions, and is created from Systembolaget’s sale statistics for 2016 and Ratebeer. This number of observations represents 43% of Systembolaget’s total assortment of beer.

The model is developed through a thorough residual analysis, transformations of variables, de- termination of multicollinearity and a validation of the absence of outliers and high leverage points. All of these in favor for significance at a level of 95%. In addition to the regression model, two submodels with associated box plots for the variable groups Country of Origin and Beer Style are created for analyzing the importance of these variables amongst each other. A k-fold cross validation study and three different variable selections are carried out for further adequacy checking, these are also given as recommendations for continued analysis.

The result shows that there are several different factors that affect the pricing of beer. For ex- ample, higher alcohol by volume, sour beers and beers from New Zealand yields a higher price while beers with high sales, and Austrian beers show a negative tendency for the price. The result can be used as an example of the influences behind the pricing of beer in . The first model in the analysis has 41 explanatory variables and in the final model the number of explanatory variables is reduced to 20 where all are significant.

iii 3 Sammanfattning

Detta kandidatexamensarbete i till¨ampadmatematik ¨aren analys av vilka faktorer som p˚averkar priss¨attningenp˚a¨olp˚aden svenska marknaden. En multipel regressionsmodell har skapats med det statiska programmeringsspr˚aket R genom en studie av influenserna f¨orett antal regressorvariabler. Dessa variabler inkluderar bland andra ursprungsland, ¨olstil,s˚aldvolym och ett Bayesiskt viktat medelv¨ardefr˚anRateBeer, vilket ¨aren popul¨arhemsida f¨or¨olentusiaster. Huvudm˚aletmed projek- tet ¨aratt finna signifikanta faktorer och, som d˚amedf¨oljer,en signifikant modell utan n˚agoninfluens av multikolinj¨aritet.

Regressionsanalysen ¨arbaserad p˚aen upps¨attning data f¨or1413 observationer representerande de ¨olsom s˚altmer ¨an1000 liter, bland ytterligare restriktioner, och ¨arskapad fr˚anSystembolagets f¨ors¨aljningsstatistikfr˚an2016 och RateBeer. Detta antal observationer representerar 43% av Sys- tembolagets totala ¨olutbud.

Modellen ¨arutvecklad genom en grundlig residualanalys, transformationer av variabler, best¨amning av multikolinj¨aritetsamt en validering av fr˚anvaron av avvikande v¨ardenoch punkter med h¨ogt inflytande. Allt detta f¨oratt n˚aen signifikansniv˚ap˚a95%. Ut¨over regressionsmodellen s˚askapas tv˚asubmodeller med tillh¨orandel˚addiagramf¨orvariabelgrupperna Ursprungsland och Olstil¨ f¨oratt analysera betydelsen av dessa variabler sinsemellan. En k-fold korsvalideringsstudie samt tre olika variabelselektioner utf¨orsf¨orvidare l¨amplighetskontroll och dessa ges ¨aven som rekommendationer f¨orfortsatt analys.

Resultatet visar att det finns flera olika faktorer som p˚averkar priss¨attningenp˚a¨ol. Exempelvis ger ¨okad alkoholhalt, syrliga ¨oloch ¨olfr˚anNya Zeeland ett h¨ogrepris samtidigt som h¨ogf¨ors¨aljning, ¨oloch ¨osterrikisk¨olvisar en negativ tendens f¨orpriset. Resultatet kan anv¨andassom ett ex- empel p˚ainfluenserna bakom priss¨attningenav ¨oli Sverige. Den f¨orstamodellen i analysen har 41 regressorer och i den slutliga modellen har antalet regressorer reducerats till 20 d¨aralla ¨arsignifikanta.

iv List of Figures

1 Monks have a great historical influence on the art of ...... 2 2 Survey carried out by SurveyMonkey...... 3 3 Summary of the final cost of a craft beer due to each expense...... 4 4 Normal probability and histogram plot for the residuals...... 8 5 Ordinary and scaled residuals against the predicted values...... 9 6 The model residuals versus the fitted values for the original model...... 15 7 Normal Q-Q plot and histogram for the original model...... 15 8 Scale-location plot for the original model...... 16 9 Leverage plot for the original model...... 16 10 Logarithmic transformation of the response variable and the variable representing item price...... 17 11 Logarithmic transformation of the response variable and the variable representing volume in ml...... 18 12 Second model, developed with logarithmic transformations...... 18 13 Model developed through multicollinearity analysis...... 19 14 Model B, developed through analysis of dummy variables...... 21 15 Outliers and high leverage points...... 22 16 Predicted price against actual price, original model...... 23 17 Predicted price against actual price, final model...... 23 18 Submodel country of origin...... 25 19 Normal probability plots...... 26 20 Submodel beer styles...... 26 21 Countries of origin...... 27 22 Beer styles...... 28 23 Package...... 28 24 Organic...... 28 25 In stock...... 29 26 Rated...... 29 27 Final model...... 31

v Contents

1 Preface ii

2 Abstract iii

3 Sammanfattning iv

4 Introduction 1 4.1 Background ...... 1 4.1.1 The ...... 1 4.2 Purpose ...... 2 4.3 Problem Definition ...... 2 4.4 Data Set ...... 3 4.5 Problem Restrictions ...... 3 4.6 Literature Analysis ...... 3

5 Mathematical Theory 5 5.1 Multiple Linear Regression ...... 5 5.1.1 Heteroscedasticity ...... 5 5.2 Residual Analysis ...... 6 5.3 Model Adequacy ...... 7 5.3.1 Residual Plots ...... 8 5.4 Outliers and High Leverage Points ...... 9 5.5 Transformations of Variables ...... 10 5.6 Multicollinearity ...... 10 5.7 Variable Selection ...... 10 5.8 Cross Validation ...... 11

6 Data Set 12 6.1 Restrictions in Data ...... 12 6.2 List of Variables ...... 12 6.3 About RateBeer ...... 14

7 Analysis and Model Development 15 7.1 Residual Analysis ...... 15 7.2 Transformations of Variables ...... 17 7.3 Multicollinearity ...... 19 7.4 Analysis of Significance ...... 20 7.5 Outliers and High Leverage Points ...... 21 7.6 Cross Validation ...... 22 7.7 Variable Selection ...... 24

8 Submodels 25 8.1 Country of Origin ...... 25 8.2 Beer Styles ...... 26 8.3 Box Plots ...... 27

vi 9 Results 30 9.1 Final Model ...... 30 9.2 Submodels ...... 32 9.3 Calculation Example ...... 33

10 Discussion 35 10.1 Analysis of Variables ...... 35 10.1.1 Quantitative Variables ...... 35 10.1.2 Dummy Variables Except Beer Styles and Countries of Origin ...... 35 10.1.3 Beer Styles ...... 36 10.1.4 Countries of Origin ...... 37 10.2 Looking Back at the Literature Analysis ...... 37 10.3 Recommendations ...... 38

11 Appendix 40 11.1 Variable Selection Tables ...... 40 11.1.1 Best Subset Selection ...... 40 11.1.2 Forward Subset Selection ...... 41 11.1.3 Backward Subset Selection ...... 42

vii 4 Introduction

In this project a study about how different factors affect the pricing of beers at Systembolaget during 2016 will be carried out. The study will analyze a multiple linear regression model with several influential parameters.

4.1 Background The government in Sweden has a monopoly for alcohol sales through Systembolaget. This means that Systembolaget is the only store in the country allowed to sell alcoholic beverages above 3.5% alcohol by volume. Since the beer industry is immensely trendy as of when this report is written, a closer look at this branch is interesting. The trend is hugely influenced by the great increase in craft beers and the current number of breweries in Sweden is the greatest through history. These breweries vary greatly in size, with the smallest having just a few workers while the biggest companies work on an international scale. The beer market has been expanding largely for the last ten years and is still expanding due to demand and a growing interest. The trend suggests that this expansion will keep on for at least another decade and therefore an analysis of which parameters actually matters for the pricing of beer can be helpful when examining the market. To see which factors determine the pricing of beer in Sweden today, as well as their level of influence, a multiple regression model will be created and analyzed.

4.1.1 The History of Beer About 10 000 - 15 000 years ago humans ceased to be migratory hunters and gatherers to instead form organized communities where they started to grow cereals. The first type of bread was made of barley, the bread was crumbled and added to water, producing a mash which then was prepared into a beverage said to make people ”exhilarated, wonderful and blissful”. Remains of breweries and detailed descriptions on how to produce and drink beer, sometimes with complete registers of different types of beer, have been found on several locations in the former Sumerian kingdom Mesopotamia. Similar remains have been found in other ancient civilizations, from the river Nile to the mountain Ararat, from the modern Egypt to Iraq and Iran. Objects from these findings are collected at the museum of the University of Pennsylvania, and through their researches the remains of beer have been identified on an earthen from Iraq, more than 5 000 years old. When the knowledge on growing cereals expanded north and west the method of brewing followed it’s tracks. The Romans were accustomed to wine but noted that people in the north were drinking beer. Many of the places where the industry of beer has it’s strongest roots today are areas where the Celts settled down early on, from central Europe to Ireland. After the Middle Ages the Christian monasteries became center of agriculture, knowledge and sci- ence. The art of brewing was improved, in the beginning to produce beer for the brothers and traveling pilgrims, and later as a way to finance the monastic life.

1 Figure 1: Monks have a great historical influence on the art of brewing.

In many countries with ancient brewing methods beer is considered part of the national identity. During the Middle Ages many royal courts got rights for brewing in order to collect money to the city and some noble families are still in the business. Finally, the industrial capitalism gave the industry of brewing the formation it has got today. [5]

4.2 Purpose The results and the complete analyzed picture of this thesis will be interesting for beer producers who are about to price and release a new product at the market. For analysts at Systembolaget the model may be used to negotiate prices with the producers and to examine the modelled prices against the market prices of products in their assortment. Customers at Systembolaget can compare the price for a desired beer with the model. The analysis can have a potential impact on already set prices since those can be compared with the model, thus it can be determined whether a product is over or under priced.

4.3 Problem Definition In this thesis an analysis of the pricing of beer is carried out. More closely, the question: which parameters affect the pricing of beer and how significant are they? To answer this question a large data set is collected and analyzed in order to receive a model that describes the problem. The main goal of the project is to find significant factors and, as follows directly, a significant model without any influence of multicollinearity.

2 4.4 Data Set The data set in this project consist of several parameters such as volume, sales, beer styles and a qualitative rating variable. The parameters are collected from Systembolaget’s sale statistics for 2016, Systembolaget’s website and from www.RateBeer.com. The qualitative rating from RateBeer was obtained in order to get some external variable apart from Systembolaget’s own data.

4.5 Problem Restrictions In the sales statistics from Systembolaget for 2016 there is a large number of samples. Creating the data set proved to be unproblematic since all data was easily accessible on the web. The most difficult part of data collecting was to set the limit for the sales in such a way that enough observations were included and at the same time removing non-representative beers. The dummy variables are grouped under the condition that the number of observations is sufficient.

4.6 Literature Analysis The number of previous studies directly related to the subject of this thesis is strictly limited. However, regression analysis models have been created with emphasis on other alcoholic beverages such as wines and whisky in previous bachelor theses. A bachelor thesis on the influences of pricing wine [7] regarding Systembolaget’s assortment indicates that the estimated coefficient for alcohol by percentage is almost 17, i.e. the logarithm of the price per liter of wine increases with alcohol by volume multiplied by a factor 17. In addition to this statistically estimated coefficient another bachelor thesis on the influences of pricing malt whisky [11] states that the alcohol by volume is instead multiplied by a factor 7.11 · 10−2. Alongside these mathematical theses there are several websites that discuss the different factors determining the prices of beers. According to SurveyMonkey [8] the factors that consumers are most interested in when choosing what beer to purchase are taste, price, style of beer and the brewery that produces the beer. The outcome of the survey resulted in the diagram in Figure 2.

Figure 2: Survey carried out by SurveyMonkey.

3 An article in Huffington Post [6] states that the price of craft beer is divided into parts by expense. The aim of the article is to describe why a craft beer may be more expensive than a beer distributed by the, in the article called, macrobreweries. The division of expense is presented graphically in Figure 3.

Figure 3: Summary of the final cost of a craft beer due to each expense.

4 5 Mathematical Theory

To investigate and model the relationship between variables by a statistical technique regression analysis is widely used.

5.1 Multiple Linear Regression The multiple linear regression model is defined by

y = β0 + β1x1 + β2x2 + ... + βpxp +  (1) where y is the response variable, xi, i = 1, ..., p, is the set of explanatory variables, βi are the asso- ciations between the response and explanatory variables and  is the random error component.

It is assumed that: 1. The relationship between the response y and the regressors is linear, at least approximately. 2. The error term  has zero mean. 3. The error term  has constant variance σ2. 4. The errors are uncorrelated. 5. The errors are normally distributed.

These assumptions can be tested through residual analysis.

It is often more practical to write the multiple linear regression model in matrix notation.

y = Xβ +  where     y1 1 x11 x12 ... x1p y2  1 x21 x22 ... x2p  y =   X =    .  . . . .   .  . . . .  yn 1 xn1 xn2 ... xnp     β0 1 β1 2  β =    =    .   .   .   .  βp n

The vector of fitted valuesy ˆi can be rewritten as yˆ = Xβˆ = X(X0X)−1X0y = Hy (2) where H is the commonly used hat matrix, a n×n-matrix that maps the vector of the observed values into a vector of fitted values. [1]

5.1.1 Heteroscedasticity If assumption 3 is not fulfilled, i.e. the error term does not have constant variance, then heteroscedas- ticity is identified. The linear model’s standard errors, confidence intervals and hypothesis test rely on this assumption. To remedy this problem a transformation of the response y is one possible solution. [2]

5 5.2 Residual Analysis

The residual is the difference between the observed value yi and the corresponding fitted valuey ˆi and is defined as ei = yi − yˆi, i = 1, ..., n. (3) The residuals have zero mean and their approximate average variance is estimated by its mean square error, MSRes. Pn e2 MS = i=1 i (4) Res n − 2 The residuals are not independent. However, this has little effect on their use for model adequacy checking as long as n is not small relative to p, the number of parameters. Scaling of the residuals is helpful in finding observations that are outliers or extreme values, see section Outliers and High Leverage Points. There are four different scalings of residuals, these are listed below.

1. Standardized Residuals. The residual is scaled by the square root of its approximate average variance, and therefore it has mean zero and approximately unit variance. This type of scaling of the residual is used to detect potential outliers, a large value (di > 3) is considered to indicate these observations. The standardized residuals are defined as

ei di = √ , i = 1, ..., n. (5) MSRes

2. Studentized Residuals. To improve the residual scaling, since the mean square error used as variance is an approximation, the i:th residual is now divided by its exact standard deviation. The studentized residual is defined as

ei ri = p , i = 1, ..., n (6) MSRes(1 − hii) where hii is the i:th diagonal element of the hat matrix. The variance of this scaled residual is constant and equal to 1, regardless of the location of xi when the form of the model is correct.

3. PRESS Residuals. If there is an observation i that is very unusual in respect to the data set, the regression model may be overly influenced by this observation. This may result that the fitted valuey ˆi is very similar to the observed value yi and it follows directly that the residual ei will be small, and therefore difficult to detect the outlier. The i:th PRESS residual is calculated by

ei e(i) = , i = 1, ..., n (7) 1 − hii and possible high influence points are marked where the PRESS residual is large.

4. R-student. The R-student residual is defined

ei ti = , i = 1, ..., n (8) q 2 S(i)(1 − hii) where (n − p)MS − e2/(1 − h ) S2 = Res i ii , i = 1, ..., n (9) (i) n − p − 1

6 2 2 S(i) is an estimation of the variance σ and hii is the i:th diagonal element of the hat matrix. It is noticeable that the R-student is similar to the studentized residual although it will, in many 2 situations, differ from it. If the i:th observation is influential, S(i) may differ from MSRes and the R-student will be more sensitive to this point. [1]

5.3 Model Adequacy In order to quantify the extent to which the model fits the data the residual standard error, F- and t-statistics, p-values and the R2 statistics can be analyzed.

The p-value roughly interpreted: a small value states that there is a relationship between the response y and the explanatory variable x that represents the value. A small value is typically 5% or 1%. This project aims to develop a model where all explanatory variables are significant and therefore the p-value is of great interest.

Residual Standard Error. With each observation there is an associated error term  and the RSE is an estimate of the standard deviation of these error terms. It is the average amount that the response will deviate from the true regression line and computed by v u n u 1 X 2 RSE = t (yi − yˆi) . (10) n − 2 i=1 An acceptable RSE-value depends on the problem context.

R2 statistics. Independent of the scale of the response y, the R2 statistics takes on a value be- tween 0 and 1 which explains the proportion of variance. Pn 2 2 i=1(yi − yˆi) R = 1 − Pn 2 (11) i=1(yi − y¯i) Pn 2 The fraction is the residual sum of squares, RSS = i=1(yi − yˆi) , divided by the total sum of Pn 2 2 squares, TSS = i=1(yi − y¯i) . An R statistics that is close to 1 indicates a large proportion of the variability in the response has been explained by the regression model. In this report R2 is also called Multiple R2 since this is a multiple regression analysis, opposed to Adjusted R2, which is an adjusted value of the Multiple R2 with respect to the number of explanatory variables as follows: RSS/(n − p − 1) Adjusted R2 = 1 − (12) T SS/(n − 1)

F-statistic. When testing the null hypothesis,

H0 : β1 = β2 = ... = βp = 0 (13) versus H1 : βj 6= 0, j = 1, ..., p (14) it is examined whether there is a relationship between the response y and a predictor xi, i = 1, ..., p for p number of predictors. In a regression model this can be tested by computing the F-statistics, (TSS − RSS)/p F = . (15) RSS/(n − p − 1)

7 For a large number of observations, n, the value of the F-statistics that is just a little larger than 1 suggests that the null hypothesis can be rejected in favor for the hypothesis that there is a linear relationship between the response and the predictors. t-statistic. Similar to the F-statistic the value for the t-statistic is used for hypothesis testing. The t-statistic is defined by βˆ − 0 t = i (16) i ˆ se(βi) ˆ ˆ and measures the number of standard deviations that βi is away from 0, where se(βi) denotes the standard error of the i:th regression coefficient. In other words, the value of ti is used to test whether the corresponding regression coefficient differs from zero. [2]

5.3.1 Residual Plots By plotting the residuals it is possible to validate the model adequacy. When analyzing the plot of the residuals against the predicted values the impression of a horizontal pattern containing the residuals implies that the variance is constant, assumption 3 listed under section Multiple Linear Regression. The normal probability plot is analyzed to examine if the data set is approximately normally distributed. As a visual example, two plots from the model development in this thesis are presented. The normal probability plot and the histogram of the residuals, see Figure 4, indicates that that the normality assumption is approximately fulfilled. However, it is also noticeable that some of the data diverges, as can be seen in the quantiles of the histogram and also in the extremities in the normal probability plot.

Normal Q−Q Plot Histogram of modalecountry2$res Frequency Sample Quantiles 0 20 40 60 −1.0 −0.5 0.0 0.5 −3 −2 −1 0 1 2 3 −1.0 −0.5 0.0 0.5

Theoretical Quantiles modalecountry2$res

Figure 4: Normal probability and histogram plot for the residuals.

In Figure 5 the ordinary residuals and the square root of the studentized residuals are plotted

8 against the fitted values respectively. The plots indicates that the residuals are dependent of the fitted values, which is a sign of heteroscedasticity.

Residuals vs Fitted Scale−Location 1257 414 1257 414 34 Residuals Standardized residuals Standardized

−40 −20 0 20 40 60 34 0.0 0.5 1.0 1.5 2.0 2.5

0 50 100 150 200 250 300 0 50 100 150 200 250 300

Fitted values Fitted values

Figure 5: Ordinary and scaled residuals against the predicted values.

5.4 Outliers and High Leverage Points

A point for which yi is far from the value predicted by the model is called an outlier. These points can arise for a variety of reasons and can have severe effect on the regression model. Residuals scaled to the studentized version plotted against the predicted values are helpful to investigate outliers. If an outlier comes from a faulty measurement or analysis it should be corrected or deleted from the data set. Another problem that might appear in the model is an unusual value for xi, so called high leverage points. A leverage point located remotely in x space compared to the rest of the sample may control certain model properties. If the point lies on the regression line it will not affect the estimates of the regression coefficients, however it will have a large effect on the model statistics such as R2, defined in section Model Adequacy, and the standard errors of the regression coefficients. An influence point has an unusual location both in terms of x and y spaces, and will have an impact on the regression line as it drags the line towards itself. Identifying influential points can be made by analyzing the elements of the hat matrix H, which can be interpreted as the amount of leverage exerted by the i:th observation yi on the j:th fitted valuey ˆj. The hat matrix diagonal is a standardized measure of the distance of the i:th observation from the center of the x space, which gives that large diagonal values are potential influences when combined with a large residual. To measure the influence Cook’s distance, Di, uses a measure of the square distance between the least-squares estimate based on all n points βˆ and the estimate obtained by deleting the i:th point ˆ ˆ β(i). Points with large values of Di have considerable influence on the least-squares estimates β. Di is F-distributed and with values above 1 it is considered to be influential by rule of thumb. Di can be expressed by 2 2 ri V ar(ˆyi) ri hii Di = = , i = 1, ..., n. (17) p V ar(ei) p 1 − hii

9 ˆ Di provides information about the effect of observations on the estimated coefficients βi and fitted values yi.[1] [2]

5.5 Transformations of Variables Transformations of variables is a method used when the assumptions listed under section Multiple Linear Regression are not fulfilled. A usual starting point in regression analysis is the assumption of a linear relationship between y and the regressors. A suitable transformation may linearize a non-linear function. If the scatter diagram of y against x indicates curvature it may be possible to use the linearized form to present the data, transformations such as logarithmic or reciprocal. These types of transformations are selected empirically while it is also possible to use objective techniques to specify an appropriate transformation, such as Box-Cox. The assumption that the variance is constant may also be violated, this is common if the response variable y follows a probability distribution in which the variance is related to the mean. It is important to find and correct a non-constant error variance since this might lead to a larger standard error for the regression coefficients than necessary. The effect of a transformation is usually to give more precise estimates of the model parameters and increase sensitivity for the statistical tests. [1]

5.6 Multicollinearity Regressors are said to be orthogonal if there is no linear relationship between them. In case the regressors are nearly perfectly related it may lead to misleading or erroneous inferences based on the model. This problem is called multicollinearity. In the multiple regression model, the X matrix contains the regressor variables. Its j:th column can be denoted Xj, so that X = [X1, ..., Xp]. Xj contains the n levels of the j:th regressor variable. Multicollinearity is defined in terms of the linear dependence of the columns of X. The vectors are linearly independent if there is a set of constants s1, ..., sp, not all zero, such that p X sjXj = 0. (18) j=1 There are some useful methods in diagnosing multicollinearity. Examining the off-diagonal elements 0 rij of the correlation matrix X X is one of these methods. If regressors xi and xj are nearly linearly dependent, then |rij| will be near unity. Examining the simple correlation rij between the regressors is helpful in detecting near-linear dependence between pairs of regressors only. When more than two regressors are involved in a near-linear dependence, there is no assurance that any of the pairwise correlations rij will be large. The diagonal elements of the C = (X0X)−1 matrix are useful in detecting multicollinearity. When using the variance inflation factor (VIF),

2 −1 VIFj = Cjj = (1 − Rj ) (19)

2 where Rj is the coefficient of determination, the calculated value for each term in the model that exceeds 5 or 10 is an indication that the associated regression coefficients are poorly estimated because of multicollinearity. [1] [2]

5.7 Variable Selection The subset selection approach tries to identify which predictors are best related to the dependent variable, and then fits a least square model with only those predictors. The reduced models may

10 then evaluated with different model evaluation criteria to choose the best model. Three of the most common selection approaches are the Best Subset Selection and the Forward/Backward Stepwise Selection algorithms.

The Best Subset Selection algorithm starts with no predictors. Initially it fits the model with only one predictor out of all p possible, and stores the best model, which is defined as the model with the largest value of R2. Henceforth it fits the model with 2, . . . , p predictors, and stores the best model for each number of predictors. Lastly the single best model out of all stored models are chosen based on different evaluation criteria.

The Forward Stepwise Selection algorithm is initially likewise with no predictors. It adds one predic- tor at the time, the best predictor that gives the highest R2 for the model, until there are p predictors.

The Backward Stepwise Selection algorithm is similar to the Forward algorithm, but starts with a full model with p predictors and instead removes one predictor at the time. [2]

5.8 Cross Validation In regression analysis there are different resampling methods. These methods are used to obtain additional information about the fitted model. The data set is randomly divided into two sets, one training set and one validation set. The idea of these methods is to see how well the fitted regression model can estimate a sample from the validation set. The validation set approach is a technique used in model selection to estimate the test error of a predictive model. The principle is to create validation sets from the training data set, a linear model is fitted to the training set and used to predict the responses for the observations in the validation sets. From this conclusions can be drawn on the model’s performance in predicting new observations. There are two potential drawbacks with the the validation set approach:

1. The estimation of the test error rate can be highly variable due to which observations are divided into the training and validation sets.

2. Only a subset of the observations are used to fit the model. The performance of the model usually decreases with fewer observations included. This can lead to an overestimation of the test error rate for the model fit to the entire data.

In attempt to address the drawbacks of the validation set approach the so called k-fold cross valida- tion can be applied on models created through model development. The set of observations will be divided into k folds with approximately the same sizes. The first fold will be treated as a validation set and the method is fit on the k − 1 remaining folds. The procedure is repeated using every fold as validation set once and the mean square error, MSE, is calculated on the observations in the validation set and the k-fold CV estimate is the calculated average: [2]

k 1 X CV = MSE . (20) (k) k i i=1

11 6 Data Set

The data set used in this project is created from Systembolaget’s sale statistics for 2016 [13] and one explanatory variable is obtained from RateBeer [10]. To the sale statistics a set of numerical variables were added, these variables are alcohol by volume, number of months in stock up until the annual shift 2016/2017 and the number of stores selling the product. In addition to the numerical variables, groups of dummy variables were created. These variables represents whether the product is in stock, type of packaging, whether the product is organic, if the product is rated at RateBeer, the product’s country of origin and it’s beer style. In the dummy variable group Beer Style there is one category called other styles by Systembolaget and as for the Country of Origin group the variable other countries includes Barbados, Colombia, Trinidad, Japan, South Africa, India, Australia, Mexico, Turkey, Iceland, Serbia, Croatia, Cyprus, China, Israel, Greece, Bosnia, Jamaica, Estonia, Singapore, Kenya and Poland. Neither of these countries had more than 4 observations and therefore the decision to group them together was made.

6.1 Restrictions in Data Initially the data set included 3312 beers. After removing beers with unordinary packaging such as kegs, magnum bottles and multi-packs and beers that sold less than 1000 liters there were 1413 beers left to analyze. The restriction due to packaging was made to eliminate probable outliers and high leverage points and the restriction related to sales was made in order to have a good sample size but also removing the most unusual beers since these would probably not be representative for the model. The beers need to have received nine or more ratings in order to obtain a RateBeer rating and the ones that have received fewer ratings have been regarded as unrated. It is assumed that the beers that are no longer in stock have been available throughout 2016.

6.2 List of Variables The response variable y is price per liter, where price is represented in Swedish kronor, and spans the interval from 21.6 SEK/liter to 252 SEK/liter with a median value of 71.21 SEK/liter. The two following tables describes each explanatory variable in the data set including their numbers of observations, the tables are divided as numerical and dummy variables respectively.

12 Explanatory Variables: Dummy Variable Description Number of observations Package 1 for bottles and 0 for cans. 1239 Not in stock 1 for not in stock and 0 for in stock. 263 Organic 1 for organic and 0 for non-organic. 121 Unrated 1 for unrated and 0 for rated at RateBeer. 81 Belgian Beer style 98 British American ale 573 German ale 11 Dark lager 27 Light lager 429 Medium dark lager 38 Other styles 32 Porter/stout 96 Sour 44 Unclassified ale 26 Wheat 39 Austria Country of origin 7 74 Canada 7 Czech Republic 41 Denmark 18 Finland 6 France 5 55 Great Britain 80 International 48 Ireland 10 Italy 9 Netherlands 9 New Zealand 6 Norway 9 Other countries 34 Spain 12 Sweden 849 Thailand 8 USA 127 Note: International as a country means that the beer probably is from a large, global brewery and that it is licence brewed in several different countries. Therefore it is not possible to label it as a specific country. [12]

13 Explanatory Variables: Numerical Variable Description Min. Median Max. Alcohol Alcohol by volume (ABV) percent- 3.6 5.4 14 age. Item price Price per beer in Swedish kronor. 8.9 24.9 189 Months in stock How many months the product had 0 15 735 been in Systembolaget’s assortment up until the annual shift 2016/2017. Number of stores Number of Systembolaget stores 0 4 437 selling the product as of when this report is written. This is regarded as a measurement of popularity. RateBeer rating A qualitative subjective rating rang- 0 45 100 ing between 0 and 100, where the rating is rounded to the nearest in- teger, based on users ratings. Rate- Beer uses a Bayesian weighted mean for their ratings. Sales Total sales in liters in 2016. 1001.55 3752.1 9893495 Sales per month Total sales in liters in 2016 divided 83.4625 401.995 824457.9 by the number of months the prod- uct had been in stock 2016. Volume Volume in milliliters. 250 330 1000

6.3 About RateBeer RateBeer users can rate beers on a combination of scales. RateBeer uses a Bayesian weighted mean for their ratings, which means that the validity of any average is increased by the number of ratings. As RateBeer themselves put it: ”We use a Bayesian weighted mean so that more ratings increase the score’s validity. Simply put, a beer that has one hundred 5.0 scores will have a score just thousandths of a point under five, whereas a beer that has only ten 5.0 scores might have a score a few tenths below a five. This not only helps us combat abuse but ensures a greater validity to our beer lists”. [9]

This weighted point average is different than RateBeer’s consumer-friendly 100-point scale score, which has been used in this project. A beer’s score is based on its percentile ranking among all beers. To assure the quality of their ratings RateBeer deletes obviously bogus ratings, they do not let a users rating for a specific beer count until the user has rated at least ten beers and they do not let brewers, with affiliates, rate their own beers. RateBeer is not the only well known database to use the Bayesian formula. The Internet Movie Database, more known as IMDb, has also used the Bayesian formula in order to obtain a weighted mean for their movies based on user votes [4].

The Bayesian formula: v m W eighted rank = · R + · C (21) v + m v + m where R is the mean rating, C is the midpoint of the scale, v is the number of ratings for a beer and m is the minimum votes required to be listed in the top 50 beers list at RateBeer.

14 7 Analysis and Model Development

The multiple regression model is built and analyzed with the editor RStudio from R. The complete data set is read into R and a first multiple linear regression model is created. In this initial model the following beer parameters are located in the intercept: beers in stock, cans, non-organic, rated at RateBeer, other styles and other countries. Which variables located in the intercept has no effect on the results and may therefore be chosen arbitary. However, the actual coefficients do depend on this choice since the interpretation of the variable coefficient is relative to the base case. Analyzing the model starts by interpreting the plots listed in the following section Residual Analysis.

7.1 Residual Analysis The residuals versus fitted values, see Figure 6, indicates that there is some tendency of non-linear behaviour in the model. The residuals and the fitted values seem to be dependent, the model shows heteroscedasticity and the variance is not constant.

Residuals vs Fitted

148 1228

1006 Residuals −40 −20 0 20 40 60

0 50 100 150 200 250 300

Fitted values

Figure 6: The model residuals versus the fitted values for the original model.

The quantile-quantile, Q-Q, plot is used to determine if the data is normally distributed. This Q-Q plot shows heavy tails in the extremities, which indicates that the original data have more extremes than expected for a normal distributed data set. This can also be seen in the quantiles of the histogram, see Figure 7.

Normal Q−Q Plot Histogram of mod1$res Frequency Sample Quantiles 0 20 40 60 80 100 −40 −20 0 20 40

−3 −2 −1 0 1 2 3 −40 −20 0 20 40

Theoretical Quantiles mod1$res

Figure 7: Normal Q-Q plot and histogram for the original model.

15 As seen in the residuals versus the fitted values plot, the plot of the scale-location empowers the conclusion that the raw data is heteroscedastic, see Figure 8.

Scale−Location

148 1228

1348 Standardized residuals Standardized 0.0 0.5 1.0 1.5 2.0 2.5

0 50 100 150 200 250 300

Fitted values

Figure 8: Scale-location plot for the original model.

By analyzing the leverage plot, see Figure 9, it is clear that the data set contain neither any high leverage points nor any outliers. The whole data set is located within the accepted region. This is expected since the number of parameters, n, in the data set is 1413 and therefore considered large. Section Outliers and High Leverage Points will present a numerical analysis considering these points.

Residuals vs Leverage

148 0.5 Standardized residuals Standardized

1341 −5 0 5 Cook’s distance 990

0.5

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Leverage

Figure 9: Leverage plot for the original model.

The summary of this initial model is noted for comparison against other models through the devel- opment, see Table 1.

16 Residual standard error Multiple R2 Adjusted R2 F-statistic P-value 7.769 0.938 0.936 509 < 2.2 · 10−16

Table 1: Summary of the original model.

7.2 Transformations of Variables A logarithmic transformation of the numerical variables including the response variable is made, none of the dummy variables are transformed. If the relationship between a numerical variable and the response variable improves, that transformation will be applied in the regression model. An improvement is visible in Figure 10, where a transformation of the response variable and the variable representing item price is plotted step by step. In the first picture both variables are untransformed, in the second picture the response variable is transformed and in the last picture both variables are transformed. It is noticeable that the linear behaviour between the variables improves, and the item price variable will therefore be transformed in the regression model. databeer$pris.l log(databeer$pris.l) log(databeer$pris.l) 50 100 150 200 250 3.0 3.5 4.0 4.53.0 5.0 3.5 5.5 4.0 4.5 5.0 5.5

50 100 150 50 100 150 2.5 3.0 3.5 4.0 4.5 5.0

databeer$aktuellt.pris databeer$aktuellt.pris log(databeer$aktuellt.pris)

Figure 10: Logarithmic transformation of the response variable and the variable representing item price.

A transformation is not always improving the relationship between the response and explanatory variables. In Figure 11 it is clear that if the explanatory variable is transformed, the linearity in the relationship does not improve. The transformation is made in the same matter as in Figure 10 and the decision is made to not include a transformation of this explanatory variable in the model.

17 databeer$pris.l log(databeer$pris.l) log(databeer$pris.l) 50 100 150 200 250 3.0 3.5 4.0 4.53.0 5.0 3.5 5.5 4.0 4.5 5.0 5.5

400 600 800 1000 400 600 800 1000 5.6 5.8 6.0 6.2 6.4 6.6 6.8

databeer$volym.i.ml databeer$volym.i.ml log(databeer$volym.i.ml)

Figure 11: Logarithmic transformation of the response variable and the variable representing volume in ml.

After each numerical explanatory variable has been transformed, four of them are transformed in the model. The transformed variables are item price, alcohol, sales and sales per month. The response variable’s logarithmic transformation is also applied in the model. The new model is plotted to compare against the original model, see Figure 12. By analyzing the plots of fitted values against the residuals and the fitted values against the square root of the standardized residuals, this model is close to being homoscedastic. There are still tails in the Q-Q plot and there are points that might be influential according to the leverage plot. Taking a closer look at observation number 148, which differs from the remaining data, it is found that this observation represents the largest and only beer at 1000 ml.

Residuals vs Fitted Normal Q−Q

148 148

475 1161 1161475 Residuals 0 5 10 15 Standardized residuals Standardized −0.1 0.1 0.2 0.3

3.0 3.5 4.0 4.5 5.0 5.5 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 148 148 1

0.5 475 1161 Cook’s distance 199 1266 Standardized residuals Standardized residuals Standardized 0 1 2 3 −5 0 5 10 15

3.0 3.5 4.0 4.5 5.0 5.5 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

Figure 12: Second model, developed with logarithmic transformations.

The summary of this model displays a large improvement in the F-statistic and the residual standard

18 error which now is close to zero, see Table 2. Out of the 41 explanatory variables 21 are significant, which can be compared to the original model where 19 variables were significant.

Residual standard error Multiple R2 Adjusted R2 F-statistic P-value 0.0252 0.996 0.996 9090 < 2.2 · 10−16

Table 2: Summary of the transformed model.

7.3 Multicollinearity To detect multicollinearity between explanatory variables the variance inflation factor, VIF, are calculated in R. In Table 3 all variables with a value larger than 10 are listed.

Variable Ale: British/American Light lager Sweden Sales Sales per month VIF 11.84 11.11 10.75 23.64 20.54

Table 3: List of explanatory variables with a VIF value larger than 10.

From these results a new model is developed. Sales and sales per month are directly related and therefore one of these variables will be excluded from this model. The sales per month variable is assumed to be the best description with respect to seasonal beers and beers that are temporarily in stock, therefore this is the variable kept in the model. Furthermore, the response variable is the quotient between the item price and volume variables and these will also be excluded from the model. The data is fitted to this model and plotted for analysis, see Figure 13.

Residuals vs Fitted Normal Q−Q

1082 990 1082990 Residuals Standardized residuals Standardized −4 −2 0155 2 4

−1.0 0.0 0.5 155

3.5 4.0 4.5 5.0 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 155 990 1082 990 505

Cook’s distance

1295 −4 −2 0 2 4 Standardized residuals Standardized residuals Standardized 0.0 0.5 1.0 1.5 2.0

3.5 4.0 4.5 5.0 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

Figure 13: Model developed through multicollinearity analysis.

By comparison with the previous model, Figure 12, this model is improved in the normal Q-Q plot where the tails are lighter. By examination of the leverage plot there are no longer any possible

19 influential points. The VIF values are calculated again and variables with values larger than 10 and also the variable which presented multicollinearity in the previous calculation are presented in Table 4. By excluding the sales variable the VIF value for sales per month is no longer exhibiting multicollinearity.

Variable Ale: British/American Light lager Sweden Sales per month VIF 11.74 10.83 10.73 4.30

Table 4: List of explanatory variables with a VIF value larger than 10.

7.4 Analysis of Significance After the numerical variables have been transformed the model is further developed through an analysis of significance. As seen in section Multicollinearity some of the dummy variables from the variable group beer styles have shown large VIF values. A new model is tested where all different types of are grouped together as one dummy variable, Model A, as a result of their low signifi- cance and some high VIF values. The significance of the variables in this model is analyzed, and the decision to test if some of the countries from its dummy variable group should be regrouped together with the variable other countries is made. The countries that are not significant are regrouped with the other countries located in the intercept and a new model is created, Model B. This regrouping means that the variables Belgium, Canada, Denmark, Finland, France, International, Ireland, Italy, Norway, Spain, Sweden, Thailand and USA is added to the intercept. The difference in significance is presented in Table 5. The significance codes in the table are represented as: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

Variable Model A Model B Variable Model A Model B Alcohol *** *** Italy Regrouped Months in stock *** *** Canada Regrouped Number of stores *** *** Netherlands . . Not in stock ** ** Norway Regrouped Ale New Zealand * * Light lager *** *** Spain Regrouped Medium dark lager *** *** Sweden Regrouped Dark lager *** *** Great Britain . ** Porter/Stout * * Thailand Regrouped Sour *** *** Czech Republic ** *** Wheat ** ** Germany *** *** Package *** *** USA Regrouped Belgium Regrouped Austria . * Denmark Regrouped Organic Finland Regrouped Sales *** *** France Regrouped RateBeer rating *** *** International Regrouped Unrated *** *** Ireland Regrouped

Table 5: Significance table.

20 Based on the resulting significance in Model B from this section, this is the model chosen and further analyzed. To meet the goal that all variables in the final model should be significant, the organic variable is removed from the model and the ale variable is regrouped together with the other styles variable, thereby adding it to the intercept. The final model is hereby chosen and presented under section Results. Model B is represented in Figure 14 and Table 6 for comparison with the final model, see section Results.

Residuals vs Fitted Normal Q−Q

1019 1019 1359 1359 Residuals Standardized residuals Standardized −4 −2 0123 2 4

−1.0 0.0 0.5 1.0 123

3.5 4.0 4.5 5.0 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 123 1019 1359 1310 1370

Cook’s distance

1394 −4 −2 0 2 4 Standardized residuals Standardized residuals Standardized 0.0 0.5 1.0 1.5 2.0

3.5 4.0 4.5 5.0 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

Figure 14: Model B, developed through analysis of dummy variables.

Residual standard error Multiple R2 Adjusted R2 F-statistic P-value 0.203 0.758 0.755 198 < 2.2 · 10−16

Table 6: Summary of Model B.

7.5 Outliers and High Leverage Points The leverage plots have, through the development, been varying and displaying different results in whether there are possible influential points. By analyzing this numerically with Cook’s distance and leverage testing possible eliminations of observations are decided. In Figure 15 these numeri- cally calculated values are represented. The first plot shows Cook’s distance, as stated in section Mathematical Theory a value larger than 0.5 indicates that those observations might be removed from the model. There are no values larger than 0.1 and therefore no observations are excluded due to Cook’s distance. In the second plot the leverage values are shown, if the value for a observation

21 exceeds 0.25 that observation may be removed. The largest value is approximately 0.2 which leads to no observation being deleted due to it’s leverage value. Leverage Cooks distances 0.00 0.02 0.04 0.06 0.08 0.00 0.05 0.10 0.15 0.20

0 200 400 600 800 1200 0 200 400 600 800 1200

Index Index

Figure 15: Outliers and high leverage points.

7.6 Cross Validation The k-fold cross validation is applied on three of the models obtained during the model development. The original linear regression model, the model with logarithmic transformed variables and the final model. k is set to ten, which generates ten folds to run as validation sets and their respective CV- values are computed. Table 7 represents these averages for the expected mean square errors for each model.

Model Original Model Transformed Model Final Model CVk=10 65.2 0.0432 0.0423

Table 7: Cross validation.

The table shows that the MSE have been reduced through the model development and the final model will be able to predict new observations with a MSE of 0.043. When comparing the cross validation plot for the original model, Figure 16, and for the final model, Figure 17, there are some differences. The original model has problems predicting the most expensive observations whereas the final model does not.

22 Small symbols show cross−validation predicted values

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Fold 6

Fold 7

Fold 8 pris.l Fold 9

Fold 10 50 100 150 200 250

0 50 100 150 200 250 300

Predicted (fit to all data)

Figure 16: Predicted price against actual price, original model.

Small symbols show cross−validation predicted values

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Fold 6

Fold 7

Fold 8

Fold 9 log(pris.l)

Fold 10 3.0 3.5 4.0 4.5 5.0 5.5 3.5 4.0 4.5 5.0

Predicted (fit to all data)

Figure 17: Predicted price against actual price, final model.

23 7.7 Variable Selection The three different methods for variable selection listed under section Mathematical Theory are fit to the final model to examine which variables are of most importance. A number of variables lower than 3 is considered as not important due to the aim of creating a model with a higher number of variables. The result is presented under Appendix and is interpreted as follows. All three of the methods indicate that the variables alcohol, sales per month and RateBeer rating are of the utter most importance. Looking at the dummy variable group countries of origin the backward selection method removes the entire group from the models. The forward and the best subset both selects Czech Republic and Germany for 8 or more variables. From the dummy variable group beer styles the variables light lager and sour are chosen for 5 or more variables, the backward method also selects these styles for models with 5 or more variables. However the backward method differs by selecting medium dark lager and dark lager for models with 8 or more variables whereas the two other methods removes all other styles from models with 10 or less variables.

If it is preferable to increase the model adequacy further than the chosen final model, changes would be done according to these results from the variable selection methods. The goal of this project is, as stated before, to develop a model where all variables are significant and a model with a large number of variables is preferable. Due to this goal the final model will not be modified.

24 8 Submodels

Submodels concerning Country of Origin and Beer Styles are been created to analyze the effect in prices on models only concerning these groups of dummy variables respectively. These are models containing only dummy variables as regressors.

8.1 Country of Origin A regression model with response variable price per liter and the countries listed under section Variables as explanatory variables have been created to study the variation due to the origin of a beer. Figure 18 presents the standard plots for this model. Cook’s distance and leverage have been calculated numerically for the submodel, resulting in no observations being deleted from the model due to these values.

Residuals vs Fitted Normal Q−Q

990 990 1282 Residuals −1.0 0.0 1.0 Standardized residuals Standardized 155 155265 −3 −1 1 2 3 4

3.8 4.0 4.2 4.4 4.6 4.8 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 990 1282 155 990 1165 652

Cook’s distance −2 0 2 4 Standardized residuals Standardized residuals Standardized 0.0 0.5 1.0 1.5

3.8 4.0 4.2 4.4 4.6 4.8 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

Figure 18: Submodel country of origin.

When working with dummy variables there are no possible transformations for the explanatory variables. The response variable have been logarithmically transformed, without this transformation the data set would have been skewed, see the normal probability plots in Figure 19. In the left plot the original data is presented, it is clearly skewed. To the right the response variable has been transformed and its normal probability plot is improved, the data is no longer skewed.

25 Normal Q−Q Normal Q−Q

990 990 1282

13481006 Standardized residuals Standardized residuals Standardized

−3 −2155 −1 0 1 2 3 4 −2 0 2 4 6

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Theoretical Quantiles Theoretical Quantiles

Figure 19: Normal probability plots.

8.2 Beer Styles As in the previous section a regression model with dummy variables as explanatory and price per liter as response is created. In this section the influences of different styles of beers will be analyzed. Again, the response variable is logarithmically transformed to prevent skewness. This model is presented in Figure 20. Cook’s distance and leverage have been calculated numerically for this submodel as well, once more resulting in no observations being deleted from the model due to these values.

Residuals vs Fitted Normal Q−Q

1006 990 990 1082 10821006 Residuals Standardized residuals Standardized −3 −1 1 2 3 4 −1.0 0.0 0.5 1.0

4.0 4.2 4.4 4.6 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 1006 990 1082 1165 1315

Cook’s distance −2 0 2 4 Standardized residuals Standardized residuals Standardized

0.0 0.5 1.0 1.5 313

4.0 4.2 4.4 4.6 0.00 0.02 0.04 0.06 0.08

Fitted values Leverage

Figure 20: Submodel beer styles.

26 8.3 Box Plots In the regression model there are six groups of dummy variables. In addition to the submodels created for two of these, their box plots are analyzed. A box plot is used when it is of interest to visualize the spread for different dummy variables, opposed to when examining for example the mean. Outliers that have not been detected earlier in the model development may appear. This is because when these data points were examined together with the total number of observations they were not extraordinary in price per liter. [3] By studying the box plot of Country of Origin, Figure 21, the spread is interpreted as the variation in price per liter. The box plots for each country are placed side by side to see which country have higher medians and which has the greatest spread. The dotted line represents the 95% confidence interval, the boxes are limited by it’s upper and lower quantile, the median is marked in the box and outliers are plotted as dots. Conclusions drawn from this figure are that Belgium, Sweden and USA are the countries with the greatest variation in price per liter. Canada represents the highest median, however its number of observations is only 7 and therefore very low compared to the total number of observations. A more likely candidate for the highest median is Belgium with its 97 observations.

250

200

150

100

50 Italy USA Spain Britain Others Ireland Austria France Finland Norway Canada Belgium Sweden Thailand Denmark Germany N. Zealand Czech Rep. Czech Netherlands International

Figure 21: Countries of origin.

In Figure 22 the box plots for the different styles of beer are plotted. Unclassified ale, porter/stout and sour beers indicates wide variation in price per liter. The sour style represents the highest median with its 44 observations and light lager beers the lowest median with its 497 observations. The high number of observations for the light lagers compared to the total number of observations is considered very likely to in fact be the cheapest beer style in price per liter.

27 h o o h er hc r olne nsok e iue2,hsasihl ie ait in variety wider slightly a has 25, Figure see stock, in longer no are which are beers which observations the 263 for are box there The set data the In not are that beers the observations that indicates of medians. 24 division variables Figure the The in to plot according 23. box expensive the Figure more and see similar, median, is variables its by liter plot not per the In cheaper the false. be between or to true variables, indicated two contain are only variables dummy of for groups remaining four The package 50 100 200 organic ihbtlsadcn,tevraini rc sseigytesm.Tecne beers canned The same. the seemingly is price in variation the cans, and bottles with , ic ny11o h er r.Hwvr h aito in variation the However, are. beers the of 121 only since , organic iue2:Package. 23: Figure oteCan Bottle 100 150 200 250 50 er n h nsta r o shaiyi aorfrteoe htare that ones the for favour in heavily is not are that ones the and beers

Ale Belgian

Ale Br.−Am.

iue2:Be styles. Beer 22: Figure Ale Unclass.

Ale German

28 Light L.

o nstock in not Medium L. 50 100 200 Dark L.

Porter/Stout iue2:Organic. 24: Figure rai NotOrganic Organic n 18osrain hc are. which observations 1178 and Sour

Wheat rc e liter per price

Other Styles o h two the for organic are price per liter. This might be due to a large spread in styles, alcohol by volume and so on for beers that are in the temporary batch. This variable is also more expensive per liter than the beers which are still in stock. There are only 81 observations missing rating at RateBeer and the median for these and the rated ones are very similar, see Figure 26. This indicates that the price per liter does not rely on whether they are rated or not. Although, the variation in price is much larger for the rated observations. 50 100 150 200 250 50 100 150 200 250

In Stock Not In Stock Rated Unrate

Figure 25: In stock. Figure 26: Rated.

29 9 Results 9.1 Final Model The final model is a multiple linear regression model with the response variable as the logarithm of price per liter and 20 explanatory variables. The summary of these variables is presented in Table 8. All of the variables are significant at a level of 95% which is one of the major aims of this project. The significance codes in the table are represented as: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1. The ? symbol represent logarithm variables.

Final model Coefficient Estimate Std. Error t-value Pr(> |t|) Significance Intercept 3.50 6.31 · 10−2 55.52 < 2 · 10−16 *** Alcohol ? 4.89 · 10−1 3.00 · 10−2 16.33 < 2 · 10−16 *** Sales per month ? -7.34 · 10−4 1.07 · 10−4 -6.88 < 2 · 10−16 *** Months in stock -1.53 · 10−3 1.68 · 10−4 -9.02 9.1 · 10−12 *** Number of stores 2.90 · 10−4 8.15 · 10−5 3.56 0.00038 *** RateBeer rating 4.08 · 10−3 2.55 · 10−4 16.02 < 2 · 10−16 *** Unrated 1.97 · 10−1 2.64 · 10−2 7.46 1.5 · 10−13 *** Package 1.77 · 10−1 2.01 · 10−2 8.80 < 2 · 10−16 *** Not in stock -4.43 · 10−2 1.56 · 10−2 -2.84 0.00453 ** Dark lager -1.89 · 10−1 4.11 · 10−2 -4.61 4.4 · 10−6 *** Light lager -1.49 · 10−1 1.66 · 10−2 -9.01 < 2 · 10−16 *** Medium dark lager -1.71 · 10−1 3.49 · 10−2 -4.88 1.2 · 10−6 *** Porter/stout -5.25 · 10−2 2.32 · 10−2 -2.26 0.02403 * Sour 2.80 · 10−1 3.28 · 10−2 8.53 < 2 · 10−16 *** Wheat -9.78 · 10−2 3.55 · 10−2 -2.75 0.00597 ** Austria -1.59 · 10−1 7.84 · 10−2 -2.03 0.04247 * Czech Republic -1.35 · 10−1 3.40 · 10−2 -3.99 7.0 · 10−5 *** Germany -1.46 · 10−1 3.00 · 10−2 -4.85 1.3 · 10−6 *** Great Britain -7.61 · 10−2 2.43 · 10−2 -3.13 0.00177 ** Netherlands -1.17 · 10−1 6.85 · 10−2 -1.70 0.08861 . New Zealand 2.05 · 10−1 8.44 · 10−2 2.43 0.01516 *

Table 8: Table of the final model.

In the intercept of the final model one finds: beers in stock, cans, rated at RateBeer, other styles (including all ales) and other countries (including Belgium, Canada, Denmark, Finland, France, International, Ireland, Italy, Norway, other countries, Spain, Sweden, Thailand and USA). The residual plots, normal Q-Q and leverage plot are represented in Figure 27. It is close to homoscedastic, there are no visual outliers or high leverage points and the normal Q-Q plot indicates that the data is close to normally distributed.

30 Residuals vs Fitted Normal Q−Q

1019 1019 1359 1359 Residuals Standardized residuals Standardized −4 −2 0123 2 4 −1.0 0.0 0.5 1.0 123

3.5 4.0 4.5 5.0 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 123 1019 1359 1310 1370

Cook’s distance

1394 −4 −2 0 2 4 Standardized residuals Standardized residuals Standardized 0.0 0.5 1.0 1.5 2.0

3.5 4.0 4.5 5.0 0.00 0.05 0.10 0.15 0.20

Fitted values Leverage

Figure 27: Final model.

From the final model’s summary and by it’s calculated VIF values the following tables are given.

Residuals Min 1Q Median 3Q Max -0.9944 -0.1156 0.0082 0.1189 0.8560

Table 9: The residuals of the final model.

Model Adecuacy Residual standard error Multiple R2 Adjusted R2 F-statistic P-value 0.204 0.758 0.754 218 < 2.2 · 10−16

Table 10: Summary of the final model.

The summary shows that the residual standard error is close to 0 which is an important result. The multiple and also the adjusted R2 are lower than in the original model, which is expected since these factors decreases when information, in the form of transformation and exclusion of variables within the model, is removed from the model. The F-statistic is also lower, compared to the other models, but is still considered as sufficient since if the number of observations is large which leads to a F-statistic over 1 is acceptable. I.e. the null hypothesis can be rejected in favor for the hypothesis that there is a dependence between the response and the regressors. [2] As stated in section Mathematical Theory a VIF value under 5 is a warranty that the variables

31 are not multicollinear. Table 11 presents the VIF values for each explanatory variable in the final model and because the highest value is lower than 5 the model does not show any tendency of multicollinearity.

Variance Influential Factor Variable Alcohol Sales per month Months in stock Number of stores VIF 1.63 4.22 1.42 3.33 Variable RateBeer rating Unrated Package Not in stock VIF 2.57 1.28 1.49 1.25 Variable Dark lager Light lager Medium dark lager Porter/stout VIF 1.08 1.98 1.09 1.17 Variable Sour Wheat Austria Czech Republic VIF 1.11 1.15 1.03 1.11 Variable Germany Great Britain Netherlands New Zealand VIF 1.15 1.07 1.01 1.03

Table 11: VIF values for the final model.

9.2 Submodels Two submodels have been created and their summaries are presented in the following tables.

Countries of Origin Residual standard error Multiple R2 Adjusted R2 F-statistic P-value 0.365 0.219 0.209 20.6 < 2.2 · 10−16

Table 12: Summary countries of origin.

Beer Style Residual standard error Multiple R2 Adjusted R2 F-statistic P-value 0.303 0.459 0.455 119 < 2.2 · 10−16

Table 13: Summary beer styles.

Interesting results for these models are their estimated coefficients compared to each other. In Table 14 the estimates are listed and variables with positive estimates are more expensive than the median price for this regression model. According to this model Canadian beers are the most expensive ones and Czech beers are the least expensive. As seen in the plot for countries of origin in section Box Plots these variables were the most expensive and cheapest ones respectively.

32 Coefficient Estimates Variable Intercept Belgium Denmark Finland Estimate 3.99 0.58 0.29 -0.18 Variable France International Ireland Italy Estimate 0.29 -0.07 -0.06 0.34 Variable Canada Netherlands Norway New Zealand Estimate 0.92 -0.03 0.57 0.64 Variable Spain Sweden Great Britain Thailand Estimate -0.06 0.22 0.24 -0.20 Variable Czech Republic Germany USA Austria Estimate -0.30 -0.08 0.50 -0.06

Table 14: Countries of origin.

By comparing the coefficient estimates of different styles of beers the most expensive ones are the Sour styled beers and the least expensive are light lager beers. Again, this corresponds with the box plot for different styles.

Coefficient Estimates Variable Intercept Ale Belgian Ale British/American Ale unclassified Estimate 4.37 0.13 -0.01 0.17 Variable Ale German Light lager Medium dark lager Dark lager Estimate -0.11 -0.53 -0.41 -0.32 Variable Porter/stout Sour Wheat Estimate 0.18 0.34 -0.25

Table 15: Beer styles.

9.3 Calculation Example An analysis of the model is calculated with three beers from Systembolaget, two of them are chosen by the authors in respect to their popularity and the third one in regard to it’s sour beer style. The beers’ values are inserted into the final model respectively, these values are found in Table 16, and the dummy variables with value zero are hidden. The intent of this analysis is to compare the model’s price per litre with the price charged by Systembolaget to see if the item is under or over priced according to the model.

33 Variable values Estimates Pistonhead Sierra Nevada Oude Alcohol 4.5% 5.6% 6.5% Sales per month 106263 21140 445 Months in stock 37 86 46 Number of stores 282 418 46 RateBeer rating 32 94 97 Package Can Bottle Bottle Light lager 1 0 0 Sour 0 0 1

Table 16: Variable values for three example beers.

Pistonhead Flat Tire Predicted value from R: 4.441831 Fitted value calculated by the equation below:

log(price per volume) = 3.5 + 4.89 · 10−1 · log(4.5) − 7.34 · 10−4 · log(106263) − 1.53 · 10−3 · 37 + +2.90 · 10−4 · 282 + 4.08 · 10−3 · 32 − 1.49 · 10−1 = 4.233729

The calculated price per liter is 68.97 SEK/liter while the actual price per liter from Systembolaget is 30.61 SEK/liter. According to the comparison between the calculated and actual prices, Pistonhead Flat Tire is greatly underpriced.

Sierra Nevada Pale Ale Predicted value from R: 4.302991 Fitted value calculated by the equation below:

log(price per volume) = 3.5 + 4.89 · 10−1 · log(5.6) − 7.34 · 10−4 · log(21140) − 1.53 · 10−3 · 86 + +2.90 · 10−4 · 418 + 4.08 · 10−3 · 94 + 1.77 · 10−1 = 4.885283

The calculated price per liter is 132.33 SEK/liter while the actual price per liter from Systembolaget is 67.32 SEK/liter. According to the comparison between the calculated and actual prices, Sierra Nevada Pale Ale is greatly underpriced.

Oude Kriek Boon Predicted value from R: 4.271688 Fitted value calculated by the equation below:

log(price per volume) = 3.5 + 4.89 · 10−1 · log(6.5) − 7.34 · 10−4 · log(445) − 1.53 · 10−3 · 46 + +2.90 · 10−4 · 46 + 4.08 · 10−3 · 97 + 1.77 · 10−1 + 2.80 · 10−1 = 5.206555

The calculated price per liter is 182.46 SEK/liter while the actual price per liter from Systembolaget is 157.33 SEK/liter. According to the comparison between the calculated and actual prices, Oude Kriek Boon is a bit underpriced.

34 10 Discussion

The final model does not completely describe the pricing of beer at Systembolaget since only the beers that sold more than 1000 liters, among further restrictions, see section Restrictions in Data, have been analyzed. Restrictions are necessary in order to limit the amount of data to a reasonable size, and this limit yielded a good sample size. A full model, where every beer were to be analyzed, would result in a calculation expensive sample size with several outliers and high leverage points decreasing the level of significance.

10.1 Analysis of Variables In the following section the explanatory variables from the final model are analyzed.

10.1.1 Quantitative Variables Alcohol with an estimate of 0.489, Significance: *** Higher alcohol by volume is followed by a higher price. This is partly because of the alcohol tax in Sweden but mostly since these beers regularly belong to more expensive beer styles. Some ales contains a high alcohol by volume and are likely to be expensive since these types of beers often are produced with fine ingredients. To produce a beer with a higher alcohol by volume, the fermentation process is extended which is also assumed to be a factor for a higher price.

Sales per month with an estimate of -0.000734, Significance: *** The beers that people buy in bulk are assumed to represent the cheaper ones, which means that a negative estimate is reasonable.

Months in stock with an estimate of -0.00153, Significance: *** Consumers generally like new, interesting beers, which in this case yields that beers that have been in stock a longer period of time receives a penalty in the price. Smaller breweries tend to produce new products which often are expensive, whereas the largest brands regularly continue with their best sellers which are established and often less expensive in order to attract buyers.

Number of stores with an estimate of 0.000290, Significance: *** The estimated coefficient for number of stores is intuitively positive since this enables purchases for a greater number of consumers.

RateBeer rating with an estimate of 0.00408, Significance: *** Higher rating responds to better beer according to consumers, which reasonably boosts the price since consumers will purchase their favourite beers.

10.1.2 Dummy Variables Except Beer Styles and Countries of Origin Unrated with an estimate of 0.197, Significance: *** With the RateBeer rating increasing the price per liter with 0.00408 per rating point, the estimate of the variable unrated is around 50 times larger, corresponding to a compared RateBeer rating of 50. This compared rating could imply that the unrated beers are relatively similar to beers that have actually received around 50 in RateBeer rating, meaning that the unrated beers are distributed evenly throughout the 0-100 scale.

35 Package with an estimate of 0.177, Significance: *** Beer in bottles seem more elegant and are nicer to drink directly from than beer in cans. Therefore they can also be priced higher.

Not in stock with an estimate of -0.0443, Significance: ** This variable presents beers that no longer are in Systembolaget’s assortment. Many of these are seasonal beers, sold around Christmas and Easter, which usually belongs to the beer styles porter, stout or dark lager. These beer styles have, as can be seen in the next section, a negative tendency for the response variable and therefore it is assumed that so is also the case for seasonal beers.

10.1.3 Beer Styles Looking at which beer styles variables are represented in the final model, all ales are located at the intercept. Since ales generally are pretty costly, all beer style estimates are negative except for one.

Dark lager with an estimate of -0.189, Significance: *** Dark lager, also known as Bavarian, is one of the oldest beer styles, it is low fermented and its characteristic smell and taste is a result of the roasted malt. It’s estimated coefficient is the lowest in this dummy variable group, a reason for this might be because this style of beer is often drank in small volumes due to its heaviness and that it is usually combined with food.

Light lager with an estimate of -0.149, Significance: *** The most common beer style, which yields a low over all price. Light lagers are made of the cheapest basic ingredients, the brewing process is simple and may therefore be performed solely by machines.

Medium dark lager with an estimate of -0.171, Significance: *** The medium dark lager is similar to dark lager, however the malt is mildly roasted and therefore this style is appreciated by a wider range of consumers which assuming yields a slightly higher estimated coefficient than the dark beer style.

Porter/stout with an estimate of -0.0525, Significance: * Most porters and stouts are not costly, but there are some exclusive beers in this group too which reduces the absolute value. It’s wide range in price per litre, which can be seen in the box plot for beer styles, makes the variable difficult to fit in the model and is assumingly the cause to it’s low significance.

Sour with an estimate of 0.280, Significance: *** The sour beer style is a large trend, which means that they can be priced accordingly. These beers are regularly sold in smaller bottles but still cost as much or more than most beers, which explains that the price per liter is high. Sour beers are often produced by smaller breweries and their brew- ing process is fragile because of the spontaneous fermentation. This beer style contains expensive ingredients such as fresh fruits and high quality hop.

Wheat with an estimate of -0.0978, Significance: ** This beer style’s estimated coefficient is just a bit lower than the average cost of ales. Barley is the most common grain for beer producing with the exception of the wheat style which is, as the name states, made from a majority of wheat. The beer is fermented under high temperature and is most commonly unfiltered, this process is more complex than the low fermentation of lager beers which

36 might be a reason for the higher estimated coefficients.

10.1.4 Countries of Origin The following countries’ coefficients compared to the origin are quite similar. This combined with the low level of significance suggest that a beer’s origin is not as influential as the variables discussed above. However, some conclusions can be drawn from their respectively estimated coefficients.

Austria with an estimate of -0.159, Significance: * With only seven beers from Austria combined with it’s low significance it is difficult to pinpoint reasons for it’s negative coefficient.

Czech Republic with an estimate of -0.135, Significance: *** The high level of significance for the Czech beers are a result of the good fit of it’s observations. The beer styles most produced are dark lagers and light lagers of the pilsner kind which is more bitter than the light lager most common in Sweden, causing a negative tendency.

Germany with an estimate of -0.146, Significance: *** Germany is the biggest producer of low fermented beer, their beers are classically brewed and are known for their well balanced style. In Germany there are several big brand breweries such as Beck’s, Bitsburger and L¨owenbr¨au. In Systembolaget’s German assortment these are all included and therefore the estimated parameter is negative.

Great Britain with an estimate of -0.0761, Significance: ** Typical beers from Great Britain are red ales, bitters and porters, however most styles of beer are produced here. This is a country with a long history of beer production and an enormous number of . The estimated coefficient is negative, but small which might be a result of it’s large number of observations.

Netherlands with an estimate of -0.117, Significance: . With only nine beers from Netherlands combined with it’s low significance it is difficult to pinpoint reasons for it’s negative coefficient.

New Zealand with an estimate of 0.205, Significance: * With only six beers from New Zealand combined with it’s low significance it is difficult to pinpoint reasons for it’s positive coefficient.

10.2 Looking Back at the Literature Analysis Comparing the coefficient estimates in the literature analysis under section Introduction for alcohol by volume, this thesis estimate is close to the estimate for whisky. The estimated coefficient for wine is significantly larger than the other two, see Table 17. Generally beer holds the lowest values of alcohol by volume (around 5%), wines are slightly higher (around 13%) and whisky is much higher (around 40%). The fact that alcohol by volume is a logarithmic transformed variable in this thesis compared to the untransformed variable for the whiskey model results that the variables are even closer to each other. This suggests that the estimates for beer and whisky are intuitively correct compared to each other while the estimate for wine is extremely out of range. It is noted that the

37 models contain different explanatory variables respectively which also affects the outcome of the analyses.

Estimated Coefficients βˆ Beer Whisky Wine 4.89 · 10−1 7.11 · 10−2 16.6

Table 17: Estimated coefficients for alcohol by volume.

According to Huffington Post [6] the craft beers are more expensive than beers from the larger brands, crafted beers are often associated with more extraordinary beer styles. In this thesis the beer styles representing the crafted ones would be the sour beers and as can be seen in the model these are in fact the most expensive ones. The sour beer style is the only dummy variable with a positive estimator, in respect to the intercept representing ales among some other styles. Comparing the final regression model with the research contributed by SurveyMonkey [8] where style and taste were two of the four top candidates for consumers, the relation is drawn as follows. The variables numbers of stores and rating both can be regarded as a measure of popularity and these variables both have positive estimates. This indicates that beers which are popular and considered tasty by the ratings are pricier than others, as says the article.

10.3 Recommendations Depending on the stated goals for this project, this model is developed with the aims for significance and absence of multicollinearity. If focus were to be empathized on other aspects of regression analysis the reader is recommended to further analyze the variable selection and cross validation. Continued work would preferably start after the transformations of variables have been carried out since this is a model with the most sufficient values according to the model statistics, although it contains some non-significant variables. Applying this model on data from another source than Systembolaget is also recommended.

38 References

[1] Douglas C. Montgomery et al. Introduction to Linear regression analysis, FIFTH EDITION. John Wiley & Sons, Inc., Hoboken, New Jersey, 2012. [2] Gareth James et al. An Introduction to Statistical Learning. Springer Science+Business Media New York, 2013 (Corrected at 6th printing 2015). [3] Richard D. De Veaux et al. Stats: Data and Models, THIRD EDITION. Addison-Wesley as a imprint of Pearson Education Inc., 2012. [4] Internet Movie Database. IMDb, another well known user of Bayesian formula. url: http: //www.imdb.com/help/showleaf?votestopfaq. [5] Michael Jackson. Stora Olboken¨ . Mitchell Beazley, Reed Consumer Books Limited, 1993. [6] Huffington Post Joe Satran. Here’s How A Six-Pack Of Craft Beer Ends Up Costing $12. url: http://www.huffingtonpost.com/2014/09/12/craft-beer-expensive-cost_n_5670015. html. [7] Niclas Fuglesang Geuken Johan Kvastad. Analys av vinpriser genom regression. url: http: //www.diva-portal.org/smash/get/diva2:736597/FULLTEXT01.pdf. [8] SurveyMonkey Lara B. Drink it Up: What Consumers Want When Bying Beer. url: https: //www.surveymonkey.com/blog/2014/05/01/what-kind-beer-people-like/. [9] RateBeer. The calculation of the ratings. url: https://www.ratebeer.com/ratingsqa.asp. [10] RateBeer. Widely known source of beer information. url: https://www.ratebeer.com. [11] Emma Lundquist Sanne Bjartmar Hylta. Pricing Single Malt Whisky. url: http://www.diva- portal.org/smash/get/diva2:942654/FULLTEXT01.pdf. [12] Systembolaget. Customer service. Apr. 28, 2017. [13] Systembolaget. Sales statistics for 2016. url: https://www.systembolaget.se/imagelibrary/ publishedmedia/yhv9nczrib3p4fofzbl0/2016_Artikellistan.xlsx.

39 11 Appendix 11.1 Variable Selection Tables 11.1.1 Best Subset Selection

Number of Variables 3 4 5 6 7 8 9 10 Alcohol ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Months in stock ” ” ” ” ” ” ” ” ”*” ”*” ”*” ”*” Number of stores ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Not in stock ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Light lager ” ” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Medium dark lager ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Dark lager ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Porter/stout ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Sour ” ” ” ” ”*” ”*” ”*” ”*” ”*” ”*” Wheat ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Package ” ” ” ” ” ” ” ” ”*” ”*” ”*” ”*” Austria ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Czech Republic ” ” ” ” ” ” ” ” ” ” ” ” ” ” ”*” Germany ” ” ” ” ” ” ” ” ” ” ” ” ”*” ”*” Great Britain ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Netherlands ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” New Zealand ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Sales per month ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” RateBeer rating ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Unrated ” ” ” ” ” ” ”*” ”*” ”*” ”*” ”*”

Table 18: The final predictors chosen by the Best Subsets-algorithm.

40 11.1.2 Forward Subset Selection

Number of Variables 3 4 5 6 7 8 9 10 Alcohol ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Months in stock ” ” ” ” ” ” ” ” ” ” ”*” ”*” ”*” Number of stores ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Not in stock ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Light lager ” ” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Medium dark lager ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Dark lager ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Porter/stout ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Sour ” ” ” ” ”*” ”*” ”*” ”*” ”*” ”*” Wheat ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Package ” ” ” ” ” ” ” ” ”*” ”*” ”*” ”*” Austria ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Czech Republic ” ” ” ” ” ” ” ” ” ” ” ” ” ” ”*” Germany ” ” ” ” ” ” ” ” ” ” ” ” ”*” ”*” Great Britain ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Netherlands ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” New Zealand ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Sales per month ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” RateBeer rating ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Unrated ” ” ” ” ” ” ”*” ”*” ”*” ”*” ”*”

Table 19: The final predictors chosen by the Forward Selection-algorithm.

41 11.1.3 Backward Subset Selection

Number of Variables 3 4 5 6 7 8 9 10 Alcohol ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Months in stock ” ” ” ” ” ” ” ” ”*” ”*” ”*” ”*” Number of stores ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Not in stock ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Light lager ” ” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Medium dark lager ” ” ” ” ” ” ” ” ” ” ” ” ” ” ”*” Dark lager ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Porter/stout ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Sour ” ” ” ” ”*” ”*” ”*” ”*” ”*” ”*” Wheat ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Package ” ” ” ” ” ” ”*” ”*” ”*” ”*” ”*” Austria ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Czech Republic ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Germany ” ” ” ” ” ” ” ” ” ” ” ” ”*” ”*” Great Britain ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Netherlands ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” New Zealand ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” ” Sales per month ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” RateBeer rating ”*” ”*” ”*” ”*” ”*” ”*” ”*” ”*” Unrated ” ” ” ” ” ” ” ” ” ” ”*” ”*” ”*”

Table 20: The final predictors chosen by the Backward Selection-algorithm.

42