Automatic Device Segmentation for Conversion Optimization A Forecasting Approach to Device Clustering Based on Multivariate Time Series Data from the Food and Beverage Industry

David Johansson

Computer Science and Engineering, master's level 2020

Luleå University of Technology Department of Computer Science, Electrical and Space Engineering Abstract

This thesis investigates a forecasting approach to clustering device be- havior based on multivariate time series data. Identifying an equitable selec- tion to use in conversion optimization testing is a difficult task. As devices are able to collect larger amounts of data about their behavior it becomes increasingly difficult to utilize manual selection of segments in traditional conversion optimization systems. Forecasting the segments can be done au- tomatically to reduce the time spent on testing while increasing the test ac- curacy and relevance. The thesis evaluates the results of utilizing multiple forecasting models, clustering models and data pre-processing techniques. With optimal conditions, the proposed model achieves an average accuracy of 97,7%.

i Preface

There are plenty of people who helped bring this thesis to fruition, and I am grateful to each and every one of them. The completion of this thesis would not have been possible without the support and insights of Robert Westerlund and Andreas Stormvinge at Future Ordering. I also had the great pleasure of working with Johan Kristiansson, whose knowledge and ideas surrounding data analysis and had a great impact on my work. I would like to extend my sincere gratitude to Ahmed Elragal for supervising this thesis along with all the valuable guidance he has provided. Especially helpful to me during this time were Adam Sawert and Johan Delissen, who were always there to discuss the difficult problems that I encountered while working on the thesis.

ii Contents

1 Introduction1 1.1 Background...... 1 1.2 Motivation...... 1 1.3 Problem definition...... 2 1.4 Delimitations...... 2 1.5 Thesis structure...... 2

2 Related Work4

3 Theory5 3.1 Clustering...... 5 3.1.1 Prototype Based Clustering...... 5 3.1.2 ...... 6 3.1.3 Density Based Clustering...... 6 3.2 Forecasting...... 6 3.2.1 SARIMA...... 7 3.2.2 Theta Model...... 7 3.2.3 Vector Auto Regression...... 7

4 Method8 4.1 Data Pre-processing...... 8 4.1.1 Data Overview...... 8 4.1.2 Data Preparation...... 9 4.1.3 Data Transformation...... 10 4.1.4 Outlier Analysis...... 10 4.1.5 ...... 11 4.2 Model Selection...... 11 4.2.1 Selection of Clustering Models...... 11 4.2.2 Selection of Forecasting Models...... 13 4.3 Model Validation...... 14 4.4 Experimental Methodology...... 15 4.5 Software...... 16

5 Results 17 5.1 Cluster Results...... 17 5.2 Forecast Results...... 23

6 Discussion 26 6.1 Generalizability...... 27

iii 6.2 Computational Complexity...... 28

7 Conclusions and Future Work 30

A Training Set Results 35

iv 1 Introduction

1.1 Background Future Ordering is a company that provides digital ordering solutions for the Food and Beverage industry. One of the solutions utilizes self-service kiosks, where customers place their orders through a digital interface. The software interface of such kiosks is optimized to maximize sales for the restaurant. A common strategy for increasing customer purchases in traditional e-commerce systems is conversion optimization. It allows businesses to compare the sales performance of different software versions. Comparisons are made by splitting devices into two equitable groups. One of the group acts as a control and is left unchanged while the other is updated to a newer version. Performance statistics are then collected over time and compared at the end of the test. It is also possible to take a multivariate approach that utilizes multiple test groups. The results of such a test can then be used to estimate the sales impact of new system updates. To acquire a more comprehensive view of the test results user segmentation is often used[1]. By segmenting the results based on user type (e.g. new user, old user, high spender, low spender) it is possible to observe behaviors that are not applicable to the average user. This is because the effects a change has on a certain type of user might be negated by another type of user. The segmentation process is usually done by manually specifying rules that determine how the users should be divided. Since the kiosks only gather anonymous information about the ordering process for each order it’s not feasible to apply user segmentation when comparing kiosks. The behavior of the kiosks also fluctuates depending on external factors such as time of day. This means that comparisons must be performed over long periods of time to negate the impact of seasonal effects. The alternating behavior might also cause segments based on historical differences to correlate poorly with the actual differences during the test periods. In an effort to make effective comparisons between kiosk versions Future Ordering is investigating alternative solutions to automate the segmentation process of self-service kiosks.

1.2 Motivation The main purpose of this thesis is to increase the accuracy of self-service device comparisons. In addition to improvements in test accuracy and relevance, au- tomating the segmentation process reduces the need for manual intervention in the segmentation step. Deploying tests within groups of kiosks that are predicted

1 to exhibit similar behaviors in the future, should decrease the impact of seasonal bias and, by extension, lower the required test duration. The results from this study could be used to segment digital devices in general and might therefore be applied in settings outside the Food and Beverage industry.

1.3 Problem definition • Which clustering method is most suitable for clustering kiosk behavior? • How does the length of the time interval used to create the input features affect the cluster quality? • How does the frequency at which the order data is aggregated affect the cluster quality? • Which forecasting method is most suitable for forecasting kiosk behavior? • How does the length of the previously observed values affect the forecast accuracy? • How does the length of the forecast horizon affect the forecast accuracy?

1.4 Delimitations • The clustering models investigated in the thesis was limited to a selected subset of available clustering models (see section 4.2). • The forecasting models investigated in the thesis was limited to a selected subset of available forecasting models (see section 4.2). • Clustering and forecasting time periods exceeding four weeks was not not in- vestigated, since test periods that require more than four weeks were deemed to be too long. • In the absence of , internal clustering measures was used to validate the performance of the different clustering models (see section 4.3).

1.5 Thesis structure In section2 previous works related to the problems and solutions introduced in the thesis are presented. In section3 underlying theory about clustering and forecasting techniques are introduced. Section4 starts by describing the data pre- processing and then moves on to outline the experimental methodology used in the thesis. It also introduces some of the theory that was not substantial enough to

2 include in section3. In section5 the experimental results are presented, with the exception of the training set results, which can be found in appendixA. Section6 discusses the results from section5 in relation to the problem definition in section 1.3. Section7 gives a brief conclusion and discusses potential improvements and ways to continue the work started in this thesis.

3 2 Related Work

There are several instances where has successfully been used in au- tomating the segmentation process for conversion optimization. Although most of the solutions utilize the K-means clustering algorithm to generate the seg- ments[2][3], there is no established consensus on which clustering method performs the best for this particular task. Beyond K-means, there are examples utilizing hi- erarchical clustering[4][5] in e-commerce segmentation, as well as, instances where soft[6] clustering was used. Most of the existing research on automatic behavior segmentation in e-commerce systems is based on the behavior of individual users. This thesis investigates solutions for segmenting the behavior of individual devices in environments where data on individual users is unobtainable. Many of the existing solutions base the segments on historical data gathered over long periods of time resulting in generalized partitions. In comparison, this thesis takes a forecasting approach to the segmentation problem, basing the segments on forecasted values instead of historical values. The most frequently utilized forecasting techniques are based on statistical models, with ARIMA[7] being one of the more widely used. Recent years have introduced more sophisticated techniques based on machine learning and deep neural net- works. Due to the increased complexity of the deep neural networks, they are able to retain more information from the historical data compared to their statistical counterparts. Although this might sound like an improvement over the statisti- cal models it has been shown that information retention is a bad predictor when it comes to the actual forecast accuracy[8]. Recent studies suggest that, in the general case, the statistical model with the worst accuracy outperforms the most accurate machine learning model[9]. Previous solutions for clustering multivariate time series utilize different combi- nations of distance measures, clustering algorithms and dimensionality reduction techniques. Among the solutions, there doesn’t seem to be a single preferred choice when it comes to choosing a clustering algorithm. Due to the high di- mensionality of multivariate time series, most solutions reduce the dimensions of the input space before applying the chosen clustering algorithm. Previous work shows that both principal component analysis[10] and creating embeddings with an auto-encoder[11] can be successfully utilized when reducing the dimensions of multivariate time series.

4 3 Theory

This section introduces the underlying theory used in the thesis, starting with clustering and then moving on to forecasting.

3.1 Clustering The objective of a clustering algorithm is to partition a group of samples, such that the samples within the same partition are more similar to each other than samples from other partitions. While the objective of all clustering algorithms is the same they all have their own strategies, which produces different results. Clustering is primarily an unsupervised form of machine learning. This means that there are no true observations to compare with the model output. This makes validating the result a difficult task. In the case of supervised clustering, external validation measures can be utilized. They compare the actual labels from the data set with the labels generated by the algorithm and calculates a score based on how well they match. In the unsupervised setting, internal validation measures are used instead of external ones. The internal measures calculate a score based on how distinct each individual cluster is. In other words, an internal measure seeks to maximize the average inter-cluster distance and minimize the average intra-cluster distance.

3.1.1 Prototype Based Clustering The main idea of prototype based clustering algorithms revolves around each clus- ter having its own prototype, or centroid. The centroid is a point that describes the center for each cluster. The most popular algorithm in this category is K- Means[12]. K-means tries to find k centroids that minimize the within-cluster sum-of-squares criterion: k X 2 min(||xi − µj|| ) µj ∈C i=0 where k is equal to the number of specified clusters, xi denotes the i-th sample, µj denotes the j-th cluster mean and C, the set of all clusters. The utilization of the aforementioned criterion ensures that the variance within the resulting clusters is minimized. The number of clusters, n, must be specified prior to running the algorithm. In practical settings, the algorithm is often approximated using one of several heuristic algorithms. By applying Loyd’s[13] algorithm to the within- cluster sum-of-squares problem, the computational and space complexity of K- means is O(nkdi) and O((n + k)d) respectively, where n is the number of vectors

5 being clustered, k is the number of specified clusters, d is the dimensions of the vectors and i is the number of iterations until the algorithm converges. In cases where the data is clusterable the number of iterations until convergence is usually negligible.

3.1.2 Hierarchical Clustering The main idea of hierarchical clustering[14] algorithms is to merge clusters in an iterative process until the specified number of clusters has been achieved. The strategy used when selecting which clusters to merge is called a linkage criterion. Some of the most widespread criteria are Single linkage which minimizes the short- est distance between clusters, Complete linkage which minimizes the maximum distance between clusters and Ward which utilizes the same strategy as K-means by minimizing the variance of the clusters. Most of the hierarchical clustering algorithms have a computational complexity of O(n3) and a space complexity of Ω(n2), where n is the number of nodes being clustered. An exception to this is the Single linkage criterion which can utilize the SLINK[15] algorithm to achieve a computational complexity of O(n2) and a space complexity of Ω(n).

3.1.3 Density Based Clustering Density based clustering algorithms seek to find areas of high density. Samples within the same high-density area are placed in the same cluster, while samples outside of the high-density areas are labeled as outliers. OPTICS[16] is an im- proved version of the popular DBSCAN algorithm. Running OPTICS requires the specification of the minimum number of samples that need to be present in an area for it to be considered dense. Compared to DBSCAN OPTICS does not require the additional parameter which defines the minimum distance required between points for them to be considered dense. The computational and space complexity of the OPTICS algorithm is O(n log n) and O(n) respectively, where n is the number of nodes being clustered.

3.2 Forecasting The objective of a time series forecasting model is to predict future values in the series based on the trend present in the prior values of the series. The term for forecasting one series at a time is called univariate time series forecasting and the term used when forecasting multiple multiple series sharing the same time steps is called multivariate time series forecasting. Forecasting is a form of . This means that the model output can be evaluated using the actual

6 true values for the forecasted time steps. The forecasting models introduced in the following sections are all from the statistical category.

3.2.1 SARIMA The SARIMA[17] model is an extension of the popular ARIMA model, with an added seasonal component. The ARIMA model bases its forecast on three terms calculated from the prior values. First, there is the autoregressive term, which determines how many of the previous values should be included when the forecast is calculated. Then there is the integrated term, which subtracts the previous value from the current value in an attempt to make the series stationary. The integrated term determines the minimum number of differencing iterations that are required. Lastly, there is the moving average term which determines the number of previous forecast values that should be used in the forecast. SARIMA has a computational 3 complexity of O(D NtrNte) and a space complexity of O(Ntr + Nte)[18], where D is the order of differencing, Ntr is the number of historical values and Nte is the number of forecasted values.

3.2.2 Theta Model The Theta[19] model is only sensitive to one parameter, which determines how much of the trend that should be dampened. The model can be summarized into the following steps: first test the time series for seasonality and deseasonalize if required, then apply a simple exponential smoother to the series, finally forecast the series and re-seasonalize if deseasonalized in the previous steps. The computational and space complexity of the theta model is O(NtrNte) and O(Ntr+Nte) respectively, where Ntr is the number of historical values and Nte is the number of forecasted values.

3.2.3 Vector Auto Regression The Vector Auto Regression[20] (VAR) model is an extension of the univariate autoregressive model found in ARIMA, that can utilize information from multiple times series to produce the forecast. Just like in the normal autoregressive model each time series has its own equation but instead of only basing its own previous values it also incorporates prior values from the other time series. SARIMA has a computational complexity of O(NtrNteNts) and a space complexity of O((Ntr + Nte)Nts)[18], where Nts is the number of time series being forecast, Ntr is the number of historical values and Nte is the number of forecasted values.

7 4 Method

This thesis investigates the following approach to clustering the behavior of self- service kiosks: 1. Transform order data from each kiosk into a multivariate time series. 2. Forecast future values of each multivariate time series. 3. Cluster the kiosks based on the forecasted values. The thesis evaluates the effects of applying different models for both the clustering step and the forecasting step. To evaluate how well each model adapts to new data, the data set was split into a training set and test set. Parameters for each model were optimized by finding the parameter combination that scored the best on the training set. The best performing combination of parameters for each model was then evaluated on the test set. The rest of this section gives a more in- depth description of the previously explained method, starting with the data pre- processing and then moving on to the forecasting and clustering model evaluation strategy.

4.1 Data Pre-processing This section starts with an overview of the data set used in the thesis. It then moves on to explain the transformations applied to the original data before it could be utilized.

4.1.1 Data Overview The raw data set used in the thesis consists of a collection of just over four million orders, all gathered during the same 15 contiguous months in 2019 and 2020. All orders came from the same restaurant chain encompassing 125 restaurants and 966 self-service kiosks. Each individual order was stored in a separate JSON file that contained information such as the date and time when the order was placed, which products the customer ordered and at which restaurant and kiosk the order was placed. In addition to the order data, there was also supplementary with additional data about individual stores and menus. A schematic overview of the original data can be seen in figure1.

8 Figure 1: Schema of the original data structure.

4.1.2 Data Preparation The purpose of preparing the data was to make it more suitable for further data analysis. In some cases, the only difference between two products with different IDs was the size (e.g small, medium, large). There were also a few cases in which the only difference was how the product name was formatted. These types of products were aggregated into a single product with only one ID. This was primarily done to increase the semantic difference between different products, as well as to reduce the required input size for the model. After all products had been aggregated the total number of unique products was 195. The list of products from the raw data was transformed into a vector with one entry for each of the aggregated products. The value at each entry represented the quantity of that particular product that the customer had ordered. Entries for date, total price, and whether the food was ordered as take-away or not, were extracted without any special transformations.

Figure 2: Example of the data set after data preparation.

9 All store and kiosk information was removed except for the store ID and the kiosk ID. This was done to ensure that the model was based solely on the behavior and not the physical location of the kiosk. The store identifier was required for implementing the naive clustering solution while the kiosk identifier was used to group orders in subsequent pre-processing steps. The extracted data was then stored in a single CSV where each row represented an order. An example of the prepared data set can be seen in figure2.

4.1.3 Data Transformation To model the kiosks’ behavior over time each kiosk was represented as a multi- variate time series. The time steps were evenly spaced and contained aggregated data from orders based on the given frequency. The entries for any given time step where the number of orders that had been placed, the number of orders that were identified as ”take away”, the total revenue and one entry for each product representing the number of sales for that product. Figure3 illustrates the general structure of the multivariate time series.

 1 m #Orders1 Revenue1 T akeaway1 #P roduct1 ... #P roduct1 1 m #Orders2 Revenue2 T akeaway2 #P roduct ... #P roduct   2 2   . . . . .   ......  1 m #Ordersn Revenuen T akeawayn #P roductn ... #P roductn

Figure 3: The general structure of the multivariate time series matrix, where n is equal to the number of time steps and m is equal to the number of products.

Not all variables in the time series shared the same value range. To avoid the possibility that some of the variables would overshadow the rest, values belonging to the same variable series were normalized to have a value between zero and one. This ensured that every variable contributed equally when calculating the difference between different kiosks. After the data was normalized it was ready to be used in the forecasting step.

4.1.4 Outlier Analysis The presence of outliers in the data was investigated using the unsupervised Histogram-based Outlier Score method (HBOS)[21]. It works by modeling the density of each feature with a histogram. An individual score is calculated for every observation and histogram. The scores are then aggregated into a single

10 score for each observation. With its linear computational complexity, HBOS is significantly faster than its clustering based and nearest-neighbor based counter- parts. This is due to its assumption that the data features to be independent of each other. While this does come with a slight decrease in accuracy for data with highly correlated variables, it is not enough to warrant the use of more computa- tionally complex solutions. Applying HBOS to the transformed data set yielded a relatively low amount of outliers (<3%). Even though a small number of devices were identified as outliers by the HBOS algorithm, none of them were removed from the data set because of the following reasons. First, since the occurrence of outliers was rare, it was deter- mined that their statistical impact on the final clustering result would be minimal. Secondly, a statistically deviating sample might not actually be an illegitimate sample and would therefore introduce a bias if it were to be removed. Thirdly, removing outliers before clustering would undermine the built-in outlier detection of the OPTICS algorithm, and therefore make it hard to assess its effectiveness compared to the other clustering algorithms.

4.1.5 Dimensionality Reduction Before the clustering step, each multivariate time series was flattened and saved as a row in a two-dimensional matrix, where each row represented a kiosk. This matrix tended to be sparse, especially for series with a large number of time steps. To de- crease the computation complexity required during the clustering step Halko’s[22] randomized truncated singular value decomposition method was used to reduce the number of dimensions. This method works by projecting the original input space onto a number of principal components while retaining as much of the varia- tion in the original data as possible. Halko’s method has been proven to be one of the better performing dimension reduction techniques for large sparse matrices[23]. The dimensions were reduced to as few principal components as possible while still retaining 100% of the variance. Reducing the dimensions even further would have led to information loss which in turn would have reduced the significance of the end results. Figure4 illustrates the general structure of the final data set.

4.2 Model Selection 4.2.1 Selection of Clustering Models The different clustering algorithms compared in this thesis were chosen from three popular categories[24] namely, prototype based clustering, hierarchical clustering and density based clustering. The K-means[12] algorithm was chosen to represent

11  1 1 1 1  p1 p2 . . . pn−1 pn  p2 p2 . . . p2 p2   1 2 n−1 n   . . . .   ......   m−1 m−1 m−1 m−1 p1 p2 . . . pn−1 pn  m m m m p1 p2 . . . pn−1 pn

Figure 4: The general data structure before clustering, where n is equal to the number of kiosks and m is equal to the number of principal components. the prototype based category because of its popularity and widespread use in the clustering field. One of the advantages of the K-means algorithm is its relatively low computational complexity. Among the density based algorithms OPTICS[16] was selected to represent the category. The OPTICS algorithm performs similarly to the most common algorithm in the category, DBSCAN, while reducing the number of tunable parameters from two to one[24]. It also has the ability to identify samples as noise. This can result in more distinguishable clusters at the cost of reducing the number of samples included in the resulting clusters. While OPTICS can be used to identify clusters with arbitrary shapes, it can’t be used to find clusters with varying density. This is because the algorithm assumes that all clusters share the same density. Because of the distinct differences between the different linkage criteria used in hierarchical clustering[14], multiple algorithms were selected to represent the hier- archical category. Single linkage has a lower computational complexity compared to other hierarchical algorithms and has the ability to identify non-spherical pat- terns in the data, with varying results for data with a lot of noise. Complete linkage is more stable than single linkage and excels at finding clearly separated spherical clusters. Wards linkage criterion has a relatively good performance in environments with noisy data. It also has a tendency to identify more evenly dis- tributed clusters at the cost of a longer average distance between samples that share the same cluster. All hierarchical clustering algorithms have the ability to identify clusters of varying sizes. To have a baseline that the more complex clustering models could be compared against, a naive clustering solution was included in the evaluation. The naive solution clustered the kiosks based on the restaurant they were placed in. Kiosks from the same restaurant were placed in the same cluster, making the number of clusters equal to the number of restaurants. This mimics the selection of a simple rule-based segmentation system. Any of the more complex models should therefore

12 be able to score better than the naive solution to be justified in its use. All clustering models used the Euclidean distance (L2 norm) to measure the dif- ference between the vector represented kiosks. The Euclidean distance is defined as follows v u n uX 2 d (p, q) = t (qi − pi) i=1 where p and q are one dimensional vectors of length n. Choosing the same similarity measure for all clustering models removed the potential for biased test results caused by contrasting measures. The reason for choosing the Euclidean distance was to expand the selection of available clustering algorithms. Both K-means and Wards algorithm works on the assumption of Euclidean space and could not have been included in this thesis if another similarity measure had been selected.

4.2.2 Selection of Forecasting Models Forecasting models can be divided into the following categories: statistical, ma- chine learning and . Because of their overall bad performance[9] and time consuming fine-tuning process, models from the machine learning and deep learning categories were not included in the forecasting experiments. This limited the selection to statistical forecasting models. The ARIMA model and the Theta model[19] have both show great results when compared to other models in the field[8], with the former scoring better when the forecast horizon is short and the latter excels at predicting longer forecast horizons. Both models were therefore included in the thesis. Since the data displayed a strong seasonal pattern, an extension of the standard ARIMA model, called SARIMA[17], was used instead. This allowed the seasonal component to be modeled directly. Both ARIMA and Theta are designed to forecast univariate time series. This means that they have to forecast each series in the multivariate time series separate from each other, without utilizing any information from the other series. The VAR[20] model was therefore selected to explore how the forecast would be affected by using the values from multiple series at the same time. The reason for choosing VAR over other multivariate forecasting models was its popularity and its relatively good performance in the field. As with the selection of clustering models, a naive forecasting solution was included to act as a baseline for the other models. The forecast that the naive model produced was equal to that of the preceding period. For example, if the naive model was tasked with forecasting the next week’s values the resulting forecast would be

13 equal to the values of the previous week, regardless of any other factors.

4.3 Model Validation External validation methods could not be used in the cluster validation since the kiosk data did not have labels that determined the type of cluster that they be- longed to. This meant that internal validation methods had to be used instead. Using internal cluster validation metrics to compare different clustering solutions is not ideal because of the biases inherent in the metrics. In an attempt to mitigate this problem[25], two different metrics were included in the thesis. The first metric was Davies-Bouldin[26] (DB) score which should be minimized, with an optimal score of zero. The second metric was Silhouette[27] (SI) score which has a score ranging from -1 to 1, with an optimal score of 1. The internal validation mea- sures were determined by running the same clustering algorithm multiple times with different specified number of clusters, the measures were calculated for all of the different partitions, the score from the partition with the best score was then selected. Validation of the forecasting models was carried out by selecting two contiguous intervals in the data. The first interval was used as input to train the model, while the second was used as a validation set. Then, the model was used to forecast over a horizon equal to the length of the validation set. After that, the observed values in the validation set could be compared with the forecasted prediction. Two dif- ferent measurements were used to measure the forecast quality, Root mean square error[28] (RMSE) and Adjusted mutual information score [29] (AMI). RMSE was chosen because of its similarity with the distance metric used by the clustering models (euclidean distance). Since RMSE is the average error between the fore- casted and the observed values, it has an optimal value at zero. In traditional forecast validation, an error metric is enough to determine the fore- cast quality. For the purpose of this thesis, it is more important to verify that the forecast can accurately predict the kiosk clusters. This was ensured by including the AMI score. AMI compares the similarity between two partitions of the same data. If the two partitions are identical to each other the AMI score will be equal to one and a score close to zero indicates that that the two partitions are indepen- dent of each other. To calculate the AMI score the forecasted and observed values had to be clustered, generating the two partitions.

14 4.4 Experimental Methodology To test how well the models generalize to new data, the data set was split into a training set and a test set. Data from the first 12 months (80%) was included in the training set while data from the remaining 3 months (20%) was set aside as the test set. The training set was used for selecting optimal model parameters. Once the optimal parameters had been decided the fine-tuned models were re-evaluated on the test set to get the final results. Evaluating the effectiveness of the models on a set of samples that were not used during the fine-tuning process, provides an unbiased view of the models performance[30]. The clustering comparisons evaluated the effects of choosing different clustering methods, sampling frequencies and time intervals. Samples were generated by uti- lizing a sliding window technique. The data set was divided into intervals of the length being investigated. Each interval was then transformed with the process outlined in sections 4.1.3 and 4.1.5. At least 10 samples were generated for any given time interval. To meet this criterion, samples of larger lengths had to be generated with some amount of overlapping data. The sampling frequencies in- vestigated in the thesis ranged from 1 hour to 48 hours and the interval lengths ranged from 1 day to 28 days. When optimizing the clustering models, the set of parameter values that maximized the Silhouette score was selected. Given the large number of samples used in the experiments, no thorough analysis of the clus- ter contents where performed. The resulting clusters were therefore given generic numerical labels. If it was possible to give meaningful and exhaustive labels to the identified clusters, the clusters could probably have been identified through manual selection, undermining the more complex solution proposed in the thesis. The forecasting comparisons evaluated the effects of choosing different clustering methods, lengths of prior values and forecasting horizons. Samples were generated with the same technique used when generating the clustering samples. The lengths of the time intervals were equal to the combined length of the prior values and the forecasting horizon. The lengths of prior values investigated in the thesis ranged from 1 day to 14 days and the forecasting horizons ranged from 1 day to 28 days. The sampling frequency used to create the samples were selected based on the cluster results from the training set. With the purpose of basing the forecasts on a model that yields the most distinct clusters, the sampling frequency with the highest Silhouette score was utilized in the forecasting comparisons. When optimizing the forecasting models, the set of parameter values that maximized the AMI score was selected. The clustering model used to calculate the AMI score was chosen by picking the model with the highest Silhouette score from the training set results.

15 4.5 Software All code including the data pre-processing was written using Python. The cluster- ing algorithms used Scikit-learns[31] implementation of Single linkage, Complete Linkage, Wards algorithm, K-Means and OPTICS. For reproducibility reasons it is worth noting that Scikit-learns implementation of the hierarchical clustering algo- rithms might produce a slightly different result compared to other implementations. When choosing the lowest distance pair it is possible for more than one pair to share the same distance. In such cases, the implementations made by Scikit-learn might choose a different pair compared to other implementations. The implemen- tations used for the Theta model, SARIMA and VAR were imported from the Statsmodels[32] library.

16 5 Results

This section presents the results obtained from the test set, with optimal model parameters. This section starts by presenting the results from the cluster model comparison. It then moves on to the results obtained from the comparison of the different forecasting models. The results in this section are all based on the test set. Results from the training set can be found in appendixA.

5.1 Cluster Results The results presented in this section comes from tests performed on the test set with optimal model parameters. Tests were performed on multiple samples with differ- ent combinations of time intervals and sample frequencies. The clustering quality was measured by calculating the DB score and the Silhouette score for every sam- ple. Table1 through table5 shows the average cluster quality, with accompanying standard deviation, for different combinations of time intervals, sample frequencies and cluster methods. The lowest recorded average DB score was 0,160 as can be seen in table3. It was achieved using the single linkage hierarchical model with an interval length of 7 days and a sampling frequency of 12 hours. The highest recorded average Silhouette score was 0,740 as can be seen in table3. It was achieved using the single linkage hierarchical model with an interval length of 7 days and a sampling frequency of 24 hours.

17 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,370 0,1866 0,586 0,1293 Single Link 0,213 0,0341 0,662 0,0585 Ward Link 0,827 0,2041 0,414 0,0744 K Means 0,742 0,1339 0,426 0,0589 OPTICS 0,716 0,2333 0,449 0,2076 Naive 13,100 0,6330 -0,476 0,0331 6 Complete Link 0,509 0,2962 0,564 0,1624 Single Link 0,193 0,0363 0,692 0,0629 Ward Link 0,916 0,1754 0,387 0,0702 K Means 0,854 0,1133 0,429 0,0498 OPTICS 0,651 0,1903 0,496 0,1714 Naive 11,906 0,6486 -0,457 0,0300 12 Complete Link 0,420 0,1872 0,624 0,1095 Single Link 0,186 0,0325 0,704 0,0569 Ward Link 0,877 0,0921 0,397 0,0721 K Means 0,829 0,0482 0,446 0,0304 OPTICS 0,634 0,1494 0,497 0,1392 Naive 11,575 0,5944 -0,447 0,0293 24 Complete Link 0,427 0,1821 0,634 0,1004 Single Link 0,182 0,0309 0,710 0,0548 Ward Link 0,887 0,0961 0,399 0,0677 K Means 0,836 0,0453 0,447 0,0269 OPTICS 0,677 0,1899 0,473 0,1517 Naive 11,196 0,5716 -0,437 0,0273 Table 1: Cluster results from the test set for intervals spanning 1 day. The fre- quency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

18 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,400 0,1975 0,556 0,1730 Single Link 0,210 0,0403 0,668 0,0685 Ward Link 0,841 0,1929 0,394 0,0721 K Means 0,704 0,1542 0,442 0,0685 OPTICS 0,699 0,2672 0,402 0,1887 Naive 13,336 0,4970 -0,475 0,0324 6 Complete Link 0,453 0,2546 0,585 0,1622 Single Link 0,184 0,0344 0,710 0,0597 Ward Link 0,828 0,1167 0,414 0,0656 K Means 0,772 0,0891 0,466 0,0452 OPTICS 0,798 0,2609 0,392 0,2539 Naive 12,405 0,7054 -0,459 0,0370 12 Complete Link 0,429 0,1920 0,635 0,1019 Single Link 0,174 0,0329 0,725 0,0571 Ward Link 0,816 0,0598 0,417 0,0668 K Means 0,774 0,0329 0,474 0,0239 OPTICS 0,853 0,3126 0,354 0,2059 Naive 12,077 0,5318 -0,458 0,0328 24 Complete Link 0,410 0,1753 0,645 0,1106 Single Link 0,165 0,0289 0,739 0,0516 Ward Link 0,800 0,0674 0,445 0,0646 K Means 0,777 0,0299 0,481 0,0233 OPTICS 0,668 0,2448 0,500 0,1958 Naive 11,837 0,4801 -0,453 0,0298 Table 2: Cluster results from the test set for intervals spanning 3 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

19 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,278 0,1304 0,617 0,0783 Single Link 0,204 0,0434 0,661 0,0740 Ward Link 0,814 0,1582 0,400 0,0767 K Means 0,602 0,1708 0,441 0,0523 OPTICS 0,813 0,2840 0,395 0,1946 Naive 13,434 0,2724 -0,469 0,0223 6 Complete Link 0,443 0,1799 0,547 0,1739 Single Link 0,172 0,0339 0,731 0,0568 Ward Link 0,838 0,1330 0,438 0,0644 K Means 0,728 0,0328 0,463 0,0225 OPTICS 0,950 0,0547 0,284 0,0197 Naive 12,712 0,3908 -0,487 0,0280 12 Complete Link 0,438 0,2081 0,632 0,1380 Single Link 0,160 0,0333 0,728 0,0573 Ward Link 0,820 0,0632 0,409 0,0494 K Means 0,758 0,0235 0,500 0,0160 OPTICS 0,568 0,3365 0,509 0,2915 Naive 12,376 0,3635 -0,471 0,0246 24 Complete Link 0,523 0,2487 0,614 0,1479 Single Link 0,168 0,0282 0,740 0,0489 Ward Link 0,789 0,0612 0,464 0,0554 K Means 0,736 0,0299 0,479 0,0221 OPTICS 0,609 0,3720 0,503 0,3012 Naive 12,188 0,4442 -0,461 0,0232 48 Complete Link 0,536 0,1388 0,554 0,1186 Single Link 0,165 0,0293 0,739 0,0640 Ward Link 0,808 0,0751 0,404 0,0881 K Means 0,794 0,0228 0,483 0,0208 OPTICS 0,632 0,1999 0,491 0,1655 Naive 12,183 0,2762 -0,474 0,0203 Table 3: Cluster results from the test set for intervals spanning 7 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

20 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,428 0,2095 0,569 0,0788 Single Link 0,214 0,0383 0,663 0,0643 Ward Link 0,886 0,2734 0,396 0,0766 K Means 0,685 0,1182 0,445 0,1037 OPTICS 0,595 0,0764 0,512 0,0968 Naive 13,690 0,2909 -0,472 0,0157 6 Complete Link 0,374 0,2585 0,644 0,1105 Single Link 0,189 0,0327 0,704 0,0558 Ward Link 0,811 0,1067 0,424 0,0402 K Means 0,725 0,0169 0,486 0,0162 OPTICS 0,781 0,2065 0,403 0,1745 Naive 12,918 0,2730 -0,477 0,0257 12 Complete Link 0,466 0,2246 0,610 0,1279 Single Link 0,170 0,0303 0,723 0,0549 Ward Link 0,778 0,0553 0,442 0,0531 K Means 0,732 0,0124 0,488 0,0142 OPTICS 0,926 0,2136 0,321 0,1217 Naive 12,804 0,2904 -0,479 0,0144 24 Complete Link 0,596 0,3070 0,552 0,1402 Single Link 0,167 0,0341 0,732 0,0697 Ward Link 0,770 0,0327 0,443 0,0598 K Means 0,730 0,0180 0,498 0,0195 OPTICS 0,983 0,3953 0,283 0,1471 Naive 12,697 0,2983 -0,474 0,0124 48 Complete Link 0,641 0,1176 0,525 0,0889 Single Link 0,172 0,0328 0,721 0,0615 Ward Link 0,799 0,0312 0,429 0,0493 K Means 0,754 0,0209 0,494 0,0220 OPTICS 0,558 0,0271 0,542 0,0748 Naive 12,713 0,1819 -0,465 0,0186 Table 4: Cluster results from the test set for intervals spanning 14 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

21 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,459 0,1939 0,561 0,0542 Single Link 0,209 0,0262 0,671 0,0451 Ward Link 0,831 0,0245 0,392 0,0198 K Means 0,574 0,2921 0,502 0,1054 OPTICS 0,691 0,0737 0,411 0,0729 Naive 13,804 0,0751 -0,455 0,0146 6 Complete Link 0,445 0,2160 0,607 0,0746 Single Link 0,182 0,0152 0,716 0,0259 Ward Link 0,773 0,0191 0,398 0,0460 K Means 0,727 0,0132 0,483 0,0128 OPTICS 0,746 0,1736 0,393 0,0535 Naive 13,250 0,1940 -0,453 0,0169 12 Complete Link 0,431 0,2201 0,643 0,0698 Single Link 0,165 0,0127 0,735 0,0215 Ward Link 0,791 0,0473 0,433 0,0271 K Means 0,729 0,0109 0,486 0,0122 OPTICS 0,762 0,2828 0,391 0,1251 Naive 13,222 0,0062 -0,461 0,0172 24 Complete Link 0,697 0,1919 0,517 0,0914 Single Link 0,163 0,0170 0,738 0,0427 Ward Link 0,788 0,0294 0,414 0,0440 K Means 0,721 0,0120 0,502 0,0121 OPTICS 0,823 0,0697 0,352 0,0881 Naive 13,160 0,0639 -0,462 0,0154 48 Complete Link 0,593 0,0764 0,574 0,0488 Single Link 0,172 0,0168 0,718 0,0407 Ward Link 0,798 0,0303 0,418 0,0446 K Means 0,740 0,0084 0,499 0,0090 OPTICS 0,968 0,0897 0,275 0,0665 Naive 13,061 0,2555 -0,457 0,0098 Table 5: Cluster results from the test set for intervals spanning 28 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

22 5.2 Forecast Results The results presented in this section comes from tests performed on the test set with optimal model parameters. Tests were performed on multiple samples with different combinations of training lengths and forecasting horizons. The forecast accuracy was measured by calculating the RMSE value and AMI score for every sample. Table6 and table7 shows the average forecast accuracy, with accompany- ing standard deviation, for different combinations of training lengths, forecasting horizons and forecast methods. Note that the performance of the Naive forecast is not affected by the length of the training set, and is therefore constant for test results with the same forecast length. The lowest recorded average RMSE value was 0,069 as can be seen in table6. It was achieved using the VAR model with a training length of 7 days and a forecast horizon of 1 day. The highest recorded average AMI score was 0,977 as can be seen in table7. It was achieved using the SARIMA model with a training length of 14 days and a forecast horizon of 7 days.

23 Forecast Train Forecast RMSE AMI Length Length Method (mean, std dev) (mean, std dev) 1 7 Naive 0,110 0,067 0,298 0,105 SARIMA 0,096 0,012 0,394 0,066 Theta 0,105 0,021 0,334 0,067 VAR 0,069 0,017 0,333 0,048 14 Naive 0,110 0,067 0,298 0,105 SARIMA 0,088 0,018 0,536 0,089 Theta 0,098 0,014 0,369 0,062 VAR 0,077 0,015 0,373 0,047 28 Naive 0,110 0,067 0,298 0,105 SARIMA 0,092 0,013 0,513 0,064 Theta 0,099 0,020 0,334 0,083 VAR 0,080 0,010 0,332 0,055 56 Naive 0,110 0,067 0,298 0,105 SARIMA 0,091 0,018 0,390 0,049 Theta 0,100 0,013 0,335 0,042 VAR 0,078 0,019 0,335 0,084 3 7 Naive 0,147 0,058 -0,003 0,087 SARIMA 0,090 0,018 0,452 0,056 Theta 0,103 0,017 0,335 0,056 VAR 0,079 0,011 0,136 0,027 14 Naive 0,147 0,058 -0,003 0,087 SARIMA 0,086 0,022 0,395 0,049 Theta 0,096 0,019 0,375 0,062 VAR 0,078 0,016 0,356 0,089 28 Naive 0,147 0,058 -0,003 0,087 SARIMA 0,089 0,013 0,402 0,067 Theta 0,101 0,020 0,335 0,042 VAR 0,081 0,014 0,335 0,048 56 Naive 0,147 0,058 -0,003 0,087 SARIMA 0,090 0,013 0,429 0,086 Theta 0,099 0,016 0,332 0,055 VAR 0,079 0,020 0,335 0,042 Table 6: Forecast results from the test set for small forecast windows. The number of forecasting and training steps are measured in days. Highlighted values indicate the best score for a particular combination of training length and forecast horizon.

24 Forecast Train Forecast RMSE AMI Length Length Method (mean, std dev) (mean, std dev) 7 7 Naive 0,108 0,058 0,566 0,156 SARIMA 0,108 0,022 0,823 0,103 Theta 0,113 0,014 0,787 0,157 VAR 0,088 0,018 0,455 0,057 14 Naive 0,108 0,058 0,566 0,156 SARIMA 0,101 0,013 0,977 0,144 Theta 0,107 0,021 0,816 0,136 VAR 0,089 0,011 0,502 0,100 28 Naive 0,108 0,058 0,566 0,156 SARIMA 0,955 0,136 0,580 0,116 Theta 0,936 0,156 0,454 0,076 VAR 0,916 0,183 0,455 0,076 56 Naive 0,108 0,058 0,566 0,156 SARIMA 0,956 0,191 0,570 0,071 Theta 0,938 0,156 0,455 0,091 VAR 0,916 0,183 0,452 0,065 14 14 Naive 0,141 0,082 0,232 0,091 SARIMA 0,105 0,015 0,696 0,099 Theta 0,112 0,014 0,704 0,088 VAR 0,094 0,019 0,363 0,052 28 Naive 0,141 0,082 0,232 0,091 SARIMA 1,004 0,143 0,668 0,134 Theta 1,011 0,253 0,750 0,150 VAR 0,974 0,144 0,335 0,048 56 Naive 0,141 0,082 0,232 0,091 SARIMA 1,004 0,143 0,667 0,133 Theta 1,009 0,252 0,713 0,119 VAR 0,973 0,162 0,335 0,067 Table 7: Forecast results from the test set for large forecast windows. The number of forecasting and training steps are measured in days. Highlighted values indicate the best score for a particular combination of training length and forecast horizon.

25 6 Discussion

This section will start by addressing and answering all the research questions raised in section 1.3. It then goes on to discuss parts of the thesis that are not directly related to the research questions. Which clustering method is most suitable for clustering kiosk behavior? Looking at tables1 through5 we can see that all cluster models were able to score significantly better than the naive method. This indicates that automated device segmentation is a viable alternative to rule-based segmentation when the number of variables becomes too large for the latter to be practical. It is also clear that hierarchical clustering with the single linkage criterion has the best average performance regardless of the input characteristics. The reason for its relatively good performance could indicate that for the most part, clusters were of varying sizes and exhibited non-spherical shapes. While single linkage had the best average score it is worth noting that it was not the most consistent, as is made evident by the standard deviation values. The most consistent models were the naive method and the K-means model. While using internal validation might not be the optimal validation solution, both metrics show that single linkage is the most suitable clustering method in this problem scenario. How does the length of the time interval used to create the input fea- tures affect the cluster quality? The cluster scores are somewhat stable across all interval lengths, as can be seen in tables1 through5. The best scoring obser- vations had an interval length of 7 days and the worst scoring observations had an interval length of 1 day. Because of the slight difference between different lengths we can not draw any clear conclusions about the significance of the results. With- out further investigation, the current data seem to indicate that an interval length of 7 is the sweet spot for identifying the most distinct clusters. How does the frequency at which the order data is aggregated affect the cluster quality? From the results in tables1 through5 it is clear that higher frequencies (1-6 hours) results in lower cluster quality. The reason for this behavior might be that data aggregated at higher frequencies will exhibit a lower data density. This makes the data more sparse and increases the likelihood of sample features to blend together, making it hard to clearly separate the samples from each other. The best scoring observations were made with frequencies of 12 and 24 hours, with most observations favoring the 24-hour sampling frequency. In comparison, observations with a sampling frequency of 48 hours saw a slight decrease in performance. Which forecasting method is most suitable for forecasting kiosk behav-

26 ior? From the results in tables6 and7 we can see that SARIMA has the best observed AMI score. The performance of the different models seems to be context- dependent, with no clear winner in the general case. Even the naive model show relatively competitive values, which in some cases surpasses both the VAR model and Theta model. Lacking any other insight, the current data seem to indicate that the SARIMA model is the most effective model at forecasting the future behavior of kiosk order data. Another interesting observation is the lack of correlation between the RMSE value and the AMI score. The most obvious example of this can be seen by the VAR model observations. It has the lowest observed RMSE values for most of the comparisons. Despite this it manages to have a relatively low AMI score, in some cases, it is even lower than the naive method. The reason for this behavior might be that a model which produces a simple generalization of the underlying process, might produce a lower average error at the cost of losing the distinct differences between the devices. How does the length of the previously observed values affect the forecast accuracy? Looking at tables6 and7 it seems like the optimal number of prior values is equal to two times the length of the forecast horizon. This holds true for forecast horizons of length 3 days, 7 days and 14 days. The one step ahead forecast with a forecast horizon of one day seems to show the best results when the number of prior values is 14 days. How does the length of the forecast horizon affect the forecast accu- racy? The RMSE value seems to increase linearly with the length of the forecast horizon. Small forecast horizons (i.e. 1 day and 3 days) have a significantly lower AMI score compared to long forecast horizons (i.e. 7 days and 14 days), with most observations favoring horizons spanning 7 days. This could be attributed to the fact that small forecast horizons are much more sensitive to individual errors while longer horizons are more forgiving. The Theta model performs better than other models on long horizons while SARIMA excels at low to medium horizons, confirming the results presented in[8].

6.1 Generalizability Comparing the results from the training set in appendixA and the test set in section5, seems to indicate that all models generalize well to new data. The models that rely more heavily on the fine-tuning process (i.e. SARIMA, Theta and OPTICS) show a slight decrease in performance, going from the training set to the test set. This slight decrease in performance, while present, is not significant

27 enough to rule out the possibility that the changes were attributed to chance. The experiments performed in this study are all based on data from the Food and Beverage industry. This makes it difficult to reach any definitive conclusions about the performance of utilizing the proposed model in other settings. How successful the model would be when applied in other areas is determined by how closely the two data sets match. First, when faced with a new data set there is no guarantee that the data contains any clusters. In those cases, it will not matter which solution you select since the data was not clusterable in the first place. Secondly, the data utilized in the thesis contains both a strong weekly seasonality as well as a strong daily seasonality. To take full advantage of the results in this study it is important that the data from the new setting shares the same seasonal behavior. In other situations where the objective is to cluster unlabeled multivariate time series, the properties of the data set may diverge from the data used in this thesis. In those cases, it might not be feasible to utilize the exact parameters and models proposed in this thesis. As an alternative, the method outlined in section4 could be used to investigate the optimal parameters and models for that specific setting. With this said it is still possible that the method proposed in this thesis could be successfully deployed in areas outside of the Food and Beverage industry. Some of the more promising settings are self-checkout stations at retail stores, ATMs and other self-service terminals in general.

6.2 Computational Complexity Utilizing the best scoring algorithms (i.e. the hierarchical clustering model with the single linkage criterion in conjunction with the SARIMA model) yields a com- putational and space complexity of O(N 2 + NVD3PF ) and O(N + NV (P + F )) respectively, where N is the number of devices, V is the number of variables tracked through time, D is the order of differencing, P is the number of previ- ous values, F is the number of forecasted values. Using the previously identified optimal values (i.e. a sampling frequency of 24 hours, previous time steps span- ning 14 days, a forecasting horizon of 7 days and a constant differencing factor) results in P = 14 and F = 7. With these values the computational complexity is equal to O(N 2 + 98NV ) ∼ O(N 2 + NV ) and the space complexity is equal to O(N + 21NV ) ∼ O(NV ). Any further improvements to the space complexity are going to minimal since sim- ply loading the data, without considering any constant factors, requires exactly NV space. In settings where the number of computations needs to be minimized, it might be worth to consider utilizing a faster clustering technique, such as K-

28 means. Running the optimal setup identified in this thesis, including the preceding data pre-processing, can be done several times an hour on data containing approx- imately 1000 devices with 200 parameters each. This means that the models can be continuously updated with new data throughout the day if required.

29 7 Conclusions and Future Work

This thesis has outlined a forecasting approach to clustering device behavior based on multivariate time series data. Multiple forecasting models, clustering mod- els and data pre-processing techniques were compared and evaluated against each other. The forecasting results were highly dependent on the pre-processing tech- nique used in combination with the different forecasting models. With optimal conditions (i.e. a sampling frequency of 24 hours, prior time steps spanning 14 days, a forecasting horizon of 7 days, utilizing the hierarchical clustering model with the single linkage criterion and forecasting values using the SARIMA model) the proposed model achieved an average AMI accuracy of 97,7%. The compu- tational complexity of the proposed model grows quadratically with respect to the number of devices. While the space complexity exhibits a linear growth with respect to the number of devices. Comparing the results from different clustering models without the presence of pre-labeled samples is a difficult task. The validation process used in the thesis has a lot of room for improvement. The ideal solution would be to perform the experiment on labeled data. That way external cluster validation metrics could be used to replace the internal cluster validation metrics, resulting in more significant results. In the absence of labeled data, the current method could be improved by extending the selection of internal cluster validation metrics. It is important that the selected metrics display a diverse set of properties to reduce the inherent bias present in every individual metric. The different clustering models and forecasting models investigated in the thesis are only a small subset of the plethora of models that is available today. Further investigating the effect of using other clustering models and forecasting models could lead to new discoveries and identification of better performing models. This is especially true for machine learning based forecasting models. Even though the average forecasting error of machine learning based forecasting models are is rela- tively large when compared to their statistical counterparts, this thesis has shown that the average forecasting error is a bad predictor of how well the model is able to forecast the behavioral changes between devices. This means that the machine learning models might have an advantage in this specific problem scenario. More work could also be put into investigating the effects of adding additional unstructured data to the model input. In the context of Food and Beverage, it is probable that adding weather data to the model would reveal new patterns in the model output. People’s decisions are to some extent affected by the weather, which in turn changes the nature of the device interaction. Data collected from social

30 media could assist in estimating the number of users that are likely to interact with a device at any given time. An increase in social media activity in the vicinity is likely a good indicator of an increase in the number of device interactions. Including historical and upcoming store maintenance and device maintenance in the input has the potential to increase the model accuracy. It is plausible that adding more information about irregular changes to the devices’ duty cycles would have a positive effect on predicting device behavior. As the data space grows larger with an increasing amount of data points, the need for efficient dimensionality reduction is essential. It would therefore be interesting to research the effects of using other methods for reducing the dimensions of the input space, such as embedding the data with an .

31 References

[1] W. C. McDowell, R. C. Wilson, and C. O. Kile Jr, “An examination of retail website design and conversion rate,” Journal of Business Research, vol. 69, no. 11, pp. 4837–4842, 2016. [2] M. Namvar, M. R. Gholamian, and S. KhakAbi, “A two phase clustering method for intelligent customer segmentation,” in 2010 International Con- ference on Intelligent Systems, Modelling and Simulation, 2010, pp. 215–219. [3] P. Anitha and M. M. Patil, “Rfm model for customer purchase behavior using k-means algorithm,” Journal of King Saud University-Computer and Information Sciences, 2019. [4] P. D. Hung, N. T. T. Lien, and N. D. Ngoc, “Customer segmentation using hierarchical agglomerative clustering,” in Proceedings of the 2019 2nd Inter- national Conference on Information Science and Systems, ser. ICISS 2019, Tokyo, Japan: Association for Computing Machinery, 2019, pp. 33–37, isbn: 9781450361033. doi: 10.1145/3322645.3322677. [5] K. Slaninov´a,R. Dol´ak,M. Miˇskus,J. Martinoviˇc,and V. Sn´aˇsel,“User seg- mentation based on finding communities with similar behavior on the web site,” in 2010 IEEE/WIC/ACM International Conference on Web Intelli- gence and Intelligent Agent Technology, vol. 3, 2010, pp. 75–78. [6] R.-S. Wu and P.-H. Chou, “Customer segmentation of multiple category data in e-commerce using a soft-clustering approach,” Electronic Commerce Research and Applications, vol. 10, no. 3, pp. 331–341, 2011. [7] R. Adhikari and R. K. Agrawal, “An introductory study on time series mod- eling and forecasting,” arXiv preprint arXiv:1302.6613, 2013. [8] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “Statistical and machine learning forecasting methods: Concerns and ways forward,” PloS one, vol. 13, no. 3, e0194889, 2018. [9] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The m4 competition: Results, findings, conclusion and way forward,” International Journal of Forecasting, vol. 34, no. 4, pp. 802–808, 2018. [10] H. Li, “Multivariate time series clustering based on common principal com- ponent analysis,” Neurocomputing, vol. 349, pp. 239–247, 2019.

32 [11] X. Wang, Y. Jin, and Y. Yu, “A mobile network performance evaluation method based on multivariate time series clustering with auto-encoder,” in Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering, ser. ICTCE 2018, Beijing, China: Association for Computing Machinery, 2018, pp. 33–37, isbn: 9781450365857. doi: 10. 1145/3291842.3291859. [12] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition letters, vol. 31, no. 8, pp. 651–666, 2010. [13] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering al- gorithm,” Journal of the royal statistical society. series c (applied statistics), vol. 28, no. 1, pp. 100–108, 1979. [14] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241–254, 1967. [15] R. Sibson, “Slink: An optimally efficient algorithm for the single-link cluster method,” The computer journal, vol. 16, no. 1, pp. 30–34, 1973. [16] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering points to identify the clustering structure,” in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’99, Philadelphia, Pennsylvania, USA: Association for Computing Machinery, 1999, pp. 49–60, isbn: 1581130848. doi: 10.1145/304182.304187. [17] M. Xie, C. Sandels, K. Zhu, and L. Nordstr¨om,“A seasonal arima model with exogenous variables for elspot electricity prices in sweden,” in 2013 10th International Conference on the European Energy Market (EEM), 2013, pp. 1–4. [18] L. Lin, Q. Wang, S. Huang, and A. Sadek, “On-line prediction of border crossing traffic using an enhanced spinning network method,” Transportation Research Part C Emerging Technologies, Dec. 2013. doi: 10.1016/j.trc. 2013.11.018.. [19] V. Assimakopoulos and K. Nikolopoulos, “The theta model: A decomposition approach to forecasting,” International Journal of Forecasting, vol. 16, no. 4, pp. 521–530, 2000, The M3- Competition, issn: 0169-2070. doi: https : //doi.org/10.1016/S0169-2070(00)00066-2. [20] E. Zivot and J. Wang, “Vector autoregressive models for multivariate time series,” Modeling financial time series with S-PLUS R , pp. 385–429, 2006. [21] M. Goldstein and A. Dengel, “Histogram-based outlier score (hbos): A fast unsupervised algorithm,” KI-2012: Poster and Demo Track, pp. 59–63, 2012.

33 [22] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with ran- domness: Probabilistic algorithms for constructing approximate matrix de- compositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011. [23] K. Tsuyuzaki, H. Sato, K. Sato, and I. Nikaido, “Benchmarking principal component analysis for large-scale single-cell rna-sequencing,” Genome biol- ogy, vol. 21, no. 1, p. 9, 2020. [24] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of , vol. 2, no. 2, pp. 165–193, 2015. [25] J. H¨am¨al¨ainen,S. Jauhiainen, and T. K¨arkk¨ainen,“Comparison of inter- nal clustering validation indices for prototype-based clustering,” Algorithms, vol. 10, no. 3, p. 105, 2017. [26] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, 1979. [27] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of ,” Journal of computational and applied mathe- matics, vol. 20, pp. 53–65, 1987. [28] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice. OTexts, 2018. [29] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” The Journal of Machine Learning Research, vol. 11, pp. 2837– 2854, 2010. [30] K. J. Max Kuhn, Applied Predictive Modeling. Springer, New York, NY, 2013. [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [32] S. Seabold and J. Perktold, “Statsmodels: Econometric and statistical mod- eling with python,” in 9th Python in Science Conference, 2010.

34 A Training Set Results

Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,357 0,1870 0,584 0,1304 Single Link 0,204 0,0321 0,648 0,0583 Ward Link 0,838 0,2049 0,400 0,0739 K Means 0,761 0,1329 0,436 0,0585 OPTICS 0,700 0,2298 0,473 0,2069 Naive 13,050 0,6325 -0,481 0,0336 6 Complete Link 0,507 0,2959 0,557 0,1614 Single Link 0,196 0,0353 0,683 0,0634 Ward Link 0,912 0,1740 0,383 0,0702 K Means 0,857 0,1135 0,446 0,0497 OPTICS 0,631 0,1890 0,519 0,1687 Naive 11,941 0,6516 -0,448 0,0315 12 Complete Link 0,415 0,1886 0,640 0,1093 Single Link 0,183 0,0314 0,702 0,0577 Ward Link 0,884 0,0914 0,408 0,0709 K Means 0,826 0,0495 0,427 0,0319 OPTICS 0,615 0,1460 0,498 0,1377 Naive 11,536 0,5934 -0,466 0,0304 24 Complete Link 0,430 0,1809 0,645 0,1011 Single Link 0,171 0,0300 0,720 0,0551 Ward Link 0,897 0,0967 0,400 0,0667 K Means 0,847 0,0472 0,467 0,0274 OPTICS 0,663 0,1898 0,494 0,1488 Naive 11,200 0,5764 -0,422 0,0278 Table 8: Cluster results from the training set for intervals spanning 1 day. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

35 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,380 0,1961 0,567 0,1719 Single Link 0,202 0,0414 0,649 0,0684 Ward Link 0,844 0,1945 0,377 0,0724 K Means 0,705 0,1533 0,427 0,0692 OPTICS 0,666 0,2643 0,440 0,1852 Naive 13,359 0,4958 -0,473 0,0317 6 Complete Link 0,437 0,2564 0,575 0,1635 Single Link 0,194 0,0332 0,712 0,0585 Ward Link 0,809 0,1169 0,401 0,0658 K Means 0,763 0,0906 0,478 0,0459 OPTICS 0,759 0,2593 0,412 0,2511 Naive 12,425 0,7097 -0,468 0,0360 12 Complete Link 0,430 0,1935 0,626 0,1008 Single Link 0,194 0,0310 0,738 0,0557 Ward Link 0,818 0,0610 0,398 0,0649 K Means 0,757 0,0320 0,478 0,0250 OPTICS 0,824 0,3114 0,376 0,2048 Naive 12,130 0,5264 -0,459 0,0315 24 Complete Link 0,397 0,1763 0,634 0,1099 Single Link 0,157 0,0287 0,757 0,0514 Ward Link 0,793 0,0669 0,453 0,0628 K Means 0,773 0,0281 0,489 0,0219 OPTICS 0,668 0,2440 0,520 0,1925 Naive 11,884 0,4768 -0,454 0,0317 Table 9: Cluster results from the training set for intervals spanning 3 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

36 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,283 0,1302 0,615 0,0781 Single Link 0,233 0,0429 0,655 0,0751 Ward Link 0,814 0,1575 0,416 0,0761 K Means 0,627 0,1693 0,460 0,0545 OPTICS 0,825 0,2837 0,385 0,1951 Naive 13,419 0,2655 -0,465 0,0225 6 Complete Link 0,434 0,1809 0,547 0,1720 Single Link 0,194 0,0334 0,715 0,0583 Ward Link 0,826 0,1342 0,424 0,0647 K Means 0,734 0,0330 0,471 0,0244 OPTICS 0,982 0,0560 0,265 0,0196 Naive 12,670 0,3946 -0,462 0,0266 12 Complete Link 0,428 0,2056 0,618 0,1365 Single Link 0,179 0,0323 0,746 0,0565 Ward Link 0,813 0,0624 0,418 0,0493 K Means 0,733 0,0229 0,486 0,0125 OPTICS 0,561 0,3378 0,541 0,2917 Naive 12,356 0,3618 -0,487 0,0266 24 Complete Link 0,494 0,2467 0,589 0,1512 Single Link 0,178 0,0280 0,746 0,0481 Ward Link 0,798 0,0616 0,438 0,0560 K Means 0,763 0,0303 0,505 0,0211 OPTICS 0,611 0,3737 0,499 0,2998 Naive 12,170 0,4391 -0,485 0,0229 48 Complete Link 0,561 0,1382 0,556 0,1192 Single Link 0,154 0,0278 0,719 0,0624 Ward Link 0,823 0,0766 0,412 0,0864 K Means 0,788 0,0224 0,478 0,0205 OPTICS 0,646 0,2008 0,499 0,1653 Naive 12,222 0,2735 -0,458 0,0191 Table 10: Cluster results from the training set for intervals spanning 7 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

37 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,411 0,2112 0,579 0,0790 Single Link 0,231 0,0377 0,649 0,0639 Ward Link 0,896 0,2724 0,392 0,0751 K Means 0,669 0,1198 0,460 0,1042 OPTICS 0,591 0,0757 0,550 0,0934 Naive 13,645 0,2925 -0,489 0,0170 6 Complete Link 0,362 0,2589 0,656 0,1102 Single Link 0,180 0,0324 0,718 0,0572 Ward Link 0,822 0,1054 0,444 0,0420 K Means 0,744 0,0163 0,481 0,0177 OPTICS 0,753 0,2031 0,440 0,1734 Naive 12,877 0,2752 -0,490 0,0269 12 Complete Link 0,448 0,2229 0,607 0,1292 Single Link 0,158 0,0310 0,722 0,0552 Ward Link 0,758 0,0559 0,443 0,0513 K Means 0,728 0,0125 0,474 0,0150 OPTICS 0,912 0,2112 0,352 0,1193 Naive 12,782 0,2946 -0,477 0,0126 24 Complete Link 0,615 0,3050 0,542 0,1393 Single Link 0,164 0,0338 0,741 0,0708 Ward Link 0,757 0,0347 0,428 0,0585 K Means 0,744 0,0183 0,481 0,0194 OPTICS 0,979 0,3948 0,298 0,1444 Naive 12,699 0,3016 -0,471 0,0106 48 Complete Link 0,660 0,1172 0,534 0,0903 Single Link 0,159 0,0327 0,706 0,0611 Ward Link 0,796 0,0302 0,445 0,0512 K Means 0,757 0,0213 0,488 0,0239 OPTICS 0,533 0,0257 0,556 0,0743 Naive 12,661 0,1817 -0,466 0,0170 Table 11: Cluster results from the training set for intervals spanning 14 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

38 Cluster DB Score SI Score Frequency Method (mean, std dev) (mean, std dev) 1 Complete Link 0,466 0,1951 0,567 0,0553 Single Link 0,197 0,0260 0,679 0,0462 Ward Link 0,837 0,0261 0,407 0,0203 K Means 0,587 0,2923 0,508 0,1041 OPTICS 0,688 0,0715 0,439 0,0696 Naive 13,772 0,0788 -0,465 0,0143 6 Complete Link 0,442 0,2170 0,611 0,0727 Single Link 0,186 0,0170 0,733 0,0263 Ward Link 0,768 0,0205 0,404 0,0458 K Means 0,739 0,0151 0,493 0,0115 OPTICS 0,716 0,1734 0,403 0,0525 Naive 13,266 0,1985 -0,468 0,0186 12 Complete Link 0,446 0,2190 0,623 0,0708 Single Link 0,176 0,0123 0,727 0,0229 Ward Link 0,773 0,0477 0,422 0,0285 K Means 0,726 0,0101 0,493 0,0118 OPTICS 0,745 0,2793 0,410 0,1231 Naive 13,200 0,0075 -0,470 0,0175 24 Complete Link 0,713 0,1922 0,501 0,0900 Single Link 0,149 0,0178 0,752 0,0420 Ward Link 0,781 0,0301 0,428 0,0435 K Means 0,714 0,0124 0,521 0,0136 OPTICS 0,785 0,0685 0,378 0,0862 Naive 13,155 0,0669 -0,446 0,0146 48 Complete Link 0,580 0,0759 0,589 0,0505 Single Link 0,164 0,0181 0,738 0,0420 Ward Link 0,811 0,0318 0,409 0,0442 K Means 0,756 0,0075 0,481 0,0072 OPTICS 0,945 0,0866 0,280 0,0633 Naive 13,020 0,2589 -0,454 0,0110 Table 12: Cluster results from the training set for intervals spanning 28 days. The frequency is measured in hours. Highlighted values indicate the best score for a particular frequency value.

39 Forecast Train Forecast RMSE AMI Length Length Method (mean, std dev) (mean, std dev) 1 7 Naive 0,107 0,066 0,291 0,110 SARIMA 0,093 0,011 0,437 0,066 Theta 0,101 0,020 0,358 0,066 VAR 0,070 0,018 0,315 0,048 14 Naive 0,107 0,066 0,291 0,110 SARIMA 0,082 0,015 0,598 0,087 Theta 0,091 0,013 0,395 0,057 VAR 0,078 0,015 0,386 0,046 28 Naive 0,107 0,066 0,291 0,110 SARIMA 0,088 0,011 0,578 0,063 Theta 0,093 0,018 0,341 0,079 VAR 0,080 0,010 0,323 0,054 56 Naive 0,107 0,066 0,291 0,110 SARIMA 0,085 0,015 0,423 0,045 Theta 0,091 0,010 0,385 0,040 VAR 0,082 0,019 0,334 0,085 3 7 Naive 0,143 0,058 -0,015 0,082 SARIMA 0,088 0,015 0,497 0,054 Theta 0,097 0,017 0,391 0,055 VAR 0,082 0,012 0,107 0,029 14 Naive 0,143 0,058 -0,015 0,082 SARIMA 0,085 0,020 0,399 0,044 Theta 0,090 0,018 0,422 0,060 VAR 0,080 0,016 0,325 0,087 28 Naive 0,143 0,058 -0,015 0,082 SARIMA 0,088 0,010 0,436 0,064 Theta 0,101 0,019 0,370 0,040 VAR 0,082 0,014 0,341 0,049 56 Naive 0,143 0,058 -0,015 0,082 SARIMA 0,090 0,010 0,443 0,085 Theta 0,094 0,016 0,358 0,051 VAR 0,081 0,019 0,341 0,043 Table 13: Forecast results from the test set for small forecast windows. The number of forecasting and training steps are measured in days. Highlighted values indicate the best score for a particular combination of training length and forecast horizon.

40 Forecast Train Forecast RMSE AMI Length Length Method (mean, std dev) (mean, std dev) 7 7 Naive 0,109 0,058 0,587 0,151 SARIMA 0,099 0,019 0,883 0,101 Theta 0,112 0,013 0,853 0,156 VAR 0,088 0,018 0,432 0,058 14 Naive 0,109 0,058 0,587 0,151 SARIMA 0,095 0,011 0,986 0,140 Theta 0,103 0,019 0,859 0,134 VAR 0,093 0,012 0,476 0,102 28 Naive 0,109 0,058 0,587 0,151 SARIMA 0,950 0,134 0,588 0,116 Theta 0,928 0,154 0,513 0,071 VAR 0,920 0,184 0,424 0,074 56 Naive 0,109 0,058 0,587 0,151 SARIMA 0,946 0,189 0,610 0,068 Theta 0,930 0,155 0,476 0,088 VAR 0,920 0,185 0,424 0,065 14 14 Naive 0,145 0,081 0,251 0,093 SARIMA 0,101 0,013 0,750 0,096 Theta 0,105 0,013 0,764 0,086 VAR 0,090 0,018 0,368 0,054 28 Naive 0,145 0,081 0,251 0,093 SARIMA 1,002 0,141 0,694 0,132 Theta 1,006 0,251 0,796 0,150 VAR 0,979 0,242 0,326 0,046 56 Naive 0,145 0,081 0,251 0,093 SARIMA 1,002 0,140 0,688 0,130 Theta 1,000 0,249 0,741 0,115 VAR 0,972 0,161 0,318 0,066 Table 14: Forecast results from the test set for large forecast windows. The number of forecasting and training steps are measured in days. Highlighted values indicate the best score for a particular combination of training length and forecast horizon.

41