UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS

MASTER'S THESIS

THE EXTENT OF AUTONOMOUS DATA ANALYSIS FOR NON-IT STAFF WITH SELF-SERVICE TOOLS

Ljubljana, October 10th, 2018 Christian Piechorowski

AUTHORSHIP STATEMENT

The undersigned Christian Piechorowski, a student at the University of Ljubljana, Faculty of Economics, (hereafter: FELU), author of this written final work of studies with the title The extent of autonomous data analysis for non-it staff with self-service business intelligence tools, prepared under supervision of Prof. dr. Jurij Jaklič and co-supervision of Prof. Tiago Oliveira

DECLARE

1. this written final work of studies to be based on the results of my own research;

2. the printed form of this written final work of studies to be identical to its electronic form;

3. the text of this written final work of studies to be language-edited and technically in adherence with the FELU’s Technical Guidelines for Written Works, which means that I cited and / or quoted works and opinions of other authors in this written final work of studies in accordance with the FELU’s Technical Guidelines for Written Works;

4. to be aware of the fact that plagiarism (in written or graphical form) is a criminal offence and can be prosecuted in accordance with the Criminal Code of the Republic of Slovenia;

5. to be aware of the consequences a proven plagiarism charge based on the this written final work could have for my status at the FELU in accordance with the relevant FELU Rules;

6. to have obtained all the necessary permits to use the data and works of other authors which are (in written or graphical form) referred to in this written final work of studies and to have clearly marked them;

7. to have acted in accordance with ethical principles during the preparation of this written final work of studies and to have, where necessary, obtained permission of the Ethics Committee;

8. my consent to use the electronic form of this written final work of studies for the detection of content similarity with other written works, using similarity detection software that is connected with the FELU Study Information System;

9. to transfer to the University of Ljubljana free of charge, non-exclusively, geographically and time-wise unlimited the right of saving this written final work of studies in the electronic form, the right of its reproduction, as well as the right of making this written final work of studies available to the public on the World Wide Web via the Repository of the University of Ljubljana;

10. my consent to publication of my personal data that are included in this written final work of studies and in this declaration, when this written final work of studies is published.

Ljubljana, ______Author’s signature: ______(Month in words / Day / Year, e. g. June 1st, 2012

TABLE OF CONTENTS

INTRODUCTION ...... 1 1. ANALYSIS OF POTENTIAL BENEFITS OF SSBI ...... 3

1.1 Basic Concepts of Business Intelligence ...... 3

1.2 Business Intelligence Architecture ...... 5 1.2.1 Operational Source System ...... 6 1.2.2 Extract, Transform, Load ...... 7 1.2.3 Online Analytical Processing ...... 8 1.2.4 Business Intelligence Applications ...... 9

1.3 Dimensional Data Modelling ...... 9

1.4 Self-Service and Self-Service Business Intelligence ...... 17

1.5 Self-Service Business Intelligence: Benefits and Shortcomings ...... 21 1.5.1 Levels of Self-Service Business Intelligence ...... 24 1.5.2 Data Governance and SSBI ...... 26 1.5.3 Data Quality and its Dimensions...... 27 1.5.3.1 Accuracy ...... 28 1.5.3.2 Completeness ...... 28 1.5.3.3 Time-related Dimensions ...... 28 1.5.3.4 Consistency ...... 29 1.5.3.5 Other Considerations ...... 29

1.6 Self-Service Business Intelligence and the importance of correct Ontology ...... 30 2 METHODOLOGY, LIMITATIONS, AND EXECUTION OF THE EXPERIMENT 33 3 RESULTS ...... 41 4 DISCUSSION ...... 47 CONCLUSION AND OUTLOOK ...... 50 REFERENCES ...... 53

LIST OF TABLES

Table 1 Overall results averaged over the Demographic Data…………………………….....43

LIST OF FIGURES

Figure 1 DW/BI Architecture ...... 6 Figure 2 Entity Classification of the AdventureWorks ...... 13 Figure 3 Hierarchies in the Sales case ...... 15 Figure 4 Database after the transformation ...... 16 Figure 5 Levels of SSBI ...... 25 Figure 6 Gartner Magic Quadrant ...... 35 Figure 7 Sales by Territory (2011-2014) ...... 36 Figure 8 Average Sales Amount per Customer by Territory (2011-2014) ...... 37 Figure 9 Sales by Quarters (2011-2014) ...... 37 Figure 10 Direct Sales and Sales through Resellers (2011-2014) ...... 38 Figure 11 Adjustment of the Ontology ...... 39 Figure 12 Box Plot of the investigated Variables ...... 41 Figure 13 Funnel with the Success Rate of the Experiment ...... 42 Figure 14 Correlation Matrix of the Variables ...... 44 Figure 15 Clustering of the relevance of Numeric Charts in connection with Avg. Perceived Usefulness ...... 45 Figure 16 Clustering of total steps completed in connection with Avg. Ease of Use ...... 46

LIST OF ABBREVIATIONS

3NF 3rd Normal Form A Anticipated Use ATM Automated Teller Machine BI Business Intelligence BIS Business Intelligence System DAS Data Staging Area DAX Data Analysis expression DM DM Dimensional Modelling DQM Data Quality Management DW E Ease of Use ERD Entity Relationship Diagram ERP Enterprise Resource Planning ETL Extract, Transform, Load GDPR General Data Protection Regulation GIGO Garbage In, Garbage Out HOLAP Hybrid OnLine Analytical Processing ID Identifier IM Information Management IT Information Technology J Enjoyment MDDB Multidimensional Database MIS Management Information Systems MOLAP Multidimensional OnLine Analytical Processing MS Microsoft O Perceived Characteristics of the Output OLAP OnLine Analytical Processing OLTP OnLine Transactional Processing, RDBMS Relational Database Management System ROLAP Relational OnLine Analytical Processing SDK Software Development Kit SDLC Software Development Life Cycle SQL Structured Query Language SSBI Self-Service Business Intelligence SSIS SQL Server Integration Services SSMS SQL Server Management Studio SST Self-Service Technology Intelligence TAM Technology Acceptance Model TDQM Total Data Quality Management TIST Trust in Information Systems Technology U Perceived Usefulness

INTRODUCTION

Not only do companies nowadays compete in the efficient and effective use of human capital and their assets, but also of data. Data has become an integral part of every business that is competing in the market today. There are many ways the data can be used; however, this thesis will focus on the field of Business Intelligence (BI). Effective decision making is not only important in our everyday lives, but also in the working world. Each day, decisions made in a company determine the efficiency and direction of the organization. As in our personal life, good choices are the key to organizational performance (Larson, 2009). BI traditionally describes the company-wide process of capturing, making accessible and analysing operational data, which is then being made accessible to the employees through the automated reporting tools. The field of Business Intelligence underwent two major changes in recent years (Alpar & Schulz, 2016). Firstly, the new breadth of available data, for example from phones, oftentimes differs in its structure, volume, and rate of growth. Secondly, the scope from BI is now extended from strategic questions to operational questions. In response to these two developments, the approach of Self-Service Business Intelligence (SSBI) was proposed. Self- Service itself describes aspects of a business process, that earlier required the companies’ employees to be involved, is now being made available to the customers through the companies website or for example self-cashier points in supermarkets (Oliver, Romm-Livermore & Sudweeks, 2009). Hence it would be interesting to see whether these new SSBI solutions can similarly be included into the workflow of the employees in the business context, similarly as the use of Excel is an integral part of a large number of positions.

According to Alpar and Schulz (2016) the two main goals of SSBI are to make actionable information, derived from multifaced data, available for casual users without having to contact any IT specialists and on the other hand, enable power users to accomplish their tasks more easily and faster than before. Since SSBI is still considered a quite new phenomenon a lot of research still needs to be undertaken. As explained afore the two shifts in BI that made SSBI necessary is the availability of multifaceted data and the extension of the scope of BI to supporting operational decisions and not only strategic problems in order to stay competitive. As SSBI is a new tool at the disposal of business users, they do not know how to properly use them, nor does the IT personnel often have a good understanding of how to set up such a system in a proper way so that it works well with the needs of the casual users. There are a variety of problems that play into this. One of them is having a proper ontology for the terms of the company. In the shipping department, the term “customer” might have a totally different meaning than in the marketing department. For example, for the shipping department, the only relevant information is the address, whereas for the marketing department the customer as a whole is of interest in order to formulate effective marketing strategies. Hence, different information catering needs to lead to different definitions across the company (Li, Thomas & Osei-Bryson, 2017). This problem is closely linked to data governance, as these definitions need to be set up in the system itself so that the business users can easily understand which data

1 they are dealing with. Oftentimes the is just being taken from the best practices, which are being set up by the vendor who provided the system in the first place. In the case of SSBI, however, the metadata needs to be more business specific as it is not only accessed by the IT personnel, that has a much deeper understanding of how the data is structured across the business. Therefore, during the process of setting up a SSBI system, appropriate communication is needed between the IT and other departments in the company. The same holds true for the data warehouse design if the data warehouse does not support specific query requirements, SSBI becomes virtually impossible. This requires the BI team to simplify the hierarchy so that the BI software can automatically generate the needed SQL queries (Weber, 2013). This thesis will provide a deeper understanding of the state of casual users and how accessible and useful they find such a system in their everyday work, especially considering that they never had a contact with such. To do so a system will be set up and, in an experiment, casual users will be exposed to a business analysis problem after a short introduction. As such a study with this focus has not been done before, I hope to be able to contribute to the field.

There is a variety of goals that should be achieved with this study. One of them is to identify the tasks of SSBI and its related benefits. To reach this goal a literature review will be conducted. Secondly it is important to find out how to set up a proper ontology based single version of the truth, which requires to make out which are the data governance requirements to achieve this. Another goal will be to determine which success factors need to be in place in order to have an effective SSBI running. Lastly and most importantly the main goal is to analyse the gap between SSBI and user capabilities, more specifically to find out which skills are required to properly use a SSBI system.

The thesis will be structured in a way that the reader will firstly understand some of the basic concepts linked to Business Intelligence so that SSBI will not be as foreign to him. Some of these basic concepts include explanations about the architecture and its related technologies, as well as about dimensional modelling. In the next section Self-Service BI will be covered. Some of the topics are for example, its definition, advantages and shortcomings, as well as some challenges that need to be tackled to implement a successful SSBI solution such as the focus on stringent data governance. After the theoretical part follows the methodology, and execution of the experiment. This section also includes some limitations of the experiment. The following section will cover the results. In the discussion section, these results will be compared to other literature to either prove or disprove them. Lastly the reader will receive a conclusion and an outlook for further research in this area.

The theoretical framework that will guide this piece of work is the Technology Acceptance Model (TAM). The model is based on information sciences and describes how the systems design influences the user's acceptance of a computer-based system. The model proposes that when a user is presented with a new technology there are several factors that determine how and when they will use it. The two main factors to be mentioned here are the perceived usefulness and the perceived ease of use. The latter explains how much a new user perceives the system to be free from effort, whereas the first deals with the perception of the user in

2 regards to the enhancement the system poses to their job performance (Davis, 1985). In 2000 Venkatesh and Davis proposed an extension of the old TAM. This new model also takes social influence and cognitive instrumental processes into consideration. The higher number of variables would require a much larger sample size to produce relevant results. Nonetheless, the three additional variables pose interesting questions that in the context of our exploratory analysis make sense to ask, even with a small sample size. The new variables are enjoyment, perceived characteristics of the output, and anticipated use. Additionally, it will also be asked how relevant such a tool would be for the participant’s workspace. The Unified Theory of Acceptance and Use of Technology was formulated by Venkatesh, G Morris, B Davis, and Davis in 2003. Here the proposed model was composed of a total of eight other models that were popular at the time. The paper proved that the model is outperforming each of the ones it took it influences from, however it again has many variables that will make a reliable result harder to achieve with a small sample size.

The main goal of the experiment will be to assess how non-IT personnel can handle a SSBI system. On the way we might also find out how relevant such a system is for them, or whether they believe that they will use it in the future. The experiment itself will be conducted as a quasi-experiment, as it cannot be done in a “clean” environment. The reason for this is the human involvement, which is why it was opted for this kind of experiment. The results of the experiment will also be compared to some research papers regarding the topic of SSBI, which will help to validate the findings or disprove them in the discussion section. The internal variables that will determine the user motivation are: Perceived Usefulness (U), Perceived Ease of Use (E), Perceived Characteristics of the Output (O), Enjoyment (J) and Anticipated Use (A). Together they let you observe the actual system use. The first four variables make up the cognitive response, the fifth the affective response and the actual system use is the behavioural response (Davis, 1985a).

1. ANALYSIS OF POTENTIAL BENEFITS OF SSBI

This chapter will explore what basic concepts make BI so beneficial to be employed by companies. After the basic concepts of BI were explored, the focus of the paper will shift towards Self-Service Business Intelligence (SSBI): what it is, for whom it is made, it’s strengths and weaknesses, as well as some practices that are necessary to achieve a successful SSBI implementation and sustainable operation.

1.1 Basic Concepts of Business Intelligence

As described before the goal of Business Intelligence is to provide the necessary information to the relevant decision makes in time in order to support effective decision making. According to Larson (2009), effective decision making is relevant at all organizational levels, therefore a timely foundation and feedback information is needed throughout the organization. Generally, BI is counted to the field of Data Mining (DM). With increasing amounts of data, it became

3 necessary to come up with new mathematical and statistical methods to model data, account for error and handle issues such as missing values and different scales or measures. The tools used in data mining stem from statistics, machine learning, computer science and mathematics. There are four typical tasks that are undertaken in data mining: Predictive Modelling, Segmentation (Data clustering), summarization and visualization (De Veaux, 2013). In today’s business world decision making power is becoming more and more horizontalized. This fact is driven by technological advances, which result in flatter organizational structures. When we have an increasing number of knowledge workers, we also need them to have more aggregated information in their hands. This is why Business Intelligence is so important. In terms of its architecture, Business Intelligence has been described in a more traditional IT-way, which is software, hardware, middleware, application suites, data warehouses and business transactions. However, many more describe it in form of a BI pyramid, which divides the different parts by the user groups it is targeted for. This pyramid consists of three stages. Firstly, the stage. This stage is used for the executive interaction with the BIS. The second one, and therefore the middle one, is called the OnLine Analytical Processing (OLAP). This section is used by the middle management for ad hoc queries by the middle management. Lastly, the lowest level of the pyramid are preformatted report generators, that are useful for operational management (Shariat & Hightower, 2007). Generally, IT systems in companies can be divided into two major groups; the aforementioned OLAP, which deals with the analysis of data and the OLTP (OnLine Transactional Processing), that provides the input data for the OLAP (Schaffner, Bog, Krüger & Zeier, 2008). Chaudhuri and Dayal (1997) explain that a OLTP usually automates bookkeeping tasks such as order entry and banking transactions which make up the “bread and butter” of day to day operations. These can be of varying size, and maximizing transaction throughput is one of the key performance metrics. Data warehouse, alias OLAP, are targeted to provide effective decision support, and therefore provide a more aggregated view. Also, they hold a larger amount of historical data to provide more precise information. The queries that are performed here are much more complex and can access potentially millions of data entries at once that need to be joined, scanned and aggregated (Chaudhuri & Dayal, 1997).

Humm and Wietek (2005) also provide a short summary of the history of the different IT systems that were used in companies for the purpose of decision support. From the 1960s to the 1970s Management Information Systems (MIS) were used. What they allowed for are efficient data processing and integrated systems. They also gave the operators a vision of automated decision generation. In short, these systems present the lowest level of decision support. From the 1970s to the 1980s Decisions support systems were introduced. Compared to the MISs’ they allowed for much more complex analysis of the data because the data stored in these systems was in more rigid and complex structures, which allowed for statistical algorithms (i.e. what- if). The most important change was that the orientation towards a database. From the 1980s to the 1990s Enterprise Information systems made their first appearance. This is the first time that the Enterprise Resource Planning (ERP) system was divided into an operational side and analytical side; the EIS represented the latter one. Also, it was the first time that multidimensional modelling was introduced; however, it was limited to be only used by top 4 management. In the following ten years, the first data warehouses started showing up. People started realizing the power of historical data, basically assuming that the data from yesterday can provide an appropriate representation of the future. The first data warehouses allowed for the integration of a variety of different data sources. The OLAP structure also made it possible to make interactive queries. In the early 2000s, BIS really took off. These at the time new systems, make use of balanced scorecards, analytical applications and allow for data mining techniques. It needs to be mentioned that throughout the history of these systems the decision support level was on a constant increase. All of the functionalities that were available in the before mentioned iterations of the decisions support systems are still available and are now commonly brought under the umbrella term that is Business Intelligence (Humm & Wietek, 2005).

1.2 Business Intelligence Architecture

As explained afore a BIS system can be separated into three stages, of which each provides value to a certain user group. The core component of each Business Intelligence system, however, is the data warehouse. This section will shed light on how a data warehouse is structured.

When it comes to modelling there are two established approaches. One that is proposed by , a long-term veteran in the development of data warehouses, and one that is proposed by , who is widely considered as the father of data warehousing. Each of the two provide their own definition of a data warehouse (DW). Inmon (2005) defines a DW as a subject oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process. It is subject oriented in the sense that the data is linked to the company’s business and organized by the functions. It is integrated because the data was obtained from several operational and external sources. Time-variant refers to the fact that in a data warehouse the data can be identified by specific periods. The reason for this is, that it contains the history of all transactions. The data in a DW is also non-volatile, as only read functions are allowed and update operations are impossible. Kimball and Ross, (2013) on the other hand define a DW in more simple terms. For them it is a copy of transactional data that is specifically structured for query and analysis with the purpose to provide information to support the decision-making process in a company; therefore, making it a database for the sole use case of decision making and analysis.

When comparing the two definitions, it already becomes clear that the two pursue a different philosophy when it comes to the modelling of a data warehouse. Modelling in the context of a data warehouse refers to the process of setting it up (Yessad & Labiod, 2016).

Inmon's approach begins with the data model of the company. It requires the business analysts to identify the key subject areas and key entities the business uses. The entities can, for example, be, customer, vendor, or profit centres and so forth. Once these have been identified for each a logical model is created for each of these entities. For example, the vendor's entity could be

5 built in a model, and this entity could then have more under itself. This effectively means that all the data is stored in a normalized form, to avoid as much data redundancy as possible. This is precisely why the model does not have a problem when it comes to updating anomalies. This is followed by building the physical model, and the physical implementation of the data warehouse, that is also normalized. Here the single version of the truth comes together and is managed. When wanting to query the structure requires many joins and tables, however, it makes loading the data less complex. Inmon then suggests creating data marts for each of the departments. These data marts act as a small data warehouse that have a more specific use case scenarios than the one of a data warehouse but get all their data from the data warehouse which then is the only source of data for all departments. Here the data can be de-normalized to help with reporting (Inmon, 2005).

In our case, the Kimball model is of higher importance. The reason for this is that it is easier to implement and that the main advantage that the Inmon model offers is that update anomalies can be avoided and that it offers a true single version of the truth. For the purpose of the experiment, it is not relevant, because we only have one database at a point in time. Overall the Kimball approach is just the much more pragmatic one of the two and makes more sense for the purpose of this work. Therefore, it will be explained in more detail in the following section.

Figure 1: DW/BI Architecture

Source: Kimball & Ross (2013).

1.2.1 Operational Source System

The Operational Source System is oftentimes also referred to as the Online Transaction Processing System (OLTP). For these systems, we have two main priorities: performance and availability (Kimball & Ross, 2013). These OLTP systems hold the raw data that is necessary to calculate the measures embedded in a or data warehouse. They cannot be queried in such broad and unexpected ways than for example, a BI system allows to (Schaffner. Bog, Krüger & Zeier, 2008). When talking about an OLTP system there are a few problems that arise 6 in the context of BI. By nature, an OLTP is optimized to run efficiently, allowing to process more than one transaction at a time, without them overlapping. Therefore, a relational database is the best solution for such a system. So, when wanting to calculate a , we are not aiming to look at a transaction at a time, but at the net value of multiple aggregated transactions over a specified period of time. Additionally, if we would query for a complex aggregate, we might end up with interfering with the business operations, as such a query takes a lot of resources in a relational database. Another worrisome topic in the context of OLTP is archiving. Since these systems predominant focus is to run the day to day operations, they oftentimes do not hold historic information that is required in order to run a successful BI operation where you might have the need to compare the performance of a measure over a span of several years. Another issue that arises with OLTPs’ is that they can pose significant problems for the data integration, as in many cases they are divided into the different functions(Larson, 2009).

1.2.2 Extract, Transform, Load

The extract, transform and load system, includes everything between the operational system and the BI/DW. It is an essential piece of every BIS. As the name of this system might suggest it consist of three fundamental processes; the first one being the extract process (Kimball & Ross, 2013). From a more distant perspective, it might seem easy to migrate the data from the legacy systems to the data warehouse environment. However, once started, this is a complex task that needs to fulfill many requirements to be fully functional. In order to entertain these requirements, the ETL system needs to come with a variety of different functionalities. One of the problems you might encounter when implementing a DW is that extracting the data from the operational environment requires a change in technology. This change often goes beyond switching the DBMS and also makes it compulsory to change the operating system, and even the hardware-based structure of the data (Inmon, 2005). El Akkaoui, Zimànyi, Mazón, and Trujillo (2011) describe these processes as the backbone component of the data warehouse because it provides it with the necessary integrated and reconciled data. The environment of the ETL process consists of two layers. The first one being the layer where the ETL processes take place and the second one being the one where the data is stored and passes through. First, the data comes from source stores, from where it is extracted and then moved on two the Data Staging Area (DSA). Here the data is transformed and cleaned. The cleansing of the data could be necessary for numerous reasons. There can, for example, be problems related to the formats, misspellings or domain conflicts. This transformation and cleansing process adds value to the data, by changing it and enhancing it. Additionally, these activities can be architected in a way that they provide diagnostic metadata, which then can be used for reengineering purposes of the business process (Kimball & Ross, 2013). Once this process was concluded successfully it is moved on the Datawarehouse, where is made accessible to the presentation area (Vassiliadis, Simitsis, & Skiadopoulos, 2002). These subsystems are critical, as the ETL systems main goal is to hand over the data of the operational sources to the presentation area, where it can be analyzed.

7

1.2.3 Online Analytical Processing

The Online Analytical Processing system or OLAP in short, is to the Business Intelligence system, what the OLTP is to the ERP system. It presents the source of data, that then will be analyzed through a variety of techniques and software. What makes OLAP so feasible for BI is that it can sustain a multidimensional data model, which is necessary to produce the desired measures in a data warehouse. OLAP supports operations such as aggregation, filtering, roll- up, drill-down and pivoting on the multidimensional view it provides (Chaudhuri, Dayal & Narasayya, 2011). Each of the dimensions is uniquely bound to a measure. A simple example of this would be the sales amount, the dimensions of this can be the product category, the region, the time and so forth. Each of the dimensions can be furthermore split into more attributes. For example, the product dimension could be broken down into the category the product falls in, the average product margin of the product or the year of introduction. These relationships between the attributes, may in some cases also be structured in a hierarchical manner (Chaudhuri & Dayal, 1997). To put it in other words, a dimension allows applying a categorization to an aggerate measure. This measure if spread out then becomes a cube. For example, you might be looking for the highest selling products in 2003 by a specific region; doing this, therefore, increases the dimensionality of the measure, which in this case is total sales. This example can even be further extended if we were to say that we want to see it by the specific marketing campaigns. Which would, in this case, give us a four-dimensional object (Larson, 2009). Another very popular type of employing measures is comparing the same dimensions in a different time span. Time is a very popular dimension, and plays a particularly important role in decision support, for example in trend analysis (Chaudhuri & Dayal, 1997).

From the conceptual view, multidimensional data is represented as a cartesian product, which is created out of the intersections of the different dimensions. Based on the concept of OLAP, there exist a variety of different implementation options of it. We will cover three in particular: Relational OLAP (ROLAP), Multidimensional OLAP (MOLAP) and Hybrid OLAP (HOLAP). Each of which has its advantages and disadvantages for the data analysis capabilities.

Multidimensional OLAP, on the other hand, implements the multidimensional view directly on the analysis data physically, by saving the facts, including their subtotals and derived key figures, directly to a multidimensional database (MDDB). Related facts are also referred to as cubes. Molap is to be recommended when you need to do well performing and complex data analysis queries on not too big databases. This approach requires storing more redundant data (Humm & Wietek, 2005).

Relational OLAP employs the classical relational technology for saving the data. The multidimensionality is thereby created through special data modeling techniques (i.e. Star or ). The OLAP server translates the commands into SQL queries which are then forwarded to the relational database. After the queries have been processed the OLAP server receives the results and prepares them for the front-end application of the user (Thurnheer, 2003). Apart from the metadata, no further data is physically saved in the OLAP server, which is one of the major differences when comparing MOLAP to ROLAP. This makes 8

ROLAP very usable for data with a high degree of periodicity. ROLAP offers high performance, stability and operational security, also for bigger quantities of data. However, the number of commands that are enabled with a ROLUP solution are traditionally subpar to the amount that a MOLAP solution allows for. The pre-aggregation process that MOLAP allowed for used to be a big advantage of MOLAP, now, however, is more and more common in ROLAP solutions. Generally, ROLAP is considered to be the easier to implement, more compact, and cheaper solution (Humm & Wietek, 2005).

The Hybrid solution is a mixture of ROLAP and MOLAP solutions. HOLAP saves derived data that take a limited amount of storage space but are relatively often used directly in the multidimensional database. When trying to access lesser used but larger in size detailed data, the system uses the relational database structure. However, combining the two does not delete all the disadvantages of the two. The inefficient access to detailed data and high redundancy still pertain. Nevertheless, HOLAP offers one big advantage in the implementation process; developers do not have to set themselves on one technology at the beginning of development but can decide throughout the process (Thurnheer, 2003).

1.2.4 Business Intelligence Applications

BI applications can come in many different forms and variations. There are many vendors in the market and each of them has a different vision and idea of what they want to contribute. According to Kimballs' and Ross’ definition, a BI application queries the data from the DW/BI presentation area (2013). They also mention ad-hoc query tools that are used by a small population of business users. Since accessing your desired data through these tools is not always easy, most business users are dependent on prefabricated reports. This will now change with the prevailing popularity of Self-Service Business Intelligence, which we will further explore in the coming sections.

1.3 Dimensional Data Modelling

After having explored what an OLAP cube is, it now needs to be understood how an OLAP cube functions on the conceptual level. The reason why this section was included in the thesis is that dimensional models are much easier to understand for casual users. This fact will be explored extensively in later sections.

Dimensional modeling is very much accepted as the standard for presenting analytical data. Its importance can be explained because it simultaneously addresses two requirements. Its simple design allows delivering data to users in a way that is understandable, while at the same time offering high query performance. Assuming you are a business that sells products to different markets and tracks your performance over time, it is important for dimensional designers to emphasize on the products, market and time. For a dimensional designer, each of these words describes a dimension, which once combined the resulting can be sliced and diced. The goal is to make everything as simple as possible, while not making it too simple. If a data 9 model is overly complicated, it will only get more complicated over time and reduce query performance, and understandability (Kimball & Ross, 2013).

Oftentimes dimensional models are still stored and instantiated in relational databases, even though they do not follow the principle of normalization, that is commonly executed in the third normal form (3NF). The advantage of this is that It avoids redundancies, thus making it higher performing for queries in the operational environment, which are much smaller by nature, compared to queries by data analysts. A transactional interaction with the database only touches it in one place, whereas a typical query in a BI environment needs to grab data from many touchpoints. A 3NF database divides the data into many discrete entities. When a query is run by a program, the query needs to be navigated through all these entities, which are stored as relational tables. Each of them is connected by primary keys and foreign keys. Primary keys need to be uniquely identifiable, while when used as foreign keys in another table they can appear more than once. This is referred to as referential integrity, and also entails that the deletion of the table or changing of the primary key can only be performed, when the table in question does not show any connection to other entities (Shapiro, Bieniusa, Zeller, & Petri, 2018). These 3NF models can be represented in different forms, such as Entity relationship diagrams (ERs). The 3NF data model is simply too complex for providing high query performance for BI applications. The unpredictability of the queries that are generated by users, would make the use of relational models in the BI environment impossible, because of the disastrous query performance. Not only that but the queries that need to be constructed are very likely to bring users to their limits, because of the complex navigation required (Inmon, 2005). ER aims to remove redundancy in the data, by inspecting even the most microscopic relationships between data elements.

As mentioned before such a modeling technique is highly beneficial for transactional processing as it makes the transactions very simple and deterministic (Kimball, 1997). The reason why data that is modeled dimensionally is easier to understand for humans is that we only have a limited capacity to process information. If this capability, which is to be said to be around seven pieces of information, is exceeded, knowledge degrades quickly; hence we humans arrange information in chunks of convenient sizes. The process of chunking is actively pursued when creating a , which will be explained later. Another reason that is easier to understand is hierarchical structuring. It is the architecture of complexity. The reason star schemas are called star schemas is the appearance of their structure. It allows managing convolutes by reducing items that are stored at each level of the hierarchy. Each dimension in a star schema consist of one or two hierarchies and provide a way of classifying business events stored in the , which in turn reduces the complexity (Sehgal & Ranga, 2016). The fact table is the main component of any dimensional model. It has performance measurements that result from events that take place in the organization. Low-level measurement data should be stored in a single dimensional table. Each row in the fact table is a measurement event and each measurement event is executed at a certain level granularity. This can be for example the number of sold items per sales transaction. Here it is important to note that each measurement in the fact table should have the same level of granularity. This ensures that measurements are 10 not wrongfully double counted. A measurement event in the real world has a one to one translation to a single row in its related fact table. This is a foundational principle that needs to be always considered in the whole of .

The most useful and most common facts are numeric and additive. It is crucial that fact rows are additive. When accessing a row in a fact table, it retrieves thousands or even millions of rows from the records; therefore, what makes the most sense to do with them is to add them up. There are also semi-additive facts and non-additive ones. Semi-additive facts are for example account balances, which cannot be added over a specific period of time (or the specified time dimensions), as they simply represent the status quo. Non-additive facts, on the other hand, cannot be added at all. An example of this would be unit prices. They could be printed one at a time, counted or be averaged. The minimum number of foreign keys is two, however, this is rarely the case, as a star schema with only two dimensions does not offer much usability nor flexibility. The primary key of the fact table is called a composite key because it consists of all the foreign key of the dimensions that are connected to it. It is also important to mention that fact tables are always in many to many relationships (Kimball & Ross, 2013).

After now having explored the fact table, it is necessary to understand their companions, the dimensional tables. They provide the context of the business process measurement event; such as the “who, what, where, when, how and why”. Compared to the fact tables the dimensional tables have more columns and attributes, however, there are also some cases of dimensional tables that only contain a few columns. Each of the dimensional tables provides a primary key, which ensures the referential integrity when the dimensional tables are joined with the fact table. The dimension table attributes are crucial in the Business Intelligence system. They are the source of all constraints and report labels, which are needed to make a Business Intelligence system more intelligible and serviceable. This is a semantic problem; rather than using technical abbreviations it makes sense to use real-world words, that fit within in the context of the business process. The goal is to make the labels more user-friendly. The general consensus is that the data warehouses analytical power is proportional to the quality and the depth of the dimensional attributes. For data warehouse queries the SQL statements virtually always take the headers of the dimension table as the source (Kimball, 1997).When looking at operational source data it sometimes can unclear whether a numeric data element is a measurement or dimensional attribute. To find out one needs to ask themselves the question whether the data element takes on lots of values and is part of calculations (which would make it a fact value) or whether it is a discretely valued description, that is reasonably constant and is part of constraints and row labels (which would make it a dimensional value). Let us consider the unit price of a product; it may be changing so often that you might see it as a measurement fact (Kimball & Ross, 2013). When trying to understand what the relationship between ER and DM (Dimensional Modeling) is we need to understand that the ER can be broken down into many DMs. By representing multiple processes in one single ER diagram, it consists of multiple data sets which makes it impossible for them to coexist at the same point in time; therefore, the first step in transforming an ER into a dimensional model is to identify the single processes within the ER diagram and model each of them separately (Sehgal & Ranga, 2016). 11

DM has a variety of strengths that make it superior to the ER modeling technique. One of them is that it is just more accessible to users. Understanding the relationships between the tables is much easier compared to an extensive ER diagram, that spans many different processes at once. The predictable framework it offers is much more advantageous in terms of processing. Once the user's database constraints are fulfilled the database engine, can engage the fact table and start creating the Cartesian product. A second strength presented by Kimball (1997) is that each of the dimension in the DM is equivalent, which means that each of them can be seen as an equivalent entry point into the fact table. This makes it more resistant to unexpected user behavior. Not only the user interface is symmetrical, but also the query strategies, and the SQL statements that are generated against the dimensional model. A third strength that dimensional data modeling possesses is that it can be extended. This can be done in several ways. So, for example, the data can be changed with an alter command, which avoids that data needs to be reloaded. Columns can be added easily in the fact table in the form of measures. It also means that no reporting or query tool needs to be reprogrammed to accommodate the changes. Lastly, it means that all applications linked to the dimensional model continue to run without problems. The fourth strength of dimensional modeling is, that it offers standard procedures for common modeling situations in the business world (Kimball, 1997). The most viable example for these business situations is slowly changing dimensions. In his Data Warehouse Toolkit book (2013) Kimball describes a total of seven techniques to deal with slowly changing dimensions. In case the values never change, Kimball (2013) suggests to simply keep the value and group the values accordingly. A good example of this would be a product that belongs to a certain product category. It will always belong to this product category and it, therefore, makes sense to group it by that value. The third way that is proposed is to add a new attribute that allows creating an alternate reality. Similarly to the type one way of dealing with slowly changing dimensions, the new attribute overwrites the old one, however other than the type one approach still offers the functionality of ordering according to the old or new value; therefore allowing for preserving the history of the change that occurred over time.

Now the question is how do we get from a relational model to a dimensional one? To illustrate the AdventureWorks database was used. It is the same that is later on used in the experiment in form of a star schema.

In their comprehensive paper of how to transition from a relational model to a dimensional one Sehgal and Ranga (2016) identified four steps: 1st classifying entities, 2nd Designing high-level star schema, 3rd Designing thorough fact table and 4th Designing a comprehensive dimension table. In order to classify the entities, we first need to understand the three different kinds of entities that exist. Transaction entities record business events such as orders, payments, reservations and so forth. They play a crucial role in most decision support applications as they help to identify certain trends, patterns and potential problems and opportunities. An exception to this are so-called snapshot entities. In contrast to regular transaction entities, they are already aggregated as they record the status quo of different states such as account balances. Secondly, we have component entities that explain the “who”, “what”, “when”, “where” and “how”. It is always a one to many relationship. For example, we have the SalesOrderDetail table that 12 contains all the specifics of each row of the sale that was done, and then we have the SalesOrderHeader that is linked to multiple records of the SalesDetails and usually contains the total due amount of the order and so forth. A sales transaction is usually defined by a number of components. For example, who was the customer, which product was sold, where was it sold, and the period, which defines when the product was sold. Lastly, we have classification entities. Their purpose is to classify the component entities. Usually, they are linked by a chain of one to many relationships (Moody & Kortink, 2003). They are functionally dependent on a component entity (directly or transitively). To further elaborate how the classification works in the following figure the relational model of the AdventureWorks database is illustrated by its classified entities.

Figure 2: Entity Classification of the AdventureWorks database

Source: Own Work.

The second step is about designing a high-level star schema. Since the goal of a data warehouse is to aggregate business events in a meaningful way, transaction entities are always a supreme candidate for the fact table. Almost always component entities will then form the dimension tables. Not all measures included in a transaction entity are relevant, therefore the person responsible for the modeling needs to decide which measures to keep and which ones to drop. When designing the star schema, it is crucial to decide the granularity of the measures that will be included in the fact table, or in other words the level of summarization the measures go through. This can be un-summarized at a transaction level or it can be summarized by a subset of dimensions or dimensional attributes. If we have low granularity, otherwise speaking higher summarization, the lesser storage space we need and the faster we can access the aggregates. As can be expected, summarization also comes with a disadvantage, that is the loss of data and a resulting restriction on analyses techniques (Sehgal & Ranga, 2016). In the Kimball approach, it is of utmost importance that the dimension tables and views that are put on these tables are 13 the same in all other star schemas, in order to drill across to other star schemas. This doesn’t mean that a star schema needs to be connected to all dimension tables, but much rather the product dimension, for example, should be the same across all other star schemas. Commonly these are called “conforming dimensions” (Kimball & Ross, 2013). When designing the star schema, the responsible person needs to identify the relevant dimensions. Depending on the level of chosen granularity the dimension might not be suitable, additionally, explicit dimensions need to be identified.

A typical candidate for this is the time dimension, which is necessary to do historical analyses. In the operational system, the longevity of records is limited, which is why they usually only get a time stamp, which then in the ETL process needs to be broken down into its components such as time of the day, day, month and so forth. Step three is about designing a detailed fact table. The first thing that needs to be done to have a fact table, that is working, is defining its relationship with the surrounding dimension tables. The key that uniquely identifies each fact table is always a combination of the keys of the dimension tables. The resulting key is unlike in normal relational databases not minimal, but rather long. The most important part of defining the fact table is obviously the inclusion of meaningful measures. These can be as a previously mentioned additive, semi-additive or non- additive (Kimball, 1997). The facts are dependent on the information that is collected by the operational system. Moody and Kortink (2003) also mention that whenever possible additive facts should be used as they are less prone to errors in queries. Fact tables, which are based on transaction entities, should be broken down to the item level (Kimball & Ross, 2013). Again, assuming we are looking at our example the SalesOrderDetail table is much more suitable for the fact table than the SalesOrderHeader. If we want to analyse the effect of discounts on different products for example, it would not be suitable to use the discount from the SalesOrderHeader, as it would result in multiple counting’s of the total order in which the product was involved, rather than the actual discount that was imposed onto the single product at the line level.

The fourth and last step required to transform an ER model into a dimensional one is the detailed dimension table design. The first action that needs to be undertaken in doing so is defining the dimensional key. In most cases, this is done by just taking the already existing numeric key from the component entity the dimension table is based on. However, when doing so one needs to ask themselves whether the key gets reused overtime in the operational system or not. If so it is highly recommended to implement a surrogate key, which is not as limited as the natural key. The same might hold true for slowly changing dimensions. The second action is collapsing the hierarchies of the relational model into a single dimension table. In our case, for example, the resulting customer dimension is a collapsed hierarchy of the Client table. The five surrounding tables (PersonDetails, Store, PhoneNumber, Email and CompanyContact) were collapsed into the single customer dimension that we have now. In the following figure the sales case, around which our dimensional model was built, is illustrated.

14

Figure 3: Hierarchies in the Sales case

Source: Own Work.

Hierarchies are an extremely important concept in dimensional modelling and form the primary basis for deriving dimensional models from Entity Relationship models. A hierarchy in an Entity-Relationship Model is any sequence of entities joined together by one-to-many relationships, all aligned in the same direction. A hierarchy is called maximal if it cannot be extended upwards or downwards by including another entity. An entity is called minimal if it is at the bottom of a maximal hierarchy and maximal if it is at the top of one. Minimal entities can be easily identified as they are entities with no one-to-many relationships (or “leaf” entities in hierarchical terminology), while maximal entities are entities with no many-to-one relationships (or “root” entities). The outcome of the endeavour is represented in Figure 3 on the following page.

15

Figure 4: Database after the transformation

Source: Own Work.

The names of the dimension tables have already been changed to be more conform to the requirements of a SSBI system.

In the end, we ended up with five dimensions:

• Customer (Dim_Customer) • Time (Dim_Time) • Address (Dim_Address) • Product (Dim_Product) • Sales Details (DIM_SalesDetails)

The fact table is represented as:

• Fact Sales (FactSales)

16

1.4 Self-Service and Self-Service Business Intelligence

The concept of self-service is by no regards a new development. Common examples in which self-service has revolutionized the way consumer interact with companies are ATMs or gas stations, in which it is now common practice that the customers pump the gas into the car themselves. These two examples are only possible due to the emergence of new technologies that allowed to make the interaction with the customer more efficient and convenient. No longer do you have to go into the bank and talk with an employee to cash you out, nor do you have to wait for an attendant who pumps the gasoline for you manually at a gas station. It is hard to say who was the first person to see self-service as an overarching concept that can be applied to many sectors, however, its popularity nowadays is stronger than ever. One of the earliest examples that can be found in regards to self-service, is the “Self-Serving Store”. In 1916 Clarance Saunders applied for a patent for his invention. Before that, it was common practice that the customer gave an employee of the store a list of goods he required, which would then be brought to him. Saunders thought of a new design of the shop, which would make it easy for shoppers to find their desired products and present them to a cashier who would then give them the bill (Saunders, 1917). Almost a hundred years later the retail sector is once again undergoing a fundamental change through self-service, however this time with the help of technology. According to Castro, Atkinson, and Ezell in 2008 North America already had 74000 self- checkouts, and Western Europe 15000 (2010). They also claim that in North America 5% of sales in the retail sector was generated through the use of self-checkouts. Vakulenko, Hellstrom, and Oghazi (2018) argue that specifically in the retail business the move towards the self- service checkouts is due to the change in the market environment. This change is characterized by the rapid increases in consumption, delivered and returned goods, urbanization and service orientation.

Considering self-service in the context of technological change, the general consent is that it increases firm productivity while reducing the cost of service delivery at the same time. At an airport, for example, the number of passengers that check-in can be increased by up to 50 percent with self-check-in options (Scherer, Wünderlich & von Wangenheim, 2015). This allows online business to act much more competitively than traditional retail ones.

When it comes to the use of Self-Service technologies (SSTs), technological anxiety, technological trust, and behavioral intention play an important role. Technological anxiety is one of the relatively often studied affective reactions towards SSTs (Liu, 2012). Technological anxiety has a long history. Technology is one of the main drivers of economies and has in many occasions been met with the apprehension of the public. Sigmund Freud describes Man as a prosthetic God that is truly magnificent once he puts on all his auxiliary organs (technologies), but they often times give him many troubles because they have not grown on to him(1962). Mokyr, Vickers, and Ziebarth (2015) claim that technological anxiety can come in three different forms, of which the first two can be characterized from an optimistic viewpoint that technology will accelerate and grow. The first one is that technology substitutes labor, and effectively will result in more economic inequality. The second anxiety stems from the

17 industrial revolution and its dehumanizing effects of labour, particularly the repetitive nature of work that workers had to endure during that time. The last form that technological anxiety can take is the view that technology has become stagnant and does not offer any new value. Technological trust in its rawest form has its origins in communication theory. Here the notion that is explored is not the trust that is being put towards a technology, but towards an intimate object. This can be referred to as a person, place or object. In our case, a SST is an object upon which trust is bestowed upon. In her study to identify the one-sided trust relationship between humans and machines, Muir (1994) used an interpersonal approach and was able to find three commonalities. The first one is trust as an expectation or confidence. Trust here is seen as an expectation, requiring technology to function in a way that offers predictable outcomes. Secondly, trust is always focused on a person, place or object; therefore, trust is a state, condition or perception that is directed towards an intimate object. Lastly, trust is always in the presence of several characteristics of the referents, which are referred to as reliability, honesty and motivation. Lippert and Michael Swiercz (2005) correctly identified that trust in a technology is not comparable to interpersonal trust, as it is not bi-directional. A machine does not have the capability to return trust. Lippert (2002) also came up with the Trust in Information Systems Technology (TIST) model. Here she argues that trust structures in information technology are bound by three measures: predictability, reliability, and utility. Reliability refers to the technology being available when needed during the day. Predictability refers to the user's ability to gauge based on his previous experience, to what extent the technology will work in the future. Utility, which is the last measure, determines to what degree the information that was produced as the outcome of the information technology meets the user's needs. All of these measures are then influenced by a hidden layer, that is the predilection to trust of technology, which is the personal expectancy of the reliance of the technology. Behavioral intentions as the last piece of the puzzle regarding the use of SSTs, are indicators that represent whether the customer or in our case the user, has defected or stayed with the organization. Favorable behavioral intentions are for example such, in which the user recommends the service to others (Liu, 2012).

After setting the stage with the basics of self-service, its history, and influencing factors, it will now be discussed what exactly Self-Service Business Intelligence (SSBI) is, why it is beneficial for a company and what might be potential pitfalls of the technology. The BARC BI Survey from 2017 proofs the importance of SSBI in today’s BI industry landscape. Together with data discovery, SSBI makes up the first two spots of the most important trends of 2017. SSBI has been under the two top scoring trends since 2013 when it was first introduced into the industry. In 2017 it the first BI technology to achieve a 60% adoption. The trends that follow only show up after a 20 percent drop in the use (Janoschek et al., 2017). The Magic Quadrant for Business Intelligence and Analytics Platforms yearly report by Gartner from February 2018 supports that prognosis. This report lists the strengths and weaknesses according to different use cases and critical capabilities that the Analytics and Business Intelligence platforms should bring to the table in today’s market. The Magic Quadrant is Gartner’s largest product and by many considered to offer great insights into the market, as Gartner is not endorsed by any of the vendors. In their market description of 2018 Howson et al. (2018) argue that visual-based data 18 discovery, such as is SSBI, were and still are a big wave of disruption that took hold in 2004. This wave made the market move from an IT-centric system of record to more agile business- centric solutions with self-service. Nowadays analytic platforms need to be easy to use while at the same time offer all the typical analytic workflow capabilities. Gartner decided to split the market into two segments: the more business-centric and the traditional BI segment. They see a clear winner in the first one, as sales figures have been declining since 2015 in the traditional segment. Older vendors are now trying to catch up, whereas new market participants focus directly on the more agile approach, while also extending capabilities related to publishing and sharing as well as making their solutions more scalable. The new BI systems that have a bigger focus on visual data discovery are now the mainstream and set the benchmark for Gartner when comparing the vendors to each other.

After having explored why Self-Service Business Intelligence is so important in the market, and here to stay, we now need to understand what exactly it is. SSBI enables business users to explore the data that is companywide available, in a manner that provides them with the necessary decision aiding information that they can effectively act upon without having the reliance on the IT department. The users can interact with the analyses graphically to find issues that need to be addressed. For example, a manager could drill down to find out which personnel of the salesforce is underperforming in which region. This allows him to figure out which employees they should keep and which they should no longer keep in the organization (Heller, 2017). Before understanding what SSBI offers to an organisation and why the deployment oftentimes falters, we need to understand the users who are eventually exposed to the SSBI solution.

There exist two main types of Self-Service BI, which are reliant on the two profiles that exist for the users: Power users and casual users. The power users are comfortable, accessing, analyzing and publishing reports on a regular basis. He refers to it as “ad hoc report navigation”. Casual users, on the other hand, use SSBI only to the extent of “ad hoc report navigation”. At a macro level, they can further be differentiated by the fact that casual users rarely produce information but rather consume it, whereas the power users actively consumer and produce information (Eckerson, 2014a). Unlike the power users, they do not use create custom reports but are dependent on predefined navigation paths. These two user groups can be further broken down. In terms of casual users, we have data consumers and data explorers. The consumer wants to simply use the pre-produced reports and might occasionally interact with them by further drilling down, whereas the data explorers have sometimes the need to create their own reports from scratch without coding. This can be achieved when the data discovery software provides drag and drop functionalities. In terms of the power users, we can identify three different subclasses: Data analysts, data scientists, and statisticians (Eckerson, 2016). Data analysts are proficient in the business language and have some skills that are part of SQL and statistics. They are usually positioned in an expert role in a certain department (sales, HR, marketing etc.) to solve data-centric problems. They might for example do root-cause analyses or create pricing plans. For them, it is important that they have far-reaching access to the corporate data (Esbensen, Guyot, Westad, & Houmoller, 2002). Data scientists usually have a 19 computer science background and are proficient in using a variety of different programming languages. In the best case, they also have capabilities related to data mining and can create predictive models. Similarly to data scientists, statisticians are concerned with creating machine learning and predictive models, however, do not have a computer science background and oftentimes come from fields such as econometrics, social sciences or mathematics (Karttunen, 2012). To support the activities of the business users, which include the power users and casual users, we need developers. Typical roles that need to be filled to achieve a successful SSBI system are: Systems Analysts, Data Engineers, Business Developers, and Application Developers. System Analysts manage the organizations operational system and can also be database administrators. Their role is hugely important because any small mistake that sneaks into the operational system, can have severe consequences for the downstream analytical capabilities, which is why they need to stay in constant contact with analytical professionals. Data engineers on the other hand are given the role of managing the information supply chain. Traditionally they were called ETL developers or data architects. Their tasks entail mapping data flows, identifying source data, model databases, defined and monitor jobs, as well as working with database developers to increase the performance of, or create and manage databases. Thirdly we have business developers, whose job it is to build dashboards for business consumption. They can be BI developers, that are located in the department they are concerned with, however, can also be business analysts who have become accustomed to building reports for their colleagues in a drag and drop manner. The last subclass of the developers is the application developer, who build custom analytic applications with the help of APIs, various programming languages or software development kits (SDKs) (Eckerson, 2016).

The MAD framework provides an optimal way to prepare an SSBI environment that is able to fulfill the requirements of casual users. MAD stand for Monitor, Analyze and Drill to Detail.

Monitor:

This first layer that provides the monitoring functionality, is tailored towards executives and managers. This layer is able to represent graphical KPI to the stakeholders, which contain financial information of the organization. For example, this functionality can be represented in form of street light, that in case that certain thresholds are surpassed, the light switches from yellow to red (Eckerson, 2009).

Analyze:

The second layer is designed for managers and analysts. So once the light switches from yellow to red, the managers or analysts need to identify the source of the problem. The user has the option to analyze the KPIs by viewing them in different dimensions or by reviewing relevant fact tables. Therefore, the tools need to be able to employ different filters. A reason that a KPI could show a negative value, can be an outlier, that affected the value in a very strong way. Once this is the case, it needs to be questioned whether the KPI really represents the actual situation (Eckerson, 2009). 20

Drill to Detail:

This last layer contains very detailed and sensitive data; here operational queries and reports are used. Once the source of a problem has been identified in the previous layer, here the operational staff or in some cases, analysts, identify which entities have been affected by the problem, and to which they need to reach out (Eckerson, 2009).

1.5 Self-Service Business Intelligence: Benefits and Shortcomings

The intention behind SSBI is to give people across the company the tool to generate reports and analytical queries that are built upon parameters they came up with themselves without being dependent on the IT department. This seemingly easy sounding process, however, is very much dependent on the individual's knowledge of the information available and its place of storage so that they can draw decisions from it. Before business users had two options when they had to make a decision based on data, they could ask the BI team to construct a report for them, or they good simply guess. SSBI fully eliminates this second and quite risky option (Lennerholt & Söderström, 2018a). Besides this, it is also highly dependent on the individual's ability to use a variety of different technologies, tools, and parameters (Burke, Simpson, & Staples, 2016). This means that the SSBI approach is not only beneficial for the end users, but also for the IT staff that can now focus more on providing high-quality data (Kabakchieva, Stefanova, & Yordanova, 2013). It also gives the IT more time for other value-adding activities such as developing new applications, incorporating new technologies to improve the performance or expand the data from existing and new sources. The IT will also become a partner to the business users, rather than the roadblock it is oftentimes seen as. The IT can better support business needs, whereas the business users are put in a role where they are responsible for the BI capabilities. Thus it also helps to cut resources (Imhoff & White, 2011). For knowledge workers in companies in many cases, it takes up a lot of time to gather actionable information. Once they have been given access to the corporate data with tools other than SSBI however, it can introduce errors, rework and knowledge gaps. This has a lot to do with the system of names that is used in the database environment. For example, the column names can be inconsistent across tables or the names might not carry over any meaning to the user. According to (Eckerson, 2009b). The wide availability of SSBI within an organization facilitates, the that companies can act proactively, rather than reactively. SSBI makes it easy to access data, which also includes fresh data. So as soon as the data arrives users can act upon it. This proactive approach can not only help to cut costs but also increase the market share and support expansions in many ways (Lennerholt & Söderström, 2018).

When it comes to the implementation and actual use of SSBI there are not only technical challenges to consider but also organizational ones. In their comprehensive literature review Lennerholt and Söderström (2018) found two main categories of SSBI implementation challenges: Access and use of data, as well as self-reliant users. It is critical to have an information-centric culture in place. In today’s business environment this is basically a given, since everybody wants to be able to build upon their competitive advantage. However, adopting 21 a new technology, is not always only met by excitement but also by fear, which can be seen by the resistance of the users. This could be because they are already used to a certain BI tool, or are totally new to BI, which is now introduced in the form of SSBI. It is always hard to adapt to a new technology or changed process. Hence effective change management is crucial, and the preparedness of the business users needs to be assessed carefully by the IT decision makers (Z. Davis, 2012).

Achieving this information-centric culture is also challenged by the fact, that many users tend to use the BI tools to support their opinion, instead of looking for error in their own development. To achieve this level of proficiency for analytics, the users need to attend trainings and educate themselves. What this boils down to is that the users, should not only know how to use a BI tool, but also internalize the notion of analytics, and know how to validate their analyses and how to compare results. It is not helpful if they can navigate the software with ease, but do not know how to select and interpret the data depending of the analysis needed (Lennerholt & Söderström, 2018a). Due to differing functionalities many enterprises employ more than one SSBI system (Kosambia, 2008). This makes it harder to maintain the desired one version of the truth. Here it is also important to carefully analyse the users’ requirements. SSBI tools, especially for casual users, can offer many features and functions that can be overwhelming. Power user can embrace these features with ease, however for the casual users the features need to be simple, intuitive and easy to use. Now many SSBI applications use the mobile approach, that is context specific. So if a user is viewing a chart, the software automatically displays all editing options for charts (Eckerson, 2014a). Generally, it can be said that with more offered flexibility comes more complexity the user has to deal with. So finding the balance between complexity and flexibility for your individual user groups is crucial (Lennerholt & Söderström, 2018a).

There are two requirements for easy BI, the tools need to be easier to use by less experienced information producers when developing BI applications, and the BI tools must be easier to use by information consumers when consuming BI results. The latter differs from simply offering an easier to comprehend user interface, because easy BI rather focuses on making BI tools easier to use by all types of information workers, whereas a clean and easy interface only conveys BI results better for information consumers(Imhoff & White, 2011). Traditionally power users, would serve the casual user’s needs, by fulfilling their request for certain reports etc.; however, with the increasing volume and frequency, the casual users needs cannot be me met without SSBI in today’s business environment. The goal is to make the casual users self- reliant, however to get there organizations need to have deep insight in what skills users require to properly work SSBI software and its underlying data structures. This requires more reflection from the business users so that everything functions smooth and effectively (Lennerholt & Söderström, 2018).

After having covered the challenges related to self-reliant users, we will now move on to the topic of access and use of data. In many cases the knowledge workers don’t get to analyse because they spend a huge amount of time in gathering the data. This can also result in costs,

22 as these workers might feel like they are not using their skills effectively, which results in a higher turnover rate among them. Many companies who are struggling with SSBI, ultimately decide to give all the employees all the data. This is the wrong approach, because if you give your employees everything, the will tackle their problems in different ways, which result in multiple versions of the truth. This might be sustainable for some time, but will not work favourably for the company in the long run (Weber, 2013). The second challenge is that the SSBI needs to be developed with a good understanding of the business, which requires the BI specialists in the company to work closely with the business users during the development process. Only in the case of simple analytical requirements and business rules, it is possible to produce meaningful outcomes for the developers, however this is the exception, as the organizational environment is usually more complex (Z. Davis, 2012). To circumvent this issue, it is necessary to offer high quality data. But how exactly do you select high quality data.

Abelló et al. (2013) propose three criteria: Relevance to keywords, integrability and owner and user ratings. The first criterium regards data retrieval, for which the data should have tags and metadata, which describe the source data. Secondly the quality of external data is higher, if we have identifiers that relate to internal data. The last criterium proposes that data should be rated according to its completeness, correctness and freshness. Another set of challenges that impact the use and access of data are makeshift practices, which result in the companies noticing that they cannot sustain or scale the SSBI system. For example, the data warehouse design needs to be impeccable. The hierarchies of the relationally stored data need to be broken up in a manner that all queries can be run automatically by the SSBI software. Even if a proposed star schema was to support 95 % of all automatically generated queries, the other 5 % might be crucial in order to create meaningful reports. It cannot be expected of the casual users, that they know how to correctly navigate the relational database and write complex queries. (Lennerholt & Söderström, 2018). In their "State of Self Service BI Report" from 2015, Logi Analytics found that are the primarily used BI tool in companies. The problem that arises through this however is, that through the use of spreadsheets a lack of control regarding integrity, security and distribution of the data is created. As more external data enters the system the importance of integrity and security increases exponentially. This is why it is important to determine who can add and use which data, how long the data should be stored and what are the minimal data quality requirements (Alpar & Schulz, 2016). Besides this problem it is also important to avoid metadata development shortcuts. The metadata layer should be closely followed according to best practices. The assumption that it can be fixed in the reports where the queries are defined is wrong and leads to unsustainable SSBI. There are also certain practices that can affect the Software Development Lifecycle (SDLC); either they are too rigid or too lax. Approaches such as SCRUM help the developers to stay agile, and quickly react to changes in requirements, however if the goals are not properly defined, the developers might end up working against each other, as each one of them has a different viewpoint. In contrast some very rigid SDLC or IT standards, do not embrace the idea of users developing their own queries and reports in the live environment. This means that a nuanced SDLC approach that empowers users in their defined boundaries and provides feedback mechanisms for developers is required (Davis, 2012). It is important that access to data sources is accelerated and 23 simplified. The users need to be able to integrate external and internal data sources, as well as structured and unstructured ones together. In many cases users are not allowed to use data freely in a self-service way, which again makes them reliant on the reports traditionally provided by the IT personnel. There should be an organizational process that facilitates the modification of standard reports (‘State of Self Service BI Report’, 2015).

Being able to quickly access, manage and deploy a data warehouse, does not only increase their productivity, but also allows power users to focus on BI solutions that can be used in a self- service manner, which ultimately increases the value of BI (Lennerholt & Söderström, 2018a). To achieve effective SSBI clear roles need to be distributed among the BI staff. Accountability plays a big part in that, if everybody is responsible for everything, it is impossible to assign responsibilities. Besides this there are also substantial security implications regarding this challenge. The access to the data should be role-specific, as the highly insightful reports that were produced only have a clearly defined target audience and should not fall in the hands of a third party. Even though this process is time-consuming, for many companies that are just now realizing the need for effective data management, it offers significant synergy potential to simultaneously review the existing data structures (Davis, 2012).The last challenge that needs to be tackled effectively is governance. This also includes attention to scalability and growth. Modularity plays an important role when selecting a tool. This means that strategic goals need to be defined in advance (Davis, 2012). Governance requires close teamwork between technical and management staff, as the data needs to be organized in a manner that it resonates with the business. Defining the actual data terms and definitions is hard for two reasons: you need cross- organizational agreement for the terms, as well as people who are willing to review the proposed terms. Engagement is critical here (Schlesinger & Rahman, 2015). The strategic goals need to be communicated to the technical staff so that they can set the analytical requirements and goals (Weber, 2013). This also entails effective data governance policies, that restricts the access depending on the user, and define when data is suitable for analysis. Even a slight mistake in the selection of the data and its quality criteria can lead to unforeseen problems for decision makers; therefore a solid foundation for SSBI data use is needed, and can be achieved through well-defined policies for data management and governance (Lennerholt & Söderström, 2018).

1.5.1 Levels of Self-Service Business Intelligence

SSBI can be broken down into a total of three levels, which are determined according to the tasks that can be fulfilled. Not all SSBI software supports all of the tasks represented in these levels, however in many cases can be added through add-ons. The levels in question are represented in Figure.2.

24

Figure 5: Levels of SSBI

Source: Alpar & Schulz (2016).

The first and lowest level is the usage of information. Here, users receive access to information that has been created a priori. They merely get the option to adjust the reports with a few parameters before accessing them. Clearly, this solution is tailored towards casual users that do not have any particular analytical or tools skills. This is enough to derive basic insights, however, makes it impossible for the users to find deeper more individual insights. The afore explained MAD framework, would fall under this level of SSBI, here it is important to mention that it allows for “drill anywhere” functionality. Usually, here the users move from a more aggregated view towards the operational view of the data (Eckerson, 2009). When thinking about introducing data that is not already part of the DW, from the users perspective it is either intuitive (i.e. Excel file) or not noticed at all, as the guidance of the analysis is fully prepared by BI specialists (Alpar & Schulz, 2016).

The middle level concerns the creation of information. Here typically users have access to the least aggregated data and can create information from it. The use of SQL has turned out to be too challenging for most business users. These tools allow creating a new view of the data on the fly. Therefore ironically SSBI requires a high degree of standardization (Alpar & Schulz, 2016). Depending on the tool, the casual users can choose whether they want to use a flat, multidimensional or relational files (Sallam et al., 2017). This allows them to be independent of the BI specialists for the data selection process. However, depending on the way the data governance is set up, ontology problems might arise, and the users might choose wrong excerpts or aggregates due to their lack of knowledge of the technological background. Some SSBI tools also give access to analytical functions, that permit to for example predictive analytics, or text mining. Here again, we have the risk that the user can not properly state his analysis requirements, as in many cases he or she lacks the statistical background (Alpar & Schulz, 2016).

25

At this stage, data preparation can be automated, which goes beyond the functionalities of traditional BI systems. Users can integrate new data sources with corporate data to create visualizations. Here it is important to mention that the data sources are only integrated temporarily for the desired purpose. This data can be part of an online resource or be stored locally on the users’ computer (Alpar & Schulz, 2016). Issues that can arise here are the use of poor quality data, as well as the event of users sharing information with other users whom should not have access to this data (Stodder, 2012). Additionally, the creation of presentation mashups is enabled by SSBI systems at this level. Here the user combines reusable components and data personalize reports, dashboards and other BI views (Kobielus, 2009). The efforts that need to be undertaken to make this possible, due to the complexity of data governance, integration and modeling issues, is not visible to the end-user. To circumvent Abelló et al. (2013) suggest a totally new BI architecture. This architecture would allow for the use of fusion cubes, that would allow combining multidimensional data from the Internet or other unstructured data sources temporarily with the corporate data cubes. The integration of such data, however, is by nature prone to quality problems.

1.5.2 Data Governance and SSBI

As previously inferred in many statements, data governance plays an important role in the effective and efficient usage of SSBI systems. Once a soda fountain is set up in a fast food restaurant, it needs to be cleaned, maintained, checked on a regular basis and refilled. This means that even though it is a self-service kiosk, it is not self-sufficient. Similarly, once a SSBI system is set up, the employees of the IT department need to monitor the data before serious issues arise that might lead business users to lose faith in the numbers. Before going deeper into the subject, however, we need to define what data governance is, and why it needs to be considered whenever a new system is introduced into the enterprise.

In a white paper that was written by an employee at Microsoft, data governance is described as an overarching strategy that encompasses policies, processes, and people to protect data (Microsoft, 2017). It is true that due to the General Data Protection Regulation (GDPR) data governance has recently come to the foreground, especially in combination with compliance regarding the pseudo-anonymization of personal information (Vojvodic, 2017). However, it is much more than that and covers a much wider breadth of topics; from a business administrative background, governance describes the process of making sure that strategies are developed, checked and executed. Corporate governance, on the other hand, sets the institutional framework (OECD, 2015). From this, you can derive specific requirements and guidelines that are relevant for different departments within the organization, for example for the financial department or the IT department. In this sense, data governance translates to data quality management (DQM). Data governance is understood as a framework, that defines the duties and responsibilities that help to achieve a good DQM (Otto & Weber, 2011). The concept of DQM can further be extended to Total Data Quality Management (TDQM). Similar to many other Quality Management areas, here DQM has a lifecycle that aims for continuous improvement, by introducing a set of best practice metrics (Eppler, 2006). Here it should also 26 be mentioned that even though data management and data governance are oftentimes interchangeably used but are of a different nature. In simple terms, data management consists of policies, procedures, practices, and tools that are designed to enhance the use of data assets. Conversely, data governance is the enforcement and application of such (Meyers, 2014). Data governance basically sets the framework for how DQM can be executed according to its goals and is to be separated from the operational execution of the activities related to the DQM. Design elements of data governance are roles, duties, and responsibilities. If we were to make a Venn diagram out of these three, the responsibilities would be fully covered by the cut set of the two. The responsibilities are therefore created out of the two before-mentioned design features (Otto & Weber, 2011). In many cases, it is assumed that these tasks and roles are always the same in any organization, precisely here lies the error that can arise if you use a reference model. Usually, these template roles, do not fit exactly to the requirements of the company and need to be adjusted adequately. The same holds true for the duties. However what are duties and roles, and how do they create the responsibilities.

1.5.3 Data Quality and its Dimensions.

The expression „garbage in, garbage out”, or in short GIGO, is very popular among IT professionals and mathematicians. The term refers to the concept that the quality of the output from an information system is only as good as the quality of the input (Sheposh, 2017). This simple reason is precisely why good data quality needs to be achieved. Nobody wants to act on faulty information. Data quality issues can even happen in our everyday life’s. For example, the late delivery of a parcel is oftentimes blamed on the bad quality of the postal service, however, when taking a closer look, it is typically related to data-related issues. The error can lie with the address, originating from the address database. It is thus that data quality issues move more towards the public eye than ever before. In 2003 a European directive was declared aimed to make the large data assets owned by public assets more reusable; therefore a large data cleaning campaign was started (Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003 on the re-use of public sector information, 2003). Similarly, whenever there is a data breach scandal in a large company it is usually related not only to the security but also to data quality, as the intrusion software deployed in a company is also reliant on complete and accurate observational data of the IT environment in the company. In the light of all the recent scandals (i.e. Facebook and Cambridge Analytica) Hoeren (2018) argues that the GDPR was motivated by them. It does not offer a complete framework of reference, which means that we are still not close to standardization and harmonization, especially across continents. So, the first reason why assuring good data quality is important, is compliance. All organizations want to avoid being criminally sanctioned for inadequate data quality.

A second reason why data quality is important is in regards, to data integration. Data that is stored in different places can be plagued by the occurrence of inconsistencies. This issue holds special importance when it comes to integrating data for a data warehouse. During his literature review, Mouroutis (2015) found that poor data quality can have severe financial implications. It might simply translate into more work to consolidate data, hence higher costs and loss of 27 costumers because of bad service quality. It can not only negatively impact the BI process, for example, analytics and reporting, but can also cause confusion and chaos in the organizations environment. There is a total of four dimensions that make up the aspects of the umbrella term data quality and the related issues that can come up. Besides accuracy dimension, which is the one that people usually first think and regard as the most important one, there are three more: Completeness, time-related and consistency.

1.5.3.1 Accuracy In simplest terms, accuracy answers the question of whether the data reflects the data set or not (Askham et al., 2013). In the book about data quality concepts, and its methodologies, (Batini & Scannapieca, 2006a) make an easy example to illustrate accuracy. Assuming we have the values v’ and v. The value v equals to “John” and is correct, whereas the value v’ equals to “Jhn” and thus is incorrect. Another example would be the false association of v’=”John“to a position within the company that is actually taken by v1’=”Bob”. The first was an example of bad syntactic accuracy, whereas the second was an example of a non-correct semantic accuracy. The second example would be syntactically correct as it holds a true value out of the domain of person names, however, is semantically incorrect because it does not reflect the real value of v1. Hence, whenever a semantic error occurs, it is always syntactically correct. This simple example shows the between the two types of accuracies. It is obvious that a semantic error requires more human involvement, as it can in many cases not be autodetected. This is generally referred to as the identification problem or record matching. This problem encompasses two main aspects: identification and decision. In terms of identification, records in different sources usually have differing identifiers. The solution here is to (when available) map identification codes or introduce matching keys, in order to link the two records. After the two records have been successfully linked, a decision needs to made whether they both represent the same real-world entity (Scannapieco, Missier, & Batini, 2005a).

1.5.3.2 Completeness Completeness is defined as “the extent to which data are of sufficient breadth, depth and scope for the task at hand” (Wang & Madnick, 1989). There exist three kinds of completeness. Shema completeness is the first one is defined as the degree to which entities and values are missing. In contrast, population completeness describes the amounts of evaluating missing values in respect to a reference population. The last one is column completeness and is a function of missing values in a column of a table (Batini & Scannapieca, 2006). Depending on the data model completeness can be explained in more detail.

1.5.3.3 Time-related Dimensions The relation of data and time is very strong, as can be seen in the literature. In term of the time dimension, we have stable, long-term changing and frequently-changing data (Eder & Koncilia, 1998). Stable data is, for example, the birth data, long-term changing data would be the address, and frequently changing data could, for example, the stock levels of a company. All of these three temporal types of data can be affected by different data quality aspects. There are also three principal time-related dimensions in terms of data quality: Currency, timeliness, and

28 volatility. Currency measures how promptly data is updated. It is usually measured by the “last updated” column that is available in many tables Volatility measures the frequency of which data varies through time. This means that stable data would have a volatility of 0. Data types are inherently characterized by this dimension, hence there is no need for a measurement. Lastly, timeliness measures how up to date the data is. Timeliness asses whether the data is being made available at the right time. For example, if you had a schedule for your lessons, but only were to receive it after the lessons already had started, it would not have been made available in a timely fashion. Measuring this dimensions requires more complex measurements and will not be further explained here (Batini & Scannapieca, 2006).

1.5.3.4 Consistency Consistency answers the question of whether the data can be matched across the different data store (Askham et al., 2013). It captures whether we have the occurrence of a semantic violation of a set of data. Here integrity constraints play an important role. These constraints stem from relational theory and can be divided into two types: inter- and intra-relational constraints. An example of an intra-relational constraint would, for example, be that the date of enrolment of a student is after his date of birth. An example of an intra-relational constraint is that the date of enrolment stored in the school register is the same as in the school database.

1.5.3.5 Other Considerations In practice, data quality dimensions are not independent of each other and are correlated with each other. When making a decision about data quality and considering one dimension more important than the others, than there might be negative consequences (Batini & Scannapieca, 2006). However, in certain scenarios, it is beneficial to give a certain dimension more priority than others. For example, when a university publishes new schedules, it is more important that the schedule is posted in time rather than it being a hundred percent correct in regard to consistency, accuracy or completeness. In contrast considering the use case for a banking application, the other three dimensions carry more importance than timeliness. Another interesting trade-off between the dimensions that can be observed is the one between consistency and completeness. The question here is: “Is it better to have a big amount of inconsistent data, or is it better to have less, but more consistent data?”. This is again, similarly to our earlier trade-off, very use case specific. Assuming we are working with statistical analysis, it is more important to have complete data, than a very consistent one. These inconsistencies can be tolerated or in many cases also be accounted for by statistical techniques. In contrast, when we want to calculate salary of companies employees, it is of higher importance to have a consistent list rather than a complete list (Scannapieco, Missier & Batini, 2005). Askham et al. (2013) claim that there are also other considerations. They claim that in order to have good data quality, the data should be flexible, meaning that is well-structured enough to be repurposed. Another consideration is whether the data offers good value in terms of cost/benefit. Lastly, it is also important to assess whether the data is usable (i.e. is it understandable, simple, relevant, maintainable and at the right level of precision?). Accessibility is also considered by many to be a data quality dimension. It measures how accessible the data is by the users and plays an especially important role in web applications 29 where the users are geographically dispersed and need to access a network prior to being able to view the desired data (Batini & Scannapieca, 2006). Besides these aforementioned data quality dimensions there also exist others that are more domain-specific, such as the archival domain, statistical domain, and the geographical and geospatial domain.

1.6 Self-Service Business Intelligence and the importance of correct Ontology

Ontology, as is, is a field of studies in philosophy, and deals with existence, to be more precise it is concerned which what things actually do exist. In the broader sense, the study of ontology also examines how and why things exist. The etymological roots of the term are in the Greek language and come from onto (being) and logia (the study of) (Sheposh, 2016). In this article “The study of Ontology” Fine (1991) explains that an ontology is made up of all “those items that are, in an appropriate sense, accepted”. The idea of ontology was firstly introduced by Greek philosophers of the likes of Aristotle and then not discussed for many years until it cut the surface at the end of the renaissance at the beginning of the 17th century (Sheposh, 2016). In the philosophy, an item is in an ontology because it should be there, not because it was put there (Fine, 1991). Hence the ontology is the entity that endorses the item, and not the person. This is where the differences begin with ontology in the context of information technology.

Regarding SSBI, it plays an especially important role because the data stored in an enterprise system is not easily analyzed in an ad-hoc manner for knowledge workers that don’t have the technical background. For them, it is not only hard to execute the process of analyzing, but in many cases, they struggle to understand the data because it is technically rendered within the system. Data stored in a database is always highly domain specific, and for a decision maker, it is almost impossible to have the knowledge for each of the domains required to retrieve actionable information. So, if the desired information cannot be fetched, or the information cannot be well understood, the integrity of a SSBI is being undermined. These two problems are the reasons that SSBI systems were only popularized in recent years, and casual users were dependent on predefined reports provided by the IT personnel(Li, Thomas & Osei-Bryson, 2017). Decisions made on such irrelevant information can lead to irreparable damage and hurt the reputation of an organization. So, when a meaningful ontology is used, it provides the background to the domains, and makes it easier to gain access to relevant information which in turn improves effective decision making. The usual enterprise systems landscape consists of many different applications, of which each uses a different data model. ERP systems for example use very large and complex data models. The data which is stored in the Relational Database Management System (RDBMS) follows certain rules, such as normalization. In the case of aggregated data, the data may be stored in star or snowflake schemas. These rules were implemented with the design idea in place that the storage systems work in a high-performance manner, not with the main purpose of making these systems accessible to business users. Thus, Spahn, Kleb, Grimm, and Scheidl (2008) propose to implement an business ontology-based abstraction layer that equips users with the business relevant vocabulary, which is what the users are accustomed to, and find more intuitive to use and easier to understand. Considerations that need to be undertaken in this layer are technical and semantic integration. More importantly 30 however it is imperative to reduce the potentially huge data model to its basic entities and relations. An ontology is a specification of a conceptualization. Concepts can be objects, events, relations and things that are necessary in order to exchange information (Girase, Patnaik, & Patil, 2016). Ontologies have always played an important role in information Management, because they offer information a common representation and semantics. Enabling a common understanding of a domain that can seamlessly be communicated between systems and people is key for effective decision making ( Mikroyannidis & Theodoulidis, 2010). Knowledge needs to be formally represented to allow for well-defined interpretation, and explicitly stated in order to make it processable by machines. It is presented in form of concepts and their relations. Another aspect is that it is always restricted to a certain domain, or in other words field of interest (Spahn, Kleb, Grimm & Scheidl, 2008). Let us now consider a simple example to understand ontology in action. The concept of “Address” stores all the relevant knowledge related to the addresses. It can be further instantiated if we were looking for an address, which in this case is a specifically identifiable entity, of a specific customer. The concept of “Address” holds many other attributes, such as “Address.PostaCode” etc. Furthermore, the concept of “Address” is related to “Customer” by a relation that has the property “Has”. From this we can derive that an ontology is always construed of concepts, attributes and properties, that altogether explain the objects in question, their relation to each other and the underlying axioms. Having a semantic relationship between the concepts, makes it possible to develop machine systems, that can automatically interpret and understand the meaning of concepts used in ontologies (Martin & Maladhy, 2011). In the aforementioned abstraction layer for example, it would be unnecessary to implement foreign-key relationships, which build the relations between tables in relational databases. Much rather the relations between the entities can be explicitly stated.

In her paper about ontology-based information extraction Wimalasuriya (2009) explains the rationale why it can be beneficial to use multiple ontologies and not only one. She clarifies that the conceptualization of a domain can be done differently if sufficient reason is there. There exist two categories of multiple ontologies. The first one consists of multiple ontologies that provide a different perspective. For example, we can have two classes of “Husband” and “Wife”, whereas others would simply define the property object as “isSpouseof”. The second category is ontologies that specialize in certain subdomains. An example that can be made here is in the domain of universities. In some countries there exist universities of applied sciences and regular universities. Each of these two is very different from each other and are better served with more specialized ontologies that are more assertive in terms of identifying unique characteristics that the subdomains hold. Associated with multiple ontologies are some challenges and opportunities. One is that the use of multiple ontologies can improve recall. Together with precision, recall is a performance metric that is used in data mining. It describes, the number of correctly identified items, in comparison to the whole set in question in form of a percentage(Simovici & Djeraba, 2008). To proof, this Wimalasuriya (2009) further extends her “Husband and Wife” example. If you are trying to identify gay marriages, the classification into “Husband” and “Wife” is more prone to have a low recall in terms of gay marriages, whereas the item property “isSpousof”, would fare much better in that regard, however not when the task at hand would be to identify heterosexual marriages. Additionally, the use of 31 multiple ontologies, allows as mentioned earlier to facilitate different perspective on a domain; therefore, when writing a query on the SSBI system, can provide outputs of different perspectives. For example, we will be able to answer a question such as “Is a person X a husband?”, or “Who is person X’s spouse?”. To make this possible it can be necessary to translate one instance from one ontology to another. This is achieved by mapping concepts between ontologies. This process of ontology alignment needs to be undertaken when the sources should be consistent with each other but are still required to be kept separate. The process of ontology mapping is defined as transforming entities from a source ontology to the desired target ontology by using semantic relations. Mapping plays an important role when two ontologies need to be merged or in data integration(Choi, Song, & Han, 2006). By closely observing the use of the ontology abstraction layer, special techniques could be applied to discover mappings between different ontologies (Wimalasuriya, 2009).

In their several papers (Alexander Mikroyannidis & Theodoulidis, 2012) propose the Heraclitus II ontological framework. Heraclitus was a Greek philosopher and pupil of Aristotle. His theory was different from his more mechanistic contemporaries, as he argued that the soul pervaded all parts of the universe (Badessa, 2013). The framework sees ontologies as a semantically rich knowledge base for Information Management and proposes a methodology for the evolution and management of this knowledge base. In their survey, they examined existing approaches in Information Management and were able to find some shortcomings in these approaches. They found that ontology integration suffers significantly, as well as the lack of layering. Additionally, it turned out that other approaches do not offer enough temporal semantics, which makes the tracking of the evolution of the ontology a hard task. Regarding the ontology evolution, many key issues such as consistency, preservation and change propagation seem to not be addressed adequately. Heraclitus II aims to eliminate these shortcomings.

Time is modeled bitemporal in this framework, meaning that it can be divided into valid and transaction time. Valid time describes when the fact is true in the modeled reality. Which can lie in the past, present or future and is usually modeled by the ontology author. In contrast, the transaction time describes whether the fact is current in the knowledge base of the IS and can be currently retrieved. It is provided by the IM system and cannot be changed (Alexander Mikroyannidis & Theodoulidis, 2012). The framework can be represented in form of a pyramid that consists of four layers that are all connected by intra- and inter-layer ontology mapping. The lower layers describe more generic ontologies, while the higher ones are used for certain uses in the IM system. The layers are: Lexical Ontology, Domain Ontology, Data Source Ontology, and lastly Application ontology. The Lexical Ontology layer contains domain independent ontologies and are of a lexicographical nature. The ontologies that are included in this layer can be used to model all domains. Here such lexicographical issues, like multilingualism are dealt with. The layer of a specific domain is handled in the domain ontology layer. It is crucial that the data belongs to a certain domain. The collected data can be from structured and unstructured sources, such as the corporate database, APIs, or news publications. The data source ontology layer specifically deals with the organization of such information (A. Mikroyannidis & Theodoulidis, 2006). Lastly on top of our ontology pyramid, we have the 32

Application Ontology Layer. These basically represent the software organization of an IM system. Here it is possible to connect the software structures with ontological data, which facilitates the ontological software development process. It is important to have the bitemporal time because it allows for the better evolution of the ontology. It allows for pro-active as well as retroactive changes to be captured in the knowledge base. What that means is that a retroactive change would happen when, the time the fact was true, was already before the time captured in the transaction time of the fact. The pro-active change happens when the valid time is greater than the transaction time. These temporal semantics, which are used for the ontology evolution, have been part of the Temporal Information Management framework (TAU) (Alexander Mikroyannidis & Theodoulidis, 2010). Consistency preservation is another goal of ontology preservation. This is done in two ways, semantically and structurally. These techniques help to deal with issues that arise when the structure or semantics of an ontology become obsolete, due to change. This change is then carried over to the other layers through the intra- and inter-ontology mapping (Alexander Mikroyannidis & Theodoulidis, 2012).

2 METHODOLOGY, LIMITATIONS, AND EXECUTION OF THE EXPERIMENT

The theoretical framework that will guide this piece of work is the Technology Acceptance Model (TAM). The model is based on information sciences and describes how the systems design influences the user's acceptance of a computer-based system. The model proposes that when a user is presented with a new technology there are several factors that determine how and when they will use it. The two main factors to be mentioned here are the perceived usefulness and the perceived ease of use. The latter explains how much a new user perceives the system to be free from effort, whereas the first deals with the perception of the user in regards to the enhancement the system poses to their job performance ( Davis, 1985a). In 2000 Venkatesh and Davis proposed an extension of the old TAM. This new model also takes social influence and cognitive instrumental processes into consideration. The higher number of variables would require a much larger sample size to produce relevant results. Nonetheless, the three additional variables pose interesting questions that in the context of our exploratory analysis make sense to ask, even with a small sample size. The new variables are enjoyment, perceived characteristics of the output, and anticipated use. Additionally, it will also be asked how relevant such a tool would be for the participant’s workspace. The Unified Theory of Acceptance and Use of Technology was formulated by Venkatesh, G Morris, B Davis, and Davis in 2003. Here the proposed model was composed of a total of eight other models that were popular at the time. The paper proved that the model is outperforming each of the ones it took it influences from, however it again has many variables that will make a reliable result harder to achieve with a small sample size.

The main goal of the experiment will be to assess how non-IT personnel can handle a SSBI system. On the way, we might also find out how relevant such a system is for them, or whether they believe that they will use it in the future. The experiment itself will be conducted as a

33 quasi-experiment, as it cannot be done in a “clean” environment. The reason for this is the human involvement, which is why it was opted for this kind of experiment. The results of the experiment will also be compared to some research papers regarding the topic of SSBI, which will help to validate the findings or disprove them in the discussion section. The internal variables that will determine the user motivation are: Perceived Usefulness (U), Perceived Ease of Use (E), Perceived Characteristics of the Output (O), Enjoyment (J) and Anticipated Use (A). Together they let you observe the actual system use. The first four variables make up the cognitive response, the fifth the affective response and the actual system use is the behavioral response ( Davis, 1985a).

The experiment will be conducted in form of a business problem on Microsoft Power BI. MS Power BI was chosen because everybody is already familiar with the design language of MS products. The database that will be used for the experiment is the Adventure Works Sample database from Microsoft as it provides a realistic level of detail in terms of a company (‘AdventureWorks for SQL Server 2016 CTP’, 2018). The tables it contains cover Sales, Purchasing, Person, Production, and HR. In order to integrate the provided data, I will use SQL Server Integration Services software to connect a working prototype of a now transformed relational database to Microsoft Power BI on the MS Azure platform. Since the focus of this paper lies on the Front-End of the BI architecture, the data warehouse was pre-prepared from another project and in this case, simply serves as an exemplary data source. For the data analysis problem, the user will need to follow through a few steps that cover the basic functionalities of MS Power BI. To assure consistency the participants will have 20 minutes for the experiment. This should also help to increase the response rate. Before going into the peculiarities of the experiment it needs to be explained why MS Power BI was the solution selected for the execution of it.

The decision is based on the Magic Quadrant for Analytics and Business Intelligence Platforms report from 2018 from Gartner. The evaluated several vendors in regards, to self-service and found Power BI to be the best solution in that regard. But this does not hold true for the whole of Business Intelligence solutions that Microsoft offers, however, is not important in the context of this master thesis.

34

Figure 6: Gartner Magic Quadrant

Source: Howson et al. (2018).

Figure 6 shows the “Magic” Quadrant. Here it can be seen that Microsoft is the best leader, which has a lot to do with the Power BI solution they offer. The method used, incorporates 15 product capabilities across five different use cases, and therefore is rather comprehensive. Explaining all of them would be too much at this point. Power BI offers data discovery, data preparation, augmented analytics and interactive dashboards in a single product. It can run on the cloud (which is our case) or since 2017 locally on the Power BI report server. Power BI is particularly interesting for many companies because it is comparatively cheap. A license for single users only costs 9,99 € per month. There is also a Pro Version that offers greater cloud storage and faster data refresh frequencies. This option is more expensive (4995 € per month) however does not require licenses for individual users. In the Quadrant of customer experience, Microsoft scored the highest together with Sisense. This was particularly achieved by them making information and usable insights available to such a huge number of users. Another big factor that helped Microsft to become the leader is the ease of use and visual appeal that Power BI offers. For 14 % of the customers, this was the main buying criterion. Ease of use is achieved through a variety of features implemented in their solution, particularly the cloud implementation. However, one of their most important advantages is their visual appeal. This is achieved by their design language, which should be familiar to everybody. The ease of use and the visual appeal made it especially suitable for this experiment. Power BI does not only offer the necessary analytical capabilities to be examined but also makes it easy to deploy via their Azure service and easy for first-time users to get into (Howson et al., 2018). Since the experiment should not result in hours of effort and confusion for the participants, Power BI seemed like a good solution, to cut down the experiment time. 35

For the experiment it was necessary that the participants must work with a data structure that contains realistic data; hence the AdventureWorks database was used. The database is freely available and exists in different versions, the version that was used in the experiment is CTP3 (‘AdventureWorks for SQL Server 2016 CTP’, 2018). Out of the database the company called AddBike was created, which serves us as our example company for the experiment. This is a manufacturer of metal and bicycle components. They sell their products through two main sales channels: Their online store and through a variety of distributors. Their customers are located across three continents: To give the reader a quick overview of the contents of the database the next paragraphs will show some basic data of the database.

Australia, Europe, and North-America. However, over the period from 2011 to 2014, North- America brought in 72% of overall turnover, followed by Europe with 18% and Australia with 10%, making North-America by far its strongest market.

Figure 7: Sales by Territory (2011–2014)

Source: Own Work.

When looking at the average sales amount per customer in the different territories it also becomes apparent that the North Americans are willing to spend more than people situated in the other territories.

36

Figure 8: Average Sales Amount per Customer by Territory (2011–2014)

Source: Own Work.

AddBikes’s goal is to better target their best customers to expand their market share across the different territories, extend their product availability by offering their products on an external website and reduce the cost of sales by lowering their production cost.

Figure 9: Sales by Quarters (2011–2014)

Source: Own Work.

Figure 9 does not only present the sales by quarters from 2011-2014 but also explains why the overall sales in Figure 7 from 2011 to 2012 and from 2014 to 2013 are significantly lower; the data available in the provided database is missing the first quarter of 2011 and the last two quarters of 2014. AddBike has two kinds of clients, individuals, and stores, such as wholesalers etc. Together they make up 19.185 clients in total with 18.484 and 701 clients respectively.

37

Figure 10: Direct Sales and Sales through Resellers (2011–2014)

Source: Own Work.

The above-presented column chart (Figure 10) compares their two kinds of customers; individuals who directly buy from AddBike (direct sales) and stores that buy from AddBike. On average over the span of the 13 quarters, the sales to resellers were 15,42% stronger compared to the direct sales. This is a good indication that AddBike needs to revise its B2B marketing strategy.

An integral part of the execution of the experiment was getting the data warehouse to run. To do so the first step was translating the relational database into a data warehouse by modeling it and using the ETL process. The modeling part and its related steps were already explained in the chapter about data modeling. The ETL process was done with the help of MS SQL Server Integration Services (SSIS). The whole experiment was conducted online, hence the data warehouse of AddBike also needed to be online. This was done via the MS Azure platform. Once the data warehouse was set up, it only needed to be uploaded to the service. This was done with MS SQL Server Management Studio (SSMS). The Azure integration in the software was quite intuitive and did not offer a challenge. I also created a new user, who was defined in the dataset in Power BI as the user creating the report. To follow through I firstly had to create a new login and define a role with an “only-read” authorization which was linked to the newly created users. The last step that was needed to make the data warehouse presentable is adjusting the technical names so that they make ontological sense to the participants. If this would have been done in SSMS via SQL statements it would come to conflicts with the relationships of the entities. MS Power BI offered an easy solution to this problem.

38

Figure 11: Adjustment of the Ontology

Source: Own Work.

Power BI offers a function in which you can set synonyms for the dimensions and fact table, as well as for the fields inside of them. Additionally, you can exclude fields, so that they are not visible for the end-users. With this is, for example, all the IDs or Rowguid values, which are totally irrelevant for them, were excluded. This can only be done on the desktop version. In Figure 11 it can be seen that the dimensions “Dim_Customer” has been changed to “Customer” or that the field “isStore” is now the “Sales Channel” field. Also, many fields that were of a technical nature were excluded. For example, “Name_Style” or the surrogate key “SK_CustomerID”.

At the beginning of the PDF file in which the experiment and the tutorial were included, it was shortly explained to the users what the purpose of the experiment is and the way it will be conducted. Before the users could get into the experiment, it was necessary that they first understood what Power BI is and how it works. To do so I created an example dashboard, which included all the functions (and more) the users would need to use during their experiment. The dashboard incorporated several visualizations in Power BI and allowed for drill through to individual report pages for: country, sales channel, best customers, and colors used for products (to show a bit of variety). To see the example dashboard, please consult Appendix III. also included a few hints on how to navigate the example report. During the experiment, the users were confronted with the data mart of the sales department and had to use it to create some outputs that were formulated in the steps. The steps were not too specific in their instructions, as that would have defeated the purpose of the experiment by making it a simple “click- exercise”. The experiment itself was set to a time limit of 20 minutes, which was set after doing some initial testing with friends and colleagues. The reason being, that the experiment shouldn’t take too long in order to increase the response rate of the potential participants by avoiding 39 scaring them off. Before the users were exposed to the actual experiment, however, I explained them six basic functions that they would need, in order to get through it. While doing so I also familiarized them with some of the basic concepts, such as drill down, and acquainted them with the vocabulary used in MS Power BI. The six functions I explained, ranged from simple to a bit more complicated and were: How to add data, how to filter, how to visualize, how to add pages, how to drill down, and how to drill through. As the online version of Power BI was used, not all functions were available, for example, users in the Power BI service version cannot create hierarchies. Which presented a small hurdle, because without a hierarchy a drill down is not possible; therefore, the hierarchies had to be created in advance and be pointed out during the experiment so that the users would be aware of them. Additionally, there is no option to use Power Query, which is an extraction and transformation engine and allows the users to do mashups. Another thing that would have made sense in the experiment but was not good to do within in the format of an experiment where the user comes into the first contact with the software, was the usage of custom measures. This functionality is very useful in Power BI because it lets you create new facts. However, it also has its limits, for example, it is not possible to normalize data in the online version via a formula, because the data needs to be fully imported and does not support direct query. To do so the user needs to have some knowledge about DAX (Data Analysis expression). This is a library that can be used to combine different commands and operators for creating expressions and formulas. Another thing that I was not able to test within the format of the experiment was data preparation. The reason being, that the online version of Power BI does not allow for the user to access the dataset. It lets him or her create reports from a dataset but does not allow for changing the relationships between the entities, nor seeing the tabular view of the data. This also made data modeling impossible. All these functions are possible in the desktop version.

Once again, it should not be necessary for the participants to have to download extra software on their computer, as this would most likely bring down the response rate. The by far largest limitation was the number of participants. Since the experiment is quite time-intensive and is catered to a specific target group, finding suitable participants presented a big challenge. In the end, there were 14 participants. In further research, this should definitely be done on a larger scale and with more available time. The experiment was then followed up by a survey that used the TAM2 Model as a framework for the questions. In Appendix II you can find the whole survey.

40

3 RESULTS

The results of the experiment are mixed. It seems that many people do not want to work with MS Power BI or cannot fully wrap their head around the concept of SSBI.

Figure 12: Box Plot of the investigated Variables

Source: Own Work.

Figure 12. represents a box plot of the factors that were analyzed during the survey. The black line in the middle of the box-plot is the median and separates the 2nd and 3rd Quartile. The dot in the boxes is the average value. The scale used in the survey was a Likert scale from 1 to 7, in which 1 always represented “Strongly Disagree” and 7 always represented “Strongly Agree”. During the survey, it was also asked how confident the participants are about their answer. The confidence ranged from 1 to 7 and helped to adjust the answer by a parameter ranging from - 15 to -15 percentage points. The rationale behind this was to assess how confident the participants were about their answers, especially when considering that they only had short contact with the software as well as with the concept of BI. When looking at the boxplots it becomes clear that in many cases the median is far off the mean, meaning that in case the median is below the mean, there are a lot of outliers on the lower spectrum, and in case it is above the mean, there are a lot of outliers on the higher spectrum. This becomes especially clear with Perceived Ease of Use and Perceived Characteristics of the Output. This indicates that some of the participants did find it particularly complicated to handle the software and some found the outputs that Power BI is able to produce especially good.

41

Figure 13: Funnel with the Success Rate of the Experiment

Source: Own Work.

Figure 13 represents the actual results of the experiment. The upper numbers are the individual steps the participants had to go through. It must be mentioned that steps three and six, were the easiest steps because the participants simply had to create a new page in the report, hence we have the high success rate. To look at the individual questions, please consult Appendix I. Curiously the success rate of the “Drill Down and Filter” part of the experiment has a higher success rate than the “Basics” part. All the participants managed to create a simple table listing all the products, but almost half of the participants failed to count the items in the table. The reason for this might be that, finding the count function is quite hidden in the software. It requires the users to understand that all the data can only be manipulated once the field of the table is in the value section of the visualization. The second section dealt with understanding hierarchies and using filters, in this case, to compare the net profit of 2013 and 2014, additionally, it was required to use the Matrix visualization. In step four, where the participants had to Use the matrix with the Products hierarchy, was seemingly easy for most. However, using the filter was considerably harder. Here it is necessary to understand the three levels of filter Power BI offers; visual level filter, page level filter, and report-level filter. If you use two different page level filters in different tables, one overwrites the other, thus the participants had to use the visual level filter for one of the years in each of the tables. The last section was the hardest one and contained the most points, also it is the part with the lowest success rate of 46%. The major reason for this can be assumed to be that the participants ran out of time. Step seven was basically the same as step four, however, asked the participants to use net profit with the countries, compared to the product categories. Here only half of the participants managed 42 to do so. Step eight, which has a success rate of 50%, required the users to use a different visualization, called “Slicer”, to filter by a year. After this, I asked the participants to drill through to page two (which is part two of the experiment) for Germany. To successfully do so you need to activate the drill through for the visualization on page two and add the “Country Name” field as the value for the drill through. Out of the 14, only two managed to do so. The reason that only so few managed to do this step probably is that you need to go back to page two. The last step was particularly hard. Similarly, to step three, in which the participants had to use the count function, they now had to select the top five cities in Germany. To do so you need to change the filter type to “Top N”, which only works for nominal values. However, just as the function used in step three, it is hard to find.

Table 1: Overall results averaged over the Demographic Data

Age No. of Participants U O J E A 18-24 3 3,48 4,42 4,52 4,50 5,24 25-34 7 4,13 5,05 6,02 4,79 4,48 35-44 2 3,00 1,85 1,85 2,53 3,95 55-64 2 3,92 6,05 6,42 5,55 5,93 Total/Average 14 3,80 4,60 5,16 4,52 4,77 Field of Work No. of Participants U O J E A Accountancy 7 4,26 4,07 5,22 4,21 4,42 Distribution 2 2,92 4,65 5,21 3,94 4,93 Other 5 3,51 5,33 5,05 5,18 5,21 Total/Average 14 3,8 4,6 5,16 4,52 4,77 Source: Own Work.

The table above shows the averages of the factors for the different kinds of information we have about our participants. The abbreviations are as follows: Perceived Usefulness (U), Perceived Ease of Use (E), Perceived Characteristics of the Output (O), Enjoyment (J) and Anticipated Use (A) The participants were put into three categories regarding their field of work and age. Half of the participants fall within the age bracket of 25-34, whereas the rest is pretty evenly split across the other age brackets. For the field of work half of the participants fall within the Accountancy category, 14 % into the Distribution category, and 36 % into the Other category. The two persons included in the distribution category work in sales. In other, we have people ranging from pharmaceutical consultants to human resources personnel. The overall number of participants was 14. The total averages of the different variables can also be seen here. The only one that is below 4 is Perceived Usefulness. This means that in general, the participants did not believe that a SSBI solution will help them in improving the performance at their workplace. The variable with the highest value is Enjoyment with 5,16.

43

Figure 14: Correlation Matrix of the Variables

Source: Own Work.

In this correlation matrix, we can see how the variables are correlated with each other. All of them are positively correlated, or in the case of a 0, they are not correlated at all. This latter case can, for example, be observed with Perceived Usefulness and Anticipated Use. This point, in particular, is hard to explain, unlike the strong positive correlation of 0,8 between Ease of Use and Enjoyment. Anticipated Use has the strongest positive correlation with Perceived Characteristics of the Output. Between Perceived Usefulness and Ease of Use we also only have a very weak positive correlation, from which we can conclude, that there is no linear relationship between the two coefficients. Together with Ease of Use, Enjoyment seems to show the highest values in terms of positive correlation. The fact that the two are also the two factors with the highest overall score and the biggest distribution only supports this observation. Perceived Characteristics of the Output also shows a large distribution of its values, similarly to the two earlier mentioned variables it also shows a strong positive correlation with all 44 variables except perceived usefulness. This can probably be explained by the small number of participants in the sample. If the number was higher we would have a larger variance within the values of the variables.

Figure 15: Clustering of the relevance of Numeric Charts in connection with Avg. Perceived Usefulness

Source: Own Work.

One of the big advantages of Microsoft Power BI is that it makes the visualization of data more intuitive. Therefore, it is interesting to see whether there are any clusters that can be found when it comes to the relevance of numeric charts at the workplace. On the x-axis of Figure 15 is the relevance of numeric charts the participants in their job and on the y-axis, their average from the perceived usefulness section in the survey. The 14 participants were put into three different categories: Accountancy (7), Distribution (2) and Other (5). To cluster the participants a visualization in Power BI using the k-means algorithm based on Euclidean distance was used. It minimizes the distance within the cluster and maximizes the distance between the clusters. Due to the low number of participants, the outcome of the clustering serves more as a scatterplot, as the clusters can easily be seen by the naked eye. As can be seen here for the participants in “Cluster 1”, the perceived usefulness is low, just as the relevance they gave to numeric charts in their work environment. In this cluster, we have one of each field of work. The third cluster generally tends to be assigning a lot of relevance to numeric charts at their work. However, the overall assigned values of perceived usefulness are lower than four, which

45 means that they do not believe that SSBI would help them in their work. This is also the largest cluster, with a total of 7 participants, which is half of our total sample. Cluster two describes participants for whom numeric charts are relevant at their workplace and they also believe that SSBI would help them to excel in their work. In “Cluster 2” most of the participants are in the Accountancy field.

Figure 16: Clustering of total steps completed in connection with Avg. Ease of Use

Source: Own Work.

One very important factor was to find out whether the participants were able to use the software. Ease of Use represents how easy it was working for them with the software, whereas the total steps, describes their actual systems use. The age was also included here as a label for the data points. However, it does not offer any insights into why some participants marked Ease of Use lower than others. It also has to be mentioned that some of the participants went over the time limit of 20 minutes. This might explain why especially in “Cluster 4” some participants completed many steps of the experiment, however, felt that the software was not easy to work with. With five out of 14 participants “Cluster 4” is the largest, followed by number one and three, which both contain four participants each. All the participants in cluster four completed more than seven steps and rated the Ease of Use accordingly high. We also have one outlier, that is represented by “Cluster 2”. The participants in “Cluster 1” rated the Ease of Use low, this can also be seen by the steps they completed; most of them only managed to complete three steps. The next section will help to understand the findings by putting them into a theoretical context and discussing their meaning.

46

4 DISCUSSION

The survey was structured after the Technology Acceptance model in its second version, as was proposed by (F. D. Davis, 1985b). The experiment itself was designed in a way that used all the capabilities of the online Version of MS Power BI. These functions were sufficient for casual users to build their reports in a business and thus was not geared towards power users, as the second main user category (Eckerson, 2014). Before the participants started it was made sure that the data quality was acceptable. The accuracy, completeness, time-related dimensions and consistency were checked (Scannapieco, Missier, & Batini, 2005). The ontology of the data fields was also transformed into natural language. It was important to try to make the report available to a larger crowd of people, which is why doing it online was the easiest solution to the problem; however, it did not allow the participants to try out the data preparation, or mashup capabilities of MS Power BI. Generally, the capabilities of the Power BI online version were on the second level of SSBI. Power BI allows for the use of parameterized reports that can be accessed by users, but also for them to use the data source in question to build their own visualizations with the data, and thus create new information. As they cannot use the DAX language, they are able to import visualizations from the Marketplace. Therefore, analytical requirements can also be met for most self-service users. However what is missing for Power BI to in the online version to reach the third level of SSBI, is that it does not allow to do mashups with locally saved data and the database (Alpar & Schulz, 2016).

During the experiment the participants were exposed to the front-end of the BI architecture, consisting of the star schema and the BI application that came in form of MS Power BI (Kimball & Ross, 2013). The variables that were inspected in the survey were Perceived Usefulness (U), Perceived Ease of Use (E), Perceived Characteristics of the Output (O), Enjoyment (J) and Anticipated Use (A). Davis (1985) defines the benefit of the technology in question, as the product it produces. In our case, the product produced by the technology would be better decision making. In this context, it is important to evaluate what Power BI brings to the table. In the Gartner Magic Quadrant, it is stated, that it makes visual data exploration easy to set up and intuitive (Howson et al., 2018). This also includes the process of working with the software, which is best explained by the ease of use. Nevertheless, to explore data the output produced by the software needs to have a high quality so that it is understandable for the user and the audience it might be presented to. This is why the variable explaining the characteristics of the output (O) is so important. Ease of use also should be linked to Enjoyment. Generally, a system that is easier to use is more enjoyable to use, as it cuts down frustration and loss of time. When assessing the anticipated use, not only perceived usefulness plays an extrinsic role, but also enjoyment. Enjoyment seems to describe intrinsic motivation. Unlike extrinsic motivation, it does not offer any apparent reward, except the activity itself. In other words, they are not means to an end but ends to themselves (F. D. Davis, 1985). In Figure 14, that presents the correlation matrix, all the links that just were described can be observed, but not with Perceived Usefulness. This variable should have a strong correlation with Ease of Use and the Perceived Characteristics of the Output. With the latter, the correlation is non-existent and with Ease of Use, the correlation is also basically not there because the value is 0,10. What can be observed

47 is that Ease of Use is strongly correlated to Enjoyment with a value of 0,8, which is in harmony with the theoretical background. Anticipated Use also poses some problems which should not be there according to the theory. Its correlation with Perceived usefulness does not exist, because we have a value of 0 between the two variables, also Enjoyment should be strongly correlated to it, but with a value of 0,2, this is not the case. From this, it can be concluded that there is a problem with either the questions in the survey or the small number of people included in the sample. The latter one seems more probable. Which such a small number of participants the distribution of values of the variable is larger, additionally the variance within the variables is smaller. This can lead to the effect that we see here, in which everything is strongly positively correlated with each other. The case of Perceived Usefulness is curious as it does not offer results that make sense in the context of TAM2. In Ease of Use and the Characteristics of the Output, we can see in the boxplot that there was a lot of discussions because the median is far off the mean. In the case of Perceived Characteristics of the Output, we had a lot of answers on the higher end, whereas with Ease of Use we had a few outliers on the lower end. Again, this problem should be resolved with a larger sample size. The values we received for the overall averages of the variables are all close to 4, with a range from 3,82 in the case of Perceived Usefulness to 5,12 with Enjoyment. This is already included with the self-assessment of the confidence in their answers, from the participants. In case they answered with a 1 here, the overall results of the question for the variable would be lowered by 15% and if they answered with a 7 it would be raised by 15%. This was done in anticipation of the participants not being sure about their answer and to avoid that all the answers would be somewhere in the middle. Even though this precaution was taken the results still lets one conclude that the participants were not sure about their answer, because the overall averages, even though almost all of them being over 4, are still in the area of 4, which presents indifference.

The participants were also asked about their age and the work they do. In the bracket from 35- 44, especially low values can be observed. The two participants falling within this bracket gave 1,85 to the Quality of the Output and Enjoyment. The highest value they have is for Anticipated use with 3,95, which is still under 4. Overall the age did not seem to have a large effect on the results. The participants in the bracket from 55-64 assigned the highest values to the variables, but also were only two people; with only two people it is hard to justify a conclusion here.

Since everybody has a different position, the work was put under umbrella terms: accountancy, distribution and other. With 7 people accountancy was the largest group, making up half of the total sample. Their answers are all around 4, except Enjoyment which with 5,22 has the highest value out of all the groups. Probably they enjoyed playing around with the software. But this cannot be said for the Perceived Quality of the Output, with which 4,07 has the lowest value. Especially with this group, higher values could be expected because some of their tasks include creating reports for the board of directors, or for the controlling department. The group which has the highest values overall is other. From this, it is hard to conclude why that is so, as the works they do are usually not as reliant on reporting. It was also checked how relevant numeric charts (graphical representations of data) were for them at the workplace. Almost half of the participants assigned a really high 6 to this. Therefore, one would think that Power BI would present a useful tool that would make the preparation of numeric charts easier and more 48 intuitive. However, this is not the case as can clearly be seen with the low values that were assigned to Perceived Usefulness. From the feedback, it was possible to hear that many of the participants do not really require such a tool because they do not need access to the whole database. For their everyday work, it seemed that they could easily work with Excel files being passed around among their fellow peers. However, none of them worked for larger companies. This might also be a contributing factor, that should be explored in further research. Also as Lennerholt & Söderström (2018) describe in their research paper about implementation challenges of SSBI, the users need to have a good understanding of the notion of analytics to use the tool, otherwise they will end up using it just to confirm their own views.

Since the participants saved their reports, it was possible to take a closer look at them and see which steps they were able to complete and in which ones the failed. The questions were structured in a manner so that they went from easier to harder. What stuck out is that in step 2 only 7 people managed to count the products. From the feedback, it could be derived, that the participants were struggling to find the small arrow in the values section. From looking at the rest of the results I can see that generally, the participants understood the structure of the visualizations window, in which they can manipulate the values they add, change the filters and their form and activate the drill through function. When it came to drilling through, only one of the participants managed to do so. The main reason for this was probably that in the short period of time they had, they did not get this far or that they didn’t understand that the drill through needs to be activated on the visualization to which they want to drill through. Overall, they also used the right fields from the dimensions that were provided in the data warehouse. Hence changing the names from the technical terms to a more natural language, in order to adopt the ontology according to its new context, was a necessary step to ensure accessibility of the data. The two main problems that make this step necessary is that the information is always highly domain specific and that the technical rendering of terms used in the data warehouse makes it hard to understand for unskilled knowledge workers (Li, Thomas & Osei-Bryson, 2017). It also proved that a dimensional model, such as is a star schema, is more accessible than a relational model (Passlick, Lebek, & Breitner, 2017). In Figure 16 it was then explored whether there was a correlation between Ease of Use and the steps completed. In the best-case scenario, we would have seen here a linear relationship. Especially most of the participants pertaining to “Cluster 4” did finish more steps than the average but gave low ratings to ease of use. They tended towards give Ease of Use values below 4, meaning that they were struggling with the software. It can be assumed that most of them took longer than 20 minutes and were keen on figuring out solutions to the steps of the experiment.

It is hard to compare this research to other research, as no one so far has investigated SSBI in this way. Usually the tools are assessed by experts, which is, for example, the case with the Gartner Magic Quadrant (Howson et al., 2018), however, in this research, the experts were cut out and the users were directly confronted with the data analysis problem and a new software none of them have used before. Also, the concept of Business Intelligence was new to most. Out of the 14 participants, only two marked that they have had contact to Business Intelligence solutions before. For further research, I would suggest finding a more suitable timeframe for the experiment and do it in a corporate environment, maybe in a single department. The last 49 point is especially important when considering that information is always highly domain specific (Li, Thomas & Osei-Bryson , 2017). In the case of this research, it was impossible to find a sample that showed representative characteristics. Hence the participants all had a different background and were of varying ages (which might not be relevant). The biggest problem was the small sample size. In their paper about the forced use of self-service technologies, Liu (2012) used a total of 290 participants in this survey. Since this experiment took quite some time, the response rate was low. Out of 110 people that are usually required for such studies, in this research there were only 14. To increase the response rate, some sort of incentive should be given to the participants, so that they are willing to spend their time on it.

Regarding the research question, based on the previously discussed findings, it can be said that the participants did not find a SSBI solution useful for their work, but did not find it too hard to work with, which is a good sign for the accessibility of SSBI. A good state of understanding SSBI and achieving a notion of analytics was not demonstrated by the participants of the experiment, hence when introducing such a system extensive training for analytics as well as for the SSBI tool should be undertaken.

CONCLUSION AND OUTLOOK

Self-Service Business Intelligence is supposed to make data analysis more accessible to personnel that before did not have the opportunity to work with it, by combining it with a drag and drop mentality. To do so a number of things need to be ensured before making it available to the users. Business Intelligence is an umbrella term related to applications, infrastructure, and tools that enable users to access their desired information and make it usable for data analysis. First of all, the users need to understand how it will improve their decision making. In today's business world reaction time is a huge success factor. If the users continue on working with spreading the decision critical data in Excel files, their workload only increases, and timely data cannot be established. The big advantage of SSBI is that it is directly linked to the data sources, and thus are always representative of the reality the company is currently in. Additionally, SSBI allows for mashups of data. To make these data sources accessible to the users in a way that is comprehensible, the data source needs to be structured adequately. SSBI promises users independence from the IT personnel if they need a new report. This is not totally true, as a SSBI firstly needs to be implemented correctly. If the data source is modeled relationally it would be hard for them to find the data that they require; therefore, the data model needs to be simplified. A star-schema, for example, is much easier to understand than a relational database in the 3NF, that is designed for the sole purpose of transactional efficiency. Besides this, a relational database does not offer the performance required for data analysis problems, as the queries take too long to execute and due to their length are too complex. Even for BI power users, designing these queries is a hard and time-consuming task. Two major approaches in data modeling are the Inmon and the Kimball approach. Each of them has their advantages and disadvantages. In this paper, we opted for the Kimball approach, as it is more department-centric. This also makes the data governance easier, as it can be geared towards the different departments in a more effective manner. Assuming each department is a user group, 50 the ontology of the terms can also be set in a more individual demeanor. Ontology, in general, is quite important in terms of SSBI, even more so than in traditional BI applications. Same words derived from organizational naming conventions can have different meanings in different departments. Here the context is what counts. If we, for example, look at an address, for the marketing department the city of origin might be interesting to more individually cater advertisement in this area, whereas for the transportation/logistics department it only represents the destination or origin of the products. Ontology is also important because it takes technicalities out of the terms. Sales.NetProfit, might not be as easily understood as simply saying Net Profit. In terms of data governance, the data also needs to fulfill the different data quality requirements, such as completeness, consistency and so forth.

The goal of this thesis was to find out whether unexperienced casual users, would be able to handle such a system and whether they would consider it meaningful in the context t of their work a complementary tool to Excel. To do so the participants partook in an experiment in which they worked with MS Power BI. The experiment was done online and included a short explanation of the different functions of Power BI as well an example dashboard with which they could play around to get some insight into the bigger picture by seeing what is possible with the tool. The experiment was especially catered towards people without an IT background. The data used for this experiment was a transformed version of the AdventureWorks sample database by Microsoft that contained a total of 31465 orders from customers spread across the globe. The final data warehouse was from the view of the sales department. The survey that followed the experiment was designed with the Technology Acceptance Model in its second version in mind. This work is more qualitative, therefore the model served more as a framework for the appropriate selection of questions. The five variables that were investigated are: Perceived Usefulness, Perceived Ease of Use, Perceived Characteristics of the Output, Enjoyment and Anticipated Use. Before discussing the findings, it is important to mention some of the limitations of the experiment. The by far largest limitation was the small sample size, which was 14 participants. Each of them went through the process of doing the experiment which took them approximately 30 minutes or more. It can be assumed that the intensity of the experiment scared many off, as it was clear from the beginning that it is quite time-consuming. As explained in the previous discussion chapter the results of the survey that followed the experiment are mixed. It was hard to derive demographic characteristics that might influence the participants standing towards SSBI, because of the small sample size. The same holds true for the area of work the participants are in. The only characteristic that was ubiquitous among all participants was that they had an academic background. Half of the sample was working as accountants, 14% in distribution and 36% in other business areas. The accountants scored indifferent values around 4 for the variables that were investigated, only the enjoyment variable, which represents the fun they had working with the software, was higher than the others. Ease of Use seemed to be a topic that is highly debated, as some people thought that the software was hard to use. This also holds true for participants that managed to complete many of the steps in the experiment. From this, it can be concluded that even though some participants were able to solve the main part of the data analysis problem it took them a longer time than the prescribed 20 minutes. The variable that scored the lowest overall was Perceived Usefulness, 51 which describes the degree to which participants believe that SSBI would improve their work performance.

In other research SSBI has proven to improve the job performance, and in the Gartner Magic Quadrant report from 2018, the authors even went as far as segmenting the market of BI solutions into the traditional segment and one that is more focused on visual exploratory data analysis (which includes SSBI). The reason that is most likely occurred is that the participants did not have enough time to fully understand the concept of SSBI. Out of the 14, only have had access to a Bi solution before, so the general concepts were totally new to them. From the feedback that was received a lot went in the direction of saying, that they do not need to work with such large data structures and are fine with sticking to Excel sheets. One person from the HR department, for example, said, that these sorts of problems are something that she never has to work with. So at least we can derive that SSBI has some specific use cases and should not be made accessible to all positions in an organization. Because this would only be a waste of resources. On the other hand, it is interesting to see that a pharmaceutical consultant that is not directly involved with the business, is continuously working with BI solutions and considered Power Bi to be really helpful. So just as a matter of saving resources in implementing and maintaining such a solution, it would be interesting to make a study in which the different use cases are further investigated and analyzed. During the process of analyzing the results, it was also possible to look at the different steps and their success rate. In general, it can be said, that the participants were able to navigate the software and understood the fields of the dimension tables and fact table necessary to solve the data analysis problem. They also understood how to work with hierarchies, so that they could perform a drill-down. One area that the participants were struggling with was manipulating the data in the input fields for the visualizations. This includes, for example, changing the value from a sum to a mean, or a count of the items included in the field. The same held true for filtering actions. The last section of the experiment dealt with drilling through from one page of the report to another. Whereas with a drill-down/up you move vertically a drill through lets you move horizontally between two items by an established link. Or in other words, moving from one data view to another, that offers more insight into the KPI in question. Here the success rate was the lowest. Two reasons come to mind, one being that many of the participants ran out of time, or that they did not grasp the concept of a drill- through. When implementing a SSBI solution extensive training must be done in advance. Not only regarding the navigation of the software by doing a simple click and follow tutorials, but also the concepts related to BI and data analysis need to be explained in a way that helps the users to do better decisions. Otherwise, they have the software but cannot use it to its fullest capabilities, which will result in obsoleteness of it. This goes hand in hand with effective change management, which helps to circumvent resistance of adoption among the employees.

For further research, it should be made sure that the sample is adequate so that a proper statistical analysis can be performed. Also, the sample should be put in a specific scenarios, such as a department. Since such an experiment is quite a time consuming for the participants it is suggested to offer them an incentive to increase the response rate. To better asses what went wrong in the experiment it would also be interesting to make a few interviews with

52 participants that offered interesting results in the experiment in connection with Ease of Use to better understand why they were not able to complete some steps.

REFERENCES

1. Abelló, A., Darmont, J., Etcheverry, L., Golfarelli, M., López, M., Norberto, J., … Vossen, G. (2013). Fusion cubes: towards self-service business intelligence. https://doi.org/10.4018/jdwm.2013040104 2. AdventureWorks for SQL Server 2016 CTP. (2018, April 13). Retrieved 13 April 2018, from https://www.microsoft.com/en-us/download/details.aspx?id=49502 3. Alpar, P., & Schulz, M. (2016). Self-Service Business Intelligence. Business & Information Systems Engineering, 58(2), 151–155. https://doi.org/10.1007/s12599-016-0424-6 4. Askham, N., Cook, D., Doyle, M., Fereday, H., Gibson, M., Landbeck, U., … Schwarzenbach, J. (2013). The six Primary dimensions for Data Quality asseSsment. DAMA UK, 17. 5. Badessa, R. (2013). Heraclitus of Ephesus. Salem Press Biographical Encyclopedia. 6. Batini, C., & Scannapieca, M. (2006). Data quality: concepts, methodologies and techniques. Berlin; New York: Springer. 7. Burke, M., Simpson, W. & Staples, S. (2016). The Cure for Ailing Self-Service Business Intelligence. Business Intelligence Journal, 21(3). Retrieved from https://www.osti.gov/pages/biblio/1367536-cure- ailing-self-service-business-intelligence 8. Castro, D., Atkinson, R. D., & Ezell, S. J. (2010). Embracing the Self-Service Economy. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1590982 9. Chaudhuri, S. & Dayal, U. (1997). An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec., 26(1), 65–74. https://doi.org/10.1145/248603.248616 10. Chaudhuri, S., Dayal, U. & Narasayya, V. (2011). An Overview of Business Intelligence Technology. Communications of the ACM, 54(8), 88–98. https://doi.org/10.1145/1978542.1978562 11. Choi, N., Song, I.-Y. & Han, H. (2006). A Survey on Ontology Mapping. SIGMOD Rec., 35(3), 34– 41. https://doi.org/10.1145/1168092.1168097 12. Davis, F. D. (1985a). A technology acceptance model for empirically testing new end-user information systems : theory and results (Thesis). Massachusetts Institute of Technology. Retrieved from http://dspace.mit.edu/handle/1721.1/15192 13. Davis, F. D. (1985b). A technology acceptance model for empirically testing new end-user information systems : theory and results (Thesis). Massachusetts Institute of Technology. Retrieved from http://dspace.mit.edu/handle/1721.1/15192 14. Davis, Z. (2012). Top Five Considerations for Self-Service BI Dashboards. Ziff Davis. Retrieved from http://www.prostrategy.ie/wp-content/uploads/2015/09/Top-Five-Considerations-for-Self- Services-BI-Dashboards.pdf 15. De Veaux, R. (2013). Data mining. Salem Press Encyclopedia of Science. 16. Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003 on the re-use of public sector information, Pub. L. No. 32003L0098, OJ L 345 (2003). Retrieved from http://data.europa.eu/eli/dir/2003/98/oj/eng 53

17. Eckerson, W. W. (2009a). Beyond Reporting: Delivering Insights with Next-Generation Analytics, 32. 18. Eckerson, W. W. (2009b). TDWI Checklist Report: Self-Service BI, 8. 19. Eckerson, W. W. (2014a). Five Steps to delivering Self-Service BI to Everyone. TechTarget CustomMedia. 20. Eckerson, W. W. (2014b). Five Steps to delivering Self-Service BI to Everyone (p. 6). Newton, MA 02466: TechTarget CustomMedia. 21. Eckerson, W. W. (2016). A Refernce Architecture for Self-Service Analytics. Retrieved from http://go.timextender.com/hubfs/EckersonGroup__ReferenceArchitectureforSelf- Service_Final_083116.pdf?_ga=2.170039904.1531123024.1509345044- 1402613734.1507804061&t=1525989643382 22. Eder, J., & Koncilia, C. (1998). C.: Evolution of dimension data in temporal datawarehouses. 23. El Akkaoui, Z., Zimànyi, E., Mazón, J.-N. & Trujillo, J. (2011). A Model-driven Framework for ETL Process Development. In Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP (pp. 45–52). New York, NY, USA: ACM. https://doi.org/10.1145/2064676.2064685 24. Eppler, M. J. (2006). Managing Information Quality: Increasing the Value of Information in Knowledge-intensive Products and Processes (2nd ed.). Berlin Heidelberg: Springer-Verlag. 25. Esbensen, K. H., Guyot, D., Westad, F. & Houmoller, L. P. (2002). Multivariate Data Analysis: In Practice : an Introduction to Multivariate Data Analysis and Experimental Design. Multivariate Data Analysis. 26. Fine, K. (1991). The Study of Ontology. Noûs, 25(3), 263–294. https://doi.org/10.2307/2215504 27. Freud, S., & Sigmund Freud Collection (Library of Congress). (1962). Civilization and its discontents. New York: W.W. Norton. 28. Girase, A. V., Patnaik, G. K. & Patil, S. S. (2016). Devloping knowledge driven ontology for decision making. In 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES) (pp. 99–105). https://doi.org/10.1109/SCOPES.2016.7955610 29. Heller, M. (2017). 10 hot data analytics trends and 5 going cold: Big data, machine learning, data science - the data analytics revolution is evolving rapidly. CIO, 9–15. 30. Hoeren, T. (2018). Big Data and Data Quality. In T. Hoeren & B. Kolany-Raiser (Eds.), Big Data in Context (pp. 1–12). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319- 62461-7_1 31. Howson, C., Sallam, R., Richardson, J. L., Tapadinhas, J., Idoin, C. J. & Woodward, A. (2018). Magic Quadrant for Analytics and Business Intelligence Platforms. Kentucky, USA: Gartner. 32. Humm, B., & Wietek, F. (2005). Architektur von Data Warehouses und Business Intelligence Systemen. Informatik-Spektrum, 28(1), 3–14. https://doi.org/10.1007/s00287-004-0450-5 33. Imhoff, C., & White, C. (2011). Self-Service Business Intelligence: Empowering Users to Generate Insights. TDWI, 40. 34. Inmon, W. H. (2005). BUILDING THE DATA WAREHOUSE (4th Ed.). Wiley India Pvt. Limited. 35. Janoschek, N., Bange, C., Fuchs, C., Seidler, L., Tischler, R., Lartigue, E., … von Simson, C. (2017). The BI Survey 17 – The Results. BARC, 68. 54

36. K. Lippert, S. (2002). Technology Trust: An Inventory of Trust Infrastructures for Government and Commercial Information Systems In Support of Electronic Commerce. https://doi.org/10.28945/2527 37. Kabakchieva, D., Stefanova, K. & Yordanova, S. (2013). Latest Trends in Business Intelligence System Development. Presented at the 3 RD INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGY AND STATISTICS IN ECONOMY AND EDUCATION, Sofia. 38. Karttunen, S. (2012). Cultural policy indicators: reflections on the role of official statisticians in the politics of data collection. Cultural Trends, 21(2), 133–147. https://doi.org/10.1080/09548963.2012.674753 39. Kimball, R. (1997, August 2). A Dimensional Modeling Manifesto. Retrieved 3 July 2018, from https://www.kimballgroup.com/1997/08/a-dimensional-modeling-manifesto/ 40. Kimball, R. & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. John Wiley & Sons. 41. Kobielus, J. (2009). Mighty Mashups: Do-It-Yourself Business Intelligence For The New Economy. Forrester Research, Inc., 20. 42. Kosambia, S. (2008). Business Intelligence The Self-Service Way. DM Review, 18(7), 20–22. 43. Larson, B. (2009). Delivering business intelligence with Microsoft SQL server 2008. McGraw-Hill: New York. 44. Lennerholt, C., & Söderström, E. (2018a). Implementation Challenges of Self Service Business Intelligence: A Literature Review. Proceedings of the 51 St Hawaii International Conference on System Sciences, 9. 45. Lennerholt, C. & Söderström, E. (2018b). Implementation Challenges of Self Service Business Intelligence: A Literature Review. In Proceedings of the 51 st Hawaii International Conference on System Sciences (pp. 5055–5063). Hawaii. 46. Li, Y., Thomas, M. A. & Osei-Bryson, K.-M. (2017). Ontology-based data mining model management for self-service knowledge discovery. Information Systems Frontiers, 19(4), 925–943. https://doi.org/10.1007/s10796-016-9637-y 47. Lippert, S. K., & Michael Swiercz, P. (2005). Human resource information systems (HRIS) and technology trust. Journal of Information Science, 31(5), 340–353. https://doi.org/10.1177/0165551505055399 48. Liu, S. (2012). The impact of forced use on customer adoption of self-service technologies. Computers in Human Behavior, 28(4), 1194–1201. https://doi.org/10.1016/j.chb.2012.02.002 49. Martin, A. & Maladhy, D. (2011). A FRAMEWORK FOR BUSINESS INTELLIGENCE APPLICATION USING ONTOLOGICAL CLASSIFICATION. International Journal of Engineering Science and Technology, 3(2), 9. 50. Meyers, C. (2014). How Data Management and Governance Can Enable Successful Self-Service BI. Business Intelligence Journal, 19(4), 23–27. 51. Microsoft. (2017, February). Whitepaper Data Governance for GDPR Compliance: Principles, Processes, and Practices. Retrieved 29 May 2018, from https://info.microsoft.com/DataGovernanceforGDPRCompliancePrinciplesProcessesandPractices- Registration.html

55

52. Mikroyannidis, A. & Theodoulidis, B. (2006). Heraclitus II: A Framework for Ontology Management and Evolution. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI’06) (pp. 514–521). https://doi.org/10.1109/WI.2006.90 53. Mikroyannidis, Alexander & Theodoulidis, B. (2010). Ontology management and evolution for business intelligence. International Journal of Information Management, 30(6), 559–566. https://doi.org/10.1016/j.ijinfomgt.2009.10.002 54. Mikroyannidis, Alexander & Theodoulidis, B. (2012). A framework for ontology-based temporal modelling of business intelligence. Knowledge Management Research & Practice, 10(2), 188–199. https://doi.org/10.1057/kmrp.2012.2 55. Mokyr, J., Vickers, C. & Ziebarth, N. L. (2015). The History of Technological Anxiety and the Future of Economic Growth: Is This Time Different? The Journal of Economic Perspectives, 29(3), 31– 50. 56. Moody, D. L. & Kortink, M. A. R. (2003). From ER Models to Dimensional Models: BRIDGING THE GAP, 18. 57. Mouroutis, S. (2015). Data quality: does poor data quality significantly impact the effectiveness of data analysis and other business intelligence activities? 58. MUIR, B. M. (1994). Trust in automation: Part I. Theoretical issues in the study of trust and human intervention in automated systems. Ergonomics, 37(11), 1905–1922. https://doi.org/10.1080/00140139408964957 59. OECD. (2015). G20/OECD Principles of Corporate Governance. https://doi.org/10.1787/9789264236882-en 60. Oliver, D., Romm-Livermore, C. & Sudweeks, F. (Eds.). (2009). Self-service in the Internet age: expectations and experiences. London: Springer. 61. Otto, B. & Weber, K. (2011). Data Governance. In Daten- und Informationsqualität (pp. 277–295). Vieweg+Teubner. https://doi.org/10.1007/978-3-8348-9953-8_16 62. Passlick, J., Lebek, B. & Breitner, M. H. (2017). A Self-Service Supporting Business Intelligence and Big Data Analytics Architecture (pp. 12–15). Presented at the 13th International Conference of Business Informatics, St.Gallen, Switzerland. 63. Sallam, R., Howson, C., Idoine, C. J., Oestreich, T., Richardson, J. L. & Tapadinhas, J. (2017). Magic Quadrant for Business Intelligence and Analytics Platforms. Retrieved from www.gartner.com/home 64. Saunders, C. (1917, October 9). Self-serving store. Retrieved from https://patents.google.com/patent/US1242872A/en 65. Scannapieco, M., Missier, P. & Batini, C. (2005a). Data Quality at a Glance. Datenbank-Spektrum, 14, 6–14. 66. Schaffner, J., Bog, A., Krüger, J. & Zeier, A. (2008). A Hybrid Row-Column OLTP Database Architecture for Operational Reporting. In Business Intelligence for the Real-Time Enterprise (pp. 61– 74). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03422-0_5 67. Scherer, A., Wünderlich, N. V. & von Wangenheim, F. (2015). The Value of Self-Service: Long- Term Effects of Technology-Based Self-Service Usage on Customer Retention. MIS Quarterly, 39(1), 177–200.

56

68. Schlesinger, P. A., & Rahman, N. (2015). Self-Service Business Intelligence Resulting in Disruptive Technology. Journal of Computer Information Systems, 56(1), 11–21. 69. Sehgal, S. & Ranga, K. K. (2016). Translation of Entity Relational Model to Dimensional Model. International Journal of Computer Science and Mobile Computing, 5(5), 439 – 447. 70. Shapiro, M., Bieniusa, A., Zeller, P. & Petri, G. (2018). Ensuring referential integrity under causal consistency. 71. Shariat, M. & Hightower, R. (2007). Conceptualizing Business Intelligence Architecture, 7. 72. Sheposh, R. (2016). Ontology. Salem Press Encyclopedia of Health. 73. Sheposh, R. (2017). Garbage in, garbage out (GIGO). Salem Press Encyclopedia of Science. 74. Simovici, D. A. & Djeraba, C. (2008). Topologies and Measures on Metric Spaces. In Mathematical Tools for Data Mining (pp. 423–458). Springer, London. https://doi.org/10.1007/978-1-84800-201- 2_11 75. Spahn, M., Kleb, J., Grimm, S. & Scheidl, S. (2008). Supporting Business Intelligence by Providing Ontology-based End-user Information Self-service. In Proceedings of the First International Workshop on Ontology-supported Business Intelligence (pp. 10:1–10:12). New York, NY, USA: ACM. https://doi.org/10.1145/1452567.1452577 76. State of Self Service BI Report. (2015). Logi Analytics. 77. Stodder, D. (2012). TDWI Checklist Report | Seven Steps to Actionable Personal Analytics and Discovery. Transforming Data with Intelligence. Retrieved from https://tdwi.org/research/2012/06/tdwi-checklist-report-seven-steps-to-actionable-personal-analytics- and-discovery.aspx 78. Thurnheer, A. (2003, April 10). Temporale Auswertungsformen in OLAP. Wirtschaftswissenschaftlichen Fakultät der Universität Basel, Basel. 79. Vakulenko, Y., Hellstrom, D. & Oghazi, P. (2018). Customer value in self-service kiosks: a systematic literature review. International Journal of Retail & Distribution Management. https://doi.org/10.1108/IJRDM-04-2017-0084 80. Vassiliadis, P., Simitsis, A. & Skiadopoulos, S. (2002). Conceptual Modeling for ETL Processes. In Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP (pp. 14–21). New York, NY, USA: ACM. https://doi.org/10.1145/583890.583893 81. Venkatesh, V. & Davis, F. D. (2000). A Theoretical Extension of the Technology Acceptance Model: Four Longitudinal Field Studies. Management Science, 46(2), 186–204. 82. Venkatesh, V., G Morris, M., B Davis, G. & Davis, F. (2003). User Acceptance of Information Technology: Toward a Unified View. MIS Quarterly, 27, 425–478. https://doi.org/10.2307/30036540 83. Vojvodic, M. (2017). Addressing GDPR Compliance Using Oracle Data Integration and Data Governance Solutions. Oracle, 13. 84. Wang, Y. R. & Madnick, S. E. (1989). The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proceedings of the Fifth International Conference on Data Engineering (pp. 46–55). Washington, DC, USA: IEEE Computer Society. Retrieved from 85. Weber, M. (2013). Keys to Sustainable Self-Service Business Intelligence. Business Intelligence Journal, 18(1), 18–24. 86. Wimalasuriya, D. C. (2009). Ontology-Based Information Extraction. University of Oregon, 37. 57

87. Yessad, L., & Labiod, A. (2016). Comparative study of data warehouses modeling approaches: Inmon, Kimball and Data Vault. In 2016 International Conference on System Reliability and Science (ICSRS) (pp. 95–99). https://doi.org/10.1109/ICSRS.2016.7815845

58

APPENDICES

1

APPENDIX I 1

2

3

4

5

APPENDIX II

APPENDIX III 1

2

3

4

5