The Potential of Temporal for the Application in Data Analytics

Master Thesis

Alexander Menne July 2019

Thesis supervisors: Prof. Dr. Marko van Eekelen ICIS, Radboud University and Ronald van Herwijnen Avanade Netherlands

Second assessor: Dr. Stijn Hoppenbrouwers ICIS, Radboud University

Radboud University and Avanade Netherlands

Abstract

The concept of temporal databases has gotten much attention from research in computer science, but there is a lack of literature concerning the practical application of temporal databases. This thesis examines the potential of temporal databases for data analyt- ics. The main concepts of temporal databases and the role of temporal data for data analytics and data warehousing is studied using a literature review. Furthermore, the current implementations of temporal databases are discussed to exemplify the differences between the literature and practice. We compare temporal databases embedded in a architecture with conventional data warehouses by means of two proto- types and five assessment criteria. The results of the assessment indicate that the use of temporal databases as alternative for traditional data warehouses has great advantages. Temporal databases solve data integrity issues of the classical ETL process and enable a more direct data flow from the source to the tool. Also, the integrated support of the temporal dimension reduces programming efforts and in- creases the maintainability of the system. Hence, we find that temporal databases have the potential to enhance the data-driven strategy of companies significantly.

Keywords: , Data Analytics, Data Warehouse, ETL

1

Acknowledgements

I would like to thank Marko van Eekelen for the excellent guidance through our biweekly meetings which provided me with very valuable insights into research methodology. Also, I want to thank my external supervisor Ronald van Herwijnen for sharing his expertise in data warehousing, which contributed significantly to the prototype construction. Fur- thermore, I want to thank Stijn Hoppenbrouwers for the and effort he has devoted to reading and assessing this thesis. Finally, I want to express my gratitude to all the people that supported, encouraged, and inspired me on this academic journey in the Netherlands, which is a place I am proud to call home.

3

Table of Contents

1. Introduction7 1.1. Related work ...... 8 1.2. Research question ...... 9 1.3. Thesis outline ...... 10

2. Research Methods 11 2.1. Literature review ...... 11 2.2. Assessment ...... 12 2.2.1. Issues and criteria ...... 12 2.2.2. Measurement ...... 13 2.2.3. Technology ...... 15

3. Theoretical Background 19 3.1. Temporal Data ...... 19 3.2. Conventional Databases and Temporal Data ...... 20 3.3. Temporal Databases ...... 22 3.3.1. Implementation issues ...... 24 3.4. Data Analytics ...... 25 3.4.1. Big Data ...... 26 3.4.2. The role of temporal data ...... 27 3.5. Data Warehouses ...... 27 3.5.1. The temporal dimension ...... 29

4. Practical Background 31 4.1. SQL:2011 standard ...... 31 4.2. Microsoft SQL Server implementation ...... 35 4.3. Other implementations ...... 37

5 6 Table of Contents

5. Temporal Databases compared to Conventional Data Warehouses 39 5.1. Prototype A: Conventional data warehouse ...... 40 5.1.1. Architecture ...... 40 5.1.2. Insights ...... 40 5.2. Prototype B: Data warehouse with system-versioned tables ...... 41 5.2.1. Architecture ...... 41 5.2.2. Insights ...... 42 5.3. Assessment results ...... 42 5.3.1. Performance ...... 43 5.3.2. Costs ...... 44 5.3.3. Data integrity ...... 45 5.3.4. Maintainability ...... 46 5.3.5. Acceptance ...... 47 5.3.6. Three issues of the classical ETL process ...... 47

6. Discussion 49 6.1. Interpretation of the findings ...... 49 6.2. Limitations ...... 52

7. Conclusion 53 7.1. Conclusions ...... 53 7.2. work ...... 54

Glossary 55

References 57

A. Source Code 61 A.1. Transfer between staging database and data warehouse ...... 61 A.2. Data warehouse transformation ...... 70 A.3. Views in temporal database ...... 98

B. Technical specifications 115

C. Assessment 117 C.1. Performance assessment results ...... 117 C.2. Assessment scripts ...... 119 1 Introduction

1.1. Related work ...... 8 1.2. Research question ...... 9 1.3. Thesis outline ...... 10

Thirty years ago, the British computer scientist Tim Berners-Lee submitted a ’vague but exciting’ proposal for an information system meant as a "free, open, permission- less for all of humanity to share knowledge and ideas" [32]. Nowadays, we call this system the internet, and its purpose has expanded far beyond the spreading of knowledge. All kinds of data are stored online, which leads to the exponential growth of the internet. The fact that the data stored online doubles every 20 months gives an indication of the challenge that companies face to cope with the masses of collected data [26]. The field of data analytics aims at finding solutions for the increasingly important extraction of knowledge to empower data-driven company strategies. The internet has become not only a source of data but also a facilitator for data analytics due to the efficiency and flexibility that cloud computing offers. In spite of the technological progress driven by the internet, the architecture of data warehouses has hardly changed in the last two decades. Databases usually represent the current state of an organisation and are updated when a change happens in the real world. Periodically, data is extracted, transformed, and loaded (ETL) into a data warehouse to gain insights for decision-making processes. In between two ETL processes, the propositions stored in the database can change multiple , which is not reflected in the next ETL process, as only the current state of the database is extracted at the start of the ETL process. In this fast-changing world, this is a potential bias for the decision-making process, since ETL processes are often executed only once per day or less. Temporal databases may solve this problem by recording all changes made to the database in tables. Hence, no data ever gets lost as the archiving of the data is not dependent from an ETL process. This, however, also means that much data needs to be stored, which might be a reason why it took almost 30 years from the first idea

7 8 Chapter 1. Introduction

Fig. 1.1.: Average disk drive price per gigabyte in US dollar, based on [21]. of a temporal database to an implementation by a database management system ven- dor. Nowadays, the conditions have changed as the prices for hardware have declined immensely, which makes it affordable to store large amounts of data even for small busi- nesses. An illustration of the price decline of disk drives from 2004 to 2019 can be found in figure 1.1. Surprisingly, the low storage prices and available solutions has not led to an establishment of temporal databases in the industry yet.

1.1. Related work

Much research on temporal databases has been done in the field of computer science, which mainly focuses on the underlying concepts and possible implementations of tem- poral databases in existing database management systems. With over 100 publications, Richard T. Snodgrass is the primary contributor to the design and implementation of temporal databases [31]. His early publications on temporal databases contribute for a significant part to the theoretical foundation this thesis is built on [10, 9]. Also, Snod- grass is co-director of the institution TimeCenter, which plays an essential role in the advancement of knowledge within the domain of temporal databases [17]. More recent works include the books of Johnston [18] and Date et al. [14], which combine the theo- retical aspects of temporal data with a rather practical view on the design of temporal databases. Especially, Date et al. [14] offer an interesting discussion on the problems that arise when implementing a temporal database. There is a relatively small body of literature that is solely concerned with data analytics in general. Most recent academic research focuses on trends within the field of data analytics, such as big data. Gandomi et al. [2] provide a comprehensive overview of relevant techniques in the field of big data analytics. The study of Russom [27] offers some important insights into the state of big data analytics and best practices of the 1.2. Research question 9 industry as reported by 325 data management professionals. However, the reader should bear in mind that the study was taken in 2011. Closely related to data analytics, data warehouses are foremost addressed by practical researchers and industry leaders. Ralph Kimball and the Kimball Group are important authorities in this field, as they established the best practices of data warehouse archi- tecture with their ’Toolkit’ books [20]. A main contribution of Kimball is the concept of dimensional modelling, which implies that tables in the data warehouse are modelled as a with fact tables surrounded by dimension tables [19]. In contrast to the data warehouse design suggested by Inmon [16], the Kimball data warehouse is not normalised to simplify the information retrieval process. Overall, there is sufficient theoretical research on temporal databases and practical re- search on data analytics and data warehousing. There is, however, very little known about the use of temporal databases for the purposes of data analytics. Also, the use of a temporal database within a data warehouse architecture has not been investigated yet.

1.2. Research question

This research aims at closing the knowledge gap on the use of temporal databases in data warehouse architectures to achieve an enhancement in the field of data analytics. Specifically, we want to the potential of temporal databases to innovate the ETL process of data warehousing for the benefit of data analytics. The main question this research seeks to answer is:

What is the potential of temporal databases for the application in data analytics?

In order to answer this research question in a structured manner, we defined six sub- questions:

Q1 What are the main concepts of temporal databases? Q2 How are temporal databases currently implemented? Q3 What role does temporal data play in data analytics and data warehousing? Q4 What are suitable criteria to assess the impact of temporal databases for the appli- cation in data analytics? Q5 How well do temporal databases serve the purposes of data analytics compared to current databases when embedded in a data warehouse architecture? Q6 Can temporal databases solve issues of the classical ETL process in data warehouse architectures? 10 Chapter 1. Introduction

1.3. Thesis outline

This research investigates the potential of temporal databases when applied in the con- text of data analytics. In the second chapter, we the methods used as part of this research. In the third chapter, the theoretical background of temporal databases, analytics, and data warehouses is presented. The fourth chapter is concerned with the implementations of temporal databases and discusses the differences between them. In chapter five, the comparison between temporal databases and conventional data ware- houses is presented using the two prototypes built. The prototypes are assessed by means of five assessment criteria and three issues of the classical ETL process. Finally, the results of this research are discussed and concluded in chapters six and seven. 2 Research Methods

2.1. Literature review ...... 11 2.2. Assessment ...... 12 2.2.1. Issues and criteria ...... 12 2.2.2. Measurement ...... 13 2.2.3. Technology ...... 15

In this research, we make use of a literature review and an assessment based on prototype construction. In the first section of this chapter, the methods used for the literature re- view are described. The second section concerns the assessment criteria and the methods used to measure the criteria. Also, the technology behind the prototypes are introduced and motivated.

2.1. Literature review

In order to answer the first four sub-questions of this research, we make use of a literature review. At that, we apply the five-stage grounded-theory method of Wolfswinkel et al. [13], which suggests taking an iterative approach to enable the continuous development and refinement of themes and theories found in the literature. In the first stage of the method, criteria for inclusion and exclusion, the fields of research, the appropriate sources, and the search terms need to be defined. The criteria for inclusion in the literature review were chosen to be currentness and authoritativeness. The criterion of currentness was more important for the review on analytics and data warehouses than for temporal databases, as there has been much development in the last years in the fields of analytics and data warehousing. As for authoritativeness, we took the approach of including peer-reviewed literature rather than non-peer reviewed literature except if there is no suitable peer-reviewed literature available. The fields of research we considered for the literature review were mainly Informa- tion Systems and Computer Science. The search terms used were ’temporal database’,

11 12 Chapter 2. Research Methods

’temporal table’, ’bitemporal database’, ’temporal data’, ’analytics’, ’analytics require- ments’, ’analytics success factors’, ’business intelligence’, ’temporal data’, ’use of tempo- ral databases’, ’teradata’, ’data warehouse’, ’data warehouse requirements’, ’data ware- house criteria’, ’data warehouse assessment’, ’data analytics’, ’:2011’, ’sql server 2016’, ’azure data warehouse’, ’SQL complexity’, and several combinations of these terms. The second stage is concerned with the actual search for literature. As tools for the lit- erature search, we made use of Google Scholar and the library search engine of Radboud University (RUQuest) [25]. The collected literature was then scanned and selected based on the criteria defined in the first stage. The fourth stage of Wolfswinkel’s method pro- poses to iterate through the literature corpus randomly and to highlight significant parts of the text. We found it to be more effective to iterate through the literature sorted by topic and relevance and thus slightly deviated from the method. After the whole body of literature has been analyzed, we re-read the excerpts and assigned them to concepts. This has been done using the iOS application Liquidtext. In the last stage, we sorted the excerpts into the suitable chapters of this paper to prepare the text production.

2.2. Assessment

2.2.1. Issues and criteria

The sub-questions Q5 and Q6 are answered by assessing temporal databases and com- paring it to a conventional data warehouse architecture. In order to answer Q6, we defined three issues of the ETL process used in conventional data warehouses based on the literature review. These issues are used to determine whether temporal databases are able to solve them. The identified issues of the classical ETL process are1: Issue 1 Transactions done in between two ETL iterations are lost. Issue 2 The time attributes reflect the times of ETL executions and not the actual times of the transactions. Issue 3 It is not possible or very costly to realize (nearly) real-time processing of data. In order to compare temporal databases with conventional data warehouse architectures and to answer Q5, we defined five assessment criteria based on computer science and data analytics literature. These criteria are used to assess both temporal databases and conventional data warehouses. The results of the assessment are compared and a grounded judgment about the potential of temporal databases can be made. In the following, the five criteria are presented and motivated.

Performance Performance is a frequently used assessment criterion applied in computer science to evaluate the responsiveness of a system [12]. This criterion is relevant for our research because data analytics projects often involve the processing of massive amounts of data, which should be analyzed in an efficient manner. Therefore, the time passed from a transaction in the source database to reflecting this transaction

1See section 3.5.1 for a more detailed description of the issues. 2.2. Assessment 13

in the business intelligence tool should be minimal so that business users have access to up-to-date information [27]. Costs The initial and post-implementation costs are a key criterion for the choice of data warehouses [6]. In this research, we focus on cloud-based solutions, as data warehouses are rarely implemented on-premise anymore. Therefore, we disregard the initial costs and concentrate on the monthly running costs of both concepts. These costs are a factor which may not be neglected, as data analytics is not only for big businesses anymore [27]. After all, cloud services and lower storage costs have enabled smaller companies with budget constraints to use their data for decision-making. Data integrity Data integrity is a vital criterion for the evaluation of data warehouses, which commonly concerns the whole data flow of the ETL process [3]. For data analytics, data integrity is critical as the information retrieved from the business intelligence is the foundation for a company’s decision-making. There- fore, the data must be consistent, accurate, and complete, as false or lost data can have severe consequences for data-driven companies [19]. For the comparison between temporal databases and conventional data warehouses, we aim special attention to the integrity of the temporal attributes of the systems. Maintainability Maintainability refers to the ease with which changes can be done to a system. This criterion is particularly relevant for data warehouses, as they are often iteratively designed and regularly need to adapt to changes in the data model of the organisation [3]. After all, businesses are very dynamic nowadays and have constantly changing demands, which also applies to the needs for information. Therefore, the data warehouse architecture should allow for easy adaptation and expansion to be able to react to change [19]. Maintainability plays a key role in the ability to adapt, as it affects the time and costs needed to implement adjustments to the architecture. Acceptance The introduction of new technology can only be deemed successful if the end users accept the technology and incorporate it into their routines [19]. There- fore, acceptance is a frequently used criterion for new technology [1]. Temporal databases embedded in a data warehouse architecture is a rather new concept, which is why it is necessary to assess in how far it can be accepted by the business users. At that, it is particularly important that the data retrieved in the busi- ness intelligence tool is similar between the two concepts, as the end users might otherwise resist the technological change.

2.2.2. Measurement

In order to compare temporal databases and conventional data warehouses, we built two prototypes representing the two concepts. This way, it is possible to compare the two concepts using the aforementioned issues and criteria in a realistic manner. The prototypes are both based on the Microsoft Azure data warehouse infrastructure due to the client’s focus on Microsoft products. Prototype A has a conventional data warehouse 14 Chapter 2. Research Methods architecture using a classical ETL process, while Prototype B uses a temporal database as an alternative for the data warehouse. As mentioned before, the ETL process of conventional data warehouses causes three issues. We use prototype B to determine whether temporal databases solve these issues. The first issue is considered as solved if no data transferred from the source database to the BI dashboard gets lost at any time. The second issue is regarded as solved if the temporal attributes in the temporal database are equivalent to the actual execution times of the transactions in the source database. Finally, we find the third issue to be solved if the delay from executing a transaction on the source database to the reflection of that transaction in the BI dashboard is less than one minute. In the following, we present and motivate the metrics used for evaluating the prototypes in terms of the assessment criteria.

Performance In order to assess the performance of the prototypes, we apply load testing, which is a frequently used testing practice in computer science involving the monitoring of the responsiveness of a system that is exposed to a certain work load [12]. In the context of data warehouses, the tests commonly focus the processing time of the whole data flow of the ETL process [3]. We apply this approach by measuring the time separately for 1) transforming and loading the data, and 2) retrieving the data in the dashboard. The time for extracting the data is neglected, as there is no reliable method to measure the replication time, and we do not expect significant differences between the two prototypes. At prototype A, the time for the transformation and load of the data is retrieved from Data Factory. At prototype B, the time for that is taken by measuring the time it takes to query all views. For both prototypes, we measure the consumed time for retrieving the data in the dashboard by manually taking the time using a app. We have taken some measures to make the results of the assessment more reliable and realistic. The prototypes are tested using a low and a high performance setup for the SQL databases used in the prototypes in order to gain insight into the effect of the hardware setup on the performance of the prototypes. Furthermore, the measurements are performed three times each for both prototypes and both configurations to eliminate possible external influences such as irregularities of the Azure servers. The results are then combined by calculating the average per prototype and configuration.

Costs We measure the costs of each prototype in both configurations by calculating the overall monthly costs of the used Azure services. For prototype A, we calculate the monthly costs based on a daily ETL process and an hourly ETL process. This way, we get an understanding of the degree to which an increase in the ETL frequency affects the overall costs. The prices are retrieved from the Microsoft’s price calculator using the specifications of the prototypes we made [24]. At that, we use the prices of the region ’West Europe’. If there are no monthly prices given for a service, we consider a month as an interval of 30 days. 2.2. Assessment 15

Data integrity The data integrity of the prototypes is tested by comparing the effect that transactions have on the data warehouse of prototype A and the temporal database of prototype B. Both prototypes use the same source data, which means that they should in principle show the same data at any time. However, the currentness and the precision of the temporal attributes of the data may differ, which we measure by successively executing three sets of create, update, and delete transactions on the source database. Also, we measure the influence of the different ETL processes in the prototypes. After each set, we first determine the differences in both prototypes using SQL Server Management Studio and another time after the execution of an ETL iteration on prototype A. In the analysis, we focus on the differences in the temporal attributes and potentially different or non-existent data.

Maintainability It is rather difficult to measure the maintainability of data warehouses, as the data flow involves several processes, and the manual implementations are done completely in SQL. Unlike other programming languages, the maintainability of SQL code cannot be measured by a prediction model using metrics such as cohesion, coupling, or complexity [3]. Therefore, commonly used code quality evaluation services such as Better Code Hub do not support SQL [29]. Hence, we applied a rather straightforward metric for this criterion, namely the lines of code for each prototype. This is measured by transforming all code for each prototype into a standard format using the online tool Instant SQL Formatter and calculating the sum of the lines of codes per prototype [15]. Furthermore, we evaluate the easiness of changing and testing code by looking into the associated steps involved for each prototype.

Acceptance The acceptance of a system is commonly measured by means of validation tests per- formed by end users [1]. However, this approach is not feasible for this research, as we do not have the capacity to achieve a reliable end user validation. Therefore, we assess the acceptance by analyzing the differences between the dashboards of the prototypes and determine whether the dashboard of prototype B offers the same as or possibly a better user experience than prototype A. At that, we focus on the usefulness of the temporal information and the loading time of the dashboards.

2.2.3. Technology

For creating the prototypes, we make use of Microsoft solutions because of the client’s focus on Microsoft products. However, irrespective of this, Microsoft offers a very com- prehensive package of integrated cloud-based services to implement a modern data ware- house architecture, which is why Microsoft is a suitable choice for our purposes. Due to the abundance of services, there are many ways to implement a data warehouse. Our approach was to build a realistic architecture that stays within the scope of this research and the financial budget. That is to say, we implemented an architecture according to 16 Chapter 2. Research Methods industry’s best practises without making use of services that are not relevant for our re- search. For instance, we did not make use of services that are intended for processing big data because it does not add value to our research (see section 3.4.2). In the following, the key elements of the architecture of the prototypes are presented and motivated.

SQL Server A rather obvious choice for the implementation of the databases is SQL Server, as it is the only database management system from Microsoft which is available on Azure. Also, as described in section 4.2, SQL Server supports system-versioned tables which clearly is necessary for the prototype B. When setting up SQL Server, there are two options in Azure: creating a virtual machine or using a managed instance. In principle, a managed instance is the preferable option because of its scalability and the easy setup. However, a virtual machine provides more control over the server, as the virtual machine simulates an on-premise server. Also, the managed instance uses a slightly different version of SQL Server, namely Azure SQL, which does not offer features that are available in the original version. For the prototypes, we use both a managed Azure SQL instance and a virtual machine running SQL server for different purposes. The managed instance is used for the staging database and the data warehouse. Operational source databases are in practice often on-premise, which is why we chose to use a virtual machine for it. Another reason is that the original version of SQL Server supports replication, which we use for the synchronization between the source database and the staging database and the data warehouse. Replication is a technology that enables the distribution and synchronization of database contents directly from publisher databases to subscriber databases. SQL Server provides four different publication types: snapshot, transactional, peer-to-peer, and merge. They differ in the frequency of synchronization and the ways the data is transferred. For instance, snapshot publications transfer a snapshot at scheduled intervals, while trans- action publications transfer data immediately after the data has been added, changed, or deleted. Peer-to-peer publications enable the replication with more than one pub- lication database and stream data directly to the peer databases. Merge publications compound the changes to both the publisher database and the subscriber database pe- riodically. We chose to use a transactional publication because it makes it possible to synchronize the source database with the staging database and the temporal database in nearly real-time, and there is no need for more than one publisher database.

Data Factory Data Factory is a data integration tool in Azure that offers a simple graphical user interface without the need for own code. The core of Data Factory is the pipeline, which consists of activities with an overlying task. These activities can concern straightforward tasks such as copying or transforming data, but there are also more complex activities that involve other Azure services. For instance, there are activities for big data analytics services, such as MapReduce. In prototype A, we make use of the copy activity, which copies data from a source database to a so-called sink database. One can execute select 2.2. Assessment 17 statements to retrieve data from several tables, which makes it possible to extract and transform the data into an analyzable format. Also, one can execute pre-copy scripts on the sink database, which is useful if one wants to modify or delete data in the sink database before inserting the data from the source database. An alternative for Data Factory is SQL Server Integration Services (SSIS). SSIS is a component of SQL Server designed for performing ETL processes. We could have used SSIS for prototype A, but the integration within Azure and the good usability made us choose Data Factory as service for the ETL process. PowerBI Desktop PowerBI is a business analytics service with a variety of data visualization tools. It makes it possible for users to create business intelligence dashboards using data from databases and other sources. PowerBI supports several platforms, such as Windows computers and mobile devices. Furthermore, Microsoft offers an online browser-based software as a service solution for PowerBI. For the dashboards of the prototypes, we used PowerBI Desktop, as it is available for free and sufficient for our purposes. AdventureWorks As we aim at building realistic prototypes, we made use of the comprehensive sample database AdventureWorks, which is provided by Microsoft [22]. Adventure Works Cycles is a fictitious company selling bikes and sports accessories in multiple countries around the world. The database consists of 68 tables divided into human resources, person, production, purchasing, and sales tables. The human resources schema comprises six tables which give typical information needed for the HR department, such as the data of all employees, the departments within the company, and job candidates. In the person schema, there are 13 tables with various kinds of personal data, such as addresses and phone numbers, about employees, resellers, and other contacts of the business. The production schema consists of 25 tables with detailed data about the products and their production. The purchasing schema encompasses five tables with information about purchase orders and vendors. The sales schema has 19 tables with orders and customer and reseller data. All tables are filled with a large amount of sample data. For instance, there are 31 465 sales orders, 19 614 addresses, and 504 different products in the database, which makes the AdventureWorks database very realistic and therefore suitable for our research.

3 Theoretical Background

3.1. Temporal Data ...... 19 3.2. Conventional Databases and Temporal Data ...... 20 3.3. Temporal Databases ...... 22 3.3.1. Implementation issues ...... 24 3.4. Data Analytics ...... 25 3.4.1. Big Data ...... 26 3.4.2. The role of temporal data ...... 27 3.5. Data Warehouses ...... 27 3.5.1. The temporal dimension ...... 29

In this chapter, the three main concepts this paper deals with are introduced, being temporal databases, analytics, and data warehouses. Before defining what temporal databases are, it is necessary to understand the notion of temporal data and its ter- minology and how conventional databases integrate temporal data. Building on that foundation, the key concepts of temporal databases are pointed out using several ex- amples, and the history of temporal databases is explained. Also, the problems arising when implementing a temporal database are addressed. Finally, the issues that tempo- ral databases solve are pointed out. In the Analytics section, we give an overview of the field of analytics, and the concept of big data analytics is elaborated on to illustrate the current trends within analytics. Furthermore, the role of temporal data for data analytics is discussed. In the last section, an introduction to data warehousing and the principles used in data warehouse architectures is given. Lastly, we discuss the temporal dimension in data warehouses.

3.1. Temporal Data

Temporal data can be defined as any data which may change over the course of time [10]. Since most data can, at least in theory, change over time, we add to this definition that

19 20 Chapter 3. Theoretical Background the aspect of time needs to be significant for the purpose of the data. For instance, an e- commerce company might not be interested in the time at which a customer changed her email address, while the time at which a customer places an order is certainly important. At first, the notion of temporal data may seem rather trivial, since time is considered to be a generally known concept. However, there is a considerable number of terms used ambiguously in the academic literature about temporal databases. Hence, we define a concise nomenclature for temporal aspects in the following.

The purpose of temporal data is to model certain events occurring in reality by means of records, the so-called transactions [18]. These transactions contain information about an and possibly a timestamp with certain semantics defined by either the user herself or the database management system, which we call temporal attribute. Temporal attributes can give information about the truthfulness of a statement in the real world (the valid time) or as stated in the database (the transaction time) [14]. Also, a temporal attribute could have other semantics defined by the user, which are not interpreted by the database management system (user-defined time) [10]. As these terms are used throughout this paper, we recommend having a look at the glossary in the back matter of this paper for more comprehensive definitions of the introduced terms.

3.2. Conventional Databases and Temporal Data

Conventional databases model the state of an enterprise or organization at a certain point in time [10]. When information stored in the database is not believed to be true anymore and is modified, the old data is replaced by the new data and is not retrievable any longer [10]. The consequence of this behaviour is that there is no history of trans- actions stored in conventional databases. This, however, does not mean that there is not any temporal data stored in these databases. For instance, user-defined time can be stored in conventional databases, but there is no support for automatically maintained temporal attributes such as transaction time [10, 14]. Therefore, any temporal attribute in conventional databases can be updated [14].

If the history of a company’s data needs to be stored, a common solution is to regularly make backups of the databases (so-called snapshots). There is, however, often a need to include history in a conventional database, as this allows the users to query historical data from within the database. This request is commonly dealt with by adding times- tamps to the primary key of the table for which historical information is needed [18]. Another option is to create a separate table with these timestamps added in order to keep the schema of the original table [18]. These approaches involve either the manual execution of all transactions or a trigger that modifies the update query before it is executed [18]. The need for these additional operations is best explained using example 3.1, which we will use and advance throughout this chapter. 3.2. Conventional Databases and Temporal Data 21

Example 3.1 Let company CheeseHut be a chain store for cheese offering three different types of cheese, namely young, mature, and old cheese, with each cheese having its own price.

Row id name price 1 1 Young 6 2 2 Mature 8 3 3 Old 11

Tab. 3.1.: Product table of CheeseHut

Table 3.1 illustrates the product table of CheeseHut, modelling the types of cheese and their prices per kilogram in euros. In 2014 CheeseHut changes the supplier for the old cheese, which results in a higher price of 12 euros per kilogram. Table 3.2 reflects this change.

Row id name price 1 1 Young 6 2 2 Mature 8 3 3 Old 12

Tab. 3.2.: Product table of CheeseHut

Suppose now that in 2019, CheeseHut wants to change back to the former supplier of the old cheese. CheeseHut wants to sell the cheese for the same price as before, but none of the employees recalls the retail price of the old cheese in 2013 anymore. This information has been lost in the change made before. Fortunately, the IT department of CheeseHut implemented a history table before the change was made, which is illustrated in table 3.3.

Row id startTime endTime name price 1 1 2011-01-01 9999-12-31 Young 6 2 2 2011-01-01 9999-12-31 Mature 8 3 3 2011-01-01 2014-01-01 Old 11 4 3 2014-01-01 9999-12-31 Old 12

Tab. 3.3.: Product history table of CheeseHut

As it can be seen in this example, a conventional database without any temporal at- tributes comes with the risk of loosing possibly valuable information about the history of the database. Adding a history table may mitigate this risk, but comes with additional 22 Chapter 3. Theoretical Background operations as mentioned before. When updating a row in the original table, this change also needs to be reflected in the history table, which can either be done manually or by setting up a trigger function. Furthermore, the end time of the existing record needs to be changed and a new row needs to be inserted. This effort, however, is still reasonable compared to the complex queries and constraints that need to be written to cope with more complicated transactions, which is why a database providing well-designed solu- tions for the processing of temporal data can be very valuable [14]. In the next section, we elaborate on these complex transactions and how they can be solved.

3.3. Temporal Databases

A temporal database is a database that contains time-varying data and offers built-in support for modelling the temporal dimension of the data [10]. When using a temporal database, retrieving information about the is supported by included query functions, which makes the development and databases more efficient and potentially increases the performance [10, 14]. Furthermore, any modifications to stored data are automatically dealt with, which reduces the manual workload involved in maintaining the database [10]. The concepts of valid time and transaction time are the core of temporal databases. Again, valid time defines the time interval during which a statement is true according to current beliefs, while transaction time specifies the time interval during which a statement was true according to the database [14]. If a database would always be up- to-date and would only contain the right data, then valid time and transaction time would be identical. This is, however, a rather utopical scenario, as beliefs about the truthfulness of statements constantly change due to new insights. Furthermore, an important distinction is that valid time is usually kept in the tables containing the temporal data, while transaction time is kept in a separate history table [14]. The difference between valid time and transactional time is further explained in example 3.2.

Example 3.2 As stated in example 1, the IT department of CheeseHut implemented a table with historical data of the products (see table 3.3). This table is a simple example of a separate history table with transactional times. For the purpose of this example, we add times to the history table:

Row id startTime endTime name price 1 1 2011-01-01 08:15:26 9999-12-31 23:59:59 Young 6 2 2 2011-01-01 08:17:12 9999-12-31 23:59:59 Mature 8 3 3 2011-01-01 08:18:03 2014-01-01 15:47:52 Old 11 4 3 2014-01-01 15:47:52 9999-12-31 23:59:59 Old 12

Tab. 3.4.: History table with transactional times 3.3. Temporal Databases 23

Suppose now that the IT department also added valid times to the product table, such as illustrated in table 3.5.

Row id startTime endTime name price 1 1 2008-06-01 08:00:00 9999-12-31 23:59:59 Young 6 2 2 2008-06-01 08:00:00 9999-12-31 23:59:59 Mature 8 3 3 2008-06-01 08:00:00 2013-12-31 23:59:59 Old 11 4 3 2014-01-01 00:00:00 9999-12-31 23:59:59 Old 12

Tab. 3.5.: Product table with valid times

The indicated times in table 3.5 deviate to a great extent from the history table 3.4, as the transaction times differ greatly from the reality. CheeseHut opened on 1 June 2008 at 8 o’clock and therefore, the prices were in fact valid from then on. The database, however, was just set up on 1 January 2011, which explains the deviation between the valid time and the transaction time. Furthermore, the price change of the old cheese was supposed to be enacted with the transition to the year 2014, but the database was only updated in the afternoon of 1 January.

Example 3.2 shows that valid time and transactional time can differ significantly. These two temporal attributes store different information and a combination of both temporal attributes enables users to retrieve not only the time when a statement was true in reality, but also the time when that statement was true according to the database. Both attributes therefore have their own purpose and use cases. Now that the purpose and usefulness of temporal databases are evident, it might surprise that it took almost 30 years from the first concrete idea of a temporal database until temporal features were included in the standard SQL:2011 [10, 5]. The effort done in research was undoubtedly not the issue, as there have been written about 400 papers in 1992 already [9]. More than 20 temporal data models and query languages have been proposed at that time with each contributor having his own vision on the implementation of the temporal dimension in databases [9]. There has been a very active academic discussion about the terminology of temporal features and attributes, but a general consensus data model could not be established due to the profusion of proposals [14]. This was an immense stumbling block for the consolidation of knowledge within this research domain. A core issue which divided the research community revolved around the role of temporal data. One group of researchers took the position that temporal data should be treated as a special kind of data that should be represented with hidden timestamp attributes, accepting the divergence from general relational principles [14]. The other group of re- searchers, however, advocated on sticking with relational principles by treating temporal data as much as possible just as any other data [14]. Nowadays, there is little discussion about the terms introduced in section 2.1, but there are still different data models for temporal databases. The key difference in data models 24 Chapter 3. Theoretical Background lies in the inclusion of valid time, transaction time, or both. A valid-time relation supports only valid time, while a transaction-time relation supports only transaction time [10]. A bitemporal relation, however, supports both valid and transaction time, which has several advantages but also makes the implementation more complex [14].

3.3.1. Implementation issues

As broached in section 3.2, the built-in support of temporal databases for temporal attributes can offer solutions to rather complex problems and atypical situations. Date et al. [14] found several issues that arise when manually implementing a temporal database with valid times. In the following, we introduce two major problems that Date et al. addressed.

The redundancy problem Undoubtedly, consistency is of major importance for structured databases. Data needs to be stored in a uniform way to be able to routinize the information retrieval process. Therefore, redundancy needs to be avoided in order to guarantee the integrity of the database. Example 3.3 illustrates how a redundancy problem can occur. In table 3.6, the redundancy emerges from the fact that the third and fourth rows could be stored in one row without loosing any information. The price effectively never changed and therefore, the fourth row should be deleted and the end time of the third row should be set to ’9999-12-31 23:59:59’. This way, a query with the semantics of ’Since when does the old cheese cost 11 euros?’ can be answered by ’Since 2008-06-01 08:00:00’ instead of ’Since 2008-06-01 08:00:00 and since 2014-01-01 00:00:00’, implying that an event occurred in between these times. Example 3.3 Suppose that CheeseHut is not satisfied with the new supplier for the old cheese (see example 3.1). The management decides that they will not do business with the new supplier and stay with the old supplier. Consequently, the price change that was implemented before is turned back as shown in table 3.6.

Row id startTime endTime name price 1 1 2008-06-01 08:00:00 9999-12-31 23:59:59 Young 6 2 2 2008-06-01 08:00:00 9999-12-31 23:59:59 Mature 8 3 3 2008-06-01 08:00:00 2013-12-31 23:59:59 Old 11 4 3 2014-01-01 00:00:00 9999-12-31 23:59:59 Old 11

Tab. 3.6.: Redundancy problem

The contradiction problem Temporal tables usually have the start and the end time of the temporal attribute as primary keys, which enables users to add rows which are, apart from the times, identical. 3.4. Data Analytics 25

A row with identical times and different values is, due to the nature of primary keys, not possible, as this would establish a contradiction in the database. However, the start and end times in different rows can be overlapping and still non-identical, which allows for unwanted contradictions. Example 3.4 demonstrates the contradiction problem. A query with the semantics of ’How much is young cheese on 2 June 2018?’ would be answered with contradicting results, since the database states in the first and fourth row that young cheese costs both six euros and four euros. Example 3.4 Suppose that CheeseHut celebrates its 10-year anniversary with a discount on young cheese. For two weeks starting from 1 June 2018 the stores sell young cheese for four euros, which is implemented in the database as shown in table 3.7.

Row id startTime endTime name price 1 1 2008-06-01 08:00:00 9999-12-31 23:59:59 Young 6 2 2 2008-06-01 08:00:00 9999-12-31 23:59:59 Mature 8 3 3 2008-06-01 08:00:00 9999-12-31 23:59:59 Old 11 4 1 2018-06-01 08:00:00 2018-06-16 07:59:59 Young 4

Tab. 3.7.: Contradiction problem

These two problems have in common that they need to be solved by implementing rather complex constraints, which makes evident why a self-made temporal database with valid times is a challenging endeavour. For a comprehensive discussion of the solutions for these problems, we refer to the work of Date et al. [14].

3.4. Data Analytics

Data analytics can be defined as the use of IT applications to support decision-making by analyzing large data sets [26]. Commonly, data analytics is used in the context of structured data stored in data warehouses to maximize the usefulness of the collected data [11]. The main goal of data analytics is to improve a company’s performance by an- alyzing the past (descriptive analytics) and predicting the future (predictive analytics). More specifically, descriptive analytics aims to reason why something has happened by summarizing raw data into a format that is appealing to business users [8]. For instance, descriptive analytics could give an answer to why a company’s sales declined. Predic- tive analytics, however, is used to evaluate the future and to forecast trends by means of prediction models and scoring [8]. An example of predictive analytics could be the prediction of a company’s net profit at the end of the fiscal year. Data analytics projects can be structured in phases, such as suggested by Runkler’s four- phase model [26]. The first step in a data analytics project is the preparation of the data, meaning that data is collected and a selection is made based on the information need of the business users. Afterwards, the data needs to be preprocessed, as raw data is often 26 Chapter 3. Theoretical Background not in an analyzable format and therefore needs to be cleaned, filtered, and transformed. When the data is in a suitable format, mathematical and statistical calculations can be applied to the data in order to gain valuable insights. These insights are commonly communicated in the form of visual dashboards which can be used by business users. In the last phase, the business users interpret and document the findings and possibly take action based on the gained insights. See figure 3.1 for an illustration of the four phases of data analytics projects.

Fig. 3.1.: Phases of data analytics projects, based on Runkler [26].

3.4.1. Big Data

It is estimated that the data stored online doubles every 20 months [26]. These massive amounts of continuously produced heterogeneous data are often referred to as big data. The largest part of big data is in unstructured form, such as photos, videos, or user posts, which poses a challenge for companies to analyze the data[2]. The challenges of big data are often referred to as the Three V’s, which stand for Volume, Variety, and Velocity [8]. The volume, referring to the enormous size of data, creates issues such as high storage and computational costs. The second V, variety, concerns the heterogeneity of the data sources, which can be, e.g. a mixture of structured data created by employees and unstructured data generated by Internet of Things devices. Velocity can be described as the frequency at which data is generated, and the time it takes to be able to analyze the data. The challenge connected to the velocity of data is to process the data at the rate it is created. Big data analytics seeks to cope with these challenges and endeavours to achieve the efficient processing and analysis of massive amounts of heterogeneous data. Current research mainly focuses on extracting information from unstructured data by means of , which aims at deriving relationships and other insights from the data using statistical methods and machine learning algorithms [11]. A key issue for data mining is efficiency, as the used methods and algorithms often require complex computations. A popular framework for reducing processing time is MapReduce, which enables the parallel processing of massive amounts of (unstructured) data on a distributed network of servers [11]. 3.5. Data Warehouses 27

3.4.2. The role of temporal data Temporal data is a valuable source of information for analytics, as it gives insights on trends and other significant changes in data over time. Non-temporal databases do not offer enough information on the temporal dimension of the data, which may lead to biased insights. Both descriptive and predictive analytics can benefit a great deal from a complete history of data enabled by temporal databases. After all, the insights gathered using descriptive and predictive analytics are mostly based on historical data. Also, the inclusion of valid time can make the analysis and prediction more precisely, as it reduces the gap between reality and modelled reality (i.e. the database). As for big data analytics, temporal data is undoubtedly valuable. Temporal databases, however, are less relevant, because they are grounded on management systems which require structured data. However, even if unstructured data is processed into structured data and inserted into a relational database, the added value of temporal databases is limited. Valid times are assigned and maintained by humans, which is neither feasible nor logical for big data due to its high volume and the mostly external human sources or sensors the data is retrieved from. Furthermore, transactional times are useful for tracking the changes within a database, which may occur if the data refers to some object that changes over time. This is, however, rarely the case for big data, since big data often refers to either a state at a certain point in time, or to an unchangeable object.

3.5. Data Warehouses

A data warehouse can be defined as a database which stores data from several sources and presents them in an integrated structure that is suitable for effective decision-making support [11]. Data warehouses are an integral part of data analytics projects, as they provide the infrastructure for effective analysis. Specifically, data warehouses focus on subjects of analysis that are valuable for decision-making, such as sales or customer satisfaction, while operational databases are normalized and store the data that is nec- essary for effective functioning of the company’s operations [11]. Furthermore, data warehouses store historical data and are optimized to support the efficient handling of complex queries by calculating and aggregating significant performance indicators beforehand [11]. A significant tool for monitoring the performance of a company are key performance indicators (KPIs), which are measurable organizational targets [11]. Data warehouses present these KPIs (also called measures) from different dimensions, which are perspec- tives such as time or location. For instance, the measure ’sales’ can be seen from the time dimension with the query ’sales in 2015’, or from the location dimension with the query ’sales in Berlin’. This principle is called multidimensional modelling and builds the foundation of modern data warehouses [11]. In practise, this concept is implemented by designing fact tables which are surrounded by connected dimensional tables. This design is often referred to as star schema because it reminds of a star-like construction [19]. The fact tables mainly contain measures and references to dimensional tables, 28 Chapter 3. Theoretical Background while dimensional tables provide background information to enable the business user to see the measures from different perspectives. Example 3.5 illustrates the principle of multidimensional modelling. Example 3.5 Suppose that CheeseHut has experienced a substantial decrease in sales and the CEO wants to know the reason for that. The tables below show an excerpt of the sales and the corresponding product and store dimension tables of CheeseHut’s data warehouse.

Row ProductKey StoreKey Quarter Sales 1 1 2 2018Q3 9562 2 1 3 2018Q3 11279 3 1 2 2018Q4 6043 4 1 3 2018Q4 10862 5 1 2 2019Q1 4590 6 1 3 2019Q1 11054

Tab. 3.8.: Excerpt of sales fact table

Row ProductKey Name 1 1 Young 2 2 Mature 3 3 Old

Tab. 3.9.: Product dimension table

Row StoreKey Location 1 1 London 2 2 Berlin 3 3 Amsterdam

Tab. 3.10.: Store dimension table

The company’s data analytics specialist looks into the data and sees that the measure ’sales’ for the store with the ID ’2’ and the product ’1’ decreased signifi- cantly from the third quarter of 2018 to the first quarter of 2019. Looking into the product and store dimension tables, the expert concludes that the sales of young cheese in the store located in Berlin almost halved, while the sales in the store in Amsterdam remained stable. The specialist shares these insights with the CEO and thereby enabling her to make a data-driven decision. 3.5. Data Warehouses 29

Another central principle of data warehousing is ETL, which is the classical process of extracting significant data from different sources, transforming the data into a format that is suitable for analysis, and loading the transformed data into the data warehouse. The sources are often operational databases, but increasingly also include unstructured data from the, e.g. Internet of things devices. The data that is significant for analyzing the company’s performance, therefore, needs to be transformed into a format that can be used within data warehouses. This transformation process is regularly done in so-called staging databases, which serve as temporary data storage. Once the data is in the right format, the data is transferred from the staging database to the data warehouse. Clearly, this process implements the preparation and preprocessing phases of data analytics projects mentioned in section 3.4. An illustration of the ETL process is given in figure 3.2.

Fig. 3.2.: ETL process

3.5.1. The temporal dimension An operational database usually does not store the history of its content. For the purposes of analysis, however, historical data is of high value. Therefore, an important function of the ETL process is the addition of the temporal dimension. This is done by adding timestamps to the data, such as the time when a certain row has been inserted or modified. If a business requires more detailed temporal information, one can even add transactional times to the data during the ETL process. This, however, is not the same as using a temporal database, as the ETL process is usually executed periodically. Accordingly, the data warehouse is only updated daily, weekly, or even monthly [11]. While this may be sufficient for many business applications, it poses the risk of losing data in the case of fast-changing data sources. For instance, if one row in a table in the operational database changes several times between two ETL processes, only the last version of the row is loaded into the data warehouse. Hence, the changes that happen between the two ETL processes are not reflected in the data warehouse. This issue is illustrated in example 3.6.

Example 3.6 The IT department of CheeseHut has configured the ETL process to run weekly every Sunday night. On Monday, 3 June 2019 CheeseHut introduces a new cheese with spinach flavour. The non-temporal product table of CheeseHut is shown in table 3.11. 30 Chapter 3. Theoretical Background

Row id name price 1 1 Young 6 2 2 Mature 8 3 3 Old 11 4 4 Spinach 9

Tab. 3.11.: Product table of CheeseHut

On Thursday, 6 June 2019 a customer complains about stomach ache after eating the spinach cheese. CheeseHut’s quality assurance department finds that some of the spinach cheeses are contaminated. CheeseHut immediately takes the spinach cheese off the shelves and consequently removes the cheese from the product table. On Sunday, 9 June 2019 the weekly ETL process is executed, and the product table is scanned for any updates. The table is the same as it has been the week before, and no changes are made to the product dimension table in the data warehouse.

Furthermore, business users nowadays require real-time performance indicators, which presents challenges to the ETL technology [11]. For one, the ETL process itself consumes time, which is a hurdle for the real-time analysis of the data. Also, the consequential delay poses the challenge that the added timestamps refer to the time of the ETL process execution and do not represent the actual times the data has been added, updates, or deleted in the operational database. One approach to cope with these issues is to increase the execution frequency of the ETL process. However, the execution of ETL processes is rather costly, which justifies the search for alternative ways to process data in (nearly) real-time. 4 Practical Background

4.1. SQL:2011 standard ...... 31 4.2. Microsoft SQL Server implementation ...... 35 4.3. Other implementations ...... 37

This section gives an overview of several available implementations of temporal databases in order to exemplify the differences and similarities between the temporal database concepts as described in the literature and the practice. First, the SQL:2011 standard is introduced, which defines the foundation for most available solutions. In the second section, the implementation of temporal databases in Microsoft SQL Server 2016 is described, which is used in the prototypes made within this research. Finally, two other implementations of temporal databases are presented.

4.1. SQL:2011 standard

In 1995, there has been an effort by the International Organization for Standardization (ISO) to extend the SQL standard to support temporal data. However, the ISO SQL committee could not agree on a common proposal and the interest of DBMS vendors to support temporal data was rather low, which led to the cancellation of the project in 2001. Ten years later, the ISO and the International Electrotechnical Commission (IEC) published the SQL standard SQL:2011, which ultimately introduced the support for temporal tables1. In the following, the key definitions of the SQL:2011 standard are given, based on an article of Kulkarni and Michels [5]. Period definitions SQL:2011 introduces period definitions which identify a pair of columns as period, con- sisting of a start time and an end time. The standard defines that the start time is included in the period, while the end time is excluded. Naturally, the standard also

1Before, we referred to this concept as temporal database because this term is mostly used in literature. However, temporal table describes the concept more precisely and is commonly used in practise.

31 32 Chapter 4. Practical Background

defines the constraint that the end time needs to be greater than the start time of a period. Furthermore, SQL:2011 defines two time dimensions: the system-time period, which is the equivalent of what is referred to in the literature as valid time, and the application-time period, which corresponds to transaction time.

Application-time period tables Application-time period tables are designed to meet the demand for capturing the time during which a certain proposition is true in the real world according to current beliefs. This time period is defined by the user and can be updated at any time. Using the syntax defined by SQL:2011, the product table with valid times, as shown in table 3.5 (p. 23) can be created using this code:

1 CREATE TABLE Products( 2 id INTEGER, 3 startTime TIMESTAMP(12), 4 endTime TIMESTAMP(12), 5 name VARCHAR(20), 6 price INTEGER, 7 PERIOD FOR ProductPeriod (startTime, endTime) 8 )

As it can be seen, no changes to the names of the columns need to be done, as the user can pick any suitable name. However, SQL:2011 sets the restriction that the data types of the start and end columns need to be either date or timestamp, and both columns need to have the same data type. After having defined the application-time period, it is possible to execute update and delete statements that take changes in the application-time period into account. The standard defined the command for portion of that indicates the period for which the SQL statement is executed. For instance, table 3.5 could be changed so that the price of young cheese is changed to 5 euros in 2013 with the following code:

1 UPDATE Products 2 FOR PORTION OF ProductPeriod 3 FROM TIMESTAMP ’2013-01-01 00:00:00’ 4 TO TIMESTAMP ’2013-12-31 23:59:59’ 5 SET price = 5 6 WHERE id = 1

Since the first row in table 3.5 already defines the price during this period, this statement replaces the row with the following three new rows:

Row id startTime endTime name price 1 1 2008-06-01 08:00:00 2013-01-01 00:00:00 Young 6 2 1 2013-01-01 00:00:00 2013-12-31 23:59:59 Young 5 3 1 2013-12-31 23:59:59 9999-12-31 23:59:59 Young 6

Tab. 4.1.: Excerpt of product table with application-time periods 4.1. SQL:2011 standard 33

A delete statement with the for portion of command behaves similar, as it also creates new rows for the remaining periods where a certain proposition is true in the modelled world. Furthermore, SQL:2011 defines seven period predicates that simplify querying data in application-time period tables: contains, equals, overlaps, precedes, immedi- ately precedes, succeeds, and immediately succeeds. For example, if one wants to know how much the young cheese costed as of July 1, 2009, one can use the following query:

1 SELECT name, price 2 FROM Products 3 WHERE id = 1 4 AND ProductPeriod CONTAINS TIMESTAMP ’2009-07-01 00:00:00’

The fact that the application-time periods are maintained by the user indicates that there must be constraints to ensure that the periods cannot lead to contradictions in the database. SQL:2011 proposes that application-time periods for the same object must not be overlapping. Note that this requirement, when correctly implemented, should solve both the redundancy problem and the contradiction problem presented in section 3.3.1. Another constraint defined by the standard prevents references from a child table to a parent table if the application-time periods of an object in the child table are not contained in the periods of the matching object in the parent table. These complex constraints indicate that the implementation of application-time period tables is a rather difficult endeavour.

System-versioned tables System-versioned tables are designed for keeping track of all data changes in the table, which makes it possible to reconstruct the state of the table at a certain point in time. The SQL:2011 standard defines that all update or delete statements need to store the current state of a row prior to updating or deleting the row. This happens completely automatically, which means that the system has control over the start and end times of the system-time periods. In contrast to application-time tables, the user is not able to change these periods. This has the advantage that the history of the data changes is, in principle, immutable, which prevents human failures and makes system-versioned tables very suitable for auditing purposes. Interestingly, the SQL:2011 standard does not define that the system-time periods should be stored in a separate table. Thus, similar to the application-time period table, a system-versioned table is created by adding columns to the current table. The code for creating the product history table 3.4 (p. 22) is:

1 CREATE TABLE Products( 2 id INTEGER, 3 startTime TIMESTAMP(12) GENERATED ALWAYS AS ROW START, 4 endTime TIMESTAMP(12) GENERATED ALWAYS AS ROW END, 5 name VARCHAR(20), 6 price INTEGER, 7 PERIOD FOR SYSTEM_TIME (startTime, endTime) 8 ) WITH SYSTEM VERSIONING 34 Chapter 4. Practical Background

As both current and historical data is contained in one table, the SQL:2011 proposes names to separate the data: current system rows and historical system rows. Plainly, current system rows are the rows where the system-time period contains the current time, and all other rows are historical system rows. Since the system exclusively manages the historical system rows, users can execute update or delete statements only on current system rows. An update statement first copies the old row and sets the period end time to the timestamp when the statement is executed. The old row is then updated according to the statement and the period start time is set to the execution timestamp and the end time is set to the default value of ’9999-12-31 23:59:59’. A delete statement simply sets the period end time to the execution timestamp of the statement. The major use case of system-versioned tables is the so-called ’’, which in this context means presenting the state of the table at a certain point in time or period. Therefore, SQL:2011 defines the syntax for system_time as of, which make it possible to query the content of a table at any point in time or any time period. For instance, if one wants to retrieve the content of table 3.4 as of July 1, 2009 at 14:00, one can use the following select statement:

1 SELECT id, startTime, endTime, name, price 2 FROM products FOR SYSTEM_TIME AS OF TIMESTAMP ’2009-07-01 14:00:00’

For time periods, SQL:2011 defines the expressions between ... and ..., which includes the period end time, and from ... to ..., which does not include the period end time. Thus, if one wants to retrieve all rows that were current system rows between July 1, 2009 at 09:00 and (including) August 1, 2009 at 17:00, the following code can be used:

1 SELECT id, startTime, endTime, name, price 2 FROM products FOR SYSTEM_TIME BETWEEN 3 TIMESTAMP ’2009-07-01 09:00:00’ AND 4 TIMESTAMP ’2009-08-01 17:00:00’

Clearly, system-versioned tables are easier to implement than application-time period tables because the system-time periods cannot be tampered with by users. Therefore, only basic constraints such as a primary key or foreign key constraints need to be enforced on the current system rows. The historical system rows are immutable and thus do not need any protection in the form of constraints. Bitemporal tables As already defined in section 3.3, bitemporal tables are tables which include valid time and transaction time. This concept is also included in SQL:2011, defining that a table is bitemporal if it includes both application-time and system-time periods. The imple- mentation is rather straightforward, as it is merely a combination of both concepts. A bitemporal table which combines table 3.4 and 3.5 can be implemented using this code:

1 CREATE TABLE Products( 2 id INTEGER, 3 ApplicationStart TIMESTAMP(12), 4 ApplicationEnd TIMESTAMP(12), 5 SystemStart TIMESTAMP(12) GENERATED ALWAYS AS ROW START, 4.2. Microsoft SQL Server implementation 35

6 SystemEnd TIMESTAMP(12) GENERATED ALWAYS AS ROW END, 7 name VARCHAR(20) 8 price INTEGER, 9 PERIOD FOR ProductPeriod (ApplicationStart, ApplicationEnd), 10 PERIOD FOR SYSTEM_TIME (SystemStart, SystemEnd) 11 ) WITH SYSTEM VERSIONING

Bitemporal tables are very useful if there is a need for both the times during which a proposition is believed to be true in the modelled reality as well as the times during which a proposition was shown to be true in the database. This way, complex scenarios can be recorded in the database without a potential loss of temporal data.

4.2. Microsoft SQL Server implementation

In 2016 Microsoft published its database management system SQL Server 2016 which for the first time supports temporal tables. This functionality is based on the SQL:2011 standard, but it has one significant restriction: SQL Server 2016 does only support system-versioned tables. This is also the case for the current 2017 version [28]. As mentioned in section 4.1, application-time period tables are considerably more complex, and therefore, the effort needed to implement this component is higher, which might be the reason why Microsoft decided not to support application-time period tables. Hence, if one needs to store application-time periods in a database, one either needs to implement this functionality or has to maintain the time periods manually.

System-versioned tables As opposed to the SQL:2011 standard, a system-versioned table is implemented as two separate tables: a current table and a history table. In both tables, there need to be two columns with type datetime2 which are maintained by the system and therefore cannot be changed by the user. When creating a system-versioned table, one only needs to create the current table and define that it is system-versioned. The history table is then created automatically by the system. The syntax for implementing a system- versioned table is similar to the SQL:2011 standard, which can be seen in the following code which implements the product history table 3.4 of CheeseHut:

1 CREATE TABLE Products 2 ( 3 id INT CONSTRAINT PK_Products PRIMARY KEY, 4 startTime DATETIME2 GENERATED ALWAYS AS ROW START NOT NULL, 5 endTime DATETIME2 GENERATED ALWAYS AS ROW END NOT NULL, 6 name VARCHAR(20), 7 price INT, 8 PERIOD FOR SYSTEM_TIME (startTime, endTime) 9 ) 10 WITH (SYSTEM_VERSIONING = ON);

In the history table, only historical data is stored, which means that when creating and filling a system-versioned table, the associated history table is empty at first. Only after updating or deleting a row in the current table, the old row is copied to the history 36 Chapter 4. Practical Background table. This has the advantage that current data and historical data is strictly split and that the current table behaves similarly as a normal non-temporal table. This way, applications that do not support system-time periods can be used further on without any adjustments.

In accordance with the SQL:2011 standard, users can only insert, change, or delete current data. In terms of Microsoft’s implementation, this means that users can only modify the content in the current table. Hence, the history table is fully maintained by the system. For instance, when a user executes an insert statement, a new row is added to the current table with the system-time period set by the system. The history table does not change, as there is no historical data to be added or updated. However, when a user executes an update or delete statement, a copy of the corresponding row is stored in the history table, and the row is updated or deleted in the current table. The system-time periods are updated accordingly to the rules defined by the SQL:2011 standard. Example 4.1 illustrates the behaviour of system-versioned tables as implemented in SQL Server.

Example 4.1 CheeseHut’s IT department decided to use SQL Server to implement their prod- ucts table as a system-versioned table to keep track of the data changes. On 20 May, 2019, the department creates and fills the table. The history table is empty as there has not been any changes yet. Table 4.2 shows the current table of the new product table.

Row id startTime endTime name price 1 1 2019-05-20 07:50:45.045 9999-12-31 23:59:59.999 Young 6 2 2 2019-05-20 07:51:12.268 9999-12-31 23:59:59.999 Mature 8 3 3 2019-05-20 07:51:55.651 9999-12-31 23:59:59.999 Old 11

Tab. 4.2.: System-versioned current table

Two days later, the IT department notices that the price of the mature cheese should be 9 euros instead of 8 euros. They change the product table accordingly and the current table and history table are updated as it can be seen in tables 4.3 and 4.4.

Row id startTime endTime name price 1 1 2019-05-20 07:50:45.045 9999-12-31 23:59:59.999 Young 6 2 2 2019-05-22 11:02:18.492 9999-12-31 23:59:59.999 Mature 9 3 3 2019-05-20 07:51:55.651 9999-12-31 23:59:59.999 Old 11

Tab. 4.3.: System-versioned current table 4.3. Other implementations 37

Row id startTime endTime name price 1 2 2019-05-20 07:51:12.268 2019-05-22 11:02:18.492 Mature 8

Tab. 4.4.: System-versioned history table

Regarding the querying of temporal data, SQL Server recognizes the clause for sys- tem_time and the corresponding sub-clauses as of, between ... and ..., and from ... to ... that were defined in SQL:2011. Additionally, Microsoft implemented the sub-clauses contained in and all. contained in can be used to retrieve all data that has a start time and an end time within the specified boundaries. The sub-clause all returns all rows including current data and historical data, which would produce a combined table such as table 3.4.

4.3. Other implementations

There are a few other database management systems (DBMS) that offer built-in support for temporal tables. In this section, we present two exemplary DBMS to give a brief insight into the design choices made by other DBMS vendors. One of them is IBM, which implemented system-versioned tables, application-time period tables, and bitemporal tables with the 2012 released DB2 version 10. IBM followed the specifications of the SQL:2011 standard strictly and only deviated from it in some details [7]. For instance, IBM uses the term business time instead of application time, and just as Microsoft, IBM decided to separate current rows and historical rows by means of a current table and a history table. In contrast to the implementation in SQL Server, however, DB2 does not create the history table automatically. Hence, the user first needs to create a current table and a history table and then alter the current table to be system-versioned. Another database management system supporting temporal tables is Teradata, which implemented the functionality based on the TSQL2 model before the SQL:2011 stan- dard was published [30]. While the essential functions correspond with the standard, there are some technical differences in the implementation. One significant distinction is the use of a period data type, which consists of two dates or timestamps. This design choice makes sense for temporal tables, but it has consequences that are worthy of at- tention. That is to say, the implementation of a new data type does not only affect the programming language but also depending programming languages and other technolo- gies. Therefore, a new data type could have a negative influence on the adoption, which is why SQL:2011 added period definitions instead of a new data type [5]. Teradata has adopted this approach after the publication of the SQL:2011 standard and made the database management system compliant to the standard [30].

5 Temporal Databases compared to Conventional Data Warehouses

5.1. Prototype A: Conventional data warehouse ...... 40 5.1.1. Architecture ...... 40 5.1.2. Insights ...... 40 5.2. Prototype B: Data warehouse with system-versioned tables 41 5.2.1. Architecture ...... 41 5.2.2. Insights ...... 42 5.3. Assessment results ...... 42 5.3.1. Performance ...... 43 5.3.2. Costs ...... 44 5.3.3. Data integrity ...... 45 5.3.4. Maintainability ...... 46 5.3.5. Acceptance ...... 47 5.3.6. Three issues of the classical ETL process ...... 47

In this section, we analyze the potential of a temporal database embedded in a data warehouse architecture compared to a conventional data warehouse. As part of this re- search, we made two prototypes representing these two concepts. We use the prototypes as a method to evaluate the use of system-versioned tables as an alternative for conven- tional data warehouses with a classical ETL process. One section each is devoted to the two prototypes: Prototype A, which implements a conventional setup, and Prototype B implementing a data warehouse architecture using a temporal database. In each of the two sections the architecture of the prototype is presented and the insights gained from making and using the prototype are discussed. In the third section, the results of the assessment are presented in terms of the five assessment criteria and the three issues of the classical ETL process.

39 40 Chapter 5. Temporal Databases compared to Conventional Data Warehouses

5.1. Prototype A: Conventional data warehouse

5.1.1. Architecture

This prototype has a classical data warehouse architecture consisting of a source database, a staging database, a data warehouse, and a business intelligence dashboard. A high- level illustration of the architecture of this prototype is given in figure 5.1. The source database is filled with data using the AdventureWorks backup file as provided by Mi- crosoft [22]. The transfer between the source database and the staging database is done using transactional replication. This means that a snapshot of the source database is made at the beginning and copied to the staging database. Afterwards, every change made to the source database is immediately applied to the staging database. The func- tion of the staging database is to provide a non-operational space for the tables that are deemed useful for data analytics. For instance, in this prototype, only 31 tables of the 68 tables in the source database are transferred to the staging database. The data in the staging database is transformed into an analyzable format, which mainly means that the data is inserted into dimension and fact tables. This is done using a pipeline of copy activities within Data Factory (see appendix A.1). These activities first delete the content of all tables and then insert the current content. A second pipeline then adds the temporal dimension to the data, which means that temporal attributes are added and the history of the data is preserved (see appendix A.2). In this prototype, we added the attributes ’ETLDate’ and ’SysEndDate’. ’ETLDate’ is filled with the date and time of the execution of the ETL process and is similar to the ’SysStartTime’ of system-versioned tables. The attribute ’SysEndDate’ has by default the value ’9999-12- 31 23:59:59.000’ and is set to the ETL execution time if a row in the source database has been changed or deleted. Equally to the behaviour of system-versioned tables, a new row is added to the data warehouse in case a row has been updated. However, it has to be noted that this functionality is added manually to this prototype to allow for a fair comparison between the two prototypes. Once the analyzable data is loaded into the data warehouse, the data is imported in Power BI using the SQL Server database connector. This means that a snapshot of the data warehouse is made at the time of the first import. Afterwards, it is possible to refresh the data either on-demand or by schedule. The imported data in Power BI is then used to build business intelligence dashboards using various data visualization tools. When the dashboard is ready, it is published to Power BI on the web, which makes it accessible online to business users. These users can then use the data visualized in the BI report to support their decision-making processes.

5.1.2. Insights

While making and using this prototype, we gathered some insights that we discuss in the following. First, it was surprising that there is still a lot of manual coding work necessary to implement the ETL process. After all, ETL is the industry standard for the data flows within data warehouse architectures. On the one hand, Data Factory supports the 5.2. Prototype B: Data warehouse with system-versioned tables 41

Fig. 5.1.: Architecture of prototype A user immensely by offering a graphical interface and integrated connectors to many data sources. On the other hand, if the ETL process involves more sophisticated actions than copying data from one table to another table, there is no way around writing SQL queries. These SQL queries get rather advanced if one wants to implement an incremental load (see appendix A.2), which is necessary if the history of the data warehouse should be kept. Also, adding temporal attributes during the ETL process is part of the industry’s best practises, but there is no built-in solution for it. Hence, while Data Factory delivers a clean and simple user interface, it does not relieve from the manual implementation of standard operations. Furthermore, during the implementation of the prototype, it became apparent how many steps and different technologies are used in a conventional data warehouse architecture. While each part of the architecture has its own function and purpose, it is questionable whether it is necessary to have three databases which are all based on the same data. In principle, one could query all data from the operational database and possibly other sources directly using Power BI’s built-in connectors. Indeed, there are strong arguments against this approach, such as data size limits and performance issues. But as technology continually evolves and data storage becomes more affordable, it may be a possibility in the near future.

5.2. Prototype B: Data warehouse with system-versioned tables

5.2.1. Architecture

This prototype makes use of system-versioned tables as an alternative for a conventional data warehouse. An illustration of the architecture is given by figure 5.2. The prototype makes use of the same source database as in prototype A and replicates the significant data to the temporal database. The temporal database has the function of both the staging database and the data warehouse, as the transformations and the storage of the analyzable data is done in this database. The technology behind it, however, is entirely different to the technology used in prototype A. The temporal database consists of the same tables as the staging database in prototype A with the difference that the tables 42 Chapter 5. Temporal Databases compared to Conventional Data Warehouses are system-versioned. This means that the addition of the temporal attributes as well as the incremental load are implemented by the system and therefore, do not require manual coding. The data is then transformed into dimension and fact tables using SQL views, which can be described as virtual tables with aggregated data from tables in the database. The significant difference between this approach and the use of Data Factory is that the transformation happens on-demand and directly in the database. The views are created to show the same data as the dimension and fact tables in prototype A. They aggregate all data including the historical data by means of the for system_time all command (see section 4.2). The temporal attributes are taken from the system-versioned tables, which means that they reflect the times that the data has been added, changed, or deleted in the system-versioned tables as a result of the transactional replication. The data in the views can be imported into Power BI the same way as with prototype A.

Fig. 5.2.: Prototype B: Architecture

5.2.2. Insights

A rather straight-forward insight gained from making this prototype is that it requires significantly less coding work than prototype A. The only code that needed to be written were the select statements for the views (see appendix A.3). Also, the system not only handles the temporal attributes and history completely, but it also offers handy time- related commands, which make it possible to analyze the data within the database easily. Surprisingly, the temporal commands also work on the views given that there is at least one temporal table included. For business users, however, this does not add any functionality as it is not possible to use the commands within Power BI. This is unfortunate, as a built-in function for a point-in-time analysis would surely be of great use. Therefore, the impression arises that the potential of temporal databases has not been fully unleashed, possibly due to the restrained adoption in the industry.

5.3. Assessment results

As stated in section 2.2, we compare the prototypes on two configurations using five assessment criteria. Also, we determine whether prototype B solves the three detected issues of the classical ETL process. In the following, the results of the assessment are 5.3. Assessment results 43 presented and analyzed. The exact results and techniques used for the assessment can be found in appendices C.1 and C.2.

5.3.1. Performance The performance of the prototypes has been assessed by measuring the time of the data flow process, which we split into two parts: • Transformation of the data into an analyzable format and loading it into the data warehouse. • Importing the data from the data warehouse to Power BI. In prototype A, the transformation and loading of the data are implemented in two pipelines in Data Factory. Therefore, we measure the execution times of these pipelines. In prototype B, however, the transformation happens inside the database, and the data does not need to be loaded because the temporal database functions as a data warehouse. For testing the performance of prototype B, we measured the time to query all views in the database. The replication process is left out of account as there is no reliable time measurement method for it as it only takes a few seconds. In table 5.1 the average results of the performance measurements for both configurations are given.

Prototype Part Time (low cfg.) Time (high cfg.) A Transformation and Loading 44:11 15:57 B Transformation and Loading 06:14 01:21 A Power BI import 00:55 00:25 B Power BI import 11:58 02:43

Tab. 5.1.: Performance assessment results (mm:ss)

Evidently, the performances of the prototypes differ significantly from each other on the low configuration. While the transformation and loading part is a major bottleneck for prototype A, the import in Power BI takes less than one minute. The opposite applies to prototype B: the Power BI import takes almost twelve minutes, while the transformation and loading part is done in less than one third of the time of prototype A. The reason for these differences is that the transformation process in prototype A is handled completely by the Data Factory pipelines. Hence, the data is already stored in an analyzable format in the data warehouse. In prototype B, however, the transformation is implemented in the views, which has as a consequence that the transformation needs to be done every time the data is queried. For instance, when one wants to import the data in Power BI, the data from prototype A only needs to be loaded from the database, while the data from prototype B first needs to be transformed and is then loaded from the database. Regarding the performances on high configuration, it is apparent that the performance of prototype B improves to a great deal. The time consumed by the transformation and loading part and the Power BI import reduce by the factors of 4.6 and 4.4 respectively. Interestingly, the increase in performance is significantly lower for prototype A, with time reduction factors of 2.8 for the transformation and loading process and 2.2 for the 44 Chapter 5. Temporal Databases compared to Conventional Data Warehouses

Power BI import. This may be caused by the data transition time in the pipelines, which is not influenced by the performance of the SQL databases. Also, the performance of the Power BI import was already rather good on low configuration, which means that an increase in hardware performance may not have as much of leverage anymore.

5.3.2. Costs The costs of the prototypes have been assessed using the pricing calculator provided by Microsoft [24]. Prototype A makes use of a virtual machine running SQL Server, two Azure SQL databases, and two pipelines in Data Factory. Prototype B makes use of the same virtual machine and one Azure SQL database. In principle, the difference between the lower and the higher configuration lies only in the performance setup of the Azure SQL databases. However, Data Factory activities are billed in consumed time units, which is lower on the high configuration, as mentioned in the performance assessment. The costs per service and in total for prototypes A and B for both configurations are given in tables 5.2 and 5.3 respectively. For prototype A, we give the costs for both a daily ETL execution as well as for an hourly ETL execution to emphasize the effect of an increase in the ETL execution frequency on the costs.

Service Costs (low cfg.) Costs (high cfg.) Virtual machine with SQL Server 338.69 338.69 2 SQL databases 8.26 49.64 Daily ETL execution 18.97 6.69 Hourly ETL execution 455.45 160.61 Total daily ETL execution 365.93 395.02 Total hourly ETL execution 802.40 548.94

Tab. 5.2.: Monthly costs of prototype A (in euros)

Service Costs (low cfg.) Costs (high cfg.) Virtual machine with SQL Server 338.69 338.69 SQL database 4.13 24.82 Total 342.82 363.51

Tab. 5.3.: Monthly costs of prototype B (in euros)

Clearly, the virtual machine causes relatively high expenses. For instance, the costs for the SQL databases and the daily ETL run cause less than 10 per cent of the total costs of prototype A on the low configuration. This has to do with the reserved storage space and computation power of virtual machines. Furthermore, the total costs of prototype B are slightly smaller than the total costs of prototype A with daily ETL execution. While this price difference is negligible, the price difference between prototype B and prototype A with hourly ETL execution is tremendous. This proves that an increase in the frequency of ETL execution comes in hand with a significant increase in total costs. However, it is noteworthy that the total expenses for prototype A with hourly 5.3. Assessment results 45

ETL execution on high configuration are lower than on low setting. This is caused by the faster execution of the Data Factory pipelines due to the higher performance of the SQL databases.

5.3.3. Data integrity

As described in section 2.2, we tested the prototypes on their data integrity by assessing the differences in the databases after executing three sets of create, update, and delete transactions on the source database. For prototype A, we evaluate the database at each step before and after an ETL execution. StepStep 1: 1: Inserting Inserting data data After the create statements have been executed on the source database, the transaction set is replicated to the staging server of prototype A and the temporal database of prototype B. As the views implemented in prototype B retrieve the new data directly within the database, the inserted data is added almost immediately and can then be loaded in Power BI. If one were to import data in Power BI using prototype A, the new data would not be shown as the data is only added to the staging server. Thus, the data is only visible after an ETL process has been run. The time attributes in prototype A are less accurate than the ones in prototype B, as they only get added during the ETL process. StepStep 2: 2: Updating Updating data data Likewise, any update statements executed on the source database are almost directly applied to prototype B, while it is applied to prototype A only after executing an ETL process. Apart from the time attributes, there are also differences in the data stored in the data warehouse of prototype A and the temporal database of prototype B. Two of the update statements refer to the same object, which causes that only the last state of that object is loaded into the data warehouse of prototype A. This phenomenon exemplifies the aforementioned issue that changes happening in between two ETL iterations get lost. In contrast, prototype B processes both update statements and records the first one in the history table and the second one in the current table. Surprisingly, the first record is not shown in the view, which most likely is a bug in SQL Server’s implementation. A possible cause for this is that the two records have the same SysStartTime because the update statements have been replicated to the temporal database in one batch. In practice, however, this error is negligible since more than one update of the same object in a short period of time is rather unusual. StepStep 3: 3: Deleting Deleting data data Similar observations have been made with regard to the effects of the delete state- ments. Rows that have been created and deleted in between two ETL iterations are not shown in the data warehouse or the dashboard of prototype A. This is a major issue for the data integrity of prototype A, as whole rows may be deleted by mistake and cannot be restored. Prototype B, however, stores all deleted rows directly in the history table. Since the history table is immutable for users, it is, without any changes to the database, impossible to erase any data completely. 46 Chapter 5. Temporal Databases compared to Conventional Data Warehouses

5.3.4. Maintainability

The lines of codes written for each prototype can give an indication of their maintain- ability. For the implementation of prototype A, T-SQL queries needed to be written to transform the data into an analyzable format and to incrementally load the data with temporal attributes to the data warehouse. For prototype B, the views in the temporal database needed to be programmed using T-SQL. In table 5.4, the lines of code per table and prototype are given. Prototype A is split into the first and second Data Factory pipeline.

Table A: Transformation A: Incremental load B DimAddress 15.00 97.00 45.00 DimCurrency 4.00 82.00 7.00 DimCustomer 29.00 116.00 54.00 DimDepartmentGroup 4.00 40.00 7.00 DimEmployee 49.00 168.00 78.00 DimGeography 16.00 88.00 43.00 DimProduct 44.00 118.00 65.00 DimProductCategory 5.00 76.00 8.00 DimProductSubCategory 9.00 81.00 39.00 DimReseller 28.00 93.00 51.00 DimSalesReason 6.00 79.00 9.00 DimSalesTerritory 10.00 84.00 39.00 FactInternetSales 81.00 140.00 80.00 FactResellerSales 83.00 144.00 81.00 FactSalesQuota 6.00 86.00 9.00 Total 389.00 1492.00 615.00

Tab. 5.4.: Lines of code per table and prototype

As can be seen, the lines of code written for prototype A exceed the lines of code of prototype B to a great deal. This is primarily caused by the complexity of the incremental load, which is necessary for the implementation of the temporal dimension. For instance, if one row in the source database has been updated, both an update and an insert statement need to be executed on the data warehouse. The update statement changes the SysEndDate of the obsolete row to the current date and the insert statement inserts the new, updated row with the current date as ETLDate. These operations consume considerable computation power because the new row needs to be compared entirely with the old row by means of hash values to check for any differences. Also, the ETLDate of the new row needs to match the SysEndDate of the old row in order to avoid any temporal gaps. Therefore, the SysEndDate of the old row needs to be selected in the insert statement, which adds additional complexity to the code. In contrast, the lines of code written for prototype B amount to about one third of the lines of code of prototype A. This is because there is considerably less manually 5.3. Assessment results 47

implemented functionality required for prototype B. For instance, the incremental load is done entirely by the system. Furthermore, it has to be noted that all code for prototype B is stored within the database, which makes the maintenance easier because all changes can directly be tested on the database. Concerning prototype A, all code is stored in the Data Factory activities, which complicates any adaptations for several reasons. First, the code is split within the activity into a source query and a pre-copy script. As all tables have at least one activity in each of the two pipelines, the code for one table is divided into four pieces. Hence, if the schema of a table changes, one would need to make adaptations at four different locations within Data Factory. Furthermore, it is not recommended to test new code by running the pipelines in Data Factory. Instead, the code should be tested directly on the database, which means that there are additional steps required for testing.

5.3.5. Acceptance

The acceptance of prototype B is tested by determining the differences in the prototypes’ Power BI user experience. A pronounced difference is the time that the data import takes. The data import of prototype B, especially on low configuration, takes so long (see table 5.1) that it may be an issue for the acceptance of system-versioned tables as an alternative to classical data warehouses. However, Power BI offers two solutions to this problem. First, it is possible to schedule the data refresh so that the business user does not need to do that. This solution, however, has the disadvantage that data is not always up-to-date. Another approach is to use the incremental refresh function, which only loads the new data instead of all data. Unfortunately, this function is only available in the Power BI Premium subscription, which costs around 5000 US dollars per month [23]. Regarding the dashboards, no significant difference between the prototypes could be observed given that the data in the data warehouse of prototype A is up to date. The data is entirely equivalent except for the temporal attributes, which is decisive for the acceptance as business users do not need to adapt their work routines. However, users should be aware of the different meaning of the temporal attributes. While ETLDate and SysEndDate of prototype A reflect the state of the data warehouse, the SysStartTime and SysEndTime attributes of prototype B concern the state of the source database1. Assuming that the business user instead wants to see the data from the temporal di- mension of the source database, it can be concluded that this difference has a positive impact on the acceptance of prototype B.

5.3.6. Three issues of the classical ETL process

The traditional ETL process has several problems which constitute considerable limita- tions for the conventional approach of data warehousing implemented in prototype A.

1To be precise, the attributes reflect the state of the temporal database. However, as the delay caused by the replication is very low, we assume the states to be equivalent. 48 Chapter 5. Temporal Databases compared to Conventional Data Warehouses

In the following we analyze in how far prototype B solves the three issues defined in section 2.2. LostLost transactions transactions All transactions replicated to the temporal database of prototype B are in principle immutable. If a row gets updated, the system changes the SysEndTime of the current row and inserts a new row with the updated data. If a row gets deleted, the system sets the SysEndTime to the current time. This way, all changes to the source database are kept in the temporal database. Even in the event of a downtime of the temporal database, the source database keeps any transactions in its logs and replicates them as soon as the temporal database is online again. Hence, prototype B prevents the loss of transactions very effectively. InaccurateInaccurate temporal temporal attributes attributes The temporal attributes in prototype B are maintained entirely by the system. The attributes are added or changed as soon as a replicated transaction from the source database is executed on the temporal database. The delay between the execution of a transaction on the source database and the execution of the replicated transaction on the temporal databases accounts for a few seconds only. In contrast, the temporal attributes in prototype A can have a delay of up to 24 hours given a daily ETL execution. Therefore, it can be concluded that temporal databases offer a satisfactory solution to this issue. NoNo real-time real-time data data processing processing Prototype B allows for analysis with less delay than prototype A, but one cannot denote this as real-time data processing. Even on high configuration, the prototype has a delay of about three minutes (see table 5.1). However, the increase in performance from low to high configuration indicates that it may be possible to reduce the delay to a greater extent. Furthermore, the incremental load feature of Power BI Premium could reduce the data refresh time tremendously. Hence, prototype B, in its present condition, does not support real-time data processing, but it is reasonable to believe that a nearly real- time data processing is possible with other setups involving higher costs. 6 Discussion

6.1. Interpretation of the findings ...... 49 6.2. Limitations ...... 52

6.1. Interpretation of the findings

This research investigated the potential of temporal databases for the benefit of data analytics. We defined six sub-questions to answer this main research question. The first sub-question concerns the main concepts of temporal databases. In literature, we found that the concepts of transaction time and valid time are an integral part of the notion of temporal databases. In particular, a lot of attention is drawn to the use cases of valid time and the issues connected with it. For instance, Date et al. [14] addressed two problems of the implementation of valid time in temporal databases, namely the redundancy problem and the contradiction problem. Surprisingly, this attention for the valid time attribute is not reflected in Microsoft’s implementation of temporal tables. On account of the second sub-question regarding the implementations of temporal databases, we found that Microsoft did not implement a valid time attribute in SQL Server - in spite of the attribute being defined in the standard SQL:2011. On the one hand, one might argue that it is particularly not surprising because of the complex im- plementation of valid time as it lies in the control of the user. On the other hand, other DBMS vendors such as IBM and Teradata did implement the valid time attribute, so it is reasonable to believe that it was a conscious decision by Microsoft rather than implemen- tation issues. Possibly, Microsoft waits for a broad establishment of its system-versioned tables in the industry before adding any additional functionality. Furthermore, we investigated the role of temporal data in data analytics and data ware- houses on account of the third sub-question of this research. It has been found that data analytics can benefit significantly from a complete history of data and valid times. This, however, is only applicable to the analysis of structured data, as the notion of temporal databases is inextricably linked to relational database management systems.

49 50 Chapter 6. Discussion

Considering the increasing demand for big data analytics, the question arises whether it is possible to adapt the concept of temporal databases for the use with massive amounts of unstructured data. After all, the idea behind temporal databases has been invented in the late 1980s, and there have not been any fundamental changes to it. At that time, megabytes of structured data have been collected, and not petabytes of unstructured data. Concerning data warehousing, the temporal dimension is a very important, if not the most important, element of data warehouses. A major function of data warehouses is to store a complete history of the analyzable data. Therefore, the use of temporal databases seems like a natural fit for this purpose, which raises the question of why that is not done in practice. Supposedly, there is not enough awareness of the existence of temporal databases. Also, the classical approach to load the data into the data ware- house, ETL, typically adds the temporal dimension just at the end of the whole process. This entails several issues which can partly be solved by using temporal databases as an alternative.

The last three sub-questions concern the assessment of the concepts of temporal databases and conventional data warehouses. For that, we made a prototype with a traditional data warehouse architecture and a prototype with a temporal database. These pro- totypes have been compared using five assessment criteria: performance, costs, data integrity, maintainability, and acceptance. This part of this research aimed at gaining insights about the potential of a temporal database embedded in a data warehouse ar- chitecture for the purposes of data analytics. Conventional data warehouses have been found to have a rather cumbersome data flow with much manual coding work involved to implement the temporal dimension. A temporal database, however, supports the temporal dimension by design and therefore, it saves a great deal of manual coding.

Regarding the assessment, the results indicate that a data warehouse architecture using a temporal database has an overall better performance than a conventional data ware- house. This result may be explained by the fact that temporal databases have built-in optimized support for incremental load, which consumes much computation power in conventional data warehouses. Furthermore, a conventional data warehouse needs to transfer data between four systems, namely the source database, the staging database, the data warehouse, and the business intelligence tool. A temporal database can func- tion as both a staging database and data warehouse, which saves delays caused by the data transfer.

However, the results of the assessment suggest that the direct import of data from the temporal database to the business intelligence tool consumes much more time than the import from the conventional data warehouse. While the ETL process transforms the data and stores the analyzable data in the data warehouse, the temporal database is transformed on-demand when importing the data to the BI tool. This constitutes an undesirable bottleneck, as the business user would need to wait longer for the analysis of the data. Nonetheless, it has been found that the data import time increases signifi- cantly with higher hardware performance. Also, an incremental load from the temporal database to the business intelligence tool could lower the delay, as not all data would have to be queried again. 6.1. Interpretation of the findings 51

Concerning the costs, we could not constitute a significant difference between the two concepts. A conventional data warehouse uses one more database for staging and a data integration solution. Naturally, the costs of the data integration depend on the frequency of the ETL execution. For instance, if a company requires hourly updated data in the BI tool, the costs of a conventional data warehouse increase notably. At that, it has to be noted that an hourly update of the data is still inferior to the on-demand update of the temporal database. However, it has been found that the costs of the ETL process can be reduced by increasing the performance of the staging database and the data warehouse. This, however, is only applicable for data integration solutions that are billed based on the execution time.

Regarding data integrity, temporal databases and conventional data warehouses differ greatly in the way the data history and the temporal attributes are handled. At a tempo- ral database, the history table and the temporal attributes are automatically maintained as soon as transactions are transferred to the database. At a conventional data ware- house, these operations are done in the transformation part of the ETL process. This has as a consequence that only the last state of a row in the source database is trans- mitted to the data warehouse. Also, rows that are created and deleted in between two ETL iterations are not stored in the data warehouse. In contrast, a temporal database saves a complete history of all transactions transferred to it. Hence, according to the results of the assessment, temporal databases have clear advantages over conventional data warehouses in terms of data integrity.

The maintainability has been found to be fundamentally different between the two con- cepts. The implementation of the temporal dimension of a conventional data warehouse requires much manual coding, while temporal databases support the temporal dimen- sion by design. It can, therefore, be assumed that changes to the conventional data warehouse require more effort than changes to the temporal database. Furthermore, conventional data warehouses rely on an external data integration solution which con- tains the whole logic of the temporal dimension. This means that any changes to the code cannot directly be tested, but first need to be extracted and then tested on the staging database and the data warehouse. In contrast, the can be done within the temporal database, which makes any changes easier to test. Especially for data analytics, this is increasingly important due to the fast-changing requirements.

With regard to the acceptance of temporal databases, we identified the poor performance of the data import into the BI tool as a significant issue for the experience of the business users. However, as discussed above, there are solutions to this matter. Furthermore, the data input for the BI tool is equivalent except for the temporal attributes. This is vital for the acceptance of temporal databases, as business users do not need to change their routines. The temporal attributes have a different meaning, as the attributes of the conventional data warehouse architecture refer to the state of the data warehouse itself, while the attributes of the temporal database reflect the state of the source database. Since the purpose of the source database is to model the real world, the attributes of the temporal database are more adequate for the aims of data analytics. This advantage may foster the acceptance of temporal databases as an alternative for data warehouses. 52 Chapter 6. Discussion

The results of the prototype assessment show that temporal databases solve two of the three issues of the classical ETL process we found. In contrast to the classical approach, temporal databases reliably store all transactions transferred to them. Also, as men- tioned before, the temporal attributes of temporal databases are more accurate than the ones of conventional data warehouses. However, we found that temporal databases do not enable real-time data processing, as there is a considerable delay between the execu- tion of a transaction on the source database and the presentation of that transaction in the BI tool. Nevertheless, in general, we attribute temporal databases a high potential for enabling nearly real-time data processing, as there are aforementioned solutions to improve the data import performance.

6.2. Limitations

There are some limitations of this research that should be noted. First, prior research relevant to this thesis was partly very limited. There is sparse research done on the role of temporal data within data analytics and data warehouses, which limited our research on the sub-question Q3. Furthermore, the evaluation of data warehouses has not been researched sufficiently, which was a hurdle for finding adequate assessment criteria and methods. Particularly, there is a need for a maintainability prediction model for SQL code. Moreover, the literature on data analytics and data warehouses used in this research was mostly non-academic due to a lack of current, peer-reviewed articles in these domains. Second, the prototypes made for comparing temporal databases with conventional data warehouses were limited by the financial budget and by the client’s focus on Microsoft products. With a higher financial budget, it would have been possible to make the prototypes more realistic to increase the credibility of the assessment. For instance, higher configurations could have been used in order to have better evidence for the performance assessment. Furthermore, the restriction of the prototypes on Microsoft products caused that we cannot assess the practical use of bitemporal databases for data analytics, as Microsoft did not implement the valid time attribute (see section 4.2). Third, it has to be noted that the selection of the criteria and the used technology for the prototypes limits the generalizability of the assessment of the concepts in some as- pects. The criteria were carefully selected from generally accepted assessment criteria in computer science and data analytics literature. However, the final selection of the criteria and measurement methods depends on the researcher, which may cause different findings for the sub-questions Q4 and Q5. Also, the use of a different database man- agement system for the prototypes may have an influence on the performance and the costs assessment in particular. 7 Conclusion

7.1. Conclusions ...... 53 7.2. Future work ...... 54

7.1. Conclusions

This research was set out to evaluate the potential of temporal databases for the ap- plication in data analytics. The literature review has identified the temporal attributes transaction time and valid time as the main concepts of temporal databases. Transac- tion time refers to the period at which a certain statement was stored in a database, while valid time expresses the period at which a statement was, is, or will be true in the real world. The SQL:2011 standard suggested the implementation of these temporal attributes, which has been put into practice by a few DBMS vendors. For instance, IBM implemented both temporal attributes, whereas Microsoft only implemented transaction time. It has been found that data analytics can benefit from the inclusion of temporal data since this additional information provides a vital source for analysis. Data warehouses are intended to include temporal data, which is commonly implemented manually within the ETL process. This approach has the downside that the temporal dimension is only added at the last step of the data flow within the data warehouse architecture. The consequences are that transactions happening in between two ETL iterations are lost and the temporal attributes are delayed. Also, real-time data processing is not possible due to the delay caused by the ETL process. Temporal databases support the temporal dimension by design, which motivates the evaluation of temporal databases as an alternative to conventional data warehouses. We compared the concepts of temporal databases and conventional data warehouses by means of two prototypes and five assessment criteria. The results of the assessment suggest that temporal database are overall significantly more efficient than conventional data warehouses due to a more direct data flow. However, the data import to the business

53 54 Chapter 7. Conclusion intelligence tool has been found to be a major bottleneck for temporal databases, which limits the potential to realize real-time data processing. The two concepts do not differ significantly in terms of costs, but the results indicate that an increase in the ETL frequency of data warehouses causes a remarkable increase in the total costs. Temporal databases, in contrast, are always up-to-date, which means that the costs are very stable. Furthermore, the assessment results highlight the shortcomings of conventional data warehouses in terms of data integrity. Conventional data warehouses do not prevent data loss and do not record the temporal attributes accurately, which is a significant drawback for the purposes of data analytics. Temporal databases solve these issues by automatically maintaining the temporal attributes at the execution of a data transaction. Moreover, the results of the assessment suggest that temporal databases are significantly easier to maintain, as conventional data warehouses require more manual coding. Also, the data transformation code for the data warehouse is stored within the ETL tool, whereas the code for the temporal database is stored within the database. This may indicate that changes done to the temporal database are easier to test. Furthermore, it has been found that the data import time of the business intelligence tool is a po- tential hurdle for the acceptance of temporal databases embedded in a data warehouse architecture. Other than that, the temporal databases provide the end user with more reliable temporal data, which may foster the acceptance of temporal databases as an alternative for conventional data warehouses. In conclusion, we attribute temporal databases a high potential for the use in data ana- lytics. The use of temporal databases as an alternative for traditional data warehouses has been proven to have significant advantages. Consequently, organizations following a data-driven strategy should investigate the use of temporal databases. Broad adoption of temporal databases may have a positive influence on innovation within the field of data analytics, such as the native support of point-in-time analysis in business intelligence tools.

7.2. Future work

In terms of directions for future research, further work on the use of temporal data within data analytics is needed to better understand the role that temporal data plays in data-driven decision making. Also, further research is required for the measurement of maintainability of SQL code. Moreover, it would be interesting to repeat the assessment using other database management systems in the prototypes in order to validate the generalizability of the results. At that, it is suggested to use high-performance hardware for the databases to gain insights on the feasibility of real-time data processing using temporal databases. Furthermore, the use of a database management system supporting the valid time attribute could give a better understanding of the practical benefit of valid time for data analytics. Besides, it would be interesting to assess whether the concept of temporal databases could be adapted in order to be applied to unstructured data. Glossary

Temporal Data Data which may change over the course of time, provided that the aspect of time is significant for the purpose of the data [10]. Time instant A specific point in time on the time line [10]. Event A non-recurring occurrence which happens at a certain time instant [10, 18]. Time period The time between two instants, which can be seen as a sequence of con- tiguous [10, 4]. Time interval A certain of time with a fixed length and unspecified begin and end time [14, 10]. A time interval of a minimal atomic duration [4]. Timestamp The time instant or interval associated with an event or object [4]. Transaction The record associated with a certain event [18] Temporal attribute A timestamp with certain semantics defined by the user or the database management system. Valid time The time period at which a certain statement was, is, or will be true in reality according to current beliefs. Valid times can be updated if current beliefs about the truthfulness of a statement have changed [9, 4, 14]. Transaction time The time period at which a certain statement was stored in a database. Transaction times refer to the history of a database and therefore cannot be up- dated [10, 4, 14]. User-defined time A timestamp defined by the user which is not interpreted by the database management system, unlike valid time and transaction time [10]. Temporal Database A database containing time-varying data and offering built-in sup- port for modelling the temporal dimension of the data [10]. Snapshot A backup of a database capturing its contents at a certain point in time [10]. Data Analytics The use of IT applications to support decision-making by analyzing large data sets [26].

55 Descriptive Analytics Data analytics aiming at reasoning why something has hap- pened by summarizing raw data into a format that is appealing to business users [8]. Predictive Analytics Data analytics used to evaluate the future and to forecast trends by means of prediction models and scoring [8]. Measure Usually numeric indicators for the performance of an organization [11]. Dimension Perspective from which measures can be looked at in a data warehouse [11]. ETL The classical process of extracting significant data from different sources, trans- forming the data into a format that is suitable for analysis, and loading the trans- formed data into the data warehouse. Period definition A definition in SQL:2011 identifying a pair of date or timestamp columns as a period with a start time and an end time [5]. System-time period The SQL:2011 equivalent of valid time [5]. Application-time period The SQL:2011 equivalent of transaction time [5]. Current system row A row with a system-time period containing the current time [5]. Historical system row A row with a system-time period not containing the current time [5].

56 References

Academic sources

[1] Harrine Freeman. “Software testing”. In: IEEE instrumentation & measurement magazine 5.3 (2002), pp. 48–50. [2] Amir Gandomi and Murtaza Haider. “Beyond the hype: Big data concepts, meth- ods, and analytics”. In: International Journal of Information Management 35.2 (2015), pp. 137 –144. issn: 0268-4012. doi: https : / / doi . org / 10 . 1016 / j . ijinfomgt.2014.10.007. url: http://www.sciencedirect.com/science/ article/pii/S0268401214001066. [3] Matteo Golfarelli and Stefano Rizzi. “Data warehouse testing”. In: International Journal of Data Warehousing and Mining (IJDWM) 7.2 (2011), pp. 26–43. [4] Christian S Jensen et al. “The consensus glossary of temporal database con- cepts—February 1998 version”. In: Temporal Databases: Research and Practice. Springer, 1998, pp. 367–405. [5] Krishna Kulkarni and Jan-Eike Michels. “Temporal features in SQL: 2011”. In: ACM Sigmod Record 41.3 (2012), pp. 34–43. [6] Hua-Yang Lin, Ping-Yu Hsu, and Gwo-Ji Sheen. “A fuzzy-based decision-making procedure for data warehouse system selection”. In: Expert systems with applica- tions 32.3 (2007), pp. 939–953. [7] Dušan Petkovic. “Temporal data in relational database systems: a comparison”. In: New Advances in Information Systems and Technologies. Springer, 2016, pp. 13– 23. [8] Saumyadipta Pyne, BLS Prakasa Rao, and Siddani Bhaskara Rao. Big data ana- lytics: Methods and applications. Springer, 2016. [9] Richard T Snodgrass. “Temporal databases”. In: Theories and methods of spatio- temporal reasoning in geographic space. Springer, 1992, pp. 22–64. [10] Richard Thomas Snodgrass. “Temporal databases”. In: IEEE computer. 1986.

57 [11] Alejandro Vaisman and Esteban Zimányi. “Data warehouse systems”. In: Data- Centric Systems and Applications (2014). [12] Elaine J Weyuker. “Testing component-based software: A cautionary tale”. In: IEEE software 15.5 (1998), pp. 54–59. [13] Joost F Wolfswinkel, Elfi Furtmueller, and Celeste P M Wilderom. “Using grounded theory as a method for rigorously reviewing literature”. In: European Journal of Information Systems 22.1 (2013), pp. 45–55. issn: 1476-9344. doi: 10.1057/ejis. 2011.51. url: https://doi.org/10.1057/ejis.2011.51.

Non-academic sources

[14] Christopher John Date, Hugh Darwen, and Nikos Lorentzos. Time and relational theory: temporal databases in the and SQL. Morgan Kaufmann, 2014. [15] Gudu Software. Instant SQL Formatter. http://www.dpriver.com/pp/sqlformat. htm (accessed June 11, 2019). [16] William H Inmon. Building the data warehouse. John wiley & sons, 2005. [17] Christian S Jensen and Richard T Snodgrass. TimeCenter. http://timecenter. cs.aau.dk/ (accessed April 11, 2019). [18] Tom Johnston. Bitemporal data: theory and practice. Newnes, 2014. [19] Ralph Kimball and Margy Ross. The data warehouse toolkit: the complete guide to . John Wiley & Sons, 2011. [20] Ralph Kimball et al. Kimball Group. https://www.kimballgroup.com/ (accessed April 12, 2019). [21] John C McCallum. Disk Drive Prices. https : / / jcmit . net / diskprice . htm (accessed May 16, 2019). [22] Microsoft. AdventureWorks Installation and configuration. https://docs.microsoft. com/en-us/sql/samples/adventureworks-install-configure (accessed June 12, 2019). [23] Microsoft. Power BI pricing. https://powerbi.microsoft.com/en-us/pricing/ (accessed June 17, 2019). [24] Microsoft. Pricing calculator Microsoft Azure. https://azure.microsoft.com/ en-gb/pricing/calculator/ (accessed June 17, 2019). [25] OCLC. RUQuest. https://ru.on.worldcat.org/discovery?lang=en (accessed April 15, 2019). [26] Thomas A Runkler. “Data Analytics”. In: Wiesbaden: Springer. 10 (2012), pp. 978– 3. [27] Philip Russom. “Big data analytics”. In: TDWI best practices report, fourth quarter 19.4 (2011), pp. 1–34.

58 [28] Dejan Sarka, Milos Radivojevic, and William Durkin. SQL Server 2017 Developer’s Guide. Packt Publishing Ltd, 2018. [29] Software Improvement Group. Better Code Hub configuration manual. https : //bettercodehub.com/docs/configuration-manual (accessed June 28, 2019). [30] Teradata. ANSI Temporal Table Support. https://docs.teradata.com/reader/ _LRlKl9_m2VqMOOEPhMinA/47Rv9E15F~JnVitHW3xXzQ (accessed May 28, 2019). [31] The University of Arizona. Richard T. Snodgrass Publications. https://www2. cs.arizona.edu/~rts/publications.html (accessed April 10, 2019). [32] World Wide Web Foundation. Interview with Tim Berners-Lee. http://time. com / 5549635 / tim - berners - lee - interview - web/ (accessed April 9, 2019). 2019.

59

A Source Code

A.1. Transfer between staging database and data warehouse

The transfer between the staging database and the data warehouse is done using a pipeline of copy activities. In the following the queries for the source data set and the pre-copy scripts are given for each table in the data warehouse.

DimAddress

1 -- Source data set query 2 SELECT a.addressid, 3 a.addressline1, 4 a.addressline2, 5 a.city, 6 sp.NAME AS StateProvince, 7 cr.NAME AS CountryRegion, 8 a.postalcode, 9 a.modifieddate 10 FROM person.address a 11 LEFT OUTER JOIN person.stateprovince sp 12 ON a.stateprovinceid = sp.stateprovinceid 13 LEFT OUTER JOIN person.countryregion cr 14 ON sp.countryregioncode = cr.countryregioncode 15 -- Pre-copy script 16 DELETE FROM dbo.dimaddress; 17 DBCC checkident (’dbo.DimAddress’, reseed, 0);

DimCurrency

1 -- Source data set query 2 SELECT c.CurrencyCode, c.Name 3 FROM Sales.Currency c 4 -- Pre-copy script 5 DELETE FROM dbo.dimcurrency; 6 DBCC checkident (’dbo.DimCurrency’, reseed, 0);

61 62 Appendix A. Source Code

DimCustomer

1 -- Source data set query 2 SELECT c.customerid AS [CustomerKey], 3 dg.geographykey AS [GeographyKey], 4 pp.title AS [Title], 5 pp.firstname AS [FirstName], 6 pp.middlename AS [MiddleName], 7 pp.lastname AS [LastName], 8 pe.emailaddress AS [EmailAddress], 9 ppp.phonenumber AS [PhoneNumber], 10 pa.addressline1 AS [AddressLine1], 11 pa.addressline2 AS [AddressLine2], 12 pa.city AS [City], 13 pa.postalcode AS [PostalCode] 14 FROM [Sales].[customer] c 15 INNER JOIN person.person pp 16 ON pp.businessentityid = c.personid 17 INNER JOIN person.emailaddress pe 18 ON pe.businessentityid = c.personid 19 INNER JOIN person.businessentityaddress pbea 20 ON pbea.businessentityid = c.personid 21 INNER JOIN person.address pa 22 ON pa.addressid = pbea.addressid 23 INNER JOIN person.personphone ppp 24 ON ppp.businessentityid = c.personid 25 INNER JOIN temp.dimgeography dg 26 ON dg.city = pa.city 27 AND dg.postalcode = pa.postalcode 28 ORDER BY c.customerid; 29 -- Pre-copy script 30 DELETE FROM dbo.dimcustomer; 31 DBCC checkident (’DimCustomer’, reseed, 0);

DimDepartmentGroup

1 -- Source data set query 2 SELECT DISTINCT humanresources.department.groupname AS DepartmentGroupName 3 FROM humanresources.department 4 -- Pre-copy script 5 DELETE FROM dbo.dimdepartmentgroup; 6 DBCC checkident (’DimDepartmentGroup’, reseed, 0);

DimEmployee

1 -- Source data set query 2 SELECT e.[businessentityid] AS BusinessEntityID, 3 e.[nationalidnumber] AS 4 [EmployeeNationalIDAlternateKey], 5 COALESCE(sp.[territoryid], 11) AS [SalesTerritoryKey], 6 co.[firstname] AS [FirstName], 7 co.[lastname] AS [LastName], 8 co.[middlename] AS [MiddleName], 9 e.[jobtitle] AS [Title], 10 e.[hiredate] AS [HireDate], A.1. Transfer between staging database and data warehouse 63

11 e.[birthdate] AS [BirthDate], 12 e.[loginid] AS [LoginID], 13 em.[emailaddress] AS [EmailAddress], 14 pp.phonenumber AS [Phone], 15 e.[maritalstatus] AS [MaritalStatus], 16 e.[salariedflag] AS [SalariedFlag], 17 e.[gender] AS [Gender], 18 eph.[payfrequency] AS [PayFrequency], 19 eph.[rate] AS [BaseRate], 20 e.[vacationhours] AS [VacationHours], 21 e.[sickleavehours] AS [SickLeaveHours], 22 e.[currentflag] AS [CurrentFlag], 23 d.[name] AS [DepartmentName], 24 COALESCE(edh.[startdate], e.[hiredate]) AS [StartDate], 25 edh.[enddate] AS [EndDate], 26 CASE 27 WHEN edh.[enddate] IS NULL THEN N’Current’ 28 ELSE NULL 29 END AS [Status] 30 FROM [HumanResources].[employee] e 31 INNER JOIN [Person].[person] co 32 ON e.[businessentityid] = co.[businessentityid] 33 INNER JOIN [Person].[personphone] pp 34 ON pp.businessentityid = e.businessentityid 35 INNER JOIN [Person].[emailaddress] em 36 ON e.[businessentityid] = em.businessentityid 37 INNER JOIN [Person].[businessentityaddress] ea 38 ON e.[businessentityid] = ea.[businessentityid] 39 INNER JOIN [Person].[address] a 40 ON ea.[addressid] = a.[addressid] 41 LEFT OUTER JOIN [Sales].[salesperson] sp 42 ON e.[businessentityid] = sp.[businessentityid] 43 LEFT OUTER JOIN [HumanResources].[employeedepartmenthistory] edh 44 ON e.businessentityid = edh.[businessentityid] 45 INNER JOIN [HumanResources].[department] d 46 ON edh.[departmentid] = d.[departmentid] 47 LEFT OUTER JOIN [HumanResources].[employeepayhistory] eph 48 ON e.[businessentityid] = eph.[businessentityid] 49 -- Pre-copy script 50 DELETE FROM dbo.dimemployee; 51 DBCC checkident (’dbo.DimEmployee’, reseed, 0);

DimGeography

1 -- Source data set query 2 SELECT DISTINCT a.[city] AS [City], 3 sp.[stateprovincecode] AS [StateProvinceCode], 4 sp.[name] AS [StateProvinceName], 5 cr.[countryregioncode] AS [CountryRegionCode], 6 cr.[name] AS [CountryRegionName], 7 a.[postalcode] AS [PostalCode] 8 FROM [Person].[address] AS a 9 INNER JOIN [Person].[stateprovince] AS sp 10 ON a.[stateprovinceid] = sp.[stateprovinceid] 11 INNER JOIN [Person].[countryregion] AS cr 64 Appendix A. Source Code

12 ON sp.[countryregioncode] = cr.[countryregioncode] 13 ORDER BY cr.[countryregioncode], 14 sp.[stateprovincecode], 15 a.[city]; 16 -- Pre-copy script 17 DELETE FROM dbo.dimgeography; 18 DBCC checkident (’DimGeography’, reseed, 0);

DimProduct

1 -- Source data set query 2 SELECT p.productnumber AS 3 ProductAlternateKey, 4 p.productsubcategoryid AS 5 ProductSubcategoryKey, 6 p.weightunitmeasurecode AS 7 WeightUnitMeasureCode, 8 p.sizeunitmeasurecode AS 9 SizeUnitMeasureCode, 10 p.[name] AS ProductName, 11 pch.standardcost AS StandardCost, 12 p.finishedgoodsflag AS 13 FinishedGoodsFlag, 14 COALESCE(p.color, ’NA’) AS Color, 15 p.safetystocklevel AS 16 SafetyStockLevel, 17 p.reorderpoint AS ReorderPoint, 18 plph.listprice AS ListPrice, 19 p.size AS Size, 20 CONVERT(FLOAT, p.weight) AS Weight, 21 p.daystomanufacture AS 22 DaysToManufacture, 23 p.productline AS ProductLine, 24 p.class AS Class, 25 p.style AS Style, 26 pm.[name] AS ModelName, 27 COALESCE(plph.startdate, pch.startdate, p.sellstartdate) AS StartDate, 28 COALESCE(plph.enddate, pch.enddate, p.sellenddate) AS EndDate, 29 CASE 30 WHEN COALESCE(plph.enddate, pch.enddate, p.sellenddate) IS NULL THEN 31 N’Current’ 32 ELSE NULL 33 END AS Status 34 FROM production.product p 35 LEFT OUTER JOIN production.productmodel pm 36 ON p.productmodelid = pm.productmodelid 37 LEFT OUTER JOIN production.productcosthistory pch 38 ON p.productid = pch.productid 39 LEFT OUTER JOIN production.productlistpricehistory plph 40 ON p.productid = plph.productid 41 AND pch.startdate = plph.startdate 42 AND COALESCE(pch.enddate, ’12-31-2020’) = 43 COALESCE(plph.enddate, ’12-31-2020’) 44 -- Pre-copy script 45 DELETE FROM dbo.dimproduct; A.1. Transfer between staging database and data warehouse 65

46 DBCC checkident (’DimProduct’, reseed, 0);

DimProductCategory

1 -- Source data set query 2 SELECT DISTINCT pc.productcategoryid AS ProductCategoryAlternateKey, 3 pc.[name] AS ProductCategoryName 4 FROM [Production].[productcategory] pc 5 -- Pre-copy script 6 DELETE FROM dbo.dimproductcategory; 7 DBCC checkident (’DimProductCategory’, reseed, 0);

DimProductSubcategory

1 -- Source data set query 2 SELECT DISTINCT ps.productsubcategoryid AS ProductSubcategoryKey, 3 ps.productsubcategoryid AS ProductSubcategoryAlternateKey, 4 ps.[name] AS ProductSubcategoryName, 5 dpc.productcategorykey AS ProductCategoryKey 6 FROM [Production].[productsubcategory] ps 7 INNER JOIN [temp].[dimproductcategory] dpc 8 ON ps.productcategoryid = dpc.productcategoryalternatekey 9 -- Pre-copy script 10 DELETE FROM dbo.dimproductsubcategory; 11 DBCC checkident (’DimProductSubcategory’, reseed, 0);

DimReseller

1 -- Source data set query 2 SELECT DISTINCT s.[businessentityid] AS [ResellerKey], 3 dg.[geographykey] AS [GeographyKey], 4 s.[name] AS [ResellerName], 5 a.addressline1 AS AddressLine1, 6 a.addressline2 AS AddressLine2, 7 a.city AS City, 8 a.postalcode AS PostalCode, 9 a.stateprovinceid AS StateProvinceID 10 FROM [Sales].[customer] cu 11 INNER JOIN [Sales].[store] s 12 ON cu.[storeid] = s.[businessentityid] 13 INNER JOIN [Person].[businessentityaddress] bea 14 ON cu.[storeid] = bea.[businessentityid] 15 INNER JOIN [Person].[address] a 16 ON bea.[addressid] = a.[addressid] 17 INNER JOIN [Person].[stateprovince] sp 18 ON a.[stateprovinceid] = sp.[stateprovinceid] 19 INNER JOIN [Person].[countryregion] cr 20 ON sp.[countryregioncode] = cr.[countryregioncode] 21 INNER JOIN [temp].[dimgeography] dg 22 ON a.[city] = dg.[city] 23 AND sp.[stateprovincecode] = dg.[stateprovincecode] 24 AND cr.[countryregioncode] = dg.[countryregioncode] 25 AND a.[postalcode] = dg.[postalcode] 26 WHERE bea.[addresstypeid] = 3 27 ORDER BY s.[name]; 66 Appendix A. Source Code

28 -- Pre-copy script 29 DELETE FROM dbo.dimreseller; 30 DBCC checkident (’DimReseller’, reseed, 0);

DimSalesReason

1 -- Source data set query 2 SELECT DISTINCT sr.[salesreasonid] AS [SalesReasonAlternateKey], 3 sr.[name] AS [SalesReasonName], 4 sr.[reasontype] AS [SalesReasonReasonType] 5 FROM [Sales].[salesreason] sr; 6 -- Pre-copy script 7 DELETE FROM dbo.dimsalesreason; 8 DBCC checkident (’DimSalesReason’, reseed, 0);

DimSalesTerritory

1 -- Source data set query 2 SELECT st.[territoryid] AS [SalesTerritoryAlternateKey], 3 st.[name] AS [SalesTerritoryRegion], 4 cr.[name] AS [SalesTerritoryCountry], 5 st.[group] AS [SalesTerritoryGroup] 6 FROM [Sales].[salesterritory] st 7 INNER JOIN [Person].[countryregion] cr 8 ON st.[countryregioncode] = cr.[countryregioncode] 9 ORDER BY st.[name]; 10 -- Pre-copy script 11 DELETE FROM dbo.dimsalesterritory; 12 DBCC checkident (’DimSalesTerritory’, reseed, 0);

FactInternetSales

1 -- Source data set query 2 SELECT dp.[productkey] AS 3 [ProductKey], 4 soh.[orderdate] AS 5 [OrderDateKey], 6 soh.[duedate] AS 7 [DueDateKey], 8 soh.[shipdate] AS 9 [ShipDateKey], 10 soh.[customerid] AS 11 [CustomerKey], 12 sod.[specialofferid] AS 13 [PromotionKey], 14 COALESCE(dc.[currencykey], (SELECT currencykey 15 FROM [temp].[dimcurrency] 16 WHERE currencyalternatekey = N’USD’)) AS 17 [CurrencyKey], 18 soh.[territoryid] AS 19 [SalesTerritoryKey], 20 soh.[salesordernumber] AS 21 [SalesOrderNumber], 22 soh.[revisionnumber] AS 23 [RevisionNumber], A.1. Transfer between staging database and data warehouse 67

24 sod.[orderqty] AS 25 [OrderQuantity], 26 sod.[unitprice] AS 27 [UnitPrice], 28 sod.[orderqty] * sod.[unitprice] AS 29 [ExtendedAmount], 30 sod.[unitpricediscount] AS 31 [UnitPriceDiscountPct], 32 sod.[orderqty] * sod.[unitprice] * sod.[unitpricediscount] AS 33 [DiscountAmount], 34 pch.[standardcost] AS 35 [ProductStandardCost], 36 sod.[orderqty] * pch.[standardcost] AS 37 [TotalProductCost], 38 sod.[linetotal] AS 39 [SalesAmount], 40 CONVERT(MONEY, sod.[linetotal] * 0.08) AS 41 [TaxAmt], 42 CONVERT(MONEY, sod.[linetotal] * 0.025) AS 43 [Freight], 44 sod.[carriertrackingnumber] AS 45 [CarrierTrackingNumber], 46 soh.[purchaseordernumber] AS 47 [CustomerPONumber] 48 FROM [Sales].[salesorderheader] soh 49 INNER JOIN [Sales].[salesorderdetail] sod 50 ON soh.[salesorderid] = sod.[salesorderid] 51 INNER JOIN [Production].[product] p 52 ON sod.[productid] = p.[productid] 53 INNER JOIN [temp].[dimproduct] dp 54 ON dp.[productalternatekey] = 55 p.[productnumber] COLLATE 56 sql_latin1_general_cp1_ci_as 57 AND [dbo].[Udfminimumdate](soh.[orderdate], soh.[duedate]) 58 BETWEEN 59 dp.[startdate] AND COALESCE(dp.[enddate], ’12-31-9999’) 60 INNER JOIN [Sales].[customer] c 61 ON soh.[customerid] = c.[customerid] 62 LEFT OUTER JOIN [Production].[productcosthistory] pch 63 ON p.[productid] = pch.[productid] 64 AND [dbo].[Udfminimumdate](soh.[orderdate], 65 soh.[duedate]) 66 BETWEEN 67 pch.[startdate] AND COALESCE(pch.[enddate], 68 ’12-31-9999’) 69 LEFT OUTER JOIN [Sales].[currencyrate] cr 70 ON soh.[currencyrateid] = cr.[currencyrateid] 71 LEFT OUTER JOIN [temp].[dimcurrency] dc 72 ON cr.[tocurrencycode] = dc.[currencyalternatekey] COLLATE 73 sql_latin1_general_cp1_ci_as 74 LEFT OUTER JOIN [HumanResources].[employee] e 75 ON soh.[salespersonid] = e.[businessentityid] 76 LEFT OUTER JOIN [temp].[dimemployee] de 77 ON e.[nationalidnumber] = de.[employeenationalid] COLLATE 68 Appendix A. Source Code

78 sql_latin1_general_cp1_ci_as 79 WHERE soh.onlineorderflag = 1 80 ORDER BY [orderdatekey], 81 [customerkey]; 82 -- Pre-copy script 83 DELETE FROM dbo.factinternetsales;

FactResellerSales

1 -- Source data set query 2 SELECT dp.[productkey] AS 3 [ProductKey], 4 soh.[orderdate] AS 5 [OrderDate], 6 soh.[duedate] AS 7 [DueDate], 8 soh.[shipdate] AS 9 [ShipDate], 10 soh.[customerid] AS 11 [ResellerKey], 12 de.[employeekey] AS 13 [EmployeeKey], 14 sod.[specialofferid] AS 15 [PromotionKey], 16 COALESCE(dc.[currencykey], (SELECT currencykey 17 FROM [temp].[dimcurrency] 18 WHERE currencyalternatekey = N’USD’)) AS 19 [CurrencyKey], 20 soh.[territoryid] AS 21 [SalesTerritoryKey], 22 soh.[salesordernumber] AS 23 [SalesOrderNumber], 24 soh.[revisionnumber] AS 25 [RevisionNumber], 26 sod.[orderqty] AS 27 [OrderQuantity], 28 sod.[unitprice] AS 29 [UnitPrice], 30 sod.[orderqty] * sod.[unitprice] AS 31 [ExtendedAmount], 32 sod.[unitpricediscount] AS 33 [UnitPriceDiscountPct], 34 sod.[orderqty] * sod.[unitprice] * sod.[unitpricediscount] AS 35 [DiscountAmount], 36 pch.[standardcost] AS 37 [ProductStandardCost], 38 sod.[orderqty] * pch.[standardcost] AS 39 [TotalProductCost], 40 sod.[linetotal] AS 41 [SalesAmount], 42 CONVERT(MONEY, sod.[linetotal] * 0.08) AS 43 [TaxAmt], 44 CONVERT(MONEY, sod.[linetotal] * 0.025) AS 45 [Freight], 46 sod.[carriertrackingnumber] AS A.1. Transfer between staging database and data warehouse 69

47 [CarrierTrackingNumber], 48 soh.[purchaseordernumber] AS 49 [CustomerPONumber] 50 FROM [Sales].[salesorderheader] soh 51 INNER JOIN [Sales].[salesorderdetail] sod 52 ON soh.[salesorderid] = sod.[salesorderid] 53 INNER JOIN [Production].[product] p 54 ON sod.[productid] = p.[productid] 55 INNER JOIN [temp].[dimproduct] dp 56 ON dp.[productalternatekey] = 57 p.[productnumber] COLLATE 58 sql_latin1_general_cp1_ci_as 59 AND [dbo].[Udfminimumdate](soh.[orderdate], soh.[duedate]) 60 BETWEEN 61 dp.[startdate] AND COALESCE(dp.[enddate], ’12-31-9999’) 62 INNER JOIN [Sales].[customer] c 63 ON soh.[customerid] = c.[customerid] 64 LEFT OUTER JOIN [Production].[productcosthistory] pch 65 ON p.[productid] = pch.[productid] 66 AND [dbo].[Udfminimumdate](soh.[orderdate], 67 soh.[duedate]) 68 BETWEEN 69 pch.[startdate] AND COALESCE(pch.[enddate], 70 ’12-31-9999’) 71 LEFT OUTER JOIN [Sales].[currencyrate] cr 72 ON soh.[currencyrateid] = cr.[currencyrateid] 73 LEFT OUTER JOIN [temp].[dimcurrency] dc 74 ON cr.[tocurrencycode] = dc.[currencyalternatekey] COLLATE 75 sql_latin1_general_cp1_ci_as 76 LEFT OUTER JOIN [HumanResources].[employee] e 77 ON soh.[salespersonid] = e.businessentityid 78 LEFT OUTER JOIN [temp].[dimemployee] de 79 ON e.[nationalidnumber] = de.employeenationalid COLLATE 80 sql_latin1_general_cp1_ci_as 81 WHERE soh.onlineorderflag = 0 82 ORDER BY [orderdate], 83 [resellerkey]; 84 -- Pre-copy script 85 DELETE FROM dbo.factresellersales;

FactSalesQuota

1 -- Source data set query 2 SELECT DISTINCT spqh.businessentityid AS [EmployeeKey], 3 spqh.[quotadate] AS [Quotadate], 4 spqh.[salesquota] AS [SalesAmountQuota] 5 FROM [Sales].[salespersonquotahistory] spqh 6 -- Pre-copy script 7 DELETE FROM dbo.factsalesquota; 8 DBCC checkident (’DimReseller’, reseed, 0); 70 Appendix A. Source Code

A.2. Data warehouse transformation

DimAddress

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimaddress s 6 LEFT OUTER JOIN target.dimaddress t 7 ON s.addressalternatekey = t.addressalternatekey 8 WHERE t.addressalternatekey IS NULL 9 -- Pre-copy script 10 -- DELETE 11 UPDATE target.dimaddress 12 SET sysenddate = Getdate() 13 WHERE addressalternatekey IN (SELECT t2.addressalternatekey 14 FROM target.dimaddress t2 15 LEFT OUTER JOIN dbo.dimaddress s 16 ON s.addressalternatekey = 17 t2.addressalternatekey 18 WHERE s.addressalternatekey IS NULL) 19 AND sysenddate = ’9999-12-31 23:59:59’ 20 21 -- UPDATE 22 UPDATE t 23 SET SysEndDate = Getdate() 24 FROM target.dimaddress t, 25 dbo.dimaddress s 26 WHERE t.addresskey IN (SELECT s.addresskey 27 FROM (SELECT s.addresskey, 28 Hashbytes(’SHA2_512’, 29 Concat_ws(’,’, s.addressalternatekey, 30 s.addressline1, 31 s.addressline2, s.city, 32 s.countryregion, 33 s.modifieddate, 34 s.postalcode, s.stateprovince)) AS hash 35 FROM dimaddress s) s 36 JOIN (SELECT t.addresskey, 37 Hashbytes(’SHA2_512’, 38 Concat_ws(’,’, 39 t.addressalternatekey, 40 t.addressline1, 41 t.addressline2, t.city, 42 t.countryregion, 43 t.modifieddate, 44 t.postalcode, t.stateprovince)) AS hash, 45 t.sysenddate 46 FROM target.dimaddress t) t 47 ON s.addresskey = t.addresskey 48 WHERE s.hash != t.hash 49 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 50 AND s.addresskey = t.addresskey A.2. Data warehouse transformation 71

51 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 52 53 INSERT INTO target.dimaddress 54 SELECT s.addresskey, 55 s.addressalternatekey, 56 s.addressline1, 57 s.addressline2, 58 s.city, 59 s.stateprovince, 60 s.countryregion, 61 s.postalcode, 62 s.modifieddate, 63 t.sysenddate, 64 ’9999-12-31 23:59:59’ 65 FROM dbo.dimaddress s, 66 (SELECT *, 67 Row_number() 68 OVER( 69 partition BY addresskey 70 ORDER BY sysenddate DESC) AS rn 71 FROM target.dimaddress) AS t 72 WHERE t.addresskey IN (SELECT s.addresskey 73 FROM (SELECT s.addresskey, 74 Hashbytes(’SHA2_512’, 75 Concat_ws(’,’, s.addressalternatekey, 76 s.addressline1, 77 s.addressline2, s.city, 78 s.countryregion, 79 s.modifieddate, 80 s.postalcode, s.stateprovince)) AS hash 81 FROM dimaddress s) s 82 JOIN (SELECT t.addresskey, 83 Hashbytes(’SHA2_512’, 84 Concat_ws(’,’, 85 t.addressalternatekey, 86 t.addressline1, 87 t.addressline2, t.city, 88 t.countryregion, 89 t.modifieddate, 90 t.postalcode, t.stateprovince)) AS hash 91 FROM (SELECT *, 92 Row_number() 93 OVER( 94 partition BY addresskey 95 ORDER BY sysenddate DESC) AS rn 96 FROM target.dimaddress) AS t 97 WHERE rn = 1) t 98 ON s.addresskey = t.addresskey 99 WHERE s.hash != t.hash) 100 AND s.addresskey = t.addresskey 101 AND rn = 1;

DimCurrency

1 -- Source data set query 72 Appendix A. Source Code

2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimcurrency s 6 LEFT OUTER JOIN target.dimcurrency t 7 ON s.currencyalternatekey = t.currencyalternatekey 8 WHERE t.currencyalternatekey IS NULL 9 -- Pre-copy script 10 -- DELETE 11 UPDATE target.dimcurrency 12 SET sysenddate = Getdate() 13 WHERE currencyalternatekey IN (SELECT t2.currencyalternatekey 14 FROM target.dimcurrency t2 15 LEFT OUTER JOIN dbo.dimcurrency s 16 ON s.currencyalternatekey = 17 t2.currencyalternatekey 18 WHERE s.currencyalternatekey IS NULL) 19 AND sysenddate = ’9999-12-31 23:59:59’ 20 21 -- UPDATE 22 UPDATE t 23 SET SysEndDate = Getdate() 24 FROM target.dimcurrency t, 25 dbo.dimcurrency s 26 WHERE t.currencykey IN (SELECT s.currencykey 27 FROM (SELECT s.currencykey, 28 Hashbytes(’SHA2_512’, 29 Concat_ws(’,’, s.currencyalternatekey, 30 s.currencyname)) AS 31 hash 32 FROM dbo.dimcurrency s) s 33 JOIN (SELECT t.currencykey, 34 Hashbytes(’SHA2_512’, 35 Concat_ws(’,’, 36 t.currencyalternatekey, 37 t.currencyname)) AS 38 hash, 39 t.sysenddate 40 FROM target.dimcurrency t) t 41 ON s.currencykey = t.currencykey 42 WHERE s.hash != t.hash 43 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 44 AND s.currencykey = t.currencykey 45 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 46 47 INSERT INTO target.dimcurrency 48 SELECT s.currencykey, 49 s.currencyalternatekey, 50 s.currencyname, 51 t.sysenddate, 52 ’9999-12-31 23:59:59’ 53 FROM dbo.dimcurrency s, 54 (SELECT *, 55 Row_number() A.2. Data warehouse transformation 73

56 OVER( 57 partition BY currencykey 58 ORDER BY sysenddate DESC) AS rn 59 FROM target.dimcurrency) AS t 60 WHERE t.currencykey IN (SELECT s.currencykey 61 FROM (SELECT s.currencykey, 62 Hashbytes(’SHA2_512’, 63 Concat_ws(’,’, s.currencyalternatekey, 64 s.currencyname)) AS 65 hash 66 FROM dbo.dimcurrency s) s 67 JOIN (SELECT t.currencykey, 68 Hashbytes(’SHA2_512’, 69 Concat_ws(’,’, 70 t.currencyalternatekey, 71 t.currencyname)) AS 72 hash 73 FROM (SELECT *, 74 Row_number() 75 OVER( 76 partition BY 77 currencykey 78 ORDER BY sysenddate 79 DESC) AS 80 rn 81 FROM target.dimcurrency) AS t 82 WHERE rn = 1) t 83 ON s.currencykey = t.currencykey 84 WHERE s.hash != t.hash) 85 AND s.currencykey = t.currencykey 86 AND rn = 1;

DimCustomer

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimcustomer s 6 LEFT OUTER JOIN target.dimcustomer t 7 ON s.customeralternatekey = t.customeralternatekey 8 WHERE t.customeralternatekey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimcustomer 11 SET sysenddate = Getdate() 12 WHERE customeralternatekey IN (SELECT t2.customeralternatekey 13 FROM target.dimcustomer t2 14 LEFT OUTER JOIN dbo.dimcustomer s 15 ON s.customeralternatekey = 16 t2.customeralternatekey 17 WHERE s.customeralternatekey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 20 UPDATE t 21 SET SysEndDate = Getdate() 74 Appendix A. Source Code

22 FROM target.dimcustomer t, 23 dbo.dimcustomer s 24 WHERE t.customerkey IN (SELECT s.customerkey 25 FROM (SELECT s.customerkey, 26 Hashbytes(’SHA2_512’, 27 Concat_ws(s.customeralternatekey, 28 s.geographykey, 29 s.title, 30 s.firstname, 31 s.middlename, 32 s.lastname, 33 s.emailaddress, s.phonenumber, 34 s.addressline1, 35 s.addressline2, 36 s.city, s.postalcode)) AS hash 37 FROM dbo.dimcustomer s) s 38 JOIN (SELECT t.customerkey, 39 Hashbytes(’SHA2_512’, 40 Concat_ws(t.customeralternatekey, 41 t.geographykey, 42 t.title, 43 t.firstname, 44 t.middlename, 45 t.lastname, 46 t.emailaddress, t.phonenumber, 47 t.addressline1, 48 t.addressline2, 49 t.city, t.postalcode)) AS hash, 50 t.sysenddate 51 FROM target.dimcustomer t) t 52 ON s.customerkey = t.customerkey 53 WHERE s.hash != t.hash 54 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 55 AND s.customerkey = t.customerkey 56 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 57 58 INSERT INTO target.dimcustomer 59 SELECT s.customerkey, 60 s.customeralternatekey, 61 s.geographykey, 62 s.title, 63 s.firstname, 64 s.middlename, 65 s.lastname, 66 s.emailaddress, 67 s.phonenumber, 68 s.addressline1, 69 s.addressline2, 70 s.city, 71 s.postalcode, 72 t.sysenddate, 73 ’9999-12-31 23:59:59’ 74 FROM dbo.dimcustomer s, 75 (SELECT *, A.2. Data warehouse transformation 75

76 Row_number() 77 OVER( 78 partition BY customerkey 79 ORDER BY sysenddate DESC) AS rn 80 FROM target.dimcustomer) AS t 81 WHERE t.customerkey IN (SELECT s.customerkey 82 FROM (SELECT s.customerkey, 83 Hashbytes(’SHA2_512’, 84 Concat_ws(s.customeralternatekey, 85 s.geographykey, 86 s.title, 87 s.firstname, 88 s.middlename, 89 s.lastname, 90 s.emailaddress, s.phonenumber, 91 s.addressline1, 92 s.addressline2, 93 s.city, s.postalcode)) AS hash 94 FROM dbo.dimcustomer s) s 95 JOIN (SELECT t.customerkey, 96 Hashbytes(’SHA2_512’, 97 Concat_ws(t.customeralternatekey, 98 t.geographykey, 99 t.title, 100 t.firstname, 101 t.middlename, 102 t.lastname, 103 t.emailaddress, t.phonenumber, 104 t.addressline1, 105 t.addressline2, 106 t.city, t.postalcode)) AS hash 107 FROM (SELECT *, 108 Row_number() 109 OVER( 110 partition BY customerkey 111 ORDER BY sysenddate DESC) AS 112 rn 113 FROM target.dimcustomer) AS t 114 WHERE rn = 1) t 115 ON s.customerkey = t.customerkey 116 WHERE s.hash != t.hash) 117 AND s.customerkey = t.customerkey 118 AND rn = 1;

DimDepartmentGroup

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimdepartmentgroup s 6 LEFT OUTER JOIN target.dimdepartmentgroup t 7 ON s.departmentgroupkey = t.departmentgroupkey 8 WHERE t.departmentgroupkey IS NULL; 9 -- Pre-copy script 76 Appendix A. Source Code

10 UPDATE target.dimdepartmentgroup 11 SET sysenddate = Getdate() 12 WHERE departmentgroupkey IN (SELECT t2.departmentgroupkey 13 FROM target.dimdepartmentgroup t2 14 LEFT OUTER JOIN dbo.dimdepartmentgroup s 15 ON s.departmentgroupkey = 16 t2.departmentgroupkey 17 WHERE s.departmentgroupkey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 20 UPDATE t 21 SET SysEndDate = Getdate() 22 FROM target.dimdepartmentgroup t, 23 dbo.dimdepartmentgroup s 24 WHERE t.departmentgroupkey = s.departmentgroupkey 25 AND t.departmentgroupname != s.departmentgroupname 26 AND t.sysenddate = ’9999-12-31 23:59:59.000’; 27 28 INSERT INTO target.dimdepartmentgroup 29 SELECT s.departmentgroupkey, 30 s.departmentgroupname, 31 t.sysenddate, 32 ’9999-12-31 23:59:59’ 33 FROM dbo.dimdepartmentgroup s, 34 (SELECT *, 35 Row_number() 36 OVER( 37 partition BY departmentgroupkey 38 ORDER BY sysenddate DESC) AS rn 39 FROM target.dimdepartmentgroup) AS t 40 WHERE t.departmentgroupkey = s.departmentgroupkey 41 AND t.departmentgroupname != s.departmentgroupname 42 AND rn = 1;

DimEmployee

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS sysenddate 5 FROM dbo.dimemployee s 6 LEFT OUTER JOIN target.dimemployee t 7 ON s.employeekey = t.employeekey 8 WHERE t.employeekey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimemployee 11 SET sysenddate = Getdate() 12 WHERE employeekey IN (SELECT t2.employeekey 13 FROM target.dimemployee t2 14 LEFT OUTER JOIN dbo.dimemployee s 15 ON s.employeekey = t2.employeekey 16 WHERE s.employeekey IS NULL) 17 AND sysenddate = ’9999-12-31 23:59:59’ 18 19 UPDATE t A.2. Data warehouse transformation 77

20 SET SysEndDate = Getdate() 21 FROM target.dimemployee t, 22 dbo.dimemployee s 23 WHERE t.employeekey IN (SELECT s.employeekey 24 FROM (SELECT s.employeekey, 25 Hashbytes(’SHA2_512’, 26 Concat_ws(’,’, s.businessentityid, 27 s.employeenationalid, 28 s.salesterritorykey, 29 s.firstname, 30 s.lastname, 31 s.middlename, s.title, s.hiredate, 32 s.birthdate, 33 s.loginid, 34 s.emailaddress, s.phone, 35 s.maritalstatus, 36 s.salariedflag, 37 s.gender, 38 s.payfrequency, s.baserate, 39 s.vacationhours, 40 s.sickleavehours, 41 s.currentflag, 42 s.departmentname, 43 s.startdate, 44 s.enddate, 45 s.status)) 46 AS hash 47 FROM dbo.dimemployee s) s 48 JOIN (SELECT t.employeekey, 49 Hashbytes(’SHA2_512’, 50 Concat_ws(’,’, t.businessentityid, 51 t.employeenationalid, 52 t.salesterritorykey, 53 t.firstname, 54 t.lastname, 55 t.middlename, t.title, t.hiredate, 56 t.birthdate, 57 t.loginid, 58 t.emailaddress, t.phone, 59 t.maritalstatus, 60 t.salariedflag, 61 t.gender, 62 t.payfrequency, t.baserate, 63 t.vacationhours, 64 t.sickleavehours, 65 t.currentflag, 66 t.departmentname, 67 t.startdate, t.enddate, 68 t.status)) 69 AS hash, 70 t.sysenddate 71 FROM target.dimemployee t) t 72 ON s.employeekey = t.employeekey 73 WHERE s.hash != t.hash 78 Appendix A. Source Code

74 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 75 AND s.employeekey = t.employeekey 76 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 77 78 INSERT INTO target.dimemployee 79 SELECT s.employeekey, 80 s.businessentityid, 81 s.employeenationalid, 82 s.salesterritorykey, 83 s.firstname, 84 s.lastname, 85 s.middlename, 86 s.title, 87 s.hiredate, 88 s.birthdate, 89 s.loginid, 90 s.emailaddress, 91 s.phone, 92 s.maritalstatus, 93 s.salariedflag, 94 s.gender, 95 s.payfrequency, 96 s.baserate, 97 s.vacationhours, 98 s.sickleavehours, 99 s.currentflag, 100 s.departmentname, 101 s.startdate, 102 s.enddate, 103 s.status, 104 t.sysenddate, 105 ’9999-12-31 23:59:59’ 106 FROM dbo.dimemployee s, 107 (SELECT *, 108 Row_number() 109 OVER( 110 partition BY employeekey 111 ORDER BY sysenddate DESC) AS rn 112 FROM target.dimemployee) AS t 113 WHERE t.employeekey IN (SELECT s.employeekey 114 FROM (SELECT s.employeekey, 115 Hashbytes(’SHA2_512’, 116 Concat_ws(’,’, s.businessentityid, 117 s.employeenationalid, 118 s.salesterritorykey, 119 s.firstname, 120 s.lastname, 121 s.middlename, s.title, s.hiredate, 122 s.birthdate, 123 s.loginid, 124 s.emailaddress, s.phone, 125 s.maritalstatus, 126 s.salariedflag, 127 s.gender, A.2. Data warehouse transformation 79

128 s.payfrequency, s.baserate, 129 s.vacationhours, 130 s.sickleavehours, 131 s.currentflag, 132 s.departmentname, 133 s.startdate, 134 s.enddate, 135 s.status)) 136 AS hash 137 FROM dbo.dimemployee s) s 138 JOIN (SELECT t.employeekey, 139 Hashbytes(’SHA2_512’, 140 Concat_ws(’,’, t.businessentityid, 141 t.employeenationalid, 142 t.salesterritorykey, 143 t.firstname, 144 t.lastname, 145 t.middlename, t.title, t.hiredate, 146 t.birthdate, 147 t.loginid, 148 t.emailaddress, t.phone, 149 t.maritalstatus, 150 t.salariedflag, 151 t.gender, 152 t.payfrequency, t.baserate, 153 t.vacationhours, 154 t.sickleavehours, 155 t.currentflag, 156 t.departmentname, 157 t.startdate, t.enddate, 158 t.status)) 159 AS hash 160 FROM (SELECT *, 161 Row_number() 162 OVER( 163 partition BY employeekey 164 ORDER BY sysenddate DESC) AS rn 165 FROM target.dimemployee) AS t 166 WHERE rn = 1) t 167 ON s.employeekey = t.employeekey 168 WHERE s.hash != t.hash) 169 AND s.employeekey = t.employeekey 170 AND rn = 1;

DimGeography

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimgeography s 6 LEFT OUTER JOIN target.dimgeography t 7 ON s.geographykey = t.geographykey 8 WHERE t.geographykey IS NULL; 9 -- Pre-copy script 80 Appendix A. Source Code

10 UPDATE target.dimgeography 11 SET sysenddate = Getdate() 12 WHERE geographykey IN (SELECT t2.geographykey 13 FROM target.dimgeography t2 14 LEFT OUTER JOIN dbo.dimgeography s 15 ON s.geographykey = t2.geographykey 16 WHERE s.geographykey IS NULL) 17 AND sysenddate = ’9999-12-31 23:59:59’ 18 19 UPDATE t 20 SET SysEndDate = Getdate() 21 FROM target.dimgeography t, 22 dbo.dimgeography s 23 WHERE t.geographykey IN (SELECT s.geographykey 24 FROM (SELECT s.geographykey, 25 Hashbytes(’SHA2_512’, 26 Concat_ws(’,’, s.city, 27 s.stateprovincecode, 28 s.stateprovincename, 29 s.countryregioncode, 30 s.countryregionname, s.postalcode)) AS 31 hash 32 FROM dbo.dimgeography s) s 33 JOIN (SELECT t.geographykey, 34 Hashbytes(’SHA2_512’, Concat_ws(’,’, t.city, 35 t.stateprovincecode, 36 t.stateprovincename, 37 t.countryregioncode, 38 t.countryregionname, t.postalcode)) AS hash, 39 t.sysenddate 40 FROM target.dimgeography t) t 41 ON s.geographykey = t.geographykey 42 WHERE s.hash != t.hash 43 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 44 AND s.geographykey = t.geographykey 45 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 46 47 INSERT INTO target.dimgeography 48 SELECT s.geographykey, 49 s.city, 50 s.stateprovincecode, 51 s.stateprovincename, 52 s.countryregioncode, 53 s.countryregionname, 54 s.postalcode, 55 t.sysenddate, 56 ’9999-12-31 23:59:59’ 57 FROM dbo.dimgeography s, 58 (SELECT *, 59 Row_number() 60 OVER( 61 partition BY geographykey 62 ORDER BY sysenddate DESC) AS rn 63 FROM target.dimgeography) AS t A.2. Data warehouse transformation 81

64 WHERE t.geographykey IN (SELECT s.geographykey 65 FROM (SELECT s.geographykey, 66 Hashbytes(’SHA2_512’, 67 Concat_ws(’,’, s.city, 68 s.stateprovincecode, 69 s.stateprovincename, 70 s.countryregioncode, 71 s.countryregionname, s.postalcode)) AS 72 hash 73 FROM dbo.dimgeography s) s 74 JOIN (SELECT t.geographykey, 75 Hashbytes(’SHA2_512’, Concat_ws(’,’, t.city, 76 t.stateprovincecode, 77 t.stateprovincename, 78 t.countryregioncode, 79 t.countryregionname, t.postalcode)) AS hash 80 FROM (SELECT *, 81 Row_number() 82 OVER( 83 partition BY geographykey 84 ORDER BY sysenddate DESC) AS rn 85 FROM target.dimgeography) AS t 86 WHERE rn = 1) t 87 ON s.geographykey = t.geographykey 88 WHERE s.hash != t.hash) 89 AND s.geographykey = t.geographykey 90 AND rn = 1;

DimProduct

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS sysenddate 5 FROM dbo.dimproduct s 6 LEFT OUTER JOIN target.dimproduct t 7 ON s.productkey = t.productkey 8 WHERE t.productkey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimproduct 11 SET sysenddate = Getdate() 12 WHERE productkey IN (SELECT t2.productkey 13 FROM target.dimproduct t2 14 LEFT OUTER JOIN dbo.dimproduct s 15 ON s.productkey = t2.productkey 16 WHERE s.productkey IS NULL) 17 AND sysenddate = ’9999-12-31 23:59:59’ 18 19 UPDATE t 20 SET SysEndDate = Getdate() 21 FROM target.dimproduct t, 22 dbo.dimproduct s 23 WHERE t.productkey IN (SELECT s.productkey 24 FROM (SELECT s.productkey, 25 Hashbytes(’SHA2_512’, 82 Appendix A. Source Code

26 Concat_ws(’,’, s.productalternatekey, 27 s.productsubcategorykey, 28 s.weightunitmeasurecode, 29 s.sizeunitmeasurecode, s.productname, 30 s.standardcost, 31 s.finishedgoodsflag, 32 s.color, 33 s.safetystocklevel, s.reorderpoint, 34 s.listprice, s.size, 35 s.weight, 36 s.daystomanufacture, s.productline, 37 s.class, s.style, 38 s.modelname, 39 s.startdate, s.enddate, s.status)) AS hash 40 FROM dbo.dimproduct s) s 41 JOIN (SELECT t.productkey, 42 Hashbytes(’SHA2_512’, 43 Concat_ws(’,’, t.productalternatekey, 44 t.productsubcategorykey, 45 t.weightunitmeasurecode, 46 t.sizeunitmeasurecode, t.productname, 47 t.standardcost, 48 t.finishedgoodsflag, 49 t.color, 50 t.safetystocklevel, t.reorderpoint, 51 t.listprice, t.size, 52 t.weight, 53 t.daystomanufacture, t.productline, 54 t.class, t.style, 55 t.modelname, 56 t.startdate, t.enddate, t.status)) AS hash, 57 t.sysenddate 58 FROM target.dimproduct t) t 59 ON s.productkey = t.productkey 60 WHERE s.hash != t.hash 61 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 62 AND s.productkey = t.productkey 63 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 64 65 INSERT INTO target.dimproduct 66 SELECT s.*, 67 t.sysenddate, 68 ’9999-12-31 23:59:59’ 69 FROM dbo.dimproduct s, 70 (SELECT *, 71 Row_number() 72 OVER( 73 partition BY productkey 74 ORDER BY sysenddate DESC) AS rn 75 FROM target.dimproduct) AS t 76 WHERE t.productkey IN (SELECT s.productkey 77 FROM (SELECT s.productkey, 78 Hashbytes(’SHA2_512’, 79 Concat_ws(’,’, s.productalternatekey, A.2. Data warehouse transformation 83

80 s.productsubcategorykey, 81 s.weightunitmeasurecode, 82 s.sizeunitmeasurecode, s.productname, 83 s.standardcost, 84 s.finishedgoodsflag, 85 s.color, 86 s.safetystocklevel, s.reorderpoint, 87 s.listprice, s.size, 88 s.weight, 89 s.daystomanufacture, s.productline, 90 s.class, s.style, 91 s.modelname, 92 s.startdate, s.enddate, s.status)) AS hash 93 FROM dbo.dimproduct s) s 94 JOIN (SELECT t.productkey, 95 Hashbytes(’SHA2_512’, 96 Concat_ws(’,’, t.productalternatekey, 97 t.productsubcategorykey, 98 t.weightunitmeasurecode, 99 t.sizeunitmeasurecode, t.productname, 100 t.standardcost, 101 t.finishedgoodsflag, 102 t.color, 103 t.safetystocklevel, t.reorderpoint, 104 t.listprice, t.size, 105 t.weight, 106 t.daystomanufacture, t.productline, 107 t.class, t.style, 108 t.modelname, 109 t.startdate, t.enddate, t.status)) AS hash 110 FROM (SELECT *, 111 Row_number() 112 OVER( 113 partition BY productkey 114 ORDER BY sysenddate DESC) AS rn 115 FROM target.dimproduct) AS t 116 WHERE rn = 1) t 117 ON s.productkey = t.productkey 118 WHERE s.hash != t.hash) 119 AND s.productkey = t.productkey 120 AND rn = 1;

DimProductCategory

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimproductcategory s 6 LEFT OUTER JOIN target.dimproductcategory t 7 ON s.productcategorykey = t.productcategorykey 8 WHERE t.productcategorykey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimproductcategory 11 SET sysenddate = Getdate() 84 Appendix A. Source Code

12 WHERE productcategorykey IN (SELECT t2.productcategorykey 13 FROM target.dimproductcategory t2 14 LEFT OUTER JOIN dbo.dimproductcategory s 15 ON s.productcategorykey = 16 t2.productcategorykey 17 WHERE s.productcategorykey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 20 UPDATE t 21 SET SysEndDate = Getdate() 22 FROM target.dimproductcategory t, 23 dbo.dimproductcategory s 24 WHERE t.productcategorykey IN (SELECT s.productcategorykey 25 FROM (SELECT s.productcategorykey, 26 Hashbytes(’SHA2_512’, 27 Concat_ws(’,’, s.productcategoryalternatekey, 28 s.productcategoryname)) AS hash 29 FROM dbo.dimproductcategory s) s 30 JOIN (SELECT t.productcategorykey, 31 Hashbytes(’SHA2_512’, 32 Concat_ws(’,’, 33 t.productcategoryalternatekey, 34 t.productcategoryname)) AS hash, 35 t.sysenddate 36 FROM target.dimproductcategory t) t 37 ON s.productcategorykey = t.productcategorykey 38 WHERE s.hash != t.hash 39 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 40 AND s.productcategorykey = t.productcategorykey 41 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 42 43 INSERT INTO target.dimproductcategory 44 SELECT s.*, 45 t.sysenddate, 46 ’9999-12-31 23:59:59’ 47 FROM dbo.dimproductcategory s, 48 (SELECT *, 49 Row_number() 50 OVER( 51 partition BY productcategorykey 52 ORDER BY sysenddate DESC) AS rn 53 FROM target.dimproductcategory) AS t 54 WHERE t.productcategorykey IN (SELECT s.productcategorykey 55 FROM (SELECT s.productcategorykey, 56 Hashbytes(’SHA2_512’, 57 Concat_ws(’,’, s.productcategoryalternatekey, 58 s.productcategoryname)) AS hash 59 FROM dbo.dimproductcategory s) s 60 JOIN (SELECT t.productcategorykey, 61 Hashbytes(’SHA2_512’, 62 Concat_ws(’,’, 63 t.productcategoryalternatekey, 64 t.productcategoryname)) AS hash 65 FROM (SELECT *, A.2. Data warehouse transformation 85

66 Row_number() 67 OVER( 68 partition BY 69 productcategorykey 70 ORDER BY sysenddate 71 DESC) AS rn 72 FROM target.dimproductcategory) 73 AS t 74 WHERE rn = 1) t 75 ON s.productcategorykey = t.productcategorykey 76 WHERE s.hash != t.hash) 77 AND s.productcategorykey = t.productcategorykey 78 AND rn = 1;

DimProductSubcategory

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimproductsubcategory s 6 LEFT OUTER JOIN target.dimproductsubcategory t 7 ON s.productcategorykey = t.productcategorykey 8 WHERE t.productcategorykey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimproductsubcategory 11 SET sysenddate = Getdate() 12 WHERE productsubcategorykey IN (SELECT t2.productsubcategorykey 13 FROM target.dimproductsubcategory t2 14 LEFT OUTER JOIN dbo.dimproductsubcategory s 15 ON s.productsubcategorykey = 16 t2.productsubcategorykey 17 WHERE s.productsubcategorykey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 20 UPDATE t 21 SET SysEndDate = Getdate() 22 FROM target.dimproductsubcategory t, 23 dbo.dimproductsubcategory s 24 WHERE t.productsubcategorykey IN (SELECT s.productsubcategorykey 25 FROM (SELECT s.productsubcategorykey, 26 Hashbytes(’SHA2_512’, 27 Concat_ws(’,’, s.productsubcategoryalternatekey, 28 s.productsubcategoryname, 29 s.productcategorykey)) AS 30 hash 31 FROM dbo.dimproductsubcategory s) s 32 JOIN (SELECT t.productsubcategorykey, 33 Hashbytes(’SHA2_512’, 34 Concat_ws(’,’, 35 t.productsubcategoryalternatekey, 36 t.productsubcategoryname, 37 t.productcategorykey)) AS 38 hash, 39 t.sysenddate 86 Appendix A. Source Code

40 FROM target.dimproductsubcategory t) t 41 ON s.productsubcategorykey = t.productsubcategorykey 42 WHERE s.hash != t.hash 43 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 44 AND s.productsubcategorykey = t.productsubcategorykey 45 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 46 47 INSERT INTO target.dimproductsubcategory 48 SELECT s.*, 49 t.sysenddate, 50 ’9999-12-31 23:59:59’ 51 FROM dbo.dimproductsubcategory s, 52 (SELECT *, 53 Row_number() 54 OVER( 55 partition BY productsubcategorykey 56 ORDER BY sysenddate DESC) AS rn 57 FROM target.dimproductsubcategory) AS t 58 WHERE t.productsubcategorykey IN (SELECT s.productsubcategorykey 59 FROM (SELECT s.productsubcategorykey, 60 Hashbytes(’SHA2_512’, 61 Concat_ws(’,’, s.productsubcategoryalternatekey, 62 s.productsubcategoryname, 63 s.productcategorykey)) AS 64 hash 65 FROM dbo.dimproductsubcategory s) s 66 JOIN (SELECT t.productsubcategorykey, 67 Hashbytes(’SHA2_512’, 68 Concat_ws(’,’, 69 t.productsubcategoryalternatekey, 70 t.productsubcategoryname, 71 t.productcategorykey)) AS 72 hash 73 FROM (SELECT *, 74 Row_number() 75 OVER( 76 partition BY productsubcategorykey 77 ORDER BY sysenddate DESC) AS rn 78 FROM target.dimproductsubcategory) AS t 79 WHERE rn = 1) t 80 ON s.productsubcategorykey = t.productsubcategorykey 81 WHERE s.hash != t.hash) 82 AND s.productsubcategorykey = t.productsubcategorykey 83 AND rn = 1;

DimReseller

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimreseller s 6 LEFT OUTER JOIN target.dimreseller t 7 ON s.resellerkey = t.resellerkey 8 WHERE t.resellerkey IS NULL; A.2. Data warehouse transformation 87

9 -- Pre-copy script 10 UPDATE target.dimreseller 11 SET sysenddate = Getdate() 12 WHERE resellerkey IN (SELECT t2.resellerkey 13 FROM target.dimreseller t2 14 LEFT OUTER JOIN dbo.dimreseller s 15 ON s.resellerkey = t2.resellerkey 16 WHERE s.resellerkey IS NULL) 17 AND sysenddate = ’9999-12-31 23:59:59’ 18 19 UPDATE t 20 SET SysEndDate = Getdate() 21 FROM target.dimreseller t, 22 dbo.dimreseller s 23 WHERE t.resellerkey IN (SELECT s.resellerkey 24 FROM (SELECT s.resellerkey, 25 Hashbytes(’SHA2_512’, 26 Concat_ws(’,’, s.reselleralternatekey, 27 s.geographykey, 28 s.resellername, 29 s.addressline1, 30 s.addressline2, 31 s.city, s.postalcode, 32 s.stateprovinceid)) AS hash 33 FROM dbo.dimreseller s) s 34 JOIN (SELECT t.resellerkey, 35 Hashbytes(’SHA2_512’, 36 Concat_ws(’,’, 37 t.reselleralternatekey, 38 t.geographykey, 39 t.resellername, 40 t.addressline1, 41 t.addressline2, 42 t.city, t.postalcode, 43 t.stateprovinceid)) AS hash, 44 t.sysenddate 45 FROM target.dimreseller t) t 46 ON s.resellerkey = t.resellerkey 47 WHERE s.hash != t.hash 48 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 49 AND s.resellerkey = t.resellerkey 50 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 51 52 INSERT INTO target.dimreseller 53 SELECT s.*, 54 t.sysenddate, 55 ’9999-12-31 23:59:59’ 56 FROM dbo.dimreseller s, 57 (SELECT *, 58 Row_number() 59 OVER( 60 partition BY resellerkey 61 ORDER BY sysenddate DESC) AS rn 62 FROM target.dimreseller) AS t 88 Appendix A. Source Code

63 WHERE t.resellerkey IN (SELECT s.resellerkey 64 FROM (SELECT s.resellerkey, 65 Hashbytes(’SHA2_512’, 66 Concat_ws(’,’, s.reselleralternatekey, 67 s.geographykey, 68 s.resellername, 69 s.addressline1, 70 s.addressline2, 71 s.city, s.postalcode, 72 s.stateprovinceid)) AS hash 73 FROM dbo.dimreseller s) s 74 JOIN (SELECT t.resellerkey, 75 Hashbytes(’SHA2_512’, 76 Concat_ws(’,’, 77 t.reselleralternatekey, 78 t.geographykey, 79 t.resellername, 80 t.addressline1, 81 t.addressline2, 82 t.city, t.postalcode, 83 t.stateprovinceid)) AS hash 84 FROM (SELECT *, 85 Row_number() 86 OVER( 87 partition BY resellerkey 88 ORDER BY sysenddate DESC) AS 89 rn 90 FROM target.dimreseller) AS t 91 WHERE rn = 1) t 92 ON s.resellerkey = t.resellerkey 93 WHERE s.hash != t.hash) 94 AND s.resellerkey = t.resellerkey 95 AND rn = 1;

DimSalesReason

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimsalesreason s 6 LEFT OUTER JOIN target.dimsalesreason t 7 ON s.salesreasonalternatekey = t.salesreasonalternatekey 8 WHERE t.salesreasonalternatekey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimsalesreason 11 SET sysenddate = Getdate() 12 WHERE salesreasonkey IN (SELECT t2.salesreasonkey 13 FROM target.dimsalesreason t2 14 LEFT OUTER JOIN dbo.dimsalesreason s 15 ON s.salesreasonkey = 16 t2.salesreasonkey 17 WHERE s.salesreasonkey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 A.2. Data warehouse transformation 89

20 UPDATE t 21 SET SysEndDate = Getdate() 22 FROM target.dimsalesreason t, 23 dbo.dimsalesreason s 24 WHERE t.salesreasonkey IN (SELECT s.salesreasonkey 25 FROM (SELECT s.salesreasonkey, 26 Hashbytes(’SHA2_512’, 27 Concat_ws(’,’, s.salesreasonalternatekey, 28 s.salesreasonname, 29 s.salesreasonreasontype)) AS 30 hash 31 FROM dbo.dimsalesreason s) s 32 JOIN (SELECT t.salesreasonkey, 33 Hashbytes(’SHA2_512’, 34 Concat_ws(’,’, t.salesreasonalternatekey, 35 t.salesreasonname, 36 t.salesreasonreasontype)) AS 37 hash, 38 t.sysenddate 39 FROM target.dimsalesreason t) t 40 ON s.salesreasonkey = t.salesreasonkey 41 WHERE s.hash != t.hash 42 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 43 AND s.salesreasonkey = t.salesreasonkey 44 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 45 46 INSERT INTO target.dimsalesreason 47 SELECT s.*, 48 t.sysenddate, 49 ’9999-12-31 23:59:59’ 50 FROM dbo.dimsalesreason s, 51 (SELECT *, 52 Row_number() 53 OVER( 54 partition BY salesreasonkey 55 ORDER BY sysenddate DESC) AS rn 56 FROM target.dimsalesreason) AS t 57 WHERE t.salesreasonkey IN (SELECT s.salesreasonkey 58 FROM (SELECT s.salesreasonkey, 59 Hashbytes(’SHA2_512’, 60 Concat_ws(’,’, s.salesreasonalternatekey, 61 s.salesreasonname, 62 s.salesreasonreasontype)) AS 63 hash 64 FROM dbo.dimsalesreason s) s 65 JOIN (SELECT t.salesreasonkey, 66 Hashbytes(’SHA2_512’, 67 Concat_ws(’,’, t.salesreasonalternatekey, 68 t.salesreasonname, 69 t.salesreasonreasontype)) AS 70 hash 71 FROM (SELECT *, 72 Row_number() 73 OVER( 90 Appendix A. Source Code

74 partition BY salesreasonkey 75 ORDER BY sysenddate DESC) AS rn 76 FROM target.dimsalesreason) AS t 77 WHERE rn = 1) t 78 ON s.salesreasonkey = t.salesreasonkey 79 WHERE s.hash != t.hash) 80 AND s.salesreasonkey = t.salesreasonkey 81 AND rn = 1;

DimSalesTerritory

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.dimsalesterritory s 6 LEFT OUTER JOIN target.dimsalesterritory t 7 ON s.salesterritorykey = t.salesterritorykey 8 WHERE t.salesterritorykey IS NULL; 9 -- Pre-copy script 10 UPDATE target.dimsalesterritory 11 SET sysenddate = Getdate() 12 WHERE salesterritorykey IN (SELECT t2.salesterritorykey 13 FROM target.dimsalesterritory t2 14 LEFT OUTER JOIN dbo.dimsalesterritory s 15 ON s.salesterritorykey = 16 t2.salesterritorykey 17 WHERE s.salesterritorykey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 20 UPDATE t 21 SET SysEndDate = Getdate() 22 FROM target.dimsalesterritory t, 23 dbo.dimsalesterritory s 24 WHERE t.salesterritorykey IN (SELECT s.salesterritorykey 25 FROM (SELECT s.salesterritorykey, 26 Hashbytes(’SHA2_512’, 27 Concat_ws(’,’, s.salesterritoryalternatekey, 28 s.salesterritoryregion, 29 s.salesterritorycountry, 30 s.salesterritorygroup)) AS hash 31 FROM dbo.dimsalesterritory s) s 32 JOIN (SELECT t.salesterritorykey, 33 Hashbytes(’SHA2_512’, 34 Concat_ws(’,’, 35 t.salesterritoryalternatekey, 36 t.salesterritoryregion, 37 t.salesterritorycountry, 38 t.salesterritorygroup)) AS hash, 39 t.sysenddate 40 FROM target.dimsalesterritory t) t 41 ON s.salesterritorykey = 42 t.salesterritorykey 43 WHERE s.hash != t.hash 44 AND t.sysenddate = A.2. Data warehouse transformation 91

45 ’9999-12-31 23:59:59.000’) 46 AND s.salesterritorykey = t.salesterritorykey 47 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 48 49 INSERT INTO target.dimsalesterritory 50 SELECT s.*, 51 t.sysenddate, 52 ’9999-12-31 23:59:59’ 53 FROM dbo.dimsalesterritory s, 54 (SELECT *, 55 Row_number() 56 OVER( 57 partition BY salesterritorykey 58 ORDER BY sysenddate DESC) AS rn 59 FROM target.dimsalesterritory) AS t 60 WHERE t.salesterritorykey IN (SELECT s.salesterritorykey 61 FROM (SELECT s.salesterritorykey, 62 Hashbytes(’SHA2_512’, 63 Concat_ws(’,’, s.salesterritoryalternatekey, 64 s.salesterritoryregion, 65 s.salesterritorycountry, 66 s.salesterritorygroup)) AS hash 67 FROM dbo.dimsalesterritory s) s 68 JOIN (SELECT t.salesterritorykey, 69 Hashbytes(’SHA2_512’, 70 Concat_ws(’,’, 71 t.salesterritoryalternatekey, 72 t.salesterritoryregion, 73 t.salesterritorycountry, 74 t.salesterritorygroup)) AS hash 75 FROM (SELECT *, 76 Row_number() 77 OVER( 78 partition BY salesterritorykey 79 ORDER BY sysenddate DESC) AS rn 80 FROM target.dimsalesterritory) AS t 81 WHERE rn = 1) t 82 ON s.salesterritorykey = 83 t.salesterritorykey 84 WHERE s.hash != t.hash) 85 AND s.salesterritorykey = t.salesterritorykey 86 AND rn = 1;

FactInternetSales

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.factinternetsales s 6 LEFT OUTER JOIN target.factinternetsales t 7 ON s.salesordernumber = t.salesordernumber 8 AND s.productkey = t.productkey 9 WHERE t.salesordernumber IS NULL 10 AND t.productkey IS NULL; 92 Appendix A. Source Code

11 -- Pre-copy script 12 UPDATE target.factinternetsales 13 SET sysenddate = Getdate() 14 FROM target.factinternetsales t 15 WHERE NOT EXISTS (SELECT NULL 16 FROM dbo.factinternetsales s 17 WHERE s.salesordernumber = t.salesordernumber 18 AND s.productkey = t.productkey) 19 AND sysenddate = ’9999-12-31 23:59:59’ 20 21 UPDATE t 22 SET SysEndDate = Getdate() 23 FROM target.factinternetsales t, 24 dbo.factinternetsales s 25 WHERE Concat_ws(’,’, t.salesordernumber, t.productkey) IN (SELECT 26 Concat_ws(’,’, s.salesordernumber, s.productkey) 27 FROM 28 (SELECT s.salesordernumber, 29 s.productkey, 30 Hashbytes(’SHA2_512’, Concat_ws(’,’, s.orderdatekey, 31 s.duedatekey, 32 s.shipdatekey, s.customerkey, 33 s.promotionkey, 34 s.currencykey, 35 s.salesterritorykey, s.revisionnumber, 36 s.orderquantity, 37 s.unitprice, 38 s.extendedamount, s.unitpricediscountpct, 39 s.discountamount, 40 s.productstandardcost, 41 s.totalproductcost, 42 s.salesamount, 43 s.taxamt, s.freight, 44 s.carriertrackingnumber, 45 s.customerponumber)) AS hash 46 FROM 47 dbo.factinternetsales s) s 48 JOIN 49 (SELECT t.salesordernumber, 50 t.productkey, 51 Hashbytes(’SHA2_512’, Concat_ws(’,’, t.orderdatekey, 52 t.duedatekey, 53 t.shipdatekey, t.customerkey, 54 t.promotionkey, 55 t.currencykey, 56 t.salesterritorykey, t.revisionnumber, 57 t.orderquantity, 58 t.unitprice, 59 t.extendedamount, t.unitpricediscountpct, 60 t.discountamount, 61 t.productstandardcost, 62 t.totalproductcost, 63 t.salesamount, 64 t.taxamt, t.freight, A.2. Data warehouse transformation 93

65 t.carriertrackingnumber, 66 t.customerponumber)) AS hash, 67 t.sysenddate 68 FROM target.factinternetsales t) t 69 ON s.salesordernumber = t.salesordernumber 70 AND s.productkey = t.productkey 71 WHERE s.hash != t.hash 72 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 73 AND s.salesordernumber = t.salesordernumber 74 AND s.productkey = t.productkey 75 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 76 77 INSERT INTO target.factinternetsales 78 SELECT s.*, 79 t.sysenddate, 80 ’9999-12-31 23:59:59’ 81 FROM dbo.factinternetsales s, 82 (SELECT *, 83 Row_number() 84 OVER( 85 partition BY salesordernumber, productkey 86 ORDER BY sysenddate DESC) AS rn 87 FROM target.factinternetsales) AS t 88 WHERE Concat_ws(’,’, t.salesordernumber, t.productkey) IN (SELECT 89 Concat_ws(’,’, s.salesordernumber, s.productkey) 90 FROM 91 (SELECT s.salesordernumber, 92 s.productkey, 93 Hashbytes(’SHA2_512’, Concat_ws(’,’, s.orderdatekey, 94 s.duedatekey, 95 s.shipdatekey, s.customerkey, 96 s.promotionkey, 97 s.currencykey, 98 s.salesterritorykey, s.revisionnumber, 99 s.orderquantity, 100 s.unitprice, 101 s.extendedamount, s.unitpricediscountpct, 102 s.discountamount, 103 s.productstandardcost, 104 s.totalproductcost, 105 s.salesamount, 106 s.taxamt, s.freight, 107 s.carriertrackingnumber, 108 s.customerponumber)) AS hash 109 FROM 110 dbo.factinternetsales s) s 111 JOIN 112 (SELECT t.salesordernumber, 113 t.productkey, 114 Hashbytes(’SHA2_512’, Concat_ws(’,’, t.orderdatekey, 115 t.duedatekey, 116 t.shipdatekey, t.customerkey, 117 t.promotionkey, 118 t.currencykey, 94 Appendix A. Source Code

119 t.salesterritorykey, t.revisionnumber, 120 t.orderquantity, 121 t.unitprice, 122 t.extendedamount, t.unitpricediscountpct, 123 t.discountamount, 124 t.productstandardcost, 125 t.totalproductcost, 126 t.salesamount, 127 t.taxamt, t.freight, 128 t.carriertrackingnumber, 129 t.customerponumber)) AS hash 130 FROM (SELECT *, 131 Row_number() 132 OVER( 133 partition BY salesordernumber, productkey 134 ORDER BY sysenddate DESC) AS rn 135 FROM target.factinternetsales) AS t 136 WHERE rn = 1) t 137 ON s.salesordernumber = t.salesordernumber 138 AND s.productkey = t.productkey 139 WHERE s.hash != t.hash) 140 AND s.salesordernumber = t.salesordernumber 141 AND s.productkey = t.productkey 142 AND rn = 1;

FactResellerSales

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.factresellersales s 6 LEFT OUTER JOIN target.factresellersales t 7 ON s.salesordernumber = t.salesordernumber 8 AND s.productkey = t.productkey 9 WHERE t.salesordernumber IS NULL 10 AND t.productkey IS NULL; 11 -- Pre-copy script 12 UPDATE target.factresellersales 13 SET sysenddate = Getdate() 14 FROM target.factresellersales t 15 WHERE NOT EXISTS (SELECT NULL 16 FROM dbo.factresellersales s 17 WHERE s.salesordernumber = t.salesordernumber 18 AND s.productkey = t.productkey) 19 AND sysenddate = ’9999-12-31 23:59:59’ 20 21 UPDATE t 22 SET SysEndDate = Getdate() 23 FROM target.factresellersales t, 24 dbo.factresellersales s 25 WHERE Concat_ws(’,’, t.salesordernumber, t.productkey) IN (SELECT 26 Concat_ws(’,’, s.salesordernumber, s.productkey) 27 FROM 28 (SELECT s.salesordernumber, A.2. Data warehouse transformation 95

29 s.productkey, 30 Hashbytes(’SHA2_512’, Concat_ws(’,’, s.orderdate, 31 s.duedate, 32 s.shipdate, 33 s.resellerkey, s.employeekey, 34 s.promotionkey, 35 s.currencykey, 36 s.salesterritorykey, 37 s.revisionnumber, 38 s.orderquantity, 39 s.unitprice, s.extendedamount, 40 s.unitpricediscountpct, 41 s.discountamount, 42 s.productstandardcost, 43 s.totalproductcost, 44 s.salesamount, 45 s.taxamt, s.freight, 46 s.carriertrackingnumber, 47 s.customerponumber)) AS hash 48 FROM 49 dbo.factresellersales s) s 50 JOIN 51 (SELECT t.salesordernumber, 52 t.productkey, 53 Hashbytes(’SHA2_512’, Concat_ws(’,’, t.orderdate, t.duedate, 54 t.shipdate, 55 t.resellerkey, t.employeekey, 56 t.promotionkey, 57 t.currencykey, 58 t.salesterritorykey, 59 t.revisionnumber, 60 t.orderquantity, 61 t.unitprice, t.extendedamount, 62 t.unitpricediscountpct, 63 t.discountamount, t.productstandardcost, 64 t.totalproductcost, 65 t.salesamount, 66 t.taxamt, t.freight, 67 t.carriertrackingnumber, 68 t.customerponumber)) AS hash, 69 t.sysenddate 70 FROM target.factresellersales t) t 71 ON s.salesordernumber = t.salesordernumber 72 AND s.productkey = t.productkey 73 WHERE s.hash != t.hash 74 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 75 AND s.salesordernumber = t.salesordernumber 76 AND s.productkey = t.productkey 77 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 78 79 INSERT INTO target.factresellersales 80 SELECT s.*, 81 t.sysenddate, 82 ’9999-12-31 23:59:59’ 96 Appendix A. Source Code

83 FROM dbo.factresellersales s, 84 (SELECT *, 85 Row_number() 86 OVER( 87 partition BY salesordernumber, productkey 88 ORDER BY sysenddate DESC) AS rn 89 FROM target.factresellersales) AS t 90 WHERE Concat_ws(’,’, t.salesordernumber, t.productkey) IN (SELECT 91 Concat_ws(’,’, s.salesordernumber, s.productkey) 92 FROM 93 (SELECT s.salesordernumber, 94 s.productkey, 95 Hashbytes(’SHA2_512’, Concat_ws(’,’, s.orderdate, 96 s.duedate, 97 s.shipdate, 98 s.resellerkey, s.employeekey, 99 s.promotionkey, 100 s.currencykey, 101 s.salesterritorykey, 102 s.revisionnumber, 103 s.orderquantity, 104 s.unitprice, s.extendedamount, 105 s.unitpricediscountpct, 106 s.discountamount, 107 s.productstandardcost, 108 s.totalproductcost, 109 s.salesamount, 110 s.taxamt, s.freight, 111 s.carriertrackingnumber, 112 s.customerponumber)) AS hash 113 FROM 114 dbo.factresellersales s) s 115 JOIN 116 (SELECT t.salesordernumber, 117 t.productkey, 118 Hashbytes(’SHA2_512’, Concat_ws(’,’, t.orderdate, t.duedate, 119 t.shipdate, 120 t.resellerkey, t.employeekey, 121 t.promotionkey, 122 t.currencykey, 123 t.salesterritorykey, 124 t.revisionnumber, 125 t.orderquantity, 126 t.unitprice, t.extendedamount, 127 t.unitpricediscountpct, 128 t.discountamount, t.productstandardcost, 129 t.totalproductcost, 130 t.salesamount, 131 t.taxamt, t.freight, 132 t.carriertrackingnumber, 133 t.customerponumber)) AS hash 134 FROM (SELECT *, 135 Row_number() 136 OVER( A.2. Data warehouse transformation 97

137 partition BY salesordernumber, productkey 138 ORDER BY sysenddate DESC) AS rn 139 FROM target.factresellersales) AS t 140 WHERE rn = 1) t 141 ON s.salesordernumber = t.salesordernumber 142 AND s.productkey = t.productkey 143 WHERE s.hash != t.hash) 144 AND s.salesordernumber = t.salesordernumber 145 AND s.productkey = t.productkey 146 AND rn = 1;

FactSalesQuota

1 -- Source data set query 2 SELECT s.*, 3 Getdate() AS etldate, 4 ’9999-12-31 23:59:59’ AS enddate 5 FROM dbo.factsalesquota s 6 LEFT OUTER JOIN target.factsalesquota t 7 ON s.salesquotakey = t.salesquotakey 8 WHERE t.salesquotakey IS NULL; 9 -- Pre-copy script 10 UPDATE target.factsalesquota 11 SET sysenddate = Getdate() 12 WHERE salesquotakey IN (SELECT t2.salesquotakey 13 FROM target.factsalesquota t2 14 LEFT OUTER JOIN dbo.factsalesquota s 15 ON s.salesquotakey = 16 t2.salesquotakey 17 WHERE s.salesquotakey IS NULL) 18 AND sysenddate = ’9999-12-31 23:59:59’ 19 20 UPDATE t 21 SET SysEndDate = Getdate() 22 FROM target.factsalesquota t, 23 dbo.factsalesquota s 24 WHERE t.salesquotakey IN (SELECT s.salesquotakey 25 FROM (SELECT s.salesquotakey, 26 Hashbytes(’SHA2_512’, 27 Concat_ws(’,’, s.employeekey, 28 s.quotadate, 29 s.salesamountquota)) AS hash 30 FROM dbo.factsalesquota s) s 31 JOIN (SELECT t.salesquotakey, 32 Hashbytes(’SHA2_512’, 33 Concat_ws(’,’, t.employeekey, 34 t.quotadate, 35 t.salesamountquota)) AS hash, 36 t.sysenddate 37 FROM target.factsalesquota t) t 38 ON s.salesquotakey = t.salesquotakey 39 WHERE s.hash != t.hash 40 AND t.sysenddate = ’9999-12-31 23:59:59.000’) 41 AND s.salesquotakey = t.salesquotakey 42 AND t.sysenddate = ’9999-12-31 23:59:59.000’ 98 Appendix A. Source Code

43 44 INSERT INTO target.factsalesquota 45 (employeekey, 46 quotadate, 47 salesamountquota, 48 etldate, 49 sysenddate) 50 SELECT s.employeekey, 51 s.quotadate, 52 s.salesamountquota, 53 t.sysenddate, 54 ’9999-12-31 23:59:59’ 55 FROM dbo.factsalesquota s, 56 (SELECT *, 57 Row_number() 58 OVER( 59 partition BY salesquotakey 60 ORDER BY sysenddate DESC) AS rn 61 FROM target.factsalesquota) AS t 62 WHERE t.salesquotakey IN (SELECT s.salesquotakey 63 FROM (SELECT s.salesquotakey, 64 Hashbytes(’SHA2_512’, 65 Concat_ws(’,’, s.employeekey, 66 s.quotadate, 67 s.salesamountquota)) AS hash 68 FROM dbo.factsalesquota s) s 69 JOIN (SELECT t.salesquotakey, 70 Hashbytes(’SHA2_512’, 71 Concat_ws(’,’, t.employeekey, 72 t.quotadate, 73 t.salesamountquota)) AS hash 74 FROM (SELECT *, 75 Row_number() 76 OVER( 77 partition BY 78 salesquotakey 79 ORDER BY sysenddate 80 DESC) AS 81 rn 82 FROM target.factsalesquota) AS 83 t 84 WHERE rn = 1) t 85 ON s.salesquotakey = t.salesquotakey 86 WHERE s.hash != t.hash) 87 AND s.salesquotakey = t.salesquotakey 88 AND rn = 1; A.3. Views in temporal database

DimAddress

1 CREATE VIEW [dbo].[DimAddress] AS 2 ( 3 SELECT a.addressid, A.3. Views in temporal database 99

4 a.addressline1, 5 a.addressline2, 6 a.city, 7 sp.NAME AS stateprovince, 8 cr.NAME AS countryregion, 9 a.postalcode, 10 a.modifieddate, 11 ( 12 SELECT Max(v) 13 FROM (VALUES 14 ( 15 a.sysstarttime 16 ) 17 , (sp.sysstarttime), (cr. sysstarttime)) AS value(v)) AS [SysStartTime], 18 ( 19 SELECT Min(v) 20 FROM (VALUES 21 ( 22 a.sysendtime 23 ) 24 , (sp.sysendtime), (cr.sysendtime )) AS value(v)) AS [ SysEndTime] 25 FROM person.address FOR system_time ALL a 26 LEFT OUTER JOIN person.stateprovince FOR system_time ALL sp 27 ON a.stateprovinceid = sp.stateprovinceid 28 LEFT OUTER JOIN person.countryregion FOR system_time ALL cr 29 ON sp.countryregioncode = cr.countryregioncode 30 AND 31 ( 32 SELECT max(v) 33 FROM (VALUES 34 ( 35 a.sysstarttime 36 ) 37 , (sp.sysstarttime), (cr. sysstarttime)) AS value(v)) < 38 ( 39 SELECT min(v) 40 FROM (VALUES 41 ( 42 a.sysendtime 43 ) 44 , (sp.sysendtime), (cr.sysendtime )) AS value(v)) 45 )go

DimCurrency

1 CREATE view [dbo].[DimCurrency] as ( 2 3 SELECT c.CurrencyCode, c.Name, 4 c.SysStartTime, 100 Appendix A. Source Code

5 c.SysEndTime 6 FROM Sales.Currency FOR SYSTEM_TIME ALL c ) 7 GO

DimCustomer

1 CREATE VIEW [dbo].[DimCustomer] AS 2 ( 3 SELECT c.customerid AS [CustomerKey], 4 pp.title AS [Title], 5 pp.firstname AS [FirstName], 6 pp.middlename AS [MiddleName], 7 pp.lastname AS [LastName], 8 pe.emailaddress AS [EmailAddress], 9 ppp.phonenumber AS [PhoneNumber], 10 pa.addressline1 AS [AddressLine1], 11 pa.addressline2 AS [AddressLine2], 12 pa.city AS [City], 13 pa.postalcode AS [PostalCode], 14 ( 15 SELECT Max(v) 16 FROM (VALUES 17 ( 18 c.sysstarttime 19 ) 20 , (pp.sysstarttime), (pe.sysstarttime), ( pa.sysstarttime), (ppp.sysstarttime)) AS value(v)) AS [SysStartTime], 21 ( 22 SELECT Min(v) 23 FROM (VALUES 24 ( 25 c.sysendtime 26 ) 27 , (pp.sysendtime), (pe.sysendtime), (pa. sysendtime), (ppp.sysendtime)) AS value(v)) AS [SysEndTime] 28 FROM [Sales].[Customer] FOR system_time ALL c 29 INNER JOIN person.person FOR system_time ALL pp 30 ON pp.businessentityid = c.personid 31 INNER JOIN person.emailaddress FOR system_time ALL pe 32 ON pe.businessentityid = c.personid 33 INNER JOIN person.businessentityaddress FOR system_time ALL pbea 34 ON pbea.businessentityid = c.personid 35 INNER JOIN person.address FOR system_time ALL pa 36 ON pa.addressid = pbea.addressid 37 INNER JOIN person.personphone FOR system_time ALL ppp 38 ON ppp.businessentityid = c.personid 39 AND 40 ( 41 SELECT max(v) 42 FROM (VALUES 43 ( 44 c.sysstarttime 45 ) A.3. Views in temporal database 101

46 , (pp.sysstarttime), (pe.sysstarttime), ( pa.sysstarttime), (ppp.sysstarttime)) AS value(v)) < 47 ( 48 SELECT min(v) 49 FROM (VALUES 50 ( 51 c.sysendtime 52 ) 53 , (pp.sysendtime), (pe.sysendtime), (pa. sysendtime), (ppp.sysendtime)) AS value(v)) 54 )go

DimDepartmentGroup

1 CREATE VIEW [dbo].[DimDepartmentGroup] AS 2 ( 3 SELECT DISTINCT humanresources.department.groupname AS departmentgroupname, 4 humanresources.department.sysstarttime, 5 humanresources.department.sysendtime 6 FROM humanresources.department FOR system_time ALL 7 )go

DimEmployee

1 CREATE VIEW [dbo].[DimEmployee] AS 2 ( 3 SELECT e.[BusinessEntityID] AS businessentityid, 4 e.[NationalIDNumber] AS [ EmployeeNationalIDAlternateKey], 5 COALESCE(sp.[TerritoryID], 11) AS [ SalesTerritoryKey], 6 co.[FirstName] AS [FirstName], 7 co.[LastName] AS [LastName], 8 co.[MiddleName] AS [MiddleName], 9 e.[JobTitle] AS [Title], 10 e.[HireDate] AS [HireDate], 11 e.[BirthDate] AS [BirthDate], 12 e.[LoginID] AS [LoginID], 13 em.[EmailAddress] AS [EmailAddress], 14 pp.phonenumber AS [Phone], 15 e.[MaritalStatus] AS [MaritalStatus], 16 e.[SalariedFlag] AS [SalariedFlag], 17 e.[Gender] AS [Gender], 18 eph.[PayFrequency] AS [PayFrequency], 19 eph.[Rate] AS [BaseRate], 20 e.[VacationHours] AS [VacationHours], 21 e.[SickLeaveHours] AS [SickLeaveHours], 22 e.[CurrentFlag] AS [CurrentFlag], 23 d.[Name] AS [DepartmentName], 24 COALESCE(edh.[StartDate], e.[HireDate]) AS [ StartDate], 25 edh.[EndDate] AS [EndDate], 102 Appendix A. Source Code

26 CASE 27 WHEN edh.[EndDate] IS NULL THEN N’Current’ 28 ELSE NULL 29 END AS [Status], 30 ( 31 SELECT Max(v) 32 FROM (VALUES 33 ( 34 e.sysstarttime 35 ) 36 , (co.sysstarttime), (pp. sysstarttime), (em. sysstarttime), (sp. sysstarttime), (edh. sysstarttime), (d. sysstarttime), (eph. sysstarttime)) AS value(v)) AS [SysStartTime], 37 ( 38 SELECT Min(v) 39 FROM (VALUES 40 ( 41 e.sysendtime 42 ) 43 , (co.sysendtime), (pp.sysendtime ), (em.sysendtime), (sp. sysendtime), (edh.sysendtime) , (d.sysendtime), (eph. sysendtime)) AS value(v)) AS [SysEndTime] 44 FROM [HumanResources].[Employee] FOR system_time ALL e 45 INNER JOIN [Person].[Person] FOR system_time ALL co 46 ON e.[BusinessEntityID] = co.[BusinessEntityID] 47 INNER JOIN [Person].[PersonPhone] FOR system_time ALL pp 48 ON pp.businessentityid = e.businessentityid 49 INNER JOIN [Person].[EmailAddress] FOR system_time ALL em 50 ON e.[BusinessEntityID] = em.businessentityid 51 INNER JOIN [Person].[BusinessEntityAddress] FOR system_time ALL ea 52 ON e.[BusinessEntityID] = ea.[BusinessEntityID] 53 INNER JOIN [Person].[Address] FOR system_time ALL a 54 ON ea.[AddressID] = a.[AddressID] 55 LEFT OUTER JOIN [Sales].[SalesPerson] FOR system_time ALL sp 56 ON e.[BusinessEntityID] = sp.[BusinessEntityID] 57 LEFT OUTER JOIN [HumanResources].[EmployeeDepartmentHistory] FOR system_time ALL edh 58 ON e.businessentityid = edh.[BusinessEntityID] 59 INNER JOIN [HumanResources].[Department] FOR system_time ALL d 60 ON edh.[DepartmentID] = d.[DepartmentID] 61 LEFT OUTER JOIN [HumanResources].[EmployeePayHistory] FOR system_time ALL eph 62 ON e.[BusinessEntityID] = eph.[BusinessEntityID] A.3. Views in temporal database 103

63 AND 64 ( 65 SELECT max(v) 66 FROM (VALUES 67 ( 68 e.sysstarttime 69 ) 70 , (co.sysstarttime), (pp. sysstarttime), (em. sysstarttime), (sp. sysstarttime), (edh. sysstarttime), (d. sysstarttime), (eph. sysstarttime)) AS value(v)) < 71 ( 72 SELECT min(v) 73 FROM (VALUES 74 ( 75 e.sysendtime 76 ) 77 , (co.sysendtime), (pp.sysendtime ), (em.sysendtime), (sp. sysendtime), (edh.sysendtime) , (d.sysendtime), (eph. sysendtime)) AS value(v)) 78 )go

DimGeography

1 CREATE VIEW [dbo].[DimGeography] AS 2 ( 3 SELECT DISTINCT a.[City] AS [City], 4 sp.[StateProvinceCode] AS [StateProvinceCode ], 5 sp.[Name] AS [StateProvinceName], 6 cr.[CountryRegionCode] AS [CountryRegionCode ], 7 cr.[Name] AS [CountryRegionName], 8 a.[PostalCode] AS [PostalCode], 9 ( 10 SELECT Max(v) 11 FROM (VALUES 12 ( 13 a.sysstarttime 14 ) 15 , (sp.sysstarttime), (cr. sysstarttime)) AS value(v)) AS [SysStartTime], 16 ( 17 SELECT Min(v) 18 FROM (VALUES 19 ( 20 a.sysendtime 21 ) 104 Appendix A. Source Code

22 , (sp.sysendtime), (cr.sysendtime )) AS value(v)) AS [ SysEndTime] 23 FROM [Person].[Address] FOR system_time ALL AS a 24 INNER JOIN [Person].[StateProvince] FOR system_time ALL AS sp 25 ON a.[StateProvinceID] = sp.[StateProvinceID] 26 INNER JOIN [Person].[CountryRegion] FOR system_time ALL AS cr 27 ON sp.[CountryRegionCode] = cr.[CountryRegionCode] 28 AND 29 ( 30 SELECT max(v) 31 FROM (VALUES 32 ( 33 a.sysstarttime 34 ) 35 , (sp.sysstarttime), (cr. sysstarttime)) AS value(v)) < 36 ( 37 SELECT min(v) 38 FROM (VALUES 39 ( 40 a.sysendtime 41 ) 42 , (sp.sysendtime), (cr.sysendtime )) AS value(v)) 43 )go

DimProduct

1 CREATE VIEW [dbo].[DimProduct] AS 2 ( 3 SELECT p.productnumber AS productalternatekey, 4 p.productsubcategoryid AS productsubcategorykey, 5 p.weightunitmeasurecode AS weightunitmeasurecode, 6 p.sizeunitmeasurecode AS sizeunitmeasurecode, 7 p.[Name] AS productname, 8 pch.standardcost AS standardcost, 9 p.finishedgoodsflag AS finishedgoodsflag, 10 COALESCE(p.color, ’NA’) AS color, 11 p.safetystocklevel AS safetystocklevel, 12 p.reorderpoint AS reorderpoint, 13 plph.listprice AS listprice, 14 p.size AS size, 15 CONVERT(FLOAT, p.weight) AS weight, 16 p.daystomanufacture AS daystomanufacture, 17 p.productline AS productline, 18 p.class AS class, 19 p.style AS style, 20 pm.[Name] AS modelname, 21 COALESCE(plph.startdate, pch.startdate, p. sellstartdate) AS startdate, A.3. Views in temporal database 105

22 COALESCE(plph.enddate, pch.enddate, p. sellenddate) AS enddate, 23 CASE 24 WHEN COALESCE(plph.enddate, pch .enddate, p.sellenddate) IS NULL THEN N’Current’ 25 ELSE NULL 26 END AS status, 27 ( 28 SELECT Max(v) 29 FROM (VALUES 30 ( 31 p.sysstarttime 32 ) 33 , (pm.sysstarttime), (pch. sysstarttime), (plph. sysstarttime)) AS value(v)) AS [SysStartTime], 34 ( 35 SELECT Min(v) 36 FROM (VALUES 37 ( 38 p.sysendtime 39 ) 40 , (pm.sysendtime), (pch. sysendtime), (plph.sysendtime )) AS value(v)) AS [ SysEndTime] 41 FROM production.product FOR system_time ALL p 42 LEFT OUTER JOIN production.productmodel FOR system_time ALL pm 43 ON p.productmodelid = pm.productmodelid 44 LEFT OUTER JOIN production.productcosthistory FOR system_time ALL pch 45 ON p.productid = pch.productid 46 LEFT OUTER JOIN production.productlistpricehistory FOR system_time ALL plph 47 ON p.productid = plph.productid 48 AND pch.startdate = plph.startdate 49 AND COALESCE(pch.enddate, ’12-31-2020’) = COALESCE(plph. enddate, ’12-31-2020’) 50 AND 51 ( 52 SELECT max(v) 53 FROM (VALUES 54 ( 55 p.sysstarttime 56 ) 57 , (pm.sysstarttime), (pch. sysstarttime), (plph. sysstarttime)) AS value(v)) < 58 ( 59 SELECT min(v) 60 FROM (VALUES 106 Appendix A. Source Code

61 ( 62 p.sysendtime 63 ) 64 , (pm.sysendtime), (pch. sysendtime), (plph.sysendtime )) AS value(v)) 65 )go

DimProductCategory

1 CREATE VIEW [dbo].[DimProductCategory] AS 2 ( 3 SELECT DISTINCT pc.productcategoryid AS productcategoryalternatekey, 4 pc.[Name] AS productcategoryname, 5 pc.sysstarttime, 6 pc.sysendtime 7 FROM [Production].[ProductCategory] FOR system_time ALL pc 8 )go

DimProductSubcategory

1 CREATE VIEW [dbo].[DimProductSubcategory] AS 2 ( 3 SELECT DISTINCT ps.productsubcategoryid AS productsubcategorykey, 4 ps.productsubcategoryid AS productsubcategoryalternatekey, 5 ps.[Name] AS productsubcategoryname, 6 dpc.productcategoryalternatekey AS productcategorykey, 7 ( 8 SELECT Max(v) 9 FROM (VALUES 10 ( 11 ps.sysstarttime 12 ) 13 , (dpc.sysstarttime)) AS value(v) ) AS [SysStartTime], 14 ( 15 SELECT Min(v) 16 FROM (VALUES 17 ( 18 ps.sysendtime 19 ) 20 , (dpc.sysendtime)) AS value(v)) AS [SysEndTime] 21 FROM [Production].[ProductSubcategory] FOR system_time ALL ps 22 INNER JOIN dbo.dimproductcategory dpc 23 ON ps.productcategoryid = dpc.productcategoryalternatekey 24 AND 25 ( 26 SELECT max(v) 27 FROM (VALUES A.3. Views in temporal database 107

28 ( 29 ps.sysstarttime 30 ) 31 , (dpc.sysstarttime)) AS value(v) ) < 32 ( 33 SELECT min(v) 34 FROM (VALUES 35 ( 36 ps.sysendtime 37 ) 38 , (dpc.sysendtime)) AS value(v)) 39 )go

DimReseller

1 CREATE VIEW [dbo].[DimReseller] AS 2 ( 3 SELECT DISTINCT s.[BusinessEntityID] AS [ResellerKey], 4 s.[Name] AS [ResellerName], 5 a.addressline1 AS addressline1, 6 a.addressline2 AS addressline2, 7 a.city AS city, 8 a.postalcode AS postalcode, 9 a.stateprovinceid AS stateprovinceid, 10 ( 11 SELECT Max(v) 12 FROM (VALUES 13 ( 14 s.sysstarttime 15 ) 16 , (a.sysstarttime)) AS value(v)) AS [SysStartTime], 17 ( 18 SELECT Min(v) 19 FROM (VALUES 20 ( 21 s.sysendtime 22 ) 23 , (a.sysendtime)) AS value(v)) AS [SysEndTime] 24 FROM [Sales].[Customer] FOR system_time ALL cu 25 INNER JOIN [Sales].[Store] FOR system_time ALL s 26 ON cu.[StoreID] = s.[BusinessEntityID] 27 INNER JOIN [Person].[BusinessEntityAddress] FOR system_time ALL bea 28 ON cu.[StoreID] = bea.[BusinessEntityID] 29 INNER JOIN [Person].[Address] FOR system_time ALL a 30 ON bea.[AddressID] = a.[AddressID] 31 INNER JOIN [Person].[StateProvince] FOR system_time ALL sp 32 ON a.[StateProvinceID] = sp.[StateProvinceID] 33 INNER JOIN [Person].[CountryRegion] FOR system_time ALL cr 34 ON sp.[CountryRegionCode] = cr.[CountryRegionCode] 35 AND 36 ( 108 Appendix A. Source Code

37 SELECT max(v) 38 FROM (VALUES 39 ( 40 s.sysstarttime 41 ) 42 , (a.sysstarttime)) AS value(v)) < 43 ( 44 SELECT min(v) 45 FROM (VALUES 46 ( 47 s.sysendtime 48 ) 49 , (a.sysendtime)) AS value(v)) 50 WHERE bea.[AddressTypeID] = 3-- Main Office 51 )go

DimSalesReason

1 CREATE VIEW [dbo].[DimSalesReason] AS 2 ( 3 SELECT DISTINCT sr.[SalesReasonID] AS [ SalesReasonAlternateKey], 4 sr.[Name] AS [SalesReasonName], 5 sr.[ReasonType] AS [SalesReasonReasonType], 6 sr.sysstarttime, 7 sr.sysendtime 8 FROM [Sales].[SalesReason] FOR system_time ALL sr 9 )go

DimSalesTerritory

1 CREATE VIEW [dbo].[DimSalesTerritory] AS 2 ( 3 SELECT st.[TerritoryID] AS [SalesTerritoryAlternateKey], 4 st.[Name] AS [SalesTerritoryRegion], 5 cr.[Name] AS [SalesTerritoryCountry], 6 st.[Group] AS [SalesTerritoryGroup], 7 ( 8 SELECT Max(v) 9 FROM (VALUES 10 ( 11 st.sysstarttime 12 ) 13 , (cr.sysstarttime)) AS value(v)) AS [ SysStartTime], 14 ( 15 SELECT Min(v) 16 FROM (VALUES 17 ( 18 st.sysendtime 19 ) 20 , (cr.sysendtime)) AS value(v)) AS [ SysEndTime] 21 FROM [Sales].[SalesTerritory] FOR system_time ALL st A.3. Views in temporal database 109

22 INNER JOIN [Person].[CountryRegion] FOR system_time ALL cr 23 ON st.[CountryRegionCode] = cr.[CountryRegionCode] 24 AND 25 ( 26 SELECT max(v) 27 FROM (VALUES 28 ( 29 st.sysstarttime 30 ) 31 , (cr.sysstarttime)) AS value(v)) < 32 ( 33 SELECT min(v) 34 FROM (VALUES 35 ( 36 st.sysendtime 37 ) 38 , (cr.sysendtime)) AS value(v)) 39 )go

FactInternetSales

1 CREATE VIEW [dbo].[FactInternetSales] AS 2 ( 3 SELECT dp.[ProductAlternateKey] AS [ProductKey] , 4 soh.[OrderDate] AS [OrderDateKey] , 5 soh.[DueDate] AS [DueDateKey] , 6 soh.[ShipDate] AS [ShipDateKey] , 7 soh.[CustomerID] AS [CustomerKey] , 8 sod.[SpecialOfferID] AS [PromotionKey] , 9 COALESCE(dc.[CurrencyCode], 10 ( 11 SELECT currencycode 12 FROM [dbo].[DimCurrency] 13 WHERE currencycode = N’USD’)) AS [ CurrencyKey] , 14 soh.[TerritoryID] AS [SalesTerritoryKey] , 15 soh.[SalesOrderNumber] AS [SalesOrderNumber] , 16 soh.[RevisionNumber] AS [RevisionNumber] , 17 sod.[OrderQty] AS [OrderQuantity] , 18 sod.[UnitPrice] AS [UnitPrice] , 19 sod.[OrderQty] * sod.[UnitPrice] AS [ ExtendedAmount] , 20 sod.[UnitPriceDiscount] AS [ UnitPriceDiscountPct] , 21 sod.[OrderQty] * sod.[UnitPrice] * sod.[ UnitPriceDiscount] AS [DiscountAmount] , 22 pch.[StandardCost] AS [ProductStandardCost] , 23 sod.[OrderQty] * pch.[StandardCost] AS [ TotalProductCost] , 24 sod.[LineTotal] AS [SalesAmount] , 25 CONVERT(MONEY, sod.[LineTotal] * 0.08) AS [ TaxAmt] , 26 CONVERT(MONEY, sod.[LineTotal] * 0.025) AS [ Freight] , 110 Appendix A. Source Code

27 sod.[CarrierTrackingNumber] AS [ CarrierTrackingNumber] , 28 soh.[PurchaseOrderNumber] AS [ CustomerPONumber], 29 ( 30 SELECT Max(v) 31 FROM (VALUES 32 ( 33 sod.sysstarttime 34 ) 35 , (soh.sysstarttime), (dp. sysstarttime), (pch. sysstarttime), (dc. sysstarttime)) AS value(v)) AS [SysStartTime], 36 ( 37 SELECT Min(v) 38 FROM (VALUES 39 ( 40 sod.sysendtime 41 ) 42 , (soh.sysendtime), (dp. sysendtime), (pch.sysendtime) , (dc.sysendtime)) AS value(v )) AS [SysEndTime] 43 FROM [Sales].[SalesOrderHeader] FOR system_time ALL soh 44 INNER JOIN [Sales].[SalesOrderDetail] FOR system_time ALL sod 45 ON soh.[SalesOrderID] = sod.[SalesOrderID] 46 INNER JOIN [Production].[Product] FOR system_time ALL p 47 ON sod.[ProductID] = p.[ProductID] 48 INNER JOIN [dbo].[DimProduct] dp 49 ON dp.[ProductAlternateKey] = p.[ProductNumber] COLLATE sql_latin1_general_cp1_ci_as 50 AND [dbo].[udfMinimumDate](soh.[OrderDate], soh.[DueDate]) BETWEEN dp.[StartDate] AND COALESCE(dp.[EndDate], ’ 12-31-9999’)-- Make sure we get all the Sales Orders! 51 INNER JOIN [Sales].[Customer] FOR system_time ALL c 52 ON soh.[CustomerID] = c.[CustomerID] 53 LEFT OUTER JOIN [Production].[ProductCostHistory] FOR system_time ALL pch 54 ON p.[ProductID] = pch.[ProductID] 55 AND [dbo].[udfMinimumDate](soh.[OrderDate], soh.[DueDate]) BETWEEN pch.[StartDate] AND COALESCE(pch.[EndDate], ’ 12-31-9999’)-- Make sure we get all the Sales Orders! 56 LEFT OUTER JOIN [Sales].[CurrencyRate] FOR system_time ALL cr 57 ON soh.[CurrencyRateID] = cr.[CurrencyRateID] 58 LEFT OUTER JOIN [dbo].[DimCurrency] dc 59 ON cr.[ToCurrencyCode] = dc.[CurrencyCode] COLLATE sql_latin1_general_cp1_ci_as 60 LEFT OUTER JOIN [HumanResources].[Employee] FOR system_time ALL e 61 ON soh.[SalesPersonID] = e.[BusinessEntityID] A.3. Views in temporal database 111

62 LEFT OUTER JOIN [dbo].[DimEmployee] de 63 ON e.[NationalIDNumber] = de.[EmployeeNationalIDAlternateKey ] COLLATE sql_latin1_general_cp1_ci_as 64 AND 65 ( 66 SELECT max(v) 67 FROM (VALUES 68 ( 69 sod.sysstarttime 70 ) 71 , (soh.sysstarttime), (dp. sysstarttime), (pch. sysstarttime), (dc. sysstarttime)) AS value(v)) < 72 ( 73 SELECT min(v) 74 FROM (VALUES 75 ( 76 sod.sysendtime 77 ) 78 , (soh.sysendtime), (dp. sysendtime), (pch.sysendtime) , (dc.sysendtime)) AS value(v )) 79 WHERE soh.onlineorderflag = 1 80 )go

FactResellerSales

1 CREATE VIEW [dbo].[FactResellerSales] AS 2 ( 3 SELECT dp.[ProductAlternateKey] AS [ProductKey], 4 soh.[OrderDate] AS [OrderDate], 5 soh.[DueDate] AS [DueDate], 6 soh.[ShipDate] AS [ShipDate], 7 soh.[CustomerID] AS [ResellerKey], 8 de.[BusinessEntityID] AS [EmployeeKey], 9 sod.[SpecialOfferID] AS [PromotionKey], 10 COALESCE(dc.[CurrencyCode], 11 ( 12 SELECT currencycode 13 FROM [dbo].[DimCurrency] 14 WHERE currencycode = N’USD’)) AS [ CurrencyCode], 15 soh.[TerritoryID] AS [SalesTerritoryKey], 16 soh.[SalesOrderNumber] AS [SalesOrderNumber], 17 soh.[RevisionNumber] AS [RevisionNumber], 18 sod.[OrderQty] AS [OrderQuantity], 19 sod.[UnitPrice] AS [UnitPrice], 20 sod.[OrderQty] * sod.[UnitPrice] AS [ ExtendedAmount], 21 sod.[UnitPriceDiscount] AS [ UnitPriceDiscountPct], 22 sod.[OrderQty] * sod.[UnitPrice] * sod.[ UnitPriceDiscount] AS [DiscountAmount], 112 Appendix A. Source Code

23 pch.[StandardCost] AS [ProductStandardCost], 24 sod.[OrderQty] * pch.[StandardCost] AS [ TotalProductCost], 25 sod.[LineTotal] AS [SalesAmount], 26 CONVERT(MONEY, sod.[LineTotal] * 0.08) AS [ TaxAmt], 27 CONVERT(MONEY, sod.[LineTotal] * 0.025) AS [ Freight], 28 sod.[CarrierTrackingNumber] AS [ CarrierTrackingNumber], 29 soh.[PurchaseOrderNumber] AS [ CustomerPONumber], 30 ( 31 SELECT Max(v) 32 FROM (VALUES 33 ( 34 soh.sysstarttime 35 ) 36 , (sod.sysstarttime), (dp. sysstarttime), (pch. sysstarttime), (dc. sysstarttime), (de. sysstarttime)) AS value(v)) AS [SysStartTime], 37 ( 38 SELECT Min(v) 39 FROM (VALUES 40 ( 41 soh.sysendtime 42 ) 43 , (sod.sysendtime), (dp. sysendtime), (pch.sysendtime) , (dc.sysendtime), (de. sysendtime)) AS value(v)) AS [SysEndTime] 44 FROM [Sales].[SalesOrderHeader] FOR system_time ALL soh 45 INNER JOIN [Sales].[SalesOrderDetail] FOR system_time ALL sod 46 ON soh.[SalesOrderID] = sod.[SalesOrderID] 47 INNER JOIN [Production].[Product] FOR system_time ALL p 48 ON sod.[ProductID] = p.[ProductID] 49 INNER JOIN [dbo].[DimProduct] dp 50 ON dp.[ProductAlternateKey] = p.[ProductNumber] 51 AND [dbo].[udfMinimumDate](soh.[OrderDate], soh.[DueDate]) BETWEEN dp.[StartDate] AND COALESCE(dp.[EndDate], ’ 12-31-9999’) 52 INNER JOIN [Sales].[Customer] FOR system_time ALL c 53 ON soh.[CustomerID] = c.[CustomerID] 54 LEFT OUTER JOIN [Production].[ProductCostHistory] FOR system_time ALL pch 55 ON p.[ProductID] = pch.[ProductID] 56 AND [dbo].[udfMinimumDate](soh.[OrderDate], soh.[DueDate]) BETWEEN pch.[StartDate] AND COALESCE(pch.[EndDate], ’ 12-31-9999’) A.3. Views in temporal database 113

57 LEFT OUTER JOIN [Sales].[CurrencyRate] FOR system_time ALL cr 58 ON soh.[CurrencyRateID] = cr.[CurrencyRateID] 59 LEFT OUTER JOIN [dbo].[DimCurrency] dc 60 ON cr.[ToCurrencyCode] = dc.[CurrencyCode] 61 LEFT OUTER JOIN [HumanResources].[Employee] FOR system_time ALL e 62 ON soh.[SalesPersonID] = e.businessentityid 63 LEFT OUTER JOIN [dbo].[DimEmployee] de 64 ON e.[BusinessEntityID] = de.businessentityid 65 AND 66 ( 67 SELECT max(v) 68 FROM (VALUES 69 ( 70 soh.sysstarttime 71 ) 72 , (sod.sysstarttime), (dp. sysstarttime), (pch. sysstarttime), (dc. sysstarttime), (de. sysstarttime)) AS value(v)) < 73 ( 74 SELECT min(v) 75 FROM (VALUES 76 ( 77 soh.sysendtime 78 ) 79 , (sod.sysendtime), (dp. sysendtime), (pch.sysendtime) , (dc.sysendtime), (de. sysendtime)) AS value(v)) 80 WHERE soh.onlineorderflag = 0 81 )go

FactSalesQuota

1 CREATE VIEW [dbo].[FactSalesQuota] AS 2 ( 3 SELECT DISTINCT spqh.businessentityid AS [EmployeeKey], 4 spqh.[QuotaDate] AS [Quotadate], 5 spqh.[SalesQuota] AS [SalesAmountQuota], 6 spqh.sysstarttime, 7 spqh.sysendtime 8 FROM [Sales].[SalesPersonQuotaHistory] FOR system_time ALL spqh 9 )go

B Technical specifications

In the following the technical specifications for the prototypes are given. If not stated otherwise, the specifications are valid for both configurations. Virtual Machine Standard D1 v2 with 1 VCPU and 3.5 GiB memory. Operating on Windows Server 2016 Datacenter. Running SQL Server 2017. Region: West Europe. SQL database (low configuration) Azure SQL database running on ’basic’ tier with 5 DTU and 2GB. Region: West Europe. SQL database (high configuration) Azure SQL database running on ’standard’ tier with 20 DTU and 2GB. Region: West Europe. Data Factory Version 2. Region: West Europe. Power BI Desktop version 2.70.5494.761 64-bit running on a Windows 10 Home laptop (HP Pavilion x360 Convertible 14-ba1xx) with an internet connection with 200 Mbps on average.

115

C Assessment

C.1. Performance assessment results

In the following, detailed results of the performance assessment are given. All times are presented either in the format mm:ss or hh:mm:ss.ms.

Prototype A: Transfer between staging database and data warehouse

Table Low1 Low2 Low3 High1 High2 High3 DimAddress 0:36 0:38 0:36 0:31 0:25 0:28 DimCurrency 0:15 0:16 0:15 0:15 0:20 0:16 DimCustomer 1:15 1:18 1:11 0:37 0:32 0:22 DimDepartmentGroup 0:07 0:07 0:06 0:06 0:08 0:05 DimEmployee 0:22 0:20 0:16 0:21 0:20 0:15 DimGeography 0:22 0:20 0:21 0:24 0:19 0:15 DimProduct 0:08 0:07 0:06 0:07 0:09 0:05 DimProductCategory 0:07 0:06 0:05 0:07 0:08 0:05 DimProductSubCategory 0:17 0:16 0:15 0:16 0:22 0:15 DimReseller 0:19 0:18 0:19 0:17 0:17 0:16 DimSalesReason 0:17 0:16 0:15 0:17 0:19 0:14 DimSalesTerritory 0:16 0:20 0:16 0:16 0:20 0:15 FactInternetSales 3:32 3:35 3:33 1:23 1:18 1:11 FactResellerSales 3:24 3:25 3:20 1:13 1:12 1:11 FactSalesQuota 0:20 0:17 0:16 0:17 0:17 0:16 TempEmployee 0:17 0:18 0:15 0:15 0:16 0:15 TempGeography 0:06 0:05 0:06 0:07 0:07 0:05 TempCurrency 0:17 0:17 0:16 0:16 0:15 0:16 TempProduct 0:16 0:16 0:16 0:15 0:25 0:15 TempProductcategory 0:07 0:05 0:05 0:05 0:10 0:05 Total 12:40 12:40 12:08 7:25 7:39 6:25

117 118 Appendix C. Assessment

Prototype A: Data warehouse transformation

Table Low1 Low2 Low3 High1 High2 High3 DimAddress 2:22 2:20 2:03 0:26 0:47 0:36 DimCurrency 0:56 0:58 1:41 0:13 0:16 0:12 DimCustomer 2:36 2:26 2:25 0:41 0:55 0:42 DimDepartmentGroup 0:46 0:46 1:13 0:39 0:16 0:06 DimEmployee 1:10 1:17 1:33 0:35 0:34 0:22 DimGeography 1:27 1:29 1:21 0:33 0:18 0:22 DimProduct 1:38 1:33 1:11 0:40 0:34 0:21 DimProductCategory 1:29 1:36 1:31 0:39 0:17 0:07 DimProductSubCategory 2:26 2:20 2:18 0:34 0:37 0:25 DimReseller 1:27 1:29 1:22 0:40 0:20 0:22 DimSalesReason 0:52 0:57 1:29 0:36 0:28 0:07 DimSalesTerritory 0:48 0:50 0:59 0:32 0:29 0:05 FactInternetSales 6:15 6:05 6:16 1:01 1:40 1:20 FactResellerSales 6:45 6:35 6:19 1:23 1:49 1:32 FactSalesQuota 1:44 1:36 1:28 0:34 0:16 0:22 Total 32:41 32:17 33:09 9:46 9:36 7:01

Prototype B: Loading the views

Table Low1 Low2 Low3 DimAddress 00:00:05.234 00:00:04.250 00:00:04.641 DimCurrency 00:00:00.250 00:00:00.109 00:00:00.140 DimCustomer 00:00:11.859 00:00:12.516 00:00:11.782 DimDepartmentGroup 00:00:00.078 00:00:00.063 00:00:00.156 DimEmployee 00:00:06.297 00:00:05.954 00:00:05.656 DimGeography 00:00:04.766 00:00:04.438 00:00:04.797 DimProduct 00:00:00.672 00:00:00.562 00:00:00.859 DimProductCategory 00:00:00.109 00:00:00.094 00:00:00.110 DimProductSubCategory 00:00:00.375 00:00:00.344 00:00:00.281 DimReseller 00:00:02.250 00:00:02.750 00:00:02.156 DimSalesReason 00:00:00.078 00:00:00.125 00:00:00.078 DimSalesTerritory 00:00:00.266 00:00:00.250 00:00:00.250 FactInternetSales 00:02:00.971 00:02:02.250 00:02:04.172 FactResellerSales 00:03:39.953 00:03:40.828 00:03:39.813 FactSalesQuota 00:00:00.172 00:00:00.094 00:00:00.109 Total 00:06:13.330 00:06:14.627 00:06:15.000 C.2. Assessment scripts 119

Table High1 High2 High3 DimAddress 00:00:03.437 00:00:00.812 00:00:00.750 DimCurrency 00:00:00.563 00:00:00.235 00:00:00.344 DimCustomer 00:00:21.765 00:00:01.969 00:00:01.094 DimDepartmentGroup 00:00:00.157 00:00:00.265 00:00:00.156 DimEmployee 00:00:03.937 00:00:00.813 00:00:00.594 DimGeography 00:00:01.391 00:00:01.000 00:00:00.938 DimProduct 00:00:00.594 00:00:00.172 00:00:00.218 DimProductCategory 00:00:00.093 00:00:00.093 00:00:00.125 DimProductSubCategory 00:00:00.203 00:00:00.218 00:00:00.157 DimReseller 00:00:01.157 00:00:00.547 00:00:00.484 DimSalesReason 00:00:00.172 00:00:00.110 00:00:00.110 DimSalesTerritory 00:00:00.218 00:00:00.109 00:00:00.140 FactInternetSales 00:00:33.704 00:00:22.750 00:00:23.595 FactResellerSales 00:00:39.141 00:00:38.141 00:00:38.908 FactSalesQuota 00:00:00.187 00:00:00.141 00:00:00.109 Total 00:01:46.719 00:01:07.375 00:01:07.722

Power BI data import

Low1 Low2 Low3 High1 High2 High3 Prototype A 00:57 00:54 00:55 00:21 00:29 00:26 Prototype B 11:56 11:41 12:17 02:36 02:47 02:47

C.2. Assessment scripts

In the following, two scripts to measure the performance of prototype B and the data integrity of both prototypes are given. Performance testing script for prototype B

1 ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE; 2 3 PRINT SYSDATETIME(); 4 GO 5 6 SELECT * 7 FROM dbo.DimAddress; 8 GO 9 10 PRINT ’DimAddress’ + convert(varchar, SysDateTime(), 21); 11 GO 12 13 SELECT * 14 FROM dbo.DimCurrency; 15 GO 16 120 Appendix C. Assessment

17 PRINT ’DimCurrency’ + convert(varchar, SysDateTime(), 21); 18 GO 19 20 SELECT * 21 FROM dbo.DimCustomer; 22 GO 23 24 PRINT ’DimCustomer’ + convert(varchar, SysDateTime(), 21); 25 GO 26 27 SELECT * 28 FROM dbo.DimDepartmentGroup; 29 GO 30 31 PRINT ’DimDepartmentGroup’ + convert(varchar, SysDateTime(), 21); 32 GO 33 34 SELECT * 35 FROM dbo.DimEmployee; 36 GO 37 38 PRINT ’DimEmployee’ + convert(varchar, SysDateTime(), 21); 39 GO 40 41 SELECT * 42 FROM dbo.DimGeography; 43 GO 44 45 PRINT ’DimGeography’ + convert(varchar, SysDateTime(), 21); 46 GO 47 48 SELECT * 49 FROM dbo.DimProduct; 50 GO 51 52 PRINT ’DimProduct’ + convert(varchar, SysDateTime(), 21); 53 GO 54 55 SELECT * 56 FROM dbo.DimProductCategory; 57 GO 58 59 PRINT ’DimProductCategory’ + convert(varchar, SysDateTime(), 21); 60 GO 61 62 SELECT * 63 FROM dbo.DimProductSubcategory; 64 GO 65 66 PRINT ’DimProductSubcategory’ + convert(varchar, SysDateTime(), 21); 67 GO 68 69 SELECT * 70 FROM dbo.DimReseller; C.2. Assessment scripts 121

71 GO 72 73 PRINT ’DimReseller’ + convert(varchar, SysDateTime(), 21); 74 GO 75 76 SELECT * 77 FROM dbo.DimSalesReason; 78 GO 79 80 PRINT ’DimSalesReason’ + convert(varchar, SysDateTime(), 21); 81 GO 82 83 SELECT * 84 FROM dbo.DimSalesTerritory; 85 GO 86 87 PRINT ’DimSalesTerritory’ + convert(varchar, SysDateTime(), 21); 88 GO 89 90 SELECT * 91 FROM dbo.FactInternetSales; 92 GO 93 94 PRINT ’FactInternetSales’ + convert(varchar, SysDateTime(), 21); 95 GO 96 97 SELECT * 98 FROM dbo.FactResellerSales; 99 GO 100 101 PRINT ’FactResellerSales’ + convert(varchar, SysDateTime(), 21); 102 GO 103 104 SELECT * 105 FROM dbo.FactSalesQuota; 106 GO 107 108 PRINT ’FactSalesQuota’ + convert(varchar, SysDateTime(), 21);

Data integrity testing script

1 -- STEP 1: INSERT 10 rows in SalesOrderHeader/SalesOrderDetail 2 3 PRINT ’INSERT’; 4 GO 5 6 INSERT INTO Sales.SalesOrderHeader (ModifiedDate, rowguid, OnlineOrderFlag, Status, RevisionNumber, OrderDate, DueDate, ShipDate, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight) 7 VALUES (GETDATE(), NEWID(), 1, 5, 8, GETDATE(), GETDATE() + 5, GETDATE() + 3, ’ PO522145787’, ’10-4020-000676’, ’29825’, ’279’, ’5’, ’985’, ’985’, ’5’, ’16281’, ’105041Vi84182’, 4, ’1002’, ’210’, ’10’); 8 GO 122 Appendix C. Assessment

9 10 INSERT INTO Sales.SalesOrderDetail (SalesOrderID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate) 11 VALUES ((SELECT TOP(1) SalesOrderID FROM Sales.SalesOrderHeader WHERE subtotal = 1002 ORDER BY ModifiedDate DESC), ’4911-403C-98’, 2, 776, 1, ’2024,994’, ’0,00’, NewID(), GETDATE()); 12 GO 13 14 INSERT INTO Sales.SalesOrderHeader (ModifiedDate, rowguid, OnlineOrderFlag, Status, RevisionNumber, OrderDate, DueDate, ShipDate, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight) 15 VALUES (GETDATE(), NEWID(), 1, 5, 8, GETDATE(), GETDATE() + 5, GETDATE() + 3, ’ PO522145787’, ’10-4020-000676’, ’29825’, ’279’, ’5’, ’985’, ’985’, ’5’, ’16281’, ’105041Vi84182’, 4, ’2002’, ’210’, ’10’); 16 GO 17 18 INSERT INTO Sales.SalesOrderDetail (SalesOrderID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate) 19 VALUES ((SELECT TOP(1) SalesOrderID FROM Sales.SalesOrderHeader WHERE subtotal = 2002 ORDER BY ModifiedDate DESC), ’4911-403C-98’, 2, 776, 1, ’2024,994’, ’0,00’, NewID(), GETDATE()); 20 GO 21 22 INSERT INTO Sales.SalesOrderHeader (ModifiedDate, rowguid, OnlineOrderFlag, Status, RevisionNumber, OrderDate, DueDate, ShipDate, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight) 23 VALUES (GETDATE(), NEWID(), 1, 5, 8, GETDATE(), GETDATE() + 5, GETDATE() + 3, ’ PO522145787’, ’10-4020-000676’, ’29825’, ’279’, ’5’, ’985’, ’985’, ’5’, ’16281’, ’105041Vi84182’, 4, ’3002’, ’210’, ’10’); 24 GO 25 26 INSERT INTO Sales.SalesOrderDetail (SalesOrderID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate) 27 VALUES ((SELECT TOP(1) SalesOrderID FROM Sales.SalesOrderHeader WHERE subtotal = 3002 ORDER BY ModifiedDate DESC), ’4911-403C-98’, 2, 776, 1, ’2024,994’, ’0,00’, NewID(), GETDATE()); 28 GO 29 30 INSERT INTO Sales.SalesOrderHeader (ModifiedDate, rowguid, OnlineOrderFlag, Status, RevisionNumber, OrderDate, DueDate, ShipDate, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight) 31 VALUES (GETDATE(), NEWID(), 1, 5, 8, GETDATE(), GETDATE() + 5, GETDATE() + 3, ’ PO522145787’, ’10-4020-000676’, ’29825’, ’279’, ’5’, ’985’, ’985’, ’5’, ’16281’, ’105041Vi84182’, 4, ’4002’, ’210’, ’10’); 32 GO 33 34 INSERT INTO Sales.SalesOrderDetail (SalesOrderID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate) C.2. Assessment scripts 123

35 VALUES ((SELECT TOP(1) SalesOrderID FROM Sales.SalesOrderHeader WHERE subtotal = 4002 ORDER BY ModifiedDate DESC), ’4911-403C-98’, 2, 776, 1, ’2024,994’, ’0,00’, NewID(), GETDATE()); 36 GO 37 38 INSERT INTO Sales.SalesOrderHeader (ModifiedDate, rowguid, OnlineOrderFlag, Status, RevisionNumber, OrderDate, DueDate, ShipDate, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight) 39 VALUES (GETDATE(), NEWID(), 1, 5, 8, GETDATE(), GETDATE() + 5, GETDATE() + 3, ’ PO522145787’, ’10-4020-000676’, ’29825’, ’279’, ’5’, ’985’, ’985’, ’5’, ’16281’, ’105041Vi84182’, 4, ’5002’, ’210’, ’10’); 40 GO 41 42 INSERT INTO Sales.SalesOrderDetail (SalesOrderID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, rowguid, ModifiedDate) 43 VALUES ((SELECT TOP(1) SalesOrderID FROM Sales.SalesOrderHeader WHERE subtotal = 5002 ORDER BY ModifiedDate DESC), ’4911-403C-98’, 2, 776, 1, ’2024,994’, ’0,00’, NewID(), GETDATE()); 44 GO 45 46 -- STEP 2: UPDATE 10 rows in SalesOrderDetail 47 48 PRINT ’UPDATE’; 49 GO 50 51 UPDATE Sales.SalesOrderDetail 52 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 53 , ModifiedDate = GETDATE() 54 WHERE SalesOrderID = 43659 55 AND SalesOrderDetailID = 1; 56 GO 57 58 UPDATE Sales.SalesOrderDetail 59 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 60 , ModifiedDate = GETDATE() 61 WHERE SalesOrderID = 43659 62 AND SalesOrderDetailID = 2; 63 GO 64 65 UPDATE Sales.SalesOrderDetail 66 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 67 , ModifiedDate = GETDATE() 68 WHERE SalesOrderID = 43659 69 AND SalesOrderDetailID = 3; 70 GO 71 72 UPDATE Sales.SalesOrderDetail 73 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 74 , ModifiedDate = GETDATE() 75 WHERE SalesOrderID = 43659 76 AND SalesOrderDetailID = 4; 77 GO 124 Appendix C. Assessment

78 79 UPDATE Sales.SalesOrderDetail 80 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 81 , ModifiedDate = GETDATE() 82 WHERE SalesOrderID = 43659 83 AND SalesOrderDetailID = 5; 84 GO 85 86 UPDATE Sales.SalesOrderDetail 87 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 88 , ModifiedDate = GETDATE() 89 WHERE SalesOrderID = 43659 90 AND SalesOrderDetailID = 6; 91 GO 92 93 UPDATE Sales.SalesOrderDetail 94 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 95 , ModifiedDate = GETDATE() 96 WHERE SalesOrderID = 43659 97 AND SalesOrderDetailID = 7; 98 GO 99 100 UPDATE Sales.SalesOrderDetail 101 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 102 , ModifiedDate = GETDATE() 103 WHERE SalesOrderID = 43659 104 AND SalesOrderDetailID = 8; 105 GO 106 107 UPDATE Sales.SalesOrderDetail 108 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 109 , ModifiedDate = GETDATE() 110 WHERE SalesOrderID = 43659 111 AND SalesOrderDetailID = 9; 112 GO 113 114 UPDATE Sales.SalesOrderDetail 115 SET UnitPrice = CONVERT(MONEY,RAND()*10000) 116 , ModifiedDate = GETDATE() 117 WHERE SalesOrderID = 43659 118 AND SalesOrderDetailID = 1; 119 GO 120 121 -- STEP 3: DELETE THE 10 ROWS CREATED BEFORE 122 123 PRINT ’DELETE’; 124 GO 125 126 DELETE FROM Sales.SalesOrderHeader WHERE ModifiedDate > ’2019-06-17’; 127 GO 128 DELETE FROM Sales.SalesOrderDetail WHERE ModifiedDate > ’2019-06-17’ AND SalesOrderID != 43659 AND SalesOrderID != 43661; 129 GO