Data Warehousing the Layer Between Raw Data Sources and Knowledge

Data Warehousing The Layer between Raw Data Sources and Knowledge Sivanujann Selliah s093042 Kongens Lyngby August 2014 Technical University of Denmark DTU Compute Department of Applied Mathematics and Computer Science Richard Petersens Plads, Building 324 DK-2800 Kgs. Lyngby, Denmark Phone +45 45 25 30 31 [email protected] www.compute.dtu.dk Abstract Data warehousing is an essential part of delivering Business Intelligence (BI). It is the foundation which makes it possible to analyse and get answers for many different questions about large data sets. e Technical University of Denmark (DTU) is for the second time participating in a competition to build an energy efficient house. e competition is called Solar Decathlon Europe. For the 2014 edition of this competition the entry from DTU is called EMBRACE. is thesis will investigate how a data warehouse and BI solution can be made for the EMBRACE house. e principal research question of this thesis is how to give occupants and other subsystems of an EM- BRACE house the insights of the functioning of the house, by measurements done by the control systems. e proposed solution is to make a data warehouse which is exposed through an OLAP cube, which is based on a star schema dimensional model. For reporting insights a BI application is also proposed. e data warehouse was built using an ETL system which integrated four different operational source systems. eses systems were the Hardware System Integration, Data Collection, Weather Data, and SDE Measurements. All the data from these systems were put into a star schema dimensional model, with mul- tiple fact tables and dimension tables. e ETL system consisted of two types of ETL processing, one which was based on Entity Framework and another which was based on SSIS. e SSIS processing outperforms the EF-based ETL, but the EF-based ETL was more flexible. e data warehouse was exposed through the OLAP system SSAS, which made query performance much better by pre-calculating aggregates and it provided hierarchical intelligence for the dimensions. e final implementation was evaluated through a number of integration tests done through black-box testing. All the integration tests succeeded, which means that the components of the DW/BI solution and the integration between these works as intended. To test the functional requirements of the DW/BI solution a number of system tests was done through black-box testing. It was shown that all the functional requirements are met for the DW/BI solution implemented. Besides system testing, acceptance testing was also conducted, which included performance testing. ese tests showed that the processing times set in the non-functional requirements are mostly met by the DW/BI solution, especially by the SSIS package. Lastly to evaluate the DW/BI solution a number of BI reports were made, which showed that valuable knowledge can be gained from the insights provided by the DW/BI solution. e thesis shows that because BI solutions are quite generic it is actually possible to build a data warehouse and BI solution for an intelligent house context. Keywords: Data Warehouse, Business Intelligence, Dimensional Modelling, Cube, OLAP, Star Schema, Intelligent Housing, Home Control, Home Automation i This page intentionally left blank Resumé Data-warehousing er en nødvendig del af Business Intelligence (BI). Det er fundamentet som muligøre analyse og at få svar på mange forskellige spørgsmål omkring store datasæt. Danmarks Tekniske Universitet (DTU) deltager for anden gang i en konkurrence for energi effektive huse. Konkurrencen hedder Solar Decathlon Europe. For 2014 udgaven af konkurrencen er DTUs bidrag et hus som hedder EMBRACE. Dette speciale vil undersøge hvordan en data warehouse og BI løsning kan laves for EMBRACE huset. Spe- cialets hovedproblemstilling er hvordan man kan give beboere og andre subsystemer af EMBRACE huset indsigt i husets funktion på baggrund af målinger lavet af kontrol systemer i huset. Den foreslået løsning til dette er at lave et data warehouse som er udstillet igennem en OLAP kube, som er baseret på et stjerneskema, som er en dimensional data model. For at lave indsigtsrapportering er en BI applikation også foreslået. Det færdige data warehouse blev bygget med et ETL system som integrerer fire forskellige operationelle kilde systemer. Disse systemer var Hardware System Integration, Data Collection, Vejr data og SDE målinger. Al data fra disse systemer blev sat ind i stjerneskemaet, med en del fakta- og dimensionstabeller. ETL systemet bestod af to forskellige typer af ETL databehandling, et som var baseret på Entity Framework og et som var baseret på SSIS. SSIS databehandlingen overgik ydeevnen på den EF-baseret ETL, men den EF-baseret ETL var mere fleksibel. OLAP systemet SSAS blev brugt til at udstille det færdige data warehouse, som gjorde at forespørgsels ydeevnen blev meget bedre ved at udregne akkumuleret tal på forhånd og det leverede også hierarkisk intelligens for dimensioner. Implementeringen blev evalueret igennem et antal integration tests, som blev udført vha. black-box testing. Alle integration tests var succesfulde, hvilket betyder at komponenterne i DW/BI løsningen og inte- grationen mellem disse virker som forventet. For at teste de funktionelle krav af DW/BI løningen blev et antal af system tests udført vha. black-box testing. Det viste sig at alle funktionelle krav var opfyldt for DW/BI løsningen. Ud over system tests blev der også udført acceptance tests, som inkluderede performance tests. Disse tests viste at behandlingstiden sat i de ikke-funktionelle krav til dels var opfyldt, især for SSIS pakken. Til sidst blev der lavet nogle BI rapporter som viste at værdifuld viden kan tilegnes fra indsigter leveret af DW/BI løsningen. Specialet viser hvordan generiske BI løsninger faktisk kan bruges til at bygge en data warehouse og BI løsning i en intelligent hus kontekst. Stikord: Data Warehouse, Business Intelligence, Dimensional Modelling, Cube, OLAP, Star Schema, In- telligent Housing, Home Control, Home Automation iii This page intentionally left blank Preface is master’s thesis was prepared at DTU Compute, Department of Applied Mathematics and Computer Science at the Technical University of Denmark, DTU, in fulfilment of the requirements for acquiring a Master of Science in Engineering (Computer Science and Engineering). e thesis was done by Sivanujann Selliah (s093042). e thesis supervisor was Christian Damsgaard Jensen. e data warehouse is available through http://bi.epiccloud.solardecathlon.dk/. Kongens Lyngby, August 2014 Sivanujann Selliah v This page intentionally left blank Contents 1 Introduction 1 2 Context 3 2.1 Intelligent Houses ....................................... 3 2.2 Solar Decathlon Europe 2014 ................................. 4 2.2.1 Team DTU ....................................... 5 2.2.2 The ten contests ................................... 5 2.3 EMBRACE ............................................ 6 2.3.1 Urban context ..................................... 7 2.3.2 Competition ...................................... 8 2.3.3 Save, smart and share ................................. 8 2.3.4 Appliances and control systems ........................... 8 2.3.5 Results from SDE14 ................................. 9 2.3.6 After Versailles .................................... 9 2.4 EPIC Cloud ........................................... 10 2.4.1 Logical components .................................. 11 2.4.2 Technology used ................................... 12 2.5 Business Intelligence ..................................... 12 2.5.1 Motivation for using BI ................................ 14 2.6 Data Warehouse ........................................ 14 2.6.1 Motivation for using DW ............................... 15 2.7 Scope .............................................. 15 2.8 Summary ............................................ 16 3 State of the Art 17 3.1 Third Normal Form ....................................... 17 3.2 Dimensional Modelling ..................................... 18 3.3 Facts .............................................. 20 3.3.1 Structure ........................................ 21 3.3.2 Keys .......................................... 21 3.3.3 Additivity ....................................... 22 3.3.4 Grain .......................................... 23 vii Contents 3.3.5 Null values ....................................... 24 3.3.6 Types of fact tables .................................. 24 3.3.7 Factless ........................................ 25 3.3.8 Numeric values as dimensions ............................ 25 3.3.9 Units of measure ................................... 26 3.4 Dimensions ........................................... 26 3.4.1 Structure ........................................ 27 3.4.2 Flags, codes and indicators .............................. 28 3.4.3 Junk ........................................... 28 3.4.4 Keys .......................................... 28 3.4.5 Drilling down and other dimensional operations ................... 29 3.4.6 Hierarchies ....................................... 30 3.4.7 Multivalued ...................................... 30 3.4.8 Date and time of day dimensions .......................... 31 3.4.9 Conformed dimensions ................................ 31 3.4.10 Drilling across ..................................... 32 3.4.11 Slowly changing dimensions ............................. 32 3.5 Star Schema .......................................... 33 3.6 OLAP Cubes .........................................

Load more