6 Designing Conventional Data Warehouses
Total Page:16
File Type:pdf, Size:1020Kb
6 Designing Conventional Data Warehouses The development of a data warehouse is a complex and costly endeavor. A data warehouse project is similar in many aspects to any software develop- ment project and requires definition of the various activities that must be performed, which are related to requirements gathering, design, and imple- mentation into an operational platform, among other things. Even though there is an abundant literature in the area of software development (e.g., [48, 248, 282]), few publications have been devoted to the development of data warehouses. Some of these publications [15, 114, 119, 146, 242] have been written by practitioners and are based on their experience in building data warehouses. On the other hand, the scientific community has proposed a variety of approaches for developing data warehouses [28, 29, 35, 42, 47, 79, 82, 114, 167, 203, 221, 237, 246]. Nevertheless, many of these approaches target a specific conceptual model and are often too complex to be used in real-world environments. As a consequence, there is still a lack of a method- ological framework that could guide developers in the various stages of the data warehouse development process. This situation results from the fact that the need to build data warehouse systems arose before the definition of formal approaches to data warehouse development, as was the case for operational databases [166]. In this chapter, we propose a general method for conventional-data- warehouse design that unifies existing approaches. This method allows de- signers and developers to better understand the alternative approaches that can be used for data warehouse design, helping them to choose the one that best fits their needs. To keep the scope and complexity of this chapter within realistic boundaries, we shall not cover the overall data warehouse develop- ment process, but shall focus instead only on data warehouse design. The chapter starts by presenting in Sect. 6.1 an overview of the existing approaches to data warehouse design. Then, in Sect. 6.2, we refer to the various phases that make up the data warehouse design process. Since data warehouses are a particular type of databases devoted to analytical processing [29], our approach is in line with traditional database design, i.e., it includes the phases 252 6 Designing Conventional Data Warehouses of requirements specification, conceptual design, logical design, and physical design. In Sect. 6.3, we present a university case study that we then use throughout the chapter to show the applicability of the proposed method and to facilitate understanding of the various phases. The subsequent sections are devoted to more detailed descriptions of each design phase. In Sect. 6.4, we propose three different approaches to require- ments specification. These approaches differ in that users, source systems, or both may be the driving force for specifying the requirements. Section 6.5 cov- ers the conceptual-design phase for data warehouses. With respect to database design, this phase includes some additional steps required for delivering con- ceptual schemas that represent users’ requirements, but also take into account the availability of data in the source systems. Section 6.6 summarizes the three approaches to data warehouse development, highlighting their advantages and disadvantages. Next, in Sects. 6.7 and 6.8, we provide insights into the logical and physi- cal design phases for conventional data warehouses. Both of these phases con- sider the structure of the data warehouse, as well as the related extraction- transformation-loading (ETL) processes. Since we have already referred in Sect. 3.5 to the logical representation of data warehouses, in this chapter we present only a short example of a transformation of a conceptual multidimen- sional schema into a logical schema using the SQL:2003 standard. Then, we identify several physical features that can considerably improve the perfor- mance of analytical queries. We give examples of their implementation using Oracle 10g. Sect. 6.9 summarizes the proposed methodological framework for conventional-data-warehouse design. Finally, Sect. 6.10 surveys related work, and Sect. 6.11 summarizes this chapter. 6.1 Current Approaches to Data Warehouse Design There is a wide variety of approaches that have been proposed for designing data warehouses. They differ in several aspects, such as whether they target data warehouse or data mart design, the various phases that make up the design process, and the methods used for specifying requirements. This sec- tion highlights some of the essential characteristics of the current approaches according to these aspects. 6.1.1 Data Mart and Data Warehouse Design A data warehouse includes data about an entire organization that is used by users at high management levels in order to support strategic decisions. However, these decisions may also be taken at lower organizational levels related to specific business areas, in which case only a subset of the data contained in a data warehouse is required. This subset is typically contained in a data mart (see Sect. 2.9), which has a similar structure to a data warehouse 6.1 Current Approaches to Data Warehouse Design 253 but is smaller in size. Data marts may be physically collocated with the data warehouse or may have their own separate platform [114]. Similarly to the design of operational databases (Sect. 2.1), there are two major methods for the design of a data warehouse and its related data marts [146]: • Top-down design: The requirements of users at different organizational levels are merged before the design process begins, and one schema for the entire data warehouse is built. Afterwards, separate data marts are tailored according to the particular characteristics of each business area or process. • Bottom-up design: A separate schema is built for each data mart, taking into account the requirements of the decision-making users responsible for the corresponding specific business area or process. Later, these schemas are merged, forming a global schema for the entire data warehouse. The planning and implementation of an enterprise-wide data warehouse using the top-down approach is an overwhelming task for most organizations in terms of cost and duration. It is also a challenging activity for designers [146] because of its size and complexity. On the other hand, the reduced size of data marts allows a business to earn back the cost of building them in a shorter time period and facilitates the design and implementation processes. Leading practitioners usually use the bottom-up approach in developing data warehouses [114, 146]. However, this requires a global data warehouse framework to be established so that the data marts are built considering their future integration into a whole data warehouse. A lack of such a global framework may cause a data warehouse project to fail [221], since different data marts may contain the same data using different formats or structures. Various frameworks can be applied to achieve this. For example, in the framework of Imhoff et al. [114] a global data warehouse schema is first devel- oped. Then, a prototype data mart is built and its structure is mapped into the data warehouse schema. The mapping process is repeated for each subsequent data mart. On the other hand, Kimball et al. [146] have proposed a frame- work called the data warehouse bus architecture, in which dimensions and facts shared between different data marts must be conformed. A dimension is con- formed when it is identical in each data mart that uses it. Similarly, a fact is conformed if it has the same semantics (for instance, the same terminology, granularity, and units for its measures) across all data marts. Once conformed dimensions and facts have been defined, new data marts may be brought into the data warehouse in an incremental manner, ensuring their compatibility with the already-existing data marts. An approach to building a data warehouse by integrating heterogeneous data marts has been studied in [37, 38, 34, 300]. Cabibbo and Torlone [37] introduced the notion of dimension compatibility, which extends the notion of conformed dimensions in [146]. Intuitively, two dimensions belonging to differ- ent data marts are compatible when their common information is consistent. 254 6 Designing Conventional Data Warehouses Compatible dimensions allow the data of various data marts to be combined and correlated. These authors proposed two strategies for the integration of data marts. The first one consists in identifying the compatible dimensions and performing drill-across queries over autonomous data marts [37]. The sec- ond approach allows designers to merge the various data marts by means of materialized views and to perform queries against these views [38]. A tool that integrates these two strategies has been implemented [34, 300]. 6.1.2 Design Phases To the best of our knowledge, relatively few publications (e.g., [15, 47, 83, 114, 119, 146, 169, 247]) have proposed an overall method for data warehouse design. However, these publications do not agree on the phases that should be followed in designing data warehouses. Some authors consider that the tradi- tional phases of developing operational databases (as described in Sect. 2.1), i.e., requirements specification, conceptual design, logical design, and physical design, can also be used in developing data warehouses or data marts. Other authors ignore some of these phases, especially the conceptual-design phase. Many publications refer to only one of the phases, without considering the subsequent transformations needed to achieve implementable solutions. Some proposals consider that the development of data warehouse systems is rather different from the development of operational database systems. On the one hand, they include additional phases, such as workload refinement [83] or the ETL process [146, 169]. On the other hand, they provide various meth- ods for the requirements specification phase, to which we refer in Sect.