An Overview of Data Warehousing and OLAP Technology

(Appears in ACM Sigmod Record, March 1997) An Overview of Data Warehousing and OLAP Technology Surajit Chaudhuri Umeshwar Dayal Microsoft Research, Redmond Hewlett-Packard Labs, Palo Alto [email protected] [email protected] Abstract A data warehouse is a “subject-oriented, integrated, time- Data warehousing and on-line analytical processing (OLAP) varying, non-volatile collection of data that is used primarily are essential elements of decision support, which has in organizational decision making.”1 Typically, the data increasingly become a focus of the database industry. Many warehouse is maintained separately from the organization’s commercial products and services are now available, and all operational databases. There are many reasons for doing this. of the principal database management system vendors now The data warehouse supports on-line analytical processing have offerings in these areas. Decision support places some (OLAP), the functional and performance requirements of rather different requirements on database technology which are quite different from those of the on-line transaction compared to traditional on-line transaction processing processing (OLTP) applications traditionally supported by the applications. This paper provides an overview of data operational databases. warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for OLTP applications typically automate clerical data processing extracting, cleaning and loading data into a data warehouse; tasks such as order entry and banking transactions that are the multidimensional data models typical of OLAP; front end bread-and-butter day-to-day operations of an organization. client tools for querying and data analysis; server extensions These tasks are structured and repetitive, and consist of short, for efficient query processing; and tools for metadata atomic, isolated transactions. The transactions require management and for managing the warehouse. In addition to detailed, up-to-date data, and read or update a few (tens of) surveying the state of the art, this paper also identifies some records accessed typically on their primary keys. Operational promising research issues, some of which are related to databases tend to be hundreds of megabytes to gigabytes in problems that the database research community has worked size. Consistency and recoverability of the database are on for years, but others are only just beginning to be critical, and maximizing transaction throughput is the key addressed. This overview is based on a tutorial that the performance metric. Consequently, the database is designed authors presented at the VLDB Conference, 1996. to reflect the operational semantics of known applications, and, in particular, to minimize concurrency conflicts. 1. Introduction Data warehousing is a collection of decision support Data warehouses, in contrast, are targeted for decision technologies, aimed at enabling the knowledge worker support. Historical, summarized and consolidated data is (executive, manager, analyst) to make better and faster more important than detailed, individual records. Since data decisions. The past three years have seen explosive growth, warehouses contain consolidated data, perhaps from several both in the number of products and services offered, and in operational databases, over potentially long periods of time, the adoption of these technologies by industry. According to they tend to be orders of magnitude larger than operational the META Group, the data warehousing market, including databases; enterprise data warehouses are projected to be hardware, database software, and tools, is projected to grow hundreds of gigabytes to terabytes in size. The workloads are from $2 billion in 1995 to $8 billion in 1998. Data query intensive with mostly ad hoc, complex queries that can warehousing technologies have been successfully deployed in access millions of records and perform a lot of scans, joins, many industries: manufacturing (for order shipment and and aggregates. Query throughput and response times are customer support), retail (for user profiling and inventory more important than transaction throughput. management), financial services (for claims analysis, risk analysis, credit card analysis, and fraud detection), To facilitate complex analyses and visualization, the data in a transportation (for fleet management), telecommunications warehouse is typically modeled multidimensionally. For (for call analysis and fraud detection), utilities (for power example, in a sales data warehouse, time of sale, sales district, usage analysis), and healthcare (for outcomes analysis). This salesperson, and product might be some of the dimensions paper presents a roadmap of data warehousing technologies, of interest. Often, these dimensions are hierarchical; time of focusing on the special requirements that data warehouses sale may be organized as a day-month-quarter-year hierarchy, place on database management systems (DBMSs). product as a product-category-industry hierarchy. Typical 1 OLAP operations include rollup (increasing the level of aggregation) and drill-down (decreasing the level of In Section 2, we describe a typical data warehousing aggregation or increasing detail) along one or more architecture, and the process of designing and operating a dimension hierarchies, slice_and_dice (selection and data warehouse. In Sections 3-7, we review relevant projection), and pivot (re-orienting the multidimensional view technologies for loading and refreshing data in a data of data). warehouse, warehouse servers, front end tools, and warehouse management tools. In each case, we point out Given that operational databases are finely tuned to support what is different from traditional database technology, and we known OLTP workloads, trying to execute complex OLAP mention representative products. In this paper, we do not queries against the operational databases would result in intend to provide comprehensive descriptions of all products unacceptable performance. Furthermore, decision support in every category. We encourage the interested reader to look requires data that might be missing from the operational at recent issues of trade magazines such as Databased databases; for instance, understanding trends or making Advisor, Database Programming and Design, Datamation, predictions requires historical data, whereas operational and DBMS Magazine, and vendors’ Web sites for more databases store only current data. Decision support usually details of commercial products, white papers, and case requires consolidating data from many heterogeneous studies. The OLAP Council2 is a good source of information sources: these might include external sources such as stock on standardization efforts across the industry, and a paper by market feeds, in addition to several operational databases. Codd, et al.3 defines twelve rules for OLAP products. Finally, The different sources might contain data of varying quality, or a good source of references on data warehousing and OLAP use inconsistent representations, codes and formats, which is the Data Warehousing Information Center4. have to be reconciled. Finally, supporting the multidimensional data models and operations typical of Research in data warehousing is fairly recent, and has focused OLAP requires special data organization, access methods, primarily on query processing and view maintenance issues. and implementation methods, not generally provided by There still are many open research problems. We conclude in commercial DBMSs targeted for OLTP. It is for all these Section 8 with a brief mention of these issues. reasons that data warehouses are implemented separately from operational databases. 2. Architecture and End-to-End Process Figure 1 shows a typical data warehousing architecture. Data warehouses might be implemented on standard or extended relational DBMSs, called Relational OLAP Monitoring & Admnistration (ROLAP) servers. These servers assume that data is stored in relational databases, and they support extensions to SQL and Metadata Repository special access and implementation methods to efficiently OLAP Servers Analysis implement the multidimensional data model and operations. Data Warehouse In contrast, multidimensional OLAP (MOLAP) servers are External Extract sources Transform Query/Reporting servers that directly store multidimensional data in special Load data structures (e.g., arrays) and implement the OLAP Operational Refresh Serve dbs operations over these special data structures. Data Mining There is more to building and maintaining a data warehouse Data sources than selecting an OLAP server and defining a schema and Data Marts Tools some complex queries for the warehouse. Different Figure 1. Data Warehousing Architecture architectural alternatives exist. Many organizations want to implement an integrated enterprise warehouse that collects It includes tools for extracting data from multiple operational information about all subjects (e.g., customers, products, databases and external sources; for cleaning, transforming sales, assets, personnel) spanning the whole organization. and integrating this data; for loading data into the data However, building an enterprise warehouse is a long and warehouse; and for periodically refreshing the warehouse to complex process, requiring extensive business modeling, and reflect updates at the sources and to purge data from the may take many years to succeed. Some organizations are warehouse, perhaps

Load more