Decision Support Systems 42 (2006) 267–282 www.elsevier.com/locate/dsw

Denormalization strategies for data retrieval from data warehouses

Seung Kyoon Shina,*, G. Lawrence Sandersb,1

aCollege of Business Administration, University of Rhode Island, 7 Lippitt Road, Kingston, RI 02881-0802, United States bDepartment of Management Science and Systems, School of Management, State University of New York at Buffalo, Buffalo, NY 14260-4000, United States Available online 20 January 2005

Abstract

In this study, the effects of denormalization on relational system performance are discussed in the context of using denormalization strategies as a database design methodology for data warehouses. Four prevalent denormalization strategies have been identified and examined under various scenarios to illustrate the conditions where they are most effective. The relational algebra, query trees, and join cost function are used to examine the effect on the performance of relational systems. The guidelines and analysis provided are sufficiently general and they can be applicable to a variety of , in particular to data warehouse implementations, for decision support systems. D 2004 Elsevier B.V. All rights reserved.

Keywords: Database design; Denormalization; Decision support systems; Data warehouse; Data mining

1. Introduction houses as issues related to database design for high performance are receiving more attention. Database With the increased availability of data collected design is still an art that relies heavily on human from the Internet and other sources and the implemen- intuition and experience. Consequently, its practice is tation of enterprise-wise data warehouses, the amount becoming more difficult as the applications that data- of data that companies possess is growing at a bases support become more sophisticated [32].Cur- phenomenal rate. It has become increasingly important rently, most popular database implementations for for the companies to better manage their data ware- business applications are based on the . It is well known that the relational model is simple but somewhat limited in supporting real world constructs * Corresponding author. Tel.: +1 401 874 5543; fax: +1 401 874 and application requirements [2]. It is also readily 4312. E-mail addresses: [email protected] (S.K. Shin)8 observable that there are indeed wide differences [email protected] (G.L. Sanders). between the academic and the practitioner focus on 1 Tel.: +1 716 645 2373; fax: +1 716 645 6117. database design. Denormalization is one example that

0167-9236/$ - see front matter D 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2004.12.004 268 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 has not received much attention in academia but has and argues the effects of incorporating denormaliza- been a viable database design strategy in real world tion techniques into decision support system data- practice. bases. Section 3 presents an overview of how From a database design perspective, normalization denormalization fits in the database design cycle and has been the rule that should be abided by during develops the criteria to be considered in assessing database design processes. The normal forms and the database performance. Section 4 summarizes the process of normalization have been studied by many commonly accepted denormalization strategies. In researchers, since Codd [10] initiated the subject. The Section 5, relational algebra, query trees, and join objective of normalization is to organize data into cost function approach are applied in an effort to normal forms and thereby minimize update anomalies estimate the effect of denormalization on the database and maximize data accessibility. While normalization performance. We conclude in Section 6 with a brief provides many benefits and is indeed regarded as the discussion of future research. rule for relational database design, there is at least one major drawback, namely bpoor system performanceQ [15,31,33,50]. Such poor performance can be a major 2. Denormalization and data warehouses deterrent to the effective managerial use of corporate data. 2.1. Normalization vs. denormalization In practice, denormalization techniques are fre- quently utilized for a variety of reasons. However, Although conceptual and logical data models denormalization still lacks a solid set of principles and encourage us to generalize and consolidate entities to guidelines and, thus, remains a human intensive better understand the relationships between them, such process [40]. There has been little research illustrating generalization does not guarantee the best performance the effects of denormalization on database performance but may lead to more complicated database access and query response time, and effective denormalization paths [4]. Furthermore, normalized schema may create strategies. The goal of this paper is to provide retrieval inefficiencies when a comparatively small comprehensive guidelines regarding when and how amount of data is being retrieved from complex to effectively exercise denormalization. We further relationships [50]. A normalized data schema performs propose common denormalization strategies, analyze well with a relatively small amount of data and the costs and benefits, and illustrate relevant examples transactions. However, as the workload on the database of when and how each strategy can be applied. engine increases, the relational engine may not be able It should be noted that the intention of this study is to handle transaction processing with normalized not to promote the concept of denormalization. schema in a reasonable time frame because relatively Normalization is, arguably, the mantra upon which costly join operations are required to combine normal- the infrastructure of all database systems should be ized data. As a consequence, database designers built and one of the foundational principals of the field occasionally trade off the aesthetics of data normal- [20,25]. It is fundamental to translating requirement ization with the reality of system performance. specifications into semantically stable and robust There are many cases when a normalized schema physical schema. There are, however, instances where does not satisfy the system requirements. For exam- this principal of database design can be violated or at ple, existing normalization concepts are not applicable least compromised to increase systems performance to temporal relational data models [9] because these and present the user with a more simplified view of the models employ relational structures that are different data structure [39]. Denormalization is not necessarily from conventional relations [29]. In addition, rela- an incorrect decision, when it is implemented wisely, tional models do not handle incomplete or unstruc- and it is the objective of this paper to provide a set of tured data in a comprehensive manner, while for many guidelines for applying denormalization to improve real-world business applications such as data ware- database performance. houses where multiple heterogeneous data sources are This paper is organized as follows: Section 2 consolidated into a large data repository, the data are discusses the relevant research on denormalization frequently incomplete and uncertain [19]. S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 269

In an OLAP environment, the highly normalized ducing redundancy or synthetic keys to encompass the form may not be appropriate because retrieval often movement or change of structure within the physical entails a large number of joins, which greatly model. One deficiency of this approach is that future increases response times and this is particularly true development and maintenance are not considered. when the data warehouse is large [48]. Database Denormalization procedures can be adopted as pre- performance can be substantially improved by mini- physical database design processes and as intermedi- mizing the number of accesses to secondary storage ate steps between logical and physical modeling, and during transaction processing. Denormalizing rela- provide an additional view of logical database refine- tions reduces the number of physical tables that need ment before the physical design [7]. The refinement to be accessed to retrieve the desired data by reducing process requires a high level of expertise on the part of the number of joins needed to derive a query [25]. the database designer, as well as appropriate knowl- edge of application requirements. 2.2. Previous work on denormalization There are possible drawbacks of denormalization. Denormalization decisions usually involve trade-offs Normalization may cause significant inefficiencies between flexibility and performance, and denormaliza- when there are few updates and many query retrievals tion requires an understanding of flexibility require- involving a large number of join operations. On the ments, awareness of the update frequency of the data, other hand, denormalization may boost query speed but and knowledge of how the database management also degrade data integrity. The trade-offs inherent in system, the operating system and the hardware work normalization/denormalization should be a natural area together to deliver optimal performance [12]. for scholarly activity. The pioneering work by Schkol- From the previous discussion, it is apparent that nick and Sorenson [44] introduced the notion of there has been little research regarding a comprehen- denormalization. They argue that, to improve database sive procedure for invoking and assessing a process performance, the data model should be viewed by the for denormalization. user because the user is capable of understanding the underlying semantic constraints. 2.3. Denormalizing with data warehouses and data Several researchers have developed a lists of marts normalization and denormalization types, and have also suggested that denormalization should be carefully Data warehousing and OLAP are long established deployed according to how the data will be used fields of study with well-established technologies. The [24,41]. The common denormalization types include notion of OLAP refers to the technique for performing pre-joined tables, report tables, mirror tables, split complex analysis over the information stored in a data tables, combined tables, redundant data, repeating warehouse. A data warehouse (or smaller-scale data groups, derivable data, and speed tables. The studies mart) is typically a specially created data repository [24,41] show that denormalization is useful when there supporting decision making and, in general, involves are two entities with a one-to-one relationship and a a bvery largeQ repository of historical data pertaining many-to-many relationship with non-key attributes. to an organization [28]. The goal of the data There are two approaches to denormalization [49]. warehouse is to put enterprise data at the disposal of In the first approach, the entity relationship diagram organizational decision makers. The typical data (ERD) is used to collapse the number of logical warehouse is a subject-oriented corporate database objects in the model. This will in effect shorten that involves multiple data models implemented on application call paths that traverse the database objects multiple platforms and architectures. While there are when the structure is transformed from logical to data warehouses for decision support based on the star physical objects. This approach should be exercised schema2, which is a suitable alternative, there are also when validating the logical model. In the second approach, denormalization is accomplished by mov- 2 A is a database design in which dimensional data ing or consolidating entities and creating entities or are separated from fact data. A star schema consists of a fact table attributes to facilitate special requests, and by intro- with a single table for each dimension. 270 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 some cases where a star schema is not the best design procedure incorporating the denormalization database design [39]. It is important to recognize that process. star schemas do not represent a panacea for data A primary goal of denormalization is to improve warehouse database design. the effectiveness and efficiency of the logical model It should be noted that denormalization can indeed implementation in order to fulfill application require- speed up data retrieval and that there is a downside of ments. However, as in any denormalization process, denormalization in the form of potential update it may not be possible to accomplish a full anomalies [17]. But updates are not typical in data denormalization that meets all the criteria specified. warehouse applications, since date warehouses and In such cases, the database designer will have to data marts involve relatively fewer data updates, and make trade-offs based on the importance of each most transactions in a data warehouse involve data criteria and the application processing requirement, retrieval [30]. In essence, data warehouses and data taking into account the pros and cons of denormal- marts are considered good candidates for applying ization implementation. denormalization strategies because individual attrib- The process of information requirements determi- utes are rarely updated. nation or logical database design produces a descrip- Denormalization is particularly useful in dealing tion of the data content and processing activities that with the proliferation of star schemas that are found must be supported by a database system. In addition, in many data warehouse implementations [1]. In this in order to access the efficiency of denormalization, case, denormalization may provide better perform- the hardware environment such as page size, storage ance and a more intuitive data structure for data and access cost must also be described. This warehouse users to navigate. One of the prime goals information is needed to develop the criteria for of most analytical processes with the data warehouse denormalization. is to access aggregates such as sums, averages, and A generally accepted goal in both denormalization trends, which frequently lead to resource-consuming and physical design is to minimize the operational calculations at runtime. While typical production costs of using the database. Thus, a framework for systems usually contain only the basic data, data information requirements for physical design can be warehousing users expect to find aggregated and applied to denormalization. Carlis et al. [6] developed time-series data that is in a form ready for immediate a framework for information requirements that can be display. In this case, having pre-computed data readily applied to denormalization. Based on this stored in the database reduces the cost of executing framework and the literature review related to aggregate queries. denormalization, Table 1 presents a more detailed overview of the numerous processing issues that need to be considered when embarking on a path to 3. Database design with denormalization denormalization [3,13,21,25,32,37]. The main inputs to the denormalization process are the logical struc- The database design process, in general, includes ture, volume analysis (e.g. estimates of tuples, the following phases: conceptual, logical, and phys- attributes, and cardinality), transaction analysis (e.g. ical design [25,43]. Denormalization can be separated statement of performance requirements and trans- from those steps because it involves aspects that are action assessment in terms of the frequency and not purely related to either logical or purely physical profile), and volatility analysis. design. We propose that the denormalization process An estimate of the average/maximum possible should be implemented between the data model number of instances per entity is a useful criterion mapping and the physical design and that it is to supply the model’s ability to fulfill access require- integrated with logical and physical design. In fact, ments. In developing a transaction analysis, the major it has been asserted by Inmon [26] that data should be transactions required against the database need to be normalized as the design is being conceptualized and identified. Transactions are logical units of work made then denormalized in response to the performance up of pure retrievals, insertions, updates, or deletions, requirements. Fig. 1 presents a suggested database or a mixture of all four. Each transaction is analyzed S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 271

Start

Development of Conceptual Data Model (ERD)

Refinement and Normalization

Identifying Candidates for Denormalization

Denormalization Identifying Denormalization Model Procedure

Determining the Effects of Denormalization

Apply Denormalization Yes Benefit > Cost Cost-Benefit Analysis No

Physical Design

Implementation

Fig. 1. Database design cycle incorporating denormalization. to determine the access paths used and the estimated schemas. Suppose that there are i (i 1,2,...,I) data frequency of use. Volatility analysis is useful in that retrieval queries and j ( j 1,2,...,J) update queries the volatility of a file has a clear impact on its processing with a normalized schema. The schema expected update and retrieval performance [47]. can be denormalized with n (n 1,2,...,N) denormal- ization models, and iV (iV 1,2,..,IV) data retrieval queries and jV (jV 1,2,...,JV) update queries are 4. Denormalization strategies for increasing modified to retrieve the same data from the denor- performance malized schema as those with the normalized. Each query is executed ti times per hour. Let xi denote the In this section, we present denormalization patterns elapsed time of query i so that enhanced query that have been commonly adopted by experienced performance by the denormalized model can be database designers. We also develop a mathematical measured by the time difference of two queries, model for assessing the benefits of each pattern using Xi=xixiV. The benefit, K, of denormalization model, cost–benefit analysis. After an extensive literature DN (DN={CR,PR,RA,DA}), can be stated as review, we identified and classified four prevalent models for denormalization in Table 2 [7,15,16,23, X X 24,38,49]. They are collapsing relations (CR), parti- KDN ¼ Xiti þ Xjtj: ð1Þ i j tioning a relation (PR), adding redundant attributes P N (RA), and adding derived attributes (DA). It is assumed that i Xiti 0, asP it is the goal of V Our mathematical model for the benefits of denormalization. On the contrary, j Xjtj 0 because denormalization relies on performance of query most denormalization models entail data duplication processing measured by the elapsed time difference that may increase the volume of data to be updated as of each query with normalized and denormalized well as may require additional update transactions. 272 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282

Table 1 exceed the sum of the benefits. We can state these Information requirements for denormalization requirements in the model structure shown below. Type of information Variables to be considered X X X during denormalization Max KDN ¼ Xata þ Xbtb þ Xjtj ð3Þ Data description Cardinality of each relation a b j Optionality of each relation X X Data volume of each relation Subject to Xata þ XbtbN0 ð4Þ Volatility of data a b Transaction activity Data retrieval and update patterns of the user community X Number of entities involved XataN0 ð5Þ by each transaction a Transaction type X X X (update/query, OLTP/OLAP) Xata þ XbtbN Xjtj ð6Þ Transaction frequency a b j Number of rows accessed by each transaction Based on this formula, the benefits of the four Relations between transactions and denormalization models, CR, PR, RA, and DA, will be relations of entities involved Access paths needed by discussed. each transaction Hardware environment Pages/blocks size 4.1. Collapsing relations (CR) Access costs Storage costs One of the most commonly used denormalization Buffer size Other consideration Future application development models involves the collapsing of one-to-one relation- and maintenance considerations ships [38]. Consider the normalized schema with two relations in a one-to-one relationship, CUSTOMER and CUSTOMER_ACCOUNT as shown in Fig. 2.If Note that any given physical design is good for some there are a large number of queries requiring a join queries but bad for others [17]. While there are a operation between the two relations, collapsing both queries ( qa, a 1,2,. . .,A) that are to be enhanced by relations into CUSTOMER_DN may enhance query N performance. There are several benefits of this model the denormalized schema ((8qa)(Xa 0)), there may be ia queries ( qb, b 1,2,. . .,IA) that are degraded in the form of reduced number of foreign keys on V relations, reduced number of indexes (since most ((8qb)(Xa 0)). This results in the model structure below: indexes are based on primary/foreign keys), reduced X X X storage space, and reduced join operations. Collapsing KDN ¼ Xata þ Xbtb þ Xjtj ð2Þ a one-to-one relationship does not cause data dupli- a b j P P N V where a Xata 0 and b Xbtb 0. The mathematical Table 2 model for selecting the optimal set of the relations Denormalization models and patterns needs to deal with two requirements. First, the Denormalization strategies Denormalization patterns objective of denormalizationP should be to enhance Collapsing relations (CR) Two relations with a N data retrieving performance i Xi 0 , trading with one-to-one relationship the costs ofP update transaction caused by data Two relations with a b one-to-many relationship redundancy j Xj 0 . However, this cost would be negligible at data warehouses and data marts Two relations with a many-to-many relationship because the databases for those systems do not Partitioning a Horizontal partitioning involve a great deal of update transactions. Second, relation (PR) Vertical partitioning as denormalization results in the costs of data Adding redundant Reference data in lookup/ duplication, if the schema involves update trans- attributes (RA) code relations actions, the cost of update transaction should not Derived attributes (DA) Summary data, total, and balance S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 273

Normalized Schema

CUSTOMER (Customer_ID, Customer_Name, Address) CUSTOMER_ACCOUNT (Customer_ID, Account_Bal, Market_Segment)

Denormalized Schema

CUSTOMER_DN (Customer_ID, Customer_Name, Address, Account_Bal, Market_Segment)

Fig. 2. An example of collapsing one-to-one relationship. cation and, inP general, update anomalies are not a relations [25]. Reference data typically exists on the b Q serious issue j Xj ¼ 0 . one side of a one-to-many relationship. This entity Combining two relations in a one-to-one relation- typically does not participate in a different type of ship does not change the business view but does relationship because a code table typically participates decrease overhead access time because of fewer join in a relationship as a code table. However, as the operations. In general, collapsing tables in a one-to-one number of attributes in the reference relation increases, relationship has fewer drawbacks than other denorm- it may be reasonable to consider RA rather than CR alization approaches. (this will be discussed in detail in Section 4.3). In One-to-many relationships of low cardinality can general, it is appropriate to combine the two relations be a candidate for denormalization, although data in a one-to-many relationship when the bmanyQ side redundancy may interfere with updates [24]. Consider relation is associated with low cardinality. Collapsing the normalized relations CUSTOMER and CITY in one-to-many relationships can be relatively less Fig. 3, where Customer_ID and Zip are, respectively, efficient than collapsing one-to-one relationships the primary keys for each relation. These two relations because of data maintenance processing for duplicated are in 3rd normal form and the relationship between data. In the case ofP a reference table, the update costs c CITY and CUSTOMER is one-to-many. Suppose the are negligible j Xjtj 0 because update trans- address of CUSTOMER does not frequently change, actions rarely occur in reference relations tjc0 . this structure does not necessarily represent a better A many-to-many relationship can also be a candi- design from an efficiency perspective, particularly date for CR when the cardinality and the volatility of when most queries require a join operation. On the involved relations are low [25,41]. The typical many- contrary, the denormalized relation is superior because to-many relationship is represented in the physical it will substantially decrease the number of join database structure by three relations: one for each of operations. In addition, the costs for data anomalies two primary entities and another for cross-referencing caused by a transitive functional dependency (Cust- them. A cross-reference or intersection between the two NumberYZip, ZipY{City, State}) may be minimal entities in many instances also represents a business when the boneQ side relation (CITY) does not involve entity. These three relations can be merged into two if update transactions. one of them has little data apart from its primary key. If a reference table in a one-to-many relationship has Such a relation could be merged into a cross-reference very few attributes, it is a candidate for collapsing relation by duplicating the attribute data.

Normalized Schema

CUSTOMER (Customer_ID, Customer_Name, Address, ZIP) CITY (Zip, City, State)

Denormalized Schema

CUSTOMER_DN (Customer_ID, Customer_Name, Address, ZIP, City, State)

Fig. 3. An example of collapsing one-to-many relationship. 274 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282

CR eliminates the join, but the net result is that quite a few attributes associated with part specification there is a loss at the abstract level because there is no and inventory information. If query transactions can be conceptual separation of the data. However, Bracchi et classified into two groups in terms of association with al. [5] argues, bwe see that from the physical point of the two attribute groups, and there are few transactions view, an entity is a set of related attributes that are involving the two groups at the same time, PART can frequently retrieved together. The entity should be, be partitioned into PART_SPEC_INFO and PART_ therefore, an access concept, not a logical concept, and INVENTORY_INFO to decrease the number of pages it corresponds to the physical record.Q Even though the needed for each query. In general VP is particularly distinction between entities and attributes is real, some effective when there are lengthy text fields with a large designers conclude it should not be preserved, at least number of columns and these lengthy fields are for conceptual modeling purposes [36]. These discus- assessed by a limited number of transactions. In sions present us with the opportunity to explore actuality, a view of the joined relations may make strategies for denormalization and free us from the these partitions transparent to the users. A vertically stringent design restrictions based on the normalization partitioned relation should contain one row per primary rules. key as this facilitates data retrieval across relations. VP does not cause much data duplication except 4.2. Partitioning relation (PR) for the relating primary keys among the partitioned relations. As those primary keys are rarely modified, Database performance can be enhanced by mini- theP cost caused by the data duplication can be ignored c mizing the number of data accesses made in the j Xj 0 . secondary storage upon processing transactions. Par- titioning a relation (PR) in data warehouses and data 4.2.2. Horizontal partitioning (HP) marts is a major step in database design for fewer disk Horizontal partitioning (HP) [13] involves a row accesses [8,34,35]. When separate parts of a relation split and the resulting rows are classified into groups are used by different applications, the relation may be by key ranges. This is similar to partitioning a relation partitioned into multiple relations for each application during physical database design, except that each processing group. A relation can be partitioned either relation has a respective name. HP can be used when vertically or horizontally. It is the process of dividing a a relation contains a large number of tuples as the logical relation into several physical objects. The sales database of a super store. SALE_HISTORY of resulting file fragments consist of the smaller number super store database in Fig. 5 is a common case of the of attributes (if vertically partitioned) or tuples (if relation including a number of historical sales trans- horizontally partitioned) and therefore fewer accesses actions to be considered for HP. It is likely that most are made in secondary storage to process a transaction. OLAP transactions with this relation include a time- dependent variable. As such, HP basedonthe 4.2.1. Vertical partitioning (VP) analysis period considerably decreases unnecessary Vertical Partitioning (VP) involves splitting a data access to the secondary data storage. relation by columns so that a group of columns be HP is usually applied when the table partition placed into the new relation and the remaining columns corresponds to a natural separation of the rows such as be placed in another [13,38]. PART in Fig. 4 consists of in the case of different geographical sites or when there

Normalized Schema PART (Part_ID, Width, Length, Height, Weight, ······, Price, Stock, Supplier_ID, Warehouse,······)

Part Specification Part Inventory Denormalized Schema PART_SPEC_INFO (Part_ID, Width, Length, Height, Weight, Strength,······) PART_INVENTORY_INFO (Part_ID, Price, Stock, Supplier_ID, Warehouse,······)

Fig. 4. An example of vertical partitioning. S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 275

Normalized Schema SALE_HISTORY (Sale_ID, Timestamp, Store_ID, Customer_ID, Amount, ······)

Denormalized Schema SALE_HISTORY_Period_1 (Sale_ID, Timestamp, Store_ID, Customer_ID, Amount, ······) SALE_HISTORY_Period_2 (Sale_ID, Timestamp, Store_ID, Customer_ID, Amount, ······) SALE_HISTORY_Period_3 (Sale_ID, Timestamp, Store_ID, Customer_ID, Amount, ······) ••••••

Fig. 5. An example of horizontal partitioning. is a natural distinction between historical and current attribute, Supplier_Name, from SUPPLIER. In the data. HP is particularly effective when historical data is normalized schema, queries require costly join oper- rarely used and current data must be accessed promptly. ations, while a join operation is not required in the A database designer must be careful to avoid duplicat- denormalized schema. The major concern of denorm- ing rows in the new tables so that bUNION ALLQ may alization, the occurrence of update anomalies, would not generate duplicated results when applied to the two be minimal in this case because Supplier_Name new tables. In general, HP may add a high degree of would not be frequently modified. complexity to applications because it usually requires In general, adding a few redundant attributes different table names in queries, according to the values does not change either the business view or the in the tables. This complexity alone usually outweighs relationships with other relations. The concerns the advantages of table partitioning in most database related to future development and database design applications. flexibility, therefore,P are not critical in thisP approach. c Partitioning a relation either vertically P or horizon- Normally, b Xb for RA is negligible b Xb 0 . tally does not cause data duplication j Xj ¼ 0 . However the costs for an update transaction may be Therefore, data anomalies are not an issue in this case. critical when the volatility of those attributes is high. 4.3. Adding redundant attributes (RA) Reference data, which relates to sets of code in one relation and a description in another, is accomplished Adding redundant attributes (RA) can be applied via a join, and it presents a natural case where when data from one relation is accessed in conjunc- redundancy can pay off [3]. While CR is good for tion with only a few columns from another table [13]. short reference data, adding redundant columns is If this type of table join occurs frequently, it may suitable for a lengthy reference field. A typical make sense to add a column and carry them as strategy for a relation with long reference data is to redundant [24,38]. Fig. 6 presents an example of RA. duplicate the short descriptive attribute in the entity, Suppose that PART information is more useful with which would otherwise contain only a code. The Supplier_Name (not Supplier_ID) and that it does not result is a redundant attribute in the target entity that is change frequently due to the company’s stable busi- functionally independent of the primary key. The code ness relationship with part suppliers, and that joins are attribute is an arbitrary convention invented to frequently executed on the PART and SUPPLIER in normalize the corresponding description and, of order to retrieve part data from PART and only one course, reduce update anomalies.

Normalized Schema PART (Part_ID, Width, Length, Height, Weight, Price, Stock, Supplier_ID, ···) SUPPLIER (Supplier_ID, Supplier_Name, Address, State, ZIP, Phone, Sales_Rep, ···)

Denormalized Schema PART_DN (Part_ID, Width, Length, Height, Weight, Price, Stock, Supplier_ID, Supplier_Name, ···) SUPPLIER (Supplier_ID, Supplier_Name, Address, State, ZIP, Phone, Sales_Rep, ···)

Fig. 6. An example of adding redundant columns. 276 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282

Denormalizing a reference data relation does not increase batch-processing time to unacceptable levels. necessarily reduce the total number of relations Moreover, if such calculations are performed on the because, in many cases, it is necessary to keep the fly, the user’s response time can be seriously affected. reference data table for use as a data input constraint. An example of derived attributes is presented in However, it does remove the need to join target entities Fig. 7. Suppose that a manager frequently needs to with reference data tables as the description is du- obtain information regarding the sum of purchases of zplicated in each target entity. In this case, the goal each customer. In the normalized schema, the queries should be to reduce redundant data and keep only the require join operations with the CUSTOMER and columns necessary to support the redundancy and ORDER relations and an aggregate operation for the infrequently updated columns used in query processing total balance of individual customers. In the denor- [24]. malized schema containing the derived attribute Sum_Of_Purchase in CUSTOMER_DN, those que- 4.4. Derived attributes (DA) ries do not require join or aggregate operations but rather a select operation to obtain the data. However, With the growing number of large data warehouses in order to keep the data concurrent in the denormal- for decision support applications, efficiently executing ized format, additional transactions are needed to keep aggregate queries, involving aggregation such as MIN, the derived attributes updated in an operation data- MAX, SUM, AVG, and COUNT is becoming increas- base, which in turn increases the costs of additional ingly important. Aggregate queries are frequently update transactions. found in decision support applications where relations As DA does not complicateP data accessibility, or with large history data often are joined with other changeP the business view, b Xb is negligible c relations and aggregated. Because the relations are b Xb 0 . It is worth noting that the effects of DA large, the optimization of aggregation operations has may decrease, particularly when the data volume of the the potential to provide huge gains in performance. relation increases. To reduce the cost of executing aggregate queries, Storing such derived data in the database can frequently used aggregates are often precomputed and substantially improve performance by saving both materialized in an appropriate relation. In statistical CPU cycles and data read time, although it violates database applications, derived attributes and powerful normalization principles [24,38]. Optimal database statistical query formation capabilities are necessary design with derived attributes, therefore, is sensitive to model the data and to respond to statistical queries to the capability of the aggregation functionality of [18,45]. Business managers typically concentrate first the database server [37]. This technique can also on aggregate values and then delve into more detail. be combined effectively with other forms of Summary data and derived data, therefore, are denormalization. important in decision support systems (DSS) based DA reduces processing time at the P expense of higher on a data warehouse [22]. The nature of many DSS update costs [33]. Updates costs j Xjtj , therefore, applications dictates frequent use of data calculated may become the main deterrent for adopting derived from naturally occurring large volume of information. attributes [42] and the access frequency, the response When a large volume of data is involved, even simple time requirement, and increased maintenance costs calculations can consume processing time as well as should be considered [22]. A compromise is to

Normalized Schema CUSTOMER (Customer_ID, Customer_Name, Address, ZIP) ORDER (Order_ID, Customer_ID, Order_Date, Standard_Price, Quantity)

Denormalized Schema CUSTOMER_DN (Customer_ID, Customer_Name, Address, ZIP, Sum_Of_Purchase) ORDER (Order_ID, Customer_ID, Order_Date, Standard_Price, Quantity)

Fig. 7. An example of derived attribute. S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 277 determine the dimension of derived data at lower level assume that the operational intricacies can be so that the update cost can be reduced. Maintaining data neglected for reasons of analytical tractability [46]. integrity can be complex if the data items used to calculate a derived data item change unpredictably or 5.1. Algebraic approach to assess denormalization frequently and a new value for derived data must be effects recalculated each time a component data item changes. Data retention requirements for summary and To illustrate the approach, we will use an example detail records may be quite different and must also CR of many-to-many relationship from Section 4.1. be considered as well as the level of granularity for the Suppose there are three relations in the 3rd normal summarized information [25]. If summary tables are form, R={r1, r2, r3}, S={s1, s2}, and T={t1, t2, t3} as created at a certain level and detailed data is shown in Fig. 8(A), where S is the M:N relationship destroyed, there will be a problem if users suddenly between entity R and T 3, and s1 and s2 are the need a slightly lower level of granularity. foreign keys of R and T, respectively. When the cross- reference relation S stores foreign keys for R and T (or likely with a few attributes), collapsing relation entity 5. Justifying denormalization S into T (or R) can be a good strategy to increase database performance such that after denormalization Determining the criteria for denormalization is a we have two denormalized relations, R={r1, r2, r3} complex task. Successful denormalization of a data- and TV={t1, s1, t2, t3} as shown in Fig. 8 (B). base requires a through examination of the effects of Now, we consider two queries running Cartesian denormalization in terms of 1) cardinality relation- products with non-indexed tables with normalized and ships, 2) transaction relationships, and 3) access paths denormalized schemas respectively. As illustrated in required for each transaction/module. Fig. 9, these are two sets of SQL statements and In this section, we develop a framework for relational algebra operations for the nested-loop join assessing denormalization effects using relational queries that retrieve the identical data sets from each algebra, query tree analysis and the join cost function. data structure. We also present two query trees (see Fig. The primary purpose of the relational algebra is to 10) that correspond to the relational algebra of each represent a collection of operations on relations that is schema. performed on a relational database [11]. In relational Both queries retrieve the appropriate tuples from algebra, an arbitrary relational query can be systemati- each schema. Suppose that there are n tuples in both R cally transformed to a semantically equivalent alge- and T, and m˙ n tuples in entity S assuming that the braic expression in which the operations and volume of cardinality between the primary and cross-reference associated data can be evaluated [14]. The relational relations is 1:m. Let s denote selection cardinality4 algebra is appropriate for assessing denormalization satisfying the operation condition (r r1=dxT(R)). The effects in the cost-based perspective. number of resulting join operations of entity R and S A query tree is a tree data structure that corresponds (Res1 Tr1=s1 (S)) is m˙ n˙s. Likewise, as the selection to a relational algebra expression. It represents the input cardinality of the first join operation is m˙ s, the final relations of the query as leaf nodes of the tree and the project operation (Res2 Ts2=t1 (T)) produces m˙ n˙s relational algebra operation as internal nodes. Rela- relations which results from the Cartesian product tional algebra and query tree analysis provide not only an assessment of both approximate join costs but also join selectivity and access path, and they can be used to 3 In order to clearly illustrate and amplify the effects of assist the database designer in obtaining a clear-cut denormalization, it is assumed that all R.r1 are related to S.s1 assessment of denormalization effects. (aS)(R.r1= S.s1) and all T.t1 are related to S.s2 (aS)(T.t1= S.s2), In this study, we assume that the database respectively. Also, we use weak entity types that do not have key attributes of their own, although many cases in reality involve with processing capacity is a function of the CPU process- strong entity type that do have key attributes of their own. ing capacity and that it is measured in terms of the 4 The selection cardinality is the average number of records that volume of data accessed per unit time. We also will satisfy an equality selection condition on that attribute. 278 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282

A) Normalized B) Denormalized r2 r1 r3 r2 R r1 r3 R M s1 1 S s1 M N s2 T'

T t1 t3 t2 t1 t3 t2

Fig. 8. ER diagrams of exemplary data structure. operation with the three relations. Thus, the total blocking factors of relationsas bfrR=4, bfrS=6, bfrR-S=3, 6 number of relations produced from the normalized bfrT=4, bfrRS-T=2, bfrTV=3 and bfrR-TV=2 with the join data structure is 2md nd s. cardinalities as jsRS=1/1000, jsRST=1/2000, and 7 After collapsing the two relations S and T, we have jsRTV=1/2000 , we estimate the costs for the join m˙ n tuples in a denormalized T’ (as S and T have been operation with normalized schema, CN, and with merged into new entity T’, the number of tuples of denormalized schema, CD, as follows: relation T’ should be identical to the number of tuples of relation S). In order to retrieve the same dataset T T T CN ¼ ðÞbR þ ðÞþbR bS ðÞðÞjsRS rR rS =bfrRS from the denormalized schema, the total number of þðÞb þ ðÞþb Tb ðÞðÞjs Tr Tr =bfr join operations produced is m˙ n˙s. Notice that join RS RS T RST RS T RST T T T operation with the denormalized schema is 50% ¼ ðÞ4 þ ðÞþ4 334 ðÞðÞ1=1000 1000 2000 =3 ((md nd s)/(2md nd s)) of the join operation required þðÞ7 þ ðÞþ7T250 ðÞðÞ1=2000T20T1000 =2 for the normalized schema. ¼ 3768

T T T 5.2. Cost-based approach to assess denormalization CD ¼ bR þ ðÞþbR bTV ðÞðÞjsRTV rR rTV =bfrRTV effects ¼ 4 þ ðÞþ4T667 ðÞðÞ1=2000T1000T2000 =2 The effects of denormalization can also be assessed ¼ 3172 by the join operation cost function [20]. Consider the following formula:

T T T 6 Blocking factors (bfr=records/block) are calculated based on Cost ¼ bA þ ðÞþbA bB ðÞððÞjs rA rB =bfrAB 7Þ assumption that the size of all columns across entities is identical where A and B denote the outer and inner loop files, and that the density of block across entities is equal so that the respectively; rA and rB indicate the number of tuples analysis takes into account the change of data volume caused by denormalization. In this case, we assume that bfr =4 and bfr =6 stored in bA and bB disk blocks, respectively; js is the R S join selectivity5; and bfr denotes the blocking because entity R has 3 columns and entity S has 2 columns. Based A-B on the blocking factors, we also calculated the number of blocks factor for the resulting join file (records/block) [20]. used for each entity (bR=4, bS=334, bRS=7, bT=250, and bTV=667). Notice that the 3rd term of the right-hand side of the bR is obtained after the first selection operation (jr1=d001T (R)). Since formula is the cost of writing the resulting file to disk. the cost for this operation is common for CN and CD, it was not Assuming that the lengths of all columns are equal, the considered in this estimation. 7 Join cardinality (js) is calculated based on the assumption that the relationship between entity R and S, and T and S are 1:2, 1000 5 Join selectivity (js) is an estimate for the size of the file that tuples per each relation R and T, 2000 tuples in cross-reference results after the join operation, which is obtained by the ratio of the relation, S, and the selection cardinality of entity R and T is equally size of the resulting join file to the size of the Cartesian product file 10. In this example, as the cardinality increases, the benefit from

(js=|(ATcB)|/|(AS)|) (Elmasri and Navathe, 2000). denormalization generally increases. S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 279

A) SQL and Relational Algebra with Normalization B) SQL and Relational Algebra with Denormalization

Select R.r1, R.r2, R.r3, T.t1, T.t2, T.t3 Select R.r1, R.r2, R.r3, T.t1, T.t2, T.t3 From R, S, T From R, T' Where R.r1 = '001' and Where R.r1 = '001' and R.r1 = S.s1 and R.r1 = T'.s1; R.s2 = T.t1;

← σ ← σ Res1 r1 = '001' (R) Res1 r1 = '001' (R) ← ← Res2 (Res1 r1 = s1 (S)) Res2 (Res1 r1 = s1 (T')) ← ← π Res3 (Res2 s2 = t1 (T)) Result r1, r2, r3, t1, t2, t3 (Res2) ← π Result r1, r2, r3, t1, t2, t3 (Res3)

Fig. 9. SQL and relational algebra with normalization/denormalization.

The result shows that the suggested denormalization the length or number of tuples, join and selection technique leads to a decrease of 15.8% in cost over the cardinalities, and blocking factors. normalized schema. While the example uses a minimal number of attributes as well as a relatively small number of tuples in each entity, most databases 6. Concluding remarks have more attributes and more tuples, and the benefits could be greater as far as it keeps low cardinalities. In The effects of denormalization on relational data- reality, the positive effect of CR of many-to-many base system performance have been discussed in the relationship is expected to increase in proportion to context of using denormalization strategies as a the number of attributes and tuples involved. By using database design methodology. We present four prev- the relational algebra and join cost function alent denormalization strategies, evaluate the effects approaches, it is concluded that the join process with of each strategy, and illustrate the conditions where the denormalized data structure is less expensive than they are most effective. The relational algebra, query the normalized data structure. However, the actual trees, and join cost function are adopted to examine benefit will depend on a variety of factors including the effects on the performance of relational systems to

A) Normalized B) Denormalized r1, r2, r3, t1, t2, r1, r2, r3, t1, t2,

s2 = t1 r1 = s1 r1, r2, r3, T

σ T' r1 = s1 r1 = 'X'

σ r1 = 'X' S R

R

Fig. 10. Query trees analysis. 280 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 the end. Two significant conclusions can be derived Acknowledgement from these analytical techniques: 1) the denormaliza- tion strategies may offer positive effects on database The authors are grateful to the anonymous referees performance and 2) the analytical methods provide an for their constructive suggestions and detailed com- efficient way to assess the effects of the denormaliza- ments on earlier drafts of this article. tion strategies. The guidelines and analysis provided are sufficiently general and they can be applicable to a variety of databases, in particular to data warehouse implementations. References One of the powerful features of relational algebra [1] R. Barquin, H. Edelstein, Planning and designing the data is that it can be efficiently used in the assessment of warehousing, Prentice Hall, Upper Saddle River, NJ, 1997. denormalization effects. The approach clarifies [2] D. Batra, G.M. Marakas, Conceptual data modelling in theory denormalization effects in terms of mathematical and practice, European Journal of Information Systems 4 methods and maps it to a specific query request. The (1995) 185–193. methodological approach of the current study may, [3] P. Beynon-Davies, Using an entity model to drive physical database design, Information and Software Technology 34 however, be somewhat restricted in that it focuses on (12) (1992) 804–812. the computation costs between normalized and [4] N. Bolloju, K. Toraskar, Data clustering for effective mapping denormalized data structures. In order to explicate of object models to relational models, Journal of Database the effects of denormalization, future studies are Management 8 (4) (1997) 16–23. needed to develop a more comprehensive way of [5] G. Bracchi, P. Paolini, G. Pelagatti, Data independent descriptions and the DDL specifications, in: B.C.M. Douque, measuring the impact of denormalization on storage G.M. Nijssen (Eds.), Data Base Description, North-Holland, costs, memory usage costs, and communications Amsterdam, 1975, pp. 259–266. cost. [6] J.V. Carlis, S.T. March, G.W. Dickson, Physical database It is important to note that although denormalization design: a DSS approach, Information & Management 6 (4) may speed up data retrieval, it may lead to additional (1983) 211–224. [7] N. Cerpa, Pre-physical data base design heuristics, Information update costs as well as slow the data modification & Management 28 (6) (1995) 351–359. processes due to the presence of duplicate data. In [8] C.H. Cheng, Algorithms for vertical partitioning in database addition, since denormalization is frequently case- physical design, OMEGA International Journal of Manage- sensitive and reduces the flexibility of design inherent ment Science 22 (3) (1993) 291–303. in normalized databases, it may hinder the future [9] J. Clifford, A. Croker, A. Tuzhilin, On data representation and use in a temporal relational DBMS, Information Systems development [27]. It is, thus, critical for a database Research 7 (3) (1996) 308–327. designer to understand the effects of denormalization [10] E.F. Codd, A relational model of data for large shared data before implementing the strategies. banks, Communication of the ACM 13 (6) (1970) 377–387. While the results of this study are generally [11] E.F. Codd, Relational completeness of data base sublanguages, applicable to database design, the results are partic- in: R. Rustin (Ed.), Data Base Systems, Prentice-Hall, Engle- wood Cliffs, NJ, 1972, pp. 65–98. ularly applicable to data warehouse applications. [12] G. Coleman, Normalizing not only way, Computerworld 1989 Future research on denormalization should focus on (12) (1989) 63–64. a wide variety of operational databases as well as on [13] C.E. Dabrowski, D.K. Jefferson, J.V. Carlis, S.T. March, examining the effects of denormalization on a variety Integrating a knowledge-based component into a physical of data models, including semantic, temporal and database design system, Information & Management 17 (2) (1989) 71–86. object-oriented implementations. Denormalizing da- [14] M. Dadashzadeh, Improving usability of the relational tabases is a critical issue because of the important algebra interface, Journal of Systems Management 40 (9) trade-offs between system performance, ease of use, (1989) 9–12. and data integrity. Thus, a database designer should [15] C.J. Date, The normal is so...interesting, Database Program- not blindly denormalize data without good reason ming & Design (1997 (November)) 23–25. [16] C.J. Date, The birth of the relational model, Intelligent but should carefully balance the level of normal- Enterprise Magazine (1998 (November)) 56–60. ization with performance considerations in concert [17] C.J. Date, An Introduction to Database Systems, Pearson with the application and system needs. Addison Wesley, Reading, MA, 2002. S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282 281

[18] D.E. Denning, Secure statistical databases with random sample [36] G.M. Nijssen, T.A. Halpin, Conceptual Schema and Relational queries, ACM Transactions on Database Systems 5 (3) (1980) Database Design, Prentice-Hall, Sydney, 1989. 291–315. [37] P. Palvia, Sensitivity of the physical database design to [19] D. Dey, S. Sarkar, Modifications of uncertain data: a Bayesian changes in underlying factors, Information & Management framework for belief revision, Information Systems Research 15 (3) (1988) 151–161. 11 (1) (2000) 1–16. [38] K. Paulsell, Sybase Adaptive Server Enterprise Performance [20] R. Elmasri, S.B. Navathe, Fundamental of Database Systems, and Tuning Guide, S.P. Group, Sybase, Inc., Emeryville, CA, 3rd ed., Addison-Wesley, New York, 2000. 1999. [21] S. Finkelstein, M. Schkolnick, P. Tiberio, Physical database [39] V. Poe, L.L. Reeves, Building a Data Warehouse for Decision design for relational databases, ACM Transactions on Data- Support, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 1998. base Systems 13 (1) (1988) 91–128. [40] M.J. Prietula, S.T. March, Form and substance in physical [22] S.F. Flood, Managing data in a distributed processing environ- database design: an empirical study, Information Systems ment, in: B. von Halle, D. Kull (Eds.), Handbook of Data Research 2 (4) (1991) 287–314. Management, Auerbach, Boston, MA, 1993, pp. 601–617. [41] U. Rodgers, Denormalization: why, what, and how? Database [23] J. Hahnke, Data model design for business analysis, Unix Programming & Design 1989 (12) (1989) 46–53. Review 14 (10) (1996) 35–44. [42] D. Rotem, Spatial join indices, Proceedings of Seventh Interna- [24] M. Hanus, To normalize or denormalize, that is the question, tional Conference on Data Engineering, 1991, pp. 500–509. Proceedings of 19th International Conference for the Manage- [43] G.L. Sanders, Data modeling, International Thomson Publish- ment and Performance Evaluation of Enterprise Computing ing, Danvers, MA, 1995. Systems, San Diego, CA, 1994, pp. 416–423. [44] M. Schkolnick, P. Sorenson, Denormalization: a performance [25] J.A. Hoffer, M.B. Prescott, F.R. McFadden, Modern database oriented database design technique, Proceedings of the AICA, management, 6th ed., Prentice Hall, Upper Saddle Review, NJ, Bologna, Italy, 1980, pp. 363–367. 2002. [45] J. Schloerer, Disclosure from statistical databases: quantitative [26] W.H. Inmon, What price normalization? Computerworld 22 aspects of trackers, ACM Transactions on Database Systems 5 (10) (1988) 27–31. (4) (1980) 467–492. [27] W.H. Inmon, Information Engineering for the Practitioner: [46] J.A. Stankovic, Misconceptions about real-time computing: a Putting Theory into Practice, Yourdon Press, Englewood Cliffs, serious problem for next-generation systems, IEEE Computer NJ, 1989. 21 (10) (1988) 10–19. [28] W.H. Inmon, Building the Data Warehouse, 2nd ed., Wiley & [47] F. Sweet, Process-driven data design (Part 3): objects and Sons Inc., New York, 1996. events, Datamation (1985 (September)) 152. [29] C.S. Jensen, R.T. Snodgrass, Temporal data management, [48] H. Thomas, A. Datta, A conceptual model and algebra for on- IEEE Transactions on Knowledge and Data Engineering 11 (1) line analytical processing in decision support databases, (1999) 36–44. Information Systems Research 12 (1) (2001) 83–102. [30] M.Y. Kiang, A. Kumar, An evaluation of self-organizing map [49] C. Tupper, The physics of logical modeling, Database networks as a robust alternative to factor analysis in data Programming & Design (1998 (September)) 56–59. mining applications, Information Systems Research 12 (2) [50] J.C. Westland, Economic incentives for database normal- (2001) 177–194. ization, Information Processing & Management 28 (5) (1992) [31] H. Lee, Justifying : a cost/benefit 647–662. model, Information Processing & Management 31 (1) (1995) 59–67. Seung Kyoon Shin is an assistant profes- [32] S.T. March, Techniques for structuring database records, ACM sor of Information Systems in the College Computing Surveys 15 (1) (1983) 45–79. of Business Administration at the Uni- [33] D. Menninger, Breaking all the rules: an insider’s guide to versity of Rhode Island. Prior to joining practical normalization, Data Based Advisor 13 (1) (1995) academia, he worked in industry as a 116–120. software specialist and system integration [34] S.B. Navathe, S. Ceri, G. Wiederhold, J. Dou, Vertical consultant. His research has appeared in partitioning algorithms for database design, ACM Trans- Communications of the ACM, Decision actions on Database Systems 9 (4) (1984) 680–710. Support Systems, and International Jour- [35] S.B. Navathe, M. Ra, Vertical partitioning for database nal of Operation & Production Manage- design: a graphical algorithm, Proceedings of International ment. His current research interests Conference on Management of Data and Symposium on include strategic database design for data warehousing and data Principles of Database Systems, Portland, OR, 1989, mining, web-based information systems success, and economic and pp. 440–450. cultural issues related to intellectual property rights. 282 S.K. Shin, G.L. Sanders / Decision Support Systems 42 (2006) 267–282

G. Lawrence Sanders, Ph.D., is professor and chair of the Department of Manage- ment Science and Systems in the School of Management at the State University of New York at Buffalo. He has taught MBA courses in the Peoples Republic of China and Singapore. His research interests are in the ethics and economics of digital piracy, systems success measurement, cross-cultural implementation research, and systems development. He has pub- lished papers in outlets such as The Journal of Management Information Systems, The Journal of Business, MIS Quarterly, Information Systems Research, the Journal of Strategic Information Systems, the Journal of Management Systems, Decision Support Systems, Database Programming and Design,andDecision Sciences. He has also published a book on database design and co-edited two other books.