A SOPHISTICATE'S INTRODUCTION To NORMALIZATION THEORY+

Catriel Beeri Philip A. Bernstein Nathan Goodman

Computer Science Department , Aiken Computation Lab. Computer Corp. of America The Hebrew University Harvard University 575 Technology Square Jerusalem, ISRAEL Cambridge, MA 02138 Cambridge, WA 02139

Abstract

Formal database semantics has concentrated on whose columns are labelled-with attributes dependency constraints, such as functional and and whose rows depict . Fig. 1 illustrates multivalued dependencies, and on normal forms a in this way. The data manipulation for relations. Unfortunately, much of this work operators used in this paper are projection and has been inaccessible to researchers outside natural join. The projection of relation R(X) on this field, due to the unfamiliar formalism in attributes T is denoted R[Tl. If V=X-T, which the work is couched. In addition, the R[T] = ilER(X)} , and is defined iff T 5 X. lack of a single set of definitions has confused (If we visualize R as a table, R[TI is those columns the relationships among certain results. This of R labelled with elements of T.) The natural join paper is intended to serve the two-fold purpose of relations R and S is denoted R*S. Given R(X,Y) of introducing the main issues and theorems of and S(Y,Z), where X,Y,Z are disjoint sets, R*S = formal database semantics to the uninitiated, (lER and Es). and to clarify the terminology of the field. Functional and multivalued dependencies are predicates on relations. Intuitively, a (abbr. FD) f:X-tY holds in R(X,Y,Z) iff each value of x in R is associated with exactly one value of Y (see Fig. 1). The truth-value of f can 1. INTRODUCTION of course vary over time, since the contents of R can vary over time. A multivalued dependency 1.1 Database Semantics (abbr. MVD) g:X*Y holds in R iff each X-value in R is associated with a set of Y-values in a way A database is a collection of information about that does not depend on Z-values [see Fig. 1). FDs some enterprise in the world. The role of &tabase and WVDs are defined formally in the next section. semrmtics is to ensure that stored information accurately represents the enterprise. Database. FIGURE 1. A relation with functional and multi- semantics studies the creation, maintenance, and valued dependencies interpretation of as models of external activities. A wide variety of database semantic Relation: RENTAL-UWITS tools exist, ranging from data type constraints, to Attributes: LANDLORD,ADDRESS,APT#,RENT,OCCUPANT,PETS integdty con&mints, to semantic modelling Functional dependencies: structures used in Artificial Intelligence [26,36,391. ADDRFSS,APT#*RENT--Each unit has one rental This paper is concerned with a specif& type of OCCUPANT+ADDRESS;APT#--Every occupant lives database semantic tool, namely da& depada&eS-- in one unit both functional and multi-valued dependencies. This Multivalued dependencies: paper surveys the major results in this area. Our LANDLORD~ADDRESS--Each landlord can own many aim is to provide a unified framework for under- buildings standing these results. OCCuPANT*PETS --Each occupant may have several nets l-2 Database Models Tuples: LXNDLORD, ADDRESS, AeT#, RENT, OCCUPANT, PETS Most work on,data dependencies uses the re- lathed data model, with which we assume reader Wizard, Ox* x3, $ 50, Tinman, Oilcan familiarity at the level of [%I. Briefly, a re- Wizard, -9 $1‘ $ 50, Witch, Bat lational database consists of a set of relations Wizard, OS, Xl, $ 50, Witch,,. Snake defined on certain attributes. RIX) is our notation Wizard* OZ# 112, $ 75, Lion, &owe for a relation named R defined on a set of attsi- cod& 3 NF St, 111, $500, Beeri, Fish butes x.1 The relation R(X) is.a set of m-fxptes-, codd, 3 NF St, #I, $500, Bernstein, Dog where m=/Xl. A relation can be visualized as a Codd, 3 NF st, #l, $500, Bernstein, Rhino Codd, 3 NF St, t2, $600, Goodman, Cat t This work was supported in part by the Mtional %lore generally, the notation R(X,Y,Z,...) denotes Science Foundation under Grant WCS-77435314. relation R defined on XUYUZU... . _,

113 CHl3894/78/0000-0113$00.75 0 1978 IEEE 1.3 Description vs. Content We limit our attention to the first area, though The interplay between database description and some of the work we cover has application in the database content is a major theme in database second area also. The problem of schema design is: semantics. A database description is called a Given an initial schema, find an equivalent one schema, and contains descriptions for each relation that is better in some respect. As we will see, in the database. The description of a single re- different definitions of "equivalent" and "better" lation is called a relation scheme and consists of lead to startlingly different results. the relation name, its attributes and a set of data The paper is organized as follows. Section 2 dependencies. E: denotes a relation scheme R formally defines data dependencies and reviews with attributes T and dependencies I' (see Fig. 2): their basic properties. Section 3 states the schema design problem more precisely. Then We sometimes use the notation -R(T) when r is either unknown or irrelevant. Sections 4, 5 and 6 examine several definitions of schema "equivalence" and several criteria for one FIGURE 2. Formal notation for a relation scheme schema to be "better" than another. Section 7 ties based on Figure 1. these ideas together by looking at specific schema design methods. We conclude with an historical RENTAL-UNITS = look at our field and predications for its future. <{LAND~~,ADDRESS,APT#,RENT,OCCUPANT,PETS'), IADDRZSS,APT# +RR~JT; OCCUPAWT+ ADDRRSS,APT#; 2. DATA DEPENDENCIES LAwDIORD++ADDRRSS; oCCUPAWT++PRTS1> 2.1 Definition and Basic Properties An FD is a statement of the form F:X+Y, where The contents of a relation is called the state X and Y are sets of attributes. f is defGaed for a or extensia of the corresponding scheme, and is relation R(T) or a relation scheme R(T) if X and Y a set of tuples as stated above. R(T) denotes an are subsets of T. If f is defined for R, then f is extension of R=. If R(T) satisfies all a predicate on R's state; f is Valid in R iff every dependencies Tn r, it is called an in8tance of 5 two tuples of R that have the same X-value also (notationally, R denotes an instance of R). A have the same Y-value. From the definition we see re&ztiona~ database for a schema is a coilection of that f's validity depends only on the values instances, one for each relation scheme in the assigned to X and Y. We say that FDs enjoy the schema. projectivity and inverse projectivity properties: In summary, schema and scheme are syntactic For sets X,Y C_ T' C_ T, X+Y is valid in R(T) iff it objects; database and relation refer to database is valid in R[T']. content. The distinction between schema-related An MVD is a statement of the form g:X*Y. g and content-related concepts is often subtle yet is defined for R(T) or R(T) if X and Y are subsets important, and we keep it sharp in this paper. of T. Let Z==T-(XUY). For a Z-value, z, we define Y,,=, fo~~;i~iz:~!. su~hi~$l~d i;; :ff 1.4 The Universal Relation Assumption yxz = yxz I are nonempty. This ieiinition implieythat g7g Most work on data dependencies assumes that validity depends on values assigned to Z, not just all relations in a database are projections of a XandY. If g is valid in R(T), then it is valid single relation. Formally, suppose Rl(Tl), in all projections of R(T); the converse, however, R,(T,] is a database of interest, and R2(T2),...r does not hold. MVDs thus enjoy the projectivity let T = Ulcicn Ti. It is assumed that a tm&ersaZ property but not inverse projectivity. The PD X+Y states that a unique Y-value is rekd~n U?Tj exists, such that Ri = U[Ti] for associated with,each X-value: the MVD X-Y States l

II5 Definition Repl. SD represents the same in- FD X*2 (which is in rf,) is valid in R More- formation as S4 if they contain the same attributes; 12' over, by the universal relation assumption Rl and that is, if uni=l T i =T. R2 are projections of U, as is R,,tXZl. Conse- This definition is inadequate because it ignores relationships among attributes. By this quently the user can obtain the "extension" of definition, the schemas in Figs. 4 and 5 are egui- X+2 from SD, even though X+2 is not explicitly valent to the one in Fig. 2, even though they con- represented. In fact, all FD inference rules can tain no data dependencies. be "simulated" by relational operators applied to relations containing the FDs. It follows that all FIGURE 4. A schema equivalent to one in Fig. 2 FDs in F+ can be retrieved from SD if SD's schemes under Def. Repl. contain a covering of r. MVDs, on the other hand, do not possess the SD = i$, s2, R3} inverse projectivity property, and definition Rep2 is not easily generalized to them. More research is needed to formulate a suitable generalization 51 = of Rep2 for MVDs. 52 = Definition Rep3. SD represents the same in- E3 = formation as S,$ if they have the same attributes and the databases of SD contain the same data as the databases of S+. FIGURE 5. Another schema equivalent to one in Fig. 2 under Repl. In contrast to Rep2, this definition stresses the data component of equivalence. Two schemas are equivalent under Rep3 if at all times their databases contain the same information, albeit in different formats. ahNDLoRD,ib The definition is formalized by the concept of ZossZess join [l]. Suppose U(T) is an instance of U and the corresponding set of instances of SD is TRi(Ti)=U[Ti]Ii=l,...,n). To answer a query in- QPT#, iI> volving, say, all attributes of T, we must re- construct U from fRiJ via the join operator. If U=Rl*...*Rn, then U can be precisely reconstructed -PETS,{)> from its projections. If, however, UtR,...*Rn, the join contains tuples that are not in U, and ERi) is not a faithful representation. This Definition Rep2. SD represents the same phenomenon is called a bossy join and is information as S if they have the same attributes illustrated in Fig. 6. # and the same data dependencies. FIGURE 6. An Example of a lossy join. When only FDs are involved, this definition Let S = {RENTAL-UNITS) defined in Fig. 2, with can be made precise. The FDs of S$ are r+. The 4 FDs of SD are (Uzxlri)+. SD represents S.4 if instance of Fig. 1. Let SD= {LAND-APT#,APT#-RENT,PERS~N-PETS~ r+ = (U"i=lri)+, i.e., if (Uni=lri) is a covering of r. However, there is a problem with the defini- LAND-APT# = tion as stated. The inference rules in Sec. 2 are APT#-RENT = <{APT#,REW~,~b PERSON-PETS = &CCUPANT,PETS,ADDRESS}, only defined with respect to dependencies in a single relation. since SD involves m&tip&? re- ~OCCUPANT-~ADDRESS;OCCUPANT-HPETS}> lations, it is not obvious that those inference Instances corresponding to Fig. 1 are rules can validly be applied to it. Suppose, LAND-AFT# (LANDLORD,hFT#) hF'T#-RENT fAPT#,RENT) Wizard, $1 #3, s 50 SD represents S4 by the above definition. Yet SD Wizard, tl #l, $ 50 does not even contain a relation scheme in which Wizard, 12 #2, $ 75 X+2 is defined: - Codd, #l Cl, $500 This problem is rectified by the "universal coda, #2 #2, $600 relation assumption" (Sec. 11 and the "inverse projectivity property of FDs" fSec. 21. Let Rl LAND-APT#*hPT#-RENT = and R2 be instances of 3 and s; define Attributes: LhNDLOitD, APT++, RENT R =

116 FIGURE 6 continued by a unique database of SD. Rep2 says that every database of SD satisfies the same dependencies as Attributes: LANDLORD, APT#, RENT So, and hence represents a legal database of S . Together, Rep2 and Rep3 imply Repl. 4 Tuples: Codd, #l, $ 50 If only FDs are given, Rep4 is identical to Codd, #l, $500 the notion of independent components 1311, and the Codd, #2, $ 75 following is proved: Codd, #2, $600 Let u=, z$= and s2=. Note that each LANDLORDis associated with RENTS charged by the other. {R,,R,) are independent components of g iff (a; (zlUF2)+ = F+, and (b) F+ contains Tl flT2+Tl Formally we say that SD has the Zoss&?ss join or T1 nT2+T2 [311. property if for each instance U of g, n Comparison of Definitions. Fig. 5 illustrated U = * UITil. a schema equivalent to the schema of Fig. 2 by Rep1 i=l but not by Rep2, Rep3, or Repl. Fig. J differenti- ates between Rep2 and Rep3. Fig. J(a) is similar When only pairs of relations are considered, we to an example in [16, p. 1651. SD is equivalent to have the following results. S,+,under Rep3 but not Rep2; it would be considered FACT 1: If U= (that is, only FDs are algood design by [16,231, but not by [Jl. In Fig. given) then for sets Tl,T2 such that TlUT2 = T, J(b), SD is equivalent to S+ by Rep2 but not by Rep3; it would be approved by [Jl, but not [23,3Jl. {El(Tl), s2((T2)) has the lossless join property These differences of opinion are examined further iff either TlnT2-+Tl or TlnT2+T 2 is in r+[331. in later sections.

FACT 2: For U= and for Tl,T2 as above, FIGURE 7. Situations where Rep2 differs from Rep3. {El(Tl), s2(T2)) has the lossless join property s+= IDWELLER=<{ADDRESS,APT#,OCCUPANT), iff Tl flT2++Tl (and, by rule MVDO, TlnT2++T2) is {ADDRESS,APT#+OCCUPANT; in I'+ [231. OCCUPANT~ADDRESS;OCCUPANT~APT#)>) sD= IADD-occ=; versally quantified sets of instances; i.e., the APT-occ <{APT#,occUPANT},{OCCUPANT'APT#)>} conditions of Facts 1 Ei 2 hold iff all instances SD does not Rep2-represent S . SD Rep3-represents of the given schemas have lossless joins. It is f4 possible, though, for the conditions not to hold, %- yet for specific instances to have lossless joins, S+= (RENTAL=<{LANDLORD,ADDRESS,APT#,OCCUPANT}, nonetheless. Facts 1 & 2 can be adapted for specific instances as follows. (LANDLORD*ADDRESS; ADDRESS,APT#+OCCUPANT; OCCUPANT~+DRESS;OCCUP~~APT#}>} FACT 1': Given -U=, and Tl,T2 as above. An instance U=U[T.]*U[T-I if (but ?Wi! only if) SD= (OWNER = <{LANDLOPD,ADDRESS~, I L ~LAEDLORD+ADDRESS~>; T1”T2+T or T1nT2 5. -T L-.#>-110ux5 In1- .U. . 1 2 DWELLER=<{ADDRESS,APT#,OCCUPANT), FACT 2': Given E=, and Tl,T2 as above. ~ADDRESS,APT#+OCCUPAET; OCCUPANT~ADDRESS;OCCUP~~APT#)>) An instance U=U[Tll*U[T21 iff Tl llT2*Tl (and by SD RepZ-represents S . SD does not Rep3-represents MVDS T1nT2++T2) holds in U. s . (4 + When more than pairs of relations are con- sidered, the situation is more complex. An algo- 5. TEE PRINCIPLE OF SEPARATION rithm for deciding the lossless join property in general is presented in [l]. The algorithm re- The next question is to understand how SD can quires polynomial time for FDs but may require be "better than" S,#,. One way is for "independent Another interesting exponential time for MVDs. relationships"to be represented by SD in indepen- result is thatforall n>2 there are sets of n dent relation schemes. To illustrate this point, relation schemes that have the lossless join let S+ be the RENTAL-UNITS scheme of Figs. 1 & 2, property, for which no proper subset has this and suppose we want to add a new LANDLORDto the property. database. This can only be done if values for other attributes are given, too. The new LANDLORD SD represents the same Definition Rep4. must be associated with an ADDRESS; the ADDRESS information as S,$ iff there exists a one-to-one must be associated with an APT#; the ADDRRs,APT# mapping between the databases of S and databases 0 pair requires a RENT and an OCCUPANT; and the of s D' OCCUPANTneeds PETS. So to add a new LANDLORD, S, forces us to add information that is at most Y Rep4 combines definitions Rep2 and P.ep3. distantly related to him. By the same token, when Rep3 says that every database of S,$ is represented

117 the last PET of the last OCCUPANTof the last APT# RepZ-representing S need always exist. of a given ADDRESS runs away, the association 4 between LANDLORDand ADDRESS is also destroyed. FACT 6: There always exists a 4NF schema that Another problem with S is data redundancy. Rep3-represents S 1231. It follows that a BCNF schema Rep3-representing S is always achievable, Each LANDLORDis represente 2 in four tuples although 4 each only owns one building. To change the too. building owned by Codd, say, requires that all four tuples with ADDRESS= "3 NF St" be updated. If FIGURE 8. 4NF schema and instance corresponding to some of these tuples were forgotten, the database RENTAL-UNITS (Figs. 1 & 2) would be inconsistat, meaning that some depen- dencies would no longer hold. In this case, sD= {owNs,cBARGES,LIVES,L~VES~-~-- LANDLORD-HADDRESS would no longer hold. These difficulties are caused by a lack of OWNS <(LANDLORD,ADDRESS~,{LANDLORD"ADDRESS)> separation in S$. To overcome these difficulties, CHARGES:<~ADDRESS,APT#,RENT~,~ADDRESS,~T#'~NT~> a series of &h&se ncPm&fO~shave been proposed, LIVES =<(OCCUPANT,ADDRESS,APT#), four of which are of interest. Before defining {OCCUPANT+ADDRESS,APT#)> them, we present several preliminary concepts. LOVES = <{OCCUPANT,PETS},{OCCUPANT*PETS~> Let S= &=(i=l,...,n) be a schema OWNS(LANDLORD,ADDRESS)LIVES(OCCUPANT,ADDRESS,APT#) and let r= (Uy,lri). (1) Superkey--Let XET.; X + Wizard OZ Tinman, 02, #3 is a superkey of iR if X+T. is in r . 3. (2.) key-- Codd 3 NF St Witch, 02, #l Let XGT.; X is a key of R. if X is a superkey and Lion, oz., #2 no X'cx'is. (3) prime ;;iftribute--Let AcTi; A CBARGES(ADDRESS,APT#,RENT) Beeri, 3 NF St,#l is prime in.% if A is in any key of 3. 02, #3 $ 50 Bernstein,3 NF St,#l (4) !l'ransit%ve dependence--Let AETi and XETi; A 02, #l $ 50 Goodman, 3 NF St,#2 is transitively dependent on X in 9 if there 02, #2 $ 75 L~VFS(~CCUPANTS,PETS) exists YcTi such that X+YEr+, Y+AEr+, Y+Xer+, 3 NF St, #l $500 and A$! < (5) TKviQZ FD--x+Y is trivial, 3 N-F St, #2 $600 Tinman, Oilcan meaning it holds in all relations, if YEX. Witch, Bat (6) tiviaz m--X*4 and X*Ti-X are trivial Witch, Snake in I. Lion, Mouse We now define four normal forms of interest. Beeri, Fish 1. (abbr. 3NF)*: [141 &ES is Bernstein, Dog in 3NF if none of its nonprimeattributes is Bernstein, Rhino tra=ively dependent on any of its keys Goodman, Cat 2. Boyce-Codd Normal Form (BCNF): [15] Let f:X+Y be any nontrivial FD in r*, defined on %iES. % is in BCNF if for all such f, X is a superkey of Ri. 3. Weak Fourth Normal Form (MNF): Let g:X++YEr+ FIGURE 9. w4NF schema and instance. be any nontrivial MVD in RJ ES. s is in W4NF if it is in 3NF and all such g are FDs. s = IRENTA~UNITS' = 4. Fourth normal Form (4NF): [23) Let g:X+YEr+ 4 <{LANDLORD,ADDRESS,APT#,RENT,OCCUPANT,PETS), be any nontrivial MVD in &ES. 3 is in 4NF if for all such g, X is a superkey of 3. IADDRESS,APT#+ P.ENT;~CCUPANT+ADDRESS,APT#, LANDLORD-HADDRESS;OCCUPANT*PETS; Notice that 3NF is a weak version of BCNF, and W4NF is a weak version of 4NF. Also W4NF implies ADDRESS,APT#+LANDLORD)> 3NF and 4NF implies BCNF. BCNF and 4NF always SD= {OWNS',CBARGES,LIVBS,LOVES~, CBARGES,LIVES,LOVES succeed in separating independent relationships into -- --_I same as In Fig. 8. separate schemes. This is illustrated in Fig. 8. Notice that SD in Fig. 8 RepZ-represents RENTAL- OWNS' = <{LANDLORD,ADDRESS,AT#),CLANDLORD-HADDRESS; UNITS. There are cases, though, where the stronger ADDP.ESS,APT#+LANDLORD~> normal forms cannot be achieved and we must settle CBARGES,LIVES,LOVES in 4NF (from Fig. 8) for the weaker forms. Fig. 9 shows an example of -- this sort. The following formalize this observ- OWNS' in W4NF since LANDLORD'ADDRESS implied by ation. Let s,+,= {c=). S 's dependencies: 4 FACT 3: There always exists a 3NF schema that (1) OCCUPANT'ADDRESS,APT#KJCCUPANT'ADDR=S(FD6) Repl-represents So (7). (2) LANDLORDHADDRESS and OCCUPANT'ADDRESS I, LANDLORD+ADDRESS(FD-MVD~) FACT 4: There need not exist a BCNF schema that RepZ-represents S+. Moreover the question, Extension of OWNS', given data in Figs. 1 and 8. "Is schema S in BCNF?" is NP-hard [51. OWNS'(LANDLORD, ADDRESS, APT#) There need not exist a 4NF schema FACT 5: Wizard, #3 that RepZ-represents S$. (Follows from Fact 4 when 02, Wizard, 02, #l G= 4.) It is not known whether a W4NF scheme Wizard, oz, #2 *~NF simply requires that relations be "flat", non- Codd, 3NFst, #l hierarchical. ~NF is a weak form of 3NF and is Codd, 3NFst, #2 subsumed by it [14,161.

118 Another observation to make from Fig. 9 is ADD-OCC*APT-OCC = that W4NF doesn't achieve total separation in the way 4NF does. OWNS' has redundant information and Attributes: ADDRESS, APT#, OCCUPANT suffers the same kind of update anomalies as Tuples: 02, #3, Tinman RENTAL-UNITS does. The same is true of 3NF vs. 02, #l, Witch BCNF. And since 4NF and BCNF cannot always be 02, ?a, Lion achieved under Rep2 (Facts 4 & 51, we must conclude 02, #2, Scrarecrow that Rep2 and total separation are incompatible concepts. This result is both surprising and (b) fundamental; it holds for non-computerized data- bases as well as computerized ones, and has applicability in all data models. 6. THE PRINCIPLE OF MINIMAL REDUNDANCY This result has been interpreted differently by some workers [16,24] who argue that BCNF and Another goal in designing SD is minimal ra- 4NF schemas should be obtained even if Rep2 is not dundrmcy; SD must contain the information needed achieved. We saw such a case in Fig. 7(a), which to represent S$ but it should not contain the in- we replicate in Fig. 10. In that example, SD formation redundantly. The meaning of minimal violates Rep2 because it does not include redundancy depends on the definition of represent- ADDRBSS,APT#'OCCUPANT. Without this FD legal ation. Only by knowing what it means to represent instances of SD can correspond to illegal in- information can we judge whether a certain re- presentation is redundant. stances of S@' and may represent illegal conditions in the real world (see Fig. 10(b)). It is Virtually all work on schema design adopts suggested in [16] that these illegal instances be some notion of minimal redundacy, although often prevented by adding ADDRESS,APT#+OCCUPANT to SD this point is addressed intuitively. Consequently as an "interrelational constraint." However, our treatment of redundancy must be sketchier than because Rep2 is incompatible with BCNF in this the previous sections. We present here different case, this suggestion is futile. If we add the definitions of redundancy analogous to the defi- suggested interrelational constraint, the two nitions of representation in Sec. 4. In the relations can no longer be updated independently, following, let SD=(~i=li=l,...,n}. which simply defeats the original goal of sepa- ration. Definition Redl. EiiEsD is redundant if In other words, while total separation is a T.cUn This approach, like Definition goal of schema design, there simply are cases where i- j=l,j#iTj' it cannot be achieved. Rep1 , is unsatisfactory since it does not account for relationships among attributes. Also, minimal redundancy under Red1 is always attained in S FIGURE 10. An instance of SD that is not an 4 instance of S since each attribute appears only once. 4' Definition Red2. SiEsD is redundant if 3's s+= {DWELLER= ) be made formal. R. ES, is redundant if (T=lrj)+ sD= {ADD-• CC= ; j=l,j+irj) e APT-OCC=<~APT#,OCCUPANT};(OCCUPANT~APT#}>}. be explicitly represented. Rather, they need only be derivable from the FDs in the other schemes. SD does not Rep2-represent S . SD Rep3-represents As for Rep2, this definition does not easily 4 generalize to MVDs since rules for manipulating %' (a) MVDs in different relations are not known. Relation: ADD-OCC Definition Red3. $6 SD is redundant if for Attributes: ADDRFSS,OCCUPANT each database of SD, the data in Ri is contained in (RjIj=l,...,n, j#i). For this definition to be Tuples: Oz, Tinman meaningful, a database of SD must be viewed as a Witch 02, set of related relations, since if relations can Lion 02, assume independent values, no relation scheme is Scarecrow 02, ever redundant. The universal relation assumption (Sec.1) provides the necessary connection and leads Relation: APT-OCC to the following. R+ is redundant in SD if Attributes: APT#, OCCUPANT f R. = 0 for all databases of SD. j=l j j=l,j#iRj' ! Tuples: #3, Tinman #I, Witch Definition Red4. si.SD is redundant if there #2, Lion is a one-to-one correspondence between the set of a, Scarecrow instances of SD and the set of instances of SD-&_i). This definition, like Repl, combines data and Each relation has legal contents--all dependencies dependency aspects of schema design. hold. But, ADDRESS,APT#+OCCUPANT does not hold in ADD-OCC*APT-OCC. We note, in conclusion, that other approaches to redundancy are possible, e.g., using as a

119 measure the number of data items in relations, etc. 7.1 The Synthesis Approach We discuss the synthesis approach in terms of 7. SCHEMADESIGN METHODS the specific method of [7]. A central concept of this method is embodied FDS, which are FDS implied Traditionally, schema design has been called by keys. Formally, given si=, X+A iS "" in the literature in this embodied in % if X'AErir and X is a superkey of area and two approaches are prominent: synthesis Ri* Fig. 12 presents a simplified synthesis algo- [5,7], and decomposition [14,20,21,25,41]. This rithm (called SYNl) that uses embodied FDs to con- section describes both approaches, explaining how struct an SD Repa-representing S,$. SYNl is a first they interpret and achieve the schema design step towards a correct synthesis algorithm. SYNl principles discussed earlier. is not yet correct because SD is not necessarily The key difference between synthesis and de- in 3NF, so transitive dependencies can be composition lies in the definition of representation that each adopts. In synthesis SD RepZ-represents FIGURE 12. Simplified Synthesis Algorithm input S , whereas with decomposition SD Rep3-re 8 resents s f’* This difference leads to a Algorithm SYNl series of other d screpancies between the methods: (1) Since Rep2 is not compatible with total separa- Input: S4 = fIJ = j tion (Sec.S), synthesis can only achieve 3NF and output: SD = {R. not higher normal forms: decomposition, on the -1 = )i,l ,...,n other hand, is not limited in this way. (2) Rep2 leads to the Red2 definition of redundancy, while 1. AFind Covering]. Find a nonredundant covering Rep3 leads to Red3; therefore synthesis strives F of F. for minimality of dependencies while decomposition strives for minimality of data content. (3) Because 2. (). Partition F^ into "groups", Fir definitions Rep2 and Red2 do not easily extend to i=l,..., n, such that all FDs in each Fi have MVDs,itisnotknownhow [or if) synthesiscanhandle the same left hand side, and no two groups have MVDS; decomposition,ontheotherhand, is straight- the same left hand side. forwardlyextendabletoMVDs. (4) Finally, as ex- 3. (Construct Relations). For each Fi construct a plainedinsec. 5, RepSdoesnotguaranteethatallin- relation scheme s = where Ti = all stances of SD correspondtolegalinstancesofS4; thus attributes appearing in Fi, schemas producedbydecompositionadmit instancesthat would not be permitted by synthesized schemas. These Important Fact : The left hand side of every FD in differences are summarized in Figure 11. Fi is a superkey of 3; each FD in Fi is em- bodied in R.. FIGURE 11. Differences between principal Normalization Methods

Method Synthesis Decomposition Decomposition exhibited within individual FDs, due to extraneous attributes in their left hand sides. An attribute is extraneous in an FD if it could be eliminated Describei Bernstein Fagin [25] Rissanen [31] from the FD without affecting the closure (F+). by r71 Let us precede SYNl with a step that elimi- nates extraneous attributes from the left sides of Defini- Reps, "SD Rep3, "SD ,,sD and ReP4, FDs in F, and call the resulting algorithm SYNl'. tion of has same has same S$ databases SYNl' produces schemas that Rep2-represent the in- Represen- dependen- data con- are l-to-l" put and are guaranteed to be in 3NF. Algorithm tation cies as tent as S n SYNl' thus meets the representation and separation s n 4 $ goals of schema synthesis. h The next step is to achieve minimality. Let F Dependen- FDs FDs + MVDs FDs be the nonredundant covering of F obtained by cies SYNC' af$er excising extraneous attributes, and suppose F includes V+W and X+Y, X#V. Clearly Normal 3NF 4NF,BCNF 3NF these FDs will be embodied in different relation Form schemes. But suppose V and X are eqU&&Zxt; i.e., Then V-*X and X+V can Red4,"both V+X and X+V are in F+. Defini- Red2 "re- Red3,"redun- be embedded in one relation scheme with both V and tion of dundincy dancy of data Red2 + Red3" Doing so reduces the number of synthe- of depen- content*' (not (not attained X as keys. Redun- sized schemes and makes explicit the equivalence dencies" attained by by current al- dancy of X and V. attained) current algo- gorithms] So, we add a stage to SYNl' to find and merge rithms] all relation schemes with equivalent keys. Dn- Instan- Same as More than SameasS fortunately, this modification takes a step back- ces ad- 4 ward: it no longer produces 3NF schemes! _When we mitted % % merge relation schemes we also add FDs to F from by SD -- *A decomE sition app ach is suggested by [Rissanen,77] in which SD Rep4-represents S,+,. This approach is algorithmically similar to the other decomposition approaches so will not be discussed separately.

120 (F+-G) which may thereby cause F to become redun- FIGURE 14. Basic Decomposition Algorithm dant. A final stage is needed to eliminate this redundancy. This modification brings us to algo- Algorithm DEC ritlxnSYN2 (Fig. 13), which is our final schema synthesis algorithm. The following facts, are Input: = fu = } proved in [71, establishing that SYN2 achieves the % three schema design principles. Given output: SD = {R+ = li=l,...,n) S+= lg=) and SD = the result of applying sYN2 to s 1. (Initialize) Set k:=@. 4' 2. (Test for Separation) If all schemes in Sk are FACT 7: SD RepZ-represents S . in 4NF, then output SD:=Sk. 4 FACT 8: SD is in 3NF. 3. (Decompose) Set Sk+l:=$. Let %= be any non-4NF scheme in S and set Sk:=Sk-$. FACT 9: SD is minimally redundant under de- k' Decompose $ into R+ 1 and R+ 2 as follows: finition Red2. In fact SD is minimal in an even stronger sense. Let SD = {S;lSA Rep2-represents S (1) Let X*Y be any'non-trivial MVD in r de- and all FDs in St, are embodced FDS). sD contains 4 no more relation schemes than any other scheme in SD- In other words, SD is the smallest schema that can Rep2-represent S using just keys. 4 FIGURE 13. A Correct Synthesis Algorithm [5]. R+fS . k Algorithm SYN2 4. (Eliminate Some Redundancy) For each gi,%6Sk+l. If TiGT. set S Input: S+ = (II= }. k+l:=Sk+l - $1. 5. (Iterate)' Set k:=k+l. Go to step 2. output: SD = {Ei = li=l,...,n)

1. (Eliminate Extraneous Attributes) Eliminates extraneous attributes from the left side of FACT 11: SD is in 4NF. each FD in F, producing the set F'. Reason: DEC will not stop until SD is in 2. (Find Covering) Find a nonredundant covering F^ 4NF. of F'. FACT 12: SD is not necessarily minimally re- 3. (Partition) Partition F^ into groups Fir dundant under Rep3 (or most other reasonable de- i=l,...,n , as in step 2 of SyNl. finitions). 4. (Merge Equivalent Keys) Set J:=@. For each Reason: Fig. 15 shows two ways DEC could pair of groups Fir Fj with left hand sides Xi, decompose the same schema, one of which is minimal Xj do the following: If Xij . EF+ and and one of which is not. Few minimality facts XjjXiEF', merge Fi and Fj, a"a d XiTXj and have been established regarding decomposition, and Xj+Xi to J, and remove them from F. it is not even known whether minimal schemas can be produced by non-deterministic decomposition. Find a 5. (Eliminate Transitive Dependencies) Also, it is not known whether decomposition can minimal F^' C F^ such that ($+J)+= ($'+J). Delete consider coverings of dependencies rather than each element of Fh-Fh' from the group in which it entire closures; in the specific case of Fig. 15(a) appears. For each Xi+Xj in J, add it to the minimality would be guaranteed if S$ were decom- corresponding group. posed using a nonredundant, nonextraneous covering 6. (Construct Relations) For each Fi construct a of r. relation scheme % = where Ti = all Algorithmic aspects of decomposition have not attributes appearing in F.. been considered fully either, and current algorithms 1 have high computational complexity. For example, DEC is probably very slow, because the question 7.2 The Decomposition Approach "Is schema S in 4NF?" is NP-hard and DEC asks this Fig. 14 shows a typical decomposition algo- question repeatedly. A related problem is caused rithm (which we call algorithm DEC) adapted from by using closures rather than coverings. Closures WI. DEC achieves the representation and separa- can be exponentially large and their use can lead tion goals of decomposition but does not achieve to exponential worst case running time. minimal redundancy. These conclusions are stated Another problem is that decompositions are formally as follows. not unique. At each stage the algorithm may have SD = the result of app?~~~~ ig %~~"" and several decomposition choices with different choices leading to very different outputs (e.g., FACT 10: SD Rep3-represents S 0' Fig. 15). Some choices produce "natural looking" Reason: Whenever a scheme & is decomposed schemas while other choices may lead to bizarre into 3 and 3 (in step 3 of DRC) &,3 has the results (see Fig. 16). Also, the dependencies in lossless join property (by Fact 2). the output schema can depend idiosynchratically on the input and the algorithm (see Fig. 17).

121 Notice in Fig. 17 that Sl Rep3-represents Sl LANDLORD,ADDRESS,APT#,PENT, D1 cp' c Rep3-represents S2 and S and S2 both Rep3-repre2 ~CCUPANT,PETS #' 4 cp sent a third schema S = {g=

FIGURE 15. Algorithm D does not achieve minimal /\ redundancy (italicized attributes LANDLORD, APT#, RENT APT#, OCCUPANT become relations schemes). (b) S; s$=I~=~

ABCD FIGURE 17. Idiosyncratic behavior of decomposition

ABC

//I+\ //.?( (Note : d)+ = (F2)+). s; = $= AB BC BD CD R1= -2 (a) 177=)

ABCD /\ S; = h;= E;={}>

I%%=) .

8. HISTORY, CONCLUSIONS, AND FUTURE WORK

The history of database normalization theory AB ;sC BC CD begins with Codd's early work [14]. Codd intro- duced the notion of FD, but did not formalize it. (b) The first mathematizations of FDs were by Delobel FIGURE 16. Natural and unnatural 4NF schemas [17], Rissanen and Delobel [331, and Delobel and produced by decomposition. Casey [201; these authors concentrated on formal properties of dependencies and their relationship S = completeness of a set of rules for FDs. This work laid the groundwork for the formal theory that has LANDMRD,ADDRESS,APT#,RENT, developed since. The earliest synthesis algorithm OCCUPANT.PETS was an informal one described by Wang and Wedekind t381. Bernstein [173 followed with a synthesis algorithm that used Armstrong's theory to prove LANDLORD;ADDRESS,APT#, ADDRESS, APTc#, RENT properties of synthesized schemas. Bernstein's OCCUPANT,PETS algorithm was the first to use a formal definition of representation. This algorithm was subsequently ADDRESS,APT#, OCCUPANT enhanced by Bernstein and Beeri [5,81 who improved /' its running time. LANDLORD,OCCUPANT,PETS The first generalization of FDs was the con- cept of first order hierarchical decomposition by LANDLORD, UC&&NT h&ANT, PETS Delobel 1181 and Delobel and Leonard [211. The related concept of MVD was introduced by Fagin [231 (a) Si and Zaniolo [41], and 4NF was introduced by Fagin

122 E31. Completeness of inference rules for MVDs is REFERENCES treated by Beeri, Fagin and Howard [6] and Mendel- zohn 1291, and algorithmic questions about MVDS by Beeri [4]. Recently, attention has been directed 1. A.V. Aho, C. Beeri, and J.D. Ullmann, "The to the representation principle by the work of Aho, Theory of Joins in Relational Databases," Beeri, and Ullman [l] and Rissanen [31]. Proc. 18th IEEE Symp. on Foundations of These references are a mere sketch of the Computer Science, Oct. 1977. history of normalization theory; a more complete bibliography follows. 2. A.V. Aho, J.E. Hopcroft, and J.D. Ullman, The A variety of important results appear in these Design and Anal&s of Computer Algorithms, papers, but the lack of uniform definitions has Addison-Wesley, Reading, Mass., 1974. obscured the relationships among many works. We hope the paper will clear up some of the confusion 3. W.W. Armstrong, "Dependency Structures of by comparing the major definitions and outlining Database Relationships," Proc. IFIP 74, North a general framework in which all can be embedded. Holland, 1974, pp. 580-583. Our main theme is that schema design is directed by the three principles of representation, 4. C. Beeri, "On the Membership Problem for separation, and minimal redundancy. A goal of Multivalued Dependencies in Relational Data- research in schema design is to develop a design bases," TR-229, Dept. of Elec. Eng. and Comp. methodology that satisfies these three principles. Science, Princeton Univ., Princeton, N-J., Specific formulations of the principles depend upon Sept. 1977. the types of constraints involved, so a thorough understanding of the formal properties of FDS and 5. Beeri, C. and P.A. Bernstein, "Computational MVDs is a prerequisite for achieving this goal. Problems Regarding the Design of Normal Form Many questions still remain unanswered. We Relational Schemas," ACMT~ans. on Database list four important areas where more work is sys., to appear. needed: 1. Other dependency structures--An MVD can hold in 6. C. Beeri, R. Fagin, and J.H. Howard, "A Com- a projection of a relation, although it does not pleteAxiomatization for Functional and Multi- hold in the entire relation 119,251. These em- valued Dependencies," Proc. ACM-SIGMODConf., bedded MVDs (abbr. EMVD) may appear when decompo- Toronto, Aug. 1977, pp. 47-61. sing a relation scheme into smaller schemes. While some inference rules for EMVDs have appeared, a 7. P.A. Bernstein, "Synthesizing Third Normal complete set is not currently known [19]. Form Relations from Functional Dependencies," MVDs characterize lossless joins between two ACMTrans. on Database Sys., Vol. 1, No. 4 relations. Dependency structures that characterize (Dec. 1976), pp. 277-298. lossless joins among N relations have recently been suggested, and should be integrated into the theory 8. P.A. Bernstein and C. Beeri, "An Algorithmic [30,321. In addition, the concept of representation Approach to Normalization of Relational Data- (particularly Rep2, Rep4) has only been developed base Schemas," TR CSRG-73, Computer Systems for FDs. Representation questions about MVDs and Research Group, Univ. of Toronto, Sept. 1976. other dependency structures are open. 2. Semantic operations on dependencies--Dependency 9. P.A. Bernstein, J.R. Swenson, and D.C. structures can be used to guide correct retrievals Tsichritzis, "A Unified Approach to Functional given only minimal logical access path information Dependencies and Relations," Proc. ACM-SIGMOD [11,34]. However, the influence of dependency Conf., San Jose, Cal., 1975, pp. 237-245. structures on data operations and the constraints that hold in a relation constructed by operations 10. J-M. Cadiou, "On Semantic Issues in the Re- lational Model of Data," Proc. Intern. Symp. are only known for special cases. 3. UniversaZ PeZation asswnption--This assumption on Math. Foundations of Comp. Science, Gdansk, simplifies many theoretical problems but apparently Poland, Sept. 1975, Springer-Verlag Lecture does not hold in practice. It should either be Notes in Computer Science. abandoned or adapted for practical situations in Carlson, C.R. and R.S. Kaplan, "A Generalized some way. 11. Access Path Model and its Application to a 4. Design tools--Mechanical procedures must be developed to assist the database designer. A Systems," Proc. 1976 ACM- schema synthesis algorithm that takes FDs and MVDs SIGMOD Conf., ACM, N.Y., pp. 143-156. as input could be one such design aid. Mechanical "The Entity-Relationship Model: mappings from high level data descriptions (e.g., 12. P.P-S. Chen, Toward a Unified View of Data," ACM!i?~a?z~. on [36]) into dependency structures are also needed. The true test of the theory is demonstrating its Database Sys., vol. 1, No. 1 (Sept. 1976), effectiveness in solving day to day pp. 9-36. problems. On this metric the theory will live or E.F. Codd, "A for Large Shared die. 13. Data Bases," CACM,Vol. 13, No. 6 (June, 1970), pp. 377-387.

123 14. E.F. Codd, "Further Normalization of the Data 32. J. Rissanen, "Theory of Relations for Data- Base Relational Model, m in Data Base Systems bases--A Tutorial Survey," IBM Research Lab., (R. Rustin, ea.), Prentice-Hall, Englewood San Jose, Cal., Apr. 1978. Cliffs, N-J., 1972, pp. 33-64. 33. J. Rissanen and C. Delobel, "Decomposition of 15. E.F. Codd, "Recent Investigations in Relational Files--A Basis for Data Storage and Retrieval," Data Base Systems," Proc. IFIP 74, North- IBM Res. Rep. RJ1220, San Jose, Cal., May Holland, 1974, pp. 1017-1021. 1973. 16. C.J. Date, An Introduction to Database Systems 34. K.L. Schenk and J.R. Pinkert, "An Algorithm (2nd ea.), Addison-Wesley, Reading, MA, 1977. for Servicing Multi-Relational Queries," 17. C. Delobel, "A Theory About Data in an Inform- Proc. 1977 ACM-SIGMODConf., ACM, N.Y., ation Systems," IBM Res. Rep. RJ964, Jan. 1972. pp. 10-19. 18. C. Delobel, "Contributions Thgoretiques 2 la 35. H.A. Schmid and J.R. Swenson, "On the Semantics Conception d'un Systkne d'Informations," Ph.D. of the Relational Data Model," Proc.1975 ACM- Thesis, Univ. of Grenoble, Oct. 1973. SIGMOD Conf., San Jose, Cal., pp. 211-223. 19. C. Delobel, "Semantics of Relations and Decom- 36. J.M. Smith and D.C.P. Smith, "Database Ab- position Process in the Relational Data Model," stractions: Aggregation and Generalization," Computer Laboratory, Univ. of Grenoble, 1977. ACM Trans. on Database Sys., Vol. 2, No. 2 (June 19771, pp. 105-133. 20. C. Delobel and R.C. Casey, "Decomposition of a Data Base and the Theory of Boolean Switching 37. Y. Tanaka and T. Tsuda, "Decomposition and Functions," IBM J. of Res. and Dev., 17:5 Composition of a Relational Database,' Proc. (Sept. 1972), pp. 370-386. 3rd VLDB Conf., Tokyo, Oct. 1977, pp. 454-461. 38. C.P. Wang and H.H. Wedekind, "Segment Synthe- 21. C. Delobel and M. Leonard, "The Decomposition Process in a Relational Model," Int. Workshop sis in Logical Data Base Design," IBM J. of on Data Structures, IRIA, Namur (Belgium), Res. and Dev., Vol. 19, No. 1 (Jan. 1975), May 1974. pp. 71-77. 22. R. Fadous and J. Forsythe, "Finding Candidate 39. H.K.T. Wong and J. Mylopoulos, "Two Views of Keys for Relational Data Bases," Proc. 1st Data Semantics: A Survey of Data Models in ACM-SIGMODConf., pp. 203-210, San Jose, Cal., Artificial Intelligence and Data Management," 1975. Tech. Rep., Computer Science Dept., Univ. of Toronto, 1977. 23. R. Fagin, "Multivalued Dependencies and a New Normal Form for Relational Databases," ACM 40. C.T. Yu and D.T. Johnson, "On the Complexity Trcms. on Database Sys., Vol. 2, No. 3 (Sept. of Finding the Set of Candidate Keys for a 19771, pp. 262-278. Given Set of Functional Dependencies,' Infon- ation Processing f&tters, Vol. 5, No. 4 24. R. Fagin, "The Decomposition Versus the (Oct. 19761, pp. 100-101. Synthetic Approach to Relational Database C. Eaniolo, "Analysis and Design of Relational Design," Proc. 3rd VLDB Conf., Tokyo, Oct. 41. Scemata for Database Systems," Tech. Rep. 1977, pp. 441-446. UCLA-ENG-7769, Dept. of Computer Science, 25. R. Fagin, "Functional Dependencies in a Re- UCLA, July 1976. lational Database and Propositonal Logic," IBM J. of Res. ad Dev., Vol. 21, No. 6, Acknowledgments (NOV. 1977), pp. 534-544. We gratefully acknowledge the assistance of 26. M.M. Hammer and D.J. McLeod, "Semantic Inte- Renate D'Arcangelo for her expert preparation of grity in a Relational Database System," Int. this manuscript. Conf. on Very Large Data Bases, ACM, N.Y., pp. 25-47, 1975. 27. W. Kent, "A Primer of Normal Forms," IBM Sys. Dev. Div., TRO2.600, San Jose, Cal., 1973. 28. C.L. Lucchesi and S.L. Osborn, "Candidate Keys for Relations," Tech. Rep. Univ. of Waterloo, Waterloo, Ontario, Canada, 1976. 29. A. 0. Mendelzon, "On Axiomatizing Multivalued Dependencies in Relational Databases," Dept. of Electr. Eng. and Computer Science, Princeton Univ., Princeton, N.J., July 1977. 30. J.M. Nicolas, "Mutual Dependencies and Some Results on Undecomposable Relations," ONERA- CERT, Toulouse, France, Feb. 1978. 31. J. Rissanen, "Independent Components of Relations ,'I ACMTrans. on Database Sys., Vol. 2, No. 4 (Dec. 19771, pp. 317-325.

124