fi,'1.,.!r'$

tt Uqt has a very practical exterior but a very theoretical interior. That interior is the relational rnodel.The relational model came before SQLand created a need for SQL.The power of SQL liesnot in the languageitself but in the conceptsset forth in the relationalmodel. These concepts form the basisof SQL'sdesign and operation. The relational model is a powerful and elegantidea that has pervadednot only computer sciencebut our daily lives aswell, whether we know it or not. Like the automobile, or penicillin, it is one of the many great examplesof human thought and discoverythat has fundamentally impacted the way the world works. The relational model has spawneda billion-dollar industry, and has become an integral part of almost every other industry. You'll seethe relational model at work nearly everywherethat dealswith information of any kind-in Fortune 500companies, universities,hospitals, grocery stores,websites, routers, MP3 players,cell phones, and even smart cards.It is truly pervasive. Background The relationalmodel wasborn in 1969,inside of IBM. A researchernamed E. F. Codd distributed an internal paper titled "Derivability, Redundancy,and Consistencyof RelationsStored in LargeData Banks,"which defined the basic theory of the relational model. This paper was circulated internally and not widely distributed. In 1970,Codd published a more refined version of this paper in Communicationsof theACM called "A RelationalModel for large SharedData Banks,"which is more widely recognizedas the seminal work on relational theory. This is the paper that changedthe industry. Despite the enormous influence of this paper, the relational model in its entirety did not appearovernight, or within the scopeof a singlehistoric paper. It evolved,grew, and expanded over time and was the product of many minds, not just Codd alone.For example,the term "relational model" wasn't coined until 1979,nine yearsafter publication of the paper proposing it (seeDate, 1999,in the Referencessection). Codd's 12 rules, now famousas the working defi- nition of relational, weren't posited until 1985,appearing in a wvo-partseries published by Computerworldmagadne.Thus, while there was a seminal paper proposing the generalconcept, much of the relational model aswe know it today is actually the result of a seriesof papers,articles, and debatesby many people over a 30-yearperiod, and indeed continues to develop to this very day. That is alsowhy in some of the various quotes included in this chapter you may see terms such as "data base,"which may initially appear to be a typo. They aren't. At the time these statementswere made, the terminology still had not congealedand all sorts of various terms were being thrown around. 47 48 CHAPTER3 :.,;,THE RELATI0NAL M0DEL

The Three Components This chapteri of it. It is merelya enoughwas known the relational principle By 1980, about model for Codd to identiff three presentsonly the r components: underpinnings of , . The structural component: This component defineshow information is structured, or in generaland in t represented.Specifically, all information is representedas relatioru,which are composedof splendor is far bel tuples,which in turn are composed of attribute and ualuecomponents. The relation is stamina to fully dr the sole data structure used to representall information in the database.Relations as learn more. The ru defined in the relational model are derived from relations in set theory, a branch of Computerworldat mathematics, and sharemany of their properties.They are formally defined in Codd's 1970paper. TheStruc . The integrity component: This component definesmethods that enforce relationships within and betweenrelations (or tables)in the structural component. Thesemethods The structural cor are called constraints,and are expressedin the form of rules. There are three principle componentsbuil( types of integrity: domain integrity, governingvalues in columns; entity integrity, governing first of Codd's 12r rows in tables;and referentialintegrity, goveming how tablesrelate to one another. Integrity has no analogin set theory, but rather is unique to relational theory. It was initially The Informt addressedin Codd's 1970paper and greatlyexpanded upon in the 1980sby Codd The first rule, call and others. defined as follows . The manipulative component: This aspectdefines the methods with which to operate on or manipulate information. Like relations, theseoperations also have their roots in 1. The Inforr mathematics.They are formalized in relational algebraand relational calculzsas origi- explicitly at tt nally presentedin Codd's 1972paper "RelationalCompleteness of Data Base Sublanguages." Date summarizes

SQL,likewise, is structured similarly along theselines-so similarly, in fact, that it is hard Theentire inJ to talk about SQLwithout addressingthe relational model to somedegree, directly or indirectly. namely as exl It is true that SQLis for the most part a straighdorward language,which to many people appearsErs an island unto itself. But in reality, it is the offspring of the relational model, and in manyways 'l'hereare naroimg is clearly a reflectlon of it. Iogicalrepresenta within it. It is a kin SQLand the RelationalModel tlepictionof data, This chapterpresents the theoreticalroots of SQL,and examinesthe power and elegance The view pr behind SQL.It preparesyou to dealwith SQLnot in isolationbut in the contextof the relational aremade u you powerfi,ll, model. If notNng else,it shouldgive an appreciationfor how elegant, and complex The view is a beasta relational databasereally is. You may be surprisedby what eventhe most rudimentary or hardwar relational databasesare capableof. You don't haveto read this chapter in order to understandSQL; it merely providesa theoretical The logical re and historical backdrop.To this end, I presentthe theoretical aspectsin terms of severalinfluential rlatabaseis imple papersand articleswritten by Codd between lg70 and 1985that define the essenceof the relational l:rttercomponent model, alongwith the ideasand work of other contributorsto the field. Two of Codd'spapers rq)resentation-t mentioned here are availableonline, and are listed in the Referencessection at the end of the tlbles in a differe chapter.Again, others contributed to the developmentof the relationalmodel, but this chapter lrhysicalrepresen will draw primarily from Codd's work. tlreway in which rlcnt of physicalr Fi.:ffi

CHAPTER3 tir,THE RELATI0NAL M0DEt 49

This chapteris not an exhaustivetreatment of the relationalmodel, or a completehistory of it. tt is merelyan appetizer.The SQLchapter that follows is the main course.This chapter ntiffthree principle presentsonly the minimal materialneeded to get an appreciationof the origin and theoretical underpinningsof SQL.We illustratethe relationalmodel in termsof both its threecomponents ationisstructured, or in generaland in the contextof Codd's 12rules in particular.The relationalmodel in its full whicharecomposed of splendoris far beyondthe scopeof this book, and I haveneither the qualificationsnor the nents,The relation is stamina to fully describeit. Seethe Referencessection at the end of this chapter if you want to tabase.Relations as learn more. The rulesas they are presentedhere are taken directlyfrom Codd'sOctober 14,1985, heory,abranch of Comp uter wor ld article. dlydefinedin Codd's TheStructural Gomponent enforcerelationships rent.These methods The structuralcomponent of the relationalmodel laysthe foundationupon which the other reare three principle components build. It definesthe form in which information is represented.It is defined by the iryintegrity, governing first of Codd's 12rules, which is the cornerstoneof the relational model. rofleilother.Integrity ry,ltwasinitially The Information Principle he1980s by Codd The first rule, called the Information RuIe,is also known as the Information Principle. It is defined as follows: /ithwhichto operate ;ohavetheir roots in I. The Information Rule. All information in a rel"ational data base is represented ywlcalculusasorigi- explicitly at the logical level and in exactly one way-by ualuesin tables. fDataBase Date summarizesthis rule as follows:

infact,that it is hard The entire information content of the databaseis representedin one and only one way, ,dfectly or indirectly. namely as explicit ualuesin column positions in rows in tables. ranypeopleappeils as lel,and in manyways There are two import expressionshere: "logical level" and "valuesin tables."The logic level, or logical representation,refers to the way that you, the user, seethe databaseand information within it. It is a kind of ideal worldview for information. The logical level is a consistent,uniform depiction of data, which has Wvoimportant properties: verandelegance . The view presentedin the logicallevel consists of tables,made up of rows,which in turn ntextofthe relational are made up of values. or.verfrd,and complex . The view is completely independent of the databasesystem-the technolory (software hemostrudimentary or hardware) that enablesit. providesatheoretical The logical representationis a world unto itself,which is completely distinct from how the sofseveralinfluential databaseis implemented, or how it storesdata physically,or how it operatesinternally. These senceoftherelational latter components-software, operating system,and hardware-are referredto as the physical woofCodd's papers representation-the technology of the system.If the databasevendor decidesto store database ionat the end of the tables in a different way on disk, the Information Principle mandatesthat such a changein cdel,but this chapter physicalrepresentation can in no way affect or changethe logical representationof that data- the way in which you (or your programs) seethat data. The logical representationis indepen- dent of physical representation.A result of the Information Principle is that it is possibleto 50 CHAPTER3 ffi THERELATIONAL MODET

representthe sameinformation in the sameway acrossmultiple databaseimplementations on 11.Distributio multiple operating systemson different hardware, as shown in Figure 3-1. it cannot impa

12.Nonsubuer, subuertthe intt

The separatior opening of his origi

Future usersoJ is organized i terminals anc I I internal repres representation Windows Mac0S X 'l'he relational mod tied applicationsto rnodel challengedt

It prouides a m superimposing it prouidesaba betweenprogrr data on the otl Figure3-1. Logical and physical representation Soin the relatir You can createa relation (or table) in Oracle on Solaristhat is representedin exactlythe which information sameway as a table in PostgreSQlon Linux, or SQLiteon Mac OSKl The Information Principle cxclusivelyin termr guaranteesyou a consistentlogical view of information regardlessof how the databasesoft- l'rom the underlyin ware is implemented, or the operating systemor hardware it runs on. rvay.As statedbefo

The Sanctity of the Logical Level 'l'he Anatoml So important are thesetwo constraints in the relational model that they iue expandedupon I'helogical level is r and reinforced in severalother rules (8,9, 11,and 12)so as to eliminate any possibleambiguity. I.ypes.Inrelational p In short, Codd says: t,alues,and domait theorytends to use B. Physical Data Independence. The logical uiew can in no way be impaired by the termsused almost ir underlying software or hardware. rngly,there is one te tablesand relationr 9. Logical Data Independence. Application programs and terminal actiuities remain This big soup ( Iogimlly unimpaired when information-preseruingchanges of any kind that theoretically txplained in detail permit un-impairment are made to the basetables. ;rreillustrated in Fi: At the centerol rntegitf, and manil operationsin the rel l. In reality,this is not 100percent true (it'smore like 95 percent true). There are slight differencesin database implementations so that relationsin one databasemay contain featuresnot presentin other data- ;rredefinedwithin r bases.Nevertheless, they still all adhereto the samegeneral structure: relations made of tuples made rrnderstandingof rr ofvalues. l'irstunderstand tul

3* ,#:.*:. -:5i .W CHAPTER3 THERELATIONAL MODEL 51

rlementationson I1. Distribution Independ.ence.Euen if the databaseis spread auoss uarious locations, it cannot impact the logical uiew of data.

12. Nonsubuersion Rule. Thedatabase software may not prouide any facility which can subuert the integrity constraints of the logical uiew.

The separationof logical from physicalwas very important to Codd from the outset. In the opening of his original paper he had strong words concerning this separation:

Future usersof large data banks must be protectedfrom hauing to know how the data is organized in the machine (the internal representation)... Actiuities of usersat terminals and most application programs should remain unaffected when the internal representation of data is changed and euenwhen some aspectsof the external representation are changed.

'fhe relational model was in part a reaction to the databasesystems of the day, which closely tied applications to both databaseimplementation and data format on disk. Codd's relational rnodel challengedthis:

It provides a means of describing data with its natural structure only-that is, without superimposinganyadditionnl strrcrurefor mnchine representationpurposes. Acnrdingly, it prouidesa basisfor a high leueldata langtage which will yield maximal independence betweenprograms on the one hand and machine representation and, organization of data on the other.

Soin the relationalmodel, the Information Principleprovides alevel of abstractionthrough I in exactlythe which information can be representedin a consistentway. The user seesand works with data nation Principle exclusivelyin terms of this logical representation.This representationis completely insulated databasesoft- from the underlying technology.It cannot be undermined, influenced, or affectedby it in any way. As statedbefore, the Information Principle is the foundation of the relational model.

The fuiatomy of the Logical Level panded upon The logical level is made up of more than relations.It is made up of tables,rows, columns, and iible ambiguity. types.Inrelational parlance,these are often referredto more formally as relation uariabbs, tuples, values,and domainq respectively.SQL, for example,uses many of the former, and relational theory tends to use many of the latter. Throughout the literature, however,you will seethese ired by the terms used almost interchangeably.Even in Codd's papers,both setsof terms are used.Interest- ingly, there is one term that both lexiconshave in common: relations.(In caseyou're wondering, tablesand relations iue not the same thing; we'll addressthe differencelater in this chapter.) ties remain 'heoretically This big soup of terminology comprisesthe anatomy of the relational body, which will be explainedin detail over the next severalsections. The relationships betweenall of theseterms are illustrated in Figure 3-2. At the center of everything is the relation. It is the central object aroundwhich the structural, integrity, and manipulative components of the relational model are built. All of the fundamental encesin database operationsin the relational model :ue expressedin terms of relations,and all integrity constraints n other data- are defined within relations.Your understandingof the relational model is only asgood asyour of tuplesmade understanding of relations themselves.To have a good graspof relations,however, you must first understand tuples. 52 CHAPTEB3 'r.,riiTHE RELATI0NAL M0DEL

l{ame Oomain For examplr a tuple is sir Relation num , real that and a type. As st A as a tuple's attril I Attibutes information abo hnlinality (Heading) Tuple I Y Values

I I :<-- Degree-;'l I I Component

Figure3-2. The logical representationof data EF Tuples 1 TheSr { A tuple is a set of ualues,each of which has an associatedattribute. Anattribute defines a value's jl TheFu nameand domain. The attribute'sname is usedto identiSrthe value in the tuple, and its domain 2 : defines the kind of information stored within it. The combination of attribute and value is called 3 'TheM a component.lnFigure 3-2,you can seethat the first component is named num, and that it has a valueof 3.14and a domain of real (asin real numbers). A component is kind of a tidy, self-containedunit of data. It has three essentialingredients: a name, a domain, and a value.Its name givesit identity, its domain a descriptionof its content, and its value the content itself. Together,components aggregateinto larger structuressuch as tables and relations, and their qualities propagateinto those structures,imparting to them identity, description, and content as well. Figure3-3. Heade The name and value parts of a component are easyenough to understand,but what exactlyis a domain? The word domain,like the words relationand, tuple, comes from mathematics.In Similarly, a mathematics, the domain of a function is the set of all valuesfor which the function is defined. compositedata r Somefamiliar domains in mathematicsare the setsof all integers,rational numbers, real numbers, in the structure i and complex numbers. Generally,the term domain correspondsto a set of permissiblevalues. domain. A domain is sometimesreferred to as a type,andis synonymousto a data type in programming Thus,tupler languages.A domain, especiallyin the relational sense,also implies a set of operatorsthat can spreadsheet.Thr be used to operate on its associatedvalues. For example,common operators associatedwith informationwith integers are addition, subtraction, multiplication, and division. than with rows i So the job of a domain is to define a finite or an infinite set of permissiblevalues along with But the anal, ways of operating on them. In a way, the domain controls or restrictsthe value of an attribute, of an array of str but it doesso only by providing information. The databaseactually restricts an attribute'svalue type,just asrelar so that it conforms to its associateddomain. As you will seelater, this particular restriction is The bottom defined in the integrity component of the relational model, and is called domain integrity.The is defined by the group of collective attributes in a tuple is called its heading.Iust as an attribute definesthe properties of its associatedvalue, the heading defines the properties of its tuple. Degreeand Carr Associatedwith Relations fancy words fon A relation,simply enough,is a set of one or more tuplesthat sharethe sameheading.Iust as of attributesin itr domains have analogsin programming languages,so do tuples and relations.And even if you this is called carc are not a programmer, the analogyis quite helpful in illustrating the relationship between and five tuples ra tuples,relations, and headings. CHAPTEB3 .i.i..:THE RELATI0NAL M00EL 53

For example,consider the relation and its C equivalent shown in Figure3-3. You could say that a tuple is similar to a C structure.They both are made up of attributesthat havea name ;rnd a type.As shown in the figure, a C structure's attributes are defined in its declaration,just Itsa tuple'sattributes are definedin its heading.The declarationand headingboth provide U Attributes irrformationabout the contentsof their respectivedata structures. -f }1 lireaorng; --T1: tt #values tl 4l ------J Declaration typedefstruct { int id; char* name' ) episode; 1 TheSoup Nazi :tedefines a value's Z TheFusilliJerry ple, and its domain Instances and value is called 3 , TheMutfin Tops episoderows[3] = n,arrdthatit hasa { {1, "TheSoup Nazi"}, 'ntial ingedients: {2, "TheFusilli Jerry"}, rtion of its content, {3, "TheMuffin Tops"} h structuressuch as parting to them Figure3-3. Header as decl^aration;tuple as instance , but what exactlyis r mathematics. In Similarly, a tuple's valuesare analogousto an instanceof a C structure. Its valuesform a unction is defined. composite data stnlture that correspondsto the attributes defined in its heading. Each value bers,real numbers, in the structure is identifiable by an attribute name, and eachvalue is restricted by an attribute lermissible values. domain. re in programming Thus, tuples are more than just rows of amorphous valueslike what you might find in a operatorsthat can spreadsheet.They are well-defined data structureswith a high degreeof specificity over the :s associatedwith information within them. They havemore in common with constructsin programming languages than with rows in a spreadsheet. : valuesalong with But the analogydoesn't stop there.As alsoshown in Figure3-3, a relation is the C equivalent lue of an attribute, of an array of structures.Each stmcture in the array along with the uurayitself sharesthe same rn attribute'svalue type,just as relations and their tuples sharethe same heading. :ular restriction is The bottom line is that relations and tuples are highly structured. Furthermore, this structue nain integriry.The is defined by their corrunon heading. 'ute definesthe rple. Degreeand Cardinality Associatedwith tuples and relations are the notions of.degree and. cardinality. Theseare just fancy words for width and height, respectively.You could say a relation's width is the number : heading.Just as of attributesin its heading.This is calledthe degree.Its height is the number of tuplesit contains; rs.And even if you this is called cardinality.These terms are illustrated in Figure3-2. Arelation with four attributes nshipbetween and five tuples would be said to be of degree4 and cardinality 5. 54 CHAPTER3 BS THE RELATIONAL MODEL

While tuples don't have cardinality, they nonethelesshave their own fancy terms.A tuple A relation oue of degree I is said to be a unarytuple. Tuples of degree2, 3, and 4 aresaid to be binary, teruMry, and quaternAry,respectively. Generally, a tuple of degreeNis said to be an N-ary tuple. In fact, This is a somewha the word tupleis taken from the N-ary form. In the strictestsense, a unary tuple is called a unfamiliar concet monad, a binary tuple a pair, a ternary tuple a triple, a quaternary tuple a tetrad, and so on, quite simple,how, asshown in Table3-1. (}veryother domai eachnamefin F.1 Figure3-4. Note tl Tabfe3-1 . TupleTerminologr valuesto {-1,0, 1}. Degree Qualification Designation Example I Unary Monad I Integers

2 Binary Pair, twin (0,l) llrst_names 3 Ternary Triple, triad (0,1,2) Integersx first_names 4 Quaternary Quadruple, tetrad (0,l, 2,3) N N-ary N-tuple (tuple) (0,1,2, ... N)

Typically, the term tuple is used as a generic catchall for tuples of all degrees,and using terms kke monadand triodis often more confusing than it is precise.As with tuples,a relation composed of binary tuples is a binary relation. A relation composedof N-ary tuples is an N-ary relation, and so forth. Figure3-4. The cros MathematicalRelations What doesthe The notion of a relation in relational theory is taken directly from the relation in set theory, rsa setcontaining with a few modifications. To really understand relations, as defined in relational theory, you l'hatis, everytuple must understand their predecessorsin mathematics. ;rsits secondvalue Just as in relational theory, mathematical relations are setsof mathematical tuples. Like ;rlbeita very big or mathematical relations, mathematical tuples also differ somewhatfrom their relational name- tlrt:domains I and sakes.In mathematics,tuples are ordered sequencesand relations are unordered sets.Sets by rr"relation over I a theirvery nature are ambivalent to order. That is, the order of elementsin a set doesnot affect Put more siml the fundamental identity of that set,whereas the order of items in a sequencedoes affect its {rrrsubset) of their identity. trrplethat couldev To begin with, everyvalue in a tuple has a specific domain associatedwith it. Supposewe lxpressinga relatio have a tuple composed of the following two domains: orrrrelation to incl . The domainof all integers{...,-1,0,1,...}, denoted by I rrlation R overI, F,

. The domain of the first names of all Seinfeldcharacters 'Cosmo', '', ...}, {'Ierry', RelationalRelatiol denotedby F llrt:reis one funda Now, supposewe have a relation composed of such tuples.The relation then is also composed tlrr.ory.In mathem (in of the domains I and F that order). The first column of the relation must consist of integer llris is becauseme values,and the second column must consist of the first names of Seinfeldcharacters. Using trl'r'them.Mathem this as an example,the formal definition of a relation can be expressedas follows: .rlrvaysof the form r orrv€Dtionis not r CHAPTER3 ffi THERELATIONAL MODEL 55 ncyterms. A tuple A relation ouerI and F is any subsetof the crossproduct of I and F, representedby IxF. be binary,ternary, arytuple.In fact, This is a somewhatannoying definition becauseit definesa relation in terms of another perhaps rpleis calleda unfamiliar concept: the crossproduct. The cross product, also called the Cartesian product, is trad,and so on, quite simple, however.It is the combination of everyvalue in everydomainwith everyvalue in everyother domain. To compute IxF, for example,you take each integer iin I, and pair it with eachname/in F. This yields an infinitely large set of (l,l tuples (binary tuples),as illustrated in Figure3-4. Note that the figure is somewhat simplistic and depicts the set of all integers as the valuesto {-1,0, l}.

integen = (...,-1,0,1,2,3...)

i. ,i::. first_names = Uery,Cosmo, Newman,... I ,,i$qgCr'. Ifirst-naifirst-narile -1 JenY integersxfirst_names = { (-1,Jerry), i (0,Jeny), 0 ; Jeny (1,Jerry), 1 j Jeny ..- !r .r, .r;i.--. (-1,Cosmo), co:rg. ".-."-:L.-." - "i --"-,.". (0,Cosmo), 0 ' Cosmo (1,Cosmo), -.'.: (-1,Newman), 1 I Cosmo rees,and using (0,Newman), -1 Newman (1, tuples, a relation Newman)) .. .."9"."-....-.. ..:. N-gwmap. . . uples is an N-ary 1 Newman

Figure3-4, The crossproduct of integersand characterfirst names

What doesthe crossproduct do here?Why is it used?The crossproduct of domains I and F r in set theory, is a set containing every tuple that could ever exist by the combination of thesetwo domains. nal theory, you That is, everytuple that has an integer as its first value and the first name of a Seinfeldcharacter as its secondvalue is contained in the crossproduct IxF. The crossproduct is itself just a set, cal tuples. Like albeit a very big one. It is a set defining the limit of every tuple that could ever be formed from relational nilne- the domains I and F. That being said, any subsetof this crossproduct is a relation, specifically :red sets.Sets by a "relation over I and F." That is a mathematical relation: any subsetof a crossproject. :t doesnot affect Put more simply, a relation is defined in terms of its constituent domains-it is a portion does affect its (or subset)of their crossproduct. This crossproduct is a supersetcontaining every possible tuple that could ever appear in the relation. (I know this seemscircuitous.) The proper way of e it. Supposewe expressinga relation is to speakof a relntion R ouerthe domains X Y,and Z.lf we were to expand our relation to include the domain of all Seinfeldepisodes, denoted by E, it would then be "a relation Rover I, F, and E"-represented by IxFxE, which in turn is an even larger set than IxF.

,'Newman',...1, RelationalRelations There is one fundamental difference between tuples in mathematics and tuples in relational ralsocomposed theory. In mathematics,the order of valuesin a tuple is significant; in relational theory it isn't. nsist of integer This is becausemembers (attributes) of relational tuples have names,which are used to iden- racters.Using tify them. Mathematical ones do not. Therefore,mathematical tuples in relations over IxF are rWSl alwaysof the form (i,/ becausethis is the onlyway to identify them-by ordinal positions. This convention is not needed in relational tuples becausethey have attributes that have names to 56 CHAPTER3 ,.'I1gE RELATIONALMODEL

identify them, rather than ordinal positions. Order, then, in a relational tuple offers no real l>. advantage.The same appliesto relational relations,as they sharethe same attributes through their common heading. In his original paper, Codd differentiated mathematical relations (which he called domain- ordered relations)from his relational relations (calleddomain-unordered relations)by referring to the latter as "relationships":

Accordingly, we propose that usersdeal, not with relations which are domain-ordered, but with relationshipswhich are their domain-unorderedcounterparts.To accomplish this, domains must be uniquely identifiable at leastwithin any giuen relation, without using their position.

Domains (or attributes) are uniquely identifiable through attribute names.Therefore, both column and row order do not matter in the relational model. This is the principal way in which the relationsof settheory and the relationsof relationaltheory differ. In all other respects, tuples and relations in the relational theory shareall the sameproperties of their counterparts in set theory. And that is a relation. It is the fundamental object upon which relational theory is built. tf you understand it, you are well on your way to understanding the core mechanicsof relational theory. Relationaltheory is fitted so as to mirror the true mathematical relation as closelyas possible.The reasonis that the closerthe relationalmodel fits the mathematicalmodel, the more it benefitsfrom other facetsof mathematicsalreadywell established. A prime exampleis relationalalgebra, which is part of the manipulativecomponent of the relationalmodel. Many of the operationsdefined for relationsin settheory carryover to relationalalgebra because relational relations are sufficiently similar. The same is true for relational calculus,which employs methodsof formal logic.As Codd put it

Moreouer, the (relational) approach has a closetie tofirst-order predicate logic-a logic on which most of mathematics is based, hence a logic which can be expectedto haue Figure3-5. Relatio strength,endurance, and many applications.

In practice,t Codd recognizednot only that the eleganceof mathematicsis beneficialas a basisto build oftenreferred to. upon, but that it also offersa vast reservoirof existingknowledge that can be harnessedas well. v:rlue;the otherir variable-to whic Tables:Relation Variables I'heirname is jus A relation, though it contains values,is itself just a value,just like an integer or a string. The irlgebra,such as-r value of a relation is given by the particular set of tuples it contains. Likewise,the value of each (lreading,degree, tuple in turn is given by the specificvalues within it. Thus, the value of a relation is determined With this in r by the sum of its parts.Figure 3-5 illustratesthis. It depictsthree relations, represented by R,t, S()L,you probabl R2, and R3, taken over IxF. Each representsa different value, or relation. s.uneas changinl RI, R2,and R3 ale not relations but rather relation uariables.They representrelations. l)artsas individur Codd referredto them as named relations.Date (2003)calls them reluars.SQL calls them tables. rnodifyparts of a They are also known as basetables. The bottom line is that they are simply variableswhose ;rnything,you chi valuesare relations.I will simply refer to them here as tables,as that is what you will be dealing ;rren'tchanging a with in SQL. orre.where the nr I know that i ilrtegervariable,' CHAPTER3 THERELATIONAL MODEL 57 rle offers no real lxF attributesthrough integer first_name re calleddomain- ions)byreferring to -OO Jerry integer first_name : : Jerry tmain-ordered, -1 Jerry -1 To accomplish 0 Jerry -1 Cosmo R1 lation, without 1 ,rl -1 Newman

Therefore,both : integer : first_name cipalwayinwhich +oo Jerry other respects, 1 .. Co.smo -oo Cosmo R2 their counterparts 1 :Jerry : : | , Newman rI theory is built. If -1 Cosmo ranicsof relational 0 Cosmo tion asclosely as integer rticalmodel, the 1 Cosmo first_name r prime exampleis -1 Jeny : , onal model.Many R3 0 Jeny Llgebrabecause +oo Cosmo , us,which employs 1 Jerry -OO Newman

: : t logic-a logic cectedto haue Figure3-5. Relations ouer IxF a basisto build In practice,the precisemeaning of "relation" can sometimesbe a bit murky. Relationsare harnessedas well. often referred to in the same context as tables.However, they are not the same thing. One is a value;the other is a variable.A relationis a ualue,likean integervalue is l, 2, or 3. A table is a variable-to which relations are assigned.Tables, like variables,have both a name and a value. Their name is just a symbol. Their value is a relation. They are no different than variablesin or a string. The algebra,such as x and yin the equation of a line. Tablesshare all the properties of relations , the value of each (heading,degree, cardinality, etc.)just as integer variablesshare the properties of integers. ion is determined With this in mind, what doesit mean to update or moditr a table?If you are familiar with :presentedby Rl, SQL,you probably think of it in terms of modifuing a single value, or a few rows. You seeit the same as changing cells in a spreadsheet.You modify parts of the spreadsheet.You seethose sentrelations. parts as individual values.The relational view is different. In the relational view, you don't calls them tables. modify parts of a table.A table is a variableholding an entire relation as its value. If you change uiables whose anything, you change the entire relation-no matter how small the differencemight be. You ou will be dealing aren't changing a row or column;you are actually swapping one relation with an entirely new one, where the new relation contains the rows or columns you want changed. I know that it may seem like frivolous semantics.But just as you don't modify part of an integer variable,you don't modify part of a relation variable either. There is nothing wrong 58 CHAPTER3 S# THE RELATIONAL MODEL

with thinking in terms of the SQLview-where you are changing values in a row. It is logically ,l;ry.Rather, it mea equivalent: in a SQLupdate, you are articulating the changeyou want to make in a statement, I'rrrlge the gapif sh and that statementproduces a newrelation. But that is where you should make the distinction: rlrr.klgical view see the changeproducesanewrelation, not a patched-upversion of the old one.Thatnewrelation rrrtotwo smalleron becomesa new value that the table holds. r,';rlitya relationalt Be aware that when you talk about tables in a database,you are referring to variables, rrurrsknow no diffe specificallyrelation variables-not relations.If a databasewere composedstrictly of relations, then its contentswould be fixed values,not subject to change.But databasesare dynamic. And UpdatableViews the reasonis becausetheir contents-tables-iue relation variables,which are subject to change. llrrt They changeby assigningthoserelation variablesnew relation values,not by adding, subtracting, logicaldata ind or changing tuples. t.rrrriliarwith views rlr,rlyoushotrldbei tl.rt;riltd€pendence Views: Virtual Tables Views are virtual tables.They look like tablesand act like tables,but they're not. Views are rela- 6. View Updatt tional expressionsthat yield relations. It is like sayingthat the algebraicexpression x+yis not a the system. number per se,though it can yield a number if the expressionis evaluated,provided we have lirrlr'(i specific valuesfor .r and y. The sameis true for views.Codd (1980)described them as follows: isessential f i r)lr\tructed in sucl A (table) l'\l)(fctivebase tab view is a virtual relation defined by means of an expressionor sequenceof 'l'hat commands.Although not directly supportedby actual data, aview appears to a useras said,Rule if were an additional basetable kept up-to-date and in the state of integrity with other rrntl'rrlly supported basenbLes. Wews are wefulfor permittingapplication programsand usersat terminnls to l'r u[r'ilfitflling a sys interact with constant view stntctures, euenwhen the basetabla themseluesare under- , "rlttrlrrs is not exac going strucrural changesat the logical leuel... , runnlon. In fact,it

LogicalData Independence l'he System( In one sense,views are simply a matter of convenience.They let you assigna name to a set of i', ,rrlatabase is cor operations the and treat result like a relation. They are a kind of shorthand.This is perhaps the lrrrrlottt exactlywh most colnmon use of views.Another use was for security. In some systems,views can be used .,n(l views describin to selectivelypresent parts of tables,while excluding other sensitiveor restrictedparts. But the \ rr.rvs,constraints, original intent views of was much more than these two applications.The main application for rr'lr.r'rcd to as the s' views was facilitatingwhat is called logical dataindependence,which is defined as follows: Ilule 4. Dynar 9. Logical Data Independence. Application programs and terminal actiuities remain tlescription is t Iogically unimpaired when informatian-preseruingchanges of any kind that theoretically authorized ust permit un-impairment are made to the basetables. tryply to the re,

Logical data independencemeans that applications and usersshould theoreticallybe able to llrl system catalog seethe samelogical structure (e.g.,the sameattributes in a relation), evenif it is changed(e.g., ,r rt.l;rtiond format an attribute in that relation is moved to another relation). An exampleis when a relation is lrr';tns that even m decomposedinto two relations for the sakeof normalization, as describedin the section ,, l,rtionally.In mal "Normalization." ,| | rlllcrnentedas vi, More precisely,logical data independencemeans that usersand applications should be .,r ir r lo extend or er: able to be insulated from changesat the logical level of the database.That doesn't mean the ,rrlilr rrlttiOn yOu fit databaseis supposedto automatically shield usersand programs from the databaseadminis- r lr,tsiS. trator (DBA)doing somethingreally bone-headedlike dropping a table and going home for the CHAPTER3 +..8 THE RELATIONAL MODEL 59

I row. It is logically tlay. Rather,it means that the relational model provides a means for the administrator to rke in a statement, bridge the gap if she wants to make substantivechanges at the logical level that would impact rke the distinction: the logical view seenby usersand applications. For instance,if the DBA decomposesa table . That new relation into two smaller ones,she can createa view that looks like the original table, though it is in reality a relational expressionthat combines (joins) the two new tables.The usersand applica- ng to variables, tions know no differentlv. ;trictly of relations, s are dynamic. And UpdatableViews : subjectto change. llut logical data independenceentails more than just viewing data. Many people who ,dding,subtracting, are larniliar with views understandthem asread-only. However,the relational model clearly states that you should be ableto write to viewsas well, just as if they wereordinary tables.While logical tlataindependence would seemto imply this, Rule6 makesit explicit: not. Viewsare rela- 6. View Updating Rule. All uiewsthat are theoretically updatable are also updatable by ressionx+yis not a the system. provided we have Itule 6 is essentialfor logical independence."Theoretically updatable" view is d them as follows: data means the t:onstructedin such a way that it is theoretically possibleto map changesmade on it to its respectivebase tables. That is, if it can be done in theory,the systemshould be able to do it. or sequenceof (sometimes tars to a useras That said, Rule 6 and updatable views referredto as materializedviews) are not fully supported in all relational databases,simply becausethey are not easyto implement. irity with other I attermirak to l)rogramming a systemto knowwhat is "theoretically updatable" in relational algebra and/or iues are under- calculusis not exactlya weekend project. On the other hand, read-only views are extremely common. In fact, it is almost hard to find relational databasestoday that don't support them.

The SystemCatalog I a name to a set of As a databaseis composed of tables and views,lou might at some point wonder howyou can This is perhapsthe find out exactlywhat is in a given database.As it turns out, a relational databasecontains tables , views can be used and views describingits tablesand views-information about the information. That is, all tables, icted parts.But the views,constraints, and other databaseobjects belonging in a databaseare registeredin what is rain application for referred to as the systemcatalog. Per RuIe 4: ined as follows: Rule 4. Dynamic On-Line Catnlog Based on the Relntional Model. The data base :tiuities remain description is representedat the logical leuel in the seme way as ordinary data, so that 'tat theoretically authorized userscan apply the same relational language to its interuogation as they apply to the regular data. retically be able to 'fhe systemcatalog is subject to the Information Principle: it is required to be representedin f it is changed(e.g., a relational format that can be queried in the same way as other relations in the system.This hen a relation is rneansthat evenmetadata (information about information) in a databasehas to be represented in the section relationally.In many databaseproducts, many of the tablesin the systemcatalog are often implemented as views.The beauty of views and the systemcatalog is that together they enable cationsshouldbe you to extend or enhancethe catalogitself. You can createyour own catalogviews that contain loesn't mean the information you find useful or informative. Thoseviews in turn may even use catalogtables as databaseadminis- a basis. going home for the 60 CHAPTER3 ,..,..',THE RELATI0NAL MODEL

Thelntegrity The primary key is Gomponent definitionof a prir While the structural aspectof the model relatesto the structure of information, the integrity doesin defininga aspectrelates to the information within the structure. Information can be arrangedin relations in a way that givesrise to various relationships,both within columns of a single relation and Foreign Key between columns of different relations.These relationships provide additional structure and indeed add even more information to what is already present. With keyscomes The integrity aspectof the relational model provides a way to explicitly define and protect tuples within a sir such relationships.Although their usein a databaseis entirelyoptional, the degreeto which betweentuples in they are employed can ultimately determine the consistencyof your data. tuple in one relati common keyvalu correspondsto, or Primary Keys the tuplesin eacht Codd'ssecond rule is the startingpoint of data integrity,and dealswith the nature of data Eachrow in foods within relations. This rule is called the GuaranteedAccess Rule and is defined as follows: the name of whicl classifications(e.1 2. Guaranteed AccessRule. Each and euerydatum (atomic ualue) in a relational data usinga common I base is gu.aranteed to be logically accessibleby resorting to a combination of table a valuein the prir name, primary key ualueand column name.

The relational model requiresthat everyrelation have a primary key.The primary key is the set foods of attributes in a relation that uniquely identifies each tuple within it. This rule carrieswith it an important corollary:relations may not contain duplicatetuples. r,M The GuaranteedAccess Rule statesthat everyfield (or value in a tuple) in a relational data- 244 , Mackit basemust be addressable.To be addressable,it must be identifiable.To be identifiable,each tuple must be distinguishablein some way from all other tuples in the relation, which is to say 245 Cheerit unique (thus,duplicate tuples would violatethis constraint). 244 Chocol Uniquenessis the businessof the primary key. A key is a designatedattribute (or group of attributes) in a relation such that 245 Junior

1. The value (or combinedvalues) of that attribute (or attributes)is unique for everytuple Figure3-6. Foreigr, in the relation.

2. If the key is composedof more than one attribute, all of the attributes that define the From a relati key must be necessaryto ensureuniqueness. That is, everyattribute in the key is sufficient nlore than that. Tt to ensureuniqueness, but also necessaryas well-if one were absent,then the uniqueness hasreal meaning condition would not hold. individualtables. It is easyto sr If both conditions I and 2 are met, then the resultingattribute or group of attributesis a key ;rpplicationcome (alsocalled a candidatekey). If condition I is met but not condition 2, then the attribute (or that a tuple in foo group of attributes)is calleda superkey.Itis a key that could standto losesome weight. That is, tleletesall the tup it has more attributes than necessaryto ensure uniqueness:a smaller key containing fewer tuplespointing tc attributescould be definedthat still guaranteesuniqueness. For that matt A relationmay haveone or more candidatekeys. If it does,then which one of them that is Ir thousandidenti defined as the primary key is arbitrary: l)rotectand ensur rrrles, like laws,at I,Vheneuera relation has two or more nonredundant primary keys,one of them is arbitrarily selectedand called the primary key of the relation. g' ',-e;

CHAPTERJ :::.fl{E RELATI0NALM0DEL

l'heprimary key isjust a rule that requiresthat everyrelation haveat leastone candidatekey. The definition of a primary key servesmore as an affirmation of the GuaranteedAccessRule than it ration, the integrity doesin defininga new relationalconcept. arrangedin relations singlerelation and Foreign Keys tional structureand With keyscomes more than simply identification. \tVhilekeys define relationshipsbetween ly defineand protect tupleswithin a singlerelation (you might say vertically), they can also define relationships he degreeto which betweentuples in different relations (horizontally).e key's identification property allows a tuple in one relation to identify (or reference)a specific tuple in another relation by way of a common key value. This is called aforeign key relationship.Specifically, a key in one table correspondsto, or references,the primary key in another table (theforeign key),thereby relating thetuples in eachtable. Take, for example,the foodsand food_typestables, as shown in Figure3-6. te nature of data Eachrow in foodscorresponds to a distinct food item (Iunior Mints, MackinawPeaches, etc.), ined as follows: the name of which ls storedin the nameattribute. Each row in food_typesstores various food classifications(e.g., Junk Food, Fruit, etc.).Each row in foodsreferences a row in food_typesby z rehtional data usinga common key.foods contains a key calledtype_id, everyvalueof which correspondsto ination of table a valuein the primary key (id) of the food_types,as illustratedby the arrow in Figure3-6. rrimarykey is the set foods ; rule carrieswith it

I in a relational r data- 244 MackinawPeaches I 1 Bakery e identifiable,each tion, which is to say 245 Cheerios 2 2 Cereal 244 GhocolateBobka 1 I Fruit ttribute (or group of 245 JuniorMints I I Junkfood

Liquefor everytuple Figure3-6. Foreign key relationship betweenfoods and food_types tes that define the From a relational standpoint, this is merely a foreign key relationsNp. But in reality it is r the key is suficient more than that. This relationship models a relationship in the real world. It has a basisin reality, hen the uniqueness has real meaning, and adds information above and beyond the information contained in the individual tables.Therefore, it is a critical part of the information itself. It is easyto seethat suchrelationships, if not properlymaintained, can be precarious.If an attributesis a key applicationcomes to assumethis relationship,and its normal operationdepends on the fact r the attribute (or that a tuple in foods alwayshas a correspondingtuple in food_types,what happensif someone deletesall the tuples from food_types?Where does that leavethe type_id valuesin the foods 'me weight. That is, ontaining fewer tuples pointing to now?What has become of this relationship? For that matter, what is to stop a user from just ignoring the primary key rule and jamming one of them that is a thousandidentical food_types tuples back into the database?If there is nothing in placeto protect and ensurethese relationships, they can be as destructiveas they are beneficial.Such rules, like laws, are worthlesswithout enforcement. one of them is 62 CHAPTER3 K THERELATIONAL MODEL

Constraints Null Values Enter the constrainr.Constraints are relational cops.They enforce databaserules and relation- Closelyassociated v ships and preserveorder in general.They bring consistencyand uniformity to information which existsboth in within a database.With constraints,you can rest assuredthat tuples in the foods table will a value and is called alwaysreference legitimate tuples in food_types.Primary keyswill alwaysexist in tables,and their valueswill alwaysbe unique. This kind of uniformity and consistencyis called dnta integrity. 3. Systematic T, It is the integrity component of the relational model. stTingor a strin Constraintswork by governing databaseoperations. Like the precogsin Minority Report, supportedinful they stop bad things before they happen. If a user or an application issuesa requestthat would information in result in an inconsistent relationship or data, the databaserefuses to carry out the operation 'l'here and issuesan error, called a constraint uiolation. In the relational model, constraints fall into aremultiple ir four general classesof integrity: to be "unknown." B rnayhave an attribu ' Domain integrity: Domain integrity is the relationship between attribute valuesand A tuple for a person their associateddomains. In the relational model, domain integrity is instituted through llr this case,the nulj 'l'hus, domain constraints.A domain constraint requires that eachattribute value in a tuple a valuemay b exist within its associateddomain. For example,if the type_id attribute in the foods table is rrncertain(it is not I declaredas an integer,then the correspondingvaluesof type_id in all tuples in the foods trrple(or employee) table must be integervalues-not floating-point numbers or strings.Domain integrity is The inclusion o also referred to as attribute integrity-it pertains to the attributes of a relation. Domain years,and there are integrity is not limited to iust checking that a given value residesin a given domain. ( lodd,for example,1 It includes additional constraints as well, such as CHECKconstraints. CHECK constraints (which €uecovered in Chapter4) can define arbitrarily complex rules on what constitutesa In general, cont permissible value for a given attribute. mation in datt complexities oJ . Entity integrity: This form of integrity is mandated by the Guaranteed AccessRule: each information an tuple in a relation must be uniquely identifiable. Whereasdomain integrity is concerned with a relation's attribute values,entiry integrity is concernedwith its tuples. The term l):rt€,Codd's colleal "entity" here is a rather loose term for table. It originatesfrom databasemodeling (i.e., entity-relationship diagrams).In this particular context,an entity simply refersto anything ...weshould mr in the real world that must be representedin a database. too, we hasten t, formal systemli ' Referential integrity: This form of integrity pertains to relationshipsbetween tables, specificallythe preservationof foreign key relationships.Whereas entity integrity 'l'he "3VL" acronym pertains to tuples in a relation, referential integrity pertains to tuples between relations. rrrrllsare evaluatedi ' User-defined integrity: User-definedintegrity encompassesany form of integrity not defined in the other forms. Many relational databasesoffer various facilities that go beyond the normal constraint mechanisms.One such example is triggers,which are Normalizal coveredin Chapter4. lhe implicationof n The relational model requiresthat databasesand query languagessupport integrity constraints. nrrportantconcept i Furthermore, theseconstraints, like all other data and metadatain the database,must also be n'ith the organizatio defined directly in the database,specifically in the systemcatalog: rlrrplication,asyou\ It increasesthe oppo 1o. Integrity Independence.Integrity constraints specific to a particular relational data I'ourdatabase to be basemust be definable in the relational data sub-language and storable in the catalog, While basedon not in the application programs. rsomeeven sav an a CHAPTER3 $il THERELATIONAL MOOEL 63

Null Values rules and relation- Closelyassociated with data integrity in the relational model is a specialvalue (or lack thereof), to information which existsboth inside and outside of everydomain. This specialvalue denotes the absenceof foodstable will a value and is called a null uAlue,or nuII for short. Nulls are prescribedin Codd's third rule: xist in tables,and ilLeddataintegrity. 3. Systematic Treatment of NulI Values. NuII ualues(distinct from the empty character string or a string of blank charactersand distinct from zero or any other number) are t Minority Report, supportedin fully relntional DBMSfor repraenting missinginformation and inapplicable 'equestthat would information in a systematicway, independent of data type. rut the operation nstraints fall into There are multiple interpretations for what null valuesmean. The prevailingview of null seems to be "unknown." But there areothers. For example,a tuple containingemployee information rnayhave an attribute for the employee'smiddle name. But not everyonehas a middle name. rute valuesand A tuple for a person without a middle name might use a null value for that particular attribute. instituted through In this case,the null value doesn't necessarilymean "unknown" but rather "not applicable." value in a tuple Thus, a value may be null becauseit is either missing (a value existsbut was not input) or n the foodstable is uncertain (it is not known whether a value existsat all), or it is simply not applicable for the ruplesin the foods tuple (or employee)in question. romain integrity is The inclusion of nulls in the relational model has been a sourceof controversyfor many relation. Domain yeius,and there are peopleon both sidesof the debatewho feelvery strongtyfor their positions. liven domain. Codd, for example,felt the need for nulls: IECKconstraints what constitutesa In general, controuersystill surrounds the problem of missing and inapplicable infor- mation in data bases.It seems to me that those who complnin loudly about the complexities of manipulating nulls are ouerlooking the fact that handling missing AccessRule: each info rmatio n and inapplicable infor matio n i s inherently complicate d. grityis concerned tuples. The term Date,Codd's colleagueand well-known authority on the relational model, is opposed to them: ;e modeling (i.e., refers to anything ...we should make it ueryclear that in our opinion (and in that of many other writers too, we hasten to add), nulls and 3VL are a seriousmistake and haue no plnce in a clean formal systemlike the relational model. etween tables, ty integrity The "3VL" acronym in the quote standsfor "three-valuedlogic," which correspondsto how retween relations. nulls are evaluatedin logical expressions.This is addressedin Chapter 4. of integrity not :ilitiesthat go ers,which are Normalization The implication of no duplicate tuples provided by the GuaranteedAccessRule givesrise to an :grityconstraints. important concept in databasedesign called normalizaton. Normalization concerns itself so minimize duplication of data. Data ase,must also be with the organization of attributes within relations as to duplication, asyou will see,has more deleteriouseffects than just taking up unnecessaryspace. It increasesthe opportunity for databaseinconsistencies to arise.Normalization is about designing lationaldata your databaseto be more resistantto the ill effectsof thoughtlessusers and buggy programs. n the catalog, While basedon principles of the relational model, normalization is somewhat a subject (someeven say an art) in itself. It can become quite complicated,introducing a considerable 8l CHAPTER3,:i.i THERELATIONAL MODEL

new amount of conceptsand terminology. A proper treatment of the subject is beyond the Table3-2. The Unn' scope of this book. What follows is a very brief introduction. season wee Normal Forms I 2 As stated,the chief aim of normalizationis to eradicateduplication. Relations that havedupli- cation removed are said to be normalized.However, there are degreesof normalization. These I degreesare called normalforms.The first degreeis calledfirst normalform, followed by second 2 normalform, and so on. They are abbreviated lNF, zNF, 3NF, and so on. Eachnormal form definesspecific conditions a relation must meet in order to be so classified.Thus a relation that 2l meets the conditions of first normal form is said to be "in first normal form." Normal forms 25 build on each other so that higher normal forms require all the conditions of lower forms as prerequisites.For data to be in 2NF,it must alsobe in lNF. Essentially,the higher the normal form of a relation,the lessduplication it has,and the more resistantit is to inconsistencies.The Many times,f most common and perhapsmost widely used normal form is 3NF, although there are even table.And already more advancednormal formssuch as Boyce-Codd normal form (BCNF),4NF,5NF, and higher. lircta correlationI The first three normal forms are easyenough to describe. rrrodifiesthe first r .lhat relationship First Normal Form yt.ar=1992,and ar lrothI992 and l9€ Firstnormal form simplystatesthat all attributesin a relationuse domains that aremade up of srrfllcientnormali atomic values.The working definition of "atomic" is the sameas other disciplines,meaning lnoreinteresting i simply "that which cannot be broken dolrm further." For example,integer valueswould seem lrroblem.It is purr to be atomic, asyou can't break them down further. But you could argue that integer values Functionaldr could be decomposedinto their prime factors.Atomicity, then, is determinedby the method of rnirrant)determint decomposition,which can be a subjectivematter. For all practicalpurposes, it is the database rrrirkingupthe det m€rnagementsystem that decideswhat is atomic. And the domains provided by the systemcan pl:rce,you canexl safelybe consideredas such. That said, first normal form basicallymeans that a singleattribute cannot hold more than 1. B is functi, a singlevalue. For example,take the episodestable shown in Table3-2 in the next section.INF statesthat you couldn't store both the values 1992and 1993(two integer values)in a year attribute 2. A function of a singletuple. It soundsso silly that it can be kind of hard to imagine.But it's that simple. I lrereis evena fat It is so simple, in fact, that you have to work at violating INF; most databaseswon't even give r;rlrrcof B hasson you the meansto do it.

Functional Dependencies Second Norr rrlr'ond normal fo To understand secondand third normal form, you have to first understandfunctional depen- lrr.in first normal dencies.The simple definition of a functional dependencyis a correlation between columns in ,rrtrit)utesof the p a table.If the valuesof one column (or set of columns)correlate to the valuesof another,then r rrttingout duplic they have a functional dependency.More precisely,a fi.rnctionaldependency describes a relation- i,rrtrltrplication. ship betweentwo or more attributesin a relation such that the valueof one attribute (or set of You have alre attributes) can be inferred from the value of another attribute (or attributes) for everytuple in .r'1iurexample. Tt the relation.This is more easilyillustrated by example.Consider the table,called episodes, rlr.tlyear is functi shown in Table3-2. u'tl llag). Specifi, There is a functionaldependency between season and year.For any givenvalue of season, r..no reason to inr you will find the samevalue of year.If you evercome acrossa tuple with a valueof 4 for season, Ilrlt's duplicatior you know thc valuefor year will be 1992.lfyou know the former,you can determinethe latter. CHAPTER3 .Ii:'THE RELATIONAL MODEL 65

ect is bevondthe Tabfe3-2. The Unnormalized episodesTable season week year name I 1992 2 r992 The ions that havedupli- Smelly Car ormalization.These I 1993 r,followed by second 2 1993 The Puffu Shirt lach normal form Thus a relationthat 2l I994 The Fusilli Jerry n." Normal forms 25 1994 The Understudy ; of lower forms as : higher the normal nconsistencies. The Many times, functional dependenciesare a warning sign that duplication lurks within a there 3h are even table.And alreadyyou can seein this examplehow inconsistenciescan crop up. If there is in \trF,5NF, and higher. lact a correlationbetween year and season,what happensto that relationshipif someone rnodifiesthe first row so that its valuefor year is 1999(but fails to do so for the secondrowf 'l'hatrelationship has been compromised.It is inconsistent.One tuple with season=4has year=1992,and anotherwith season=4has !ear=1999. How can season4 havehappened in both 1992and 1999?This is logicallyinconsistent, and the functionaldependency (or lack of that aremade up of sufficient normalization) is what made it possibleto introduce this inconsistency.What is even ciplines,meaning rnore interesting is that there is no standard integrify constraint designedto guard againstthis valueswould seem problem.It is purelythe resultof bad design. hat integer values Functionaldependencies always involve exactlytwo setsof attributes:one set (the deter- ed by the method of rninant)determines (or relatesto) the valueof another (thedependent). Call the attribute(s) :s,it is the database rnakingup the determinant A and the attribute(s) making up the dependent B. With this in d by the systemcan place,you can expressthis relationship using the following two (equivalent)statements: not hold more than 1. Bis functionallydependent upon A. le next section.INF s) in ayear attribute 2. A functionally determines B. rt it's that simple. There is a fancy notation for as well:A -+ B. The line: for any value of A, the ieswon't evengive even this bottom valueof B has somekind of correlation.

SecondNormal Form Secondnormal form is defined in terms of functional dependencies.It requires that a relation I functional depen- be in first normal form andthat non-key attributes be functionally dependentupon all etweencolumns in all attributesof the primary key (not just part of them). Rememberthat normalizationis about esof another,then so while soundslike an arbitrary rule, it is in fact aimed at weeding Cescribesa relation- cutting out duplication, this out duplication. : attribute(or setof You havealready seen what duplication looks like when 2NFis not followed,using episodes ;) for everytuple in The primary key is both seasonand week.You know calledepisodes, as an example. in episodes composedof that year is functionallydependent on season,which in turn is only part of the primary key (red-flag).Specifically, for everySedson=.r (e.9., 4) you havethe sameyear-y (e.g.,1992). There 'en valueof season, is no reasonto include the year attribute in the relationif its valueis correlatedwith season. rlueof 4 for season, That's duplication.It must go. eterminethe latter. 66 CHAPTER3 W THERELATIONAL MODEL

Sowhat do you do?You decomposethe table into two tables.Cut out year, and move it to 'l.hird Norma its own table: call it the seasonstable. Now episodesis decomposedinto two tableswith a foreign key relationshipwhere episodes.season references seasons. season (see Tables 3-3 and 3-4). llrird normalform tlcncieson anyoth, rlt.pendenciescalle Table3-3. The episodesTable in SecondNormal Form nrorefunctional de1 lrorexample, say yr season week name lrrirnarykey,and B 4 I The Iunior Mint 'l: it dependson,4 i 4 2 The Smelly Car rlrrplication.Third r ,rlsohave no transit 5 I The Mango To picrure this, 5 2 tlrc candidatekey is 6 2l The Fusilli Jerry 6 25 The Understudy Tabfe3-5. An Unorn

id season 4 Tabfe3-4. Theseasons Table from Normalizingepisodes 4 season year 5 4 1992 5 5 1993 6 6 1994 6

This is like factoringout a common variablein an algebraicexpression. Now episodesis in 2NF. Furthermore, no information is lost becauseyear is functionally determined by season. In this version lurrctionally But look what elseyou get the new designallows referential integrity to back you up: you have depen lr.rckwards. a foreign key constraint guarding this relationship (the correlation between seasonand year), Askyot lrurn which you couldn't get in the previous form. What was previously only implied is now both season?Yes.Tl ,urtl explicit and enforceable. seasonfunction ,k'llendent Now consider the specific casementioned earlier where someonemodifies the first row. on a nor rrrto The logicalinconsistency is no longer possible:it is impossibleto messup the year,because it itsown table,a1 is not in episodes.If someonechanges !ear=1992 to 1999in seasons,it may be inaccurate,but it is still logicallyconsistent. That is,both rows that referto it in episodes(season=4) will still be Iabfe3-6. The episot in agreementabout the year in which season4 took place.Normalization cannot make you get your facts right, but it can ensure that at leastyour data is consistentabout them. rtJ season Secondnormal form works by tightening the specificity betweenthe primary key and the 4 non-key attributes. If the valuesof a non-key attribute correlateto only partof the primary key, 4 then logically that attribute can introduce duplication. \ryhy?Because only part of a primary key is not unique (allof the primary key is requiredfor uniqueness,as mentioned earlier), and 5 a correspondencebetween two columns, neither of which is unique, opens the possibility for 5 duplication, and duplication invites inconsistency. 6

6 '.: ,:i:€t-

CHAPTER3 :f]iTHE RELATIONAL MODEL 67

/ear, and move it to 'fhird Normal Form tableswith a foreign rbles3-3 and 3-4). I'hirdnormal form shiftsattention from functionaldependencies on the primary key to depen- rlencieson any other non-key attributesin a relation.It roots out a specialclass of functional tfcpendencies called ffansitive dependencies.A transitive dependencyis just a chain of two or nlore ftrnctionaldependencies spanning two or more attribute groups(two or more correlations). ljor example,say you have a relation with three setsof attributes,A, B, and C where A is the lrrimary key, and B is a candidate key. If A -+ B and B -+ C, then C is transitively dependent on .4:it dependson A indirectlythrough its relationshipwith B.Transitive dependencies harbor duplication. Third normal form dictatesthat a relation must be in secondnormal form and ;rlsohave no transitivedependencies. To picture this, let's rig the original episodestable to have an integer primary key called id, and the candidatekey is now the combinationof seasonand week.This relation is shown in Table3-5.

Tabfe3-5. An Unormalized episodesTable with an Integer Primary Key id season week year name l4 I 1992 The Junior Mint 24 2 r992 The Smelly Car

:J5 I 1993 The Mango

45 2 1993 The Puffy Shirt

56 2l 1994 The Fusilli Ierry

(i6 25 1994 The Understudy

Iowepisodesis in rined by season. In this version of episodes,year is functionally dependenton season,which in turn is you up: you have functionally dependenton id. Sowe have id + season-+ year. How do you know this?Work ;easonand year), backwards. Ask yourself if you can determine seasonfrom id. Yes.Then can you determine year ed is now both from season?Yes. Therefore, you have a transitive dependency:id functionally determines season and seasonfunctionally determinesyear. To be in third normal form, year, which is frrnctionally ies the first row. dependenton a non-keyattribute,must go.And aswith the previousexample, you relegateyear ] year, becauseit into its own table,again splitting episodes into two tables,as shown in Tables3-6and 3-7. e inaccurate,but on={) will still be Tabfe3-6. The episodesTable in Third Normal Form rot makeyou get em. id season week name nary key and the t4 I The Iunior Mint the primary key, 24 2 The Smelly Car rt of a primary red earlier),and 35 I The Mango e possibiliry for 45 2 The Puffy Shirt

56 2L The Fusilli Jerry 66 25 The Understudy 68 CHAPTER3.tT THERELATIONAL MOOEL { .l: ,;' Iust as in the 2NF example,decomposition made an implicit relationshipexplicit. :i A relational presented for €. calculus)into Tabfe3-7. The seasonsTable from Normalizing episodes Finally, some season year uersus algebr 4 1992 highly discrin 5 1993 l'hesetwo "pure" 6 1994 theoryrather than or "querylanguag rnanipulativepart Sincefunctional dependenciesbetween non-key attributes are not allowedin a table,you l.ikewise,relationi may wonder why it is okay for functional dependenciesto exist betweenkeys. The answer is pcrspective,relati simple:uniqueness. Even if there is a correlationbetween one key and another,the fact that tional calculusis r they are unique gu€uanteesthat their conelation is also unique; therefore,no duplication exists. nroredetail in Ch: In the end, 2NF and 3NF aim not to introduce confusingrules, but to seekout and destroy As statedin tl duplication and the inconsistenciesthat can arise from it. It is essentialto proper database rlifferent,they are design.The more important your data,the more attentionyou should pay to ensurethat your cln alsobe expres databaseis properly designedand normalized. two languagesha pcrformed or exp

TheManipulative Gomponent Relational Q The manipulative component of the relational model definesthe ways in which information logether,relation can be manipulatedand changed.It is the dynamic part of the model that connectsthe datain rt, for querylangut the logical view to the outside world. rrllof the fundamt ralationally comp Relational Algebra and Calculus The query lar rr:lationalmodel i In his originalpaper, Codd describeda languagethat could be usedto operateon data: 5. Comprehe, Theadoption of a relatiornl model of data, as describedaboue, permits the deuelopment languages ar of a uniuersal data sublangunge based on an applied predicate calcuhts... Such a mode). Howe language would prouide a yardstick of linguistic power for all other proposed data per some we languages, and would itself be a strong candidate for embedding (with appropriate supporting at syntactic modification) in a uariety of host languages (programming, command- or problem-oriented). Data Definitt Integrity Cor, This "universaldata sublanguage" would havea sound mathematicalbasis. Codd definedthis and rollback. basisin the form of relational algebraand relational calculus,illustrated in his 1972paper "RelationalCompleteness of DataBase Sublanguages." As describedin the abstractof that paper: Additionally, l;rnguage)not onl In the nearfuture, we can expecta great variety of languagesto be proposedfor interro- lor the purposeo1 gating and updating data bases.This paper attempts to prouide a theoretical basis which may be used to determine how complete a selection capability is prouided in a 7. High Leuel proposed data sublanguage independently of any host language in which the sublan- update,and t guage may be embedded. CHAPTEtt .., THERELATI0NAL M0DEL 69

ship explicit. A relational algebra and relational cahulus are defined. Then, an algorithm is presented for reducing an arbitrary relation-defining expression (based on the calculus) into a semantically equiualent expressionof the relational algebra.

Finally, some opinions are stated regarding the relatiue merits of calculus-oriented uersus algebra-oriented sublanguagesfrom the standpoint of optimal search and highly discriminating authorization schemes.

Thesetwo "pure" languages-relational algebraand calculus-focused on mathematical theory rather than a particular languagesyntax. The latter is the job of the "data sublanguage," or "query language,"as we know it today.Like the structuralpart of the relationalmodel, the manipulative part drewheavily from mathematics.Relational algebra has its basisin set theory. rwedin a table,you Likewise,relational calculushas its basisin predicate calculus.From a computer science rys.The answer is perspective,relational algebracan be consideredmore of a procedurallanguagewhile rela- ther, the fact that tional calculusis more of a declaratiuelanguage.The meanings of theseterms are explained in luplication exists. more detail in Chapter4. rekout and destrov As statedin the quote, while the particular forms of expressionin the nrvolanguages are lroper database different, they are neverthelesslogically equivalent.That is, any operation in relational algebra to ensure that vour can also be expressedin terms of calculus,and vice versa.Another way of sayingthis is that the two languageshave the same expressiuepower-the same fundamental operations can be performed or expressedin either system.

Relational Query Language vhichinformation Together,relational algebraand calculusserye as a guideline, or a yardstick as Codd describes onnectsthe data in it, for query languagesimplemented in relational databases.Any query languagethat can express all of the fundamental operationsset forth in relational algebraand/or calculus is said to be relationally complete. The query languagemust also addressthe other aspects(structural and integrity) of the relational model as well. This is summed up in Codd's fifth rule, defined as follows: rate on data: 5. Comprehensiue Datq Sublanguage Rul.e.A reLational system may support seueral he deuelopment Ianguages and uariotu modes of terminal use (for example, the fill-in-the-blanks :ulus... Such a mode). Howeuer, there must be at least one language whosestatements are expressible, proposed data per some well-defined syntax, as character strings and that is comprehensiue in ith appropriate supporting all thefollowing items: , commAnd- or Data Definition, View Definition, Data Manipulation (Interactiueand by program), Integrity Constraints, and Authorization, Transaction boundaries (begin, commit, ;.Codd definedthis and rollback). his 1972paper stractof that paper: Additionally, Codd's seventhrule requires that the database(and by extensionthe query language)not only use relations for storage,manipulation, and retrieval,but also as operands xed for interro- for the purposeof modifying the database: heoretical basis is prouided in a 7. High Leuel Insert, Update, and Delete. The systemmust support set at a time insert, fich the sublan- update, and deleteoperators. CHAPTER3 $S THE RELATIONAL MODEL

A user then should be able to insert tuples by providing a relation composedof the tuples to be Despiteits crit inserted,and similarly for updating and deleting information within the system.While this rule rrndoubtedlythe n keepsthings consistent-using relations-its primary intent at the time was actuallyin optimizing rrslongstanding dc databaseperformance. The idea was that in some casesthe databasecould make better opti- t rnrein the near ful mizations by seeingmodifications together as a set than it could by processingthem individually. So,first, a relational databasemust provide at least one query languagethat addressesthe structural, integrity, and manipulative aspectsof the relational model. And with respectto the TheMeani manipulative aspect,it must be capableof expressingthe mathematical conceptsset forth in :j ( ,ivcn all this inforn relational algebraand/or calculus.Furthermore, it must acceptrelations as a means for modi- :: '* rl;rtabasesderive tL fying data in the system.Note that Codd never mandated a particular query language.He only ..*: r olrrrnninanother mandated that a relational databaseprovide one and what it must do. 3,r' ..r(.tr)sfrom the cer rr basedon the mal The Advent of SQL rr'lationsas the sol Thus, over the years,there havebeen multiple competing querylanguages.However, the most lrrlirrmationPrinci popular and widely adopted of theselanguages today is undoubtedly SQL.SQL is a relationally A more specifi complete querylanguage that exhibits aspectsof both relational algebraand relational calculus. That is, it has both declarativefeatures (calculus)and procedural features(algebra). In fact, as Rule Zero. For you will seein the next chapter, SQLincludes almost all of the operatorsdefined in relational ftrafta$€ftEtlt; its relational c algebra. SQLalso reflectseach aspectof the relational model. Part of its languageis dedicatedto I or o databaseto t working with the structural aspectof the model, specifically to creating, altering, and destroying trorr:rlmodel. That tables.This part of the languageis called data definition language(DDL). Within DDL lies also , overedevery one the integrity aspect,allowing the creation of keys and various databaseconstraints. Ukewise, lo be relational.Dr part of the languageis dedicatedto the operational aspect,called data manipulation language (DML). And as stated,it includes ideasfrom both relational algebraand calculus. Ironically, despitethe clearinfluence of the relational model on SQL,the crrrent SQlstandard doesnot mention the relational model or userelational terminology.2And while SQLis relationally Summary complete, in manyways it falls short of the true power of the relational model. Although it was l lriswas a nickelt primarily inspired by relational calculus,some have claimed that there are ways in which SQL pioirlwas to giveyo also violates it. Furthermore, SQLlacks some relational operations that some consider impor- rrntlerstaDdsome, tant, such as relational assignment,and has a number of redundant features.Part of the reason rrrurlelhasclearly i for this was that the organization(s)responsible for creating the SQLstandard felt that it was l'he relational more important to releasea standard as early as possiblein order to establisha basethat data- rlr r lxrct not only or baseimplementations could build upon (Connolly, 2001).Thus, when the initial standard was ,rtnryof electronic releasedin 1987,though practical, it was found wanting in many ways by researchersinvolved 'fhe relationa with the relational model (Codd included). Among other things, it omitted some relational rlurtis independel operations and included no mention of referentialintegrityconstraints. Subsequentstandards lorth in mathemat filled in some of the gapshere and there, but this seemsto have done little to appeaseits detractors ,,1information, po or deter databasevendors from "extending" their SQLdialects to include proprietary features. r orrsist€ocyand ir Entire books and websitesare devotedto both SQL'sinadequacies as well aswhat an ideal There are mat query languageshould be. Furthermore, alternative query languageshave been proposed and ',rrlrjectof entireb, implemented (both before and after SQL)that in the minds of their creatorsare more expressive l,rrgesubject, its cl and better reflect the principles and intent of the relational model.3 .rrrtlform the basir If you are ne\^ '',rsiertime graspin 2. Seehttp: //en. wikipedia. orglwiki/Relational_model. lrilt oSa gatewayil 3. TutorialD is one suchexample. See wr+,r. thethirdmanifesto. com for more information. lrr':rvilygeared to r CHAPTER3 IiIi16E RELATIONALMODEL 71

of the tuplesto be Despite its criticisms and shortcomings,however, SQLis what we have to work with. It is em.While this rule rrttdoubtedlythe most popular and widely adopted query languagein the industry, and given nrallyin optimizing its longstandingdominance in the marketplace,it is unlikely that its position will change any make better opti- tirnein the near future. them individually. that addressesthe with respectto the TheMeaning of Relational rceptsset forth in (iiven all this information, what exactlyis relntionnBIt is a common misconception that relational a meansfor modi- tlatabasesderive the name "relational" from their ability to "relate" a column in one table to a language.He only r:olumn in another through a foreign key relationship.The true meaning of relational,however, stemsfrom the central structural component of the relational model: the relation, which itself is basedon the mathematical concept.First and foremost, a relational databaseis one that uses relationsas the sole structural unit in which to represent information, as mandated by the However,the most Information Principle. A more specific definition of relational, however,is provided by QL is a relationally Codd's Rule Zero: relationalcalculus. rlgebra).In fact, as Rule Zero, For any systemthat is aduertiseda.s, or claimed to be, a relational data base fined in relational management system,that systemmust be able to manage data basesentirely through its r elatio nal capab ili ties. is dedicatedto ;e For a databaseto be called relational, it must provide all of the facilities required by the rela- ng, and destroying tional model. That is, it must conform to all of the other 12rules. And believeit or not, you have ithin DDLlies also coveredevery one of them. You therefore should have a good idea at th-ispoint what it means straints.Likewise, to be relational.Don't get too cocky,though, as Codd later expandedthe 12rules to over 300. pulation language culus. LrrentSQL standard e SQLisrelationally Summary el.Although itwas 'fhis was a nickel tour through the relational model. While our discussionisn't definitive, my araysin which SQL goal was to give you enough history to make the topic enjoyableand enough theory for you to re considerimpor- understand some of the thinking behind SQL-why it works the way it does.The relational . Part of the reason model has clearly influenced its designand has provided it with a solid theoretical foundation. rrd felt that it was The relational model has proven itself over its 30-yearhistory. It has had an enormous ;h a basethat data- impact not only on computing but on the way we do business.Today, it can be found in a wide ritial standardwas array of electronic machinery, ranging from mainframes to cell phones. ;earchersinvolved The relational model was createdto provide a logical, consistentrepresentation of data iome relational that is independent of hardware and software.The model built on well-founded theory set sequentstandards forth in mathematics.This model provides databaseusers with a consistent,unchangingview peaseits detractors of information, powerful methods to operate on it, and mechanismsto protect and ensure its 'oprietaryfeatures. consistencyand integrity. :ll aswhat an ideal There are many more aspectsto the relational model, and much that builds on it. It is the een proposedand subject of entire books, some of which I have included in the "References"section. While it is a l more expressive largesubject, its core conceptsare logical and straightforward.These concepts ground, frame, and form the basisof the subject coveredin the next chapter: SQL. If you are new to SQLand you've patiently endured this chapter,you should have a much easiertime graspingthe conceptsin the next chapter.You will seeSQL not asan arbitrary language but as a gatewayinto a powerful databasemanagement system.You will find that its syntax is tion. heavily gearedto what you've learned here in this chapter. CHAPTER3 -,i:J11E RELATI0NAL M0DEL ,i ;i;; ReferenGes

Codd, E. F. 1970."A relationalmodel for largeshared data banks." Communications of ACM,13(6):377-387. SQT Codd, E. F. 1972.Relational Completeness of Data BaseSublanguages in Data Base Systems.In Rustin,R. I. (ed.),Data BaseSystems, Courant Computer SymposiaSeries, v. 6. EnglewoodCliffs, NJ: Prentice-Hall.

Codd, E. F. 1979."Extending the DatabaseRelational Model to capturemore meaning." ACM Transactionson DatabaseSystems. 4(4):397434. T Codd, E. F. 19B0.Data Models in DatabaseManagement. In Proceedingsof the l9B0 I his chapteris a Workshopon DataAbstraction, Databasesand ConceptualModeling(Pingree Park, CO, irrparticular. It asr Iune23-26, l9B0).NewYork: ACM Press,ll2-114. irrenew to SQL,S( Codd, E. F. 1982."Relational database: a practicalfoundation for productivity." rionaldatabases. I Communicationsof ACM 25(2):109-l I 7. clrapteron the rel rt:laxed and practi Codd,E. F. 1985."ls your DBMSreally relational?," Computerworld (Part l: October14, 1985, rlidn'tread the Prr Part2 October21, l9B5). rturterialcovered i n('cessaryto learn Connolly, C. E. B. 2001.Database Systems: A PracticalApproachto Design,Implementation SQLis the sol and Management.Boston: Addison-Wesley. rlrriabase.It is a la Date, C. I. 1999.Thirty Yearsof Relational:Rel"ational (series of l2 articles),Intelligent \tntcturing,readi Enterprisel, Nos. l-3 and 2, Nos. l-9 (October1998 onward). Note: Most installments ,rHt{regating,and i after the first publication in the online portion of the magazineat SQLis an intt r,wvw.intelligententerprise. com,accessed on March 16,2006. ol'th€fun thingsa tlriltyou can alwa Date, C. I. 2003.An Introduction to DatabaseSystems, Bth ed. Boston:Addison-Wesley. oltcn many ways Grimaldi, R. P. 1998.Discrete and Combinatorial Mathematics,4thed. Boston: ru l'indmore effic. Addison-Wesley. nroreelegant app rrl th€ languageyc Reiter,R. 1978.On ClosedWorld DataBases. In Gallaire,H. and Minker, I. (eds.),Logic and \'()urexperience. Data Bases.NewYork: Plenum, ll9-140. The goal of tt ,rrrdperhaps a fev Silberschatz,A., H. F. Korth, and S. Sidharshan.2002.Database System Concepts. New .rsl)cctsto SQL.T York: McGraw Hill. Iogicalorder that rlrcreis a lot of gr rlrroughthis chag rrra database.

TheRelal \s you'll rememt ,,riginally propos lr:rst-'sprovide a q