Chapter 3: The Relational Data Model and Relational 2

The of Data

a l b m

c n s e d t

A B a b c d A between sets A and B e A subset of A x B l m n s t 3

The basis of the model is the concept of relation, as found in mathematics, set theory, mathematical logic, in particular in predicate logic A model is given in terms of relations between elements of a domain A relational schema contains the basic elements of a relational data model The schema is application dependent 4 A relational schema S contains:

The data domain D, which is a possibly infinite set

A finite collection of relations (relation names) R1,...,Rn over D of finite and fixed arity That is, for each relation name R ∈S, its potential exten- sions will be subsets of Dk = D×···×D (k times) for some natural number k that depends on R R(·,...,·) k arguments An k-ary relation can be seen as a with k columns R ··· ··· · ··· ··· · · ··· ··· ·

↑←− k columns−→ ↑ 5

A finite collection of attributes (attribute names) A1,...,Am They are associated to the different relations to denote their arguments, or “columns” They can be identified with/by unary relations (unary pred- icates, properties) over D That is, they can be identified with subsets (sub-domains) of the domain D

R A ··· ··· C · ··· ··· · · ··· ··· ·

A,...,C are attributes of relation R 6

Example: Schema S with

Domain D = {john, peter , mary, ..., 1, 2, 3, 4, ....} Binary relation People(·, ·) Attributes for People, in this order: Name, Age

People Name Age · ·

No contents or extensions so far; the schema describes the struc- ture of the model The schemas are domain/application dependent We can see that attributes can be seen as subsets (to be) of the domain 7

Two attributes, with different names, can have later the same extensions (and still be different; that’s why treating them as functions is more precise) Example: Schema with domain D = {john, peter , mary,...} and relation

Manager Boss Subordinate · ·

Schemas can be filled with data in many different ways A instance D compatible with a given schema S is a collection of finite extensions for the relation names in the schema 8

Example: For the schema S with

Domain D = {john, peter, mary, ken, carol, steve, ..., 1, 2, 3, 4, ....}

Binary relations People(·, ·), Manager(·, ·) Attributes for People, in this order: Name, Age Attributes for Manager, in this order: Boss, Subordinate

This is an instance compatible with the schema:

D1: People Name Age john 35 Manager Boss Subordinate mary 25 ken john ken 40 john mary 9

This is another compatible instance:

D2: People Name Age mary 35 Manager Boss Subordinate mary 25 ken steve peter 40 carol steve john mary

The sub-domains for the attributes Boss and Subordinate are the same, namely the subset {john, peter , mary, ken, carol , steve, ...} of the database domain D 10 Example: (different notations for the same) The table Account# Name Balance 12345 Raoul 400,00 34567 Rupert 354,60 12338 Rumilde 1234,30 34561 Sulema 34445,23 Accounts relation is an instance of a relation between the attributes Acount#, Name,andBalance Each attribute has an associated (sub)domain Here, the account number 12345, the name Raoul and the nu- merical value 400,00 are mutually related through the relation

The schema of the relation is: Accounts(Account#, Name, Balance) 11

Bank Example: Some abbreviations clientn = client name cladd = client address clneigh = client neighborhood, branch = branch name acc# = account number

Schema: Deposit(branch,acc#,clientn,balance), Client(clientn,cladd,clneigh) 12

An instance

Deposit branch acc# clientn balance Carleton 101 Jim 500 Downtown 215 Sandy 700 Barrhaven 304 Alvin 1300

Client clientn cladd neighcl Jim 101 Queensbury Barrhaven Sandy 40 Stone Nepean Hernandez 15 Laurier Downtown Alvin 17 Clyde Altavista John 89 Case Centrepoint 13

What is a right schema? What about this one? A single universal relation Bank(branch,acc#,clientn, balance,cladd,neighcl)

It depends on the application and other practical, DB oriented, issues

If a client has several accounts, there is redundancy of in- formation This DB becomes unnecessarily large, and inconsistencies become more likely to occur If a client has an account, but no address, we have to use more values than desired Null values are not easy to handle 14

We will come back to design issues later on ... For the moment, this one seems to be a better schema:

Deposit(branch,acc#,clientn,balance) Client(clientn,cladd,clneigh)

Consider the relation Deposit

We have 4 (sub)domains, D1, D2, D3, D4, one for each of its 4 attributes, where they take values (branch names, account numbers, client names, balances) Any row in the table (extension of the relation) is a 4-tuple (v1,v2,v3,v4) with

v1 ∈D1,v2 ∈D2,v3 ∈D3,v4 ∈D4 15

That is Deposit branch acc# clientn balance Carleton 101 Jim 500 Downtown 215 Sandy 700 Barrhaven 304 Alvin 1300 is a subset of D1 ×D2 ×D3 ×D4 Any instance of the relation Deposit will be a subset of D1 ×D2 ×D3 ×D4 We use relation and table as synonymous, the same for tuple and row If t is a tuple, and R is a relation (extension), then: 16

We can say that t ∈ R if the tuple belongs to the relation R (relation extensions are sets) Let A be the name of an attribute in the nth of relation R If t ∈ R,thent[n] and t[A] denote the value of the at- tribute A in the tuple t

For example, if t denotes the first tuple in the table Deposit, then t[2]=t[acc#]=101 Useful notation: Since the same attribute may appear in different tables, we distinguish the occurrences of the attribute, by using the relation name followed by “.” as a prefix, e.g. Deposit.acc# Deposit.clientn Client.clientn 17

Queries

For the instance on page 12, give me the addresses with balances of the clients who have a balance higher than 600 40 Stone 700 Answer: 17 Clyde 1300

The answer is a set of tuples, a new relation (extension) We can say that a query is a mapping that sends DB instances to new DB instances (possible with a different schema) 18

Several issues:

How to specify a query? How to write it? In what language? What is the precise meaning of a query? How to compute the answer?

There are several query languages for RDBs Some more used in practice than others But those of a more theoretic nature are the basis for the most used in practice 19

The distinction between declarative vs. procedural query lan- guages is always relevant The former express what the user wants to obtain from the database, the latter express a particular way to compute the answer 20

Relational Algebra as a Query Language

Idea: Relations are sets (subsets of cartesian products) con- structed on top of other sets (domain or subdomains) Query answers are new relations Thus, in order to obtain new relations (e.g. query answers) do set-theoretic algebra on existing relations Operate on sets and relations in order to obtain new sets or relations 21

The Relational Algebra (RA)

Provides algebraic operations over relations that produce new relations Operations based on set-theoretic operations Some of those operations come directly from set theory Others are specific, ad hoc, for the RA The latter are applicable to relations (as opposed to sets in general) Provides a procedural query language for RDBs (because it is based on explicit operations) The RA is one of the strengths of the relational model RA can be used to give a precise, set-theoretic semantics to other query languages 22

Queries in RA:

It is possible to answer the query by applying a sequence of algebraic (relational) operations starting from the original database instance Even if the RDBMS offers a different query language, e.g. a declarative one, a query will be compiled into a sequence of algebraic operations on the DB 23

Summary of basic operations of RA:

Union and Intersection: R1 ∪ R2, R1 ∩ R2 Can be applied to similar relations, i.e. same arity (and data types), as normal sets

Difference: R1  R2 Again, for similar relations, as normal sets

Product: R1 × R2 This is essentially the cartesian product of two relations taken as normal sets E.g. for R = {(a, b), (c, d)},S= {(1, 2), (2, 3)} R × S = {(a, b, 1, 2), (a, b, 2, 3), (c, d, 1, 2), (c, d, 2, 3)} 24 R1 R2

D

D

R1 R2 R1 U R2 D R1  R2

D

R1 R2

D R1 \ R2

D 25

Π R(··· ,A,···) Projection: A , i.e. the projection of rela- tion R on attribute A

B R

A II R A Here, A is one of the attributes of R The projection could be on several attributes of R This is a unary operation: takes one relation as input (the previous ones are binary) This is an operation special for relations It deletes, ignores, projects out entire “columns” from a relation Projects R over one (or several) “coordinates” (attributes) 26

It generates a new relation, with a subset of the attributes (columns) Its logical counterpart is the existential quantification For the relation in the figure:

Π R(A, B)={a ∈ A | b ∈ B A it exists such that (a, b) ∈ R} 27

Selection: σ(R)

Unary operation, special for relations Selects the tuples of the relation R that satisfy the condi- tion The condition can be expressed in a (limited) logical lan- guage It generates a new relation, with the same attributes, but possibly fewer tuples (rows) 28

Join: R1  R 2

A binary operator, essential in RA It allows to compose two relations through the values in common taken by a distinguished attribute that shared by the two relations (or two different attributes but with same data type or domain) Similar to the operation of composition of two relations as seen in set theory: R◦S It is essential to combine tables in natural way, without ap- pealing to the possibly large and computationally expensive product of them There are generalizations of this basic, natural join 29

Notice: There is no (set-theoretic) Complement operation in RA (as found in set theory)

D

R c R = ???

D

In principle, there could be, given that relations are sets, but What is the “meaning” of the complement of a relation? Actually, it could be infinite, because the DB domain is possibly infinite The difference ()isonlyarelative complement, relative a given relation 30

Deposit branch acc# clientn balance Carleton 101 Jim 500 Downtown 215 Sandy 700 Barrhaven 304 Alvin 1300

Which could be the tuples that are not in Deposit? Which of those make sense? 31

Examples: Union: Two relations with the same schema

WINE1 W# GRAPE VINTAGE PERCENTAGE 100 Volnay 1978 12.5 110 Chablis 1979 12.0 120 Sancerre 1980 12.5 130 Tokay 1980 12.5 

WINE2 W# GRAPE VINTAGE PERCENTAGE 130 Tokay 1980 12.5 140 Chenas 1981 12.7 150 Volnay 1978 12.5 32

WINE3 W# GRAPE VINTAGE PERCENTAGE 100 Volnay 1978 12.5 110 Chablis 1979 12.0 120 Sancerre 1980 12.5 130 Tokay 1980 12.5 140 Chenas 1981 12.7 150 Volnay 1978 12.5

 Similarly, there is the intersection of the two relations:

WINE4 W# GRAPE VINTAGE PERCENTAGE 130 Tokay 1980 12.5 33

Difference: Two relations with the same schema

WINE1 W# GRAPE VINTAGE PERCENTAGE 100 Volnay 1978 12.5 110 Chablis 1979 12.0 120 Sancerre 1980 12.5 130 Tokay 1980 12.5



WINE2 W# GRAPE VINTAGE PERCENTAGE 130 Tokay 1980 12.5 140 Chenas 1981 12.7 150 Volnay 1978 12.5 34

WINE4 W# GRAPE VINTAGE PERCENTAGE 100 Volnay 1978 12.5 110 Chablis 1979 12.0 120 Sancerre 1980 12.5

It should be clear from these examples that the complement of a table does not make much sense ... 35

Product: Two relations, not necessarily with same schema

GRAPE GRAPE AREA COUNTRY Chenas Beaujolais France Volnay Bourgogne France Chanturgues Auvergne France

×

YEAR VINTAGE QUALITY 1979 Good 1980 Average 36

G/Y GRAPE AREA COUNTRY VINTAGE QUALITY Chenas Beaujolais France 1979 Good Chenas Beaujolais France 1980 Average Volnay Bourgogne France 1979 Good Volnay Bourgogne France 1980 Average Chanturgues Auvergne France 1979 Good Chanturgues Auvergne France 1980 Average

A huge table; maybe many of the combinations do not make much sense The product is an expensive operation we may want to avoid, or apply only after we have reached smaller tables using other operations ... Usually it makes more sense from the application point of view to combine tables via a join 37

Join: First the natural Join (there are more general ones) Essential binary operator of RA Relations are composed via the values in common taken by at- tributes in common (or of a similar data type) 38

WINE W# GRAPE VINTAGE QUALITY 100 Chenas 1977 Good 200 Chenas 1980 Excellent 300 Chablis 1977 Good 400 Chablis 1978 Bad 500 Volnay 1980 Average 1

LOCATION GRAPE AREA AVG-QUALITY Chenas Beaujolais Good Chablis Bourgogne Average Chablis California Bad 39

W/L W# GRAPE VINTAGE QUALITY AREA AVG-QUAL 100 Chenas 1977 Good Beaujolais Good 200 Chenas 1980 Excellent Beaujolais Good 300 Chablis 1977 Good Bourgogne Average 300 Chablis 1977 Good California Bad 400 Chablis 1978 Bad Bourgogne Average 400 Chablis 1978 Bad California Bad

This is a common but expensive operation in RDBS In general one applies it once tables have been reduced using other operations The intersection, difference, selection and projection all reduce relations In (syntactic) query optimization, sequences of operations are rearranged to make the whole evaluation less expensive 40

Join operations can be more general than this basic one, ac- tually joins can be performed considering more complex “join conditions” Here, in formal terms, the simple condition can be specified with the join: 1 WINE WINE.GRAPE=LOCATION.GRAPE LOCATION The join is performed under the condition that the values in the GRAPE attribute in the two tables coincide 41

Projection:

WINE W# GRAPE VINTAGE PERCENTAGE QUALITY 100 Volnay 1979 12.7 Good 110 Chablis 1980 11.8 Average 120 Tokay 1981 12.1 Excellent 130 Chenas 1979 12.0 Good 140 Volnay 1980 11.9 Average

Π VINTAGE,QUALITY

YEAR VINTAGE QUALITY A unary operator 1979 Good 1980 Average Giving a new name 1981 Excellent to the result is not 1979 Good part of the operation 1980 Average (but helps) 42

A tuple t is in the result (the projection) iff there is a tuple t in the original relation that, restricted to the attributes indicated in Π gives t:  t [VINTAGE, QUALITY]=t In other words: (1979, ) ∈ Π ( ) Good VINTAGE,QUALITY WINE because there there exist values, say x, y, z for attributes W #, GRAPE, PERCENTAGE (we do not care which ones) such that the tuple (x, y, 1979,z,Good ) belongs to re- lation WINE 43

Selection:

WINE W# GRAPE VINTAGE PERCENTAGE QUALITY 100 Volnay 1979 12.7 Good 110 Chablis 1980 11.8 Average 120 Tokay 1981 12.1 Excellent 130 Chenas 1979 12.0 Good 140 Volnay 1980 11.9 Average

σ QUALITY=Good

GOOD-WINE W# GRAPE VINTAGE PERCENTAGE QUALITY 100 Volnay 1979 12.7 Good 130 Chenas 1979 12.0 Good

Here the condition is very simple 44

It is possible to express more complex selection conditions using a more expressive language that may use

Attribute names Logical, boolean (propositional) operations (AND, OR, NOT ) Built-in relations (=,<,≤,>,≥, =) applied to attribute names and domain elements Built-in relations have a fixed semantics, and fixed and possibly infinite extensions As opposed to relations in the schema that have variable extensions depending on the application and the state of the DB E.g. the < built-in relation on the data type integer has an infinite, fixed extension that the DBMS can simple use 45

< Smaller Bigger = String String 0 1 john peter 0 2 peter mary ··· ··· ··· ··· 1000 1500 mary john ··· ··· ··· ···

So a selection could be σ ( ) VINTAGE>1980 OR QUALITY =Good WINE

This boolean language for expressing conditions can be used to 1 extend the join operations with conditions ,soaswe σ can express selections with complex conditions

1 WINE W.GRAPE=L.GRAPE AND QUALITY =AVG-QUALITY LOCATION 46

Queries Expressed in RA

A query can be expressed as a sequence of operations of RA applied to the original tables and/or intermediate results Example: Consider the schemas

DRINKER DRINKER# SURNAME FNAME TYPE

DRINKS DRINKER# WINE# DATE QUANTITY

WINE WINE# GRAPE VINTAGE PERCENTAGE 47

Query 1: Obtain the percentages of alcohol in the wines of grape Morgon, vintage 1979 Answer 1:

R1:=σ ( ) GRAPE=Morgon WINE R2:=σ ( ) VINTAGE=1979 WINE R3:=R1 ∩ R2 := Π (R3) ANS PERCENTAGE

Answer 2: (same values)

=Π (σ ( )) ANS PERCENTAGE GRAPE=Morgon AND VINTAGE=1979 WINE

Notice the correspondence between the set-theoretic and logical operations ... 48

Query 2: Obtain last and first names of drinkers of Morgon or Chenas Now we need to combine the three original tables

R1:=σ ( ) GRAPE=Morgon WINE R2:=σ ( ) GRAPE=Chenas WINE R3:=R1 ∪ R2

R4:=R3 1 R3 WINE WINE# DRINKS ( is smaller than ) R5:=R4 1 DRINKER# DRINKER := Π (R5) ANS SURNAME,FNAME

Notice that we selected before the join, which becomes smaller The other way around would have been semantically the same 49

Query 3: Obtain last and first names of drinkers who have tried in one day more than 10 samples of Chablis, vintage 1976, together with the percentage of alcohol of the wine

R1:=σ ( ) QUANTITY >10 DRINKS R2:=σ ( ) GRAPE=Chablis WINE R3:=σ ( ) VINTAGE=1976 WINE R4:=R2 ∩ R3 R5:=R1 1 R4 WINE# R6:=Π (R5) DRINKER#,PERCENTAGE R7:=R6 1 DRINKER# DRINKER =Π (R7) ANS SURNAME,FNAME,PERCENTAGE 50 Warning!: RA is based on set-theoretic operations, i.e. that take and pro- duce sets In consequence, “duplicates” (multiple occurrences of the same tuple) do not appear anywhere It is possible to extend these operations to “multi-sets” that may have duplicates (we do not do this for the moment though)

Exercise: Illustrate the computations of queries 1-3 using con- crete initial instances and producing all the intermediate rela- tions that lead to the final answer 51

Exercise: Assume we have the following schema

Frequents(Drinker,Bar) Serves(Bar,Beer) Likes(Drinker,Beer)

Express in RA the following queries:

1. Which bars serve the beer John likes? 2. Which drinkers frequent at least one bar that serves some beer they like? 3. Which drinkers frequent only bars that serve at least one beer they like? 4. Which drinkers do not frequent any bar that serves some beer they like? 52

Remarks, Extensions, Limitations

RA provides a procedural language for querying RDBs We presented the most common relational operations, but there are others (c.f. the textbook) There may be many different RA expressions (formulas) that can be used to compute the same query answer Which one to use depends on efficiency issues Queries (computations thereof) can be optimized by rearrang- ing them into semantically equivalent RA formulas (i.e. same meaning) Space is always an issue; DBs can be very large and computa- tions take place in main memory 53

RDBMSs have built-in query optimizers that take care of opti- mizing the query The notion of “semantic equivalence” of queries, in particular of relational expressions, is well-defined and precise: two RA queries are equivalent if for every instance they produce the same result (i.e. same query instance) Just like when we say that the (numerical) algebraic expressions x + y +0and x(y +1)− xy + y − 1+1are equivalent: for any values for x, y the result is the same A strength of RA: the semantics of the language is clear, precise, formal and well-studied It is grounded on set theory and predicate logic 54

There is a purely “logical counterpart” to the RA The relational calculus is a declarative query language that is based directly on predicate logic (we saw informal examples be- fore) Relational algebra and relational calculus are equivalent in terms of the queries they can express (more on this later)

It is possible to define new operations for the RA using algebraic expressions that use the already defined operations (and nothing more) 55

Example: We could define the “symmetric difference” of two similar relations R1∆R2:=(R1  R2) ∪ (R2  R1)

R1 R2 R1 R2 D

D

Notice that the new operation (∆) on the LHS is being defined by means of a fixed algebraic formula that uses already defined operations (, ∪) A single formula or definition that can be applied to any instance, i.e. the definition is independent from the instance to which is is applied 56

Two relations with the same attributes WINE1 W# GRAPE VINTAGE PERCENTAGE 100 Volnay 1978 12.5 110 Chablis 1979 12.0 120 Sancerre 1980 12.5 130 Tokay 1980 12.5

WINE2 W# GRAPE VINTAGE PERCENTAGE 130 Tokay 1980 12.5 140 Chenas 1981 12.7

WINE5 W# GRAPE VINTAGE PERCENTAGE 100 Volnay 1978 12.5 110 Chablis 1979 12.0 120 Sancerre 1980 12.5 140 Chenas 1981 12.7 57

Exercise: Invent and define new operations for the RA Actually, we haven’t been very economical when we listed the basic RA operations Some of them could have been defined in terms of the others, so they are theoretically redundant (but not necessarily practically redundant) Exercise: Define the join in terms of the product,theselection and the projection 58

The Transitive Closure

Example: Paternity Father Son Eric Luis Eric Juan Juan Carlos Juan Sergio Luis Tomas Tomas Pedro

We want to define and com- Ancestry Ancestor Descendant Eric Luis pute a new relation Ancestry Eric Juan that contains all (and only) the Juan Carlos Juan Sergio tuples that can be obtained by Luis Tomas transitive paternity, i.e. Tomas Pedro Eric Tomas Eric Carlos Eric Sergio Luis Pedro Eric Pedro 59

Ancestry is the transitive closure of Paternity, i.e. the smallest transitive relation that includes Paternity The transitive closure of a relation is something we use, com- pute, and need all the time Computation? An iterative procedures computes it

Ancestry Ancestor Descendant Eric Luis Eric Juan step 0 Juan Carlos Juan Sergio Luis Tomas Tomas Pedro Eric Tomas step 1 Eric Carlos The length of the iteration de- Eric Sergio pends on the initial instance; it Luis Pedro step 2 Eric Pedro is not bounded a priori 60

Can we define the transitive closure using a general and fixed formula of RA? TC (R):= ··· With a formula on the RHS in terms of the operations of RA we saw (and nothing else)? In particular, not depending on the instance at hand ... 61

It can be mathematically proved that it is not possible to define the TC of a relation by means of a fixed and general formula of RA The TC is not part of the RA, and cannot be defined by means of a fixed formula that uses (the other) relational operators The theorem is easier to state and prove in the “logical coun- terpart” of the RA, i.e. in the relational calculus This is a result about the (limited) expressive power of a partic- ular query language: there are things that cannot be expressed in it 62

In order to compute the TC from a RDB, an iterative procedure can be programmed in interaction with the DB Ideally, a query language provided by a RDBMS should offer the possibility to define and express the TC The newer SQL standard (SQL99) supports this

(We will come back in the context of (extended) logical query languages ...) 63

Integrity Constraints

So far we have no way to capture requirements or conditions that our DB model (and DB) should satisfy in order to:

Be an accurate model of the outside reality being modeled (c.f. instance in page 9, two ages for mary ...) Impose or contain more meaning, more semantics wrt the modeled domain Stay in correspondence with the modeled domain To make sure that when the data changes, the meaning and correspondence are kept 64

Outside Reality

IC s

DB

ICs are statements (sentences, propositions, ...) that have to be satisfied in every stable, valid, legal state of the database It is not difficult to express semantic or integrity constraints (ICs) in languages of predicate logic They can be expressed as formal, symbolic sentences in those languages 65

People Name Age Degree mary 35 law mary 25 medicine peter 40 CS peter 40 math

∀x∀y∀z∀u∀v(People(x, y, z) ∧ People(x, u, v) → y = u)

The instance above is not admissible if the IC is to be satisfied, because it does not satisfy the IC The database instance is inconsistent, in the sense that it does not satisfy (does not make true) the ICs 66

In principle, ICs expressed in such languages could be processed by a DBMS And the DBMS could make sure that the actual DB instance does satisfy them 67

How?

Rejecting changes (updates) that violates them Compensating with additional, internal, automatic updates, those updates issued by user or applications programs Notifying the user or applications about violations of ICs before committing changes ...

However, to privilege efficiency, those mechanisms are not always implemented or offered in/by DBMSs In general, commercial DBMSs provide automated, built-in sup- port only for some restricted, limited classes of ICs 68

Thus, in some cases the user has to find alternatives

Maintaining the IC satisfied (DB maintenance) through ex- ternal application programs that interact with the DB The external program could issue a query to the database in order to detect if there is a violation of the IC Depending on the answer returned by the DBMS, the ap- plication program has alternatives on how to proceed

Which could be the query to detect if there is a violation of the IC in page 66? 69

An external program that interacts with the DBMS, can create a view that captures the violations of the IC For example, the the following “violation view” V (x): ∃y∃z∃v∃w(Person(x, y, v) ∧ Person(x, z, w) ∧ y = z)

This view will contain all the names that have more that one age It is expected that the contents of the view is always empty If not, there is a violation, and the application program can do something about that In the example, mary would be caught by the view 70

Defining triggers in the DB to be stored in the DB Triggers (aka. active rules) react automatically when a vi- olation occurs: • notification messages • rejection of updates • additional compensating updates as programmed in the trigger • ... They are of the form: Event & Condition ⇒ Action 71

For the IC in page 66, the Event couldbeanupdateof table Person, like an insertion (deletions of tuples are not relevant for the FD) The Condition could be a check of the occurrence of some tuple in the violation view The Action does something to restore consistency, e.g. delete the old conflicting tuple and accept the new one Crossing fingers ... 72

Some Classes of ICs

In some cases, a subset of the attributes functionally determines (or is expected to determine) another subset of the attributes Example: Students Number Name Study Sport 9901254 John Stanley CS Soccer 9910803 Sue Jones Math Skating 9910803 Sue Jones Math Soccer 9901254 John Stanley Literature Handball

“Every student number is associated to at most one student name”or“Every two students that coincide in student number coincide in student name” Number functionally determines Name; denoted Students : Number → Name (not logical implication) 73

2. Key Dependencies (constraints): A particular case of func- tional dependency, where a subset of the attributes functionally determines all the attributes in the relation

Students Number Name Study Sport 9901254 John Stanley CS Soccer 9910803 Sue Jones Math Skating

“Student number determines all the other attributes of the stu- dent” Number is a key of the relation, i.e. Students : Number →{Number, Name, Study, Sport} Satisfied by this instance, but not by the previous one 74

Example: WINE GRAPE VINTAGE VINEYARD QUALITY Chenas 1977 Laphite Good Chenas 1980 Mouton Excellent Chablis 1977 Rotschild Good Chablis 1978 Crepeau Bad Volnay 1980 Satie Average

The set of attributes {GRAPE, VINTAGE, VINEYARD} forms akey {GRAPE, VINTAGE, VINEYARD}→QUALITY (If the relation name is clear from the context, we omit it) 75 3. Range Constraints: They restrict the values that can be taken by some attributes Example: “A CEO cannot make less that 80,000 per year”, “An employee must be over 18”

Employee Name Position Salary Age Both satisfied by john clerk 40 K 35 mary CEO 100 K 45 ken accountant 60 K 40 But none of the two by

Employee Name Position Salary Age john clerk 40 K 35 mary CEO 70 K 45 ken accountant 60 K 40 carol programmer 90 K 16 In predicate logic: ∀wxyz(Employee(w, x, y, z) → z>18) (exercise: express the other) 76

4. NOT NULL Constraints: They restrict the values taken by some attributes to be non NULL A NULL value is used in databases to represent missing, un- known, non applicable, ...., information (there are different, most- ly informal, semantics for NULL values)

Emp Name Posit Sal Age This instance satisfies john clerk 40 K 35 “Name cannot be mary CEO 85 K NULL NULL”, but not “Age ken account. 60 K 40 carol NULL 90 K 19 cannot be NULL”

Normally when a key constraint declares that a set of attributes is a key, it is also (compulsory) required that its attributes cannot be NULL Key constraints and NON NULL constraints go together 77

If Name above has been declared a key (i.e. this IC is imposed), it should not take the value NULL It has to do with the way NULL values are treated by the DBMS: it does not know if it represents a value that is equal or different from the other certain (or null) values 78

5. Referential Constraints: They require that all the values of some attributes in a relation also appear in attributes of another relation The first relation refers to the second relation (which can be though of as a relation containing official data) Example: “Every student in relation UnivTeams must be a registered student”

UnivTeams Number Team 9910803 Basketball ......

Students Number Name Study 9901254 John Stanley CS (official students table) 9910803 Sue Jones Math 9901254 John Stanley Literature 79

A referential IC: UnivTeams.Number refers to Students.Number Also called an inclusion dependency: UnivTeams.Number is included in Students.Number: UnivTeams.Number ⊆ Students.Number

(or UnivTeams[Number] ⊆ Students[Number]) This is not a full inclusion dependency in the sense that not all the attributes of UnivTeams participate in the inclusion The referring and referred attributes may have different names, e.g. we could have UnivTeams[Number] ⊆ Students[ID] (as long as the data types match) 80

In the language of predicate logic:

∀x∀y(UnivTeams(x, y) →∃z∃wStudents(x, z, w)) 81

5. Constraints: A combination of a referential con- straint and a key constraint In addition to the referential constraint, it is required that the referred attributes in the second relation form a key for that relation That is, if the referential IC requires

R[Ai1 ,...,Aim ] ⊆ S[Bj1 ,...,Bjm ], then we also require that {Bj1 ,...,Bjm } is a key for S 82 Example: We want UnivTeams.Number to be a foreign key for relation Students, i.e. that it refers to the attribute Students.Name that is a key of Students UnivTeams Number Team 9910803 Basketball ......

Students Number Name Study 9901254 John Stanley CS Inconsistent instance! 9910803 Sue Jones Math 9901254 John Stanley Literature

UnivTeams Number Team 9910803 Basketball Consistent instance! ......

Students Number Name Study 9901254 John Stanley CS (official students table) 9910803 Sue Jones Math 99052454 Ken Scott Literature 83

Example: A referential IC Loan(branchn,loan#,clientn,amount) ↓

Branch(branchn,actives,branchNeigh) 84

Example: Foreign key constraints

DRINKER DRINKER# SURNAME FNAME TYPE

DRINKS DRINKER# WINE# DATE QUANTITY

WINE WINE# GRAPE VINTAGE PERCENTAGE

(Thinking in terms of the E/R model, the relation in the middle seems to come from a Relationship and the other two, from Entities) 85 Final Remarks

There are several other classes of ICs for the relational model We presented those most common in practice For most of them, commercial RDBMSs provide automatic support (database maintenance wrt them) Integrity constrains become part of the schema Those declared in the schema are expected to be satisfied by all the instances that are compatible with the schema In that case we say that the database (instance, extension) is consistent wrt the declared ICs We will see other kinds of ICs later on, in other contexts Chapter 5: Relational Algebra and Relational Calculus 2

Relational Calculus

The relational algebra is an algebraic and procedural query lan- guage for relational databases A few examples have shown that it is also possible to use lan- guages from predicate logic to:

Pose queries to the database Specify integrity constraints Define views of the database In general, express metadata, i.e. data about the data, i.e. about the structure and organization of data 3 The “logical counterpart” of the RA is called the Relational Calculus (RC), and it comes in two flavors:

The Tuple Calculus (TC): Basically the “atomic values” of data are complete tuples in relations The language has variables to refer to tuples The Domain Calculus (DC): The atomic values of data are those taken by the attributes (i.e. columns) as opposed to whole rows The reason for the name is that values taken by the at- tributes are drawn from the underlying database domain Variables of the language refer to elements of (values in) the database domain 4

There are transformations between TC and DC, and basically the same can be expressed Since DC is easier to explain and closer to classical predicate logic, we will concentrate on the DC RC is a declarative language to express queries (a query lan- guage), ICs, view definitions, etc.

We review some elements of predicate logic, at least those that are relevant to RC: 5

We introduce a formal, symbolic, object, language to talk about a database So, we need some symbolic ingredients First, we need symbolic names for the data items, i.e. for elements of the database domain D Actually, we will use in the formal language of RC the same names that we use in the metalanguage for the elements of D Symbolic names for the relations (predicates) in the schema It may be useful to introduce unary symbolic predicates for (domains of) the attributes A Alternatively, we could have unary predicates to refer to the subdomains D(A) of D (c.f. chapter 3 of these notes) 6 Example: Consider the relational schema S with Domain D = {john, peter, mary, ken, carol, steve, ..., 0, 1, 2, 3, 4, ....} Binary relations People(·, ·), Manager(·, ·) Attributes for People, in this order: Name, Age Attributes for Manager, in this order: Boss, Subordinate This schema S has an associated language L(S) of predicate logic, based on the following symbols: Names for domain individuals, to denote them: john, peter,, mary, ken, carol, steve, ... Predicate symbols: P eople(·, ·), Manager(·, ·) Logical symbols: ¬, ∧, ∨, →, ↔, ∀, ∃

An infinite but countable, official, set of variables: x1, x2, x3,... (sometimes we will use other variables) 7

Possibly of set of logical predicates (aka. evaluable, built-in predicates): =, 6=, <, ... They have a fixed interpretation (extension) given by the logic and depending on the underlying domain (c.f. chapter 3 of these notes), as opposed to those in the second list, that can have different interpretations (extensions) Symbols for subdomain (attribute) predicates: Name(·), Age(·), Boss(·), Subordinate(·) These predicates for the domains of the attributes (or sub- domains of the domain) also have fixed interpretations (ex- tensions), in the sense that they depend on the domain, and not on the relations of the database The extension for Name is {john, peter, mary, ken, carol, steve, ...}, the same for Boss and Subordinate, but the extension of Age is {0, 1, 2, 3,...} 8 Using these symbols it is possible to build formulas of the lan- guage L(S), e.g.

1. P eople(john, 35), P eople(john, mary), P eople(x3, 20), P eople(x3, x5), Age(35), Age(x10), john=mary, x2 =ken, 35 < 12, ken 6= john, ... These are all atomic formulas: A predicate applied to names and/or variables 2. More complex -non atomic- formulas: a) P eople(john, 32) ∧ ¬P eople(mary, 23) b) P eople(peter, x) → Age(x) c) ∀x∀y∀z(Manager(x, y) ∧ Manager(z, y)→x = z) d) ∃x∀y(y 6= x → Manager(x, y)) e) Manager(peter, x) ∧ ∃y(P eople(x, y) ∧ y < 30) f ) ∀x∀y(P eople(x, y) → Name(x) ∧ Age(y)) 9 Some of these formulas do not have variables outside the scope of a quantifier (∃, ∀) They are called sentences, e.g. P eople(john, 35), and 2.(a), 2.(c) Notice that atomic sentences correspond to data in the database, i.e. tuples in tables Sentence 2.(c) could be an integrity constraint; if it is imposed as an IC with the schema, it is expected to be true in the legal instances of the schema Sentence 2.(f) could be seen as a condition on the schema: the arguments for P eople are elements of Name and Age, resp. Formula 2.(e) could be seen as a query: “Give me the values for x such that the condition on it becomes true” (in the instance of the database at hand) 10

The notion of “being true” seems to be crucial here! Formulas and sentences are purely symbolic objects, but they become true (or false) when they are interpreted In predicate logic, formulas are interpreted in structures; and database instances can be seen as structures So, formulas can be interpreted in database instances, and they become true or false in them The semantics of the RC languages is inherited from the seman- tics for languages of predicate logic: 11 A database instance D for a relational schema S = {R, S, ...} can be seen as a finite structure D = hD,RD,SD,...i, in the following sense: It has a domain, namely D (possibly infinite) Example: D = {john, peter, . . . , 0, 1,...} The names for individuals in L(S) are interpreted by them- selves Example: peter of the object language is interpreted as the element peter of the domain Every non-logical, domain dependent, predicate has a finite extension (i.e. a finite number of tuples in the table) Example: PeopleD = {(john, 34), (peter, 37),..., (mary, 25)} Manager D = {(john, peter), (peter, mary),..., (john, ken)} 12

Built-in predicates have fixed extensions, possibly infinite Example: • {(john, john), (peter, peter),...} are the tuples in the interpretation of = • {(john, peter), (peter, mary),...} are the tuples in the interpretation of 6= • < has the extension {(0, 1), (0, 2),..., (1, 2), (1, 3),...}

We can see that given the schema S and a fixed domain D, the different instances will differ only in the extensions of the non-logical predicates, i.e. P eople, Manager 13

Now we can apply the classical definition of truth of symbolic formulas in structures (Alfred Tarski, early 30’s) It is a recursive (inductive) definition that can be made precise and general, but we illustrate it with examples Given a schema S, a formula ϕ ∈ L(S), and a database instance D compatible with S, we want to define when ϕ is true in D

Notice that ϕ could have free variables, then we need to indicate the values in the domain that we assign to the free variables 14

Example: These are two instances compatible with the schema:

D1: People Name Age john 35 Manager Boss Subordinate mary 25 ken john ken 40 john mary

D2: People Name Age mary 35 Manager Boss Subordinate mary 25 ken steve peter 40 carol steve john mary 15

1. P eople(john, 35) is true in D1, but false in D2:

D1 D1 |= P eople(john, 35), because (john, 35) ∈ P eople , i.e. the tuple (john, 35) belongs to the extension of the predicate in the DB

But D2 6|= P eople(john, 35) (the tuple does not belong to the extension) Actually, it is considered to be false by applying the Closed World Assumption on Databases: The only true atomic knowledge is the one explicitly contained in the tables

2. D1 |= john = john, because (john, john) is in the extension for =

D1 |= john 6= mary, because (john, mary) is in the extension for 6=

D1 |= 5 < 40, because (5, 40) is in the extension for < 16

D1 6|= john 6= john, because (john, john) is not the extension for 6=

3. P eople(mary, x) is true in D2 when x takes the value 25

D2 |= P eople(mary, x)[25]

But D2 6|= P eople(mary, x)[10] 4. Actually, for the values 35 or 25 for x, the formula (query?) P eople(mary, x) becomes true in D2

D2 |= P eople(mary, x)[25] and D2 |= P eople(mary, x)[35]

5. D2 |= ¬P eople(john, 35) by definition, because D26|=P eople(john, 35) (cf. 1. and use of CWA)

6. D1 |= (P eople(john, 35)∧Manager(ken, john)), by def- inition, because

D1 |= (P eople(john, 35) and D1 |= Manager(ken, john)) 17

7. D2 |= (P eople(peter, 10)∨Manager(ken, steve) by defi- nition, because

D2 |= (P eople(peter, 10) or D2 |= Manager(ken, steve)

8. D2 |= ∃xManager(x, steve) by definition, because there exists a value in the domain for x, namely x = ken, such that D2 |= Manager(x, steve)[ken]

9. D2 |= ∀y(P eople(mary, y) → y = 35 ∨ y = 25) by definition, because for all the values a ∈ D, i.e. in the domain, it holds

D2 |= (P eople(mary, y) → y = 35 ∨ y = 25)[a] the value for y ↑

E.g. D2 |= (P eople(mary, y) → y = 35 ∨ y = 25)[10]

E.g. D2 |= (P eople(mary, y) → y = 35 ∨ y = 25)[25] 18

With this recursive definition it is possible to evaluate the truth of any syntactically well-formed sentence or formula (provided we give values to the free variables) in a database instance Since the formulas are symbolic, with a precise syntax, and the extensions of the relevant relations are finite, all the “reason- able” formulas (c.f. later) can be evaluated by a computational system, like a RDBMS Notice that the evaluation of the truth of a formula is composi- tional: the truth of a formula is based on the truth (or not) of its subformulas, which makes evaluation easier and clearer A sentence (written in the logical language associated to the schema) can be algorithmically determined as true or false in the DB 19

Exercise: Say in English what is expressed by the symbolic sen- tence

∀x∃y∀z(Manager(x, y) ∧ P eople(y, z) ∧ z > 30 → ∃w(P eople(w, 25) ∧ ¬Manager(x, w))) Determine if the following sentence is true in the instances D1,D2 by using the inductive (recursive) definition of truth in a structure (database instance)

? D1 |= ∀x∃y∀z(Manager(x, y)∧P eople(y, z)∧z > 30 → ∃w(P eople(w, 25)∧ ¬Manager(x, w)))

? D2 |= ∀x∃y∀z(Manager(x, y)∧P eople(y, z)∧z > 30 → ∃w(P eople(w, 25)∧ ¬Manager(x, w)))

The sentence has to be true or false in D1,D2 20

For an instance D for a schema S and a formula of L(S) with free variables, it is possible to algorithmically determine if there are values in the database domain for those variables, so that the formula becomes true in D If those values exist, they can also be determined algorithmically as a part of the same evaluation process So, we can use the RC as a query language This language (or family of relational languages) has a precise, clear, and well-studied semantics 21

Example: Pose and answer the following queries to the database instances above 1. Return the managers who have subordinates that are younger than 27 ∃y∃z(Manager(x, y) ∧ P eople(y, z) ∧ z < 27) The answers are collected as the values for the only free variable, namely x ? D1 |= ∃y∃z(Manager(x, y) ∧ P eople(y, z) ∧ z < 27) Answer: x = john, because

D1 |= ∃y∃z(Manager(x, y)∧P eople(y, z)∧z < 27)[john] Notice how the join of the tables is captured through the variable in common y above and the conjunction Projection is captured by existential quantifiers; selections by conjuncts expressed in terms of built-in predicates 22

2. Return names and ages of the employees who are not a boss (of any people) P eople(x, y) ∧ ∀z∀w(P eople(z, w) → ¬Manager(x, z))

Answers in D1: {(mary, 25)}

Answers in D2: {(mary, 25), (mary, 35), (peter, 40)} Alternatively (but a different query though, with different meaning): P eople(x, y) ∧ ¬∃zManager(x, z)

Exercise: Show that the two queries above have different mean- ing by providing an instance of S where the answers are different 23

RA vs. RC

In RC we can express all the operations of RA, i.e. we can de- fine by means of logical formulas the relations that result from applying the RA operations We introduce in the logical language a new predicate Ans to collect the result of the operation, and next we define it by a sentence

Selection: σϕ(R(A1,...,An)) Here R is a relation and ϕ is a condition on the values of the attributes Ai

∀x1 · · · ∀xn(Ans(x1, . . . , xn) ←→ R(x1, . . . , xn) ∧ ϕ)

E.g. σA=aR(A, B) can be defined by ∀x∀y(Ans(x, y) ←→ R(x, y) ∧ x = a) 24

Intersection: R(A, B) ∩ S(A, B) ∀x∀y(Ans(x, y) ←→ R(x, y) ∧ S(x, y)) Instead of using the answer predicate, we can use the for- mula R(x, y) ∧ S(x, y), with free variables x, y, wherever needed as a (sub)formula to capture the intersection Union: R(A, B) ∪ S(A, B) ∀x∀y(Ans(x, y) ←→ R(x, y) ∨ S(x, y))

Projection: ΠA(R(A, B)) ∀x(Ans(x) ←→ ∃yR(x, y))

Join: R(A, B) 1B=C S(C,D) ∀x∀y∀z(Ans(x, y, z) ←→ R(x, y) ∧ S(y, z)) Cartesian Product: R(A, B) × S(C,D) ∀x∀y∀z∀w(Ans(x, y, z, w) ←→ R(x, y) ∧ S(z, w)) 25

Difference: R(A, B) r S(A, B) ∀x∀y(Ans(x, y) ←→ R(x, y) ∧ ¬S(x, y))

We can see that all the RA can be expressed in the RC Thus complex RA expressions (RA queries) can be translated into declarative RC formulas (queries) Actually, with the syntax and semantics of RC we could go beyond ... Example: Give me those who are not bosses ¬∃yManager(x, y) (*)

Who should be answers in, say D1? mary? susan? (with susan ∈ D) 26

As a formula, the query is O.K., its semantics as a logical formula is also O.K. But as a DB query? Notice that the “corresponding RA query”would be

c (ΠSubordinate (Manager)) We do not have complement in RA We do have it in RC (or logic), but in DB we do not want to use it We restrict ourselves to the so called domain independent or safe queries (the “reasonable” queries mentioned before) 27

Those are the queries that can be evaluated without appealing to the whole -possibly infinite- underlying DB domain D Domain independent queries can be evaluated by concentrating on the active domain of the DB: the subset of D that contains the data items that appear in some of the finite DB tables activeDom(D1) = {john, 35, mary, 25, ken, 40, ken} The query 2. above (as formulated) is safe, because the non bosses are found among the people; and the latter appear in a table The RA difference is always safe: R(x, y) ∧ ¬S(x, y) because the answers are all among rows in table R The query (*) is not domain independent 28

Finally, it can be proved that it is not possible to define the transitive closure of a relation using the RC language This is expected given the correspondence between the RC and the RC Actually, this impossibility result is usually proved in the context of the RC, not directly for RA So the RC has limited expressive power for some natural appli- cations Any reasonable, well-behaved, and more expressive extensions of the RC? 29

Other Uses of the RC Languages

The RC languages can be used for many purposes, not only for query formulation With the same advantages of having a language that is suitable for computational processing, has clear syntax and semantics, and is highly expressive for DB purposes 1. Metadata: We can express conditions expressed in a RC language on the structure of the data Example: ∀x∀y(P eople(x, y) → Name(x) ∧ Age(y)) This is saying that the values in attributes have to be taken in the right subdomain 30

This opens the possibility of expressing more complex conditions on the data types for the different attributes ∀x∀y(P eople(x, y) → CharString(x) ∧ Integer(y)), where CharString(·), Integer(·) are recognized by the system (built-in types) Or even more complex: ∀x∀y(R(x, y) → Type1 (x) ∧ Type2 (y)), where T ype1(·), T ype2(·) are defined by means of additional logical formulas Conditions like these can be checked by the system 31

Metadata is crucial in many applications of databases today, because data is integrated from different databases The integration is usually virtual: the data stay at their sources Think of data sources integrated through the WWW The metadata of each source provides (some) information about what is found in a data source and how Because it is data about data ... 32

2. Integrity Constraints: ICs can be expressed as sentences of a RC language

Employee Name Position Salary Age D john clerk 40 K 35 mary CEO 100 K 45 ken accountant 60 K 40

∀x∀y∀z∀w(Employee(x, y, z, w) ∧ y = CEO → Salary > 90K) This range constraint is satisfied by the DB instance D:

D |= ∀x∀y∀z∀w(Employee(x, y, z, w) ∧ y = CEO → Salary > 90K) 33

ICs are also metadata They embody knowledge (about the data) that can be used For example, it can be publicized as or provided as a semantic layer for a data source In this way conveying “meaning” (semantics) about/of the data source For example, if different data sources about salaries of employees are virtually integrated and we want to find those CEO who make less that 80K, we do not have to search inside a data source that is exposed to the outside world as satisfying the range constraint above 34

We can also express functional dependencies, e.g. Employee : Name → Age

∀x∀y1∀y2∀z1∀z2∀w1∀w2(Employee(x, y1, z1, w1) ∧ Employee(x, y2, z2, w2) → w1 = w2) It is satisfied by D It is easy to express that Name is a key of the relation Write down an axiom like this for each of the attributes other that those in the key 35

UnivTeams Number Team 9910803 Basketball ......

Students Number Name Study 9901254 John Stanley CS 9910803 Sue Jones Math

Referential IC: UnivTeams.Number ⊆ Students.Number ∀x∀y∃z∃w(UnivTeams(x, y) → Students(x, z, w)) The corresponding foreign key constraint can be expressed by the conjunction of this sentence plus the sentence that says that Number is a key of Students (do it!) 36 We can see that by using RC languages to express ICs:

It is clear what it means for a DB to satisfy an IC We can express complex ICs, actually we can go much beyond what commercial RDBMSs support ICs checking becomes machine processable Being ICs syntactic objects, they can be in principle stored in the DB and used as extra knowledge about the domain Knowledge that can be used for other purposes, e.g. query optimization, more precisely semantic query optimization Example: Return students (numbers) that participate in a team and are registered students ∃y∃z∃v(UnivT eams(x, y) ∧ Students(x, z, v)) If the system knows that the RIC is satisfied, no need to go to table Students 37

ICs should be checkable in the active domain of the database, so IC are expected to be domain independent sentences Exercise: Check that all the ICs we have encountered so far are domain independent There is a syntactic characterization of the safe formulas, so in DB applications one restricts the RC to its safe portion (To be precise, the class of safe formulas is a proper subset of the class of domain independent formulas, but safeness is good enough for applications) Example: ∀x∃yStudent(x, y) is a RC sentence, but is not do- main independent 38

R S

V virtual table

3. View Definitions: Views are (usually) virtual tables that “contain” data that come from other (usually material) base tables, those in the original schema Views can be defined using the RC: first introducing a name for the view, i.e. a new predicate, and then a RC formula that defines it (as we did with the Ans predicate before) 39

D : People Name Age Manager Boss Subordinate john 35 ken john mary 25 john mary ken 40

The view that shows (“contains”) the bosses ∀x(Bosses(x) ←→ ∃yManager(x, y)) The “extension” of the view on D is Bosses(D) = {ken, john} The view containing “top bosses” ∀x(T opBoss(x) ↔ ∃yManager(x, y) ∧ ¬∃zManager(z, x)) TopBoss(D) = {ken} 40

Bosses with their ages:

∀x(BossAge(x, z) ↔ ∃yManager(x, y) ∧ People(x, z))

BossAge(D) = {(ken, 40), (john, 35)} Notice that views are defined by a query, the one on the RHS So, a view is any relation (usually virtual) that is defined in terms of already existing relations (usually material tables) by using a suitable query language 41 We can see that the semantics of a view is clear We have a rich, expressive language for defining views It is easy to compute their extensions if wanted Views are useful to represent different perspectives and/or uses of the data in the DB They allow to combine data into new relations They can be used also for security purposes: certain users may have access to certain views of the DB only

Problem: How to speed up updating of the view when base relations change? 42

Views are very important today The emphasis is on integration of data sources Thin again of data sources used/accessed through the WWW Actually virtual integration (data is not collected into a single and huge physical repository) View definitions provide a way to define correspondences, map- pings, semantic bridges between separate and autonomous data sources