SIMCA

1. Introduction

1.0 Management Systems

1. A database management system (DBMS), or simply a database system (DBS), consists of → A collection of interrelated and persistent data (usually referred to as the database (DB)). → A set of application programs used to access, update and manage that data (which form the data management system (MS)).

2. The goal of a DBMS is to provide an environment that is both convenient and efficient to use in → Retrieving information from the database. → Storing information into the database.

3. are usually designed to manage large bodies of information. This involves → Definition of structures for information storage (data modeling). → Provision of mechanisms for the manipulation of information (file and systems structure, query processing). → Providing for the safety of information in the database (crash recovery and security). → Concurrency control if the system is shared by users.

1.1 Purpose of Database Systems

1. To see why database management systems are necessary, let's look at a typical “file- processing system" supported by a conventional operating system.

E.g. The application is a savings bank: → Savings account and customer records are kept in permanent system files. → Application programs are written to manipulate files to perform the following tasks:

● Debit or credit an account. ● Add a new account. ● Find an account balance. ● Generate monthly statements.

2. Development of the system proceeds as follows:

→ New application programs must be written as the need arises. → New permanent files are created as required. → But over a long period of time files may be in different formats, and → Application programs may be in different languages. SIMCA

3. So we can see there are problems with the straight file-processing approach:

● Data redundancy and inconsistency → Same information may be duplicated in several places. → All copies may not be updated properly.

● Difficulty in accessing data → May have to write a new application program to satisfy an unusual request. E.g. find all customers with the same postal code. Could generate this data manually, but a long job...

● Data isolation → Data in different files. → Data in different formats. → Difficult to write new application programs.

● Multiple users → Want concurrency for faster response time. → Need protection for concurrent updates. E.g. two customers withdrawing funds from the same account at the same time & account has Rs. 500 in it, and they withdraw Rs.100 and Rs.50. The result could be Rs.350, Rs. 400 or Rs. 450 if no protection.

● Security problems → Every user of the system should be able to access only the data they are permitted to see. E.g. payroll people only handle employee records, and cannot see customer accounts; tellers only access account data and cannot see payroll data. → Difficult to enforce this with application programs.

● Integrity problems → Data may be required to satisfy constraints. E.g. no account balance below $25.00. → Again, difficult to enforce or to change constraints with the file-processing approach.

These problems and others led to the development of database management systems.

1.2. DBMS functions.

● Data definition.

This includes describing: ⌐ FILES ⌐ RECORD STRUCTURES ⌐ FIELD NAMES, TYPES and SIZES SIMCA

⌐ RELATIONSHIPS between records of different types ⌐ Extra information to make searching efficient, e.g. INDEXES.

● Data entry and validation.

Validation may include: ⌐ TYPE CHECKING ⌐ RANGE CHECKING ⌐ CONSISTENCY CHECKING In an interactive data entry system, errors should be detected immediately - some can be prevented altogether by keyboard monitoring - and recovery and re-entry permitted.

● Updating.

Updating involves: ⌐ Record INSERTION ⌐ Record MODIFICATION ⌐ Record DELETION. At the same time any back-ground data such as indexes or pointers from one record to another must be changed to maintain consistency. Updating may take place interactively, or by submission of a file of transaction records; handling these may require a program of some kind to be written, either in a conventional programming language (a host language, e.g. COBOL or C) or in a language supplied by the DBMS for constructing command files.

● Data retrieval on the basis of selection criteria.

For this purpose most systems provide a QUERY LANGUAGE with which the characteristics of the required records may be specified. Query languages differ enormously in power and sophistication but a standard which is becoming increasingly common is based on the so-called RELATIONAL operations. These allow: ⌐ selection of records on the basis of particular field values. ⌐ selection of particular fields from records to be displayed. ⌐ linking together records from two different files on the basis of matching field values. ⌐ Arbitrary combinations of these operators on the files making up a database can answer a very large number of queries without requiring users to go into one record at a time processing.

● Report definition.

Most systems provide facilities for describing how summary reports from the database are to be created and laid out on paper. These may include obtaining: ⌐ COUNTS ⌐ TOTALS ⌐ AVERAGES ⌐ MAXIMUM and MINIMUM values SIMCA over particular CONTROL FIELDS. Also specification of PAGE and LINE LAYOUT, HEADINGS, PAGE-NUMBERING, and other narrative to make the report comprehensible.

● Security.

This has several aspects: ⌐ Ensuring that only those authorized to do so can see and modify the data, generally by some extension of the password principle. ⌐ Ensuring the consistency of the database where many users are accessing and up-dating it simultaneously. ⌐ Ensuring the existence and INTEGRITY of the database after hardware or software failure. At the very least this involves making provision for back-up and re-loading.

1.3 Data Abstraction

1. The major purpose of a database system is to provide users with an abstract view of the system. The system hides certain details of how data is stored and maintained. Complexity should be hidden from database users.

2. There are several levels of abstraction: (a) Physical Level: → How the data are stored. E.g. index, B-tree, hashing. → Lowest level of abstraction. → Complex low-level structures described in detail.

(b) Conceptual Level: → Next highest level of abstraction. → Describes what data are stored. → Describes the relationships among data. → Database administrator level.

(c) View Level: → Highest level. → Describes part of the database for a particular group of users. → Can be many different views of a database. E.g. tellers in a bank get a view of customer accounts, but not of payroll data. SIMCA

1.4 Instances and Schemes

1. Databases change over time. 2. The information in a database at a particular point in time is called an instance of the database. 3. The overall design of the database is called the database scheme. 4. Analogy with programming languages: ● Data type definition -- scheme ● Value of a variable -- instance 5. There are several schemes, corresponding to levels of abstraction: ● Physical scheme ● Conceptual scheme ● Subscheme (can be many)

1.5 Data Independence

1. The ability to modify a scheme definition in one level without affecting a scheme definition in a higher level is called data independence. 2. There are two kinds: ● Physical data independence The ability to modify the physical scheme without causing application programs to be rewritten -- Modifications at this level are usually to improve performance.

● Logical data independence SIMCA

The ability to modify the conceptual scheme without causing application programs to be rewritten -- Usually done when logical structure of database is altered 3. Logical data independence is harder to achieve as the application programs are usually heavily dependent on the logical structure of the data. An analogy is made to abstract data types in programming languages.

1.6 Data Definition Language (DDL)

1. Used to specify a database scheme as a set of definitions expressed in a DDL 2. DDL statements are compiled, resulting in a set of tables stored in a special _le called a data dictionary or data directory. 3. The data directory contains metadata (data about data) 4. The storage structure and access methods used by the database system are specified by a set of definitions in a special type of DDL called a data storage and definition language 5. basic idea: hide implementation details of the database schemes from the users

1.7 Data Manipulation Language (DML)

1. Data Manipulation is: ● retrieval of information from the database ● insertion of new information into the database ● deletion of information in the database ● modification of information in the database

2. A DML is a language which enables users to access and manipulate data. The goal is to provide efficient human interaction with the system. 3. There are two types of DML: ● procedural: the user specifies what data is needed and how to get it ● nonprocedural: the user only specifies what data is needed -- Easier for user -- May not generate code as efficient as that produced by procedural languages 4. A query language is a portion of a DML involving information retrieval only. The terms DML and query language are often used synonymously.

1.8 Database Manager

1. The database manager is a program module which provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system. 2. Databases typically require lots of storage space (gigabytes). This must be stored on disks. Data is moved between disk and main memory (MM) as needed. 3. The goal of the database system is to simplify and facilitate access to data. Performance is important. Views provide simplification. 4. So the database manager module is responsible for SIMCA

● Interaction with the file manager: Storing raw data on disk using the file system usually provided by a conventional operating system. The database manager must translate DML statements into low-level file system commands (for storing, retrieving and updating data in the database). ● Integrity enforcement: Checking that updates in the database do not violate consistency constraints. (e.g. no bank account balance below $25) ● Security enforcement: Ensuring that users only have access to information they are permitted to see ● Backup and recovery: Detecting failures due to power failure, disk crash, software errors, etc., and restoring the database to its state before the failure ● Concurrency control: Preserving data consistency when there are concurrent users. 5. Some small database systems may miss some of these features, resulting in simpler database managers. (For example, no concurrency is required on a PC running MS-DOS.) These features are necessary on larger systems.

1.9 Database Administrator

1. The database administrator is a person having central control over data and programs accessing that data. Duties of the database administrator include: ● Scheme definition: the creation of the original database scheme. This involves writing a set of definitions in a DDL (data storage and definition language), compiled by the DDL compiler into a set of tables stored in the data dictionary. ● Storage structure and access method definition: writing a set of definitions translated by the data storage and definition language compiler ● Scheme and physical organization modification: writing a set of definitions used by the DDL compiler to generate modifications to appropriate internal system tables (e.g. data dictionary). This is done rarely, but sometimes the database scheme or physical organization must be modified. ● Granting of authorization for data access: granting different types of authorization for data access to various users. ● Integrity constraint specification: generating integrity constraints. These are consulted by the database manager module whenever updates occur.

1.10 Database Users 1. The database users fall into several categories: ● Application programmers are computer professionals interacting with the system through DML calls embedded in a program written in a host language (e.g. C, PL/1, Pascal). ⌐ These programs are called application programs. ⌐ The DML precompiler converts DML calls (prefaced by a special character like $, #, etc.) to normal procedure calls in a host language. ⌐ The host language compiler then generates the object code. ⌐ Some special types of programming languages combine Pascal-like control structures with control structures for the manipulation of a database. ⌐ These are sometimes called fourth-generation languages. SIMCA

⌐ They often include features to help generate forms and display data. ● Sophisticated users interact with the system without writing programs. ⌐ They form requests by writing queries in a database query language. ⌐These are submitted to a query processor that breaks a DML statement

10. Overall System Structure 1. Database systems are partitioned into modules for different functions. Some functions (e.g. file systems) may be provided by the operating system. 2. Components include: ● File manager: manages allocation of disk space and data structures used to represent information on disk. ● Database manager: The interface between low-level data and application programs and queries. ● Query processor translates statements in a query language into low-level instructions the database manager understands. (May also attempt to find an equivalent but more efficient form.) ● DML pre-compiler converts DML statements embedded in an application program to normal procedure calls in a host language. The pre-compiler interacts with the query processor. ● DDL compiler converts DDL statements to a set of tables containing metadata stored in a data dictionary.

In addition, several data structures are required for physical system implementation: → Data files: store the database itself. → Data dictionary: stores information about the structure of the database. It is used heavily. Great emphasis should be placed on developing a good design and efficient implementation of the dictionary. → Indices: provide fast access to data items holding particular values. Figure shows these components. SIMCA SIMCA

2. Modeling Techniques

CONCEPTUAL DATA MODELLING is the first stage in the process of TOP-DOWN DATABASE DESIGN. The aim is to describe the information used by an organization in a way which is not governed by implementation-level issues and details.

A common method of analysis involves identifying:

1. The ENTITIES (persons, places, things etc.) which the organisation has to deal with. 2. The ATTRIBUTES - the items of information which characterise and describe these entities. 3. The RELATIONSHIPS between entities which exist and must be taken into account when processing information.

In the abstract, or illustrated by superficial examples, this looks a very simple idea. An explanation may put forward a car as a typical entity, point to make, colour, registration number as obvious attributes, and suggest owning and driving as relationships in which it takes part. But applying the method of analysis to some useful purpose in a working organisation will be difficult, simply because the world does not fit so neatly into boxes.

2.1. Entities.

1. Defining.

In a global data model for a hospital, for example, the entity names employee and patient are likely to occur. In some situations the same individual may be both an employee and a patient; is s/he one entity or two? In such a case it is more accurate to define a fundamental entity such as a person able to take a role in the relationship.

Note: the name given to an entity should always be a singular noun descriptive of each item to be stored in it. e.g. patient NOT patients.

2. Composite entities.

Sometimes an entity as defined above may be considered as a single entity or as multiple entities. For example, a firm may have a number of different premises for the delivery of goods or supply of services - multiple entities - but one head office for the payment of accounts - a single entity. Similarly, a single object - e.g. an aeroplane - may be made up of many components. These components have a hierarchical / recursive relationship: e.g. A is made up of B and C which in turn are made up of X, Y and Z. It is impossible to treat such entities at a single level; for example the CAA has rules about the expected life of different aircraft components but these must be related to the flight history (take-offs, landings, flight hours) of the plane in which they are fitted. SIMCA

3. Aggregates.

For some applications, e.g. stock control of low cost items, we are not concerned with individuals but with AGGREGATES - e.g. how many cans of baked beans are in the warehouse? It is easy to overlook the fact that a definition refers not to an individual but to a type of object. More problems arise when manufactured objects change status part- way through their lives - a car on an assembly line may be indistinguishable from those immediately before and after it, but it must at some point acquire an individual identity for sale and registration.

4. Sub-classification of entities.

The need for sub-classification of entities arises frequently. For example, a transport pool handles resources which in a global data model might be all classed as vehicles. Although certain attributes and relationships may be common to all vehicles, they are not inter- changeable and for practical purposes must be subdivided into vans, lorries, etc. each with their own particular characteristics.

5. Physical boundaries.

It may be difficult to determine the physical boundaries of entities which are identified primarily by geographical location. For example, a motorway is a road which is part of one continuous system; it may, however, be necessary to specify points on the motorway, where one section stops and the next begins.

6. Events.

So far we have been talking about entities which are fairly tangible objects; however, there are many more that are more abstract, for example a sale or a birth. In terms of the overall model these are events which trigger off processes. Nevertheless, we may wish to count them, attach attributes to them, and relate them to one another, as if they were entities. The more abstract the entity, the more difficult it is to decide whether it should really be classed as an event or a relationship.

7. Messages.

Similar difficulties occur with entities which exist only as messages within an existing clerical information system. As a piece of paper, a receipt is a tangible object, but it is doubtful whether it should be a first class citizen of a conceptual data model, since in different technological situations its functions could equally well be performed by a notch on a stick or a series of pulses on an EDI network.

As these examples indicate, entities are not easy to pin down and compromise is required between a fruitless search for some ultimate truth and a short-term expedient which will break down as soon as it is seriously tested. It is necessary to consider the purpose for which the database is being designed: what sort of questions are likely to be asked and SIMCA how future developments may require the initial requirements specification to be extended.

2.2. Attributes.

Attributes are pieces of information ABOUT entities. The analysis must of course identify those which are actually relevant to the proposed application. Attributes will give rise to recorded items of data in the database so it is important that nothing potentially useful is omitted here even though not all may be included later.

At this level we need to know such things as:

 Attribute name.  The domain from which attribute values are taken.  Whether the attribute is part of the entity identifier.  Whether it is permanent or time-varying.  Whether it is required or optional for the entity.

1. Attribute names.

Names should be explanatory words or phrases; codings and/or abbreviations should be postponed until the implementation level. They should enable users of the data model to see what is being recorded, and identify where the same piece of information occurs in different places (e.g. through a DATA DICTIONARY.)

2. Domains.

A DOMAIN is a set of values from which attribute values may be taken. Examples are: dates, sums of money, temperatures, grid references, colours, gradings and nationalities. In one way the more definite we can be here the better - it is more helpful to know that an attribute is a linear measurement than that it is a number - but precise formatting specifications are not required. A date is a date, however it is stored or presented, and the conceptual data model is not the place to specify that it is a six-digit number in the format ddmmyy. On the other hand it will be useful to know the kind of accuracy which may be expected for this attribute - is the measurement to the nearest day, week or year?

3. Identifier attributes.

We must distinguish between attributes which just describe an entity (and perhaps change over time), and those which help to identify it uniquely. This will certainly influence the lower level design. But some care is required. In natural discourse very few things have one unique name or identifying attribute - they are distinguished by a combination e.g. name, address and date of birth for a person. In formal information systems, it is convenient to give unique but arbitrary labels to persons or objects (e.g. National Insurance Number, Vehicle Registration Number) so as to have quicker and easier means of identification. But however convenient such KEYS are at the implementation level SIMCA they should not be introduced into the conceptual design unless they are already familiar to users of the information, and there is an understandable process for getting from the entity itself to its identifying attribute.

4. Time-varying attributes.

It is important to know which attributes may change their values over time, as they will have to be updated when the system is implemented. It may also be necessary to hold previous values somewhere for AUDITING or the production of HISTORICAL REPORTS.

5. Optional attributes.

Entities may have attributes whose values will sometimes be unknown or irrelevant. This happens with very general entities such as vehicle more often than with those more narrowly defined. A record of which attributes are OPTIONAL or MANDATORY will help when defining general consistency constraints for the data base at lower levels.

2.3. Relationships.

In many applications one external event or process may affect several related entities, requiring the setting of LINKS from one part of the database to another. Important information to be recorded is:

1. Number and type of roles.

Like an attribute, a relationship should be named by a word or phrase which explains its function. The same applies to the role names within the relationship, e.g. owner, driver etc. Role names are different from the names of entities forming the relationship: one entity may take on many roles, the same role may be played by different entities. Where a relationship is symmetrical (e.g. 2 roads intersecting) role names are arbitrarily assigned.

An important point about a relationship is how many entities participate in it. In practice BINARY RELATIONSHIPS are most convenient for data modeling; however it is by no means the case that all relationships must necessary be binary in this sense. For example, consider a model built around vehicle manufacturers, types of vehicle and garages. In such a model we may wish to represent the information specified by a TERNARY RELATIONSHIP e.g.

 (Vehicle_Manufacturer, Vehicle_Type, Garage)  Vehicle_Manufacturer supplies Vehicle_Type to Garage.

This ternary relationship is not in general equivalent to the combination of the three binary relationships:

 Vehicle_Manufacturer SUPPLIES Vehicle_Type SIMCA

 Garage SELLS Vehicle_Type  Garage IS SUPPLIED BY Vehicle_Manufacturers

For example the information that:

1. Ford supplies cars to Hills

tells us more than the combination:

2. Ford supplies cars 3. Hills sells cars 4. Hills is supplied by Ford

Relationships (2), (3) and (4) only tell us that:

 Ford supplies cars to some garage(s)  Hills sells cars supplied by some supplier  Ford supplies some type of vehicle to Hills.

One cannot validly infer (1) from (2), (3) and (4); e.g. it may be that Hills only sells Ford trucks, not cars. The false inference (2, 3, 4) => (1) is sometimes called the connection trap.

2. Degree.

Relationships may be:

 ONE-TO-ONE, e.g. Building - Location,  ONE-TO-MANY, e.g. hospital - patient,  MANY-TO-MANY, e.g. Author - Book.  RECURSIVE, e.g. Manager - Employee.

To decide the DEGREE of a relationship one must be clear about whether individuals or types are involved. The relationship (vehicle_manufacturer, vehicle_type) is 1-to-many if we are talking about an individual manufacturer, but many-to-many where (vehicle_manufacturer, vehicle_type) are types.

Consider also the time dimension - whether the relationship is being recorded at a single point in time or over a period of time. For example, in Britain the legal relationship husband-wife is one to one in the first case, but potentially many to many in the second.

3. Optionality.

A relationship may be required or optional for either participant, e.g. a piece of property must be owned by a person but not all persons need own a piece of property. SIMCA

4. Permanence.

A relationship may be defined as permanent (e.g. parenthood) or temporary (e.g. ownership or employment). If temporary but required by one of the participants (e.g. ownership) it must be transferable.

5. Dependencies.

One relationship may necessarily exclude or imply another, or be excluded or implied by another. Membership of the City University Students Union necessarily implies membership of the University itself.

Many of the above constraints will require consistency checks on incoming data. Some DBMSs allow such checks to be declared as part of the logical data model so that they are applied automatically whenever the data base is updated.

2.4 Data Models

Data models are a collection of conceptual tools for describing data, data relationships, data semantics and data constraints. There are three different groups:

(a) Object-based Logical Models. (b) Record-based Logical Models. (c) Physical Data Models.

(a) Object-based Logical Models.

Object-based logical models: ⌐ Describe data at the conceptual and view levels. ⌐ Provide fairly flexible structuring capabilities. ⌐ Allow one to specify data constraints explicitly. ⌐ Over 30 such models, including ● Entity-relationship model. ● Object-oriented model. ● Binary model. ● Semantic data model. ● Infological model. ● Functional data model. SIMCA

We'll take a closer look at the entity-relationship (E-R) and object-oriented models.

→ The E-R Model

The entity-relationship model is based on a perception of the world as consisting of a collection of basic objects (entities) and relationships among these objects.

● An entity is a distinguishable object that exists. ● Each entity has associated with it a set of attributes describing it. E.g. number and balance for an account entity. ● A relationship is an association among several entities. e.g. A cust acct relationship associates a customer with each account he or she has. ● The set of all entities or relationships of the same type is called the entity set or relationship set. ● Another essential element of the E-R diagram is the mapping cardinalities, which express the number of entities to which another entity can be associated via a relationship set.

→ The overall logical structure of a database can be expressed graphically by an E-R diagram: ● rectangles: represent entity sets. ● ellipses: represent attributes. ● diamonds: represent relationships among entity sets. ● lines: link attributes to entity sets and entity sets to relationships.

→ The Object-Oriented Model

The object-oriented model is based on a collection of objects, like the E-R model. ● An object contains values stored in instance variables within the object. ● Unlike the record-oriented models, these values are themselves objects. ● Thus objects contain objects to an arbitrarily deep level of nesting. SIMCA

● An object also contains bodies of code that operate on the object. ● These bodies of code are called methods. ● Objects that contain the same types of values and the same methods are grouped into classes. ● A class may be viewed as a type definition for objects. ● Analogy: the programming language concept of an abstract data type. ● The only way in which one object can access the data of another object is by invoking the method of that other object. ● This is called sending a message to the object. ● Internal parts of the object, the instance variables and method code, are not visible externally. ● Result is two levels of data abstraction. For example, consider an object representing a bank account. ⌐ The object contains instance variables number and balance. ⌐ The object contains a method pay-interest which adds interest to the balance. ⌐ Under most data models, changing the interest rate entails changing code in application programs. ⌐ In the object-oriented model, this only entails a change within the pay-interest method.

Unlike entities in the E-R model, each object has its own unique identity, independent of the values it contains: _ Two objects containing the same values are distinct. _ Distinction is maintained in physical level by assigning distinct object identi_ers.

(b) Record-based Logical Models

Record-based logical models: ● Also describe data at the conceptual and view levels. ● Unlike object-oriented models, are used to Specify overall logical structure of the database, and provide a higher-level description of the implementation. ● Named so because the database is structured in fixed-format records of several types. ● Each record type defines a fixed number of fields, or attributes. ● Each field is usually of a fixed length (this simplifies the implementation). ● Record-based models do not include a mechanism for direct representation of code in the database. ● Separate languages associated with the model are used to express database queries and updates. ● The three most widely-accepted models are the relational, network, and hierarchical.

→ The ● Data and relationships are represented by a collection of tables. ● Each table has a number of columns with unique names, e.g. customer, account. ● Figure shows a sample relational database. SIMCA

→ The Network Model ● Data are represented by collections of records. ● Relationships among data are represented by links. ● Organization is that of an arbitrary graph. ● Figure shows a sample network database.

→ The Hierarchical Model ● Similar to the network model. ● Organization of the records is as a collection of trees, rather than arbitrary graphs. ● Figure shows a sample hierarchical database. SIMCA

(c) Physical Data Models

● Are used to describe data at the lowest level. ● Very few models, e.g. → Unifying model. → Frame memory.

2.5 Introduction to Entity Diagram :

Entity-Relationship ER Diagram. ● The ER model allows us to describe the data involved in a real world enterprise in terms of objects (entities) and their relationships. ● Provides the initial framework for developing an initial DB design. ● There are other variations of the ER model exist, mainly different on the way entities and their relationships are graphically represented.

ER Notation Explanation SIMCA

Additional Feature of ER Model ● Key Constrains The “At least – At most” question → 1:1 : Each professor works in at most 1 department . In each department at most 1 professor work.

→ 1:N : Each professor works in at most N departments . In each department at most 1 professor work. →N:M : Each professor works in at most N departments . In each department at most M professor work. ● Participation Constrains (partial VS total) 1. Employee works in at least 1, at most N Departments (TOTAL participation in relationship) At-mostAt - mostP P 2. Employee works in at least 0, at most N Departments (PARTIAL participation in relationship) ● Weak Entities 1. A weak entity can be identified uniquely ONLY by considering the primary of another relation. 2. Employee(ssn) -- ||Has ||<= Dependents(pname, age) ● Class Hierarchies & Aggregation 1. Not used in most designing tools, 2. Might be covered by your teacher. SIMCA

3. Relational Databases

The relational model.

The first database systems were based on the network and hierarchical models. The relational model was first proposed by E.F. Codd in 1970 and the first such systems (notably INGRES and System/R) were developed in 1970s. The relational model is now the dominant model for commercial data processing applications.

Attribute Name Abbreviations

The text uses fairly long attribute names which are abbreviated in the notes as follows. _ customer-name becomes cname _ customer-city becomes ccity _ branch-city becomes bcity _ branch-name becomes bname _ account-number becomes account# _ loan-number becomes loan# _ banker-name becomes banker

1. Structure of Relational Database

● A relational database consists of a collection of tables, each having a unique name. A row in a table represents a relationship among a set of values. Thus a table represents a collection of relationships. ● There is a direct correspondence between the concept of a table and the mathematical concept of a relation. A substantial theory has been developed for relational databases.

1 Basic Structure

1. Figure shows the deposit and customer tables for our banking example. _

● It has four attributes. ● For each attribute there is a permitted set of values, called the domain of that attribute. E.g. the domain of bname is the set of all branch names. SIMCA

Let D1 denote the domain of bname, and D2, D3 and D4 the remaining attributes' domains respectively. Then, any row of deposit consists of a four-tuple (v1; v2; v3; v4) where v1 € D1; v2 € D2; v3 € D3; v4 € D4 In general, deposit contains a subset of the set of all possible rows. That is, deposit is a subset of 4 D1 × D2 × D3 × D4; or abbreviated to × i=1Di In general, a table of n columns must be a subset of

n × i=1Di (all possible rows)

2. Mathematicians define a relation to be a subset of a Cartesian product of a list of domains. You can see the correspondence with our tables. We will use the terms relation and tuple in place of table and row from now on.

3. Some more formalities: ● let the tuple variable t refer to a tuple of the relation r. ● We say t € r to denote that the tuple t is in relation r. ● Then t[bname] = t[1] = the value of t on the bname attribute. ● So t[bname] = t[1] = “Downtown", ● and t[cname] = t[3] = “Johnson".

4. We'll also require that the domains of all attributes be indivisible units. ● A domain is atomic if its elements are indivisible units. For example, the set of integers is an atomic domain. ● The set of all sets of integers is not. ● Why? Integers do not have subparts, but sets do | the integers comprising them. ● We could consider integers non-atomic if we thought of them as ordered lists of digits. SIMCA

2 Database Schema

1. We distinguish between a database scheme (logical design) and a database instance (data in the database at a point in time). 2. A relation scheme is a list of attributes and their corresponding domains. 3. The text uses the following conventions: ● italics for all names ● lowercase names for relations and attributes ● names beginning with an uppercase for relation schemes

These notes will do the same. For example, the relation scheme for the deposit relation: Deposit-scheme = (bname, account#, cname, balance) We may state that deposit is a relation on scheme Deposit-scheme by writing deposit (Deposit-scheme).

If we wish to specify domains, we can write: (bname: string, account#: integer, cname: string, balance: integer).

Note that customers are identified by name. In the real world, this would not be allowed, as two or more customers might share the same name.

4. The relation schemes for the banking example used throughout the text are: Branch-scheme = (bname, assets, bcity) Customer-scheme = (cname, street, ccity) SIMCA

Deposit-scheme = (bname, account#, cname, balance) Borrow-scheme = (bname, loan#, cname, amount)

Note: some attributes appear in several relation schemes (e.g. bname, cname). This is legal, and provides a way of relating tuples of distinct relations.

5. Why not put all attributes in one relation?

Suppose we use one large relation instead of customer and deposit: Account-scheme = (bname, account#, cname, balance, street, ccity) If a customer has several accounts, we must duplicate her or his address for each account. If a customer has an account but no current address, we cannot build a tuple, as we have no values for the address. We would have to use null values for these fields. Null values cause difficulties in the database. ● By using two separate relations, we can do this without using null values

3 Keys

1. The notions of superkey, candidate key and primary key all apply to the relational model. 2. For example, in Branch-scheme, _ fbnameg is a superkey. _ fbname, bcityg is a superkey. _ fbname, bcityg is not a candidate key, as the superkey fbnameg is contained in it. _ fbnameg is a candidate key. _ fbcityg is not a superkey, as branches may be in the same city. _ We will use fbnameg as our primary key. 3. The primary key for Customer-scheme is fcnameg. 4. More formally, if we say that a subset K of R is a superkey for R, we are restricting consideration to relations r(R) in which no two distinct tuples have the same values on all attributes in K. In other words, If t1 and t2 are in r, and t1 6= t2, then t1[K] 6≠ t2[K].

4 Query Languages

1. A query language is a language in which a user requests information from a database. These are typically higher-level than programming languages. They may be one of: ● Procedural, where the user instructs the system to perform a sequence of operations on the database. This will compute the desired information. ● Nonprocedural, where the user specifies the information desired without giving a procedure for obtaining the information. SIMCA

2. A complete query language also contains facilities to insert and delete tuples as well as to modify parts of existing tuples.

5 The Relational Algebra

The relational algebra is a procedural query language. ● Six fundamental operations: select (unary) project (unary) rename (unary) cartesian product (binary) union (binary) set-difference (binary)

Several other operations, defined in terms of the fundamental operations: set-intersection natural join division assignment

Operations produce a new relation as a result.

Fundamental Operations

1. The Select Operation

Select selects tuples that satisfy a given predicate. Select is denoted by a lowercase Greek sigma (σ), with the predicate appearing as a subscript. The argument relation is given in parentheses following the σ. For example, to select tuples (rows) of the borrow relation where the branch is “SFU", we would write σ bname=”SFU"(borrow)

Above Figure shows the borrow and branch relations in the banking example. SIMCA

The new relation created as the result of this operation consists of one tuple: (SFU; 15;Hayes; 1500).

We allow comparisons using =, ≠, {, ≤, > and ≥ in the selection predicate. We also allow the logical connectives V (or) and ^ (and). For example: σ bname=”Downtown" ^ amount >1200 ( borrow)

Suppose there is one more relation, client, shown in following Figure, with the scheme Client scheme = (cname; banker)

We might write σ cname=banker(client) to find clients who have the same name as their banker.

2. The Project Operation

Project copies its argument relation for the specified attributes only. Since a relation is a set, duplicate rows are eliminated. Projection is denoted by the Greek capital letter pi (Π). The attributes to be copied appear as subscripts. For example, to obtain a relation showing customers and branches, but ignoring amount and loan#, we write Π bname, cname (borrow)

We can perform these operations on the relations resulting from other operations. To get the names of customers having the same name as their bankers, Π cname (σ cname=banker(client))

Think of select as taking rows of a relation, and project as taking columns of a relation.

3. The Cartesian Product Operation

The cartesian product of two relations is denoted by a cross (×), written r1 × r2 for relations r1 and r2. The result of r1 × r2 is a new relation with a tuple for each possible pairing of tuples from r1 and r2. In order to avoid ambiguity, the attribute names have attached to them the name of the relation from which they came. If no ambiguity will result, we drop the relation name. The result client × customer is a very large relation. If r1 has n1 tuples, and r2 has n2 tuples, then r = r1×r2 will have n1n2 tuples. SIMCA

The resulting scheme is the concatenation of the schemes of r1 and r2, with relation names added as mentioned. To find the clients of banker Johnson and the city in which they live, we need information in both client and customer relations. We can get this by writing

σ banker=”Johnson"(client × customer)

However, the customer.cname column contains customers of bankers other than Johnson. (Why?) We want rows where client.cname = customer.cname. So we can write

σclient:cname=customer:cname(σ banker=”Johnson"(client × customer)) to get just these tuples. Finally, to get just the customer's name and city, we need a projection: Π client, cname, ccity (σclient.cname=customer.cname (σ banker=”Johnson"(client × customer)))

4. The Rename Operation

The rename operation solves the problems that occurs with naming when performing the cartesian product of a relation with itself. Suppose we want to find the names of all the customers who live on the same street and in the same city as Smith. We can get the street and city of Smith by writing

Π street, ccity(σ cname=”Smith"(customer)) To find other customers with the same information, we need to reference the customer relation again:

σ P (customer ×(Π street, ccity (σ cname=”Smith"(customer))))

Where P is a selection predicate requiring street and ccity values to be equal.

Problem: how do we distinguish between the two street values appearing in the Cartesian product, as both come from a customer relation?

Solution: use the rename operator, denoted by the Greek letter rho (ρ). We write ρ x(r) to get the relation r under the name of x.

If we use this to rename one of the two customer relations we are using, the ambiguities will disappear.

Π customer. name (σcust2.street= customer:street ^ cust2.ccity= customer:ccity SIMCA

(customer × (Π street, ccity(σ cname=”Smith"( ρ cust2(customer))))))

5. The Union Operation

The union operation is denoted U as in set theory. It returns the union (set union) of two compatible relations. For a union operation r U s to be legal, we require that ● r and s must have the same number of attributes. ● The domains of the corresponding attributes must be the same.

To find all customers of the SFU branch, we must find everyone who has a loan or an account or both at the branch. We need both borrow and deposit relations for this:

Π cname (σ bname=”SFU"(borrow)) U Π cname (σ bname =”SFU"(deposit))

As in all set operations, duplicates are eliminated, giving the relation of Figure

5. The Set Difference Operation

Set difference is denoted by the minus sign ( - ). It finds tuples that are in one relation, but not in another. Thus r - s results in a relation containing tuples that are in r but not in s. To find customers of the SFU branch who have an account there but no loan, we write

Π cname (σ bname=”SFU" (deposit)) – Π cname (σ bname=”SFU"(borrow))

The result is shown in Figure

We can do more with this operation. Suppose we want to find the largest account balance in the bank.

Strategy: ● Find a relation r containing the balances not the largest. ● Compute the set difference of r and the deposit relation.

To find r, we write SIMCA

Π deposit. balance (σ deposit.balance < d.balance (deposit × ρ d(deposit)))

This resulting relation contains all balances except the largest one. (See Figure).

Now we can finish our query by taking the set difference:

Π balance (deposit) - Π deposit.balance (σ deposit.balance< d.balance (deposit × ρ d(deposit))) Figure shows the result.

6. Formal Definition of Relational Algebra

1. A basic expression consists of either ● A relation in the database. ● A constant relation.

2. General expressions are formed out of smaller sub-expressions using ● σp(E1) select (p a predicate) ● Π s(E1) project (s a list of attributes) ● ρ x(E1) rename (x a relation name) ● E1 U E2 union ● E1 - E2 set difference ● E1 × E2 cartesian product

7. Additional Operations

Additional operations are defined in terms of the fundamental operations. They do not add power to the algebra, but are useful to simplify common queries.

1. The Set Intersection Operation

Set intersection is denoted by ∩ , and returns a relation that contains tuples that are in both of its argument relations. It does not add any power as r ∩ s = r - (r - s) SIMCA

To find all customers having both a loan and an account at the SFU branch, we write Π cname (σ bname=”SFU"(borrow)) ∩ Π cname(σ bname=”SFU"(deposit))

2. The Natural Join Operation

Often we want to simplify queries on a cartesian product. For example, to find all customers having a loan at the bank and the cities in which they live, we need borrow and customer relations:

Π borrow, cname, ccity (σ borrow.cname=customer.cname (borrow × customer))

Our selection predicate obtains only those tuples pertaining to only one cname.

This type of operation is very common, so we have the natural join, denoted by a ►◄ sign. Natural join combines a cartesian product and a selection into one operation. It performs a selection forcing equality on those attributes that appear in both relation schemes. Duplicates are removed as in all relation operations.

To illustrate, we can rewrite the previous query as

Π cname, ccity (borrow ►◄ customer)

The resulting relation is shown in Figure.

We can now make a more formal definition of natural join. → Consider R and S to be sets of attributes. → We denote attributes appearing in both relations by R ∩ S. → We denote attributes in either or both relations by R U S. → Consider two relations r(R) and s(S). → The natural join of r and s, denoted by r ►◄ s is a relation on scheme R U S. → It is a projection onto R U S of a selection on r × s where the predicate requires r.A = s.A for each attribute A in R ∩ S.

Formally, r ►◄ s = Π RUS(σ r.A1=s.A1 ^ r.A2=s.A2 ^ …^ r.An=s.An(r × s)) where R ∩ S = {A1;A2; …An}. SIMCA

To find the assets and names of all branches which have depositors living in Stamford, we need customer, deposit and branch relations:

Π bname, assets (σ ccity=”Stamford" (customer ►◄ deposit ►◄ branch))

Note that ►◄ is associative.

To find all customers who have both an account and a loan at the SFU branch: Π cname (σ bname=”SFU"(borrow ►◄ deposit))

This is equivalent to the set intersection version we wrote earlier. We see now that there can be several ways to write a query in the relational algebra.

If two relations r(R) and s(S) have no attributes in common, then R ∩ S = Φ, and r ►◄ s = r × s.

3. The Division Operation

Division, denoted ÷, is suited to queries that include the phrase “for all". Suppose we want to find all the customers who have an account at all branches located in Brooklyn. Strategy: think of it as three steps.

We can obtain the names of all branches located in Brooklyn by r1 = Π bname (σ bcity=”Brooklyn"(branch))

We can also find all cname, bname pairs for which the customer has an account by r2 = Π cname, bname(deposit)

Now we need to find all customers who appear in r2 with every branch name in r1. The divide operation provides exactly those customers.

Π cname, bname(deposit) ÷Π bname(σ bcity=”Brooklyn"(branch)) which is simply r2 _r1.

Formally, ● Let r(R) and s(S) be relations. ● Let S С R ¬ ● The relation r ÷ s is a relation on scheme R - S. ● A tuple t is in r ÷ s if for every tuple ts in s there is a tuple tr in r satisfying both of the following: tr [S] = ts[S] tr[R - S] = t[R-S] SIMCA

● These conditions say that the R - S portion of a tuple t is in r ÷ s if and only if there are tuples with the r - s portion and the S portion in r for every value of the S portion in relation S.

The division operation can be defined in terms of the fundamental operations.

r ÷s = П R-S(r)- П R-S((П R-S(r) × s) - r)

4. The Assignment Operation

Sometimes it is useful to be able to write a relational algebra expression in parts using a temporary relation variable (as we did with r1 and r2 in the division example). The assignment operation, denoted ←, works like assignment in a programming language. We could rewrite our division definition as

temp1 ← П R-S(r)

temp2 ← П R- S((temp1 × s) - r) result = temp1 - temp2

No extra relation is added to the database, but the relation variable created can be used in subsequent expressions. Assignment to a permanent relation would constitute a modification to the database.

8. Aggregate functions

1. Aggregate functions: sum, avg, count, min, max. 2. The collections on which aggregate functions are applied can have multiple occurrences of a value: the order in which the value appears is irrelevant. Such collections are called multi-sets. For example, Sum salary (pt_works)

3. To eliminate multiple occurrences of a value prior to computing an aggregate function, with the addition of the hyphenated string distinct appended to the end of the function name. For example, count-distinct bname(pt_works).

4. Grouping and then aggregating: To find the total salary sum of all part-time employees at each branch (not the entire bank!), bnamecount-distinctsalary(pt_works).

5. The general form of the aggregation operation G is as follows. G1,G2,…,Gn G F1A1,F2A2,…,FmAm(E)

where E is any relational-algebra expression, G1,G2,…,Gn constitute a list of attributes on which to group, each Fi is an aggregate functions, and each Ai is an attribute name. SIMCA

The meaning of the operation: ● The tuples in result of expression E is partitioned into groups. All tuples in a group has the same values for G1,G2,…,Gn, and tuples in different group have different values. ● For each group (g1, g2,…, gn), the result has a tuple (g1, g2,…, gn; a1, a2,,…, am) where, for each i, ai is the result of applying the aggregate function Fi on the multi-set of values for attribute Ai in the group.

6. Example. Find sum and max of salary for part-time employees at each branch. bname Gsum salary, max salary(pt_works)

9. Modification of the Database

Up until now, we have looked at extracting information from the database. We also need to add, remove and change information. Modifications are expressed using the assignment operator.

1. Deletion Deletion is expressed in much the same way as a query. Instead of displaying, the selected tuples are removed from the database. We can only delete whole tuples. In relational algebra, a deletion is of the form r ← r - E where r is a relation and E is a relational algebra query. Tuples in r for which E is true are deleted. 2. Some examples: 1. Delete all of Smith's account records. deposit ← deposit – σ cname=”Smith"(deposit)

2. Delete all loans with loan numbers between 1300 and 1500. deposit ← deposit - σ loan# ≥ 1300 ^ loan# ≤ 1500(deposit)

3. Delete all accounts at Branches located in Needham.

r1 ← σ bcity=”Needham"(deposit ►◄ branch) r2 ←П bname, account#, cname balance(r1) deposit ← deposit - r2

2 Insertion To insert data into a relation, we either specify a tuple, or write a query whose result is the set of tuples to be inserted. Attribute values for inserted tuples must be members of the attribute's domain. An insertion is expressed by r ← r U E where r is a relation and E is a relational algebra expression. Some examples: 1. To insert a tuple for Smith who has $1200 in account 9372 at the SFU branch. deposit ← deposit U {(“SFU", 9372, “Smith", 1200)} SIMCA

2. To provide all loan customers in the SFU branch with a $200 savings account. r1 ← (σ bname=”SFU"(borrow)) r2 ← П bname, loan#, cname(r1) deposit ← deposit U (r2 × {(200)}) 3 Updating Updating allows us to change some values in a tuple without necessarily changing all. We use the update operator, δ, with the form δ A← E(r) where r is a relation with attribute A, which is assigned the value of expression E. The expression E is any arithmetic expression involving constants and attributes in relation r.

Some examples: 1. To increase all balances by 5 percent. δ balance ← balance * 1.05(deposit) This statement is applied to every tuple in deposit.

2. To make two different rates of interest payment, depending on balance amount: δ balance← balance*1.06(σ balance > 10000(deposit)) δ balance← balance*1.05(σ balance ≤ 10000(deposit))

Note: in this example the order of the two operations is important.

10. Views

We have assumed up to now that the relations we are given are the actual relations stored in the database. For security and convenience reasons, we may wish to create a personalized collection of relations for a user. We use the term view to refer to any relation, not part of the conceptual model, which is made visible to the user as a “virtual relation”. As relations may be modified by deletions, insertions and updates, it is generally not possible to store views. Views must then be recomputed for each query referring to them.

1 View Definition A view is defined using the create view command: create view v as {query expression> where {query expression}is any legal query expression. The view created is given the name v. To create a view all customer of all branches and their customers:

create view all_customer as Π bname,cname(deposit) U Π bname, cname(borrow) SIMCA

Having defined a view, we can now use it to refer to the virtual relation it creates. View names can appear anywhere a relation name can.

We can now find all customers of the SFU branch by writing Π cname (σ bname=”SFU"(all_customer))

2 Updates Through Views and Null Values

Updates, insertions and deletions using views can cause problems. The modifications on a view must be transformed to modifications of the actual relations in the conceptual model of the database. An example will illustrate: consider a clerk who needs to see all information in the borrow relation except amount. Let the view loan-info be given to the clerk: create view loan-info as Π bname, loan#; cname(borrow) Since SQL allows a view name to appear anywhere a relation name may appear, the clerk can write: loan-info ← loan-info U {(“SFU",3,”Ruth")} This insertion is represented by an insertion into the actual relation borrow, from which the view is constructed. However, we have no value for amount. A suitable response would be ● Reject the insertion and inform the user. ● Insert (“SFU",3,”Ruth",null) into the relation. The symbol null represents a null or place-holder value. It says the value is unknown or does not exist. Another problem with modification through views: consider the view create view branch_city as Π bname, ccity (borrow ►◄ customer) This view lists the cities in which the borrowers of each branch live. Now consider the insertion Branch_city ← branch_city U {(“Brighton",”Woodside")} Using nulls is the only possible way to do this If we do this insertion with nulls, now consider the expression the view actually corresponds to

Π bname, ccity (borrow ►◄ customer)

As comparisons involving nulls are always false, this query misses the inserted tuple. To understand why, think about the tuples that got inserted into borrow and customer. Then think about how the view is recomputed for the above query. 3 Views Defined Using Other Views

Views can be defined using other views. E.g., create view sfu-customer as Π cname (σ banme =”SFU"(all_customer)) SIMCA where all customer itself is a view.

Dependency relationship of views: ● A view v1 is said to depend directly on v2 if v2 is used in the expression defining v1. ● A view v1 is said to depend on v2 if and only if there is a path in the dependency graph from v2 to v1. ● A view relation v is said to be recursive it depends on itself.

View expansion: a way to define the meaning of views in terms of other views. repeat Find any view relation vi in e1 Replace any view relation vi by the expression defining vi until no more view relations are present in e1

As long as the view definition is not recursive, this loop will terminate. SIMCA

Codd’s Rule

As the benefits of the relational approach become more widely perceived, vendors of DBMSs’ increasingly often claim that their products are 'relational', not always with justification. In 1985 Codd produced the following set of 'rules' by which systems should be judged:

1. Information.

All information is represented at the logical level and in exactly one way - by values in tables.

2. Guaranteed access.

Each datum (atomic value) in a relational database is guaranteed to be logically accessible through a combination of table-name, primary key value and column-name.

3. Systematic treatment of null values.

Null values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in a fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.

4. Database description.

The database description is represented at the logical level in the same way as ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data.

5. Comprehensive sub-language.

A relational system may support several languages and various modes of terminal use (for example the fill-in-the-blanks mode). However there must be at least one language whose statements are expressible through some well defined syntax as character strings, and that is comprehensive in supporting all the following items:

1. Data definition 2. View definition 3. Data manipulation (interactive and by program) 4. Integrity constraints 5. Authorization 6. Transaction boundaries (begin, commit and rollback) SIMCA

6. View updating.

All theoretically-updatable views are also updatable by the system. (A view is theoretically updatable if there is a time-independent algorithm for unambiguously determining a single series of changes to the base tables, having as their effect precisely the requested changes in the view.)

7. Insert and update.

The capability of handling a base table or a derived table as a single operand applies not only to retrieval of data but also to insertion, updating, and deletion. (This allows the system to optimize its execution sequence by determining the best access paths. It may be important in obtaining efficient handling of transactions across a distributed database, avoiding the communications costs of transmitting separate requests for each record obtained from remote sites.)

8. Physical data independence.

Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods. (There must be a clear distinction between logical and physical design levels.)

9. Logical data independence.

Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit un-impairment are made to the base tables. (This rule permits logical database design to be changed dynamically, e.g. by splitting or joining base tables in ways which do not entail loss of information.)

10. Integrity independence.

Integrity constraints must be definable in the relational data sub-language and storable in the catalogue, not in the applications program. Certain integrity constraints hold for every relational database, further application-specific rules may be added. The general rules relate to:

1. Entity integrity : no component of a primary key may have a null value. 2. Referential integrity : for each distinct non-null 'foreign key' value in the database, there must exist a matching primary key value from the same domain.

11. Distribution independence.

A relational DBMS has distributional independence - i.e. if a distributed database is used it must be possible to execute all relational operations upon it without knowing or being SIMCA constrained by the physical locations of data. This must apply both when distribution is originally introduced, and when data is redistributed.

12. Non-subversion.

A low-level (single record-at-a-time) language cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher-level (multiple record-at-a-time) language.

Comparison between HDB, NDB, RDB.

1. The Hierarchical Data Model

The Hierarchical Data Model can be represented as follows: Figure - The Hierarchical Data Model

A hierarchical database consists of the following: ● It contains nodes connected by branches. ● The top node is called the root. ● If multiple nodes appear at the top level, the nodes are called root segments. ● The parent of node nx is a node directly above nx and connected to nx by a branch. ● Each node (with the exception of the root) has exactly one parent. ● The child of node nx is the node directly below nx and connected to nx by a branch. ● One parent may have many children. SIMCA

By introducing data redundancy, complex network structures can also be represented as hierarchical databases. This redundancy is eliminated in physical implementation by including a 'logical child'. The logical child contains no data but uses a set of pointers to direct the database management system to the physical child in which the data is actually stored. Associated with a logical child are a physical parent and a logical parent. The logical parent provides an alternative (and possibly more efficient) path to retrieve logical child information.

2. The Network Data Model

The Network Data Model can be represented as follows: Figure - The Network Data Model

Like the the Hierarchical Data Model the Network Data Model also consists of nodes and branches, but a child may have multiple parents within the network structure.

HDB an NDB suffers with the following deficiencies (when compared with relational databases): ● Access to the database was not via SQL query strings, but by a specific set of API's. ● It was not possible to provide a variable WHERE clause. The only selection mechanism was to read entries from a child table for a specific entry on a related parent table with any filtering being done within the application code. ● It was not possible to provide an ORDER BY clause. Data was presented in the order in which it existed in the database. This mechanism could be tuned by specifying sort criteria to be used when each record was inserted, but this had several disadvantages: ● Only a single sort sequence could be defined for each path (link to a parent), so all records retrieved on that path would be provided in that sequence. ● It could make inserts rather slow when attempting to insert into the middle of a large collection, or where a table had multiple paths each with its own set of sort criteria. SIMCA

Normal forms and Normalization

INTRODUCTION

The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. This conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details.

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an un- normalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

A)

First normal form deals with the "shape" of a record type.

Under first normal form, all occurrences of a record type must contain the same number of fields.

First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.

B) SECOND AND THIRD NORMAL FORMS

Second and third normal forms deal with the relationship between non-key and key fields.

Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.

We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as SIMCA the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

1

Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:

------| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS | ======------

The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:

 The warehouse address is repeated in every record that refers to a part stored in that warehouse.  If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.  Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.  If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.

To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

------| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS | ======------======------

When a data design is changed in this way, replacing un-normalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the un-normalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs. SIMCA

2

Third normal form is violated when a non-key field is a fact about another non-key field, as in

------| EMPLOYEE | DEPARTMENT | LOCATION | ======------

The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:

 The department's location is repeated in the record of every employee assigned to that department.  If the location of the department changes, every such record must be updated.  Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.  If a department has no employees, there may be no record in which to keep the department's location.

To satisfy third normal form, the record shown above should be decomposed into the two records:

------| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION | ======------======------

To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

3 Functional Dependencies

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't SIMCA provide unique identifiers for people, then there will not be a functional dependency in the data:

------| PERSON | ADDRESS | ------+------| John Smith | 123 Main St., New York | | John Smith | 321 Center St., San Francisco | ------

Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.

Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.

------| PERSON | ADDRESS | ------+------| John Smith | 123 Main St., New York | | John Smith | 123 Main Street, NYC | ------

We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, those functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

------| EMPLOYEE | FATHER | FATHER'S-ADDRESS | |======------+------| | Art Smith | John Smith | 123 Main St., New York | | Bob Smith | John Smith | 123 Main Street, NYC | | Cal Smith | John Smith | 321 Center St., San Francisco | ------

However, in formal terms, there is no functional dependency here between FATHER'S- ADDRESS and FATHER, and hence no violation of third normal form. SIMCA

C) FOURTH AND FIFTH NORMAL FORMS

Fourth and fifth normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many- to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.

Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.

1

Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

The term "independent" will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as

------| EMPLOYEE | SKILL | LANGUAGE | ======

Instead, they should be represented in the two records

------| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE | ======

Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record: SIMCA

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

------| EMPLOYEE | SKILL | LANGUAGE | |------+------+------| | Smith | cook | | | Smith | type | | | Smith | | French | | Smith | | German | | Smith | | Greek | ------

This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

------| EMPLOYEE | SKILL | LANGUAGE | |------+------+------| | Smith | cook | French | | Smith | type | German | | Smith | type | Greek | ------

(b) Minimal number of records, with null values:

------| EMPLOYEE | SKILL | LANGUAGE | |------+------+------| | Smith | cook | French | | Smith | type | German | | Smith | | Greek | ------

(c) Unrestricted:

------| EMPLOYEE | SKILL | LANGUAGE | |------+------+------| | Smith | cook | French | | Smith | type | | | Smith | | German | | Smith | type | Greek | ------SIMCA

(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:

------| EMPLOYEE | SKILL | LANGUAGE | |------+------+------| | Smith | cook | French | | Smith | cook | German | | Smith | cook | Greek | | Smith | type | French | | Smith | type | German | | Smith | type | Greek | ------

Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:

 If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.  Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.  Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.

Fourth normal form minimizes such update problems.

Independence

We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct: SIMCA

------| EMPLOYEE | SKILL | LANGUAGE | |------+------+------| | Smith | cook | French | | Smith | type | French | | Smith | type | German | | Smith | type | Greek | ------

Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.

Multi-valued Dependencies

For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross- product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:

 An employee uses certain skills on certain projects.  An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.

2

Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others. SIMCA

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:

------| AGENT | COMPANY | PRODUCT | |------+------+------| | Smith | Ford | car | | Smith | GM | truck | ------

This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.

But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.

------| AGENT | COMPANY | PRODUCT | |------+------+------| | Smith | Ford | car | | Smith | Ford | truck | | Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car | ------

In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:

------| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT | |------+------| |------+------| |------+------| | Smith | Ford | | Ford | car | | Smith | car | | Smith | GM | | Ford | truck | | Smith | truck | | Jones | Ford | | GM | car | | Jones | car | ------| GM | truck | ------

These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the SIMCA smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.

It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.

The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.

------| AGENT | COMPANY | PRODUCT | |------+------+------| | Smith | Ford | car | | Smith | Ford | truck | | Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car | | Jones | Ford | truck | | Brown | Ford | car | | Brown | GM | car | | Brown | Totota | car | | Brown | Totota | bus | ------SIMCA

------| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT | |------+------| |------+------| |------+------| | Smith | Ford | | Ford | car | | Smith | car | Fifth | Smith | GM | | Ford | truck | | Smith | truck | Normal | Jones | Ford | | GM | car | | Jones | car | Form | Brown | Ford | | GM | truck | | Jones | truck | | Brown | GM | | Toyota | car | | Brown | car | | Brown | Toyota | | Toyota | bus | | Brown | bus | ------

Observe that:

 Jones sells cars and GM makes cars, but Jones does not represent GM.  Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.  Brown represents Ford and Brown sells buses, but Ford does not make buses.

Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of those violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case. SIMCA

Integrity Constraints

Integrity constraints provide a way of ensuring that changes made to the database by authorized users do not result in a loss of data consistency.

We saw a form of integrity constraint with E-R models: ♦ key declarations: stipulation that certain attributes form a candidate key for the entity set. ♦ form of a relationship: mapping cardinalities 1-1, 1-many and many-many.

An integrity constraint can be any arbitrary predicate applied to the database. They may be costly to evaluate, so we will only consider integrity constraints that can be tested with minimal overhead.

1 Domain Constraints

A domain of possible values should be associated with every attribute. These domain constraints are the most basic form of integrity constraint. They are easy to test for when data is entered.

Domain types (a) Attributes may have the same domain, e.g. cname and employee-name. (b) It is not as clear whether bname and cname domains ought to be distinct. (c) At the implementation level, they are both character strings. (d) At the conceptual level, we do not expect customers to have the same names as branches, in general. (e) Strong typing of domains allows us to test for values inserted, and whether queries make sense. Newer systems, particularly object-oriented database systems, offer a rich set of domain types that can be extended easily.

The check clause in SQL-92 permits domains to be restricted in powerful ways that most programming language type systems do not permit.

(a) The check clause permits schema designer to specify a predicate that must be satis_ed by any value assigned to a variable whose type is the domain. (b) Examples: create domain hourly-wage numeric(5,2) constraint wage-value-test check(value >= 4.00)

Note that “constraint wage-value-test" is optional (to give a name to the test to signal which constraint is violated). create domain account-number char(10) constraint account-number-null-test check(value not null) SIMCA

create domain account-type char(10) constraint account-type-test check(value in (“Checking", \Saving"))

2 Referential Integrity

Often we wish to ensure that a value appearing in a relation for a given set of attributes also appears for another set of attributes in another relation. This is called referential integrity.

1 Basic Concepts 1. Dangling tuples. ♦ Consider a pair of relations r(R) and s(S), and the natural join r ►◄ s. ♦ There may be a tuple tr in r that does not join with any tuple in s. ♦ That is, there is no tuple ts in s such that tr[R ∩ S] = ts[R∩S]. ♦ We call this a dangling tuple. ♦ Dangling tuples may or may not be acceptable.

2. Suppose there is a tuple t1 in the account relation with the value t1[bname] =”Lunartown", but no matching tuple in the branch relation for the Lunartown branch. This is undesirable, as t1 should refer to a branch that exists. Now suppose there is a tuple t2 in the branch relation with t2[bname] =\Mokan", but no matching tuple in the account relation for the Mokan branch. This means that a branch exists for which no accounts exist. This is possible, for example, when a branch is being opened. We want to allow this situation. 3. Note the distinction between these two situations: bname is the primary key of branch, while it is not for account. In account, bname is a foreign key, being the primary key of another relation. ● Let r1(R1) and r2(R2) be two relations with primary keys K1 and K2 respectively. ● We say that a subset α of R2 is a foreign key referencing K1 in relation r1 if it is required that for every tuple t2 in r2 there must be a tuple t1 in r1 such that t1[K1] = t2[α] ● We call these requirements referential integrity constraints. ● Also known as subset dependencies, as we require Π α (r2) C Π K1 (r1)

Referential Integrity in the E-R Model

These constraints arise frequently. Every relation arising from a relationship set has referential integrity constraints. Figure shows an n-ary relationship set R relating entity sets E1,E2, …,En. SIMCA

● Let Ki denote the primary key of Ei. ● The attributes of the relation scheme for relationship set R include K1 U K2 U…U Kn. ● Each Ki in the scheme for R is a foreign key that leads to a referential integrity constraint.

Relation schemes for weak entity sets must include the primary key of the strong entity set on which they are existence dependent. This is a foreign key, which leads to another referential integrity constraint.

Database Modification Database modifications can cause violations of referential integrity. To preserve the referential integrity constraint Π α (r2) C Π K1 (r1) in the following operations:

♦ Insert: if a tuple t2 is inserted into r2, the system must ensure that there is a tuple t1 in r1 such that t1[K] = t2[α], i.e. t2[α] € Π K(r1)

♦ Delete: if a tuple t1 is deleted from r1, the system must compute the set of tuples in r2 that reference t1: σα=t1[K](r2) If this set is not empty, either reject delete command, or delete also the tuples that reference t1. ♦ Update: two cases SIMCA

i) Updates to referencing relation: test similar to insert case must be made, ensuring if t’2 is new value of tuple, t’2[α] € ΠK(r1)

ii) Updates to referenced relation: test similar to delete if update modifies values for primary key, must compute σ α=t1[K](r2) to ensure that we are not removing a value referenced by tuples in r2.

Assertions 1. An assertion is a predicate expressing a condition we wish the database to always satisfy. 2. Domain constraints, functional dependency and referential integrity are special forms of assertion. 3. Where a constraint cannot be expressed in these forms, we use an assertion, e.g. ● Ensuring the sum of loan amounts for each branch is less than the sum of all account balances at the branch. ● Ensuring every loan customer keeps a minimum of $1000 in an account. 4. An assertion takes the form, create assertion {assertion-name} check {predicate} 5. Two assertions mentioned above can be written as follows. Ensuring the sum of loan amounts for each branch is less than the sum of all account balances at the branch.

create assertion sum-constraint check (not exists (select * from branch where (select sum) amount) from loan where (loan.bname = branch.bname >= (select sum) amount) from account where (account.bname = branch.bname)))

6. Ensuring every loan customer keeps a minimum of $1000 in an account.

create assertion balance-constraint check (not exists (select * from loan L (where not exists (select * from borrower B, depositor D, account A where L.loan# = B.loan# and B.cname = D.cname and D.account# = A.account# and A.balance >= 1000 )))

7. When an assertion is created, the system tests it for validity. If the assertion is valid, any further modification to the database is allowed only if it does not cause that assertion to be violated. This testing may result in significant overhead if the assertions are complex. Because of this, the assert should be used with great care. SIMCA

8. Some system developer omits support for general assertions or provides specialized form of assertions that are easier to test. Triggers 1. A trigger is a statement that is automatically executed by the system as a side effect of a modification to the database. 2. We need to ♦ Specify the conditions under which the trigger is executed. ♦ Specify the actions to be taken by the trigger. 3. For example, suppose that an overdraft is intended to result in the account balance being set to zero, and a loan being created for the overdraft amount. The trigger actions for tuple t with a negative balance are then

● Insert a new tuple s in the borrow relation with s[bname] = t[bname] s[loan#] = t[account#] s[amount] = t[balance] s[cname] = t[cname]

● We need to negate balance to get amount, as balance is negative. ● Set t[balance] to 0. Note that this is not a good example. What would happen if the customer already had a loan? 4. To write this trigger in terms of the original System R trigger: define trigger overdraft on update of account T (if new T.balance < 0 then (insert into loan values (T.bname, T.account#, - new T.balance) insert into borrower (select cname, account# from depositor where T.account# = depositor.account#) update account S set S.balance = 0 where S.account# = T.account# ))

Functional Dependencies

Basic Concepts 1. Functional dependencies. ● Functional dependencies are a constraint on the set of legal relations in a database. ● They allow us to express facts about the real world we are modeling. ● The notion generalizes the idea of a superkey. ● Let α C R and β C R. SIMCA

● Then the functional dependency α → β holds on R if in any legal relation r(R), for all pairs of tuples t1 and t2 in r such that t1[α] = t2[α], it is also the case that t1[β] = t2[β]. ● Using this notation, we say K is a superkey of R if K → R. ● In other words, K is a superkey of R if, whenever t1[K] = t2[K], then t1[R] = t2[R] (and thus t1 = t2).

2. Functional dependencies allow us to express constraints that cannot be expressed with superkeys.

3. Consider the scheme Loan-info-schema = (bname, loan#, cname, amount) if a loan may be made jointly to several people (e.g. husband and wife) then we would not expect loan# to be a superkey. That is, there is no such dependency loan# → cname

We do however expect the functional dependency loan# → amount loan# → bname to hold, as a loan number can only be associated with one amount and one branch.

4. A set F of functional dependencies can be used in two ways: ● To specify constraints on the set of legal relations. (Does F hold on R?) ● To test relations to see if they are legal under a given set of functional dependencies. (Does r satisfy F?)

5. Figure shows a relation r that we can examine.

6. We can see that A → C is satisfied (in this particular relation), but C → A is not. AB → C is also satisfied. 7. Functional dependencies are called trivial if they are satisfied by all relations. 8. In general, a functional dependency α → β is trivial if β C α . 9. In the customer relation, we see that street → ccity is satisfied by this relation. However, as in the real world two cities can have streets with the same names (e.g. Main, Broadway, etc.), we would not include this functional dependency in our list meant to hold on Customer-scheme. SIMCA

10. The list of functional dependencies for the example database is: ● On Branch-scheme: bname → bcity bname → assets ● On Customer-scheme: cname → ccity cname → street ● On Loan-scheme: loan# → amount loan# → bname ● On Account-scheme: account# → balance account# → bname

There are no functional dependencies for Borrower-schema, nor for Depositor-schema.

Closure of a Set of Functional Dependencies

1. We need to consider all functional dependencies that hold. Given a set F of functional dependencies, we can prove that certain other ones also hold. We say these ones are logically implied by F. 2. Suppose we are given a relation scheme R = (A,B,C, G,H, I), and the set of functional dependencies: A → B A → C CG → H CG → I B → H Then the functional dependency A → H is logically implied. 3. To see why, let t1 and t2 be tuples such that t1[A] = t2[A] As we are given A → B , it follows that we must also have t1[B] = t2[B] Further, since we also have B → H , we must also have t1[H] = t2[H] Thus, whenever two tuples have the same value on A, they must also have the same value on H, and we can say that A → H . 4. The closure of a set F of functional dependencies is the set of all functional dependencies logically implied by F. 5. We denote the closure of F by F+. 6. To compute F+, we can use some rules of inference called Armstrong's Axioms:

● Reflexivity rule: if α is a set of attributes and β C α, then α → β holds. ● Augmentation rule: if α → β holds, and γ is a set of attributes, then γα →γβ holds. ● Transitivity rule: if α → β holds, and β → γ holds, then α → γ holds. SIMCA

7. These rules are sound because they do not generate any incorrect functional dependencies. They are also complete as they generate all of F+. 8. To make life easier we can use some additional rules, derivable from Armstrong's Axioms: ● Union rule: if α → β and α → γ, then α → βγ holds. ● Decomposition rule: if α → βγ holds, then α → β and α → γ both hold. ● Pseudo transitivity rule: if α → β holds, and γβ→ δ holds, then αγ→ δ holds. 9. Applying these rules to the scheme and set F mentioned above, we can derive the following: ● A → H, as we saw by the transitivity rule. ● CG → HI by the union rule. ● AG → I by several steps: ♦ Note that A → C holds. ♦ Then AG → CG, by the augmentation rule. ♦ Now by transitivity, AG → I. (You might notice that this is actually pseudo transitivity if done in one step.)