University of Northumbria at Newcastle

UNIVERSITY OF NORTHUMBRIA AT NEWCASTLE

School of Informatics

BSc in Computing

CM036 Advanced Database

Date Time Allowed January 2004 3 hours Reading Time 10 minutes Instructions to candidates

1. FIVE Questions are set. o Candidates must attempt THREE questions: 1. Question 1 in Section A is compulsory; 2. ONE question from Section B; and 3. ONE question from Section C. o The questions are of equal value. o The number of marks for each part of each question is shown in brackets.

2. This is an Open Book Examination.

3. Students are allowed to use calculators.

4. Questions 2-5 refer to the attached Appendix: DreamHome Estate Agent Database System. SECTION A: YOU MUST ANSWER THIS QUESTION

Q. 1 This a multiple choice question. A correct answer scores 2.5 marks (making a possible total of 25 marks); a wrong answer or an un-attempted part scores zero. Read all of the options for each part before choosing an answer. Choose only one answer for each part of the question.

(I) Which one of the following statements about a B+ tree Index is not true?

a) In practice each node in a tree is actually a page, so we could randomly access one record in a file containing 100 million records using 3 or 4 I/Os.

b) A B+ tree index is better than a Hash index for the following selection query: StaffId = 6500(R) where R is a relation with the columns (StaffId, Name, Sal, Dept), and StaffId is the unique identifier (on which Hash index is defined).

c) It is costly in terms of update time to maintain a balanced tree, as every time the file is updated (a new record inserted or deleted), the index needs to be updated.

d) Because of the structure of the tree, i.e. by keeping it balanced with a constant depth, it always takes approximately the same time to access any data record in the file.

e) A B+ tree index should not be used for the following selection query: StaffId op 6500(R) where R is a relation with the columns (StaffId, Name, Sal, Dept), StaffId is the unique identifier (on which Hash index is defined) and the operator op is either < or >.

(II) A transaction is a logical unit of work on a database, and should either be performed in its entirety or not at all. The operations of two transactions may be interleaved to achieve concurrent execution. Which one of the following statements is true?

a) Locking and time stamping are techniques used to ensure that if one of a concurrent set of transactions fails to commit, the other(s) will also have to be ‘locked’, i.e. not committed.

b) If two transactions achieve correct results when executed, then the interleaving of these two transactions will always produce the same correct results.

c) In order to achieve a correct result, one transaction must be committed before the other is allowed to begin.

d) Serializability is the concurrent execution of two transactions in such a way that the result produced is the same as would have occurred if the two transactions were run separately in a sequential order.

e) None of the above is true

2 (III) Which one of the following pairs of relational algebra expressions is equivalent? a) (A ⋈ B) – C and A ⋈ (C - B) b) (salary<15000 (A)) ⋃ B and salary<15000(A ⋂ B) c) (A ⋈ B) ⋈ C and C ⋈ (B ⋈ A) d) Expressions in options (a) to (c) are equivalent. e) Expressions in options (a) to (c) are not equivalent.

(IV) A distributed database can be fragmented in a number of different ways, each of which may be of benefit to a different proposed usage of the database. Which one of the following statements is true? a) Complete replication is of most benefit where storage cost and communication costs for updates are of primary concern, whilst performance is of lesser concern. b) Horizontal fragmentation is of most benefit where a number of separate applications located in different locations need to access the same tuples, and those tuples need to appear in all fragments. c) Vertical fragmentation is of most benefit where the selected columns in the different fragments cannot be used to reconstruct the original relation. d) Mixed fragmentation is of most benefit only if the fragments satisfy the correctness rules of completeness, reconstruction and dis-jointness. e) Selective replication has none of the benefits of the other strategies (i.e., a, b, c and d). It is only used where the disadvantages of the other strategies need to be avoided.

(V) Object-Oriented Databases have been developed to enable users to take advantage of the object-oriented concepts being used by systems analysts and programmers. Which one of the following statements is not true? a) Object identifiers (OIDs) are used to model the bi-directional relationships between objects b) Many to many associations can be implemented directly with the use of collection types. c) All the benefits of objects can be incorporated, e.g. inheritance, encapsulation. d) Abstract Data Types can be used by the developer to model the complex data structures present in the real world. e) Associations between object types can be modelled in a similar way to relationships between entities. When implementing these associations, the primary key/foreign key mechanism should be used.

3 (VI) A trigger is a compound statement that is executed automatically by the DBMS when a modification is made to a table. Which one of the following statements is true?

a) A trigger can be used for purposes such as maintaining complex integrity constraints, maintaining audit information, supporting replication, and validating input data.

b) An AFTER INSERT trigger is executed after a new row has been inserted into a table, and requires a FOR EACH clause to state that the trigger will be fired for every row that is inserted.

c) A BEFORE UPDATE trigger is executed before a row has been updated, and does not require a FOR EACH clause, as an UPDATE can validly update multiple rows in a table.

d) (a) – (c) are all true

e) (a) – (c) are all false

(VII) Which one of the following queries will be the most efficient (i.e. will have the least cost in terms of I/Os)? Note that StaffId and StaffNo are unique identifiers of the relations A and B, respectively; hash indexes are available on StaffId and StaffNo.

a) П StaffId, StaffName(StaffId=2340(A ⋈ StaffId= StaffNo B))

b) П StaffId, StaffName ((StaffId=2340 (A)) ⋈ StaffId= StaffNo (StaffNo=2340(B)))

c) П StaffId, StaffName ((StaffId=2340 (A)) ⋈ StaffId= StaffNo(B))

d) П StaffId, StaffName (StaffId=2340 (A ⋈ StaffId= StaffNo (StaffId=2340(B))))

e) All have a similar performance

(VIII) The use of a cursor in PL/SQL allows you to manipulate the result of a query having more than one tuple. Which one of the following statements about a cursor is true?

a) A cursor must be declared and opened before it can be used. You may either do this in the DECLARE section, or by simply opening it in the statements section when its definition will be assumed to be the DEFAULT value.

b) LOOP-EXIT, FOR-LOOP or WHILE-LOOP blocks may be used to loop round all the tuples returned by a query.

c) When the cursor is first opened in the statements section, the SQL query is executed and the resulting rows retrieved. The first row is returned via the cursor, and you may then loop round all the retrieved rows and perform the required processing.

d) Parameters cannot be passed to the cursor, as the initial declaration of a cursor may not be changed.

e) None of the above are true

4 (IX) Consider these statements in the context of optimising queries involving Selection (σ):

i) Use the binary search method if the selection condition corresponds to an attribute on which the table is sorted and the query is unlikely to select the entire table. ii)Use a B+ tree index if the selection condition is based on inequality (i.e., <, >). iii) Use a Hash index if the selection is based on equality (i.e., =).

Which combination of these will give the best policy for recommending optimization?

a) i and ii only

b) i and iii only

c) i, ii, and iii

d) ii and iii only

e) None, they are all bad policies.

(X) Given the following 2 tables, which one of these relational algebra statements gives a different result from the others?

Prod ProdCode ProdDesc Price Colour Order OrdNo Product Qty A105 Axminster £24.99 Red 1 A105 20 C999 Cord £3.99 Cream 2 L002 2 B001 Bathroom £8.99 Purple 3 L003 10 L002 Laminate £14.99 Cherry 4 L002 8 (Click) L003 Laminate £14.99 Mahogany 5 A105 15 (Click)

a) П ProdCode, OrdNo (Prod ProdCode=Product Order)

b) П Product, OrdNo (Order) ⋃ (П ProdCode, OrdNo (Prod ⋈ ProdCode=Product Order))

c) П ProdCode, OrdNo (Order Product=ProdCode Prod)

d) (П Product, OrdNo (Order)) Product=ProdCode (П ProdCode, NULL (Prod))

e) (П ProdCode, OrdNo (Prod ProdCode=Product Order)) ⋃ (П ProdCode, OrdNo (Prod ⋈ ProdCode=Product Order))

5 SECTION B: ANSWER ONE QUESTION ONLY

Q. 2

Consider the following SQL query that retrieves names of those clients who are prepared to pay a rent more than £595 and information about the flats (i.e. properties of type ‘F’) they have already viewed for renting out on the 10th January 2004 in the Newcastle area:

SELECT C.fName, C.lName, P.street, P.city, P.rooms, P.rent FROM PropertyForRent P, Client C, Viewing V WHERE P.propertyNo = V.propertyNo AND V.clientNo = C.clientNo AND V.viewDate = ‘10-JAN-2004’ AND P.city = ‘Newcastle’ AND P.type = ‘F’ AND C.maxRent > 595

Assume the following information:

 The page size for the database is 2048 bytes.  The size of the PropertyForRent table is 5000 pages o There are 90,000 records; o Each record is of 112 bytes; o 18 records occupy one page on disk; and o log 2 5000 = 12  The size the Client table is 4500 pages o There are 103,500 records; o Each record is of 88 bytes; o 23 records occupy one page on disk; and o log 2 4500 = 12  The size the Viewing table is 5000 pages o There are 145,000 records; o Each record is of 70 bytes; o 29 records occupy one page on disk; and o log 2 5000 = 12  There are 5800 records in the Viewing table with viewDate = ‘10-JAN-2004’.  There are 31,500 records in the PropertyForRent table with type = ‘F’.  There are 1800 records in the PropertyForRent table with city = ‘Newcastle’, however, only 900 of these properties are flats (i.e. with type = ‘F’).  There are 20,700 records in the Client table with maxRent > 595.  The result of the SQL query contains 2700 records and the size of each record is 146 bytes.  The number of buffer pages available during query processing is 5 i.e. B = 5. The constant cost factor used in the evaluation of projection is 2 given as follows:

log B-1 2 * B = log 5-1 2 * 5 = log 4 10 = log 10 / log 4 = 1.66096  2  Hash indexes (clustered) are available on city and type attributes of PropertyForRent table.  B+ tree index (clustered) is available on maxRent attribute of Client table.  The Viewing table is sorted in ascending order of viewDate.  The join algorithms used by the DBMS for evaluating joins are: page-oriented nested and block-nested loops. Also note that you should consider the cost of writing to temporary tables for intermediate results.  While calculating the evaluation cost of the query, ignore the cost of writing out the final result.  The relational algebra query tree representing the above SQL query is given in FIGURE 1 (on the next page).

6 7 π fName, lName, street, city, rooms, rent

σ viewDate = '10-JAN-2004' and city= 'Newcastle' and type = 'F' and maxRent > 595

propertyNo = propertyNo

PropertyForRent clientNo = clientNo

Client Viewing

FIGURE 1: Relational Algebra Tree for the given SQL Query.

Answer the following questions:

(a) Apply logical query optimisation (i.e. transformation/re-writing rules and algebraic equivalence) to the query tree shown in FIGURE 1 and draw the logically optimised query tree thus obtained. (4 marks)

(b) Re-draw the query tree from part (a) and devise a physical plan for executing the relational operations involved in the tree by indicating which method/algorithm will be used for evaluating each separate operator. (4 marks)

(c) Compute the cost for evaluating the physical plan from part (b) in terms of I/O operations. (13 marks)

(d) Discuss by referring to your answers to parts (a), (b), and (c) how optimisation techniques influence the overall evaluation cost of the given SQL query. (4 marks)

8 Q. 3

(a) When a new client registers with the DreamHome estate agent a transaction is invoked for setting up the client account. A requirement of this transaction is that the client’s number is unique and sequential. Within the database schema it is identified as a primary key.

(i) Explain how a trigger could be used to achieve the above requirement for client number regardless of the value for clientNo entered by a user of the system for new clients.

(3 marks) (ii) Using your answer to part (i) as a design, produce code for your trigger. (5 marks)

(b) At the present moment, the existing system does not incorporate any concurrency handling protocols within the database system. There are two parties interested in renting property number 3224. Each interested party approaches different members of staff for renting out the property (staffNo’s 43 & 22).

(i) Draw database state transaction diagrams and point out the shared data, critical states and interfering operations included into the two transactions when the clients attempt to confirm the rental (Suggestion: You may use UML state diagrams, dataflow diagrams or simple state charts) (6 marks)

(ii) Illustrate the schedule for parallel execution of the two transactions making use of read and write locks. (6 marks)

(c) At the start of a new financial year the rents of all properties increase by the rate of inflation. For the year 2003/2004 the rent should increase by 3.25%. Write code for a cursor inside a skeleton procedure that will be executed annually to update all the rentals (i.e. the values of rent attribute) in the PropertyForRent table. (Hint: you will need to define the outline of the procedure and declare the input variables). (5 marks)

9 SECTION C: ANSWER ONE QUESTION ONLY

Q. 4

(a) The UML class diagram (i.e. the conceptual model of the DreamHome database) shows a many-to-many association between PropertyForRent and Client in the form of an association class (i.e. with the Views association label). This association class is represented as Viewing relation in the relational schema of the DreamHome database.

Provide an object-oriented representation of the classes: PropertyForRent and Client. Describe the association between these classes. [You need to include only important attributes and you may omit other associations related to these classes. You may use ODL or generic OO class definitions to answer this part.] (4 marks)

(b) The UML class diagram in Appendix shows, among others things, the class Owner and its sub-classes PrivateOwner and BusinessOwner. Note that specialization/generalization relationship of the Owner class is mandatory and disjoint (shown as {Mandatory, Or}) i.e. an owner must be either a private or business owner but cannot be both. These classes are represented as tables PrivateOwner and BusinessOwner in the relational schema (also in Appendix).

Provide an object-oriented representation of the classes: Owner, PrivateOwner and BusinessOwner. [You should fully describe these classes including all of their attributes and associations in which they participate. You may use ODL or generic OO class definitions to answer this part.] (5 marks)

(c) Compare and contrast the object-oriented representations from your answers to parts (a) and (b) with their counterparts in the relational schema of the DreamHome database. Your answer must include any advantages and disadvantages of these representations over each other. [Arguments for or against either representations should be specific to the scenario and not generic arguments.] (8 marks)

(d)

(i) Write a query in SQL (over the relational schema given in the Appendix) for listing the fName and lName of the clients who have viewed 5 or more properties for rent such that maxRent of each client is equal to or less than the rent of the properties they viewed. (2 marks)

(ii) Express the same query over an equivalent object-oriented representation (e.g. see your answers to parts (a) and (b)) using either:

 OQL (Max. of 4 marks)  Plain English (Max. of 2 marks)

(iii) Compare and contrast your answers to sub-parts (i) and (ii) of part (d) above. (2 marks)

10 Q. 5

The relational implementation of the DreamHome database is not distributed at the present time. The management of the company plans to expand the business and develop regional branches in four other cities in the UK (Edinburgh, Birmingham, Manchester and London). This expansion will ensure that the UK-wide coverage by the company is considerably increased, which will in turn increase the company’s market share and in turn profits.

The company’s head office will remain in the original site of Newcastle upon Tyne. a) Explain how the current database schema is not suitable for a distributed database and provide an effective solution of the database design to take these problems on board. (2 marks) b) Opening up the additional branches (details above) brings certain challenges. There is a need that the database be fragmented or allocated over a number of servers. Also data would need to be stored by some method at relevant sites for ease of access. Explain why this would be the case and illustrate your answer with extracts from the fragmentation schema that would be required, you should mention replication or Snapshot views of data in your answer. Note that answers without illustrations will receive a max mark of 2 marks. (4 marks) c) Propose an allocation schema for the new distributed database indicating which sites would retain the original data and which would retain either replicated views or Snapshots. You should illustrate your answer for only two sites – head office and one other branch. Explain your reasoning in order to achieve maximum marks. Note that an allocation schema with no reasoning will obtain a max mark of 3 marks. (8 mark) d) C. J. Date proposed an ideal model/rule base for a distributed database system (DDBS). Describe in detail each of the rules and explain the feasibility of the rule base. (3 marks) e) The company had a previous database system, which was used in isolation; this is to be integrated with a new database system that is to be distributed throughout. Explain the problems faced when moving from an autonomous and standalone system to a decentralised system. What are the key issues to be overcome by the technical staff and the impact of this move on the company? (5 marks) f) Explain differences between the use of replication and the use of snapshots when transferring data between sites. What are the considerations that as a developer you would need to make to ensure that the system is working to efficient standards? Illustrate your answer by referring to the scenario and the preamble to this question. (3 marks)

11 APPENDIX DreamHome Estate Agent Database System

1. SCENARIO The scenario is based on and is a subset of the DreamHome database application described in “Database Systems”, 3rd Edition by Thomas Connolly and Carolyn Begg, 2002. DreamHome estate agent serves both private and business owners in renting out their properties to clients. DreamHome database stores information about:

 Staff working for the estate agent.  Owners of the properties for rent, where an owner can be private or business.  Clients (and their preferences) who are looking to take properties on rent.  Properties for rent.  Property viewings (clients my request to view properties before agreeing on tenancy).  Leases, which are tenancy contracts between clients and properties.

2. Conceptual model for the DreamHome database in UML

12 13 3. Relational Schema for the DreamHome database

4. Data types/Domains of Attributes

Attributes Domain/Data type staffNo, propertyNo, clientNo, leaseNo, Integer (positive) up to 6 digits ownerNo, supervisorStaffNo fName, lName, position, contactName VarChar(30) sex Char(1), possible values are ‘F’, ‘M’ DOB, rentStart, rentFinish, viewDate Date bName VarChar(40) bType VarChar(20) address, comment VarChar(50) telNo Char(12) type, prefType Char(1), possible values are ‘F’ for flat, ‘D’ for detached house, ‘S’ for semi-detached house, ‘M’ for maisonette, ‘T’ for terraced house, ‘B’ for bungalow, and ‘H’ for any type of house. paymentMethod Char(2), possible values are ‘CC’ for credit card, ‘SD’ for switch/delta, ‘DD’ for direct debit, ‘CQ’ for cheque, and ‘CP’ for cash payment. depositPaid Char(1) or Boolean, possible values are ‘Y’ or ‘N’ rent, maxRent, rooms Integer (positive) up to 3 digits street, city VarChar(40) postcode Char(7)

14 15