Jordan University of Science & Technology Computer Science Department CS 728: Advanced Systems Midterm Exam First 2009/2010 Student Name: ID: Part 1: Multiple-Choice Questions (17 questions, 1 point each) 1. Select the TRUE statement concerning normalization. a. Performance is increased. b. Data consistency is improved. c. Redundancy is increased. d. The number of tables is reduced. e. Functional dependencies increase.

2. A lack of normalization can lead to which one of the following problems: a. Deadlock b. Lost Updates c. Insertion problems d. Deferred updates e. Deletion of data

3. Each of the following is an argument which might be used to support the use of relations which are not fully normalized. Select the weakest argument. a. A fully normalized database may have too many tables b. Full normalization may compromise existing applications/systems c. Full normalization may make some queries too complicated d. A fully normalised database may result in tables which are too large e. A fully normalized database may perform too slowly

4. If a non-key attribute of a table can be , that table automatically violates which normal form (choose the lowest one): NONE 1NF 2NF 3NF BCNF 4NF 5NF

5. If an attribute of a table can have multiple values, that table automatically violates which normal form (choose the lowest one): NONE 1NF 2NF 3NF BCNF 4NF 5NF

6. Given the following and dependences, state which normal form the relation is in. R(p,q,r,s,t) p,q -> r,s,t r,s -> p,q,t t -> s a. 1NF b. 2NF c. 3NF d. BCNF e. Unnormalized 7. Which of the following is the highest normal form by which the R relation can be classified? R(patient, consultant, hospital, address, date, time) Given patient, consultant -> hospital, address, date, time hospital -> address a. 1NF b. Unnormalised c. 2NF d. BCNF e. 3NF

8. Assume the relation R(A, B, C ,D, E) is in at least 3NF. Which of the following functional dependencies must be FALSE? a. A, B -> C b. A, C -> E c. A, B -> D d. C, D -> E e. None of the above

9. Consider the relational schema R(A,B,C,D,E) with non-key functional dependencies C,D -> E and B -> C. Select the strongest statement that can be made about the schema R a. R is in b. R is in c. R is in BCNF normal form d. R is in first normal form e. None of the above

10. Consider the following functional dependencies: a, b -> c, d e -> c b -> e, f Given the same functional dependencies as shown above, which option shows the relations normalized to 3NF of: R(a, b, c, d, e, f) a. R(a,b,c,d,e,f) R(e,c) R(b,e,f) b. R(a,b,c,d) R(c,e) R(e,f,b) c. R(a,b,c,d) R(c,e) R(b,e,f) d. R(a,b,d) R(e,c) R(b,e,f) e. R(a,b,c,d,e,f) 11. Given the following relation and dependencies, select the option that is the result of fully normalizing the relation to BCNF. R(a,b,c,d,e) a -> c d -> c,e a. R1(a,c) R2(d,e) R(a,b,d) b. R1(a,c) R2(d,c,e) R(a,b,d) c. R1(d,c,e) R(a,b,d) d. R(a,b,c,d,e) e. None of the above

12. If a relation schema contains two different multi-valued dependencies, that table automatically violates which normal form (choose the lowest one): NONE 1NF 2NF 3NF BCNF 4NF 5NF

13. In spatial , what does MBR refer to? a. Minimum Box Rectangle b. Minimum Box Region c. Minimum Bounding Region d. Minimum Bounding Rectangle e. Minimum Bounding Box

14. The two steps (in order) that are performed when answering a spatial query using MBR are a. filtering and refinement steps b. refinement and filtering steps c. filtering and screening steps b. screening and filtering steps

15. The first step(s) in the knowledge discovery process is/are a. data selection b. visualization c. data mining d. data transformation e. data cleaning and integration

16. The measure of interestingness used by association rule mining that is based on how frequently an itemset appears in data is called a. correlation b. confidence c. support d. importance e. none of the above

17. Association rule mining is an example of a. Descriptive data mining b. Unsupervised learning c. Predictive data mining d. Learning by example e. NOA Part 2: Essay Questions (83 points all) 1.

a. Is the following decomposition of Book lossless? Why or why not? 4 R1(A, T, P), R2(I, P, Y ), R3(I, T) I A T P Y R1 b11 b12 b13 b14 b15 R2 b21 b22 b23 b24 b25 R3 b31 b32 b33 b34 b35

I A T P Y R1 b11 a2 a3 a4 b15 R2 a1 b22 b23 a4 a5 R3 a1 b32 a3 b34 b35 TP → I No changes to table S AP → T No changes to table S I → ATP Table S is changed to I A T P Y R1 b11 a2 a3 a4 b15 R2 a1 b22 a3 a4 a5 R3 a1 b22 a3 a4 b35 TP → I Table S is changed to I A T P Y R1 a1 a2 a3 a4 b15 R2 a1 b22 a3 a4 a5 R3 a1 b22 a3 a4 b35 AP → T No changes to table S I → ATP Table S is changed to I A T P Y R1 a1 a2 a3 a4 b15 R2 a1 a2 a3 a4 a5 R3 a1 a2 a3 a4 b35 Since row 2 is all "a" symbols, then the decomposition is lossless. b. Does the decomposition in a preserve all functional dependencies in F+? If so, simply state Yes. If not, give a that is not preserved. 4 No, because TP → I is not preserved. Here is why. Restrictions of F+ to each relation in the decomposition: • R1 = ATP: {AP → T; TP → A} • R2 = IPY: {I → P} • R3 = IT: {I → T} Testing FD preservation: test each FD in F to see if it is implied by the restricted FDs above. • TP → I: (TP)+ = TPA so TP → I is not preserved. • AP → T: (AP)+ = APT so AP → T is preserved. • I → ATP: I+ = IATP so I → ATP is preserved. Another justification: F+ = {TP → I, TP → A, AP → T, I → A, I → T, I → P}

+ + (πR1(F) ∪ πR2(F) ∪ πR3(F)) = (AP → T, TP → A, I → P, I → T) {AP → T, TP → A, I → P, I → T, I → A} ≠ F+ c. Consider replacing F by an alternative set of functional dependencies G: 4 AP-> IT I -> ATP Is the effect of G the same as the effect of F? In other words, does G+ = F+? Why or why not? Under these FDs (G), we have TP+ = TP, so TP → I is not implied, hence G+ ≠ F+. Under F Under G (TP)+ = TPIA (TP)+ = TP (AP)+ = APIT (AP)+ = APIT I+ = IATP I+ = IATP

2.

a. Consider the following database instance D1 of R: 4

Is D1 consistent with the dependencies specified above? Why or why not? No, because the instant D1 violates (does not preserve) the dependency S → D. We have Novell (S) paying dividends (D) $0.05 and $0.10. b. Give a lossless decomposition of R into Boyce-Codd Normal Form. 4 The Key of R is IS, since (IS)+ = R. F+ = F ∪ {I → O} Solution I: 1. Decompose R by I → B into R1 = IB and R2 = IOSQD. 2. R1 is in BCNF. 3. Decompose R2 by S → D into R3 = SD and R4 = IOSQ. 4. R3 is in BCNF. 5. Decompose R4 by I → O into R5 = IO and R6 = ISQ. 6. R5 is in BCNF. 7. R6 is in BCNF.

Solution II: {BO, IB, SD, ISQ}, which does preserve functional dependencies. c. Does your answer to Question b preserve all given and implied functional dependencies? If No, state which dependencies that are not preserved. 4 Solution I: does not preserve FD B → O Solution II: does preserve all FDs

3. Consider the following function dependencies for a relation R(X, Y, Z, W): 4 WX -> Y X -> Z Z -> WY Give a derivation of X -> Y from the given functional dependencies above. Justify your steps with Armstrong’s axioms. Z -> WY Z -> Y X -> Z X -> Y 4. Consider a relation R(A, B, C, D, E, F, G, H) with Functional Dependencies 21 AB → E C → D F → GH B → F

(a) Consider R1(A, B, C, D) with the above functional dependencies. What would be a candidate

key for R1? Solution I: A+ = A Not a key B+ = B Not a key C+ = CD Not a key (AB)+ = AB Not a key + (ABC) = ABCD = R1 is a key Solution II: Assum that the key K is ABCD

K - D → ABCD = R1

K - DC → AB ≠ R1

K - DA → BCD ≠ R1

K - DB → ACD ≠ R1 So the key is K – D = ABC

(b) Is R1(A, B, C, D) in second normal form? If not, say why. No, because of C → D (partial dependency), D is not fully dependent on the key ABC.

(c) Is R1(A, B, C, D) and R2(A, B, E) a lossless join decomposition of R’(A, B, C, D, E)? Explain why or why not. Since the decomposition is binary, we can test the lossless property using the following rules:

If R1 ∩ R2 → R1 - R2 or R2 - R1 then it is lossless.

Since R1 ∩ R2 = AB, R2 - R1 = E, and AB → E, then decomposition is lossless.

(d) Is R1(A, B, C, D), R2(A, B, E), R3(F, G, H) and R4(B, F) a dependency preserving decomposition? Explain why or why not. Yes.

AB → E is covered by R2 C → D is covered by R1

F → GH is covered by R3

B → F is covered by R4 Another justification: F+ = {AB → E, C → D, F → GH, B → FGH} + (πR1(F) ∪ πR2(F) ∪ πR3(F) ∪ πR4(F)) = (C → D, AB → E, F → GH, B → F)+ {AB → E, C → D, F → GH, B → FGH} = F+ (e) Now start with these functional dependencies: AB → E C → D F → GH FG → GH B → FG Find a minimal covering of these functional dependencies. Then use it to synthesize a set of dependency preserving, 3NF relations with a lossless join. I.e. use algorithm 11.4 from the text book. Minimal Covering: Step 1. Write the FDs with singleton RHSs AB → E C → D F → G F → H FG → G FG → H B → F B → G Step 2. Remove redundant LHS attributes. FG → G becomes F → G FG → H becomes F → H Step 3. Remove redundant FDs Remove one copy of F → G Remove one copy of F → H Also remove B → G The minimal covering is {AB → E, C → D, F → G, F → H, B → F} Now on to the 3NF algorithm: Step 1 is done. Step 2, write down relations: R1(A, B, E), R2(C, D), R3(F, G, H) and R4(B, F) Step 3, no relation is a key for the whole thing. Add another relation R5(A, B, C) 5. List three spatial data types. 3 a. point b. line c. region 6. What are the three spatial operators? Give an example of each operator. 6 a. Topological operators: Overlap b. Direction operators: North c. Distance operators: Near 7. List two types of spatial queries. Give an example of each one. 4 a. Spatial Range Queries Find all cities within 50 miles of Madison Find me buildings that are adjacent to the Railway Stations? b. Nearest-Neighbor Queries Find the 10 cities nearest to Madison Find me the nearest fire station to Clementi Ave. 3? c. Window Range Query Find me data points that satisfy the conditions x1 < A1 < x2, y1

B 1 2 3 F E 4 5 6

C 11 G 7 8 H 9 12 10

A: BC

B: E F C: GH

E F G 4 5 6 1 2 3 10 11 1 2 H

7 8 9

10. Consider the following set of spatial objects and the corresponding R-tee. How many pages (nodes, I/O) are read in order to search for object #5? 3

a b 1 6

2 5 7

3 4 89 cd

x 10 12 13

f 14 11 y 15 16 e g

x y

a b c d e f g

10, 11 12, 13, 14 15, 16 1,2,3 6,7 4,5 8,9 4 pages

11. What is the main difference between classification and clustering? 3 • Clustering (Unsupervised learning) vs. Classification (Supervised learning) No prior knowledge (Number of clusters, Meaning of clusters) 12. List three factors that affect classification based on decision trees. 3 • Choosing Splitting Attributes Ordering of Splitting Attributes • Splits Predicates Tree Structure • Stopping Criteria Training Data 13. List three methods used to represent the distance between clusters 3 • Single/Complete/Average Link Centroid Medoid 14. Consider the following set of transactions. What is the support of PeanutButter & Jelly? 3 1/5 = 20%