<<

CS377: Systems Distributed

Li Xiong Department of Mathematics and Computer Science Emory University

1 Centralized DBMS on a Network

Site 1 Site 2

Site 5

Communication Network

Site 4 Site 3

2 Distributed DBMS Environment

Site 1 Site 2

Site 5 Communication Network

Site 4 Site 3

3 Distributed Database System

 A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a .

A distributed database management system (D– DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users.

Distributed database system (DDBS) = DDB + D–

DBMS 4

Distributed Database System The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible as shown below.

6 Distributed DBMS Promises

Transparent management of distributed, fragmented, and replicated data

Improved reliability/availability through distributed transactions

Improved performance

Easier and more economical system expansion

7 Distributed DBMS Issues

 Distributed

 How to distribute the database

 Query Processing

 Optimize cost = data transmission + local processing

8 Distributed DBMS Issues



 Synchronization of concurrent accesses

 Consistency and isolation of transactions' effects

 Deadlock management

 Reliability

 How to make the system resilient to failures

 Atomicity and durability

9 Distributed database design

 Data distribution  Topdown mostly in designing systems from scratch  Bottomup when the databases already exist at a number of sites  Unit of distribution   fragments of relations (subrelations)  Data are inherently fragmented, e.g. in locality  Allow concurrent execution of a number of transactions that access different portions of a relation

10 Example Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc= Sb and… and ...

Motivation: Two sites: Sa, Sb Qa →Sa ←SbQb

11 Fragmentation Alternatives – Horizontal PROJ PNO PNAME BUDGET LOC PROJ 1 : projects with budgets P1 Instrumentation 150000 Montreal less than $200,000 P2 Database Develop. 135000 New York P3 CAD/CAM 250000 New York PROJ : projects with budgets P4 Maintenance 310000 Paris 2 P5 CAD/CAM 500000 Boston greater than or equal to $200,000 PROJ 1 PROJ 2

PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P3 CAD/CAM 250000 New York P2 Database Develop. 135000 New York P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Boston

12 Fragmentation Alternatives – Vertical PROJ PNO PNAME BUDGET LOC PROJ 1:information about P1 Instrumentation 150000 Montreal project budgets P2 Database Develop. 135000 New York P3 CAD/CAM 250000 New York PROJ :information about P4 Maintenance 310000 Paris 2 P5 CAD/CAM 500000 Boston project names and locations

PROJ 1 PROJ 2

PNO BUDGET PNO PNAME LOC

P1 150000 P1 Instrumentation Montreal P2 135000 P2 Database Develop. New York P3 250000 P3 CAD/CAM New York P4 310000 P4 Maintenance Paris P5 500000 P5 CAD/CAM Boston

13 Data Fragmentation, Replication and Allocation

 Horizontal fragmentation  A horizontal subset of a relation which contain those of tuples which satisfy selection conditions.  E.g. Employee relation with selection condition (DNO = 5) σσσ  Can be specified by a Ci (R) operation in the .  Complete horizontal fragmentation  A set of horizontal fragments whose conditions C1, C2, …, Cn include all the tuples in R every tuple in R satisfies (C1 OR C2 OR … OR Cn).  Disjoint complete horizontal fragmentation: No tuple in R satisfies (Ci AND Cj) where i ≠ j.  How to reconstruct R from complete horizontal fragments?

14 Three common horizontal partitioning techniques  Round robin  Hash partitioning  Range partitioning

15 15 • Round robin

RD0 D1 D2 t1 t1 t2 t2 t3 t3 t4 t4 ... t5

16 • Hash partitioning

RD0 D1 D2 t1 →h(k 1)=2 t1 t2 →h(k 2)=0 t2 t3 →h(k 3)=0 t3 t4 →h(k 4)=1 t4 ...

17 • Range partitioning

RD0 D1 D2 t1: A=5partitioning t1 t2: A=8vector t2 t3: A=24 7 t3 t4: A=3 V0 V1 t4 ...

18 Data Fragmentation, Replication and Allocation

 Vertical fragmentation  A vertical subset of a relation that contains a subset of columns.  E.g. Employee relation: a vertical fragment of Name, Bdate, Sex Π  Can be specified by a Li (R) operation in the relational algebra .  Each fragment must include the primary key attribute of the parent relation Employee  Complete vertical fragmentation  A set of vertical fragments whose projection lists L1, L2, …, Ln include all the attributes in R but share only the primary key of R.  L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)  Li ∩ Lj = PK(R) for any i j  How to reconstruct R from complete vertical fragments?

19 Data Fragmentation, Replication and Allocation

 Mixed (Hybrid) fragmentation  A combination of Vertical fragmentation and Horizontal fragmentation.  This is achieved by SELECTPROJECT operations Π σσσ which is represented by Li ( Ci (R ))

20 Data Fragmentation, Replication and Allocation

 Fragmentation schema  A definition of a set of fragments (horizontal or vertical or mixed) that can reconstruct the original database  Allocation schema  Distribution of fragments to sites of distributed databases. It can be fully or partially replicated or can be partitioned  Data Replication  Full replication: database is replicated to all sites.  Partial replication: some selected part is replicated

21 Distributed Database System The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible replication as shown below.

22 Distributed DBMS Issues

 Distributed Database Design

 How to distribute the database

 Query Processing

 Optimize cost = data transmission + local processing

23 Query Processing in Distributed Databases

 Cost of transferring data (files and results) over the network is usually high  Example:  Employee at site 1 and Department at Site 2  Employee at site 1. 10,000 rows. size = 100 bytes. size = 10 6 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno  Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate

 Q submitted at Site 3: retrieve employee name and department name where the employee works. Π  Fname,Lname,Dname (Employee Dno = Dnumber Department)  Result has 10,000 tuples and each result tuple is 40 bytes

24 Query Processing in Distributed Databases  Strategies: 1. Transfer Employee and Department to site 3.  Total transfer size 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.  Total transfer size 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3.  Total bytes transferred  Optimization criteria: minimizing data transfer.  Which strategy?

25 Query Processing in Distributed Databases  Strategies: 1. Transfer Employee and Department to site 3.  Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.  Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 + 1,000,000 = 1,400,000 bytes. 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3.  Total bytes transferred = 400,000 + 3500 = 403,500 bytes.  Optimization criteria: minimizing data transfer.  Preferred approach: strategy 3.

26 Query Processing in Distributed Databases

 What if Q is submitted at site 2?

 Example:  Employee at site 1 and Department at Site 2  Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 10 6 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno

 Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate

 Q submitted at Site 2: retrieve employee name and department name where the employee works. Π  Fname,Lname,Dname (Employee Dno = Dnumber Department)  Result has 10,000 tuples and each result tuple is 40 bytes

27 Query Processing in Distributed Databases

 Semijoin:  Objective is to reduce the number of tuples in a relation before transferring it to another site.  Example execution of Q: 1. Project the join attributes of Department at site 2, and transfer them to site 1. For Q, 4 * 100 = 400 bytes are transferred 2. Join the transferred file with the Employee relation at site 1, and transfer the required attributes from the resulting file to site 2. For Q, 32 * 10,000 = 320,000 bytes are transferred 3. Execute the query by joining the transferred file with Department and present the result to the user at site 2.  Semijoin

 Left semi-join R ⋉⋉⋉ S = ΠR (R join S).

28 Parallel Databases

 Parallel database  Using parallel processers  Architectures  Shared memory  Shared disk  Shared nothing  Data partitioning (shard)

29