Heterogeneous / Federated / Multi- Systems Transaction Management

Dr. Denise Ecklund 16 October 2002

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-1

Contents: Heterogeneous DBSs • A ten minute review from last week ☺ – What is a HDBMS? – Main Problems (we already covered): • Defining a Global Data Model • Query Processing and Optimization • Transaction Management

• Summary and Conclusion

Pensum: This set of presentation slides.

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-2

©2002 Vera Goebel & Denise Ecklund III–1 Components of a Multi-DBMS

USER

System Responses User Requests

Multi-DBMS Layer

Query DBMS Query DBMS Processor Processor

Transaction Transaction Manager Manager

Scheduler ••• Scheduler

Recovery Recovery Manager Manager

Runtime Support Runtime Support Processor Processor

• •

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-3

Components of a Distributed Multi-DBMS

USER USER

System Responses User Requests System Responses User Requests

Multi-DBMS Layer Multi-DBMS Layer

Query DBMS Query DBMS Query DBMS Query DBMS Processor Processor Processor Processor

Transaction Transaction Transaction Transaction Manager Manager Manager Manager

Scheduler Scheduler Scheduler Scheduler ••• … ••• Recovery Recovery Recovery Recovery Manager Manager Manager Manager

Runtime Support Runtime Support Runtime Support Runtime Support Processor Processor Processor Processor

• • • •

Multi-DB Integration layers act as peers in a homogeneous system - Use the global data model and global access language - Users submit queries to any Multi-DB site - Distributed control over transaction execution

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-4

©2002 Vera Goebel & Denise Ecklund III–2 Definition - Heterogeneous DBS (HDBS)

A HDBS comprises a software layer (integration layer) and multiple DBSs and/or file sytems to be integrated.

Users can transparently access the integrated DBSs and/or file systems via the interface provided by the integration layer. Defines a global data model Supports a Data Definition Language (DDL) Supports a Data Manipulation Language (DML) Management Transparent integration of the underlying, disparate DBSs

The integrated, local DBSs are autonomous and can also be used as stand-alone systems. Local applications are unchanged and unknown to the HDBS.

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-5

Concepts in the Integration Layer • Global data model • Global schema and meta data management

• Distributed query processing and optimization

• Distributed transaction management

• Extensible software construction (to allow the “easy” integration of additional system components)

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-6

©2002 Vera Goebel & Denise Ecklund III–3 Schema Architecture of HDBS - 2 5-layer schema architecture

external schema ... external schema external schema Multi-lingual App View Defn

federated schema ... federated schema

auxiliary schema ...... auxiliary schema Integration

Multi-Use

export schema export schema ... export schema Global View Defn Multiple Views global ... data model component schema component schema Translation

local local schema ... local schema data models ©2002 Vera Goebel & Denise Ecklund HDBMS-TM-7

Query Processing and Optimization

• The HDBMS has – A global Data Definition Language (DDL) – A global Data Manipulation Language (DML) – A set of local DMLs

• The HDBMS Query Processing Goal: – Given a query stated in the global query language (DML), execute that query, in an optimal manner, using the local database management systems

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-8

©2002 Vera Goebel & Denise Ecklund III–4 Query Planning global query and Optimization in a Distributed Multi-DBMS query localization PQ 1 ... PQ r Localized multi-DB query 1 ... Localized multi-DB query m

query fragmentation Another and global optimization Multi-DBMS

SQ 1 SQ 2 SQ 3 ... SQ n PQ 1 ... PQ k query query query ... query translator 1 translator 2 translator 3 translator n

Sorting and unioning result data TQ 1 TQ 2 TQ 3 ... TQ n Joining intermediate results

DB 1 DB 2 DB 3 ... DB n

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-9

GenCompact For Data Sources with Varying Capabilities • Use Simple Source Description Language (SSDL) to describe a data source’s query processing capabilities • Represent a query are a condition tree (CT) • GenCompact runtime support: – creates distributed query plans (based on capabilities) – selects and executes the cheapest plan

Query Best query as a Rewrite Mark Generate Cost plan condition equiv marked feasible tree (CT) CTs CTs query plans

Rewrite Rules SSDL descr Cost Model of LDBs

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-10

©2002 Vera Goebel & Denise Ecklund III–5 SchemaSQL for Accessing Heterogenous Relational DBs

• Syntax definitions – Extends the variables and ranges defined by SQL – Valid SchemaSQL ranges are: Symbol Meaning → Set of database names in the federation db → Set of relation names in the database db db::rel → Set of attribute names in relation rel in the database db db::rel Set of tuples in the relation rel in the database db db::rel.attr Set of values in the columns named attr in the relation rel in the database db

• A valid SchemaSQL variable is of the form

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-11

SchemaSQL – Examples: 3 CIS Electricity Database CustInfo Name Address1 Address2 CustType RatePerKwH

Natural Gas Database Home Name Address RateCategory Biz Name ServiceAddr BillingAddr RateCategory Industry Name ServiceAddr BillingAddr RateCategory Rates RateCategory RatePerCuM Fees Oil Database DelN Name DeliveryAddr MailingAddr TankCapacity DeliveryFreq PriceRate DelS Name DeliveryAddr MailingAddr TankCapacity DeliveryFreq PriceRate DelE Name DeliveryAddr MailingAddr TankCapacity DeliveryFreq PriceRate DelW Name DeliveryAddr MailingAddr TankCapacity DeliveryFreq PriceRate

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-12

©2002 Vera Goebel & Denise Ecklund III–6 SchemaSQL – Example #3 Problem: Create a view containing the ”Low Rate Payers” from all the federated CIS databases (less than 0.035 for each unit of fuel). create view LowRatePayers::CInfo(Name, FuelType, Rate) Variable Values select NameRel.Name, DBname, RateValue Dbname Electricity from → DBname, Natural Gas DBname→ NameRel, Oil DBname→ RateRel, NameRel CustInfo DBname::RateRel→ RateAttr, & RateRel Home DBname::RateRel.RateAttr RateValue … where (Dbname = ”Natural Gas” DelN and NameRel <> ”Rates” … and RateRel = ”Rates” RateAttr Name and NameRel.RateCategory = RateRel.RateCategory Address1 … and RateRel.RateValue < 0.035) Name or (NameRel = RateRel Address and ((RateAttr = ”RatePerKwH”) … or (RateAttr = ”PriceRate”) Name or (RateAttr = ”RatePerCuM”)) DeliveryAddr and RateValue < 0.035) …

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-13

SchemaSQL Service Architecture

Resident SQL Engine

Final Answers PostProc Answer Q1, Q2, ... Qn Queries

SchemaSQL Query SchemaSQL Final Answer FST Server

Federation Optimized local Optimized local User SQL query Q1 SQL query Q2

Answer Q1 Answer Qn RDBMS1 . . . RDBMSn

Federation System Table (FST) – stores database names, relation names, attribute names, and statistical information on the local RDBMSs

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-14

©2002 Vera Goebel & Denise Ecklund III–7 Transaction Management in HDBMSs

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-15

HDBS Transaction Model global transactions

GTi GTj

GTM - global transaction manager

{ GSTi1, GSTl1, GSTi2, GSTj2 }

server server (proxy for the GTM) (proxy for the GTM)

GSTi1 GSTj1 GSTi2 GSTj2 local local LTk transactions transactions LTm LTl DBMS 1 ... DBMS n LTn

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-16

©2002 Vera Goebel & Denise Ecklund III–8 Transaction Management • Local transactions: access data at a single site outside of the global HDBS control.

• Global transactions: are executed under the HDBS control.

Local DBMSs have three types of autonomy: Autonomy Type Definition Resulting Problem No changes can be made to the local Non-serializable schedule Design DBMS software to support the HDBMS for global transactions Each local DBMS controls execution of Non-atomic & non-durable Execution global subtransactions and local global transactions transactions ( the commit/abort decision) Local DBMS do not communicate with Distributed Communication each other and they do not exchange can not be detected execution control information

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-17

Global Problem Global Serializability Atomicity & • GTM is responsible for Durability Distrbuted – A serializable schedule for the set of global transactions Deadlock – Coordination of submission and execution of global subtransactions among the local DBMSs • Serializing the global schedule?

GT1 GT2

GST11 GST12 GST21 GST22 GST23

Local DBMS-3 Local DBMS-1 Local DBMS-2

If GST11 〈 GST22 at site DBMS-1, GT1 〈 GT2 Then it must be the case that GST12 〈 GST23 at site DBMS-2

If GST 〈 GST at site DBMS-2 GT 〈 GT 23 12 2 1 A non-serializable schedule!

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-18

©2002 Vera Goebel & Denise Ecklund III–9 Local Transactions and the Global Serializable Schedule • Local transactions execute outside the control of the GTM • Local transactions create indirect conflicts with global transactions • GTM is not aware of local transactions and these indirect conflicts • In general, the GTM cannot ensure global serializability GTM belives GT1 〈 GT2 GT1: r1(a) r1(c) GT2: r2(b) r2(d) at both sites

LT3: w3(a) w3(b) LDBMS-1 LDBMS-2 LT4: w4(c) w4(d)

a b c d

LDBMS-1: r1(a) c1 w3(a) w3(b) c3 r2(b) c2 LDBMS-2: w4(c) r1(c) c1 r2(d) c2 w4(d) c4

=> LDBMS-1: GT1 〈 LT3 〈 GT2 => LDBMS-2: GT2 〈 LT4 〈 GT1

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-19

Controlling the Execution Order of Global Subtransactions Global Serializability • Four Strategies: Atomicity & Durability 1) Execute global transactions serially Distrbuted Deadlock • No concurrent execution for global transactions! • Does not solve indirect conflicts with local transactions • Costs: Heavy CC processing at the GTM Low query processing throughput 2) Define a specific order over the global transactions and use the mechanism of each local DBMS to enforce that order • Every local DB stores one ”ticket” object • Extend every global subtransaction to access the ticket GT1: r1(a) w1(a) newGT1: r1(ticketS1) r1(a) w1(a) w1(ticketS1) c1 GT2: r2(b) w2(b) newGT2: r2(ticketS1) r2(b) w2(b) w2(ticketS2) c2 • Means GT1 and GT2 will be correctly serialized with respect to all global transactions and all local transaction executed by the local DBMS at S1

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-20

©2002 Vera Goebel & Denise Ecklund III–10 Controlling the Execution Order of Global Subtransactions Global Serializability Atomicity & 3) Use local DBs deploying rigorous CC Algorithms Durability Distrbuted • If all LDBMSs use rigorous 2-phase locking Deadlock and support a “prepare-to-commit” interface then – Global transactions are serializable without a CC Alg at GTM – Local transactions can not cause indirect conflicts Ex: (w4(c) r1(c) c1 r2(d) c2 w4(d) c4) In R2PL, T4 holds Not a rigorous all locks until commit, so ... local schedule T1 can not read object c until after T4 commits 4) Relax the serializability requirement • Use “strong correctness” instead • Most indirect conflicts have no effect on correctness

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-21

Alternative Consistency Models • Global schedule is not serializable; it is strongly correct – Global transactions preserve all data consistency constraints Global Serializability Constraint-based strategies Atomicity & Durability • Local serializability: Some HDBS applications have no global Distrbuted constraints because each DBS is (and should be) independent from Deadlock each other => no global concurrency control mechanism needed So, local serializability ensures strong correctness of global executions. Ex application: travel reservation service for planes, trains, ferries, hotels, etc. • Limited global constraints: Some applications need global constraints. Define 2 types of data: global data and local data. Global constraints may only span global data, and local transactions may not write to global data. Use two-level serializability (2LSR): local-SR and global-SR. Artificial solution: local site has no autonomy over or direct-access to global data; local site must submit transactions to GTM to update global data stored at the local site => master-slave relationship.

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-22

©2002 Vera Goebel & Denise Ecklund III–11 Alternative Consistency Models Global Serializability AtomicityGlobal & SerializabilityDurability Non-constraint-based strategies Distrbuted Deadlock • Diverge from strong correctness and serializability 1) Epsilon Serializability • Allows a specified number of nonserializable conflicts 2) Sets of Compatible Transactions • Assume a set of known transactions • Pre-analyze the transactions for conflicts • Group non-conficting transactions into compatible sets • Not CC control required among transactions in a compatible set

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-23

Global Atomicity and Recovery Problem Global Serializability Atomicity & • The GTM must guarantee that a global transaction Durability Distrbuted commits at all sites or aborts at all sites Deadlock • Local DBMSs wish to preserve their execution autonomy – May not implement or export a “prepare-to-commit” interface

GTM GT1 2PC 2PC

GST11 GST12 GTM Proxy GTM Proxy Abort GST11 No 2PC No 2PC Commit GST12 LDBMS LDBMS

• A local DBMS can unilaterally abort a subtransaction anytime – Results in non-atomic global transactions and incorrect global schedules – Local transactions and global subtransactions see committed partial results

Note: The first heterogeneous systems did not support update transactions!

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-24

©2002 Vera Goebel & Denise Ecklund III–12 Approaches to Achieve Atomicity and Durability Global Serializability Atomicity & Durability Distrbuted Deadlock • If all LDBMSs export a “prepare-to-commit” interface, then use 2PC between the proxy and the LDBMS

• If some LDBMSs do not export “prepare-to-commit”, then four approaches: 1) Modify each global subtransaction to “callback to the proxy” just before local commit GTM 2PC • Blocks the global subtransaction until GTM GTM Proxy completes 2PC with proxies No 2PC • Possibly only if the LDBMS supports a client LDBMS callback service • Fails if the LDBMS uses optimistic concurrency control

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-25

Approaches to Achieve Atomicity and Durability Global Serializability Atomicity & • If any global subtransaction aborts Durability Distrbuted Deadlock 2) REDO failed write operations from global subtransactions - Performed by the proxy, who must maintain a local redo log

3) RETRY failed global subtransactions (read & write operations) - Performed by the proxy - Inappropriate semantics for many applications or transactions - No guarantee that the retry can ever be committed Ex: Banking application – withdrawing money can fail ”forever” 4) UNDO committed global subtransactions by Inconsistent data is executing compensating transactions temporarily visible to other transactions! - Performed by the GTM - Can provide semantic atomicity (called a saga)

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-26

©2002 Vera Goebel & Denise Ecklund III–13 Global Deadlock Problem Global Serializability • Same problem as in distributed homogeneous DBMSs Atomicity & Durability waits for T1 x Distrbuted Site X to release Lx Deadlock T1 x T2 x holds Lx holds lock Lb T1 x needs a T2 y needs b waits for T1 y waits for T2 x to complete to complete waits for T2 y to release Ly Site Y T1 y T2 y holds lock La holds lock Ly

• We solved the problem by exchanging lock information to construct the global “waits-for” graph – This violates design autonomy and communication autonomy • Therefore the GTM will be unaware of a global deadlock. • There are no complete solutions to the global deadlock problem for autonomous multi-database systems.

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-27

Status: Transaction Management for HDBS

• Transaction management for HDBSs is a very active research area. • Distributed transactions over the define new semantics for transaction consistency, allowing development of new solutions.

Open issues: • What can be done if some of the local subsystems (e.g., file systems) do not support transaction management?

• Performance implications of transaction management strategy?

• Handling of different degrees of consistency?

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-28

©2002 Vera Goebel & Denise Ecklund III–14 Conclusions

HDBS allows a uniform view on the combination of data maintained by different autonomous database systems.

• available: prototypes & commercial products with a set of fixed / specific drivers (so-called gateways) for existing, widely used data management systems (conventional DBS and file systems)

• missing: systematic support for individual integration of arbitrary data management systems – Examples: geographical DBs, multimedia DBs, Internet storefronts, etc.

©2002 Vera Goebel & Denise Ecklund HDBMS-TM-29

©2002 Vera Goebel & Denise Ecklund III–15