IP, TCP, UDP and RPC

† The most accepted standard for † Yet sockets are quite low level for network communication is IP many applications, thus, RPC (Internet Protocol) which provides (Remote Procedure Call) appeared unreliable delivery of single packets as a way to to one-hop distant hosts Ä hide communication details † IP was designed to be hidden behind behind a procedural call other software layers: Ä bridge heterogeneous Ä TCP (Transport Control environments Protocol) implements connected, † RPC is the standard for distributed reliable message exchange (client-server) computing Ä UDP (User Datagram Protocol) Remote Procedure Call (RPC) implements unreliable datagram based message exchanges RPC † TCP/IP and UDP/IP are visible to Gustavo Alonso applications through sockets. The SOCKETS Computer Science Department purpose of the socket interface was Swiss Federal Institute of Technology (ETHZ) to provide a UNIX-like abstraction TCP, UDP [email protected] http://www.iks.inf.ethz.ch/ IP

©Gustavo Alonso, ETH Zürich. 2

The basics of client/server Problems to solve

† Imagine we have a program (a † Ideally, we want he programs to † How to make the service invocation † How to find the service one actually server) that implements certain behave like this (sounds simple?, part of the language in a more or wants among a potentially large services. Imagine we have other well, this idea is only 20 years old): less transparent manner. collection of services and servers. programs (clients) that would like to Ä Don’t forget this important Ä The goal is that the client does invoke those services. aspect: whatever you design, not necessarily need to know Machine A Machine B † To make the problem more others will have to program and where the server resides or even interesting, assume as well that: (client) (server) use which server provides the Ä client and server can reside on service. different computers and run on Service request different operating systems † How to exchange data between † How to deal with errors in the Ä the only form of communication machines that might use different service invocation in a more or less is by sending messages (no representations for different data elegant manner: shared memory, no shared disks, types. This involves two aspects: Ä server is down, etc.) Service response Ä data type formats (e.g., byte Ä communication is down, Ä some minimal guarantees are to orders in different architectures) be provided (handling of Ä server busy, Ä data structures (need to be Ä duplicated requests ... failures, call semantics, etc.) flattened and the reconstructed) Ä we want a generic solution and not a one time hack Execution Thread Message ©Gustavo Alonso, ETH Zürich. 3 ©Gustavo Alonso, ETH Zürich. 4 Programming languages Interoperability

† The notion of distributed service † Once we are working with remote † When exchanging data between † The concept of transforming the data invocation became a reality at the procedures in mind, there are several clients and servers residing in being sent to an intermediate beginning of the 80’s when aspects that are immediately different environments (hardware or representation format and back has procedural languages (mainly C) determined: software), care must be taken that different (equivalent) names: were dominant. Ä data exchange is done as input the data is in the appropriate format: Ä marshalling/un-marshalling † In procedural languages, the basic and output parameters of the Ä byte order: differences between Ä serializing/de-serializing module is the procedure. A procedure call little endian and big endian † The intermediate representation procedure implements a particular Ä pointers cannot be passed as architectures (high order bytes format is typically system function or service that can be used parameters in RPC, opaque first or last in basic data types) dependent. For instance: anywhere within the program. references are needed instead so data structures: like trees, hash Ä Ä SUN RPC: XDR (External Data † It seemed natural to maintain this that the client can use this tables, multidimensional arrays, same notion when talking about reference to refer to the same or records need to be flattened Representation) distribution: the client makes a data structure or entity at the (cast into a string so to speak) † Having an intermediate procedure call to a procedure that is server across different calls. The before being sent representation format simplifies the design, otherwise a node will need implemented by the server. Since server is responsible for † This is best done using an the client and server can be in providing this opaque intermediate representation format to be able to transform data to any different machines, the procedure is references. possible format remote. † Client/Server architectures are based on Remote Procedure Calls (RPC)

©Gustavo Alonso, ETH Zürich. 5 ©Gustavo Alonso, ETH Zürich. 6

Example (XDR in SUN RPC) Binding

† Marshalling or serializing can be † SUN XDR follows a similar † A service is provided by a server † With a distributed binder, several done by hand (although this is not approach: located at a particular IP address and general operations are possible: desirable) using (in C) sprintf and Ä messages are transformed into a listening to a given port Ä REGISTER (Exporting an sscanf: sequence of 4 byte objects, each † Binding is the process of mapping a interface): A server can register byte being in ASCII code service name to an address and port service names and the Message= “Alonso” “ETHZ” “2001” Ä it defines how to pack different that can be used for communication corresponding port data types into these objects, purposes Ä WITHDRAW: A server can char *name=“Alonso”, place=“ETHZ”; which end of an object is the † Binding can be done: withdraw a service int year=2001; most significant, and which byte Ä locally: the client must know the Ä LookUP (Importing an of an object comes first sprintf(message, “%d %s %s %d %d”, name (address) of the host of the interface): A client can ask the strlen(name), name, strlen(place), place, Ä the idea is to simplify server binder for the address and port year); computation at the expense of Ä distributed: there is a separate of a given service bandwidth service (service location, name † There must also be a way to locate Message after marshalling = and directory services, etc.) in the binder (predefined location, “6 Alonso 4 ETHZ 2001” 6 String length charge of mapping names and environment variables, broadcasting A l o n addresses. These service must to all nodes looking for the binder) † Remember that the type and number s o be reachable to all participants of parameters is know, we only need 4 String length to agree on the syntax ... E T H Z 2 0 0 1 cardinal

©Gustavo Alonso, ETH Zürich. 7 ©Gustavo Alonso, ETH Zürich. 8 Call semantics Making it work in practice

† A client makes an RPC to a service † At least-once: the procedure will be † One cannot expect the programmer at a given server. After a time-out executed if the server does not fail, to implement all these mechanisms CLIENT Client process expires, the client may decide to re- but it is possible that it is executed every time a distributed application call to remote procedure send the request. If after several tries more than once. This may happen, is developed. Instead, they are there is no success, what may have for instance, if the client re-sends the provided by a so called RPC system happened depends on the call request after a time-out. If the server (our first example of low level CLIENT stub procedure semantics: is designed so that service calls are middleware) Bind idempotent (produce the same † What does an RPC system do? Marshalling outcome given the same input), this Communication † Maybe: no guarantees. The Ä Provides an interface definition Send procedure may have been executed might be acceptable. language (IDL) to describe the module (the response message(s) was lost) † At most-once: the procedure will be services executed either once or not at all. or may have not been executed (the Ä Generates all the additional code request message(s) was lost). It is Re-sending the request will not result in the procedure executing necessary to make a procedure Communication very difficult to write programs call remote and to deal with all SERVER stub procedure several times module based on this type of semantics since the communication aspects Unmarshalling the programmer has to take care of Return all possibilities Ä Provides a binder in case it has a distributed name and directory Dispatcher service system (select SERVER stub) remote procedure Server process

©Gustavo Alonso, ETH Zürich. 9 ©Gustavo Alonso, ETH Zürich. 10

In more detail IDL (Interface Definition Language)

Client Client Comm. Comm. Server Server † All RPC systems have a language † Given an IDL specification, the code stub Module Binder that allows to describe services in an interface compiler performs a variety module stub code abstract manner (independent of the of tasks: programming language used). This Register service request † generates the client stub procedure for language has the generic name of each procedure signature in the RPC bind ACK Look up request IDL (e.g., the IDL of SUN RPC is interface. The stub will be then XDR) compiled and linked with the client Look up response † The IDL allows to define each code service in terms of their names, and † Generates a server stub. It can also send input and output parameters (plus RPC request create a server main, with the stub and maybe other relevant aspects). the dispatcher compiled and linked † An interface compiler is then used to into it. This code can then be extended call generate the stubs for clients and by the designer by writing the servers (rpcgen in SUN RPC). It implementation of the procedures might also generate procedure † It might generate a *.h file for return headings that the programmer can importing the interface and all the then used to fill out the details of the necessary constants and types RPC response implementation.

return

©Gustavo Alonso, ETH Zürich. 11 ©Gustavo Alonso, ETH Zürich. 12 Programming RPC RPC Application

† RPC usually provides different † The Simplified Interface (in SUN levels of interaction to provide RPC) has only three calls: SALES POINT CLIENT Server 1 different degrees of control over the Ä rpc_reg() registers a procedure IF no_customer_# system: as a remote procedure and THEN New_customer New_customer returns a unique, system-wide ELSE Lookup_customer S Lookup_customer M Customer identifier for the procedure B Simplified Interface Check_inventory Delete_customer D database Ä rpc_call() given a procedure IF enough_supplies Update_customer Top Level identifier and a host, it makes a THEN Place_order call to that procedure ELSE ... Server 2 Intermediate Level Ä rpc_broadcast() is similar to rpc_call() but broadcasts the New_product Expert Level INVENTORY CONTROL S message instead Lookup_product M Products

B

CLIENT Delete_product D Bottom Level † Additional levels allow more control database of transport protocols, binding Lookup_product Update_product procedures, etc. Check_inventory IF supplies_low Server 3 † Each level adds more complexity to THEN the interface and requires the Place_order

Place_order S Inventory programmer to take care of more Cancel_order M

Update_inventory B and order

aspects of a distributed system Update_inventory D ... database Check_inventory

©Gustavo Alonso, ETH Zürich. 13 ©Gustavo Alonso, ETH Zürich. 14

RPC in perspective RPC system issues

ADVANTAGES DISADVANTAGES † RPC was one of the first tools that † RPC can be used to build systems † RPC provided a mechanism to † RPC is not a standard, it is an idea allowed the modular design of with many layers of abstraction. implement distributed applications that has been implemented in many distributed applications However, every RPC call implies: in a simple and efficient manner different ways (not necessarily † RPC implementations tend to be Ä several messages through the † RPC followed the programming compatible) quite efficient in that they do not add network techniques of the time (procedural † RPC allows designers to build too much overhead. However, a Ä at least one context switch (at languages) and fitted quite well with distributed systems but does not remote procedure is always slower the client when it places the call, the most typical programming solve many of the problems than a local procedure: but there might be more) languages (C), thereby facilitating distribution creates. In that regard, it Ä should a remote procedure be † When a distributed application is its adoption by system designers is only a low level construct transparent (identical to a local complex, deep RPC chains are to be † RPC allowed the modular and † RPC was designed with only one procedure)? (yes: easy of use; avoided hierarchical design of large type of interaction in mind: no: increase programmer distributed systems: client/server. This reflected the awareness) Ä client and server are separate hardware architectures at the time Ä should location be transparent? entities when distribution meant small (yes: flexibility and fault terminals connected to a mainframe. tolerance; no: easier design, less Ä the server encapsulates and hides the details of the back end As hardware and networks evolve, overhead) systems (such as databases) more flexibility was needed Ä should there be a centralized name server (binder)?

©Gustavo Alonso, ETH Zürich. 15 ©Gustavo Alonso, ETH Zürich. 16 From RPC we go to ... DCE

Stored procedures Distributed environments † The Distributed Computing † Two tier architectures are, in fact, † When designing distributed Environment is a standard Distributed Applications client/server systems. They need applications, there are a lot of implementation of RPC and a some sort of interface to allow crucial aspects common to all of distributed run-time environment clients to invoke the functionality of them. RPC does not address any of provided by the Open Software Foundation (OSF). It provides: the server. RPC is the ideal interface these issues Distributed Time for client/server interactions on a † To support the design and Ä RPC File Service Service LAN deployment of distributed systems, Ä Cell Directory: A sophisticated † To add flexibility to their servers, programming and run time Name and Directory Service software vendors added to them the environments started to be created. Ä Time: for clock synchronization RPC possibility of programming These environments provide, on top across all nodes procedures that will run inside the of RPC, much of the functionality server and that could be invoked needed to build and run a distributed Ä Security: secure and through RPC application authenticated communication Cell Directory Security Ä Distributed File: enables sharing Service † This turned out to be very useful for † The notion of distributed Service databases where such procedures environment is what gave rise to of files across a DCE could be used to hide the schema environment middleware. During the course, we Thread Service and the SQL programming from the will see many examples of such Ä Threads: support for threads and DCE clients. The result was stored environments. multiprocessor architectures procedures, a common mechanism found in all database systems Transport Service/ OS

©Gustavo Alonso, ETH Zürich. 17 ©Gustavo Alonso, ETH Zürich. 18

DCE architecture DCE’s model and goals

client processDCE server process † Not intended as a final product but † Encina (a TP-Monitor) is an development as a basic platform to build more example of an extension of DCE: environment client server IDL sophisticated middleware tools code code † Its services are provided as the most basic services needed in any Distributed Applications IDL distributed system. Any other sources functionality needs to be language specific language specific implemented on top of it call interface call interface Encina Monitor † DCE is not just an specification of a client stub IDL compiler server stub standard (e.g., CORBA) but an implementation that acts as the Structured Peer to Reliable standard. Since the API is the same File Peer Queuing RPC API RPC API across all platforms, interoperability Service Comm Service RPC run time interface RPC run time is always guaranteed service library headers service library † DCE is packaged in a modular way so that services that are not used do Encina Encina Toolkit not need to be licensed

RPC security cell distributed thread OSF DCE protocols service service file service service

DCE runtime environment ©Gustavo Alonso, ETH Zürich. 19 ©Gustavo Alonso, ETH Zürich. 20 Encina’s extensions to DCE † Encina tailors DCE to transactional Ä Encina Monitor: execution environments. It is used very much support for starting/stopping like RPC but the range of services services, failure detection, available is much wider: system management, etc. Ä Encina toolkit: support for † Encina extends IDL with interactions with databases, XA transactions (TIDL transactional interface, and distributed IDL) and even provides its own transactions language support in the form of Ä Structured File Service: record transactional C and transactional oriented file server with C++ Extending RPC: transactional properties † Why all this is necessary will be Ä Peer to peer communication: for discussed when we cover TP- Message Oriented Middleware connectivity with mainframes Monitors Ä Reliable Queuing Service: transactional queues for Gustavo Alonso asynchronous interaction Computer Science Department Swiss Federal Institute of Technology (ETHZ) [email protected] http://www.iks.inf.ethz.ch/

©Gustavo Alonso, ETH Zürich. 21

Synchronous Client/Server Disadvantages of sync C/S

† The most straightforward interaction † Synchronous interaction requires INVENTORY CONTROL Call between components is the both parties to be “on-line”: the Receive request/response model in which the IF supplies_low caller makes a request, the receiver client sends a request and waits until THEN gets the request, processes the idle time the server provides a response: BOT request, sends a response, the caller Place_order Response Ä closely resembles the way we receives the response. Answer program (hence RPC as the Update_inventory † The caller must wait until the basic mechanism to support this EOT response comes back. The receiver idea) does not need to exist at the time of † Because it synchronizes client and server, this mode of operation has Ä the model is simple and intuitive the call (TP-Monitors, CORBA or Server 2 (products) Server 3 (inventory) DCOM create an instance of the several disadvantages: Ä well supported by RPC and the Ä connection overhead systems built around RPC New_product Place_order service/server /object when called if it does not exist already) but the Ä higher probability of failures (TRPC, TP-Monitors and even Lookup_product Cancel_order interaction requires both client and Object Monitors) Delete_product Update_inventory Ä difficult to identify and react to server to be “alive” at the same time failures Ä needs additional infrastructure Update_product Check_inventory when interactions becomes more Ä it is a one-to-one system; it is complex (e.g., nested) but this not really practical for nested infrastructure is available calls and complex interactions

S S Inventory (the problems becomes even

M Products M

B B and order more acute)

D database D database

©Gustavo Alonso, ETH Zürich. 23 ©Gustavo Alonso, ETH Zürich. 24 Overhead of synchronism Failures in synchronous calls † Synchronous invocations require to † If the client or the server fail, the maintain a session between the context is lost and resynchronization caller and the receiver. request() might be difficult. request() 1 † Maintaining sessions is expensive receive Ä If the failure occurred before 1, 2 and consumes CPU resources. There session nothing has happened receive process is also a limit on how many sessions duration Ä If the failure occurs after 1 but process can be active at the same time (thus return before 2 (receiver crashes), then return limiting the number of concurrent the request is lost clients connected to a server) 4 3 do with answer Ä If the failure happens after 2 but do with answer † For this reason, client/server systems before 3, side effects may cause often resort to connection pooling to inconsistencies optimize resource utilization request() Ä If the failure occurs after 3 but 1 Ä have a pool of open connections before 4, the response is lost but request() receive 2 Ä associate a thread with each the action has been performed receive

connection process s (do it again?) e

c process i

allocate connections as needed return v † Finding out when the failure took Ä r

e return † When the interaction is not one-to- s place may not be easy. Worse still, if d 3 do with answer e there is a chain of invocations, the

one, the context (the information t a

c failure can occur anywhere. do with answer i

defining a session) needs to be l p

passed around. The context is e timeout receive r 2’ usually volatile receive process try again process return Context is lost do with answer return Needs to be restarted!! 3’ ©Gustavo Alonso, ETH Zürich. 25 ©Gustavo Alonso, ETH Zürich. 26

Failure semantics Two solutions

† A great deal of the functionality † The more elements are involved in Enhanced Support ASYNCHRONOUS INTERACTION built around RPC tries to address the an interaction, the higher the † Client/Server middleware provides a † Using asynchronous interaction, the problem of failure semantics, i.e., probability that the interaction will number of mechanisms to deal with caller sends a message that gets determine what has happened after a fail (a failure in anyone of the the problems created by stored somewhere until the receiver failure elements results is enough) synchronous interaction: reads it and sends a response. The † Exactly-once semantics solves this † The more elements are required to Ä Transactional RPC: to enforce response is sent in a similar manner problem but it has hidden costs: be alive for an interaction to exactly once execution † Asynchronous interaction can take Ä it implies atomicity in all succeed, the more difficult it is to semantics and enable more place in two forms: maintain the system: operations complex interactions with some Ä non-blocking invocation (RPC Ä the server must support some Ä even if it is modular, the execution guarantees but the call returns immediately form of 2PC; if it is a database, components cannot do anything Ä Service replication and load without waiting for a response, then one can use the XA without the rest of the system balancing: to prevent the system similar to batch jobs) interface, otherwise one needs a Ä upgrades, corrections, general from having to shut down if a Ä persistent queues (the call and TP-Monitor to make the server maintenance becomes very given service is not available; the response are actually transactional difficult because they might this also gives a chance to persistently stored until they are Ä it usually requires a coordinator require to shut the system down maintain and upgrade the accessed by the client and the to oversee the interaction system while keeping it online server)

©Gustavo Alonso, ETH Zürich. 27 ©Gustavo Alonso, ETH Zürich. 28 TP-Monitors Reliable queuing

† The problems of synchronous † Reliable queuing turned out to be a

client b server u

interaction are not new. The first t very good idea and an excellent

service call s service

systems to provide alternatives were r complement to synchronous e

TP-Monitors which offered two v interactions: get results r return results choices: e Client stub S Ä Suitable to modular design: the Ä asynchronous RPC: client code for making a request can makes a call that returns be in a different module (even a RPC support request() immediately; the client is different machine!) than the queue responsible for making a second code for dealing with the receive call to get the results Input queue Output queue response Reliable process Ä Reliable queuing systems (e.g., queuing Ä It is easier to design return Encina, Tuxedo) where instead sophisticated distribution modes system queue of through procedure calls, external (multicast, transfers, replication, client and server interact by application Output coalescing messages) an it also do with answer exchanging messages. Making queue helps to handle communication the messages persistent by sessions in a more abstract way storing them in queues added Input Ä More natural way to implement considerable flexibility to the complex interactions (see next) system queue

external application

©Gustavo Alonso, ETH Zürich. 29 ©Gustavo Alonso, ETH Zürich. 30

Queuing systems Transactional queues

† Queuing systems implement † Persistent queues are closely tied to client asynchronous interactions. transactional interaction: † Each element in the system Ä to send a message, it is written in external communicates with the rest via the queue using 2PC application persistent queues. These queues store Ä messages between queues are messages transactionally, guaranteeing exchanged using 2PC Input queue Output queue that messages are there even after Ä reading a message from a queue, 2PC 2PC failures occur. processing it and writing the reply † Queuing systems offer significant to another queue is all done under advantages over traditional solutions in 2PC terms of fault tolerance and overall Output queue Input queue Reliable queuing system † This introduces a significant overhead Monitoring system flexibility: applications do not but it also provides considerable need to be there at the time a request is 2PC 2PC Administration made! advantages. The overhead is not that Persistent storage important with local transactions † Queues provide a way to communicate (writing or reading to a local queue). across heterogeneous networks and † Using transactional queues, the Input queue Output queue Input queue Output queue systems while still being able to make processing of messages and sending and some assumptions about the behavior of receiving can be tied together into one the messages. single transactions so that atomicity is 2PC † They can be used embedded (workflow, guaranteed. This solves a lot of external TP-Monitors) or by themselves problems! external application (MQSeries, Tuxedo/Q). application

©Gustavo Alonso, ETH Zürich. 31 ©Gustavo Alonso, ETH Zürich. 32 Problems solved (I) Problems solved (II)

SENDING RECEIVING † An application can bundle within a single transaction reading a message Input queue Output queue from a queue, interacting with other systems, and writing the response to a external external application application queue. 2PC † If a failure occur, in all scenarios consistency is ensured: external Ä if the transaction was not application 2PC 2PC completed, any interaction with other applications is undone and the reading operation from the input queue is not committed: the message remains in the input queue. Upon recovery, the message can be Message is either read or written Message is now persistent. If the node Arriving messages remain in the queue. processed again, thereby achieving Input queue Output queue exactly once semantics. crashes, the message remains in the If the node crashes, messages are not Ä If the transaction was completed, queue. Upon recovery, the application lost. The application can now take its the write to the output queue is 2PC can look in the queue and see which time to process messages. It is also committed, i.e., the response remains in the queue and can be external messages are there and which are possible for several applications to read sent upon recovery. application not. Multiple applications can write to from the same queue. This allows to Ä If replicated services are used, if one fails and the message remains the same queue, thereby “multiplexing” implement replicated services, do load in the input queue, it is safe for the channel. balancing, and increase fault tolerance. other services to take over this This is undone if necessary message. ©Gustavo Alonso, ETH Zürich. 33 ©Gustavo Alonso, ETH Zürich. 34

Simple implementation Queues in practice

† Persistent queues can be † To access a queue, a client or a

b

g )

implemented as part of a database server uses the queuing services, client u

t

e

n

s i

since the functionality needed is external e.g., : c u

put t

vi e

application n

exactly that of a database: r u

Ä put (enqueue) to place a e e

… i

q

l s Ä a transactional interface message in a given queue ( Ä persistence of committed Ä get (dequeue) to read a message C transactions from a queue Ä advanced indexing and search Output queue Input queue Ä mput to put a message in capabilities multiple queues put get mput † Thus, messages in a queue become Ä transfer a message from a queue simple entries in a table. These to another entries can be manipulated like any † In TP-Monitors, these services are queue other data in a database so that implemented as RPC calls to an persistent applications using the queue can internal resource manager (the management repository

assign priorities, look for messages reliable queuing service) Queuing server with given characteristics, trigger certain actions when messages of a MSSG QUEUE † These calls can be made part of particular kind arrive … m1 q1 transaction using the same m2 q3 mechanisms of TRPC (the queuing m3 q7 system uses an XA interface and

works like any other resource b g m6 q5 )

server u

t e

manager) n

s i

m4 q1 c

i u

get t

v

e n

m5 q1 r

u

e e

… i

q

l

s

( C

©Gustavo Alonso, ETH Zürich. 35 ©Gustavo Alonso, ETH Zürich. 36 Advanced functionality Types and messages

† Queues allow to implement complex interaction patterns between modules: † Queues are very useful but they also † The way to develop a system is as Ä 1-to-1 interaction with failure resilience have their disadvantages from the follows: programming point of view: Ä 1-to-many (multicast: put in a queue and then send from this queue to many Ä define message formats by other queues) this is very helpful for “subscriptions”. The fact that the Ä In RPC, the type of the creating complex types (records, queues are implemented in the database even helps with performance since parameters exchanged between objects) the logic for distribution can be embedded in the database itself client and server is determined Ä create the queues and the access by the IDL definition and policies for those queues Ä many-to-1 many modules send their request to a single module that can then available in the stubs. The RPC assign priorities, reorder, compare, etc. infrastructure takes care of Ä program the server and clients Ä many-to-many as in replicated services for large amount of clients according to the type definitions marshalling, unmarshalling, of the messages † In some cases queues are being used for interactions that are also on-line. If the serializing, etc. † The system uses the types defined queues are fast enough (like in a cluster) one can take advantage of the Ä When queues are used, there is properties of queues at the expense of performance. Building computer farms no IDL determining the for the messages to set up the RPC becomes easier since messages are one more element that can be moved, copied interface. The type and format calls needed to do marshalling, and stored. of the data in a queue must be unmarshalling, serialization, etc. † Incorporating queues into databases provides databases with a very powerful agreed upon before hand but the tool for designing distributed applications (TP-light). system does not have much control over it Ä The role of IDL is now taken over by the message format (it is not in the stubs)

©Gustavo Alonso, ETH Zürich. 37 ©Gustavo Alonso, ETH Zürich. 38

Beyond client/server Message brokers

† Persistent queues are most useful † Because these interactions are also † Message brokers add logic to the queues when the interactions are not simple very common and have increased in and at the level of the messaging client/server calls importance, queuing systems are no infrastructure. Ä workflow processes can be longer just one more module in TP- † Messaging processing is no longer just easily implemented as a Monitors but have become products moving messages between locations but in their own right (e.g., MQSeries of designers can associate rules and sequence of services that pass processing steps to be executed when messages to each other along a IBM) given messages are moved around well defined set of queues † Once they became products, † The downside of this approach is that in basic MOM it is Ä information dissemination and queuing systems started to be the logic associated with the queues and the sender who event notification can be subjected to the same evolutionary the messaging middleware might be specifies the very difficult to understand since it is identity of the directly and efficiently forces as other forms of middleware: receivers implemented on top of queues integration in larger, more distributed and there is no coherent view Ä of the whole Ä publish/subscribe systems are, comprehensive tools in essence, event systems Ä enhancements to the basic sender receiver implemented on top of modified functionality by making the queuing systems queues active processing entities with message = Information Brokers brokers, custom message routing logic can be defined at the message broker core message broker level or at the message broker queue level

©Gustavo Alonso, ETH Zürich. 39 ©Gustavo Alonso, ETH Zürich. 40 Publish/Subscribe Subscription in message brokers † Standard client/server architectures and queuing systems assume the server client at systems startup time (can occur in client and the server know each publish subscribe any order, but all must occur before other (through an interface or a RFQs are executed) queue) … … get RFQ processing A: subscription to message quote † In many situations, it is more useful B: subscription to message to implement systems where the quoteRequest interaction is based on announcing A61 5 C: subscription to message newQuote given events: Publish a service publishes Subscription server Ä List subscriptions check subscriptions messages/events of given type put in queues message broker at run time: processing of a request Ä clients subscribe to different for a message type for quote. types of messages/events Subscribe to a subscribe BC24 7 1: publication of a quoteRequest message type check subscriptions message Ä when a service publishes an SmartQuotation SmartForecasting put in queues 2: delivery of message quoteRequest event, the system looks at a adapter adapter table of subscriptions and 3: synchronous invocation of the forwards the event to the subscriptions get 3 8 getQuote function interested clients; this is usually read from queue 4:6: publication of a quote message done in the form of a message put into a queue for that client 5: delivery of message quote Reliable queuing system 8:publication of a newQuote message † publish, subscribe, get, .. are also 7: delivery of message newQuote RPC calls to a resource manager SmartQuotation SmartForecasting RPC support (DCE, invocation of the TP-Monitor, …) createForecastEntry procedure

©Gustavo Alonso, ETH Zürich. 41 ©Gustavo Alonso, ETH Zürich. 42

What is SOAP?

† The W3C started working on SOAP in 1999. The current W3C recommendation is Version 1.2 † SOAP covers the following four main areas: Ä A message format for one-way communication describing how a message can be packed into an XML document Ä A description of how a SOAP message (or the XML document that makes up a SOAP message) should be transported using HTTP (for Web based interaction) or SMTP(for e-mail based interaction) Ä A set of rules that must be followed when processing a SOAP message and RPC for the Internet: a simple classification of the entities involved in processing a SOAP message. It also specifies what parts of the messages should be read by Simple Object Access Protocol (SOAP) whom and how to react in case the content is not understood Ä A set of conventions on how to turn an RPC call into a SOAP message and back as well as how to implement the RPC style of interaction (how the Gustavo Alonso client makes an RPC call, this is translated into a SOAP message, Computer Science Department forwarded, turned into an RPC call at the server, the reply of the server Swiss Federal Institute of Technology (ETHZ) converted into a SOAP message, sent to the client, and passed on to the [email protected] client as the return of the RPC call) http://www.iks.inf.ethz.ch/

©Gustavo Alonso, ETH Zürich. 44 The background for SOAP SOAP messages

† SOAP was originally conceived as the minimal possible infrastructure necessary † SOAP is based on message to perform RPC through the Internet: exchanges SOAP Envelope use of XML as intermediate representation between systems † Messages are seen as envelops Ä SOAP header Ä very simple message structure where the application encloses the data to be sent Ä mapping to HTTP for tunneling through firewalls and using the Web † A message has two main parts: infrastructure Header Block † The idea was to avoid the problems associated with CORBA’s IIOP/GIOP Ä header: which can be divided (which fulfilled a similar role but using a non-standard intermediate into blocks representation and had to be tunneled through HTTP any way) Ä body: which can be divided into SOAP Body † The goal was to have an extension that could be easily plugged on top of blocks existing middleware platforms to allow them to interact through the Internet † SOAP does not say what to do with rather than through a LAN as it is typically the case. Hence the emphasis on the header and the body, it only Body Block RPC from the very beginning (essentially all forms of middleware use RPC at states that the header is optional and one level or another) the body is mandatory † Eventually SOAP started to be presented as a generic vehicle for computer † Use of header and body, however, is driven message exchanges through the Internet and then it was open to support implicit. The body is for application interactions other than RPC and protocols other then HTTP. This process, level data. The header is for however, is only in its very early stages. infrastructure level data

©Gustavo Alonso, ETH Zürich. 45 ©Gustavo Alonso, ETH Zürich. 46

For the XML fans (SOAP, body only) SOAP example, header and body

0

0

0

XML name space identifier for SOAP serialization 2

y XML name space identifier for SOAP envelope a xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"

M

8

0 SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"/>

e

xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" N

C

3

SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> W xmlns:t="some-URI"

©

.

1

.

1 SOAP-ENV:mustUnderstand="1">

)

P

A 5

O

S

(

DIS l

o

c

o

t

o

r

P

s

s

e

c

c

A

t c

e

j From the: Simple Object Access Protocol (SOAP) 1.1. ©W3C Note 08 May 2000 b DEF

O

e

l

p

m

i

S

:

e

h

t

m

o

r

F ©Gustavo Alonso, ETH Zürich. 47 ©Gustavo Alonso, ETH Zürich. 48 The SOAP header The SOAP body

† The header is intended as a generic place holder for information that is not † The body is intended for the application specific data contained in the message necessarily application dependent (the application may not even be aware that a † A body entry (or a body block) is syntactically equivalent to a header entry with header was attached to the message). attributes actor= ultimateReceiver and mustUnderstand = 1 † Typical uses of the header are: coordination information,identifiers (for, e.g., † Unlike for headers, SOAP does specify the contents of some body entries: transactions), security information (e.g., certificates) Ä mapping of RPC to a collection of SOAP body entries † SOAP provides mechanisms to specify who should deal with headers and what to do with them. For this purpose it includes: Ä the Fault entry (for reporting errors in processing a SOAP message) † The fault entry has four elements (in 1.1): Ä SOAP actor attribute: who should process that particular header entry (or header block). The actor can be either: none, next, ultimateReceiver. None Ä fault code: indicating the class of error (version, mustUnderstand, client, is used to propagate information that does not need to be processed. Next server) indicates that a node receiving the message can process that block. Ä fault string: human readable explanation of the fault (not intended for ultimateReceiver indicates the header is intended for the final recipient of automated processing) the message Ä fault actor: who originated the fault Ä mustUnderstand attribute: with values 1 or 0, indicating whether it is Ä detail: application specific information about the nature of the fault mandatory to process the header. If a node can process the message (as indicated by the actor attribute), the mustUnderstand attribute determines whether it is mandatory to do so. Ä SOAP 1.2 adds a relay attribute (forward header if not processed)

©Gustavo Alonso, ETH Zürich. 49 ©Gustavo Alonso, ETH Zürich. 50

SOAP Fault element (v 1.2) Message processing

† In version 1.2, the fault element is specified in more detail. It must contain two † SOAP specifies in detail how messages must be processed (in particular, how mandatory sub-elements: header entries must be processed) Ä Code: containing a value (the code for the fault) and possibly a subcode (for Ä Each SOAP node along the message path looks at the role associated with application specific information) each part of the message Ä Reason: same as fault string in 1.1 Ä There are three standard roles: none, next, or ultimateReceiver † and may contain a few additional elements: Ä Applications can define their own roles and use them in the message Ä detail: as in 1.1 Ä The role determines who is responsible for each part of a message Ä node: the identification of the node producing the fault (if absent, it defaults † If a block does not have a role associated to it, it defaults to ultimateReceiver to the intended recipient of the message) † If a mustUnderstand flag is included, a node that matches the role specified must Ä role: the role played by the node that generated the fault process that part of the message, otherwise it must generate a fault and do not † Errors in understanding a mandatory header are responded using a fault element forward the message any further but also include a special header indicating which one o f the original headers † SOAP 1.2 includes a relay attribute. If present, a node that does not process that was not understood. part of the message must forward it (i.e., it cannot remove the part) † The use of the relay attribute, combined with the role next, is useful for establishing persistence information along the message path (like session information)

©Gustavo Alonso, ETH Zürich. 51 ©Gustavo Alonso, ETH Zürich. 52 From TRPC to SOAP messages SOAP and HTTP

RPC Request † A binding of SOAP to a transport protocol is a description of how a SOAP Envelope SOAP message is to be sent using that transport protocol HTTP POST SOAP header † The typical binding for SOAP is SOAP Envelope RPC Response (one of the two) HTTP SOAP header Transactional SOAP Envelope SOAP Envelope † Transactional context SOAP can use GET or POST. With context GET, the request is not a SOAP SOAP header SOAP header message but the response is a SOAP SOAP Body Transactional Transactional message, with POST both request Name of Procedure SOAP Body context context and response are SOAP messages Name of Procedure (in version 1.2, version 1.1 mainly Input parameter 1 considers the use of POST). Input parameter 2 Input param 1 SOAP Body SOAP Body † SOAP uses the same error and status codes as those used in HTTP so that Return parameter Fault entry HTTP responses can be directly Input param 2 interpreted by a SOAP module

©Gustavo Alonso, ETH Zürich. 53 ©Gustavo Alonso, ETH Zürich. 54

In XML (a request) In XML (the response)

POST /StockQuote HTTP/1.1

0

0

0

0 0

0 HTTP/1.1 200 OK 2

2 Host: www.stockquoteserver.com

y

y

a

a M

M Content-Type: text/; charset="utf-8" Content-Type: text/xml; charset="utf-8"

8

8

0

0

e

e t

t Content-Length: nnnn o

o Content-Length: nnnn

N

N

C

C 3

3 SOAPAction: "Some-URI"

W

W

©

©

.

.

1

1

.

.

1

1

) )

P xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"

P

A

A O

O SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"/>

S

S (

( xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"

l

l

o

o

c c

o

o

t

t o

o SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">

r

r

P

P

s

s

s

s e

e

c

c

c c

A 34.5

A

t t

c

c

e

e

j

j b

b O

O DIS

e

e

l

l p

p

m

m i

i

S

S

:

: e

e

h

h t

t

m

m

o

o

r

r F F

©Gustavo Alonso, ETH Zürich. 55 ©Gustavo Alonso, ETH Zürich. 56 HTTP POST All together SOAP Envelope SOAP summary SOAP header Transactional † SOAP, in its current form, provides a basic mechanism for: context Ä encapsulating messages into an XML document SOAP Body Ä mapping the XML document with the SOAP message into an HTTP request Name of Procedure Ä transforming RPC calls into SOAP messages Input parameter 1 Ä simple rules on how to process a SOAP message (rules became more precise SERVICE REQUESTER SERVICE PROVIDER and comprehensive in v1.2 of the specification) Input parameter 2 † SOAP takes advantage of the standardization of XML to resolve problems of Procedure data representation and serialization (it uses XML Schema to represent data and RPC call data structures, and it also relies on XML for serializing the data for transmission). As XML becomes more powerful and additional standards around XML appear, SOAP can take advantage of them by simply indicating SOAP SOAP engine engine what schema and encoding is used as part of the SOAP message. Current HTTP Acknowledgement HTTP engine HTTP engine schema and encoding are generic but soon there will be vertical standards SOAP Envelope implementing schemas and encoding tailored to a particular application area SOAP header (e.g., the efforts around EDI) Transactional context † SOAP is a very simple protocol intended for transferring data from one middleware platform to another. In spite of its claims to be open (which are SOAP Body true), current specifications are very tied to RPC and HTTP. Return parameter

©Gustavo Alonso, ETH Zürich. 57 ©Gustavo Alonso, ETH Zürich. 58

SOAP and the client server model A first use of SOAP Web services interfaces † The close relation between SOAP, RPC and HTTP has two main reasons: † Some of the first systems to Ä SOAP has been initially designed for client server type of interaction which incorporate SOAP as an access HTTP is typically implemented as RPC or variations thereof method have been databases. The engine process is extremely simple: Ä RPC, SOAP and HTTP follow very similar models of interaction that can be client very easily mapped into each other (and this is what SOAP has done) Ä a stored procedure is essentially XML HTTP an RPC interface † The advantages of SOAP arise from its ability to provide a universal vehicle for mapping wrapping conveying information across heterogeneous middleware platforms and Ä Web service = stored procedure SOAP engine applications. In this regard, SOAP will play a crucial role in enterprise Ä IDL for stored procedure = application integration efforts in the future as it provides the standard that has translated into WSDL been missing all these years Ä call to Web service = use SOAP stored procedure API † The limitations of SOAP arise from its adherence to the client server model: engine to map to call to stored Ä data exchanges as parameters in method invocations procedure Stored procedure interfaces engine Ä rigid interaction patterns that are highly synchronous † This use demonstrates how well Database SOAP fits with conventional † and from its simplicity: middleware architectures and stored procedure Ä SOAP is not enough in a real application, many aspects are missing interfaces. It is just a natural extension to them external application database database management system

resource manager

©Gustavo Alonso, ETH Zürich. 59 ©Gustavo Alonso, ETH Zürich. 60 Automatic conversion RPC - SOAP SOAP exchange patterns (v 1.2) SOAP response message exchange SOAP request-response message stubs, SOAP system Wrap doc exchange client HTTP runtime in HTTP † It involves a request which is not a call support service Serialized POST / SOAP message (implemented as an † It involves sending a request as a location XML doc M-POST HTTP GET request method which SOAP message and getting a second eventually includes the necessary SOAP message with the response to information as part of the requested the request RPC based middleware URL) and a response that is a SOAP † This is the typical mode of operation message for most Web services and the one † This pattern excludes the use of any used for mapping RPC to SOAP. header information (as the request † This exchange pattern is also the one has no headers) that implicitly takes advantage of the

NETWORK binding to HTTP and the way HTTP works RPC based middleware

SOAP system server stubs, Retrieve HTTP procedure runtime doc from support adapters Serialized HTTP XML doc packet

©Gustavo Alonso, ETH Zürich. 61 ©Gustavo Alonso, ETH Zürich. 62

How to implement this with SOAP? Implementing message queues

† In principle, it is not impossible to implement asynchronous queues with SOAP: Ä SOLUTION A: • use SOAP to encode the messages • create an HTTP based interface for the queues • use an RPC/SOAP based engine to transfer data back and forth between the queues Ä SOLUTION B: • use SOAP to encode the messages • create appropriate e-mail addresses for each queue • use an e-mail (SMTP) binding for transferring messages † Both options have their advantages and disadvantages but the main problem is that none is standard. Hence, there is no guarantee that different queuing systems with a SOAP will be able to talk to each other: many advantages of SOAP are lost † The fact that SOAP is so simple also makes it difficult to implement these solutions: a lot additional functionality is needed to implement reliable, practical queue systems

©Gustavo Alonso, ETH Zürich. 63 ©Gustavo Alonso, ETH Zürich. 64 The need for attachments A possible solution

† SOAP is based on XML and relies From SOAP Version 1.2 Part 0: Primer. † There is a “SOAP messages with † Problems with this approach: on XML for representing data types 11.12.02 that addresses this problem dragging the attachment along, make all data exchanged explicit in † It uses MIME types (like e-mails) which can have performance the form of an XML document New York and it is based in including the implications for large messages Los Angeles much like what happens with IDLs 2001-12- SOAP message into a MIME Ä scalability can be seriously in conventional middleware 14 element that contains both the affected as the attachment is platforms late SOAP message and the attachment sent in one go (no streaming) (see next page) † This approach reflects the implicit afternoon Ä not all SOAP implementations assumption that what is being aisle † The solution is simple and it follows support attachments exchanged is similar to input and the same approach as that taken in e- output parameters of program mail messages: include a reference Ä SOAP engines must be extended Los Angeles to deal with MIME types (not invocations New York and have the actual attachment at the too complex but it adds † This approach makes it very difficult 2001-12 end of the message overhead) to use SOAP for exchanging 20 † The MIME document can be complex data types that cannot be mid- embedded into an HTTP request in † There are alternative proposals like easily translated to XML (and there morning the same way as the SOAP message DIME of Microsoft (Direct Internet is no reason to do so): images, Message Encapsulation) and WS- † The Apache SOAP 2.2 toolkit attachments binary files, documents, proprietary supports this approach representation formats, embedded SOAP messages, etc.

©Gustavo Alonso, ETH Zürich. 65 ©Gustavo Alonso, ETH Zürich. 66

Attachments in SOAP The problems with attachments

MIME-Version: 1.0 † Attachments are relatively easy to include in a message and all proposals Content-Type: Multipart/Related; boundary=MIME_boundary; type=text/xml; (MIME or DIME based) are similar in spirit start="" Content-Description: This is the optional message description. † The differences are in the way data is streamed from the sender to the receiver --MIME_boundary and how these differences affect efficiency Content-Type: text/xml; charset=UTF-8 Ä MIME is optimized for the sender but the receiver has no idea of how big a Content-Transfer-Encoding: 8bit message it is receiving as MIME does not include message length for the Content-ID: parts it contains Ä this may create problems with buffers and memory allocation

W3C Note 11 December 2000 W3C Note 11 December

© MIME boundaries between the different parts (DIME explicitly specifies the length of each part which can be use to skip what is not relevant) .. † All these problems can be solved with MIME as it provides mechanisms for .. adding part lengths and it could conceivably be extended to support some basic form of streaming SOAP MESSAGE † Technically, these are not very relevant issues and have more to do with --MIME_boundary Content-Type: image/tiff marketing and control of the standards Content-Transfer-Encoding: binary ATTACHMENT † The real impact of attachments lies on the specification of Web services Content-ID: (discussed later on) ...binary TIFF image...

From Messages with Attachments. SOAP --MIME_boundary--

©Gustavo Alonso, ETH Zürich. 67 ©Gustavo Alonso, ETH Zürich. 68 SOAP as simple protocol Beyond SOAP

† SOAP does not include anything about: † Not everybody agrees to the procedure of SOAP + WS-”extensions”, some Ä reliability organizations insist that a complete protocol specification for Web services needs to address much more than just getting data across Ä complex message exchanges † ebXML, as an example, proposes its own messaging service that incorporates Ä transactions many of the additional features missing in SOAP. This messaging service can be Ä security built using SOAP as a lower level protocol but it considers the messaging Ä … problem as a whole † As such, it is not adequate by itself to implement industrial strength applications † The idea is not different from SOAP ... that incorporate typical middleware features such as transactions or reliable delivery of messages † SOAP does not prevent such features from being implemented but they need to Abstract ebXML Messaging Service be standardized to be useful in practice: Ä WS-security Messaging service layer Ä WS-Coordination (maps the abstract interface to the Ä WS-Transactions Ä … transport service) † A wealth of additional standards are being proposed to add the missing functionality Transport service(s)

† but extended to incorporate additional features (next page)

©Gustavo Alonso, ETH Zürich. 69 ©Gustavo Alonso, ETH Zürich. 70

ebXML messaging service ebXML and SOAP

† The ebXML Messaging specification clarifies in great detail how to use SOAP MESSAGING SERVICE INTERFACE and how to add modules implementing additional functionality: Ä ebXML message = MIME/Multipart message envelope according to “SOAP AUTHENTICATION, AUTHORIZATION with attachments” specification AND REPUDIATION SERVICES Ä ebXML specified standard headers: • MessageHeader: id, version, mustUnderstand flag to 1, from, to, HEADER PROCESING conversation id, duplicate elimination, etc. Ä ebXML recommends to use the SOAP body to declare (manifest) the data ENCRYPTION, being transferred rather than to carry the data (the data would go in pther DIGITAL SIGNATURE parts of the MIME message) Ä ebXML defines a number of core modules and how information relevant to these modules is to be exchanged: MESSAGE PACKAGING MODULE • security (for encryption and signature handling)

DELIVERY MODULE • error handling (above the SOAP error handling level) SEND/RECEIVE • sync/reply (to maintain connections open across intermediaries) TRANSPORT MAPPING AND BINDING

FTP HTTP IIOP SMTP

©Gustavo Alonso, ETH Zürich. TRANSPORT SERVICES 71 ©Gustavo Alonso, ETH Zürich. 72 Additional features of ebXML messages

† Reliable messaging module Ä a protocol that guarantees reliable delivery between two message handlers. It includes persistent storage of the messages and can be used to implement a wide variety of delivery guarantees † Message status service Ä a service that allows to ask for the status of a message previously sent † Message ping service Ä to determine if there is anybody listening at the other end of the line † Message order module Ä to deliver messages to the receiver in a particular order. It is based on Transactions in distributed settings sequence numbers † Multi-hop messaging module Prof. Dr. Gustavo Alonso Ä for sending messages through a chain of intermediaries and still achieve reliability Institute for Pervasive Computing Computer Science Department † This are all typical features of a communication protocol that are needed anyway (including practical SOAP implementations) ETH Zürich [email protected] http://www.inf.ethz.ch/~alonso

©Gustavo Alonso, ETH Zürich. 73

Transaction Processing Why is transaction processing relevant? Ä Most of the information systems used in businesses are transaction based (either databases or TP-Monitors). The market for transaction processing is many tens billions of dollars per year Ä Not long ago, transaction processing was used mostly in large companies (both users and providers). This is no longer the case (CORBA, WWW, Commodity TP-Monitors, Internet providers, distributed computing) Ä Transaction processing is not just database technology, it is core distributed systems technology

Why distributed transaction processing? Basics of transaction processing Ä It is an accepted, proven, and tested programming model and computing paradigm for complex applications Ä The convergence of many technologies (databases, networks, workflow management, ORB frameworks, clusters of workstations …) is largely based on distributed transactional processing

©Gustavo Alonso, ETH Zürich. 75 ©Gustavo Alonso, ETH Zürich. 76 From business to transactions The problems of electronic transactions

† A business transaction usually involves an exchange between two or more Transactions are a great idea: entities (selling, buying, renting, booking …). † Hack a small, cute program and that’s it. † When computers are considered, these business transactions become electronic transactions: Unfortunately, they are also a complex idea: † From a programming point of view, one must be able to encapsulate the transaction (not everything is a transaction). TRANSACTION † One must be able to run high volumes of these transactions (buyers want fast BUYER SELLER response, sellers want to run many transactions cheaply). † Transactions must be correct even if many of them are running concurrently (= at the same time over the same data). † Transactions must be atomic. Partially executed transactions are almost always STATE STATE STATE incorrect (even in business transactions). † While the business is closed, one makes no money (in most business). † The ideas behind a business transactionbook-keeping are intuitive. These same ideas are used in electronic transactions. Transactions are “mission critical”. † Legally, most business transactions require a written record. So do electronic † Electronic transactions open up many possibilities that are unfeasible with traditional accounting systems. transactions.

©Gustavo Alonso, ETH Zürich. 77 ©Gustavo Alonso, ETH Zürich. 78

What is a transaction? Transactional properties Transactions originated as “spheres of control” in which to encapsulate certain These systems would have been very difficult to build without the concept of behavior of particular pieces of code. transaction. To understand why, one needs to understand the four key properties † A transaction is basically a set of service invocations, usually from a program of a transaction: (although it can also be interactive). † A transaction is a way to help the programmer to indicate when the system † ATOMICITY: necessary in any distributed system (but also in centralized ones). should take over certain tasks (like semaphores in an operating system, but much A transaction is atomic if it is executed in its entirety or not at all. more complicated). † Transactions help to automate many tedious and complex operations: † CONSISTENCY: used in database environments. A transactions must preserve Ä record keeping, the data consistency. Ä concurrency control, Ä recovery, † ISOLATION: important in multi-programming, multi-user environments. A Ä durability, transaction must execute as if it were the only one in the system. Ä consistency. † It is in this sense that transactions are considered ACID (Atomic, Consistent, † DURABILITY: important in all cases. The changes made by a transaction must Isolated, and Durable). be permanent (= they must not be lost in case of failures).

©Gustavo Alonso, ETH Zürich. 79 ©Gustavo Alonso, ETH Zürich. 80 Transactional properties Transactional atomicity

† Transactional atomicity is an “all or nothing” property: either the entire C transaction takes place or it does not take place at all. consistent Transaction consistent † A transaction often involves several operations that are executed at different database database times (control flow dependencies). Thus, transactional atomicity requires a mechanism to eliminate partial, incomplete results (a recovery protocol). A Txn consistent inconsistent database database Failure Txn I Txn 1 consistent inconsistent consistent consistent database database database database Txn 2 Failure inconsistent database RECOVERY database log D MANAGER consistent Txn consistent recovery database database Txn Failure system recovery inconsistent crash database consistent consistent inconsistent database database database

©Gustavo Alonso, ETH Zürich. 81 ©Gustavo Alonso, ETH Zürich. 82

Transactional isolation Transactional consistency

† Isolation addresses the problem of ensuring correct results even when there are † Concurrency control and recovery protocols are based on a strong assumption: many transactions being executed concurrently over the same data. the transaction is always correct. † The goal is to make transactions believe there is no other transaction in the † In practice, transactions make mistakes (introduce negative salaries, empty system (isolation). social security numbers, different names for the same person …). These † This is enforced by a concurrency control protocol, which aims at guaranteeing mistakes violate database consistency. serializability. † Transaction consistency is enforced through integrity constraints: Ä Null constrains: when an attribute can be left empty. Ä Foreign keys: indicating when an attribute is a key in another table. Ä Check constraints: to specify general rules (“employees must be either consistent Txn 1 consistent managers or technicians”). database database † Thus, integrity constraints acts as filters determining whether a transaction is Txn 2 inconsistent acceptable or not. database † NOTE: integrity constraints are checked by the system, not by the transaction programmer. Txn 1 CONCURRENCY Txn 2 CONTROL

Txn 1 Txn 2 consistent consistent consistent database database database

©Gustavo Alonso, ETH Zürich. 83 ©Gustavo Alonso, ETH Zürich. 84 Transactional durability A Simple Transaction Manager (I)

† Transactional system often deal with valuable information. There must be a guarantee that the changes introduced by a transaction will last. transactions TRANSACTION † This means that the changes introduced by a transaction must survive failures (if (r, w, c, a) MANAGER you deposit money in your bank account, you don’t want the bank to tell you they have lost all traces of the transaction because there was a disk crash). restart read, write, commit, abort (after system SCHEDULER † In practice, durability is guaranteed by using replication: database backups, failure) mirrored disks. (concurrency control † Often durability is combined with other desirable properties such as availability: read, write, Ä Availability is the percentage of time the system can be used for its intended commit, abort purpose (common requirement: 99.86% or 1 hour a month of down time). RECOVERY MANAGER Ä Availability plays an important role in many systems. Consider, for instance, the name server used in a CORBA implementation. fetch, flush read, write read, write CACHE MANAGER STABLE DATABASE DATA MANAGER CACHE LOG stable storage main memory

©Gustavo Alonso, ETH Zürich. 85 ©Gustavo Alonso, ETH Zürich. 86

A Simple Transaction Manager (II) Example Application (ATM)

† Each one of the modules shown is a complex component that can be optimized Example 1: Automated Teller Machines (ATM) by using clever tricks (engineering, not theory). † Tables: † In practice, these modules tend to be heavily interconnected (if one wants Ä AccountBalance (Acct#, balance): the accounts and the money in them performance, forget about modularity and nice, clear cut interfaces). Ä HotCard-List (Acct#): card that have been canceled/stolen/suspended. † A crucial aspect of a transaction manager is to ensure that operations are Ä AccountVelocity (Acct#,SumWithdrawals): stores the latest transactions and executed in the proper order. the accumulated amount. † When a module indicates “A should be executed before B”, this should be the case at all levels, independently of the optimizations performed at each level. Ä PostingLog (Acct#,ATMid,Amount): a record of each operation. There are two ways to guarantee such property: † Typical operation (money withdrawal): Ä FIFO queues between each module: force sequential processing but have Ä Get input (Acct#, ATM#, type, PIN, Txn-id, Amount). problems with threads and multi-processing. Ä Write request to PostingLog. Ä Handshaking: If A must happen before B, B is not passed to a lower module Ä Check PIN until the execution of A has not been confirmed by the lower module. Ä Check Acct# with HotCard-List table. Ä Check Acct# with AccountVelocity table Ä Update AccountVelocity table. Ä Update balance in AccountBalance table. Ä Write withdrawal record to PostingLog Ä Commit transaction and dispense money.

©Gustavo Alonso, ETH Zürich. 87 ©Gustavo Alonso, ETH Zürich. 88 Example Application (ATM) Example Application (Stock Exchange)

† Size: with several hundred ATMs and about one million customers, the database Example 2: Stock Exchange. takes 25-50 MB. † Tables: † Load: the system is configured to deal with the peak load: about one transaction Ä Users: list of traders and market watchers. per minute per ATM. Under normal circumstances, one mirrored disk (two disks doing the same operations) can handle 5 transactions per second (tps). Assuming Ä Stocks: list of traded stocks. 5 I/O operations per transaction, one mirrored disk can then handle about 300 Ä BuyOrders/SellOrders: all the orders entered during the day ATMs. Ä Trades: all trades executed during the day. † The PostingLog can be updated off-line (at night). Ä Price: buy and sell total volume, and numbers of orders for each stock and † Before, these systems were based on snapshot replication. Today, many of these price. systems access on-line databases. Ä Log: all users’ requests and system replies. Ä NotificationMssgs: all messages sent to the users (usually, confirmations of an operation). † Typical operation (Execute Trade): Ä Read information about the stock from the Stocks table. Ä Get timestamp. Ä Read scheduled trading periods for the stock. Ä Check validity of operation (time, value, prices). Ä If valid, find a matching trade operation, update Trades, NotificationMssgs, Orders, Prices, Stocks.

©Gustavo Alonso, ETH Zürich. 89 ©Gustavo Alonso, ETH Zürich. 90

Example Application (Stock Exchange) Ä Write the system’s response to the log. Ä Commit the transaction. Ä Broadcast the new book situation. † Size: 10 stock exchanges connected, real-time distributed trading, total database size 2.6 GB. † Load: Peak daily load is 140.000 orders. The peak-per-second load involves 180 disk I/Os and executing 300 million instructions per second.

Some basic advanced transaction models

©Gustavo Alonso, ETH Zürich. 91 ©Gustavo Alonso, ETH Zürich. 92 Distributed Transactions Pros and Cons of Atomicity

† The transaction model described so far is known as “flat model”, i.e., transactions have only two levels, the parent transaction and the children transaction † Distributed transactions are difficult to model with flat transactions (for instance, consistent Transaction consistent a chain of TRPCs), hence more complex models are needed database database † The most common model for distributed transactions is the nested model in which operations of a transaction can be transactions themselves † One of the most important aspects of distributed transactions is the problem of atomic commitment. Nested transactions help with this problem by indicating consistent Txn inconsistent when transactions at different systems need to be committed database database Abort

Recovery

If a program finds there is some error, it suffices to abort the transaction. The recovery mechanism ensures the effects of the transaction will be eliminated. This is both good and bad.

©Gustavo Alonso, ETH Zürich. 93 ©Gustavo Alonso, ETH Zürich. 94

Pros and Cons of Atomicity Savepoints

† If you are a transaction programmer, every time there is something that goes † To avoid this problem, savepoints are used. wrong (there is not enough funds, for instance), it is enough to execute † A savepoint records the current state of the execution of a transaction. ROLLBACK WORK to go back to the beginning of the transaction: † When invoking ROLLBACK WORK, one indicates to which point one wants to rollback (to the beginning or to a savepoint).

consistent Txn database BEGIN (1) ROLLBACK OP 1 WORK OP 2 SAVEP (2) Recovery OP 3 OP 4 † But if transactions are long, or complex, aborting the entire transaction may be a ROLLB(2) OP 5 waste of effort. In many cases, one does not want to go all the way to the SAVEP (3) beginning but to some intermediate point where one is sure things were correct, and then take again from there. OP 6 OP 9 OP 7 OP 10 OP 8 OP 11 ROLLB(3) COMMIT

©Gustavo Alonso, ETH Zürich. 95 ©Gustavo Alonso, ETH Zürich. 96 Triggered Commits Persistent Savepoints ?

† How can savepoints be implemented ? † Savepoints are a great idea: † Consider each set of operations between two savepoints as an atomic unit. Ä if something goes wrong, we can rollback to different parts of the execution † A ROLLBACK(x) aborts all atomic units all the way back to savepoint x. and resume from there. † A COMMIT at the end, triggers a chain of commits for each atomic unit. Ä rollback is performed by the system using the standard recovery mechanism. † Can this great idea be generalized? † If savepoints are made persistent, we may be able to resume the execution of a transaction even after crash failures! BEGIN (1) † In principle yes, but in practice: OP 1 Ä The database can recover to a savepoint, but the application program may OP 2 not be able to do the same. SAVEP (2) OP 3 OP 4 User program ABORT(2) OP 5 SAVEP (3) crash OP 6 OP 9 failure OP 7 OP 10 restart OP 8 OP 11 Database ABORT(3) COMMIT transaction restart

©Gustavo Alonso, ETH Zürich. 97 ©Gustavo Alonso, ETH Zürich. 98

Chained Transactions Non-Flat Transaction Models

† Note that the atomic units used in savepoints are almost like a transaction. † Savepoints and chained transactions point the need to structure sequences of † The important difference is that the transaction context is kept (locks are not interactions with the database. released until commit, not needed ones can be released). † The transactions we have been discussed so far are known as flat transactions. † Chained transactions allow to commit one transaction and pass its context to the † A chained transaction can be seen as a first step towards a non-flat transaction next. model: † If a failure occurs, committed transactions are safe, only the last active transaction will be aborted. † However, there is no possibility of rollback to a previous transactions (now they are really committed). parent

children BEGIN (1)

OP 1 † When these ideas are generalized, one arrives at the nested transaction model. OP 2 OP 3 OP 5 OP 4 OP 6 CHAIN CHAIN OP 7 COMMIT

©Gustavo Alonso, ETH Zürich. 99 ©Gustavo Alonso, ETH Zürich. 100 Nested Transactions Nested Transaction Structure

† A nested transaction is a tree of transactions. † Transactions at the leaves are flat transactions. † Transactions can either commit or rollback. The commit is conditional to the parent transaction’s commit (hence, transactions commit only if the root BEGIN COMMIT commits). † Rollback of a transaction causes all of its children to also rollback.

root BOT C BOT C

BCC B R B R B

B C

©Gustavo Alonso, ETH Zürich. 101 ©Gustavo Alonso, ETH Zürich. 102

Nested Transaction Rules

† Nested Transactions are like a combination of savepoints and chained transactions. † They follow these three rules (node = txn.): Ä Commit Rule: When a node wants to commit, it passes its context to the parent node (like in chained txns.), it will actually commit when the root node commits (like in savepoints). Ä Rollback Rule: If a node does a rollback, all of its children must also rollback. Ä Visibility Rule: When a node “commits” all of its changes become visible to the parent (because the child passes its context to the parent). Concurrent siblings are isolated from each other (they see each other as different Atomic commitment in practice (2PC- transactions). The parent can make certain objects accessible to the children, thereby allowing the context of a child to pass to another child. 3PC) † All other notions (serializability, recoverability …) still apply across different nested transactions † For distributed transactions, the important aspect is how to commit all transactions, not so much the isolation aspects

©Gustavo Alonso, ETH Zürich. 103 ©Gustavo Alonso, ETH Zürich. 104 Atomic Commitment The Consensus (agreement) Problem

† Distributed consensus is the problem The impossibility result implies that of reaching an agreement among all there is always a chance to remain working processes on the value of a uncertain (unable to make a The variable decision), hence: Consensus † Consensus is not a difficult problem † If failures may occur, then all 3 Phase if the system is reliable (no site entirely asynchronous commit Problem failures, no communication failures) protocols may block. Commit † Asynchronous = no timing † No commit protocol can guarantee assumptions can be made about the independent recovery (if a site fails speed of processes or the network when being uncertain, upon delay (it is not possible to recovery it will have to find out distinguish between a failure and a from others what the decision was). 2 Phase slow system) † This is a very strong result with Commit important implications in any distributed system.

Applications In an asynchronous environment where failures can occur reaching consensus may be impossible

©Gustavo Alonso, ETH Zürich. 105 ©Gustavo Alonso, ETH Zürich. 106

Generals problem Atomic Commitment

† To succeed the generals must attack † The impossibility in the generals Properties to enforce: at the same time problem arises from the need to † AC1 = All processors that reach a decision reach the same one (agreement, † The generals can only communicate have complete knowledge: I need to consensus). through messages know my state, the other’s state, that the other knows my state, that the † AC2 = A processor cannot reverse its decision. † The system is asynchronous: other knows that I know her state, † AC3 = Commit can only be decided if all processors vote YES (no imposed messages can be lost or delayed decisions). indefinitely that the other knows that I know that she knows my state … † AC4 = If there are no failures and all processors voted YES, the decision will be † If the system is entirely to commit (non triviality). asynchronous, this problem cannot † AC5 = Consider an execution with normal failures. If all failures are repaired be solved by simply exchanging and no more failures occur for sufficiently long, then all processors will Under these circumstances, messages eventually reach a decision (liveness). the generals will never † There are many forms of this problem and atomic commitment is be able to agree on a one of them: simultaneous attack, Ä all sites must decide on whether that is, they can never reach to commit or abort a transaction and all must make the same consensus decision

©Gustavo Alonso, ETH Zürich. 107 ©Gustavo Alonso, ETH Zürich. 108 Simple 2PC Protocol and its correctness Timeout Possibilities PROTOCOL: CORRECTNESS: † Coordinator send VOTE-REQ to all The protocol meets the 5 AC conditions participants. (I - V): COORDINATOR † Upon receiving a VOTE-REQ, a † ACI = every processor decides what send participant sends a message with the coordinator decides (if one COMMIT YES or NO (if the vote is NO, the decides to abort, the coordinator will COMMIT participant aborts the transaction and decide to abort). stops). † AC2 = any processor arriving at a all vote YES † Coordinator collects all votes: decision “stops”. send wait Ä All YES = Commit and send † AC3 = the coordinator will decide to VOTE-REQ for votes COMMIT to all others. commit if all decide to commit (all some vote NO Ä Some NO = Abort and send vote YES). ABORT to all which voted † AC4 = if there are no failures and YES. everybody votes YES, the decision send † A participant receiving COMMIT or will be to commit. ABORT ABORT ABORT messages from the † AC5 = the protocol needs to be coordinator decides accordingly and extended in case of failures (in case stops. of timeout, a site may need to “ask around”).

©Gustavo Alonso, ETH Zürich. 109 ©Gustavo Alonso, ETH Zürich. 110

Timeout Possibilities Timeout and termination In those three waiting periods: When in doubt, ask. If anybody has PARTICIPANT † If the coordinator times-out waiting decided, they will tell us what the COMMIT for votes: it can decide to abort decision was: (nobody has decided anything yet, † There is always at least one wait for received or if they have, it has been to abort) processor that has decided or is able COMMIT vote YES decision † If a participant times-out waiting for to decide (the coordinator has no VOTE-REQ: it can decide to abort uncertainty period). Thus, if all (nobody has decided anything yet, failures are repaired, all processors wait for or if they have, it has been to abort) will eventually reach a decision ABORT † If the coordinator fails after VOTE-REQ † If a participant times-out waiting for received a decision: it cannot decide anything receiving all YES votes but before unilaterally, it must ask (run a sending any COMMIT message: all Cooperative Termination Protocol). participants are uncertain and will vote NO If everybody is in the same situation not be able to decide anything until ABORT no decision can be made: all the coordinator recovers. This is the processors will block. This state is blocking behavior of 2PC (compare called uncertainty period with the impossibility result discussed previously)

©Gustavo Alonso, ETH Zürich. 111 ©Gustavo Alonso, ETH Zürich. 112 Recovery and persistence Linear 2PC

Processors must know their state to be † When sending VOTE-REQ, the † Linear 2PC commit exploits a particular network configuration to minimize the able to tell others whether they have coordinator writes a START-2PC number of messages: reached a decision. This state must log record (to know the be persistent: coordinator). † If a participant votes YES, it writes YES † Persistence is achieved by writing a a YES record in the log BEFORE it log record. This requires flushing send its vote. If it votes NO, then it the log buffer to disk (I/O). writes a NO record. † This is done for every state change † If the coordinator decides to commit in the protocol. or abort, it writes a COMMIT or YES ABORT record before sending any † This is done for every distributed transaction. message. † After receiving the coordinator’s ... † This is expensive! decision, a participant writes its own YES decision in the log.

COM

©Gustavo Alonso, ETH Zürich. 113 ©Gustavo Alonso, ETH Zürich. 114

Linear 2PC 3 Phase Commit Protocol † The total number of messages is 2n instead of 3n, but the number of rounds is 2n 2PC may block if the coordinator fails We will consider two versions of 3PC: instead of 3 after having sent a VOTE-REQ to † One capable of tolerating only site all processes and all processes vote failures (no communication YES. It is possible to reduce the failures). Blocking occurs only window of vulnerability even further when there is a total failure (every YES by using a slightly more complex process is down). This version is protocol (3PC). useful if all participants reside in the In practice 3PC is not used. It is too same site. expensive (more than 2PC) and the † One capable of tolerating both site YES probability of blocking is considered and communication failures (based to be small enough to allow using on quorums). But blocking is still 2PC instead. possible if no quorum can be formed. NO NO But 3PC is a good way to understand better the subtleties of atomic commitment NO NO

©Gustavo Alonso, ETH Zürich. 115 ©Gustavo Alonso, ETH Zürich. 116 Blocking in 2PC Avoiding Blocking in 3PC Why does a process block in 2PC? The reason why uncertain process The NB rule guarantees that if anybody is uncertain, nobody can have decided to † If a process fails and everybody else cannot make a decision is that being commit. Thus, when running the cooperative termination protocol, if a process is uncertain, there is no way to know uncertain does not mean all other finds out that everybody else is uncertain, they can all safely decide to abort. whether this process has committed processes are uncertain. Processes † The consequence of the NB rule is that the coordinator cannot make a decision or aborted (NOTE: the coordinator may have decided and then failed. by itself as in 2PC. Before making a decision, it must be sure that everybody is has no uncertainty period. To block To avoid this situation, 3PC out of the uncertainty area. Therefore, the coordinator, must first tell all the coordinator must fail). enforces the following rule: processes what is going to happen: (request votes, prepare to commit, commit). † Note, however, that the fact that This implies yet another round of messages! everybody is uncertain implies † NB rule: No operational process can everybody voted YES! decide to commit if there are † Why, then, uncertain processes operational processes that are cannot reach a decision among uncertain. themselves? How does the NB rule prevent blocking?

©Gustavo Alonso, ETH Zürich. 117 ©Gustavo Alonso, ETH Zürich. 118

3PC Coordinator 3PC Participant

bcast wait * COMMIT bcast wait for send wait for pre-commit for ACKs commit pre-commit ACK commit COMMIT commit all vote YES all ACKs vote YES pre-commit received received received bcast wait wait for vote-req for votes vote-req abort received some vote NO vote NO

bcast ABORT ABORT abort

Possible time-out actions Possible time-out actions

©Gustavo Alonso, ETH Zürich. 119 ©Gustavo Alonso, ETH Zürich. 120 3PC and Knowledge (using the NB rule) Persistence and recovery in 3PC 3PC is interesting in that the processes The NB rule is used when time-outs Similarly to 2PC, a process has to Processes in 3PC cannot independently know what will happen before it occur (remember, however, that remember its previous actions to be recover unless they had already happens: there are no communication able to participate in any decision. reached a decision or they have not † Once the coordinator reaches the failures): This is accomplished by recording participated at all: “bcast pre-commit”, it knows the † If coordinator times out waiting for every step in the log: † If the coordinator recovers and finds decision will be to commit. votes = ABORT. † Coordinator writes “start-3PC” a “start 3PC” record in its log but no † Once a participant receives the pre- † If participant times out waiting for record before doing anything. It decision record, it needs to ask commit message from the vote-req = ABORT. writes an “abort” or “commit” around to find out what the decision record before sending any abort or was. If it does not find a “start 3PC”, coordinator, it knows that the † If coordinator times out waiting for decision will be to commit. ACKs = ignore those who did not commit message. it will find no records of the Why is the extra-round of messages sent the ACK! (at this stage † Participant writes its YES vote to transaction, then it can decide to useful? everybody has agreed to commit). the log before sending it to the abort. coordinator. If it votes NO, it writes † If a participant has a YES vote in its † The extra round of messages is used † If participant times out waiting for it to the log after sending it to the log but no decision record, it must to spread knowledge across the pre-commit = still in the uncertainty coordinator. When reaching a ask around. If it has not voted, it can system. They provide information period, ask around. decision, it writes it in the log (abort decide to abort. about what is going on at other † If participant times out waiting for or commit). processes (NB rule). commit message = not uncertain any more but needs to ask around!

©Gustavo Alonso, ETH Zürich. 121 ©Gustavo Alonso, ETH Zürich. 122

Termination Protocol Partition and total failures

† Elect a new coordinator. TR4 is similar to 3PC, have we actually This protocol does not tolerate Total failures require special treatment, † New coordinator sends a “state req” solved the problem? communication failures: if after the total failure every process to all processes. participants send † Yes, failures of the participants on † A site decides to vote NO, but its is still uncertain, it is necessary to their state (aborted, committed, the termination protocol can be message is lost. find out which process was the last uncertain, committable). ignored. At this stage, the on to fail. If the last one to fail is † All vote YES and then a partition found and is still uncertain, then all † TR1 = If some “aborted” received, coordinator knows that everybody is occurs. Assume the sides of the then abort. uncertain, those who have not sent partition are A and B and all can decide to abort. an ACK have failed and cannot have † Why? Because of partitions. † TR2 = If some “committed” processes in A are uncertain and all received, then commit. made a decision. Therefore, all processes in B are committable. Everybody votes YES, then all remaining can safely decide to When they run the termination processes in A fail. Processes in B † TR3 = If all uncertain, then abort. commit after going over the pre- protocol, those in A will decide to will decide to commit once the † TR4 = If some “committable” but no commit and commit phases. abort and those in B will decide to coordinator times out waiting for “committed” received, then send † The problem is when the new commit. ACKs. Then all processes in B fail. “PRE-COMMIT” to all, wait for Processes in A recover. They run the coordinator fails after asking for the † This can be avoided if quorums are ACKs and send commit message. state but before sending any pre- used, that is, no decision can be termination protocol and they are all commit message. In this case, we made without having a quorum of uncertain. Following the termination have to start all over again. processes who agree (this protocol will lead them to abort. reintroduces the possibility of blocking, all processes in A will block).

©Gustavo Alonso, ETH Zürich. 123 ©Gustavo Alonso, ETH Zürich. 124 2PC in Practice 2PC

† 2PC is a protocol used in many applications from distributed systems to Internet environments COORDINATOR Commit-Log COMMIT † 2PC is not only a database protocol, it is used in many systems that are not ES l Y necessarily databases but, traditionally, it has been associated with transactional Al systems Start Prepare-Log Wait EOT-Log † 2PC appears in a variety of forms: distributed transactions, transactional remote S procedure calls, Object Transaction Services, Transaction Internet Protocol … om e N † In any of these systems, it is important to remember the main characteristic of Vote Commit O Abort-Log ABORT 2PC: if failures occur the protocol may block. In practice, in many systems, t es blocking does not happen but the outcome is not deterministic and requires qu Ack Re e manual intervention ot mit V Com rt Abo

Ack Vote Yes -Log Wait Commit-Log COMMIT

Wait Vote Abort Abort-Log ABORT PARTICIPANT Vote Abort

©Gustavo Alonso, ETH Zurich. 125 ©Gustavo Alonso, ETH Zürich. 126

ORB Object Transaction Service

† The OTS provides transactional guarantees to the execution of invocations between different components of a distributed application built on top of the Application Objects Common Facilities ORB † The OTS is fairly similar to a TP-Monitor and provides much of the same functionality discussed before for RPC and TRPC, but in the context of the CORBA standard † Regardless of whether it is a TP-monitor or an OTS, the functionality needed to support transactional interactions is the same: Ä transactional protocols (like 2PC) Ä knowing who is participating SOFTWARE BUS (ORB) Ä knowing the interface supported by each participant

Common Object Services

... naming events security transactions

©Gustavo Alonso, ETH Zurich. 127 ©Gustavo Alonso, ETH Zurich. 128 Object Transaction Service Object Transaction Service

Assume App A wants to update its database and also that in B

Application Application Application Application DBAB DB DBAB DB

ORB BEGIN ORB TXN

Object Object Transaction Transaction Service Service

©Gustavo Alonso, ETH Zurich. 129 ©Gustavo Alonso, ETH Zurich. 130

Object Transaction Service Object Transaction Service

… but the transaction does not commit TXN(1)

Application Application Application Application DBAB DB DBAB DB

Register ORB ORB DB

OTS now knows Object Object that there is database Transaction Transaction behind App A Service Service

©Gustavo Alonso, ETH Zurich. 131 ©Gustavo Alonso, ETH Zurich. 132 Object Transaction Service Object Transaction Service

Application Application Application Application DBAB DB DBAB DB Call B txn(1) ORB ORB Register DB

Object Object OTS now knows Transaction Transaction that there is database Service Service behind App B

©Gustavo Alonso, ETH Zurich. 133 ©Gustavo Alonso, ETH Zurich. 134

Object Transaction Service Object Transaction Service

… but the transaction does not commit TXN(1)

Application Application Application Application DBAB DB DBAB DB

ORB ORB COMMIT

Object Object Transaction Transaction Service Service

©Gustavo Alonso, ETH Zurich. 135 ©Gustavo Alonso, ETH Zurich. 136 Object Transaction Service OTS Sequence of Messages

DB A APP Abegin OTS APP B DB B register Application Application TXN DBAB DB invoke register TXN ORB commit prepare prepare vote yes vote yes Object 2PC 2PC commit commit Transaction Service

©Gustavo Alonso, ETH Zurich. 137 ©Gustavo Alonso, ETH Zurich. 138

pg Registration

† When a call is made to another † Registration is necessary in order to server, somebody has to know that tell the OTS who will participate in this call belongs to a given the 2PC protocol and what type of transaction. There are two ways of interface is supported. Registration doing this: can be manual or automatic † Explicit (manual): the invocation † Manual registration implies the the itself contains the transaction user provides an implementation of identifier. Then, when the the resource. This implementation application registers the resource acts as an intermediary between the manager, it uses this transaction OTS and the actual resource identifier to say to which transaction manager (useful for legacy Transaction Processing Monitors (TP- it is “subscribing” applications that need to be † Implicit (automatic): the call is made wrapped) monitors) through the OTS, which will † Automatic registration is used when forward the transaction identifier the resource manager understands along with the invocation. This transactions (i.e., it is a database), in requires to link with the OTS library which case it will support the XA and to make all methods involved interface for 2PC directly. A transactional resource are registered only once, and implicit propagation is used to check which transactions go there

©Gustavo Alonso, ETH Zürich. 139 ©Gustavo Alonso, ETH Zürich. 140 Outline Client, server, and databases

† Historical perspective: † Processing, storing, accessing and INVENTORY CONTROL Ä The problem: synchronization and atomic interaction retrieving data has always been one IF supplies_low of the key aspects of enterprise Ä The solution: transactional RPC and additional support THEN computing. Most of this data resides † TP Monitors BOT in relational database management Ä Example and Functionality Place_order systems, which have well defined Ä Architectures Update_inventory interfaces and provided very clear Structure E guarantees to the operations Ä OT performed over the data. Ä Components † However: † TP Monitor functionality in CORBA Server 2 (products) Server 3 (inventory) Ä not all the data can reside in the New_product Place_order same database Lookup_product Cancel_order Ä the application is built on top of the database. The guarantees Delete_product Update_inventory provided by the database need to Update_product Check_inventory be understood by the application running on top

S S Inventory M Products M B B and order D database D database

©Gustavo Alonso, ETH Zürich. 141 ©Gustavo Alonso, ETH Zürich. 142

The nice thing about databases ... One at a time interaction † … is that they take care of all † Unfortunately, these properties are † Databases follow a single thread DBMS enforces Database CLIENT aspects related to data management, provided only to operations execution model where a client can transactional B from physical storage to performed within the database. In only have one outstanding call to OT brackets concurrency control and recovery principle, they do not apply when: one and only one server at any time. ... † Using a database can reduce the Ä An operation spawns several The basic idea is one call per EOT

amount of code necessary in a large databases process (thread). S M database application by about 40 % the operations access data not in † Databases provide no mechanism to B

Ä D † From a client/server perspective, the the database (e.g., in the server) bundle together several requests into a single work unit databases help in: † To help with this problem, the Ä concurrency control: many Distributed Transaction processing † The XA interface solves this Additional layer problem for databases by providing Database CLIENT enforces servers can be connected in Model was created by X/Open (a B transactional parallel to the same database standard’s body). The heart of this an interface that supports a 2 Phase OT brackets and the database will still have model is the XA interface for 2 Commit protocol. However, without ... correct data Phase Commit, which can be used to any further support, the client EOT becomes the one responsible for 2 Phase Commit Ä recovery: if a server fails in the ensure that an operation spawning coordinator several databases enjoy the same running the protocol which is highly middle of an operation, the impractical database makes sure this does atomicity properties as if it were XA XA not affect the data or other executed in one database. † An intermediate layer is needed to servers run the 2PC protocol DBMS DBMS

database database

©Gustavo Alonso, ETH Zürich. 143 ©Gustavo Alonso, ETH Zürich. 144 2 Phase Commit Interactions through RPC

BASIC 2PC What is needed to run 2PC? † RPC has the same limitations as a (a)

† Coordinator send PREPARE to all † Control of Participants: A database: it was designed for one at r

S participants. transaction may involve many a time interactions between two end e

M

v r points. In practice, this is not B database resource managers, somebody has to e client D

† Upon receiving a PREPARE s message, a participant sends a keep track of which ones have enough: message with YES or NO (if the participated in the execution a) the call is executed but the (b) vote is NO, the participant aborts the † Preserving Transactional Context: response does not arrive or the

client fails. When the client r

transaction and stops). During a transaction, a participant S e

M

recovers, it has no way of v

may be invoked several times on r

† Coordinator collects all votes: B database e client D behalf of the same transaction. The knowing what happened s Ä All YES = Commit and send resource manager must keep track of b) c) it is not possible to join two COMMIT to all others. calls and be able to identify which calls into a single unit (neither

Ä Some NO = Abort and send ones belong to the same transaction the client nor the server can do S

M

ABORT to all which voted by using a transaction identifier in this) B database YES. D

all invocations r

S e

M

† A participant receiving COMMIT or v r

† Transactional Protocols: somebody B database e client D ABORT messages from the acting as the coordinator in the 2PC s coordinator decides accordingly and protocol stops.

† Make sure the participants r e

S

(c) v

M understand the protocol (this is what r

e B database s

the XA interface is for) D

©Gustavo Alonso, ETH Zürich. 145 ©Gustavo Alonso, ETH Zürich. 146

Transactional RPC Basic TRPC (making calls)

† The limitations of RPC can be resolved client by making RPC calls transactional. In practice, this means that they are controlled by a 2PC protocol TM Client 12Client stub Transaction Manager (TM) BOT Get tid Generate tid † As before, an intermediate entity is needed to run 2PC (the client and server … from TM store context for tid could do this themselves but it is neither practical nor generic enough) TP Service_call Add tid to Associate server to tid † This intermediate entity is usually called TM monitor TM … call a transaction manager (TM) and acts as intermediary in all interactions between server server clients, servers, and resource managers 4 † When all the services needed to support RPC, transactional RPC, and additional features are added to the intermediate 3 Server stub Server layer, the result is a TP-Monitor Get tid DBMS XA XA DBMS register with database database the TM Invoke service Service 5 return procedure

©Gustavo Alonso, ETH Zürich. 147 ©Gustavo Alonso, ETH Zürich. 148 Basic TRPC (committing calls) One step beyond ...

† The previous example assumes the server is transactional and can run 2PC. This could be, for instance, a Client Client stub 1 Transaction Manager (TM) client stub RPC Look up tid stored procedure interface within a ... database. However, this is not the support Service_call usual model 2 Run 2PC with all servers client stub … † Typically, the server invokes a EOT Send to TM associated with tid resource manager (e.g., a database) Transaction manager commit(tid) 3 that is the one actually running the Confirm commit transaction TP-Monitor † This makes the interaction more complicated as it adds more Additional participants but the basic concept is features the same: Server stub Server Ä the server registers the resource Participant manager(s) it uses in 2PC Ä the TM runs 2PC with those server stub server stub resources managers instead of with the server (see OTS at the end) database database

©Gustavo Alonso, ETH Zürich. 149 ©Gustavo Alonso, ETH Zürich. 150

TP-Monitors = transactional RPC TP-Monitor functionality

† A TP-Monitor allows building a † TP-Monitors appeared because † A TP-Monitor addresses the problems of common interface to several operating systems are not suited for sharing data from heterogeneous, client applications while maintaining or adding transactional processing. TP-Monitors distributed sources, providing clean transactional properties. Examples: are built as operating systems on top of interfaces and ensuring ACID CICS, Tuxedo, Encina. operating systems. properties. services † A TP-Monitor extends the transactional † As a result, TP-Monitor functionality is † A TP-Monitor extrapolates the functions capabilities of a database beyond the not well defined and very much system of a transaction manager (locking, database domain. It provides the dependent. scheduling, logging, recovery) and transactional mechanisms and tools necessary to build controls the distributed execution. As coordination † A TP-Monitor tries to cover the applications in which transactional deficiencies of existing “all purpose” such, TP-Monitor functionality is at the guarantees are provided. systems. What it does is determined by core of the integration efforts of many TP-Monitor † TP-Monitors are, perhaps, the best, the systems it tries to ”improve”. software producers (databases, workflow systems, CORBA providers, oldest, and most complex example of † A TP-Monitor is basically an integration middleware. Some even try to act as tool. It allows system designers to tie …). distributed operating systems providing together heterogeneous system † A TP-Monitor also controls and file systems, communications, security components using a number of utilities manages distributed computations. It controls, etc. that can be mixed and matched performs load balancing, monitoring of † TP-Monitors have traditionally been depending on the particular components, starting and finishing associated to the mainframe world. characteristics of each case. components as needed, routing of requests, recovery of components, Application 1Application 2 Application 3 Their functionality, however, has long † Using the tools provided by the TP- since migrated to other environments Monitor, the integration effort becomes logging of all operations, assignment of and has been incorporated into most more straightforward as most of the priorities, scheduling, etc. In many cases middleware tools. it has its own transactional file system, needed functionality is directly becoming almost indistinguishable from supported by the TP-Monitor. a distributed operating system. ©Gustavo Alonso, ETH Zürich. 151 ©Gustavo Alonso, ETH Zürich. 152 Transactional properties TRAN-C (Encina)

† The TP-monitor tries to encapsulate the # include services provided within transactional client client brackets. This implies RPC augmented inModule(“helloWorld”); with: Ä atomicity: a service that produces void Main () { services modifications in several int i; components should be executed inFunction(“main”); entirely and correctly in each transactional initTC(); /* initializes transaction manager */ server component or should not be user coordination executed at all (in any of the program components). transaction { /* starts a transaction */ TP-Monitor Ä isolation: if several clients request printf(“Hello World - transaction %d\n”, getTid()); the same service at the same time if (I % 2) abort (“Odd transactions are aborted”); and access the same data, the overall result will be as if they were } alone in the system. onCommit Ä consistency: a service is correct printf(“Transaction Comitted”); when executed in its entirety (it onAbort does not introduce false or incorrect printf(“Abort in module: %s\n \t %s\n”, abortModuleNAme(), abortReason()); Application 1Application 2 Application 3 data into the component databases) } Ä durability: the system keeps track of what has been done and is capable of redoing and undoing changes in case of failures.

©Gustavo Alonso, ETH Zürich. 153 ©Gustavo Alonso, ETH Zürich. 154

TP-Monitor, generic architecture Tasks of a TP Monitor

Interfaces to user defined services Core services Additional services † Transactional RPC: Implements † Server monitoring and Programs implementing the services RPC and enforces transactional administration: starting, stopping Yearly balance ? Monthly average revenue ? semantics, scheduling operations and monitoring servers; load accordingly balancing † Transaction manager: runs 2PC and † Authentication and authorization: TP-Monitor Front end Control (load balancing, takes care of recovery operations checking that a user can invoke a environment cc and rec., replication, † Log manager: records all changes given service from a given terminal, distribution, scheduling, at a given time, on a given object priorities, monitoring …) done by transactions so that a consistent version of the system can and with a given set of parameters

’ (the OS does not do this)

2

1 be reconstructed in case of failures

r

1

r

r e

e recoverable

e

v

v † Data storage: in the form of a

r

v r queue † Lock manager: a generic mechanism

r e

e

s

e

s

s transactional file system user user user user

p to regulate access to shared data

p

p

p program program program program

p

p a

a a app server 3 outside the resource managers † Transactional queues: for asynchronous interaction between components wrappers † Booting, system recovery, and other Branch 1 Branch 2 Finance Dept. administrative chores

©Gustavo Alonso, ETH Zürich. 155 ©Gustavo Alonso, ETH Zürich. 156 Structure of TP-Monitors (I) Structure of TP-Monitors (II)

† TP-Monitors try in many aspects to replace the operating system so as to provide more efficient transactional properties. Depending what type of operating system they try to replace, they have a different structure: Ä Monolithic: all the functionality of the TP-Monitor is implemented

within one single process. The r

design is simpler (the process can s o

Term. Term. Term. s

t e

control everything) but restrictive Terminal handling (multithreaded) Terminal handling (multithreaded) i

c

n interf. interf. interf. o

(bottleneck, single point of failure, o

r p must support all possible protocols M in one single place). Monitor process Ä Layered: the functionality is divided Application handling (multithreaded) App 1 App 1 App 1 in two layers. One for terminal handling and several processes for App 1 App 1 App 1 interaction with the resource managers. The design is still simple but provides better performance and resilience. Ä Multiprocessor: the functionality is db db db db db db db db divided among many independent, db db db distributed processes. Monolithic structure Layered structure Multiprocessor structure ©Gustavo Alonso, ETH Zürich. 157 ©Gustavo Alonso, ETH Zürich. 158

TP-Monitor components (generic) Example: BEA Tuxedo bulletin Administration and operations interfaces client dll client board server named resource process routine handler name server process service manager persistent service queue Monitor services call client scheduling load balancing program forward call locate library server client queues server class server location

Authentication queue binding context context forward call client database read invoke transaction data internal dictionary Application Log Recovery resource Presentation services program manager manager managers queue response

persistent queue response

external database resource manager From “Transaction Processing” Gray&Reuter. Morgan Kaufmann 1993 ©Gustavo Alonso, ETH Zürich. 159 ©Gustavo Alonso, ETH Zürich. 160 Example: BEA Tuxedo TP-Monitor components (Encina)

† The client uses DLL (Dynamic Link † The current trend is towards a “family of products” instead of a single system. Libraries) routines to interact with client the TP-Monitor Each element can be used by itself (reduced footprint) and, in some cases, † The Monitor Process or Tuxedo dll routine can be used completely independent of Monitor Communication server implements all system the TP-Monitor. scheduling admin services (name services, transaction † Monitor: execution environment peer to peer management, load balancing, etc) providing integrity, availability, server mgmt. and acts as the control point for all security, fast response time and high queues interactions Monitor process throughput. It includes tools for administration and installation of data mgmt. † Application services are known as components and the development named services. These named environment. services interact with the system Txn.-RPC Txn. services server process † Communication services: protocols and through a local server process mechanisms for persistent messages and RPC concurrency control † Interaction across components is peer to peer communication. through message queues rather than Named service † Transactional RPC: basic interaction T-RPC logging recovery direct calls (although clients and mechanism servers may interact synchronously) † Transactional services: supporting Persistent storage DBMS concurrency control, recovery, logging and transactional programming. txnal. file system. database Behavior of the system can be tailored (advances transaction models, selective Resource database manager logging, ad-hoc recovery …) † Persistent storage ©Gustavo Alonso, ETH Zürich. 161 ©Gustavo Alonso, ETH Zürich. 162

External interfaces Monitor services

With clients With administrators † Monitor services are those facilities † Load balancing: tries to optimize the † The main interface is through the † The TP-Monitor needs to be that provide the basic functionality resources of the system by providing presentation services. In old maintained and administered like of the TP-Monitor. They can be an accurate picture of the ongoing systems, presentation services any other system. Today there are a implemented as part of the TP- and scheduled work included terminal handling and wide variety of tools for doing so. Monitor process or as external † Context management: a key service format control for presentation on a They include: resource managers in TRPC that is also used in keeping screen. Today, the presentation Ä node monitoring † Server class: each application context across transaction services are mostly interfaces to program implementing services has boundaries or to store and forward other systems that take care of data Ä service monitoring a server class in the monitor. The data between different resource presentation (mainly web servers) Ä load monitoring server class starts and stops the managers and servers † The most important part of the Ä configuration tools application, creates message queues, † Communication services (queue presentation services still in use Ä programming support monitors the load, etc. In general, it management and networking) are manages one application program today is the RPC (TRPC) stubs and Ä … usually implemented as external libraries used on the client side for † Binding: acts as the name and resource managers. They take care † Another important part of the invoking services implemented interfaces to the system are the directory services and offers similar of transactional queuing and any within the TP-Monitor development environments which functionality as the binder in RPC. It other aspect of message passing tend to be similar in nature to that of might be coupled with the load RPC systems balancing service for better distribution

©Gustavo Alonso, ETH Zürich. 163 ©Gustavo Alonso, ETH Zürich. 164 Resource managers Transaction processing components Internal Resource Managers External Resource Managers † These are modules that implement a † These are the systems the TP- Client particular service in the TP-Monitor. Monitor has to integrate Service Request/Response There are two kinds: † The typical resource manager is a † Application programs: programs that database management system with implement a collection of services an SQL/XA interface. It can also be TP forward request that can be invoked by the clients of a legacy application, in which case transaction the TP-Monitor. They define the wrappers are needed to bridge the Monitor Service application built upon the TP- interface gap. A typical example are response Monitor screen scraping modules that database Begin Work Resource interact with mainframe based managers † Internal services: like logging, Save Work locking, recovery, or queuing. applications by posing as dumb send / Save Work terminals Commit Implementing these services as receive Rollback Checkpoint Log resource managers gives more † The number and type of external Prepare Records register incoming / Commit modularity to the system and even resource managers keeps growing outgoing transactions Transaction allows to use other systems for this and a resource manager can be communication manager Log purpose (like queue management another TP monitor. manager UNDO/ Savepoint manager Savepoint systems) † The WWW is slowly also becoming REDO Prepared a resource manager Prepare Log Committed Savepoint Commit Records Completed Prepare Checkpoint Commit From “Transaction Processing” Gray&Reuter. Morgan Kaufmann 1993

©Gustavo Alonso, ETH Zürich. 165 ©Gustavo Alonso, ETH Zürich. 166

TP-Monitors vs. OS Advantages of TP-Monitors

processing data communication † TP-Monitors are a development and run-time platform for distributed applications TP Services Admin interface Databases Name server † The separation between the monitor and the transaction manager was a practical Configuration tools Disaster recovery Server invocation TP consideration but turned out to be a significant advantage as many of the Load balancing Protected user Monitor features provided by the monitor are as valuable as transactions Programming tools Resource managers interface † The move towards more modular architectures prepared TP-Monitors for Flow of control changes that had not been foreseen but turned be quite advantageous: TP internal Txn identifiers Transaction manager Transactional RPC Ä the web as the main interface to applications: the presentation services system services Server class Logs and context Transactional included an interface so that requests could be channeled through a web Scheduling Durable queues Sessions server Authentication Transactional files RPC Ä queuing as a form of middleware in itself (Message Oriented Middleware, OS Process – Threads Repository IPC MOM): once the queuing service was an internal resource manager, it was Address space Simple sessions not too difficult to adapt the interface so that the TP-Monitor could talk with Scheduling File System Naming other queuing systems Local naming protection Blocks, paging Authentication Ä Distributed object systems (e.g., CORBA) required only a small syntactic File security layer in the development tools and the presentation services so that services Hardware CPU Memory Wires, switches will appear as objects and TRPC would be come a method invocation to those objects.

From “Transaction Processing” Gray&Reuter. Morgan Kaufmann 1993

©Gustavo Alonso, ETH Zürich. 167 ©Gustavo Alonso, ETH Zürich. 168 TP-Heavy vs. TP-Light = 2 tier vs. 3 tier TP-light: databases and the 2 tier approach

† A TP-heavy monitor provides: † A TP-Light is an extension to a † Databases are traditionally used to Ä a full development environment database: manage data. (programming tools, services, Ä it is implemented as threads, instead client † However, simply managing data is not libraries, etc.), of processes, an end in itself. One manages data Ä additional services (persistent Ä it is based on stored procedures because it has some concrete application queues, communication tools, ("methods" stored in the database logic in mind. This is often forgotten transactional services, priority that perform an specific set of when considering databases (specially scheduling, buffering), operations) and triggers, database management system benchmarking) and has allowed SAP to take over a significant market share Ä support for authentication (of users Ä it does not provide a development before any other vendors reacted. and access rights to different environment. Database † But if the application logic is what services), † Light Monitors are appearing as developing environment matters, why not move the application Ä its own solutions for databases become more sophisticated logic into the database? These is what communication, replication, load and provide more services, such as user defined many vendors are advocating. By doing balancing, storage management ... integrating part of the functionality of a application logic this, they propose a 2 tier model with the (most of the functionality of an TP-Monitor within the database. database providing the tools necessary to operating system). † Instead of writing a complex query, the implement complex application logic. † Its main purpose is to provide an query is implemented as a stored † These tools include triggers, replication, execution environment for resource procedure. A client, instead of running stored procedures, queuing systems, managers (applications), and do all this the query, invokes the stored procedure. external application standard access interfaces (ODBC, with guaranteed reasonable performance † Stored procedure languages: Sybase's database JDBC) .. which are already in place in (e.g., > 1000 txns. per second). Transact-SQL, Oracle's PL/SQL. many databases. † This is the traditional monitor: CICS, Encina, Tuxedo. resource manager

©Gustavo Alonso, ETH Zürich. 169 ©Gustavo Alonso, ETH Zürich. 170

TP-Heavy: 3-tier middleware Object Transaction Service

† TP-heavy are middleware platforms for † An OTS provides transactional † ... and two ways to register Clients developing 3-tier architectures. They guarantees to the execution of resources (necessary in order to tell provide all the functionality necessary invocations between different the OTS who will participate in the for such an architecture to work. components of a distributed 2PC protocol and what type of † A system designer only need to program application built on top of an ORB. interface is supported) terminal Transaction the services (which will run within the It is part of the CORBA standard It † Manual registration implies the the handling control scope of the TP-Monitor; the services is identical to a basic TP-Monitor user provides an implementation of are linked to a number of TP libraries † There are two ways to trace calls: the resource. This implementation connecting logic providing the needed functionality), the Ä Explicit (manual): the invocation acts as an intermediary between the wrappers (if they are not already itself contains the transaction OTS and the actual resource provided), and the clients. The TP- identifier. Then, when the manager (useful for legacy TP monitor Services Monitors takes these components and application registers the resource applications that need to be embeds them within the overall system manager, it uses this transaction wrapped) as interconnected components. identifier to say to which † Automatic registration is used when † The TP-Monitor provides the transaction it is “subscribing” the resource manager understands infrastructure for the components to Ä Implicit (automatic): the call is transactions (i.e., it is a database), in wrappers work and the tools necessary to build made through the OTS, which which case it will support the XA services, wrappers and clients. In some will forward the transaction interface for 2PC directly. A cases, it provides even its own identifier along with the resource are registered only once, invocation. This requires to link and implicit propagation is used to programming language (e.g., check which transactions go there Resource Transational-C of Encina). with the OTS library and to managers make all methods involved transactional 2 tier systems

©Gustavo Alonso, ETH Zürich. 171 ©Gustavo Alonso, ETH Zurich. 172 Running a distributed transaction (1) Running a distributed transaction (2)

5) App A now calls App B 7) App B runs the txn but does not 1) Assume App A wants to update databases A and B 3) App A registers the database for that transaction Call for commit at the end txn Txn

App A App B App A App B App A App B App A App B DB DB DB DB DB DB DB DB

ORB Register ORB ORB ORB db Txn has part Object Object executed in Object Object Transaction Transaction database A Transaction Transaction Service Service Service Service

4) App A runs the txn but does not 2) App A obtains a txn identifier for the operation 6) App B registers the database for that transaction 2) App A request commit and the OTS runs 2PC txn commit at the end

App A App B App A App B App A App B App A App B DB DB DB DB DB DB DB DB

Begin ORB ORB ORB Register Commit ORB txn db txn Txn has part Object Object executed in Object Object Transaction Transaction database B Transaction Transaction Service Service Service 2PC Service 2PC

©Gustavo Alonso, ETH Zurich. 173 ©Gustavo Alonso, ETH Zurich. 174

The future of TP-Monitors

† TP-Monitors are the best example of middleware and the most successful implementation both in terms of performance and functionality. † Together with object brokers, TP-Monitors form the foundation of today’s distributed data management products. Enterprise Application Integration is still largely based on TP-Monitor technology. † TP-Monitors are the main reference for implementing middleware: Ä in terms of performance, TP-Monitors are orders of magnitude ahead of other middleware systems Ä in terms of functionality, TP-Monitors offer a quite complete, well integrated platform that can be extended to provide the functionality needed in other middleware systems WS-Coordination † Unlike other forms of middleware, TP-Monitors have proven to be quite resilient in time: some product lines are almost 30 years old already. Although the technology changes, the answer to fundamental design problems is well understood in TP-Monitors. These expertise will still have a significant impact on any emerging form of middleware.

©Gustavo Alonso, ETH Zürich. 175 ©Gustavo Alonso, ETH Zürich. 176 WS-Coordination Basics of WS-Coordination (1)

† WS-Coordination is intended as a generic infrastructure to implement coordination protocols between Web Register for a services Start a coordination coordination APPLICATION APPLICATION protocol † Its main goal is to serve as a generic protocol A B platform for implementing advanced transaction models but it can be used to implement a wide variety of coordination protocols between Context (Ca) services (including some forms of Activation Registration activity identifier A1 conversations) service service coordination type Q supporting protocol Y † WS-Coordination encompasses a set COORDINATOR portReference for the registration service A of behaviors and APIs that conform Coord. Coord.

a module that will extend Web for coordination type Q Protocol Protocol services with coordination ... CreateCoordinationContext capabilities A B Activation Registration Registration Activation † It mirrors the behavior of service service A service B service transactional services in conventional middleware platforms Operations of Operations of Coord. Coord. coordination coordination Protocol Ya Protocol Yb protocol protocol COORDINATOR A COORDINATOR B

©Gustavo Alonso, ETH Zürich. 177 ©Gustavo Alonso, ETH Zürich. 178

Basics of WS-Coordination (2) Basics of WS-Coordination (3)

APPLICATION MESSAGE (e.g., SOAP message) APPLICATION APPLICATION APPLICATION APPLICATION A HTTP POST B A B SOAP Envelope Header: Context Ca Context (Cb) activity identifier A1 coordination type Q supporting protocol Y

Body Ca portReference for the registration service B CreateCoordinationContext

Activation Registration Registration Activation Activation Registration Registration Activation service service A service B service service service A service B service

Coord. Coord. Coord. Coord. Protocol Ya Protocol Yb Protocol Ya Protocol Yb

COORDINATOR A COORDINATOR B COORDINATOR A COORDINATOR B

©Gustavo Alonso, ETH Zürich. 179 ©Gustavo Alonso, ETH Zürich. 180 Basics of WS-Coordination (4) Basics of WS-Coordination (5)

APPLICATION APPLICATION APPLICATION APPLICATION A B A B

Application B registers for coordination protocol Y Registration service B forwards the registration of and passes portReference to the protocol application B to the registration service A along with information about coordination protocol Yb The registration service A can then link both ends of the coordination protocol

Activation Registration Registration Activation Activation Registration Registration Activation service service A service B service service service A service B service

Coord. Coord. Coord. Coord. Protocol Ya Protocol Yb Protocol Ya Protocol Yb

COORDINATOR A COORDINATOR B COORDINATOR A COORDINATOR B

©Gustavo Alonso, ETH Zürich. 181 ©Gustavo Alonso, ETH Zürich. 182

Messages and interfaces WS-Coordinator in XML

† The coordinator defined by WS-Coordination is described using WSDL and offers a number of services to the application. ACTIVATION SERVICE: † The application accesses these services by sending, e.g., SOAP messages to the coordinator which then responds with new SOAP messages. Interactions with the protocol would then also be in terms of SOAP messages (but other protocols are possible, one needs only o provide alternative bindings for the coordinator services) † The example shown considers the case where application B decides to use its own coordinator. Application B could also decide to use the same coordinator as application A but in the cases where A and B are independent services provided RESPONSE ACTIVATION SERVICE by different organizations a coordinator per application makes more sense † WS-Coordination is an attempt at standardizing: Ä the use of SOAP headers for coordination protocols Ä the basic operations for most coordination protocols Ä the functionality a Web service middleware platform must support for allowing coordination protocols to be implemented

From Web Services Coordination (WS-Coordination) 9 August 2002

©Gustavo Alonso, ETH Zürich. 183 ©Gustavo Alonso, ETH Zürich. 184 WS-Transactions

† WS-Transactions builds directly upon WS-Coordination to specify different coordination protocols related to transaction processing Ä atomic transactions (governed by 2 Phase Commit) Ä business activities (transactional but based on compensation activities) • business agreement • business agreement with complete † WS-Transactions specifies the coordination protocol to be used as part of WS- Coordination. The specification deals with the nature of the interaction, the syntax and semantics of the messages to exchange as part of the coordination protocol, and the expected responses of all participants involved † Like WS-Coordination, WS-Transactions follows very closely the transactional WS-Transactions model found in conventional middleware platforms

©Gustavo Alonso, ETH Zürich. 185 ©Gustavo Alonso, ETH Zürich. 186

Coordination protocol for 2PC Business agreement

From Web Services Transaction (WS-Transaction) 9 August 2002 From Web Services Transaction (WS-Transaction) 9 August 2002 ©Gustavo Alonso, ETH Zürich. 187 ©Gustavo Alonso, ETH Zürich. 188 Business agreement with completion WS-Transactions

APPLICATION APPLICATION A B

Activation Registration Registration Activation service B service service service A WS-TRANSACTION PROTOCOL Coord. Coord. Protocol Ya Protocol Yb

From Web Services Transaction (WS-Transaction) 9 August 2002 COORDINATOR A COORDINATOR B

©Gustavo Alonso, ETH Zürich. 189 ©Gustavo Alonso, ETH Zürich. 190

CORBA transactions (1) CORBA transactions (2)

1) Assume App A wants to update databases A and B 3) App A registers the database for that transaction 5) App A now calls 7) App B runs the txn but does not Call for commit it at the end txn App B Txn

App A App B App A App B App A App B App A App B DB DB DB DB DB DB DB DB

ORB Register ORB ORB ORB db Txn has part Object Object executed in Object Object Transaction Transaction database A Transaction Transaction Service Service Service Service

2) App A obtains a txn identifier for the operation 4) App A runs the txn but does not 6) App B registers the database for that transaction 8) App A request commit and the OTS runs 2PC txn commit it at the end

App A App B App A App B App A App B App A App B DB DB DB DB DB DB DB DB

Begin ORB ORB ORB Register Commit ORB txn db txn Txn has part Object Object executed in Object Object Transaction Transaction database B Transaction Transaction Service Service Service 2PC Service 2PC

©Gustavo Alonso, ETH Zurich. 191 ©Gustavo Alonso, ETH Zurich. 192