Transaction Processing, , and Fa ult-tolerant Systems

Digital Technical Journal Digital Equipment Corporation

Volume 3 Number 1

Winter 1991 Editorial Jane C. Blake, Editor Kathleen M. Stetson, Associate Editor

Ci.rculation Catherine M. Phillips, Administrator Suzanne). Babineau, Secretary

Production Helen L. Patterson, Production Editor Nancy jones, Typographer Peter Woodbury, Illustrator

Advisory Board Samuel H. Fuller, Chairman Richard W Beane Robert M. Glorioso Richard). Hollingsworth john W McCredie Alan G. Nemeth Mahendra R. Patel F. Grant Sa viers Robert K. Spitz Victor A. Vyssotsky Gayn B. Winters

The Digital Tecbnicaljoumal is published quarterly by Digital Equipment Corporation, 146 Main Street MLO l-3/B68, Maynard, Massachusetts 0175 4-2571. Subscriptions to the journal are $40.00 for four issues and must be prepaid in .S. funds. niversity and college professors and Ph.D. students in the electrical engineering and fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to The Digital Tecbn.ical}oumal at the published-by address. Inquiries can also be sent electronically to DTJ®CRJ..DEC.COM. Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 12 Crosby Drive, Bedford, MA 01730 -1493.

Digital employees may send subscription orders on the ENET to RDVAX::JOURNAI. or by interoffice mail to mailstop MLO I-3/B68. Orders should include badge number, cost center, site location code and address. All employees must advise of changes of address.

Comn1ents on the content of any paper are welcomed and may be sent to the editor at the published-by or network address.

Copyright

The information in this journalis subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this journal.

ISSN 0898-901 X

Documentation Number EY-F588E-DP

The following are trademarks of Digital Equipment Corporation: DEC, DECforms, DECintact, DECnet, DECserver, DECtp, Digital, the Cover Design Digital logo, LAT, Rdb/VMS, TA, VAX ACMS, VAX CDD, VAX COBOL, Transaction processing is the common theme for papers in this VAX DBMS, VAX Performance Advisor, VAX RALLY, VAX Rdb/VMS, issue. The automatic teller machine on our cover represents one VAX RMS, VAX SPM, VAX SQL, VAX 6000, VAX 9000, VAXcluster, VA.Xft, VAXserver, VMS. of the many businesses that rely on TP systems. If we could look behind the familiar machine, we would see the products and IBM is a registered trademark of International Business Machines technologies - here symbolized by linked databases - that Corporation. suppo1·treliable and speedy processing of transactions worldwide. TPC Benchmark is a trademark of the Transaction Processing Performance Council.

The cover was designed by Dave Bryant of Digital'sMedia Book production was done by Digital's Educational Services

Communications Group. Media Communications Group in Bedford, MA. I Contents

8 Foreword Carlos G. Borgialli

Transaction Processing, Databases, and Fault-tolerant Systems

10 DECdta-Digital's Distributed Transaction Processing Architecture Philip A. Bernstein, William T. Emberton, and Vi jay Trehan

18 Digital's Transaction Processing Monitors Thomas G. Speer and Mark W Storm

33 Transaction Management Support in the VMS Kernel William A. Laing, James E. Johnson, and Robert V Landau

45 Peiformance Evaluation of Transaction Processing Systems Walter H. Kohler, Yun-Ping Hsu, Thomas K. Rogers, and Wael H. Bahaa-EI-Din

58 To ols and Techniques for Preliminary Sizing of Transaction Processing Applications William Z. Zahavi, Frances A. Habib, and Kenneth). Omahen

65 Availability for Transaction Processing Ananth Raghavan and T. K. Rengarajan

70 Designing an Optimized Transaction Protocol Peter M. Spiro, Ashok M. Joshi, and T. K. Rengarajan

79 Verification of the First Fault-tolerant VAXSystem William F. Bruckert, Carlos Alonso, and James M. Melvin I Editors Introduction

ad dition to the operating system kernel and address the problem of integrating data from multiple sys­ tems and databases. The authors describe the three DECdtm components, an optimized implementa­ tion of the two-phase commit protocol, and some VA.Xclu ster-specific optimizations. The next two papers turn to the issue s of measur­ ing TP system performance and of sizing a system to ensure a TP appl ication will run efficient ly. Wal t Kohler, Yun-Ping Hsu, To m Rogers, and Wae l Bahaa­ EI-Din discuss how Digital measures and models TP system performance. They present an overview of Jane C. Blake the industry-standard TPC Benchmark A and Digital's Editor implementation, and then describe an alternative to benchmark measurement- a multilevel analyti­ Digital's transaction processi ng systems are inte­ cal model ofTP system performance that simplifies grated hardware and software products that operate the system's complex behavior to a manageable set in a distributed environment to support commer­ of parameters. The discussion of performance con­ cial applications, such as bank cash withd rawals, tinues but takes a different perspective in the paper credit card transactions, and global trading. For on sizing TP systems. Bill Zahavi, Fran Habib, and these applications, data integrity and continuous Ken Omahen have written about a methodology access to shared resources are necessary system for estimating the appropriate system size fo r a TP characteristics; anything less would jeopardize the application. The tools, techniques and algorithms revenues of business operat ions that depend on they describe are used when an application is still these applications. Papers in this issue ofthe Journal in its early stages of development. look at some of Digital's techologies and products High performance must extend to the database that provide these system characteristics in three system. ln their paper on database availability, areas: distributed transaction processing, database Ananth Raghavan and T. K. Rengarajan exam ine access, and system fau lt tolerance. strategies and novel techniques that minimize the Opening the issue is a discussion of the architec­ affects of downtime situations. The two databases ture, DECdta, which ensures reliable interoperation referenced in their discussion are the VAX Rdb/YMS in a distributed environment. Phil Bernstein, Bill and VA X DBMS systems. Both system s use a database Emberton, and Vijay Trehan define some transaction kernel called KODA, which provides transaction processing terminology and anal yze a TP applica­ capabilities and commit processing. Peter Spiro, tion to illustrate the need fo r separate architectural AshokJoshi, and T.K. Rengarajan explain the impor­ components. They then present overviews of each tance of commit processing relative to throughput of the components and interfaces of the distributed and describe new designs fo r improving the perfo r­ transaction processing architecture, giving partic­ mance of group commit processing. These designs ular attention to transaction management. were tested, and the results of these tests and the Two products, the ACMS and DECi ntact monitors, authors' observations are presented. implement several of the functions defined by the Equal ly as important in TP systems as database DECdta architecture and are the twin topics of a availability is system availability. The topic of the paper by To m Speer and Mark Storm. Although final paper in this issue is a system designed to be based on different implementation strategies, both continously available, the 3000 fault-tolerant ACMS and DECintact provide TP-specific services VAX.ft system. Authors Bill Bruckert, Carlos Alonso, and fo r developing, executing, and managing TP appli­ Jim Melvin give an overview of the system and then cations. To m and Mark discuss the two strategies fo cus on the fo ur-phase ve rification strategy devised and then highlight the functional simi larities and to ensure transparent system recovery from errors. differences of each monitor product. I thank Carlos Borgialli fo r his help in prepari ng The ACMS and DECintact monitors are layered on this issue and fo r writing the issue's Foreword. the VMS operating system, which provides base services for distributed transaction management. Described by Bill Laing, Jim Johnson, and Bob Landau, these VMS services, called DECdtm, are an

2 Biographies I

Carlos Alonso A principal software engineer, Carlos Alonso is a team lead er fo r the project to port the System-V operating system to the VAXft 3000. Previously, he was the project leader fo r various VAXft 3000 system validation development efforts. As a member of the research group, Carlos developed the test bed fo r evaluating algorithms using the VMS Distributed Lock Manager, and he designed the prototype alternate lock rebuild algorithm fo r cluster transitions. He holds a B.S.E.E. (1979) from Tulane University and an M.S.C.S. (1980) from Boston University.

Wael Hilal Bahaa-El-Din Wael Bahaa-EI-Din joined Digital in 1987 as a senior consultant to the Systems Performance Group, Database Systems. He has led a number of studies to evaluate performance database and transaction process­ ing systems under response time constraints. After receiving his Ph.D. (1984) in computer and info rmation science from Ohio State University, Wael sp ent three years as an assistant professor at the University of Houston. He is a member of ACMS and IEEE, and he has written numerous art icles fo r profes­ sional journals and conferences.

Philip A. Bernstein As a senior consultant engineer, Philip Bern st ei n is both an architectural consultant in the Transaction Processing Systems Group and a researcher at the Cambridge Research Laboratory. Prior to joining Digital in 1987, he was a professor at Wang Institute of Graduate Studies and at Harvard Un iver­ sity, a vice president at Sequoia System s, and a researcher at the Computer Corporation of America. He bas published over 60 papers and coauthored two books. Phil received a B.S. (1971) in engineering from Cornell University and a Ph.D. ( 197'5) in computer science from the University of To ronto.

William F. Bruckert William Bruckert is a consulting engineer who joined Digital in 1969 after receiving a B.S.E.E. degree fro m the University of Massachusetts. He received an M.S.E.E./C.E. degree from the same university in 1981. Beginning as a worldwide product support engineer, Bill later worked on a number of DECsystem-10/20 designs. He developed the cache, memory, and 1/0 subsystem fo r the VA.,'( 8600 processor and was the system architect of the VAX86'50 processor. His most recent role was as the architect of the VA Xft 3000 system. Bi.ll currently holds seven patents.

3 WilliamT. Emberton As a principal software engineer, William Emberton is currently involved in the development of Queue Management Architecture. He is also involved in X/Open and POS!X TP Standards work ancl is a member of the team that is developing the overall DECtp product architecture. Previ­ ously, he worked on the initial versions of the DEC:dta architecture. Before com­ ing to Digital in 1987, Bill held positions as Director of Software Development at National Semiconductor and Manager of Systems Development for Inter­ national Retail Systems at NCR. He was educated at London University.

Frances A. Habib Fran Habib is a principal software engineer involved with the development of transaction processing workload characterization and siz­ ing tools. Previously, Fran worked at Data General and c;TE Laboratories as a

management science consultant. She holds an M.S. in operations research from MIT and a B.S. in engineering and applied science from Harvard. Fran is a full

member of ORSA ancl belongs to ACM, IEEE, and the AC:YI S!CMETRJC:S special interest group on modeling and performance evaluation of computer systems.

Yun-Ping Hsu Yun-Ping is currently a principal software engineer in the Transaction Processing Systems Performance and Characterization Group. He joined Digital in October 1987, after receiving his master's degree in electrical and computer engineering from the University of Massachusetts at Amherst. In his position, Yun-Ping is responsible for performance modeling and bench­ mark measurement of both ACMS- and DEC:intact-based TP systems. He also

participated in the TPC Benchmark A standardization activity during !989 He is a member of ACM and IEEE.

james E. johnson A consulting software engineer, Jim Johnson has worked for the VMS Engineering Group since joining Digital in 1984. He is current!)'a project leader for VMS Engineering in Europe. Prior to this work, Jim led the RMS project, and after relocating to the UK three years ago, he was responsible for much of the design and implementation of the DEC:dtm services. At the same time, Jim was an active participant in the transaction management architecture review group. He has applied for a patent pertaining to the two-phase commit protocol optimization currently used in DECdtm services.

Ashok M. Joshi Ashok Joshi is a principal software engineer interested in database systems, transaction processing, and object-based programming. He is presently working on the KODA subsystem, which provides record storage for Rdb/VMS and DBMS software. For the Rdb/VMS project, he developed hash indexing and record placement features, and he worked on optimizing the lock protocols. Ashok came to Digital after receiving a bachelor's degree in electrical engineering from the Indian Institute of Technology, Bombay, and a master's degree in computer science from the University of Wisconsin, Madison.

4 Walter H. Kohler As a software engineering senior manager, Walt is respon­ sible fo r TP system performance measurement and analysis and leads Digital's TP benchmark standards activities. Before joining Digital in 1988, Walt was a vis­ iting scientist and technical consultant to Digital and a professor of electrical and computer engineering at the University of Massachusetts at Amherst. He holds B.S., M.S., and Ph.D. degrees in electrical engineering, all from Princeton University. Walt recently received the IEEE/CS Meritorious Service Award, and he has published over 25 technical articles.

William A. Laing William La ing is a senior consultant engineer based in Newbury, England. He is the technical leader fo r production systems support fo r the VMS operating system. During five years spent in the U.S., Bill was responsible for the design and initial development of symmetrical multi­ processing support in the VMS system. He joined Digital in 1981, after doing research on operating systems at Edinburgh University fo r nine years. Bill holds a B.Sc. (1972) in mathematics and computer science and an M.Phil. (1976) in computer science, both from Edinburgh University.

Robert V. Landau Principal software engineer Robert Landau is a member of the VMS Engineering Group, based in Newbury, England. He is currently the project leader of a VMS advanced development team investigating a high-perfor­ mance, transaction-based, flat file system. Before joining Digital in 1987, Bob worked fo r a variety of software houses specializing in database-related prod­ ucts. He studied botany at London University and, subsequently, obtained a teaching qualification from Hereford College.

James M. Melvin As a principal design engineer, Jim was responsible fo r the specification of hardware error-handling mechanisms in the VAXft system and is presently an engineering project leader fo r future VA.,'(ft systems. He also speci­ fied and led the implementation of the hardware system simulation platform and the hardware design verification test plan. Jim joined Digital in 1984 and holds a B.S.E.E. (1984) and an M.S. (1989) in engineering management from Wo rcester Polytechnic Institute. He holds three patents on the VAXft 3000 sys­ tem, al l related to error handling in a fault-tolerant system.

Kenneth]. Omahen A principal engineer, Kenneth Omahen is developing object-oriented queuing network solvers. He designed a variety of perfor­ mance tools and performed design support studies which influenced a number of Digital products. Prior to joining Digital , Ken worked at Bell Te lephone Laboratories, lectured at the University of Newcastle-Upon-Tyne, and was a faculty member at Purdue University. He received a B.S. degree in science engi­ neering from Northwestern University and M.S. and Ph.D. degrees in info rma­ tion sciences from the University of Chicago.

5 Biographies

Ananth Raghavan Since joining Digital in 1988, Ananth Raghavan has been a softwareeng ineer who has led projects fo r the KODA/Rdb Group. Previous to this position, he was a teaching assistant in the computer science department of the University of Wisconsin. Ananth holds a B.S. (1985) degree in mechani­ cal engineering from the Ind ian Institute of Te chnology, Madras, and an M.S. (1987) degree in computer science from the University of Wisconsin, Madison. He has two patent applications pending for his work on undo and undo/redo database algorithms.

T. K. Rengarajan T. K. Rengarajan has been a member of the Database Systems Group since 1987 and works on the KODA software kernel fo r database management systems. He is involved in the support fo r WORM devices and global buffe r management in the VA..'\cluster environment. His work in the areas of boundary element methods and database management systems is reported in several published papers and patent applications. Ranga holds an M.S. degree in computer-aided design from the University of Kentucky and an M.S. in com­ puter science from the University of Wisconsin.

Thomas K. Rogers Thomas Rogers is a project leader fo r the Transaction Processing Systems Performance ami Characterization Group. He is respon­ sible for testing the V.A.,'C9000 Model 210 system using the TPC Benchmark A standard. Prior to joining Digi tal in January 1988, To m worked fo r Sperry Corporation as a technical specialist fo r the Northeast region. He received a bachelor of science degree in mathematical sciences in 1979 fr om Johns Hopkins University.

Thomas G. Speer As a principal software engineer in the DECtp/East Engineering Group, Thomas Speer is currently lead ing the DECintact V2.0 pro­ ject. In this position, his major responsibility is defining the requirements for DECintact support of DECdtm services, client/server database access, and sup­ port for the DECforms product. Since joining Digital in 1981, Tom has worked on several development projects, including FORTRAN-10/20 and RMS-20. He holds degrees from Harvard University, Rutgers University, and Simmons College. He is a member of Phi Beta Kappa.

Peter M. Spiro Peter Spiro, a principal software engineer, is currently involved in optimizing database technology for RISCmach ines. He has worked on database facilities such as access methods, journaling and recovery, trans­ action protocols, and buffe r management. Peter joined Digital in 1985, after receiving M.S. degrees in fo rest science and computer science from the University ofWisconsin. He has a patent pending fo r a method of database jour­ naling and recovery, and he authored a paper for an earlier issue of the Digital Te chnicaljo urnal. In addition, Peter enjoys the game of Ping-Pong.

6 Mark W. Storm Consulting engineer Mark Storm was one of the original designers of the ACMS monitor, and he has been involved in the development of TP products for more than ten years. Currently, he is acting technical director fo r the East Coast Transaction Processing Engineering Group, as well as manag­ ing a small advanced development group. Afterjoin ing Digital in 1976, Mark worked on COBOL compilers for the PDP-11 systems and developed the first native COBOL compiler fo r the VAX computer. He holds a B.S. (with honors) in computer science from the University of Southern Mississippi.

Vijay Trehan Since joining Digital in 1978, Vijay Trehan has contributed to several architecture projects. He is the technical director responsible for DECtp architecture, design, and standards work. Prior to this assignment, Vijay was the architect fo r the DECdtm protocol, architect fo r the DDIS data inter­ change fo rmat, and initiator of work on the DDIF document interchange fo rmat and compound document strategy. He holds a B.S. (1972) in mechanical engi­ neering from the Indian Institute of Te chnology and an M.S. ( 1974) in operations research from Syracuse University.

William Z. Zahavi As an engineering manager, Bill is responsible fo r the design and development of predictive sizing tools fo r transaction processing app.lications. Before joining Digital in 1987, he was a technical consultant for Sperry Corporation, specializing in systems performance analysis and capacity planning. Bill received an M.B.A. from Northeastern University and a B.S. in mathematics from the University of Virginia. He is an active member of the Computer Measurement Group, and frequently presents at CMG conferences.

7 I Foreword

rather than centralized systems has fo cused atten­ tion on system management. Queuing services, highly av ailable systems, heterogeneous environ­ ments, security services, and computer-aided soft­ ware engineering (CASE) are a few examples of areas in which research and advanced develop­ ment effo rts have had and will continue to have a major impact on the capabilities of transaction processing systems. Transaction processing solutions require the application of a wide range of technology and the integration of multiple software and hardware Carlos G. Borgialli products: from desktop to mainframe: from presen­ Senior Manager, DECtp Software tation services and user interfaces to TP monitors, Engineering database systems, and computer-aided software engineering tools; from optimization of system performance to optimization of availability. Making Transaction processing is one of the largest, most all of this tcch.nology work well together is a great rapidly growing segments of the computer indus­ challenge, but a challenge Digital is uniquely posi­ try. Digital's strategy is to be a leader in transaction tioned to meet. processing, and toward that end we are making Digital ensures broad application of its trans­ technological advances and delivering products to action processing technology by defining an meet the evolving needs of businesses that rely on architecture, the Digital Distribu ted Transaction transaction processing systems. Architecture (DECdta). DECdta, about which you will Because of the speed and reliability with which read in this issue, defines the major components of transaction processing systems capture and dis­ a Digital TP systt:m and the way those components play up-to-date information, they enable businesses can fo rm an integrated transaction processing sys-­ to make well-informed, timely decisions. Industries tem. The DECdta architecture describes how data fo r which transaction processing systems are a sig­ and processing are easily distributed among multi­ nificant asset include banking, laboratory automa­ ple VAX processors, as well as how the components tion, manufacturing, government, and insurance. can interoperate in a heterogeneous environment. For these industries and others, transaction pro­ The DECdta architecture is based on the client/ cessing is an info rmation lifeline that supports the server computing model, which allows Digital to achievement of daily business objectives and in apply its traditional strengths in networking and many instances provides a competitive advantage. expandabi I ity to transaction processing system Many older transaction processing systems on solutions. In the DECdta client/server computing which businesses rely are centralized and tied to a model, the client portion interacts with the user to particular vendor. A great deal of money and time create processing requests, and the server portion has been invested in these systems to keep pace performs the data manipulation and computation with business expansion. As expansion continues to execute the processing request. This computing beyond geographic boundaries, however, the cen­ model facilitates the division of a TP system into tralized, single-vendor transaction processing sys­ small components in three ways. It allows fo r dis­ tems are less and less likely to offe r the flexibility tribution of fu nctions among VA_,\: processors; it needed fo r round-the-clock, reliable, business partitions the work performed by one or more of operations conducted worldwide. Transaction pro­ the components to allow for parallel processing; cessing technology therefore must evolve to or it replicates functions to achieve higher avail­ respond to the new business environment and at ability goals. These options permit the customer the same time protect the investment made in to purchase the configurat ion that meets present existing systems. needs, confident that the system will allow smooth Our research efforts and innovative products expansion in the future. provide the transaction processing systems that Further, the DECdta architecture sets a direction businesses need today. The demand fo r distributed fo r its evolution through different products in a

8 I

coord inated manner. It provides for the cooper­ besides its high performance, allow Digital's data­ ation and interoperation of components imple­ base products to eff iciently support TP applica­ mented on different platforms, and it supports the tions as well as to provide rich functionality for expansion of customer applications to meet growth general-purpose database appl ications. requirements. The DECdta architecture is designed For those TP systems that require user inter­ to work with other Digital arch itectures such as the faces, DECforms provides a device-independent, Digital Network Architecture (DNA), the network easy-to-use human interface and permits the sup­ application services (NAS), and the Digi tal database port of multiple devices and users within a single architecture (DDA). Moreover, the DECdta architec­ application. ture supports industry standards that enable the TP systems that require high availability or con­ portability of applications and their interopera­ tinuous operations are supported by the V�'X fam­ tion in a heterogeneous environment, such as the ily of hardware and software. The introd uct ion of standard appl ication programming interfaces being the fault-tolerant VAXft 3000 system, added to the developed by the X/Open Tr ansaction Processing successful V�'Xcluster system, allows for a high Wo rking Group and the IEEE POSJX. Standard wire level of system availability. Performance needs protocols that provide for systems interoperation also are be ing met by a combination of hardware in a multivendor, heterogeneous environment are resources. including the VAX 9000 system. being developed by the International Standards This combination of architecture, software, and Organization as part of the Open System Inter­ hardware technology, and support for emerging connection activities. industry standards places Digital in an excellent Among the products Digital has developed speci­ position to become the industry leader for dis­ fically for TP systems are the TP monitors. These tributed, portable transaction processing systems. monitors provide the system integration "glue," if The papers in this issue of the Journal provide a you will. Rather than act as their own systems inte­ of the key elements of Digital's distributed grators, customers who use Digital's TP monitors transaction processing technologies. are able to spend more time on solving business Many individuals, teams, organizations, and busi­ problems and less time on solving software inte­ ness partners are responsible for bringing Digi tal's gration problems, such as how to make forms and TP vision to fruition. Their dedication, hard work, database products work together smoothly. and creativity will continue to drive the develop­ Digital's TP monitors run on all types of hard­ ment of new technologies that enhance our family ware configurations, including local area networks of products and services. (LANs), wide area networks (WAJ'\Is), and VAXcl uster systems. The DECdta client/server computing model provides the necessary flexibility to change hard­ ware configurations, thus allowing reconfigura­ tion without the need for any source code changes. The two TP monitors, DECintact and VA X AG•IS, integrate vital Digital technologies such as the Digital Distributed Transaction Manager (DECcltm) and products such as Digital's forms systems (DECforms) and our Rdb/VMS or V�'\ DBMS data­ base products. DECdtm uses the two-phase com­ mit protocol to solve the complex problem of coordinating updates to multiple data resources or databases. Major developments in Digital's database prod­ ucts have enhanced the strengths of its overall product offerings. The two mainstream database products noted above, Rdb/VMS and VA ,"( DBMS, layer on top of a database kernel called KODA, thus providing data access independent of any data model. The services made available by KODA,

9 Philip A. Bernstein

William T. Emberton Vijay Trehan

DECdta -Digitals Distributed Transaction Processing Architecture

Digital's Distributed Transaction Processing Architecture (DECdta) describes tfJe modules and interfaces that are common to Digital's transaction processing (DECtp) products. The architecture allows easy distribution of DECtjJ products. fn particular. it supports client/server style applications. Distributed transaction management is the main function that ties DECdta modules together it ensures that application programs, database systems, and other resource managers inter­ operate reliably in a distributed �ystem.

Transaction processing (TP) is the activity of execut­ and manufacturing. Given the broad range of appli­ ing requests to access shared resources, typically cations of TP, Digital offe rs a wide variety of prod­

databases. A computer system that is configured to ucts with which to build Tl' systems. execute TP applications is called a TP system. DECtp is an umbrel la term that refersto Digital's A transaction is an execution of a set of opera­ TP products. The goal of DECtp is to offer an inte­ tions on shared resources that has the fo llowing grated set of hardware and software products properties: that supports the development, execution, and management of appl ications for enterprises of Atomici ty. Either aJ of the transaction ·s opera­ TP • J all sizes. tions execute, or the transaction has no effect DECtp systems include software components at all. that are specialized fo r TP, notably TP monitors

. The set ofall operations that exe­ such as the ACMS and DECintacr TP monito rs, and cute on behalf of the transaction appears to transaction managers such as the DEC:dtm trans­ execute serially with respect to the set of opera­ action manager. '' DECtp systems also require the tions executed by every other transaction. integration of general-purpose hardware products (processors, storage, communications, and termi­ Durability. The effects ofthe transaction 's oper­ • nals) and software products (operating systems, ations are resistant to fa ilures. database systems, and communication gateways). A transaction terminates by executing the com­ These products are typically integrated as shown mit or abort operation. Commit tells the system to in Figure l. install the effect of the transaction's operations permanently. Abort tells the system to undo the effects of the transaction's operations. TP APPLICATION For enhanced reliability and availability, a TP application uses transactions to execute requests. That is, the application receives a request message TP MONITOR DATABASE SYSTEMS FORMS MANAGER (from a display, computer, or other device), exe­ cutes one or more transactions to process the OPERATING SYSTEM COMMUNICATION SYSTEM request, and possibly sends a reply to the origina­ tor of the request or to some other parry specified by the originator.

TP applications are essential to the operation of many industries, such as finance, reta il, health Figure 1 Layering of Products to Support care, transportation, govern ment, communications, a TP Application

10 Vol. .l No. I Willll!r J')')J Digital Te c!Jnicaljounwl DECdta - Digital's Distributed Tra nsaction Processing Architecture

Applications on DECtp systems can be designed Step 3: A server processes the request by using a client/server paradigm. This paradigm is executing one or more transactions. Each trans­ especially useful fo r separating the work of prepar­ action may ing a request from that of running transactions. a. Access multiple resources Request preparation can be done by a front-end system, that is, one that is close to the user, in b. Cal. I programs, some of which may be remote which processor cycles arc inexpensive and inter­ c. Generate requests to execute other transactions active feedback is easy to obtain. Transaction execution can be done by a larger back-end sys­ d. Interact with a user tem, that is, one that manages large databases e. Return a reply when the transaction finishes and may be far from the user. Back-end systems may themselves be distributed. Each back-end Step 4: If the transaction produces a reply, then system manages a porrion of the enterprise the client interacts with the user to display that database and executes applications, usually ones reply, e.g., using a fo rms manager. that make heavy use of the database on that back Each of the above steps involves the interaction end. DECtp products are modularized to al low easy of two or more programs. In many cases, it is desir­ distribution across fro nt ends and back ends, able that these programs be distributed. To dis­ tribute them conveniently, it is important that the which enables them to support client/server style applications. DECtp systems thereby simplify pro­ programs be in separate components. For exam­ gramming and reconfiguration in a distributed ple, consider the fo llowing: system. • The presentation service that operates the dis­ Digital's Distributed Transaction Processing play and the application that controls which Architecture (DECdta) defines the modularization fo rm to display may be distributed. and distribution structure that is common to DI'Ctp products. Distributed transaction management is One may want to off-load presentation services the main fu nction that tics this structure together. and related functions to front ends, while allow­ This paper describes the DECdta structure and ing programs on back ends to control which explains how DECd ta components are integrated fo rms are displayed to users. This capability is by distributed transaction management. useful in Steps 1, 3d, and 4 above to gather input Current versions of DECtp products implement and display output. To ensure that the presenta· most, but not all, modules and interfaces in the tion service and application can be distribu ted, DECdta architecture. Gaps between the architec­ the presentation service should correspond to a ture and products will be fi lled over time. DECtp separate DECdta component. products that currently implement DECdta compo­ • The client appl ication that sends a request and nents are referenced throughout the paper. the server application that processes the request may be distributed. The applications may com­ TP Application Structure municate through a nerwork or a queue. By analyzing TP applications, we can see where the In Step 2, front-end applications may want to need arises fo r separate DECdta components. A send requests directly to back-end applications typical TP appl ication is structured as fo llows: or to place requests in queues that are managed Step 1: The client application interacts with a on back ends. Similarly, in Step 3c, a trans· user (a person or machine) to gather input, e.g., action, T, may enqueue a request to run another using a fo rms manager. transaction, where the queue resides on a dif­ Step 2: The client maps the user's input into a fe rent system than T. To maximize the flexibil­ request, that is, a message that asks the system to ity of distributing request management, request perform some work. The client sends the request management should correspond to a separate to a server application to process the request. DECdta component. A request may he direct or queued. Jf direct, the • Two transaction managers that want to run a client expects a server to process the request right commit protocol may be distributed. away. If queued, the client deposits the request in a queue from which a server can dequeue the For a transaction to be distributed across different request later. systems, as in Step 3b, the transaction management

Digitu/ Te clm icul jo uniUI Vo l. ,) Nu I Winter t'J'JI 11 Transaction Processing, Databases, and Fault-tolerant Systems

se rvices must be distributed. '1() en sure that each proce sses. Alte rnatively. several DECdt a compo­ tran saction is at omic, the transac tion manage rs on nen ts may be imple men ted as a sing le software these sy ste ms must control tran sac tion commit­ component. For example, an operating system or

ment using a common commit prot oc ol. To com­ TP monitor ty pically offe rs th e fac il ities of more plic ate matte rs, the re is more than on e widely used than one DECdt a component. prot oc ol for tran saction commit men t. To the Th e following are th e components of DEC:dta: extent possible, a sy stem sh ould all ow inte rope ra­ An application prog ram is any prog ram that tion of th ese protocols. • uses se rv ice s of DECdta compon ent s To en sure th at tran sact ion managers can be dis­ tributed, the tran sact ion man ag er should be a • A re source manager manages resources th at sup­ port tran sact ion se mantics. component of DEC:dt a. Tc> en sure th at they can inte rope rate, the ir transaction protocol sh ould • A tran saction manage r coordin ates tran sac ti on also be in DECdta. To en sure th at (liffe rent commit te rminat ion (i.e , commit and abort). prot oc ol s em be supported , the part of transaction management th at define s the protocol for inte r­ • A communic ation manage r supports a tran s­ act ion with re mote tran sac tion manage rs sh ould ac tion communicat ion protoc ol between Tl' be se parated from the part that coordinates tran s­ sy ste ms.

action exec ution ac ross local re sources. In the • A presentation manager supports device-inde­ DECdt a architecture, the formeris called a commu­ pendent inte ract ions with a presentation devic e. nic at ion manager, and the latte r is called a trans­ A re quest manag er fac ilitates th e submission of act ion manager. • re quests to exec ute tran sactions. Inte rope rat ion of transaction managers and re source managers, such as (latabasc syste ms, also DECdta components are layered on serv ice s that affects the modul arization of DEC:dta components. are provided by the underlying operating sy ste m A tran saction may inv olve clifferent ty pe s of and distributed system platform, and arc not spec i­ re source s, as in Step :)a. For example, it may update fic to Tl', as sh mvn in Fi gure 2. data th at is managed by different database sy ste ms. To cont rol tran saction commit ment, the tran sac­ Application Program tion manage r must interact with different re source We usc the term application program to mean a manage rs, possibly supplied by diffe rent vendors. program th at uses th e services provid ed by oth er This re quires that re source managers be se parate DECd ta components. An applic ation program components of DE C:dta. could be a cust omcr-writt cn program, a laye red prod uct . or a DfUita component . Th e DECdta Architecture In the DECdt a arch itecture, we disting uish tw o Having seen whe re the need fo r DECdta compo­ special types of application prog ram: request ini­ nent s arises, we are now re ady to de sc ri be the tiators and tran sact ion se rve rs. A re quest in it iator DECdt a architec ture as a whole, inc luding the func­ is a DECd ta component that prepares ami submits tion s of and interfaces to each component. a re quest for the exec ut ion of a tran sact ion. Tb Most DECdta inte rface s are rmblic . Some of the create a re quest, the re quest initiator usua IIy inter­ public inte rf aces are cont rolled by offic ial stan­ act s with a pre sentation manage r that provides an dards bodie s and ind ustry consortia; i.e., they are interface to a device, such as a te rminal, a work­ "open " inte rfaces. Oth ers are cont rolled solely by station, a dig it al priv at e branch exchange, or an Digital. DECdt a interface s and protocols will be automated telle r machine . published and aligned with industry st andards, as A tran saction se rve r can demarcate a tran s­ appropriate. action, inte ract with one or more resource man­ DECdta components are abst ract entities. They agers to access rec ove rable re sources on behalf of do not necessarily map one-t o-one to hardware the tran saction, inv oke ot her tran sac tion serve rs, component s, software components (e .g ., pro­ and re spond toca lls from request initiat ors. grams or products), or exec ution envi ron ment s For a simple re quest, a transac ti on server (e .g ., a single-thre aded proce ss, a multithre aded receives the re quest, proce sses it, and option ally process, or an ope rating sy ste m se rvice). Rathe r, a ret urns a re ply to the re quest initiat or. A conve r­ DE Cdt a component may be imple mented as multi­ sational re que st is like a simple re quest, except that ple software components, for example, as seve ral while proce ssing the re quest, the transaction

12 \�11. j .Vu. J Winter 1991 Digital Te cb uica/ jou rua/ DECdta - Digital'sDistributed Tra nsaction Processing Architecture

APPLICATION PROGRAMS

REQUEST TRANSACTION SERVER INITIATOR

TP SERVICES

REQUEST REQUEST RESOU RCE MANAGER MANAGER MANAGER

PRESENTATION MANAGER OTH ER COMMUNICATION MANAGERS TRANSACTION MANAGER

OPERATING SYSTEM AND DISTRIBUTED SYSTEM SERVICES

THREAD DISTRIBUTED DISTRIBUTED AUTHENTICATION MANAGEMENT UID SERVICE NAME SERVICE TIME SERVICE SERVICE SERVICE

Figure 2 DECdta Components and Interfa ces

server exchanges one or more messages with the ager may be written by a customer, a third party, user, usually through the request initiator. or Digital. In principle, a request initiator co ulll also execute Each resource manage r type offe rs a resource­ transactions (not shown in Figure 2). That is, the dis­ manager-specific interface that is used by applica­ tinction between request initiators and transaction tion programs to access and modify recoverable servers is fo r clarity only, and does not restrict an resources managed by the resource manager. A des­ application from perform ing request initiation func­ cription of these resource manager interfaces is tions in a transaction. Architecturally, this amounts outside the scope of DECdta. However, many of to saying that request initiation fu nctions can exe­ these resource manager interfaces have architec­ cute in a transaction server. tures defined by industry standards, such as SQL (e .g., the VAX Rdb/Vtv!S product), CODASYL data man­ Resource 1l1a nager ipulation language (e.g., the VAX DB,'v!S product), and A resource manager performs operations on shared COBOL fi le operations (e .g., RNIS in the VMS system). resources. We are especially interested in recover­ One type of resource manager that plays a spe­ able resource managers, those that obey transaction cial role in TP systems is a queue resource manager. semantics. In particular, a recoverable resource It manages recoverable queues, which are often manager undoes a transaction's updates to the used to store requests.' It allows application pro­ resources if the transaction aborts. Other recover­ grams to place elements into queues and retrieve able resource manager ac tivities in support of trans­ them, so that application programs can communi­ actions are described in the next section. In the rest cate even though they execute independently and of this paper, we use "resource manager" to mean asynchronously. For example, an appl ication pro­ "recoverable resource manager." gram that sends elements can communicate with In a TP system, the most common kind of one that receives elements even ifthe two applica­ resource manager is a database system. Some pre­ tion programs are not operationai simultaneously. sentation managers and communication managers This communication arrangement improves avail­ may also be resource managers. A resource man- ability and facilitates hatch input of elements.

Digita/ 1ec1Jitical jo urt�al 1-'11/. .> Nu. I \Vinter I'J'JI 13 Transaction Processing, Databases, and Fault-tolerant Systems

A queue resource manager interface supports later use the log to reconstruct transactions' states such operations as open-queue, close-queue, while recovering from a fa ilure. enqueue, dequeue, and read-element. The ACMS A deta i led description of the DECdta transaction and DECintact TP monitors both have queue manager component appears in the Transaction resource managers as components. Manager Architecture section.

Transaction Ma nager Communication Ma nager A transaction manager supports the transaction A communication manager provides services fo r abstraction. It is responsible fo r ensuring the atom­ communication between named obj ects in a TP icity of each transaction by telling each resource system, such as application programs and trans­ manager in a transaction when to commit. It uses action managers. Some communication managers a two-phase commit protocol to ensure that either participate in coordinating the termination of a all resource managers accessed by a transaction transaction by propagating the transaction man­ commit the transaction or they all abort the trans­ ager's two-phase commit operations as messages action.' To support transaction atomicity, a trans­ to remote communication managers. Other com­ action manager provides the fo llow ing functions: munication managers propagate application data and transaction context, such as a transaction iden­ Transaction demarcation operations allow appli­ • tifier, from one node to another. Some do both. cation programs or resource managers to start A TP system can support multiple communica­ and commit or abort a transaction. (Resource tion managers. These communication managers managers sometimes start a transaction to exe­ can interact with other nodes using different com­ cute a resource operation if the caller is not mit protocols or message-passing protocols, and executing a transaction. The SQL standard may be part of diffe rent name spaces, security requires this.) domains, system management domains, etc. IBM LU6.2 • Transaction execution operations allow Examples are an SNA communication resource managers and communication man­ manager or an ISO-TP communication manager. agers to declare themselves part of an existing By supporting multiple communication man­ transaction. agers, the DECdta architecture enhances the inter­ operability ofTP systems. Diffe rent TP systems can Tw o-phase commit operations allow resource • interoperate by executing a transaction using dif­ managers and communication managers to fe rent commit protocols. change a transaction's state (to "prepared," "com­ A communication manager offe rs an interface mitted," or "aborted"). fo r application programs to communicate with The serializabi lity of transactions is primarily other application programs. Different communica­ the responsibility of the resource managers. tion managers may offer different communication Usually, a resource manager ensures serializability paradigms, such as remote procedure call or peer­ by setting locks on resources accessed by each to-peer message passing. transaction, and by releasing the locks after the A communication manager also has an interface transaction manager tells the resource manager to its local transaction manager. It uses this inter­ to commit. (The latter activity makes serializabil­ face to tell the transaction manager when a trans­ ity partly the respons ibility of the transaction action has spread to a new node and to obtain manager.) If transactions become deadlocked, a info rmation about transaction commitment, which resource manager may detect the deadlock and it exchanges with communication managers on abort one of the dead locked transactions. remote nodes. The durability of transactions is a responsibil ity of transaction managers and resource managers. Presentation Ma nager The transaction manager is responsible fo r the A presentation manager provides an application durability of the commit or abort decision. A program with a record-oriented interface to a pre­ resource manager is responsible fo r the durability sentation device. Its services are used by applica­ of operations of committed transactions. Usually, tion programs, usually request initiators. By using it ensures durability by storing a description of presentation manager services, instead of directly each transaction's resource operations and state accessing a presentation device, application pro­ changes in a stable (e.g., disk-resident) log. It can grams become device independent.

Win ter 14 Vo l. 3 No. 1 1991 Digital Te cl micaljournal DECdta - Digital'sDistribu ted Tra nsaction Processing Architecture

A fo rms manager is one type of presentation manager. Just as a database system supports opera­ APPLICATION PROGRAM tions to define, open, close, and access databases, a fo rms manager supports operations to define, enable, disable, and access fo rms. A fo rm includes the definition of the fields (with diffe rent attributes) that make up the fo rm. It also includes services to map the fields into device-independent application records, to perform data validation, and to perform data conversion to map fields onto device-specificfr ames. One presentation manager is Digital's DEC:forms OTHER fo rms management product. The DECforms prod­ COMMUNICATION uct is the first implementation of the ANSI/ISO MANAGERS Forms Interface Management Systems standard Figure 3 Tra nsaction Manager Architecture (COOASYL FIMS).'

transaction manager protocol, whose messages are Request Ma nager propagated by communication managers. This pro­ A request manager provides services to authenti­ tocol is used by Digital's DEC:dtm distributed trans­ cate the source of requests (a user ami/or a presen­ action manager.' tation device), to submit requests, and to receive From a transaction manager's viewpoint, a trans­ replies from the execution of requests. It supports action consists of transaction demarcation opera­ such operations as send-request and receive-reply. tions, transaction execution operations, two-phase Send-request must provide the identity of the commit operations, and recovery operations. source device, the identity of the user who entered The transaction demarcation operations are the request, the identity of the application pro­ • issued by an application program to a transac­ gram to be invoked, and the input data to the tion manager and incl ude operations to start program. and either end or abort transaction. A request manager can either pass the request a directly to an application program, or it can store • Transaction execurion operations are issued by requests in a queue. In the latter case, another resource managers ami communication man­ request manager can subsequently schedule the agers to a transaction manager. They include request by dequeuing the request ami invoking an operations application program. The ACMS System Interface is an example of an existing request manager inter­ For a resource manager or communication face fo r direct requests. The ACMS Queued Trans­ manager to an existing transaction action Initiator is an example of a request manager - For a communication manager to tell a trans­ that schedules queued requests.' action manager to start a new branch of a transaction that already exists at another node

Transaction Manager Architecture • Two-phase commit operations are issued by a OECdta components are tied together by the trans­ transaction manager to resource managers, action abstraction. Transactions allow application communication managers, and through com­ programs, resource managers, request managers munication managers to other transaction man­ (indirectly through queue resource managers), and agers, and vice-versa. They include operations communication managers to intemperate reliably. Since transactions play an especially important - For a transaction manager to ask a resource ro le in the OECdta architecture, we describe the manager or communication manager to pre­ transaction management fu nctions in more detail. pare, commit, or abort a transaction The OECdta architecture incl udes interfaces For a resource manager or communica­ between transaction managers and application tion manager to tell a transaction manager programs, resource managers, and communication whether it has prepared, committed, or managers, as shown in Figure 3. It also includes a aborted a transaction

1'<>1. Digital Te dm ical Jo ur11al .i 1\i>. I Winler I') VI 15 Transaction Processing, Databases, and Fault-tolerant Systems

- For a communication manager to ask a trans­ at Node 2 fo rT. This arrangement allows the com­ action manager to prepare, co mmit, or abort mit protocol between transaction managers to be a transaction propagated properly by communication managers. - For a transaction manager to tell a commu­ After the transaction is done with its application nication manager whether it has prepared, work, the application program that started transac­ committed, or aborted a transaction tion T may invoke an "end" operation at the home transaction manager to commit T. This causes the • Recovery operations are issued by a resource home transaction manager to ask its subordinate manager to its transaction manager to deter­ resource managers and communication managers mine the state of a transaction (i.e., committed to try to commit T. The transaction manager does or aborted). this by using a two-phase commit protocol. The protocol ensures that either all subord inate In response to a start operation invoked by an resource managers commit the transaction or they application program, the transaction manager dis­ all abort the transaction. penses a unique transaction identifier fo r the trans­ In phase the home transaction manager asks action. The transaction manager that processes the 1, its subordinates fo r T to prepare A subord inate start operation is that transaction's home trans­ T. prepares T by doing what is necessary to guarantee action manager. that it can either commit Tor abort T if asked to do When an application program invokes an opera­ so by its superior; this guarantee is valid even if tion supported by a resource manager, the it fa ils immediately after becoming prepared. To resource manager must find out the transaction prepare T, identifier of the application program's transaction. This can happen in different ways. For example, the • Each subordinate fo r T recmsively propagates application program may tag the operation with the prepare request to its subordinates fo rT the transaction identifier, or the resource manager may look up the transaction identifier in the appli­ • Each resource manager subordinate writes all of cation program's context. When a resource man­ T's updates to stable storage ager receives its first operation on behalf of a • Each resource manager and transaction manager transaction, T, it mustjoin T, meaning that it must subordinate writes a prepare-record to stable tell a transaction manager that it is a subordinate storage fo r T. AJternatively, the DECdta architecture sup­ ports a model in which a resource manager may ask A subordinate fo rT replies with a "yes" vote if to be joined automatically to all transactions man­ and when it bas completed its stable writes and all aged by its transaction manager, rather than asking of its subordinates fo r T have voted "yes"; other­ to join each transaction separately. wise, it votes "no.'' lf any subordinate fo rTdoes not

A transaction, T, sp reads from one node, Node 1, acknowledge the request to prepare within the to another node, Node 2, by sending a message timeout period, then the home transaction man­ (through a communication manager) fr om an appli­ ager aborts T; the effect is the same as issuing an cation program that is executing T at Node 1 to abort operation. an application program at Node 2. When T sends In phase 2, when the home transaction manager a message from Node 1 to Node 2 fo r the first has received "yes" votes from all of its subordinates time, the communication managers at Node 1 and fo r T, it decides to commit T. It writes a commit Node 2 must perfo rm branch registration. This record for T to stable storage and tells its subordi­ fu nction may be performed automatically by the nates fo r T to commit T. Each subordi nate fo r T communication managers. Or, it may be done man­ writes a commit record fo r T to stable storage and ually by the application program, which tells the recursively propagates the commit request to its communication managers at Node 1 and Node 2 subord in2.tesfo rT. A subordinate fo rT rep I ies with that the transaction has spread to Node 2. In either an acknowledgment if and when it has committed case, the result is as fo llows: the communication the transaction (in the case of a resource manager manager at Node 1 becomes the subordinate of the subordinate) and has received acknowledgments transaction manager at Node 1 fo rT and the supe­ from all subordinates fo rT. When the home trans­ rior of the communication manager at Node 2 action manager receives acknowledgments from all fo rT; and the communication manager at Node 2 of its subordinates fo rT, the transaction commit­ becomes the superior of the transaction manager ment is complete.

16 v'!JI.j No. J Wi nter I':J'JJ Digital Technical jo urnal DECdta - Digital'sDistribut ed Tra nsaction Processing Architecture

To re cove r from a failure, all resource manage rs Acknowledgments that participated in a transaction mu st ex amine This architecture grew from dis cu ssi ons with many their logs on stable storage to de termine what to colle agues. We thank them all for their help, espe­ do. If the log contains a commit or abort record for cially Dieter Gawl ick, Bill La ing, Dave Lomet, Bruce T, then T comple ted. No action is required. If the Mann, B arry Rubi nson, D iog enes To rres, and the TP log contains no prep are, commit, or abort record architecture group , includ ing Edw ard Bragi nsky, for T, then Twas active. T must be aborted. If the Tony DellaFe ra, George Gajnak, Per Gyllstrom, and log contains a prepare record fo rT, bur no com­ Yo av Raz. mit or abort re cord for T, Twas betw een p hases I and 2. The resou rce manage r mus t ask its su perior References transaction manager whether to commit or abort Speer and M. Storm, "Digital's Transaction the transaction. 1. T. Processing Monitors," Digital Te chnicaljo urnal, An i nhcrenr p roble m in aU two-phas e commit vol. 3. no. I (Wi nte r 1991, this is sue): 18-32. protocols is that a resource manager is blocked 2. between phases I and that is, after voting " yes " 2. W L1 ing, J. johnson, and R. Landau, "Transaction and before receiving the commit or abort decision. Management Supp ort in the VMS Ope rati ng It cannot commit or abort t he transaction until the System Ke rnel," Digital Te chnicaljo urnal, vol. 3, transaction manager tells itwh ich to do. If its trans­ no. 1 (Winter 1991 , this issue): :B -44. action manage r fa ils, the resource manager may be Bernstein, H d i laco , and Goodman, bl ocked indef initely, until either the transaction 3. P V a z s N. manager re covers or an ex ternal age nt, such as a Co ncurrency Co ntrol and Recouery in Database Sy (Reading, Add ison-Wesley, 1987). system manage r, steps in to tell the resource man­ stems MA: agerwh ethe r to commit or abort. 4. P Bernstein, M. Hsu, and B. Mann, "Implement­ A transaction T may spontaneousl y abort due to ing Recoverable Re quests Us ing Queues," system errors at any rime du ring irs ex ecu tion. Or, Pro ceedings 1990 ACM StG/viOD Conference on an appl ication program (p rior to comp leti ng its Management of Data (May 1990). work) or a resource manage r (p rior to voting "yes") jo urnal of Developrnent (Norfolk, VA : may tell its transaction manager to abort T. In 5. FIMS CODASYL FIMS Committee, Ju ly 1990). eithe r case, the transaction manager then tells all of its su bordinates for T to undo the effects 6. C. Mohan, B. Lindsay, and R. Obermarck, of T's res ource manager op erations . Subordinate "Transaction Manage ment in the R* Distributed re source manage rs abort T, and su bordinate com­ Database Management System," ACM Tra ns­ munication managers recursivel y propag ate the actions on Database .� vstems, vo l. 11, no. 4 abort re quest to their su bordinates fo rT. (December 1986) The two-phase commit protocol is optimized for those cases in which the nu mber of messages exchang ed can be red uced below that of the gen­ eral cas e (e.g., if the re is only one su bordinate res ou rce manage r. if a resource manager did not modify res ou rces, or if the presu med-abort proto­ col was used to save acknowledgments)."

Summary

We have presented an overview of the DECdta architecture. As part of this overview, we intro­ duced the components and explained the fu nction

of each i ntcrface . We als o described tile DECd ta transaction management an:hirecrure in some dera il. Over time, many inte rfaces of the DECd ta model will be made pu blic via product of feri ng s or architectur e pu b!ication s.

l'ol . . Digital Te clmical jo unwl ..> \'u I Winter I')<)I 1.7 Thomas G. Speer Mark W.St orm

Digitals Transaction Processing Monitors

Digital provides two transaction processing (TP) monitor products - ACi\115 (Application Control and Management System) and DECintact (In tegrated Appli­ cation Control). Each monitor is a unified set of transaction processing services for

the application environment. These services are layered on the Vi\:15 op erating .\)'S­ tem. Although there is a large ju nctional overlap between the two, both products achieve similar goals by means of some significant�y different implementation

strategies. Flow control and multithreading in the ACM5 monitor is managed by means of a fourth-generation language (4GL) task definition language. Flow control and multithreading in the DECintact monitor is managed at the application level

bythir d-generation language ()GL) calls to a library of services. The ACM5 monitor supports a deferred task model of queuing, and the DECintact monitor supports a message-based model. Over time, the persistent distinguishing feature between the two monitors will be their differeYJtapplication programming interfaces.

Transaction processing is the execution of an specific execution environments, and transaction­ application that performs an administrative fu nc­ processing-specific programming interfaces. tion by accessing a shared database. Within trans­ Digital provides two transaction processing action processing, processing monitors provide monitors: the Application Control and Manage­ the software "glue" that ties together many soft­ ment System (ACMS) and the DECintact monitor. ware components into a transaction processing Both monitors are built on top of the VMS operat­ system solution. ing system. Each monitor provides a unified set A typical transaction processing application of transaction-processing-specific services to the involves interaction with many terminal users by application environment, and a large fu nctional means of a presentation manager or fo rms system overlap exists between the services each monitor to collect user requests. Information gathered by provides. The distinguishing factor between the the presentation manager is then used to query or two monitors is in the area of application program­ update one or more databases that reflect the cur­ ming styles and interfaces - fo urth-generation rent state of the business. A characteristic of trans­ language (4GL) versus third-generation language action processing systems and applications is (3GL). This distinction represents Digital's recog­ many users performing a small number of similar nition that customers have their own styles of functions against a common database. A trans­ application programming. Those that prefer 4GL action processing monitor is a system environment styles should be able to build transaction process­ that supports the efficient development, execu­ ing applications using Digital's TP monitors with­ tion, and management of such applications. out changing their style. Similarly, those that prefer Processing monitors are usually built on top of 3GL styles should also be able to build TP applica­ or as extensions to the operating system and other tions using Digital's TP monitors without changing products such as database systems and presenta­ their style. tion services. By so doing, additional components The ACMS monitor was first introduced by Digital can be integrated into a system and can fill "holes" in 1984. The ACMS monitor addresses the require­ by providing fu nctions that are specifically needed ments of large, complex transaction processing by transaction processing applications. Some applications by making them easier to develop and examples of these fu nctions are appl ication con­ manage. The ACMS monitor also creates an efficient trol and management, transaction-processing- execution environment fo r these applications.

I Winter 1991 18 Vo l. 3 No. Digital Te chnical jo urnal Digital's Tra nsaction Processing Monitors

The DECintact monitor (Integrated Application • !'>rocessing steps perfo rm computational pro­ Control) was originally developed by a third-party cessing and database or fi le l/0 through standard vendor. Purchased and introduced by Digital in subroutines. The subroutines are written in 1988, it has been installed in major financial insti­ any language that accepts records passed by tutions and manufacturing sites. The DECintact reference. monitor includes its own presentation manager, • The task definition language defines the flow of support fo r DECforms, a recoverable queuing sub­ control berween processing steps and exchange system. a transaction manager, and a resource man­ steps and specifies transaction demarcation. ager that provides its own recovery of RJV!S (Record Wo rk spaces arc special records that the ACMS Management Services) fi lcs. monitor provides to pass data between the task This paper highlights the important similarities definition, exchange steps, and processing steps. and differences ofthe r\GvJ S and DEC:intact monitors in terms of goals and implementation strategies. A compiler, called the application definition util- ity (ADU). is implemented in the ACMS monitor to Development En vironment compile the task definition language into binary data structures. The run-time system is -driven, Transaction processing monitors provide a view rather than interpreted, by these structures. of the transaction processing system for appli­ Digital is the only vendor that supplies this "divide cation development. 'fherefore, the ACMS and and conquer" solution to bu ilding large complex DEC:intact monitors must embody a style of pro­ TP applications. We believe this approach - unique in gram development. the industry - reduces complexity, thus making applications easier to produce and to manage. ACMS Programming Style

A "divide and conquer" approach was used in the DECintact Programming Style ACMS monitor. The work typically involved in The approach to application development used in developing a TP application was divided into logi­ the DEC:intact monitor provides the application cally separate fu nctions described below. Each of developer with 3GL control over the transaction these functions was then "conquered" by a special processing services required . This approach utility or approach. allows application prototyping and deve lopment Tn the ACJ';JS monitor, an "application" is defined to be done rapid ly. Moreover, the application can as a collection of selectable units of work cal led make the most efficient use of monitor services tasks. A separate appl ication definition facility by selecting and controlling only those services isolates the system management characteristics of required fo r a particular task. the application (such as resource allocation, fi le In the DECintact monitor. an application is location, and protection) from the logic of the defined as one or more programs written entirely application. in 3GL and supported by the VMS system. The code The specification of menus is also clecoupled written by the application developer manages all from the ap plication. A nonprocedural (4GL) flow control, user interaction, and data manipu­ method of defining menu layouts is used in which lation through the utilities and service libraries the layouts are compiled into fo rm files and data provided by the DECintact monitor. Al l DECintact structures to be used at run-time. Each menu entry services are callable, including most services pro­ points either to another menu or to an application vided by the DECintact uti I ities. The DECintact and a task. (Decoupling menus from the appl ica­ services are as fo llows: tion allows user mem1s to be independent of how A li brary of presentation services used for all the tasks are grouped into applications.) • interaction with users. The application developer In addition to separate menu specification and includes calls to these services for fo rm manip­ system management characteristics, the applica­ ulation and display. Forms are created with a tion logic is broken down into the three logical fo rms editor utility and can be updated dynami­ parts of interactive Tl' applications: cally. Forms are displayed by the DECintact

• Exchange steps support the exchange of data terminal manager in emulated block mode. with the end user. This exchange is typically Device- and terminal-dependent information is accomplished by displaying a fo rm on a terminal completely separated from the implementation screen and collecting the input. of the application.

. Winter Digita1 1ec1JIIical jo ur11al V"l. i .Vu. I I')')I 19 Transaction Processing, Databases, and Fault-tolerant Systems

• The separation of specification of menus from minal user invokes a new function. This method is the application. DECintact menus are defined by most beneficial in simple transaction processing means of a menu database and are compiled into applications that have a relatively small number of data structures accessed at run-time. The menus users. However, as the number of users grows or as are tree-structured. Each entry points either to the application becomes larger ancl more complex, another menu entry or to an executable applica­ several problem areas may arise with this method: tion image. The specification of mem1s is linked • Resource use. As the number of processes grows, to the DECintact monitor's security subsystem. more and more memory is needed to run the The DEC:intact terminal user sees only those system effectively. specific menu entries for which the user has

been granted access. • Start -up costs. Process creation, image activa­ tion, file opens, and database binds are expen­ • A library of services fo r the control of fi le and sive operations in terms of system resources queue operations. In addition to layered access utilized and time elapsed. These operations can to the RMS fi le system, the DEC:intact monitor degrade system performance if clone frequently. supports its own hash fi le fo rmat (a fu nctional

analog to single-keyed indexed files in RMS) • Contention. As the number of users simul­ which provides very fast, efficient record taneously accessing a database or fi le grows, retrieval. The application developer includes contention fo r locks also increases. For many calls to these services fo r managing fu\1 S and applications, lock contention is a significant hash fi le 1/0 operations, demarcating recovery fa ctor in throughput. unit boundaries, creating queues, placing data • Processing location. Single process implementa­ items on queues, and removing data items from tions Jimit distribution options. queues. The queuing subsystem is typically an integral part of application design and work flow control. Application-defined DEC:intact ACMS On-line Execution recovery units ensure that fu\1 S, hash, and queue To address the problems listed above, Digital imple­ operations can be committed or aborted atomi­ mented a client/server architecture in the ACMS cally; that is, either all permanent effects of the monitor. (Client/server is also cal led request/ recovery unit happen, or none happen. response.) The basic run-time architecture consists of three types of processes, as shown in Figure Because of DECintact's 3GL development envi- 1: the command process, execution controller, ancl ronment, application programmers who are acctis­ procedure servers. tomed to calling procedure libraries from standard An agent in the ACMS monitor is a process that VMS languages or who are familiar with other submits work requests to an application. In the transaction processing monitors can easily learn ACMS system, the command process is a special DEC:intact's services. Application prototypes can agent respons ible fo r interactions with the termi­ be produced quickly because only skills in 3C�L nal user. (In terms of the DECdta architecture, the are required. Further, completed applications command process implements the fu nctions of can be produced quickly because training time a request initiator, presentation manager, and is minimal. request manager for direct requests.)� The com­ On-line Execution Environment mand process is generally created at system start­ up time, altho ugh ACMS commands allow it to Transaction processing monitors provide an execu­ be started at other times. The process is multi­ tion environment tailored to the characteristics and threaded through the use of VMS asynchronous needs of transaction processing applications. This system traps (AS1') . Thus, one command process environment generally has two aspects: on-line, fo r per node is generally sufficient fo r all terminals interactive applications that use terminals; and off­ handled by that node. line, fo r noninteractive applications that use other There are two subcomponents of the ACMS moni­ devices. tor within the command process: Traditional VMS timesharing applications are implemented by allocating one VMS process to each • System interface, which is a set of services fo r terminal user when the user logs in to the system. submitting work requests and fo r interacting An image activation is then done each time the ter- with the ACMS application

Winter 20 Vol. 3 No. 1 1991 Digital Te e/mica/ jo ur11al Digital's Transaction Processing Monitors

IBAC�ENDNOD E----� I I MENU I I TASK I DATABASE I I DEFINITION I I g + I I � l__ • COMMAND SYSTEM PROCESS INTERFACE EXECUTION I R LL R g I CONT O E DEC FORMS SERVER I r I g I ______I PROCEDURE SERVERS

KEY: ACMS PROCESSES

APPLICATION PROCESSES USER DATABASES

�------_j

Figure I Basic Run-time Architecture of the ACMS Monitor

• DECforms, Digital's fo rms management product, dure servers implement the fu nctions of a trans­ which implements the ANSI/ISO Forms Inter­ action server.)' face Management System (FIMS) that provides Whenthe execution controller receives a request the presentation server fo r executing the from the command process, it invokes DECdtm exchange steps (Digital Distributed Transaction Manager) services

The command process reads the menu defini­ to join the transaction if the agent passes the tion fo r a particular terminal user and determines transaction identifier. If the agent does not pass a which menu to display. When the terminal user transaction identifier, there is no transaction to selects a particular menu entry, the command pro­ join and a DECdtm or resource-manager-specific cess calls the ACMS system interface services to transaction is started as specified in the task defini­ submit the task. The system interface uses logical tion. The execution controller then uses the task names from the VMS system to translate the appli­ index to find the tables that represent the task. cation name into the address of the execution con­ When the execution of a task reaches an exchange troller that represents that application. The system step, the execution controller sends a callback to interface then sends a message to the execution the command process fo r a fo rm to be displayed controller. The message contains the locations of and the input to be collected fo r the task. When the presentation server and an index into the task the request to display a fo rm is sent to the com­ definition tables fo r the particular task. The status mand process, the execution controller dismisses of the task is returned in the response. During the the AST to enable other threads to execute. When course of task execution, the command process the response to the request arrives from the accepts callbacks from the task to display a fo rm exchange step, an AST is added to the queue fo r fo r interaction with the term inal user. the execution controller. The execution controller executes the task When a task comes to a processing step, the exe­ definition language and creates and manages pro­ cution controller allocates a free procedure server cedure servers. The controller is created at appli­ to the task. It then sends a request to the proce­ cation start-up time and is multithreaded by dure server to execute the particular procedure using VMS ASTs. There is one execution controller and dismisses the AST. If no procedure server is per application. (In terms of the DECdta archi­ free, the execution controller puts the request tecture, the execution control ler and the proce- on a waiting list and dismisses the AST. When a

Vu l Nu. I I'J91 Digital Te ciJ nical jo unwl . .> Winter 21 Transaction Processing, Databases, and Fault-tolerant Systems

procedure server becomes free, the execution con­ threaded applications. For this reason, proce­ troller checks the wa it list and allocates the proce­ dure server subroutines in the ACMS system dure server to the next task, if any, on the wa it list. can be written in a standard fashion by using Procedure servers are created and deleted by standard calls to Rdb/VMS and the VMS system. the execution controller. Procedure servers are a Error isolation - In one large multithreaded collection of user-written procedures that perform • process, the threads are not comp letely pro­ computation and provide database or fi le accesses fo r the application. The procedures are written in tected within the process. An aprlication logic standard languages and use no special services. The error in one thread can corrupt data in a thread ACMS system creates a transfer vector fro m the that is executing fo r a different user. A severe server definition. This transfer vecror is linked into error in one thread could potentially bring the server image. With this vector, the ACMS system clown the entire application. The multithreaded code can receive incoming messages and translate processes in the ACMS architecture (i.e., the them into calls to the procedure. execution controller and command process) A procedure server is specified with initialization are provided by Digital. Because no applica­ and termination procedures, which are routines tion code executes directly in these processes, supplied by the user. The ACMS monitor calls these we can guarantee that no application coding procedures whenever a procedure server is created error can affe ct them. Procedure servers are and deleted. The initialization procedure opens single-threaded. Therefore, an application logic files and performs database bind operations. The error in a procedure server is isolated to affect termination procedure does clean-up work, such only the task that is executing in the proce­ as closing fi les prior to process exit. dure server. The ACMS architecture addresses the problem areas discussed in the On-line Execution Environ­ Sta rt- up Costs The run-time environment is basi­ ment section in several ways. cally "static," which means that the start-up costs (i.e., system resources and elapsed time) are Resource Use Because procedure se. vers are al lo­ incurred infrequently (i.e., at system and appli­ cated only fo r the time required to execute a pro­ cation start-up time). A timesharing user who is cessing step, the servers are avai lable fo r other running many diffe rent applications causes image use while a terminal user types in data fo r the activations and rundowns by switching among fo rm. Thus, the system can execute efficiently with images. Because the terminal user in the ACMS fewer procedure servers than active terminal system is separated from the applications pro­ users. Improvement ga ins in resource use can vary, cesses, the process of switching applications depending on the application. Our debit and credit involves only changing message destinations and benchmark experiments with the ACMS monitor incurs minimal overhead . and the Rdb/VMS system indi­ cated that the most improvement occurs with one Contention The database accesses in the ACMS procedure server fo r every one or two transactions environment are channeled through a relatively (TPS). per second These benchmarks equate to few, but heavily used, number of processes. The I procedure server fo r every lO to 20 active termi­ typical VMS timesharing environment uses a large nal users. number of lightly used processes. By reducing The use of procedure servers and the mul ti­ the number of processes that access the database, threaded character of the execution controller and the contention fo r locks is reduced. the command process allow the architecture to reduce the number of processes and, therefore, the Processing Location Because the ACMS monitor number of resources needed. The optimal solution is a multiprocess architecture, the command pro­ fo r resource use would consist of one large multi­ cess and fo rms processing can be clone close to the threaded process that performed all processing. terminal user on small, inexpensive machines. This However, we chose to trade off some resource use method takes advantage of the inexpensive pro­ in the architecture in favor of other gains. cessing power available on these smaller machines

• Ease of use - Multithreaded applications are while the rest of the application executes on a generally more diffi cult to code than single- large r VA-'

I Winter 22 VI>!. .i No. !')')! Digital Te chnical jo urnal Digital's Tra nsaction Processing Monitors

DECintact On-line Execution Applications designated in the local menu data­ Although the specific components of the DECintact base as remote applications cause the front-end monitor vary from those of the ACMS monitor, the terminal manager/d ispatcher process to communi­ basic architecture is very similar. Figure 2 shows the cate with the cooperating back-end terminal application configured locally to the front end. The manager/d ispatcher process through a task-to-task run-time architecture consists of three types of DECnet link. (In terms of the DECdta architecture, DECintact system processes - terminal manager/ the terminal manager/dispatcher implements the dispatcher, DECforms servers, server manager - functions of presentation manager, request initia­ and, typically, one or more application processes. tor, and request manager fo r direct requests.)' When fo rms processing is distributed, the same When a user selects the remote task, that user's application is configured as shown in Figure 3. request is sent to the back end and is treated by the The DECintact monitor can run in multiple application as a local request. The terminal man­ copies on any one VA)( node. Each copy can be an ager/dispatcher process is started automatically as independent run-time environment; or it can share part of a copy start-up and is multithreaded. data and resources, such as user security profiles Therefore, one such process can handle all the ter­ and menu definitions, with other copies on the minal users fo r a particular DECintact copy. same system. Thus, independent development, When the terminal user selects a menu task, one testing, and production environments can reside of the fo llowing actions occurs, depending on on the same node. whether the task is local or remote and whether it In the DECintact system, the terminal manager/ is single- or multithreaded. dispatcher process (one per copy) is responsible If the application is local and single-threaded, a fo r the fo llowing: VMS process may be created that activates the application image associated with this task. The

• Displaying DECintact fo rms terminal manager/d ispatcher, upon start up, may create a user-specified number of application shell • Coordinating DECforms fo rms display VMS processes to activate subsequent application

• Interacting with local applications images. If such a shell exists when the user selects a task, this process is used to run the application • Communicating, through DECnet, with remote image. Each user who selects a given menu entry DECintact copies receives an individual VMS process and image.

• Maintaining security authorization, including If the application is local and multithreaded, the the dynamic generation of user-specific menus terminal manager/dispatcher first determines

TERMINAL MANAGER/ MULTITHREADED DISPATCHER APPLICATION

DATABASE SERVERS

KEY DECINTACT D PROCESSES

APPLICATION D PROCESSES USER DATABASES

Figure 2 Basic Run-time Architecture of the DECintact Monitor

Vo l. .Vo. I Digital Te ch nical journal j Winter 1991 23 Transaction Processing, Databases, and Fault-tolerant Systems

------� �������------, FRONT-END NODE I I I I I I TERMINAL MANAGER/ 1--.---t--+1 TERMINAL MANAGER/ DISPATCHER DISPATCHER I I I I I MULTITHREADED I APPLICATION I I '------_ I I I I I I I I KEY I DECINTACT I D PROCESSES I I I APPLICATION : USER DATABASES D PROCESSES ______j

Figure 3 DECintact Basic Architecture with Distributed Fo rms Processing

whether this task has already been activated by pre­ the user has entered all data in the fo rm. If the vious users. If the task has not been activated and application is remote, fo rm information is sent a shell is not avai I able, the terminal manager/ between cooperating local and remote term inal dispatcher creates a VMS process fo r the applica­ manager processes; the interface is transparent to tion and activates the image. If the task is already the application. activated, the terminal manager/dispatcher con­ In addition to supporting DECintact forms, the nects the user to the active task. The user becomes DECintact monitor also supports applications that another thread of execution within the image . use DECforms as their presentation service. The Multithreaded applications handle many simulta­ implementation of this support fo llows the same neous users within the context of one VMS process client/server model used by the ACMS system's and image. support fo r DECforms and shares much of the Remote applications, whether single- or multi­ underlying run-time interprocess communication threaded, route the menu task selection to a remote code used by the ACMS monitor. Functionally, the terminal manager/dispatcher process. On receipt two implementations of DECforms support are also of the request, the remote terminal manager/ similar to the ACMS monitor. Both implemen­ dispatcher processes the selection locally by using tations offe r transparent support fo r distributed the same procedures as described above. DECforms processing, automatic forms caching Local DECintact fo rms interaction is handled in (i.e., propagation of updated DECforms in a distrib­ the fo llowing manner by the local terminal man­ uted environment), and DECforms session caching ager/d ispatcher. The application's cal l to display a fo r increased performance. fo rm sends a request to the terminal manager. The The DECintact monitor supports application­ terminal manager locates the fo rm in its database level, single- and multithreaded environments. The of active fo rms, displays the fo rm on the user's ter­ DECintact monitor's threading package allows appli­ minal, and returns control to the application when cation programmers to use standard languages

I Wi 11ter 24 Vo l. .) No. 1991 Digital Te chnical jo urnal Digital's Transaction Processing Monitors

supported by the VMS system to write multi­ • Local storage (e.g., the $LOCAL PSECT in COBOL threaded rrocesses. Applications declare them­ applications) that the application has designated selves as either single- or multithreac.leu. With the as thread-specific exception of the declaration, there is little diffe r­ The PSECT naming convention allows the ence between the way an on-line multithreaded application to decide which variable storage is application and its single-threaded counterpart thread-specific and which is process-global. must be coded. For on-line applications, thread Thread-specific storage is unava able to other creation, deletion, and management are automatic. i I threads in the same process because it is saved New threads are created when a terminal user and restored on each thread switch. Process-global selects the multithreaded application and are storage is always avai lable to all threads in the deleted when the user leaves the application. process and can be used when interthread commu­ In a single-threaded application, the fo llowing nication or synchronization is desired. occurs: The use of multithreading in the DECintact sys­

• Each user receives an ind ividual VMS process and tem is appropriate fo r higher vo lume multiuser image context (e.g., 200 users, 200 processes). applications that perform frequent 1/0. Such appli­ cation usage is typical in transaction processing • Al l terminal and fi le l/0 is synchronous. environments. Because threac.J sw itches occur only when l/0 is requested or when locking requests • The application image normally ex its when the application work is completed. are issued, this environment may not be recom­ mended fo r applications that perform infrequent In a multithreaded on-line application, the fo l­ 1/0 or that expect very small numbers of concur­ lowing occurs: rent users, such as end-of-day accounting pro­ grams or other batch-oriented processing. These • One VMS process/image can hand le many simul­ kinds of applications typically choose to declare taneous users. themselves as single-threac.Jed.

• All terminal and file 1/0 is asynchronous. All1/0 from within a multithreaded DECintact application process is asynchronous. Therefore, New threads are created automatically when • the DECintact system provides a client/server inter­ new users are connected to the process. face between multithreaded applications and syn­ chronous database systems, such as VAX DBMS • The application image does not exit when all currently allocated threads have comrleted exe­ (Database Management System) and Rdb/VMS sys­ cution but remains for use by new on-line users. tems. The interface is provided because calling a synchronous database operation directly from For each thread in a multithreaded application within a multithreaded application would stall the image, the DECintact system maintains thread con­ calling thread and all other threads until the call text and state info rmation. Each l/0 request is completed. Figure 2 shows that a typical on-line issued asynchronously. Immediately after control DECintact application access ing Rdb/VMS, fo r is returnec.l, butbefore the l/0 request completes, example, is written in two pieces: the DECintact system saves the currently executing thread's context and schedules another thread to • A multithreaded, on-line piece (the cl ient), that execute. When the thread's 1/0 completion AST is handles fo rms requests from multiple users delivered, the thread's context is restored, and the • A single-threaded, database server piece (a server thread is inserted on an internal ly maintained list instance), that performs the actual synchronous of threads eligible fo r execution. database 1/0 A thread's context consists of the fo llowing: This client/server approach to database access is • An internally maintained thread block contain­ functionally very similar to that of ACMS procedure ing state information servers and offe rs similar benefits. Like the ACMS monitor, the DECintact monitor offe rs system man­ • The stack agement facilities to define pools of servers and to

• Standard DECintact work spaces that are allo­ adjust them dynamically at run-time in accordance cated to each thread and that maintain terminal with load. Similar algorithms are used in both mon­ and file management context itors to allocate server instances to client threads

Digital Te chnical jo urnal Vo l. J No. I Winter I'J'JI 25 Transaction Processing, Databases, and Fault-tolerant Systems

ami to start up new instances, as necessary. The once started, do not exit but wait fo r new user DECintact server code, like the ACMS procedure threads as users select the application. Thus, the server code, can define initialization and termina­ IXCintact terminal user can switch between appli­ tion procedures to perform once-only start-up and cation images and incur only an inexpensive shut-down processing. With DECintact transaction thread creation. semantics, which are layered on DECdtm services, a client can declare a global transaction that the Contention As in the ACMS monitor, database server instance will join. The server instance can accesses in the DECintact client/server environ­ also declare its own independent transaction or no ment are channeled through a relatively few, but transaction. (In terms of the DECdta architecture, heavily used, number of processes rather than this client/server approach implements the func­ through a large number of lightly used processes. tions of a transaction server.)' The principal diffe r­ This reduction decreases lock contention. ence between the DECintact and ACMS approach is that DECintact clients and servers use a message­ Processing Location Forms processing can be based 3GL communications interface to send and off- loaded to a front end and brought closer to the receive work requests. Control in the ACMS moni­ terminal user. Thus smaller, less expensive CPUs tor resides in the execution controller. can be used while the rest of the application exe­ As the ACMS monitor does, the DECintact archi­ cutes on a larger back-end machine or cluster. In tecture addresses the problem areas discussed in the DECintact monitor, the front end can consist of the On-line Execution section in several ways. fo rms processing on.ly or a mix of fo rms process­ AJ so, as with the ACMS approach, the factors we ing and application remote queuing work. chose to trade off allowed us to achieve better effi­ ciency, performance, and ease of use. Off-lineExec ution Many transaction processing applications are used Resource Use The DECintact system's multi­ with nonterminal devices, such as a bar code threaded methodology economizes on VMS reader or a communications link used fo r an elec­ resources. Similar to the method used in the ACMS tronic fu nds transfer application. Because there is monitor, the system reduces process creations no human interaction with these applications, and image activations. A major diffe rence between they have two requirements that diffe r from the the ACMS and DECintact architectures is the way requirements of interactive appl ications: tasks must the DECintact monitor implements multithread­ be simple data entries, and the system must handle ing support. The transparent implementation of fa ilures transparently. threading capabi lities means that coding multi­ threaded applications is no more difficult than ACMS Off line Execution coding traditional single-threaded applications. As The ACMS monitor's goal for off-line processing is with any application-level threading scheme, how­ to allow simple transaction capture to continue ever, the responsibility fo r ensuring that a logic when the application is not available. A typical error in one thread is isolated to that thread lies example is the continued capture of data on a man­ with the application. The DECintact client/server ufacturing assembly line by a MicroVAX system faci lities fo r accessing databases, like those used in when the application is unava i !able. The ACMS the ACMS monitor, can realize similar benefits in monitor provides two mechanisms fo r support­ process reuse, throughput, and error isolation. ing nonterminal devices: queuing agents and user­ written agents. Start-up Costs The DECintact architecture, like Figure 4 illustrates the ACMS queuing model. the ACMS architecture, distributes start-up costs A queuing system is a resource manager that (i.e., system resources and elapsed time) between processes entries, with priorities, in first-in, first­ two points: the start of the DECintact system, and out (FIFO) order. (In terms of DECdta, this is the the start of applications. System start-up can queue resource manager.)' The ACMS queuing faci l­ involve prestarting VMS process shells (as dis­ ity is built upon RMS-indexed files. The primary cussed previously) fo r subsequent application goal of ACMS queuing is to provide a store­ image activation. On-line application start-up is and-forward mechanism to allow task requests executed on demand when the first user selects a to be collected for later execution. By using particular menu task. Multithreaded applications, the ACMS$ENQlJE_TASK service, a user can write

26 Vo l. J Nu I \Vinter /991 Digital Te chnicaljountal Digital's Transaction Processing Monitors

user writes an agent that captures the input fo r the task and then starts a DECdtm transaction. The agent uses the system interface services to invoke the ACMS task and passes the transaction identifier and the input data. When the task call completes, the agent commits the transaction. If DECdtm returns an error on the commit, the agent loops back to start another transaction and to resubmit the task. If a VA.X cluster system is used fo r the appli­ cation, this configuration will survive any single � ------, I NODE B point of fa ilure. I I DECintact Off line Execution I EXECUTION CONTROLLER The DECintact monitor provides several facilities I I fo r applications to perform off-line processing. I These facilities allow applications to I • Interface with and process data from nonterm i­ PROCEDURE na! devices and asynchronous events i u---- SERVERS 1....------r I USER DATABASES • Control transaction capture, store and fo rward, I interprocess communication, and business work fl ow through the DECintact queuing subsystem KEY: ACMS D PROCESSES Of f-line Multithreading Off-line, multithreaded DECintact applications are typically used to service APPLICATION D PROCESSES r------VAXFT 1 Figure ACMS Queuing Agents I 3000 4 I I I BARCODE I I a process that captures a task request and safely I READERS stores the task on a local disk queue. I I I The ACMS monitor provides a special agent, I ___j called the queued task initiator (QTI), which takes r------� a task entry from the queue and submits it to the I BACK-END NODE appropriate execution controller. The QTI starts a I DECdtm transaction, removes the task entry from I EXECUTION the queue within that transaction, invokes the I I CONTROLLER ACMS task, and passes the transaction identifier. On I the DECdta architecture, the QTI implements the I fu nctions of a request manager for queued I requests.)' The task then joins that transaction. I I The removal from the queue is atomic with the u--- PROCEDURE I SERVERS commit of the task, and no task entry is lost or I L-----� executed twice. I USER DATABASES Figure 5 shows the ACMS user-written agent I ______model for off-line processing. With the ACMS sys­ KEY: tem interface, users may write their own versions ACMS of the command process. Note that because these D PROCESSES agents cannot be safely stored on disks, this APPLICATION method is generally not as reliable as using queues. D PROCESSES User-written agents can be used, however, with DECdtm and the fault-tolerant VAXft 3000 system Figure 5 ACMS User-written Agent Model to produce a reliable front-end system. To do so, a fo r Off-line Processing

No. Winter l!.J91 27 Digital Te e/mica/ jou rnal Vo l. 3 I Transaction Processing, Databases, and Fault-tolerant Systems

asynchronous events, such as the arrival of an of data entry is concentrated at particular times electronic funds transfer message or the addition of clay; applications can assign themselves to to the queue of an item already on a DECintact one or more queues and can be notified when queue. The application programmer explicitly an item is inserted on the queue. controls how many threads are created, when they Store-and-forward processing - \V hen users at are created, and which execution path or paths • the front end of the system write items local each thread wi ll fo llow. Off-line, multithreaded to queues, data entry can be continuous in the applications are well-suited to message switching event of back-end system fa ilure or whenever a systems and other aspects of electronic fu nds program that is needed to process data is tem­ transfer in which each thread may be dedicated to porarily unavailable. servicing a different kind of event.

• lnterprocess communication - Locally between DECintact Queues The primary goal of the applications sharing a node and by means of DECintact queuing subsystem is to support a work the DECintact remote queuing faci lity, applica­ flow model of business transactions. (In the tions can use the queuing system to rei iably DEC:dta architecture, the DECintact queuing sub­ exchange application data between processes system implements the fu nctions of a queue and applications. resource manager and request initiator fo r queued A fu ndamental diffe rence between queues requests.)' In a typical DECintact application that ACMS and DECintact queues is that the system relies on queuing, the state of the business trans­ ACMS inserts tasks onto the queues, and tile DECintact action may be represented by the queue on which system inserts data items. In DECintact queuing, a particular queue item resides at the moment. An each data item contains both user-supplied data item moves from queue to queue as the item's and a header that includes an item key and other processing state changes, much as a work item control information. The header is used by the moves from desk to desk. The superset of queue queuing system to control the movement of the items that reside on queues throughout the appli­ item from queue to queue. Each queue item can be cation at any one time represents the state of trans­ assigned an item priority. Items can be removed actions currently executing. Depending on the from the queue in FIFO order, in FIFO order within number of programs that need to process data dur­ item priority, or by direct access using the item ing the course of a transaction, a queue item may key. Queues can be stopped and started fo r inser­ be inserted on several diffe rent queues before the tion, removal, or both. Queues can also be redi­ transaction completes. The application also may rected transparently at the system management wish to chain together several small transactions i<.:vcl to running applications. within the context of a larger business transaction. ln the DECintact monitor, alert thresholds can be The DECintact queuing system functions through­ specified on a queue-by-queue basis alert the out the application: from the front end, where to system manager when queue levels reach defined queues collect and route incoming data; to the amounts. Individual queue items can be held back end, where queues can be integrated with against removal or released. Queues can be grouped data fi les in recovery units; and in between, where together into logical entities, called queue sets, diffe rent programs in the application can use which .look and behave to the application the same queues to share data. as individual queues. Queue sets have added facili­ The DEc:intact queuing subsystem consists of a ties fo r broadcast insertion on all members of a comprehensive set of callable services fo r the cre­ queue set and a choice of removing algorithms that ation and manipulation of queues, queue sets, and can weight relative item- and queue-level priorities queue items. Queue item operations performed from the queue. within the context of a DECintact transaction are DECintact queues can be automatically distrib­ fully atomic along with DECintact fi le operations. uted. At the system management level, a local In addition to overall workflow control, the queue can be designated as remote outbound. That DECintact queuing system allows the fo llowing: is to say, items aclclecl to this queue are shipped

• Deferred processing -Anitem can be queued transparently across the network to a correspond­ by one process and then removed from the ing remote inbound queue on the destination queue later by another process fo r processing. node. The transfer is hand led by the DECintact Deferred processing is useful when the volume queuing system by using exactly-once semantics

Vu f. u f Winter f'J'Jf 28 3 N . Digital Te cbnical jou rnal Digital's Transaction Processing Monitors

(i.e., the item is guaranteed to be sent once and generated documents or by an off-line application only once). From the point of view of the appli­ that receives data from another branch or bank. cation that is adding or removing items from the The transactions are verified or repaired by other queue, remote queues behave exactly as local clerks in a different department of the bank. The queues behave. transactions are then sent to destination banks To better understand some of the uses for over one or more network services. DECintact queuing, consider a simplified but repre­ To implement this application, the developer uses sentative electronic funds transfer example built queues to route, safely store, and synchronize data on the DECintact monitor. Figure 6 shows the ele­ as it progresses through the system, and to priori­ ments of such an application. In this application, tize data items. Data items are given priority levels, transactions might be initiated either locally by based on application-defined criteria, such as trans­ clerks entering data into the system from user- fe r amount, destination bank, or time-to-closing.

NODE A ,II II, TERMINAL MANAGER/ DATA ENTRY DISPATCHER APPLICATION

g-- ....______,

TERMINAL MANAGER/ VERIFY AND REPAIR DISPATCHER APPLICATION

DATABASE SERVERS

FEDWIRE ,...._----'.--- NETWORK

USER DATABASES

KEY: DEC I NTACT D PROCESSES

APPLICATION PROCESSES

Figure 6 Elements of a DEC intact ElectronicsFu nctsTr ansfe r

Digital Te clmical jow·nal Vo l. 3 No. Winter 1991 I 29 Transact ion Processing, Databases, and Fault-tolerant Systems

As illustrated in Figure 6, the terminal manager • Controlling and restricting terminal user controls terminals fo r the Data Entry and Ve rify and environments Repair applications. Clerks enter data from user­ • Controlling and restricting the application generated documents on-line as complete messages.

Ve rification and repair clerks receive these mes­ • Ability to dynamically make changes to the appli­ sages as work items from the verify and repair cation without stopping work queue through the Ve rify and Repair application. In addition to using the VMS user amhorization The result of verification is either a validated mes­ file (Vi'v!S SYSUAF), the ACMS monitor provides utili­ sage, which is ultimately sent to a destination bank, ties to define which users and terminals have or an unverifiable message, which is routed to the access to the ACMS system. Control led terminals supervisor queue for special hand ling. After special are terminals defined by one of these utilities to be handling, the message rejoins the processing flow owned by the ACJVIS monitor. These terminals are by returning to the verify and repair queue. After allocated by the ACMS monitor when the ACMS validation, the messages are inserted in the system is started. When a user presses the Return Fedwire Xmt queue and sent over the network to key, the ACMS monitor displays its login prompt. the Federal Reserve System. The Fedwire Process Unless the user has login access, the VMS system application controls the physical interface to cannot be accessed. The user's access is restricted the communication line and implements the to only those ACMS fu nctions that the user is per­ Fedwire protocol. The validated messages are also mitted to invoke. This restriction prevents a user used to update a local database by means of from damaging the integrity of data on the system. database server programs. The ACMS monitor also allows access support fo r The Fedwire Xmt queue could be defined as a terminals that are automatically logged in to the queue set, which would permit the Fedwire ACMS system, such as a terminal on a shop floor. Process application to remove items from the Such access is useful fo r unprivileged users who queue by a number of algorithms that bias the are not accustomed to computers. They can enter transfer amount by queue and item priority. data without understanding the process fo r log­ Similarly, this queue set could be passively repriori­ ging in to the system. tized near the close of the business day. In other For application control, the ACMS monitor uses a words, the DECintact system administrator could protected directory, ACMS$DIRECTORY, to store the use the DECintact queue utility near the end of the application definition files. The application autho­ day to change queue-wide priorities and ensure rization utility (AAU) ensures that special authori­ that items with a higher priority level in the queue zation is required fo r a user to make changes to an set would be sent over the Fedwire first, without application. changing any application code. In the ACMS monitor, the application is a single point of controL The ACMS/START APPI.ICATION and Applicatio1lMa1la geme1lt ACMS/ST'OP APPLICATION commands cause the exe­ Typically, transaction processing applications are cution controller fo r the application to be created crucial to the business running the applications. If and deleted. An operator can control the times the applications cannot perform their fu nctions when an application is accessible. For example, an rel iably or securely, business activity may have to application can be controlled to run only on cease altogether or be curtailed, as in the case of an Fridays or only between certain hours. The control inventory control application or electronic fu nds of access times can also be used to restrict access processing application. Therefore, the applica­ while changes or repairs are made to the applica­ tions require additional controls to ensure that the tion. This type of access control is difficult to applications and the access by users to the appli­ achieve with only the VMS system because the VMS cations are limited to exactly what is needed fo r system does not provide these capabilities. the business. The execution controller does access-control list checking that is specified fo r each task. This ACMS Application Management mechanism can restrict user access by fu nction. Of the many features and tools for monitoring and For example, a user could have the privilege to controlling the system offered in the AG>IS moni­ make a particular update to a database but not have tor, three areas are most often used. access to read or make changes to any other parts

1 Winter 30 Vo l. j No. 199/ Digital Te cb11ical jo unwl Digital's Tra nsaction Processing Monitors

of that database. The execution controller achieves even to certain users on certain terminals. The three a much finer level of control than do the mecha­ elements in geographic ent itlement are as fo llows: nisms of the VMS system or the database system. • The user security profile enables a function to be accessed by a certain user. DECintact Application Ma nagement The DECintact monitor controls access to the whole • The terminal security profile enables a fu nction system and to individual tasks by means of a secu­ to be accessed at a certain te rminal. rity subsystem. The subsystem adds transaction­ • A GEOG attribute is associated with a menu processing-specific fe atures to basic VMS security. entry in the terminal manager/d ispatcher's • User security profiles specify the DECintact user menu database. This attribute, when associated name and password (DECintact users are not with a function, demands that there be an appli­ required to have an entry in the VMS SYSUAF cable terminal security profile before the func­ file); levels of security entitlement; inclusive and tion can be accessed . exclusive hours of permissible sign-on; menu entries authorized fo r the user. Only one user Normally, if a function is enabled in a user under a given DECintact user name can be signed profile, the user can access the function without on to the DECintact system at any one time on further checks. If the GEOG attribute is associated any one node. with the function, however, that fu nction must be enabled in the user profi le and in the terminal • Dedicated terminal security profi les are used, in profile before it can be accessed. conjunction with user security profiles, to pro­ Geographic entitlement is frequently a require­ vide geographic entitlement. ment in financial environments which have specific • CAPTIVE and INITlAL_MENU user attributes and rigid security protocols. For example, a bank restrict users to a specific menu level of func­ officer may be authorized to execute certain sensi­ tions and prevent the user from accessing outer tive functions available only at dedicated terminals levels. when the offi cer is signed-in at the home office. The same officer may be authorized to execute • User-specific menus are menu entries fo r which only a subset of less sensitive functions when an explicit authorization has been granted in the signed-in from a branch office. Such sensitive func­ user profile and are the only menu items visible tions can be protected by requiring that the user on the menu presented to terminal users. The profile and the dedicated terminal profile enable DECintact monitor does include an exception fo r the function. users who have an auditor privilege. Auditors Applications and resources are controlled can see all menu functions but must be specifi­ within the context of a DECintact copy's run-time cally authorized to execute any single function. and management environment. Multiple copies • The subsystem provides the ability to dynami­ can be established on the same VMS system. cally enable or disable specific menu fu nctions. Different groups of users can maintain a certain level of autonomy (e.g ., separate applications and • Password revalidation is an attribute that can be data files), but all users can also share some or all associated with a menu function. If set, the user functions and resources of a given DECintact ver­ must reenter the DECintact user name and pass­ sion. A typical example of this concept, that is, the word before being allowed to access the function. ability to create multiple DECintact copies fo r isola­ The DECintact monitor supports both controlled tion and partitioning, is the common practice of or dedicated terminals and terminals assigned LAT establishing development, acceptance testing, and terminal server application ports, as does the ACMS production DECintact environments. Managing monitor. These terminals are owned by, and allo­ applications and resources within a development cated to, the DECintact system. Whena user types environment, fo r example, can differ from manag­ any character at these terminals, a DECintact sign­ ing applications and resources within a prod uction on screen is displayed, and the user is prevented environment with a different system manager. from logging in to the VMS system. Access to menu functions is controlled by the Geographic entitlement limits certain DECintact INTACT MANAGE DISABLE/ENABLE command. This terminal-based functions to certain terminals or command removes or restores specified functions

Digital Te cbtJicaljourtJal Vo l. 3 No. I WitJ ter 1991 3 1 Transaction Processing, Databases, and Fault-tolerant Systems

dynamically from all menus in the DECintact copy shared by both monitors has been growing with and disables or enables their selection by subse­ the latest releases of the ACMS and DECintact quent users. (Current accessors of the specified monitors. This external convergence has been fo s­ fu nction are allowed to complete the function.) tered and made possible by an internal conver­ The execution of single- and multithreaded appli­ gence, which is based on sharing the underlying cations or DECi ntact system components can be code that supports the common fe atures of each shut down by the INTACT 1Y1Ai'IAGE SHUTDOWN monitor. As more common fe atures are introduced command. This command issues a mailbox request and enhanced in the DECtp system, the investment to the application or component, which then initi­ in applicat ions built on either monitor can be ates an orderly shutdown. Access to the system by protected and the distinctive programming styles inclusive and exclusive time of day is controlled on of both can be preserved. a per-user basis through the DECintact security subsystem. In addition to these commands and Reference fu nctions, the queuing subsystem is managed by 1. A. Bernstein, W T. Emberton, and Trehan, means of a queue management utility This utility P V "DEC:dta - Digital's Distributed Transaction Pro­ creates and deletes queues and queue sets, modi­ cessing Architecture," Digital Te chnicaljo urnal, fies queue and queue set attributes, and performs vol. 3, no. 1 (Winter 1991, this issue): 10- 17 . all othe r fu nctions necessary fo r managing the DECintact queuing subsystem. In general, the DECintact monitor's security and application control fo cuses on the front end by concentrating access checking at the point of sys­ tem sign-in and menu generation. The ACMS system concentrates more on the back-end parts of the system by means of VMS access control lists (ACL) on specified tasks. The ACMS approach is built on VMS security and system access (the SYSUAF fi lc) and reflects an environment in which the VMS sys­ tem and the transaction processing security func­ tions are typically performed by the same system management agency The DECintact monitor's sys­ tem access is handled more independently of the VMS system and reflects an environment in which transaction-processing-specific security functions may be performed by a different department from those of the general VMS security system.

Co nclusion The ACMS and DECintact transaction processing monitors provide a unified set of transaction-pro­ cessing-specific services to the application envi­ ronment. A large functional overlap exists between the services etch monitor provides. Where the fu nctions provided by each monitor are identical or similar (e.g., client/server database access and support fo r DECforms), the factors that distinguish one from the other are primarily a result of the use of 4GL and .)GL application programming styles and interfaces. Where notable functional diffe r­ ences remain (as in each product's respective queuing or security systems), the diffe rences are primarily ones of emphasis rather than fu nc­ tional incompatibility. The set of common fe atures

Vo l. 3 No. I Winter 1991 Digital Te ciJnicaljou r11al Wi lliam A. Laing Ja mes E. Jo hnson Robert V.La ndau

Transaction Management Support in the 1/MS operating Sy stem Ke rnel

Distributed transaction management support is an enhancement to the VMS op er· ating system. Thissu pport provides services in the VMS op erating sys tem fo r atomic transactions that may span multtple resource managers, such as those fo rflat files, network databases, and relational databases. These transactions may also be distrib­ uted across multiple nodes in a network, independent of the communications mechanisms used by either the application programs or the resource managers. The Digital distributed transaction manager (DECdtm) services implement an op ti­ mized variant of the two-phase commit protocol to ensure transaction atomicity. Additionally, these services take advantage of the unique VAX. cluster capabilities to greatly reduce the potentialfo r blocking that occurs with the traditional two-phase commit protocol. These fe atures, now part of the VMS op erating sys tem, are readily available to multiple resource managers and to many applications outside the traditional transaction processing monitor environment.

Businesses are becoming critically dependent on data consistency. Moreover, a simplifying abstrac­ the availability and integrity of data stored on com­ tion is crucial when one is faced with the complex­ puter systems. As these businesses expand and ity of a distributed system. merge, they acquire ever greater amounts of on-line With increasingly reliable hardware and the data, often on disparate computer systems and often influx of more general-purpose, fault-tolerant sys­ in disparate databases. The Digital distributed trans­ tems, the fo cus on reliability has shifted from action manager (DECdtm) services described in hardware to software.1 Recent discussions indicate this paper address the problem of integrating data that the key requirements for building systems from multiple computer systems and multiple with a 100-year mean time between fa ilures may be databases while maintaining data integrity under (1) software-fault containment, using processes, transaction control. and (2) software-fau lt masking, using process check­ The DECdtm services are a set of transaction pro­ pointing and transactions.' cessing fe atures embedded in the VMS operating It was clear that we could use transactions as a system. These services support distributed atomic pervasive technique to increase application avail­ transactions and implement an optimized variant ability and data consistency. Further, we saw that of the well-known, two-phase commit protocol. this technique had merit in a general-purpose oper­ ating system that supports transaction processing, Design Goals as well as timesharing, office automation, and tech­ Our overall design goal was to provide base services nical computing. on which higher layers of software could be built. The design of DECdtm services also reflects sev­ This software would support reliable and robust eral other Digital and VMS design strategies: applications, while maintaining data integrity. Many researchers report that an atomic trans­ • Pervasive av ailability and reliability. As organi­ action is a very powerful abstraction fo r building zations become increasingly dependent on their robust applications that consistently update data.' ' information systems, the need fo r all applica­ Supporting such an abstraction makes it possible tions to be universally available and highly reli­ both to respond to partial fa ilures and to maintain able increases. Features that ensure application

Digital Te cb ntcal]ournal Vo l. 3 No. I Winter 1991 33 Transaction Processing, Databases, and Fault-tolerant Systems

availability and data integrity, such as journaling • Demarcation operations to start, commit, and and two-phase commit, must be avai lable to all abort a transaction applications, and not lim ited to those tradition­ Execution operations fo r resource managers to ally thought of as "transaction processing." • declare themselves part of a transaction and fo r

• Operating environment consistency. Embedding transaction branch managers to declare the dis­ features in the operating system that are required tribution of a transaction by a broad range of utilities ensures consistency Two -phase commit operations fo r resource man­ in two areas: first, in the functionality across all • agers and other transaction managers to change layered software products, and, second, in the the transaction state (to either "preparing" or interface fo r developers. For instance, ifseveral "committing") or to acknowledge receipt of a products require the two­ request to change state phase commit protocol, incorporating the protocol into the underlying system allows Benifits of Em bedding Transaction programmers to fo cus on providing "value­ Semantics in the Kernel added" fe atures fo r their products instead of Several benefits are achieved by embedding trans­ re-creating a common routine or protocol. action semantics in the kernel of the VMS operating

• Flexibility and interoperability. Our vision system. Briefly, these benefits include consistency, includes making DECdtm interfaces available to interoperability, and flexibility. Embedding trans­ any developer or customer, al lowing a broad action semantics in the kernel makes a set of range of software products to take advantage of services available to different environments and the VMS environment. Future DECdtm services products in a consistent manner. As a consequence, are also being designed to conform to de facto interoperability between products is encouraged, and international standards fo r transaction pro­ as well as investment in the development of "value­ cessing, thereby ensuring that VMS applications added" fe atures. The inherent flexibility allows a can interoperate with applications on other programmer to choose a transaction processing vendors' systems. monitor, such as VAX ACMS, and to access multiple databases anywhere in the network. The program­ Transaction Manager - Some mer may also write an application that reads a Definitions VAX DBMS CODASYL database, updates an Rdb/VMS To grasp the concept of transaction manager, some relational database, and writes report records to basic terms must first be understood: a sequential VAX RMS file - all in a single trans­ action. Because all database and transaction pro­ Resource manager. A software entity that con­ • cessing products use DECdtm services, a fa ilure at trols both the access and recovery of a resource. any point in the transaction causes all updates to For example, a database manager serves as the be backed out and the files to be restored to their resource manager fo r a database. original state.

• Transaction. The execution of a set of opera­ tions with the properties of atomicity, serializ­ Tw o-phase Commit Protocol ability, and durability on recoverable resources. DECdtm services use an optimized variant of the technique referred to as two-phase commit. The Atomicity. Either all the operations of a trans­ • technique is a member of the class of protocols action complete, or the transaction has no effect known as Atomic Commit Protocols. This class at all. guarantees two outcomes: first, a single yes or no

• Serializability. All operations that executed fo r decision is reached among a distributed set of par­ the transaction must appear to execute serially, ticipants; and, second, this decision is consistently with respect to every other transaction. propagated to all participants, regard less of sub­ sequent machine or communications fa ilures. This • Durability. The effects of operations that exe­ guarantee is used in transaction processing to help cuted on behalf of the transaction are resilient achieve the atomicity property of a transaction. to fa ilures. The basic two-phase commit protocol is straight­ A transaction manager supports the transaction fo rward and well known. It has been the subject of abstraction by providing the fo llowing services: considerable research and technical literature fo r

Vo l. 3 No. 1 Winter 1991 Digital Te c1Jnical]our11al 34 Tra nsaction Ma nagement Suppo1·t in the VMS Op erating Sys tem Kern el

several years.'·6·'"·" The fo llowing section describes In the first phase, called the prepare phase, the in detail this general two-phase commit protocol coordinator issues "requests to prepare" to all sub­ fo r those who wish to have more information on ordinates. The subordinates then vote, either a "yes the subject. vote" or a "veto." Implicit in a "yes vote" is the guar­ antee that the subordinate will neither commit nor Th e Basic Tw o-phase Co mmit abort the transaction (decide yes or no) without an Protocol explicit order fr om the coordinator. This guarantee The two-phase co mmit protocol occurs between must be maintained despite any subsequent fa il­ two types of participants: one coordinator and one ures and usually requires the subordinate to place or more subordinates. The coordinator must arrive suffi cient data on disk (prior to the "yes vote") to at a yes or no decision (typically called the "com­ ensure that the operations can be either completed mit decision") and propagate that decision to all or backed out. subordinates, regardless of any ensuing fa ilures. The second phase, cal led the commit phase, Conversely, the subordinates must ma intain cer­ begins after the coordinator receives all expected tain guarantees (as descri bed below) and must votes. Based on the subordinate votes, the coor­ defer to the coordinator fo r the result of the com­ dinator decides to commit if there are no "veto" mit decision. As the name suggests, two-phase votes; otherwise, it decides to abort. The coordina­ commit occurs in two distinct phases, which the tor propagates the decision to all subordinates as coordinator drives. either an ··order to commit" or an "order to abort."

SUBORDINATE COORD INATOR

END TRANS FROM APPLICATION ------REQUEST TO PREPARE

FORCE WRITE "PREPARE" RECORD ----- YES VOTE INCREASING TIME � FORCE WRITE EJ"COMMIT" RECORD COMMIT POINT --.....- NOTIFY j APPLICATION ---- ORDER TO COMMIT

LAZY WRITE "COMMIT" RECORD

DONE

------� LAZY WRITE LAZY WRITE F===t "FORGET" "FORGET" l__.J RECORD RECORD

Fig U1·e 1 Simple Tw o-phase Commit Ti me Line

Digital Te chnicaljounwl Vo l. 3 No. I \Vin ter 1991 35 Transaction Processing, Databases, and Fault-tolerant Systems

Because the coordinator's decision must survive In such cases, there is a tree-structured ­ fa ilures, a record of the decision is usually stored ship between the coordinator and the fu ll set of sub­ on disk before the orders are sent to the subordi­ ordinates. Intermediate nodes must propagate the nates. When the subordinates complete process­ messages down the tree and collect responses back ing, they send an acknowledgment back to the up the tree. Figure 2 shows a time line for a two­ coordinator that they are "done." This allows the phase commit sequence with an intermediate node. coordinator to reclaim disk storage from com­ Most of us have had direct contact with the two­ pleted transactions. Figure 1 shows a time line of phase commit protocol. It occurs in many activities. the two-phase commit sequence. Consider the typical wedding ceremony as pre­ A subordinate node may also function as a supe­ sented below, which is actually a very precise two­ rior (intermediate) node to fo llow-on subordinates. phase commit.

SUBORDINATE INTERMEDIATE COORDINATOR

END TRANS FROM - APPLICATION ------REQUEST TO PREPARE

FORCE WRITE "PREPARE" RECORD ------YES VOTE INCREASING TIME -... FORCE WRITE "PREPARE" ..-- EJ RECORD ------j YES VOTE ------.... � l__..J FORCE WRITE "COMMIT" ..-- COMMIT POINT - RECORD NOTIFY APPLICATION ------ORDER TO COMMIT

LAZY WRITE ORDER TO COMMIT "COMMIT" RECORD

LAZY WRITE "COMMIT" DONE RECORD ------LAZY WRITE "FORGET" ..---... EJ RECORD

LAZY WRITE "FORGET" RECORD

Figure 2 Three-node Two-phase Commit Time Line

I Winter 1991 36 Vo l. .3 No. Digital Te chnical jo urnal Tra nsaction Ma nagement Support in the VMS Op erating Sy stem Kernel

Official: Will you, Mary, take john . Co mponents of the DECd tm Services Bride: !will. The DECdtm services were developed as three sep­ Official: Will you, John, take Mary ...? arate components: a transaction manager, a log Groom: I will. manager, and a communication manager. To gether, Official: I now pronounce you man and wife. these components provide support for distributed transaction management. The transaction manager The above dialog can be viewed as a two-phase is the central component. The log manager ser­ commit: vices enable the transaction manager to store data on nonvolatile storage. The communication man­ Coordinator: Request to Prepare? ager provides a location-independent interprocess Participant 1: Ye s Vo te. communication service used by the transaction Coordinator: Request to Prepare? and log managers. Figure 3 shows the relationships Participant 2: Ye s Vo te. among these components. Coordinator: Commit Decision. Order to Commit. Th e Digital Distributed Tra nsaction The basic two-phase commit protocol is straight­ Ma nager fo rward, survives fa ilures, and produces a single, As the central component of the DECdtm services, consistent yes or no decision. However, this proto­ the transaction manager is responsible for the col is rarely used in commercial products. Opti­ application interface to the DECdtm services. This mizations are often applied to minimize message section presents the system services the trans­ exchanges and physical disk writes. These optimi­ action manager comprises. zations are important particularly to the trans­ The transaction coordinator is the core of the action processing market because the market is transaction manager. It implements the transaction very performance sensitive, and two-phase com­ state machine and knows which resource man­ mit occurs after the application is complete. Thus, agers and subordinate transaction managers are two-phase commit is reasonably considered an involved in a transaction. The coordinator also con­ added overhead cost. We have endeavored to reduce trols what is written to nonvolatile storage and the cost in a number of ways, resulting in low manages the volatile list of active transactions. overhead and a scalable protocol embodied in the The user services are ro utines that implement DECdtm services. Some of the optimizations are the START_TRANSACTION, END_TRANSACTION, and described later in another section. ABORT_ TRANSACTION transaction system services.

COMMUNICATION TO REMOTE INTERFACE DECDTM

RESOURCE VOLATILE MANAGER REGISTRY REGISTRY

TRANSACTION COORDINATOR

RESOURCE LOGGING TO HARDENED MANAGER - INTERFACE REGISTRY SERVICES

t t

BRANCH USER INFORMATION MANAGEMENT SERVICES SERVICES SERVICES

EXTERNAL INTERFACE

Figure 3 Components of the DECdtm Services

Digital Te chnical jou rnal Vo l. 3 No. I Winter 1991 37 Transaction Processing, Databases, and Fault-tolerant Systems

They validate user parameters, dispense a trans­ a homogeneous VMS wide area network. The com­ action identifier, pass state transition requests to munication manager also provides highly optimized the transaction coordinator, and return info rma­ local (that is, intranode) and intracluster trans­ tion about the transaction outcome. ports. In addition, this service component multi­ The branch management services support the plexes communication links across a single, cached creation and demarcation of branches in the dis­ DECnet virtual circuit to improve the performance tribu ted transaction tree. New branches are con­ of creating and destroying wide area links. structed when subordinate application programs are invoked in a distributed environment. The ser­ Transaction Processing Model vices are called on to attach an application pro­ Digital 's transaction processing model entails the gram to the transaction, to demarcate the work cooperation of several distinct elements fo r correct done in that application as parr of the transaction, execution of a distributed transaction. These ele­ and finally to return info rmation about the trans­ ments are (1) the application programmer, (2) the action outcome. resource managers, (3) the integration of the The resource manager services are routines that DECdtm services into the VMS operating system, provide the interface between the DECdtm services (4) transaction trees, and (5) vote-gathering and and the cooperating resource managers. This inter­ the fi nal outcome. face allows resource managers to declare them­ selves to the transaction manager and to register Application Programmer their involvement in the "voting'' stage of the two­ The application programmer must bracket a series phase commit process of a specific transaction. of operations with START_TRANSACTION and Final ly, the information services routines are END_TRANSACTION calls. This bracketing demar­ the interface that allows resource managers to cates the unit of work that the system is to treat as query and update transaction information stored a single atomic unit. The application programmer by DECdtm services. This information is stored may call the DECdtm services to create the branches in either the volatile-active transaction list or the of the distributed transaction tree. nonvolatile . Resource managers Resource Managers may resolve and possibly modify the state of A RMS, "in-doubt" transactions through these services. Resource managers, such as V X VA,'{ Rdb/VMS, and VAX DBMS, that access recoverable resources during a transaction info rm the DECdtm services of Th e Log Manager their involvement in the transaction. The resource The Jog manager provides the transaction manager managers can then participate in the voting phase with an interface for storing sufficient information and react appropriately to the decision on the final in nonvolatile storage to ensure that the outcome outcome of the transaction. Resource managers of a transaction can be consistently resolved. This must also provide recovery mechanisms to restore interface is available to operating system compo­ resources they manage to a transaction-consistent nents. The log manager also supports the creation, state in the event of a fa ilure. deletion, and general management of the trans­ action logs used by the transaction manager. An Integration in the Op erating Sy stem additional utility enables operators to examine The DECcltm services are a basic component of the transaction logs and, in extreme cases, makes it VMSoper ating system. These services are responsi­ possible to change the state of any transaction. ble fo r maintaining the overa.l l state of the distrib­ uted transaction and fo r ensuring that sufficient Th e Co mmunication Manager information is recorded on stable storage. Such The communication manager provides a command/ information is essential in the event of a fa ilure so response message-passing facility to the trans­ that resource managers can obtain a consistent action manager and the log manager. The interface view of the outcome of transactions. is specifically designed to offe r high-performance, Each VMS node in a network normally contains low-latency services to operating system com­ one transaction manager object. This object main­ ponents. The command/response, connection­ ta ins a list of participants in transactions that are oriented, message-passing system allows clients active on the node. This list consists of resource to exchange messages. The clients may reside on managers local to the node and the transaction the same node, within the same cluster, or within manager objects located on other nodes.

No. Winter 1991 Digital Te chnical jo urnal 38 Vo l. 3 I Tr ansaction Management Support in the VMS Op erating Sys tem Kern el

Tra nsaction Trees transaction by the coordinator, the transaction The node on which the transaction originated (that aborts. This removes the need to write an abort is, the node on which the START_TRANSACTION decision to disk and to subsequently acknowledge service was called) may be viewed as the "root" of the order to abort. In addition, subordinates that a distributed transaction tree. The transaction do not mod ify any data during the transaction (that manager object on this node is usually responsible is, they are "read only"), avoid writing info rmation fo r coordinating the transaction commit phase of to disk or participating in the commit phase. the transaction. The transaction tree grows as Lazy Co mmit Log Write The DECdtm services applications call on the branch management ser­ can act as intermediate nodes in a distributed trans­ vices of the transaction manager object. action. In this mode, they write a "prepare" record The transaction identifier dispensed by the prior to responding with a "yes vote." They also START_TRANSACTION service is an input parameter write a "commit" record upon receipt of an order to the branch services. This parameter identifies to commit. This latter record is written so that the two concerns for the local transaction manager coordinator need not be asked about the commit object: (1) to which transaction tree the new branch decision should the intermediate node fa il. This should be added, and (2) which transaction man­ refi nement isolates the intermediate node's recov­ ager object is the immed iate superior in the tree. ery from communication fa ilures between it and Resource managers join specific branches in a the coordinator. transaction tree by calling the resource manager Performance is enhanced when the DECdtm ser­ services of the local transaction manager object. vices write the commit record on an intermediate Vo te-ga thering and the Final Outcome node in a "nonurgent" or "lazy" manner.'" The lazy When the "commit" phase of the transaction is write buffe rs the info rmation and waits for an entered (triggered by an application call to urgent request to trigger the group commit timer END_TRA.t"\ISACTION), each transaction manager to write the data to disk. Typically, this operation object involved in the transaction must gather the avoids a disk write at the intermediate node. The "votes" of the locally registered resource managers increase in the length of time before the commit and the subordinate transaction manager objects. record is written is negligible. The results are fo rwarded to the coordinating trans­ One-phase Commit A key consideration in the action manager object. design of the DECdtm services was to incur mini­ The coordinating transaction manager object mal im pact on the performance of Digital's data­ eventually info rms the local ly registered resource base products. We exploited two attributes to managers and the subordinate transaction manager achieve this goal. First, all current users are limited objects of the final outcome of the transaction. The to non-distributed transactions (those that involve subordinate transaction manager objects, in turn, only a single subordinate). Second, the rwo-phase propagate this information to locally registered commit protocol requires that all subordinates resource managers as well as to any subordinate respond with a "yes vote" to commit the trans­ transaction manager objects. action. This allows a highly optimized path fo r Protocol Optimiza tions single subordinate transactions. Such transactions require no writes to disk by the DECdtm services The DECd tm services use several previously pub­ and execute in one phase. The subordinate is told lished optimizations and extend those optimiza­ that it is the only voting party in the transaction tions with a number that are unique to VAXcluster and, if it is willing to respond with a "yes vote," it systems. In this section we present these general should proceed and perform its order to commit optimizations, a discussion of VAX cluster consider­ processing. at ions, and two VAXcluster-specific optimizations.

General Op timizations VAX cluster Considerations The fo llowing sections describe some previously The optimizations listed above (and others not published optimizations. described here) provide the DECdtm services with a competitive two-phase commit protocol. Presumed Abort DECdtm services use the "pre­ VAXcluster technology, though, offers other sumed abort" optimization."·" This optimization untapped potential. VAXcluster systems offer sev­ states that, if no info rmation can be fo und fo r a eral unique fe atures, in particular, the guarantee

Digital Te ch nical jo urnal Vol. 3 No. I Win ter 1991 39 Trans action Processing, Databases, and Fault-tolerant Systems

aga inst partitioning, the distributed lock manager, Ea rly Prepare Log Write As mentioned earlier, an and the ability ro share disk access between CPUs." intermediate node must write a "prepare" record Within a VAXcluster system, use of these unique prior to responding with a "yes vote." The pres­ fe atures al lows the DECdtm services to avoid a ence of this record in an intermed iate node's blocked condition which occurs during the short log indicates that the node must get the outcome period of time when a su bord inate node responds of the transaction from the coordinator and, thus, wi th a "yes vote" and communication with its it is subject to blocking. Therefore, the prepare coordinator is lost. NormaLly, the subord inate is record is typically written after all the expected unable to proceed with that transaction's commit votes are returned, which adds to commit-time until communications have been restored. latency. Outside a VAXcluster system, the DECdtm ser­ The DECdtm services are free from blocking con­ vices would indeed be blocked. If, however, the cerns within a VAXclusrer system; the vast majority subordinate and its coordinator are in the same of transactions do commit. This fa ctor prompted VAXcluster system, this wil l nor occur. If communi­ an optimization that writes a prepare record while cation is lost, a subordinate node knows, as a result simultaneously collecting the subordinate votes. of the guarantee against partitioning, that its coor­ This reduces commit-time latency. dinator has fa iled. Because a subordinate node can access the trans­ No Commit Log Write The lazy commit log write action Jog of the fa iled coordinator, it may imme­ optimization described above causes the inter­ diately "host" its fa iled coordinator's recovery. mediate node's commit record to be written and , Communications to the hosted coordinator are thus, minim izes the potential fo r blocking should quickly restored, and the subordinate node is able the intermediate node fa il. Note that this is not a to complete the transaction commit. concern for the intra-VAXcluster case. Therefore, no commit record is written at the intermediate node. VAX cluster-specific Op timiza tions Once the blocking potential was removed from Performance Evaluation intra-VAXcluster transactions, several ad ditional Ta ble 1 describes the message and Jog write costs protocol optimizations became practical. The of the DECdtm services protocol and compares it optimizat ions described in this section are dynam­ to the basic two-phase commit protocol, as we ll ically enabled if the subordinate and its coordina­ as to the standard presumed abort variant previ­ tor are both in the same VAXcluster system. ously described."''

Ta ble 1 Logging and Message Cost by Tw o-phase Commit (2PC) Protocol Va riant

Coordinator Intermediate Coordinator Log Write Message Log Write Message

Basic 2PC: 2, 1 forced 2N 2, 2 forced 2 Presumed Abort: 2, 1 forced 2N 2, 2 forced 2 (RO intermediate) 2, 1 forced 1N 0 1

Normal DECdtm: 2, 1 forced 2N 2, 1 forced 2 (RO intermediate) 2, 1 forced 1N 0 1 lntracluster: 2, 1 forced 2N 1, 1 forced* 2 (RO intermediate) 2, 1 forced 1N 0 1

DECdtm 1 PC: 0

Notes:

Log writes are total writes, forced. The table entry 2,1 forced means that there are two total log writes, one of which is forced. A forced write must complete before the protocol makes a transition to the next state.

RO means Read Only.

Where a message is listed as xN, N represents the number of intermediates that fit that category.

• In this instance, forced means that the log write is initiated optimistically; thus, it has lower latency.

40 Vo l. 3 No. 1 Wi nter 1991 Digital Te chnical jo urnal Tra nsaction Ma nagement Support in the VM S Op erating Sy stem Kern el

Ease-of- use Evaluation from the short time it took to make the required A primary goal in providing transaction processing changes. Based on this experience, we expect third­ primitives within the VMS kernel was to supply party software vendors to rapid ly take advantage of many disparate applications with a stra ightforward the DECdtm services as they become available as interface to distributed transaction management. part of the standard VMS operating system. This contrasts with most commercially available To incorporate the DECdtm services into a systems, where distributed transaction manage­ recoverable resource manager, the existing inter­ ment functionality is ava ilable only from a trans­ nal transaction management module with calls action processing monitor. This latter fo rm restricts to the DECdtm services must be replaced. The the functional ity to applications written to exe­ resource manager must also be modified to cor­ cute under the control of the transaction process­ rect ly respond to the prepare and co mmit callbacks ing monitor, and it effectively precludes other by the DECdtm services. Further, the recovery applications from making use of the technology. logic of the resource manager must be modified to From the outset of development, we endeavored obtain from the DECdtm services the state of "in to provide an interface that was suitable fo r as doubt" transactions. many applications as possible. We made early ver­ sions of the DECdtm services available within Ex ample of DECdtm Us age Digi tal to decrease the "time to market" for soft­ The model and pseudocode shown in Figures 4a ware products that wished to exploit distributed and b illustrate the use of DECdtm services in a transaction processing technology. As of]uly 1990, simple example of a distributed transaction. The at least seven Digital software products have been transaction spans two nodes, NODE_A and NODE_B, modified to use the DECdtm services. These in a VMS network. During the course of the trans­ products are VAX Rdb/VMS, VAX DBMS, VAX RMS action, recoverable resources managed by resource )ournaling, VAX ACMS, DECintact, VAX RALLY, managers, RM_A and RM_B, are modified. Two and VA X SQL. "application" programs, APPL_A and APPL_B, that In general, the modifications to these products run on NODE_A and NODE_B, respectively, make have been relatively minor, as might be infe rred normal procedural calls to RM_A and RM_B.APPL_A

------­ NODE A I NODEB I I I DECDTM -+-----j_ DECDTM I I I I RESOURCE RESOURCE BRANCH USER I USER BRANCH MANAGER I MANAGER SERVICES SERVICES SERVICES SERVICES SERVICES I I SERVICES + + I I + + I I I I I I I -�-. T? I �+ +� I I -----� RM_A APPL_A ...... -�········J · ...... APPL_B ------RM_B I======I I 1 I I I I I I

______I �------KEY:

IPC CONNECTION PROCEDURE CALL

RPC RM RESOURCE MANAGER

-- SYSTEM SERVICE CALL APPL APPLICATION

Figure 4a Model Illustrating the Use of DECdtm Services

Digital Te chnical jo urnal Vtll. 3 No. I Win ter 1991 41 Transaction Processing, Databases, and Fault-tolerant Systems

PROGRAM APPL A

Establish communications with remote application

IPC_LINK (node="NODE_B ", application="APPL_B ", link=link_ id);

Exchange transaction manager names

LIB$GETJPI (JPI$_C OMMIT_DOMAIN,,,my_ cd ); IPC TRANSCEIVE (link=Link_i d, send_data=my_c d, recei ve_data=your_ cd);

Start a transaction

$START_T RANSW (iosb=status, tid= tid);

Make a procedu ra l ca ll to RM A to perform an operation

RM_A (tid, requested_ operation);

! Now create a transaction branch for the remote application

$ADD BRANCHW (iosb=status, tid=tid, branch=bi d, cd_ name=your_ cd );

Ask APPL 8 to do something as part of this transaction

!PC TRANSCEIVE (link= link_i d, send_d a ta=(tid, bid, da ta), recei ve_data=status);

And end the transaction

$END TRANSW (iosb=status, tid= tid);

PROGRAM APPL B (link id)

Exchange transaction manager names

IPC_RECEIVE (link= Link_i d, data=sup_ cd); LIB$G ETJPI (JPI$_COMMIT_D OMAIN,,,my_cd); IPC_REPLY (Link=link_i d, data=my_ cd);

Now we execute transaction requests

loop; !PC RECEIVE (link=link_i d, data=(tid, bid, data )); Start the transaction branch created by APPL_A .

$START BRANCHW (iosb=status, tid=tid, branch=bi d, cd_name=sup_ cd);

! Make a procedura l ca ll to RM_B to perform an operation

RM_B (tid, requested_ operation);

Tell APPL A we are done

IPC REPLY (link= link_i d, data=SS$_NORMAL);

Declare that we are finished for thi s transaction and wa it for it to comp lete

$READY TO COMMITW (iosb=status, tid= tid); end_loop;

. I Winter 1991 42 Vo l . 3 No. Digital Te chnical jo urrw/ Tra nsaction Management Support in the VM S Op erating Sy stem Kern el

ROUTINE RM_A (tid, requested_ope ration)

If thi s is the fi rst operation, regi ster wi th DECdtm servi ces as a resource manager . As part of the registration we dec lare an event rout ine that wi ll be ca l led dur ing the vot ing process.

if fi rst time we 've been ca l led then $DECLARE RMW (iosb=status, name="RM_A", evt rtn=RM_A_E VENT, rm_i d=rm_ handle);

Inform DECdtm services of our interest in thi s transaction

if tid has not previously been seen then $JOIN RMW (iosb= status, rm_i d=rm_hand le, tid=t id, pa rt_i d=participant );

1 Perform the requested operation

DO_OPERATION (requested_o pe ration); RETURN

ROUT INE RM A EVENT (event block)

! Select action from the DECdtm servi ces event type

CASE event block . DDTM$L OPTYPE FROM ... TO ...

Do "request to prepare" processing

[DDTM$K_ PREPA REJ : DO PREPARE ACTIVITY (result=status, tid=event_t ype .DDTM$A_TID);

Do "order to commi t" process ing

[DDTM$K_COMM ITJ : DO COMMIT ACTIVITY (resu l t=status, tid=event_t ype .DDTM$A_TID);

Do "order to abort" processing

[DDTM$K_A B ORTJ : DO ABORT ACTIVITY (resu lt=status, tid=event_type .DDTM$A_TI D);

ESAC;

Inform the DECdtm servi ces of the final status of our event processin g .

$FINISH RMOPW (iosb=iosb, part_i d=event_t ype .DDTM$L_PART_I D, retsts=status); RETURN

Figure 4b Pseudocode Illustrating the Us e of DECdtm Services and APPL_B use an interprocess communication ROUTINE RM_A_EVENT, is invoked by the DECdtm mechanism to communicate info rmation across services during transaction state transitions. the network. The DECdtm service calls are pre­ fi.."Xed with a dollar sign ($). Co nclusions The code for the resource managers, RM_A and The addition of a distributed transaction manager RM _B, is identical with respect to calls fo r the to the kernel of the general-purpose VMS operating DECdtm services. The resource manager routine, system makes distributed transactions available

Digital Te ch nical jou rnal Vol. 3 No. I Win ter 1991 43 Transaction Processing, Databases, and Fault-tolerant Systems

to a wide spectrum of applications. This design 7. B. Lampson, "Atomic Transactions," In Dis­ and implementation was accomplished with com­ tributed Sy stems-Architecture and Imple­ parative ease and with quality performance. In mentation: An Aduanced Course, edited by addition to util izing the most commonly described G. Goos and J. Hartmanis (Berlin: Springer­ optimizations of the two-phase commit protocol, Ve rlag, 1981). we have used optimizations that exploit some of 8. C. Mohan, B. Lindsay, and R. Obermarck, the unique benefits of the VAXcluster system. "Transaction Management in the R* Distributed Database Management System," ACM Trans­ Acknowledgments actions on Computer Sy stems, vol. 11, no. 4 We wish to gratefully acknowledge the contrib­ (December 1986). utions of all the transaction processing architects involved, and in particular Vijay Trehan, fo r deliver­ 9. C. Mohan and B. Lindsay, "Efficient Commit ing to us an understandable and implement­ Protocol fo r the Tree of Processes Model of able architecture. We also extend our thanks to Distributed Transactions," Proceedings of the Phil Bernstein fo r his encouragement and advice, 2nd ACM SIGACT/SIGOPS Sy mposium on Prin­ and to our initial users, Bill Wright, Peter Spiro, ciples of Distributed Computing (Montreal, and Lenny Szubowicz, fo r their persistence and August 1983). good nature. 10. D. Duchamp, "Analysis of Transaction Manage­ Finally, and most importantly, we would like ment Performance," Proceedings of the Tw elfth to thank all the DECdtm development engineers ACM Sy mposium on Op erating Systems Prin­ and the others who helped ship the product: ciples (Special issue), vol. 23, no. 5 (December Stuart Bay ley, Cathy Foley, Mike Grossmith, To m 1989): 177- 190. Harding, To ny Hasler, Mark Howell, Dave Marsh, Julian Palmer, Kevin Playford, and Chris Whitaker. 11. N. Kronenberg, H. Levy, and W Strecker, "VAXclusters: A Closely-Coupled Distributed

References System," AOH Tra nsactions on Computer Sy stems, vol. 4, no. 2 (May 1986). I. R. Haskin, Y. Malachi, W Sawdon, and G. Chan, "Recovery Management in Quicksi lver,'' ACLl-'1 Tra nsactions on Computer Sy stems, vol. 6, no. 1 (February 1988).

2. A. Spector et al., Camelot: A Distributed Trans­ action Facilityfo r Mach and the Internet -An Interim Report (Pittsburgh: Carnegie Mellon University, Department of Computer Science, June 1987).

). W Bruckert, C. Alonso, and J. Melvin, "Verifi­ cation of the First Fault-to lerant VA..'{ System," Digital Te chnicaljo urnal, vol. 3, no. 1 (Winter 1991 , this issue): 79 -85.

4. J. Gray, "A Census of Tandem System Availa­ bi lity between 1985 and 1990," Ta ndem Te chni­ cal Report 90.1, part no. 33579 (January 1990).

5. P Bernstein, V Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Data­

base Systems (Reading, MA: Addison-Wesley, 1987).

6. J. Gray, "Notes on Database Operating Systems," In Op erating Sys tems: An Advanced Course (Berlin: Springer-Ve rlag, 1978).

Winter 44 Vo l. 3 No. 1 1991 Digital Te chnical jo urnal Wa lter H. Ko hler Yu n-Ping Hs u Th omas K. Rogers Wa el H. Bahaa-El-Din

Pe rformance Evaluation of Tr ansaction Processing Sy stems

Performance and price/performance are important attributes to consider when evaluating a transaction processing system. Two major approaches to performance evaluation are measurement and modeling TPC Benchmark A is an industry stan­ dard benchmark fo r measuring a transaction processing system s performance and price/p erfmmance. Digital has implemented TPC Benchmark A in a distributed transaction processing environment. Benchmark measurements were performed on the VAX 9000 Model 210 and the VAX 4000 Mode/300 systems. Further, a compre­ hensive analytical model was developed and customized to model the pe�formance behavior of TPC Benchmark A on Digital's transaction processing platforms. Th is model was validatedusing measurement results and has proven to be an accurate petformance prediction tool.

Transaction processing systems are complex in response time, i.e., the response time observed by nature and are usually characterized by a large a user at a terminal. The end-to-end response time number of interactive terminals and users, a large represents the time required by all components vo lume of on-li ne data and storage devices, and a that compose the transaction processing system. high volume of concurrent and shared database The two major approaches used fo r evaluating accesses. Transaction processing systems require transaction processing system performance are layers of software components and hardware measurement and modeling. The measurement devices to work in concert. Performance and approach is the most realistic way of evaluating the price/performance are two important attributes performance of a system. Performance measure­ fo r customers to consider when selecting trans­ ment results from standard benchmarks have been action processing systems. Performance is impor­ the most accepted fo rm of performance assess­ tant because transaction processing systems are ment of transaction processing systems. However, frequently used to operate the customer's business due to the complexity of transaction processing or handle mission-critical tasks. Therefore, a certain systems, such measurements are usually very expen­ level of throughput and response time guarantee sive, very time-consu ming, and difficult to perform. are required from the systems during normal oper­ Modeling uses simulation or analytical model­ ation. Price/performance is the total system and ing techniques. Compared to the measurement maintenance cost in dollars, normal ized by the per­ approach, modeling makes it easier to produce fo rmance metric. results and requires less computing resources. The performance of a transaction processing Performance models are also flexible. Models can system is often measured by its throughput in trans­ be used to answer "what-if" types of questions and actions per second (TPS) that satisfies a response to provide insights into the complex performance time constraint. For example, 90 percent of the behavior of transaction processing systems, which transactions must have a response time that is less is difficult (if not impossible) to observe in the than 2 seconds. This throughput, qualified by the measurement environment. Performance models associated response time constraint, is called the are widely used in research and engineering com­ maximum qualified throughput (MQTh). In a trans­ munities to provide valuable analysis of design action processing environment, the most mean­ alternatives, architecture evaluation, and capacity ingfu l response time definition is the end-to-end planning. Simplifying assumpt ions are usually

Digital Te cb llical]ournal Vol . .3 No. I Win ter I

made in the modeling approach. Therefore, perfor­ The benchmark can be run in either a local mance models require validation, through detai led area network (LAN) or a wide area network (WAN) simulation or measurement, before predictions configuration. The related throughput metrics from the models are accepted. are tpsA-Local and tpsA-Wide, respectively. The This paper presents Digital's benchmark measure­ benchmark specification defines the general appli­ ment and modeling approaches to transaction cation requirements, and scaling processing system performance evaluation. The rules, testing and pricing guide! ines, fu ll disclo­ paper includes an overview of the current industry sure report requirements, and an audit checklist.' standard transaction processing benchmark, the The following sections provide an overview of TPC Benchmark A, and a description of Digital's the benchmark. implementation of the benchmark, including the distinguishing fe atures of the implementation and Application Environment the benchmark methodology. The perfo rmance The TPC Benchmark A workload is patterned after a measurement results that were achieved by using simplified banking application. In this model, the the TPC &nchmark A are also presented. Finally, a bank contains one or more branches. Each branch multilevel analytical model of the performance has 10 tellers and 100,000 customer accounts. A behavior of transaction processing systems with transaction occurs when a teller enters a deposit response time constraints is presented and vali­ or a withdrawal fo r a customer aga inst an account dated against measurement results. at a branch location. Each teller enters transactions at an average rate of one every 10 seconds. Figure I TPCBenchm ark A-An Overview illustrates this simplified banking environment. The TPC Benchmark A simulates a simple banking environment and exercises key components of Tr ansaction Logic the system under test (SUT) by using a simple, The transaction logic of the TPC Benchmark A update-intensive transaction type. The benchmark workload can be described in terms of the bank is intended to simulate a class of transaction pro­ environment shown in Figure 1. A teller deposits cessing application environments, not the entire in or withdraws money from an account, updates range of transaction processing environments. the current cash position of the teller and branch, Nevertheless, the single transaction type specified and makes an entry of the transaction in a history by the TPC &nchmark A standard provides a simple file. The pseudocode shown in Figure 2 represents and repeatable unit of work. the transaction.

�------, �------, �------,

1oo,ooo I 1 1oo,ooo I 1 1oo.ooo I ACCOUNTS ACCOUNTS ACCOUNTS I 1 I 1 I I I I I I I I I I I I I I I I I I I I I I I I I I I 1--+--t------I -+ ... � I I I I I I I I I I I I 1 10 10 I 1 10 TELLERS TELLERS TELLERS I I I I I I I I I : CUSTOMERS I : CUSTOMERS : : CUSTOMERS L------� L------� L------�

Figure 1 TPC Benchmark A Banking Environment

Vo l. Winter 46 3 No. 1 1991 Digital Te chnical jo urnal PerformanceEvalu ation of Tra nsaction Processing Sys tems

Read 100 bytes inc luding Bid, Tid, Aid, De lta from termi n al BEGIN TRANSACTION

Update Account where Account_I D = Aid: Read Account_B a lance from Account

Set Account Ba lance = Account Ba lance + De lta Write Account Ba lance to Account Wri te to Hi story : Aid, Tid, Bi d, De lta, Time_ Stamp

Update Te l ler where Te ller_I D = Tid:

Set Te ller Ba lance = Te ller Ba lance + Delta Wri te Teller Ba lance to Te ller

Update Branch where Branch_I D = Bid:

Set Branch Ba lance = Branch_B a lance + Delta Wri te Branch_Ba lance to Branch COMM IT TRANSACTION Write 200 bytes includi ng Aid, Tid, De lta, Account Ba lance to termi na l

Fig ure 2 TPC Benchmark A Tr ansaction Pseudocode

Te rminal Co mmunication • Eighty-five percent of the accounts processed For each transaction, the originating terminal is by a teller must belong to the home branch (the required to transmit data to, and receive data from, one to which the teller belongs). Fifteen percent the system under test. The data sent to the system of the accounts processed by a teller must be under test must consist of at least 100 alphanumeric owned by a remote branch (one to which the data bytes, organized as at least fo ur distinct fields: teller does not belong). Accounts must be uni­ Account_ID, Te ller_ID, Branch_ID, and Delta. The fo rmly distributed and randomly selected. Branch_ID identifies the branch where the teller is located. The Delta is the amount to be credited to, Database Design or debited from, the specified account. The data The database consists of fo ur individual files/tables: received from the system under test consists of at Branch, Teller, Account, and History, as defined in least 200 data bytes, organized as the above fo ur Ta ble 1. The overal l size of the database is deter­ input fields and the Account_Balance that results mined by the throughput capacity of the system. from the successful commit operation of the Te n tellers, each entering transactions at an aver­ transaction. age rate of one transaction every 10 seconds, gener­ Implementation Constraints ate what is defined as a one-TPS load. Therefore, The TPC Benchmark A imposes several conditions each teller contributes one-tenth (1/10) TPS. The on the test environment. history area must be large enough to store the his­ tory records generated during 90 eight-hour days • The transaction processing system must support of operation at the published system TPS capacity. atomicity, consistency, isolation, and durability For a system that has a processing capacity of (ACI D) properties during the test. x TPS, the database is sized as shown in Ta ble 2. • The tested system must preserve the effects of For example, to process 20 TPS, a system must committed transactions and ensure database use a database that includes 20 branch records, 200 consistency after recovering fr om teller records, and 2,000,000 account records. Because each teUer uses a terminal, the price of the - The fa ilure of a single durable medium that system must include 200 terminals. A test that contains datatbase or recovery log data results in a higher TPS rate is invalid unless the size - The crash and reboot of the system of the database and the number of terminals are - The loss of all or part of memory increased proportionately.

Digital Te cbt�ical ]o urttal Vo l. 3 No. I \Vi ttter 1991 47 Transaction Processing, Databases, and Fault-tolerant Systems

Ta ble 1 Database Entities

Record Bytes Fields Required Description

Branch 100 Branch_ID Identifies the branch across the range of branches Branch_ Balance Contains the branch's current cash balance Te ller 100 Te ller_ID Identifies the teller across the range of te llers Branch_ID Identifies the branch where the teller is located Te ller_Balance Contains the teller's current cash balance

Account 100 Account_ID Identifies the customer account uniquely for the entire database Branch_ID Identifies the branch where the account is held Account_Balance Contains the account's current cash balance History 50 Account_ID Identifies the account updated by the transaction Te ller_ID Identifies the teller involved in the transaction Branch_ID Identifies the branch associated with the teller Amount Contains the amount of credit or debit (delta) specified by the transaction Time_Stamp Contai ns the date and time taken between the BEGIN TRANSACTION and CO MMIT TRANSACTION statements

Ta ble 2 Database Sizing during an interval of steady-state performance, divided by the elapsed time of the interval. The Number of Records Record Ty pe steady-state measurement interval must be at least 1 XX Branch records 15 m inures, and 90 percent of the transactions 10 XX Tel ler records must have a response time of less than 2 seconds.

1QQ,QQQ X X Account records

2,592,000 X X History records Price per TPS The K$/TPS price/performance metric measures the total system price in thou­ sands of dollars, normalized by the TPS rating of Benchmark Metrics the system. The priced system includes all the components that a custo mer requires to achieve TPC Benchmark A uses two basic metrics: the reported perfo rm ance level and is defined by • (TPS) - throughput in the TPC Benchmark A standard as the TPS, subject to a response time constra int, i.e., the MQTh, is measured whi le the system is in a • Price of the system under test, including all hard­ sustai nable steady-state condition. ware, software, and mai ntenance fo r five years.

• Price per TPS (K$/TPS) - the purchase price • Price of the terminals and network compo­ and five-year maintenance costs associated with nents, and their maintenance fo r five years. one TPS. • Price of on-line storage fo r 90 days of history Tra nsactions per Second To guarantee that the records at the publ ished TPS rate, which amounts tested system provides fast response to on-line to 2,592,000 records per TPS. A storage medium users, the TPC Benchmark A imposes a sp ecific is considered to be on-line if any record can be response time constraint on the benchmark. accessed randomly within one second. Ninety percent of all transactions must have a • Price of addi tional products required fo r the response time of less than two seconds. The TPC operation, adm inistration, or maintenance of Benchmark A standard defines transaction response the priced systems. time as the time interval between the transmission from the terminal of the first byte of the input mes­ • Price of products required for application sage to the system under test to the arrival at the development. term inal of the last byte of the outpu t message from the system under test. Al l hardware and software used in the tested The reported TPS is the total number of com mit­ configuration must be announced and generally ted transactions that both started and completed ava ilable to customers.

48 ViJI. 3 No. I Wi nter /991 Digital Te ch nical jo urnal Performance Evaluation of Tr ansaction Processing Sy stems

TPCBenchm ark A Implementation Ta ble 3 Tra nsaction Processing Software Components Digital's implementation of the TPC Benchmark A goes beyond the minimum requirements of the Component Example TPC Benchmark A standard and uses Digital's dis­ Operat ing system VMS tributed approach to transaction processing.' For example, Digital's TPC Benchmark A implementa­ Communications LAT, DECnet tion includes fo rms management and transaction Database VAX RdbNMS processing monitor software that are required in TP monitor VAX ACMS, DECintact most real transaction processing environments Forms DECforms but are not required by the benchmark. The fo l­ Application COBOL lowing sections provide an overview of Digital's approach and implementation. database manager) on separate computers. In the Tra nsaction Processing Software simplest fo rm of a distributed transaction process­ Environment ing system, the user interface component runs on a The three basic functions of a general-purpose front-end processor, and the application and data­ transaction processing system are the user inter­ base components run on a back-end processor. The face (forms processing), applications management, configuration allows terminal and fo rms manage­ and database management. Digital has developed a ment to be performed at a remote location, whereas distributed transaction architecture (DECdta) to the application is processed at a central location. define how the major functions are partitioned The Digital transaction processing software com­ and supported by components that fit together to ponents are separable because their clearly defined fo rm a complete transaction processing system. interfaces can be layered transparently onto a net­ Ta ble 3 shows the software components in a typical work. How these components may be partitioned Digital transaction processing environment. in the Digital distributed transaction processing environment is illustrated in Figure 3. Distributed Tr ansaction Processing Approach TPC Benchmark A Te st Environment Digital transaction processing systems can be dis­ The Digital TPC Benchmark A tests are imple­ tributed by placing one or more of the basic system mented in a distributed transaction processing fu nctions (i.e., user interface, application manager, environment using the transaction processing

- FORMS FRONT END PROCESSORS TP MONITOR OPERATING SYSTEM COMMUNICATIONS

- BACK END 1 -A P_ P_ L_ICA_T_I O_N -1 PROCESSORS -- _ _ TP MONITOR DATABASE DATABASE STORAGE OPERATING SYSTEM

COMMUNICATIONS

Figure 3 Distributed Transaction Processing Environment

Digital Te chnical jo urnal Vo l. 3 No. I Winter 1')91 49 Transaction Processing, Databases, and Fault-tolerant Systems

software components shown in Figure 3. The user interface component runs on one or more front­ end processors, whereas the application and database components run on one or more back­ end processors. Transactions are entered from teller terminals, which communicate with the front-end processors. The front-end processors then communicate with the back-end processors to invoke the application servers and perform database operations. The communications can DATABASE take place over either a local area or a wide area Figure Te st Sys tem Configuration network. However, to simplify testing, the TPC 4 Benchmark A standard allows sponsors to use TPCBenc hmark A Results remote terminal emulators (RTEs) rather than real terminals. Therefore, the TPC Benchmark A tests We now present the results of two TPC Benchmark base performance and price/performance results A tests based on audited benchmark experiments on two distinctly configured systems, the target performed on the VAX 9000 Model 210 and the system and the test system. VAX 4000 Model 300 systems.' ' These two systems The target system is the configuration of hard­ are representative of Digital's large and small trans­ ware and software components that customers can action processing platforms. The benchmark was use to perform transaction processing. With the implemented using the VAX ACMS transaction pro­ Digital distributed transaction processing approach, cessing monitor, the VAX Rdb/YMS relational data­ user terminals initiate transactions and communi­ base management system, and the DECforms fo rms cate with the front-end processors. Front-end pro­ management system on the VMS operating system. cessors communicate with a back-end processor Ta bles 4 and 5 show the back-end system configu­ using the DECnet protocol. rations fo r the VAX 9000 Model 210 and the VA X 4000 The test system is the configuration of com­ Model 300 systems, respectively. Ta ble 6 shows the ponents used in the lab to measure the perfor­ system configuration of the front-end systems. mance of the target system. The test system uses RTEs, rather than user terminals, to generate the Measurement Results workload and measure response time. (Note: In The maximum qualified throughput and response previously published reports, based on Digital's time results fo r the TPC Benchmark A are summa­ DebitCredit benchmark, the RTE emulated front­ rized in Ta ble 7 fo r the V�'\ 9000 Model 210 and the end processors. In the TPC Benchmark A standard, VAX 4000 Model 300 systems. Both configurations the RTE emulates only the user terminals.) The have sufficient main memory and disk drives such RTE component Ta ble 4 VAX 9000 Model 21 0 Back-end Emulates the behavior of terminal users accord­ • System Configuration ing to the benchmark specification (e.g., think time, transaction parameters) Component Product Quantity

Processor VAX 9000 Model 21 0 1 • Emulates terminal devices (e.g., conversion and multiplexing into the local area transport Memory 256 MB [LAT) protocol used by the OECserver terminal Ta pe drive TA 81 1 servers) Disk controller KDM70 2 Disks RA92 16 • Records transaction messages and response Operat ing system VMS 5.4 times (e.g., the starting and ending times of Communications DECnet-VMS Phase IV individual transactions from each emulated termi nal device) TP monitor VAX ACMS V3.1 Dictionary VAX COD/Plus V4.1

Figure 4 depicts the test system configuration in Application VAX COBOL V4.2 the environment with one back-end proces­ LAN Database system VAX RdbNMS V4.0 sor, multiple front-end processors, and multiple Forms management DECforms V1 .2 remote terminal emulators.

Vo l. I<)')I so 3 No. I Win ter Digital Te ch11icaljo ur11al PerformanceEvalu ation of Tr ansaction Processing Sy stems

Ta ble 5 VAX4000 Model 300 Back-end 2.0 System Configuration (j) 0 z 1.8 0 Component Product Quantity (.) IJ.J 1.6 � 1.4 Processor VAX 4000 Model 300 IJ.J ::; 1.2 Memory 64 MB f= IJ.J Ta pe drive TK70 1 rn 1.0 z 0 Disk controller DSSI 3 c._ 0.8 rn IJ.J 0.6 Disks RF31 18 a: Operating system VMS 5.4 0.4 20 30 40 50 60 70 80 Communications DECnet-VMS Phase IV TRANSACTIONS PER SECOND KEY: TP monitor VAX ACMS V3.1

Dictionary VAX CDD/Pius V4.1 t::,---{:,. G AVERA E Application VAX COBOL V4.2 ..__..... 90TH PERCENTILE Database system VAX RdbNMS V4.0 Forms management DECforms V1 .2 Figure 5 VAX 9000 Response Time in Relationship to Tra nsactions per Second that the processors are effectively utilized with no other bottleneck. Both systems achieved well over response times at 100 percent and approximately 90 percent CPU utilization at the maximum quali­ 80 percent and 50 percent of the maximum fied throughput under the response time constraint. qualified throughput. The mean transaction In addition to the throughput and response time, response time still grows linearly with the the TPC Benchmark A specification requires that transaction rate up to the 70 TPS level, but the several other data points and graphs be reported. ninetieth percentile response time curve has We demonstrate these data and graphs by using started to rise quickly due to the high CPU uti­ the VAX 9000 Model 210 TPC Benchmark A results. lization and random arrival of transactions.

• Response Time in Relationship to TPS. Figure 5 • Response Time Frequency Distribution. Figure 6 shows the ninetieth percentile and average is a graphical representation of the transaction

Ta ble 6 Front-end Run-time System Configuration

Component Product Quantity

Processor VAXserver31 00 Model 1 0 1 0 for VAX 9000 back-end 3 for VAX 4000 back-end

Memory 16MB for VAX 9000 back-end 12MB for VAX 4000 back-end

Disks RZ23 (1 04 MB) 16

Operat ing system VMS 5.3 1 for VAX 9000 back-end VMS 5.4 1 for VAX 4000 back-end Communications DECnet-VMS Phase IV 1

TP monitor VAX ACMS V3.1 Forms management DECforms V1 .2

Ta ble 7 Maximum Qualified Throughput

Response Time (seconds) System TPS (tpsA-Local) Ave rage 90 percent Maximum

VAX 9000 Model 21 0 69.4 1.20 1.74 5.82

VAX 4000 Model 300 21 .6 1.39 1.99 4.81

I Win ter 1991 Digital Te chnical jo urnal Vo l. 3 No. 51 Transaction Processing, Databases, and Fault-tolerant Systems

(/) 14000 0 80 z I+- AVERAGE = 1.20 SECONDS z 0 0 12000 0 70 f= I I 90TH PERCENTILE = 1.74 SECONDS w 0 (/) 60 <{ 10000 I (/) I MAXIMUM = 5.82 SECONDS [[ z w I 0.. 50 <{ 8000 I [[ (/) f- 40 DATA COLLECTION I z I- -! � 6000 0 INTERVAL I f= 30 I I 0 ffi 4000 II <{ 20 I CD (/) I z I <{ 10 I I 3 2000 [[ z I f- I I 5 0 2 4 6 8 10 12 14 16 18 20 0 10 15 20 25 30 35 TIME (MINUTES) RESPONSE TIME (SECONDS)

Figure 6 VAX 9000 Response Time Frequency Figure 7 VAX 9000 Tra nsactions per Second Distribution ouer Time

response time distribution. The average, nine­ and policies. Such a model, after proper validation, tieth percentile, and max imum transaction can be a powerful tool fo r many types of analysis, response times are also marked on the graph. as well as a perfo rmance pred iction tooL Results can be obtai ned quickly fo r any combination of • Transactions per Second over Time. The results parameters. shown in Figure 7 demonstrate the sustainable A comprehensive analytical model of the perfor­ maximum qualified throughput. The one-minute mance behavior of transaction processing systems running average transaction throughputs dur­ with a response time constraint was developed ing the warm-up and data col lection periods of and validated aga inst measurement results. This the experiment are plotted on the graph. This model is hierarch ical and flexible for extension. graph shows that the throughput was steady The fo llowing sections describe the basic con­ during the period of data collection. struction of the model and the customization made

• Ave rage Response Time over Time. The results to model the execution of TPC Benchmark A on shown in Figure 8 demonstrate the sustain­ Digital's transaction processing systems. The able average response time in the experiment. model can also be used to study diffe rent trans­ The one-minute running average transaction action processing workloads in addition to the response times during the war m-up and data TPC Benchmark A. collection periods of the experiment are plotted Response Time Components on the graph. This graph shows that the mean response time was steady during the period of The main metric used in the model is the maxi­ data collection. mum qualified throughput under a response time constra int. The response time constra int is in the Comprehensive Analytical Model

6 Modeling techniques can be used as a supplement w ::;;; or an alternative to the measurement approach. f= 5 w The performance behavior of complex transaction 'QUJ 4 processing systems can be characterized by a set of Oo o.. z I_ DATA COLLECTION ! parameters, a set of performance metrics, and the - �8 3 I INTERVAL I [[w relationships among them. These parameters can � I lli 2 i be used to describe the different resources avail­ <{ [[ I I w able in the system, the database operations of trans­ > <{ actions, and the workload that the transaction processing system undergoes. To completely rep­ 0 10 15 20 25 30 35 resent such a system, the size of the parameter set TIME (MINUTES) wo uld be too huge to manage. An analytical model simplifies, through abstraction, the complex behav­ Figure S VA..-Y9000 Auerage Response Time ior of a system into a manageable set of parameters ouer Ti tne

'52 Vo l. 3 No. I Wi nter 1991 Digital Te e/m ica/ jo urnal Performance Evaluation of Transaction Processing Sy stems

fo rm of "x percent of transaction response times Within the back-end system, the transaction are less thany seconds." response time is further decomposed into two To evaluate throughput under such response time additional components, CPU delays and non-CPU, constraint, the distribution of transaction response nonoverlapping delays. CPU delays include both times is determined by first decomposing the trans­ the CPU service and the CPU waiting times of trans­ action response time into nonoverlapping and actions. Non-CPU, nonoverlapping delays include: independent components. The distribution of each • Logging delays, which include the time fo r trans­ component is then evaluated. Finally, the overall action log writes and commit protocol delays transaction response time distribution is derived from the mathematical convolution of the compo­ • Database 1/0 delays, which include both waiting nent response time distributions. and service times for accessing storage devices The logical flow of a transaction in a front-end • Other delays, which include delays that result and back-end distributed transaction processing from concurrency control (e.g., waiting fo r locks) system that is used to implement TPC Benchmark A and waiting fo r messages is depicted in Figure 9. The response time of a transaction consists of three basic components: Tw o-level Approach front-end processing, back-end processing, and The model is configured in a two-level hierarchy, a communication delays. high level and a detailed level. The use of a hierarchy

• Front-end process ing usually includes terminal allows a complex and detai led model that considers 1/0 processing, forms/presentation services, and many components and involves many parameters communication with the back-end systems. In to be constructed easily. Because of the hierarchical the benchmark experiments, no disk 1/0 activity approach, the model also provides flexibility fo r was involved during the front-end processing. modifications and extensions, and validation of separate submodels. • Back-end processing includes the execution of The high-level model assumes the decomposition application, database access, concurrency control, of transaction response times, as described in the and transaction commit processing. The back-end Response Time Components section, and models processing usually involves a high degree of con­ the behavior of the transaction processing system currency and many disk 1/0 activities. by an open queuing system, as shown in Figure 10.

• Communication delays primarily include the The queuing system consists of servers and delay communications between the user terminal and centers, which are connected in a queuing net­ the front-end node, and the front-end and back­ work with the following assumptions: end interactions. • The front-end processing does not involve any (Note: These response time components do not disk 1/0 operation, and the load on the front­ overlap with each other.) end systems is equally balanced.

...... 1

I COMMUNICATION I FRONT-END I COMMUNICATION I BACK-END

------� END-TO-END RESPONSE TIME

Figure 9 Response Time Components

Digital Te chnical jo urnal Vo l. 3 No. I Winter 1991 53 Transaction Processing, Databases, and Fault-tolerant Systems

COMMUNICATION {8

DELAYS r COMMUNICATION � :

FRONT-END BACK-END 1/0 PROCESSORS PROCESSORS8 DEVELOPMENT

1/0 DEVELOPMENT

Figure 10 High-level Queuing Modelfo r a Transaction Processing Sy stem

• The back-end is a shared-memory multiprocessor • Number of CPUs in the back-end system and the system with symmetrical loads on all processors back-end CPU service time per transaction (or it can be simply a uniprocessor). • Sum of the back-end database 1/0 response time,

• No intratransaction parallelism exists within journaling 1/0 response time, and other delay individual transaction execution. times (i.e., the mean fo r the LOD delay center's 2-Erlang distribution) • No mutual dependency exists between trans­

action response time components. • Response time constraint (in the fo rm of x per­ centile less thany seconds) • Transaction arrivals tO the processors have a Poisson distribu tion. The main result from the high-level model is the MQTh. This high-level model presents a global pic­ These assumptions correspond to Digital's TPC Bench­ ture of the performance behavior and manife sts the mark A testing methodology and implementation. relationship between the most important parameters The front-end CPU is modeled as an M/M/1 queu­ of the transaction processing system and MQTh. ing center, and the back-end CPU is modeled as an Some of the input parameters in the high-level M/M/m queuing center. The transactions' CPU times model are dynamic. The CPU service time of a trans­ on the front-end and back-end systems are assumed action may vary with the throughput or number of to be exponentially distributed (coefficient of vari­ processors, and the database 1/0 or other delays ation equal tO 1) due to the single type of trans­ may also depend on the throughput. A good exam­ action in the benchmark. (Note: approximation An ple of a dynamic model is a tightly coupled multi­ of M/G/m can be used to consider a coefficient of processor system, with one bus interconnecting variation other than 1 fo r the back-end transaction the processors and with a shared common memory service time, especially in the mult iprocessor CPU (e.g., a VAX 6000 Model 440 system). Such a system case when the bus is highly utilized.) Database 1/0, would run a single copy of the symmetrical multi­ logging 1/0, and other delays are modeled as delay processing operating system (e.g., the VMSsystem). centers, with appropriate delay distributions. For The average CPU service time of transactions is the model of the Benchmark A workJoad, the TPC affected by both hardware and software factors, database I/0, journaling 1/0, and other communi­ such as cation and synchronization delays are combined

into one delay center, called the LOD delay center, • Hardware contention that results from conflict­ which is represented by a 2-Erlang distribution. ing accesses to the shared bus and main memory The major input parameters fo r this high-level and that causes processor speed degradation

model are the and longer CPU service time.

• Number of front-end systems and the front-end • Processor synchronization overhead that results CPU service time per transaction from the serialization of accesses to shared data

54 Vo l. 3 No. 1 Winter 1991 Digital Te chnical jo unml PerformanceEvalua tion of Tra nsaction Processing Sys tems

structures. Many operating systems use spin­ Integrating the Tw o Levels of the Model locks as the mechanism fo r processor-level The two levels of the model are integrated by using synchronization, and the processor spins (i.e., an iterative procedure outlined in Figure 12. It busy-waits) in the case of a conflict. In the starts at the detai led-level submodels, with initial model, the busy-wait overhead is considered values for the MQTh, the transaction path length, to be part of the transaction code path, and the busy-wait overhead, and the CPU utilization. such contention elongates the transaction CPU By applying the initialized parameters to the service time. submodels, the values of these parameters are refined and input to the high-level model. The out­ Four detailed-level submodels are used to put parameters from the high-level model are then account for the dynamic behavior of these param­ fed back to the detai led-level submodels, and this eters: CPU-cache-bus-memory, busy-wait, I/0 group, iterative process continues until the MQTh con­ and LOD. verges. In most cases, convergence is reached The CPU-cache-bus-memory submodel consists within a few iterations. of many low-level parameters associated with the workload, processor, cache, bus, and memory com­ ponents of multiprocessor systems. It models these Model Predictions components by using a mixed queuing network The back-end portion of the model was validated model that consists of both open and closed chains, against measurement results from numerous as shown in Figure 11. The most important output DebitCredit benchmarks (Digital's precursor of the from this submodel is the average number of CPU TPC Benchmark A) on many VAX computers with clock cycles per instruction. the VMS operating system, running VAX ACMS and The busy-wa it submodel models the spin-lock VAX Rdb/VMS software.1 With sufficient detailed contention that is associated with the two major parameters available (such as transaction instruc­ VMS spin-locks, called SCHED and IOLOCK8. This sub­ tion count, instruction cycle time, bus/memory model divides the state of a processor into several access time, cache hit ratio), the model correctly nonoverlapping states and uses probability analy­ estimated the MQTh and many intermediate results sis to derive busy-wait time. The I/O grouping sub­ fo r several multiprocessor VAX systems. The model model models the group commit and group write was then extended to include the front-end sys­ mechanisms of the VAX Rdb/VMS relational database tems. In this section, we discuss applying this com­ management system. This submodel affects the path plete end-to-end model to the TPC Benchmark A length of transaction because of the amortization on two VAX platforms, the VAX 9000 Model 210 and of disk I/O processing among grouped trans­ the VAX 4000 Model 300 systems, and then compare actions. The LOD submodel considers the disk I/0 the results. The benchmark environment and imple­ times and the lock contention of certain critical mentation are described in the TPC Benchmark A resources in the VAX Rdb/VMS system. Implementation section of this paper.

�------� MEMORY 1 SINK I I I I �------I I I I I I -r - ----1 I I I I I I I m 1 MEMORY I I I • ______!

KEY: SOURCE S CLO ED CHAIN E OP N CHAIN

Figure 11 CP U-cache-bus-memory Submodel

Digital Te cb,tcaljournal Vol. 3 No. I Winter 1991 55 Transaction Processing, Databases, and Fault-tolerant Systems

INITIALIZE: TxnPL,MQTh,BusyWa itP L,CpuUti lization; LOD-submode l(input :MQTh;output :LOD) REPEAT I/0-Groupi ng-submodel(input : MQTh;output :DioPerTxn,TxnPL); REPEAT REPEAT BusyWa it-submode l(input : TxnPL,BusyWa itP L, CpuUtilization, DioPe rTxn;output : BusyWaitPL); UNTIL(BusyWaitPL converges); CPU-Cache-Bus-Memory-submodel(input : TxnPL,BusyWa itP L; output : CpuUtilization,AvgCpuSvcTime); UNTIL(CpuUti Lization converges); REPEAT MQTh-mode l(input :AvgCpu SvcTime,LOD;output :MQTh,Cpu Utilization); LOD-submode l(input : MQTh;output :LOD); UNTIL(MQTh converges); UNTIL(MQTh converges);

Figure 12 Tbe Iterative Procedure to Integrating Submodels

Because both the VAX 9000 Model 210 and the The high-level model predicted a range of MQTh, VAX 4000 Model 300 systems are uniprocessor with a high end of 70 TPS and with a strong proba­ systems, there is no other processor contending bility that the high-end performance was achievable. for the processor-memory interconnect and mem­ Additional predictions were made later, when an ory subsystems. Such contention effects can there­ early prototype version of the VAX9000 Model 210 fo re be ignored when modeling a uniprocessor system was available fo r testing. A variant of the system. The transaction processing perfo rmance DebitCredit benchmark, much smaller in scale and prediction for the VAX9000 Model 210 system is a easier to run, was performed on the prototype successful example of the application of our analyt­ system, with the emphasis on measuring the CPU ical model. perfo rmance in a transaction processing environ­ We needed an accurate estimate of TPC Bench­ ment. The result was used to extrapolate the CPU mark A performance on the VAX 9000 Model 210 service time of the TPC Benchmark A transactions system before a VAX9000 system was actually avail­ on the VAX 9000 Model 210 system and to refine able fo r resting. The high-level (MQTh) model was the early estimate. The results of these mod ifica­ used with estimated values fo r the input parame­ tions supported the previous high-end estimate of ters, LOD and transaction CPU service time. The performance of 70 TPS and refined the low-end estimated LOD was based on previous measure­ performance to be 62 TPS. The final, audited TPC ment observations from the VAX6000 systems. The Benchmark A measurement result of the VAX 9000 other parameter, back-end transaction CPU service Model 210 system showed 69.4 TPS, which closely time, was derived from the matches the prediction. Ta ble 8 compares the results from benchmark measurement and the • Timing information of the VAX 9000 CPU analyticalmodel outputs. • Memory access time and cache miss penalty of the VA.'\ 9000 CPU Ta ble 8 Measurement Compared to Model • Prediction of cache hit ratio of the VAX 9000 sys­ Predictions tem under the TPC Benchmark A workload Measured Modeled • Transaction path length of the TPC Benchmark A System MOTh MOTh implementation VAX 9000 Model 21 0 69.4 70.0 • Instruction profi le of the TPC Benchmark A VAX 4000 Model 300 21 .5 20.8 implementation

56 Vo l. 3 No. I If/inter 1991 Digital Te chnical Jo urnal PerformanceEvalua tion of Tra nsaction Processing Sy stems

The VAX 4000 Model 300 TPC Benchmark A paper was validated and used extensively in many results were also used as a validation case. VAX 4000 engineering performance studies. The model also Model 300 systems use the same CMOS chip as helped the benchmark process to size the hard­ the VAX 6000 Model 400 series and the same ware during preparation (e.g., the number of 28-nanosecond (ns) CPU cycle time. However, in RTE and front-end systems needed, the size of the VAX 4000 series, the CPU-memory interconnect the database) and to provide an MQTh goal as a is not the XMI bus but a direct primary memory sanity check and a tuning aid. The model could interconnect. This direct memory interconnect be extended to represent additional distributed results in fast main memory access. The processor, configurations, such as shared-disk and "shared­ cache, and main memory subsystems are otherwise nothing" back-end transaction processing systems, the same as in the VAX 6000 Model 400 systems. and could be applied to additional transaction pro­ Therefore, the detailed-level model and associated cessing workloads. parameters fo r the VAX 6000 Model 410 system can be used by ignoring the bus access time. The Acknowledgments TPC Benchmark A measurement results are within The Digital TPC Benchmark A implementation and 7 percent of the model prediction, which means measurements are the result of work by many that our assumption on the memory access time individuals within Digital. The authors would like is acceptable. especially to thank Jim McKenzie, Martha Ryan, Hwan Shen, and Bob Ta nski fo r their work in the Conclusion TPC Benchmark A measurement experiments; and Performance is one of the most important attrib­ Per Gyllstrom and Rabah Mediouni for their con­ utes in evaluating a transaction processing system. tributions to the analytical model and validation. However, because of the complex nature of trans­ action processing systems, a universal assessment References of transaction processing system performance is 1. Tra nsaction Processing Performance Council, impossible. The performance of a transaction pro­ TPC Benchmark A Standard Sp ecification cessing system is workload dependent, configura­ (Menlo Park, CA: Wa terside Associates, tion dependent, and implementation dependent. A November 1989). standard benchmark, like TPC Benchmark A, is a step toward a fa ir comparison of transaction pro­ 2. Transaction Processing Sy stems Handbook cessing performance by different vendors. But it is (Maynard: Digital Equipment Corporation, only one transaction processing benchmark that Order No. EC-H0650-57, 1990). represents a limited class of applications. When 3. TPC Benchmark: A Report fo r the VAX 9000 evaluating transaction processing systems perfor­ Model 210Sy stem (Maynard: Digital Equipment mance, a good understanding of the targeted appli­ Corporation, Order No. EC-N0302-57, 1990). cation environment and requirements is essential before using any ava ilable benchmark result. 4. TPC Benchmark: A Report fo r the VAX 4000 Additional benchmarks that represent a broader Model 300 Sy stem (Maynard: Digital Equipment range of commercial applications are expected to Corporation, Order No. EC-N0301-57, 1990). be standardized by the Transaction Processing 5. L. Wright, W Ko hler, and W Zahavi, "The Digital Performance Council (TPC) in the com ing years. DebitCredit Benchmark: Methodology and Performance modeling is an attractive alterna­ Results," Proceedings of the International tive to benchmark measurement because it is less Co nference on Management and Pe rformance expensive to perform and results can be compiled Evaluation of Computer Systems (December more quickly. Modeling provides more insight 1989): 84-92. into the behavior of system components that are treated as black boxes in most measurement exper­ iments. Modeling helps system designers to better understand performance issues and to discover existing or potential performance problems. Model­ ing also provides solutions fo r improving perfor­ mance by modeling different tuning or design alternatives. The analytical model presented in this

Digital Te chnical journal Vo l. 3 No. I Winter 1991 57 Wi lliam Z Zahavi Frances A Ha bib Ke t�neth J. Omahen 1I

To ols and Te chniquesfo r Preliminary Sizing of Transaction Processing Applications

Sizing transaction processing systems correctly is a difficult task. By nature, trans­ action processing applications are not predefi ned and can vmy fr om tbe simple to tbe complex. Sizing during the analysis and design stages of tbe application devel­ op ment cy cle is particularly difficult. It is impossible to measure tbe resource requirements of an application which is not yet written or fu lly implemented. To make sizing easier and more accurate in these stages, a sizing methodology was developed that uses measurements fr om systems on which industry-standard benchmarks have been run and employs standardsy stems analysis techniques fo r acquiring sizing information. Th ese metrics are tben used to predictfu ture trans­ action resource usage.

The transaction processing marketplace is domi­ The development of Digital's transaction process­ nated by commercial applications that support ing sizing methodology was guided by several prin­ businesses. These applications contribute substan­ ciples. The first principle is that the methodology tially to the success or fa ilure of a business, based on should rely heavily upon measurements of Digital the level of performance the application provides. systems running industry-standard transaction In transaction processing, poor application perfor­ processing benchmarks. These benchmarks pro­ mance can translate directly into lost revenues. vide valuable data that quantifies the perfo rmance The risk of implementing a transaction process­ characteristics of different hardware and software ing application that performs poorly can be mini­ configurations. mized by estimating the proper system size in the The second principle is that systems analysis early stages of application development. Sizing esti­ methodologies should be used to provide a frame­ mation includes configuring the correct processor work fo r acquiring sizing information. In partic­ and proper number of disk drives and controllers, ular, a multilevel view of a customer's business given the characteristics of the applicat ion. is adopted. This approach recognizes that a man­ The sizing of transaction processing systems is ager's view of the business functions performed by a difficult activity. Unlike traditional applications an organization is diffe rent from a computer ana­ such as mail, transaction processing applications lyst's view of the transaction processing activity. are not predefi ned. Each customer's requirement The sizing methodology should accommodate both is different and can vary from simple to complex. these views. Therefore, Digital chose to develop a sizing method­ The third principle is that the sizing methodol­ ology that specifically meets the unique require­ ogy must employ tools and teclmiques appropriate ments of transaction processing customers. The to the current stage of the customer's application goal of this effort was to develop sizing tools and design cycle. Early in the effo rt to develop a sizing techniques that would help marketing groups and methodology, it was fo und that a distinction must design consultants in recommending configura­ be made between prel im inary sizing and sizing tions that meet the needs of Digital's customers. during later stages of the application development Digital's methodology evolved over time, as experi­ cycle. Preliminary sizing occurs during the analysis ence was gained in dealing with the real-world and design stages of the application development problems of transaction processing system sizing. cycle. Therefore, no application software exists

58 Vo l. 3 No. I Wi11ter 1991 Digital Te ch nical jo urnal To ols and Te chniquesfo r Preliminary Sizing of Tr ansaction Processing Applications

which can be measured. Application software does The hardware platform specifies the desired exist in later stages of the application development topology or processing style. For example, process­ cycle, and its measurement provides valuable input ing style includes a centralized configuration and a for more precise sizing activities. front-end and back-end configuration as valid alter­ For example, if a customer is in the analysis or natives. The hardware platform may also include design stages of the application development cycle, specific hardware components within the process­ it is unlikely that estimates can be obtained fo r ing style. (In this paper, the term processor refers such quantities as paging rates or memory usage. to the overall processing unit, which may be com­ However, if the application is fully implemented, posed of multiple CPUs.) then tools such as the VA.Xcluster Performance The software platform identifies the set of layered Advisor (VPA) and the DECcp capacity planning products to be used by the transaction processing products can be used fo r sizing. These tools pro­ application, with each software product identified vide fa cilities fo r measuring and analyzing data by its name and version number. In the transaction from a running system and fo r using the data as processing environment, a software platform is input to queuing models. composed of the transaction processing monitor, The term sizing, as used in this paper, refers to fo rms manager, database management system, appli­ preliminary sizing. The paper presents the metrics cation language, and operating system. and algebra used in the sizing process fo r DECtp Different combinations of software platforms applications. It also describes the individuaJ tools may be configured, depending on the hardware plat­ developed as part of Digital's transaction process­ fo rm used. A centralized configuration contains ing sizi ng effort. all the software components on the same system. A distributed system is comprised of a front-end pro­ Sizing cessor and a back-end processor; different software The purpose of sizing tools is twofold. First, sizing platforms may exist on each processor. tools are used to select the appropriate system components and to estimate the performance level Levels of Business Metrics of the system in terms of device utilization and The term business metrics refers collectively to user response times. Second, sizing tools bridge the the various ways to measure the work associated gap between business specialists and computer with a customer's business. In this section, various specialists. This bridge translates the business units levels of business metrics are identified and the into functions that are performed on the system relationship between metrics at different levels is and, ultimately, into units of work that can be quan­ described.' As mentioned earlier, the levels corre­ tified and measured in terms of system resources. spond to the multilevel view of business operation In the sections that fo llow, a number of important typically used for systems analysis. The organi­ elements of the sizing methodology are described. zation or personnel most interested in a metric in The first of these elements is the platform on which relation to its business operation is noted in the the transaction processing system will be imple­ discussion of each metric. mented. It is assumed that the customer will supply The decomposition of the business application general preferences for the software and hardware requirements into components that can be counted configuration as part of the platform infor mation. and quantified in terms of resource usage requires The Levels of Business Metrics section details the that a set of metrics be defined. These metrics multilevel approach used to describe the work per­ reflect the business activity and the system load. fo rmed by the business. The Sizing Metrics and The business metrics are the fo undation fo r the Sizing Formulas sections describe the algorithms development of several transaction processing siz­ that use platform and business metric information ing tools and for a consistent algebra that connects to perform transaction processing system sizi ng. the business units with the computer units. The business metrics are natural fo recasting units, Platforms business functions, transactions, and the number The term platform is used in transaction process­ of I/Os per transaction. The relationship among ing sizing methodology to encompass general cus­ these levels is shown in Figure 1. In general, a busi­ tomer preferences fo r the hardware and software ness may have one or more natural forecasting upon which the transaction processing application units. Each natural fo recasting unit may drive one or will run. more business fu nctions. A business function may

Digital Te chnical jo urnal Vo l. 3 Nu. I Wirzter 1991 59 Transaction Processing, Databases, and Fault-tolerant Systems

1/0 ACTIVITY REQUIREMENTS FILES UPDATE UPDATE READS INSERTS READS WAITES

Figure 1 Levels of Business Activity Ch aracterization

have multiple transactions, and a single transaction managers to track the activity of their departments. may be activated by different business fu nctions. A transaction is an atomic unit of work fo r an Every transaction issues a variety of 1/0 operations application, and transaction response time is the to one or more fi les, which may be physically primary performance measure seen by a user. Each located on zero, one, or more disks. This section of the interactions mentioned in the above busi­ discusses the business metrics but does not dis­ ness fu nction is a transaction. The measurement cuss the physical distribution of 1/0s across disks, metric fo r a transaction is the number of trans­ which is an implementation-specific item. action occurrences per business function. Trans­ A natural forecasting unit is a macrolevel indica­ actions may be used by low-level managers to track tor of business volume. (It is also called a key vol­ the activity of their groups. ume indicator.) A business generally uses a volume The bulk of commercial applications involves indicator to measure the level of success of the the maintaining and moving of information. This business. The vo lume is often measured in time information is data that is often stored on perma­ intervals that reflect the business cycle, such as nent storage devices such as rotational disks, solid weekly, monthly, or quarterly. For example, if busi­ state disks, or tapes. An l/0 operation is the process ness volume indicators were "number of ticket sales by which a transaction accesses that data. The mea­ per quarter," or "monthly production of widgets," surement metric fo r the I/O profile is the number then the corresponding natural fo recasting units of l/0 operations per transaction. 1/0 operations would be "ticket sales" and "widgets." Natural fo re­ by each transaction are important to programmers casting units are used by high-level executives to or system analysts. track the health of the overall business. In addition to issuing 1/0s, each transaction Business fu nctions are a logical unit of work per­ requires a certain amount of CPU time to handle fo rmed on behalf of a natural fo recasting unit. For fo rms processing. (Forms processing time is not example, within an airline reservation system, a illustrated in Figure 1.) The measurement metric common business fu nction might be "selling air­ fo r fo rms processing time is the expected number line tickets." This business fu nction may consist of fields. The number of input and output fields of multiple interactions with the computer (e .g., per fo rm are important metrics fo r users of a trans­ flight inquiry, customer credit check). The comple­ action processing application or programmer/ tion of the sale terminates the business fu nction, system analysts. and "airline ticket" acts as a natural fo recasting unit By collecting information about a transaction fo r the enterprise selling the tickets. The measure­ processing application at various levels, high-level ment metric fo r business fu nctions is the num­ vo lume indicators are mapped to low-level units ber of business function occurrences per hour. of 1/0 activity. This mapping is fu ndamental to the Business fu nctions may be used by middle-level transaction processing sizing methodology.

60 Vo l. 3 No. I Winter 1991 Digital Te ch t�icaljourtlal To ols and Te chniquesfo r PreliminarySizing of Transaction Processing Applications

Performance goals play a particularly important ally constitute the major expense in any proposed role in the sizing of transaction processing systems.2 system solution. Mistakes in processor sizing are The major categories of performance goals com­ very expensive to fix, both in terms of customer monly encountered in the transaction processing satisfaction and cost. marketplace are bounds fo r

• Device utilization(s) Sizing Me trics Transaction processing applications permit a large • Average response time fo r transactions number of users to share access to a common data­ • Response time quantiles fo r transactions base crucial to the business and usually residing on disk memory. In an interactive transaction process­ For example, a customer might specifya required ing environment, transactions generally involve processor utilization of less than 70 percent. Such a some number of disk I/O operations, although the constraint reflects the fact that system response number is relatively small compared to those time typically rises dramatically at higher proces­ generated by batch transaction processing appli­ sor utilizations. A common performance goal fo r cations. CPU processing also is generally small and response time is to use a transaction's average consists primarily of overhead for layered trans­ response time and response time quantiles. For action processing software products. Although example, the proposed system should have an aver­ these numbers are small, they did influence the age response time of x seconds, with 95 percent sizing methodology in several ways. of all responses completing in less than or equal Ratings fo r relative processor capacity in a trans­ to y seconds, where x is less than y. Transaction action processing environment were developed response times are crucial for businesses. Poor to reflect the ability of a processor to support disk response times translate directly into decreased 1/0 activity (as observed in benchmark tests). In productivity and lost revenues. addition, empirical studies of transaction process­ When a customer generates a fo rmal Request For ing applications showed that, fo r purposes of pre­ Proposal (RFP), the performance goals fo r the liminary sizing, the number of disk 1/0s generated transaction processing system typically are speci­ by a transaction provides a good predict ion of the fied in deta il. The specification of goals makes required amount of CPU processing.; Numerous it easier to define the perfo rmance bounds. For industry-standard benchmark tests fo r product customers who supply only general performance positioning were run on Digital's processors. These goals, it is assumed that the performance goal takes processors were configured as back-end proces­ the fo rm of bounds fo r device utilizations. sors in a distributed configuration with different Overall response time consists of incremental software platforms. contributions by each major component of the The base workload fo r this benchmark testing is overall system: currently the Transaction Processing Performance • Front-end processor Council's TPC Benchmark A (TPC-A, fo rmerly the DebitCredit benchmark)�·'6 The most complete • Back-end processor set of benchmark testing was run under Digital's • Communications network VAX ACMS transaction processing monitor and VAX Rdb/VMS relational database. Therefore, results • Disk subsystem from this softwareplatf orm on all Digi tal proces­ A main objective in this approach to sizing was sors were used to compute the first sizing metric to identifyand use specific metrics that could be called the base load fa ctor. easily counted fo r each major component. For The base load factor is a high-level metric that instance, the number of fields per fo rm could be incorporates the contribution by all layered soft­ a metric used fo r sizing front-end processors ware products on the back-end processor to the because that number is specific and easily counted. total CPU time per 1/0 operation. Load factors are As the path of a transaction is fo llowed through the computed by dividing the total CPU utilization by overall system, the units of work appropriate fo r the number of achieved disk 1/0 operations per each component become clear. These units become second. (The CPU utilization is normalized in the the metrics fo r sizing that particular component. event that the processor is a Symmetrical Multi­ The fo cus of this paper is on processor sizing with processing [SMP] system, to ensure that its value bounds on processor utilization. Processors gener- falls within the range of 0 to 100 percent.) The

Digital Te ch nical jo urnal Vo l. 3 No. I Winter 1991 61 Tra nsaction Processing, Databases, and Fault-tolerant Systems

calculation of load factor yields the total CPU time, application/database and fo rms management were in centiseconds (hundredths of seconds), required developed. These fo rmulas are used separately fo r to support an application's single physical 1/0 sizing back-end and fr ont-end processors in a dis­ operation. tribu ted configuration. The individual contribu­ The base load fa ctors give the CPU time per 1/0 tions of the fo rmulas are combined fo r sizing a required to run the base workload, TPC-A, on any centralized configuration. Digital processor in a back-end configuration using The application/database is the work that takes the AC MS/Rdb. The CPU time per 1/0 can be esti­ place on the back-end processor of a distributed mated fo r any workload. This generalized metric is configuration. It is a fu nction of physical disk called the application load factor. accesses. To determine the minimal CPU time To relate the base load factors to workloads other required to handle this load, processor utilization than the base, an additional metric was defined is used as the performance goal, setting up an cal led the intensity factor. The metric calculation inequality that is solved to obtain a corresponding fo r the intensity fa ctor is the application load load factor. The resulting load fa ctor is then com­ factor divided by the base load factor. The value in pared to the table of base load fa ctors to obtain a using intensity factors is that, once estimated (or recommendation fo r a processor type. To rein­ calculated fo r ru nning appl icat ions), intensity fac­ fo rce this dependence of load factors on processor tors can be used to characterize any application in types, load fa ctor x refers to the associated pro­ a way that can be applied across all processor types cessor type x in the fo llowing calculations. to estimate processor requirements. One method fo r estimating the average CPU time Intensity fa ctors vary based on the software per transaction is to multiply the number of 1/0s platform used. If a software plattorm other than a per transaction by the load fa ctor x and the inten­ combined VAX ACMS and VAX Rdb/YMS platform is sity factor. This yields CPU time per transaction, selected, the estimate of the intensity factor must expressed in centiseconds per transaction. By mul­ be adjusted to reflect the resource usage character­ tiplying this product by the transactions per sec­ istics of the selected DECtp software platform. ond rate, an expression fo r processor utilization is To estimate an appropriate intensity factor fo r a derived. Thus processor utilization (expressed as a nonexistent application, judgment and experience percentage scaled between 0 and 100 percent) is with similar applications are required. However, the number of transactions per second, times the measured cases from a range of DECtp appl ications number ofi/Os per transaction, times load factor x, shows relatively little variation in intensity fa ctors. times the intensity factor. Guidelines to help determine intensity factors are The performance goal is a CPU utilization that is included in the documentation fo r Digital's inter­ less than the utilization specified by the customer. nally developed transaction processing sizing tools. Therefore, the calculation used to derive the load The work required by any transaction pro­ factor is the utilization percentage provided by the cessing application is composed of two parts: the customer, divided by the number of transactions application/database and the fo rms management. per second, times the number of l/Os per trans­ This division of work corresponds to what occurs action, times the intensity factor. in a distributed configuration, where the fo rms pro­ Once computed, the load fa ctor is compared to cessing is off-loaded to one or more front-end pro­ those values in the base load fa ctor table. The base cessors. Load factors and intensity factors are load factor equal to or Jess than the computed value metrics that were developed to size the application/ is selected, and its corresponding processor type, database. To estimate the amount of CPU time x, is returned as the minimal processor required to required fo r fo rms management, a fo rms-specific hand le this workload . metric is required. For a first-cut approximation, The fo ur input parameters that need to be esti­ the expected number of (input) fields is used as the mated fo r inclusion in this inequality are sizing metric. This number is obta ined easily from • Processor utilization performance goal (tradi­ the business-level description of the application. tionally set at around 70 percent, but may be set higher fo r Digital's newer, fa ster processors) Sizing Fo rmulas This section describes the underlying algebra devel­ • Ta rget transactions per second (which may be oped fo r processor selection. Different fo rmulas derived from Digital's multilevel mapping of to estimate the CPU time required fo r both the business metrics)

62 Vo l. 3 No. I Winter 1991 Digital Te chnical jo urnal To ols and Te chniques fo r Prelim inary Sizing of Tra nsaction Processing Applications

• l/Os per transaction (estimated from application two inputs generally involves the most uncer­ description and database expertise) tainty, the spreadsheet allows the user to input a range of values fo r each.) The second screen turns • Intensity fa ctor (estimated from experience with the analysis around, showing the resulting trans­ similar applications) action-per-second ranges that can be supported by Note: Response time performance goals do not the processor type selected by the user, given the appear in this fo rmula. This sizing for mula deals target processor utilization, expected number of strictly wi th ensuring adequate processor capacity. fields, and possible intensity fa ctors and number of However, these performance parameters (includ­ 1/0s per transaction. ing the CPU service time per transaction) are fed The basic sizing fo rmula addresses issues that into an analytic queuing solver embedded in some deal specifically with capacity but not with per­ of the transaction processing sizing tools, which fo rmance. To predict behavior such as response produces estimates of response times. times and queue lengths, modeling techniques that Forms processing is the work that occurs either employ analytic solvers or simulators are needed. on the front-end processor of a distributed config­ A second tool embeds an analytic queuing solver uration or in a centralized configuration. It is not a within itself to produce performance estimates. fu nction of physical disk accesses; rather, fo rms This tool is an automated system (i.e., a DECtp processing is CPU intensive. To estimate the CPU application) that requests information from the time (in seconds) required fo r fo rms processing, user according to the multilevel workload charac­ the fo llowing simple linear equation is used: terization methodology. This starts from general business-level info rmation and proceeds to request

y = c(a + bz) successively more detailed information about the application. The tool also contains a knowledge wherey equals the CPU time fo r fo rms processing; base of Digital's product characteristics (e.g., pro­ a equals the CPU time per fo rm per transaction cessor and disk) and measured DECtp applications. instance, depending on the fo rms manager used; The user can search through the measured cases to b equals the CPU time per field per transaction find a similar case, which could then be used to instance, depending on the fo rms manager used; provide a starting point fo r estimating key applica­ z equals the expected number of fields; and c equals tion parameters. The built-in product characteris­ the scaling ratio, depending on the processor type. tics shield the user from the numeric details of the This equation was developed by fe ed ing the results sizing algorithms. of controlled fo rms testing into a linear regression A third tool is a spin-off from the second tool. model to estimate the CPU cost per fo rm and per This tool is a standalone analytic queuing solver with field (i.e., a and b). The multiplicative term, c, is a simple textual interface. The tool is intended fo r used to el iminate the dependence of factors a and the sophisticated user and assumes that the user bon the hardware platform used to run these tests. has completed the level of analysis required to be able to supply the necessary technical input param­ Sizing To ols eters. No automatic table lookups are provided . Several sizing tools were constructed by using the However, for a completely characterized applica­ above fo rmulas as starting points. These tools dif­ tion, this tool gives the sophisticated user a quick fe r in the range of required inputs and outputs, and means to obtain performance estimates and run in the expected technical sophistication of the user. sensitivity analyses. The complete DECtp software The first tool developed is fo r quick, first­ platform necessary to run the second tool is not approximation processor sizing. Currently embod­ required fo r this tool. ied as a DECalc spreadsheet, with one screen fo r processor selection and one fo r transactions-per­ Data Collection second sensitivity analysis, it can handle back-end, To use the sizing tools fu lly, certain data must be front-end, or centralized sizing. The first screen available, which allows measured workloads to be shows the range of processors required, given the used to establish the basic metrics. Guidance in target processor utilization, target transactions sizing unmeasured transaction processing applica­ per second, expected number of fields, and the tions is highly dependent on developing a knowl­ possible intensity factors and number of 1/0s per edge base of real-world transaction processing transaction. (Because the estimation of these last application descriptions and measurements. The

Digital Te cbnicalJourual VrJ I. 3 No. I Win ter 1991 63 Transaction Processing, Databases, and Fault-tolerant Systems

kinds of data that need to be stored within the Acknowledgments knowledge base require the data collection tools to The authors would like to acknowledge our col­ gather information consistent with the transaction leagues in the Transaction Processing Systems processing sizing algebra. Performance Group whose efforts led to the devel­ For each transaction type and for the aggregate opment of these sizing tools, either through prod­ of all the transaction types, the fo llowing informa­ uct characterization, system support, objective tion is necessary to perform transaction process­ critique, or actual tool development. In particular, ing system sizi ng: we would like to acknowledge the contributions • CPU time per disk 1/0 made by Jim Bouhana to the development of the sizing methodology and tools. • Disk 1/0 operations per transaction

• Transaction rates References • Logical-to-physical disk 1/0 ratio 1. W Zahavi and ]. Bouhana, "Business-Level Des­ The CPU to 1/0 ratio can be derived from Digital's cription of Transaction Processing Applications," existing instrumentation products, such as the VAX CM G '88 Proceedings (1988): 720 -726. Software Performance Monitor (SPM) and VA Xcluster K. Omahen, "Practical Strategies fo r Config­ Performance Advisor (VPA) products.' Both prod­ 2. uring Balanced Transaction Processing Sys tems," ucts can record and store data that reflects CPU IEEE COMPCON usage levels and physical disk 1/0 rates. Sp ring '89 Proceedings (1989): The DECtrace product collects event-driven data. 554-559. It can collect resource items from layered soft­ 3. W Zahavi, "A First Approximation Sizing ware products, including VA X ACMS monitor, the Te chnique -The 1/0 Operation as a Metric of VAX Rdb/VMS and DBMS database systems, and if CPU Power," CM G '90 Conference Proceedings instrumented, from the application program itself. (forthcoming December 10-14, 1990). As an event collector, the DECtrace product can be "TPC BENCHMARK A - Standard Specification," used to track the rate at which events occur. 4. (Transaction Processing Performance Council, The methods fo r determining the logical-to­ November physical disk 1/0 ratio per transaction remain open 1989). fo r continuing study. Physical disk 1/0 operations 5. "A Measure of Transaction Processing Power," are issued based on logical commands from the Datamation, vo l. 31, no. 7 (Ap ril 1, 1985): 112-118. application. The find, update, or fetch commands 6. L. Wright, W Kohler, and W Zahavi, "The Digital from an SQL program translate into from zero to DebitCredit Benchmark: Methodology and many thousands of physical disk I/O operations, Results," CMG '89 Co nference Pro ceedings depending upon where and how data is stored . Characteristics that affect this ratio include the (1989): 84-92. length of the data tables, number of index keys, and 7 F Habib, Y. Hsu, and K. Omahen, "Software access methods used to reach individual data items Measurement To ols for VA X/VMS Systems," CMG (i.e., sequential, random). Tra nsactions (Summer 1988): 47-78. Few tools currently available can provide data on physical l/0 operations for workloads in the design stage. A knowledge base that stores the logical-to-physical disk 1/0 activity ratio is the best method avai lable at this time for predicting that value. The knowledge base in the second sizing tool is beginning to be populated with application descriptions that include this type of information. It is anticipated that, as this tool becomes widely used in the field, many more application descrip­ tions will be stored in the knowledge base. Pooling individual application experiences into one central repository will create a valuable source of knowl­ edge that may be utilized to provide better infor­ mation for future sizing exercises.

64 Vo l. 3 No. J Winter 1991 Digital Te chnical jou rnal Ananth Raghavan T. K. Rengarajan I

Database Availabilityfo r Transaction Processing

A transaction processing system relies on its da tabase management system to supply high availability Digital offe rs a network-based product, the VAX DBMS sys tem, and a relational data-based product, the VAX Rdb/VMS database system, fo r its transaction processing systems. These database sys tems have several strategies to survive fa ilures, disk head crashes, revectored bad blocks, database corruptions, memory corruptions, and memory overwrites byfa ulty application programs. Th ey use base hardware technologies and also employ novel software techniques, such as parallel transaction recovery, recovery on surviving nodes of a VA.X cluster system, restore and rollforward operations on areas of the database, on-line backup, verification and repair utilities, and executive mode protection of trusted database management system code.

Modern businesses store critical data in database way the database management system provides management systems. Much of the daily activity high availability is by guaranteeing the proper­ of business includes manipulation of data in the ties of transactions: atomicity, serializability, and database. As businesses extend their operations durability.1 For example, if a transaction that has worldwide, their databases are shared among made updates to the database is aborted, other office locations in different parts of the world. transactions must not be allowed to see these Consequently, these businesses require transac­ updates; the updates made by the aborted trans­ tion processing systems to be available fo r use at action must be removed from the database before all times. This requirement translates directly to a other transactions may use that data. Ye t, data that goal of perfect availability fo r database manage­ has not been accessed by the aborted transaction ment systems. must continue to be available to other transactions. VAX DBMS and VAX Rdb/VMS database systems are Downtime is the term used to refer to periods based on network and relational data models, respec­ when the database is unavailable. Downtime is tively. Both systems use a kernel of code that is caused by either an unexpected fa ilure (unex­ largely responsible for providing high availability. pected downtime) or scheduled maintenance on

This layer of code is maintained by the KODA group. the database (scheduled downtime). Such classifi­ KODA is the physical subsystem fo r VAX DBMS and cations of downtime are useful. Unexpected down­ VAX Rdb/VMS database systems. It is responsible fo r time is caused by factors that are beyond the all 1/0, buffe r management, concurrency control, control of the transaction processing system. For transaction consistency, locking, journaling, and example, a disk fa ilure is quite possible at any access methods. time during normal processing of transactions. In this paper, we define database availability, However, scheduled downtime is entirely within and describe downtime situations and how such the control of the . High situations can be resolved. We then discuss the availability demands that we eliminate scheduled mechanisms that have been implemented to pro­ downtime and ensure fast system recovery from vide minimal loss of availability. unexpected fa ilures. The layers of the software and hardware services Database Availability which compose a transaction processing system The unit of work in transaction processing systems are dependent on one another fo r high availability. is a transaction. We therefore define database avail­ The dependency among these services is illus­ ability as the ability to execute transactions. One trated in Figure 1. Each service depends on the

Vo l. Win ter Dtgttal Te cbnicaljournal 3 No. I 1991 65 Transaction Processing, Databases, and Fault-tolerant Systems

A database monitor must be started on a node before a user's process running on that node can APPLICATION PROGRAM access a database. The monitor oversees all data­ base activity on the node. It allows processes to 1 attach to and detach from databases and detects fa ilures. On detecting a fa ilure, the monitor starts DATABASE MANAGEMENT a process to recover the transactions that did not SYSTEM complete because of the fa ilure. Note that this I database monitor is different from the TP monitor.' Application Program Exceptions OPERATING SYSTEM (VMS) Although transaction processing systems are based on the client/server architecture, Digital's database I systems are process based. The privileged database management system code is packaged in a share­ HARDWARE (CPU, DISK) able library and linked with the application pro­ grams. Therefore, bugs in the applications have 1 a good chance of affecting the consistency of the GENERAL database. Such bugs in applications are one type of ENVIRONMENT AVAILABILITY fa ilure that can make the database unavailable. The VAX DBMS and VAX Rdb/VMS systems guard Figure 1 Layers of Availability in Tra nsaction against this class of fa ilure by executing the data­ Processing Sy stems base management system code in the VA X execu­ tive mode. Since application programs execute in availability of the service in the lower layers. user mode, they do not have access to data struc­ tures used by the database management system. Errors and fa ilures can occur in any layer, but may not be detected immediately. For example, in the When a faulty application program attempts such case of a database management system, the effects an access, the VMS operating system detects it and of a database corruption may not be apparent until generates an exception. This exception then fo rces an image rundown of the application program. long after the corruption (error) has occurred. In general, when an image rundown is initiated, Hence it is difficult to deal with such errors. On the Digital's database management products use the other hand, fa ilures are noticed immediately. condition-hand ling facility ofVMS to abort the trans­ Failures usually make the system unavai lable and act ion. Condition handling of image rundown is are the cause of unexpected downtime. performed at two levels. Two condition handlers Each layer can provide only as much availability as the immediate lower layer. Hence we can also are established, one in user mode and the other in kernel mode. The user mode ex it handler is usual ly express the perfect-availability goal of a database invoked, which rolls back the current transaction management system as the goal of matching the and unbinds it from the database. In this case, the availability of the immediately lower layer, which in our case is the operating system. rest of the users on the system are not affected at At the outset, it is clear that a database manage­ all. The database remains available. The execution of the user mode exit handler is, however, not ment system layered on top of an operating system guaranteed by the VMS operating system. Under and hence only as avai lable as the underlying oper­ some abnormal circumstances, the user mode exit ating system. However, a database management handlers may not be executed at all. In such cir­ system is in general not as avai lable as the under­ cumstances, the kernel mode exit handler is lying layer because of the need to guarantee the invoked by the VMS system. This handler resides properties of transactions. in the database monitor. The monitor starts a database recovery (DBR) process. It is the responsi­ Unexpected Downtime bility of the DBR process to roll back the effects of In this section we discuss the causes of unex­ the aborted transaction. To do this, the DBR pro­ pected downtime and the techniques that mini­ cess first establishes a database freeze. This freeze mize downtime. prevents other processes from acquiring locks that

3 I Winter 1991 Digital Te chnical Jo urnal 66 Vo l. No Database Availabilityfo r Transaction Processing

were held by the aborted transaction and hence Power Fa ilure see and update uncommitted data. (The VMS lock Power fa ilure is a hardware failure. As soon as manager releases all locks held by a process when power is restored, the VMS system boots. When a that process dies.) The DBR process then proceeds process attaches to the database, a number of mes­ to roll back the aborted transaction. sages are passed between the process that is attach­ ing and the monitor. If the database is corrupt Code Corruptions (because of power fa ilure), the monitor is so It is important to prevent coding mistakes within info rmed by the attaching process, and aga in the the DBMS from irretrievably corrupting the data­ monitor starts recovery processes to return the base. To protect the database management system database to a consistent state. The database becomes from cod ing mistakes, internal data structure con­ ava ilable as soon as recovery is complete fo r all sistency is examined at different points in the such fa iled users. code. If any inconsistency is fo und, a bug-check As described above, recovery is always accom­ utility is called that dumps the internal database plished by the monitor process starting DBR pro­ fo rmat to a file. The utility then ra ises an excep­ cesses to do the recovery. The only differences in tion that is handled by the monitor, and the DBR the case of process, node, or cluster fa ilure is the process is started as described above. mechanism by which the monitor is info rmed of To deal with corruptions to the database that are the fa ilure. undetected with this mechanism, an explicit utility is provided that verifies the structural consistency Disk Head Crash of the database. This ve rify utility may be executed Some fa ilures can result in the loss or corruption of on-line, while users are still accessing the data­ the data on the stable storage device (disk). Digital base. Such verification may also be executed by a has a mechanism for bringing the database back to database adm inistrator (DBA) in response to a bug­ a consistent state in such cases. check dump. Once such a corruption is detected, A disk head crash is a fa ilure of hardware that is an on-line utility provides the ability to repair the usually characterized by the inability to read from database. or write to the disk. Hence database storage areas In general, corruption in databases causes unex­ residing on that disk are unavailable and possibly pected downtime. Digital provides the means of irretrievable. A disk head crash automatically aborts detecting such corruption on-line and repairing transactions that need to read from or write to that them on-line through recovery utilities. disk. In addition, recovery of these aborted trans­ actions is not possible since the recovery pro­ Process Fa ilure cesses need access to the same disk. In this case, In the VMS system, a process fa ilure is always pre­ the database is shut down and access is denied until ceded by image rundown of the current image run­ the storage areas on the fa iled disk are brought on­ ning as part of the process. Therefore, a process line. Areas are restored from backups and then fa ilure is detected by the database monitor, which rol led fo rward until consistent with the rest of the then starts a DBR process to handle recovery. database. The after image journal (AlJ)files are used to roll the areas fo rward. As soon as all the areas on No de Fa ilure the fa iled disk have been restored onto a good disk Among the many mechanisms Digital provides fo r and rolled fo rward, the database becomes available. availability is node fa ilover within a cluster. When a node fails, another node on the cluster detects Bad Disk Blocks the fa ilure and rolls back the lost transactions from Bad blocks are hardware errors that often are not the failed node. Thus the fa ilure of one node does detected when they happen. The bad blocks are not cause transactions on other active nodes of the revectored, and the next time the disk block is cluster to come to a halt (except fo r the time the read, an error is reported. Bad blocks simply mean DBR process enforces a freeze). It is the database that the contents of a disk block are lost fo rever. monitor that detects node fa ilure and starts a The database administrator detects the problem recovery process fo r every lost transaction on the only when a database application fa ils to fetch data fa iled node. The database becomes available as on the revectored block. Such an error may cause a soon as recovery is complete fo r all the users on certain transaction or a set of transactions to fa il, the fa iled node. no matter how many attempts are made to execute

Digital Te chnical jo urnal Vol. 3 No. I Win ter 1991 67 Transaction Processing, Databases, and Fault-tolerant Systems

the transactions. This fa ilure constitutes reduced The loss of ava ilability during repair is not worse availability; parts of the database are unava i !able to than the loss due to the memory error itself. transactions. Exactly how much of the database As explained previously, the database monitor remains available depends on which blocks were plays an important part in ensuring database con­ revectored. sistency and availability. Most unexpected failure The mechanism provided to reduce the possible scenarios are detected by the monitor, which then downtime is early detection. Digital's database starts recovery processes. In addition, some fa il­ systems provide a verification utility that can be ures might require the use of backup files to executed while users are running transactions. restore the database. The verification utility checks the structural con­ sistency of the database. Once a bad block is Scheduled Downtime detected by such a utility, that area of the database Most database systems have scheduled maintenance may be restored and rolled forward. These two operations that require a database shutdown. Data­ operations make the whole database temporarily base backup fo r media recovery and verification to unavailable; however, the bad block is corrected, check structural consistency are examples of oper­ and future downtime is avoided. The downtime ations that may require scheduled downtime. In caused by the bad block may be traded off against this section we describe ways to perfo rm many of the downtime needed to restore and roll fo rward. these operations while the database is executing transactions. Site Fa ilure A site fa ilure occurs when neither the computers Backup nor the disks are available. A site fa ilure is usually Digital's database systems allow two types of trans­ caused by a natural disaster such as an earthquake. actions: update and "snapshot." The ability to back The best recourse fo r recovery is archival storage. up data on-line depends on the snapshot transaction Digital provides mechanisms to back up the data­ capability of the database. base and AIJ files to tape. These tapes must then be Database backup is a standard way of recovering stored at a site away from the site at which the from media fa ilures. Digital's database systems pro­ database resides. Should a disaster happen, these vide the ability to do transaction consistent back­ backup tapes can be used to restore the database. ups of data on-line while users continue to change However, the recovery may not be complete. It the database. cannot restore the effects of those committed trans­ The general scheme for snapshot transactions is actions that were not backed up to tape. as follows. The update transactions of the database After a disaster, the database can be restored preserve the previous versions of the database and rolled fo rward to the state of the completion of records in the snapshot file. All versions of a data­ the last AIJ that was backed up to tape. Any trans­ base record are chained. Only the current version actions that committed after the last Al] was backed of the record is in the database area. The older ver­ up cannot be recovered at the alternate site. Such sions are kept in the snapshot area. The versions transaction losses can be minimized by fr equently of the records are tagged with the transaction backing up the AIJ files. numbers (TSNs). Whena snapshot transaction (fo r example, a database backup) needs to read a data­ MemoryErrors base record, it traverses the chain fo r that database Memory errors are quite infrequent, and when record and then uses the appropriate version of they happen, they usually are not detected. If the the record. error happens to a data record, it may never be There are two modes of database operation with detected by any utility, but may be seen as incor­ respect to snapshot activity.In one mode, al l update rect data by the user. If the verification utility is run transactions write snapshot copies of any records on-line, it may also detect the errors. Aga in, the they update. In the deferred snapshot mode, the database may only be partially ava ilable, as in the updates cause snapshot copies to be written only case of bad blocks. However, it is possible to repair if a snapshot transaction is active and requires old the database while users are still accessing the versions of a record. In this mode, a snapshot trans­ database. Digital's database management products action cannot start until aU currently active update provide explicit repair facilities fo r this purpose. transactions (which are not writing snapshot

68 Vo l. 3 No. I Winter 1991 Digital Te ch nical jou rnal Database Availability fo r Tr ansaction Processing

records) have completed; that is, the snapshot systems offer the ability to back up the AlJ file to transaction must wait fo r a quiet point in time. If tape (or another device) on-line. The only restric­ there are either active or pending snapshot trans­ tion is that a quiet point must be established fo r a actions when an update transaction starts, the short period during which the backup operation update transaction must write snapshot copies. takes place. A quiet point is defined as a point Here we see a trade-off between update trans­ when the database is quiescent, i.e., there are no actions and snapshot transactions. The database active transactions. is completely available to snapshot transactions if all update transactions always write snapshot On-line Schema Changes copies. On the other hand, ifthe deferred snapshot Digital's database management systems allow users mode is enabled, update transactions need not to change metadata on-line, while users are still write snapshot copies if a snapshot transaction accessing the database. Although this may be stan­ in not active. This approach obviously results in dard for relational database management systems, some loss of ava ilability to snapshot transactions. it is not standard fo r network databases. The VAX DBMS system provides a utility called the database Ve rification restructuring utility (DRU) to allow fo r on-line schema modifications. Database corruption can also result in downtime. Although database corruption is not probable, it Acknowledgments is possible. Any database system that supports Many engineers have contributed to the develop­ critical data must provide facilities to ensure the ment of the algorithms described in this paper. We consistency of the database. Digital's database man­ have chosen not to enumerate all such contribu­ agement systems provide verification utilities that tions. However, we would like to recognize the con­ scan the database to check the structural consis­ tributions of Peter Spiro, Ashok Joshi, Jeff Arnold, tency of the database. These utilities may also be and RickAnderson who, together with the authors, executed on-line through the use of snapshot are members of the KODA team. transactions. References AI] Backup 1. P Bernstein, W Emberton, and Trehan, The backup and the AlJ log are the two mechanisms V "DECdta - Digital's Distributed Transaction Pro­ that provide media recovery fo r Digital's database cessing Architecture," Digital Te chnicaljo urnal, management products. The AlJ fi le is continuously vol. 3, no. 1 (Winter 1991 , this issue): 10 -17. written to by all user processes updating the data­ base. We need to provide some ability to back up 2. T. Speer and M. Storm, "Digital's Transaction the AlJfile since it monotonically increases in size Processing Monitors," Digital Te chnicaljo urnal, and eventual ly fi lls up the disk. Digital's database vol. 3, no. 1 (Winter 1991, this issue): 18-32.

Vo l. I Winter Digital Te ch nical]ourrwl 3 Nu. 1991 69 Peter M. Spiro Ashok M.joshi T. K. Rengarajan I

Designing an optimized Transaction Commit Protocol

Digital's database products, VAX Rdb/VMS and VAX DBMS, share the same database kernelcalled KODA . KODA uses a grouping mechanism to commit many concurrent transactions together. Thisfeature enables high transaction rates in a transaction processing (TP) environment. Since group commit processing affects the maximum

throughput of the transaction processing sy stem, the KODA group designed and implemented several grouping algorithms and studied their performance charac­ teristics. Preliminary results indicate that it is possible to achieve up to a 66percent improvement in transaction throughput byusing more efficient grouping designs.

Digital has two general-purpose database products, A transaction is the execution of one or more Rdb/VMS software, which supports the relational statements that access data managed by a database

data model, and VAX D BMS software, which sup­ system. General ly, database management systems ports the CODASYL (Conference on Data Systems guarantee that the effects of a transaction are atomic, Languages) data model. Both products layer on top that is, either all updates performed within the con­ of a database kernel called KO DA. In addition to text of the transaction are recorded in the database, other database services, KODA provides the trans­ or no updates are reflected in the database. action capabilities and commit processing fo r these The point at which a transaction's effects become two products. durable is known as the "moment of commit." This In this paper, we address some of the issues rele­ concept is important because it allows database vant to efficient commit processing. We begin by recovery to proceed in a predictable manner after explaining the importance of commit processing a transaction fa ilure. If a transaction terminates in achieving high transaction throughput. Next, we abnormally before it reaches the moment of com­ describe in detail the current algorithm for group mit, then it aborts. As a result, the database system commit used in KODA. We then describe and con­ performs transaction recovery, which removes all trast several new designs fo r perfo rm ing a group effects of the transaction. However, if the trans­ commit. Following these discussions, we present action has passed the moment of commit, recovery our experimental results. And, finally, we discuss processing ensures that all changes made by the the possible direction of fu ture work and some transaction are permanent. conclusions. No attempt is made to present fo rmal analysis or exhaustive empirical results for commit Tr ansaction Profi le processing; rather, the fo cus is on an intuitive For the purpose of analysis, it is useful to divide a understanding of the concepts and trade-offs, transaction processed by KODA into fo ur phases: along with some empirical results that support our the transaction start phase, the data manipulation conclusions. phase, the logging phase, and the commit process­

ing phase. Figure I illustrates the phases of a trans­ Commit Processing action in time sequence. The first three phases are To fo llow a discussion of commit processing, two collectively referred to as "the average transaction's basic terms must first be understood. We begin this CPU cost (excluding the cost of commit)" and the section by defining a transaction and the "moment last phase (commit) as "the cost of writing a group of commit." commit buffer." '

Vo l. 3 No. I Wi nter 1991 70 Digital Te chnical jou rnal Designing an Op timized Transaction Commit Protocol

applications, stock market transactions, or the DATA telephone system, the data manipulation phase is MANIPULATION usually short (requiring few 1/0s). Instead, the log­ TIME - ging and commit phases comprise the bulk of the work and must be optim ized to allow high trans­ Figure Phases in the Execution 1 action throughput. The transaction profile fo r a of a Transaction transaction modifying one record is shown in Figure 3. Note that the commit processing phase The transaction start phase involves acquiring represents 36 percent of the transaction duration, a transaction identifier and setting up control in this example. data structures. This phase usually incurs a fixed overhead. The data manipulation phase involves executing rr=START the actions dictated by an application program. DATA Obviously, the time spent in this phase and the MANIPULATION amount of processing required depend on the II I LOGGING COMMIT nature of the application. At some point a request is made to complete the TIME - transaction. Accordingly in KO DA, the transaction Figure 3 Profi le of a Transaction Modifying enters the logging phase which involves updating One Record the database with the changes and writing the undo/redo information to disk. The amount of work done in the logging phase is usually small and con­ Group Commit stant (less than one 1/0) fo r transaction processing. Generally, database systems must fo rce write infor­ Finally, the transaction enters the commit pro­ mation to disk in order to commit transactions. In cessing phase. In KODA, this phase involves writing the event of a fa ilure, this operation permits recov­ commit information to disk, thereby ensuring that ery processing to determine which fa iled trans­ the transaction's effects are recorded in the data­ actions were active at the time of theirtermi nation base and now visible to other users. and which ones had reached their moment of com­ For some transactions, the data manipulation mit. This information is often in the fo rm of lists of phase is very expensive, possibly requiring a large transaction identifiers, called commit lists. number of 1/0s and a great deal of CPU time. For Many database systems perform an optimized example, if 500 employees in a company were to ve rsion of commit processing where commit infor­ get a 10 percent salary increase, a transaction wo uld mation fo r a group of transactions is written to disk have to fe tch and modify every employee/salary in one 1/0 operation, thereby, amortizing the cost record in the company database. The commit pro­ of the 1/0 across multiple transactions. So, rather cessing phase, in this example, represents 0.2 per­ than having each transaction write its own commit cent of the transaction duration. Thus, fo r this class list to disk, one transaction writes to disk a com­ of transaction, commit processing is a small frac­ mit list containing the commit information for a tion of the overall cost. Figure 2 illustrates the pro­ number of other transactions. This technique is file of a transaction modifying 500 records. referred to in the literature as "group commit."' Group commit processing is essential fo r achiev­ .------COMMIT ing high throughput. If every transaction that START LOGGING +l reached the commit stage had to actually perform an 1/0 to the same disk to flush its own commit DATA II MANIPULATION Ill information, the throughput of the database sys­ tem would be limited to the l/0 rate of the disk. A TIME - magnetic disk is capable of performing 30 l/0 operations per second. Consequently, in the Figure 2 Profi le of a Tr ansactionModifying 500 Records absence of group commit, the throughput of the system is limited to 30 transactions per second In contrast, fo r transaction processing applica­ (TPS). Group commit is essential to breaking this tions such as hotel reservation systems, banking performance barrier.

I Winter Digital Te chnical journal Vo l. 3 No. 1!)91 71 Transaction Processing, Databases, and Fault-tolerant Systems

There are several variations of the basic algo­ computer is. Therefore, the group size directly rithms fo r grouping multiple commit lists into a affects the maximum transaction throughput of single 1/0. The specific group commit algorithm the transaction processing system. chosen can significantly influencethe throughput and response times of transaction processing. One Use of Timers to Stall Tra nsactions One of the study reports throughput gains of as much as 25 mechanisms to increase the size of the commit percent by selecting an optimal group commit group is the use of timers.u Timers are used to algorithm.1 stall the transactions fo r a short period of time At high transaction throughput (hundreds of (on the order of tens of milliseconds) during com­ transactions per second), efficient commit process­ mit processing. During the stall, more transactions ing provides a significant performance advantage. enter the commit processing phase and so the There is little info rmation in the database litera­ group size becomes larger. The stalls provided by ture about the efficiency of different methods of the timers have the advantage of increasing the perform ing a group commit. Therefore, we ana­ group size, and the disadvantage of increasing the lyzed several grouping designs and evaluated their response time. perfo rmance benefits. Tra de-ojfs This section discusses the trade-offs Fa ctors Affecting Group Commit between the size of the group and the use of timers Before proceeding to a description of the experi­ to stall transactions. Consider a system where there ments, it is useful to have a better understanding of are 50 active database programs, each repeatedly the factors affecting the behavior of the group com­ processing transactions against a database. Assume mit mechanism. This section discusses the group that on average each transaction takes between size, the use of timers to stall transactions, and the 0.4 and 0.5 seconds. Thus, at peak performance, the relationship between these two factors. database system can commit approximately 100 transactions every second, each program actually Group Size An important factor affecting group completing two transactions in the one-second commit is the number of transactions that partici­ time interval. Also, assume that the transactions pate in the group commit. There must be several arrive at the commit point in a steady stream at dif­ transactions in the group in order to benefit from fe rent times. 1/0 amortization. At the same time, transactions If transaction commit is stalled fo r 0.2 sec­ should not be required to wait too long for the onds to allow the commit group to build up, the group to build up to a large size, as this factor group then consists of about 20 transactions

would adversely affect throughput. (0.2 seconds x 100 TPS). In this case, each trans­ It is interesting to note that the incremental action only incurs a small delay at commit time, advantage of adding one more transaction to a averaging 0.10 seconds, and the database system group decreases as the group size increases. The shoul.d be able to approach its peak throughput of

incremental savings is equal to 1/(G x (G + 1)), 100 TPS. However, ifthe mechanism delays commit where G is the initial group size. For example, if processing fo r one second, an entirely different the group consists of 2 transactions, each of them behavior sequence occurs. Since the transactions does one-half a write. If the group size increases complete in approximately 0.5 seconds, they accu­ to 3, the incremental savings in writes will be mulate at the commit stall and are fo rced to wait (1/2 - l/3), or 1/6 per transaction. If we do the same until the one-second stall completes. The group calculation fo r a group size incremented from 10 size then consists of 50 transactions, thereby maxi­

to 11, the savings will be (1/10 - l/11), or l/110 of a mizing the l/0 amortization. However, throughput write per transaction. is also limited to 50 TPS, since a group commit is In general, if G represents the group size, and I occurring only once per second. represents the number of 1/0s per second fo r the Thus, it is necessary to balance response time

disk, the maximum transaction commit rate is Jx G and the size of the commit group. The longer the TPS. For example, if the group size is 45 and the rate stall, the larger the group size; the larger the group is 30 I/Os per second to disk, the maximum trans­ size, the better the l/0 amortization that is achieved.

action commit rate is 30 x 45, or 1350 TPS. Note that However, if the stall time is too long, it is possible a grouping of only 10 will restrict the maximum to limit transaction throughput because of wasted TPS to 300 TPS, regardless of how powerful the CPU cycles.

Vo l. 3 No. Winter 1991 72 I Digital Te ch nical jo urnal Designing an Op timized Transaction Commit Protocol

Motivation fo r Our Wo rk Commit-Lock Design The concept of using commit timers is discussed The Commit-Lock Design uses a VMS lock to gener­ in great detail by Reuter' However, there are signifi­ ate groups of committing transactions; the lock is cant differences between his group commit scheme also used to choose the group committee and our scheme. These differences prompted the Once a process completes all its updates and work we present in this paper. wants to commit its transaction, the procedure is In Reuter's scheme, the timer expiration triggers as fo llows. Each transaction must first declare its the group commit fo r everyone. In our scheme, no intent to join a group commit. In KODA, each pro­ single process is in charge of commit processing cess uses the interlocked queue instructions of the based on a timer. Our commit processing is per­ VAX system running VMS software to enqueue a fo rmed by one of the processes desiring to write a block of commit information, known as a commit commit record. Our designs involve coordination packet, onto a globally accessible commit queue. between the processes in order to elect the group The commit queue and the commit packets are committer (a process). located in a shared, writeable global section. Reuter's analysis to determine the optimum value Each process then issues a lock request fo r the of the timer based on system load assumes that the commit lock. At this point, a number of other total transaction duration, the time taken fo r com­ processes are assumed to be go ing through the mit processing, and the time taken fo r performing same sequence; that is, they are posting their the other phases are the same fo r all transactions. commit packets and making lock requests fo r the In contrast, we do not make that assumption. Our commit lock. One of these processes is granted designs strive to adapt to the execution of many dif­ the commit lock. For the time being, assume the fe rent transaction types under different system process that currently acquires the lock acts as loads. Because of the complexity introduced by the group committer. allowing variations in transaction classes, we do The group committer, first, counts the number not attempt to calculate the optimal timer values as of entries on the commit queue, providing the does Reuter. number of transactions that will be part of the group commit. Because of the VAX interlocked Cooperative Commit Processing queue instructions, scanning to obtain a count and In this section, we present the stages in perform­ concurrent queue operations by other processes ing the group commit with cooperating processes, can proceed simultaneously. The group committer and we describe, in detail, the grouping design cur­ uses the information in each commit packet to rently used in KO DA, the Commit-Lock Design. fo rmat the commit block which will be written to disk. In KODA, the commit block is used as a Group Committer commit list, recording which transactions have Assume that a number of transactions have com­ committed and which ones are active. In order to pleted all data manipulation and logging activity commit fo r a transaction, the group committer and are ready to execute the commit processing must mark each current transaction as completed. phase. To group the commit requests, the fo llow­ In addition, as an optimization, the group commit­ ing steps must be performed in KODA: ter assigns a new transaction identifier fo r each 1. Each transaction must make its commit infor­ process's next transaction. Figure 4 illustrates a mation available to the group committer. commit block ready to be flushed to disk.

2. One of the processes must be selected as the Once the commit block is modified , the group "group committer." committer writes it to disk in one atomic 1/0. This is the moment of commit fo r all transactions in 3. The other members of the group need to be the group. Thus, all transactions that were active info rmed that their commit work wil.l be com­ and took part in this group commit are now stably pleted by the group committer. These processes marked as committed. In addition, as explained must wa it until the commit info rmation is writ­ above, these transactions now have new transac­ ten to disk by the group committer. tion identifiers. Next, the group committer sets a 4. Once the group committer has written the com­ commit flag in each commit packet fo r all recently mit information to stable storage, it must inform committed transactions, removes all commit pack­ the other members that commit processing is ets from the commit queue, and, finally, releases completed. the commit lock. Figure 5 illustrates a committed

Vu l. 3 No. I Digital Te ch nical jo urnal Winter 1991 73 Transaction Processing, Databases, and Fault-tolerant Systems

COMMIT PACKET

CURR_TID 37 CURR- TID 32 CURR- TID 41 CURR- TID 28 CURR_TID 39 COMMIT - -- r-- -- NEXT_TID 0 NEXT TID 0 NEXT TID 0 r-- NEXT TID 0 NEXT TID 0 QUEUE M M M M COMMIT_FLG 0 COM IT_FLG 0 COM IT_FLG 0 COM IT_FLG 0 COM IT_FLG 0

COMMIT GROUP

CURR_TID NEXT_TID 37 42 32 43 41 44

COMMIT BLOCK

KEY:

CURR_TID CURRENT TRANSACTION IDENTIFIER NEXT TID NEXT TRANSACTION IDENTIFIER M COM IT_FLG COMMIT FLAG

Figure 4 Co mmit Block Ready to be Flushed to Disk

group with new transaction identifiers and with group commit acquires the lock, notices it has not commit flags set. been committed, and, therefore, initiates the next At this point, the remaining processes that were group commit. part of the group commit are, in turn, granted There are several interesting points about using the commit Jock. Because their commit flags are the VMS Jock as the grouping mechanism. Even already set, these processes realize they do not though all the transactions are effectively commit­ need to perform a commit and, thus, release the ted after the commit block l/0 has completed, the commit lock and proceed to the next transaction. transactions are still fo rced to proceed serially; After all these committed processes release the that is, each process is granted the lock, notices commit lock, a process that did not take part in the that it is committed, and then releases the Jock.

NEXT COMMIT PACKET

39 CURR - TID 28 CURR- TID CURR- TID 29 -- COMMIT NEXT TID 0-- NEXT TID 0 NEXT TID 0 QUEUE M M M COM IT_FLG 0 COM IT_FLG 0 COM IT_FLG 0

1 CURR- TID -1 CURR- TID -1 CURR- TID - NEXT TID 42 NEXT TID 43 NEXT TID 44 M 1 M M 1 COM IT_FLG COM IT_FLG 1 COM IT_FLG

COMMITTED GROUP

KEY: CURR- TID CURRENT TRANSACTION IDENTIFIER NEXT MTID NEXT TRANSACTION IDENTIFIER COM IT_FLG COMMIT FLAG

Figure 5 Co mmitted Group

74 Vo l. 3 No. I Winter 1991 Digital Te chnical journal Designing an Op timized Transaction Commit Protocol

So there is a serial procession of lock enqueues/ increases, which is undesirable. At the same time, dequeues before the next group can start. the grouping size fo r group commit increases, This serial procession can be made more concur­ which is desirable. The challenge is to determine rent by, first, requesting the lock in a shared mode, the optimal stall time. Reuter presented an analyti­ hoping that aU processes committed are granted cal way of determining the optimal stall time fo r a the lock in unison. However, in practice, some pro­ system with transactions of the same type.' cesses that are granted the lock are not committed. Ideally, we would like to devise a flexible scheme These processes must then request the lock in an that makes the trade-off we have just described in exclusive mode. If this lock request is mastered on real time and determines the optimum commit a different node in a VA Xcluster system, the lock stall time dynamically. However, we cannot deter­ enqueue/dequeues are very expensive. mine the optimum stall time automatically, because Also, there is no explicit stall time built into the database management system cannot judge the algorithm. The latency associated with the which is more important to the user in a general lock enqueue/dequeue requests allows the commit customer situation - the transaction response time queue to build up. This stall is entirely dependent or the throughput. on the contention fo r the lock, which in turn The Willing-to-Wa it Design provides a user param­ depends on the throughput. eter called WTW time. This parameter represents the amount of time the user is willing to wait fo r Group Commit Mechanisms ­ the transaction to complete, given this wait will Our Ne w Designs benefit the complete system by increasing through­ To improve on the transaction throughput provided put. wrw time may be specified by the user fo r each by the Commit-Lock Design, we developed three transaction. Given such a user specification, it is different grouping designs, and we compared their easy to calculate the commit stall to increase the performances at high throughput. Note that the group size. This stall equals the WlW time minus basic paradigm of group commit for all these the time taken by the transaction thus far, but only designs is described in the Group Committer sec­ if the transaction has not already exceeded the tion. Our designs are as fo llows. WlW time. For example, if a transaction comes to commit processing in 0.5 second and the time Com.m.it-Stall Design wrw is 2.0 seconds, the stall time is then 1.5 seconds. In In the Commit-Stall Design, the use of the commit addition, we can make a further improvement by lock as the grouping mechanism is eliminated. reducing the stall time by the amount of time Instead , a process inserts its commit packet onto needed fo r group commit processing. This delta the commit queue and, then, checks to see if it is time is constant, on the order of 50 milliseconds the first process on the queue. If so, the process (one 1/0 plus some computation). acts as the group committer. If not, the process The WTW parameter gives the user control over schedules its own wake-up call, then sleeps. Upon how much of the response time advantage (if any) waking, the process checks to see if it has been may be used by the system to improve transaction committed . If so, the process proceeds to its next throughput. The choice of an abnormally high value transaction. If not, the process again checks to see of WlW by one process only affects its own trans­ it is first on the commit queue. The algorithm if action response time; it does not have any adverse then repeats, as described above. effect on the total throughput of the system. A low This method attempts to el iminate the serial value of WlW would cause small commit groups, wake-up behavior displayed by using the commit which in turn would limit the throughput. However, lock. Also, thedu ration fo r which each process this can be avoided by administrative controls on stalls can be varied per transaction to allow explicit the database that specify a minimum WlW time. control of the group size. Note that ifthe stall time is too small, a process may wake up and stall many times before it is committed. Hiber Design The Hiber Design is similar to the Commit-Stall Willing- to-Wait Design Design, but, instead of each process scheduling its As we have seen before, a delay in the commit own wake-up call, the group committer wakes up sequence is a convenient means of converting a all processes in the committed group. In addition, response time advantage into a throughput ga in. If the group committer must wake up the process we increase the stall time, the transaction duration that will be the next groupcom mitter.

Digital Te chnical jo urnal Vo l. No. 1 Winter 1991 3 75 Transaction Processing, Databases, and Fault-tolerant Systems

Note, this design exhibits a serial wake-up behav­ throughput, 500 TPS. Using this last design, it was ior like the Commit-Lock Design, however, the possible to achieve up to a 66 percent improve­ mechanism is less costly than the VMS lock used by ment over the less-efficient Commit-Lock Design. the Commit-Lock Design. In the Hiber Design, if Although both timer schemes, i.e., the Commit­ a process is not the group committer, it simply Stall and Willing-to-Wait designs, needed tuning to sleeps; it does not schedule its own wake-up call. set the parameters and the Commit-Lock Design Therefore, each process is guaranteed to sleep and did not, we observed that the maximum through­ wake up at most once per commit, in contrast to put obtained using timers is much better than that the Commit-Stall Design. Another interesting char­ obtained with the lock. These results were similar acteristic of the Hiber Design is that the group to those of Reuter. committer can choose to either wake up the next For our Willing-to-Wait Design, the minimum group committer immediately, or it can actually transaction duration is the WTW time. Therefore, schedule the wake-up call aftera delay. Such a delay the maximum TPS, the number of servers, and

al lows the next group size to become larger. the WTW stall time, measured in milliseconds, are related by the formula: number of servers Experiments TPS. x 1000/WTW =maximum For example, our We implemented and tested the Commit-Lock, the TPS maximum for the \.XITWdesign was obtained Commit-Stall, and the Willing-to-Wait designs in with 50 servers and 90 milliseconds WTW time. The objectives of experiments were KO DA . our = Using the fo rmula, 50 x 1000/90 555. The actual TPS • To find out which design would yield the achieved was 500, which is 90 percent of the maximum throughput under response time maximum TPS. This ratio is also a measure of the constraints effectiveness of the experiment. During our experiments, the maximum group • understand the performance characteristics Tb size observed was 45 (with the Wi lling-to-Wa it of the designs Design). This is close to the system-imposed limit In the fo llowing sections, we present the details of50 and, so, we may be able to get better grouping of our experiments, the results we obtained, and with higher limits on the size of the group. some observations.

Details of the Experiments Observations The hardware used fo r all of the fo llowing tests was In the Commit-Stall and the Willing-to-Wait designs, a VAX 6340 with fo ur processors, each rated at 36 given a constant stall, if the number of servers is TPS VAX units of performance (VUP). The total possible increased, the increases and then decreases. CPU utilization was 400 percent and the total pro­ The rate of decrease is slower than the rate of TPS CPU cessing power of the computer was 14.4 vurs. As increase. The decrease is due to overload­ ing. The TPS increase is due to more servers trying the commit processing becomes more sign ificant in a transaction (in relation to the other phasc:s), to execute transactions and better CPU utilization. TPS the impact of the grouping mechanism on the trans­ Figure 6 illustrates how varies with the num­ ber of servers, given a constanr stall time. action throughput increases. Therefore, in order WlW Aga in, in the stalling designs, fo r a constant num­ to accentuate the performance differences between TPS the various designs, we performed our experiments ber of servers, if the stall is increased, the increases and then decreases. The TPS increase is using a transaction that involved no database activ­ due to better grouping and the decrease is due to ity except to fo llow the commit sequence. So, fo r CPU underutilization. Figures and 8 show the all rractical purposes, the TPS data presented 7 effects on TPS when you vary the commit-stall in this paper can be interpreted as "commit sequences rer second.'' Also, note that our system time or the WT\V time, while keeping the number of servers constant. imposed an upper limit of 50 on the grouping size. TPS Tb maximize with the Commit-Stall Design, Results the fo llowing "mountain-climbing" algorithm was Using the Commit-Lock Design, transaction pro­ useful. This algorithm is based on the previous two cessing bottlenecked at 300 TPS. Performance observations. Start with a reasonable value of the greatly improved with the Commit-Stall Design; stall and the number of servers, such that the CPU the maximum throughput was 464 ·rPS. The is underutilized. Then in-crease the number of Willing-to-Wa it Design provided the highest servers. CPU utilization and the TPS increase.

. I 76 Vol . ) No. Winter 1991 Digital Te cb nical jou rnal Designing an Op timized Tra nsaction Co mmit Protocol

0 0 490 z 480 z 0 0 0 0 480 w 460 w (/) (/) a: a: 470 w 440 w (l_ (l_ 460 (/) (/) z 420 z Q Q 450 1- 400 1- 0 0 440 <{ <{ (/) (/) z 380 z 430 <{ <{ a: a: 1- 360 1- 420 90 100 110 120 130 140 150

NUMBER OF SERVERS WILLING-TO-WAIT TIME (MILLISECONDS)

NOTE THE NUMBER OF SERVERS EQUALS 65. NOTE THE WILLING-TO-WAIT STALL TIME IS A CONSTANT 100 MILLISECONDS

Figure 6 Tr ansactions per Second in Figure 8 Transactions per Second in Rela tionship to the Number of Relationshzp to the WTW Time, Servers, Given a Co nstant Given a Co nstant Nu mber Willing-to-Wait Time of Servers

Continue until the CPU is overloaded; then, increase ments of this design are presented in Ta ble 2. As the stall time. CPU utilization decreases, but the we have seen before, the maximum TPS with this TPS increases due to the larger group size. design is inversely proportional to the wrw time, This algorithm demonstrates that increasing while CPU is not fully utilized . The first fo ur rows the number of servers and the stall by small of Ta ble 2 illustrate this behavior. The rest of the amounts at a time increases the TPS, but only up table fo llows the same pattern as Ta ble 1. to a limit. After this point, the TPS drops. When The Willing- to-Wait Design performs slightly close to the limit, the two factors may be varied better than the Commit-Stall Oesign by adjusting alternately in order to find the true maximum. to the variations in the speed at which different Ta ble 1 shows the pe rformance measurements of servers arrive at the commit point. Such variations the Commit-Stall Design. Comments are included are compensated fo r by the variable stalls in the in the table to highlight the performance behavior Willing-to-Wa it Design. Therefore, if the variation the data supports. is high and the commit sequence is a significant The same mountain-climbing algorithm is modi­ portion of the transaction, we expect the Willing­ fied slightly to obtain the maximum TPS with the to-Wa it Design to perfo rm much better than the Willing-to-Wa it Design. The performance measure- Commit-Stall Design.

0 z 440 Future Wo rk 0 0 There is scope fo r more interesting work to fu rther w 400 (/) optimize commit processing in the KODA database a: w (l_ 360 kernel. First, we would like to perform experi­ (/) z ments on the Hiber Design and compare it to the 0 320 i= other designs. Next, we would like to explore ways 0 <{ of combining the Hiber Design wi th either of the (/) 280 z <{ two timer designs, Commit-Stall or Willing-to­ a: 1- 240 Wa it. This may be the best design of all the above, 10 20 30 40 50 60 70 80 with a good mixture of automatic stall, low over­ COMMIT-STALL TIME (MILLISECONDS) head , and explicit control over the total stall time. NOTE THE NUMBER OF SERVERS EQUALS 50 In addition, we would like to investigate the use of timers to ease system management. For example, a Fig ure 7 Tr ansactions per Second in system administrator may increase the stalls fo r Relationship to the Com mit-Stall all transactions on the system in order to ease CPU Ti me, Given a Constant Number contention, thereby increasing the overall effective­ of Servers ness of the system.

Digital Te chn ical jo urnal Vu l. 3 No. I Winter J'YJI 77 Transaction Processing, Databases, and Fault-tolerant Systems

Ta ble 1 Commit-Stall Design Performance Data

Number of Commit Stall CPU Utilization Servers (Milliseconds) (Percent)* TPS Comments

50 20 360 425 Starting point

55 20 375 454 Increased number of servers, therefore, higher TPS 60 20 378 457 Increased number of servers, therefore, CPU saturated 60 30 340 461 Increased stall, therefore, CPU less utilized

65 30 350 464 Increased number of servers, maximum TPS

70 30 360 456 "Over-the-hill" situation, same strategy of further increasing the number of servers does not increase TPS 70 40 330 451 No benefit from increasing number of servers and stall 65 40 329 448 No benefit from just increasing stall

• CPU Four processors were used in the experiments. Thus, the total possible utilization is 400 percent.

Ta ble2 Willing-to-Wait Performance Data

Willing-to-Wait Number of Stall CPU Utilization Servers (Milliseconds) (Percent)* TPS Comments

45 100 285 426 Starting point, CPU not saturated 45 90 295 466 Decreased stall to load CPU, CPU st ill not saturated 45 80 344 498 Decreased stall again

45 70 363 471 Further decreased stall, CPU almost saturated 50 80 372 485 Increased number of servers, CPU more saturated 50 90 340 500 Increased stall to lower CPU usage, maximum TPS 55 90 349 465 "Over-the-hill"situation, same strategy of further increasing number of servers does not increase TPS 50 100 324 468 No benefit from just increasing stall

• CPU Four processors were used in the experiments. Thus, the total possible utilization is 400 percent.

Conclusions reviews of this paper. Al so, we want to thank the We have presented the concept of group commit other KODA group members fo r their contri­ processing as well as a general analysis of various butions during informal discussions. Finally, we options available, some trade-offs involved, and would like to acknowledge the efforts of Steve Klein some performance results indicating areas fo r pos­ who designed the original KODA group commit sible improvement. It is clear that the choice of the mechanism. algorithm can significantly influence performance at high transaction throughput. We are optimistic References that with some fu rther investigation an optimal Helland, H. Sam mer, ). Lyon, R. Carr, Garrett, commit sequence can be incorporated into Rdb/VMS I. P P and A. Reuter, "Group Commit Timers and High and VAX DBMS with considerable gains in trans­ Vo lume Transaction Processing Systems," High action processing performance. Pe rformance Tra nsaction Sy stems, Proceedings Acknowledgments of the 2nd International Wo rkshop (September 1987). We wish to acknowledge the help provided by Rabah Mediouni in performing the experiments 2. D. Gawlick and D. Kinkade, "Varieties of Con­ discussed in this paper. We would like to thank currency Control in IMS/VS Fast Path," Database Phil Bernstein and Dave Lomet fo r their careful Engineering Qune 1985).

Vo l. 3 No. I 78 Winter 1991 Digital Te chuicaljournal William F. Bruckert Carlos Alonso James M. Melvin

Ve rification of the First Fault-tolerant l2.lXSy stem

Thefa ult-tolerant character of the VAXft 3000 system required that plans be made early in the development stages fo r the verifica tion and test of the sys tem. To ensure proper test coverage of the fa ult-tolerantfe atures, engineers built fa ult-insertion points directly into the sys tem hardware. During the verification process, test engi­ neers used hardware and software fa ult insertion in directed and random test fo rms. A fo ur-phase verification strategy was devised to ensure that the VAXft system hardware and software was fu lly tested fo r error recovery that is transparent to applications on the sys tem.

The VAXft 3000 system provides transparent fault labeled dual-rail and single-rail. The dual-rail por­ tolerance for applications that run on the system. tions of the system have two independent sets Because the 3000 includes fault-tolerant fe atures, of hardware performing the same operations. verification of the system was unlike that ordinar­ Correct operation is verified by comparison. The ily conducted on VAX systems. To facilitate system fault-detection mechanism for the single-rail I/O test, the verification strategy outlined a fo ur-phase modules combines checking codes and communi­ approach which would require hardware to be cation protocols. built into the system specifically fo r test purposes. The system performs J/0 operations by sending This paper presents a brief overview of the VAX ft and receiving message packets. The packets are

system architecture and then describes the meth­ exchanged between the CPU and various servers, ods used to verify the system's fault tolerance. including disks, Ethernet, and synchronous lines. These message packets are formed and interpreted VAXf t 3000Ar chitectural Overview in the dual-ra il portion of the system. They are pro­ The VAX.ft fault-tolerant system is designed to tected in the single-rail portion of the machine by recover from any single point of hardware fa ilure. check codes which are generated and checked in Fault tolerance is provided transparently for all the dual-rail portion of the machine. Corrupted applications running under the VMS operating packets can be retransmitted through the same or system. This section reviews the implementation alternate paths. of the system to provide background fo r the main In the normal mode of fault-tolerant operation, discussion of the verification process. both zones execute the same instruction at the The system comprises two duplicate systems, same time. The fo ur processors (two in each zone) called zones. Each zone is a fu lly fu nctional com­ appear to the operating system as a single logical puter with enough elements to run an operating CPU. The hardware supplies the detection and system. These two zones, referred to as zone A and recovery facilities fo r faults detected in the CPU zone B, are shown in Figure 1, which illustrates the and memory portions of the system. A defective duplication of the system components. The two CPU module and its memory are automatically independent zones are connected by duplicate removed from service by the hardware, and the cross-link cables. The cabinet of each zone also remaining CPU continues processing. includes a battery, a power regulator, cooling fans, Error handling fo r the l/0 interconnections is and an AC power input. Each zone's hardware has managed differently. The paths to and from I/O sufficient error checking to detect all single faults adapters are duplicated fo r checking purposes. If a within that zone. fault is detected, the hardware retries the operation. Figure 2 is a block diagram of a single zone with If the retry is successful, the error is logged, and one 1/0 adapter. Note the portions of the zone operation continues without software assistance.

Digital Te chnicaljourtral Vo l. 3 No. I Winter 1991 79 Transaction Processing, Databases, and Fault-tolerant Systems

ZONE A ZONE B

DISK OR TAPE D >- 0:: E w f- f- <( m >->->- 0:: 0:: 0:: 00 0 222 >-www G o:: 2 22 0 0:: 0:: 0:: '"'== ::::J 2 000 [l_W00 0 ��02 :::: .:::::: ;:,

AC BOX I I FAN � CROSSLINK CABLES

Figure 1 A Dual-zone VAX.ft Sy stem

� � � �

MEMORY INTERFACE BUS

CPU DUAL-RAIL MODULE

CROSSLINK CABLES TO ZONE B

MODULE INTERCONNECT

FIREWALL FIREWALL

EDC EDC

1/0 ADAPTER

SINGLE-RAIL

Figure 2 Single-zone Structure of a VAX.f t 3000 Sy stem

3 1 Winter 1991 80 Vo l No. Digital Te ch t�ical ]o urnell Ve rification of the First Fa ult-tolerant VAX System

If the retry is unsuccessfu l, the Fault-tolerant importantly, the number of components used to System Services (FTSS) software performs error implement any fu nction has been dramatically recovery. FTSS is a layered software product that is decreasing. These factors have produced a corre­ utilized with every VAX.ft 3000 system. It provides sponding reduction in system fa ilure rates. Given the software necessary to complete system error the high reliability of today's machines, it is not recovery. For system recovery from a fa iled 1/0 practical from a verification standpoint to verify a device, an alternate path or device is used. All system by letting it run until fa ilures occur. recoverable faults have an associated maximum Conceptually, fa ults can be inserted in two ways. threshold value. If this threshold is exceeded, FTSS First, memory locations and registers can be cor­ performs appropriate device reconfiguration. rupted to mimic the results of gate-level fa ults (software fault insertion). Second, gate-level faults Ve rification of a Fault-tolerant may be inserted directly into the hardware (hard­ J::aXSy stem ware fault insertion). There are advantages to both techniques. One advantage of software· This section entails a discussion of the types of sys­ implemented fault insertion is that no embedded tem tests and the fault-insertion techniques used hardware support is required .' The advantage of to ensure the correct operation of the VAXft system. hardware fa ult insertion, on the other hand, is that In addition, the fo ur-phase verification strategy and fa ults are more representative of actual hardware the procedures involved in each phase are reviewed . fa ilures and can reveal unanticipated side effects There are two types of system tests: directed and from a gate-level fa ilure. To utilize hardware fault random. Directed tests, which test specific hard­ insertion, either a mechanism must be designed ware or software features, are used most frequently into the system, or an external insertion device in computer system verification and fo llow a strict must be developed once the hardware is available. test sequence. Complex systems, however, cannot Given the physical feature size of the components be completely verified in a directed fashion.' As a used today, it is virtually impossible to achieve ade­ case in point, an operating system running on a quate fa ult-insertion coverage through an external processor has innumerable states. Directed tests fa ult-insertion mechanism. verify functional operation under a particular set The error detection and recovery mechanism of conditions. They may not, however, be used to determines which fa ult insertion technique is verify that same functionality under all possible suitable for each component. Some examples illus­ system conditions. trate this point. For the lockstep portion of the In comparison, random testing allows multiple VAXft 3000 CPUs, software fault insertion is not suit­ test processes to interact in a pseudo-random or able because the lockstep fu nctionality prevents random fashion. In random testing, test coverage corruption of memory or registers when fa ults is increased with additional run-time. Thus, once occur. Therefore, hardware fa ults cannot be mim­ the proper test processes are in place, the need to icked by modifying memory contents. However, develop additional tests in order to increase cover­ the software fault-insertion technique was suitable age is eliminated. This type of testing also reduces to test the l/0 adapters since the system handles the effects of the biases of the engineers generating faults in the adapters by detecting the corruption the tests. While directed testing can provide only a of data. Hardware fault insertion was not suitable limited level of coverage, this coverage level can be because the 1/0 adapters were implemented with well understood. Random testing offe rs a poten­ standard components that did not support hard­ tially unbounded level of coverage; however, quan­ ware fa ult insertion. tify ing this coverage is difficult if not impossible. Because the verification strategy fo r the 3000 To achieve the proper level of verification, the was considered a fu ndamental part of the system VAX.ft verification utilized a balance of directed development effort, fa ult insertion points were and random testing. Directed testing was used to built directly into the system hardware. The amount achieve a certain base level of fu nctionality, and of logic necessary to implement fault insertion is random testing was used to expand the level of relatively small. The goals of the fau lt-insertion coverage. hardware were to To permit testing of system fa ult tolerance in a practical amount of time, some fo rm of fault inser­ • Eliminate any corruption of the environment tion is required. The reliability of components used under test that could result from fa ult insertion. in computer systems has been improving, and more For example, if a certain type of system write

Vo l. No. I Winter Digital Te chtticaljou l"lwl 3 1991 8J Transaction Processing, Databases, and Fault-tolerant Systems

operation is required to insert a fault, then every experienced the fault; solid faults do require device test case will be done on a system that is in a removal. Since the system reacts differently to tran­ "post-fault-insertion" state. sient and hard fau lts, both types of fau lts had to be verified in the VAXft system. Therefore, it was • Enable the user to distribute faults randomly required that the fault-insertion hardware be capa­ across the system. ble of inserting solid or transient fau lts. Solid fau lts

• Allow insertion of fau lts during system operation. were inserted by continually applying the fault­ insertion signal. Trans ient fau lts were inserted by • Enable testing of transient and solid fau lts. applying the fault-insertion signal only until the The fault-insertion points are accessed through machine detected an error. a separate serial interface bus isolated from the As noted earlier, the verification strategy utilized operating hardware. This separate interface ensures both hardware and software fault insertion. The that the environment under test is unbiased by hardware fault-insertion mechanisms al lowed fau lts fault insertion. to be inserted into any system environment, includ­ Even with hardware support fo r fau lt insertion, ing diagnostics, exercisers, and the VMS operating only a small number of fault-insertion points can system. As such, it was used fo r initial verification be implemented relative to the total number possi­ as well as regression testing of the system. The veri­ ble . Where the number of fault-insertion points is fication strategy fo r the VAXft 3000 system involved small, the selection of the fault-insertion points a multiphase effort. Each of the fo llowing fo ur veri­ is important to achieve a random distribution. fication phases built upon the previous phase: Fault-insertion points were designed into most of 1. Hardware verification under simulation the custom chips in the VAXft system. When the designers were choosing the fault-insertion points, 2. Hardware verification with system exerciser and a single bit of a data path was considered sufficient fau lt insertion fo r data path coverage . Since a significant portion 3. System software verification with fault insertion of the chip area is consumed by data paths, a high level of coverage of each chip was achieved with 4. System application verification with fault relatively few fault-insertion points. The remaining insertion fault-insertion points could then be applied to the control logic. Coverage of this logic was important Figure 3 shows the functional layers of the because control logic fau lts result in error modes VAXft 3000 system in relation to the verification that are more unpredictable than data path failures. phases. The numbered brackets to the right of the diagram correlate to the testing coverage of The effect that a given fault has on the system each layer. For example, the system software verifi­ depends on the current system operation and when in that operation the fau lt was inserted. In the cation, phase 3, verified the VMS system, Fault­ tolerant System Services (FTSS), and the hardware 3000, fo r example, a failure of bit 3 in a data path will have significantly diffe rent behavior depend­ platform. The following sections briefly describe the fo ur ing upon whether the data bit was incorrect during the ad dress transmission portion of a cycle or dur­ phases of the VAXft verification. ing the succeeding data portion. Therefore, the timing of the fault insertion was pseudo-random. Hardware Ve rification under Simulation The choice of pseudo-random insertion was based Functional design verification using software simu­ on the fact that the fault-insertion hardware oper­ lation is inherently slow in a design as large as the ated asynchronously to the system under test. This VAXft 3000 system. To use resources most efficiently, meant that faults could be inserted at any time, a verification effort must incorporate a number of without correlation to the activity of the system different modeling levels, which means trading off under test. detail to achieve other goals such as speed.' Faults may be transient or solid in nature. For VAXft3000 simulation occurred at two levels: the design purposes, a solid fau lt was defined as a fail­ module level and the system leveL Module-level ure that will be present on retry of an operation. simulat ion verified the base fu nctionality of each A transient fau lt was defined as a fau lt that will not module. Once this verificationwas complete, a sys­ be present on retry of the operation. Transient tem-level model was produced to validate the fau lts do not require the removal of the device that intermodule functional ity. The system-level model

Vo l. 3 No. I Win ter 82 1991 Digital Te ch t�ical]ourtlal Ve rification of the First Fa ult-tolerant VAX Sy stem

TEST PHASE COVERAGE

....------. ------USER APPLICATION

HOST-BASED VO LUME SHADOWING �------� ------VMS 5.4 PHASE 4

FAULT-TOLERANT SYSTEM SERVICES PHASE 3 VAXFT 3000 HARDWARE } PHASE 1 } PHASE 2

Figure 3 Functional Layers of the VAXf t 3000 Sy stem in Rela tion to the Ve rification Phases

consisted of a fu ll dual-ra i I, dual-zone system with that would not generate illegal conditions. Each an 1/0 adapter in each zone. At the final stage, h1ll test case was considered successful only when the system testing was performed. system error registers contained the correct data More than 500 directed error test cases were and the system had the ability to continue opera­ developed fo r gate-level system simu lation. For each tion after the fau lt. test, the test environment was set up on a fully operational system model, and then the fault was Hardware Ve rification with Sy stem inserted. A simulation controller was developed to Ex erciser and Fa ult Insertion coordinate the system operations in the simulation After the prototypes were available, the verification environment. The simulation controller provided effort shifted from simulation to fault insertion on the fo llowing control ove r the testing: the hardware. The goal was to insert faults using an exerciser that induced stressful, reproducible hard­ • Initialization of all memory elements and certain ware activity and that allowed us to analyze and system registers to reduce test time debug the fault easily. • Setup of all memory data buffe rs to be used in Exerciser test cases were developed to stress testing the various hardware functions. The tests were designed to create maximum interrupt and data • Automated test execution transfer activity between the CPU and the l/0 • Automated checking of test results adapters. These fu nctions could be tested individ­ ually or simultaneously. The exerciser scheduler • Log of test results provided a degree of randomness such that the For each test case, the test environment was interaction of fu nctions was representative of a selected from the fo llowing: memory testing, 1/0 real operating system. The fau lt-insertion hardware register access, direct memory access (DMA) traf­ was used to achieve a random distribution of fault fic, and interrupt cycles. In any given test case, any cases across the system. number of the previous tests could be run. These Because it was possible to insert initial fau lts environments could be run with or without faults while specific functions were performed, a great inserted . In addition, each environment consisted degree of reproducibility was achieved that aided of multiple test cases. In an error handling test case, debug efforts. Once the full su ite of tests worked the proper system environment required fo r the correctly, fa ult insertion was performed while the test was set, and then the fa ult was inserted into system continually switched between all fu nc­ the system. The logic simulator used was designed tions. This test ing was more representative of actual to verify logic design. When an illegal logic condi­ faults in customer environments, but was less tion was detected, it produced an error response. reproducible. When a fa ult insertion resulted in an illegal logic As previously mentioned, the hardware fault­ condition, the simulator responded by invalidat­ insertion tool allowed the insertion of both tran­ ing the test. Because of this, a great deal of time was sient and solid fa ilures. The VAXft 3000 hardware spent to ensure that fa ults were inserted in a way recovers from transient fa ilures and utilizes

Digital Te chnical jo urnal Vo l. 3 No. 1 Wi nte1' 1991 83 Transaction Processing, Databases, and Fault-tolerant Systems

software recovery fo r hard fa ilures. Since the goal Sy stem Application Ve rification with of phase 2 testing was to verify the hardware, the Fa ult Insertion focus was on transient fault insertion. Two criteria The goals fo r the fi nal phase of the VAX ft 3000 fo r each error case determined the success of the verification were to run an application with fault test. First and fo remost, the system must continue insertion and to dl'monstrate that any system to run and to produce correct results. Second, the fault recovery action had no effect on the process error data that the system captures must be correct integrity and data integrity of the application. The based on the fault that was inserted. Correct error application used in the testing was based on the data is important because it is used to identify the standard DebitCredit banking benchmark and was fa iling component both fo r software recovery and implemented using the DECintact layered product. fo r servicing. 100 The bank has 10 branches, tellers, and 3,600 Although the simulation environment of phase 1 customer accounts (10 tellers and 360 accounts was substantially slower than phase 2, it provided per branch) Traffic on the system was simulated the designers with more information. Therefore using terminal emulat ion process (VAXRTE) scripts when problems were discovered on the prototypes representing bank teller activity. The transaction used in phase 2, the fa iling case was transferred to rate was initially one transaction per second (T PS) the simulator fo r fu rther debugging. The hardware and was varied up to the maximum TPS rate to stress verification also validated the models and test pro­ the system load. cedures used in the simulation environment. The general test process can be described as fo llows: Sy stem Software Ve rification with Fa ult Insertion 1. Started application execution. The terminal emu­ lation processes emulating the bank tellers were In parallel with hardware verification, the VA.'{ft .) 000 started and continued until the system was system software error hand ling capabilities were operating at the desired rating. tested. This phase represented the next higher TPS level of testing. The goal was to verify the VAX func­ 2. Invoked fa ult insertion. A fa ult was selected at tionality of the 3000 system as we ll as the software random from a table of hardware and software recovery mechanisms. fa ults. The te rminal emulation process submitted Digital has produced various test packages to stimuli to the application before, during, and VAX 3000 verify fu nctionality. Since the VA.'{ft system after fa ult insertion. incorporates a VAX chip set used in the VAX 6000 3. Stopped terminal emulation process. The appli­ series, it was possible to use several standard cation was run until a quiescent state was test packages that had been used to verify that reached. system.' Fault-tolerant verification, however, was not 4. Performed result validation. The process integ­ addressed by any of the existing test packages. rity and data integrity of the application were Therefore, additional tests were developed by com­ validated. bining the ex isting fu nctional test suite with the All the meaningful events were logged and time­ hardware fault-insertion tool and software fault­ stamped during the experiments. Process integrity insertion routines. Test cases used included cache was proved by verifying continuity of transaction fa ilure, clock fa ilure, memory fa ilure, intercon­ processing through fa ilures. The time stamps on nect fa ilures, and disk fa ilures. These fa ilures were the transaction executious and the system error applied to the system during various system opera­ logs allowed these two independent processes to tions. In addition, servicing errors were also tested be correlated. by removing cables and modules while the system The proof of data integrity consisted of using the was running. The completion criteria fo r tests fo llowing consistency rules fo r transactions: included the fo llowing:

1. The sum of the account balances is equal to the • Detection of the fault sum of the teller balances, which is equal to the

• Isolation of the fa iled hardware sum of the branch balances.

• Continuation of the test processes without 2. For each branch, the sum of the teller balances is interruption equal to the branch balance.

Winter 84 Vo l. 3 No J 1991 Digital Te clmical jou rnal Ve rification of the First Fa ult-tolerant VAX Sy stem

3. For each transaction processed, a new record Te sts also revealed problems that would have

must be added to the history fi le. resulted in system outages if left uncorrected. System enhancements were made in the areas Application ve rification under fau lt insertion of system recovery actions and repair call out. served as the fi nal level of fault-tolerant validation. Whereas some of the problems were simple Whereas the previous phases ensured that the vari­ coding errors, others were errors in carefully ous components required fo r fa ult tolerance oper­ reviewed and documented algorithms. Simply pur, ated properly, the system application verification the collective wisdom of the designers was not demonstrated that these components could oper­ always sufficient to reach the degree of accuracy ate together to provide fu lly fau lt-tolerant system. a desired fo r this fault-to lerant system. As the VAX ft product fa mily evolves, perfor­ Conclusions mance and fu nctional enhancements will be avail­ The process of verifying fault tolerance requires able. The test processes described in this paper a strong architectural test plan. This plan must be wiJI remain in use, so that every fu ture release developed early in the design cycle because hard­ of software will be better than the previous one. ware support for testing may be required. The veri­ The combination of hardware and software fa ult fication plan must demonstrate cognizance of the insertion, coupled with physical system disruption capabilities and limitations at each phase of the allows testing to occur at such a greatly accelerated development cycle. For example, the speed of sim­ rate, that all testing performed will be repeated for ulation prohibits verification of software error every new release. recovery in a simulation environment. Also, when a system is implemented with V!.Sl technology, the Riferences ability to physically insert fa ults into the system by means of an external mechanical mechanism 1. J Croll, L. Camilli, and A. Vaccaro, "Test and may not be adequate to properly verify the correct Qualification of the VAX 6000 Model 400 System." system error recovery. These and other issues Digital Te chnicaljo urnal, vol. 2, no. 2 (Spring must be addressed before the chips are fa bricated 1990): 73-83 or adequate error recovery verification may not be 2. J. Barron, E. Czeck, Z. Segall, and D. Siewiorek, possible. Inadequate error recovery verification "Fault Injection Experiments Using FIAT (Fault directly increases the risk of real, unrecoverable Injection-based Automated Testing," IEEE Tra ns­ fa ults resulting in system outages. actions on Computers, vol. 39, no. 4 (Apri l 1990). The verification plan fo r the VAXft 3000 system consisted of the fo llowing phases and objectives: 3. R. Calcagni and W Sherwood, "VAX 6000 Model 400 CPU Chip Set Functional Design Ve rification," Hardware simulation with fa ult insertion verified l. Digital Te chnicaljo urnal, voL 2, no. 2 (Spring error detection, hardware recovery, and error 1990): 64-72. data capture.

2. System exerciser with fa ult insertion enhanced the coverage of the hardware simulation effort.

3. System software with fau lt insertion verified software error recovery and reporting.

4. System software verification with fault inser­ tion verified the transparency of the system error recovery to the application running on the system.

The rest of any fault tolerant system is to survive a real fa ult while running a customer application. Removing a module from a machine may be an impressive test, bur machines fa il as a result of modules fa lling out of the backplane. The initial

rest of the VAXft 3000 system showed that the sys­ tem would survive most of the fa ults introduced.

Digital Te chnicaljournal Vo l. 3 No. I 1Vi11ter I')'JI 85 I Further Readings

The Digital Te chnical]ournal manufacture, and maintenance of Digital's storage publishes papers that explore and information management products the technologicalfo undations CVAX-based Systems of Digital's major products. Each Vo l. 1, No. 7, August 1988 jo urnalfo cuses on at least one CVAX chip set design and multiprocessing archi­ product area andprese nts a tecture of the mid-range VA X 6200 fam ily of compilation of papers written systems and the MicroVAX 3500/3600 systems � the engineers who developed the product. The contentfo r Software Productivity Tools the jo urnalis selected � the Vo l. 1, No. 6, Fe bruary 1988 jo urnalAdvisor y Board. To ols that assist programmers in the development of high-quality, reliable software

To pics covered in previous issues of the Digital VAXcluster Systems Te chnicaljo urnalare as fo llows: Vo l. 1, No. 5, September 1987 System communication architecture, design and VAX 9000 Series implementation of a distributed lock manager,

Vo l. 2, No. 4, Fa ll 1990 and performance measurements The technologies and processes used to build VAX 8800 Family Digital's first mainframe computer, including Vo l. 1, No. Fe bruary 1987 papers on the architecture, microarchitecture, 4, The microarchitecture, internal boxes, VA XBI bus, chip set, vector processor, and power system, and VMS support for the VA X 8800 high-end multi­ as well as CAD and test methodologies processor, simulation, and CAD methodology DECwindows Program NetworkingProducts Vo l. 2, No. 3, Summer 1990 Vo l. 1, No. 3, September 1986 An overview and descriptions of the enhancements The Digital Network Architecture (DNA), network Digital's engineers have made to MIT's X Window performance, LAN bridge 100, DECnet-ULTRIX and System in such areas as the server, toolkit, interface DECnet-DOS, monitor design language, and graphics, as well as contributions made to related industry standards MicroVAX II System Vo l. 1, No. 2, Mm·ch 1986 VAX 6000 Model 400 System The implementation of the microprocessor and Vo l. 2, No. 2, Sp ring 1990 floating point chips, CADsuite, MicroVAX work­ The highly expandable and configurable midrange station, disk controllers, and TK50 tape drive family of VA X systems that includes a vector pro­ cessor, a high-performance scalar processor, and VAX 8600 Processor advances in chip design and physical technology Vo l. 1, No. 1, August 1985 The system design with pipelined architecture, Compound Document Architecture the 1-box, F-box, packaging considerations, signal Vo l. 2, No. 1, Winter 1990 integrity, and design for reliability The CDA family of architectures and services that support the creation, interchange, and processing Subscriptions to the Digital Te chnicaljo urnal are of compound documents in a heterogeneous ava ilable on a yearly, prepaid basis. The subscrip­ network environment tion rate is $40.00 per year (four issues). Requests Distributed Systems should be sent to Cathy Phillips, Digital Equipment Vo l. 1, No. 9, ]une 1989 Corporation, ML01·3/B68, 146 Main Street, Maynard,

Products that allow system resource sharing MA 01754, U.S.A. Subscriptions must be paid in U.S. throughout a network, the methods and tools to dollars, and checks should be made payable to evaluate product and system performance Digital Equipment Corporation.

Storage Technology Single copies and past issues of the Digital Vo l. 1, No. 8, Fe bruary 1989 Te ch nicaljo urnalcan be ordered from Digital Engineering technologies used in the design, Press at a cost of $16.00 per copy.

86 Vo l. 3 No. I Winter 1991 Digital Te chnicalJow·nal Technical Papers and Books by Digital Authors SOFTWARE DESIGN TECHNIQUES FOR LARGE ADA SYSTEMS P. Bernstein, Hadzilacos, and N. Goodman, V. William E. Byrne, 1991, hardbound, Concurrency Control and Recovery in Database 314 pages ($44.95) Sy stems (Reading, MA: Addison-Wesley, 1987). INFORMATION TECHNOLOGY STANDARDIZA­ Bernstein, M. Hsu, and B. Mann, "Implementing P. TION: Theory, Practice, and Organizations Recoverable Requests Using Queues," Proceedings Carl Cargi ll, 1989, softbound, 1990 ACM 5/GMOD Co nference on Management of F. 252 pages ($24.95) Data (May 1990). THE DIGITAL GUIDE TOSOFTW ARE T.K. Rengarajan, P. Spiro, and W Wright, "High DEVELOPMENT Ava ilability Mechanisms of VA X DBMS Software," Corporate User Publication Group of Digital vol. no. (February Digital Te chnicaljo urnal, 1, 8 Equipment Corporation, 1990, softbound, 1989): 88-98. 239 pages ($27.95)

K. Morse, "The VMS/MicroVMS Merge," DEC DIGITAL GUIDE TO DEVELOPING Professional Magazine, vol. 7, no. 5 (May 1988). INTERNATIONAL SOFTWARE K. Morse and R. Gamache, "VAX/SMP," DEC Corporate User Publication Group of Digital Professional Magazine, vol. 7, no. 4 (April 1988). Equipment Corporation, 1991, softbound, 400 pages ($28.95) K. Morse, "Shrinking VMS," Datamation Quly 15, 1984). VMS INTERNALS AND DATA STRUCTURES: Version 5 Update Xpress, Vo lumes 1,2,3,4,5 L. Frampton, ]. Schriesheirn,and M. Rountree, Ruth E. Goldenberg and Lawrence]. Kenah, 1989, "Planningfo r Distributed Processing," Auerbach 1990, 1991, all softbound ($35.00 each) Report on Communications (1989). COMPUTERPROGRAM MING AND ARCHITECTURE: The VA X, Second Edition Henry M. Levy and Richard H. Eckhousejr., 1989, Digital Press hardbound, 444 pages ($38.00) Digital Press is the book publishing group of Digital USING MS-DOS KERMIT: Connecting Yo ur PC Equipment Corporation. The Press is an interna­ to the Electronic World tional publisher of computer books and journals Christine M. Gianone, 1990, softbound, on new technologies and products fo r users, system 244 pages, with Kermit Diskette ($29.95) and network managers, programmers and other professionals. Press editors welcome proposals and THE USER'S DIRECTORY OF COMPUTER ideas for books in these and related areas. NETWORKS Tracy L. LaQuey, 1990, softbound, VA X/VMS: Writing Real Programs in DCL 630 pages ($34.95) Paul C. Anagnostopoulos, 1989, softbound, SOLVING BUSINESS PROBLEMS WITH MRP II 409 pages ($29.95) Alan D. Luber, 1991, hardbound, X WINDOW SYSTEM TOOLKIT: The Complete 333 pages ($34.95)

Programmer's Guide and Specification VMS FILE SYSTEM INTERNALS Paul]. Asente and Ralph R. Swick, 1990, softbound, Kirby McCoy, 1990, softcover, 967 pages ($44.95) 460 pages ($49.95)

UNIX FOR VMS USERS TECHNICAL ASPECTS OF DATA Philip E. Bourne, 1990, softbound, COMMUNICATION, Third Edition 368 pages ($28.95) john E. McNamara, 1988, hardbound, 383 pages ($42 00) VAX ARCHITECTURE REFERENCE MANUAL, Second Edition LISP STYLE and DESIGN Richard A. Brunner, Editor, 1991, softbound, Molly M. Miller and Eric Benson, 1990, softbound, 560 pages ($44.95) 214 pages ($26.95)

Vo l. Winter 1991 Digital Te cbnical]ournal 3 No. I 87 Further Readings

THE VMS USER'S GUIDE X WINDOW SYSTEM, Second Edition james F Peters III and Patrick J Holmay, 1990, Robert Scheifler and james Gettys, 1990, softbound, :)04 pages (528.9'5) softbound, 851 pages ($49.95)

THE MATRIX: Computer Networks and COMMON LISP: The Language, Second Edition Conferencing Systems Wo rldwide Guy L. Steelejr., 1990, 1,029 pages ($38.9'5 in john S. Quarterman, 1990, softbound, softbound, $46.95 in hardbound) 719 pages ($49 95) WORKINGWITH WPS-PLUS X AND MOTIF QUICK REFERENCE GUIDE Charlotte Te mple and Dolores Cordeiro, 1990, Randi J Rost, 1990, softbound, softbound, 235 pages ($24.95) 369 pages ($24.95) To receive information on these or other publica­ FIFTH GENERATION MANAGEMENT: tions from Digital Press, write: Integrating Enterprises Through Human Networking Digital Press Charles M. Savage, 1990, hardbound, Department DTJ 267 pages ($28 9'5) 12 Crosby Drive Bedford, 01730 A BEGINNER'S GUIDE TO VA X/VMS UTILITIES MA 617/276-1536 AND APPLICATIONS

Ronald M. Sawey and ·rroy T Stokes, 1989, Or order directly by calling DECdirect at softbound, 278 pages ($26.95) 800-DIGITAL (800-344 -4825).

Vo l. No. 88 3 I lVinll!r 1991 Digital Te cbnical jou rnal ISSN 0898-901X

Printed in U.S.A. EY-F588E-DP/90 11 02 16.0 MCG/BUO Copyright © Digital Equipment Corporation. All Rights Reserved.